Preparing Documents for RAG Pipelines: Why Markdown Beats Plain Text
Sarah Chen
ML Engineer
Retrieval-Augmented Generation is the dominant architecture for building LLM applications over private data. You ingest documents, split them into chunks, embed them into vectors, store them in a database, and retrieve relevant chunks at query time. The quality of every step in this pipeline depends on one thing most teams overlook: the format of the input documents.
Most tutorials show plain text extraction from PDFs. This works for demos but fails in production. Here is why Markdown is the right input format for serious RAG pipelines, and how to integrate conversion into your workflow.
Why Input Format Matters
A RAG pipeline has three quality bottlenecks:
- Chunk quality -- are your chunks semantically coherent units?
- Retrieval precision -- does the right chunk get retrieved for a given query?
- Generation accuracy -- does the LLM produce correct answers from retrieved chunks?
All three degrade when the input format loses document structure. Plain text extraction from PDF strips headings, tables, and lists. The chunker has no signal for where one topic ends and another begins, so it splits mid-paragraph or mid-table. The embeddings reflect this noise, and retrieval suffers.
Markdown preserves structure with minimal overhead. Headings become natural chunk boundaries. Tables stay intact. Lists remain coherent. The formatting tokens (#, |, -, *) are lightweight and semantically meaningful.
Markdown vs Plain Text vs HTML
| Property | Plain Text | HTML | Markdown |
|---|---|---|---|
| Structure preserved | None | Full | Full |
| Token overhead | Lowest | 3-5x bloat from tags | 5-10% overhead |
| Chunk boundaries | None (arbitrary split) | Tag-based (noisy) | Heading-based (clean) |
| Table handling | Destroyed | Preserved but verbose | Preserved and compact |
| LLM readability | Medium | Low (tag noise) | High |
HTML preserves structure but at massive token cost. A simple table that takes 5 lines in Markdown takes 20+ lines in HTML with <table>, <thead>, <tbody>, <tr>, <td> tags. For RAG, where every token in the chunk matters for embedding quality, this bloat is unacceptable.
Heading-Based Chunking
The biggest advantage of Markdown for RAG is natural chunking by headings. Instead of splitting every N tokens (which cuts through tables and paragraphs), you split on heading boundaries:
import re
def chunk_by_headings(markdown: str, max_level: int = 2) -> list[str]:
pattern = rf'(?=^{"#" * max_level}\s)'
sections = re.split(pattern, markdown, flags=re.MULTILINE)
chunks = []
for section in sections:
section = section.strip()
if not section:
continue
lines = section.split('\n')
title = lines[0].strip('# ').strip()
chunks.append(section)
return chunks
Each chunk is a semantically complete section with its own heading. Embeddings for these chunks are far more meaningful than embeddings for arbitrary 500-token windows.
Tables in RAG
Tables are the hardest part of document processing for RAG. A quarterly report might have dozens of tables, each containing critical data. With plain text extraction, tables become:
"Revenue 2025 2026 Growth North America 12.3M 14.1M 15% Europe 8.7M 9.2M 6%"
This is useless for retrieval. A query about "European revenue growth" might not match because the structure connecting "Europe" to "6%" is lost.
With Markdown tables:
| Region | 2025 | 2026 | Growth |
|---|---|---|---|
| North America | 12.3M | 14.1M | 15% |
| Europe | 8.7M | 9.2M | 6% |
The structure is preserved. The embedding captures the relationship between region names and their corresponding values. Retrieval works correctly.
Key rule: never split a table across chunks. When chunking Markdown, detect table blocks (lines starting with |) and keep them as atomic units.
Integration Examples
LangChain integration:
import requests
from langchain.text_splitter import MarkdownHeaderTextSplitter
def convert_and_chunk(file_path: str) -> list:
# Convert to Markdown via mdstill API
with open(file_path, 'rb') as f:
resp = requests.post(
'https://mdstill.com/api/convert',
files={'file': f}
)
markdown = resp.text
# Chunk by headers
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[
("#", "h1"), ("##", "h2"), ("###", "h3")
]
)
return splitter.split_text(markdown)
LlamaIndex integration:
from llama_index.core import Document
from llama_index.core.node_parser import MarkdownNodeParser
# After converting via mdstill API
markdown_content = convert_via_api("report.pdf")
doc = Document(text=markdown_content)
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents([doc])
Both frameworks have built-in Markdown-aware splitters. Use them instead of generic character-based splitting.
Batch conversion for pipelines:
from concurrent.futures import ThreadPoolExecutor
import requests
def convert_file(path: str) -> str:
with open(path, 'rb') as f:
resp = requests.post(
'https://mdstill.com/api/convert',
files={'file': f}
)
return resp.text
files = ["report1.pdf", "report2.pdf", "report3.pdf"]
with ThreadPoolExecutor(max_workers=3) as pool:
results = list(pool.map(convert_file, files))
The format of your input documents is the foundation of your RAG pipeline. Switching from plain text to Markdown is the single highest-impact change you can make to improve retrieval quality. It costs nothing in runtime performance, requires minimal code changes, and delivers measurably better results.
Start with mdstill's converter or the API for batch processing. Your RAG pipeline will thank you.