Preparing Documents for RAG Pipelines: Why Markdown Beats Plain Text

Retrieval-Augmented Generation is the dominant architecture for building LLM applications over private data. You ingest documents, split them into chunks, embed them into vectors, store them in a database, and retrieve relevant chunks at query time. The quality of every step in this pipeline depends on one thing most teams overlook: the format of the input documents.

Most tutorials show plain text extraction from PDFs. This works for demos but fails in production. Here is why Markdown is the right input format for serious RAG pipelines, and how to integrate conversion into your workflow.

Why Input Format Matters

A RAG pipeline has three quality bottlenecks:

Chunk quality -- are your chunks semantically coherent units?
Retrieval precision -- does the right chunk get retrieved for a given query?
Generation accuracy -- does the LLM produce correct answers from retrieved chunks?

All three degrade when the input format loses document structure. Plain text extraction from PDF strips headings, tables, and lists. The chunker has no signal for where one topic ends and another begins, so it splits mid-paragraph or mid-table. The embeddings reflect this noise, and retrieval suffers.

Markdown preserves structure with minimal overhead. Headings become natural chunk boundaries. Tables stay intact. Lists remain coherent. The formatting tokens (#, |, -, *) are lightweight and semantically meaningful.

Markdown vs Plain Text vs HTML

Property	Plain Text	HTML	Markdown
Structure preserved	None	Full	Full
Token overhead	Lowest	3-5x bloat from tags	5-10% overhead
Chunk boundaries	None (arbitrary split)	Tag-based (noisy)	Heading-based (clean)
Table handling	Destroyed	Preserved but verbose	Preserved and compact
LLM readability	Medium	Low (tag noise)	High

HTML preserves structure but at massive token cost. A simple table that takes 5 lines in Markdown takes 20+ lines in HTML with <table>, <thead>, <tbody>, <tr>, <td> tags. For RAG, where every token in the chunk matters for embedding quality, this bloat is unacceptable.

Heading-Based Chunking

The biggest advantage of Markdown for RAG is natural chunking by headings. Instead of splitting every N tokens (which cuts through tables and paragraphs), you split on heading boundaries:

import re

def chunk_by_headings(markdown: str, max_level: int = 2) -> list[str]:
    pattern = rf'(?=^{"#" * max_level}\s)'
    sections = re.split(pattern, markdown, flags=re.MULTILINE)
    chunks = []
    for section in sections:
        section = section.strip()
        if not section:
            continue
        lines = section.split('\n')
        title = lines[0].strip('# ').strip()
        chunks.append(section)
    return chunks

Each chunk is a semantically complete section with its own heading. Embeddings for these chunks are far more meaningful than embeddings for arbitrary 500-token windows.

Tables in RAG

Tables are the hardest part of document processing for RAG. A quarterly report might have dozens of tables, each containing critical data. With plain text extraction, tables become:

"Revenue 2025 2026 Growth North America 12.3M 14.1M 15% Europe 8.7M 9.2M 6%"

This is useless for retrieval. A query about "European revenue growth" might not match because the structure connecting "Europe" to "6%" is lost.

With Markdown tables:

Region	2025	2026	Growth
North America	12.3M	14.1M	15%
Europe	8.7M	9.2M	6%

The structure is preserved. The embedding captures the relationship between region names and their corresponding values. Retrieval works correctly.

Key rule: never split a table across chunks. When chunking Markdown, detect table blocks (lines starting with |) and keep them as atomic units.

Integration Examples

LangChain integration:

import requests
from langchain.text_splitter import MarkdownHeaderTextSplitter

def convert_and_chunk(file_path: str) -> list:
    # Convert to Markdown via mdstill API
    with open(file_path, 'rb') as f:
        resp = requests.post(
            'https://mdstill.com/api/convert',
            files={'file': f}
        )
    markdown = resp.text

    # Chunk by headers
    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[
            ("#", "h1"), ("##", "h2"), ("###", "h3")
        ]
    )
    return splitter.split_text(markdown)

LlamaIndex integration:

from llama_index.core import Document
from llama_index.core.node_parser import MarkdownNodeParser

# After converting via mdstill API
markdown_content = convert_via_api("report.pdf")

doc = Document(text=markdown_content)
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents([doc])

Both frameworks have built-in Markdown-aware splitters. Use them instead of generic character-based splitting.

Batch conversion for pipelines:

from concurrent.futures import ThreadPoolExecutor
import requests

def convert_file(path: str) -> str:
    with open(path, 'rb') as f:
        resp = requests.post(
            'https://mdstill.com/api/convert',
            files={'file': f}
        )
    return resp.text

files = ["report1.pdf", "report2.pdf", "report3.pdf"]
with ThreadPoolExecutor(max_workers=3) as pool:
    results = list(pool.map(convert_file, files))

The format of your input documents is the foundation of your RAG pipeline. Switching from plain text to Markdown is the single highest-impact change you can make to improve retrieval quality. It costs nothing in runtime performance, requires minimal code changes, and delivers measurably better results.

Start with mdstill's converter or the API for batch processing. Your RAG pipeline will thank you.

Preparing Documents for RAG Pipelines: Why Markdown Beats Plain Text

Why Input Format Matters

Markdown vs Plain Text vs HTML

Heading-Based Chunking

Tables in RAG

Integration Examples

Related technical reads

Token-Aware Chunking for RAG: Why Fixed-Size Splitting Fails

Building a PDF to Embeddings Pipeline in 20 Lines of Python

Preparing Documents for LLMs: Why Markdown Matters