Preparing Documents for LLMs: Why Markdown Matters

If you are building anything with GPT-5, Claude, Gemini, or any other large language model, the format of your input documents matters more than you think. Feeding raw PDF text or HTML soup into an LLM wastes tokens, confuses the model, and degrades output quality. Markdown fixes this.

The Input Quality Problem

LLMs are trained on vast amounts of text, and a significant portion of that training data is Markdown -- GitHub READMEs, documentation sites, wiki pages, and technical blogs. This means the models have deep familiarity with Markdown syntax. They understand that ## means a section heading, that | pipes denote a table, and that > indicates a quote.

When you feed an LLM raw extracted text from a PDF, you lose all of this structural information:

Annual Report 2026 Financial Highlights Revenue 168.7M up 18.5%
from prior year Operating Income 42.1M Earnings Per Share 3.24
The company achieved record results across all segments

Compare that to the same content as Markdown:

# Annual Report 2026

## Financial Highlights

| Metric           |   Value | Change |
| :--------------- | ------: | -----: |
| Revenue          | $168.7M | +18.5% |
| Operating Income |  $42.1M |        |
| Earnings/Share   |   $3.24 |        |

The company achieved record results across all segments.

The second version gives the model explicit structure. It can reason about the table, reference specific values, and understand the document hierarchy.

Why LLMs Prefer Markdown

Three reasons Markdown outperforms other formats as LLM input:

1. Explicit structure without noise. HTML carries semantic information too, but it also carries CSS classes, attributes, script tags, and deeply nested divs. A typical web page is 80% markup, 20% content. Markdown is nearly 100% content with minimal, meaningful syntax.

2. Training data alignment. Models have seen billions of Markdown tokens during training. The syntax acts as a natural prompt format that activates the model's understanding of document structure. When you pass a Markdown table, the model "knows" how to read it.

3. Consistent whitespace semantics. In Markdown, whitespace is meaningful but predictable -- blank lines separate paragraphs, indentation indicates nesting. Raw text extraction from PDFs produces unpredictable whitespace that the model must guess about.

Token Efficiency

Context windows are precious. Every wasted token on formatting noise is a token you cannot use for actual content. Here is a real comparison from a 10-page technical document:

Format	Tokens	Usable Content
Raw PDF text	8,200	~60%
HTML (cleaned)	12,400	~45%
HTML (raw)	31,000	~15%
Markdown	6,100	92%

Markdown uses 25% fewer tokens than raw PDF text while preserving more structure. Compared to raw HTML, the savings are dramatic -- you fit 5x more actual content into the same context window.

This matters particularly for:

Long documents: fitting an entire report into a single prompt
RAG pipelines: maximizing the number of retrieved chunks that fit in context
Multi-document analysis: comparing several documents in one conversation

RAG Pipelines and Chunking

Retrieval-Augmented Generation (RAG) is the dominant architecture for building LLM applications over private data. The standard pipeline is: ingest documents, split into chunks, embed chunks, store in a vector database, retrieve relevant chunks at query time, pass them to the LLM.

Markdown dramatically improves every step of this pipeline:

Better chunking. Markdown headings provide natural split points. Instead of splitting on arbitrary character counts (which might break mid-sentence or mid-table), you can split on ## boundaries. Each chunk is a coherent section with its own heading.

def chunk_markdown(md_text: str) -> list[str]:
    """Split markdown into sections based on H2 headings."""
    sections = re.split(r'(?=^## )', md_text, flags=re.MULTILINE)
    return [s.strip() for s in sections if s.strip()]

Better embeddings. When a chunk has a heading like "## Q3 Revenue Analysis", the embedding captures not just the content but the topic. This leads to more relevant retrieval.

Better LLM responses. When retrieved chunks arrive at the model as clean Markdown with tables and headings intact, the model can reference specific data points rather than guessing at structure.

Practical Workflow

Here is a concrete workflow for preparing documents for LLM consumption using mdstill:

Upload your PDF, DOCX, or PPTX to mdstill
Wait a second — most documents convert in under two seconds, no queue or mode selection needed
Download the Markdown output
Chunk the Markdown by heading level for RAG, or use it directly in prompts
Embed and store the chunks in your vector database

For batch processing, you can use the mdstill API:

# Convert a document via the API
curl -X POST https://mdstill.com/api/convert \
  -F "file=@quarterly-report.pdf" \
  -F "mode=deep" \
  -o report.md

Format Comparison

A summary of how different input formats affect LLM performance:

Factor	Raw Text	HTML	Markdown
Structure preserved	None	Full	Full
Token efficiency	Medium	Poor	High
Table handling	Broken	Noisy	Clean
LLM familiarity	Low	Low	High
Chunking quality	Poor	Fair	Excellent

Markdown hits the sweet spot: full structural preservation with minimal token overhead. For anyone building LLM applications over documents, converting to Markdown is not optional -- it is a prerequisite for quality results.

Preparing Documents for LLMs: Why Markdown Matters

The Input Quality Problem

Why LLMs Prefer Markdown

Token Efficiency

RAG Pipelines and Chunking

Practical Workflow

Format Comparison

Related technical reads

Token Optimization: How Markdown Saves You Money on AI API Calls

Preparing Documents for RAG Pipelines: Why Markdown Beats Plain Text

How to Feed Documents to ChatGPT Without Losing Context