AIMarch 3, 2026

Preparing Documents for LLMs: Why Markdown Matters

person

Sarah Chen

ML Engineer

schedule4 min read

If you are building anything with GPT-5, Claude, Gemini, or any other large language model, the format of your input documents matters more than you think. Feeding raw PDF text or HTML soup into an LLM wastes tokens, confuses the model, and degrades output quality. Markdown fixes this.

The Input Quality Problem

LLMs are trained on vast amounts of text, and a significant portion of that training data is Markdown -- GitHub READMEs, documentation sites, wiki pages, and technical blogs. This means the models have deep familiarity with Markdown syntax. They understand that ## means a section heading, that | pipes denote a table, and that > indicates a quote.

When you feed an LLM raw extracted text from a PDF, you lose all of this structural information:

Annual Report 2026 Financial Highlights Revenue 168.7M up 18.5%
from prior year Operating Income 42.1M Earnings Per Share 3.24
The company achieved record results across all segments

Compare that to the same content as Markdown:

# Annual Report 2026

## Financial Highlights

| Metric           |   Value | Change |
| :--------------- | ------: | -----: |
| Revenue          | $168.7M | +18.5% |
| Operating Income |  $42.1M |        |
| Earnings/Share   |   $3.24 |        |

The company achieved record results across all segments.

The second version gives the model explicit structure. It can reason about the table, reference specific values, and understand the document hierarchy.

Why LLMs Prefer Markdown

Three reasons Markdown outperforms other formats as LLM input:

1. Explicit structure without noise. HTML carries semantic information too, but it also carries CSS classes, attributes, script tags, and deeply nested divs. A typical web page is 80% markup, 20% content. Markdown is nearly 100% content with minimal, meaningful syntax.

2. Training data alignment. Models have seen billions of Markdown tokens during training. The syntax acts as a natural prompt format that activates the model's understanding of document structure. When you pass a Markdown table, the model "knows" how to read it.

3. Consistent whitespace semantics. In Markdown, whitespace is meaningful but predictable -- blank lines separate paragraphs, indentation indicates nesting. Raw text extraction from PDFs produces unpredictable whitespace that the model must guess about.

Token Efficiency

Context windows are precious. Every wasted token on formatting noise is a token you cannot use for actual content. Here is a real comparison from a 10-page technical document:

FormatTokensUsable Content
Raw PDF text8,200~60%
HTML (cleaned)12,400~45%
HTML (raw)31,000~15%
Markdown6,10092%

Markdown uses 25% fewer tokens than raw PDF text while preserving more structure. Compared to raw HTML, the savings are dramatic -- you fit 5x more actual content into the same context window.

This matters particularly for:

  • Long documents: fitting an entire report into a single prompt
  • RAG pipelines: maximizing the number of retrieved chunks that fit in context
  • Multi-document analysis: comparing several documents in one conversation

RAG Pipelines and Chunking

Retrieval-Augmented Generation (RAG) is the dominant architecture for building LLM applications over private data. The standard pipeline is: ingest documents, split into chunks, embed chunks, store in a vector database, retrieve relevant chunks at query time, pass them to the LLM.

Markdown dramatically improves every step of this pipeline:

Better chunking. Markdown headings provide natural split points. Instead of splitting on arbitrary character counts (which might break mid-sentence or mid-table), you can split on ## boundaries. Each chunk is a coherent section with its own heading.

def chunk_markdown(md_text: str) -> list[str]:
    """Split markdown into sections based on H2 headings."""
    sections = re.split(r'(?=^## )', md_text, flags=re.MULTILINE)
    return [s.strip() for s in sections if s.strip()]

Better embeddings. When a chunk has a heading like "## Q3 Revenue Analysis", the embedding captures not just the content but the topic. This leads to more relevant retrieval.

Better LLM responses. When retrieved chunks arrive at the model as clean Markdown with tables and headings intact, the model can reference specific data points rather than guessing at structure.

Practical Workflow

Here is a concrete workflow for preparing documents for LLM consumption using mdstill:

  1. Upload your PDF, DOCX, or PPTX to mdstill
  2. Wait a second — most documents convert in under two seconds, no queue or mode selection needed
  3. Download the Markdown output
  4. Chunk the Markdown by heading level for RAG, or use it directly in prompts
  5. Embed and store the chunks in your vector database

For batch processing, you can use the mdstill API:

# Convert a document via the API
curl -X POST https://mdstill.com/api/convert \
  -F "file=@quarterly-report.pdf" \
  -F "mode=deep" \
  -o report.md

Format Comparison

A summary of how different input formats affect LLM performance:

FactorRaw TextHTMLMarkdown
Structure preservedNoneFullFull
Token efficiencyMediumPoorHigh
Table handlingBrokenNoisyClean
LLM familiarityLowLowHigh
Chunking qualityPoorFairExcellent

Markdown hits the sweet spot: full structural preservation with minimal token overhead. For anyone building LLM applications over documents, converting to Markdown is not optional -- it is a prerequisite for quality results.

#ai#llm#rag#markdown#chatgpt

Related technical reads

View allarrow_forward