Optimizing PDF Extraction for LLMs
Sarah Chen
ML Engineer
When preparing documents for LLM consumption, the quality of the input Markdown directly impacts the quality of AI outputs. PDFs present a unique challenge because they are designed for visual rendering, not semantic parsing.
The Problem
PDF documents store text as positioned glyphs rather than structured content. A table in a PDF is just a collection of text strings at specific coordinates, with optional line drawings. Standard extraction tools produce garbled output when tables have merged cells, multi-line content, or complex headers.
"Garbage in, garbage out. If your PDF extraction produces broken tables, your LLM will hallucinate the missing structure."
The impact is particularly severe for:
- Financial reports with nested tables
- Academic papers with equation-heavy content
- Government documents with multi-column layouts
Our Approach
We developed a three-stage pipeline:
- Geometric Analysis: Detect table boundaries using line detection and whitespace clustering
- Cell Reconstruction: Map text positions to logical grid cells using a constraint satisfaction algorithm
- Semantic Enrichment: Apply heuristics to identify headers, totals rows, and data types
# Simplified cell reconstruction
def reconstruct_table(page_elements):
grid = detect_grid_lines(page_elements)
cells = map_text_to_cells(page_elements, grid)
return normalize_to_markdown(cells)
Results
After implementing this pipeline, table extraction accuracy improved significantly — especially for complex layouts with merged cells and multi-column headers. More importantly, downstream LLM tasks showed noticeably better factual accuracy when using structurally preserved Markdown versus raw text extraction.