How to Convert PDF Tables to Clean Markdown
Tables are where most PDF-to-Markdown tools fall apart. A financial report, an academic paper, a government filing — they all lean on tabular data. Yet the average converter will turn a clean 5-column table into a mess of misaligned text. This article explains why that happens, what mdstill can handle, and when you need something else.
Why Tables Break
A PDF is not a structured document. It is a set of drawing instructions: place glyph "A" at coordinates (72, 540), draw a line from (70, 530) to (400, 530), and so on. There is no concept of "row" or "cell" in the PDF specification.
When a naive extractor runs over this, it reads text left-to-right, top-to-bottom. That works for paragraphs, but tables require understanding vertical alignment across columns. Common failure modes include:
- Column drift: text from one column bleeds into the next
- Row merging: multi-line cell content gets collapsed into a single line or split across rows
- Header loss: the first row is not recognized as a header
- Merged cells: spanning cells are duplicated or dropped entirely
GFM Table Format
GitHub-Flavored Markdown defines a simple pipe-based table syntax that has become the de facto standard:
| Metric | Q1 2026 | Q2 2026 | Change |
| :----------- | ------: | ------: | -----: |
| Revenue | $4.2M | $4.8M | +14% |
| Active Users | 12,400 | 15,100 | +22% |
| Churn Rate | 3.1% | 2.7% | -0.4% |
Key rules: every table must have a header row, a separator row with dashes (and optional colons for alignment), and data rows. Each cell is delimited by pipes. This is what mdstill targets as output.
What mdstill Handles Well
mdstill's PDF conversion is optimized for the common case: digitally-created PDFs where text is selectable and tables have clear row/column structure. In practice, that covers most PDFs you'll encounter from business reports, documentation exports, and web-sourced documents.
Feed it a PDF with a standard table and you'll get something like this in under a second:
| Line Item | 2025 ($M) | 2026 ($M) |
| :----------------- | --------: | --------: |
| Net Revenue | 142.3 | 168.7 |
| Cost of Goods Sold | 89.1 | 97.4 |
| Gross Profit | 53.2 | 71.3 |
No signup, no queue, no ML model download. Just upload and get clean GFM tables.
What mdstill Does Not Do
It's worth being direct about the limits so you can pick the right tool:
- No OCR. mdstill extracts text from the PDF's embedded text layer. Scanned PDFs — images of pages without a text layer — will return empty or near-empty output. Run them through an OCR tool (Tesseract, ocrmypdf, a cloud OCR service) first, then convert the OCR output.
- Complex merged cells may drift. Tables with multi-level headers ("Revenue" spanning "Domestic" and "International" columns) can lose their parent grouping. The cell values usually make it through, but the hierarchical header structure may collapse.
- Formulas and math-heavy layouts are not mdstill's strength. Equations are extracted as plain text where possible and may be garbled. For academic math papers, you'll want a tool built for that niche.
For these cases, look at dedicated parsers like LlamaParse, Marker, Reducto, or Unstructured. They cost more (latency, money, or both) but handle the edge cases better. mdstill is the right choice when you need speed and simplicity on the common case, not the 5% of hard documents.
Practical Tips
- Use mdstill for digitally-created PDFs where text is selectable. If you can highlight and copy text from the PDF in a normal viewer, mdstill will work well.
- Run OCR first for scans.
ocrmypdf input.pdf output.pdfadds a text layer to a scanned PDF, then feed the output to mdstill. - Check the first row. If a header row looks like body content, the PDF may have a non-standard heading style. Most of the time you can manually mark it as a header in post-processing.
- Split large documents. If you only need tables from specific pages, extract those pages first (
pdftk,qpdf, or any PDF splitter) to reduce noise in the output. - Post-process if needed. mdstill outputs clean GFM, but you may want to add bold to header cells or adjust alignment colons for your specific use case.
When to Use What
| Your document | Best tool |
|---|---|
| Digital PDF, text-heavy, simple tables | mdstill (fast, free, instant) |
| Digital PDF, complex multi-level tables | Dedicated parser (LlamaParse, Marker) |
| Scanned PDF with no text layer | OCR first (ocrmypdf), then mdstill |
| Math/scientific paper with many formulas | Math-aware tool (Nougat, Mathpix) |
| Web page HTML | mdstill (drop the HTML file or use the API) |
Being honest about the tradeoffs saves everyone time. Most PDFs are simple enough that mdstill's speed and ergonomics are the right call — and for the rest, combining tools usually beats chasing a single "best" parser.