Word to Markdown: The Complete Guide (.docx and .doc)
mdstill team
Engineering
Word is still where most business documents are written, and Markdown is where they increasingly need to live — in git repos, static site generators, wikis, and LLM context windows. This guide covers exactly what happens when you convert a .docx (or legacy .doc) to Markdown: what maps cleanly, what gets dropped, and the edge cases that trip people up.
What Converts Cleanly
Word's semantic structure has direct Markdown equivalents. These map one-to-one with no loss of meaning:
| Word feature | Markdown result |
|---|---|
| Heading 1–6 styles | # … ###### headings |
| Bold / italic | **bold** / *italic* |
| Bulleted lists | - lists (nesting preserved) |
| Numbered lists | 1. ordered lists (nesting preserved) |
| Tables | GFM pipe tables with a header row |
| Hyperlinks | [text](url) |
| Inline code / monospace | `code` |
| Block quotes | > quote |
The critical word above is styles. Markdown captures the role of a heading, not its appearance. If you built your document with Word's built-in Heading 1 / Heading 2 styles, the hierarchy comes through perfectly. If you faked a heading by manually bolding a 16pt line, there is no semantic signal to detect — it converts as a bold paragraph, not a heading.
Takeaway: documents written with real Word styles convert far better than visually-formatted ones. This is the single biggest predictor of output quality.
What Gets Dropped
Markdown is intentionally minimal. Anything that is purely visual has no place to go:
- Fonts, font sizes, and colors — not part of the Markdown spec
- Text boxes and columns — flattened into the main document flow
- Headers, footers, and page numbers — page-specific layout that Markdown has no concept of
- Page breaks and section breaks — Markdown is a single continuous flow
- Custom (named) paragraph styles — only the built-in heading styles carry semantic meaning
None of this is a bug — it is the trade-off you accept for a portable, diff-able, token-efficient format. If you relied on a text box or a colored callout to carry meaning, restructure it as a heading, list, or blockquote before converting.
Tracked Changes and Comments
This is the edge case that catches people, especially with documents that have been through review:
- Tracked changes resolve to their accepted state. The Markdown reflects the document as if every pending edit were accepted — deleted text is gone, inserted text stays. Rejected-but-not-removed markup does not leak into the output.
- Comments are removed entirely. They live in a separate layer of the
.docxand are not part of the document body.
If a document still has unresolved redlines, the safest move is to accept or reject all changes in Word first, so you control the final text rather than letting the conversion decide. See our Word to Markdown preserving formatting page for more on what structure survives.
Tables in Word
Word tables become GitHub-Flavored Markdown tables — a header row, a separator row, and pipe-delimited cells:
| Quarter | Revenue | Growth |
| :------ | ------: | -----: |
| Q1 | $4.2M | +12% |
| Q2 | $4.8M | +14% |
A few mechanical rules apply, because GFM tables are stricter than Word tables:
- The first row becomes the header — GFM has no headerless tables
- Merged cells are unmerged: content lands in the top-left cell of the merged range, the rest go empty
- Pipe characters inside cells are escaped as
\| - Line breaks inside a cell collapse to spaces, since GFM cells can't contain newlines
Deeply nested tables (a table inside a cell) have no GFM equivalent and are flattened. If your data is genuinely tabular, this is rarely a problem; if you were using a table for page layout, expect to clean up.
.docx vs Legacy .doc
Both are supported in a single API call, but they take slightly different paths:
| Format | Path |
|---|---|
.docx | Modern XML-based format — parsed directly |
.doc | Legacy binary (pre-2007) — normalized to a modern document first |
For .doc, a server-side normalization step brings the old binary into a modern document structure, then it flows through the same pipeline as .docx. You don't do anything different — you upload the .doc and get Markdown back. The only practical difference is that .doc files take a second or two longer.
Why Convert Word to Markdown at All
A few workflows where the conversion pays off:
- Version control. Markdown diffs cleanly in git;
.docxis a binary blob that shows up as "file changed" with no readable diff. - Static site generators. Hugo, Docusaurus, and Astro all consume Markdown natively. Migrating a Word-based docs set is a bulk conversion job.
- Wikis and Obsidian. Pulling Word knowledge bases into a Markdown vault keeps everything searchable and linkable.
- LLM context. Feeding raw
.docxto a model wastes tokens on XML scaffolding. Markdown preserves the structure a model needs — headings, lists, tables — while cutting the token count by roughly a third to a half versus raw document extraction. More on that in why Markdown matters for LLMs.
How to Convert
One-off file: drop the .docx or .doc onto the Word to Markdown converter and copy the result.
Many files or automation: the API takes a single multipart upload and returns Markdown, so you can batch-convert a folder of documents or wire conversion into a docs-publishing pipeline. The same endpoint handles both .docx and .doc.
Frequently Asked Questions
Do my custom Word styles survive? Only via their semantic role. Built-in Heading 1/2/3 styles map to Markdown headers; custom named paragraph styles do not, because Markdown has no concept of named styles. Apply the built-in heading styles before exporting for the best result.
What happens to images embedded in the document? They are extracted and referenced from the Markdown. The text structure around them — captions, headings, surrounding paragraphs — is preserved so the document still reads in order.
Will footnotes and endnotes come through? Footnote references and their text are preserved as Markdown footnotes where the structure allows. Page-bound formatting around them (the horizontal rule, the page placement) does not, because Markdown has no pages.
Is anything sent anywhere I should worry about? The document is converted and the Markdown is returned to you; nothing about the conversion requires keeping your file. For sensitive material, that single-pass model is the point — see the comparison guide for how that stacks up against general-purpose converters.
Word is the source of truth for writing; Markdown is the source of truth for publishing and processing. The conversion is the bridge — clean for everything semantic, lossy by design for everything purely visual.