ParsingJune 6, 2026

Word to Markdown: The Complete Guide (.docx and .doc)

person

mdstill team

Engineering

schedule6 min read

Word is still where most business documents are written, and Markdown is where they increasingly need to live — in git repos, static site generators, wikis, and LLM context windows. This guide covers exactly what happens when you convert a .docx (or legacy .doc) to Markdown: what maps cleanly, what gets dropped, and the edge cases that trip people up.

What Converts Cleanly

Word's semantic structure has direct Markdown equivalents. These map one-to-one with no loss of meaning:

Word featureMarkdown result
Heading 1–6 styles####### headings
Bold / italic**bold** / *italic*
Bulleted lists- lists (nesting preserved)
Numbered lists1. ordered lists (nesting preserved)
TablesGFM pipe tables with a header row
Hyperlinks[text](url)
Inline code / monospace`code`
Block quotes> quote

The critical word above is styles. Markdown captures the role of a heading, not its appearance. If you built your document with Word's built-in Heading 1 / Heading 2 styles, the hierarchy comes through perfectly. If you faked a heading by manually bolding a 16pt line, there is no semantic signal to detect — it converts as a bold paragraph, not a heading.

Takeaway: documents written with real Word styles convert far better than visually-formatted ones. This is the single biggest predictor of output quality.

What Gets Dropped

Markdown is intentionally minimal. Anything that is purely visual has no place to go:

  • Fonts, font sizes, and colors — not part of the Markdown spec
  • Text boxes and columns — flattened into the main document flow
  • Headers, footers, and page numbers — page-specific layout that Markdown has no concept of
  • Page breaks and section breaks — Markdown is a single continuous flow
  • Custom (named) paragraph styles — only the built-in heading styles carry semantic meaning

None of this is a bug — it is the trade-off you accept for a portable, diff-able, token-efficient format. If you relied on a text box or a colored callout to carry meaning, restructure it as a heading, list, or blockquote before converting.

Tracked Changes and Comments

This is the edge case that catches people, especially with documents that have been through review:

  • Tracked changes resolve to their accepted state. The Markdown reflects the document as if every pending edit were accepted — deleted text is gone, inserted text stays. Rejected-but-not-removed markup does not leak into the output.
  • Comments are removed entirely. They live in a separate layer of the .docx and are not part of the document body.

If a document still has unresolved redlines, the safest move is to accept or reject all changes in Word first, so you control the final text rather than letting the conversion decide. See our Word to Markdown preserving formatting page for more on what structure survives.

Tables in Word

Word tables become GitHub-Flavored Markdown tables — a header row, a separator row, and pipe-delimited cells:

| Quarter | Revenue | Growth |
| :------ | ------: | -----: |
| Q1      |   $4.2M |   +12% |
| Q2      |   $4.8M |   +14% |

A few mechanical rules apply, because GFM tables are stricter than Word tables:

  • The first row becomes the header — GFM has no headerless tables
  • Merged cells are unmerged: content lands in the top-left cell of the merged range, the rest go empty
  • Pipe characters inside cells are escaped as \|
  • Line breaks inside a cell collapse to spaces, since GFM cells can't contain newlines

Deeply nested tables (a table inside a cell) have no GFM equivalent and are flattened. If your data is genuinely tabular, this is rarely a problem; if you were using a table for page layout, expect to clean up.

.docx vs Legacy .doc

Both are supported in a single API call, but they take slightly different paths:

FormatPath
.docxModern XML-based format — parsed directly
.docLegacy binary (pre-2007) — normalized to a modern document first

For .doc, a server-side normalization step brings the old binary into a modern document structure, then it flows through the same pipeline as .docx. You don't do anything different — you upload the .doc and get Markdown back. The only practical difference is that .doc files take a second or two longer.

Why Convert Word to Markdown at All

A few workflows where the conversion pays off:

  • Version control. Markdown diffs cleanly in git; .docx is a binary blob that shows up as "file changed" with no readable diff.
  • Static site generators. Hugo, Docusaurus, and Astro all consume Markdown natively. Migrating a Word-based docs set is a bulk conversion job.
  • Wikis and Obsidian. Pulling Word knowledge bases into a Markdown vault keeps everything searchable and linkable.
  • LLM context. Feeding raw .docx to a model wastes tokens on XML scaffolding. Markdown preserves the structure a model needs — headings, lists, tables — while cutting the token count by roughly a third to a half versus raw document extraction. More on that in why Markdown matters for LLMs.

How to Convert

One-off file: drop the .docx or .doc onto the Word to Markdown converter and copy the result.

Many files or automation: the API takes a single multipart upload and returns Markdown, so you can batch-convert a folder of documents or wire conversion into a docs-publishing pipeline. The same endpoint handles both .docx and .doc.

Frequently Asked Questions

Do my custom Word styles survive? Only via their semantic role. Built-in Heading 1/2/3 styles map to Markdown headers; custom named paragraph styles do not, because Markdown has no concept of named styles. Apply the built-in heading styles before exporting for the best result.

What happens to images embedded in the document? They are extracted and referenced from the Markdown. The text structure around them — captions, headings, surrounding paragraphs — is preserved so the document still reads in order.

Will footnotes and endnotes come through? Footnote references and their text are preserved as Markdown footnotes where the structure allows. Page-bound formatting around them (the horizontal rule, the page placement) does not, because Markdown has no pages.

Is anything sent anywhere I should worry about? The document is converted and the Markdown is returned to you; nothing about the conversion requires keeping your file. For sensitive material, that single-pass model is the point — see the comparison guide for how that stacks up against general-purpose converters.

Word is the source of truth for writing; Markdown is the source of truth for publishing and processing. The conversion is the bridge — clean for everything semantic, lossy by design for everything purely visual.

#word#docx#doc#markdown#conversion

Related technical reads

View allarrow_forward