AIApril 17, 2026

Token-Aware Chunking for RAG: Why Fixed-Size Splitting Fails

person

schedule7 min read

If you are building a RAG pipeline, there is a boring detail that quietly decides whether your retrieval works or not: how you split documents into chunks. Most tutorials reach for the simplest splitter — fixed-size by characters or tokens — because it is one line of code. It is also the reason retrieval quality degrades the moment you feed it real documents.

This article explains what actually breaks when you chunk naïvely, what "token-aware" and "semantic" chunking mean in practice, and how to get both for free when your input is Markdown.

The Default Splitter Is the Problem

Pick up any LangChain or LlamaIndex tutorial and you will see something like:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_text(text)

This takes a string, counts characters (or tokens, if configured), and slices every N characters with some overlap. It does not know what a heading is. It does not know what a table is. It does not know where a section ends. It is a blind knife.

On a clean blog post, that is fine — prose is uniform. On a real PDF-derived document with tables, code blocks, lists, and nested headings, it produces three failure modes that silently ruin retrieval.

Failure 1: Tables Get Sliced

Imagine a financial report with a 10-row table of quarterly numbers. A fixed-size splitter hits the character budget halfway through row 4 and cuts. Your chunks now look like:

Chunk A (ends here):

| Q1 2025 | 14.2M | +3% |
| Q2 2025 | 15.1M | +6% |
| Q3 2025 | 16.

Chunk B (starts here):

0M | +6% |
| Q4 2025 | 17.8M | +11% |

When you embed these, neither chunk contains a coherent answer to "what was Q3 revenue?" The number has been torn across two embeddings. Retrieval may return chunk A or chunk B, but neither gives the LLM enough context to answer.

Failure 2: Headings Separated From Content

Heading-section coupling is how humans understand documents. A naïve splitter happily puts a heading at the end of one chunk and its body at the start of the next:

Chunk A (ends here):

...regulations allow for extended deferral periods in limited circumstances.

## Eligible Entities for Deferral

Chunk B (starts here):

The following entity types qualify under subsection 4.2...

When the user asks "which entities can defer?", retrieval needs to match their query to the content. But the content's semantic anchor — its heading — lives in the previous chunk. You retrieve the wrong chunk or a lower-scoring one.

Failure 3: Code Blocks Mangled

Same story for fenced code blocks. The splitter does not know ``` opens a region that should stay atomic:

Chunk A (ends here):

def process_invoice(path):
    doc = load(path)
    for line in doc.line

Chunk B (starts here):

_items:
        yield line.total

Now both chunks contain broken Python. Worse, the LLM may quote this back to the user as if it were correct code.

What "Token-Aware" and "Semantic" Actually Mean

Two terms get thrown around and conflated. They are different axes:

  • Token-aware: chunks are sized by LLM tokens, not characters. This matters because embedding models and LLMs charge and truncate by tokens, not characters. A 1000-character chunk can be anywhere from 200 to 800 tokens depending on language and content. Token-aware chunking uses a real tokenizer (tiktoken for OpenAI, the native tokenizer for Anthropic or open-source models) to measure size accurately.
  • Semantic: chunks respect document structure. Headings stay attached to their sections. Tables are not split mid-row. Code blocks are not split mid-line. Lists stay together (or at least only split between list items, not mid-item).

A good chunker does both. It targets a token budget (e.g., 500 tokens per chunk) but will extend a chunk rather than break atomic content, and will start new chunks at natural boundaries (section breaks, paragraph breaks) instead of arbitrary character positions.

Why Markdown Input Makes This Easy

Here is the quiet leverage: if your input is already structured Markdown, "semantic" becomes almost free.

Markdown gives the chunker explicit signals:

  • #, ##, ### — heading hierarchy, exact section boundaries
  • |---|---| — table rows, fixed until blank line
  • ``` — fenced code block, atomic until closing fence
  • -, *, 1. — list items at known indentation
  • Blank line — paragraph break, safe split point

A chunker that understands Markdown can walk the document, keep a running token count, and decide at each structural boundary whether it is cheaper to start a new chunk or extend the current one — while never breaking atomic blocks.

This is the opposite of the "string of characters, cut every 1000" approach. It is what structured chunking looks like when the input is structured.

Token-Aware Chunking with mdstill

This is what /api/convert/structured does in mdstill. You point it at a PDF, DOCX, or any supported document, and it returns Markdown plus a pre-computed list of chunks that are both token-aware and semantic.

Here is the minimum call:

curl -X POST https://mdstill.com/api/convert/structured \
  -F "file=@report.pdf" \
  -F "chunk_tokens=500"

You get back JSON with a structure.chunks array. Each chunk has:

  • index — position in the document
  • text — the Markdown content
  • tokens — exact token count
  • heading_path — breadcrumb of parent headings (e.g., "Methods > Data Collection")
  • element_types — what kinds of blocks are inside (paragraph, table, code, list)

The chunk_tokens parameter is a soft target. Atomic blocks — tables, code fences, list items — are never split. A chunk containing a 700-token table will exceed the 500-token budget rather than tear the table apart. The documentation is explicit about this tradeoff.

The heading_path is the detail most tutorials miss. When you embed a chunk, you are embedding its text alone — the LLM does not see "this came from the Methods section." With heading_path, you can prepend the breadcrumb to each chunk's text before embedding, giving the embedding model semantic anchoring that dramatically improves retrieval on documents with repeated terms across sections.

Full Python example:

import requests

response = requests.post(
    "https://mdstill.com/api/convert/structured",
    files={"file": open("report.pdf", "rb")},
    data={"chunk_tokens": 500},
)
data = response.json()

for chunk in data["structure"]["chunks"]:
    text_for_embedding = f"{chunk['heading_path']}\n\n{chunk['text']}"
    # embed(text_for_embedding) → store in your vector DB
    print(f"Chunk {chunk['index']}: {chunk['tokens']} tokens, "
          f"section = {chunk['heading_path']}")

That is your entire chunking stage. No custom splitter, no heuristics, no post-processing to glue headings back to content.

When You Still Need Custom Chunking

Token-aware semantic chunking is a strong default, but there are cases where you still need to customize:

  • Domain-specific atomic units. If your documents have custom structures — legal clauses numbered in a specific scheme, medical protocol sections, log-format records — you may want chunks that align to those boundaries. In that case, use the Markdown output and apply your own splitter on top.
  • Very small target sizes. If you need chunks under 100 tokens (e.g., for a specific embedding model with a tiny context), semantic chunking may produce chunks that exceed your budget because they contain an atomic block. You will need to decide whether to break the atomic block or move to a different embedding model.
  • Multi-document deduplication. Semantic chunking is per-document. If you have near-duplicate chunks across multiple versions of a document, you need a dedup step on top.

For the 90% case — PDFs, Word docs, HTML pages, reports, papers, manuals — token-aware semantic chunking is what you want. Building it yourself is a multi-week project. Using it off the shelf is one API call.

Summary

Fixed-size chunking is a demo-quality technique. In production, it breaks tables, separates headings from content, and mangles code blocks — all silently, and all in ways that degrade retrieval without producing visible errors.

The fix is two things: size by tokens (not characters), and split at structural boundaries (not arbitrary positions). If your input is Markdown, both come naturally. If your input is PDF, DOCX, or HTML, convert it to Markdown with a tool that respects structure, and then chunk with structural awareness.

mdstill's /api/convert/structured endpoint does both in one step. If you want to try it on your own documents, the API docs have working examples in Python, Node, cURL, and more.

#rag#chunking#llm#embeddings#pdf#markdown

Related technical reads

View allarrow_forward