Building a PDF to Embeddings Pipeline in 20 Lines of Python
If you are building a retrieval system over PDFs, the first question is always the same: how do I get from "file on disk" to "vector in a database" without writing hundreds of lines of parsing code? Most tutorials take you through a forest of dependencies — pypdf, LangChain, a custom text splitter, a tokenizer, error handling, format quirks — and leave you with brittle code that breaks on the first unusual document.
This guide shows a different approach: offload the parsing and chunking to a service that handles document structure correctly, and spend your Python on the parts that actually belong in your application — embedding and storage. The result is an end-to-end pipeline in under 20 lines of real code.
The Pipeline
There are four stages:
- Parse & chunk — turn the PDF into semantically-chunked Markdown. This is where naïve pipelines lose quality, because generic PDF parsers rip tables apart and fixed-size splitters cut headings from their content. We use mdstill's
/api/convert/structuredendpoint, which returns Markdown plus token-counted chunks that respect document structure. - Embed — turn each chunk into a vector with an embedding model. OpenAI's
text-embedding-3-smallis a sensible default: cheap, fast, 1536 dimensions, good English quality. - Store — put each (chunk, vector) pair into a vector database so you can query by similarity later. We show both pgvector (if you already run Postgres) and Pinecone (if you prefer managed).
- Query — embed the user's question, retrieve the top-k most similar chunks, and pass them to the LLM.
Stages 1 and 2 are shared. Stages 3 and 4 differ by database. Let us walk through each.
Prerequisites
pip install requests openai psycopg[binary] pgvector pinecone-client
You need:
- An OpenAI API key (
OPENAI_API_KEY) — platform.openai.com - An mdstill API key (free, optional for low volume) — generate one in your mdstill dashboard
- Either a Postgres database with pgvector enabled, or a Pinecone account
Step 1: Parse and Chunk the PDF
mdstill's structured endpoint takes a file and returns Markdown plus a structure.chunks array. Each chunk is token-aware (sized by real LLM tokens, not characters) and semantic (headings stay attached to their sections, tables and code blocks are atomic).
import requests
def chunk_pdf(path: str, chunk_tokens: int = 500) -> list[dict]:
resp = requests.post(
"https://mdstill.com/api/convert/structured",
files={"file": open(path, "rb")},
data={"chunk_tokens": chunk_tokens},
headers={"Authorization": f"Bearer {MDSTILL_KEY}"}, # optional on free tier
)
resp.raise_for_status()
return resp.json()["structure"]["chunks"]
Each chunk looks like:
{
"index": 3,
"text": "## Methods\n\nWe used a mixed-methods approach...",
"tokens": 487,
"heading_path": "Methods",
"element_types": ["heading", "paragraph", "list"]
}
The heading_path is the detail most DIY pipelines miss. When you embed just the chunk text, you lose the document structure. Prepending heading_path to the text before embedding gives each vector semantic anchoring and noticeably improves retrieval when the same term appears across different sections.
Step 2: Embed with OpenAI
from openai import OpenAI
openai_client = OpenAI()
def embed_chunks(chunks: list[dict]) -> list[tuple[dict, list[float]]]:
texts = [f"{c['heading_path']}\n\n{c['text']}" for c in chunks]
resp = openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [(c, e.embedding) for c, e in zip(chunks, resp.data)]
Batching the entire document in one API call is both faster and cheaper than embedding chunk by chunk. OpenAI's embedding endpoint accepts up to 2048 inputs per request and a 8192-token input limit per item, which comfortably covers most documents in a single call.
Step 3A: Store in pgvector
If you already run Postgres, pgvector is the obvious choice — no new infrastructure, no separate billing, and SQL queries for retrieval.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE doc_chunks (
id SERIAL PRIMARY KEY,
doc_id TEXT NOT NULL,
chunk_index INT NOT NULL,
heading TEXT,
content TEXT NOT NULL,
embedding VECTOR(1536)
);
CREATE INDEX ON doc_chunks USING ivfflat (embedding vector_cosine_ops);
Insert:
import psycopg
from pgvector.psycopg import register_vector
def store_pgvector(doc_id: str, embedded: list[tuple[dict, list[float]]]):
with psycopg.connect(DATABASE_URL) as conn:
register_vector(conn)
with conn.cursor() as cur:
cur.executemany(
"INSERT INTO doc_chunks (doc_id, chunk_index, heading, content, embedding) "
"VALUES (%s, %s, %s, %s, %s)",
[(doc_id, c["index"], c["heading_path"], c["text"], emb)
for c, emb in embedded],
)
Step 3B: Store in Pinecone
If you want managed with one API, Pinecone's serverless tier handles the indexing and retrieval for you.
from pinecone import Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("documents")
def store_pinecone(doc_id: str, embedded: list[tuple[dict, list[float]]]):
vectors = [
{
"id": f"{doc_id}#{c['index']}",
"values": emb,
"metadata": {
"doc_id": doc_id,
"chunk_index": c["index"],
"heading": c["heading_path"],
"text": c["text"],
},
}
for c, emb in embedded
]
index.upsert(vectors=vectors)
Step 4: Query
Embedding the query is the same operation as embedding chunks — one embedding call. Then you ask the vector database for the top-k most similar vectors.
pgvector:
def retrieve_pgvector(question: str, k: int = 5) -> list[dict]:
q_emb = openai_client.embeddings.create(
model="text-embedding-3-small",
input=[question],
).data[0].embedding
with psycopg.connect(DATABASE_URL) as conn:
register_vector(conn)
with conn.cursor() as cur:
cur.execute(
"SELECT chunk_index, heading, content "
"FROM doc_chunks ORDER BY embedding <=> %s::vector LIMIT %s",
(q_emb, k),
)
return [{"index": r[0], "heading": r[1], "text": r[2]}
for r in cur.fetchall()]
Pinecone:
def retrieve_pinecone(question: str, k: int = 5) -> list[dict]:
q_emb = openai_client.embeddings.create(
model="text-embedding-3-small",
input=[question],
).data[0].embedding
matches = index.query(vector=q_emb, top_k=k, include_metadata=True).matches
return [m.metadata for m in matches]
Putting It All Together
Here is the full ingestion script — 20 lines of real logic, minus imports and config:
import os, requests
from openai import OpenAI
MDSTILL_KEY = os.environ["MDSTILL_API_KEY"]
openai_client = OpenAI()
def ingest(path: str, doc_id: str):
chunks = requests.post(
"https://mdstill.com/api/convert/structured",
files={"file": open(path, "rb")},
data={"chunk_tokens": 500},
headers={"Authorization": f"Bearer {MDSTILL_KEY}"},
).json()["structure"]["chunks"]
texts = [f"{c['heading_path']}\n\n{c['text']}" for c in chunks]
embeddings = openai_client.embeddings.create(
model="text-embedding-3-small", input=texts,
).data
embedded = [(c, e.embedding) for c, e in zip(chunks, embeddings)]
store_pgvector(doc_id, embedded) # or store_pinecone
if __name__ == "__main__":
ingest("report.pdf", "report-2025")
Running this on a 30-page PDF takes about 5 seconds total: ~2 seconds for parsing, ~2 seconds for embedding, under 1 second for storage. Cost: the OpenAI embedding call is roughly $0.001 per 10,000 tokens. A 30-page document is typically 10,000–15,000 tokens, so you are looking at $0.001–$0.002 per document.
Why This Is Better Than the DIY Path
The equivalent DIY version — pypdf or pdfplumber for parsing, a LangChain splitter for chunking, tiktoken for token counting, manual handling of table and code-block boundaries, format-specific error handling — is 150–300 lines of code and has predictable failure modes: mangled tables, heading-content separation, format quirks on edge-case PDFs.
By putting a service in front of the messy part, the code you actually maintain is small and unambiguous. Parsing and chunking are delegated to an endpoint whose job is exactly that. Your Python code is about the parts of the pipeline that belong in your application: embedding choices, database schema, retrieval logic.
Extending the Pipeline
A few upgrades to consider once the base pipeline works:
- Store the original document URL or path in metadata so you can link retrieved chunks back to the source.
- Version your embeddings — if you change embedding models, keep the model name in the schema so you can query only chunks embedded with the current model.
- Add a re-ranker — for production RAG, retrieve 20 chunks by vector similarity, then re-rank with a cross-encoder like
ms-marco-MiniLMto get the top 5. This measurably improves answer quality at a modest latency cost. - Handle updates — when a document changes, delete the old chunks (
DELETE WHERE doc_id = ?) and re-ingest. Partial updates are harder than they look; most teams just re-ingest. - Use async batching for high volume —
asyncio+httpxlets you ingest many documents in parallel, which matters once you are above a few hundred documents.
Summary
A PDF-to-embeddings pipeline does not need to be a weekend project or a dependency graveyard. If you pick the right parsing layer — one that returns Markdown with semantic chunks — the Python you own is short, readable, and focused on the parts where your application differs from everyone else's.
mdstill handles the parsing and chunking. OpenAI handles the embeddings. pgvector or Pinecone handles retrieval. Your code is the glue — and the glue is 20 lines.
Try it on one of your own PDFs. The mdstill API docs have the /api/convert/structured endpoint documented with working examples in Python, cURL, Node.js, Go, and PHP, and the free tier is enough to ingest dozens of documents a day without signing up.