AIApril 17, 2026

Building a PDF to Embeddings Pipeline in 20 Lines of Python

person

schedule6 min read

If you are building a retrieval system over PDFs, the first question is always the same: how do I get from "file on disk" to "vector in a database" without writing hundreds of lines of parsing code? Most tutorials take you through a forest of dependencies — pypdf, LangChain, a custom text splitter, a tokenizer, error handling, format quirks — and leave you with brittle code that breaks on the first unusual document.

This guide shows a different approach: offload the parsing and chunking to a service that handles document structure correctly, and spend your Python on the parts that actually belong in your application — embedding and storage. The result is an end-to-end pipeline in under 20 lines of real code.

The Pipeline

There are four stages:

  1. Parse & chunk — turn the PDF into semantically-chunked Markdown. This is where naïve pipelines lose quality, because generic PDF parsers rip tables apart and fixed-size splitters cut headings from their content. We use mdstill's /api/convert/structured endpoint, which returns Markdown plus token-counted chunks that respect document structure.
  2. Embed — turn each chunk into a vector with an embedding model. OpenAI's text-embedding-3-small is a sensible default: cheap, fast, 1536 dimensions, good English quality.
  3. Store — put each (chunk, vector) pair into a vector database so you can query by similarity later. We show both pgvector (if you already run Postgres) and Pinecone (if you prefer managed).
  4. Query — embed the user's question, retrieve the top-k most similar chunks, and pass them to the LLM.

Stages 1 and 2 are shared. Stages 3 and 4 differ by database. Let us walk through each.

Prerequisites

pip install requests openai psycopg[binary] pgvector pinecone-client

You need:

  • An OpenAI API key (OPENAI_API_KEY) — platform.openai.com
  • An mdstill API key (free, optional for low volume) — generate one in your mdstill dashboard
  • Either a Postgres database with pgvector enabled, or a Pinecone account

Step 1: Parse and Chunk the PDF

mdstill's structured endpoint takes a file and returns Markdown plus a structure.chunks array. Each chunk is token-aware (sized by real LLM tokens, not characters) and semantic (headings stay attached to their sections, tables and code blocks are atomic).

import requests

def chunk_pdf(path: str, chunk_tokens: int = 500) -> list[dict]:
    resp = requests.post(
        "https://mdstill.com/api/convert/structured",
        files={"file": open(path, "rb")},
        data={"chunk_tokens": chunk_tokens},
        headers={"Authorization": f"Bearer {MDSTILL_KEY}"},  # optional on free tier
    )
    resp.raise_for_status()
    return resp.json()["structure"]["chunks"]

Each chunk looks like:

{
  "index": 3,
  "text": "## Methods\n\nWe used a mixed-methods approach...",
  "tokens": 487,
  "heading_path": "Methods",
  "element_types": ["heading", "paragraph", "list"]
}

The heading_path is the detail most DIY pipelines miss. When you embed just the chunk text, you lose the document structure. Prepending heading_path to the text before embedding gives each vector semantic anchoring and noticeably improves retrieval when the same term appears across different sections.

Step 2: Embed with OpenAI

from openai import OpenAI

openai_client = OpenAI()

def embed_chunks(chunks: list[dict]) -> list[tuple[dict, list[float]]]:
    texts = [f"{c['heading_path']}\n\n{c['text']}" for c in chunks]
    resp = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [(c, e.embedding) for c, e in zip(chunks, resp.data)]

Batching the entire document in one API call is both faster and cheaper than embedding chunk by chunk. OpenAI's embedding endpoint accepts up to 2048 inputs per request and a 8192-token input limit per item, which comfortably covers most documents in a single call.

Step 3A: Store in pgvector

If you already run Postgres, pgvector is the obvious choice — no new infrastructure, no separate billing, and SQL queries for retrieval.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE doc_chunks (
    id          SERIAL PRIMARY KEY,
    doc_id      TEXT NOT NULL,
    chunk_index INT  NOT NULL,
    heading     TEXT,
    content     TEXT NOT NULL,
    embedding   VECTOR(1536)
);

CREATE INDEX ON doc_chunks USING ivfflat (embedding vector_cosine_ops);

Insert:

import psycopg
from pgvector.psycopg import register_vector

def store_pgvector(doc_id: str, embedded: list[tuple[dict, list[float]]]):
    with psycopg.connect(DATABASE_URL) as conn:
        register_vector(conn)
        with conn.cursor() as cur:
            cur.executemany(
                "INSERT INTO doc_chunks (doc_id, chunk_index, heading, content, embedding) "
                "VALUES (%s, %s, %s, %s, %s)",
                [(doc_id, c["index"], c["heading_path"], c["text"], emb)
                 for c, emb in embedded],
            )

Step 3B: Store in Pinecone

If you want managed with one API, Pinecone's serverless tier handles the indexing and retrieval for you.

from pinecone import Pinecone

pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("documents")

def store_pinecone(doc_id: str, embedded: list[tuple[dict, list[float]]]):
    vectors = [
        {
            "id": f"{doc_id}#{c['index']}",
            "values": emb,
            "metadata": {
                "doc_id": doc_id,
                "chunk_index": c["index"],
                "heading": c["heading_path"],
                "text": c["text"],
            },
        }
        for c, emb in embedded
    ]
    index.upsert(vectors=vectors)

Step 4: Query

Embedding the query is the same operation as embedding chunks — one embedding call. Then you ask the vector database for the top-k most similar vectors.

pgvector:

def retrieve_pgvector(question: str, k: int = 5) -> list[dict]:
    q_emb = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[question],
    ).data[0].embedding

    with psycopg.connect(DATABASE_URL) as conn:
        register_vector(conn)
        with conn.cursor() as cur:
            cur.execute(
                "SELECT chunk_index, heading, content "
                "FROM doc_chunks ORDER BY embedding <=> %s::vector LIMIT %s",
                (q_emb, k),
            )
            return [{"index": r[0], "heading": r[1], "text": r[2]}
                    for r in cur.fetchall()]

Pinecone:

def retrieve_pinecone(question: str, k: int = 5) -> list[dict]:
    q_emb = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=[question],
    ).data[0].embedding
    matches = index.query(vector=q_emb, top_k=k, include_metadata=True).matches
    return [m.metadata for m in matches]

Putting It All Together

Here is the full ingestion script — 20 lines of real logic, minus imports and config:

import os, requests
from openai import OpenAI

MDSTILL_KEY = os.environ["MDSTILL_API_KEY"]
openai_client = OpenAI()

def ingest(path: str, doc_id: str):
    chunks = requests.post(
        "https://mdstill.com/api/convert/structured",
        files={"file": open(path, "rb")},
        data={"chunk_tokens": 500},
        headers={"Authorization": f"Bearer {MDSTILL_KEY}"},
    ).json()["structure"]["chunks"]

    texts = [f"{c['heading_path']}\n\n{c['text']}" for c in chunks]
    embeddings = openai_client.embeddings.create(
        model="text-embedding-3-small", input=texts,
    ).data
    embedded = [(c, e.embedding) for c, e in zip(chunks, embeddings)]

    store_pgvector(doc_id, embedded)  # or store_pinecone

if __name__ == "__main__":
    ingest("report.pdf", "report-2025")

Running this on a 30-page PDF takes about 5 seconds total: ~2 seconds for parsing, ~2 seconds for embedding, under 1 second for storage. Cost: the OpenAI embedding call is roughly $0.001 per 10,000 tokens. A 30-page document is typically 10,000–15,000 tokens, so you are looking at $0.001–$0.002 per document.

Why This Is Better Than the DIY Path

The equivalent DIY version — pypdf or pdfplumber for parsing, a LangChain splitter for chunking, tiktoken for token counting, manual handling of table and code-block boundaries, format-specific error handling — is 150–300 lines of code and has predictable failure modes: mangled tables, heading-content separation, format quirks on edge-case PDFs.

By putting a service in front of the messy part, the code you actually maintain is small and unambiguous. Parsing and chunking are delegated to an endpoint whose job is exactly that. Your Python code is about the parts of the pipeline that belong in your application: embedding choices, database schema, retrieval logic.

Extending the Pipeline

A few upgrades to consider once the base pipeline works:

  • Store the original document URL or path in metadata so you can link retrieved chunks back to the source.
  • Version your embeddings — if you change embedding models, keep the model name in the schema so you can query only chunks embedded with the current model.
  • Add a re-ranker — for production RAG, retrieve 20 chunks by vector similarity, then re-rank with a cross-encoder like ms-marco-MiniLM to get the top 5. This measurably improves answer quality at a modest latency cost.
  • Handle updates — when a document changes, delete the old chunks (DELETE WHERE doc_id = ?) and re-ingest. Partial updates are harder than they look; most teams just re-ingest.
  • Use async batching for high volumeasyncio + httpx lets you ingest many documents in parallel, which matters once you are above a few hundred documents.

Summary

A PDF-to-embeddings pipeline does not need to be a weekend project or a dependency graveyard. If you pick the right parsing layer — one that returns Markdown with semantic chunks — the Python you own is short, readable, and focused on the parts where your application differs from everyone else's.

mdstill handles the parsing and chunking. OpenAI handles the embeddings. pgvector or Pinecone handles retrieval. Your code is the glue — and the glue is 20 lines.

Try it on one of your own PDFs. The mdstill API docs have the /api/convert/structured endpoint documented with working examples in Python, cURL, Node.js, Go, and PHP, and the free tier is enough to ingest dozens of documents a day without signing up.

#rag#pdf#embeddings#python#pgvector#pinecone#openai

Related technical reads

View allarrow_forward