Chunking is the act of splitting a document into smaller passages — chunks — so each piece can be embedded and retrieved on its own.
A model can't search a whole PDF at once, and it can't usefully reason over an entire book crammed into a single block of text. So before a document can be searched, it's broken into pieces: a paragraph, a few sentences, a section. Each chunk becomes the unit that gets indexed, matched, and eventually handed back to the model.
Why chunk at all
Chunks are what make a document searchable by meaning. Each one is turned into a vector embedding — a numeric fingerprint of what it says — and stored so it can be found later by semantic search. When you ask a question, the system compares your query against those chunks and pulls back the closest matches to ground its answer. This is the retrieval step at the heart of retrieval-augmented generation.
The whole pipeline rests on the chunk being the right size and shape. Get it wrong and the best answer in the document may never surface.
How chunk size shapes the result
Chunk size is a trade-off between two failures.
- Chunks too large. A passage that runs for pages mixes several ideas into one embedding, blurring its meaning. Recall suffers — the relevant sentence is buried among unrelated text — and a citation can only point at the whole sprawling block, not the line that actually supports the claim.
- Chunks too small. A passage of a single clause loses the context around it. The system may retrieve a fragment that looks relevant but doesn't carry enough meaning to answer the question, and stitched-together fragments can mislead.
Good chunking splits along natural boundaries — headings, paragraphs, sentences — often with a little overlap so meaning isn't severed mid-thought. The aim is passages large enough to stand alone, small enough to cite precisely.
Sidenote chunks each document along its real structure, so the passages it retrieves are coherent enough to ground an answer and tight enough that every citation scrolls to the exact source rather than a vague stretch of page.