A context window is the fixed amount of text a model can hold in mind at one time. It is measured in tokens — the small chunks of text a model reads — and it sets a hard ceiling on how much a model can take in and reason over in a single pass. Everything the model needs for an answer, the question, the instructions, and any source material, has to fit inside that window together.
Why long documents overflow it
Context windows have grown, but they are still finite, and real documents are large. A single contract, research paper, or wiki export can run to hundreds of pages — far more tokens than even a generous window holds. When the text won't fit, something has to give: it gets truncated, or it never makes it in at all.
Truncation is quietly dangerous. A model handed only the first half of a document will answer as if that's the whole story, with no signal that the rest existed. The answer reads as confident and complete, which is exactly how hallucinations take hold. Simply pasting an entire library into one prompt is not a real fix either — past a point, models attend less reliably to the middle of a very long window, so relevant detail can be present yet effectively overlooked.
Feeding in only what matters
The durable approach is to stop trying to fit everything. A document is split into passage-sized pieces through chunking, and each piece is indexed so it can be found later by meaning rather than keyword via semantic search. When a question arrives, only the handful of passages that actually bear on it are retrieved and placed in the window — the pattern known as retrieval-augmented generation.
This keeps the window focused: the model sees the most relevant evidence at full fidelity instead of a thin slice of everything. It is how Sidenote reads documents far larger than any single window — pulling the passages that matter into context, then grounding and citing each answer in the exact text it was given.