Context stuffing is the practice of pasting an entire document — or as much of it as will fit — into the model's context window before asking a question. The logic is simple: give the model everything and let it find what's relevant.
Why it matters
In small doses, dumping full documents into the prompt can work. For a two-page email thread or a short article, it's often fine. The problems compound as documents grow:
- Cost. LLM pricing is roughly proportional to tokens processed. Stuffing a 200-page PDF into every query is an expensive way to answer a simple question.
- Context limits. Most models cap at somewhere between 100k and 1M tokens. Many real-world documents — legal agreements, technical manuals, research corpora — exceed those limits, making context stuffing simply impossible.
- Accuracy dilution. Counter-intuitively, more context can mean worse answers. When a model is handed hundreds of pages, relevant passages compete with irrelevant ones. The signal-to-noise ratio drops, and the model may anchor on high-frequency or early content rather than the parts that actually answer the question.
The retrieval alternative
Retrieval-augmented generation solves all three problems. Instead of dumping the document in full, top-k retrieval identifies the most relevant passages and feeds only those into the prompt. The model gets focused evidence rather than a haystack; cost scales with the question, not the document size; and no document is too long to handle.
Sidenote uses retrieval by default — every answer is built from the passages most relevant to your question, with each passage traced back to its source via citation.