Inference — Definition

Inference is the act of running a trained model forward to produce an output. Training is what happens once, over vast data, to build a model; inference is what happens every single time you use it — every question answered, every document summarised, every sentence generated.

Why it matters

The distinction matters because inference is where the real-world constraints live:

Latency. Generating a response token by token takes time. Longer outputs, larger models, or heavier prompts all cost more inference time.
Cost. Providers charge per token processed during inference — both input (the prompt and retrieved passages) and output (the generated response). Every call has a price.
Consistency. A trained model's weights don't change during inference; only the input changes. So the same question, asked twice with the same context and a deterministic temperature, should produce the same answer.

For document AI, inference happens whenever you ask Sidenote a question. The process is: retrieve the relevant passages from your document, build a prompt containing them, then run inference to generate an answer grounded in that text. The large language model never changes; what changes is the evidence placed in front of it for each query.

This is also why inference cost scales with document complexity, not model size alone — a long retrieved context costs more tokens in and pushes latency up. Retrieval-augmented workflows earn their keep by keeping that context focused rather than sprawling.

Why it matters

Stop digging. Start asking.