Inference is the act of running a trained model forward to produce an output. Training is what happens once, over vast data, to build a model; inference is what happens every single time you use it — every question answered, every document summarised, every sentence generated.
Why it matters
The distinction matters because inference is where the real-world constraints live:
- Latency. Generating a response token by token takes time. Longer outputs, larger models, or heavier prompts all cost more inference time.
- Cost. Providers charge per token processed during inference — both input (the prompt and retrieved passages) and output (the generated response). Every call has a price.
- Consistency. A trained model's weights don't change during inference; only the input changes. So the same question, asked twice with the same context and a deterministic temperature, should produce the same answer.
For document AI, inference happens whenever you ask Sidenote a question. The process is: retrieve the relevant passages from your document, build a prompt containing them, then run inference to generate an answer grounded in that text. The large language model never changes; what changes is the evidence placed in front of it for each query.
This is also why inference cost scales with document complexity, not model size alone — a long retrieved context costs more tokens in and pushes latency up. Retrieval-augmented workflows earn their keep by keeping that context focused rather than sprawling.