Glossary

Inference

Inference is running a trained model to produce output. Every summary, answer, or glossary term you generate is an inference call.

Inference is the act of running a trained model forward to produce an output. Training is what happens once, over vast data, to build a model; inference is what happens every single time you use it — every question answered, every document summarised, every sentence generated.

Why it matters

The distinction matters because inference is where the real-world constraints live:

  • Latency. Generating a response token by token takes time. Longer outputs, larger models, or heavier prompts all cost more inference time.
  • Cost. Providers charge per token processed during inference — both input (the prompt and retrieved passages) and output (the generated response). Every call has a price.
  • Consistency. A trained model's weights don't change during inference; only the input changes. So the same question, asked twice with the same context and a deterministic temperature, should produce the same answer.

For document AI, inference happens whenever you ask Sidenote a question. The process is: retrieve the relevant passages from your document, build a prompt containing them, then run inference to generate an answer grounded in that text. The large language model never changes; what changes is the evidence placed in front of it for each query.

This is also why inference cost scales with document complexity, not model size alone — a long retrieved context costs more tokens in and pushes latency up. Retrieval-augmented workflows earn their keep by keeping that context focused rather than sprawling.

All terms
Ready when you are

Stop digging. Start asking.

Add Sidenote to Chrome, open any page in your wiki, and ask it the question you’ve been Slacking the team about.

7-day Pro trial · No card required · Free tier forever