Token is the atomic unit a language model works with. Models do not read text word by word, letter by letter, or sentence by sentence — they read tokens, which are fragments produced by splitting text at statistically useful boundaries. In English, a token averages around three-quarters of a word: "tokenization" might be two tokens, "the" is one, and a punctuation mark is typically its own.
Why it matters
Tokens are the currency of everything in language model usage. They determine:
- How much fits in a model's context. A context window is measured in tokens, not words or pages. A model that supports 128,000 tokens can hold roughly 90,000–100,000 English words at once — which is a lot, but still finite, and why long documents need chunking and retrieval.
- How much an inference costs. Most model providers bill by tokens consumed — both the tokens in (your prompt and retrieved passages) and the tokens out (the model's response). Every source passage pulled into the prompt costs tokens; so does every word of the answer.
- How the model generates output. A large language model produces its response one token at a time, choosing each token from a probability distribution shaped by everything that came before it. This is why generation is sequential and why longer outputs take more time.
For document AI, the practical implication is that token budget is a real constraint. Sidenote manages it by retrieving only the passages that bear on your question, keeping the prompt focused so the model has room for the evidence that matters rather than filler it will half-attend to.