What is a context window in an LLM and why does its size matter?
The context window is the maximum number of tokens an LLM can attend to in a single forward pass — both the input prompt and the model's own generated output count toward this limit. Its size determines how much prior text influences each prediction, sets a hard ceiling on document length and conversation history, and drives memory and compute costs that scale quadratically with sequence length under standard attention.
How to think about it
Every transformer has a fixed context length baked into its positional encoding and attention masking. At inference time, the model sees exactly the tokens in its current window — nothing more, nothing less. Tokens that fall outside the window are invisible, as if they were never written.
What counts toward the window
- System prompt
- Conversation history (all prior turns)
- Retrieved documents injected by RAG
- Tool call results
- The model’s generated response so far
A 128k-token window sounds large but fills quickly: a full novel is ~100k tokens; a long PDF with embedded tables can exceed that in one file.
The quadratic cost problem
Standard self-attention computes pairwise interactions between every token in the sequence. Memory and compute grow as O(N²) with sequence length N. Doubling the context quadruples the attention cost. This is why long-context models require techniques like sliding-window attention, linear attention variants, or Flash Attention to remain practical.
Why it matters for applications
| Scenario | Impact |
|---|---|
| Multi-turn chat | Old turns are truncated first; the model “forgets” early context |
| Long-doc QA | Document must fit in window or be chunked |
| Code generation | Large codebases require selective retrieval, not full ingestion |
| Agentic loops | Tool outputs accumulate; context fills up mid-task |
Pricing implication
Since API providers charge per token and both prompt and completion count, a longer context window directly increases per-call cost even if the model’s answer is short. Efficient prompt design — removing boilerplate, summarizing history — reduces cost and latency.