NLP & LLMs Easy Asked at OpenAIAsked at AnthropicAsked at Google

What is a context window in an LLM and why does its size matter?

For AI / LLM Engineer ML Engineer Data Scientist

The short answer

The context window is the maximum number of tokens an LLM can attend to in a single forward pass — both the input prompt and the model's own generated output count toward this limit. Its size determines how much prior text influences each prediction, sets a hard ceiling on document length and conversation history, and drives memory and compute costs that scale quadratically with sequence length under standard attention.

How to think about it

Every transformer has a fixed context length baked into its positional encoding and attention masking. At inference time, the model sees exactly the tokens in its current window — nothing more, nothing less. Tokens that fall outside the window are invisible, as if they were never written.

What counts toward the window

System prompt
Conversation history (all prior turns)
Retrieved documents injected by RAG
Tool call results
The model’s generated response so far

A 128k-token window sounds large but fills quickly: a full novel is ~100k tokens; a long PDF with embedded tables can exceed that in one file.

The quadratic cost problem

Standard self-attention computes pairwise interactions between every token in the sequence. Memory and compute grow as O(N²) with sequence length N. Doubling the context quadruples the attention cost. This is why long-context models require techniques like sliding-window attention, linear attention variants, or Flash Attention to remain practical.

Why it matters for applications

Scenario	Impact
Multi-turn chat	Old turns are truncated first; the model “forgets” early context
Long-doc QA	Document must fit in window or be chunked
Code generation	Large codebases require selective retrieval, not full ingestion
Agentic loops	Tool outputs accumulate; context fills up mid-task

Pricing implication

Since API providers charge per token and both prompt and completion count, a longer context window directly increases per-call cost even if the model’s answer is short. Efficient prompt design — removing boilerplate, summarizing history — reduces cost and latency.

Learn it properly Self-attention