Deep Learning Hard Asked at GoogleAsked at OpenAIAsked at MetaAsked at Anthropic

Why is standard self-attention O(n^2) in sequence length, and how is it addressed?

For ML Engineer AI / LLM Engineer Data Scientist

The short answer

Self-attention computes a dot product between every query and every key, producing an n-by-n score matrix that must be stored and normalised. Both compute and memory scale quadratically with sequence length, which becomes prohibitive at context lengths beyond a few thousand tokens. Approximate attention methods trade exact attention for sub-quadratic complexity.

How to think about it

For a sequence of length n with model dimension d, the attention score matrix Q K^T is n × n. Computing it requires O(n^2 d) multiply-accumulate operations, and storing it requires O(n^2) memory. For n=4096 with 32-bit floats, the score matrix alone is 64 MB per head — multiply by h heads and N layers and it becomes the dominant memory cost.

Why it matters in practice:

At n=1000, quadratic is fine.
At n=32000 (typical for long-document tasks), the score matrix is ~4 GB per head — infeasible without approximation.
Inference KV-cache also grows as O(n) per token generated, which is O(n^2) total for a full generation.

Approaches to reduce cost:

Method	Key idea	Complexity
Sparse attention (Longformer, BigBird)	Each token attends to local window + global tokens	`O(n)`
FlashAttention	Tile Q, K, V in SRAM; avoid materialising full score matrix	`O(n^2)` compute, `O(n)` memory
Linear attention	Approximate softmax with kernel trick; reorder computation	`O(n)`
Sliding window (Mistral)	Local attention window of size `w`	`O(n w)`
ALiBi / RoPE	Relative bias added at score level; no extra matrix	`O(n^2)` but cheaper constant

FlashAttention is the most widely adopted: it keeps exact attention semantics while reducing memory IO by operating in hardware-fast SRAM tiles, achieving roughly 3–5× wall-clock speedup over naive attention at the same algorithmic complexity.

Learn it properly Self-attention

Why is standard self-attention O(n^2) in sequence length, and how is it addressed?

Keep practising

Explore further