Why is standard self-attention O(n^2) in sequence length, and how is it addressed?
Self-attention computes a dot product between every query and every key, producing an n-by-n score matrix that must be stored and normalised. Both compute and memory scale quadratically with sequence length, which becomes prohibitive at context lengths beyond a few thousand tokens. Approximate attention methods trade exact attention for sub-quadratic complexity.
How to think about it
For a sequence of length n with model dimension d, the attention score matrix Q K^T is n × n. Computing it requires O(n^2 d) multiply-accumulate operations, and storing it requires O(n^2) memory. For n=4096 with 32-bit floats, the score matrix alone is 64 MB per head — multiply by h heads and N layers and it becomes the dominant memory cost.
Why it matters in practice:
- At
n=1000, quadratic is fine. - At
n=32000(typical for long-document tasks), the score matrix is ~4 GB per head — infeasible without approximation. - Inference KV-cache also grows as
O(n)per token generated, which isO(n^2)total for a full generation.
Approaches to reduce cost:
| Method | Key idea | Complexity |
|---|---|---|
| Sparse attention (Longformer, BigBird) | Each token attends to local window + global tokens | O(n) |
| FlashAttention | Tile Q, K, V in SRAM; avoid materialising full score matrix | O(n^2) compute, O(n) memory |
| Linear attention | Approximate softmax with kernel trick; reorder computation | O(n) |
| Sliding window (Mistral) | Local attention window of size w | O(n w) |
| ALiBi / RoPE | Relative bias added at score level; no extra matrix | O(n^2) but cheaper constant |
FlashAttention is the most widely adopted: it keeps exact attention semantics while reducing memory IO by operating in hardware-fast SRAM tiles, achieving roughly 3–5× wall-clock speedup over naive attention at the same algorithmic complexity.