datarekha
Deep Learning Hard Asked at GoogleAsked at OpenAIAsked at MetaAsked at Anthropic

Why is standard self-attention O(n^2) in sequence length, and how is it addressed?

The short answer

Self-attention computes a dot product between every query and every key, producing an n-by-n score matrix that must be stored and normalised. Both compute and memory scale quadratically with sequence length, which becomes prohibitive at context lengths beyond a few thousand tokens. Approximate attention methods trade exact attention for sub-quadratic complexity.

How to think about it

For a sequence of length n with model dimension d, the attention score matrix Q K^T is n × n. Computing it requires O(n^2 d) multiply-accumulate operations, and storing it requires O(n^2) memory. For n=4096 with 32-bit floats, the score matrix alone is 64 MB per head — multiply by h heads and N layers and it becomes the dominant memory cost.

Why it matters in practice:

  • At n=1000, quadratic is fine.
  • At n=32000 (typical for long-document tasks), the score matrix is ~4 GB per head — infeasible without approximation.
  • Inference KV-cache also grows as O(n) per token generated, which is O(n^2) total for a full generation.

Approaches to reduce cost:

MethodKey ideaComplexity
Sparse attention (Longformer, BigBird)Each token attends to local window + global tokensO(n)
FlashAttentionTile Q, K, V in SRAM; avoid materialising full score matrixO(n^2) compute, O(n) memory
Linear attentionApproximate softmax with kernel trick; reorder computationO(n)
Sliding window (Mistral)Local attention window of size wO(n w)
ALiBi / RoPERelative bias added at score level; no extra matrixO(n^2) but cheaper constant

FlashAttention is the most widely adopted: it keeps exact attention semantics while reducing memory IO by operating in hardware-fast SRAM tiles, achieving roughly 3–5× wall-clock speedup over naive attention at the same algorithmic complexity.

Learn it properly Self-attention

Keep practising

All Deep Learning questions

Explore further

Skip to content