What is the KV cache in a transformer and why does it matter for inference?
The KV cache stores the key and value tensors computed during previous forward passes so they do not need to be recomputed for every new token during autoregressive generation. Without it, generating each token would require a full forward pass over the entire context from scratch, making inference cost grow quadratically with sequence length rather than linearly.
How to think about it
Self-attention computes three tensors for each token at each layer: a query (Q), a key (K), and a value (V). During autoregressive generation the query for the current (new) token needs to attend to the keys and values of all previous tokens — but those previous tokens have not changed. Recomputing their K and V tensors on every step is pure redundancy.
What the cache stores
After processing the prompt (the “prefill” phase), the model saves every K and V tensor for every layer. On each subsequent generation step:
- Compute Q, K, V for the single new token only.
- Append the new K and V to the cache.
- Run attention using the new Q against all cached K and V.
This reduces per-step compute from O(N) full-context operations to O(1) new-token operations.
Memory cost
The KV cache grows with sequence length. At each step it holds:
KV cache size = 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element
For a 13B model (40 layers, 40 heads, 128 head dim) at fp16 over 4096 tokens:
2 × 40 × 40 × 128 × 4096 × 2 bytes ≈ 3.4 GB
Long contexts (128k tokens) can require more KV cache memory than the model weights themselves. This is a hard constraint for batch sizes in production serving.
Techniques to reduce KV cache pressure
| Technique | Mechanism |
|---|---|
| Multi-query attention (MQA) | One K/V head shared across all Q heads; reduces cache by num_heads factor |
| Grouped-query attention (GQA) | K/V heads shared across groups of Q heads; balance between MQA and full MHA |
| Sliding-window attention | Only the last W tokens are cached; distant tokens are dropped |
| KV quantisation | Store K/V at int8 or int4; 2-4x memory reduction with small quality loss |
| PagedAttention (vLLM) | Manages KV cache like virtual memory pages; enables larger effective batch sizes |
Prefill vs decode phases
Prefill: the entire prompt is processed in one parallel forward pass. Compute-bound (high GPU utilisation).
Decode: tokens are generated one at a time, each reusing the cache. Memory-bandwidth-bound because loading the large weight matrices for a single token wastes compute capacity. This asymmetry motivates speculative decoding and continuous batching.