datarekha
NLP & LLMs Hard Asked at OpenAIAsked at AnthropicAsked at GoogleAsked at Meta

What is the KV cache in a transformer and why does it matter for inference?

The short answer

The KV cache stores the key and value tensors computed during previous forward passes so they do not need to be recomputed for every new token during autoregressive generation. Without it, generating each token would require a full forward pass over the entire context from scratch, making inference cost grow quadratically with sequence length rather than linearly.

How to think about it

Self-attention computes three tensors for each token at each layer: a query (Q), a key (K), and a value (V). During autoregressive generation the query for the current (new) token needs to attend to the keys and values of all previous tokens — but those previous tokens have not changed. Recomputing their K and V tensors on every step is pure redundancy.

What the cache stores

After processing the prompt (the “prefill” phase), the model saves every K and V tensor for every layer. On each subsequent generation step:

  1. Compute Q, K, V for the single new token only.
  2. Append the new K and V to the cache.
  3. Run attention using the new Q against all cached K and V.

This reduces per-step compute from O(N) full-context operations to O(1) new-token operations.

Memory cost

The KV cache grows with sequence length. At each step it holds:

KV cache size = 2 × num_layers × num_heads × head_dim × seq_len × bytes_per_element

For a 13B model (40 layers, 40 heads, 128 head dim) at fp16 over 4096 tokens:

2 × 40 × 40 × 128 × 4096 × 2 bytes ≈ 3.4 GB

Long contexts (128k tokens) can require more KV cache memory than the model weights themselves. This is a hard constraint for batch sizes in production serving.

Techniques to reduce KV cache pressure

TechniqueMechanism
Multi-query attention (MQA)One K/V head shared across all Q heads; reduces cache by num_heads factor
Grouped-query attention (GQA)K/V heads shared across groups of Q heads; balance between MQA and full MHA
Sliding-window attentionOnly the last W tokens are cached; distant tokens are dropped
KV quantisationStore K/V at int8 or int4; 2-4x memory reduction with small quality loss
PagedAttention (vLLM)Manages KV cache like virtual memory pages; enables larger effective batch sizes

Prefill vs decode phases

Prefill: the entire prompt is processed in one parallel forward pass. Compute-bound (high GPU utilisation).

Decode: tokens are generated one at a time, each reusing the cache. Memory-bandwidth-bound because loading the large weight matrices for a single token wastes compute capacity. This asymmetry motivates speculative decoding and continuous batching.

Learn it properly Self-attention

Keep practising

All NLP & LLMs questions

Explore further

Skip to content