NLP & LLMs Medium
What is a KV cache and how does it speed up LLM inference?
The short answer
During autoregressive generation, attention recomputes Keys and Values for all previous tokens at every step; the KV cache stores those K and V tensors so each new token only computes its own, turning per-step cost from quadratic to linear in sequence length. The tradeoff is memory growth proportional to sequence length and batch size.
How to think about it
During autoregressive generation, attention recomputes Keys and Values for all previous tokens at every step; the KV cache stores those K and V tensors so each new token only computes its own, turning per-step cost from quadratic to linear in sequence length. The tradeoff is memory growth proportional to sequence length and batch size.