What is a KV cache and how does it speed up LLM inference?

For AI / LLM Engineer MLOps Engineer ML Engineer

The short answer

During autoregressive generation, attention recomputes Keys and Values for all previous tokens at every step; the KV cache stores those K and V tensors so each new token only computes its own, turning per-step cost from quadratic to linear in sequence length. The tradeoff is memory growth proportional to sequence length and batch size.

How to think about it

During autoregressive generation, attention recomputes Keys and Values for all previous tokens at every step; the KV cache stores those K and V tensors so each new token only computes its own, turning per-step cost from quadratic to linear in sequence length. The tradeoff is memory growth proportional to sequence length and batch size.

Learn it properly KV cache & continuous batching

Keep practising

What is the KV cache in a transformer and why does it matter for inference? When the KV cache doesn't fit in GPU VRAM, what are your options? What problem does PagedAttention solve, and what is continuous batching? Why is KNN called a lazy learner, and what are the practical tradeoffs at inference time? Explain self-attention and the roles of the Query, Key, and Value vectors.

All NLP & LLMs questions

Explore further

KV cache offloading & memory tiers Caching: exact, semantic & prompt Speculative Decoding

KV Cache PagedAttention VAE Continuous Batching