KV cache & continuous batching
The KV cache is why LLM serving is memory-bound, not compute-bound. How caching keys and values makes generation fast, and how PagedAttention plus continuous batching deliver 10x throughput.
What you'll learn
- Why the KV cache exists and why it dominates GPU memory during inference
- How PagedAttention stops the memory waste that limits batch size
- Why continuous batching keeps the GPU busy and multiplies throughput
Before you start
Your serving lessons — self-hosting, load balancing, caching, speculative decoding — all quietly rest on one mechanism. It’s the reason LLM inference is bottlenecked by memory, not raw compute, and the reason vLLM exists. Meet the KV cache.
Why the cache exists
Generation is autoregressive: the model produces one token, appends it, and runs again. Naively, every new token would re-run attention over the entire sequence from scratch — recomputing the keys and values for every previous token, every step. That’s quadratic and wasteful, because those keys and values don’t change.
So we cache them. After processing a token, store its attention keys and values; on the next step, only the new token is processed and it attends to the cached K/V of everything before it. That’s the KV cache, and it turns each generation step from “reprocess everything” into “process one token.” The catch: that cache has to live in GPU memory for the whole request, and it’s big.
The KV cache dominates memory
For a real model, the KV cache grows with batch size × sequence length × layers × heads — and it quickly dwarfs the model weights for long contexts. GPU memory, not FLOPs, is what limits how many requests you can serve at once. Which makes how you manage that memory the whole ballgame.
PagedAttention: stop reserving memory you won’t use
The naive approach reserves the maximum possible sequence length for every request up front, as one contiguous block. A request that ends up using 40 tokens still locks memory for 2,000 — so the pool fills with empty reservations and only a handful of requests fit. PagedAttention (the idea behind vLLM) borrows from operating-system virtual memory: split the cache into small fixed blocks and hand them out on demand as each sequence grows. No giant up-front reservation, almost no waste. See the difference:
PagedAttention cut KV-cache waste so dramatically that achievable batch sizes — and therefore throughput — jumped severalfold, with near-zero fragmentation.
Continuous batching: never wait for the slowest request
The second half of the win is scheduling. Static batching groups requests, runs them together, and can’t admit a new one until the whole batch finishes — so short requests sit idle waiting for the longest one, wasting the GPU. Continuous batching (a.k.a. iteration-level scheduling) works at the granularity of a single token step: the moment any sequence finishes, it frees its blocks and a waiting request takes its place — mid-flight. The GPU never idles waiting for stragglers.
Quick check
Quick check
Next
The KV cache is the foundation under the cost and throughput numbers in cost & latency engineering, and the reason model routing and caching pay off.
Practice this in an interview
All questionsPagedAttention stores the KV cache in non-contiguous fixed-size blocks like OS virtual-memory pages, eliminating the fragmentation and over-reservation of contiguous KV allocation and enabling sharing across sequences. Continuous batching dynamically adds and removes requests from a batch at the token level instead of waiting for the whole batch to finish, sharply improving GPU utilization and throughput.
During autoregressive generation, attention recomputes Keys and Values for all previous tokens at every step; the KV cache stores those K and V tensors so each new token only computes its own, turning per-step cost from quadratic to linear in sequence length. The tradeoff is memory growth proportional to sequence length and batch size.
The KV cache stores the key and value tensors computed during previous forward passes so they do not need to be recomputed for every new token during autoregressive generation. Without it, generating each token would require a full forward pass over the entire context from scratch, making inference cost grow quadratically with sequence length rather than linearly.
GPUs execute tensor operations efficiently only when the batch dimension is large enough to saturate all CUDA cores. Dynamic batching collects individual requests arriving within a short window and fuses them into a single GPU call, dramatically improving throughput and cost efficiency without sacrificing per-request latency beyond the configured wait threshold.