What problem does PagedAttention solve, and what is continuous batching?

PagedAttention stores the KV cache in non-contiguous fixed-size blocks like OS virtual-memory pages, eliminating the fragmentation and over-reservation of contiguous KV allocation and enabling sharing across sequences. Continuous batching dynamically adds and removes requests from a batch at the token level instead of waiting for the whole batch to finish, sharply improving GPU utilization and throughput.

What is a KV cache and how does it speed up LLM inference?

During autoregressive generation, attention recomputes Keys and Values for all previous tokens at every step; the KV cache stores those K and V tensors so each new token only computes its own, turning per-step cost from quadratic to linear in sequence length. The tradeoff is memory growth proportional to sequence length and batch size.

When the KV cache doesn't fit in GPU VRAM, what are your options?

The KV cache is working memory — it's re-read to generate every token — so it has to stay fast. When VRAM fills, you offload the least-active sessions down a memory hierarchy: GPU VRAM (active, ~3 TB/s), CPU RAM over PCIe (idle, ~50 GB/s), local SSD (long-idle), and networked storage (cold/durable only, never live decode). Idle sessions are parked lower and promoted back to VRAM on activity. The alternative is to drop the cache and recompute the prefill when the session returns; for long prompts, offloading and reloading usually beats recomputing attention over thousands of tokens.

What is the KV cache in a transformer and why does it matter for inference?

The KV cache stores the key and value tensors computed during previous forward passes so they do not need to be recomputed for every new token during autoregressive generation. Without it, generating each token would require a full forward pass over the entire context from scratch, making inference cost grow quadratically with sequence length rather than linearly.

KV cache & continuous batching — Generative AI

Your serving lessons — self-hosting, load balancing, caching, speculative decoding — all quietly rest on one mechanism. It’s the reason LLM inference is bottlenecked by memory, not raw compute, and the reason vLLM exists. Meet the KV cache.

TryKV cache · PagedAttention

Why so few requests fit — until you page the cache

The GPU's KV cache is a fixed pool of 48 blocks. Add concurrent requests and watch them fill it. The old way reserves the max sequence length up front, so short requests waste most of their reservation. PagedAttention hands out blocks on demand — so far more requests fit, and the GPU stays busy.

requests 6

requests fit4 / 6

utilization31%

wasted69%

request KV reserved but unused free

Contiguous: only 4 requests fit, 69% wasted. Every request reserved 12 blocks for a sequence that might never get that long, so the pool fills with empty reservations and the batch stays tiny. That wasted memory is throughput you're not getting.

Why the cache exists

Generation is autoregressive: the model produces one token, appends it, and runs again. Naively, every new token would re-run attention over the entire sequence from scratch — recomputing the keys and values for every previous token, every step. That’s quadratic and wasteful, because those keys and values don’t change.

So we cache them. After processing a token, store its attention keys and values; on the next step, only the new token is processed and it attends to the cached K/V of everything before it. That’s the KV cache, and it turns each generation step from “reprocess everything” into “process one token.” The catch: that cache has to live in GPU memory for the whole request, and it’s big.

The KV cache dominates memory

For a real model, the KV cache grows with batch size × sequence length × layers × heads — and it quickly dwarfs the model weights for long contexts. GPU memory, not FLOPs, is what limits how many requests you can serve at once. Which makes how you manage that memory the whole ballgame.

PagedAttention: stop reserving memory you won’t use

The naive approach reserves the maximum possible sequence length for every request up front, as one contiguous block. A request that ends up using 40 tokens still locks memory for 2,000 — so the pool fills with empty reservations and only a handful of requests fit. PagedAttention (the idea behind vLLM) borrows from operating-system virtual memory: split the cache into small fixed blocks and hand them out on demand as each sequence grows. No giant up-front reservation, almost no waste. See the difference on one 48-block pool:

PagedAttention cut KV-cache waste so dramatically that achievable batch sizes — and therefore throughput — jumped severalfold, with near-zero fragmentation.

Continuous batching: never wait for the slowest request

The second half of the win is scheduling. Static batching groups requests, runs them together, and can’t admit a new one until the whole batch finishes — so short requests sit idle waiting for the longest one, wasting the GPU. Continuous batching (a.k.a. iteration-level scheduling) works at the granularity of a single token step: the moment any sequence finishes, it frees its blocks and a waiting request takes its place — mid-flight. The GPU never idles waiting for stragglers.

Continuous batching fills freed capacity the instant a sequence finishes, instead of waiting for the whole batch.

POOL = 48           # total KV blocks on the GPU
MAXLEN = 12         # max sequence reserved per request (contiguous mode)
req_lens = [4, 3, 6, 2, 5, 3, 7, 2, 4, 3]   # blocks each request actually uses

# Contiguous: reserve MAXLEN per request -> few fit, lots wasted
slots = POOL // MAXLEN
contig_fit = min(len(req_lens), slots)
contig_used = sum(req_lens[:contig_fit])
contig_reserved = contig_fit * MAXLEN

# Paged: allocate only what's used -> pack many more in
paged_fit, ptr = 0, 0
for L in req_lens:
    if ptr + L > POOL: break
    ptr += L; paged_fit += 1

print(f"contiguous: {contig_fit} requests fit, "
      f"{100*(contig_reserved-contig_used)//POOL}% of pool wasted")
print(f"paged:      {paged_fit} requests fit, ~0% wasted")
print(f"\nsame GPU, {paged_fit/contig_fit:.1f}x more concurrent requests with paging")

contiguous: 4 requests fit, 68% of pool wasted
paged:      10 requests fit, ~0% wasted

same GPU, 2.5x more concurrent requests with paging

The whole 48-block pool fills with just 4 contiguous reservations — each locks 12 blocks but uses only a handful, so 68% sits empty-but-reserved. Paging hands out blocks as each sequence actually grows, packing all 10 requests into the same pool. Same GPU, 2.5× the concurrency — and that multiplier is exactly the throughput vLLM buys you.

In one breath

Generation is autoregressive, so the KV cache stores past tokens’ attention keys/values — each step processes only the new token instead of recomputing everything.
That cache lives in GPU memory for the whole request and grows with batch × length × layers × heads, so serving is memory-bound, not compute-bound.
PagedAttention ditches max-length reservation for small on-demand blocks (OS-style paging), nearly eliminating waste and multiplying batch size.
Continuous batching schedules per token-step: the instant a sequence finishes it frees its blocks and a waiting request slots in, so the GPU never idles on the slowest request.
Together they are what “use vLLM/SGLang” buys — roughly 10–24× throughput over naive batching at the same latency.

Quick check

0/3

Q1What does the KV cache store, and why?

Q2What problem does PagedAttention solve?

Q3How does continuous batching differ from static batching?

The KV cache is the foundation under the cost and throughput numbers in cost & latency engineering, and the reason model routing and caching pay off.

KV cache & continuous batching

What you'll learn

Before you start

Why so few requests fit — until you page the cache

Why the cache exists

The KV cache dominates memory

PagedAttention: stop reserving memory you won’t use

Continuous batching: never wait for the slowest request

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further