datarekha
Infrastructure April 16, 2026

KV cache management: paged attention, prefix caching, LMCache

The KV cache is the dominant memory bottleneck in LLM serving, and three ideas — PagedAttention, prefix caching, and cross-instance LMCache — have rewired how it's managed. Here's how each layer earns its place in production.

13 min read · by datarekha · kv-cachepaged-attentionprefix-cachingvllm

The story of LLM serving in the last three years is mostly the story of one data structure: the key-value cache. Every other optimisation in the stack — continuous batching, speculative decoding, FP8 quantisation, disaggregated prefill/decode — is in some sense downstream of how you manage the KV cache. The teams that won serving were the ones that figured this out first.

By 2026 the KV cache layer has stratified into three distinct technologies, each addressing a different scale of reuse:

  • PagedAttention (vLLM, 2023) — paginate the cache inside one GPU so requests don’t fragment it.
  • Prefix caching — reuse the cache across requests on one instance so shared system prompts and tool definitions don’t get recomputed.
  • LMCache (2024) — share the cache across instances via Redis or RocksDB so multi-replica fleets don’t redo each other’s work.

Each layer earns its place. Each is doing real work the others can’t do. This post is a tour of the three layers, the numbers behind each, and how they fit together in a 2026 serving stack.

Why the KV cache is the bottleneck

A transformer’s attention mechanism, computed naively at every token, re-reads the entire context for every newly generated token. The KV cache is the standard fix: store the attention keys and values for every prior token in HBM so the next token’s attention only needs to compute one new row. Without it, generation cost is quadratic in sequence length. With it, generation cost is linear, but the cache itself takes memory.

For Llama-3-70B in fp16, the KV cache is roughly 2 MB per 1K tokens per layer × 80 layers = ~160 MB per 1K tokens. A single request with a 32K context occupies ~5 GB of HBM just for its cache. A 100K-context request occupies ~16 GB. Multiply by concurrent requests and the KV cache rapidly becomes larger than the model weights.

The pre-vLLM serving systems handled this poorly. They allocated each request a contiguous slab of cache sized to the maximum possible length — typically the model’s context window. A request that finished at 1K tokens still held 32K of cache. Empirically this wasted 60-80% of HBM. The practical consequence was that your H100 ran two or three concurrent requests when it should have been running twenty.

THREE LAYERS OF KV CACHE REUSELAYER 1PagedAttentionpaginate cachewithin one GPUfixes fragmentation24x throughputLAYER 2Prefix cachingreuse cacheacross requestson same instance5-10x for agentsLAYER 3LMCacheshare cacheacross instancesvia Redis / RocksDB15x throughput
The three layers stack. Each one is reuse at a wider scope, and each is doing work the layer below cannot.

Layer 1 — PagedAttention, the virtual memory analogy

The vLLM paper (Berkeley Sky Lab, 2023) made the direct analogy to OS-level virtual memory. Don’t allocate contiguous slabs. Break the cache into fixed-size blocks (typically 16 tokens of KV per block per layer). Keep a per-request page table mapping the logical sequence positions to physical block locations. Allocate blocks on demand as the sequence grows.

The implications, in the paper’s own numbers:

  • Internal fragmentation drops from 60-80% to under 4%.
  • vLLM reports 24x higher throughput than HuggingFace Transformers and 3.5x higher than HuggingFace TGI on the same hardware.
  • Memory utilisation at peak is high enough that a single H100 can hold dozens of concurrent 32K-context requests instead of two or three.

PagedAttention is now table-stakes. Every serious 2026 serving stack — vLLM, SGLang, TGI, NVIDIA TensorRT-LLM, Anthropic’s and OpenAI’s internal stacks — implements some variant. The NVIDIA TensorRT-LLM KV cache reuse documentation makes the dependency explicit: you can’t enable cache reuse without building the engine with --use_paged_context_fmha. The page is the unit of everything that comes after.

The thing worth understanding: PagedAttention is not free. The page table lookup, the block allocator, the CUDA kernels needed to read non-contiguous blocks — these all cost something. On a single uncontended request, PagedAttention is slower than a naive contiguous allocator. It wins on throughput because it packs many requests into the same GPU, not because it makes any single request faster. This is a subtle but important distinction when reading benchmark headlines.

Layer 2 — Prefix caching, where agents secretly live

A prefix is a sequence of tokens at the start of multiple requests that’s identical across them. The canonical example: every request to a chatbot starts with the same multi-thousand-token system prompt. Every tool-using agent starts with the same MCP tool definitions. Every RAG pipeline starts with the same few-shot exemplars. The KV for that prefix is computed identically across all those requests. Recomputing it is pure waste.

Prefix caching is the realisation that the cache for that prefix can be hashed, stored, and reused. The next request that begins with the same prefix skips the prefill compute for the cached portion entirely. The vLLM automatic prefix caching docs describe the mechanism: prefixes are hashed at the block level, matched against a cache on every incoming request, and on a hit the prefill phase starts from where the prefix ended.

The numbers are not small:

  • For agentic workloads, where every tool call re-sends the same tool definitions and system prompt, prefix caching delivers 5-10x speedups on time-to-first-token. The llm-d project’s benchmarks on this are worth a read.
  • Research from the KVFlow paper on multi-agent workflows reports up to 1.83x speedup for single workflows with large prompts, and 2.19x for many concurrent workflows.
  • Anthropic exposes this to customers explicitly as prompt caching — set a cache breakpoint, pay 10% of the input price for cached tokens, get ~85-90% latency reduction on TTFT for cached portions.

SGLang’s RadixAttention is a more sophisticated variant. Instead of hashing block-by-block (vLLM’s default), RadixAttention stores prefixes in a radix tree indexed at the token level. This catches nested and branching reuse patterns that block-level hashing misses — useful when many conversations share a system prompt but diverge at user turn 1, then re-converge at user turn 2 around the same tool definition. The published numbers show 29% throughput edge over vLLM on H100 for general workloads, and up to 6.4x on prefix-heavy workloads like RAG and multi-turn chat.

The thing to internalise: prefix caching wins where prefixes are predictable, and loses where prompts are unique. For free-form chat with no system prompt, it’s neutral or slightly negative (you pay the hash-and- lookup overhead with no reuse). For an MCP-style agent where every call includes the same 4K of tool schema, it’s the single biggest performance lever you have.

Layer 3 — LMCache, the cross-instance step

Prefix caching solves the problem on one instance. Production fleets have hundreds of instances, and the prefixes are globally shared but locally fragmented. A user’s first request lands on replica A and populates replica A’s prefix cache. Their second request lands on replica B (load-balanced round-robin) and gets no hit. Across a fleet of N replicas, your effective cache hit rate is 1/N of what the single-instance hit rate suggests.

The fix is either sticky routing (route all requests with the same prefix to the same replica) or shared storage (store the cache in a central place all replicas can read). Sticky routing has limits — it breaks load balancing, and it doesn’t help when a replica dies. Shared storage is the more general solution, and it’s what LMCache does.

LMCache is an open-source KV cache layer that sits between the serving engine (vLLM, SGLang) and a tiered storage backend. Their tech report describes a three-tier hierarchy:

  • L0 — GPU HBM, the existing in-engine cache.
  • L1 — host RAM on the same machine.
  • L2 — external storage: Redis, RocksDB, S3-compatible object stores, RDMA-attached arrays.

When a request arrives, LMCache checks L0 first, then L1, then L2. Misses pay the cost of computing the KV; hits in L1 pay a memcpy; hits in L2 pay a network round-trip but skip prefill entirely. The Redis blog post on LMCache integration reports compelling end-to-end numbers: up to 15x higher throughput and at least 2x lower latency on enterprise workloads with high prefix reuse across replicas.

The architectural shift LMCache enables is the part the benchmarks underplay. With cross-instance KV reuse, you can:

  • Scale instances independently of cache. Spin up a new replica, and it inherits the global prefix cache from L2 on its first request.
  • Survive evictions. A long system prompt’s KV that gets evicted from L0 can be restored from L1 or L2 without a full recompute.
  • Move prefill closer to data. Compute the KV for a large RAG context once per document update, serve it from L2 forever after.

LMCache is the first KV cache layer where the cache itself is a first-class component you can scale, monitor, and reason about independently. That’s a meaningful shift in how serving infrastructure is designed.

How the three layers compose

A diagram of the full stack as it ships in a 2026 production deployment:

KV CACHE STACK IN PRODUCTIONInference router (cache-aware, prefix-sticky)hashes incoming prefix, picks replica with hot cacheReplica AvLLM engine + PagedAttentionautomatic prefix caching (L0)+ LMCache adapter to L1/L2Replica BvLLM engine + PagedAttentionautomatic prefix caching (L0)+ LMCache adapter to L1/L2Shared L2 storeRedis / RocksDB / RDMA poolcross-instance KV reuse, persistent across replica restarts
A 2026 KV cache stack. Each layer extends the scope of reuse: PagedAttention on each GPU, prefix caching per replica, LMCache across the fleet.

The composition is what makes the multipliers compound. PagedAttention makes each GPU’s cache dense. Prefix caching makes each replica’s cache reusable across requests. LMCache makes the fleet’s cache reusable across replicas. The same idea — don’t recompute things you already computed — applied at three scales.

When each layer matters

A working rule for which layer to invest in, based on workload shape:

  • You’re running on one GPU, light traffic. PagedAttention is on by default in vLLM/SGLang/TGI. You’re done. Don’t over-engineer.
  • You’re running a small fleet (1-5 replicas), heavy agentic workload. Enable automatic prefix caching. The win is large and the configuration is one flag. Measure your time-to-first-token before and after; you should see a 3-10x drop on prefix-heavy traffic.
  • You’re running a large fleet (10+ replicas) with shared system prompts or big RAG contexts. This is LMCache territory. Set up Redis as L2, deploy the LMCache adapter, watch your cross-instance hit rate jump from ~1/N to whatever your prefix sharing pattern allows.
  • You’re an API provider serving thousands of distinct tenants. Cache-aware routing matters more than caching itself. Anthropic’s prompt-caching feature, Together AI’s dedicated endpoints, and the llm-d project’s distributed cache scheduler all converge on the same idea: route by prefix hash, not round-robin.

The mistake teams keep making: skipping straight to LMCache when their workload doesn’t have prefix overlap. LMCache is excellent at sharing what you already have; if you don’t have shared prefixes, you’re paying overhead for nothing. Measure first.

The numbers in one place

A consolidated tally of what each layer is worth, in approximate production multipliers (always measured against the layer below as baseline):

KV CACHE LAYER MULTIPLIERSLAYERWINWORKLOAD CONDITIONPagedAttention2-4x throughputany concurrent loadPrefix cache (block)5-10x TTFTshared system promptsPrefix cache (radix)up to 6.4x throughputbranching prefixes, RAGLMCache cross-instanceup to 15x throughputmulti-replica + global reuseFP8 / INT4 KV quant2-4x effective L0long contexts on small GPUsCache-aware routing5-10x on cached fractionmulti-replica fleets
Each row’s multiplier is real but measured against a specific baseline. They don’t literally compose — they overlap — but each is doing work the others can’t.

The thing the table makes obvious: the multipliers are all conditional. Prefix caching is worthless if your prompts don’t share prefixes. LMCache’s gains shrink to nothing on a single replica. KV quantisation helps only when you were memory-bound to begin with. Picking which layer to invest in is, almost always, a workload-shape question first and a benchmark question second.

Where this is going

Three trends to watch over the next 12 months:

  • KV cache compression. Quantising the cache itself to FP8 or INT4 doubles or quadruples the effective L0 size. NVIDIA’s TensorRT-LLM is shipping this; vLLM has experimental support.
  • Cache as a first-class API. Anthropic’s prompt-caching is an early sign — explicit cache breakpoints in the user-facing API. Expect OpenAI and Google to follow with more granular controls.
  • Inference-time cache hints. The agent frameworks (LangGraph, MAF) are starting to surface “this prompt portion is stable, cache it; this portion is volatile, don’t” as a configuration. Cache management is becoming part of the application layer, not just the infra layer.

Takeaway

The 100M-tokens-a-second figures the frontier labs casually report would be physically impossible without aggressive KV cache management. The techniques behind those numbers are all public:

  • PagedAttention is the foundation. If you’re self-hosting, you’re already on it; you don’t have a choice.
  • Prefix caching is the highest-leverage flag in production. Turn it on, measure, and watch your TTFT collapse on agent workloads.
  • LMCache is the cross-fleet step that promotes the cache itself to a managed component. It pays for itself in mid-to-large fleets with high prefix reuse; below that scale, it’s premature optimisation.

The point of the three-layer stack is not that any single optimisation saved the day. It’s that reuse at every scale beats recompute at every scale, and the labs that won serving figured this out before everyone else did. The good news is the primitives are public. The infrastructure gap between the frontier labs and the rest of the industry is, today, mostly operational know-how — not secrets.


Further reading: the vLLM PagedAttention paper, SGLang’s RadixAttention paper, the LMCache tech report, and the NVIDIA TensorRT-LLM KV cache reuse docs together cover the whole stack.

Skip to content