Continuous batching: the one trick that made LLM serving 10x cheaper

In the spring of 2023 the question “why is hosting a 70B model so expensive?” had a one-line answer that nobody outside the labs quite believed: static batching is wrong, and almost every open-source serving system uses static batching. The fix, then in research papers and not yet in production stacks, was an idea called continuous batching — sometimes also called iteration-level batching, in-flight batching, or rolling batch.

A year later, every serious LLM serving system had adopted it, and the cost-per-million-tokens curve had bent down by an order of magnitude.

This post traces what changed, why it mattered, and what the production behaviour looks like.

The static batching world

Static batching was inherited, more or less directly, from the world of training and from earlier non-autoregressive serving (BERT, T5 inference). The shape is:

The server collects a batch of N incoming requests.
It pads them all to the same length — typically the maximum sequence length supported by the model.
It launches one giant transformer forward pass that produces N output tokens in parallel.
It loops on step 3 until every request in the batch has hit its end-of-sequence token.
The batch is returned to clients, and the server starts collecting the next batch.

This is a perfectly reasonable design — except for what LLM workloads actually look like. Request lengths vary by 10-100x. A “hi” request is 5 tokens; a “summarise this 8K context” request is 8000 tokens. Padding the short one to match the long one wastes ~99.9% of the compute for that slot. The GPU is doing matrix multiplies on padding tokens. And every request in the batch is blocked from leaving until the slowest one finishes.

Top: static batching. Bottom: continuous batching. Same wall-clock time, dramatically different utilisation — the GPU never sits idle waiting for stragglers.

The empirical waste in the static-batching world: Anyscale measured GPU utilisation at 30-40% on production-shaped traffic, despite the GPU running continuously. The remaining 60-70% was either padding or post-completion waiting.

The continuous batching shift

The Orca paper (OSDI ‘22) made one operational change with outsized consequences: instead of treating the batch as a request-level object, treat it as an iteration-level object. That is, schedule at every forward pass, not at every request.

Concretely, at every model decoding step (every token generated for every active request):

Evict any requests that emitted end-of-sequence in the previous step.
Admit any queued requests waiting for a slot, filling vacancies.
Compute one decoding step for the now-current batch — every active request advances by exactly one token.

Three structural consequences flow from this:

No padding. Each request occupies exactly its current token slot; there’s no need to pad to a maximum.
No stragglers. A long request doesn’t block short ones from leaving; they leave the moment they’re done.
No empty slots. As soon as a slot frees, the next queued request fills it. The GPU stays full.

The price you pay is bookkeeping. The scheduler now runs on the critical path of every forward pass — it must, in single-digit microseconds, decide which requests are active, which are evicted, which are admitted, and pass the right KV-cache pointers to the attention kernel. The vLLM and SGLang code paths that implement this are some of the most performance-sensitive in any open-source serving stack.

The numbers, and why “10x” is the conservative estimate

Anyscale’s continuous batching post ran the canonical benchmark in 2023. On Llama-13B on an A100, comparing HuggingFace’s text-generation library (static batching) against an early version of vLLM (continuous batching), at the same SLO:

Static batching: ~7 requests/sec sustainable throughput.
Continuous batching: ~165 requests/sec.

That’s 23x. For the very specific reason that production traffic has heavy-tailed length distributions, static batching was leaving most of the GPU on the table.

A few caveats on that headline:

The 23x assumes a realistic length distribution. If all requests were exactly the same length, static batching is fine. They never are.
The 23x assumes the bottleneck is decode throughput. If you’re prefill-bound (very long prompts, short completions), the win is smaller — maybe 3-5x.
The 23x assumes you have spare requests to admit. At very low QPS, continuous batching and static batching converge — you have one request, there’s nothing to batch with.

Where continuous batching wins the hardest is the place LLM APIs operate hottest: moderate-to-high QPS, mixed prompt lengths, mixed completion lengths. Which is to say, the actual production case.

Continuous batching and the KV cache

There’s an awkward interaction worth flagging. Continuous batching admits new requests mid-stream, which means new KV caches must be allocated mid-stream. With static-slab cache allocation (pre-vLLM), this is impossible — the cache is already carved up.

This is why continuous batching and paged KV cache are co-evolved. Without paged allocation, you can’t dynamically size the cache for newly admitted requests, and continuous batching’s “admit on every step” benefit is lost. vLLM’s contribution wasn’t just continuous batching, or just PagedAttention — it was the integration, the realisation that you need both for either to fully pay off.

Our overview of the serving stack treats this layering in more detail.

What it looks like under load — a trace

Imagine four requests arriving over a 200ms window at a vLLM endpoint:

T=0ms: Request A arrives. 5K input tokens, expects ~200 output tokens.
T=20ms: Request B arrives. 50 input tokens, expects ~10 output tokens.
T=80ms: Request C arrives. 200 input tokens, expects ~2000 output tokens.
T=150ms: Request D arrives. 100 input tokens, expects ~50 output tokens.

Under static batching, you’d wait for a batch window (say 50ms), gather A and B, pad both to A’s length, and launch. C and D would wait for the next batch. Total time to finish all four: hundreds of milliseconds, with the GPU half-idle during long tails.

Under continuous batching:

T=0-5ms: A’s prefill runs.
T=5ms: A starts decoding. Slot 0 is A.
T=20ms: B arrives. Slot 1 was empty; B’s prefill runs in parallel with A’s next decode steps. By T=22ms, B is decoding.
T=30ms: B finishes (it only had 10 tokens to emit). Slot 1 is freed.
T=80ms: C arrives. Slot 1 admits C. C’s prefill, then decode.
T=150ms: D arrives. Slot 2 admits D. The GPU is now processing A + C + D in parallel.

The GPU never sat idle. Each request started decoding within milliseconds of its arrival. B left after 10ms; A left after ~200ms; D left after ~50ms. C is the long pole and the others didn’t block on it.

This is the difference, in production, between “we keep falling over at peak hours” and “we don’t.”

Where continuous batching ends

A few cases where the simple continuous-batching story needs more care:

Very long prompts (>16K). Prefill of a long prompt occupies the GPU for tens of milliseconds — long enough that the in-flight decoders for other requests stall. The fix, increasingly the norm in 2026, is to chunk the prefill into smaller sub-batches that interleave with ongoing decodes. vLLM calls this “chunked prefill”; SGLang ships it by default.
Disaggregated prefill/decode. The newest serving architectures put prefill and decode on different GPU pools so the long-tail prefill never disturbs decode latency at all. See our serving overview for more on this layer.
Strict latency SLOs. Continuous batching maximises throughput, but admitting new requests when the GPU is hot increases per-token decode latency for existing requests (more attention work per step). Some serving stacks expose a max_num_seqs cap that trades throughput for tail latency.

Takeaway

Continuous batching is the change that bent the LLM serving cost curve. It is not a clever algorithm — it’s a re-framing. The realisation that batching should happen at iteration granularity, not request granularity, dragged GPU utilisation from ~35% to ~85% in production deployments and turned hosting a 70B model from a luxury into a commodity.

If you write LLM serving code: the scheduler is the most important component, not the attention kernel. If you operate LLM-backed services: you can take continuous batching as table-stakes in 2026 and ask sharper questions about chunked prefill, disaggregation, and KV-cache-aware routing — the next layers of the stack.

Further reading: the Orca paper, Anyscale’s continuous batching benchmark, the vLLM PagedAttention paper, and vLLM’s scheduler design doc.