Speculative decoding in the wild: how labs cut latency by 2-3x

There is a hard ceiling on how fast a single conversation can run on a single model. Each output token requires a full forward pass through the model, and each forward pass is sequential — you cannot generate token T+1 until you have generated token T. For a 70B model on an H100, that ceiling is roughly 40-80 tokens per second per stream, no matter how many GPUs you have.

This is what makes the “Claude responds at 400 tokens per second” numbers in 2026 strange. The model didn’t get smaller. The hardware got faster, but not that much faster. What changed is that frontier labs broke the sequential ceiling using a technique that, when you first hear it described, sounds like it shouldn’t work: speculative decoding.

It works. Here’s how, and what the production landscape looks like.

The basic insight

The original Leviathan et al. paper (November 2022) and the parallel DeepMind paper (February 2023) made the same argument:

Running a transformer forward pass on K tokens at once costs almost the same as running it on 1 token. The bottleneck is memory bandwidth — loading the weights — not compute. So if you could guess the next K tokens cheaply, you could verify all K in a single pass and accept whichever prefix matches the big model’s distribution.

The verification step is the clever part. You don’t need the draft model to be correct. You only need to:

Run the big model on the proposed K tokens, getting its true probability distribution at each position.
For each position, accept the draft token if its probability under the big model is at least as high as its probability under the draft. Otherwise, sample fresh from the big model’s distribution.
Stop at the first rejection and keep the accepted prefix.

The math (worked out in the original papers) shows this is mathematically identical to sampling from the big model directly. No quality loss. No distributional drift. Same outputs you’d have gotten without speculation — just much faster when the draft is well-aligned.

A single speculative step: the draft proposes 4 tokens; the target verifies all of them in one forward pass; the first three are accepted, the fourth is rejected and resampled from the target’s distribution.

The expected speedup depends on the acceptance rate — what fraction of draft tokens the target accepts. For well-matched draft/target pairs on typical text, this hovers around 60-80%, yielding effective speedups of 1.5-3x in wall-clock latency.

The three flavours of “draft”

The draft model can come from several different places, and the production landscape uses all of them.

Draft model approach (the original)

Train a small dedicated draft model — say, a 1B-parameter Llama variant for a 70B target. This is what Leviathan et al. proposed; it’s what vLLM’s first speculative decoding implementation shipped. The advantages:

Generic. Works for any target model; doesn’t require modifying the target.
Reasonable acceptance rates (60-70%) when the draft is trained on similar data.

The disadvantages:

Two models to load. Extra HBM, extra warmup time.
Distribution drift. Updating the target without updating the draft drops acceptance noticeably.
K is bounded. If the draft is too slow, you lose the win; if K is too large, rejection rates dominate.

Together AI’s Turbo endpoints use this approach with hand-tuned draft models for each target.

Medusa heads (no draft model)

Medusa (January 2024) skips the draft model entirely. Instead, it bolts additional decoding heads onto the target model itself, each trained to predict the K-th-future token directly. The target then predicts not just position T+1, but also T+2, T+3, …, T+K in parallel. The candidate tokens form a tree of branches; the target then verifies the tree in a single pass.

Why it’s clever:

No extra model in HBM. The Medusa heads are small (~1% of model params).
Acceptance is high because the heads share representations with the target.
Plays nicely with quantisation. The heads quantise the same way as the rest of the model.

Trade-off: requires a finetuning run on the target. You can’t bolt Medusa onto a model you don’t have weights for.

EAGLE and EAGLE-2 (the production winner)

EAGLE and EAGLE-2 take a different tack: train a small autoregressive draft network on the target model’s hidden-state features rather than its tokens. The draft sees the target’s intermediate representations, so it can predict in a way that closely tracks the target’s true distribution.

EAGLE-2’s reported numbers:

3-4x lossless speedup on Llama-3-70B in greedy decoding.
70-80% acceptance rate, substantially higher than draft-model approaches.
~0.5% of target model parameters — small enough to load alongside the target.

EAGLE-3, released in 2025, pushes the acceptance rate higher with dynamic tree expansion and now ships as the default speculative path in SGLang and recent vLLM versions.

The numbers in production

A few data points from the public record:

vLLM’s spec-dec benchmark. vLLM 0.6 (September 2024) reported 1.4-2.5x speedup on Llama-3-70B serving with draft-model spec dec, depending on workload. Higher with EAGLE.
Together’s Turbo endpoints. Marketed at “200+ tok/s” on Llama-3.1-70B; their engineering post attributes about half of the gain to speculative decoding, the other half to FP8 quantisation.
DeepSeek-V3 / R1. DeepSeek introduced Multi-Token Prediction (MTP) heads during training, which double as speculative-decoding draft heads at inference time. Their default deployment runs MTP-speculative decoding and the published spec sheet shows 30+ tok/s on enormous mixture-of-experts models.
Anthropic’s Sonnet 4 Fast. Anthropic hasn’t published implementation details, but the Claude 4.5 Sonnet release post and developer-facing observations of 400+ tok/s on the Fast variant are not explainable without speculative decoding — the model size and hardware don’t otherwise yield that throughput.

Where speculative decoding doesn’t help (yet)

Three regimes where the technique struggles or doesn’t apply:

High-temperature sampling. When the target distribution is broad (T=1.0+), the draft is more likely to propose a low-probability token that gets rejected. Acceptance rates drop by 10-20 percentage points compared to greedy/low-temperature.
Very long contexts. The draft model carries its own KV cache and attention compute. At 100K+ contexts, the draft’s overhead starts to eat into the gain. EAGLE’s “draft sees target hidden states” trick helps but doesn’t fully eliminate this.
Tiny models. For a 7B target, the cost of running even a tiny draft is comparable to running the target itself. The technique pays off heavily for 30B+ targets and yields little for ~1B targets.

This is why you don’t see speculative decoding on every model. It’s a frontier-model technique.

Speculative decoding and continuous batching

The two interact in non-obvious ways. Continuous batching loves big batches — more concurrent requests means better GPU utilisation. Speculative decoding expands each request’s forward pass from 1 token to K tokens, which reduces the effective batch size for a fixed compute budget.

The cost-optimal trade-off depends on workload:

Low QPS, latency-sensitive (single-stream chat): speculative decoding is a big win. You don’t have other requests to fill the GPU anyway.
High QPS, throughput-sensitive (batch processing, classification): speculative decoding can reduce throughput, because you’ve spent the GPU budget on speculation instead of more concurrent requests.
Mixed: vLLM and SGLang’s schedulers can dynamically enable/disable speculation per-batch based on load. This is the increasingly common production setup.

The Together engineering team has been vocal about this trade-off: their Turbo endpoints run speculative decoding aggressively at low load and fall back to plain continuous batching as load climbs. The decision point is typically around 30-50% GPU utilisation.

What this means for the user-visible product

If you’re shipping LLM features and you can choose a “fast” variant (Anthropic’s Sonnet 4 Fast, OpenAI’s gpt-4o-mini, Together’s Turbo, Fireworks’ Speed), you are buying speculative decoding plus quantisation, packaged as a SKU. The output is mathematically the same as the slower variant; the latency is 2-3x better. Reach for it whenever the use case is latency-bound (chat, autocomplete, real-time agent step) and the quality bar is still met.

If you’re self-hosting on vLLM or SGLang, enabling speculative decoding is a config change — turn it on, point it at an EAGLE draft, measure. It’s almost always a free win on the kinds of workloads that benefit, but you do want to measure your actual acceptance rate. If it’s below 50%, your draft is poorly matched and you’ll want to either retrain or fall back.

Takeaway

Speculative decoding is the latency-side equivalent of what continuous batching was for throughput — a reframing that broke a ceiling everyone had accepted. The sequential-decoding ceiling was assumed inviolable from attention-is-all-you-need until Leviathan et al. quietly pointed out that verification of K tokens is essentially free.

The frontier labs have spent the last two years operationalising that insight. The result: a 70B model that responds at 400 tok/s, a chat product that streams as fast as you can read, and a class of latency-bound use cases (real-time voice, live coding assistants, fast agents) that wouldn’t exist without it.

If you want to feel the gain firsthand: prompt the same model on its “standard” vs “fast” endpoint with a 2000-token completion. Time it. The 2-3x difference is speculative decoding doing exactly what the papers promised.

Further reading: Leviathan et al. (2022), Chen et al. DeepMind (2023), Medusa, EAGLE-2, DeepSeek MTP, and vLLM’s speculative decoding docs.