Speculative Decoding
LLMs generate one token at a time, which is slow. Speculative decoding lets a small model sprint ahead and a big model check its work in one shot — 2 to 3x faster, same output.
What you'll learn
- Why autoregressive generation is sequential and why that makes large models slow
- How a small draft model proposes tokens and a large target model verifies all of them in one parallel pass
- Why the output distribution is provably identical to running the target model alone — not an approximation
Before you start
1.4 seconds per token
A team ships a 70B assistant. Latency in staging: 1.4 seconds per token. A 200-token response takes nearly 5 minutes. Users abandon the page before the first sentence finishes.
They try quantization — memory drops, but wall-clock time per token barely moves. They add more GPUs — costs triple, latency halves at best. Then they try speculative decoding. Latency drops from 1.4 seconds to under 0.5 seconds per token. Same model, same output, no quality loss.
This lesson explains exactly why that works.
The root cause: one expensive step per token
Large language models generate text autoregressively (one token at a time, where each new token depends on all the previous ones). Every new token requires a full forward pass (feeding all previous tokens through the entire model to produce a probability distribution over the vocabulary, then sampling one token from it).
That forward pass is expensive because:
- It must read every weight in the model from GPU memory — for a 70B model in FP16, that is 140 GB of data moved per token.
- It is sequential: token 5 cannot start until token 4 is done.
The GPU is underutilized. Most of the time is spent waiting for memory, not doing arithmetic. For short outputs the latency is painful. For long outputs it is a product-level problem.
The key insight: verification is cheaper than generation
Here is what makes speculative decoding possible:
Checking whether a token is correct is cheaper than choosing the token from scratch.
More precisely: a single forward pass of the big target model can evaluate the probability of multiple proposed tokens in parallel, because each token’s probability only depends on the tokens that came before it — and if you already know the proposed sequence, you can compute all those probabilities in one batched operation.
This is the same attention computation the model always does, just fed a longer input. The extra cost of checking k extra tokens is much less than k full sequential passes.
How speculative decoding works
There are two models:
- The draft model — a small, fast model (e.g. a 1B model paired with a 70B target). It generates
ktokens cheaply and sequentially. For a 1B model this might cost 50 ms total. - The target model — the large, accurate model you actually want output from. It runs one forward pass over all
kdraft tokens in parallel.
The algorithm in four steps:
Step 1 — Draft. The draft model generates k candidate tokens (call them t1, t2, ..., tk) one at a time, appended to the current context. For example, given “The cat”, it might propose [sat, on, the, mat].
Step 2 — Verify. The target model runs one forward pass over the full sequence (context + all k draft tokens). It produces the probability distribution it would have assigned to each position.
Step 3 — Accept or reject. Starting from t1, compare the target model’s probability for each draft token against the draft model’s probability, using a rejection sampling rule. If the target agrees with the draft’s choice (roughly: if the token was likely under the target too), accept it. The first rejected token triggers a correction: sample a replacement from the adjusted target distribution and stop.
Step 4 — Repeat. Keep however many tokens were accepted (call it n, where 1 <= n <= k). Append them to the context and go back to Step 1.
If the draft model is right all k times, you get k tokens for roughly the cost of one target forward pass. If it is wrong on token 1, you still get at least one token per round — the same as baseline. In practice a well-matched draft model accepts 3 to 4 out of every 4 proposed tokens, giving a 2x to 3x wall-clock speedup.
The diagram: four sequential steps vs one verification step
Baseline requires one target forward pass per token (4 passes for 4 tokens). Speculative decoding drafts 4 tokens cheaply, then verifies all 4 in one target pass — accepting 3 and rejecting 1, costing roughly 1 pass instead of 4.
Why the output is provably identical — not an approximation
This is the part most explanations skip, and it matters.
The rejection sampling step is not a heuristic. It uses a mathematically precise rule: accept draft token t at position i with probability min(1, p_target(t) / p_draft(t)). If rejected, sample from a corrected distribution (p_target - p_draft) normalized. This construction guarantees that the marginal distribution over accepted tokens is exactly the target model’s distribution — identical to what you would have gotten running the target model alone.
That means speculative decoding is not a trade-off between speed and quality. It is a speed improvement with no quality change. The two code paths produce outputs drawn from the same distribution, token for token.
The acceptance rate is everything
The acceptance rate (fraction of draft tokens accepted per round) determines the real-world speedup. If alpha is the average acceptance rate and k is the number of draft tokens per round, the expected tokens generated per target forward pass is approximately k * alpha + 1 (the +1 accounts for always generating at least one token per round).
For k = 4 and alpha = 0.8:
tokens per target pass ≈ 4 × 0.8 + 1 = 4.2
So instead of 1 token per target pass, you get roughly 4 tokens — a 4x speedup in tokens per pass. Real systems see 2x to 3x wall-clock improvement (the draft model adds some overhead, and GPU scheduling has fixed costs).
Acceptance rate depends on:
- How well the draft model’s vocabulary distribution matches the target’s. A draft model trained on the same data as the target, or a distilled smaller version, works best.
- The type of text: formulaic or predictable text (code, legal boilerplate) gets very high acceptance rates. Open-ended creative generation gets lower rates.
Trade-offs and when not to use it
You run two models. GPU memory holds both the draft and target model simultaneously. A 70B target (140 GB FP16) plus a 1B draft adds around 2 GB — negligible. But if memory is extremely tight, the draft model may not fit.
The draft model must be compatible. It needs to share the same tokenizer vocabulary as the target, otherwise token positions are misaligned and verification fails. You cannot pair an arbitrary small model with an arbitrary large one.
Low acceptance rates eliminate the benefit. If the draft is wrong on every first token, every round costs one draft pass plus one target pass for the same 1 token — slower than baseline. This happens when draft and target are domain-mismatched (e.g. a general-purpose draft paired with a highly specialized medical target).
Memory bandwidth, not FLOPs, must be the bottleneck. Speculative decoding helps when the bottleneck is the sequential memory reads of the large model. If you are in a compute-bound regime (very long context, large batch), the benefit shrinks.
Related ideas
Medusa removes the separate draft model entirely. It attaches extra “heads” (small output layers) to the target model itself, each predicting k steps ahead. One forward pass of the target model generates both the next token and several speculative candidates. Simpler deployment (one model checkpoint), slightly lower acceptance rates than a well-matched external draft.
n-gram drafts use the context window itself as the draft: look for the last m tokens as a pattern, find the same pattern earlier in the context, and propose whatever tokens followed it last time. Zero extra parameters, works surprisingly well on repetitive text (code, documents with boilerplate), and is used in production by some serving frameworks as a cheap fallback draft strategy.
Lookahead decoding generalizes n-gram drafts with a more sophisticated parallel Jacobi iteration approach, generating multiple candidate continuations in parallel without a separate draft model.
Quiz
Quick check
Next
Quantization — shrink the target model’s memory footprint so speculative decoding fits on smaller hardware.
Practice this in an interview
All questionsCost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.
Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.
An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.
Hallucinations occur because an LLM is trained to produce plausible next tokens, not verified facts — it has no internal truth-checking mechanism, only statistical patterns. Common causes include rare or conflicting training data, overconfident decoding, and prompts that lead the model to extrapolate beyond what it learned. Mitigation strategies include retrieval-augmented generation, grounding responses to retrieved sources, lowering temperature, and calibrated refusal training.