What techniques reduce LLM cost and latency in production?

Cost scales with input plus output tokens; latency scales with output tokens and model size. The highest-leverage levers are: model routing (use a small model when the task is simple), prompt caching (reuse expensive prefix computation), output length control, and batching. Together these can cut spend 60–90% without quality regression.

Why are smaller language models (SLMs) sometimes preferable to larger ones?

Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

How does an LLM generate text — what is next-token prediction and autoregression?

An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.

Speculative Decoding — Generative AI

1.4 seconds per token

A team ships a 70B assistant. Latency in staging: 1.4 seconds per token. A 200-token response takes nearly 5 minutes. Users abandon the page before the first sentence finishes.

They try quantization — memory drops, but wall-clock time per token barely moves. They add more GPUs — costs triple, latency halves at best. Then they try speculative decoding. Latency drops from 1.4 seconds to under 0.5 seconds per token. Same model, same output, no quality loss.

This lesson explains exactly why that works.

The root cause: one expensive step per token

Large language models generate text autoregressively (one token at a time, where each new token depends on all the previous ones). Every new token requires a full forward pass (feeding all previous tokens through the entire model to produce a probability distribution over the vocabulary, then sampling one token from it).

That forward pass is expensive because:

It must read every weight in the model from GPU memory — for a 70B model in FP16, that is 140 GB of data moved per token.
It is sequential: token 5 cannot start until token 4 is done.

The GPU is underutilized. Most of the time is spent waiting for memory, not doing arithmetic. For short outputs the latency is painful. For long outputs it is a product-level problem.

The key insight: verification is cheaper than generation

Here is what makes speculative decoding possible:

Checking whether a token is correct is cheaper than choosing the token from scratch.

More precisely: a single forward pass of the big target model can evaluate the probability of multiple proposed tokens in parallel, because each token’s probability only depends on the tokens that came before it — and if you already know the proposed sequence, you can compute all those probabilities in one batched operation.

This is the same attention computation the model always does, just fed a longer input. The extra cost of checking k extra tokens is much less than k full sequential passes.

How speculative decoding works

There are two models:

The draft model — a small, fast model (e.g. a 1B model paired with a 70B target). It generates k tokens cheaply and sequentially. For a 1B model this might cost 50 ms total.
The target model — the large, accurate model you actually want output from. It runs one forward pass over all k draft tokens in parallel.

The algorithm in four steps:

Step 1 — Draft. The draft model generates k candidate tokens (call them t1, t2, ..., tk) one at a time, appended to the current context. For example, given “The cat”, it might propose [sat, on, the, mat].

Step 2 — Verify. The target model runs one forward pass over the full sequence (context + all k draft tokens). It produces the probability distribution it would have assigned to each position.

Step 3 — Accept or reject. Starting from t1, compare the target model’s probability for each draft token against the draft model’s probability, using a rejection sampling rule. If the target agrees with the draft’s choice (roughly: if the token was likely under the target too), accept it. The first rejected token triggers a correction: sample a replacement from the adjusted target distribution and stop.

Step 4 — Repeat. Keep however many tokens were accepted (call it n, where 1 <= n <= k). Append them to the context and go back to Step 1.

If the draft model is right all k times, you get k tokens for roughly the cost of one target forward pass. If it is wrong on token 1, you still get at least one token per round — the same as baseline. In practice a well-matched draft model accepts 3 to 4 out of every 4 proposed tokens, giving a 2x to 3x wall-clock speedup.

The diagram: four sequential steps vs one verification step

Baseline requires one target forward pass per token (4 passes for 4 tokens). Speculative decoding drafts 4 tokens cheaply, then verifies all 4 in one target pass — accepting 3 and rejecting 1, costing roughly 1 pass instead of 4.

Why the output is provably identical — not an approximation

This is the part most explanations skip, and it matters.

The rejection sampling step is not a heuristic. It uses a mathematically precise rule: accept draft token t at position i with probability min(1, p_target(t) / p_draft(t)). If rejected, sample from a corrected distribution (p_target - p_draft) normalized. This construction guarantees that the marginal distribution over accepted tokens is exactly the target model’s distribution — identical to what you would have gotten running the target model alone.

That means speculative decoding is not a trade-off between speed and quality. It is a speed improvement with no quality change. The two code paths produce outputs drawn from the same distribution, token for token.

The acceptance rate is everything

The acceptance rate (fraction of draft tokens accepted per round) determines the real-world speedup. If alpha is the average acceptance rate and k is the number of draft tokens per round, the expected tokens generated per target forward pass is approximately k * alpha + 1 (the +1 accounts for always generating at least one token per round).

For k = 4 and alpha = 0.8:

tokens per target pass ≈ 4 × 0.8 + 1 = 4.2

So instead of 1 token per target pass, you get roughly 4 tokens — a 4x speedup in tokens per pass. Real systems see 2x to 3x wall-clock improvement (the draft model adds some overhead, and GPU scheduling has fixed costs).

Acceptance rate depends on:

How well the draft model’s vocabulary distribution matches the target’s. A draft model trained on the same data as the target, or a distilled smaller version, works best.
The type of text: formulaic or predictable text (code, legal boilerplate) gets very high acceptance rates. Open-ended creative generation gets lower rates.

Trade-offs and when not to use it

You run two models. GPU memory holds both the draft and target model simultaneously. A 70B target (140 GB FP16) plus a 1B draft adds around 2 GB — negligible. But if memory is extremely tight, the draft model may not fit.

The draft model must be compatible. It needs to share the same tokenizer vocabulary as the target, otherwise token positions are misaligned and verification fails. You cannot pair an arbitrary small model with an arbitrary large one.

Low acceptance rates eliminate the benefit. If the draft is wrong on every first token, every round costs one draft pass plus one target pass for the same 1 token — slower than baseline. This happens when draft and target are domain-mismatched (e.g. a general-purpose draft paired with a highly specialized medical target).

Memory bandwidth, not FLOPs, must be the bottleneck. Speculative decoding helps when the bottleneck is the sequential memory reads of the large model. If you are in a compute-bound regime (very long context, large batch), the benefit shrinks.

Medusa removes the separate draft model entirely. It attaches extra “heads” (small output layers) to the target model itself, each predicting k steps ahead. One forward pass of the target model generates both the next token and several speculative candidates. Simpler deployment (one model checkpoint), slightly lower acceptance rates than a well-matched external draft.

n-gram drafts use the context window itself as the draft: look for the last m tokens as a pattern, find the same pattern earlier in the context, and propose whatever tokens followed it last time. Zero extra parameters, works surprisingly well on repetitive text (code, documents with boilerplate), and is used in production by some serving frameworks as a cheap fallback draft strategy.

Lookahead decoding generalizes n-gram drafts with a more sophisticated parallel Jacobi iteration approach, generating multiple candidate continuations in parallel without a separate draft model.

EAGLE is the strongest of the self-drafting family and the current default in many serving stacks. Instead of drafting at the token level, it drafts one layer up — at the hidden-feature level. A tiny autoregressive head predicts the target model’s next hidden state (not just the next token), conditioned on both the previous features and the already-sampled token, then reuses the target’s own output layer to turn those predicted features into draft tokens. Predicting features is an easier, more regular problem than predicting tokens, so acceptance rates climb — EAGLE-2/3 push effective tokens-per-pass well past a vanilla draft model while needing only one model checkpoint. Same exact-output guarantee; it only changes how the draft is produced.

Quiz

Quick check

0/3

Q1A speculative decoding system uses k = 6 draft tokens per round and achieves an average acceptance rate of 0.5. How many tokens does it produce per target forward pass on average?

Q2After deploying speculative decoding, a team notices the output of their chatbot has changed — responses that used to end with 'I hope that helps!' now sometimes end differently. What is the most likely cause?

Q3A startup wants to deploy speculative decoding for a specialized chemistry LLM (70B parameters, fine-tuned on chemical literature). They plan to use a popular general-purpose 1B model as the draft. A colleague suggests using a 1B model fine-tuned on the same chemistry corpus instead. Who is right, and why?

Quantization — shrink the target model’s memory footprint so speculative decoding fits on smaller hardware.

Speculative Decoding

What you'll learn

Before you start

1.4 seconds per token

The root cause: one expensive step per token

The key insight: verification is cheaper than generation

How speculative decoding works

The diagram: four sequential steps vs one verification step

Why the output is provably identical — not an approximation

The acceptance rate is everything

Trade-offs and when not to use it

Quiz

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further

What you'll learn

Before you start

1.4 seconds per token

The root cause: one expensive step per token

The key insight: verification is cheaper than generation

How speculative decoding works

The diagram: four sequential steps vs one verification step

Why the output is provably identical — not an approximation

The acceptance rate is everything

Trade-offs and when not to use it

Related ideas

Quiz

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further