Constrained decoding
How structured generation actually guarantees valid JSON — by masking illegal tokens at each decode step. The FSM/grammar trick behind XGrammar and the JSON-mode in your API.
What you'll learn
- Why 'just ask for JSON' is a hope and constrained decoding is a guarantee
- How token masking against a grammar/FSM enforces structure at each step
- Why modern engines (XGrammar, llguidance) make it near-zero overhead
Before you start
The structured outputs lesson showed you the what — asking a model for JSON that matches a schema. This lesson is the how, and it’s more clever than most people realize. When an API guarantees “valid JSON every time,” it isn’t trusting the model to behave. It’s making invalid output impossible at the sampling step. That mechanism is constrained decoding.
Prompting is a hope; constraining is a guarantee
Ask a model “respond in JSON” and most of the time it complies — but “most of the
time” is a production nightmare. One stray token (a Python-style True, a
trailing comma, a missing quote) and your parser throws. Constrained decoding
removes the gamble entirely.
The idea: at every decode step, a grammar (or finite-state machine compiled from your JSON schema) knows exactly which tokens are legal given what’s been emitted so far. Before sampling, it masks every illegal token’s probability to zero. The model literally cannot pick a token that would break the structure. Step through it — watch the value position, where the model’s favorite token is invalid:
That value step is the whole lesson: the model wanted True, which would have
broken the JSON. The grammar masked it, and valid true won. No retries, no
parser errors — structure guaranteed by construction.
How the mask is computed
at each step:
1. model produces logits over the whole vocabulary
2. the grammar/FSM, given the tokens so far, returns the set of LEGAL next tokens
3. set the logits of all illegal tokens to -infinity (the mask)
4. sample from what remains
5. advance the FSM by the chosen token; repeat
The grammar can be a JSON schema, a regular expression, or a full context-free grammar (for SQL, a programming language, a custom DSL). Anything you can express as a grammar, you can force the model to emit.
Quick check
Quick check
Next
Constrained decoding is the reliability layer under tool calling and structured agent actions. To make sure the content (not just the shape) is right, pair it with LLM evals.
Practice this in an interview
All questionsConstrained decoding masks the model's next-token logits at each step so only tokens permitted by a grammar or JSON schema can be sampled, guaranteeing structurally valid output without changing the model's weights. It is how structured-output and function-calling features enforce schema conformance; placing reasoning fields before answer fields lets the model think before it commits.
Modern APIs offer constrained decoding — the model's token sampling is restricted to only produce tokens that are valid continuations of a JSON schema. Combined with Pydantic validation in application code, this eliminates the JSON-parsing errors that plagued earlier prompt-only approaches. When constrained decoding is unavailable, few-shot examples plus output parsing with retry is the fallback.
The core toolkit is: system prompts (role and constraints), few-shot examples (format and tone anchoring), chain-of-thought (step-by-step reasoning), and output constraints (JSON schema, stop sequences). Combining these predictably closes the gap between a capable base model and a production-ready feature.
An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.