datarekha

Constrained decoding

How structured generation actually guarantees valid JSON — by masking illegal tokens at each decode step. The FSM/grammar trick behind XGrammar and the JSON-mode in your API.

8 min read Advanced Generative AI Lesson 6 of 33

What you'll learn

  • Why 'just ask for JSON' is a hope and constrained decoding is a guarantee
  • How token masking against a grammar/FSM enforces structure at each step
  • Why modern engines (XGrammar, llguidance) make it near-zero overhead

Before you start

The structured outputs lesson showed you the what — asking a model for JSON that matches a schema. This lesson is the how, and it’s more clever than most people realize. When an API guarantees “valid JSON every time,” it isn’t trusting the model to behave. It’s making invalid output impossible at the sampling step. That mechanism is constrained decoding.

Prompting is a hope; constraining is a guarantee

Ask a model “respond in JSON” and most of the time it complies — but “most of the time” is a production nightmare. One stray token (a Python-style True, a trailing comma, a missing quote) and your parser throws. Constrained decoding removes the gamble entirely.

The idea: at every decode step, a grammar (or finite-state machine compiled from your JSON schema) knows exactly which tokens are legal given what’s been emitted so far. Before sampling, it masks every illegal token’s probability to zero. The model literally cannot pick a token that would break the structure. Step through it — watch the value position, where the model’s favorite token is invalid:

That value step is the whole lesson: the model wanted True, which would have broken the JSON. The grammar masked it, and valid true won. No retries, no parser errors — structure guaranteed by construction.

How the mask is computed

at each step:
  1. model produces logits over the whole vocabulary
  2. the grammar/FSM, given the tokens so far, returns the set of LEGAL next tokens
  3. set the logits of all illegal tokens to -infinity  (the mask)
  4. sample from what remains
  5. advance the FSM by the chosen token; repeat

The grammar can be a JSON schema, a regular expression, or a full context-free grammar (for SQL, a programming language, a custom DSL). Anything you can express as a grammar, you can force the model to emit.

Quick check

Quick check

0/3
Q1How does constrained decoding guarantee valid JSON?
Q2In the value-position example, the model's highest-probability token was 'True'. What happened under the grammar?
Q3Why isn't constrained decoding a big performance tradeoff in 2026?

Next

Constrained decoding is the reliability layer under tool calling and structured agent actions. To make sure the content (not just the shape) is right, pair it with LLM evals.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What is constrained decoding and how does it guarantee structured outputs like valid JSON?

Constrained decoding masks the model's next-token logits at each step so only tokens permitted by a grammar or JSON schema can be sampled, guaranteeing structurally valid output without changing the model's weights. It is how structured-output and function-calling features enforce schema conformance; placing reasoning fields before answer fields lets the model think before it commits.

How do you reliably get structured outputs (JSON, typed objects) from an LLM?

Modern APIs offer constrained decoding — the model's token sampling is restricted to only produce tokens that are valid continuations of a JSON schema. Combined with Pydantic validation in application code, this eliminates the JSON-parsing errors that plagued earlier prompt-only approaches. When constrained decoding is unavailable, few-shot examples plus output parsing with retry is the fallback.

What prompt engineering techniques should every LLM practitioner know?

The core toolkit is: system prompts (role and constraints), few-shot examples (format and tone anchoring), chain-of-thought (step-by-step reasoning), and output constraints (JSON schema, stop sequences). Combining these predictably closes the gap between a capable base model and a production-ready feature.

How does an LLM generate text — what is next-token prediction and autoregression?

An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.

Related lessons

Explore further

Skip to content