What is constrained decoding and how does it guarantee structured outputs like valid JSON?

Constrained decoding masks the model's next-token logits at each step so only tokens permitted by a grammar or JSON schema can be sampled, guaranteeing structurally valid output without changing the model's weights. It is how structured-output and function-calling features enforce schema conformance; placing reasoning fields before answer fields lets the model think before it commits.

How do you reliably get structured outputs (JSON, typed objects) from an LLM?

Modern APIs offer constrained decoding — the model's token sampling is restricted to only produce tokens that are valid continuations of a JSON schema. Combined with Pydantic validation in application code, this eliminates the JSON-parsing errors that plagued earlier prompt-only approaches. When constrained decoding is unavailable, few-shot examples plus output parsing with retry is the fallback.

What prompt engineering techniques should every LLM practitioner know?

The core toolkit is: system prompts (role and constraints), few-shot examples (format and tone anchoring), chain-of-thought (step-by-step reasoning), and output constraints (JSON schema, stop sequences). Combining these predictably closes the gap between a capable base model and a production-ready feature.

How does an LLM generate text — what is next-token prediction and autoregression?

An LLM generates text one token at a time by computing a probability distribution over its entire vocabulary for the next token, sampling from that distribution, appending the result, and repeating — a process called autoregression. Each new token is conditioned on all previously generated tokens, so the output at step N is only as good as the choices made at steps 1 through N-1.

Constrained decoding — Generative AI

The structured outputs lesson showed you the what — asking a model for JSON that matches a schema. This lesson is the how, and it’s more clever than most people realize. When an API guarantees “valid JSON every time,” it isn’t trusting the model to behave. It’s making invalid output impossible at the sampling step. That mechanism is constrained decoding.

Prompting is a hope; constraining is a guarantee

Ask a model “respond in JSON” and most of the time it complies — but “most of the time” is a production nightmare. One stray token (a Python-style True, a trailing comma, a missing quote) and your parser throws. Constrained decoding removes the gamble entirely.

The idea: at every decode step, a grammar (or finite-state machine compiled from your JSON schema) knows exactly which tokens are legal given what’s been emitted so far. Before sampling, it masks every illegal token’s probability to zero. The model literally cannot pick a token that would break the structure. The grammar is a little state machine that advances one token at a time — and at the value position is where it bites:

That value step is the whole lesson: the model may want True (Python style), which would break the JSON. The grammar masks it, and the valid true wins. No retries, no parser errors — structure guaranteed by construction.

How the mask is computed

at each step:
  1. model produces logits over the whole vocabulary
  2. the grammar/FSM, given the tokens so far, returns the set of LEGAL next tokens
  3. set the logits of all illegal tokens to -infinity  (the mask)
  4. sample from what remains
  5. advance the FSM by the chosen token; repeat

The grammar can be a JSON schema, a regular expression, or a full context-free grammar (for SQL, a programming language, a custom DSL). Anything you can express as a grammar, you can force the model to emit.

import numpy as np

vocab = ["true", "True", '"yes"', "1", ",", "}"]
logits = np.array([2.1, 2.4, 1.0, 0.3, 0.5, 0.2])         # the model "prefers" True

# The grammar says: at a value position, only these tokens are legal.
legal = {"true", '"yes"', "1"}
mask = np.array([t in legal for t in vocab])

def softmax(x):
    e = np.exp(x - x.max()); return e / e.sum()

print("unconstrained pick:", vocab[logits.argmax()], "(invalid JSON!)")

masked = np.where(mask, logits, -np.inf)                  # illegal logits -> -inf
probs = softmax(masked)
print("constrained pick:  ", vocab[int(probs.argmax())], "(valid)")
print("masked probabilities:", {v: round(float(p), 3) for v, p in zip(vocab, probs)})

unconstrained pick: True (invalid JSON!)
constrained pick:   true (valid)
masked probabilities: {'true': 0.667, 'True': 0.0, '"yes"': 0.222, '1': 0.11, ',': 0.0, '}': 0.0}

The model’s favourite token was True (probability mass at logit 2.4) — and it was set to exactly 0.0 by the mask. After renormalising over the legal tokens, the valid true (0.667) won cleanly. Same logits, a different — and guaranteed-parseable — result.

In one breath

Prompting for JSON is a hope; constrained decoding is a guarantee.
A grammar/FSM compiled from your schema knows the legal next tokens at every step.
Before sampling, illegal tokens’ logits are set to -infinity (masked), so invalid output is impossible by construction.
The grammar can be a JSON schema, a regex, or a full CFG (SQL, a DSL) — anything you can express as a grammar.
Modern engines (XGrammar, llguidance) cache the masks for near-zero overhead inside vLLM and TensorRT-LLM.

Quick check

0/3

Q1How does constrained decoding guarantee valid JSON?

Q2In the value-position example, the model's highest-probability token was 'True'. What happened under the grammar?

Q3Why isn't constrained decoding a big performance tradeoff in 2026?

Constrained decoding is the reliability layer under tool calling and structured agent actions. To make sure the content (not just the shape) is right, pair it with LLM evals.

Constrained decoding

What you'll learn

Before you start

Prompting is a hope; constraining is a guarantee

How the mask is computed

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further