Structured outputs engineering: JSON mode, function calling, constrained decoding
Three families of techniques get LLMs to return parseable data. One of them is a guarantee. The other two are negotiations. Here's when each one earns its keep — and the production regression that nobody warns you about.
There is a peculiar production failure mode that every team building extraction pipelines runs into eventually. The validation dashboard shows 100% schema conformance — every record is a clean JSON object, every required field is present, every enum value is in range. The pipeline is “fixed.” Then a domain expert looks at the actual outputs and finds that 8% of the records are coherently wrong: the model picked a real-looking value to satisfy the schema instead of refusing or hedging.
This is what nobody tells you about constrained decoding. The technique that moved teams from 92% to 100% schema conformance is the same technique that turned 8% of refusals into 8% of confident hallucinations.
That’s the whole story of structured outputs in one paragraph — and the rest of this post is how to navigate it. Three families of techniques, three different theories of when they win, and one practical rule for picking.
The three families, briefly
The simplest framing: each family makes a different promise. Prompted JSON promises nothing — the model usually returns valid JSON because it’s seen a lot of it. Function calling promises that the model has been trained to return well-formed structured calls, but the parse rate is empirical, not guaranteed. Constrained decoding promises that the token stream literally cannot diverge from your grammar, because the decoder masks every token that would.
Family 1: prompted JSON
This is what every team starts with. The system prompt says “respond in
JSON,” maybe with a few-shot example, and the application parses the
response. It works most of the time. On gpt-4 and claude-3.5-sonnet with
a clean schema it hits something like 92–95% parse rate. Add nesting, add
enums, add a long-tail of edge cases, and that number drops.
The standard rescue is a retry loop with the validation error fed back in:
“the JSON you returned failed validation because field priority was not
one of low, medium, high. Try again.” This is the entire premise of
libraries like Instructor and the
original LangChain output parsers. Two retries gets you to ~99% in practice;
three retries is essentially indistinguishable from constrained decoding on
parse rate alone.
Where it lives: anywhere the cost of the occasional retry is acceptable and you don’t want to deal with provider-specific schemas. Most internal batch-processing pipelines start (and stay) here.
Where it falls over: when each retry is expensive (latency-sensitive paths, long-context summarisation where the input itself is 20K tokens) or when the schema is deeply nested. Nested schemas blow up the failure surface — every nested object is another place the model can hallucinate a key.
Family 2: function calling
Around mid-2023 the major providers started fine-tuning models to emit structured tool calls natively. OpenAI’s function calling, Anthropic’s tool use, Gemini’s function declarations — they all share the same shape: you give the API a JSON schema describing a “function” the model can call, and the model returns a structured object that matches it. The model wasn’t prompted to do this; it was trained to do this.
Parse rates jump immediately. On a flat schema with reasonable field names, modern function-calling models hit ~99% conformance without any retry logic. The interesting failure modes are no longer “this isn’t valid JSON” but “this is valid JSON that’s missing a required field” or “this is valid JSON where the model invented a function name.”
The other reason teams adopt it: function calling is the API surface for
tool use. Once your model is calling structured tools, the natural way to
return structured data is to give it one tool — submit_extraction — and
let the model fill the arguments. Two birds, one schema.
The catch is portability. OpenAI’s function-calling schema, Anthropic’s tool schema, and Gemini’s function declaration are not the same JSON. Libraries like Pydantic AI and Instructor paper over this by accepting Pydantic models and generating the right shape for whichever provider you point them at. Without that abstraction, switching providers is a refactor.
Family 3: constrained decoding
This is the most interesting one because it changed in 2024–2025 from “an obscure technique in a few open-source libraries” to “the foundation of every serious provider’s structured outputs offering.”
The trick is: instead of asking the model to produce a valid output and hoping, you intercept the decoder. At every token, you compute which tokens would still keep the running output valid according to your grammar, and you mask out the rest. The model picks from the masked logits, which means the next token is always one that keeps the output schema-compliant. Repeat to EOS. The result is a token stream that, by construction, parses.
The open-source landscape clusters around four libraries:
- Outlines (~15K stars, by dottxt.ai) compiles your schema into a token-level DFA against the tokenizer’s vocabulary. Microseconds of decode overhead once compiled. Works against any model where you have logit access.
- Guidance (and its sibling llguidance) takes the same DFA approach with a more programmatic API. OpenAI publicly credited llguidance for foundational work behind their Structured Outputs in 2025.
- XGrammar uses a pushdown automaton for batched constrained decoding, which is what made it the default in vLLM. The Red Hat / vLLM team reported that XGrammar gives low time-per-output-token with effective grammar caching.
- llama.cpp grammars (GBNF — GGML BNF) is what local-model users hit first. The llama.cpp server can convert a JSON schema into GBNF on the fly and serve OpenAI-compatible
response_formatrequests.
vLLM’s structured outputs ties all of this together: you can pick XGrammar (default), Outlines, or Guidance as the backend; if XGrammar can’t handle a grammar feature, vLLM transparently falls back to Outlines. That’s the kind of plumbing that turns “research curiosity” into “the way you deploy”.
OpenAI’s Structured Outputs (released August 2024, expanded through 2025) and Anthropic’s strict tool use (rolled out late 2025) both sit on top of constrained decoding internally. OpenAI compiles your JSON schema into a context-free grammar and applies it at decode time, hitting “100% schema conformance” as a hard guarantee on gpt-4o-2024-08-06 and later. Anthropic does the same with a 100–300ms upfront cost to compile the schema (then caches it for 24 hours), gated behind the anthropic-beta: structured-outputs-2025-11-13 header.
The production regression nobody warns you about
Now the part of the post that justifies it existing. There is a JSONSchemaBench paper from January 2025 that benchmarks six constrained-decoding frameworks against unconstrained generation and finds a real, repeatable result: forcing a model into a schema can improve task accuracy (because the model isn’t wandering off-format), but it can also degrade it — and the degradation is worse for harder schemas. The BAML team has a whole post called Structured Outputs Create False Confidence arguing this is the default outcome on real-world data.
Here’s what that looks like in production. An extraction pipeline at a mid-sized fintech, taking PDF loan applications and pulling structured fields out of them. Three versions:
| Version | Technique | Parse rate | Field-level F1 | Confident wrong rate |
|---|---|---|---|---|
| v1 | Prompted JSON + retries | 92% | 0.87 | 3% |
| v2 | OpenAI function calling | 99% | 0.90 | 4% |
| v3 | OpenAI Structured Outputs | 100% | 0.89 | 8% |
The numbers are stylised but the shape of the regression is real and
shows up in Databricks’ own writeup
of their structured outputs rollout. The 100% conformance run had the
highest “confidently wrong” rate because the model could no longer refuse.
When the borrower’s income wasn’t on the document, the prompted version would
emit {"income": null, "confidence": "low"} (or just decline to answer); the
constrained version would emit {"income": 75000} because the grammar said
income had to be a number and the FSM couldn’t get to EOS without one.
Two things help, neither is a silver bullet:
- Make
nullor"unknown"part of every field’s schema. Constrained decoding can only refuse if refusing is on the menu.Optional[int]is not a Python type hint — it’s a production safety feature. - Pair structured outputs with a self-consistency check. Run the same extraction unconstrained alongside the constrained version, and flag rows where the two disagree by more than a token. The disagreement set is your audit queue.
Anthropic’s own advanced tool use
docs hint at this with the recommendation to combine tool_choice: any and
strict: true — but the same docs warn that strict mode “may reduce response
quality on tasks where the model would benefit from explaining its
reasoning.” That’s the regression, in vendor language.
Where the open-source race actually settled
A short field guide to the four libraries you’ll see in modern serving stacks, because the names blur together if you’ve not been following:
The convergence story is more interesting than any individual library. The JSONSchemaBench paper benchmarks all four against unconstrained generation; the spread between the best and worst is now within a few percentage points on parse rate and within milliseconds on TPOT (time per output token). What used to be a meaningful “which library is fastest?” question is now mostly settled — the choice is determined by your serving stack.
A few specific notes that come up enough to be worth saying out loud:
- XGrammar’s PDA approach scales better with batch size than DFA-based approaches like Outlines. If you’re serving at high concurrency on vLLM, XGrammar is the default for a reason.
- Outlines covers more of the JSON Schema spec — anything XGrammar can’t parse falls back to Outlines automatically in vLLM. The redundancy is intentional and you should leave it on.
- GBNF is for the local-llama crowd. If you’re running on llama.cpp for cost or privacy reasons, GBNF grammars are surprisingly powerful — you can constrain the model to emit valid SQL, valid Python, or domain-specific languages your application cares about. The llama.cpp grammars README is a short and worthwhile read.
- llguidance is what OpenAI quietly credited in mid-2025 for the Structured Outputs implementation. The lineage matters because it tells you the technique has been hardened by frontier-lab usage, not just open-source enthusiasts.
A practical implication for serving teams: when you upgrade vLLM, your backend may switch from XGrammar to Outlines (or back) for a specific schema based on what’s supported. The user-facing API doesn’t change. What does change, sometimes, is performance — and that’s worth a benchmark in your CI.
Picking the right family
The decision tree, distilled from watching this rollout across a couple of dozen teams:
- If your schema is flat, your latency budget is tight, and your task is classification or simple extraction → constrained decoding via the provider’s Structured Outputs feature. The 100% conformance is real, the quality regression is small on flat schemas, and you don’t pay for retries.
- If the model has to think before producing the output — chain of
thought, multi-step reasoning, tool use that interleaves with the answer
→ function calling, not strict-mode constrained decoding. Let the model
reason in free text, then call a final
submit_answertool. - If you’re running open-weight models on your own GPUs → vLLM with XGrammar (or Outlines if your grammar uses CFG features XGrammar doesn’t), or llama.cpp with GBNF if you’re CPU-bound. The performance overhead is now near-zero — the grammar caches across requests with the same schema.
- If you want to switch providers without rewriting → a library like Pydantic AI or Instructor that owns the schema-to-vendor translation. Pay the abstraction tax up front to avoid the rewrite tax later.
The one rule above all the others: use the strictest mechanism that still lets the model refuse or hedge when it should. A 92% parse rate with a 1% confident-wrong rate is, for most production pipelines, a better operating point than a 100% parse rate with an 8% confident-wrong rate.
What to take away
Structured outputs went from a research problem to a solved-enough infrastructure piece in about eighteen months. The interesting questions have moved up the stack.
- Parse rate is no longer the right metric. Every serious provider offers 100% conformance now. What matters is whether the model can express uncertainty inside the schema you gave it.
- The right schema is one that lets the model refuse.
null, “unknown”, and explicitconfidencefields aren’t UX niceties; they’re how you stop constrained decoding from manufacturing answers. - Constrained decoding has won the open-source race, then quietly moved under every provider’s API. XGrammar inside vLLM, llguidance inside OpenAI, Anthropic’s strict tool mode — they’re all the same shape of trick. Knowing it’s there is more important than knowing which library.
The teams that ship reliable extraction in 2026 aren’t the ones with the fanciest grammar. They’re the ones who shipped the strictest mechanism their data could survive, and built the audit queue for the rows that didn’t quite.
Further reading: OpenAI’s Structured Outputs announcement, Anthropic’s advanced tool use post, the vLLM structured outputs docs, and the JSONSchemaBench paper for the empirical study of where each technique wins and loses.