← Subword Tokenization — BPE, WordPiece, Unigram, SentencePiece Natural Language Inference — Textual Entailment →

Structured Outputs & Constrained Decoding

> Ask an LLM for JSON. Get JSON most of the time. In production, "most" is the problem. Constrained decoding turns "most" into "always" by editing the logits before sampling.

Type: Build

Languages: Python

Prerequisites: Phase 5 · 17 (Chatbots), Phase 5 · 19 (Subword Tokenization)

Time: ~60 minutes

The Problem

A classifier prompts an LLM: "Return one of {positive, negative, neutral}." The model returns "The sentiment is positive — this review is overwhelmingly favorable because the customer explicitly states that they ...". Your parser crashes. Your classifier's F1 is 0.0.

Free-form generation is not a contract. It is a suggestion. A production system needs a contract.

Three layers exist in 2026.

Prompting. Ask nicely. "Return only the JSON object." Works ~80% on frontier models, less on smaller ones.
Native structured output APIs. OpenAI response_format, Anthropic tool use, Gemini JSON mode. Reliable on supported schemas. Vendor-locked.
Constrained decoding. Modify the logits at every generation step so the model *cannot* emit invalid tokens. 100% valid by construction. Works on any local model.

This lesson builds intuition for all three and names when to reach for which.

The Concept

Constrained decoding masking invalid tokens at each step

How constrained decoding works. At each generation step, the LLM produces a logit vector over the full vocabulary (~100k tokens). A *logit processor* sits between the model and the sampler. It computes which tokens are valid given the current position in the target grammar — JSON Schema, regex, context-free grammar — and sets the logits of all invalid tokens to negative infinity. The softmax over the remaining logits puts probability mass only on valid continuations.

Implementations in 2026:

Outlines. Compiles JSON Schema or regex into a finite-state machine. Every token gets an O(1) valid-next-token lookup. FSM-based, so recursive schemas need flattening.
XGrammar / llguidance. Context-free grammar engines. Handle recursive JSON Schema. Near-zero decoding overhead. OpenAI credited llguidance in their 2025 structured output implementation.
vLLM guided decoding. Built-in guided_json, guided_regex, guided_choice, guided_grammar via Outlines, XGrammar, or lm-format-enforcer backends.
Instructor. Pydantic-based wrapper over any LLM. Retries on validation failure. Cross-provider, but does not modify logits — it relies on retries + structured-output-aware prompts.

The counterintuitive result

Constrained decoding is often *faster* than unconstrained generation. Two reasons. First, it shrinks the next-token search space. Second, clever implementations skip token generation entirely for forced tokens (scaffolding like {"name": " — every byte is determined).

The pitfall that costs you

Field order matters. Put answer before reasoning, and the model commits to an answer before it thinks. JSON is valid. Answer is wrong. No validation catches it.

// BAD
{"answer": "yes", "reasoning": "because ..."}

// GOOD
{"reasoning": "... therefore ...", "answer": "yes"}

Schema field order is logic, not formatting.

Build It

Step 1: regex-constrained generation from scratch

See code/main.py for a standalone FSM implementation. The core idea in 30 lines:

def mask_logits(logits, valid_token_ids):
    mask = [float("-inf")] * len(logits)
    for tid in valid_token_ids:
        mask[tid] = logits[tid]
    return mask


def generate_constrained(model, tokenizer, prompt, fsm):
    ids = tokenizer.encode(prompt)
    state = fsm.initial_state
    while not fsm.is_accept(state):
        logits = model.next_token_logits(ids)
        valid = fsm.valid_tokens(state, tokenizer)
        logits = mask_logits(logits, valid)
        tok = sample(logits)
        ids.append(tok)
        state = fsm.transition(state, tok)
    return tokenizer.decode(ids)

The FSM tracks what parts of the grammar we have satisfied so far. valid_tokens(state, tokenizer) computes which vocabulary tokens can advance the FSM without leaving an accepting path.

Step 2: Outlines for JSON Schema

from pydantic import BaseModel
from typing import Literal
import outlines


class Review(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float
    evidence_span: str


model = outlines.models.transformers("meta-llama/Llama-3.2-3B-Instruct")
generator = outlines.generate.json(model, Review)

result = generator("Classify: 'The wait staff was attentive and the food arrived hot.'")
print(result)
# Review(sentiment='positive', confidence=0.93, evidence_span='attentive ... hot')

Zero validation errors. Ever. The FSM makes invalid output unreachable.

Step 3: Instructor for provider-agnostic Pydantic

import instructor
from anthropic import Anthropic
from pydantic import BaseModel, Field


class Invoice(BaseModel):
    vendor: str
    total_usd: float = Field(ge=0)
    line_items: list[str]


client = instructor.from_anthropic(Anthropic())
invoice = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    response_model=Invoice,
    messages=[{"role": "user", "content": "Extract from: 'Acme Corp $420. Widget, Gizmo.'"}],
)

Different mechanism. Instructor does not touch logits. It formats the schema into the prompt, parses the output, and retries on validation failure (default 3 times). Works with any provider. Retries add latency and cost. Cross-provider portability is the selling point.

Step 4: native vendor APIs

from openai import OpenAI

client = OpenAI()
response = client.responses.create(
    model="gpt-5",
    input=[{"role": "user", "content": "Classify: 'The food was cold.'"}],
    text={"format": {"type": "json_schema", "name": "sentiment",
          "schema": {"type": "object", "required": ["sentiment"],
                     "properties": {"sentiment": {"type": "string",
                                                  "enum": ["positive", "negative", "neutral"]}}}}},
)
print(response.output_parsed)

Server-side constrained decoding. Reliability parity with Outlines for supported schemas. No local model management. Locks you to the vendor.

Pitfalls

Recursive schemas. Outlines flattens recursion to a fixed depth. Tree-structured outputs (nested comments, AST) need XGrammar or llguidance (CFG-based).
Huge enums. 10,000-option enum compiles slowly or times out. Switch to a retriever: predict top-k candidates first, constrain to those.
Grammar too strict. Force date: "YYYY-MM-DD" regex and the model cannot output "unknown" for missing dates. Model compensates by inventing a date. Allow null or a sentinel.
Premature commitment. See field-order pitfall above. Always put reasoning first.
Vendor JSON mode without schema. Pure JSON mode only guarantees valid JSON, not valid *for your use case*. Always provide a full schema.

Use It

The 2026 stack:

Situation	Pick
OpenAI/Anthropic/Google model, simple schema	Native vendor structured output
Any provider, Pydantic workflow, can tolerate retries	Instructor
Local model, need 100% validity, flat schema	Outlines (FSM)
Local model, recursive schema	XGrammar or llguidance
Self-hosted inference server	vLLM guided decoding
Batch processing with retries acceptable	Instructor + cheapest model

Ship It

Save as outputs/skill-structured-output-picker.md:

name: structured-output-picker
description: Choose a structured output approach, schema design, and validation plan.
version: 1.0.0
phase: 5
lesson: 20
tags: [nlp, llm, structured-output]
---

Given a use case (provider, latency budget, schema complexity, failure tolerance), output:

1. Mechanism. Native vendor structured output, Instructor retries, Outlines FSM, or XGrammar CFG. One-sentence reason.
2. Schema design. Field order (reasoning first, answer last), nullable fields for "unknown", enum vs regex, required fields.
3. Failure strategy. Max retries, fallback model, graceful `null` handling, out-of-distribution refusal.
4. Validation plan. Schema compliance rate (target 100%), semantic validity (LLM-judge), field-coverage rate, latency p50/p99.

Refuse any design that puts `answer` or `decision` before reasoning fields. Refuse to use bare JSON mode without a schema. Flag recursive schemas behind an FSM-only library.

Exercises

Easy. Prompt a small open-weights model (e.g., Llama-3.2-3B) without constrained decoding for Review(sentiment, confidence, evidence_span). Measure the fraction that parse as valid JSON on 100 reviews.
Medium. Same corpus with Outlines JSON mode. Compare compliance rate, latency, and semantic accuracy.
Hard. Implement a regex-constrained decoder from scratch for phone numbers (\d{3}-\d{3}-\d{4}). Verify 0 invalid outputs on 1000 samples.

Key Terms

Term	What people say	What it actually means
Constrained decoding	Force valid output	Mask invalid-token logits at every generation step.
Logit processor	The thing that constrains	Function: `(logits, state) -> masked_logits`.
FSM	Finite-state machine	Compiled grammar representation; O(1) valid-next-token lookup.
CFG	Context-free grammar	Grammar that handles recursion; slower but more expressive than FSM.
Schema field order	Does it matter?	Yes — first field commits; always put reasoning before answer.
Guided decoding	vLLM's name for it	Same concept, integrated into the inference server.
JSON mode	OpenAI's early version	Guarantees JSON syntax; does NOT guarantee schema match.