Structured Outputs & Constrained Decoding

> Ask an LLM for JSON. Get JSON most of the time. In production, "most" is the problem. Constrained decoding turns "most" into "always" by editing the logits before sampling.

Type: Build

Languages: Python

Prerequisites: Phase 5 · 17 (Chatbots), Phase 5 · 19 (Subword Tokenization)

Time: ~60 minutes

The Problem

A classifier prompts an LLM: "Return one of {positive, negative, neutral}." The model returns "The sentiment is positive — this review is overwhelmingly favorable because the customer explicitly states that they ...". Your parser crashes. Your classifier's F1 is 0.0.

Free-form generation is not a contract. It is a suggestion. A production system needs a contract.

Three layers exist in 2026.

  1. Prompting. Ask nicely. "Return only the JSON object." Works ~80% on frontier models, less on smaller ones.
  2. Native structured output APIs. OpenAI response_format, Anthropic tool use, Gemini JSON mode. Reliable on supported schemas. Vendor-locked.
  3. Constrained decoding. Modify the logits at every generation step so the model *cannot* emit invalid tokens. 100% valid by construction. Works on any local model.

This lesson builds intuition for all three and names when to reach for which.

The Concept

Constrained decoding masking invalid tokens at each step

How constrained decoding works. At each generation step, the LLM produces a logit vector over the full vocabulary (~100k tokens). A *logit processor* sits between the model and the sampler. It computes which tokens are valid given the current position in the target grammar — JSON Schema, regex, context-free grammar — and sets the logits of all invalid tokens to negative infinity. The softmax over the remaining logits puts probability mass only on valid continuations.

Implementations in 2026:

The counterintuitive result

Constrained decoding is often *faster* than unconstrained generation. Two reasons. First, it shrinks the next-token search space. Second, clever implementations skip token generation entirely for forced tokens (scaffolding like {"name": " — every byte is determined).

The pitfall that costs you

Field order matters. Put answer before reasoning, and the model commits to an answer before it thinks. JSON is valid. Answer is wrong. No validation catches it.

// BAD
{"answer": "yes", "reasoning": "because ..."}

// GOOD
{"reasoning": "... therefore ...", "answer": "yes"}

Schema field order is logic, not formatting.

Build It

Step 1: regex-constrained generation from scratch

See code/main.py for a standalone FSM implementation. The core idea in 30 lines:

def mask_logits(logits, valid_token_ids):
    mask = [float("-inf")] * len(logits)
    for tid in valid_token_ids:
        mask[tid] = logits[tid]
    return mask


def generate_constrained(model, tokenizer, prompt, fsm):
    ids = tokenizer.encode(prompt)
    state = fsm.initial_state
    while not fsm.is_accept(state):
        logits = model.next_token_logits(ids)
        valid = fsm.valid_tokens(state, tokenizer)
        logits = mask_logits(logits, valid)
        tok = sample(logits)
        ids.append(tok)
        state = fsm.transition(state, tok)
    return tokenizer.decode(ids)

The FSM tracks what parts of the grammar we have satisfied so far. valid_tokens(state, tokenizer) computes which vocabulary tokens can advance the FSM without leaving an accepting path.

Step 2: Outlines for JSON Schema

from pydantic import BaseModel
from typing import Literal
import outlines


class Review(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float
    evidence_span: str


model = outlines.models.transformers("meta-llama/Llama-3.2-3B-Instruct")
generator = outlines.generate.json(model, Review)

result = generator("Classify: 'The wait staff was attentive and the food arrived hot.'")
print(result)
# Review(sentiment='positive', confidence=0.93, evidence_span='attentive ... hot')

Zero validation errors. Ever. The FSM makes invalid output unreachable.

Step 3: Instructor for provider-agnostic Pydantic

import instructor
from anthropic import Anthropic
from pydantic import BaseModel, Field


class Invoice(BaseModel):
    vendor: str
    total_usd: float = Field(ge=0)
    line_items: list[str]


client = instructor.from_anthropic(Anthropic())
invoice = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    response_model=Invoice,
    messages=[{"role": "user", "content": "Extract from: 'Acme Corp $420. Widget, Gizmo.'"}],
)

Different mechanism. Instructor does not touch logits. It formats the schema into the prompt, parses the output, and retries on validation failure (default 3 times). Works with any provider. Retries add latency and cost. Cross-provider portability is the selling point.

Step 4: native vendor APIs

from openai import OpenAI

client = OpenAI()
response = client.responses.create(
    model="gpt-5",
    input=[{"role": "user", "content": "Classify: 'The food was cold.'"}],
    text={"format": {"type": "json_schema", "name": "sentiment",
          "schema": {"type": "object", "required": ["sentiment"],
                     "properties": {"sentiment": {"type": "string",
                                                  "enum": ["positive", "negative", "neutral"]}}}}},
)
print(response.output_parsed)

Server-side constrained decoding. Reliability parity with Outlines for supported schemas. No local model management. Locks you to the vendor.

Pitfalls

Use It

The 2026 stack:

Situation Pick
OpenAI/Anthropic/Google model, simple schema Native vendor structured output
Any provider, Pydantic workflow, can tolerate retries Instructor
Local model, need 100% validity, flat schema Outlines (FSM)
Local model, recursive schema XGrammar or llguidance
Self-hosted inference server vLLM guided decoding
Batch processing with retries acceptable Instructor + cheapest model

Ship It

Save as outputs/skill-structured-output-picker.md:

name: structured-output-picker
description: Choose a structured output approach, schema design, and validation plan.
version: 1.0.0
phase: 5
lesson: 20
tags: [nlp, llm, structured-output]
---

Given a use case (provider, latency budget, schema complexity, failure tolerance), output:

1. Mechanism. Native vendor structured output, Instructor retries, Outlines FSM, or XGrammar CFG. One-sentence reason.
2. Schema design. Field order (reasoning first, answer last), nullable fields for "unknown", enum vs regex, required fields.
3. Failure strategy. Max retries, fallback model, graceful `null` handling, out-of-distribution refusal.
4. Validation plan. Schema compliance rate (target 100%), semantic validity (LLM-judge), field-coverage rate, latency p50/p99.

Refuse any design that puts `answer` or `decision` before reasoning fields. Refuse to use bare JSON mode without a schema. Flag recursive schemas behind an FSM-only library.

Exercises

  1. Easy. Prompt a small open-weights model (e.g., Llama-3.2-3B) without constrained decoding for Review(sentiment, confidence, evidence_span). Measure the fraction that parse as valid JSON on 100 reviews.
  2. Medium. Same corpus with Outlines JSON mode. Compare compliance rate, latency, and semantic accuracy.
  3. Hard. Implement a regex-constrained decoder from scratch for phone numbers (\d{3}-\d{3}-\d{4}). Verify 0 invalid outputs on 1000 samples.

Key Terms

Term What people say What it actually means
Constrained decoding Force valid output Mask invalid-token logits at every generation step.
Logit processor The thing that constrains Function: (logits, state) -> masked_logits.
FSM Finite-state machine Compiled grammar representation; O(1) valid-next-token lookup.
CFG Context-free grammar Grammar that handles recursion; slower but more expressive than FSM.
Schema field order Does it matter? Yes — first field commits; always put reasoning before answer.
Guided decoding vLLM's name for it Same concept, integrated into the inference server.
JSON mode OpenAI's early version Guarantees JSON syntax; does NOT guarantee schema match.

Further Reading