Dialogue State Tracking

> "I want a cheap restaurant in the north... actually make it moderate... and add Italian." Three turns, three state updates. DST keeps the slot-value dict in sync so the booking works.

Type: Build

Languages: Python

Prerequisites: Phase 5 · 17 (Chatbots), Phase 5 · 20 (Structured Outputs)

Time: ~75 minutes

The Problem

In a task-oriented dialogue system, the user's goal is encoded as a set of slot-value pairs: {cuisine: italian, area: north, price: moderate}. Every user turn can add, change, or remove a slot. The system must read the whole conversation and output the current state correctly.

Get a single slot wrong and the system books the wrong restaurant, schedules the wrong flight, or charges the wrong card. DST is the hinge between what the user said and what the backend executes.

Why it still matters in 2026 despite LLMs:

The modern pipeline: classical DST concepts + LLM extractors + structured-output guardrails.

The Concept

DST: dialog history → slot-value state

Task structure. A schema defines domains (restaurant, hotel, taxi) and their slots (cuisine, area, price, people). Each slot can be empty, filled with a value from a closed set (price: {cheap, moderate, expensive}), or a free-form value (name: "The Copper Kettle").

Two DST formulations.

Metric. Joint Goal Accuracy (JGA) — the fraction of turns where *every* slot is correct. All-or-nothing. MultiWOZ 2.4 leaderboard tops around 83% in 2026.

Architectures.

  1. Rule-based (slot regex + keyword). Strong baseline for narrow domains. Debuggable.
  2. TripPy / BERT-DST. Copy-based generation with BERT encoding. Pre-LLM standard.
  3. LDST (LLaMA + LoRA). Instruction-tuned LLM with domain-slot prompting. Reaches ChatGPT-level quality on MultiWOZ 2.4.
  4. Ontology-free (2024–26). Skip the schema; generate slot names and values directly. Handles open domains.
  5. Prompt + structured output (2024–26). LLM with Pydantic schema + constrained decoding. 5 lines of code, production-ready.

The classic failure modes

Build It

Step 1: rule-based slot extractor

See code/main.py. Regex + synonym dictionaries cover 70% of canonical utterances in narrow domains:

CUISINE_SYNONYMS = {
    "italian": ["italian", "pasta", "pizza", "italy"],
    "chinese": ["chinese", "chow mein", "noodles"],
}


def extract_cuisine(utterance):
    for canonical, synonyms in CUISINE_SYNONYMS.items():
        if any(syn in utterance.lower() for syn in synonyms):
            return canonical
    return None

Brittle outside the canonical vocabulary. Works for deterministic slot confirmations.

Step 2: state update loop

def update_state(state, utterance):
    new_state = dict(state)
    for slot, extractor in SLOT_EXTRACTORS.items():
        value = extractor(utterance)
        if value is not None:
            new_state[slot] = value
    for slot in NEGATION_CLEARS:
        if is_negated(utterance, slot):
            new_state[slot] = None
    return new_state

Three invariants:

Step 3: LLM-driven DST with structured output

from pydantic import BaseModel
from typing import Literal, Optional
import instructor

class RestaurantState(BaseModel):
    cuisine: Optional[Literal["italian", "chinese", "indian", "thai", "any"]] = None
    area: Optional[Literal["north", "south", "east", "west", "center"]] = None
    price: Optional[Literal["cheap", "moderate", "expensive"]] = None
    people: Optional[int] = None
    day: Optional[str] = None


def llm_dst(history, llm):
    prompt = f"""You track the slot values of a restaurant booking across turns.
Dialogue so far:
{render(history)}

Update the state based on the latest user turn. Output only the JSON state."""
    return llm(prompt, response_model=RestaurantState)

Instructor + Pydantic guarantees a valid state object. No regex, no schema mismatches, no hallucinated slots.

Step 4: JGA evaluation

def joint_goal_accuracy(predicted_states, gold_states):
    correct = sum(1 for p, g in zip(predicted_states, gold_states) if p == g)
    return correct / len(predicted_states)

Calibrate: what fraction of turns does the system get ALL slots right? For MultiWOZ 2.4, top 2026 systems: 80-83%. Your in-domain system should exceed that on your narrow vocabulary or the LLM baseline beats you.

Step 5: handling correction

CORRECTION_CUES = {"actually", "no wait", "on second thought", "change that to"}


def is_correction(utterance):
    return any(cue in utterance.lower() for cue in CORRECTION_CUES)

On a detected correction, overwrite the last-updated slot rather than appending. Hard to get right without LLM help. The modern pattern: always let the LLM regenerate the whole state from history rather than incrementally updating — this naturally handles corrections.

Pitfalls

Use It

The 2026 stack:

Situation Approach
Narrow domain (one or two intents) Rule-based + regex
Broad domain, labeled data available LDST (LLaMA + LoRA on MultiWOZ-style data)
Broad domain, no labels, prod-ready LLM + Instructor + Pydantic schema
Spoken / voice ASR + normalizer + LLM-DST
Multi-domain booking flow Schema-guided LLM with per-domain Pydantic models
Compliance-sensitive Rule-based primary, LLM fallback with confirmation flow

Ship It

Save as outputs/skill-dst-designer.md:

name: dst-designer
description: Design a dialogue state tracker — schema, extractor, update policy, evaluation.
version: 1.0.0
phase: 5
lesson: 29
tags: [nlp, dialogue, task-oriented]
---

Given a use case (domain, languages, vocab openness, compliance needs), output:

1. Schema. Domain list, slots per domain, open vs closed vocabulary per slot.
2. Extractor. Rule-based / seq2seq / LLM-with-Pydantic. Reason.
3. Update policy. Regenerate-whole-state / incremental; correction handling; negation handling.
4. Evaluation. Joint Goal Accuracy on a held-out dialogue set, slot-level precision/recall, confusion on the hardest slot.
5. Confirmation flow. When to explicitly ask the user to confirm (destructive actions, low-confidence extractions).

Refuse LLM-only DST for compliance-sensitive slots without a rule-based secondary check. Refuse any DST that cannot roll back a slot on user correction. Flag schemas without version tags.

Exercises

  1. Easy. Build the rule-based state tracker in code/main.py for 3 slots (cuisine, area, price). Test on 10 hand-crafted dialogues. Measure JGA.
  2. Medium. Same dataset with Instructor + Pydantic + a small LLM. Compare JGA. Inspect the hardest turns.
  3. Hard. Implement both and route: rule-based primary, LLM fallback when rule-based emits <2 slots with confidence. Measure the combined JGA and inference cost per turn.

Key Terms

Term What people say What it actually means
DST Dialogue state tracking Maintain the slot-value dict across dialogue turns.
Slot Unit of user intent Named parameter the backend needs (cuisine, date).
Domain The task area Restaurant, hotel, taxi — sets of slots.
JGA Joint Goal Accuracy Fraction of turns where every slot is correct. All-or-nothing.
MultiWOZ The benchmark Multi-domain WOZ dataset; standard DST evaluation.
Ontology-free DST No schema Generate slot names and values directly, no fixed list.
Correction "Actually..." Turn that overwrites a previously-filled slot.

Further Reading