LangGraph: Stateful Graphs and Durable Execution
> LangGraph is the 2026 reference for low-level stateful orchestration. Agent is a state machine; nodes are functions; edges are transitions; state is immutable and checkpointed after every step. Resume from any failure exactly where it left off.
Type: Learn + Build
Languages: Python (stdlib)
Prerequisites: Phase 14 · 01 (Agent Loop), Phase 14 · 12 (Workflow Patterns)
Time: ~75 minutes
Learning Objectives
- Describe LangGraph's core model: state machine with immutable state, function nodes, conditional edges, and post-step checkpoints.
- Name the four capabilities the docs highlight: durable execution, streaming, human-in-the-loop, comprehensive memory.
- Explain the three orchestration topologies LangGraph supports: supervisor, peer-to-peer (swarm), hierarchical (nested subgraphs).
- Implement a stdlib state graph with immutable state, conditional edges, and a checkpoint/resume cycle.
The Problem
Agents and workflows share a problem: when a 40-step run fails at step 38, you want to resume from step 38, not start over. Second-class state models leave operators hacking retries around a library that assumes fresh runs.
LangGraph's design answer: state is a first-class typed object, mutations are explicit, and checkpoints persist after every node. Resume is a load_state(session_id) call.
The Concept
The graph
A graph is defined by:
- State type. A typed dict (or Pydantic model) that every node reads and mutates.
- Nodes. Pure functions
(state) -> state_update. Updates are merged into state after return. - Edges. Conditional or direct transitions between nodes.
- Entry and exit.
STARTandENDsentinel nodes mark the boundary.
Example: an agent with classify, refund, bug, sales, done nodes — a routing workflow as a graph.
Durable execution
After each node returns, the runtime serializes the state and writes it to a checkpointer (SQLite, Postgres, Redis, custom). On failure at step N, the runtime can resume(session_id) and pick up from step N+1 with exact state.
The LangGraph docs explicitly highlight production users where this matters: Klarna, Uber, J.P. Morgan. The claim isn't the graph shape; it's that the graph shape plus checkpointing makes recovery cheap.
Streaming
Every node can yield partial output. The graph streams per-node-delta events to the caller so UIs update as the graph runs.
Human-in-the-loop
Inspect and modify state between nodes. Implementations: pause before a critical node, surface state to a human, accept modifications, resume. The checkpointer makes this easy because state is already serialized.
Memory
Short-term (within a run — conversation history in state) and long-term (across runs — persistent via the checkpointer plus a separate long-term store). LangGraph integrates with external memory systems (Mem0, custom) via tools.
Three topologies
- Supervisor. Central router LLM dispatches to specialist subagents.
create_supervisor()inlanggraph-supervisor(though the LangChain team in 2026 recommends doing this through tool calls directly for more context control). - Swarm / peer-to-peer. Agents hand off directly via a shared tool surface. No central router.
- Hierarchical. Supervisors managing sub-supervisors, implemented as nested subgraphs.
Where this pattern goes wrong
- Checkpoints too small. Only checkpointing conversation turns leaves tool state and memory writes unrecoverable. Full state must serialize.
- Non-deterministic nodes. Resume assumes node inputs produce the same state update. Random seeds, wall-clock, external APIs must be captured.
- Over-use of conditional edges. A graph with every edge conditional is a state machine that cannot be reasoned about. Prefer linear chains with occasional branches.
Build It
code/main.py implements a stdlib stateful graph:
State— a typed dict withmessages,step,route,output,human_approval.Node— callable taking state and returning an update dict.StateGraph— nodes + edges + conditional edges + run + resume.SQLiteCheckpointer(in-memory fake) — serializes state after every node;load(session_id)restores.- A demo graph: classify -> branch(refund / bug / sales) -> human gate -> send.
Run it:
python3 code/main.py
The trace shows the first run failing at the human gate, persistence, then resume producing the final output.
Use It
- LangGraph — the reference, production-ready. Use
create_react_agent,create_supervisor, or build your own graph. - AutoGen v0.4 (Lesson 14) — actor model alternative for high-concurrency scenarios.
- Claude Agent SDK (Lesson 17) — managed harness with built-in session store.
- Custom — when you need exact control over state shape or checkpointer backend.
Ship It
outputs/skill-state-graph.md generates a LangGraph-shaped state graph in any target runtime with checkpointing and resume wired in.
Exercises
- Add a conditional edge from
classifytoendwhen classification confidence is below a threshold. Resume the run after a human setsroutemanually. - Swap the SQLite-like fake for a real SQLite checkpointer. Measure per-step serialization overhead.
- Implement parallel edges: two nodes run concurrently, merge by a custom reducer. What does immutable state buy here?
- Read
langgraph-supervisorreference. Port the toy tocreate_supervisor. Compare the trace shapes. - Add streaming: each node yields partial state while it runs. Print the deltas as they arrive.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| State graph | "Agent as state machine" | Typed state + nodes + edges + reducers |
| Checkpointer | "Persistence backend" | Serializes state after every node; enables resume |
| Reducer | "State merger" | Function that combines current state with a node's update |
| Conditional edge | "Branch" | Edge chosen by a function of state |
| Subgraph | "Nested graph" | A graph used as a node inside another graph |
| Durable execution | "Resume from failure" | Restart at the last successful node with exact state |
| Supervisor | "Router LLM" | Central dispatcher for specialist subagents |
| Swarm | "P2P agents" | Agents hand off via shared tools; no central router |
Further Reading
- LangGraph overview — the reference docs
- langgraph-supervisor reference — supervisor pattern API
- AutoGen v0.4, Microsoft Research — actor-model alternative
- Claude Agent SDK overview — session store and subagents