← Agent Economies, Token Incentives, Reputation Failure Modes — MAST, Groupthink, Monoculture, Cascading Errors →

Production Scaling — Queues, Checkpoints, Durability

> Scaling multi-agent systems to thousands of concurrent runs requires durable execution. LangGraph's runtime writes a checkpoint after each super-step keyed by thread_id (Postgres by default); worker crashes release a lease and another worker resumes. Agents can sleep indefinitely waiting for human input. MegaAgent (arXiv:2408.09955) ran a per-agent producer-consumer queue with three states (Idle / Processing / Response) and two-layer coordination (intra-group chat + inter-group admin chat). Fiber/async beats thread-per-job for LLM streaming: threads sit idle 99% of the time waiting for tokens, fibers cooperatively yield on I/O. Counterpoint: Ashpreet Bedi's "Scaling Agentic Software" argues for FastAPI + Postgres + nothing else until load proves otherwise — simple architectures go further than expected. This lesson builds a durable checkpoint log, a per-agent work queue with state transitions, an async-vs-thread demo, and lands the pragmatic "start simple" rule.

Type: Learn + Build

Languages: Python (stdlib, asyncio, sqlite3)

Prerequisites: Phase 16 · 09 (Parallel Swarm Networks), Phase 16 · 13 (Shared Memory)

Time: ~75 minutes

Problem

A prototype multi-agent system works on one laptop with three agents in an in-memory event loop. You move to production:

Agents sometimes run for hours (long research, human-in-the-loop waits).
Worker processes crash. Restarting loses state.
Peak load is 10x average; you need horizontal scaling.
Users pay per agent-run; you need exactly-once semantics for charging.

The in-memory event loop does none of these. You need a durable execution layer underneath. The 2026 canonical options are:

A workflow engine with checkpoints (Temporal, LangGraph runtime).
A message queue with a state store (Postgres + SQS/RabbitMQ).
Actor-model frameworks (MegaAgent's producer-consumer per agent).
Hand-rolled FastAPI + Postgres (Bedi's argument).

This lesson builds a miniature of each.

Concept

Durable execution, the pattern

A durable-execution engine persists the full program state after each "step" (super-step, in LangGraph's language). On crash:

worker crashes mid-step
  -> lease timeout
  -> another worker picks up the thread_id
  -> resumes from last checkpoint
  -> no duplicate side effects

Requirements for this to work:

Serializable state. All agent state has to be persistable. Function closures with live database connections do not survive.
Deterministic resume. Given the same state and same inputs, the agent produces the same actions (or defers to an external deterministic oracle for LLM calls).
Idempotent side effects. External calls (tool calls, payments) must be idempotent or use a deduplication key.

LangGraph writes a checkpoint after each super-step; Temporal writes after each activity; Restate uses event-sourced journals. All three implement the same pattern.

LangGraph's runtime

Each agent has a thread_id; state is a typed dict; each super-step writes a row to the checkpoints table. On resume, the runtime replays from the last checkpoint, not from scratch. Agents can interrupt() waiting for human input; the runtime persists and releases the worker. When input arrives, any worker can resume.

This is the reference production design in April 2026.

MegaAgent's per-agent queue

arXiv:2408.09955 describes a scale experiment: thousands of concurrent agents in one cluster. Architecture:

agent i:
  state ∈ {Idle, Processing, Response}
  in_queue   <- messages addressed to agent i
  out_queue  -> replies + side effects

coordinators:
  intra-group chat  (agents in the same group)
  inter-group admin chat  (high-level routing)

The two-layer coordination lets intra-group conversation happen densely while inter-group stays sparse — the pattern used for keeping cost linear in thousands of agents.

Async vs thread-per-job

LLM calls are I/O-bound. A thread waiting for the next token is idle 99% of the time. Threads cost ~1MB RAM each; at 10,000 concurrent calls, that is 10GB just for stacks.

Fibers (Python asyncio, Go goroutines, Rust tokio) cooperatively yield on I/O. The same 10,000 calls fit comfortably in process. At LLM-agent scale, async is not an optimization — it is the architecture.

Exception: CPU-bound post-processing (embedding, tokenizer tricks) still wants threads or processes. Separate your I/O layer from your CPU layer.

Bedi's counterpoint

"Scaling Agentic Software" (Ashpreet Bedi, 2026) argues that most teams over-engineer before they have measured load. The pragmatic default:

FastAPI + Postgres.
Each agent run is a row; state updated in-place with optimistic concurrency.
Background jobs via pg_notify or a simple Celery worker.
Retry policy in application code.

For loads under ~100 concurrent agent-runs on manageable tasks, this is often all you need. Upgrade when you measure it failing.

The rule: adopt durable-execution frameworks when you hit a concrete problem that simple architectures cannot solve. Premature adoption burns time on ceremonies that do not pay off.

Exactly-once semantics

For paid agent runs, you need "exactly-once effective" (at-least-once delivery + idempotent consumer). The engineering moves:

Dedup key per run. Include it in every side-effect call.
Outbox pattern. Side effects write to a table first, then a separate process executes them. Both steps idempotent.
Compensating transactions. When a side effect succeeds but its tracking write fails, schedule a compensate.

These are database-engineering patterns, not LLM-specific. The LLM tax is only that LLM calls are slow; everything else is standard distributed systems.

Rainbow deployment

Anthropic's multi-agent research system uses "rainbow deployments": multiple versions of the agent runtime run concurrently so long-running agents do not have to be killed on every code deploy. Canary new versions on a slice of traffic; retire old versions when their agents finish.

This is standard for long-running stateful systems; the 2026 adaptation is that agents can live for hours, so deployment cycles must accommodate.

The canonical production checklist

Durable state (checkpoints, snapshots, or outbox + replayable log).
Idempotent side effects.
Async I/O layer for LLM calls.
At-least-once delivery with dedup.
Rainbow/canary deployment for stateful workloads.
Observability: per-agent traces, super-step audit, retry counter.

Build It

code/main.py implements:

CheckpointStore — SQLite-backed checkpoint log with thread-id keys. Each super-step appends a row.
run_with_checkpoint(agent, thread_id) — simulates a crash mid-run; a second worker resumes from last checkpoint.
AgentQueue — per-agent Idle / Processing / Response state machine with a small work queue.
demo_async_vs_threads() — runs 500 concurrent simulated "LLM calls" via asyncio and via threads; reports wall-clock and peak memory (approximated).

Run:

python3 code/main.py

Expected output: checkpoint resume succeeds after simulated crash; async version handles 500 concurrent calls in < 1s; thread version takes several seconds and uses orders of magnitude more memory per concurrent unit.

Use It

outputs/skill-scaling-advisor.md advises on durable-execution choice: FastAPI + Postgres, LangGraph runtime, Temporal, or custom. Calibrated by load, state-retention needs, and deploy frequency.

Ship It

Canonical production hardening:

Start simple (Bedi's rule). FastAPI + Postgres until you measure it failing.
Instrument everything before optimizing. Per-run latency histogram, per-step time, retry count, failure categorization.
Outbox pattern for side effects. Especially payments and external API calls.
Rainbow deploys. Never kill in-flight agent runs during deploys.
Adopt durable-execution engines (Temporal / LangGraph / Restate) when you hit specific problems: hour-long human-in-the-loop waits, cross-region coordination, complex retry/compensation policies.
Async for the I/O layer. Threads only for CPU-bound post-processing.

Exercises

Run code/main.py. Confirm checkpoint resume works; measure async vs thread concurrency difference.
Implement an outbox table: every tool call writes to outbox first, then a separate goroutine/task executes. Verify idempotency by running the tool call twice.
Simulate a rainbow deploy: two concurrent runtime versions; route half of new thread_ids to each; confirm that in-flight threads on the old version are not interrupted.
Read LangGraph's runtime doc (linked below). Identify which features of the runtime would take the longest to replicate in a hand-rolled FastAPI + Postgres version. Is that a reason to adopt, or can you defer?
Read MegaAgent (arXiv:2408.09955) Section 3. The two-layer coordination (intra-group + inter-group admin chat) is explicit. Sketch how you would map this to a message queue with two queue families.

Key Terms

Term	What people say	What it actually means
Durable execution	"Persist the program state"	Engine writes state after each super-step; crash recovery is deterministic.
Super-step	"Transactional boundary"	Unit of work between checkpoints. LangGraph term.
thread_id	"Agent run identifier"	Key that binds checkpoints and resume logic.
Idempotency	"Safe to retry"	Repeating a side effect produces the same result as one attempt.
Outbox pattern	"Decouple side effects"	Write intent to a table; a separate executor performs and marks done.
At-least-once delivery	"Possible duplicates"	Message queue semantics; dedup key makes consumer effective-once.
Rainbow deploy	"Overlapping versions"	Multiple runtime versions concurrent during long-running workloads.
Async fiber	"Cooperative yielding"	User-mode concurrency; cheap compared to threads for I/O-bound loads.
Checkpoint	"State snapshot"	Serialized state at a super-step boundary; key for resume.