Supervisor / Orchestrator-Worker Pattern

> One lead agent plans and delegates; specialized workers execute in parallel contexts and report back. This is the pattern behind Anthropic's Research system (Claude Opus 4 as lead, Sonnet 4 as subagents), measured at +90.2% over single-agent Opus 4 on internal research evals. Anthropic's engineering post reports that 80% of the variance on BrowseComp is explained by token usage alone — multi-agent wins largely because each subagent gets a fresh context window. This lesson builds the supervisor pattern from the primitives and covers the 2026 engineering lessons from production deployments.

Type: Learn + Build

Languages: Python (stdlib, threading)

Prerequisites: Phase 16 · 04 (Primitive Model)

Time: ~75 minutes

Problem

Research is the prototypical task that single-agent systems fail. You ask "what changed in multi-agent systems between 2023 and 2026?" A single agent reads five papers sequentially, fills half its context with their text, and then has to reason about all of them together. It forgets the first paper by the time it reaches the fifth. It cannot parallelize.

The supervisor pattern fixes this: one lead agent plans the search, delegates each sub-question to a worker, and synthesizes. Each worker gets its own 200k-token window for a narrow question. The lead never sees the raw papers — only the worker summaries.

Anthropic's production Research system reports +90.2% on internal research evals vs a single Opus 4. The same post notes that 80% of the BrowseComp variance is explained by *token usage alone*. Fresh context per subagent is the main mechanism.

Concept

The pattern

                 ┌──────────────┐
                 │   Lead       │  plans, decomposes,
                 │  (Opus 4)    │  synthesizes
                 └──┬────┬───┬──┘
                    │    │   │
            ┌───────┘    │   └───────┐
            ▼            ▼           ▼
      ┌─────────┐  ┌─────────┐  ┌─────────┐
      │ Worker1 │  │ Worker2 │  │ Worker3 │
      │(Sonnet) │  │(Sonnet) │  │(Sonnet) │
      └─────────┘  └─────────┘  └─────────┘
         fresh       fresh        fresh
         context     context      context

The lead never reads the raw materials. The workers never see each other's work until the lead synthesizes. Each arrow is a handoff with a narrow artifact.

Why it wins

Three mechanisms:

  1. Fresh context per subagent. A worker exploring "FIPA-ACL heritage" does not carry the 40k tokens the lead spent planning. It gets a 200k window for one question.
  2. Specialization via prompt. The lead's prompt is "decompose and synthesize," not "research." Each worker's prompt is narrow: "find what changed in X." Focused prompts produce focused outputs.
  3. Parallelism. Workers run concurrently. Wall-clock time is roughly max(worker_times) + plan + synthesis, not sum(worker_times).

Engineering lessons (Anthropic 2025)

The Anthropic post lists several production lessons that are still 2026-relevant:

The LangGraph turn

LangGraph originally shipped a langgraph-supervisor library with a high-level create_supervisor helper. In 2025 LangChain moved the recommendation to implementing the supervisor pattern via tool-calling directly, because tool calls give more control over *what the supervisor sees* (context engineering). The library still works; the docs now recommend the tool-calling form.

The failure modes

When supervisor is wrong

Build It

code/main.py implements a supervisor of three parallel workers using threading. The lead decomposes a query into sub-questions, workers run concurrently on each sub-question, and the lead synthesizes. No real LLMs — the workers are scripted to simulate fetch-and-summarize.

Key structure:

Run:

python3 code/main.py

Output shows the plan, the parallel worker traces with start/end timestamps, and the final synthesis. You can see the wall-clock wins: three 0.3-second workers run in ~0.35 seconds, not 0.9.

Use It

outputs/skill-supervisor-designer.md takes a user query and produces a supervisor-pattern design: lead system prompt, worker roles, sub-question decomposition rules, and the synthesis template. Use this before building a new research-style agent system.

Ship It

Checklist before deploying a supervisor pattern:

Exercises

  1. Run code/main.py, then modify the lead to spawn 5 workers instead of 3. Observe the wall-clock effect. At what worker count does spawn overhead exceed parallel savings in this demo?
  2. Implement a worker timeout: kill any worker that runs longer than 0.5 seconds and have the lead synthesize the remaining results. What observability do you need to know a worker was cut?
  3. Add a conflict-detection step to the lead's synthesis: if two workers return contradictory answers, the lead notes the disagreement rather than picking one. How do you detect contradiction without calling an LLM?
  4. Read Anthropic's Research-system engineering post. List three practices that this toy demo would need to adopt to run in production.
  5. Compare LangGraph's create_supervisor (legacy) vs the new tool-calling recommendation. Which gives you better control over what the supervisor sees? Why does Anthropic explicitly pass only sub-answers and not raw worker context into synthesis?

Key Terms

Term What people say What it actually means
Supervisor "Lead agent" An orchestrator agent that plans, delegates, and synthesizes. Does not do the work itself.
Worker "Subagent" A focused agent invoked by the supervisor with narrow scope and its own context window.
Orchestrator-worker "Supervisor pattern" Same thing, different name. The 2026 literature uses both.
Fresh context "Clean window" A worker's context starts from its system prompt and assigned question, not the lead's history.
Rainbow deployment "Gradual rollout" Long-running stateful agents need versioned drain-and-replace, not blue-green.
Token dominance "Context is the variable" 80% of research-eval variance comes from total tokens used, not model choice, per Anthropic.
Scale effort "Match agent count to complexity" Lead estimates query difficulty, spawns 1 vs 10+ workers accordingly.
Synthesis conflict "Workers disagree" Two workers return contradictory facts; the lead must surface disagreement, not silently pick one.

Further Reading