Multi-Agent Debate and Collaboration

> Du et al. (ICML 2024, "Society of Minds") run N model instances that independently propose answers, then iteratively critique each other over R rounds to converge. Improves factuality, rule-following, reasoning. Sparse topology beats full mesh on token cost.

Type: Learn + Build

Languages: Python (stdlib)

Prerequisites: Phase 14 · 12 (Workflow Patterns), Phase 14 · 05 (Self-Refine and CRITIC)

Time: ~60 minutes

Learning Objectives

The Problem

Self-Refine (Lesson 05) is one model critiquing itself — risks groupthink. CRITIC (Lesson 05) grounds critique in external tools — not always available. Debate introduces a third mode: multiple instances, cross-critique, convergence by disagreement.

The Concept

Society of Minds (Du et al., ICML 2024)

Original experiments used N=3, R=2 due to cost. Accuracy improves with more agents and more rounds on hard problems (MMLU, GSM8K, Chess Move Validity, biography generation).

Cross-model combinations beat single-model debates: ChatGPT + Bard together > either alone.

Sparse topology

"Improving Multi-Agent Debate with Sparse Communication Topology" (arXiv:2406.11776, 2024-2025) showed full-mesh debate is not always optimal. Sparse topologies (star, ring, hub-and-spoke) can match accuracy at lower token cost. Each debater sees only a subset of peers.

Implications:

When debate helps

When debate hurts

2026 practical instantiations

Where this pattern goes wrong

Build It

code/main.py implements stdlib debate:

Run it:

python3 code/main.py

Output: per-protocol accuracy and cost; sparse matches full mesh on 2/3 questions at lower cost.

Use It

Ship It

outputs/skill-debate.md scaffolds a multi-agent debate with configurable topology, N, R, and a convergence rule.

Exercises

  1. Implement a "forced disagreement" rule: in round 1, every debater must produce a distinct proposal. Measure effect on convergence speed.
  2. Add a confidence-weighted aggregation: debaters return (answer, confidence); aggregator weights by confidence. Does it help?
  3. Swap one "agent" for a different scripted LLM with different opinions. Does heterogeneity improve accuracy?
  4. Measure token cost for full mesh vs sparse on your 3 questions. Plot cost vs accuracy.
  5. Read the Society of Minds paper. Port your toy to N=5, R=3. What breaks? What gets better?

Key Terms

Term What people say What it actually means
Debate "Multi-agent critique" N proposers, R rounds of cross-critique, converge
Full mesh "Everyone reads everyone" Every debater reads every peer each round
Sparse topology "Limited peer view" Debaters read only a subset of peers
Hub-and-spoke "Star topology" One central debater, N-1 spokes read only the hub
Convergence "Agreement" Debaters converge on a shared answer
Society of Minds "Du et al. debate paper" ICML 2024 multi-agent debate method

Further Reading