← Swarm Optimization for LLMs (PSO, ACO)Agent Economies, Token Incentives, Reputation →

MARL — MADDPG, QMIX, MAPPO

> The reinforcement-learning heritage of multi-agent coordination, which still informs LLM-agent systems in 2026. MADDPG (Lowe et al., NeurIPS 2017, arXiv:1706.02275) introduced Centralized Training, Decentralized Execution (CTDE): each critic sees all agents' states and actions during training; at test time only local actors run. Works for cooperative, competitive, and mixed settings. QMIX (Rashid et al., ICML 2018, arXiv:1803.11485) is value-decomposition with a monotonic mixing network; per-agent Qs combine into joint Q so argmax distributes cleanly — dominant on StarCraft Multi-Agent Challenge (SMAC). MAPPO (Yu et al., NeurIPS 2022, arXiv:2103.01955) is PPO with a centralized value function; "surprisingly effective" on particle-world, SMAC, Google Research Football, Hanabi with minimal tuning. These underpin training policies for agent teams that must act decentrally. MAPPO is the default 2026 cooperative-MARL baseline. This lesson builds each from a small grid-world toy and lands the three ideas in muscle memory before touching LLM-agent training.

Type: Learn

Languages: Python (stdlib, small NumPy-free implementations)

Prerequisites: Phase 09 (Reinforcement Learning), Phase 16 · 09 (Parallel Swarm Networks)

Time: ~90 minutes

Problem

LLM-agent systems increasingly train policies for inter-agent coordination: when to defer, when to act, which peer to call. The literature that tells you how to train such policies is Multi-Agent Reinforcement Learning (MARL), which predates the LLM wave and has a small set of dominant algorithms.

Reading MARL papers without the pattern vocabulary is painful. Centralized training with decentralized execution (CTDE), value decomposition, and centralized critics are not buzzwords — they are specific answers to specific problems:

Independent RL (each agent learns alone) is non-stationary from each agent's perspective. Bad.
Centralized RL (one agent controls all) does not scale and violates execution constraints.
CTDE gets the best of both: train with global information, deploy with local policies.

Concept

Three environments the papers use

Particle World (multi-agent particle env). Simple 2D physics with cooperative/competitive tasks. MADDPG's original testbed.
StarCraft Multi-Agent Challenge (SMAC). Cooperative micro-management, partial observation. QMIX's testbed. Discrete actions, continuous states.
Google Research Football, Hanabi, MPE. MAPPO baselines.

Different envs have different action/observation types. The algorithms pick accordingly.

MADDPG (2017) — the CTDE pattern

Each agent i has an actor mu_i(o_i) that maps its own observation to action. Each agent also has a critic Q_i(x, a_1, ..., a_n) that sees all observations and all actions during training. The actor is updated by policy gradient against the critic's evaluation.

actor update:    grad_theta_i J = E[grad_theta mu_i(o_i) * grad_a_i Q_i(x, a_1..n) at a_i=mu_i(o_i)]
critic update:   TD on Q_i(x, a_1..n) given next-state joint estimate

Why CTDE: at training time, we know everyone's actions; we use that to reduce variance in each critic. At deploy time, each agent only sees o_i and calls mu_i(o_i).

Failure mode: critics grow with N agents (input includes all actions). Does not scale past ~10 agents without approximations.

QMIX (2018) — value decomposition

Cooperative only. Global reward is the sum of a monotone function of per-agent Q-values:

Q_tot(tau, a) = f(Q_1(tau_1, a_1), ..., Q_n(tau_n, a_n)),   df/dQ_i >= 0

The monotonicity guarantees argmax_a Q_tot can be computed by each agent choosing argmax_{a_i} Q_i independently. That is exactly the decentralized execution property you need. At training time, a mixing network produces Q_tot from the per-agent Qs.

Why QMIX wins on SMAC: cooperative StarCraft micro-management has homogeneous agents, local obs, global reward — perfect fit for value decomposition.

Failure mode: the monotonicity constraint is restrictive; some tasks have reward structures that are not monotone decomposable (one agent sacrificing for the team). Extensions (QTRAN, QPLEX) relax this.

MAPPO (2022) — the overlooked default

Multi-Agent PPO: PPO with a centralized value function. Each agent has its own policy; all agents share (or have per-agent) value functions that see the full state. Yu et al. 2022 benchmarked MAPPO against MADDPG, QMIX, and their extensions on five benchmarks and found:

MAPPO matches or beats off-policy MARL methods on particle-world, SMAC, Google Research Football, Hanabi, MPE.
Minimal hyperparameter tuning required.
Stable training; reproducible across seeds.

The community underrated on-policy MARL until this paper. In 2026, MAPPO is the default baseline for cooperative MARL; any new method must beat it.

Why LLM-agent engineers should care

Three direct uses:

Router training. A meta-agent chooses which sub-agent handles a task. This is a MARL problem with N decentralized sub-agents and one centralized router. MAPPO fits.
Role emergence. In generative-agent simulations, training agents to adopt complementary roles over time is a MARL problem in disguise. QMIX-style value decomposition forces complementarity by construction.
Multi-agent tool use. When agents share tools and compete for budget, training them via CTDE produces deployable local policies that respect resource constraints.

Practical caveat: in 2026, most production LLM-agent systems prompt their policies rather than train them. MARL comes in when you have (a) lots of interaction data, (b) a clear reward signal, and (c) willingness to invest in training infrastructure.

CTDE as a design pattern beyond RL

Even without training, CTDE is a useful architectural pattern:

During *design*, assume full team visibility.
At *runtime*, enforce decentralized execution: each agent sees only o_i.

The pattern forces you to keep per-agent state explicit and to think about partial observability up front. Many production multi-agent systems silently assume shared state everywhere — CTDE discipline prevents that.

The non-stationarity problem

When multiple agents learn simultaneously, each agent's environment (which includes others' policies) is non-stationary. Classical single-agent RL proofs break. The MARL algorithms in this lesson all address this:

MADDPG: global critic sees all actions, so its value estimate is stationary.
QMIX: value decomposition moves learning to a joint-Q space where optimality is well-defined.
MAPPO: the centralized value function dampens variance from others' policy changes.

In LLM-agent systems, non-stationarity manifests as "my agent worked last month, now that other agent upstream changed, mine misbehaves." Training MARL with CTDE is the principled fix; prompt-level fixes are faster but less durable.

What this lesson does NOT cover

Training actual networks is a Phase 09 topic. This lesson builds scripted-policy versions that demonstrate the CTDE, value-decomposition, and centralized-value patterns without gradient updates. The goal is to internalize the patterns before you pick up a full MARL library (PyMARL, MARLlib, RLlib multi-agent).

Build It

code/main.py implements three pattern demonstrations, all on a tiny 2-agent cooperative grid-world:

Environment: 2 agents on a 4x4 grid, one reward pellet. Reward = 1 if any agent reaches pellet; task finishes.
IndependentAgents — each agent treats others as environment. Baseline.
MADDPGStyle — centralized critic computes a joint value; actor policies update from it. Scripted policy improvement.
QMIXStyle — value decomposition with a monotone mixer.
MAPPOStyle — centralized value function; policies update against the shared baseline.

All four run the same episodes and report average steps-to-goal. The CTDE variants converge to shorter paths than the independent baseline.

Run:

python3 code/main.py

Expected output: independent agents take ~6 steps on average; CTDE variants converge toward ~3.5 steps (optimal for the 4x4 grid is 3). The pattern difference shows up despite scripted policies.

Use It

outputs/skill-marl-picker.md is a skill that picks a MARL algorithm for a given multi-agent task: cooperative vs competitive, homogeneous vs heterogeneous, action-space type, scale, reward signal.

Ship It

MARL in production is rare. When you do use it:

Start with MAPPO. The 2022 paper established this as the baseline; reproducing it first saves weeks of chasing fancier methods.
Log every agent's observation and action stream. Debugging MARL without per-agent traces is hopeless.
Separate training code from execution code. CTDE is a discipline; let the execution path really only see o_i.
Reward shaping warning. MARL is exquisitely sensitive to reward design. One coordination bug in the shaping and agents learn to exploit it. Run adversarial tests.
For LLM agents, consider prompt-level policies first. Only invest in MARL training when interaction data + reward signal + infrastructure are all present.

Exercises

Run code/main.py. Measure the steps-to-goal gap between independent and MAPPO-style agents. Does the gap grow or shrink on a 6x6 grid?
Implement a competitive variant: two agents, one pellet, only the first to reach gets reward. Which pattern handles competition cleanly? MADDPG historically.
Read MADDPG (arXiv:1706.02275) Section 3. Implement the exact critic update rule symbolically in pseudocode in your own words.
Read MAPPO (arXiv:2103.01955). Why do the authors argue centralized value + PPO beats off-policy MARL on their benchmarks? List the three strongest claims.
Apply CTDE as a design pattern to a hypothetical LLM-agent system (e.g., research agent + summarizer + coder). What is the joint information available at design time that is not available at runtime?

Key Terms

Term	What people say	What it actually means
MARL	"Multi-Agent RL"	Reinforcement learning for multi-agent systems.
CTDE	"Centralized Training, Decentralized Execution"	Train with global info; deploy with local policies.
MADDPG	"Multi-Agent DDPG"	CTDE with per-agent critic seeing all observations + actions.
QMIX	"Value decomposition"	Monotonic mixing of per-agent Qs. Cooperative.
MAPPO	"Multi-Agent PPO"	PPO with centralized value function. 2026 default baseline.
Value decomposition	"Sum of individual Qs"	Joint Q represented as a monotone function of per-agent Qs.
Non-stationarity	"Moving targets"	Each agent's env changes as others learn. The core MARL problem.
On-policy / off-policy	"Learn from current / replay"	PPO is on-policy (MAPPO); DDPG and Q-learning are off-policy.
SMAC	"StarCraft Multi-Agent Challenge"	Cooperative micromanagement benchmark; QMIX's homegrown ground.