RL for Games — AlphaZero, MuZero, and the LLM-Reasoning Era

> 1992: TD-Gammon beat human champions at backgammon with pure TD. 2016: AlphaGo beat Lee Sedol. 2017: AlphaZero dominated chess, shogi, and Go from scratch. 2024: DeepSeek-R1 proved the same recipe, with GRPO replacing PPO, works on reasoning. Games are the benchmark that drives every breakthrough in this phase.

Type: Build

Languages: Python

Prerequisites: Phase 9 · 05 (DQN), Phase 9 · 08 (PPO), Phase 9 · 09 (RLHF), Phase 9 · 10 (MARL)

Time: ~120 minutes

The Problem

Games have everything RL wants. Clean reward (win/loss). Infinite episodes (self-play resets). Perfect simulation (the game *is* the simulator). Discrete or small continuous action spaces. Multi-agent structure that forces adversarial robustness.

And games are how every major RL breakthrough was tested. TD-Gammon (backgammon, 1992). Atari-DQN (2013). AlphaGo (2016). AlphaZero (2017). OpenAI Five (Dota 2, 2019). AlphaStar (StarCraft II, 2019). MuZero (learned model, 2019). AlphaTensor (matrix multiplication, 2022). AlphaDev (sorting algorithms, 2023). DeepSeek-R1 (math reasoning, 2025) — the latest demonstration that game-RL techniques work on text.

This capstone surveys the three landmark architectures — AlphaZero, MuZero, and GRPO — through a single unifying lens: self-play + search + policy improvement. Each generalizes the previous; GRPO in particular is AlphaZero's recipe applied to LLM reasoning, with tokens as actions and mathematical verification as the win signal.

The Concept

AlphaZero ↔ MuZero ↔ GRPO: same loop, different environments

The unifying loop.

while True:
    trajectory = self_play(current_policy, search)     # play game against self
    policy_target = search.improved_policy(trajectory) # search improves raw policy
    policy_net.update(policy_target, value_target)     # supervised on search output

AlphaZero (2017). Silver et al. Given a game (chess, shogi, Go) with known rules:

Zero human knowledge. Zero handcrafted heuristics. A single recipe that mastered chess, shogi, and Go after a few tens of millions of self-play games each.

MuZero (2019). Schrittwieser et al. Removes the requirement that the rules are known.

- h(s): encode observation to latent state.

- g(s_latent, a): predict next latent state + reward.

- f(s_latent): predict policy prior + value.

Stochastic MuZero (2022). Adds stochastic dynamics and chance nodes; extends to backgammon-class games.

Muesli, Gumbel MuZero (2022-2024). Improvements on sample efficiency and deterministic search.

GRPO (2024-2025). DeepSeek-R1 recipe. Same AlphaZero-shaped loop, applied to language-model reasoning:

L_GRPO(θ) = -E_{q, {o_i}} [ (1/G) Σ_i A_i · log π_θ(o_i | q) ] + β · KL(π_θ || π_ref)

No reward model, no critic, no MCTS. Group-relative baseline replaces all three. Matches or exceeds PPO-RLHF quality on reasoning benchmarks at a fraction of the compute.

The R1 recipe in full. DeepSeek-R1 (DeepSeek 2025) is two models in one paper:

1. Cold-start SFT. Collect a few thousand long-CoT demonstrations with clean formatting. Supervised-finetune the base model on them. This gives a readable starting point.

2. Reasoning-oriented GRPO. Apply GRPO with the accuracy+format rewards plus a *language-consistency* reward to prevent code-switching.

3. Rejection sampling + SFT round 2. Sample ~600K reasoning trajectories from the RL checkpoint, keep only those with correct final answers and readable CoT, and combine with ~200K non-reasoning SFT examples (writing, QA, self-cognition). Fine-tune the base again.

4. Full-spectrum GRPO. One more RL round covering both reasoning (rule-based rewards) and general alignment (helpfulness/harmlessness preference-based rewards).

The result matches o1 on AIME and MATH-500 at open weights, and is small enough to distill. The same paper also releases six distilled dense models (Qwen-1.5B through Llama-70B) by SFT'ing on R1's reasoning traces — no RL at the student. Distillation of a strong RL teacher consistently beats RL from scratch at the student's scale.

Why GRPO instead of PPO for reasoning. Three reasons in the DeepSeekMath paper (Feb 2024): (1) no value network to train, halving memory; (2) the group baseline naturally handles the sparse end-of-trajectory reward that reasoning tasks produce; (3) per-prompt normalization makes advantages comparable across problems of wildly different difficulty, which PPO's single critic cannot.

Search-free vs search-based. Games have branched:

Build It

The code in code/main.py implements GRPO in miniature — a bandit with multiple groups of samples. The algorithm is the same as on an LLM; only the policy and environment are simpler. It teaches the *loss* and the *group-relative advantage*, which is the 2025 innovation.

Step 1: a tiny verifier environment

QUESTIONS = [
    {"prompt": "q1", "correct": 3},
    {"prompt": "q2", "correct": 1},
]

def verify(prompt_idx, answer_token):
    return 1.0 if answer_token == QUESTIONS[prompt_idx]["correct"] else 0.0

In real GRPO the verifier runs unit tests or checks math equality.

Step 2: policy: softmax over K answer tokens per prompt

def policy_probs(theta, p_idx):
    return softmax(theta[p_idx])

Equivalent to the final-layer output of an LLM conditioned on a prompt.

Step 3: group sampling and group-relative advantage

def grpo_step(theta, p_idx, G=8, beta=0.01, lr=0.1, rng=None):
    probs = policy_probs(theta, p_idx)
    samples = [sample(probs, rng) for _ in range(G)]
    rewards = [verify(p_idx, s) for s in samples]
    mean_r = sum(rewards) / G
    std_r = stddev(rewards) + 1e-8
    advs = [(r - mean_r) / std_r for r in rewards]

    for a, A in zip(samples, advs):
        grad = onehot(a) - probs
        for i in range(len(probs)):
            theta[p_idx][i] += lr * A * grad[i]
    # KL penalty: pull theta toward reference
    for i in range(len(probs)):
        theta[p_idx][i] -= beta * (theta[p_idx][i] - reference[p_idx][i])

The group-relative advantage is the 2024 DeepSeek trick. No critic needed. The "baseline" is the group mean, and normalization uses group std.

Step 4: compare to REINFORCE baseline (value-free)

Same setup, same compute, plain REINFORCE. GRPO converges faster and more stably.

Step 5: observe entropy and KL

Same diagnostics as RLHF: mean KL to reference, policy entropy, reward-over-time. Once these stabilize, training is done.

Pitfalls

Use It

The 2026 game-RL landscape, by domain:

Domain Dominant method
Two-player zero-sum board games (Go, chess, shogi) AlphaZero / MuZero / KataGo
Imperfect info card games (poker) CFR + deep learning (DeepStack, Libratus, Pluribus)
Atari / pixel games Muesli / MuZero / IMPALA-PPO
Large multiplayer strategy (Dota, StarCraft) PPO + self-play + league (OpenAI Five, AlphaStar)
LLM math/code reasoning GRPO (DeepSeek-R1, Qwen-RL, open replications)
LLM alignment DPO / RLHF-PPO (not GRPO; verifier is preference not verifiable)
Robotics PPO + DR (not game-RL, but uses same policy-gradient tools)
Combinatorial problems AlphaZero variants (AlphaTensor, AlphaDev)

The *recipe* — self-play, search-augmented improvement, policy distillation — spans text, pixels, and physical control. GRPO is the youngest instance; more are coming.

Ship It

Save as outputs/skill-game-rl-designer.md:

name: game-rl-designer
description: Design a game-RL or reasoning-RL training pipeline (AlphaZero / MuZero / GRPO) for a given domain.
version: 1.0.0
phase: 9
lesson: 12
tags: [rl, alphazero, muzero, grpo, self-play]
---

Given a target (perfect-info game / imperfect-info / Atari / LLM reasoning / combinatorial), output:

1. Environment fit. Known rules? Markov? Stochastic? Multi-agent? Informs AlphaZero vs MuZero vs GRPO.
2. Search strategy. MCTS (PUCT with learned prior), Gumbel-sampled, best-of-N, or none.
3. Self-play plan. Symmetric self-play / league / offline data / verifier-generated.
4. Target signal. Game outcome / verifier reward / preference / learned model. Include robustness plan.
5. Diagnostics. Win rate vs baseline, ELO curve, verifier pass rate, KL to reference.

Refuse AlphaZero on imperfect-info games (route to CFR). Refuse GRPO without a trusted verifier. Refuse any game-RL pipeline without a fixed baseline opponent set (self-play ELO is uncalibrated otherwise).

Exercises

  1. Easy. Implement the GRPO bandit in code/main.py. Train on 2 prompts × 4 answer tokens each. Converge in < 1,000 updates with G=8.
  2. Medium. Plug in PPO (clipped) and vanilla REINFORCE. Compare sample efficiency and reward variance to GRPO on the same bandit.
  3. Hard. Extend to a length-2 "reasoning chain": the agent emits two tokens and the verifier rewards the pair. Measure how GRPO handles the credit assignment across two-step sequences. (Hint: compute group advantage per *full sequence*, propagate to both token positions.)

Key Terms

Term What people say What it actually means
MCTS "Tree search with learned net" Monte Carlo Tree Search; UCB1/PUCT selection with learned (p, v) priors.
AlphaZero "Self-play + MCTS" Policy-value net trained to match MCTS visits and game outcome.
MuZero "Learned-model AlphaZero" Same loop but in latent space via learned dynamics.
GRPO "Critic-free PPO" Group Relative Policy Optimization; REINFORCE with group-mean baseline + KL.
PUCT "AlphaZero's UCB" Q + c · p · √N / (1 + N_a) — balances value estimate with prior.
Self-play "Agent vs past self" Standard for zero-sum; symmetric training signal.
League play "Population-based self-play" Past + current + exploiters sampled as opponents.
Verifier reward "Verifiable RL" Reward comes from a deterministic checker (tests pass, answer matches).
Process reward "PRM" Scores each reasoning step, not just the final answer.

Further Reading