← Sycophancy as RLHF Amplification Mesa-Optimization and Deceptive Alignment →

Constitutional AI and RLAIF

> Bai et al. (arXiv:2212.08073, 2022) asked: what if we replaced the human labeler with an AI that reads a list of principles? Constitutional AI has two phases — self-critique and revision under a constitution, then RL from AI Feedback. The technique coined the term RLAIF and shipped in the Claude 1 post-training pipeline. On 21 January 2026 Anthropic published a rewritten Claude constitution: explanatory reasoning over prescriptive rules, a four-tier priority hierarchy, and the first major-lab formal acknowledgment of uncertainty about model moral status. Released under CC0 1.0.

Type: Learn

Languages: Python (stdlib, toy self-critique-and-revise loop)

Prerequisites: Phase 18 · 01 (InstructGPT), Phase 18 · 02 (Reward hacking)

Time: ~60 minutes

Learning Objectives

Describe the two phases of Constitutional AI (critique-and-revise SFT, RL from AI feedback) and the role of the constitution in each.
Explain why replacing a human preference labeler with an AI labeler is not a "cheaper" RLHF — it changes which failure modes the pipeline has.
Summarize the four-tier priority structure of the 2026 Claude constitution and what changed from the 2023 rewrite.
Describe Constitutional Classifiers and the drop from 23.7% compute overhead (v1) to ~1% (v2 / 2026).

The Problem

RLHF needs labelers. Labelers are slow, biased, and expensive. You can eliminate a labeler by replacing them with a model that reads explicit principles. The first formal version of this substitution was Bai et al.'s Constitutional AI. It worked well enough that every frontier lab now uses some variant of AI-feedback post-training.

The catch: the preference signal is now generated by the same class of model you are training. Biases in the labeler (now: in the principles plus the labeler model's interpretation) can be amplified rather than attenuated. Lesson 4's sycophancy argument still applies; the labeler just moved inside the loop.

The Concept

Phase 1 — Supervised self-critique and revision

Start with a helpful-but-not-yet-harmless SFT model. Given a red-team prompt, the model produces an initial response. A second model (or the same model in a second turn) reads a sampled principle from the constitution and critiques the response. A third step revises the response to address the critique. The revised response is the SFT target.

The constitution is the list of principles. Bai et al. 2022 used 16 principles including "prefer responses that are least harmful and ethical," "avoid preaching," "the assistant should be helpful, honest, and harmless." The set was deliberately small to keep critiques focused.

Phase 2 — RL from AI Feedback (RLAIF)

Generate pairs of completions. A "feedback model" scores each against sampled constitution principles. The preference signal is the feedback model's ranking. Train a reward model on AI-generated preferences; PPO against it. Everything else is InstructGPT's pipeline (Lesson 1).

"RLAIF" = the preference signal is AI-generated. The rest of the pipeline is RLHF-shaped.

Why this is not just "cheaper RLHF"

Labeler bias shifts from labeler psychology to principle-interpretation. An AI labeler can interpret "be honest" more or less strictly than any human; the strictness is uniform across the dataset.
The preference signal is strongly legible — you can read the principle, the critique, and the revision. Human labels are opaque.
The failure modes change. Sycophancy drops (the AI labeler has no user to please). Goodhart's Law persists (the proxy is now "model's interpretation of principle set X," still an imperfect measurement).

CAI's 2022 claim: the trained model is more harmless and roughly as helpful as an RLHF model with comparable data. This has held across labs.

The 2026 Claude constitution rewrite

Anthropic published a substantially revised constitution on 21 January 2026. Key shifts:

Explanatory reasoning over prescriptive rules. Previous rules ("do not generate CSAM") expanded to principles + reasoning ("because it harms children, ...") with the model expected to generalize.
Four-tier priority structure:

- Tier 1: avoid catastrophic outcomes (mass casualty, critical infrastructure).

- Tier 2: follow Anthropic's guidelines (operator overrides, platform rules).

- Tier 3: be broadly ethical (standard HHH).

- Tier 4: be helpful and candid.

Conflicts are resolved top-down.

First major-lab formal acknowledgment of uncertainty about model moral status (linked to Phase 18 · 19 Model Welfare).
Released under CC0 1.0. Other labs can use or adapt without restriction.

Constitutional Classifiers

A parallel line of work: rather than change the model's post-training, train lightweight classifiers that read the constitution and gate model outputs. v1 (2023) had 23.7% compute overhead. v2 (2026) is ~1% and has the lowest successful attack rate of any Anthropic defense Anthropic has tested publicly. No universal jailbreak was reported as of early 2026.

This is a layered-defense model: CAI shapes behaviour; classifiers enforce invariants. Neither alone is sufficient.

Where CAI fits in the family

InstructGPT: human prefs, RM, PPO.
CAI / RLAIF: AI-generated prefs from principles, RM, PPO.
DPO / family: closed-form loss on prefs (human or AI).
Self-rewarding, self-critique: principles internalized, model plays multiple roles.

The axis is "where does the preference signal come from." CAI's 2022 paper was the first serious shift from human to AI signal at frontier scale.

Use It

code/main.py simulates the CAI critique-and-revise loop on a toy lexicon. A "principle" flags tokens from a harmful set. Given an initial response, the critique identifies the harmful tokens, and the revision replaces them. After 200 iterations the "trained" model has internalized the revision rule. Compare the base model, RLHF-shaped toy, and CAI-shaped toy on a held-out prompt set.

Ship It

This lesson produces outputs/skill-constitution-writer.md. Given a domain (customer support, medical advice, coding assistant, research tool), drafts a 4-tier constitution following the 2026 Claude structure: catastrophic avoidance, platform rules, domain ethics, helpfulness.

Exercises

Run code/main.py. Compare the base model's harmful-token rate to the CAI-trained version. How many revision steps are needed to approach zero?

Read Anthropic's 2026 constitution (anthropic.com/news/claudes-constitution). List one principle that would rank Tier 1 and one that would rank Tier 4. Why does the priority structure matter for conflicts?

Design a constitution for an AI coding assistant. Specify Tier 1 (catastrophic: destructive commands without approval), Tier 2, Tier 3, Tier 4. Keep each tier to 3-5 principles.

CAI replaces human labelers with AI labelers. Name a sycophancy-like failure mode that can still occur in RLAIF, and design a detection for it.

Read Constitutional Classifiers v2 methodology (if available). Explain why ~1% compute overhead is a qualitatively different safety story than 23.7%.

Key Terms

Term	What people say	What it actually means
Constitutional AI	"AI trained with principles"	Two-phase pipeline: self-critique-and-revise SFT, then RL from AI feedback
RLAIF	"RLHF without humans"	RL with preferences generated by an AI labeler; the rest of the pipeline is unchanged
Constitution	"the principles"	An ordered list of natural-language rules the critique/labeler model consults
Critique-and-revise	"the SFT loop"	Produce response → critique under a principle → revise → SFT target
Constitutional Classifier	"the output gate"	Lightweight classifier that evaluates outputs against the constitution and blocks/logs
Four-tier priority	"the conflict resolver"	2026 Claude constitution hierarchy: catastrophic > platform > ethics > helpful
Feedback model	"the AI labeler"	The model that reads a principle and ranks a pair of completions