← Reward Hacking and Goodhart's Law Sycophancy as RLHF Amplification →

The Direct Preference Optimization Family

> Rafailov et al. (2023) showed RLHF's optimum has a closed form in terms of the preference data, so you can skip the explicit reward model and optimize the policy directly. That insight spawned a family — IPO, KTO, SimPO, ORPO, BPO — each fixing a failure mode of DPO. In 2026, direct alignment algorithms ship more frontier post-training runs than PPO. But the over-optimization curve from Lesson 2 still applies: DAAs do not escape Goodhart, they just move where it bites.

Type: Learn

Languages: Python (stdlib, six-variant preference-loss comparator)

Prerequisites: Phase 18 · 01 (InstructGPT), Phase 18 · 02 (Reward hacking), Phase 10 · 08 (DPO basics)

Time: ~75 minutes

Learning Objectives

Derive the DPO closed form from the RLHF-with-KL optimum.
State the failure mode each of IPO, KTO, SimPO, ORPO, BPO fixes in DPO.
Distinguish "implicit reward gap" from "preference strength" and explain why IPO's identity mapping matters.
Explain why Rafailov et al. (NeurIPS 2024) prove DAAs over-optimize despite having no explicit RM.

The Problem

The RLHF objective (Lesson 1):

max_pi E_{x,y~pi} [ r(x, y) ] - beta * KL(pi || pi_ref)

has a known optimum:

pi*(y|x) = (1/Z(x)) * pi_ref(y|x) * exp(r(x, y) / beta)

So the reward is implicitly defined by the ratio of the optimal policy to the reference:

r(x, y) = beta * log(pi*(y|x) / pi_ref(y|x)) + beta * log Z(x)

Substitute this into the Bradley-Terry preference likelihood and the partition function Z(x) cancels because it depends only on x. What remains is a loss in the policy parameters alone — no reward model needed. That is DPO.

The wrinkle: the derivation assumes the optimum is reachable, the preference data is in-distribution, and the reference policy is the true mode anchor. None of these hold exactly. Every family member fixes a different violated assumption.

The Concept

DPO (Rafailov et al., 2023)

L_DPO = -log sigmoid(
  beta * log(pi(y_w | x) / pi_ref(y_w | x))
  - beta * log(pi(y_l | x) / pi_ref(y_l | x))
)

What can go wrong:

The implicit reward gap beta * (log(pi/pi_ref)_w - log(pi/pi_ref)_l) is unbounded. A tiny preference can produce an arbitrarily large gap.
The loss drives chosen and rejected log-probs in opposite directions. It can push the chosen absolute log-prob down as long as the rejected falls faster. This is the Degraded Chosen Response phenomenon.
Out-of-distribution preferences (rare rare pair vs rare rare pair) produce arbitrary implicit rewards.

IPO (Azar et al., 2024)

Identity Preference Optimization replaces the log-sigmoid with an identity mapping on the preference probability. The loss becomes a squared-error on a bounded target:

L_IPO = (log(pi(y_w | x) / pi_ref(y_w | x)) - log(pi(y_l | x) / pi_ref(y_l | x)) - 1/(2 beta))^2

The margin is bounded by 1/(2 beta). Preference strength and implicit-reward gap are proportional. No blow-up.

KTO (Ethayarajh et al., 2024)

Kahneman-Tversky Optimization drops pairwise structure entirely. Given a single labeled output and a binary "desirable" or "undesirable" signal, it maps to a prospect-theory utility:

v(x, y) = sigma(beta * log(pi(y|x) / pi_ref(y|x)) - z_ref)

with different weights for gains and losses (loss aversion). Benefit: you can use unpaired data, which is far more plentiful.

SimPO (Meng et al., 2024)

Simple Preference Optimization aligns the training signal with generation. Remove the reference policy entirely and normalize log-likelihood by length:

L_SimPO = -log sigmoid(
  (beta / |y_w|) * log pi(y_w | x)
  - (beta / |y_l|) * log pi(y_l | x)
  - gamma
)

with a margin gamma to stabilize. The length normalization removes the incentive to exploit DPO's length-bias failure mode (longer y_w gives a larger log-prob gap by construction).

ORPO (Hong et al., 2024)

Odds-Ratio Preference Optimization adds a preference term to the standard SFT negative log-likelihood:

L_ORPO = L_NLL(y_w) + lambda * L_OR
L_OR = -log sigmoid(log(odds(y_w) / odds(y_l)))

No reference policy — the SFT term is the regularizer. Train in a single stage from the base model to the aligned model. No separate SFT checkpoint.

BPO (ICLR 2026 submission, OpenReview id=b97EwMUWu7)

Identifies the Degraded Chosen Responses problem: DPO preserves the ranking y_w > y_l but the absolute log-prob of y_w can drop. BPO adds a single-line correction that penalizes downward moves on the chosen response. Reported +10.1% accuracy on Llama-3.1-8B-Instruct on math reasoning over DPO.

The universal result: DAAs still over-optimize

Rafailov et al. "Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms" (NeurIPS 2024) trained policies with DPO, IPO, SLiC on multiple datasets across KL budgets. The gold-reward-vs-KL curves have the same Gao et al. peak-and-collapse shape. The implicit reward queries out-of-distribution samples during training; KL regularization does not stabilize this.

DAAs do not escape Goodhart. They change the surface where it bites from "reward model over-optimized" to "reference policy ratio over-optimized." The universal fix — better data, ensembles, early stopping — applies to both.

Choosing among them (2026)

If you have large paired preference data: DPO with conservative beta, SimPO if length bias is evident.
If you have unpaired binary feedback: KTO.
If you want a single-stage pipeline from a base model: ORPO.
If you see degraded chosen log-probs in DPO logs: BPO.
If preference strengths vary widely and DPO is saturating: IPO.

Every lab runs all five on a battery and picks the winner per task. There is no reason the optimum is the same for math reasoning and safety.

Use It

code/main.py compares six losses (DPO, IPO, KTO, SimPO, ORPO, BPO) on a toy preference dataset where the true preference strength varies by pair. Each loss is optimized against the same 500-pair sample with a small softmax policy. Plots final win rate, chosen-log-prob drift, and implicit-reward spread per method.

Ship It

This lesson produces outputs/skill-preference-loss-selector.md. Given dataset statistics (paired vs unpaired, variable vs uniform preference strength, length distribution) and a target (single-stage or SFT-then-preference), recommend a preference loss and report the failure mode it protects against.

Exercises

Run code/main.py. Report the final chosen-log-prob drop for DPO and BPO. BPO should retain higher chosen absolute probability — verify this.

Modify the preference data so that all pairs have equal strength. Which of the six methods is most robust? Which degrades? Explain IPO's advantage here.

Make the rejected responses on average 2x longer than chosen. Without changing anything else, show DPO's length exploitation numerically and SimPO's fix.

Rafailov et al. (NeurIPS 2024) claim DAAs over-optimize. Reproduce a single-point version: plot chosen-minus-rejected KL divergence and observe over-optimization in DPO at large beta.

Read the BPO paper abstract (OpenReview b97EwMUWu7). Write down the one-line correction BPO adds to DPO. Confirm against the implementation in code/main.py.

Key Terms

Term	What people say	What it actually means
DPO	"RLHF without a reward model"	Loss derived from the closed-form RLHF optimum; policy parameters only
Implicit reward	"the log-ratio"	`beta * log(pi(y	x) / pi_ref(y	x))` — the DPO-implied reward
IPO	"bounded DPO"	Replaces log-sigmoid with identity; implicit reward gap capped by `1/(2 beta)`
KTO	"unpaired DPO"	Prospect-theory utility over single labels with loss aversion
SimPO	"reference-free DPO"	Length-normalized log-likelihood + margin; no reference policy
ORPO	"one-stage DPO"	NLL + odds-ratio preference term; trains from base model in one pass
BPO	"chosen-preserving DPO"	DPO plus a penalty for decreasing the chosen response's absolute log-prob
Degraded Chosen	"chosen goes down"	DPO decreases chosen log-prob so long as rejected falls faster
DAA	"direct alignment algorithm"	Any preference-loss method that skips an explicit RM