Evaluation — FID, CLIP Score, Human Preference

> Every generative model leaderboard cites FID, CLIP score, and a win rate from a human-preference arena. Each number has a failure mode a determined researcher can game. If you do not know the failure modes, you cannot tell a real improvement from a gaming run.

Type: Build

Languages: Python

Prerequisites: Phase 8 · 01 (Taxonomy), Phase 2 · 04 (Evaluation Metrics)

Time: ~45 minutes

The Problem

A generative model is judged on *sample quality* and *conditioning adherence*. Neither has a closed-form measure. Your model has to render 10,000 images; something has to assign them numbers; you have to trust the numbers across model families, across resolutions, across architectures. Three metrics survived the 2014-2026 gauntlet:

You will also see: IS (inception score, largely retired), KID, CMMD, ImageReward, PickScore, HPSv2, MJHQ-30k. Each corrects for one failure of the previous.

The Concept

FID, CLIP, and preference: three axes, different failure modes

FID — sample quality

Heusel et al. (2017). Steps:

  1. Extract Inception-v3 features (2048-D) for N real images and N generated.
  2. Fit a Gaussian to each pool: compute mean μ_r, μ_g and covariance Σ_r, Σ_g.
  3. FID = ||μ_r - μ_g||² + Tr(Σ_r + Σ_g - 2 · (Σ_r · Σ_g)^0.5).

Interpretation: Fréchet distance between two multivariate Gaussians in feature space. Lower = more similar distributions.

Failure modes:

CLIP score — prompt adherence

Radford et al. (2021). For a generated image + prompt:

clip_score = cos_sim( CLIP_image(x_gen), CLIP_text(prompt) )

Average across 30k generated images → a scalar comparable between models.

Failure modes:

CMMD (Jayasumana et al., 2024) fixes some of these: uses CLIP features instead of Inception, maximum-mean discrepancy instead of Fréchet. Better at detecting subtle quality differences.

Human preference — the ground truth

Pick a pool of prompts. Generate with model A and model B. Show pairs to humans (or a strong LLM judge). Aggregate wins into an Elo or Bradley-Terry score. Benchmarks:

Failure modes:

Use together

A production eval report should include:

  1. FID on 10-30k samples against a held-out real distribution (sample quality).
  2. CLIP score / CMMD on the same samples vs their prompts (adherence).
  3. Win rate in a blinded arena vs the previous model (overall preference).
  4. Failure mode analysis: 50 randomly sampled outputs, flagged for known issues (hand anatomy, text rendering, consistent object count).

Any single metric is a lie. Three corroborating metrics + qualitative review are a claim.

Build It

code/main.py implements FID, CLIP-score-like, and Elo aggregation on synthetic "feature vectors" (we use 4-D vectors as stand-ins for Inception features). You see:

Step 1: FID in four lines

def fid(real_features, gen_features):
    mu_r, cov_r = mean_and_cov(real_features)
    mu_g, cov_g = mean_and_cov(gen_features)
    mean_diff = sum((a - b) ** 2 for a, b in zip(mu_r, mu_g))
    trace_term = trace(cov_r) + trace(cov_g) - 2 * sqrt_cov_product(cov_r, cov_g)
    return mean_diff + trace_term

Step 2: CLIP-style cosine-similarity

def clip_like(image_feat, text_feat):
    dot = sum(a * b for a, b in zip(image_feat, text_feat))
    norm = math.sqrt(dot_self(image_feat) * dot_self(text_feat))
    return dot / max(norm, 1e-8)

Step 3: Elo aggregation

def elo_update(r_a, r_b, winner, k=32):
    expected_a = 1 / (1 + 10 ** ((r_b - r_a) / 400))
    actual_a = 1.0 if winner == "a" else 0.0
    r_a_new = r_a + k * (actual_a - expected_a)
    r_b_new = r_b - k * (actual_a - expected_a)
    return r_a_new, r_b_new

Pitfalls

Use It

Production eval protocol in 2026:

Pillar Minimum Recommended
Sample quality FID on 10k vs held-out real + CMMD on 5k + FID on subset per category
Prompt adherence CLIP score on 30k + HPSv2 + ImageReward + VQA-style question answering
Preference 200 blinded pairs vs baseline + 2000 paired human + LLM-judge + Chatbot Arena
Failure analysis 50 hand-flagged 500 hand-flagged + automated safety classifier

All four pillars in one report = claim. Any one alone = marketing.

Ship It

Save outputs/skill-eval-report.md. Skill takes a new model checkpoint + baseline and outputs a full eval plan: sample sizes, metrics, failure-mode probes, sign-off criteria.

Exercises

  1. Easy. Run code/main.py. Compare FID at N=100 vs N=1000 on the same synthetic distributions. Report bias magnitude.
  2. Medium. Implement CMMD from synthetic CLIP-style features (see Jayasumana et al., 2024 for the formula). Compare sensitivity to quality differences vs FID.
  3. Hard. Replicate the HPSv2 setup: take 1000 image-prompt pairs from a subset of Pick-a-Pic, fine-tune a small CLIP-based scorer on the preferences, and measure its agreement with a held-out set.

Key Terms

Term What people say What it actually means
FID "Fréchet Inception Distance" Fréchet distance of Gaussian fits to real vs gen Inception features.
CLIP score "Text-image similarity" Cosine similarity between CLIP image and text embeddings.
CMMD "FID's replacement" CLIP-feature MMD; less biased, no Gaussian assumption.
IS "Inception score" Exp KL(p(y x) p(y)); correlates poorly on modern models, retired.
HPSv2 / ImageReward / PickScore "Learned preference proxies" Small models trained on human preferences; used as automatic judges.
Elo "Chess rating" Bradley-Terry aggregation of pairwise wins.
PartiPrompts "The benchmark prompt set" 1,600 Google-curated prompts across 12 categories.
FD-DINO "Self-sup replacement" FD using DINOv2 features; better for out-of-ImageNet domains.

Production note: evaluation is an inference workload too

Running FID on 10k samples means generating 10k images. For a 50-step SDXL base at 1024² on a single L4, that is ~11 hours of single-request inference. Evaluation budgets are real, and the framing is exactly the offline-inference scenario (maximize throughput, ignore TTFT):

For CI / regression gates: run FID + CLIP score on a 500-sample subset per PR (~30 min); run full 10k FID + HPSv2 + Elo nightly.

Further Reading