Benchmarks: SWE-bench, GAIA, AgentBench

> Three benchmarks anchor agent evaluation in 2026. SWE-bench tests code patching. GAIA tests generalist tool use. AgentBench tests multi-environment reasoning. Know their composition, their contamination story, and what they do not measure.

Type: Learn

Languages: Python (stdlib)

Prerequisites: Phase 14 · 06 (Tool Use)

Time: ~60 minutes

Learning Objectives

The Problem

Leaderboards tell you which model wins on one benchmark. They do not tell you:

Know the three anchoring benchmarks and their failure modes before you quote a number.

The Concept

SWE-bench (Jimenez et al., ICLR 2024 oral)

SWE-agent (Yang et al., 2024) hit 12.5% at release by emphasizing agent-computer interfaces (file editor commands, search syntax the model understands).

SWE-bench Verified

OpenAI, Aug 2024. Human-curated 500-task subset. Removes ambiguous issues, unreliable tests, and tasks where the fix was unclear. Primary benchmark for "does your agent ship real patches?"

Contamination

Practical implication: a model that scores 50% on SWE-bench may score 35% on SWE-bench+. Always report both if you claim SWE-bench performance.

GAIA (Mialon et al., Nov 2023)

GAIA is what you run to measure "generalist capability." Do not confuse with code-specific benchmarks.

AgentBench (Liu et al., ICLR 2024)

What these do not measure

Where benchmarking goes wrong

Build It

code/main.py implements a toy SWE-bench-like harness:

Run it:

python3 code/main.py

The output shows resolution rate per task + per difficulty and makes the evaluator rules concrete.

Use It

Ship It

outputs/skill-benchmark-harness.md builds a SWE-bench-style harness for any codebase-task pair with FAIL_TO_PASS / PASS_TO_PASS gating.

Exercises

  1. Port the toy harness to run on a real repo (pick one of yours). Write 3 FAIL_TO_PASS tests for known bugs.
  2. Add a step-count metric. On your 3 tasks, how many agent steps per resolution?
  3. Read the SWE-bench+ paper. Implement a solution-leakage check (pattern-match the issue text against the diff).
  4. Download a GAIA question from the public split. Trace what a GPT-4-class agent would do. What tools does it need?
  5. Read AgentBench's per-environment breakdown. Which environment mirrors your product surface? What does "SOTA" look like there?

Key Terms

Term What people say What it actually means
SWE-bench "Code agent benchmark" 2,294 GitHub issues; patch must flip FAIL_TO_PASS tests
SWE-bench Verified "Clean SWE-bench" 500 human-curated tasks, OpenAI
FAIL_TO_PASS "Fix gate" Tests previously failing that must pass after the patch
PASS_TO_PASS "No-regression gate" Tests that were passing and must still pass
GAIA "Generalist benchmark" 466 human-easy / AI-hard multi-tool questions
AgentBench "Multi-env benchmark" 8 environments; long-horizon multi-turn
Contamination "Training-set leak" Benchmark tasks present in model training
SWE-bench+ "Contamination audit" 32.67% solution leakage found in successful SWE-bench patches

Further Reading