Benchmarks: WebArena and OSWorld

> WebArena tests web-agent capability across four self-hosted apps. OSWorld tests desktop-agent capability across Ubuntu, Windows, macOS. At release (2023–2024) both showed a big gap between best-in-class agents and humans. The gap is narrowing; the failure modes haven't changed.

Type: Learn

Languages: Python (stdlib)

Prerequisites: Phase 14 · 19 (SWE-bench, GAIA)

Time: ~60 minutes

Learning Objectives

The Problem

Generalist agents can call tools. Can they drive a browser across 20 clicks to complete a shopping checkout? Can they configure a Linux box using only keyboard and mouse? These are the questions WebArena and OSWorld answer.

The Concept

WebArena (Zhou et al., ICLR 2024)

The self-hosted framing matters — the benchmark is not flaky because the target apps are pinned and reproducible.

Extensions

OSWorld (Xie et al., NeurIPS 2024)

Primary failure modes

  1. GUI grounding. Pixel → element mapping. Models struggle to localize UI elements reliably in 1920×1080.
  2. Operational knowledge. Which menu has the setting, which keyboard shortcut, which preference pane. Knowledge tail that humans build over years.

Follow-ups

Why this matters

Claude computer use, OpenAI CUA, Gemini 2.5 Computer Use (Lesson 21) all train on workloads shaped by WebArena and OSWorld. The benchmarks are the target; the production models are the shipped answer.

Where benchmarking goes wrong

Build It

code/main.py implements a toy web-agent harness:

Run it:

python3 code/main.py

Output: per-task success rate and trajectory efficiency, mirroring OSWorld-Human's methodology.

Use It

Ship It

outputs/skill-web-desktop-harness.md builds a web/desktop agent harness with execution-based eval and trajectory efficiency metric.

Exercises

  1. Extend the toy harness with a second app (a forum). Write 3 tasks plus gold trajectories.
  2. Add trajectory-efficiency reporting per task. On your toy, is the agent 1x, 2x, or 3x over gold?
  3. Implement a "distractor" tool — one the gold trajectory never uses. Does the scripted agent get tempted?
  4. Read OSWorld-G. How would you separate grounding failures from planning failures in your own evals?
  5. Read WebArena's apps README. What breaks when you upgrade one of the pinned app versions?

Key Terms

Term What people say What it actually means
WebArena "Web agent benchmark" 812 tasks across 4 self-hosted apps; gym-style evaluation
VisualWebArena "Visual WebArena" Visually grounded WebArena; screenshots are observations
OSWorld "Desktop agent benchmark" 369 tasks on real Ubuntu/Windows/macOS
GUI grounding "Pixel-to-element mapping" Model localizing UI elements in 1920x1080
Operational knowledge "OS know-how" Which menu, which shortcut, which preference pane
OSWorld-G "Grounding suite" 564 grounding-only samples + training set
OSWorld-Human "Gold trajectories" Manual expert action sequences to measure efficiency
Trajectory efficiency "Steps over gold" Agent step count divided by human minimum

Further Reading