← Bounded Self-Improvement Designs Claude Code as an Autonomous Agent: Permission Modes and Auto Mode →

The Autonomous Coding Agent Landscape (2026)

> SWE-bench Verified went from 4% to 80.9% in under three years. Same Claude Sonnet 4.5 scored 43.2% on SWE-agent v1 and 59.8% on Cline autonomous — the scaffolding around the model now matters as much as the model itself. OpenHands (formerly OpenDevin) is the most active MIT-licensed platform and its CodeAct loop executes Python actions directly in a sandbox instead of JSON tool calls. The headline numbers hide a methodological issue: 161 of 500 SWE-bench Verified tasks require only a 1–2 line change, and SWE-bench Pro (10+ line tasks) sits at 23–59% for the same frontier models.

Type: Learn

Languages: Python (stdlib, CodeAct vs JSON tool-call comparison)

Prerequisites: Phase 14 · 07 (Tool use), Phase 15 · 01 (Long-horizon agents)

Time: ~45 minutes

The Problem

"Which coding agent is best" is the wrong question. The right question is: on a task distribution that matches my work, with the scaffolding I will run in production, what end-to-end reliability do I get?

Between 2022 and 2026 the field learned that scaffolding — the retrieval layer, the planner, the sandbox, the edit-verify loop, the feedback format — is load-bearing. Claude Sonnet 4.5 on SWE-agent v1 scored 43.2% on SWE-bench Verified; the same model inside Cline's autonomous scaffold scored 59.8%. 16.6 absolute points of difference, same weights. The base model is a component; the loop is the product.

The companion problem is that benchmark saturation hides regressions. SWE-bench Verified is close to saturated, and the easy-task tail (161 of 500 tasks requiring ≤2 lines) pulls top scores up. Real-world quality is better measured on distributions like SWE-bench Pro (10+ line changes), where the same leaders still sit at 23–59%.

The Concept

SWE-bench, one paragraph

SWE-bench (Jimenez et al.) takes real GitHub issues with ground-truth patches and asks an agent to produce a patch that makes the test suite pass. SWE-bench Verified (OpenAI, 2024) is a human-curated 500-task subset with the ambiguous and broken tasks removed. SWE-bench Pro is the harder successor — tasks requiring 10+ lines of change, where current frontier agents sit at 23–59%.

What the 2022 → 2026 curve actually shows

2022: research models at ~4% on raw SWE-bench.
2024: GPT-4 + Devin-style scaffolding at ~14%; SWE-agent at ~12%.
2025: Claude 3.5/3.7 Sonnet inside Aider and SWE-agent push into the 40–55% range.
2026: Claude Sonnet 4.5 and frontier competitors at 70–80%+ on SWE-bench Verified. Epoch AI's leaderboard tracks this live.

The slope came from three compounding sources: better base models, better scaffolding (CodeAct, reflection, verifier loops), and better benchmarks (Verified removing noise).

CodeAct vs JSON tool calls

OpenHands (All-Hands-AI, arXiv:2407.16741, formerly OpenDevin) took a specific architectural bet: instead of the model emitting JSON tool calls that a host decodes and executes, the model emits Python code and a Jupyter-style kernel runs it in a sandbox. The agent can loop over files, chain tools, and catch its own exceptions inside one action.

The trade-off:

JSON tool calls: every action is one turn; easy to audit; limited compositionality; safe by default because each call goes through an explicit validator.
CodeAct: one action can be a whole program; compositional; requires a hardened sandbox (OpenHands uses Docker isolation); failure modes include anything the sandbox runtime allows.

Both architectures are in production. CodeAct is dominant in open platforms (OpenHands, smolagents). JSON tool calls remain dominant in managed services (Anthropic Managed Agents, OpenAI Assistants) where the provider controls the executor.

Scaffolds in the 2026 landscape

Scaffold	License	Execution model	Notable property
OpenHands (OpenDevin)	MIT	CodeAct in Docker	Most active open platform; event-stream replayable
SWE-agent	MIT	Agent-Computer Interface (ACI)	First end-to-end SWE-bench scaffold
Aider	Apache-2	edit-via-diff in local repo	Minimal scaffold, strong regression stability
Cline	Apache-2	VS Code agent with tool policy	Highest-scoring open scaffold on Sonnet 4.5
Devin (Cognition)	Proprietary	Managed VM + planner	First "AI software engineer" product category
Claude Code	Proprietary	Permission modes + routines	Lesson 10 covers the agent loop in detail

Why scaffolding dominates

A coding run is a long-horizon trajectory (Lesson 1). Reliability compounds across steps. Three places where scaffolding buys points:

Retrieval: finding the right files to read is the silent bottleneck. SWE-agent's ACI, OpenHands' file-index, and Aider's repo-map all attack this.
Verifier loop: running tests, reading stack traces, and re-attempting is a 10+ point delta on SWE-bench.
Failure containment: a sandbox that rolls back on error prevents compounding damage. The same model with and without a verifier loop looks like two different products.

Benchmark saturation and the real distribution

The OpenHands authors and Epoch AI both flag that SWE-bench Verified has an easy tail: 161 of 500 tasks need only 1–2 lines of change. High scores are driven partly by this tail. SWE-bench Pro restricts to 10+ line changes and returns scores in the 23–59% range even for frontier systems. Your production distribution is almost certainly closer to Pro than to Verified.

Implication for choosing an agent: run a Pro-like subset of your own bug backlog. The score that matters is the score on tasks representative of what you ship.

Use It

code/main.py compares two toy agent scaffolds on a fixed mini-task distribution:

A JSON tool-call scaffold that takes one action per turn.
A CodeAct scaffold that can emit a small Python snippet per action.

Both use a stub "model" (deterministic rules) so the comparison isolates the scaffold from model quality. The output shows the CodeAct scaffold solves more tasks in fewer turns at the cost of a larger per-action blast radius.

Ship It

outputs/skill-scaffold-audit.md helps you audit a proposed coding-agent scaffold before adoption: retrieval quality, verifier presence, sandbox isolation, and benchmark-to-distribution fit.

Exercises

Run code/main.py. How many turns does each scaffold take on the same task set? What is the per-action blast radius of each?

Read the OpenHands paper (arXiv:2407.16741). The paper argues CodeAct beats JSON tool calls on complex tasks. Identify one failure mode the paper acknowledges and write one sentence on when that mode would dominate in production.

Pick one task from your bug backlog that would require 10+ lines of change across two files. Estimate the end-to-end success probability for a frontier model under (a) JSON tool calls and (b) CodeAct. Justify the gap.

SWE-bench Verified has 161 single-file, 1–2 line tasks. Construct a score that excludes them. How does the leaderboard shuffle?

Read "Introducing SWE-bench Verified" (OpenAI). Explain the specific methodology used to remove ambiguous tasks, and name one category the curation would miss.

Key Terms

Term	What people say	What it actually means
SWE-bench	"Coding benchmark"	Real GitHub issues with ground-truth patches and test suites
SWE-bench Verified	"Cleaned subset"	500 human-curated tasks, easier-tail present
SWE-bench Pro	"Harder subset"	10+ line changes; frontier sits at 23–59%
CodeAct	"Code-as-action"	Agent emits Python; Jupyter-style kernel executes in sandbox
JSON tool call	"Function calling"	Each action is a structured JSON payload validated before execution
Scaffold	"Agent framework"	Retrieval + planner + executor + verifier loop around the base model
ACI (Agent-Computer Interface)	"SWE-agent's format"	Command set designed for LLM ergonomics, not human shells
Verifier loop	"Test-and-retry"	Run tests, read output, revise patch; biggest non-model reliability gain