Multimodal Agents and Computer-Use (Capstone)
> The 2026 frontier product is a multimodal agent that reads screenshots, clicks buttons, navigates web UIs, fills forms, and completes workflows end-to-end. SeeClick and CogAgent (2024) proved the GUI-grounding primitive. Ferret-UI added mobile. ChartAgent introduced visual tool-use for charts. VisualWebArena and AgentVista (2026) are the benchmarks the frontier chases — and even Gemini 3 Pro and Claude Opus 4.7 score ~30% on AgentVista's hard tasks. This capstone pulls together every thread of Phase 12: perception (high-res VLM), reasoning (LLM with tool use), grounding (coordinate output), long-horizon memory, and evaluation.
Type: Capstone
Languages: Python (stdlib, action schema + agent loop skeleton)
Prerequisites: Phase 12 · 05 (LLaVA), Phase 12 · 09 (Qwen-VL JSON), Phase 14 (Agent Engineering)
Time: ~240 minutes
Learning Objectives
- Design a multimodal agent loop: perceive → reason → act → observe → repeat.
- Build a GUI grounding output schema (click coordinates, type text, scroll, drag) the VLM can emit as JSON.
- Compare screenshot-only agents vs accessibility-tree agents vs hybrid agents.
- Set up a multimodal agent benchmark evaluation on a small VisualWebArena slice.
The Problem
A booking-site workflow: "find me a flight to Tokyo for April 15, aisle seat under $800, book it."
A multimodal agent needs to:
- Take a screenshot of the browser.
- Parse the screenshot + URL + goal into a plan.
- Emit a structured action: click (at x,y), type "Tokyo" (at element E), scroll down, select (radio button).
- Apply the action to the browser.
- Observe the new state (next screenshot).
- Repeat until the task is done.
Each step is a multimodal VLM call. The VLM output must be parseable JSON. Errors compound across steps, so recovery matters.
The Concept
GUI grounding — the primitive
GUI grounding is: given a screenshot and a natural language instruction, output the (x, y) coordinate to click (or other action).
SeeClick (arXiv:2401.10935) was the first open result at scale: fine-tune a VLM on synthetic + real GUI data, output coordinates as plain text tokens. Works.
CogAgent (arXiv:2312.08914) added 1120x1120 high-resolution encoding for dense UIs. Score: ~84% on web navigation.
Ferret-UI (arXiv:2404.05719) focuses on mobile UIs, integrates with iOS accessibility data.
Output format is usually JSON:
{"action": "click", "x": 384, "y": 220, "element_desc": "Search button"}
The element_desc helps recovery: if coordinates drift between screenshots, the semantic hint lets the system re-ground.
Action schemas
A typical action schema has 6-10 action types:
click: (x, y)type: (text, x?, y?)scroll: (direction, amount)drag: (x0, y0, x1, y1)select: (option_index)hover: (x, y)navigate: (url)wait: (ms)done: (success, explanation)
The agent emits one action per step. The browser wrapper executes and returns the new state.
Screenshot-only vs accessibility-tree
Two input modes:
- Screenshot-only: full image, no structural info. Most general; works on any app.
- Accessibility tree: structured DOM / iOS accessibility info. Much more reliable for grounding; works where the tree is available.
- Hybrid: both, with the tree as a reliable grounder for atomic actions and the screenshot for semantic context.
Production agents use hybrid when possible. Browser automation (Selenium + accessibility) always has the tree; desktop apps sometimes do.
Long-horizon memory
A 20-step workflow generates 20 screenshots. The VLM's context fills up fast. Three compression strategies:
- Summary-chain: after every 5 steps, summarize what has happened, drop old screenshots.
- Skip-frame: keep the first, last, and every 3rd screenshot.
- Tool-recorded log: execute actions, keep a text log of what was done; don't re-look at old screenshots.
Claude's computer-use API uses the log pattern. Simpler, more reliable.
Visual tool use
ChartAgent (arXiv:2510.04514) introduces visual tool use for chart understanding: crop, zoom, OCR, call external detection. The agent can output "crop to region (100, 200, 300, 400) then call OCR" as a tool call. The tool returns text; the VLM continues reasoning.
This pattern generalizes: set-of-mark prompting, region annotation, and external detection tools all fit the same "output a tool call, receive a structured response" schema.
The 2026 benchmarks
- ScreenSpot-Pro. GUI grounding on ~1k web screenshots. Open SOTA Qwen2.5-VL-72B ~85%. Frontier ~90%.
- VisualWebArena. End-to-end web tasks (shop, forum, classifieds). Open SOTA ~20%. Gemini 3 Pro ~27%.
- AgentVista (arXiv:2602.23166). The hardest 2026 benchmark. Realistic workflows across 12 domains. Frontier models score 27-40%; open models 10-20%.
- WebArena / WebShop. Older benchmarks; saturated by frontier.
Why it's still hard
Agent performance bottlenecks:
- Visual grounding at fine scale. "Click the small X" fails often at mobile resolution.
- Long-horizon planning. After 10 actions, the agent drifts from the goal.
- Error recovery. When a click fails (wrong button), detecting + recovering is rarely trained data.
- Cross-page context. Jumping between tabs or long forms loses state.
Research directions: memory architectures, explicit replanning, multimodal verification (screenshot match for action success).
The capstone build-it
The capstone task: build a computer-use agent that:
- Reads the HTML + screenshot of a booking-site mock page.
- Plans a multi-step sequence: search → select → fill form → submit.
- Emits JSON actions matching the action schema.
- Evaluates on a fixed 10-task slice.
The lesson provides scaffold code that is easy to extend into a real browser.
Use It
code/main.py is the capstone scaffold:
- Action schema JSON definition (10 actions).
- Mock browser state as dict.
- Agent loop skeleton: receive state, emit action, apply, loop.
- 10-task mini-benchmark (synthetic pages) to measure end-to-end success rate.
- Error-recovery hook for when an action fails.
Ship It
This lesson produces outputs/skill-multimodal-agent-designer.md. Given a computer-use product (domain, action set, evaluation target), designs the full agent loop, memory strategy, grounding mode, and expected benchmark score.
Exercises
- Extend the action schema with a
screenshot_regiontool (crop + zoom). What tasks benefit?
- Read AgentVista (arXiv:2602.23166). Describe the hardest task category and why frontier models still fail.
- Long-horizon memory compression: design a summary-chain with ≤4 screenshots kept live, any number logged.
- Build an error-recovery hook: on action failure (button not found), what does the agent do next?
- Compare screenshot-only Claude 4.7 to hybrid screenshot + accessibility-tree Qwen2.5-VL on 10 web tasks. Which wins on which tasks?
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| GUI grounding | "Click coordinates" | Model outputs (x,y) for the target of an instruction on a screenshot |
| Action schema | "Tool definitions" | JSON description of valid actions (click, type, scroll, drag) |
| Accessibility tree | "Structured DOM" | Machine-readable UI hierarchy from browser/iOS APIs |
| Hybrid agent | "Screenshot + tree" | Uses both image and structured info; more reliable than either alone |
| Visual tool use | "Zoom/crop/detect" | Agent calls external vision tools (OCR, detection) mid-plan |
| Summary-chain | "Memory compression" | Periodic text summaries replace long screenshot history |
| VisualWebArena | "E2E web bench" | 2024 benchmark for end-to-end web tasks |
| AgentVista | "2026 hard bench" | 12-domain realistic workflows; even Gemini 3 Pro scores ~30% |