Capstone 16 — GitHub Issue-to-PR Autonomous Agent
> AWS Remote SWE Agents, Cursor Background Agents, OpenAI Codex cloud, and Google Jules all ship the same 2026 product shape: label an issue, get a PR. Run an agent in a cloud sandbox, verify tests pass, and post a review-ready PR with rationale. The hard parts are reproducing the repo's build environment automatically, preventing credential leakage, enforcing per-repo budgets, and making sure the agent cannot force-push. This capstone builds the self-hosted version and compares it on cost and pass rate to the hosted alternatives.
Type: Capstone
Languages: Python (agent), TypeScript (GitHub App), YAML (Actions)
Prerequisites: Phase 11 (LLM engineering), Phase 13 (tools), Phase 14 (agents), Phase 15 (autonomous), Phase 17 (infrastructure)
Phases exercised: P11 · P13 · P14 · P15 · P17
Time: 30 hours
Problem
The async cloud coding agent is a separate product category from interactive coding agents (capstone 01). The UX is a GitHub label. You label an issue @agent fix this, a worker spins up in a cloud sandbox, clones the repo, runs tests, edits files, verifies, and opens a PR with the agent's rationale in the body. No interactive loop, no terminal. AWS Remote SWE Agents, Cursor Background Agents, OpenAI Codex cloud, Google Jules, and Factory Droids all converge on this.
The engineering challenges are concrete: environment reproduction (the agent has to build the repo from scratch without a cached dev image), flaky tests (must be re-run or isolated), credential scoping (a GitHub App with minimal fine-grained permissions), budget enforcement per repo per day, and no-force-push policy. The capstone measures pass rate, cost, and safety vs the hosted alternatives.
Concept
The trigger is a GitHub webhook (issue label or PR comment). A dispatcher enqueues work to ECS Fargate or Lambda. The worker pulls the repo into a Daytona or E2B sandbox with a generic Dockerfile inferred from the repo (language, framework). The agent runs a mini-swe-agent or SWE-agent v2 loop against Claude Opus 4.7 or GPT-5.4-Codex. It iterates: read code, propose fix, apply patch, run tests.
Verification is the gating step. Full CI must pass in the sandbox before the PR opens. Coverage delta is computed; if negative beyond a threshold, the PR opens but gets labeled needs-review. The agent posts the rationale as the PR description plus an @agent thread the reviewer can ping for follow-ups.
Safety is scoped through two different GitHub surfaces: the App provides a short-lived installation token with workflows: read and narrow repo contents/PR scopes; branch protection (not app permissions) enforces "no direct writes to main" and "no force-push" — the app is never added to the bypass list. Path-scoped read-only access to .github/workflows is not a real GitHub App primitive, so the agent's allow-list on file edits has to enforce that at the worker. Budget ceilings per repo per day are enforced at the dispatcher (e.g., max 5 PRs per repo per day, $20 per PR).
Architecture
GitHub issue labeled `@agent fix` or PR comment
|
v
GitHub App webhook -> AWS Lambda dispatcher
|
v
ECS Fargate task (or GitHub Actions self-hosted runner)
- pull repo
- infer Dockerfile (language, package manager)
- Daytona / E2B sandbox with target runtime
- clone -> git worktree -> agent branch
|
v
mini-swe-agent / SWE-agent v2 loop
Claude Opus 4.7 or GPT-5.4-Codex
tools: ripgrep, tree-sitter, read/edit, run_tests, git
|
v
verify CI passes in-sandbox + coverage delta check
|
v (verified)
git push + open PR via GitHub App
PR body = rationale + diff summary + trace URL
label: needs-review
|
v
operator reviews; can @-mention agent for follow-ups
Stack
- Trigger: GitHub App with fine-grained token; webhook receiver via Lambda or Fly.io
- Worker: ECS Fargate task (or GitHub Actions self-hosted runner)
- Sandbox: Daytona devcontainer or E2B sandbox per task
- Agent loop: mini-swe-agent baseline or SWE-agent v2 over Claude Opus 4.7 / GPT-5.4-Codex
- Retrieval: tree-sitter repo-map + ripgrep
- Verification: full CI in-sandbox + coverage delta gate
- Observability: Langfuse with per-PR trace archive linked from the PR body
- Budget: per-repo daily dollar ceiling; max PRs per repo per day
Build It
- GitHub App. Fine-grained installation token: issues read+write, pull_requests write, contents read+write, workflows read. Branch protection (the only surface that can do this) enforces "no direct push to
main" and "no force-push"; the app is not in the bypass list. The worker enforces "no writes under.github/workflows" as an allow-list check on the proposed diff, since GitHub App permissions are not path-scoped.
- Webhook receiver. Lambda function accepts issue label / PR comment webhooks. Filters by label
@agent fix this. Enqueues to SQS.
- Dispatcher. Pops tasks from SQS. Enforces per-repo per-day budget. Spins up an ECS Fargate task with the repo URL, issue body, and a fresh Daytona sandbox.
- Environment inference. Detect language (Python, Node, Go, Rust) and package manager (uv, pnpm, go mod, cargo). Generate a Dockerfile on the fly if one does not exist.
- Agent loop. mini-swe-agent or SWE-agent v2 with Claude Opus 4.7. Tools: ripgrep, tree-sitter repo-map, read_file, edit_file, run_tests, git. Hard limits: $20 cost, 30 min wall-clock, 30 agent turns.
- Verification. After the loop concludes, run the full test suite in-sandbox. Compute coverage delta via jacoco / coverage.py. If CI red: halt, do not open PR. If coverage drops more than 2%: open PR with
needs-reviewlabel.
- PR posting. Push the agent branch. Open PR via GitHub API with: title, rationale, diff summary, trace URL, cost, turns.
- Credential hygiene. Worker runs with a short-lived GitHub App installation token. Logs are scrubbed for secrets before archival.
- Eval. 30 seeded internal issues of varying difficulty. Measure pass rate, PR quality (diff size, style, coverage), cost, latency. Compare with Cursor Background Agents and AWS Remote SWE Agents on the same issues.
Use It
# on github.com
- user labels issue #842 with `@agent fix this`
- PR #1903 appears 14 minutes later
- body:
> Fixed NPE in widget.dedupe() caused by null comparator entry.
> Added regression test widget_test.go::TestDedupeNullComparator.
> Coverage delta: +0.12%
> Turns: 7 Cost: $1.80 Trace: langfuse:...
> Label: needs-review
Ship It
outputs/skill-issue-to-pr.md is the deliverable. A GitHub App + async cloud worker that turns labeled issues into review-ready PRs with bounded cost and scoped credentials.
| Weight | Criterion | How it is measured |
|---|---|---|
| 25 | Pass rate on 30 issues | End-to-end success (CI green + coverage OK) |
| 20 | PR quality | Diff size, coverage delta, style conformance |
| 20 | Cost and latency per resolved issue | $ and wall-clock per PR |
| 20 | Safety | Scoped token, per-repo budget, no force-push, credential hygiene |
| 15 | Operator UX | Rationale comments, retry affordance, @-mention follow-up |
| 100 |
Exercises
- Add a "fix flaky test" mode: the label
@agent stabilize-flake TestXruns the test 50 times in-sandbox and proposes a minimal change that stabilizes it.
- Compare cost vs Cursor Background Agents on three shared issues. Report which tools win where.
- Implement a budget dashboard: per-repo per-day cost, per-user cost. Alert on anomaly.
- Build a "dry-run" mode that opens a draft PR without running CI, so reviewers can examine the plan cheap.
- Add a retention policy: PR branches older than 7 days without merge get deleted automatically.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| GitHub App | "Scoped bot identity" | App with fine-grained permissions + short-lived installation token |
| Async cloud agent | "Background agent" | Non-interactive worker that runs in a cloud sandbox, not a terminal |
| Environment inference | "Dockerfile synthesis" | Detect language + package manager, generate a Dockerfile if absent |
| Verification | "CI-in-sandbox" | Run the full test suite inside the worker before opening a PR |
| Coverage delta | "Coverage preservation" | Change in test coverage % from base to agent branch |
| Per-repo budget | "Daily ceiling" | Dollar and PR-count cap enforced at the dispatcher |
| Rationale | "PR body explanation" | Agent's summary of what changed and why; required in the PR body |
Further Reading
- AWS Remote SWE Agents — the canonical async cloud agent reference
- SWE-agent — CLI reference
- Cursor Background Agents — commercial alternative
- OpenAI Codex (cloud) — hosted competitor
- Google Jules — Google's hosted version
- Factory Droids — alternate commercial reference
- GitHub App documentation — scoped bot identity
- Daytona cloud sandboxes — reference sandbox