Long-Running Background Agents: Durable Execution

> Production long-horizon agents do not run in while True. Every LLM call becomes an activity with checkpoint, retry, and replay. Temporal's OpenAI Agents SDK integration went GA March 2026. Claude Code Routines (Anthropic) runs scheduled Claude Code invocations without a persistent local process. Sessions pause on human-input, survive deploys, and resume from the latest checkpoint keyed by thread_id. Behind the new ergonomics sits an old pattern — workflow orchestration — with one new input: LLM calls as non-deterministic activities that must be deterministically replayed on recovery.

Type: Learn

Languages: Python (stdlib, minimal durable-execution state machine)

Prerequisites: Phase 15 · 10 (Permission modes), Phase 15 · 01 (Long-horizon agents)

Time: ~60 minutes

The Problem

Consider an agent that runs for four hours. It calls three tools, prompts the user twice, and makes forty LLM calls. Halfway through, the host it is running on reboots. What happens?

This is the same pattern workflow engines have shipped for a decade (Temporal, Cadence, Uber's Cherami). What's new is that LLM calls are now a kind of activity — non-deterministic, expensive, with side effects — and they fit this pattern cleanly.

The running theme of the lesson: long-horizon reliability decays (METR observes a "35-minute degradation" — success rate drops roughly quadratically with horizon). Durable execution enables runs that are longer than the reliability profile supports, which is a new way to fail safely if the design is right and unsafely if the design is wrong.

The Concept

Activities, workflows, and replay

This is the same shape as React re-rendering against a virtual DOM, or Git rebuilding a working tree from commits. Determinism in the orchestrator is what makes durability cheap.

Why LLM calls fit the pattern

LLM calls are:

This is exactly the activity profile. Wrapping every LLM call as an activity gives you retry with exponential backoff, checkpointing across restarts, and a replayable trace for debugging.

Checkpoints keyed by thread_id

LangGraph, Microsoft Agent Framework, Cloudflare Durable Objects, and Claude Code Routines all converged on the same API shape: a thread_id (or equivalent) identifies the session; each state transition persists to a backend (PostgreSQL default, SQLite for dev, Redis for cache); resume reads the latest checkpoint.

The backend choice matters:

Human-input as a first-class state

Propose-then-commit (Lesson 15) requires a durable "waiting on human" state. The workflow pauses, the external queue holds the pending request, and an approval resumes from exactly that point. Without durability this is best-effort; with it, an overnight approval arrives and the workflow picks up in the morning.

The 35-minute degradation

METR observed that every agent class measured shows reliability decay beyond ~35 minutes of continuous operation. Doubling the task duration roughly quadruples the failure rate. Durable execution does not fix this; it lets you run longer than the reliability profile supports. The safe pattern is to combine durability with checkpoints that require fresh HITL on re-entry, and with budget kill switches (Lesson 13) that cap total compute regardless of wall-clock time.

When durable execution is the wrong answer

Use It

code/main.py implements a minimal durable-execution engine in stdlib Python. It supports:

The driver simulates a three-activity workflow, crashes halfway through, and shows (a) a naive retry re-executing everything versus (b) a replay running only the missing activity.

Ship It

outputs/skill-durable-execution-review.md reviews a proposed long-running agent deployment for correct durable-execution shape: activities, determinism, checkpoint backend, human-input state, and HITL-on-resume policy.

Exercises

  1. Run code/main.py. Observe the difference in activity-execution count between naive retry and replay. Change the crash point and show the replay count changes accordingly.
  1. Convert the toy engine to use thread_id explicitly. Simulate two concurrent sessions sharing the engine and confirm their event logs do not collide.
  1. Take one activity in the toy engine. Introduce a non-determinism (a wall-clock timestamp inside a workflow decision). Demonstrate the divergence on replay. Explain how real engines handle this (side-effect registration, Workflow.now() APIs).
  1. Read the LangChain "Runtime behind production deep agents" post. List every state that the runtime persists and name which failure mode each covers.
  1. Design a checkpoint policy for a 6-hour autonomous coding task. Where do you checkpoint? What does resume-on-crash look like? What requires fresh HITL?

Key Terms

Term What people say What it actually means
Workflow "Agent's script" Deterministic orchestration code; replayable from event log
Activity "A step" Non-deterministic unit (LLM call, tool call); logged before and after
Event log "The backing store" Durable record of every state transition
Replay "Resume" Re-run workflow; completed activities return logged results without re-execution
Checkpoint "Save point" Persisted state keyed by thread_id; latest-wins on resume
thread_id "Session key" Identifier that scopes durable state
35-minute degradation "Reliability decay" METR: success rate drops ~quadratically with horizon
Non-determinism "Drift on replay" Wall clock, random, LLM output; must be registered as side effect

Further Reading