Prompt Injection and the PVE Defense

> Greshake et al. (AISec 2023) established indirect prompt injection as the defining agent security problem. Attacker plants instructions in data the agent retrieves; on ingest, those instructions override the developer prompt. Treat all retrieved content as arbitrary code execution on the tool-use surface.

Type: Build

Languages: Python (stdlib)

Prerequisites: Phase 14 · 06 (Tool Use), Phase 14 · 21 (Computer Use)

Time: ~75 minutes

Learning Objectives

The Problem

LLMs cannot reliably distinguish instructions that come from the user from instructions that come from retrieved content. A PDF, a web page, a memory note, or a previous agent turn can carry send $100 to X and the model may execute it as if the user asked.

This is the defining agent security problem of 2024-2026. Every production agent has to defend against it.

The Concept

Greshake et al., AISec 2023 (arXiv:2302.12173)

Attack class: indirect prompt injection.

- Data theft — agent exfiltrates conversation history to attacker-controlled URL.

- Worming — injected content instructs agent to embed the exploit in next output.

- Persistent memory poisoning — agent stores attacker's instructions; re-poisons self on next session.

- Information ecosystem contamination — injected facts spread to other agents through shared memory.

- Arbitrary tool use — any tool in the registry becomes attacker-reachable.

Central claim: processing retrieved prompts is equivalent to arbitrary code execution on the agent's tool-use surface.

The 2026 defense doctrine

Six controls that have converged across vendor guidance:

  1. Treat all retrieved content as untrusted. OpenAI CUA docs: "only direct instructions from the user count as permission."
  2. Allowlist / blocklist navigation. Narrow the set of URLs, domains, or files the agent can touch.
  3. Per-step safety evaluation. Gemini 2.5 Computer Use pattern — assess each action before execution.
  4. Guardrails on tool inputs and outputs. Lesson 16 (OpenAI Agents SDK); Lesson 06 (argument validation).
  5. Human-in-the-loop confirmation. Login, purchase, CAPTCHA, send-message — human decides.
  6. Content capture with external storage. Lesson 23 — store retrieved content externally; spans carry references, not prose; incidents are auditable.

PVE: Prompt-Validator-Executor

Deployment pattern that combines several controls:

The trade-off: an extra inference per tool call. For the vast majority of agent products, this is cheap insurance.

Where defenses fail

Build It

code/main.py implements PVE:

Run it:

python3 code/main.py

Output: per-call trace showing validator verdicts and executor behavior.

Use It

Ship It

outputs/skill-injection-defense.md scaffolds a PVE layer + content-capture discipline for any agent runtime.

Exercises

  1. Add a "source tag" to every piece of content: user_message, tool_output, retrieved. Propagate tags through the message history. Validator refuses retrieved content that looks like directives.
  2. Implement a memory-write guardrail: any memory write that looks like an instruction ("do X", "execute Y") is refused.
  3. Write a worming attack simulation: injected content tells the agent to include the exploit in its next response. Defend against it.
  4. Read Greshake et al. end to end. Implement one of the demonstrated exploits in your toy. Fix it.
  5. Measure: on normal traffic, how often does the PVE validator reject? Target: near-zero on legitimate calls.

Key Terms

Term What people say What it actually means
Indirect prompt injection "Injection in retrieved content" Instructions embedded in data the agent retrieves
Direct prompt injection "Jailbreak" User-supplied prompt bypasses guardrails
PVE "Prompt-Validator-Executor" Cheap fast validator before expensive main inference
Source tag "Content provenance" Metadata marking where content came from
Allowlist navigation "URL whitelist" Agent can only visit approved destinations
Worming "Self-replicating exploit" Injected content includes instructions to propagate
Memory poisoning "Persistent injection" Injected content stored as memory; re-poisons next session

Further Reading