Prompt Injection and the PVE Defense
> Greshake et al. (AISec 2023) established indirect prompt injection as the defining agent security problem. Attacker plants instructions in data the agent retrieves; on ingest, those instructions override the developer prompt. Treat all retrieved content as arbitrary code execution on the tool-use surface.
Type: Build
Languages: Python (stdlib)
Prerequisites: Phase 14 · 06 (Tool Use), Phase 14 · 21 (Computer Use)
Time: ~75 minutes
Learning Objectives
- State the indirect prompt injection threat model from Greshake et al.
- Name the five demonstrated exploit classes (data theft, worming, persistent memory poisoning, ecosystem contamination, arbitrary tool use).
- Describe the 2026 defense doctrine: untrusted content, allowlist navigation, per-step safety, guardrails, human-in-the-loop, external capture.
- Implement a PVE (Prompt-Validator-Executor) pattern — cheap fast validator before the expensive main model commits to a tool call.
The Problem
LLMs cannot reliably distinguish instructions that come from the user from instructions that come from retrieved content. A PDF, a web page, a memory note, or a previous agent turn can carry and the model may execute it as if the user asked.
This is the defining agent security problem of 2024-2026. Every production agent has to defend against it.
The Concept
Greshake et al., AISec 2023 (arXiv:2302.12173)
Attack class: indirect prompt injection.
- Attacker controls content the agent will retrieve: web page, PDF, email, memory note, search result.
- When ingested, the instructions in that content override the developer prompt.
- Demonstrated exploits against Bing Chat, GPT-4 code completion, synthetic agents:
- Data theft — agent exfiltrates conversation history to attacker-controlled URL.
- Worming — injected content instructs agent to embed the exploit in next output.
- Persistent memory poisoning — agent stores attacker's instructions; re-poisons self on next session.
- Information ecosystem contamination — injected facts spread to other agents through shared memory.
- Arbitrary tool use — any tool in the registry becomes attacker-reachable.
Central claim: processing retrieved prompts is equivalent to arbitrary code execution on the agent's tool-use surface.
The 2026 defense doctrine
Six controls that have converged across vendor guidance:
- Treat all retrieved content as untrusted. OpenAI CUA docs: "only direct instructions from the user count as permission."
- Allowlist / blocklist navigation. Narrow the set of URLs, domains, or files the agent can touch.
- Per-step safety evaluation. Gemini 2.5 Computer Use pattern — assess each action before execution.
- Guardrails on tool inputs and outputs. Lesson 16 (OpenAI Agents SDK); Lesson 06 (argument validation).
- Human-in-the-loop confirmation. Login, purchase, CAPTCHA, send-message — human decides.
- Content capture with external storage. Lesson 23 — store retrieved content externally; spans carry references, not prose; incidents are auditable.
PVE: Prompt-Validator-Executor
Deployment pattern that combines several controls:
- A cheap, fast validator model runs on every candidate tool invocation before the expensive main model commits.
- Validator checks: is this action consistent with the user's stated intent? Does the action touch a sensitive surface? Is there injection-shaped content in the arguments?
- If the validator rejects, the main model is told "that action was refused; try a different approach."
The trade-off: an extra inference per tool call. For the vast majority of agent products, this is cheap insurance.
Where defenses fail
- No content-source metadata. If the system can't tell "this text came from the user" vs "this text came from a web page," it cannot distinguish permission levels.
- All guardrails at the end. If validation runs only on the final output, the model already touched the world.
- Relying on instruction-following alone. "System prompt says ignore untrusted instructions" is not enforcement.
- Overtrust of retrieved memory. Yesterday's agent wrote a poisoned memory note; today's agent reads it.
Build It
code/main.py implements PVE:
- A
Validatorthat runs on every tool call: argument-shape check + injection-pattern scan. - An
Executorthat runs the main model's tool call only after validator approval. - Demo: a normal tool call passes; an injected one (prompt in the argument) is caught; a poisoned memory note triggers refusal.
Run it:
python3 code/main.py
Output: per-call trace showing validator verdicts and executor behavior.
Use It
- OpenAI Agents SDK guardrails (Lesson 16) — built-in PVE-shaped pattern.
- Gemini 2.5 Computer Use safety service — per-step vendor-managed.
- Anthropic tool-use best practices — treat retrieved content as untrusted; Claude's system prompt discusses this explicitly.
- Custom PVE — your own validator model for domain-specific injection patterns.
Ship It
outputs/skill-injection-defense.md scaffolds a PVE layer + content-capture discipline for any agent runtime.
Exercises
- Add a "source tag" to every piece of content:
user_message,tool_output,retrieved. Propagate tags through the message history. Validator refusesretrievedcontent that looks like directives. - Implement a memory-write guardrail: any memory write that looks like an instruction ("do X", "execute Y") is refused.
- Write a worming attack simulation: injected content tells the agent to include the exploit in its next response. Defend against it.
- Read Greshake et al. end to end. Implement one of the demonstrated exploits in your toy. Fix it.
- Measure: on normal traffic, how often does the PVE validator reject? Target: near-zero on legitimate calls.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Indirect prompt injection | "Injection in retrieved content" | Instructions embedded in data the agent retrieves |
| Direct prompt injection | "Jailbreak" | User-supplied prompt bypasses guardrails |
| PVE | "Prompt-Validator-Executor" | Cheap fast validator before expensive main inference |
| Source tag | "Content provenance" | Metadata marking where content came from |
| Allowlist navigation | "URL whitelist" | Agent can only visit approved destinations |
| Worming | "Self-replicating exploit" | Injected content includes instructions to propagate |
| Memory poisoning | "Persistent injection" | Injected content stored as memory; re-poisons next session |
Further Reading
- Greshake et al., Indirect Prompt Injection (arXiv:2302.12173) — canonical attack paper
- OpenAI, Computer-Using Agent — "only direct instructions from the user count as permission"
- Google, Gemini 2.5 Computer Use — per-step safety service
- OpenAI Agents SDK docs — guardrails as PVE