Chaos Engineering for LLM Production

> Chaos engineering for LLMs is its own discipline in 2026. Prerequisites before running experiments in production: defined SLI/SLO, trace+metric+log observability, automated rollback, runbooks, on-call. Architecture has four planes: control (experiment scheduler), target (services, infra, data stores), safety (guards + abort + traffic filters), observability (metrics + traces + logs), feedback (into SLO adjustments). Guardrails are mandatory: burn-rate alerts pause experiments if daily error-budget burn > 2x expected; suppression windows + trace-ID correlation dedupe alert noise. Cadence: weekly small canary + SLO review; monthly game day + postmortem; quarterly cross-team resilience audit + dependency mapping. LLM-specific experiments: memory overload, network failures, provider outages, malformed prompts, KV cache eviction storms. Tooling: Harness Chaos Engineering (LLM-derived recommendations, blast-radius downscaling, MCP tool integration); LitmusChaos (CNCF); Chaos Mesh (CNCF Kubernetes-native).

Type: Learn

Languages: Python (stdlib, toy chaos experiment runner)

Prerequisites: Phase 17 · 23 (SRE for AI), Phase 17 · 13 (Observability)

Time: ~60 minutes

Learning Objectives

The Problem

Chaos testing in traditional stacks is established. LLM stacks add new failure modes. A 4K-token prompt with a poison character stalls the tokenizer for 12 seconds. An upstream provider 429s; your gateway retries; your service OOMs on retry-amplified concurrency. A KV cache eviction storm under burst load causes re-prefill cascades that saturate compute.

None of these show up in unit tests. Chaos engineering is how you discover them before users do.

The Concept

Prerequisites

Don't run chaos in production without:

  1. SLI/SLO — defined service-level indicators and objectives.
  2. Observability — traces, metrics, logs, wired to dashboards.
  3. Automated rollback — Phase 17 · 20 policy-flag rollback.
  4. Runbooks — structured, Phase 17 · 23.
  5. On-call — someone to respond.

Missing any means chaos becomes real incident.

Four planes + feedback

Control plane — experiment scheduler (Litmus workflow, Chaos Mesh schedule, Harness UI).

Target plane — services, pods, nodes, load balancers, data stores.

Safety plane — kill switch, suppression windows, blast-radius limits, error-budget gates.

Observability plane — normal metrics + trace-ID correlation to distinguish chaos-induced from natural failures.

Feedback loop — findings feed back into SLO adjustment, runbook updates, code fixes.

Guardrails are mandatory

Five LLM-specific experiments

  1. Memory overload — force a KV cache preemption storm by sending long-context requests with high concurrency. Observe: does the service gracefully shed or crash?
  1. Network failure — cut connectivity between inference gateway and provider. Observe: does fallback kick in within SLA? (Phase 17 · 19)
  1. Provider outage simulation — 100% 429 from OpenAI. Observe: does routing failover to Anthropic? (Phase 17 · 16, 19)
  1. Malformed prompt — inject tokenizer-stalling payload (e.g., deeply nested unicode, huge UTF-8 codepoint). Observe: does a single request lock up a worker?
  1. KV eviction storm — force eviction by saturating vLLM block budget. Observe: does LMCache recover or does service degrade?

Cadence

Tooling

Starting small

First experiment: pod-kill one decode replica under steady traffic. Observe rerouting and recovery. If this works and looks safe, graduate to network chaos.

First LLM-specific experiment: inject one provider 429 for 5 minutes. Observe fallback. Most teams discover their fallback wasn't fully tested.

Numbers you should remember

Use It

code/main.py simulates three chaos experiments with safety plane gates. Reports which experiments would trip the burn-rate abort.

Ship It

This lesson produces outputs/skill-chaos-plan.md. Given stack and maturity, picks first three experiments and the tooling.

Exercises

  1. Run code/main.py. Which experiment trips the burn-rate gate and why?
  2. Design the first five chaos experiments for a vLLM-based RAG service. Include success criteria.
  3. Your burn-rate alert paused an experiment. How do you determine root cause — chaos or natural?
  4. Argue whether chaos should run in production or only staging. When is production the right answer?
  5. Name three LLM-specific failure modes that generic network-chaos cannot reproduce.

Key Terms

Term What people say What it actually means
SLI / SLO "service targets" Indicator + objective; required prerequisite
Blast radius "scope" Set of services / users affected by experiment
Burn-rate alert "budget gate" Fires when error-budget burn rate > 2x expected
Game day "monthly drill" Scheduled cross-team chaos exercise
LitmusChaos "CNCF workflow" Graduated CNCF Kubernetes chaos tool
Chaos Mesh "CNCF CRD" CNCF sandbox Kubernetes-native chaos
Harness CE "commercial AI-assisted" Harness chaos with AI recommendations
Malformed prompt "tokenizer bomb" Input that stalls tokenization
KV eviction storm "preemption cascade" Mass eviction triggering re-prefills

Further Reading