← Indirect Prompt Injection — Production Attack Surface WMDP and Dual-Use Capability Evaluation →

Red-Team Tooling — Garak, Llama Guard, PyRIT

> Three production tools frame the 2026 red-team stack. Llama Guard (Meta) — a Llama-3.1-8B classifier fine-tuned on 14 MLCommons hazard categories; the 2025 Llama Guard 4 is a 12B natively multimodal classifier pruned from Llama 4 Scout. Garak (NVIDIA) — open-source LLM vulnerability scanner with static, dynamic, and adaptive probes for hallucination, data leakage, prompt injection, toxicity, and jailbreaks. PyRIT (Microsoft) — multi-turn red-team campaigns with Crescendo, TAP, and custom converter chains for deep exploitation. Llama Guard 3 is documented in Meta's "Llama 3 Herd of Models" (arXiv:2407.21783); Llama Guard 3-1B-INT4 in arXiv:2411.17713; Garak's probe architecture in github.com/NVIDIA/garak. These tools are the 2026 production interface between red-team research (Lessons 12-15) and deployment (Lesson 17+).

Type: Build

Languages: Python (stdlib, tool-architecture simulator and Llama Guard-style classifier mock)

Prerequisites: Phase 18 · 12-15 (jailbreaks and IPI)

Time: ~75 minutes

Learning Objectives

Describe Llama Guard 3/4's position in the safety stack: input classifier, output classifier, or both.
Name the 14 MLCommons hazard categories and state one non-obvious one (Code Interpreter Abuse).
Describe Garak's probe architecture: probes, detectors, harnesses.
Describe PyRIT's multi-turn campaign structure and how it composes with Garak probes.

The Problem

Lessons 12-15 present the attack surface. Production deployments need repeatable, scalable evaluation. Three tools dominate 2026: Llama Guard (the defense classifier), Garak (the scanner), PyRIT (the campaign orchestrator). Each targets a different layer of the red-team lifecycle.

The Concept

Llama Guard (Meta)

Llama Guard 3 is a Llama-3.1-8B model fine-tuned for input/output classification over the MLCommons AILuminate 14 categories:

Violent crimes, non-violent crimes, sex-related, CSAM, defamation
Specialized advice, privacy, IP, indiscriminate weapons, hate
Suicide/self-harm, sexual content, elections, code-interpreter abuse

Supports 8 languages. Usage: place before the LLM (input moderation), after the LLM (output moderation), or both. The two uses generate different training distributions — Llama Guard 3 ships as a single model handling both.

Llama Guard 3-1B-INT4 (arXiv:2411.17713, 440MB, ~30 tokens/s on mobile CPU) is the quantized edge variant.

Llama Guard 4 (April 2025) is 12B, natively multimodal, pruned from Llama 4 Scout. It replaces both the 8B text and 11B vision predecessors with one classifier that ingests text + images.

Garak (NVIDIA)

Open-source vulnerability scanner. Architecture:

Probes. Attack generators for hallucination, data leakage, prompt injection, toxicity, jailbreaks. Static (fixed prompts), dynamic (generated prompts), adaptive (responds to target output).
Detectors. Score outputs against expected failure modes — toxic, leaked, jailbroken.
Harnesses. Manage probe-detector pairs, run campaigns, generate reports.

TrustyAI integrates Garak with the Llama-Stack shields (Prompt-Guard-86M input classifier, Llama-Guard-3-8B output classifier) for end-to-end shielded-target evaluation. Tier-based scoring (TBSA) replaces binary pass/fail — a model can pass at severity tier 3 and fail at severity tier 5 on the same probe.

PyRIT (Microsoft)

Python Risk Identification Toolkit. Multi-turn red-team campaigns. Built around:

Converters. Transform a seed prompt — paraphrase, encode, translate, roleplay.
Orchestrators. Run the campaign: Crescendo (escalation), TAP (branching), RedTeaming (custom loop).
Scoring. LLM-as-judge or classifier-as-judge.

PyRIT is the heavier cousin of Garak. Garak runs thousands of single-turn probes; PyRIT runs deep multi-turn campaigns designed to break specific failure modes.

The stack

Put Llama Guard on both sides of the model. Run Garak nightly for regression. Run PyRIT for pre-release campaigns. This is the 2026 default configuration for most production deployments.

Evaluation pitfalls

Judge identity. All three tools can use an LLM judge; judge calibration drives reported ASRs (Lesson 12). Specify the judge alongside the tool.
Probe staleness. Garak probes age as models are patched against them. Adaptive probes (PAIR-shaped) age slower than static probes.
Llama Guard FPR on benign content. Early Llama Guard versions over-flagged political and LGBTQ+ content; Llama Guard 3/4 calibrations are improved but not calibrated per-deployment.

Where this fits in Phase 18

Lessons 12-15 are the attack families. Lesson 16 is the production tooling. Lesson 17 (WMDP) is the evaluation for dual-use capability. Lesson 18 is the frontier safety frameworks that wrap these tools in a policy structure.

Use It

code/main.py builds a toy Llama Guard-style classifier (keyword + semantic features over 14 categories), a toy Garak harness (probe-detector loop), and a PyRIT-style multi-turn converter chain. You can run the three tools against a mock target and observe the different coverage signatures.

Ship It

This lesson produces outputs/skill-red-team-stack.md. Given a deployment description, it names which of the three tools are appropriate, what to configure in each, and what regression cadence to run.

Exercises

Run code/main.py. Compare the Llama-Guard-style classifier's detection rate on single-turn vs multi-turn attacks.

Implement a new Garak probe: a base64-encoded harmful request. Measure its detection by the Llama-Guard-style classifier.

Extend the PyRIT-style converter chain with a "translate to French, then paraphrase" converter. Re-measure attack success.

Read Llama Guard 3's hazard-category list. Identify two categories where the training data would realistically produce high false-positive rates on legitimate developer content.

Compare Garak and PyRIT's design principles. Argue for a deployment where each is the right tool.

Key Terms

Term	What people say	What it actually means
Llama Guard	"the classifier"	Fine-tuned Llama-3.1-8B/4-12B safety classifier with 14 hazard categories
Garak	"the scanner"	NVIDIA open-source vulnerability scanner; probes, detectors, harnesses
PyRIT	"the campaign tool"	Microsoft multi-turn red-team orchestrator; converters, orchestrators, scoring
Prompt-Guard	"the small classifier"	Meta's 86M prompt-injection classifier, paired with Llama Guard
TBSA	"tier-based scoring"	Garak's tier-based pass/fail replacing binary outcomes
Converter chain	"paraphrase + encode + ..."	PyRIT composition primitive for building multi-step attacks
MLCommons hazard categories	"the 14 taxonomies"	Industry-standard taxonomy Llama Guard targets