Voice Anti-Spoofing & Audio Watermarking — ASVspoof 5, AudioSeal, WaveVerify
> Voice cloning shipped faster than defenses. 2026 production voice systems need two things: a detector (AASIST, RawNet2) that classifies real vs fake speech, and a watermark (AudioSeal) that survives compression and editing. Ship both or do not ship voice cloning.
Type: Build
Languages: Python
Prerequisites: Phase 6 · 06 (Speaker Recognition), Phase 6 · 08 (Voice Cloning)
Time: ~75 minutes
The Problem
Three related defenses:
- Anti-spoofing / deepfake detection. Given an audio clip, is it synthetic or real? ASVspoof benchmarks (ASVspoof 2019 → 2021 → 5) are the gold standard.
- Audio watermarking. Embed an imperceptible signal in generated audio that a detector can extract later. AudioSeal (Meta) and WavMark are the open options.
- Authenticated provenance. Cryptographic signing of audio files + metadata. C2PA / Content Authenticity Initiative.
Detection handles adversaries who don't cooperate. Watermarking handles compliance — AI-generated audio should be identifiable as such. Both are required in 2026.
The Concept
ASVspoof 5 — the 2024-2025 benchmark
Biggest change from prior editions:
- Crowdsourced data (not studio clean) — realistic conditions.
- ~2000 speakers (vs ~100 before).
- 32 attack algorithms. TTS + voice conversion + adversarial perturbation.
- Two tracks. Countermeasure (CM) standalone detection; Spoofing-robust ASV (SASV) for biometric systems.
State-of-the-art on ASVspoof 5: ~7.23% EER. On the older ASVspoof 2019 LA: 0.42% EER. Real-world deployment: expect 5-10% EER on in-the-wild clips.
AASIST and RawNet2 — detection model families
AASIST (2021, updated through 2026). Graph-attention on spectral features. Current SOTA on ASVspoof 5 countermeasure task.
RawNet2. Convolutional front-end over raw waveform + TDNN backbone. Simpler baseline; still competitive with fine-tuning.
NeXt-TDNN + SSL features. 2025 variant: ECAPA-style + WavLM features + focal loss. Achieves the 0.42% EER on ASVspoof 2019 LA.
AudioSeal — the 2024 watermark default
Meta's AudioSeal (Jan 2024, v0.2 Dec 2024). Key design:
- Localized. Detects the watermark per-frame at 16 kHz sample resolution (1/16000 s).
- Generator + detector jointly trained. Generator learns to embed inaudible signal; detector learns to find it through augmentations.
- Robust. Survives MP3 / AAC compression, EQ, speed-shift ±10%, noise mix +10 dB SNR.
- Fast. Detector runs at 485× realtime; 1000× faster than WavMark.
- Capacity. 16-bit payload (can encode model ID, generation timestamp, user ID) embeddable in each utterance.
WavMark
The pre-AudioSeal open baseline. Invertible neural network, 32 bits/sec. Problems:
- Synchronization brute-force is slow.
- Can be removed by Gaussian noise or MP3 compression.
- Not real-time friendly.
WaveVerify (July 2025)
Addresses AudioSeal's weaknesses — specifically temporal manipulations (reversal, speed). Uses FiLM-based generator + Mixture-of-Experts detector. Competitive with AudioSeal on standard attacks; handles temporal edits.
The gap adversaries exploit
From AudioMarkBench: "under pitch shift, all watermarks show Bit Recovery Accuracy below 0.6, indicating near-complete removal." Pitch-shift is the universal attack. No 2026 watermark is fully robust to aggressive pitch modification. This is why you need detection (AASIST) alongside watermarking.
C2PA / Content Authenticity Initiative
Not an ML technique — a manifest format. Audio files carry cryptographically signed metadata about creation tool, author, date. Audobox / Seamless use it. Good for provenance; does nothing if a bad actor re-encodes and strips metadata.
Build It
Step 1: a simple spectral-feature detector (toy)
def spectral_rolloff(spec, percentile=0.85):
cum = 0
total = sum(spec)
if total == 0:
return 0
threshold = total * percentile
for k, v in enumerate(spec):
cum += v
if cum >= threshold:
return k
return len(spec) - 1
def is_suspicious(audio):
spec = magnitude_spectrum(audio)
rolloff = spectral_rolloff(spec)
return rolloff / len(spec) > 0.92
Synthetic speech often has unusually flat high-frequency energy. Production detectors use AASIST, not this. But the intuition holds.
Step 2: AudioSeal embed + detect
from audioseal import AudioSeal
import torch
generator = AudioSeal.load_generator("audioseal_wm_16bits")
detector = AudioSeal.load_detector("audioseal_detector_16bits")
audio = load_wav("generated.wav", sr=16000)[None, None, :]
payload = torch.tensor([[1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0]])
watermark = generator.get_watermark(audio, sample_rate=16000, message=payload)
watermarked = audio + watermark
result, decoded_payload = detector.detect_watermark(watermarked, sample_rate=16000)
# result: float in [0, 1] — probability of watermark presence
# decoded_payload: 16 bits; match against embedded payload
Step 3: evaluation — EER
def eer(real_scores, fake_scores):
thresholds = sorted(set(real_scores + fake_scores))
best = (1.0, 0.0)
for t in thresholds:
far = sum(1 for s in fake_scores if s >= t) / len(fake_scores)
frr = sum(1 for s in real_scores if s < t) / len(real_scores)
if abs(far - frr) < best[0]:
best = (abs(far - frr), (far + frr) / 2)
return best[1]
Step 4: the production integration
def safe_tts(text, voice, clone_reference=None):
if clone_reference is not None:
verify_consent(user_id, clone_reference)
audio = tts_model.synthesize(text, voice)
audio_with_wm = audioseal_embed(audio, payload=build_payload(user_id, model_id))
manifest = c2pa_sign(audio_with_wm, user_id, timestamp=now())
return audio_with_wm, manifest
Every generation ships: (1) watermark, (2) signed manifest, (3) retention-policy-compliant audit log.
Use It
| Use case | Defense |
|---|---|
| Shipping TTS / voice cloning | AudioSeal embed on every output (non-negotiable) |
| Biometric voice unlock | AASIST + ECAPA ensemble; liveness challenge |
| Call-center fraud detection | AASIST on 20% sample of incoming calls |
| Podcast authenticity | C2PA signing on upload, AudioSeal if AI-generated |
| Research / training detectors | ASVspoof 5 train/dev/eval sets |
Pitfalls
- Watermark without detector ever running. Pointless. Ship the detector in your CI.
- Detection without calibration. AASIST trained on ASVspoof LA overfits; real-world accuracy drops. Calibrate on your domain.
- Pitch-shift gap. Aggressive pitch shift removes most watermarks. Have a detection fallback.
- Metadata strip-and-rehost. C2PA is trivially bypassable by re-encoding. Always add cryptographic + perceptual (watermark) defense together.
- Liveness as detection. Ask user to say a random phrase. Prevents replay attacks but not real-time cloning.
Ship It
Save as outputs/skill-spoof-defender.md. Pick detection model, watermark, provenance manifest, and operational playbook for a voice-gen deployment.
Exercises
- Easy. Run
code/main.py. Toy detector + toy watermark embed/detect on synthetic audio. - Medium. Install
audioseal, embed a 16-bit payload in a TTS output, re-decode. Corrupt the audio with noise and measure Bit Recovery Accuracy. - Hard. Fine-tune a RawNet2 or AASIST on ASVspoof 2019 LA. Measure EER. Test on a held-out set of F5-TTS-generated clips — see how OOD detection degrades.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| ASVspoof | The benchmark | Biennial challenge; 2024 = ASVspoof 5. |
| CM (countermeasure) | Detector | Classifier: real speech vs synthetic / converted. |
| SASV | Speaker verif + CM | Integrated biometric + spoof detection. |
| AudioSeal | Meta watermark | Localized, 16-bit payload, 485× faster than WavMark. |
| Bit Recovery Accuracy | Watermark survival | Fraction of payload bits recovered after attack. |
| C2PA | Provenance manifest | Cryptographic metadata about creation / authorship. |
| AASIST | Detector family | Graph-attention-based anti-spoofing SOTA. |
Further Reading
- Todisco et al. (2024). ASVspoof 5 — the current benchmark.
- Defossez et al. (2024). AudioSeal — the watermark default.
- Chen et al. (2025). WaveVerify — MoE detector for temporal attacks.
- Jung et al. (2022). AASIST — the SOTA detection backbone.
- AudioMarkBench (2024) — robustness evaluation.
- C2PA specification — provenance manifest format.