Voice Anti-Spoofing & Audio Watermarking — ASVspoof 5, AudioSeal, WaveVerify

> Voice cloning shipped faster than defenses. 2026 production voice systems need two things: a detector (AASIST, RawNet2) that classifies real vs fake speech, and a watermark (AudioSeal) that survives compression and editing. Ship both or do not ship voice cloning.

Type: Build

Languages: Python

Prerequisites: Phase 6 · 06 (Speaker Recognition), Phase 6 · 08 (Voice Cloning)

Time: ~75 minutes

The Problem

Three related defenses:

  1. Anti-spoofing / deepfake detection. Given an audio clip, is it synthetic or real? ASVspoof benchmarks (ASVspoof 2019 → 2021 → 5) are the gold standard.
  2. Audio watermarking. Embed an imperceptible signal in generated audio that a detector can extract later. AudioSeal (Meta) and WavMark are the open options.
  3. Authenticated provenance. Cryptographic signing of audio files + metadata. C2PA / Content Authenticity Initiative.

Detection handles adversaries who don't cooperate. Watermarking handles compliance — AI-generated audio should be identifiable as such. Both are required in 2026.

The Concept

Anti-spoofing vs watermarking vs provenance — three defense layers

ASVspoof 5 — the 2024-2025 benchmark

Biggest change from prior editions:

State-of-the-art on ASVspoof 5: ~7.23% EER. On the older ASVspoof 2019 LA: 0.42% EER. Real-world deployment: expect 5-10% EER on in-the-wild clips.

AASIST and RawNet2 — detection model families

AASIST (2021, updated through 2026). Graph-attention on spectral features. Current SOTA on ASVspoof 5 countermeasure task.

RawNet2. Convolutional front-end over raw waveform + TDNN backbone. Simpler baseline; still competitive with fine-tuning.

NeXt-TDNN + SSL features. 2025 variant: ECAPA-style + WavLM features + focal loss. Achieves the 0.42% EER on ASVspoof 2019 LA.

AudioSeal — the 2024 watermark default

Meta's AudioSeal (Jan 2024, v0.2 Dec 2024). Key design:

WavMark

The pre-AudioSeal open baseline. Invertible neural network, 32 bits/sec. Problems:

WaveVerify (July 2025)

Addresses AudioSeal's weaknesses — specifically temporal manipulations (reversal, speed). Uses FiLM-based generator + Mixture-of-Experts detector. Competitive with AudioSeal on standard attacks; handles temporal edits.

The gap adversaries exploit

From AudioMarkBench: "under pitch shift, all watermarks show Bit Recovery Accuracy below 0.6, indicating near-complete removal." Pitch-shift is the universal attack. No 2026 watermark is fully robust to aggressive pitch modification. This is why you need detection (AASIST) alongside watermarking.

C2PA / Content Authenticity Initiative

Not an ML technique — a manifest format. Audio files carry cryptographically signed metadata about creation tool, author, date. Audobox / Seamless use it. Good for provenance; does nothing if a bad actor re-encodes and strips metadata.

Build It

Step 1: a simple spectral-feature detector (toy)

def spectral_rolloff(spec, percentile=0.85):
    cum = 0
    total = sum(spec)
    if total == 0:
        return 0
    threshold = total * percentile
    for k, v in enumerate(spec):
        cum += v
        if cum >= threshold:
            return k
    return len(spec) - 1

def is_suspicious(audio):
    spec = magnitude_spectrum(audio)
    rolloff = spectral_rolloff(spec)
    return rolloff / len(spec) > 0.92

Synthetic speech often has unusually flat high-frequency energy. Production detectors use AASIST, not this. But the intuition holds.

Step 2: AudioSeal embed + detect

from audioseal import AudioSeal
import torch

generator = AudioSeal.load_generator("audioseal_wm_16bits")
detector = AudioSeal.load_detector("audioseal_detector_16bits")

audio = load_wav("generated.wav", sr=16000)[None, None, :]
payload = torch.tensor([[1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0]])
watermark = generator.get_watermark(audio, sample_rate=16000, message=payload)
watermarked = audio + watermark

result, decoded_payload = detector.detect_watermark(watermarked, sample_rate=16000)
# result: float in [0, 1] — probability of watermark presence
# decoded_payload: 16 bits; match against embedded payload

Step 3: evaluation — EER

def eer(real_scores, fake_scores):
    thresholds = sorted(set(real_scores + fake_scores))
    best = (1.0, 0.0)
    for t in thresholds:
        far = sum(1 for s in fake_scores if s >= t) / len(fake_scores)
        frr = sum(1 for s in real_scores if s < t) / len(real_scores)
        if abs(far - frr) < best[0]:
            best = (abs(far - frr), (far + frr) / 2)
    return best[1]

Step 4: the production integration

def safe_tts(text, voice, clone_reference=None):
    if clone_reference is not None:
        verify_consent(user_id, clone_reference)
    audio = tts_model.synthesize(text, voice)
    audio_with_wm = audioseal_embed(audio, payload=build_payload(user_id, model_id))
    manifest = c2pa_sign(audio_with_wm, user_id, timestamp=now())
    return audio_with_wm, manifest

Every generation ships: (1) watermark, (2) signed manifest, (3) retention-policy-compliant audit log.

Use It

Use case Defense
Shipping TTS / voice cloning AudioSeal embed on every output (non-negotiable)
Biometric voice unlock AASIST + ECAPA ensemble; liveness challenge
Call-center fraud detection AASIST on 20% sample of incoming calls
Podcast authenticity C2PA signing on upload, AudioSeal if AI-generated
Research / training detectors ASVspoof 5 train/dev/eval sets

Pitfalls

Ship It

Save as outputs/skill-spoof-defender.md. Pick detection model, watermark, provenance manifest, and operational playbook for a voice-gen deployment.

Exercises

  1. Easy. Run code/main.py. Toy detector + toy watermark embed/detect on synthetic audio.
  2. Medium. Install audioseal, embed a 16-bit payload in a TTS output, re-decode. Corrupt the audio with noise and measure Bit Recovery Accuracy.
  3. Hard. Fine-tune a RawNet2 or AASIST on ASVspoof 2019 LA. Measure EER. Test on a held-out set of F5-TTS-generated clips — see how OOD detection degrades.

Key Terms

Term What people say What it actually means
ASVspoof The benchmark Biennial challenge; 2024 = ASVspoof 5.
CM (countermeasure) Detector Classifier: real speech vs synthetic / converted.
SASV Speaker verif + CM Integrated biometric + spoof detection.
AudioSeal Meta watermark Localized, 16-bit payload, 485× faster than WavMark.
Bit Recovery Accuracy Watermark survival Fraction of payload bits recovered after attack.
C2PA Provenance manifest Cryptographic metadata about creation / authorship.
AASIST Detector family Graph-attention-based anti-spoofing SOTA.

Further Reading