← Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue Audio Evaluation — WER, MOS, UTMOS, MMAU, FAD, and the Open Leaderboards →

Voice Anti-Spoofing & Audio Watermarking — ASVspoof 5, AudioSeal, WaveVerify

> Voice cloning shipped faster than defenses. 2026 production voice systems need two things: a detector (AASIST, RawNet2) that classifies real vs fake speech, and a watermark (AudioSeal) that survives compression and editing. Ship both or do not ship voice cloning.

Type: Build

Languages: Python

Prerequisites: Phase 6 · 06 (Speaker Recognition), Phase 6 · 08 (Voice Cloning)

Time: ~75 minutes

The Problem

Three related defenses:

Anti-spoofing / deepfake detection. Given an audio clip, is it synthetic or real? ASVspoof benchmarks (ASVspoof 2019 → 2021 → 5) are the gold standard.
Audio watermarking. Embed an imperceptible signal in generated audio that a detector can extract later. AudioSeal (Meta) and WavMark are the open options.
Authenticated provenance. Cryptographic signing of audio files + metadata. C2PA / Content Authenticity Initiative.

Detection handles adversaries who don't cooperate. Watermarking handles compliance — AI-generated audio should be identifiable as such. Both are required in 2026.

The Concept

Anti-spoofing vs watermarking vs provenance — three defense layers

ASVspoof 5 — the 2024-2025 benchmark

Biggest change from prior editions:

Crowdsourced data (not studio clean) — realistic conditions.
~2000 speakers (vs ~100 before).
32 attack algorithms. TTS + voice conversion + adversarial perturbation.
Two tracks. Countermeasure (CM) standalone detection; Spoofing-robust ASV (SASV) for biometric systems.

State-of-the-art on ASVspoof 5: ~7.23% EER. On the older ASVspoof 2019 LA: 0.42% EER. Real-world deployment: expect 5-10% EER on in-the-wild clips.

AASIST and RawNet2 — detection model families

AASIST (2021, updated through 2026). Graph-attention on spectral features. Current SOTA on ASVspoof 5 countermeasure task.

RawNet2. Convolutional front-end over raw waveform + TDNN backbone. Simpler baseline; still competitive with fine-tuning.

NeXt-TDNN + SSL features. 2025 variant: ECAPA-style + WavLM features + focal loss. Achieves the 0.42% EER on ASVspoof 2019 LA.

AudioSeal — the 2024 watermark default

Meta's AudioSeal (Jan 2024, v0.2 Dec 2024). Key design:

Localized. Detects the watermark per-frame at 16 kHz sample resolution (1/16000 s).
Generator + detector jointly trained. Generator learns to embed inaudible signal; detector learns to find it through augmentations.
Robust. Survives MP3 / AAC compression, EQ, speed-shift ±10%, noise mix +10 dB SNR.
Fast. Detector runs at 485× realtime; 1000× faster than WavMark.
Capacity. 16-bit payload (can encode model ID, generation timestamp, user ID) embeddable in each utterance.

WavMark

The pre-AudioSeal open baseline. Invertible neural network, 32 bits/sec. Problems:

Synchronization brute-force is slow.
Can be removed by Gaussian noise or MP3 compression.
Not real-time friendly.

WaveVerify (July 2025)

Addresses AudioSeal's weaknesses — specifically temporal manipulations (reversal, speed). Uses FiLM-based generator + Mixture-of-Experts detector. Competitive with AudioSeal on standard attacks; handles temporal edits.

The gap adversaries exploit

From AudioMarkBench: "under pitch shift, all watermarks show Bit Recovery Accuracy below 0.6, indicating near-complete removal." Pitch-shift is the universal attack. No 2026 watermark is fully robust to aggressive pitch modification. This is why you need detection (AASIST) alongside watermarking.

C2PA / Content Authenticity Initiative

Not an ML technique — a manifest format. Audio files carry cryptographically signed metadata about creation tool, author, date. Audobox / Seamless use it. Good for provenance; does nothing if a bad actor re-encodes and strips metadata.

Build It

Step 1: a simple spectral-feature detector (toy)

def spectral_rolloff(spec, percentile=0.85):
    cum = 0
    total = sum(spec)
    if total == 0:
        return 0
    threshold = total * percentile
    for k, v in enumerate(spec):
        cum += v
        if cum >= threshold:
            return k
    return len(spec) - 1

def is_suspicious(audio):
    spec = magnitude_spectrum(audio)
    rolloff = spectral_rolloff(spec)
    return rolloff / len(spec) > 0.92

Synthetic speech often has unusually flat high-frequency energy. Production detectors use AASIST, not this. But the intuition holds.

Step 2: AudioSeal embed + detect

from audioseal import AudioSeal
import torch

generator = AudioSeal.load_generator("audioseal_wm_16bits")
detector = AudioSeal.load_detector("audioseal_detector_16bits")

audio = load_wav("generated.wav", sr=16000)[None, None, :]
payload = torch.tensor([[1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0]])
watermark = generator.get_watermark(audio, sample_rate=16000, message=payload)
watermarked = audio + watermark

result, decoded_payload = detector.detect_watermark(watermarked, sample_rate=16000)
# result: float in [0, 1] — probability of watermark presence
# decoded_payload: 16 bits; match against embedded payload

Step 3: evaluation — EER

def eer(real_scores, fake_scores):
    thresholds = sorted(set(real_scores + fake_scores))
    best = (1.0, 0.0)
    for t in thresholds:
        far = sum(1 for s in fake_scores if s >= t) / len(fake_scores)
        frr = sum(1 for s in real_scores if s < t) / len(real_scores)
        if abs(far - frr) < best[0]:
            best = (abs(far - frr), (far + frr) / 2)
    return best[1]

Step 4: the production integration

def safe_tts(text, voice, clone_reference=None):
    if clone_reference is not None:
        verify_consent(user_id, clone_reference)
    audio = tts_model.synthesize(text, voice)
    audio_with_wm = audioseal_embed(audio, payload=build_payload(user_id, model_id))
    manifest = c2pa_sign(audio_with_wm, user_id, timestamp=now())
    return audio_with_wm, manifest

Every generation ships: (1) watermark, (2) signed manifest, (3) retention-policy-compliant audit log.

Use It

Use case	Defense
Shipping TTS / voice cloning	AudioSeal embed on every output (non-negotiable)
Biometric voice unlock	AASIST + ECAPA ensemble; liveness challenge
Call-center fraud detection	AASIST on 20% sample of incoming calls
Podcast authenticity	C2PA signing on upload, AudioSeal if AI-generated
Research / training detectors	ASVspoof 5 train/dev/eval sets

Pitfalls

Watermark without detector ever running. Pointless. Ship the detector in your CI.
Detection without calibration. AASIST trained on ASVspoof LA overfits; real-world accuracy drops. Calibrate on your domain.
Pitch-shift gap. Aggressive pitch shift removes most watermarks. Have a detection fallback.
Metadata strip-and-rehost. C2PA is trivially bypassable by re-encoding. Always add cryptographic + perceptual (watermark) defense together.
Liveness as detection. Ask user to say a random phrase. Prevents replay attacks but not real-time cloning.

Ship It

Save as outputs/skill-spoof-defender.md. Pick detection model, watermark, provenance manifest, and operational playbook for a voice-gen deployment.

Exercises

Easy. Run code/main.py. Toy detector + toy watermark embed/detect on synthetic audio.
Medium. Install audioseal, embed a 16-bit payload in a TTS output, re-decode. Corrupt the audio with noise and measure Bit Recovery Accuracy.
Hard. Fine-tune a RawNet2 or AASIST on ASVspoof 2019 LA. Measure EER. Test on a held-out set of F5-TTS-generated clips — see how OOD detection degrades.

Key Terms

Term	What people say	What it actually means
ASVspoof	The benchmark	Biennial challenge; 2024 = ASVspoof 5.
CM (countermeasure)	Detector	Classifier: real speech vs synthetic / converted.
SASV	Speaker verif + CM	Integrated biometric + spoof detection.
AudioSeal	Meta watermark	Localized, 16-bit payload, 485× faster than WavMark.
Bit Recovery Accuracy	Watermark survival	Fraction of payload bits recovered after attack.
C2PA	Provenance manifest	Cryptographic metadata about creation / authorship.
AASIST	Detector family	Graph-attention-based anti-spoofing SOTA.