Speaker Recognition & Verification

> ASR asks "what did they say?" Speaker recognition asks "who said it?" The math looks the same — embeddings plus cosine — but every production decision hinges on a single EER number.

Type: Build

Languages: Python

Prerequisites: Phase 6 · 02 (Spectrograms & Mel), Phase 5 · 22 (Embedding Models)

Time: ~45 minutes

The Problem

A user says a passphrase. You want to know: is this the person they claim to be (*verification*, 1:1), or is it the first person in your enrollment bank (*identification*, 1:N)? Or neither — is this an unknown speaker (*open-set*)?

Pre-2018: GMM-UBM + i-vectors. Reasonable EER but fragile to channel shift (phone vs laptop) and emotion. 2018–2022: x-vectors (TDNN backbone trained with angular margin). 2022+: ECAPA-TDNN and WavLM-large embeddings. By 2026 the field is dominated by three models and one metric.

The metric is EER — Equal Error Rate. Set your decision threshold so False Accept Rate = False Reject Rate. The crossover is EER. Used in every paper, every leaderboard, every procurement call.

The Concept

Enrollment + verification pipeline with embedding + cosine + EER

The pipeline. Enrollment: record 5–30 seconds of the target speaker; compute a fixed-dimension embedding (192-d for ECAPA-TDNN, 256-d for WavLM-large). Verification: get the test utterance embedding; compute cosine similarity; compare to a threshold.

ECAPA-TDNN (2020, still dominant 2026). Emphasized Channel Attention, Propagation and Aggregation - Time-Delay Neural Network. 1D conv blocks with squeeze-excitation, multi-head attention pooling, followed by a linear layer to 192-d. Trained on VoxCeleb 1+2 (2,700 speakers, 1.1M utterances) with Additive Angular Margin loss (AAM-softmax).

WavLM-SV (2022+). Fine-tune a pretrained WavLM-large SSL backbone with AAM loss. Higher quality but slower — 300+ MB vs 15 MB.

x-vector (baseline). TDNN + statistics pooling. Classic; still useful on CPU / edge.

AAM-softmax. Standard softmax with added margin m in the angular space: cos(θ + m) for the correct class. Forces inter-class angular separation. Typical m=0.2, scale s=30.

Scoring

Numbers you should know (2026)

Model VoxCeleb1-O EER Params Throughput (A100)
x-vector (classic) 3.10% 5 M 400× RT
ECAPA-TDNN 0.87% 15 M 200× RT
WavLM-SV large 0.42% 316 M 20× RT
Pyannote 3.1 segmentation + embedding 0.65% 6 M 100× RT
ReDimNet (2024) 0.39% 24 M 100× RT

Diarization

"Who spoke when" in a multi-speaker clip. Pipeline: VAD → segment → embed each segment → cluster (agglomerative or spectral) → smooth boundaries. Modern stack: pyannote.audio 3.1, which bundles speaker segmentation + embedding + clustering behind one call. 2026 SOTA DER on AMI is ~15% (down from 23% in 2022).

Build It

Step 1: toy embedding from MFCC statistics

def embed_mfcc_stats(signal, sr):
    frames = featurize_mfcc(signal, sr, n_mfcc=13)
    mean = [sum(f[i] for f in frames) / len(frames) for i in range(13)]
    std = [
        math.sqrt(sum((f[i] - mean[i]) ** 2 for f in frames) / len(frames))
        for i in range(13)
    ]
    return mean + std  # 26-d

Not SOTA by a mile — for teaching only. code/main.py uses this as a proof-of-concept on synthetic speaker data.

Step 2: cosine similarity + threshold

def cosine(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(x * x for x in b))
    return dot / (na * nb) if na and nb else 0.0

def verify(enroll, test, threshold=0.75):
    return cosine(enroll, test) >= threshold

Step 3: EER from similarity pairs

def eer(same_scores, diff_scores):
    thresholds = sorted(set(same_scores + diff_scores))
    best = (1.0, 1.0, 0.0)  # (fa, fr, threshold)
    for t in thresholds:
        fr = sum(1 for s in same_scores if s < t) / len(same_scores)
        fa = sum(1 for s in diff_scores if s >= t) / len(diff_scores)
        if abs(fa - fr) < abs(best[0] - best[1]):
            best = (fa, fr, t)
    return (best[0] + best[1]) / 2, best[2]

Returns (eer, threshold_at_eer). Report both.

Step 4: production with SpeechBrain

from speechbrain.pretrained import EncoderClassifier

clf = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")

# enroll: average the embeddings of 3-5 clean samples
enroll = torch.stack([clf.encode_batch(load(x)) for x in enrollment_clips]).mean(0)
# verify
score = clf.similarity(enroll, clf.encode_batch(load("test.wav"))).item()
verdict = score > 0.25   # ECAPA typical threshold; tune on your data

Step 5: diarize with pyannote

from pyannote.audio import Pipeline

pipe = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diarization = pipe("meeting.wav", num_speakers=None)
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}–{turn.end:.1f}  {speaker}")

Use It

The 2026 stack:

Situation Pick
Closed-set 1:1 verification, edge ECAPA-TDNN + cosine threshold
Open-set verification, cloud WavLM-SV + AS-norm
Diarization (meetings, podcasts) pyannote/speaker-diarization-3.1
Anti-spoofing (replay / deepfake detection) AASIST or RawNet2
Tiny embedded (KWS + enrollment) Titanet-Small (NeMo)

Pitfalls

Ship It

Save as outputs/skill-speaker-verifier.md. Pick model, enrollment protocol, threshold-tuning plan, and fraud safeguards.

Exercises

  1. Easy. Run code/main.py. Builds synthetic "speakers" (different tone profiles), enrolls, computes EER on a 100-pair trial list.
  2. Medium. Use SpeechBrain ECAPA on 30 VoxCeleb1 utterances (5 speakers × 6 each). Compute EER with cosine vs PLDA.
  3. Hard. Build the full enroll → diarize → verify pipeline with pyannote.audio. Evaluate DER on AMI dev set.

Key Terms

Term What people say What it actually means
EER The headline metric Threshold where False Accept = False Reject.
Verification 1:1 "Is this Alice?"
Identification 1:N "Who is speaking?"
Open-set Unknown possible Test set can contain unenrolled speakers.
Enrollment Registering Computing a speaker's reference embedding.
AAM-softmax The loss Softmax with additive angular margin; forces cluster separation.
PLDA Classic scoring Probabilistic LDA; likelihood-ratio scoring on top of embeddings.
DER Diarization metric Diarization Error Rate — miss + false alarm + confusion.

Further Reading