← Vision Transformers (ViT)Mixture of Experts (MoE) →

Audio Transformers — Whisper Architecture

> Audio is an image of frequency over time. Whisper is a ViT that eats mel spectrograms and speaks back.

Type: Learn

Languages: Python

Prerequisites: Phase 7 · 05 (Full Transformer), Phase 7 · 08 (Encoder-Decoder), Phase 7 · 09 (ViT)

Time: ~45 minutes

The Problem

Before Whisper (OpenAI, Radford et al. 2022), state-of-the-art automatic speech recognition (ASR) meant wav2vec 2.0 and HuBERT — self-supervised feature extractors plus a fine-tuned head. High quality, expensive data pipelines, domain-brittle. Multilingual speech recognition needed separate models per language family.

Whisper made three bets:

Train on everything. 680,000 hours of weakly-labeled audio scraped from the internet across 97 languages. No clean academic corpus. No phoneme labels.
Multi-task single model. One decoder trained jointly on transcription, translation, voice activity detection, language ID, and timestamping via task tokens.
Standard encoder-decoder transformer. Encoder consumes log-mel spectrograms. Decoder produces text tokens autoregressively. No vocoder, no CTC, no HMM.

The result: Whisper large-v3 is robust across accents, noise, and languages that have zero clean labeled data. It is the default speech front-end for every open-source voice assistant and most commercial ones in 2026.

The Concept

Whisper pipeline: audio → mel → encoder → decoder → text

Step 1 — resample + window

Audio at 16 kHz. Clip/pad to 30 seconds. Compute log-mel spectrogram: 80 mel bins, 10 ms stride → ~3,000 frames × 80 features. This is the "input image" that Whisper sees.

Step 2 — convolutional stem

Two Conv1D layers with kernel 3 and stride 2 reduce the 3,000 frames to 1,500. Halves sequence length without adding a lot of parameters.

Step 3 — encoder

A 24-layer (for large) transformer encoder over 1,500 timesteps. Sinusoidal positional encoding, self-attention, GELU FFN. Produces 1,500 × 1,280 hidden states.

Step 4 — decoder

A 24-layer transformer decoder. It autoregressively produces tokens from a BPE vocabulary that is a superset of GPT-2's with a few audio-specific special tokens.

Step 5 — task tokens

The decoder prompt starts with control tokens that tell the model what to do:

<|startoftranscript|>  <|en|>  <|transcribe|>  <|0.00|>

<|startoftranscript|>  <|fr|>  <|translate|>   <|0.00|>

The model was trained on this convention. You control task by prefix. The 2026 equivalent of instruction-tuning, but applied to speech.

Step 6 — output

Beam search (width 5) with a log-prob threshold. Timestamps are predicted every 0.02 seconds of audio when the <|notimestamps|> token is absent.

Whisper sizes

Model	Params	Layers	d_model	Heads	VRAM (fp16)
Tiny	39M	4	384	6	~1 GB
Base	74M	6	512	8	~1 GB
Small	244M	12	768	12	~2 GB
Medium	769M	24	1024	16	~5 GB
Large	1550M	32	1280	20	~10 GB
Large-v3	1550M	32	1280	20	~10 GB
Large-v3-turbo	809M	32	1280	20	~6 GB (4-layer decoder)

Large-v3-turbo (2024) cut the decoder from 32 layers to 4. 8× faster decoding with <1 WER point regression. That decode speed unlock is why Whisper-turbo is the default for real-time voice agents in 2026.

What Whisper does not do

No diarization (who is speaking). Pair with pyannote for that.
No real-time streaming natively — the 30-second window is fixed. Modern wrappers (faster-whisper, WhisperX) bolt on streaming via VAD + overlap.
No long-form context beyond 30 s without external chunking. Works well in practice because human speech rarely needs long-range context for transcription.

2026 landscape

Task	Model	Notes
English ASR	Whisper-turbo, Moonshine	Moonshine is 4× faster on edge
Multilingual ASR	Whisper-large-v3	97 languages
Streaming ASR	faster-whisper + VAD	150 ms latency targets achievable
TTS	Piper, XTTS-v2, Kokoro	Encoder-decoder pattern, but Whisper-shaped
Audio + language	AudioLM, SeamlessM4T	Text tokens + audio tokens in one transformer

Build It

See code/main.py. We don't train Whisper — we build the log-mel spectrogram pipeline + task-token prompt formatter. Those are the parts you actually touch in production.

Step 1: synthesize audio

Generate a 1-second sine wave at 440 Hz sampled at 16 kHz. 16,000 samples.

Step 2: log-mel spectrogram (simplified)

Full mel spectrogram needs FFT. We do a simplified framing + per-frame energy version that shows the pipeline without requiring librosa:

def frame_signal(x, frame_size=400, hop=160):
    frames = []
    for start in range(0, len(x) - frame_size + 1, hop):
        frames.append(x[start:start + frame_size])
    return frames

Frame = 25 ms, hop = 10 ms. Matches Whisper's windowing. Per-frame energy stands in for mel bins for pedagogy.

Step 3: pad to 30 s

Whisper always processes 30-second chunks. Pad (or clip) the spectrogram to 3,000 frames.

Step 4: build the prompt tokens

def whisper_prompt(lang="en", task="transcribe", timestamps=True):
    tokens = ["<|startoftranscript|>", f"<|{lang}|>", f"<|{task}|>"]
    if not timestamps:
        tokens.append("<|notimestamps|>")
    return tokens

That is the whole task-control surface. A 4-token prefix.

Use It

import whisper
model = whisper.load_model("large-v3-turbo")
result = model.transcribe("meeting.wav", language="en", task="transcribe")
print(result["text"])
print(result["segments"][0]["start"], result["segments"][0]["end"])

Faster, OpenAI-compatible:

from faster_whisper import WhisperModel
model = WhisperModel("large-v3-turbo", compute_type="int8_float16")
segments, info = model.transcribe("meeting.wav", vad_filter=True)
for s in segments:
    print(f"{s.start:.2f} - {s.end:.2f}: {s.text}")

When to pick Whisper in 2026:

Multilingual ASR with one model.
Robust transcription of noisy, diverse audio.
Research / prototype ASR — fastest starting point.

When to pick something else:

Ultra-low latency streaming on edge — Moonshine beats Whisper at matched quality.
Real-time conversational AI needing <200 ms — dedicated streaming ASR.
Speaker diarization — Whisper does not do this; bolt on pyannote.

Ship It

See outputs/skill-asr-configurator.md. The skill picks an ASR model, decoding parameters, and preprocessing pipeline for a new speech application.

Exercises

Easy. Run code/main.py. Confirm the frame count for a 1-second signal at 16 kHz with 10 ms hop is ~100 frames. For 30 seconds: ~3,000 frames.
Medium. Build the full log-mel spectrogram using numpy.fft. Verify 80 mel bins match librosa.feature.melspectrogram(n_mels=80) within numerical error.
Hard. Implement streaming inference: chunk audio into 10 s windows with 2 s overlap, run Whisper on each chunk, merge transcripts. Measure word-error rate vs single-pass on a 5-minute podcast sample.

Key Terms

Term	What people say	What it actually means
Mel spectrogram	"Audio image"	2D representation: frequency bins on one axis, time frames on the other; log-scaled energy per cell.
Log-mel	"What Whisper sees"	Mel spectrogram passed through log; approximates human perception of loudness.
Frame	"One time slice"	A 25 ms window of samples; overlapping at 10 ms stride.
Task token	"Prompt prefix for speech"	Special tokens like `<	transcribe	> `/` <	translate	>` in the decoder prompt.
Voice activity detection (VAD)	"Find the speech"	Gate that removes silence before ASR; cuts cost massively.
CTC	"Connectionist Temporal Classification"	Classic ASR loss for alignment-free training; Whisper does NOT use it.
Whisper-turbo	"Small decoder, full encoder"	large-v3 encoder + 4-layer decoder; 8× faster decoding.
Faster-whisper	"The production wrapper"	CTranslate2 reimplementation; int8 quantization; 4× faster than OpenAI's reference.