Real-Time Audio Processing
> Batch pipelines process a file. Real-time pipelines process the next 20 milliseconds before the next 20 arrive. Every conversational AI, broadcast studio, and telephony bot lives and dies by this latency budget.
Type: Build
Languages: Python, Rust
Prerequisites: Phase 6 · 02 (Spectrograms), Phase 6 · 04 (ASR), Phase 6 · 07 (TTS)
Time: ~75 minutes
The Problem
You want a voice assistant that feels alive. Human conversational turn-taking latency is ~230 ms (silence-to-response). Anything above 500 ms feels robotic; above 1500 ms feels broken. The budget for a full hear → understand → respond → speak loop in 2026 is:
| Stage | Budget |
|---|---|
| Mic → buffer | 20 ms |
| VAD | 10 ms |
| ASR (streaming) | 150 ms |
| LLM (first token) | 100 ms |
| TTS (first chunk) | 100 ms |
| Render → speaker | 20 ms |
| Total | ~400 ms |
Moshi (Kyutai, 2024) clocked 200 ms full-duplex. GPT-4o-realtime (2024) clocks ~320 ms. Cascaded pipelines in 2022 shipped at 2500 ms. The 10× improvement came from three techniques: (1) streaming everywhere, (2) asynchronous pipelining with partial results, (3) interruptible generation.
The Concept
Frame / chunk / window. Real-time audio flows as fixed-size blocks. Common choice: 20 ms (320 samples at 16 kHz). Everything downstream must keep up with this cadence.
Ring buffer. Fixed-size circular buffer. Producer thread writes new frames, consumer thread reads. Prevents allocations in the hot path. Size ≈ maximum-latency × sample-rate; a 2-second 16 kHz ring = 32,000 samples.
VAD (Voice Activity Detection). Gates downstream work when nobody is speaking. Silero VAD 4.0 (2024) runs <1 ms per 30 ms frame on CPU. webrtcvad is the older alternative.
Streaming ASR. Models that emit partial transcripts as audio arrives. Parakeet-CTC-0.6B in streaming mode (NeMo, 2024) does 2–5% WER at 320 ms latency. Whisper-Streaming (Macháček et al., 2023) chunks Whisper for near-streaming at ~2 s latency.
Interruption. When the user speaks while the assistant is talking, you must (a) detect the barge-in, (b) stop the TTS, (c) discard the remaining LLM output. All within 100 ms, or the user perceives deaf assistant.
WebRTC Opus transport. 20 ms frames, 48 kHz, adaptive bitrate 8–128 kbps. Standard for browser and mobile. LiveKit, Daily.co, Pion are the 2026 stacks for building voice apps.
Jitter buffer. Network packets arrive out of order / late. The jitter buffer reorders and smooths; too small → audible gaps, too large → latency. 60–80 ms typical.
Common gotchas
- Thread contention. Python's GIL + heavy models can starve the audio thread. Use a C-callback audio library (sounddevice, PortAudio) and keep Python off the hot path.
- Sample-rate conversion latency. Resampling inside the pipeline adds 5–20 ms. Either resample upfront or use a zero-latency resampler (PolyPhase,
soxr_hq). - TTS priming. Even fast TTS like Kokoro has a 100–200 ms warm-up on first request. Cache model + warm it with a dummy run before the first real turn.
- Echo cancellation. Without AEC, TTS output re-enters the mic and triggers ASR on the bot's own voice. WebRTC AEC3 is the open-source default.
Build It
Step 1: ring buffer
import collections
class RingBuffer:
def __init__(self, capacity):
self.buf = collections.deque(maxlen=capacity)
def write(self, frame):
self.buf.extend(frame)
def read(self, n):
return [self.buf.popleft() for _ in range(min(n, len(self.buf)))]
def level(self):
return len(self.buf)
Capacity determines max buffering latency. 32,000 samples at 16 kHz = 2 s.
Step 2: VAD gate
def simple_energy_vad(frame, threshold=0.01):
return sum(x * x for x in frame) / len(frame) > threshold ** 2
Replace with Silero VAD in production:
import torch
vad, _ = torch.hub.load("snakers4/silero-vad", "silero_vad")
is_speech = vad(torch.tensor(frame), 16000).item() > 0.5
Step 3: streaming ASR
# Parakeet-CTC-0.6B streaming via NeMo
from nemo.collections.asr.models import EncDecCTCModelBPE
asr = EncDecCTCModelBPE.from_pretrained("nvidia/parakeet-ctc-0.6b")
# chunk_ms=320 ms, look_ahead_ms=80 ms
for chunk in audio_stream():
partial_text = asr.transcribe_streaming(chunk)
print(partial_text, end="\r")
Step 4: interruption handler
class Dialog:
def __init__(self):
self.tts_task = None
def on_user_speech(self, frame):
if self.tts_task and not self.tts_task.done():
self.tts_task.cancel() # barge-in
# then feed to streaming ASR
def on_final_user_utterance(self, text):
self.tts_task = asyncio.create_task(self.reply(text))
async def reply(self, text):
async for tts_chunk in llm_then_tts(text):
speaker.write(tts_chunk)
Hinges on async I/O and cancellable TTS streaming. WebRTC peerconnection.stop() on the audio track is the canonical way.
Use It
The 2026 stack:
| Layer | Pick |
|---|---|
| Transport | LiveKit (WebRTC) or Pion (Go) |
| VAD | Silero VAD 4.0 |
| Streaming ASR | Parakeet-CTC-0.6B or Whisper-Streaming |
| LLM first-token | Groq, Cerebras, vLLM-streaming |
| Streaming TTS | Kokoro or ElevenLabs Turbo v2.5 |
| Echo cancel | WebRTC AEC3 |
| End-to-end native | OpenAI Realtime API or Moshi |
Pitfalls
- Buffering 500 ms to be safe. The buffer *is* your latency floor. Shrink it.
- Not pinning threads. Audio callback on a priority-lower-than-UI thread = glitches under load.
- TTS chunks too small. Sub-200 ms chunks make vocoder artifacts audible. 320 ms chunks are the sweet spot.
- No jitter buffer. Real networks are jittery; without smoothing you get pops.
- Single-shot error handling. Audio pipelines must be crash-proof. One exception kills the session.
Ship It
Save as outputs/skill-realtime-designer.md. Design a real-time audio pipeline with concrete latency budgets per stage.
Exercises
- Easy. Run
code/main.py. Simulates a ring buffer + energy VAD; prints stage latencies for a fake 10-second stream. - Medium. Using
sounddevice, build a passthrough loop that processes your mic in 20 ms frames and prints VAD state at each frame. - Hard. Build a full duplex echo test with
aiortc: browser → WebRTC → Python → WebRTC → browser. Measure glass-to-glass latency with a 1 kHz pulse.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Ring buffer | The circular queue | Fixed-size, lock-free (or SPSC-locked) FIFO for audio frames. |
| VAD | Silence gate | Model or heuristic marking speech vs non-speech. |
| Streaming ASR | Real-time STT | Emits partial text as audio arrives; bounded lookahead. |
| Jitter buffer | Network smoother | Queue reordering out-of-order packets; 60–80 ms typical. |
| AEC | Echo cancellation | Subtracts speaker-to-mic feedback path. |
| Barge-in | User interrupt | System detects user speech mid-TTS; must cancel playback. |
| Full duplex | Simultaneous both ways | User and bot can talk at the same time; Moshi is full duplex. |
Further Reading
- Macháček et al. (2023). Whisper-Streaming — chunked near-streaming Whisper.
- Kyutai (2024). Moshi — full-duplex 200 ms latency.
- LiveKit Agents framework (2024) — production audio agent orchestration.
- Silero VAD repo — sub-1 ms VAD, Apache 2.0.
- WebRTC AEC3 paper — echo cancellation under open source.