← Audio-Language Models — Qwen2.5-Omni, Audio Flamingo, GPT-4o Audio Build a Voice Assistant Pipeline — The Phase 6 Capstone →

Real-Time Audio Processing

> Batch pipelines process a file. Real-time pipelines process the next 20 milliseconds before the next 20 arrive. Every conversational AI, broadcast studio, and telephony bot lives and dies by this latency budget.

Type: Build

Languages: Python, Rust

Prerequisites: Phase 6 · 02 (Spectrograms), Phase 6 · 04 (ASR), Phase 6 · 07 (TTS)

Time: ~75 minutes

The Problem

You want a voice assistant that feels alive. Human conversational turn-taking latency is ~230 ms (silence-to-response). Anything above 500 ms feels robotic; above 1500 ms feels broken. The budget for a full hear → understand → respond → speak loop in 2026 is:

Stage	Budget
Mic → buffer	20 ms
VAD	10 ms
ASR (streaming)	150 ms
LLM (first token)	100 ms
TTS (first chunk)	100 ms
Render → speaker	20 ms
Total	~400 ms

Moshi (Kyutai, 2024) clocked 200 ms full-duplex. GPT-4o-realtime (2024) clocks ~320 ms. Cascaded pipelines in 2022 shipped at 2500 ms. The 10× improvement came from three techniques: (1) streaming everywhere, (2) asynchronous pipelining with partial results, (3) interruptible generation.

The Concept

Streaming audio pipeline with ring buffer, VAD gate, interruption

Frame / chunk / window. Real-time audio flows as fixed-size blocks. Common choice: 20 ms (320 samples at 16 kHz). Everything downstream must keep up with this cadence.

Ring buffer. Fixed-size circular buffer. Producer thread writes new frames, consumer thread reads. Prevents allocations in the hot path. Size ≈ maximum-latency × sample-rate; a 2-second 16 kHz ring = 32,000 samples.

VAD (Voice Activity Detection). Gates downstream work when nobody is speaking. Silero VAD 4.0 (2024) runs <1 ms per 30 ms frame on CPU. webrtcvad is the older alternative.

Streaming ASR. Models that emit partial transcripts as audio arrives. Parakeet-CTC-0.6B in streaming mode (NeMo, 2024) does 2–5% WER at 320 ms latency. Whisper-Streaming (Macháček et al., 2023) chunks Whisper for near-streaming at ~2 s latency.

Interruption. When the user speaks while the assistant is talking, you must (a) detect the barge-in, (b) stop the TTS, (c) discard the remaining LLM output. All within 100 ms, or the user perceives deaf assistant.

WebRTC Opus transport. 20 ms frames, 48 kHz, adaptive bitrate 8–128 kbps. Standard for browser and mobile. LiveKit, Daily.co, Pion are the 2026 stacks for building voice apps.

Jitter buffer. Network packets arrive out of order / late. The jitter buffer reorders and smooths; too small → audible gaps, too large → latency. 60–80 ms typical.

Common gotchas

Thread contention. Python's GIL + heavy models can starve the audio thread. Use a C-callback audio library (sounddevice, PortAudio) and keep Python off the hot path.
Sample-rate conversion latency. Resampling inside the pipeline adds 5–20 ms. Either resample upfront or use a zero-latency resampler (PolyPhase, soxr_hq).
TTS priming. Even fast TTS like Kokoro has a 100–200 ms warm-up on first request. Cache model + warm it with a dummy run before the first real turn.
Echo cancellation. Without AEC, TTS output re-enters the mic and triggers ASR on the bot's own voice. WebRTC AEC3 is the open-source default.

Build It

Step 1: ring buffer

import collections

class RingBuffer:
    def __init__(self, capacity):
        self.buf = collections.deque(maxlen=capacity)
    def write(self, frame):
        self.buf.extend(frame)
    def read(self, n):
        return [self.buf.popleft() for _ in range(min(n, len(self.buf)))]
    def level(self):
        return len(self.buf)

Capacity determines max buffering latency. 32,000 samples at 16 kHz = 2 s.

Step 2: VAD gate

def simple_energy_vad(frame, threshold=0.01):
    return sum(x * x for x in frame) / len(frame) > threshold ** 2

Replace with Silero VAD in production:

import torch
vad, _ = torch.hub.load("snakers4/silero-vad", "silero_vad")
is_speech = vad(torch.tensor(frame), 16000).item() > 0.5

Step 3: streaming ASR

# Parakeet-CTC-0.6B streaming via NeMo
from nemo.collections.asr.models import EncDecCTCModelBPE
asr = EncDecCTCModelBPE.from_pretrained("nvidia/parakeet-ctc-0.6b")
# chunk_ms=320 ms, look_ahead_ms=80 ms
for chunk in audio_stream():
    partial_text = asr.transcribe_streaming(chunk)
    print(partial_text, end="\r")

Step 4: interruption handler

class Dialog:
    def __init__(self):
        self.tts_task = None

    def on_user_speech(self, frame):
        if self.tts_task and not self.tts_task.done():
            self.tts_task.cancel()   # barge-in
        # then feed to streaming ASR

    def on_final_user_utterance(self, text):
        self.tts_task = asyncio.create_task(self.reply(text))

    async def reply(self, text):
        async for tts_chunk in llm_then_tts(text):
            speaker.write(tts_chunk)

Hinges on async I/O and cancellable TTS streaming. WebRTC peerconnection.stop() on the audio track is the canonical way.

Use It

The 2026 stack:

Layer	Pick
Transport	LiveKit (WebRTC) or Pion (Go)
VAD	Silero VAD 4.0
Streaming ASR	Parakeet-CTC-0.6B or Whisper-Streaming
LLM first-token	Groq, Cerebras, vLLM-streaming
Streaming TTS	Kokoro or ElevenLabs Turbo v2.5
Echo cancel	WebRTC AEC3
End-to-end native	OpenAI Realtime API or Moshi

Pitfalls

Buffering 500 ms to be safe. The buffer *is* your latency floor. Shrink it.
Not pinning threads. Audio callback on a priority-lower-than-UI thread = glitches under load.
TTS chunks too small. Sub-200 ms chunks make vocoder artifacts audible. 320 ms chunks are the sweet spot.
No jitter buffer. Real networks are jittery; without smoothing you get pops.
Single-shot error handling. Audio pipelines must be crash-proof. One exception kills the session.

Ship It

Save as outputs/skill-realtime-designer.md. Design a real-time audio pipeline with concrete latency budgets per stage.

Exercises

Easy. Run code/main.py. Simulates a ring buffer + energy VAD; prints stage latencies for a fake 10-second stream.
Medium. Using sounddevice, build a passthrough loop that processes your mic in 20 ms frames and prints VAD state at each frame.
Hard. Build a full duplex echo test with aiortc: browser → WebRTC → Python → WebRTC → browser. Measure glass-to-glass latency with a 1 kHz pulse.

Key Terms

Term	What people say	What it actually means
Ring buffer	The circular queue	Fixed-size, lock-free (or SPSC-locked) FIFO for audio frames.
VAD	Silence gate	Model or heuristic marking speech vs non-speech.
Streaming ASR	Real-time STT	Emits partial text as audio arrives; bounded lookahead.
Jitter buffer	Network smoother	Queue reordering out-of-order packets; 60–80 ms typical.
AEC	Echo cancellation	Subtracts speaker-to-mic feedback path.
Barge-in	User interrupt	System detects user speech mid-TTS; must cancel playback.
Full duplex	Simultaneous both ways	User and bot can talk at the same time; Moshi is full duplex.