Real-Time Audio Processing

> Batch pipelines process a file. Real-time pipelines process the next 20 milliseconds before the next 20 arrive. Every conversational AI, broadcast studio, and telephony bot lives and dies by this latency budget.

Type: Build

Languages: Python, Rust

Prerequisites: Phase 6 · 02 (Spectrograms), Phase 6 · 04 (ASR), Phase 6 · 07 (TTS)

Time: ~75 minutes

The Problem

You want a voice assistant that feels alive. Human conversational turn-taking latency is ~230 ms (silence-to-response). Anything above 500 ms feels robotic; above 1500 ms feels broken. The budget for a full hear → understand → respond → speak loop in 2026 is:

Stage Budget
Mic → buffer 20 ms
VAD 10 ms
ASR (streaming) 150 ms
LLM (first token) 100 ms
TTS (first chunk) 100 ms
Render → speaker 20 ms
Total ~400 ms

Moshi (Kyutai, 2024) clocked 200 ms full-duplex. GPT-4o-realtime (2024) clocks ~320 ms. Cascaded pipelines in 2022 shipped at 2500 ms. The 10× improvement came from three techniques: (1) streaming everywhere, (2) asynchronous pipelining with partial results, (3) interruptible generation.

The Concept

Streaming audio pipeline with ring buffer, VAD gate, interruption

Frame / chunk / window. Real-time audio flows as fixed-size blocks. Common choice: 20 ms (320 samples at 16 kHz). Everything downstream must keep up with this cadence.

Ring buffer. Fixed-size circular buffer. Producer thread writes new frames, consumer thread reads. Prevents allocations in the hot path. Size ≈ maximum-latency × sample-rate; a 2-second 16 kHz ring = 32,000 samples.

VAD (Voice Activity Detection). Gates downstream work when nobody is speaking. Silero VAD 4.0 (2024) runs <1 ms per 30 ms frame on CPU. webrtcvad is the older alternative.

Streaming ASR. Models that emit partial transcripts as audio arrives. Parakeet-CTC-0.6B in streaming mode (NeMo, 2024) does 2–5% WER at 320 ms latency. Whisper-Streaming (Macháček et al., 2023) chunks Whisper for near-streaming at ~2 s latency.

Interruption. When the user speaks while the assistant is talking, you must (a) detect the barge-in, (b) stop the TTS, (c) discard the remaining LLM output. All within 100 ms, or the user perceives deaf assistant.

WebRTC Opus transport. 20 ms frames, 48 kHz, adaptive bitrate 8–128 kbps. Standard for browser and mobile. LiveKit, Daily.co, Pion are the 2026 stacks for building voice apps.

Jitter buffer. Network packets arrive out of order / late. The jitter buffer reorders and smooths; too small → audible gaps, too large → latency. 60–80 ms typical.

Common gotchas

Build It

Step 1: ring buffer

import collections

class RingBuffer:
    def __init__(self, capacity):
        self.buf = collections.deque(maxlen=capacity)
    def write(self, frame):
        self.buf.extend(frame)
    def read(self, n):
        return [self.buf.popleft() for _ in range(min(n, len(self.buf)))]
    def level(self):
        return len(self.buf)

Capacity determines max buffering latency. 32,000 samples at 16 kHz = 2 s.

Step 2: VAD gate

def simple_energy_vad(frame, threshold=0.01):
    return sum(x * x for x in frame) / len(frame) > threshold ** 2

Replace with Silero VAD in production:

import torch
vad, _ = torch.hub.load("snakers4/silero-vad", "silero_vad")
is_speech = vad(torch.tensor(frame), 16000).item() > 0.5

Step 3: streaming ASR

# Parakeet-CTC-0.6B streaming via NeMo
from nemo.collections.asr.models import EncDecCTCModelBPE
asr = EncDecCTCModelBPE.from_pretrained("nvidia/parakeet-ctc-0.6b")
# chunk_ms=320 ms, look_ahead_ms=80 ms
for chunk in audio_stream():
    partial_text = asr.transcribe_streaming(chunk)
    print(partial_text, end="\r")

Step 4: interruption handler

class Dialog:
    def __init__(self):
        self.tts_task = None

    def on_user_speech(self, frame):
        if self.tts_task and not self.tts_task.done():
            self.tts_task.cancel()   # barge-in
        # then feed to streaming ASR

    def on_final_user_utterance(self, text):
        self.tts_task = asyncio.create_task(self.reply(text))

    async def reply(self, text):
        async for tts_chunk in llm_then_tts(text):
            speaker.write(tts_chunk)

Hinges on async I/O and cancellable TTS streaming. WebRTC peerconnection.stop() on the audio track is the canonical way.

Use It

The 2026 stack:

Layer Pick
Transport LiveKit (WebRTC) or Pion (Go)
VAD Silero VAD 4.0
Streaming ASR Parakeet-CTC-0.6B or Whisper-Streaming
LLM first-token Groq, Cerebras, vLLM-streaming
Streaming TTS Kokoro or ElevenLabs Turbo v2.5
Echo cancel WebRTC AEC3
End-to-end native OpenAI Realtime API or Moshi

Pitfalls

Ship It

Save as outputs/skill-realtime-designer.md. Design a real-time audio pipeline with concrete latency budgets per stage.

Exercises

  1. Easy. Run code/main.py. Simulates a ring buffer + energy VAD; prints stage latencies for a fake 10-second stream.
  2. Medium. Using sounddevice, build a passthrough loop that processes your mic in 20 ms frames and prints VAD state at each frame.
  3. Hard. Build a full duplex echo test with aiortc: browser → WebRTC → Python → WebRTC → browser. Measure glass-to-glass latency with a 1 kHz pulse.

Key Terms

Term What people say What it actually means
Ring buffer The circular queue Fixed-size, lock-free (or SPSC-locked) FIFO for audio frames.
VAD Silence gate Model or heuristic marking speech vs non-speech.
Streaming ASR Real-time STT Emits partial text as audio arrives; bounded lookahead.
Jitter buffer Network smoother Queue reordering out-of-order packets; 60–80 ms typical.
AEC Echo cancellation Subtracts speaker-to-mic feedback path.
Barge-in User interrupt System detects user speech mid-TTS; must cancel playback.
Full duplex Simultaneous both ways User and bot can talk at the same time; Moshi is full duplex.

Further Reading