Build a Voice Assistant Pipeline — The Phase 6 Capstone
> Everything from lessons 01-11, stitched together. Build a voice assistant that listens, reasons, and talks back. In 2026 that is a solved engineering problem, not a research problem — but the integration details decide whether it ships.
Type: Build
Languages: Python
Prerequisites: Phase 6 · 04, 05, 06, 07, 11; Phase 11 · 09 (Function Calling); Phase 14 · 01 (Agent Loop)
Time: ~120 minutes
The Problem
Build an end-to-end assistant:
- Captures mic input (16 kHz mono).
- Detects start/end of user speech.
- Transcribes streaming.
- Passes transcript to an LLM that can call tools (timer, weather, calendar).
- Streams LLM text to a TTS.
- Plays audio back to the user.
- Stops if the user interrupts mid-response.
Latency target: first TTS audio byte within 800 ms of the user finishing their utterance on a laptop CPU. Quality target: no missed words, no hallucinated subtitles on silence, no voice cloning leakage, no prompt injection success.
The Concept
The seven components
- Audio capture. Mic → 16 kHz mono → 20 ms chunks. Usually
sounddevicein Python or native AudioUnit/ALSA/WASAPI in production. - VAD (Lesson 11). Silero VAD @ threshold 0.5, min speech 250 ms, silence hang-over 500 ms. Signals "start" and "end."
- Streaming STT (Lesson 4-5). Whisper-streaming, Parakeet-TDT, or Deepgram Nova-3 (API). Partial + final transcripts.
- LLM with tool calling. GPT-4o / Claude 3.5 / Gemini 2.5 Flash. JSON schema for tools. Stream tokens.
- Streaming TTS (Lesson 7). Kokoro-82M (fastest open) or Cartesia Sonic (commercial). Start TTS after 20 LLM tokens.
- Playback. Speaker out; opus-encode for low-bandwidth networks.
- Interruption handler. If VAD fires during TTS playback, stop playback, cancel LLM, restart STT.
The three failure modes you will hit
- First-word clip. VAD starts a beat too late. User's "hey" is missing. Start threshold at 0.3, not 0.5.
- Mid-response interrupt confusion. LLM keeps generating after user interrupts; assistant talks over user. Wire VAD → cancel-LLM.
- Silence hallucination. Whisper outputs "Thanks for watching" on the silent warm-up frames. Always VAD-gate.
2026 production reference stacks
| Stack | Latency | License | Notes |
|---|---|---|---|
| LiveKit + Deepgram + GPT-4o + Cartesia | 350-500 ms | commercial API | Industry default 2026 |
| Pipecat + Whisper-streaming + GPT-4o + Kokoro | 500-800 ms | mostly open | DIY-friendly |
| Moshi (full-duplex) | 200-300 ms | CC-BY 4.0 | Single-model; different architecture, lesson 15 |
| Vapi / Retell (managed) | 300-500 ms | commercial | Fastest to launch; limited customization |
| Whisper.cpp + llama.cpp + Kokoro-ONNX | offline | open | Privacy / edge |
Build It
Step 1: mic capture with chunking (pseudocode)
import sounddevice as sd
def mic_stream(chunk_ms=20, sr=16000):
q = queue.Queue()
def cb(indata, frames, time, status):
q.put(indata.copy().flatten())
with sd.InputStream(channels=1, samplerate=sr, blocksize=int(sr * chunk_ms/1000), callback=cb):
while True:
yield q.get()
Step 2: VAD-gated turn capture
def capture_turn(stream, vad, pre_roll_ms=300, silence_ms=500):
buf, pre, triggered = [], collections.deque(maxlen=pre_roll_ms // 20), False
silent = 0
for chunk in stream:
pre.append(chunk)
if vad(chunk):
if not triggered:
buf = list(pre)
triggered = True
buf.append(chunk)
silent = 0
elif triggered:
silent += 20
buf.append(chunk)
if silent >= silence_ms:
return b"".join(buf)
Step 3: streaming STT → LLM → TTS
async def turn(audio_bytes):
transcript = await stt.transcribe(audio_bytes)
async for token in llm.stream(transcript):
async for audio in tts.stream(token):
await speaker.play(audio)
Step 4: tool calling inside the LLM loop
tools = [
{"name": "get_weather", "parameters": {"location": "string"}},
{"name": "set_timer", "parameters": {"seconds": "int"}},
]
async for chunk in llm.stream(user_text, tools=tools):
if chunk.type == "tool_call":
result = dispatch(chunk.name, chunk.args)
continue_streaming(result)
if chunk.type == "text":
await tts.stream(chunk.text)
Step 5: interruption handling
tts_task = asyncio.create_task(tts_loop())
while True:
chunk = await mic.get()
if vad(chunk):
tts_task.cancel()
await speaker.stop()
await new_turn()
break
Use It
See code/main.py for a runnable simulation that wires all seven components with stub models, so you can see the pipeline shape even without hardware. For a real implementation, swap stubs with:
silero-vad(pip install silero-vad)deepgram-sdkoropenai-whisperopenai(gpt-4o) oranthropickokoroorcartesiasounddevicefor I/O
Pitfalls
- Logging PII forever. Full-turn audio is PII in most jurisdictions. 30-day retention, encrypted at rest.
- No barge-in. Users will interrupt. Your assistant must stop talking.
- TTS that blocks. Synchronous TTS blocks the event loop. Use async or a separate thread.
- No tool-call error handling. Tools fail. LLM must get back the error + retry once, then gracefully degrade.
- Overzealous hallucination filters. Over-filter and the assistant repeats "I can't help with that." Under-filter and it says anything. Calibrate on a held-out set.
- No wake-word option. Always-listening is a privacy liability. Add a wake-word gate (Porcupine or openWakeWord).
Ship It
Save as outputs/skill-voice-assistant-architect.md. Given budget + scale + language + compliance constraints, produce a full stack spec.
Exercises
- Easy. Run
code/main.py. It simulates one full turn end-to-end with stub modules and prints per-stage latency. - Medium. Replace the STT stub with a real Whisper model on a pre-recorded
.wav. Measure WER and end-to-end latency. - Hard. Add tool calling: implement
get_weather(any API) andset_timer. Route the LLM through the tools and verify that when the user says "set a 5 minute timer" the right function fires and the spoken reply confirms it.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Turn | A user + assistant round-trip | One VAD-bounded user speech + one LLM-TTS response. |
| Barge-in | Interruption | User speaks while assistant talks; assistant stops. |
| Wake word | "Hey assistant" | Short keyword detector; Porcupine, Snowboy, openWakeWord. |
| End-pointing | Turn ending | VAD + min-silence decision that user has finished. |
| Pre-roll | Pre-speech buffer | Keep 200-400 ms of audio before VAD fires to avoid first-word clip. |
| Tool call | Function invocation | LLM emits JSON; runtime dispatches; result feeds back in-loop. |
Further Reading
- LiveKit — voice agent quickstart — production-grade reference.
- Pipecat — voice agent examples — DIY-friendly framework.
- OpenAI Realtime API — the managed voice-native path.
- Kyutai Moshi — full-duplex reference (Lesson 15).
- Porcupine wake-word — wake-word gating.
- Anthropic — tool use guide — LLM function calling.