Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue
> 2024-2026 redefined voice AI. Moshi ships a single model that listens and speaks simultaneously at 200 ms latency. Hibiki does speech-to-speech translation chunk-by-chunk. Both abandon the ASR → LLM → TTS pipeline for a unified full-duplex architecture over Mimi codec tokens. This is the new reference design.
Type: Learn
Languages: Python
Prerequisites: Phase 6 · 13 (Neural Audio Codecs), Phase 6 · 11 (Real-Time Audio), Phase 7 · 05 (Full Transformer)
Time: ~75 minutes
The Problem
Every voice agent built from Lessons 11 + 12 has a fundamental latency floor around 300-500 ms: VAD fires, STT processes, LLM reasons, TTS generates. Each stage has its own minimum latency. You can tune and parallelize, but the pipeline shape caps you.
Moshi (Kyutai, 2024-2026) asks a different question: what if there is no pipeline? What if one model takes audio in and emits audio out directly, continuously, with text as an intermediate "inner monologue" instead of a required stage?
The answer is full-duplex speech-to-speech. Theoretical latency 160 ms (80 ms Mimi frame + 80 ms acoustic delay). Practical latency 200 ms on a single L4 GPU. That's half what a best-in-class pipelined voice agent achieves.
The Concept
The Moshi architecture
Inputs. Two Mimi codec streams, both at 12.5 Hz × 8 codebooks:
- Stream 1: user audio (Mimi-encoded, constantly arriving)
- Stream 2: Moshi's own audio (generated by Moshi)
The transformer. A 7B-parameter Temporal Transformer processes both streams and a text "inner monologue" stream. At each 80 ms step, it:
- Consumes the latest user Mimi tokens (8 codebooks).
- Consumes the most recent Moshi Mimi tokens (8 codebooks, as produced).
- Generates the next Moshi text token (inner monologue).
- Generates the next Moshi Mimi tokens (8 codebooks via a small Depth Transformer).
All three streams — user audio, Moshi audio, Moshi text — run in parallel. Moshi can hear the user while speaking; can interrupt itself when the user interrupts; can back-channel ("mhm") without breaking its main utterance.
The depth transformer. Within a frame, the 8 codebooks are not predicted in parallel — they have inter-codebook dependencies. A small 2-layer "depth transformer" predicts them sequentially within 80 ms. This is the standard factorization for AR codec LMs (also used by VALL-E, VibeVoice).
Why inner-monologue text helps
Without explicit text, the model has to implicitly model language in its acoustic stream. Moshi's insight: force it to emit text tokens alongside audio. The text stream is essentially the transcript of what Moshi is saying. This improves semantic coherence, makes it easier to swap out a language model head, and gives you transcripts for free.
Hibiki: streaming speech-to-speech translation
Same architecture, trained on translation pairs. Source audio in, target-language audio out, continuously. Hibiki-Zero (Feb 2026) eliminates the need for word-level aligned training data — uses sentence-level data + GRPO reinforcement learning for latency optimization.
Four language pairs supported initially; can be adapted to a new language with ≈1000 hours.
The broader Kyutai stack (2026)
- Moshi — full-duplex dialogue (French first, English well-supported)
- Hibiki / Hibiki-Zero — simultaneous speech translation
- Kyutai STT — streaming ASR (500 ms or 2.5 s look-ahead)
- Kyutai Pocket TTS — 100M-param TTS runs on CPU (Jan 2026)
- Unmute — full pipeline combining these on public servers
Throughput on an L40S GPU: 64 concurrent sessions at 3× real-time.
Sesame CSM — the cousin
Sesame CSM (2025) uses a similar idea — a Llama-3 backbone with a Mimi codec head. But CSM is single-directional (takes context + text, produces speech) rather than full-duplex. It's the best "voice presence" TTS on the market; not quite the same as Moshi's full-duplex capability.
2026 performance numbers
| Model | Latency | Use case | License |
|---|---|---|---|
| Moshi | 200 ms (L4) | full-duplex English / French dialogue | CC-BY 4.0 |
| Hibiki | 12.5 Hz framerate | French ↔ English streaming translation | CC-BY 4.0 |
| Hibiki-Zero | same | 5 language-pairs, no aligned data | CC-BY 4.0 |
| Sesame CSM-1B | 200 ms TTFA | context-conditioned TTS | Apache-2.0 |
| GPT-4o Realtime | ~300 ms | closed, OpenAI API | commercial |
| Gemini 2.5 Live | ~350 ms | closed, Google API | commercial |
Build It
Step 1: the interface
Moshi exposes a WebSocket server that takes 80 ms chunks of Mimi-encoded audio and returns 80 ms chunks of Mimi-encoded audio. Both ways. Constantly.
import asyncio
import websockets
from moshi.client_utils import encode_audio_mimi, decode_audio_mimi
async def moshi_chat():
async with websockets.connect("ws://localhost:8998/api/chat") as ws:
mic_task = asyncio.create_task(stream_mic_to(ws))
spk_task = asyncio.create_task(stream_from_to_speaker(ws))
await asyncio.gather(mic_task, spk_task)
Step 2: the full-duplex loop
async def stream_mic_to(ws):
async for chunk_80ms in mic_stream_at_12_5_hz():
mimi_tokens = encode_audio_mimi(chunk_80ms)
await ws.send(serialize(mimi_tokens))
async def stream_from_to_speaker(ws):
async for msg in ws:
mimi_tokens, text_token = deserialize(msg)
audio = decode_audio_mimi(mimi_tokens)
await play(audio)
Both directions run simultaneously. Python asyncio or Rust futures are the standard transport.
Step 3: the training objective (conceptual)
For every 80 ms frame t:
- Input:
user_mimi[0..t],moshi_mimi[0..t-1],moshi_text[0..t-1] - Predict:
moshi_text[t], thenmoshi_mimi[t, codebook_0..7]
Text is predicted before audio (inner monologue); audio is predicted codebook-sequential within the depth transformer.
Step 4: where Moshi wins and where it doesn't
Moshi wins:
- Sub-250 ms end-to-end on cheap hardware.
- Natural back-channels and interruptions.
- No pipeline glue code.
Moshi does not win:
- Tool calling (not trained for it; you need a separate LLM path).
- Long reasoning (Moshi is an 8B-ish dialogue model, not Claude/GPT-4).
- Factual accuracy on niche topics.
- Most production enterprise use cases (still use pipelines in 2026).
Use It
| Situation | Pick |
|---|---|
| Lowest-latency voice companion | Moshi |
| Live translation call | Hibiki |
| Voice demo / research | Moshi, CSM |
| Enterprise agent with tools | Pipeline (Lesson 12), not Moshi |
| Custom-voice TTS in context | Sesame CSM |
| Speech-to-speech, any languages | GPT-4o Realtime or Gemini 2.5 Live (commercial) |
Pitfalls
- Limited tool calling. Moshi is a dialogue model, not an agent framework. Combine with pipeline for tools.
- Specific-voice conditioning. Moshi uses a single trained persona; cloning is a separate training run.
- Language coverage. French + English is excellent; others limited. Hibiki-Zero helps, but you still need training data.
- Resource cost. A full Moshi session holds a GPU slot; not a cheap shared-tenant deploy pattern.
Ship It
Save as outputs/skill-duplex-pipeline.md. Pick pipeline vs full-duplex architecture for a voice-agent workload, with reason.
Exercises
- Easy. Run
code/main.py. It simulates the two-stream + inner-monologue architecture symbolically. - Medium. Pull Moshi from HuggingFace, run the server, test one conversation. Measure wall-clock latency from end-of-user-speech to start-of-Moshi-response.
- Hard. Take your Lesson 12 pipeline agent and compare P50 latency vs Moshi on 20 matched test utterances. Write up when a pipeline architecturally wins anyway.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Full-duplex | Hear-and-speak at once | Two audio streams active simultaneously on the same model. |
| Inner monologue | Model's text stream | Moshi emits text tokens alongside its audio output. |
| Depth transformer | Inter-codebook predictor | Small transformer that predicts 8 codebooks within one 80 ms frame. |
| Mimi | Kyutai's codec | 12.5 Hz × 8 codebooks; semantic+acoustic; powers Moshi. |
| Streaming S2S | Audio → audio live | Chunk-by-chunk translation/dialogue, no pipeline stages. |
| Back-channeling | "Mhm" reactions | Moshi can emit small acknowledgments without breaking its turn. |
Further Reading
- Défossez et al. (2024). Moshi — speech-text foundation model — the paper.
- Kyutai Labs (2026). Hibiki-Zero — streaming translation without aligned data.
- Sesame (2025). Crossing the uncanny valley of voice — CSM spec.
- Kyutai — Moshi repo — install + server.
- OpenAI — Realtime API — closed commercial peer.
- Kyutai — Delayed Streams Modeling — the STT/TTS framework under the hood.