← Voice Activity Detection & Turn-Taking — Silero, Cobra, and the Flush Trick Voice Anti-Spoofing & Audio Watermarking — ASVspoof 5, AudioSeal, WaveVerify →

Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue

> 2024-2026 redefined voice AI. Moshi ships a single model that listens and speaks simultaneously at 200 ms latency. Hibiki does speech-to-speech translation chunk-by-chunk. Both abandon the ASR → LLM → TTS pipeline for a unified full-duplex architecture over Mimi codec tokens. This is the new reference design.

Type: Learn

Languages: Python

Prerequisites: Phase 6 · 13 (Neural Audio Codecs), Phase 6 · 11 (Real-Time Audio), Phase 7 · 05 (Full Transformer)

Time: ~75 minutes

The Problem

Every voice agent built from Lessons 11 + 12 has a fundamental latency floor around 300-500 ms: VAD fires, STT processes, LLM reasons, TTS generates. Each stage has its own minimum latency. You can tune and parallelize, but the pipeline shape caps you.

Moshi (Kyutai, 2024-2026) asks a different question: what if there is no pipeline? What if one model takes audio in and emits audio out directly, continuously, with text as an intermediate "inner monologue" instead of a required stage?

The answer is full-duplex speech-to-speech. Theoretical latency 160 ms (80 ms Mimi frame + 80 ms acoustic delay). Practical latency 200 ms on a single L4 GPU. That's half what a best-in-class pipelined voice agent achieves.

The Concept

Moshi architecture: two parallel Mimi streams + inner-monologue text

The Moshi architecture

Inputs. Two Mimi codec streams, both at 12.5 Hz × 8 codebooks:

Stream 1: user audio (Mimi-encoded, constantly arriving)
Stream 2: Moshi's own audio (generated by Moshi)

The transformer. A 7B-parameter Temporal Transformer processes both streams and a text "inner monologue" stream. At each 80 ms step, it:

Consumes the latest user Mimi tokens (8 codebooks).
Consumes the most recent Moshi Mimi tokens (8 codebooks, as produced).
Generates the next Moshi text token (inner monologue).
Generates the next Moshi Mimi tokens (8 codebooks via a small Depth Transformer).

All three streams — user audio, Moshi audio, Moshi text — run in parallel. Moshi can hear the user while speaking; can interrupt itself when the user interrupts; can back-channel ("mhm") without breaking its main utterance.

The depth transformer. Within a frame, the 8 codebooks are not predicted in parallel — they have inter-codebook dependencies. A small 2-layer "depth transformer" predicts them sequentially within 80 ms. This is the standard factorization for AR codec LMs (also used by VALL-E, VibeVoice).

Why inner-monologue text helps

Without explicit text, the model has to implicitly model language in its acoustic stream. Moshi's insight: force it to emit text tokens alongside audio. The text stream is essentially the transcript of what Moshi is saying. This improves semantic coherence, makes it easier to swap out a language model head, and gives you transcripts for free.

Hibiki: streaming speech-to-speech translation

Same architecture, trained on translation pairs. Source audio in, target-language audio out, continuously. Hibiki-Zero (Feb 2026) eliminates the need for word-level aligned training data — uses sentence-level data + GRPO reinforcement learning for latency optimization.

Four language pairs supported initially; can be adapted to a new language with ≈1000 hours.

The broader Kyutai stack (2026)

Moshi — full-duplex dialogue (French first, English well-supported)
Hibiki / Hibiki-Zero — simultaneous speech translation
Kyutai STT — streaming ASR (500 ms or 2.5 s look-ahead)
Kyutai Pocket TTS — 100M-param TTS runs on CPU (Jan 2026)
Unmute — full pipeline combining these on public servers

Throughput on an L40S GPU: 64 concurrent sessions at 3× real-time.

Sesame CSM — the cousin

Sesame CSM (2025) uses a similar idea — a Llama-3 backbone with a Mimi codec head. But CSM is single-directional (takes context + text, produces speech) rather than full-duplex. It's the best "voice presence" TTS on the market; not quite the same as Moshi's full-duplex capability.

2026 performance numbers

Model	Latency	Use case	License
Moshi	200 ms (L4)	full-duplex English / French dialogue	CC-BY 4.0
Hibiki	12.5 Hz framerate	French ↔ English streaming translation	CC-BY 4.0
Hibiki-Zero	same	5 language-pairs, no aligned data	CC-BY 4.0
Sesame CSM-1B	200 ms TTFA	context-conditioned TTS	Apache-2.0
GPT-4o Realtime	~300 ms	closed, OpenAI API	commercial
Gemini 2.5 Live	~350 ms	closed, Google API	commercial

Build It

Step 1: the interface

Moshi exposes a WebSocket server that takes 80 ms chunks of Mimi-encoded audio and returns 80 ms chunks of Mimi-encoded audio. Both ways. Constantly.

import asyncio
import websockets
from moshi.client_utils import encode_audio_mimi, decode_audio_mimi

async def moshi_chat():
    async with websockets.connect("ws://localhost:8998/api/chat") as ws:
        mic_task = asyncio.create_task(stream_mic_to(ws))
        spk_task = asyncio.create_task(stream_from_to_speaker(ws))
        await asyncio.gather(mic_task, spk_task)

Step 2: the full-duplex loop

async def stream_mic_to(ws):
    async for chunk_80ms in mic_stream_at_12_5_hz():
        mimi_tokens = encode_audio_mimi(chunk_80ms)
        await ws.send(serialize(mimi_tokens))

async def stream_from_to_speaker(ws):
    async for msg in ws:
        mimi_tokens, text_token = deserialize(msg)
        audio = decode_audio_mimi(mimi_tokens)
        await play(audio)

Both directions run simultaneously. Python asyncio or Rust futures are the standard transport.

Step 3: the training objective (conceptual)

For every 80 ms frame t:

Input: user_mimi[0..t], moshi_mimi[0..t-1], moshi_text[0..t-1]
Predict: moshi_text[t], then moshi_mimi[t, codebook_0..7]

Text is predicted before audio (inner monologue); audio is predicted codebook-sequential within the depth transformer.

Step 4: where Moshi wins and where it doesn't

Moshi wins:

Sub-250 ms end-to-end on cheap hardware.
Natural back-channels and interruptions.
No pipeline glue code.

Moshi does not win:

Tool calling (not trained for it; you need a separate LLM path).
Long reasoning (Moshi is an 8B-ish dialogue model, not Claude/GPT-4).
Factual accuracy on niche topics.
Most production enterprise use cases (still use pipelines in 2026).

Use It

Situation	Pick
Lowest-latency voice companion	Moshi
Live translation call	Hibiki
Voice demo / research	Moshi, CSM
Enterprise agent with tools	Pipeline (Lesson 12), not Moshi
Custom-voice TTS in context	Sesame CSM
Speech-to-speech, any languages	GPT-4o Realtime or Gemini 2.5 Live (commercial)

Pitfalls

Limited tool calling. Moshi is a dialogue model, not an agent framework. Combine with pipeline for tools.
Specific-voice conditioning. Moshi uses a single trained persona; cloning is a separate training run.
Language coverage. French + English is excellent; others limited. Hibiki-Zero helps, but you still need training data.
Resource cost. A full Moshi session holds a GPU slot; not a cheap shared-tenant deploy pattern.

Ship It

Save as outputs/skill-duplex-pipeline.md. Pick pipeline vs full-duplex architecture for a voice-agent workload, with reason.

Exercises

Easy. Run code/main.py. It simulates the two-stream + inner-monologue architecture symbolically.
Medium. Pull Moshi from HuggingFace, run the server, test one conversation. Measure wall-clock latency from end-of-user-speech to start-of-Moshi-response.
Hard. Take your Lesson 12 pipeline agent and compare P50 latency vs Moshi on 20 matched test utterances. Write up when a pipeline architecturally wins anyway.

Key Terms

Term	What people say	What it actually means
Full-duplex	Hear-and-speak at once	Two audio streams active simultaneously on the same model.
Inner monologue	Model's text stream	Moshi emits text tokens alongside its audio output.
Depth transformer	Inter-codebook predictor	Small transformer that predicts 8 codebooks within one 80 ms frame.
Mimi	Kyutai's codec	12.5 Hz × 8 codebooks; semantic+acoustic; powers Moshi.
Streaming S2S	Audio → audio live	Chunk-by-chunk translation/dialogue, no pipeline stages.
Back-channeling	"Mhm" reactions	Moshi can emit small acknowledgments without breaking its turn.