Streaming Speech-to-Speech — Moshi, Hibiki, and Full-Duplex Dialogue

> 2024-2026 redefined voice AI. Moshi ships a single model that listens and speaks simultaneously at 200 ms latency. Hibiki does speech-to-speech translation chunk-by-chunk. Both abandon the ASR → LLM → TTS pipeline for a unified full-duplex architecture over Mimi codec tokens. This is the new reference design.

Type: Learn

Languages: Python

Prerequisites: Phase 6 · 13 (Neural Audio Codecs), Phase 6 · 11 (Real-Time Audio), Phase 7 · 05 (Full Transformer)

Time: ~75 minutes

The Problem

Every voice agent built from Lessons 11 + 12 has a fundamental latency floor around 300-500 ms: VAD fires, STT processes, LLM reasons, TTS generates. Each stage has its own minimum latency. You can tune and parallelize, but the pipeline shape caps you.

Moshi (Kyutai, 2024-2026) asks a different question: what if there is no pipeline? What if one model takes audio in and emits audio out directly, continuously, with text as an intermediate "inner monologue" instead of a required stage?

The answer is full-duplex speech-to-speech. Theoretical latency 160 ms (80 ms Mimi frame + 80 ms acoustic delay). Practical latency 200 ms on a single L4 GPU. That's half what a best-in-class pipelined voice agent achieves.

The Concept

Moshi architecture: two parallel Mimi streams + inner-monologue text

The Moshi architecture

Inputs. Two Mimi codec streams, both at 12.5 Hz × 8 codebooks:

The transformer. A 7B-parameter Temporal Transformer processes both streams and a text "inner monologue" stream. At each 80 ms step, it:

  1. Consumes the latest user Mimi tokens (8 codebooks).
  2. Consumes the most recent Moshi Mimi tokens (8 codebooks, as produced).
  3. Generates the next Moshi text token (inner monologue).
  4. Generates the next Moshi Mimi tokens (8 codebooks via a small Depth Transformer).

All three streams — user audio, Moshi audio, Moshi text — run in parallel. Moshi can hear the user while speaking; can interrupt itself when the user interrupts; can back-channel ("mhm") without breaking its main utterance.

The depth transformer. Within a frame, the 8 codebooks are not predicted in parallel — they have inter-codebook dependencies. A small 2-layer "depth transformer" predicts them sequentially within 80 ms. This is the standard factorization for AR codec LMs (also used by VALL-E, VibeVoice).

Why inner-monologue text helps

Without explicit text, the model has to implicitly model language in its acoustic stream. Moshi's insight: force it to emit text tokens alongside audio. The text stream is essentially the transcript of what Moshi is saying. This improves semantic coherence, makes it easier to swap out a language model head, and gives you transcripts for free.

Hibiki: streaming speech-to-speech translation

Same architecture, trained on translation pairs. Source audio in, target-language audio out, continuously. Hibiki-Zero (Feb 2026) eliminates the need for word-level aligned training data — uses sentence-level data + GRPO reinforcement learning for latency optimization.

Four language pairs supported initially; can be adapted to a new language with ≈1000 hours.

The broader Kyutai stack (2026)

Throughput on an L40S GPU: 64 concurrent sessions at 3× real-time.

Sesame CSM — the cousin

Sesame CSM (2025) uses a similar idea — a Llama-3 backbone with a Mimi codec head. But CSM is single-directional (takes context + text, produces speech) rather than full-duplex. It's the best "voice presence" TTS on the market; not quite the same as Moshi's full-duplex capability.

2026 performance numbers

Model Latency Use case License
Moshi 200 ms (L4) full-duplex English / French dialogue CC-BY 4.0
Hibiki 12.5 Hz framerate French ↔ English streaming translation CC-BY 4.0
Hibiki-Zero same 5 language-pairs, no aligned data CC-BY 4.0
Sesame CSM-1B 200 ms TTFA context-conditioned TTS Apache-2.0
GPT-4o Realtime ~300 ms closed, OpenAI API commercial
Gemini 2.5 Live ~350 ms closed, Google API commercial

Build It

Step 1: the interface

Moshi exposes a WebSocket server that takes 80 ms chunks of Mimi-encoded audio and returns 80 ms chunks of Mimi-encoded audio. Both ways. Constantly.

import asyncio
import websockets
from moshi.client_utils import encode_audio_mimi, decode_audio_mimi

async def moshi_chat():
    async with websockets.connect("ws://localhost:8998/api/chat") as ws:
        mic_task = asyncio.create_task(stream_mic_to(ws))
        spk_task = asyncio.create_task(stream_from_to_speaker(ws))
        await asyncio.gather(mic_task, spk_task)

Step 2: the full-duplex loop

async def stream_mic_to(ws):
    async for chunk_80ms in mic_stream_at_12_5_hz():
        mimi_tokens = encode_audio_mimi(chunk_80ms)
        await ws.send(serialize(mimi_tokens))

async def stream_from_to_speaker(ws):
    async for msg in ws:
        mimi_tokens, text_token = deserialize(msg)
        audio = decode_audio_mimi(mimi_tokens)
        await play(audio)

Both directions run simultaneously. Python asyncio or Rust futures are the standard transport.

Step 3: the training objective (conceptual)

For every 80 ms frame t:

Text is predicted before audio (inner monologue); audio is predicted codebook-sequential within the depth transformer.

Step 4: where Moshi wins and where it doesn't

Moshi wins:

Moshi does not win:

Use It

Situation Pick
Lowest-latency voice companion Moshi
Live translation call Hibiki
Voice demo / research Moshi, CSM
Enterprise agent with tools Pipeline (Lesson 12), not Moshi
Custom-voice TTS in context Sesame CSM
Speech-to-speech, any languages GPT-4o Realtime or Gemini 2.5 Live (commercial)

Pitfalls

Ship It

Save as outputs/skill-duplex-pipeline.md. Pick pipeline vs full-duplex architecture for a voice-agent workload, with reason.

Exercises

  1. Easy. Run code/main.py. It simulates the two-stream + inner-monologue architecture symbolically.
  2. Medium. Pull Moshi from HuggingFace, run the server, test one conversation. Measure wall-clock latency from end-of-user-speech to start-of-Moshi-response.
  3. Hard. Take your Lesson 12 pipeline agent and compare P50 latency vs Moshi on 20 matched test utterances. Write up when a pipeline architecturally wins anyway.

Key Terms

Term What people say What it actually means
Full-duplex Hear-and-speak at once Two audio streams active simultaneously on the same model.
Inner monologue Model's text stream Moshi emits text tokens alongside its audio output.
Depth transformer Inter-codebook predictor Small transformer that predicts 8 codebooks within one 80 ms frame.
Mimi Kyutai's codec 12.5 Hz × 8 codebooks; semantic+acoustic; powers Moshi.
Streaming S2S Audio → audio live Chunk-by-chunk translation/dialogue, no pipeline stages.
Back-channeling "Mhm" reactions Moshi can emit small acknowledgments without breaking its turn.

Further Reading