Show-o and Discrete-Diffusion Unified Models

> Transfusion mixes continuous and discrete representations. Show-o (Xie et al., August 2024) goes the other way: text tokens use causal next-token prediction, image tokens use masked discrete diffusion in the spirit of MaskGIT. Both sit inside one transformer with a hybrid attention mask. The result unifies VQA, text-to-image, inpainting, and mixed-modality generation on one backbone, one tokenizer per modality, one loss formulation (next-token extended to masked prediction). This lesson walks the Show-o design — why masked discrete diffusion is a parallel, few-step image generator — and contrasts with Transfusion and Emu3.

Type: Learn

Languages: Python (stdlib, masked-discrete-diffusion sampler)

Prerequisites: Phase 12 · 13 (Transfusion)

Time: ~120 minutes

Learning Objectives

The Problem

Transfusion's two-loss training works but has trickier dynamics — the continuous diffusion loss lives on a different numerical scale from the discrete NTP loss. Balancing loss weights is a hyperparameter search. The architecture is effective but complex.

Show-o's answer: keep both modalities discrete (like Chameleon), but generate images in parallel via masked discrete diffusion instead of sequentially. The training objective becomes a single masked-token-prediction that generalizes next-token-prediction naturally.

The Concept

Masked discrete diffusion (MaskGIT)

The original Chang et al. (2022) MaskGIT trick is elegant. Start from a fully-masked image (every token is the special id). At each step, predict all masked tokens in parallel, then keep the top-K most confident predictions and re-mask the rest. After ~8-16 iterations, all tokens are filled in. The schedule of how many tokens to unmask per step is tuned — cosine schedules work well.

Training is simple: sample a masking ratio uniformly from [0, 1], apply it to the image's VQ tokens, train the transformer to recover the masked ones. Exactly what BERT did for text, scaled to image generation.

Show-o: one transformer, hybrid mask

Show-o puts MaskGIT inside a causal-language-model transformer. The attention mask is:

Training alternates between:

  1. Standard NTP on text sequences.
  2. T2I samples: text → image with masked image tokens, masked-token-prediction loss.
  3. VQA samples: image → text with masked text tokens (really just NTP).

The unified loss is cross-entropy on tokens, which covers both text NTP (only the last token is "masked") and image masked-diffusion (random subset is masked).

Parallel sampling

Show-o generates an image in ~16 steps instead of ~1000 (autoregressive per token) or ~20 (diffusion). At each step, predict all masked tokens in parallel; commit the top-K confident; repeat.

Compare:

Show-o is faster than Chameleon at similar-scale models, roughly matches Transfusion step count with lower per-step cost (discrete vocab logits vs continuous MSE loss).

Tasks in one checkpoint

Show-o supports four tasks at inference, selected by prompt format:

The inpainting capability comes for free from the masked-prediction training. Mask a region of the VQ-token grid, feed the rest plus a text prompt, predict the masked tokens.

Masking schedule

The schedule of how many tokens to unmask per step shapes quality. Show-o recommends cosine:

mask_ratio(t) = cos(pi * t / (2 * T))   # t = 0..T

At step 0, all tokens masked (ratio 1.0). At step T, none masked. Cosine concentrates mass on mid-range ratios where prediction is most informative. Linear schedules also work but plateau faster.

Show-o2

Show-o2 (2025 follow-up, arXiv 2506.15564) scales Show-o: larger LLM base, better tokenizer, improved mask schedule. Same architectural pattern.

Where Show-o sits

In the 2026 taxonomy:

Pick by task: Show-o when you want T2I + inpainting + VQA in one open model with reasonable speed; Transfusion when quality is paramount and you can afford the two-loss plumbing.

Use It

code/main.py simulates Show-o sampling:

Run it, watch the mask dissolve step by step.

Ship It

This lesson produces outputs/skill-unified-gen-model-picker.md. Given a product that needs both understanding (VQA, captioning) and generation (T2I, inpainting) with an open-weights constraint, picks between Show-o family, Transfusion/MMDiT family, and Emu3 / Chameleon family with concrete trade-offs.

Exercises

  1. Masked discrete diffusion samples in ~16 steps. Why not 1? What breaks if you unmask everything at step 0?
  1. Inpainting is free with masked diffusion. Propose a product use case (real or hypothetical) where Show-o's inpainting beats a specialist model.
  1. Cosine schedule vs linear schedule: trace the number of unmasked tokens per step for T=8. Which is more balanced?
  1. A 512x512 Show-o image is 1024 tokens. At vocab K=16384, the model emits 1024 * log2(16384) = 14,336 bits (~1.75 KiB) of data. Stable Diffusion outputs 512*512*24 bits = 6,291,456 bits (~768 KiB) of raw pixels. What is the compression ratio and what quality does it buy?
  1. Read LlamaGen (arXiv:2406.06525). How is LlamaGen's class-conditional autoregressive image model different from Show-o's masked approach?

Key Terms

Term What people say What it actually means
Masked discrete diffusion "MaskGIT-style" Training to predict masked tokens; at inference, iteratively unmask the most-confident predictions
Cosine schedule "Unmask schedule" Decay of mask ratio over inference steps; concentrates confidence growth at mid-range
Parallel decoding "All tokens at once" Every step predicts the full sequence of masked tokens in one forward pass, then commits top-K
Hybrid attention "Causal + bidirectional" Mask that is causal over text tokens and bidirectional within image blocks
Inpainting "Fill-in generation" Condition on an image with some tokens masked, predict the missing ones; free from the training objective
Commitment rate "Top-K per step" How many tokens are declared "done" per iteration; controls inference vs quality trade-off

Further Reading