Chameleon and Early-Fusion Token-Only Multimodal Models

> Every VLM we have seen so far keeps images and text separate. Visual tokens come from a vision encoder, flow into a projector, then meet text inside the LLM. The vision and text vocabularies never overlap. Chameleon (Meta, May 2024) asked: what if they did? Train a VQ-VAE that turns an image into a sequence of discrete tokens from a shared vocabulary. Every multimodal document is now one sequence — text tokens and image tokens interleaved, a single autoregressive loss. Side effect: the model can generate mixed-modality outputs — alternating text and image tokens in a single inference call. This lesson reads the early-fusion thesis and builds a toy version end to end.

Type: Build

Languages: Python (stdlib, VQ-VAE tokenizer + interleaved decoder)

Prerequisites: Phase 12 · 05, Phase 8 (Generative AI)

Time: ~180 minutes

Learning Objectives

The Problem

Adapter-based VLMs (LLaVA, BLIP-2, Qwen-VL) treat text and image as two different things. A text token goes through embed(text_token); an image goes through visual_encoder(image) → projector → ... pseudo_tokens. The model has two input paths that merge partway in.

Three consequences:

  1. The LLM can only consume images, not emit them. Output is text only.
  2. Mixed-modality documents (alternating paragraphs and images, as in an article) are awkward — you either parse the multimodal input outside the model or chain generations.
  3. Distributional mismatch. Visual tokens and text tokens live in different regions of the hidden space, creating subtle alignment issues.

Chameleon rejects the premise: images are just sequences of discrete tokens from a shared vocabulary. Train the model on interleaved documents, one loss, one autoregressive decoder, and you unlock mixed-modality generation for free.

The Concept

VQ-VAE as image tokenizer

The tokenizer is a vector-quantized variational autoencoder. The architecture:

Training: VAE reconstruction loss + commitment loss + codebook loss. The codebook indices form a discrete alphabet for images.

For Chameleon: one image becomes 32*32 = 1024 tokens drawn from a vocabulary of 8192. Concatenate with text tokens (from the LLM's BPE vocabulary, say 32000). Final vocabulary: 40192. The transformer sees one sequence, one loss.

The shared vocabulary

Chameleon's vocabulary combines text tokens, image tokens, and modality separators. Each token has a single ID. The input embedding layer maps every ID to a D-dim hidden vector. The output projection maps hidden back to vocab logits. Softmax picks the next token, whatever modality.

Separators matter: and tags bracket the image-token sequence. At generation time, if the model emits , downstream software knows the next 1024 tokens are VQ indices to send to the decoder for pixel rendering.

Mixed-modality generation

Inference is next-token prediction in the shared vocabulary. Example prompt: "Draw a cat and describe it." Chameleon emits:

<image> 4821 1029 2891 ... (1024 image tokens) </image>
The cat is orange, sitting on a windowsill...

The model picks the order autonomously — it may produce image then text, text then image, or interleave. Same decoder, same loss.

Compare to adapter VLMs where generation is text-only. Chameleon reopens the question of model output modalities.

Training stability — QK-Norm, dropout, LayerNorm ordering

Early-fusion training is unstable at scale. Chameleon's paper documents three tricks:

Without these tricks, 34B-param Chameleon training diverged at multiple checkpoints. With them, it converges. The training recipe is as much of the contribution as the architecture.

The tokenizer's reconstruction ceiling

VQ-VAE is lossy. At 8192 codebook entries and 1024 tokens per 512x512 image, reconstruction PSNR caps around 26-28 dB. This is enough for recognizable image gen but visibly worse than continuous-space diffusion (Stable Diffusion 3 achieves 32+ dB).

The tokenizer is the bottleneck. Better tokenizers (MAGVIT-v2, IBQ, SBER-MoVQGAN) lift the ceiling. Emu3 (Lesson 12.12) achieves SDXL-quality generation via a better tokenizer alone.

Chameleon vs BLIP-2 / LLaVA

Chameleon (early fusion, shared vocab):

BLIP-2 / LLaVA (late fusion, separate towers):

Pick by task. If you need image generation, Chameleon family. If you only need understanding, adapter-VLM is simpler and reuses more pretrained compute.

Fuyu and AnyGPT

Fuyu (Adept, 2023) is a related approach: skip the separate vision encoder entirely, feed raw image patches through the LLM's input projection as if they were tokens, no tokenizer. Simpler than Chameleon, loses the shared-vocab output generation.

AnyGPT (Zhan et al., 2024) extends Chameleon to four modalities: text, image, speech, music. Same VQ-VAE trick for each, shared transformer. Any-to-any generation. Covered more in Lesson 12.16.

Use It

code/main.py builds a toy end-to-end early-fusion model:

The code intentionally keeps the transformer tiny (bigrams) so you can trace the signal flow end to end.

Ship It

This lesson produces outputs/skill-tokenizer-vs-adapter-picker.md. Given a product spec (understand only vs understand + generate, required image quality, cost budget), it picks between Chameleon-family (early fusion) and LLaVA-family (late fusion) and justifies with quantitative rules of thumb.

Exercises

  1. Chameleon uses K=8192 codebook entries and 1024 tokens per 512x512 image. Estimate the compression ratio vs a 24-bit RGB image. Is it lossy? How lossy?
  1. A 4K image (3840x2160) at the same VQ-VAE density produces how many image tokens? Can a Chameleon-style model generate a 4K image in one inference call? What breaks first — context, tokenizer quality, or KV cache?
  1. Implement QK-Norm in pure Python. Given a 64-dim query and key, show the dot product before and after LayerNorm. Why is magnitude control important at depth?
  1. Read Chameleon Section 2.3 on training stability. Describe the exact failure mode the paper observed at 34B without QK-Norm. What was the "norm explosion" signature?
  1. Extend the toy decoder to emit a mixed-modality response given a text-only prompt. Measure how often the model picks image-first vs text-first given training-data distribution 60% text-first / 40% image-first.

Key Terms

Term What people say What it actually means
Early fusion "Unified tokens" Images converted to discrete tokens sharing the transformer's vocabulary from step one
VQ-VAE "Image tokenizer" CNN + ViT + codebook that maps images to integer indices the transformer can predict
Shared vocabulary "One dictionary" A single token ID space covering text + image + modality separators
QK-Norm "Attention stabilizer" LayerNorm applied to query and key before their dot product, prevents norm blowup
Mixed-modality generation "Text + image output" Inference that autonomously produces interleaved text and image tokens in one pass
Codebook size "K entries" Number of discrete vectors the VQ-VAE can quantize to; trades compression for fidelity
Tokenizer ceiling "Reconstruction limit" Best PSNR achievable by decoding VQ tokens; bounds the model's image quality

Further Reading