Janus-Pro: Decoupled Encoders for Unified Multimodal Models

> Unified multimodal models have an unavoidable tension. Understanding wants semantic features — SigLIP or DINOv2 output vectors rich with concept-level information. Generation wants reconstruction-friendly codes — VQ tokens that compose back into crisp pixels. The two goals are not compatible in a single encoder. Janus (DeepSeek, October 2024) and Janus-Pro (DeepSeek, January 2025) argue the fix is to stop trying: decouple the two encoders. Share the transformer body between tasks, but route understanding through SigLIP and generation through a VQ tokenizer. At 7B, Janus-Pro beats DALL-E 3 on GenEval while matching LLaVA on MMMU. This lesson reads why two encoders work where one fails.

Type: Build

Languages: Python (stdlib, dual-encoder routing + shared-body signal)

Prerequisites: Phase 12 · 13 (Transfusion), Phase 12 · 14 (Show-o)

Time: ~120 minutes

Learning Objectives

The Problem

Unified models share a transformer body across understanding and generation. Previous attempts (Chameleon, Show-o, Transfusion) all use one visual tokenizer for both directions. The tokenizer is a compromise:

Show-o and Transfusion pay for this with a visible quality tax on one direction. Janus-Pro asks: why require one tokenizer when the tasks have different needs?

The Concept

Decoupled visual encoding

Janus-Pro's architecture separates the two encoders:

The transformer body is shared. Everything upstream and downstream of the body is task-specific.

Inputs are disambiguated by prompt format: a tag routes through SigLIP; routes through VQ. Or the routing is implicit from task.

Why this works

Understanding loss gets SigLIP features, which CLIP-style pretraining has tuned for semantic similarity. The model's perception benchmarks improve over Show-o / Transfusion because the input features are better for the task.

Generation loss gets VQ tokens, which a tokenizer has tuned for reconstruction. Image quality improves over Show-o because VQ codes compose back to pixels cleanly.

The shared transformer body sees two input distributions (SigLIP and VQ) and learns to work with both. The claim: enough data + enough parameters, the body absorbs the switching.

Data scaling — Janus vs Janus-Pro

Janus (original, arXiv 2410.13848) introduced the decoupling but at small scale (1.3B params, limited data). Janus-Pro (arXiv 2501.17811) scaled:

The upshot: Janus-Pro-7B matches LLaVA on MMMU (60.3 vs ~58) and beats DALL-E 3 on GenEval (0.80 vs 0.67). One open model, competitive on both sides of the unified spectrum.

JanusFlow — the rectified flow variant

JanusFlow (arXiv 2411.07975) swaps the VQ generation path for a rectified-flow generation path (continuous). The split becomes SigLIP-for-understanding + rectified-flow-for-generation. Quality ceilings lift further. The architecture remains decoupled-encoders-shared-body.

The shared body's job

The transformer body processes a unified sequence but with two input distributions. Its job is to:

The body has no modality-specific weights per block. It is the text-style transformer you'd expect to find inside Qwen or Llama, plus the two input adapters.

Interestingly, this means Janus-Pro's body could be initialized from a pretrained LLM. Janus-Pro does initialize from DeepSeek-MoE-7B. That choice matters: the LLM contributes reasoning ability that pure-from-scratch unified models struggle to reach.

Compared to InternVL-U

InternVL-U (Lesson 12.10) is the 2026 follow-up. It combines:

InternVL-U subsumes Janus-Pro's architectural choice into a larger framework. The decoupled-encoder idea is now the default for unified models at scale.

Limitations

Decoupled encoders add architectural complexity. Two tokenizers to train, two input paths to maintain, two sets of fail modes. For products that do not need generation, Janus-Pro is over-engineered — pick a LLaVA-family understanding model.

For products that do not need understanding, Janus-Pro is overqualified — pick a Stable Diffusion 3 / Flux model.

For products that need both, Janus-Pro is now the reference open architecture.

Use It

code/main.py simulates Janus-Pro routing:

Print the routed paths for 3 examples: image QA, T2I, image editing.

Ship It

This lesson produces outputs/skill-decoupled-encoder-picker.md. Given a product that wants unified generation + understanding at frontier-ish quality, it picks Janus-Pro, JanusFlow, or InternVL-U with a concrete data-scale recommendation.

Exercises

  1. Janus-Pro-7B beats DALL-E 3 on GenEval. Explain why a 7B open model can match a frontier proprietary model on generation but not on understanding.
  1. Implement a router function: given prompt text, classify as understand or generate. How do you handle ambiguous prompts like "describe and then sketch"?
  1. JanusFlow replaces the VQ path with rectified flow. What does the transformer body now output, and what changes in the loss?
  1. Propose a fourth task the Janus-Pro architecture could handle with one more decoupled encoder. Examples: image segmentation (DINO-style), depth (MiDaS-style).
  1. Read Janus-Pro Section 4.2 on data scaling. Which data stage contributes most to the T2I quality gain vs Janus?

Key Terms

Term What people say What it actually means
Decoupled encoding "Two visual encoders" Separate tokenizer or encoder per direction: semantic for understanding, reconstruction for generation
Shared body "One transformer" Single transformer processes either encoder's output; no modality-specific weights
SigLIP for understanding "Semantic features" CLIP-family vision tower providing rich conceptual features but poor reconstruction
VQ for generation "Reconstruction codes" Vector-quantized tokens that decode cleanly back to pixels
JanusFlow "Rectified-flow variant" Janus-Pro with a continuous flow-matching generation head instead of VQ
Routing tag "Task tag" Prompt marker ( / ) that picks the input encoder

Further Reading