← Flamingo and Gated Cross-Attention for Few-Shot VLMs Any-Resolution Vision: Patch-n'-Pack and NaFlex →

LLaVA and Visual Instruction Tuning

> LLaVA (April 2023) is the most copied multimodal architecture on the planet. It replaced BLIP-2's Q-Former with a 2-layer MLP, replaced Flamingo's gated cross-attention with naive token concatenation, and trained on 158k visual-instruction turns generated by GPT-4 from text-only captions. Any practitioner who built a VLM between 2023 and 2026 built some variant of LLaVA. LLaVA-1.5 added AnyRes. LLaVA-NeXT bumped resolution. LLaVA-OneVision unified image, multi-image, and video in one recipe. This lesson reads the recipe, implements the projector, and explains why "simpler won."

Type: Build

Languages: Python (stdlib, projector + instruction-template builder)

Prerequisites: Phase 12 · 02 (CLIP), Phase 11 (LLM Engineering — instruction tuning)

Time: ~180 minutes

Learning Objectives

Build a 2-layer MLP projector that maps ViT patch embeddings (dim 1024) to an LLM's embedding dim (dim 4096).
Walk the LLaVA two-stage recipe: (1) projector alignment on 558k caption pairs, (2) visual instruction tuning on 158k GPT-4-generated turns.
Construct a LLaVA-format prompt with the image token placeholder, system prompt, and user/assistant turns.
Explain why the community moved from Q-Former to MLP despite Q-Former's token-budget win.

The Problem

BLIP-2's Q-Former (Lesson 12.03) compresses an image to 32 tokens. Clean, efficient, good for benchmarks. But it has two problems.

First, the Q-Former is trainable but its loss is not the final task. Stage 1 trains ITC+ITM+ITG. Stage 2 trains LM loss. The queries learn some intermediate representation that the LLM then has to decode. Information is lost in the bottleneck.

Second, the Q-Former takes 188M params, and at LLaVA's 2023 scale you had to co-design it with your target LLM. Change the LLM, retrain the Q-Former. Change the vision encoder, retrain. Every combination was a separate R&D project.

The LLaVA answer was embarrassing in its simplicity: take the ViT's 576 patch tokens, pass each through a 2-layer MLP (1024 → 4096 → 4096), and dump all 576 into the LLM's input sequence. No bottleneck. No stage 1 pretraining on weird objectives. Just train the MLP on a direct LM loss.

Where does the data come from? LLaVA's second insight: use GPT-4 (text-only) to generate instruction data. Feed GPT-4 the COCO caption and bounding-box data for an image, ask it to produce conversations, descriptions, and complex reasoning questions. 158k instruction-response turns for free. No human annotation.

The result: a VLM that ran on 8 A100s for one day, beat Flamingo on MMMU, and shipped an open checkpoint the community could extend. By late 2023 it had spawned 50+ forks.

The Concept

The architecture

LLaVA-1.5 at 13B:

Vision encoder: CLIP ViT-L/14 @ 336 (frozen during stage 1, optionally unfrozen stage 2).
Projector: 2-layer MLP with GELU activation, 1024 → 4096 → 4096.
LLM: Vicuna-13B (later Llama-3.1-8B).

Forward pass on an image + text prompt:

img -> ViT -> 576 patches of dim 1024
patches -> MLP -> 576 tokens of dim 4096
prompt: system + "<image>" placeholder + user question
replace <image> token with the 576 projected tokens
feed the full sequence to the LLM
decode response

The image occupies 576 tokens of the LLM context. At 2048 context, that leaves 1472 tokens for text. At 32k context, it is a rounding error.

Stage 1: projector alignment

Freeze ViT. Freeze LLM. Train only the 2-layer MLP. Dataset: 558k image-caption pairs (LAION-CC-SBU). Loss: language modeling on the caption, conditioned on the projected image tokens.

In a single epoch at batch 128 this is done in a few hours. The projector learns to map ViT-space to LLM-space. No task-specific supervision.

Stage 2: visual instruction tuning

Unfreeze the projector (still trainable). Unfreeze the LLM (usually fully, sometimes LoRA). Train on 158k visual-instruction turns.

The instruction data is the trick. Liu et al. generated it by:

Take a COCO image.
Extract the text description (5 human captions + bounding-box list).
Send to GPT-4 with three prompt templates:

- Conversation: "Generate a back-and-forth dialogue between a user and assistant about this image."

- Detailed description: "Give a rich, detailed description of the image."

- Complex reasoning: "Ask a question that requires reasoning about the image, then answer it."

Parse GPT-4's output into (instruction, response) pairs.

None of this touches the image directly — only the text description. GPT-4 hallucinates plausible image content. Some noise, but it worked: 158k turns was enough to unlock dialogue.

Why the community copied this

No stage-1-specific losses to tune. LM loss throughout.
Projector trains in hours, not days.
LLM can be swapped (LLaVA-Llama2, LLaVA-Mistral, LLaVA-Llama3) by retraining just the projector.
Visual-instruction data pipeline uses GPT-4 and is cheap to regenerate for a new domain.

LLaVA-1.5 and LLaVA-NeXT

LLaVA-1.5 (October 2023) added:

Academic-task data (VQA, OKVQA, RefCOCO) mixed into instruction tuning.
Better system prompt.
2048 → 32k context.

LLaVA-NeXT (January 2024) added:

AnyRes: split high-res images into a 2x2 or 1x3 grid of 336x336 crops, plus one global low-res thumbnail. Each crop becomes 576 tokens; total around 2880 visual tokens per image. OCR and chart tasks jumped.
Better instruction data mixture with ShareGPT4V (high-quality GPT-4V captions).
Stronger base LLMs (Mistral-7B, Yi-34B).

LLaVA-OneVision

Lesson 12.08 covers OneVision in depth. Short version: same projector, but trained with a curriculum that covers single-image, multi-image, and video in one model with shared visual-token budget.

The comparison to Q-Former

Q-Former (BLIP-2)	MLP (LLaVA)
Visual tokens per image	32	576 (base) or 2880 (AnyRes)
Trainable params	188M + LM	40M + LM
Stage 1 loss	ITC+ITM+ITG	LM only
LLM drop-in	Requires retrain	Swap with minimal retrain
Multi-image	Awkward	Natural (concat)
Video	Awkward	Natural (per-frame concat)
Token budget	Small	Large

MLP wins on simplicity and token flexibility. Q-Former wins on token budget. By late 2023 the token budget was no longer the binding constraint (LLM contexts grew to 32k-128k+) and simplicity dominated.

The prompt format

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image> Describe this image in detail. ASSISTANT: The image shows ...

is a placeholder token. Before tokenization, it is replaced with the 576 visual tokens (or 2880 with AnyRes). Tokenizer sees a slightly longer sequence than it was trained on, but the LLM handles the novel input because stage 1 taught it to.

Parameter economy

LLaVA-1.5-7B breakdown:

CLIP ViT-L/14 @ 336: 303M (frozen stage 1, often unfrozen stage 2).
Projector (2x linear): ~22M trainable.
Llama-7B: 7B.
Total: 7.3B params. Trainable during stage 2: full 7B + 22M projector.

Training cost for stage 2: ~20 hours on 8xA100. This is the key number — one day, one node, reproducible. That is why LLaVA spread.

Use It

code/main.py implements:

The 2-layer MLP projector (dim 16 → 32 → 32 for toy scale) in pure Python.
The prompt-building pipeline: system prompt + replaced with N projected tokens + user turn + assistant generation placeholder.
A visualizer for what the 576-token visual block looks like in LLM context (percentage of 2k / 32k / 128k context consumed).

Ship It

This lesson produces outputs/skill-llava-vibes-eval.md. Given a LLaVA-family checkpoint, it runs a 10-prompt vibes-eval suite (3 captioning, 3 VQA, 2 reasoning, 2 refusal) and reports a human-readable scorecard. Not a benchmark; a smoke test to confirm the projector and LLM are connecting well.

Exercises

Compute the trainable-parameter count for the 2-layer MLP projector at 1024 → 4096 → 4096. With GELU and bias, what fraction of LLaVA-13B does it represent?

Construct a LLaVA prompt for a "refusal" case — the image contains a private individual. Write the expected assistant response. Why should LLaVA refuse this zero-shot and what training data would be needed to reinforce the refusal?

Read the AnyRes section of the LLaVA-NeXT blog. Compute the visual token count for a 1344x672 image at AnyRes. Compare to base 576 tokens at 336x336.

The LLaVA stage-1 projector is trained with LM loss on captions. What happens if you skip stage 1 and go straight to stage 2 (visual instruction tuning)? Cite the Prismatic VLMs ablation (arXiv:2402.07865) for the answer.

LLaVA-Instruct-150k uses GPT-4 with COCO captions to generate instructions. For a new domain (medical X-rays, satellite imagery), describe the four-step data pipeline to generate domain instructions. What could go wrong at each step?

Key Terms

Term	What people say	What it actually means
Projector	"MLP bridge"	2-layer MLP with GELU mapping ViT dim to LLM dim
Image token	" placeholder"	Prompt marker replaced by N projected visual tokens before inference
Visual instruction tuning	"LLaVA stage 2"	Training on GPT-4-generated (image, instruction, response) triplets
Stage 1 alignment	"Projector pretraining"	Freeze ViT and LLM, train projector with LM loss on captions
AnyRes	"Multi-crop tiling"	Split high-res image into a tile grid and concatenate each tile's visual tokens
LLaVA-Instruct	"GPT-4-generated"	158k instruction-response pairs synthesized from COCO captions + GPT-4
Vision encoder freeze	"Backbone locked"	CLIP weights do not update in stage 1, sometimes not in stage 2 either
ShareGPT4V	"Better captions"	1M dense captions generated by GPT-4V, used for higher-quality alignment
VQA	"Visual question answering"	Task of answering a free-form question about an image
Prismatic VLMs	"Design-space paper"	Karamcheti 2024 ablation systematically testing projector and data choices