AI Engineering from Scratch

← ColPali and Vision-Native Document RAG Multimodal Agents and Computer-Use (Capstone) →

> Vision-native document RAG is one slice. Production multimodal RAG goes wider — retrieving across text, images, audio, and video for workflows like trip planning ("find me a quiet vegan brunch with natural light"), medical triage ("what injury matches this photo + these notes"), e-commerce ("outfits similar to this selfie, in my size"), and field service ("diagnose this engine sound plus photo of the part"). Three 2025 surveys — Abootorabi et al., Mei et al., Zhao et al. — codified the sub-problems: cross-modal retrieval, retrieval fusion, generation grounding, multimodal evaluation. This lesson reads the surveys and designs a production pipeline.

Type: Build

Languages: Python (stdlib, cross-modal retriever with fusion + grounded generator)

Prerequisites: Phase 12 · 23 (ColPali), Phase 11 (RAG basics)

Time: ~180 minutes

Learning Objectives

Design cross-modal retrieval: text → image, image → text, audio → video, etc.
Compare three fusion strategies: score fusion, attention-based fusion, MoE fusion.
Explain generation grounding: what "cite your sources" looks like when sources are a mix of modalities.
Name the three canonical multimodal RAG surveys of 2025 and their sub-problem taxonomy.

The Problem

Single-modality RAG is a solved pattern: embed query, embed chunks, retrieve, stuff into LLM. Multimodal RAG requires:

Multiple retrieval heads (each modality needs embeddings in a compatible space).
Fusion of retrieval results across modalities.
Generation grounding that cites sources across modalities.
Evaluation metrics that cover cross-modal signal.

The 2025 surveys all arrive at the same taxonomy.

The Concept

Retrieve documents of modality B given a query of modality A. Three patterns:

Shared embedding space. CLIP and CLAP produce text + image / text + audio embeddings in a shared space. Cosine similarity across modalities works directly. Limited to CLIP-trained pairs.

Per-modality encoder + translation. Text encoder + image encoder + a small translator module mapping between spaces. Sen2Sen by Gupta et al. and other 2024 designs. Flexible but adds complexity.

VLM as encoder. Use a VLM's hidden states as the retrieval representation. Any modality the VLM supports works. Higher quality, more expensive.

Choice: CLIP / SigLIP 2 for text+image; CLAP for text+audio; VLM-hidden-states for cross-modal at frontier quality.

Fusion strategies

You retrieved 10 results: 5 images, 3 text passages, 2 audio clips. How do you merge?

Score fusion (cheapest). Each modality has its own retriever, each returns scores. Normalize scores within-modality then sum. Simple, often works.

Attention-based fusion. Concatenate all retrieved items, let a small attention network weight them. Needs training.

MoE fusion. Gating network routes to modality-specific experts. Different query types route differently — a visual question weights images higher.

Production default: score fusion with a slight bias toward the query's dominant modality. Upgrade to MoE if A/B shows clear wins on your domain.

Generation grounding

The LLM should cite which retrieved item drove each claim. For multi-modal:

Text source: standard citation [1].
Image source: [img 3] with a short caption.
Audio: [audio 2 at 0:34].

Train the generator with grounding-aware data: each claim in the training target is tagged with the source index. At inference, the model naturally emits citations.

The 2025 surveys

Abootorabi et al. (arXiv:2502.08826, "Ask in Any Modality"): taxonomy for multimodal RAG. Covers retrieval, fusion, generation. Broadest coverage.

Mei et al. (arXiv:2504.08748, "A Survey of Multimodal RAG"): focuses on sub-task benchmarks and failure modes. Useful for evaluation design.

Zhao et al. (arXiv:2503.18016): vision-focused survey. Strong on ColPali-family work.

Reading all three gives you the state of the art as of spring 2025. Most of the sub-problems are still open.

MuRAG — the foundational paper

MuRAG (Chen et al., 2022) was the first multimodal RAG. Retrieved image + text from a multimodal KB, generated answers. Showed feasibility before the VLM wave. Modern systems (REACT, VisRAG, M3DocRAG) build on it.

A production trip-planner example

Query: "find me a quiet vegan brunch with natural light."

Pipeline:

Decompose query. "quiet" → audio/review keyword; "vegan brunch" → menu item; "natural light" → image feature.
Retrieve per modality:

- Text retrieval on reviews: "vegan brunch, quiet ambiance."

- Image retrieval on restaurant photos: "natural light, airy."

- Audio retrieval on ambient-sound clips: "low decibel, no music."

Fuse scores. Each restaurant has a composite score.
Top-k restaurants → VLM generator with all evidence → answer with citations.

This is well beyond text-RAG. Each modality adds signal that text alone misses.

Agentic multimodal RAG

Multi-hop: if the first retrieval does not return high-confidence answers, the LLM reformulates and retrieves again. Agentic RAG patterns from Phase 14 apply here. Examples:

Retrieve initial top-10 → LLM asks "too noisy, filter for <40 dB" → re-retrieve.
Retrieve images → LLM sees one has a menu → retrieve the menu text → answer.

Adds complexity but handles queries that single-shot retrieval cannot.

Evaluation

Cross-modal evaluation is still immature. Common proxies:

Recall@k per modality.
Fused top-k accuracy.
Human-judged end-to-end satisfaction.
Task-specific (bookings completed, purchases made).

No standard benchmark spans all modalities. Most papers evaluate on domain-specific tasks.

Use It

code/main.py:

Three mock retrievers (text, image, audio) operating on a shared corpus of restaurants.
Score fusion that combines modality scores with configurable weights.
A generator stub that emits a final answer with citations.
A simple agentic loop that reformulates the query if confidence is low.

Ship It

This lesson produces outputs/skill-multimodal-rag-designer.md. Given a product spec with a multimodal query flow, designs retrievers, fusion, generator, and evaluation.

Exercises

Propose a medical-triage multimodal RAG: query = photo of injury + text symptoms. What modalities retrieve from what KB?

Score fusion is a simple weighted sum. What failure mode does it have that MoE fusion avoids?

Read Abootorabi et al.'s taxonomy (Section 3). What are the three canonical sub-problems and how do they map to your chosen product?

Design an eval spec for a trip-planner multimodal RAG. What metrics cover image recall, audio recall, and composite correctness?

Agentic multi-hop RAG has a latency tax per round-trip. At what query difficulty does the accuracy gain justify the latency?

Key Terms

Term	What people say	What it actually means
Cross-modal retrieval	"Query one modality, retrieve another"	Text query retrieves images; image query retrieves text; requires a shared space or translator
Score fusion	"Combine scores"	Weighted sum of per-modality retrieval scores; simplest fusion
MoE fusion	"Modality-routed experts"	Gating network picks which modality's scores to trust per query
Grounded generation	"Cite your sources"	Each claim in the answer tagged with the source index
MuRAG	"First multimodal RAG"	2022 paper that established the multimodal RAG pattern
Agentic multi-hop	"Reformulate and retry"	LLM re-queries retrievers when first-pass confidence is low

Multimodal RAG and Cross-Modal Retrieval