Capstone 12 — Video Understanding Pipeline (Scene, QA, Search)
> Twelve Labs productized Marengo + Pegasus. VideoDB shipped the CRUD-for-video API. AI2's Molmo 2 published open VLM checkpoints. Gemini long-context handles hours of video natively. TimeLens-100K defined temporal grounding at scale. The 2026 pipeline is settled: scene segmentation, per-scene caption + embedding, transcript alignment, multi-vector index, and a query that answers with (start, end) timestamps plus frame previews. The capstone is ingesting 100 hours, hitting public benchmarks, and measuring hallucination on counting and action questions.
Type: Capstone
Languages: Python (pipeline), TypeScript (UI)
Prerequisites: Phase 4 (CV), Phase 6 (speech), Phase 7 (transformers), Phase 11 (LLM engineering), Phase 12 (multimodal), Phase 17 (infrastructure)
Phases exercised: P4 · P6 · P7 · P11 · P12 · P17
Time: 30 hours
Problem
Long-form video QA is the most bandwidth-hungry multimodal problem at 2026 scale. Gemini 2.5 Pro can read a 2-hour video natively, but ingesting 100 hours of video into a queryable corpus still requires a scene-level index. The production shape combines scene segmentation (TransNetV2 or PySceneDetect), per-scene captioning with a VLM (Gemini 2.5, Qwen3-VL-Max, or Molmo 2), transcript alignment (Whisper-v3-turbo with word timestamps), and a multi-vector index that stores caption, frame embedding, and transcript side by side. The query pipeline answers with (start, end) timestamps plus frame previews.
Benchmarks are public (ActivityNet-QA, NeXT-GQA) plus your own 100-query custom set. Hallucination on counting and action-type questions is the known-hard failure class; the capstone explicitly measures it.
Concept
Three pipelines run in parallel at ingest. Scene segmentation cuts the video into scenes. VLM captioning generates a caption per scene and a frame embedding from a keyframe. ASR alignment produces word-level timestamps. The three streams are joined by (scene_id, time range). Each scene gets three vector types in a multi-vector index (Qdrant): caption embedding, keyframe embedding, transcript embedding.
At query time, the natural-language question fires against all three vectors; results merge with RRF; a temporal-grounding adapter (TimeLens-style) refines the (start, end) window within the top scene. The VLM synthesizer (Gemini 2.5 Pro or Qwen3-VL-Max) takes query + top scenes + cropped frames and answers with cited timestamps and a frame preview.
The hallucination measurement matters. Counting ("how many people enter the room?") and action-type ("does the chef pour before stirring?") questions are notoriously unreliable. Report accuracy separately from descriptive questions.
Architecture
video file / URL
|
v
PySceneDetect / TransNetV2 (scene segmentation)
|
+--- per-scene keyframe --- VLM caption + frame embedding
| (Gemini 2.5 Pro / Qwen3-VL-Max / Molmo 2)
|
+--- audio channel --- Whisper-v3-turbo ASR + word timestamps
|
v
multi-vector Qdrant: {caption_emb, keyframe_emb, transcript_emb}
|
query:
dense queries against all three -> RRF merge -> top-k scenes
|
v
TimeLens / VideoITG temporal grounding (refine start/end within scene)
|
v
VLM synth: query + top scenes + frame previews
|
v
answer + (start, end) timestamps + frame thumbs + citations
Stack
- Scene segmentation: TransNetV2 (state-of-the-art 2024-26) or PySceneDetect
- ASR: Whisper-v3-turbo via faster-whisper with word timestamps
- VLM captioner + answerer: Gemini 2.5 Pro or Qwen3-VL-Max or Molmo 2
- Temporal grounding: TimeLens-100K-trained adapter or VideoITG
- Index: Qdrant with multi-vector support (caption / frame / transcript)
- UI: Next.js 15 with HTML5 video player and scene thumbnails
- Eval: ActivityNet-QA, NeXT-GQA, custom 100-question hand-labeled set
- Hallucination benchmark: counting and action-type subsets with hand labels
Build It
- Ingest walker. Accept YouTube URLs or local MP4s. Downscale to 720p if needed. Persist
{video_id, file_path}.
- Scene segmentation. Run TransNetV2 or PySceneDetect to produce
[{scene_id, start_ms, end_ms, keyframe_path}]. Target 100 hours: ~6k-8k scenes.
- ASR pass. Run Whisper-v3-turbo on audio; export word-level timestamps; split into per-scene transcript slices.
- VLM captioning. Per scene, call Gemini 2.5 Pro (or Qwen3-VL-Max) with the keyframe and a short caption template. Produce caption + frame embedding.
- Multi-vector index. Qdrant collection with three named vectors. Payload:
{video_id, scene_id, start_ms, end_ms, keyframe_url}.
- Query. Natural-language question fires three dense queries; merge with reciprocal rank fusion; top-k=5 scenes.
- Temporal grounding. Run TimeLens-style adapter on the top scene to refine the (start, end) window within the scene.
- VLM synth. Call Gemini 2.5 Pro with query + top-3 scene clips (as images or short clips) + transcripts. Require
(video_id, start_ms, end_ms)citations.
- Eval. Run ActivityNet-QA and NeXT-GQA. Build a 100-query custom set. Report overall accuracy + per-class breakdown (counting, action, descriptive).
Use It
$ video-qa ask --url=https://youtube.com/watch?v=X "how many cars pass the intersection in the first minute?"
[scene] 23 scenes detected
[asr] transcript complete, 4m12s
[index] 69 vectors written (23 scenes x 3)
[query] top scene: scene 3 [01:32-01:54], confidence 0.84
[ground] refined window: [00:12-00:58]
[synth] gemini 2.5 pro, 1.4s
answer: 5 cars pass the intersection between 00:12 and 00:58.
citations: [scene 3: 00:12-00:58]
[frame preview at 00:14, 00:27, 00:44, 00:51, 00:57]
Ship It
outputs/skill-video-qa.md is the deliverable. Given a YouTube URL or uploaded video, the pipeline indexes scenes and answers questions with timestamped citations.
| Weight | Criterion | How it is measured |
|---|---|---|
| 25 | Temporal grounding IoU | Intersection-over-union on held-out grounding set |
| 20 | QA accuracy | NeXT-GQA and custom 100-query |
| 20 | Ingest throughput | Hours of video per dollar spent |
| 20 | UI and citation UX | Timestamp links, thumbnail strip, jump-to-frame |
| 15 | Hallucination rate | Counting and action-type accuracy separately |
| 100 |
Exercises
- Swap Gemini 2.5 Pro for Qwen3-VL-Max on the captioning pass. Report caption quality delta on a human-rated 50-scene sample.
- Reduce per-scene frame embedding to one pooled vector instead of multi-vector. Measure the retrieval regression.
- Build a "counting strict" mode: the synthesizer extracts each counted instance with a timestamp and the user clicks to verify. Measure whether user-verification reduces hallucination.
- Benchmark ingest cost: hours-of-video-per-dollar across three VLM choices. Pick the sweet spot.
- Add speaker-diarized transcript: run pyannote speaker diarization on the audio and embed per-speaker transcripts. Demonstrate "what did Alice say about X?" queries.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Scene segmentation | "Shot detection" | Cutting video into scenes at shot boundaries |
| Multi-vector index | "Caption + frame + transcript" | Qdrant collection with named vectors per representation |
| Temporal grounding | "When exactly did it happen" | Refining the (start, end) window for a query answer |
| Frame embedding | "Visual representation" | A vector embedding of a keyframe; used for scene-visual similarity |
| RRF fusion | "Reciprocal rank fusion" | Merge strategy across multiple ranked lists; a classic hybrid-retrieval trick |
| Counting hallucination | "Miscount" | Known failure mode of VLMs on "how many X" questions |
| ActivityNet-QA | "Video-QA benchmark" | Long-form video QA accuracy benchmark |
Further Reading
- AI2 Molmo 2 — open VLM checkpoints
- TimeLens (CVPR 2026) — temporal grounding at scale
- Gemini Video long-context — the hosted reference
- VideoDB — CRUD-for-video API reference
- Twelve Labs Marengo + Pegasus — commercial reference
- TransNetV2 — scene segmentation model
- PySceneDetect — classic open alternative
- ActivityNet-QA — reference eval benchmark