Document and Diagram Understanding

> Documents are not photos. A PDF, scientific paper, invoice, or handwritten form has layout, tables, diagrams, footnotes, headers, and semantic structure that plain image understanding cannot capture. The pre-VLM stack was a pipeline: Tesseract OCR + LayoutLMv3 + table-extraction heuristics. The VLM wave replaced that with OCR-free models — Donut (2022), Nougat (2023), DocLLM (2023) — that emit structured markup directly. By 2026 the frontier is just "feed the page image to Claude Opus 4.7 at 2576px native," and the structured-markup output comes for free. This lesson reads the three-era arc of document AI.

Type: Build

Languages: Python (stdlib, layout-aware document parser skeleton)

Prerequisites: Phase 12 · 05 (LLaVA), Phase 5 (NLP)

Time: ~180 minutes

Learning Objectives

The Problem

"Understand this PDF" is deceptively hard. The information sits in:

Raw OCR dumps the text and loses the rest. A system that cares about invoices needs to know "Total: $1,245" came from the bottom-right, not from a footnote.

The Concept

Era 1 — OCR pipeline (pre-2021)

The classic stack:

  1. PDF → image per page.
  2. Tesseract (or commercial OCR) extracts text with per-word bounding boxes.
  3. Layout analyzer identifies blocks (header, table, paragraph).
  4. Table structure recognizer parses tables.
  5. Domain rules + regex extract fields.

Works for clean printed text. Breaks on handwriting, skewed scans, complex tables, non-English scripts. Every failure mode requires a custom exception path.

TrOCR (2021)

TrOCR (Li et al., arXiv:2109.10282) replaced Tesseract's classic CNN-CTC with a transformer encoder-decoder trained on synthetic + real text images. Clean win on handwritten and multilingual text. Still a pipeline (detector then TrOCR then layout), but the OCR step improved dramatically.

Era 2 — OCR-free (2022-2023)

The first OCR-free models said: skip detection entirely, map image pixels to structured output directly.

Donut (Kim et al., arXiv:2111.15664):

Nougat (Blecher et al., arXiv:2308.13418):

These are specialists, not generalists. Donut on a scientific paper fails; Nougat on an invoice fails.

LayoutLMv3 (2022)

A different track. LayoutLMv3 (Huang et al., arXiv:2204.08387) keeps OCR but adds layout understanding:

LayoutLMv3 is the peak of OCR-based document understanding. Strong on forms and invoices. Requires OCR upstream. Best pre-VLM accuracy on standardized document benchmarks.

DocLLM (2023)

DocLLM (Wang et al., arXiv:2401.00908) is LayoutLM's generative sibling. Generates free-form answers conditioned on layout tokens. Better for QA on documents; still depends on OCR input.

Era 3 — VLM-native (2024+)

2024 VLMs became good enough to replace the pipeline entirely. Feed the full page image at high resolution to a VLM, ask the question, get an answer.

The gap between VLM-native and OCR-pipeline closed rapidly. By 2026, VLM-native wins on:

OCR pipelines still win on:

The Claude 4.7 / GPT-5 frontier

At 2576-pixel native input, frontier VLMs do document understanding at near-human accuracy. The benchmark numbers from early 2026:

The closed-model gap is mostly resolution and base-LLM scale. Open models at 7B are a few points behind but catching up.

Math equations and LaTeX output

Scientific papers need exact LaTeX output for equations. Nougat was trained on this. VLMs trained with LaTeX targets (Qwen2.5-VL-Math, Nougat derivatives) produce usable LaTeX. Without explicit LaTeX training, VLMs produce readable but imprecise transcriptions.

For scientific-paper pipelines in 2026: chain Nougat on the PDF, then a VLM on tricky pages.

Handwriting

Still the hardest sub-task. Mixed printed + handwritten (doctors' notes, filled forms) is where OCR pipelines still beat VLMs for cost. Handwritten-only VLMs are improving (Claude 4.7, PaliGemma 2).

2026 recipe

For a new document-AI project:

Use It

code/main.py:

Ship It

This lesson produces outputs/skill-document-ai-stack-picker.md. Given a document-AI project (domain, scale, quality, regulatory), picks between OCR pipeline, OCR-free specialist, and VLM-native.

Exercises

  1. Your project is 10M invoices per day. Which stack minimizes cost-per-page without losing accuracy?
  1. Why does LayoutLMv3 outperform pure-CLIP-VLMs on form QA but underperform at scene-text? What does the bbox stream give up?
  1. Nougat generates LaTeX. Propose a test case where VLM-native output beats Nougat on LaTeX fidelity, and a case where Nougat wins.
  1. Read PaliGemma 2 paper (Google, 2024). What was the key training-data addition that lifted document accuracy vs PaliGemma 1?
  1. Design a regulatory-safe hybrid: OCR pipeline as primary, VLM as secondary cross-check. How do you resolve disagreement?

Key Terms

Term What people say What it actually means
OCR pipeline "Tesseract-style" Stage-wise stack: detect -> OCR -> layout -> rules; deterministic, fragile
OCR-free "Donut-style" Image-to-output transformer that skips explicit OCR; single model
Layout-aware "LayoutLM" Input includes per-token bbox coordinates; unified masking across modalities
VLM-native "Frontier VLM" Feed page image directly to Claude/GPT/Qwen VLM at high resolution; no pipeline
DocVQA "Doc benchmark" Document VQA standard; most-cited score
Markup output "LaTeX / MD" Structured output format instead of free-form text; enables downstream automation

Further Reading