← Embodied VLAs: RT-2, OpenVLA, π0, GR00T ColPali and Vision-Native Document RAG →

Document and Diagram Understanding

> Documents are not photos. A PDF, scientific paper, invoice, or handwritten form has layout, tables, diagrams, footnotes, headers, and semantic structure that plain image understanding cannot capture. The pre-VLM stack was a pipeline: Tesseract OCR + LayoutLMv3 + table-extraction heuristics. The VLM wave replaced that with OCR-free models — Donut (2022), Nougat (2023), DocLLM (2023) — that emit structured markup directly. By 2026 the frontier is just "feed the page image to Claude Opus 4.7 at 2576px native," and the structured-markup output comes for free. This lesson reads the three-era arc of document AI.

Type: Build

Languages: Python (stdlib, layout-aware document parser skeleton)

Prerequisites: Phase 12 · 05 (LLaVA), Phase 5 (NLP)

Time: ~180 minutes

Learning Objectives

Explain the three eras of document AI: OCR pipeline, OCR-free, VLM-native.
Describe LayoutLMv3's three input streams: text, layout (bbox), image patches, with unified masking.
Compare Donut (OCR-free, image → markup), Nougat (scientific paper → LaTeX), DocLLM (layout-aware generative), PaliGemma 2 (VLM-native).
Pick a document model for a new task (invoices, scientific papers, handwritten forms, Chinese receipts).

The Problem

"Understand this PDF" is deceptively hard. The information sits in:

Text content (90% of the signal).
Layout (headers, footnotes, sidebars, two-column format).
Tables (rows, columns, merged cells).
Figures and diagrams.
Handwritten annotations.
Fonts and typography (title vs body).

Raw OCR dumps the text and loses the rest. A system that cares about invoices needs to know "Total: $1,245" came from the bottom-right, not from a footnote.

The Concept

Era 1 — OCR pipeline (pre-2021)

The classic stack:

PDF → image per page.
Tesseract (or commercial OCR) extracts text with per-word bounding boxes.
Layout analyzer identifies blocks (header, table, paragraph).
Table structure recognizer parses tables.
Domain rules + regex extract fields.

Works for clean printed text. Breaks on handwriting, skewed scans, complex tables, non-English scripts. Every failure mode requires a custom exception path.

TrOCR (2021)

TrOCR (Li et al., arXiv:2109.10282) replaced Tesseract's classic CNN-CTC with a transformer encoder-decoder trained on synthetic + real text images. Clean win on handwritten and multilingual text. Still a pipeline (detector then TrOCR then layout), but the OCR step improved dramatically.

Era 2 — OCR-free (2022-2023)

The first OCR-free models said: skip detection entirely, map image pixels to structured output directly.

Donut (Kim et al., arXiv:2111.15664):

Encoder-decoder transformer, encoder is Swin-B.
Output is JSON for form understanding, markdown for summarization, or any task-specific schema.
No OCR, no layout, no detection.

Nougat (Blecher et al., arXiv:2308.13418):

Trained specifically on scientific papers.
Output is LaTeX / markdown.
Handles equations, multi-column layout, figures.
The model every arXiv-parser calls.

These are specialists, not generalists. Donut on a scientific paper fails; Nougat on an invoice fails.

LayoutLMv3 (2022)

A different track. LayoutLMv3 (Huang et al., arXiv:2204.08387) keeps OCR but adds layout understanding:

Three input streams: OCR text tokens, per-token 2D bounding boxes, image patches.
Masked training objective across all three modalities (masked text, masked patches, masked layout).
Downstream: classification, entity extraction, table QA.

LayoutLMv3 is the peak of OCR-based document understanding. Strong on forms and invoices. Requires OCR upstream. Best pre-VLM accuracy on standardized document benchmarks.

DocLLM (2023)

DocLLM (Wang et al., arXiv:2401.00908) is LayoutLM's generative sibling. Generates free-form answers conditioned on layout tokens. Better for QA on documents; still depends on OCR input.

Era 3 — VLM-native (2024+)

2024 VLMs became good enough to replace the pipeline entirely. Feed the full page image at high resolution to a VLM, ask the question, get an answer.

LLaVA-NeXT 336-tile AnyRes works for small documents.
Qwen2.5-VL dynamic-resolution handles 2048+ pixels natively.
Claude Opus 4.7 supports 2576px documents.
PaliGemma 2 (April 2025) trains specifically for documents + handwriting.

The gap between VLM-native and OCR-pipeline closed rapidly. By 2026, VLM-native wins on:

Scene text (hand-written + printed, mixed scripts).
Complex tables with merged cells.
Math equations embedded in text.
Figures with text annotations.

OCR pipelines still win on:

Pure-scan workloads at massive scale where per-page latency matters.
Pipeline reliability (deterministic failures vs VLM hallucinations).
Regulated environments requiring auditable OCR output.

The Claude 4.7 / GPT-5 frontier

At 2576-pixel native input, frontier VLMs do document understanding at near-human accuracy. The benchmark numbers from early 2026:

DocVQA: Claude 4.7 ~95.1, PaliGemma 2 ~88.4, Nougat ~77.3, pipelined LayoutLMv3 ~83.
ChartQA: Claude 4.7 ~92.2, GPT-4V ~78.
VisualMRC: Claude 4.7 ~94.

The closed-model gap is mostly resolution and base-LLM scale. Open models at 7B are a few points behind but catching up.

Math equations and LaTeX output

Scientific papers need exact LaTeX output for equations. Nougat was trained on this. VLMs trained with LaTeX targets (Qwen2.5-VL-Math, Nougat derivatives) produce usable LaTeX. Without explicit LaTeX training, VLMs produce readable but imprecise transcriptions.

For scientific-paper pipelines in 2026: chain Nougat on the PDF, then a VLM on tricky pages.

Handwriting

Still the hardest sub-task. Mixed printed + handwritten (doctors' notes, filled forms) is where OCR pipelines still beat VLMs for cost. Handwritten-only VLMs are improving (Claude 4.7, PaliGemma 2).

2026 recipe

For a new document-AI project:

Pure-printed invoices at scale: LayoutLMv3 + rules, cost-efficient.
Mixed documents (scientific + handwritten + forms): VLM-native (PaliGemma 2 or Qwen2.5-VL).
Full arXiv ingestion: Nougat for math, VLM for figures.
Regulatory: OCR pipeline + VLM validator for cross-check.

Use It

code/main.py:

A toy layout-aware tokenizer: given (text, bbox) pairs, produces the LayoutLMv3-style input.
A Donut-style task schema generator: JSON template for forms.
A comparison of token budgets per page across OCR-pipeline, Donut, Nougat, and VLM-native.

Ship It

This lesson produces outputs/skill-document-ai-stack-picker.md. Given a document-AI project (domain, scale, quality, regulatory), picks between OCR pipeline, OCR-free specialist, and VLM-native.

Exercises

Your project is 10M invoices per day. Which stack minimizes cost-per-page without losing accuracy?

Why does LayoutLMv3 outperform pure-CLIP-VLMs on form QA but underperform at scene-text? What does the bbox stream give up?

Nougat generates LaTeX. Propose a test case where VLM-native output beats Nougat on LaTeX fidelity, and a case where Nougat wins.

Read PaliGemma 2 paper (Google, 2024). What was the key training-data addition that lifted document accuracy vs PaliGemma 1?

Design a regulatory-safe hybrid: OCR pipeline as primary, VLM as secondary cross-check. How do you resolve disagreement?

Key Terms

Term	What people say	What it actually means
OCR pipeline	"Tesseract-style"	Stage-wise stack: detect -> OCR -> layout -> rules; deterministic, fragile
OCR-free	"Donut-style"	Image-to-output transformer that skips explicit OCR; single model
Layout-aware	"LayoutLM"	Input includes per-token bbox coordinates; unified masking across modalities
VLM-native	"Frontier VLM"	Feed page image directly to Claude/GPT/Qwen VLM at high resolution; no pipeline
DocVQA	"Doc benchmark"	Document VQA standard; most-cited score
Markup output	"LaTeX / MD"	Structured output format instead of free-form text; enables downstream automation