OCR & Document Understanding
> OCR is a three-stage pipeline — detect text boxes, recognise the characters, then lay them out. Every modern OCR system reorders these stages or merges them.
Type: Learn + Use
Languages: Python
Prerequisites: Phase 4 Lesson 06 (Detection), Phase 7 Lesson 02 (Self-Attention)
Time: ~45 minutes
Learning Objectives
- Trace the classical OCR pipeline (detect -> recognise -> layout) and the modern end-to-end alternatives (Donut, Qwen-VL-OCR)
- Implement CTC (Connectionist Temporal Classification) loss for sequence-to-sequence OCR training
- Use PaddleOCR or EasyOCR for production document parsing without training
- Distinguish OCR, layout parsing, and document understanding — and pick the right tool per task
The Problem
Images full of text are everywhere: receipts, invoices, IDs, scanned books, forms, whiteboards, signs, screenshots. Extracting structured data from them — not just the characters, but "this is the total amount" — is one of the highest-value applied-vision problems.
The field splits into three skill layers:
- OCR proper: turn pixels into text.
- Layout parsing: group OCR output into regions (title, body, table, header).
- Document understanding: extract structured fields ("invoice_total = $42.50") from layout.
Each layer has classical and modern approaches, and the gap between "I want text from an image" and "I need the total amount from this receipt" is bigger than most teams realise.
The Concept
The classical pipeline
(DB, EAST, CRAFT)"] DET --> BOX["Word/line
bounding boxes"] BOX --> CROP["Crop each region"] CROP --> REC["Recognition
(CRNN + CTC)"] REC --> TXT["Text strings"] TXT --> LAY["Layout
ordering"] LAY --> OUT["Reading-order text"] style DET fill:#dbeafe,stroke:#2563eb style REC fill:#fef3c7,stroke:#d97706 style OUT fill:#dcfce7,stroke:#16a34a
- Text detection produces per-line or per-word quadrilaterals.
- Recognition crops each region to a fixed height, runs a CNN + BiLSTM + CTC to produce a character sequence.
- Layout rebuilds reading order (top-to-bottom, left-to-right for Latin; different for Arabic, Japanese).
CTC in one paragraph
OCR recognition produces a variable-length sequence from a fixed-length feature map. CTC (Graves et al., 2006) lets you train this without character-level alignment. The model outputs a distribution over (vocab + blank) at every time step; CTC loss marginalises over all alignments that reduce to the target text after merging repeats and removing blanks.
raw output: "h h h _ _ e e l l _ l l o _ _"
after merge repeats and remove blanks: "hello"
CTC is the reason CRNN worked in 2015 and still trains most production OCR models in 2026.
Modern end-to-end models
- Donut (Kim et al., 2022) — a ViT encoder + a text decoder; reads an image and emits JSON directly. No text detector, no layout module.
- TrOCR — ViT + transformer decoder for line-level OCR.
- Qwen-VL-OCR / InternVL — full vision-language models fine-tuned for OCR tasks; best accuracy in 2026 on complex documents.
- PaddleOCR — classical DB + CRNN pipeline in a mature production package; still the open-source workhorse.
End-to-end models need more data and compute but skip the error accumulation of multi-stage pipelines.
Layout parsing
For structured documents, run a layout detector (LayoutLMv3, DocLayNet) that labels each region: Title, Paragraph, Figure, Table, Footnote. Reading order then becomes "iterate through regions in layout order, concatenate."
For forms, use Key-Value extraction models (Donut for visually-rich documents, LayoutLMv3 for plain scans). They take image + detected text + positions and predict structured key-value pairs.
Evaluation metrics
- Character Error Rate (CER) — Levenshtein distance / length of reference. Lower is better. Production target: < 2% on clean scans.
- Word Error Rate (WER) — same at the word level.
- F1 on structured fields — for key-value tasks; measures whether
{invoice_total: 42.50}appears correctly. - Edit distance on JSON — for end-to-end document parsing; the Donut paper introduced normalised tree edit distance.
Build It
Step 1: CTC loss + greedy decoder
import torch
import torch.nn as nn
import torch.nn.functional as F
def ctc_loss(log_probs, targets, input_lengths, target_lengths, blank=0):
"""
log_probs: (T, N, C) log-softmax over vocab including blank at index 0
targets: (N, S) int targets (no blanks)
input_lengths: (N,) per-sample time steps used
target_lengths: (N,) per-sample target length
"""
return F.ctc_loss(log_probs, targets, input_lengths, target_lengths,
blank=blank, reduction="mean", zero_infinity=True)
def greedy_ctc_decode(log_probs, blank=0):
"""
log_probs: (T, N, C) log-softmax
returns: list of index sequences (blanks removed, repeats merged)
"""
preds = log_probs.argmax(dim=-1).transpose(0, 1).cpu().tolist()
out = []
for seq in preds:
decoded = []
prev = None
for idx in seq:
if idx != prev and idx != blank:
decoded.append(idx)
prev = idx
out.append(decoded)
return out
F.ctc_loss uses the efficient CuDNN implementation when available. The greedy decoder is simpler than a beam search and usually within 1% CER of it.
Step 2: Tiny CRNN recogniser
Minimal CNN + BiLSTM for line OCR.
class TinyCRNN(nn.Module):
def __init__(self, vocab_size=40, hidden=128, feat=32):
super().__init__()
self.cnn = nn.Sequential(
nn.Conv2d(1, feat, 3, 1, 1), nn.BatchNorm2d(feat), nn.ReLU(inplace=True),
nn.MaxPool2d(2),
nn.Conv2d(feat, feat * 2, 3, 1, 1), nn.BatchNorm2d(feat * 2), nn.ReLU(inplace=True),
nn.MaxPool2d(2),
nn.Conv2d(feat * 2, feat * 4, 3, 1, 1), nn.BatchNorm2d(feat * 4), nn.ReLU(inplace=True),
nn.MaxPool2d((2, 1)),
nn.Conv2d(feat * 4, feat * 4, 3, 1, 1), nn.BatchNorm2d(feat * 4), nn.ReLU(inplace=True),
nn.MaxPool2d((2, 1)),
)
self.rnn = nn.LSTM(feat * 4, hidden, bidirectional=True, batch_first=True)
self.head = nn.Linear(hidden * 2, vocab_size)
def forward(self, x):
# x: (N, 1, H, W)
f = self.cnn(x) # (N, C, H', W')
f = f.mean(dim=2).transpose(1, 2) # (N, W', C)
h, _ = self.rnn(f)
return F.log_softmax(self.head(h).transpose(0, 1), dim=-1) # (W', N, vocab)
Fixed-height input (the CNN max-pools height to 1). Width is the time dimension for CTC.
Step 3: Synthetic OCR
Generate black-on-white digit strings for an end-to-end smoke test.
import numpy as np
def synthetic_line(text, height=32, char_width=16):
W = char_width * len(text)
img = np.ones((height, W), dtype=np.float32)
for i, c in enumerate(text):
x = i * char_width
shade = 0.0 if c.isalnum() else 0.5
img[6:height - 6, x + 2:x + char_width - 2] = shade
return img
def build_batch(strings, vocab):
H = 32
W = 16 * max(len(s) for s in strings)
imgs = np.ones((len(strings), 1, H, W), dtype=np.float32)
target_lengths = []
targets = []
for i, s in enumerate(strings):
imgs[i, 0, :, :16 * len(s)] = synthetic_line(s)
ids = [vocab.index(c) for c in s]
targets.extend(ids)
target_lengths.append(len(ids))
return torch.from_numpy(imgs), torch.tensor(targets), torch.tensor(target_lengths)
vocab = ["_"] + list("0123456789abcdefghijklmnopqrstuvwxyz")
imgs, targets, lengths = build_batch(["hello", "world"], vocab)
print(f"images: {imgs.shape} targets: {targets.shape} lengths: {lengths.tolist()}")
A real OCR dataset adds fonts, noise, rotation, blur, and colour. The pipeline above is identical.
Step 4: Training sketch
model = TinyCRNN(vocab_size=len(vocab))
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for step in range(200):
strings = ["abc" + str(step % 10)] * 4 + ["xyz" + str((step + 1) % 10)] * 4
imgs, targets, target_lens = build_batch(strings, vocab)
log_probs = model(imgs) # (W', 8, vocab)
input_lens = torch.full((8,), log_probs.size(0), dtype=torch.long)
loss = ctc_loss(log_probs, targets, input_lens, target_lens, blank=0)
opt.zero_grad(); loss.backward(); opt.step()
Loss should drop from ~3 to ~0.2 over 200 steps on this trivial synthetic data.
Use It
Three production paths:
- PaddleOCR — mature, fast, multilingual. One-line usage:
paddleocr.PaddleOCR(lang="en").ocr(image_path). - EasyOCR — Python-native, multilingual, PyTorch backbone.
- Tesseract — classical; still useful for old scanned documents when models struggle.
For end-to-end document parsing, use Donut or a VLM:
from transformers import DonutProcessor, VisionEncoderDecoderModel
processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
For receipts, invoices, and forms with repeatable structure, fine-tune Donut. For arbitrary documents or OCR with reasoning, a VLM like Qwen-VL-OCR is the current default.
Ship It
This lesson produces:
outputs/prompt-ocr-stack-picker.md— a prompt that picks Tesseract / PaddleOCR / Donut / VLM-OCR given document type, language, and structure.outputs/skill-ctc-decoder.md— a skill that writes greedy and beam-search CTC decoders from scratch, including length normalisation.
Exercises
- (Easy) Train the TinyCRNN on 5-digit random numeric strings for 500 steps. Report CER on a held-out set.
- (Medium) Replace greedy decoding with beam search (beam_width=5). Report CER delta. On which inputs does beam search win?
- (Hard) Use PaddleOCR on a set of 20 receipts, extract line items, and compute F1 against hand-labelled ground truth for {item_name, price} pairs.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| OCR | "Text from pixels" | Turning image regions into character sequences |
| CTC | "Alignment-free loss" | Loss that trains a sequence model without per-timestep labels; marginalises over alignments |
| CRNN | "Classic OCR model" | Conv feature extractor + BiLSTM + CTC; the 2015 baseline still used in production |
| Donut | "End-to-end OCR" | ViT encoder + text decoder; emits JSON directly from image |
| Layout parsing | "Find regions" | Detect and label Title/Table/Figure/Paragraph regions in a document |
| Reading order | "Text sequence" | Ordering of recognised regions into a sentence; trivial for Latin, non-trivial for mixed layouts |
| CER / WER | "Error rates" | Levenshtein distance / reference length at character or word granularity |
| VLM-OCR | "LLM that reads" | A vision-language model trained or prompted for OCR tasks; current SOTA on complex documents |
Further Reading
- CRNN (Shi et al., 2015) — the original CNN+RNN+CTC architecture
- CTC (Graves et al., 2006) — the original CTC paper; densely packed with the algorithmic ideas
- Donut (Kim et al., 2022) — OCR-free document understanding transformer
- PaddleOCR — the open-source production OCR stack