← Open-Vocabulary Vision — CLIP Image Retrieval & Metric Learning →

OCR & Document Understanding

> OCR is a three-stage pipeline — detect text boxes, recognise the characters, then lay them out. Every modern OCR system reorders these stages or merges them.

Type: Learn + Use

Languages: Python

Prerequisites: Phase 4 Lesson 06 (Detection), Phase 7 Lesson 02 (Self-Attention)

Time: ~45 minutes

Learning Objectives

Trace the classical OCR pipeline (detect -> recognise -> layout) and the modern end-to-end alternatives (Donut, Qwen-VL-OCR)
Implement CTC (Connectionist Temporal Classification) loss for sequence-to-sequence OCR training
Use PaddleOCR or EasyOCR for production document parsing without training
Distinguish OCR, layout parsing, and document understanding — and pick the right tool per task

The Problem

Images full of text are everywhere: receipts, invoices, IDs, scanned books, forms, whiteboards, signs, screenshots. Extracting structured data from them — not just the characters, but "this is the total amount" — is one of the highest-value applied-vision problems.

The field splits into three skill layers:

OCR proper: turn pixels into text.
Layout parsing: group OCR output into regions (title, body, table, header).
Document understanding: extract structured fields ("invoice_total = $42.50") from layout.

Each layer has classical and modern approaches, and the gap between "I want text from an image" and "I need the total amount from this receipt" is bigger than most teams realise.

The Concept

The classical pipeline

flowchart LR IMG["Image"] --> DET["Text detection
(DB, EAST, CRAFT)"] DET --> BOX["Word/line
bounding boxes"] BOX --> CROP["Crop each region"] CROP --> REC["Recognition
(CRNN + CTC)"] REC --> TXT["Text strings"] TXT --> LAY["Layout
ordering"] LAY --> OUT["Reading-order text"] style DET fill:#dbeafe,stroke:#2563eb style REC fill:#fef3c7,stroke:#d97706 style OUT fill:#dcfce7,stroke:#16a34a

Text detection produces per-line or per-word quadrilaterals.
Recognition crops each region to a fixed height, runs a CNN + BiLSTM + CTC to produce a character sequence.
Layout rebuilds reading order (top-to-bottom, left-to-right for Latin; different for Arabic, Japanese).

CTC in one paragraph

OCR recognition produces a variable-length sequence from a fixed-length feature map. CTC (Graves et al., 2006) lets you train this without character-level alignment. The model outputs a distribution over (vocab + blank) at every time step; CTC loss marginalises over all alignments that reduce to the target text after merging repeats and removing blanks.

raw output: "h h h _ _ e e l l _ l l o _ _"
after merge repeats and remove blanks: "hello"

CTC is the reason CRNN worked in 2015 and still trains most production OCR models in 2026.

Modern end-to-end models

Donut (Kim et al., 2022) — a ViT encoder + a text decoder; reads an image and emits JSON directly. No text detector, no layout module.
TrOCR — ViT + transformer decoder for line-level OCR.
Qwen-VL-OCR / InternVL — full vision-language models fine-tuned for OCR tasks; best accuracy in 2026 on complex documents.
PaddleOCR — classical DB + CRNN pipeline in a mature production package; still the open-source workhorse.

End-to-end models need more data and compute but skip the error accumulation of multi-stage pipelines.

Layout parsing

For structured documents, run a layout detector (LayoutLMv3, DocLayNet) that labels each region: Title, Paragraph, Figure, Table, Footnote. Reading order then becomes "iterate through regions in layout order, concatenate."

For forms, use Key-Value extraction models (Donut for visually-rich documents, LayoutLMv3 for plain scans). They take image + detected text + positions and predict structured key-value pairs.

Evaluation metrics

Character Error Rate (CER) — Levenshtein distance / length of reference. Lower is better. Production target: < 2% on clean scans.
Word Error Rate (WER) — same at the word level.
F1 on structured fields — for key-value tasks; measures whether {invoice_total: 42.50} appears correctly.
Edit distance on JSON — for end-to-end document parsing; the Donut paper introduced normalised tree edit distance.

Build It

Step 1: CTC loss + greedy decoder

import torch
import torch.nn as nn
import torch.nn.functional as F


def ctc_loss(log_probs, targets, input_lengths, target_lengths, blank=0):
    """
    log_probs:      (T, N, C) log-softmax over vocab including blank at index 0
    targets:        (N, S) int targets (no blanks)
    input_lengths:  (N,) per-sample time steps used
    target_lengths: (N,) per-sample target length
    """
    return F.ctc_loss(log_probs, targets, input_lengths, target_lengths,
                      blank=blank, reduction="mean", zero_infinity=True)


def greedy_ctc_decode(log_probs, blank=0):
    """
    log_probs: (T, N, C) log-softmax
    returns: list of index sequences (blanks removed, repeats merged)
    """
    preds = log_probs.argmax(dim=-1).transpose(0, 1).cpu().tolist()
    out = []
    for seq in preds:
        decoded = []
        prev = None
        for idx in seq:
            if idx != prev and idx != blank:
                decoded.append(idx)
            prev = idx
        out.append(decoded)
    return out

F.ctc_loss uses the efficient CuDNN implementation when available. The greedy decoder is simpler than a beam search and usually within 1% CER of it.

Step 2: Tiny CRNN recogniser

Minimal CNN + BiLSTM for line OCR.

class TinyCRNN(nn.Module):
    def __init__(self, vocab_size=40, hidden=128, feat=32):
        super().__init__()
        self.cnn = nn.Sequential(
            nn.Conv2d(1, feat, 3, 1, 1), nn.BatchNorm2d(feat), nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Conv2d(feat, feat * 2, 3, 1, 1), nn.BatchNorm2d(feat * 2), nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Conv2d(feat * 2, feat * 4, 3, 1, 1), nn.BatchNorm2d(feat * 4), nn.ReLU(inplace=True),
            nn.MaxPool2d((2, 1)),
            nn.Conv2d(feat * 4, feat * 4, 3, 1, 1), nn.BatchNorm2d(feat * 4), nn.ReLU(inplace=True),
            nn.MaxPool2d((2, 1)),
        )
        self.rnn = nn.LSTM(feat * 4, hidden, bidirectional=True, batch_first=True)
        self.head = nn.Linear(hidden * 2, vocab_size)

    def forward(self, x):
        # x: (N, 1, H, W)
        f = self.cnn(x)                # (N, C, H', W')
        f = f.mean(dim=2).transpose(1, 2)  # (N, W', C)
        h, _ = self.rnn(f)
        return F.log_softmax(self.head(h).transpose(0, 1), dim=-1)  # (W', N, vocab)

Fixed-height input (the CNN max-pools height to 1). Width is the time dimension for CTC.

Step 3: Synthetic OCR

Generate black-on-white digit strings for an end-to-end smoke test.

import numpy as np

def synthetic_line(text, height=32, char_width=16):
    W = char_width * len(text)
    img = np.ones((height, W), dtype=np.float32)
    for i, c in enumerate(text):
        x = i * char_width
        shade = 0.0 if c.isalnum() else 0.5
        img[6:height - 6, x + 2:x + char_width - 2] = shade
    return img


def build_batch(strings, vocab):
    H = 32
    W = 16 * max(len(s) for s in strings)
    imgs = np.ones((len(strings), 1, H, W), dtype=np.float32)
    target_lengths = []
    targets = []
    for i, s in enumerate(strings):
        imgs[i, 0, :, :16 * len(s)] = synthetic_line(s)
        ids = [vocab.index(c) for c in s]
        targets.extend(ids)
        target_lengths.append(len(ids))
    return torch.from_numpy(imgs), torch.tensor(targets), torch.tensor(target_lengths)


vocab = ["_"] + list("0123456789abcdefghijklmnopqrstuvwxyz")
imgs, targets, lengths = build_batch(["hello", "world"], vocab)
print(f"images: {imgs.shape}   targets: {targets.shape}   lengths: {lengths.tolist()}")

A real OCR dataset adds fonts, noise, rotation, blur, and colour. The pipeline above is identical.

Step 4: Training sketch

model = TinyCRNN(vocab_size=len(vocab))
opt = torch.optim.Adam(model.parameters(), lr=1e-3)

for step in range(200):
    strings = ["abc" + str(step % 10)] * 4 + ["xyz" + str((step + 1) % 10)] * 4
    imgs, targets, target_lens = build_batch(strings, vocab)
    log_probs = model(imgs)  # (W', 8, vocab)
    input_lens = torch.full((8,), log_probs.size(0), dtype=torch.long)
    loss = ctc_loss(log_probs, targets, input_lens, target_lens, blank=0)
    opt.zero_grad(); loss.backward(); opt.step()

Loss should drop from ~3 to ~0.2 over 200 steps on this trivial synthetic data.

Use It

Three production paths:

PaddleOCR — mature, fast, multilingual. One-line usage: paddleocr.PaddleOCR(lang="en").ocr(image_path).
EasyOCR — Python-native, multilingual, PyTorch backbone.
Tesseract — classical; still useful for old scanned documents when models struggle.

For end-to-end document parsing, use Donut or a VLM:

from transformers import DonutProcessor, VisionEncoderDecoderModel

processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

For receipts, invoices, and forms with repeatable structure, fine-tune Donut. For arbitrary documents or OCR with reasoning, a VLM like Qwen-VL-OCR is the current default.

Ship It

This lesson produces:

outputs/prompt-ocr-stack-picker.md — a prompt that picks Tesseract / PaddleOCR / Donut / VLM-OCR given document type, language, and structure.
outputs/skill-ctc-decoder.md — a skill that writes greedy and beam-search CTC decoders from scratch, including length normalisation.

Exercises

(Easy) Train the TinyCRNN on 5-digit random numeric strings for 500 steps. Report CER on a held-out set.
(Medium) Replace greedy decoding with beam search (beam_width=5). Report CER delta. On which inputs does beam search win?
(Hard) Use PaddleOCR on a set of 20 receipts, extract line items, and compute F1 against hand-labelled ground truth for {item_name, price} pairs.

Key Terms

Term	What people say	What it actually means
OCR	"Text from pixels"	Turning image regions into character sequences
CTC	"Alignment-free loss"	Loss that trains a sequence model without per-timestep labels; marginalises over alignments
CRNN	"Classic OCR model"	Conv feature extractor + BiLSTM + CTC; the 2015 baseline still used in production
Donut	"End-to-end OCR"	ViT encoder + text decoder; emits JSON directly from image
Layout parsing	"Find regions"	Detect and label Title/Table/Figure/Paragraph regions in a document
Reading order	"Text sequence"	Ordering of recognised regions into a sentence; trivial for Latin, non-trivial for mixed layouts
CER / WER	"Error rates"	Levenshtein distance / reference length at character or word granularity
VLM-OCR	"LLM that reads"	A vision-language model trained or prompted for OCR tasks; current SOTA on complex documents