Machine Translation

> Translation is the task that paid for NLP research for thirty years and keeps paying now.

Type: Build

Languages: Python

Prerequisites: Phase 5 · 10 (Attention Mechanism), Phase 5 · 04 (GloVe, FastText, Subword)

Time: ~75 minutes

The Problem

A model reads a sentence in one language and produces a sentence in another. Length varies. Word order varies. Some source words map to multiple target words and vice versa. Idioms refuse one-to-one mapping. "I miss you" in French is "tu me manques" — literally "you are lacking to me." No word-level alignment survives that.

Machine translation is the task that forced NLP to invent encoder-decoders, attention, transformers, and eventually the whole LLM paradigm. Every step forward arrived because translation quality was measurable and the gap between human and machine was stubborn.

This lesson skips the history lesson and teaches the working pipeline of 2026: pretrained multilingual encoder-decoder (NLLB-200 or mBART), subword tokenization, beam search, BLEU and chrF evaluation, and the handful of failure modes that still ship to production uncaught.

The Concept

MT pipeline: tokenize → encode → decode with attention → detokenize

Modern MT is a transformer encoder-decoder trained on parallel text. The encoder reads the source in its language's tokenization. The decoder generates the target, one subword at a time, using the encoder's output via cross-attention (lesson 10). Decoding uses beam search to avoid the greedy-decoding trap. The output is detokenized, detruecased, and scored against a reference.

Three operational choices drive real-world MT quality.

Build It

Step 1: a pretrained MT call

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "facebook/nllb-200-distilled-600M"
tok = AutoTokenizer.from_pretrained(model_id, src_lang="eng_Latn")
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

src = "The cats are running."
inputs = tok(src, return_tensors="pt")

out = model.generate(
    **inputs,
    forced_bos_token_id=tok.convert_tokens_to_ids("fra_Latn"),
    num_beams=5,
    length_penalty=1.0,
    max_new_tokens=64,
)
print(tok.batch_decode(out, skip_special_tokens=True)[0])
Les chats courent.

Three things matter here. src_lang tells the tokenizer which script and segmentation to apply. forced_bos_token_id tells the decoder which language to generate. Both are NLLB-specific tricks; mBART and M2M-100 use their own conventions and they are not interchangeable.

Step 2: BLEU and chrF

BLEU measures n-gram overlap between output and reference. Four reference n-gram sizes (1-4), geometric mean of precisions, brevity penalty for too-short output. The score is in [0, 100]. Commonly used. Frustrating to interpret: 30 BLEU is "usable"; 40 is "good"; 50 is "exceptional"; differences under 1 BLEU are noise.

chrF measures character-level F-score. More sensitive to morphologically rich languages where BLEU undercounts matches. Often reported alongside BLEU.

import sacrebleu

hypotheses = ["Les chats courent."]
references = [["Les chats courent."]]

bleu = sacrebleu.corpus_bleu(hypotheses, references)
chrf = sacrebleu.corpus_chrf(hypotheses, references)
print(f"BLEU: {bleu.score:.1f}  chrF: {chrf.score:.1f}")

Always use sacrebleu. It normalizes tokenization so scores are comparable across papers. Rolling your own BLEU computation is how misleading benchmarks happen.

The three-tier evaluation hierarchy (2026)

Modern MT evaluation uses three complementary metric families. Ship with at least two.

Practical 2026 stack: sacrebleu for BLEU and chrF, unbabel-comet for COMET, and a prompted LLM for the final human-facing signal. Calibrate every metric against 50-100 human-labeled examples before trusting it on production data.

Reference-free metrics (COMET-QE, BLEURT-QE, LLM-as-judge) let you evaluate translations without a reference, which matters for long-tail language pairs where reference translations do not exist.

Step 3: what breaks in production

The working pipeline above will translate fluently 80% of the time and silently fail the remaining 20%. Named failure modes:

Step 4: fine-tuning for a domain

Pretrained models are generalists. Legal, medical, or game-dialog translation benefits measurably from fine-tuning on domain parallel data. The recipe is not exotic:

from transformers import Trainer, TrainingArguments
from datasets import Dataset

pairs = [
    {"src": "The defendant pleaded guilty.", "tgt": "L'accusé a plaidé coupable."},
]

ds = Dataset.from_list(pairs)


def preprocess(ex):
    return tok(
        ex["src"],
        text_target=ex["tgt"],
        truncation=True,
        max_length=128,
        padding="max_length",
    )


ds = ds.map(preprocess, remove_columns=["src", "tgt"])

args = TrainingArguments(output_dir="out", per_device_train_batch_size=4, num_train_epochs=3, learning_rate=3e-5)
Trainer(model=model, args=args, train_dataset=ds).train()

A few thousand high-quality parallel examples beats a few hundred thousand noisy web-scraped ones. Quality of training data is the single largest production lever.

Use It

The 2026 production stack for MT:

Use case Recommended starting point
Any-to-any, 200 languages facebook/nllb-200-distilled-600M (laptop) or nllb-200-3.3B (production)
English-centric, high quality, 50 languages facebook/mbart-large-50-many-to-many-mmt
Short runs, cheap inference, English-French/German/Spanish Helsinki-NLP / Marian models
Latency-critical browser-side ONNX-quantized Marian (~50 MB)
Maximum quality, willing to pay GPT-4 / Claude / Gemini with translation prompts

LLMs now outperform specialized MT models on several language pairs as of 2026, particularly on idiomatic content and long context. The tradeoff is per-token cost and latency. Pick an LLM when context length, stylistic consistency, or domain adaptation via prompting matters more than throughput.

Ship It

Save as outputs/skill-mt-evaluator.md:

name: mt-evaluator
description: Evaluate a machine translation output for shipping.
version: 1.0.0
phase: 5
lesson: 11
tags: [nlp, translation, evaluation]
---

Given a source text and a candidate translation, output:

1. Automatic score estimate. BLEU and chrF ranges you would expect. State whether a reference is available.
2. Five-point human-verifiable check list: (a) content preservation (no hallucinations), (b) correct language, (c) register / formality match, (d) terminology consistency with glossary if provided, (e) no truncation or length explosion.
3. One domain-specific issue to probe. E.g., for legal: named entities and statute citations. For medical: drug names and dosages. For UI: placeholder variables `{name}`.
4. Confidence flag. "Ship" / "Ship with review" / "Do not ship". Tie to the severity of issues found in step 2.

Refuse to ship a translation without a language-ID check on output. Refuse to evaluate without a reference unless the user explicitly opts in to reference-free scoring (COMET-QE, BLEURT-QE). Flag any content over 1000 tokens as likely needing chunked translation.

Exercises

  1. Easy. Translate a 5-sentence English paragraph to French and back to English using nllb-200-distilled-600M. Measure how close the round-trip is to the original. You should see semantic preservation with word-choice drift.
  2. Medium. Implement a language-ID check on translation outputs using fasttext lid.176 or langdetect. Integrate into the MT call so off-target generations are caught before returning.
  3. Hard. Fine-tune nllb-200-distilled-600M on a 5,000-pair domain corpus of your choice. Measure BLEU on a held-out set before and after fine-tuning. Report which kinds of sentences improved and which regressed.

Key Terms

Term What people say What it actually means
BLEU Translation score N-gram precision with brevity penalty. [0, 100].
chrF Character F-score Character-level F-score. More sensitive for morphologically rich languages.
NMT Neural MT Transformer encoder-decoder trained on parallel text. The 2017+ default.
NLLB No Language Left Behind Meta's 200-language MT model family.
Constrained decoding Controlled output Force specific tokens or n-grams to appear / not appear in the output.
Hallucination Invented content Model output that is not supported by the source.

Further Reading