GloVe, FastText, and Subword Embeddings
> Word2Vec trained one embedding per word. GloVe factorized the co-occurrence matrix. FastText embedded the pieces. BPE bridged to transformers.
Type: Build
Languages: Python
Prerequisites: Phase 5 · 03 (Word2Vec from Scratch)
Time: ~45 minutes
The Problem
Word2Vec left two open questions.
First, there was a parallel line of research that factorized the co-occurrence matrix directly (LSA, HAL) rather than doing online skip-gram updates. Was Word2Vec's iterative approach fundamentally better, or was the difference an artifact of how the two methods handled counts? GloVe answered that: matrix factorization with a thoughtfully chosen loss matches or beats Word2Vec, and costs less to train.
Second, neither method had a story for words it had never seen. Zoomer-approved, dogecoin, any proper noun coined last week, every inflected form of a rare root. FastText fixed this by embedding character n-grams: a word is the sum of its parts, including morphemes, so even out-of-vocabulary words get a sensible vector.
Third, once transformers arrived, the question shifted again. Word-level vocabularies cap out around a million entries; real language is more open than that. Byte-pair encoding (BPE) and its relatives solved this by learning a vocabulary of frequent subword units that covers everything. Every modern tokenizer for every modern LLM is a subword tokenizer.
This lesson walks all three, then explains which to reach for when.
The Concept
GloVe (Global Vectors). Build the word-word co-occurrence matrix X where X[i][j] is how often word j appears in the context of word i. Train vectors such that v_i · v_j + b_i + b_j ≈ log(X[i][j]). Weight the loss so frequent pairs do not dominate. Done.
FastText. A word is the sum of its character n-grams plus the word itself. where becomes . The word vector is the sum of those component vectors. Train as Word2Vec. Benefit: unseen words (whereupon) compose from known n-grams.
BPE (Byte-Pair Encoding). Start with a vocabulary of individual bytes (or characters). Count every adjacent pair in the corpus. Merge the most frequent pair into a new token. Repeat for k iterations. Result: a vocabulary of k + 256 tokens where frequent sequences (ing, tion, the) are single tokens and rare words are broken into familiar pieces. Every sentence tokenizes into something.
Build It
GloVe: factorize the co-occurrence matrix
import numpy as np
from collections import Counter
def build_cooccurrence(docs, window=5):
pair_counts = Counter()
vocab = {}
for doc in docs:
for token in doc:
if token not in vocab:
vocab[token] = len(vocab)
for doc in docs:
indexed = [vocab[t] for t in doc]
for i, center in enumerate(indexed):
for j in range(max(0, i - window), min(len(indexed), i + window + 1)):
if i != j:
distance = abs(i - j)
pair_counts[(center, indexed[j])] += 1.0 / distance
return vocab, pair_counts
def glove_train(vocab, pair_counts, dim=16, epochs=100, lr=0.05, x_max=100, alpha=0.75, seed=0):
n = len(vocab)
rng = np.random.default_rng(seed)
W = rng.normal(0, 0.1, size=(n, dim))
W_tilde = rng.normal(0, 0.1, size=(n, dim))
b = np.zeros(n)
b_tilde = np.zeros(n)
for epoch in range(epochs):
for (i, j), x_ij in pair_counts.items():
weight = (x_ij / x_max) ** alpha if x_ij < x_max else 1.0
diff = W[i] @ W_tilde[j] + b[i] + b_tilde[j] - np.log(x_ij)
coef = weight * diff
grad_W_i = coef * W_tilde[j]
grad_W_tilde_j = coef * W[i]
W[i] -= lr * grad_W_i
W_tilde[j] -= lr * grad_W_tilde_j
b[i] -= lr * coef
b_tilde[j] -= lr * coef
return W + W_tilde
Two moving pieces worth naming. The weighting function f(x) = (x/x_max)^alpha downweights very frequent pairs (like (the, and)) so they do not dominate the loss. The final embedding is the sum of W (center) and W_tilde (context) tables. Summing both is a published trick that tends to outperform using just one.
FastText: subword-aware embeddings
def char_ngrams(word, n_min=3, n_max=6):
wrapped = f"<{word}>"
grams = {wrapped}
for n in range(n_min, n_max + 1):
for i in range(len(wrapped) - n + 1):
grams.add(wrapped[i:i + n])
return grams
>>> char_ngrams("where")
{'<where>', '<wh', 'whe', 'her', 'ere', 're>', '<whe', 'wher', 'here', 'ere>', '<wher', 'where', 'here>'}
Each word is represented by its set of n-grams (typically 3 to 6 characters). The word embedding is the sum of its n-gram embeddings. For skip-gram training, plug this in where Word2Vec used a single vector.
def fasttext_vector(word, ngram_table):
grams = char_ngrams(word)
vecs = [ngram_table[g] for g in grams if g in ngram_table]
if not vecs:
return None
return np.sum(vecs, axis=0)
For an unseen word, you still get a vector as long as some of its n-grams are known. whereupon shares her, ere, and where, so the two land near each other.
BPE: learned subword vocabulary
def learn_bpe(corpus, k_merges):
vocab = Counter()
for word, freq in corpus.items():
tokens = tuple(word) + ("</w>",)
vocab[tokens] = freq
merges = []
for _ in range(k_merges):
pair_freq = Counter()
for tokens, freq in vocab.items():
for a, b in zip(tokens, tokens[1:]):
pair_freq[(a, b)] += freq
if not pair_freq:
break
best = pair_freq.most_common(1)[0][0]
merges.append(best)
new_vocab = Counter()
for tokens, freq in vocab.items():
new_tokens = []
i = 0
while i < len(tokens):
if i + 1 < len(tokens) and (tokens[i], tokens[i + 1]) == best:
new_tokens.append(tokens[i] + tokens[i + 1])
i += 2
else:
new_tokens.append(tokens[i])
i += 1
new_vocab[tuple(new_tokens)] = freq
vocab = new_vocab
return merges
def apply_bpe(word, merges):
tokens = list(word) + ["</w>"]
for a, b in merges:
new_tokens = []
i = 0
while i < len(tokens):
if i + 1 < len(tokens) and tokens[i] == a and tokens[i + 1] == b:
new_tokens.append(a + b)
i += 2
else:
new_tokens.append(tokens[i])
i += 1
tokens = new_tokens
return tokens
>>> corpus = Counter({"low": 5, "lower": 2, "newest": 6, "widest": 3})
>>> merges = learn_bpe(corpus, k_merges=10)
>>> apply_bpe("lowest", merges)
['low', 'est</w>']
First iteration merges the most common adjacent pair. After enough iterations, frequent substrings (low, est, tion) become single tokens and rare words break cleanly.
The real GPT / BERT / T5 tokenizers learn 30k-100k merges. Result: any text tokenizes into a bounded-length sequence of known IDs, no OOV ever.
Use It
In practice, you rarely train any of these yourself. You load pre-trained checkpoints.
import fasttext.util
fasttext.util.download_model("en", if_exists="ignore")
ft = fasttext.load_model("cc.en.300.bin")
print(ft.get_word_vector("whereupon").shape)
print(ft.get_word_vector("zoomerapproved").shape)
For BPE-style subword tokenization in the transformer era:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
print(tok.tokenize("unbelievably tokenized"))
['un', 'bel', 'iev', 'ably', 'Ġtoken', 'ized']
The Ġ prefix marks word boundaries (a GPT-2 convention). Every modern tokenizer is a BPE variant, WordPiece (BERT), or SentencePiece (T5, LLaMA).
When to pick which
| Situation | Pick |
|---|---|
| Pretrained general-purpose word vectors, no OOV tolerance needed | GloVe 300d |
| Pretrained general-purpose word vectors, must handle misspellings / neologisms / morphologically rich languages | FastText |
| Anything going into a transformer (training or inference) | Whatever tokenizer the model shipped with. Never swap. |
| Training your own language model from scratch | Train a BPE or SentencePiece tokenizer on your corpus first |
| Production text classification with a linear model | Still TF-IDF. Lesson 02. |
Ship It
Save as outputs/skill-tokenizer-picker.md:
name: tokenizer-picker
description: Pick a tokenization approach for a new language model or text pipeline.
version: 1.0.0
phase: 5
lesson: 04
tags: [nlp, tokenization, embeddings]
---
Given a task and dataset description, you output:
1. Tokenization strategy (word-level, BPE, WordPiece, SentencePiece, byte-level). One-sentence reason.
2. Vocabulary size target (e.g., 32k for an English-only LM, 64k-100k for multilingual).
3. Library call with the exact training command. Name the library. Quote the arguments.
4. One reproducibility pitfall. Tokenizer-model mismatch is the single most common silent production bug; call out which pair must be used together.
Refuse to recommend training a custom tokenizer when the user is fine-tuning a pretrained LLM. Refuse to recommend word-level tokenization for any model targeting production inference. Flag non-English / multi-script corpora as needing SentencePiece with byte fallback.
Exercises
- Easy. Run
char_ngrams("playing")andchar_ngrams("played"). Compute the Jaccard overlap of the two n-gram sets. You should see substantial shared pieces (pla,lay,play), which is why FastText transfers well across morphological variants. - Medium. Extend
learn_bpeto track vocabulary growth. Plot tokens-per-corpus-character as a function of number of merges. You should see rapid compression at first, asymptoting near ~2-3 chars per token. - Hard. Train a 1k-merge BPE on Shakespeare's complete works. Compare tokenization of common words vs. rare proper nouns. Measure average tokens per word before and after. Write up what surprised you.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Co-occurrence matrix | Word-word frequency table | X[i][j] = how often word j appears in a window around word i. |
| Subword | Piece of a word | A character n-gram (FastText) or learned token (BPE/WordPiece/SentencePiece). |
| BPE | Byte-pair encoding | Iterative merging of most-frequent adjacent pairs until vocabulary hits target size. |
| OOV | Out of vocabulary | Word the model has never seen. Word2Vec/GloVe fail. FastText and BPE handle it. |
| Byte-level BPE | BPE on raw bytes | GPT-2's scheme. Vocabulary starts with 256 bytes, so nothing is ever OOV. |
Further Reading
- Pennington, Socher, Manning (2014). GloVe: Global Vectors for Word Representation — the GloVe paper, seven pages, still the best derivation of the loss.
- Bojanowski et al. (2017). Enriching Word Vectors with Subword Information — FastText.
- Sennrich, Haddow, Birch (2016). Neural Machine Translation of Rare Words with Subword Units — the paper that introduced BPE to modern NLP.
- Hugging Face tokenizer summary — how BPE, WordPiece, and SentencePiece actually differ in practice.