← Structured Outputs: JSON, Schema Validation, Constrained Decoding Context Engineering: Windows, Budgets, Memory, and Retrieval →

Embeddings & Vector Representations

> Text is discrete. Math is continuous. Every time you ask an LLM to find "similar" documents, compare meanings, or search beyond keywords, you're relying on a bridge between these two worlds. That bridge is an embedding. If you don't understand embeddings, you don't understand modern AI. You just use it.

Type: Build

Languages: Python

Prerequisites: Phase 11, Lesson 01 (Prompt Engineering)

Time: ~75 minutes

Related: Phase 5 · 22 (Embedding Models Deep Dive) covers dense vs sparse vs multi-vector, Matryoshka truncation, and per-axis model selection. This lesson focuses on the production pipeline (vector DBs, HNSW, similarity math). Read Phase 5 · 22 before picking a model.

Learning Objectives

Generate text embeddings using API providers and open-source models, and compute cosine similarity between them
Explain why embeddings solve the vocabulary mismatch problem that keyword search cannot handle
Build a semantic search index that retrieves documents by meaning rather than exact keyword match
Evaluate embedding quality using retrieval benchmarks (precision@k, recall) and choose the right embedding model for your task

The Problem

You have 10,000 support tickets. A customer writes "my payment didn't go through." You need to find similar past tickets. Keyword search finds tickets containing "payment" and "didn't go through." It misses "transaction failed," "charge was declined," and "billing error." These tickets describe the exact same problem with completely different words.

This is the vocabulary mismatch problem. Human language has dozens of ways to say the same thing. Keyword search treats each word as an independent symbol with no meaning. It cannot know that "declined" and "didn't go through" refer to the same concept.

You need a representation of text where meaning, not spelling, determines similarity. You need a way to place "my payment didn't go through" and "transaction was declined" close together in some mathematical space, while pushing "my payment arrived on time" far away despite sharing the word "payment."

That representation is an embedding.

The Concept

What Is an Embedding?

An embedding is a dense vector of floating-point numbers that represents the meaning of text. The word "dense" matters -- every dimension carries information, unlike sparse representations (bag-of-words, TF-IDF) where most dimensions are zero.

"The cat sat on the mat" becomes something like [0.023, -0.041, 0.087, ..., 0.012] -- a list of 768 to 3072 numbers depending on the model. These numbers encode meaning. You never inspect them directly. You compare them.

The Word2Vec Breakthrough

In 2013, Tomas Mikolov and colleagues at Google published Word2Vec. The core insight: train a neural network to predict a word from its neighbors (or neighbors from a word), and the hidden layer weights become meaningful vector representations.

The famous result:

king - man + woman = queen

Vector arithmetic on word embeddings captures semantic relationships. The direction from "man" to "woman" is roughly the same as the direction from "king" to "queen." This was the moment the field realized that geometry could encode meaning.

Word2Vec produced 300-dimensional vectors. Each word got one vector regardless of context. "Bank" in "river bank" and "bank account" had the same embedding. This limitation drove the next decade of research.

From Words to Sentences

Word embeddings represent single tokens. Production systems need to embed entire sentences, paragraphs, or documents. Four approaches emerged:

Averaging: take the mean of all word vectors in the sentence. Cheap, lossy, surprisingly decent for short text. Loses word order entirely -- "dog bites man" and "man bites dog" get identical embeddings.

CLS token: transformer models (BERT, 2018) output a special [CLS] token embedding that represents the entire input. Better than averaging but the [CLS] token was trained for next-sentence prediction, not similarity.

Contrastive learning: train the model explicitly to push similar pairs together and dissimilar pairs apart. Sentence-BERT (Reimers & Gurevych, 2019) used this approach and became the foundation for modern embedding models. Given "How do I reset my password?" and "I need to change my password," the model learns these should have nearly identical vectors.

Instruction-tuned embeddings: the latest approach. Models like E5 and GTE accept a task prefix ("search_query:", "search_document:") that tells the model what kind of embedding to produce. This lets one model serve multiple tasks.

graph LR subgraph "2013: Word2Vec" W1["king"] --> V1["[0.2, -0.1, ...]"] W2["queen"] --> V2["[0.3, -0.2, ...]"] end subgraph "2019: Sentence-BERT" S1["How do I reset my password?"] --> E1["[0.04, 0.12, ...]"] S2["I need to change my password"] --> E2["[0.05, 0.11, ...]"] end subgraph "2024: Instruction-Tuned" I1["search_query: password reset"] --> T1["[0.08, 0.09, ...]"] I2["search_document: To reset your password, click..."] --> T2["[0.07, 0.10, ...]"] end

Modern Embedding Models

The market has settled into a handful of production-grade options (MTEB scores as of early 2026, MTEB v2):

Model	Provider	Dimensions	MTEB	Context	Cost / 1M tokens
Gemini Embedding 2	Google	3072 (Matryoshka)	67.7 (retrieval)	8192	$0.15
embed-v4	Cohere	1024 (Matryoshka)	65.2	128K	$0.12
voyage-4	Voyage AI	1024/2048 (Matryoshka)	66.8	32K	$0.12
text-embedding-3-large	OpenAI	3072 (Matryoshka)	64.6	8192	$0.13
text-embedding-3-small	OpenAI	1536 (Matryoshka)	62.3	8192	$0.02
BGE-M3	BAAI	1024 (dense+sparse+ColBERT)	63.0 multilingual	8192	Open-weight
Qwen3-Embedding	Alibaba	4096 (Matryoshka)	66.9	32K	Open-weight
Nomic-embed-v2	Nomic	768 (Matryoshka)	63.1	8192	Open-weight

MTEB (Massive Text Embedding Benchmark) v2 covers 100+ tasks across retrieval, classification, clustering, reranking, and summarization. Higher is better. By 2026, open-weight models (Qwen3-Embedding, BGE-M3) match or beat closed hosted models on most axes. Gemini Embedding 2 leads pure retrieval; Voyage/Cohere lead specific domains (finance, law, code). Always benchmark on your own queries before committing.

Similarity Metrics

Given two embedding vectors, three ways to measure how similar they are:

Cosine similarity: the cosine of the angle between two vectors. Ranges from -1 (opposite) to 1 (identical direction). Ignores magnitude -- a 10-word sentence and a 500-word document can score 1.0 if they point the same direction. This is the default for 90% of use cases.

cosine_sim(a, b) = dot(a, b) / (||a|| * ||b||)

Dot product: the raw inner product of two vectors. Identical to cosine similarity when vectors are normalized (unit length). Faster to compute. OpenAI's embeddings are normalized, so dot product and cosine give the same ranking.

dot(a, b) = sum(a_i * b_i)

Euclidean (L2) distance: straight-line distance in the vector space. Smaller = more similar. Sensitive to magnitude differences. Use when the absolute position in space matters, not just the direction.

L2(a, b) = sqrt(sum((a_i - b_i)^2))

When to use which:

Metric	Use when	Avoid when
Cosine similarity	Comparing texts of different lengths; most retrieval tasks	Magnitude carries information
Dot product	Embeddings are already normalized; maximum speed	Vectors have varying magnitudes
Euclidean distance	Clustering; spatial nearest-neighbor problems	Comparing documents of wildly different lengths

Vector Databases and HNSW

A brute-force similarity search compares the query against every stored vector. At 1 million vectors with 1536 dimensions, that is 1.5 billion multiply-add operations per query. Too slow.

Vector databases solve this with Approximate Nearest Neighbor (ANN) algorithms. The dominant algorithm is HNSW (Hierarchical Navigable Small World):

Build a multi-layer graph of vectors
Top layers are sparse -- long-range connections between distant clusters
Bottom layers are dense -- fine-grained connections between nearby vectors
Search starts at the top layer, greedily descending to refine
Returns approximate top-k results in O(log n) time instead of O(n)

HNSW trades a small accuracy loss (typically 95-99% recall) for massive speed gains. At 10 million vectors, brute force takes seconds. HNSW takes milliseconds.

Production options:

Database	Type	Best for	Max scale
Pinecone	Managed SaaS	Zero-ops production	Billions
Weaviate	Open source	Self-hosted, hybrid search	100M+
Qdrant	Open source	High performance, filtering	100M+
ChromaDB	Embedded	Prototyping, local dev	1M
pgvector	Postgres extension	Already using Postgres	10M
FAISS	Library	In-process, research	1B+

Chunking Strategies

Documents are too long to embed as single vectors. A 50-page PDF covers dozens of topics -- its embedding becomes an average of everything, similar to nothing specific. You split documents into chunks and embed each one.

Fixed-size chunking: split every N tokens with M-token overlap. Simple and predictable. Works well when documents have no clear structure. A 512-token chunk with 50-token overlap: chunk 1 is tokens 0-511, chunk 2 is tokens 462-973.

Sentence-based chunking: split at sentence boundaries, grouping sentences until reaching the token limit. Each chunk is at least one complete sentence. Better than fixed-size because you never cut a thought in half.

Recursive chunking: try splitting at the largest boundary first (section headers). If still too large, try paragraph boundaries. Then sentence boundaries. Then character limits. This is LangChain's RecursiveCharacterTextSplitter and it works well for mixed-format corpora.

Semantic chunking: embed each sentence, then group consecutive sentences whose embeddings are similar. When the embedding similarity drops below a threshold, start a new chunk. Expensive (requires embedding every sentence individually) but produces the most coherent chunks.

Strategy	Complexity	Quality	Best for
Fixed-size	Low	Decent	Unstructured text, logs
Sentence-based	Low	Good	Articles, emails
Recursive	Medium	Good	Markdown, HTML, mixed docs
Semantic	High	Best	Critical retrieval quality

The sweet spot for most systems: 256-512 token chunks with 50-token overlap.

Bi-Encoders vs Cross-Encoders

A bi-encoder embeds the query and documents independently, then compares vectors. Fast -- you embed the query once and compare against pre-computed document embeddings. This is what you use for retrieval.

A cross-encoder takes the query and a document as a single input and outputs a relevance score. Slow -- it processes each query-document pair through the full model. But far more accurate because it can attend across query and document tokens simultaneously.

The production pattern: bi-encoder retrieves top-100 candidates, cross-encoder reranks them to top-10. This is the retrieve-then-rerank pipeline.

graph LR Q["Query"] --> BE["Bi-Encoder: embed query"] BE --> VS["Vector search: top 100"] VS --> CE["Cross-Encoder: rerank"] CE --> R["Top 10 results"]

Reranking models: Cohere Rerank 3.5 ($2 per 1000 queries), BGE-reranker-v2 (free, open source), Jina Reranker v2 (free, open source).

Matryoshka Embeddings

Traditional embeddings are all-or-nothing. A 1536-dimensional vector uses 1536 floats. You cannot truncate to 256 dimensions without retraining.

Matryoshka Representation Learning (Kusupati et al., 2022) fixes this. The model is trained so that the first N dimensions capture the most important information, like a Russian nesting doll. Truncating a 1536-d Matryoshka embedding to 256 dimensions loses some accuracy but remains functional.

OpenAI's text-embedding-3-small and text-embedding-3-large support Matryoshka truncation via the dimensions parameter. Requesting 256 dimensions instead of 1536 cuts storage by 6x with roughly 3-5% accuracy loss on MTEB benchmarks.

Binary Quantization

A 1536-dimensional embedding stored as float32 uses 6,144 bytes. Multiply by 10 million documents: 61 GB just for vectors.

Binary quantization converts each float to a single bit: positive values become 1, negative values become 0. Storage drops from 6,144 bytes to 192 bytes -- a 32x reduction. Similarity is computed using Hamming distance (count differing bits), which CPUs can do in a single instruction.

The accuracy hit is around 5-10% on retrieval recall. The common pattern: binary quantization for the first-pass search over millions of vectors, then rescore the top-1000 with full-precision vectors. This gets you 95%+ of full-precision accuracy at 32x less memory.

Build It

We build a semantic search engine from scratch. No vector database. No external embedding API. Pure Python with numpy for the math.

Step 1: Text Chunking

def chunk_text(text, chunk_size=200, overlap=50):
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks


def chunk_by_sentences(text, max_chunk_tokens=200):
    sentences = text.replace("\n", " ").split(".")
    sentences = [s.strip() + "." for s in sentences if s.strip()]
    chunks = []
    current_chunk = []
    current_length = 0
    for sentence in sentences:
        sentence_length = len(sentence.split())
        if current_length + sentence_length > max_chunk_tokens and current_chunk:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_length = 0
        current_chunk.append(sentence)
        current_length += sentence_length
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

Step 2: Building Embeddings from Scratch

We implement a simple dense embedding using TF-IDF with L2 normalization. This is not a neural embedding, but it follows the same contract: text in, fixed-size vector out, similar texts produce similar vectors.

import math
import numpy as np
from collections import Counter

class SimpleEmbedder:
    def __init__(self):
        self.vocab = []
        self.idf = []
        self.word_to_idx = {}

    def fit(self, documents):
        vocab_set = set()
        for doc in documents:
            vocab_set.update(doc.lower().split())
        self.vocab = sorted(vocab_set)
        self.word_to_idx = {w: i for i, w in enumerate(self.vocab)}
        n = len(documents)
        self.idf = np.zeros(len(self.vocab))
        for i, word in enumerate(self.vocab):
            doc_count = sum(1 for doc in documents if word in doc.lower().split())
            self.idf[i] = math.log((n + 1) / (doc_count + 1)) + 1

    def embed(self, text):
        words = text.lower().split()
        count = Counter(words)
        total = len(words) if words else 1
        vec = np.zeros(len(self.vocab))
        for word, freq in count.items():
            if word in self.word_to_idx:
                tf = freq / total
                vec[self.word_to_idx[word]] = tf * self.idf[self.word_to_idx[word]]
        norm = np.linalg.norm(vec)
        if norm > 0:
            vec = vec / norm
        return vec

Step 3: Similarity Functions

def cosine_similarity(a, b):
    dot = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    if norm_a == 0 or norm_b == 0:
        return 0.0
    return float(dot / (norm_a * norm_b))


def dot_product(a, b):
    return float(np.dot(a, b))


def euclidean_distance(a, b):
    return float(np.linalg.norm(a - b))

Step 4: Vector Index with Brute-Force Search

class VectorIndex:
    def __init__(self):
        self.vectors = []
        self.texts = []
        self.metadata = []

    def add(self, vector, text, meta=None):
        self.vectors.append(vector)
        self.texts.append(text)
        self.metadata.append(meta or {})

    def search(self, query_vector, top_k=5, metric="cosine"):
        scores = []
        for i, vec in enumerate(self.vectors):
            if metric == "cosine":
                score = cosine_similarity(query_vector, vec)
            elif metric == "dot":
                score = dot_product(query_vector, vec)
            elif metric == "euclidean":
                score = -euclidean_distance(query_vector, vec)
            else:
                raise ValueError(f"Unknown metric: {metric}")
            scores.append((i, score))
        scores.sort(key=lambda x: x[1], reverse=True)
        results = []
        for idx, score in scores[:top_k]:
            results.append({
                "text": self.texts[idx],
                "score": score,
                "metadata": self.metadata[idx],
                "index": idx
            })
        return results

    def size(self):
        return len(self.vectors)

Step 5: The Semantic Search Engine

class SemanticSearchEngine:
    def __init__(self, chunk_size=200, overlap=50):
        self.embedder = SimpleEmbedder()
        self.index = VectorIndex()
        self.chunk_size = chunk_size
        self.overlap = overlap

    def index_documents(self, documents, source_names=None):
        all_chunks = []
        all_sources = []
        for i, doc in enumerate(documents):
            chunks = chunk_text(doc, self.chunk_size, self.overlap)
            all_chunks.extend(chunks)
            name = source_names[i] if source_names else f"doc_{i}"
            all_sources.extend([name] * len(chunks))
        self.embedder.fit(all_chunks)
        for chunk, source in zip(all_chunks, all_sources):
            vec = self.embedder.embed(chunk)
            self.index.add(vec, chunk, {"source": source})
        return len(all_chunks)

    def search(self, query, top_k=5, metric="cosine"):
        query_vec = self.embedder.embed(query)
        return self.index.search(query_vec, top_k, metric)

    def search_with_scores(self, query, top_k=5):
        results = self.search(query, top_k)
        return [
            {
                "text": r["text"][:200],
                "source": r["metadata"].get("source", "unknown"),
                "score": round(r["score"], 4)
            }
            for r in results
        ]

Step 6: Comparing Similarity Metrics

def compare_metrics(engine, query, top_k=3):
    results = {}
    for metric in ["cosine", "dot", "euclidean"]:
        hits = engine.search(query, top_k=top_k, metric=metric)
        results[metric] = [
            {"score": round(h["score"], 4), "preview": h["text"][:80]}
            for h in hits
        ]
    return results

Use It

With a production embedding API, the architecture stays identical. Only the embedder changes:

from openai import OpenAI

client = OpenAI()

def openai_embed(texts, model="text-embedding-3-small", dimensions=None):
    kwargs = {"model": model, "input": texts}
    if dimensions:
        kwargs["dimensions"] = dimensions
    response = client.embeddings.create(**kwargs)
    return [item.embedding for item in response.data]

Matryoshka truncation with OpenAI -- same model, fewer dimensions, lower storage:

full = openai_embed(["semantic search query"], dimensions=1536)
compact = openai_embed(["semantic search query"], dimensions=256)

The 256-d vector uses 6x less storage. For 10 million documents, that is 10 GB vs 61 GB. The accuracy loss is roughly 3-5% on standard benchmarks.

For reranking with Cohere:

import cohere

co = cohere.ClientV2()

results = co.rerank(
    model="rerank-v3.5",
    query="What is the refund policy?",
    documents=["Full refund within 30 days...", "No refunds after 90 days..."],
    top_n=3
)

For local embeddings with no API dependency:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = model.encode(["semantic search query", "another document"])

The VectorIndex class from our build works with any of these. Swap the embedding function, keep the search logic.

Ship It

This lesson produces:

outputs/prompt-embedding-advisor.md -- a prompt for choosing embedding models and strategies for specific use cases
outputs/skill-embedding-patterns.md -- a skill that teaches agents how to use embeddings effectively in production

Exercises

Metric comparison: run the same 5 queries against the sample documents using cosine similarity, dot product, and euclidean distance. Record the top-3 results for each. For which queries do the metrics disagree? Why?

Chunk size experiment: index the sample documents with chunk sizes of 50, 100, 200, and 500 words. For each, run 5 queries and record the top-1 similarity score. Plot the relationship between chunk size and retrieval quality. Find the point where larger chunks start hurting.

Matryoshka simulation: build a SimpleEmbedder that produces 500-d vectors. Truncate to 50, 100, 200, and 500 dimensions. Measure how retrieval recall degrades at each truncation. This simulates Matryoshka behavior without needing the real training trick.

Binary quantization: take the embeddings from the search engine, convert them to binary (1 if positive, 0 if negative), and implement Hamming distance search. Compare the top-10 results against full-precision cosine similarity. Measure the overlap percentage.

Sentence-based chunking: replace fixed-size chunking with chunk_by_sentences. Run the same queries and compare retrieval scores. Does respecting sentence boundaries improve the results?

Key Terms

Term	What people say	What it actually means
Embedding	"Text to numbers"	A dense vector where geometric proximity encodes semantic similarity
Word2Vec	"The OG embedding"	2013 model that learned word vectors by predicting context words; proved vector arithmetic encodes meaning
Cosine similarity	"How similar are two vectors"	Cosine of the angle between vectors; 1 = identical direction, 0 = orthogonal, -1 = opposite
HNSW	"Fast vector search"	Hierarchical Navigable Small World graph -- multi-layer structure enabling O(log n) approximate nearest neighbor search
Bi-encoder	"Embed separately, compare fast"	Encodes query and document independently into vectors; enables pre-computation and fast retrieval
Cross-encoder	"Slow but accurate reranker"	Processes query-document pair jointly through the full model; higher accuracy, no pre-computation
Matryoshka embeddings	"Truncatable vectors"	Embeddings trained so the first N dimensions capture the most important information, enabling variable-size storage
Binary quantization	"1-bit embeddings"	Converting float vectors to binary (sign bit only) for 32x storage reduction with Hamming distance search
Chunking	"Split docs for embedding"	Breaking documents into 256-512 token segments so each can be independently embedded and retrieved
Vector database	"Search engine for embeddings"	Data store optimized for storing vectors and performing approximate nearest neighbor search at scale
Contrastive learning	"Train by comparison"	Training approach that pushes similar pair embeddings together and dissimilar pair embeddings apart
MTEB	"The embedding benchmark"	Massive Text Embedding Benchmark -- 56 datasets across 8 tasks; standard for comparing embedding models