Self-Attention from Scratch
> Attention is a lookup table where every word asks "who matters to me?" - and learns the answer.
Type: Build
Languages: Python
Prerequisites: Phase 3 (Deep Learning Core), Phase 5 Lesson 10 (Sequence-to-Sequence)
Time: ~90 minutes
Learning Objectives
- Implement scaled dot-product self-attention from scratch using only NumPy, including query/key/value projections and the softmax-weighted sum
- Build a multi-head attention layer that splits heads, computes parallel attention, and concatenates results
- Trace how the attention matrix captures token relationships and explain why scaling by sqrt(d_k) prevents softmax saturation
- Apply causal masking to convert bidirectional attention into autoregressive (decoder-style) attention
The Problem
RNNs process sequences one token at a time. By the time you reach token 50, the information from token 1 has been squeezed through 50 compression steps. Long-range dependencies get crushed into a fixed-size hidden state - a bottleneck that no amount of LSTM gating fully solves.
The 2014 Bahdanau attention paper showed the fix: let the decoder look back at every encoder position and decide which ones matter for the current step. But it was still bolted onto an RNN. The 2017 "Attention Is All You Need" paper asked a sharper question: what if attention is the *only* mechanism? No recurrence. No convolution. Just attention.
Self-attention lets every position in a sequence attend to every other position in a single parallel step. That is what makes transformers fast, scalable, and dominant.
The Concept
The Database Lookup Analogy
Think of attention as a soft database lookup:
Traditional database:
Query: "capital of France" --> exact match --> "Paris"
Attention:
Query: "capital of France" --> similarity to ALL keys --> weighted blend of ALL values
Every token generates three vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide if selected?"
The dot product between a query and all keys produces attention scores. High score means "this key matches my query." Those scores weight the values. The output is a weighted sum of values.
Q, K, V Computation
Each token embedding gets projected through three learned weight matrices:
Input embeddings (sequence of n tokens, each d-dimensional):
X = [x1, x2, x3, ..., xn] shape: (n, d)
Three weight matrices:
Wq shape: (d, dk)
Wk shape: (d, dk)
Wv shape: (d, dv)
Projections:
Q = X @ Wq shape: (n, dk) each token's query
K = X @ Wk shape: (n, dk) each token's key
V = X @ Wv shape: (n, dv) each token's value
Visually, for one token:
Wq
x_i ------[*]------> q_i "What am I looking for?"
|
| Wk
+----[*]------> k_i "What do I contain?"
|
| Wv
+----[*]------> v_i "What do I offer?"
The Attention Matrix
Once you have Q, K, V for all tokens, attention scores form a matrix:
Scores = Q @ K^T shape: (n, n)
k1 k2 k3 k4 k5
+-----+-----+-----+-----+-----+
q1 | 2.1 | 0.3 | 0.1 | 0.8 | 0.2 | <- how much q1 attends to each key
+-----+-----+-----+-----+-----+
q2 | 0.4 | 1.9 | 0.7 | 0.1 | 0.3 |
+-----+-----+-----+-----+-----+
q3 | 0.2 | 0.6 | 2.3 | 0.5 | 0.1 |
+-----+-----+-----+-----+-----+
q4 | 0.9 | 0.1 | 0.4 | 1.7 | 0.6 |
+-----+-----+-----+-----+-----+
q5 | 0.1 | 0.3 | 0.2 | 0.5 | 2.0 |
+-----+-----+-----+-----+-----+
Each row: one token's attention over the entire sequence
Why Scale?
The dot products grow with dimension dk. If dk = 64, dot products can be in the range of tens, pushing softmax into regions where gradients vanish. The fix: divide by sqrt(dk).
Scaled scores = (Q @ K^T) / sqrt(dk)
This keeps values in a range where softmax produces useful gradients.
Softmax Turns Scores into Weights
Softmax converts raw scores into a probability distribution across each row:
Raw scores for q1: [2.1, 0.3, 0.1, 0.8, 0.2]
|
softmax
|
Attention weights: [0.52, 0.09, 0.07, 0.14, 0.08] (sums to ~1.0)
Now each token has a set of weights saying how much to attend to every other token.
Weighted Sum of Values
The final output for each token is a weighted sum of all value vectors:
output_i = sum( attention_weight[i][j] * v_j for all j )
For token 1:
output_1 = 0.52 * v1 + 0.09 * v2 + 0.07 * v3 + 0.14 * v4 + 0.08 * v5
Full Pipeline
X (input) ----->| @ Wq |-----> Q
+-------+
+-------+
X (input) ----->| @ Wk |-----> K
+-------+ +----------+
+-------+ | |
X (input) ----->| @ Wv |-----> V ---------->| weighted |----> output
+-------+ ^ | sum |
| +----------+
+--------+--------+
| softmax |
+---------+-------+
^
+---------+-------+
| Q @ K^T / sqrt |
+-----------------+
Formula in one line:
Attention(Q, K, V) = softmax( Q @ K^T / sqrt(dk) ) @ V
Build It
Step 1: Softmax from scratch
Softmax converts raw logits into probabilities. Subtract the max for numerical stability.
import numpy as np
def softmax(x):
shifted = x - np.max(x, axis=-1, keepdims=True)
exp_x = np.exp(shifted)
return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
logits = np.array([2.0, 1.0, 0.1])
print(f"logits: {logits}")
print(f"softmax: {softmax(logits)}")
print(f"sum: {softmax(logits).sum():.4f}")
Step 2: Scaled dot-product attention
The core function. Takes Q, K, V matrices and returns the attention output plus the weight matrix.
def scaled_dot_product_attention(Q, K, V):
dk = Q.shape[-1]
scores = Q @ K.T / np.sqrt(dk)
weights = softmax(scores)
output = weights @ V
return output, weights
Step 3: Self-attention class with learned projections
A full self-attention module with Wq, Wk, Wv weight matrices initialized with Xavier-like scaling.
class SelfAttention:
def __init__(self, d_model, dk, dv, seed=42):
rng = np.random.default_rng(seed)
scale = np.sqrt(2.0 / (d_model + dk))
self.Wq = rng.normal(0, scale, (d_model, dk))
self.Wk = rng.normal(0, scale, (d_model, dk))
scale_v = np.sqrt(2.0 / (d_model + dv))
self.Wv = rng.normal(0, scale_v, (d_model, dv))
self.dk = dk
def forward(self, X):
Q = X @ self.Wq
K = X @ self.Wk
V = X @ self.Wv
output, weights = scaled_dot_product_attention(Q, K, V)
return output, weights
Step 4: Run it on a sentence
Create fake embeddings for a sentence and watch the attention weights.
sentence = ["The", "cat", "sat", "on", "the", "mat"]
n_tokens = len(sentence)
d_model = 8
dk = 4
dv = 4
rng = np.random.default_rng(42)
X = rng.normal(0, 1, (n_tokens, d_model))
attn = SelfAttention(d_model, dk, dv, seed=42)
output, weights = attn.forward(X)
print("Attention weights (each row: where that token looks):\n")
print(f"{'':>6}", end="")
for token in sentence:
print(f"{token:>6}", end="")
print()
for i, token in enumerate(sentence):
print(f"{token:>6}", end="")
for j in range(n_tokens):
w = weights[i][j]
print(f"{w:6.3f}", end="")
print()
Step 5: Visualize attention with ASCII heatmap
Map attention weights to characters for a quick visual.
def ascii_heatmap(weights, tokens, chars=" ░▒▓█"):
n = len(tokens)
print(f"\n{'':>6}", end="")
for t in tokens:
print(f"{t:>6}", end="")
print()
for i in range(n):
print(f"{tokens[i]:>6}", end="")
for j in range(n):
level = int(weights[i][j] * (len(chars) - 1) / weights.max())
level = min(level, len(chars) - 1)
print(f"{' ' + chars[level] + ' '}", end="")
print()
ascii_heatmap(weights, sentence)
Use It
PyTorch's nn.MultiheadAttention does exactly what we built, plus multi-head splitting and output projection:
import torch
import torch.nn as nn
d_model = 8
n_heads = 2
seq_len = 6
mha = nn.MultiheadAttention(embed_dim=d_model, num_heads=n_heads, batch_first=True)
X_torch = torch.randn(1, seq_len, d_model)
output, attn_weights = mha(X_torch, X_torch, X_torch)
print(f"Input shape: {X_torch.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weight shape: {attn_weights.shape}")
print(f"\nAttn weights (averaged over heads):")
print(attn_weights[0].detach().numpy().round(3))
The key difference: multi-head attention runs multiple attention functions in parallel, each with its own Q, K, V projections of size dk = d_model / n_heads, then concatenates results. This lets the model attend to different relationship types simultaneously.
Ship It
This lesson produces:
outputs/prompt-attention-explainer.md- a prompt for explaining attention through the database lookup analogy
Exercises
- Modify
scaled_dot_product_attentionto accept an optional mask matrix that sets certain positions to negative infinity before softmax (this is how causal/decoder masking works) - Implement multi-head attention from scratch: split Q, K, V into
n_headschunks, run attention on each, concatenate, and project through a final weight matrix Wo - Take two different sentences of the same length, feed them through the same SelfAttention instance, and compare their attention patterns. What changes? What stays the same?
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Query (Q) | "The question vector" | A learned projection of the input that represents what information this token is looking for |
| Key (K) | "The label vector" | A learned projection that represents what information this token contains, matched against queries |
| Value (V) | "The content vector" | A learned projection carrying the actual information that gets aggregated based on attention scores |
| Scaled dot-product attention | "The attention formula" | softmax(QK^T / sqrt(dk)) @ V - scaling prevents softmax saturation in high dimensions |
| Self-attention | "The token looks at itself and others" | Attention where Q, K, V all come from the same sequence, letting every position attend to every other position |
| Attention weights | "How much focus" | A probability distribution over positions, produced by softmax over scaled dot products |
| Multi-head attention | "Parallel attention" | Running multiple attention functions with different projections, then concatenating results for richer representations |
Further Reading
- Attention Is All You Need (Vaswani et al., 2017) - the original transformer paper
- The Illustrated Transformer (Jay Alammar) - best visual walkthrough of the full architecture
- The Annotated Transformer (Harvard NLP) - line-by-line PyTorch implementation with explanations