Tensor Operations
> Tensors are the common language between data and deep learning. Every image, every sentence, every gradient flows through them.
Type: Build
Language: Python
Prerequisites: Phase 1, Lessons 01 (Linear Algebra Intuition), 02 (Vectors, Matrices & Operations)
Time: ~90 minutes
Learning Objectives
- Implement a tensor class with shape, strides, reshape, transpose, and element-wise operations from scratch
- Apply broadcasting rules to operate on tensors of different shapes without copying data
- Write einsum expressions for dot products, matrix multiplications, outer products, and batched operations
- Trace the exact tensor shapes through every step of multi-head attention
The Problem
You build a transformer. The forward pass looks clean. You run it and get: RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x768 and 512x768). You stare at the shapes. You try a transpose. Now it says Expected 4D input (got 3D input). You add an unsqueeze. Something else breaks.
Shape errors are the most common bug in deep learning code. They are not hard conceptually -- each operation has a shape contract -- but they multiply fast. A transformer has dozens of reshapes, transposes, and broadcasts chained together. One wrong axis and the error cascades. Worse, some shape mistakes do not throw errors at all. They silently produce garbage by broadcasting along the wrong dimension or summing over the wrong axis.
Matrices handle pairwise relationships between two sets of things. Real data does not fit into two dimensions. A batch of 32 RGB images at 224x224 is a 4D tensor: (32, 3, 224, 224). Self-attention with 12 heads is also 4D: (batch, heads, seq_len, head_dim). You need a data structure that generalizes to any number of dimensions, with operations that compose cleanly across all of them. That structure is the tensor. Master its operations and shape errors become trivially debuggable.
The Concept
What a tensor is
A tensor is a multi-dimensional array of numbers with a uniform data type. The number of dimensions is the rank (or order). Each dimension is an axis. The shape is a tuple listing the size along each axis.
rank 0
shape: ()"] --> V["Vector
rank 1
shape: (3,)"] V --> M["Matrix
rank 2
shape: (2,3)"] M --> T3["3D Tensor
rank 3
shape: (2,2,2)"] T3 --> T4["4D Tensor
rank 4
shape: (B,C,H,W)"]
Total elements = product of all sizes. A shape (2, 3, 4) holds 2 * 3 * 4 = 24 elements.
Tensor shapes in deep learning
Different data types map to specific tensor shapes by convention.
32, 3, 224, 224"] end subgraph NLP N1["(B, T, D)
16, 128, 768"] end subgraph Attention A1["(B, H, T, D)
16, 12, 128, 64"] end subgraph Weights W1["Linear: (out, in)
Conv2D: (out_c, in_c, kH, kW)
Embedding: (vocab, dim)"] end
PyTorch uses NCHW (channels-first). TensorFlow defaults to NHWC (channels-last). Mismatched layouts cause silent slowdowns or errors.
How memory layout works
A 2D array in memory is a 1D sequence of bytes. Strides tell you how many elements to skip to move one step along each axis.
strides: (3, 1)"] end subgraph "Column-major (F order)" C["a d b e c f
strides: (1, 2)"] end
Transpose does not move data. It swaps the strides, making the tensor non-contiguous -- the elements for a row are no longer adjacent in memory.
Broadcasting rules
Broadcasting lets you operate on tensors of different shapes without copying data. Align shapes from the right. Two dimensions are compatible when they are equal or one is 1. Fewer dimensions get padded with 1s on the left.
Tensor A: (8, 1, 6, 1)
Tensor B: (7, 1, 5)
Padded B: (1, 7, 1, 5)
Result: (8, 7, 6, 5)
Einsum: the universal tensor operation
Einstein summation labels each axis with a letter. Axes in the input but not the output get summed. Axes in both are kept.
Key patterns: i,i-> (dot product), i,j->ij (outer product), ii-> (trace), ij->ji (transpose), bij,bjk->bik (batch matmul), bhtd,bhsd->bhts (attention scores).
Build It
The code lives in code/tensors.py. Each step references the implementation there.
Step 1: Tensor storage and strides
A tensor stores a flat list of numbers plus shape metadata. Strides tell the indexing logic how to map multi-dimensional indices to flat positions.
class Tensor:
def __init__(self, data, shape=None):
if isinstance(data, (list, tuple)):
self._data, self._shape = self._flatten_nested(data)
elif isinstance(data, np.ndarray):
self._data = data.flatten().tolist()
self._shape = tuple(data.shape)
else:
self._data = [data]
self._shape = ()
if shape is not None:
total = reduce(lambda a, b: a * b, shape, 1)
if total != len(self._data):
raise ValueError(
f"Cannot reshape {len(self._data)} elements into shape {shape}"
)
self._shape = tuple(shape)
self._strides = self._compute_strides(self._shape)
@staticmethod
def _compute_strides(shape):
if len(shape) == 0:
return ()
strides = [1] * len(shape)
for i in range(len(shape) - 2, -1, -1):
strides[i] = strides[i + 1] * shape[i + 1]
return tuple(strides)
For shape (3, 4), strides are (4, 1) -- skip 4 elements to advance one row, skip 1 element to advance one column.
Step 2: Reshape, squeeze, unsqueeze
Reshape changes the shape without changing element order. The total number of elements must stay the same. Use -1 for one dimension to infer its size.
t = Tensor(list(range(12)), shape=(2, 6))
r = t.reshape((3, 4))
r = t.reshape((-1, 3))
Squeeze removes axes of size 1. Unsqueeze inserts one. Unsqueezing is critical for broadcasting -- a bias vector (D,) added to a batch (B, T, D) needs unsqueezing to (1, 1, D).
t = Tensor(list(range(6)), shape=(1, 3, 1, 2))
s = t.squeeze()
v = Tensor([1, 2, 3])
u = v.unsqueeze(0)
Step 3: Transpose and permute
Transpose swaps two axes. Permute reorders all axes. This is how you convert between NCHW and NHWC.
mat = Tensor(list(range(6)), shape=(2, 3))
tr = mat.transpose(0, 1)
t4d = Tensor(list(range(24)), shape=(1, 2, 3, 4))
perm = t4d.permute((0, 2, 3, 1))
After transpose or permute, the tensor is non-contiguous in memory. In PyTorch, view fails on non-contiguous tensors -- use reshape or call .contiguous() first.
Step 4: Element-wise operations and reductions
Element-wise ops (add, multiply, subtract) apply independently to each element and preserve shape. Reductions (sum, mean, max) collapse one or more axes.
a = Tensor([[1, 2], [3, 4]])
b = Tensor([[10, 20], [30, 40]])
c = a + b
d = a * 2
s = a.sum(axis=0)
Global average pooling in a CNN: (B, C, H, W).mean(axis=[2, 3]) produces (B, C). Sequence mean pooling in NLP: (B, T, D).mean(axis=1) produces (B, D).
Step 5: Broadcasting with NumPy
The demo_broadcasting_numpy() function in tensors.py shows the core patterns.
activations = np.random.randn(4, 3)
bias = np.array([0.1, 0.2, 0.3])
result = activations + bias
images = np.random.randn(2, 3, 4, 4)
scale = np.array([0.5, 1.0, 1.5]).reshape(1, 3, 1, 1)
result = images * scale
a = np.array([1, 2, 3]).reshape(-1, 1)
b = np.array([10, 20, 30, 40]).reshape(1, -1)
outer = a * b
Pairwise distance via broadcasting: reshape (M, 2) to (M, 1, 2) and (N, 2) to (1, N, 2), subtract, square, sum along last axis, take square root. Result: (M, N).
Step 6: Einsum operations
The demo_einsum() and demo_einsum_gallery() functions walk through every common pattern.
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
dot = np.einsum("i,i->", a, b)
A = np.array([[1, 2], [3, 4], [5, 6]], dtype=float)
B = np.array([[7, 8, 9], [10, 11, 12]], dtype=float)
matmul = np.einsum("ik,kj->ij", A, B)
batch_A = np.random.randn(4, 3, 5)
batch_B = np.random.randn(4, 5, 2)
batch_mm = np.einsum("bij,bjk->bik", batch_A, batch_B)
The computational cost of a contraction is the product of all index sizes (kept and summed). For bij,bjk->bik with B=32, I=128, J=64, K=128: 32 * 128 * 64 * 128 = 33,554,432 multiply-adds.
Step 7: Attention mechanism via einsum
The demo_attention_einsum() function implements multi-head attention end to end.
B, H, T, D = 2, 4, 8, 16
E = H * D
X = np.random.randn(B, T, E)
W_q = np.random.randn(E, E) * 0.02
Q = np.einsum("bte,ek->btk", X, W_q)
Q = Q.reshape(B, T, H, D).transpose(0, 2, 1, 3)
scores = np.einsum("bhtd,bhsd->bhts", Q, K) / np.sqrt(D)
weights = softmax(scores, axis=-1)
attn_output = np.einsum("bhts,bhsd->bhtd", weights, V)
concat = attn_output.transpose(0, 2, 1, 3).reshape(B, T, E)
output = np.einsum("bte,ek->btk", concat, W_o)
Every step is a tensor operation: projection (matmul via einsum), head splitting (reshape + transpose), attention scores (batch matmul via einsum), weighted sum (batch matmul via einsum), head merging (transpose + reshape), output projection (matmul via einsum).
Use It
Scratch vs NumPy
| Operation | Scratch (Tensor class) | NumPy |
|---|---|---|
| Create | Tensor([[1,2],[3,4]]) |
np.array([[1,2],[3,4]]) |
| Reshape | t.reshape((3,4)) |
a.reshape(3,4) |
| Transpose | t.transpose(0,1) |
a.T or a.transpose(0,1) |
| Squeeze | t.squeeze(0) |
np.squeeze(a, 0) |
| Sum | t.sum(axis=0) |
a.sum(axis=0) |
| Einsum | N/A | np.einsum("ij,jk->ik", a, b) |
Scratch vs PyTorch
import torch
t = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)
t.shape
t.stride()
t.is_contiguous()
t.reshape(3, 2)
t.unsqueeze(0)
t.transpose(0, 1)
t.transpose(0, 1).contiguous()
torch.einsum("ik,kj->ij", A, B)
PyTorch adds autograd, GPU support, and optimized BLAS kernels. The shape semantics are identical. If you understand the scratch version, PyTorch shape errors become readable.
Every neural network layer as a tensor operation
| Operation | Tensor Form | Einsum |
|---|---|---|
| Linear layer | Y = X @ W.T + b |
"bd,od->bo" + bias |
| Attention QKV | Q = X @ W_q |
"btd,dh->bth" |
| Attention scores | Q @ K.T / sqrt(d) |
"bhtd,bhsd->bhts" |
| Attention output | softmax(scores) @ V |
"bhts,bhsd->bhtd" |
| Batch norm | (X - mu) / sigma * gamma |
element-wise + broadcast |
| Softmax | exp(x) / sum(exp(x)) |
element-wise + reduction |
Ship It
This lesson produces two reusable prompts:
outputs/prompt-tensor-shapes.md-- A systematic prompt for debugging tensor shape mismatches. Includes decision tables for every common operation (matmul, broadcast, cat, Linear, Conv2d, BatchNorm, softmax) and a fix lookup table.
outputs/prompt-tensor-debugger.md-- A step-by-step debugging prompt you paste into any AI assistant when a shape error is blocking you. Feed it the error message and your tensor shapes, get back the exact fix.
Exercises
- Easy -- Reshape round-trip. Take a tensor of shape
(2, 3, 4). Reshape it to(6, 4), then to(24,), then back to(2, 3, 4). Verify element order is preserved at each step by printing the flat data.
- Medium -- Implement broadcasting. Extend the
Tensorclass with abroadcast_to(shape)method that expands dimensions of size 1 to match a target shape. Then modify_elementwise_opto automatically broadcast before operating. Test with shapes(3, 1)and(1, 4)producing(3, 4).
- Hard -- Build einsum from scratch. Implement a basic
einsum(subscripts, *tensors)function that handles at least: dot product (i,i->), matrix multiply (ij,jk->ik), outer product (i,j->ij), and transpose (ij->ji). Parse the subscript string, identify contracted indices, and loop over all index combinations. Compare your results againstnp.einsum.
- Hard -- Attention shape tracker. Write a function that takes
batch_size,seq_len,embed_dim, andnum_headsas inputs and prints the exact shape at every step of multi-head attention: input, Q/K/V projection, head split, attention scores, softmax weights, weighted sum, head merge, output projection. Verify against thedemo_attention_einsum()output.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Tensor | "A matrix but more dimensions" | A multi-dimensional array with uniform type and defined shape, strides, and operations |
| Rank | "The number of dimensions" | The number of axes. A matrix has rank 2, not rank equal to its matrix rank |
| Shape | "The size of the tensor" | A tuple listing the size along each axis. (2, 3) means 2 rows, 3 columns |
| Stride | "How memory is laid out" | The number of elements to skip to advance one position along each axis |
| Broadcasting | "It just works when shapes differ" | A strict set of rules: align from right, dimensions must be equal or one must be 1 |
| Contiguous | "The tensor is normal" | Elements stored sequentially in memory with no gaps or reordering from the logical layout |
| Einsum | "A fancy way to write matmul" | A general notation that expresses any tensor contraction, outer product, trace, or transpose in one line |
| View | "Same as reshape" | A tensor sharing the same memory buffer but with different shape/stride metadata. Fails on non-contiguous data |
| Contraction | "Summing over an index" | The general operation where a shared index between tensors is multiplied and summed, producing a lower-rank result |
| NCHW / NHWC | "PyTorch vs TensorFlow format" | Memory layout conventions for image tensors. NCHW puts channels before spatial dims, NHWC puts them after |
Further Reading
- NumPy Broadcasting -- The canonical rules with visual examples
- PyTorch Tensor Views -- When views work and when they copy
- einops -- A library that makes tensor reshaping readable and safe
- The Illustrated Transformer -- Visualizes the tensor shapes flowing through attention
- Einstein Summation in NumPy -- Full einsum documentation with examples