← Singular Value Decomposition Numerical Stability →

Tensor Operations

> Tensors are the common language between data and deep learning. Every image, every sentence, every gradient flows through them.

Type: Build

Language: Python

Prerequisites: Phase 1, Lessons 01 (Linear Algebra Intuition), 02 (Vectors, Matrices & Operations)

Time: ~90 minutes

Learning Objectives

Implement a tensor class with shape, strides, reshape, transpose, and element-wise operations from scratch
Apply broadcasting rules to operate on tensors of different shapes without copying data
Write einsum expressions for dot products, matrix multiplications, outer products, and batched operations
Trace the exact tensor shapes through every step of multi-head attention

The Problem

You build a transformer. The forward pass looks clean. You run it and get: RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x768 and 512x768). You stare at the shapes. You try a transpose. Now it says Expected 4D input (got 3D input). You add an unsqueeze. Something else breaks.

Shape errors are the most common bug in deep learning code. They are not hard conceptually -- each operation has a shape contract -- but they multiply fast. A transformer has dozens of reshapes, transposes, and broadcasts chained together. One wrong axis and the error cascades. Worse, some shape mistakes do not throw errors at all. They silently produce garbage by broadcasting along the wrong dimension or summing over the wrong axis.

Matrices handle pairwise relationships between two sets of things. Real data does not fit into two dimensions. A batch of 32 RGB images at 224x224 is a 4D tensor: (32, 3, 224, 224). Self-attention with 12 heads is also 4D: (batch, heads, seq_len, head_dim). You need a data structure that generalizes to any number of dimensions, with operations that compose cleanly across all of them. That structure is the tensor. Master its operations and shape errors become trivially debuggable.

The Concept

What a tensor is

A tensor is a multi-dimensional array of numbers with a uniform data type. The number of dimensions is the rank (or order). Each dimension is an axis. The shape is a tuple listing the size along each axis.

graph LR S["Scalar
rank 0
shape: ()"] --> V["Vector
rank 1
shape: (3,)"] V --> M["Matrix
rank 2
shape: (2,3)"] M --> T3["3D Tensor
rank 3
shape: (2,2,2)"] T3 --> T4["4D Tensor
rank 4
shape: (B,C,H,W)"]

Total elements = product of all sizes. A shape (2, 3, 4) holds 2 * 3 * 4 = 24 elements.

Tensor shapes in deep learning

Different data types map to specific tensor shapes by convention.

graph TD subgraph Vision V1["(B, C, H, W)
32, 3, 224, 224"] end subgraph NLP N1["(B, T, D)
16, 128, 768"] end subgraph Attention A1["(B, H, T, D)
16, 12, 128, 64"] end subgraph Weights W1["Linear: (out, in)
Conv2D: (out_c, in_c, kH, kW)
Embedding: (vocab, dim)"] end

PyTorch uses NCHW (channels-first). TensorFlow defaults to NHWC (channels-last). Mismatched layouts cause silent slowdowns or errors.

How memory layout works

A 2D array in memory is a 1D sequence of bytes. Strides tell you how many elements to skip to move one step along each axis.

graph LR subgraph "Row-major (C order)" R["a b c d e f
strides: (3, 1)"] end subgraph "Column-major (F order)" C["a d b e c f
strides: (1, 2)"] end

Transpose does not move data. It swaps the strides, making the tensor non-contiguous -- the elements for a row are no longer adjacent in memory.

Broadcasting rules

Broadcasting lets you operate on tensors of different shapes without copying data. Align shapes from the right. Two dimensions are compatible when they are equal or one is 1. Fewer dimensions get padded with 1s on the left.

Tensor A:     (8, 1, 6, 1)
Tensor B:        (7, 1, 5)
Padded B:     (1, 7, 1, 5)
Result:       (8, 7, 6, 5)

Einsum: the universal tensor operation

Einstein summation labels each axis with a letter. Axes in the input but not the output get summed. Axes in both are kept.

graph LR subgraph "matmul: ik,kj -> ij" A["A(I,K)"] --> |"sum over k"| C["C(I,J)"] B["B(K,J)"] --> |"sum over k"| C end

Key patterns: i,i-> (dot product), i,j->ij (outer product), ii-> (trace), ij->ji (transpose), bij,bjk->bik (batch matmul), bhtd,bhsd->bhts (attention scores).

Build It

The code lives in code/tensors.py. Each step references the implementation there.

Step 1: Tensor storage and strides

A tensor stores a flat list of numbers plus shape metadata. Strides tell the indexing logic how to map multi-dimensional indices to flat positions.

class Tensor:
    def __init__(self, data, shape=None):
        if isinstance(data, (list, tuple)):
            self._data, self._shape = self._flatten_nested(data)
        elif isinstance(data, np.ndarray):
            self._data = data.flatten().tolist()
            self._shape = tuple(data.shape)
        else:
            self._data = [data]
            self._shape = ()

        if shape is not None:
            total = reduce(lambda a, b: a * b, shape, 1)
            if total != len(self._data):
                raise ValueError(
                    f"Cannot reshape {len(self._data)} elements into shape {shape}"
                )
            self._shape = tuple(shape)

        self._strides = self._compute_strides(self._shape)

    @staticmethod
    def _compute_strides(shape):
        if len(shape) == 0:
            return ()
        strides = [1] * len(shape)
        for i in range(len(shape) - 2, -1, -1):
            strides[i] = strides[i + 1] * shape[i + 1]
        return tuple(strides)

For shape (3, 4), strides are (4, 1) -- skip 4 elements to advance one row, skip 1 element to advance one column.

Step 2: Reshape, squeeze, unsqueeze

Reshape changes the shape without changing element order. The total number of elements must stay the same. Use -1 for one dimension to infer its size.

t = Tensor(list(range(12)), shape=(2, 6))
r = t.reshape((3, 4))
r = t.reshape((-1, 3))

Squeeze removes axes of size 1. Unsqueeze inserts one. Unsqueezing is critical for broadcasting -- a bias vector (D,) added to a batch (B, T, D) needs unsqueezing to (1, 1, D).

t = Tensor(list(range(6)), shape=(1, 3, 1, 2))
s = t.squeeze()
v = Tensor([1, 2, 3])
u = v.unsqueeze(0)

Step 3: Transpose and permute

Transpose swaps two axes. Permute reorders all axes. This is how you convert between NCHW and NHWC.

mat = Tensor(list(range(6)), shape=(2, 3))
tr = mat.transpose(0, 1)

t4d = Tensor(list(range(24)), shape=(1, 2, 3, 4))
perm = t4d.permute((0, 2, 3, 1))

After transpose or permute, the tensor is non-contiguous in memory. In PyTorch, view fails on non-contiguous tensors -- use reshape or call .contiguous() first.

Step 4: Element-wise operations and reductions

Element-wise ops (add, multiply, subtract) apply independently to each element and preserve shape. Reductions (sum, mean, max) collapse one or more axes.

a = Tensor([[1, 2], [3, 4]])
b = Tensor([[10, 20], [30, 40]])
c = a + b
d = a * 2
s = a.sum(axis=0)

Global average pooling in a CNN: (B, C, H, W).mean(axis=[2, 3]) produces (B, C). Sequence mean pooling in NLP: (B, T, D).mean(axis=1) produces (B, D).

Step 5: Broadcasting with NumPy

The demo_broadcasting_numpy() function in tensors.py shows the core patterns.

activations = np.random.randn(4, 3)
bias = np.array([0.1, 0.2, 0.3])
result = activations + bias

images = np.random.randn(2, 3, 4, 4)
scale = np.array([0.5, 1.0, 1.5]).reshape(1, 3, 1, 1)
result = images * scale

a = np.array([1, 2, 3]).reshape(-1, 1)
b = np.array([10, 20, 30, 40]).reshape(1, -1)
outer = a * b

Pairwise distance via broadcasting: reshape (M, 2) to (M, 1, 2) and (N, 2) to (1, N, 2), subtract, square, sum along last axis, take square root. Result: (M, N).

Step 6: Einsum operations

The demo_einsum() and demo_einsum_gallery() functions walk through every common pattern.

a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
dot = np.einsum("i,i->", a, b)

A = np.array([[1, 2], [3, 4], [5, 6]], dtype=float)
B = np.array([[7, 8, 9], [10, 11, 12]], dtype=float)
matmul = np.einsum("ik,kj->ij", A, B)

batch_A = np.random.randn(4, 3, 5)
batch_B = np.random.randn(4, 5, 2)
batch_mm = np.einsum("bij,bjk->bik", batch_A, batch_B)

The computational cost of a contraction is the product of all index sizes (kept and summed). For bij,bjk->bik with B=32, I=128, J=64, K=128: 32 * 128 * 64 * 128 = 33,554,432 multiply-adds.

Step 7: Attention mechanism via einsum

The demo_attention_einsum() function implements multi-head attention end to end.

B, H, T, D = 2, 4, 8, 16
E = H * D

X = np.random.randn(B, T, E)
W_q = np.random.randn(E, E) * 0.02

Q = np.einsum("bte,ek->btk", X, W_q)
Q = Q.reshape(B, T, H, D).transpose(0, 2, 1, 3)

scores = np.einsum("bhtd,bhsd->bhts", Q, K) / np.sqrt(D)
weights = softmax(scores, axis=-1)
attn_output = np.einsum("bhts,bhsd->bhtd", weights, V)

concat = attn_output.transpose(0, 2, 1, 3).reshape(B, T, E)
output = np.einsum("bte,ek->btk", concat, W_o)

Every step is a tensor operation: projection (matmul via einsum), head splitting (reshape + transpose), attention scores (batch matmul via einsum), weighted sum (batch matmul via einsum), head merging (transpose + reshape), output projection (matmul via einsum).

Use It

Scratch vs NumPy

Operation	Scratch (Tensor class)	NumPy
Create	`Tensor([[1,2],[3,4]])`	`np.array([[1,2],[3,4]])`
Reshape	`t.reshape((3,4))`	`a.reshape(3,4)`
Transpose	`t.transpose(0,1)`	`a.T` or `a.transpose(0,1)`
Squeeze	`t.squeeze(0)`	`np.squeeze(a, 0)`
Sum	`t.sum(axis=0)`	`a.sum(axis=0)`
Einsum	N/A	`np.einsum("ij,jk->ik", a, b)`

Scratch vs PyTorch

import torch

t = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)
t.shape
t.stride()
t.is_contiguous()

t.reshape(3, 2)
t.unsqueeze(0)
t.transpose(0, 1)
t.transpose(0, 1).contiguous()

torch.einsum("ik,kj->ij", A, B)

PyTorch adds autograd, GPU support, and optimized BLAS kernels. The shape semantics are identical. If you understand the scratch version, PyTorch shape errors become readable.

Every neural network layer as a tensor operation

Operation	Tensor Form	Einsum
Linear layer	`Y = X @ W.T + b`	`"bd,od->bo"` + bias
Attention QKV	`Q = X @ W_q`	`"btd,dh->bth"`
Attention scores	`Q @ K.T / sqrt(d)`	`"bhtd,bhsd->bhts"`
Attention output	`softmax(scores) @ V`	`"bhts,bhsd->bhtd"`
Batch norm	`(X - mu) / sigma * gamma`	element-wise + broadcast
Softmax	`exp(x) / sum(exp(x))`	element-wise + reduction

Ship It

This lesson produces two reusable prompts:

outputs/prompt-tensor-shapes.md -- A systematic prompt for debugging tensor shape mismatches. Includes decision tables for every common operation (matmul, broadcast, cat, Linear, Conv2d, BatchNorm, softmax) and a fix lookup table.

outputs/prompt-tensor-debugger.md -- A step-by-step debugging prompt you paste into any AI assistant when a shape error is blocking you. Feed it the error message and your tensor shapes, get back the exact fix.

Exercises

Easy -- Reshape round-trip. Take a tensor of shape (2, 3, 4). Reshape it to (6, 4), then to (24,), then back to (2, 3, 4). Verify element order is preserved at each step by printing the flat data.

Medium -- Implement broadcasting. Extend the Tensor class with a broadcast_to(shape) method that expands dimensions of size 1 to match a target shape. Then modify _elementwise_op to automatically broadcast before operating. Test with shapes (3, 1) and (1, 4) producing (3, 4).

Hard -- Build einsum from scratch. Implement a basic einsum(subscripts, *tensors) function that handles at least: dot product (i,i->), matrix multiply (ij,jk->ik), outer product (i,j->ij), and transpose (ij->ji). Parse the subscript string, identify contracted indices, and loop over all index combinations. Compare your results against np.einsum.

Hard -- Attention shape tracker. Write a function that takes batch_size, seq_len, embed_dim, and num_heads as inputs and prints the exact shape at every step of multi-head attention: input, Q/K/V projection, head split, attention scores, softmax weights, weighted sum, head merge, output projection. Verify against the demo_attention_einsum() output.

Key Terms

Term	What people say	What it actually means
Tensor	"A matrix but more dimensions"	A multi-dimensional array with uniform type and defined shape, strides, and operations
Rank	"The number of dimensions"	The number of axes. A matrix has rank 2, not rank equal to its matrix rank
Shape	"The size of the tensor"	A tuple listing the size along each axis. `(2, 3)` means 2 rows, 3 columns
Stride	"How memory is laid out"	The number of elements to skip to advance one position along each axis
Broadcasting	"It just works when shapes differ"	A strict set of rules: align from right, dimensions must be equal or one must be 1
Contiguous	"The tensor is normal"	Elements stored sequentially in memory with no gaps or reordering from the logical layout
Einsum	"A fancy way to write matmul"	A general notation that expresses any tensor contraction, outer product, trace, or transpose in one line
View	"Same as reshape"	A tensor sharing the same memory buffer but with different shape/stride metadata. Fails on non-contiguous data
Contraction	"Summing over an index"	The general operation where a shared index between tensors is multiplied and summed, producing a lower-rank result
NCHW / NHWC	"PyTorch vs TensorFlow format"	Memory layout conventions for image tensors. NCHW puts channels before spatial dims, NHWC puts them after