← Debugging and Profiling Vectors, Matrices & Operations →

Linear Algebra Intuition

> Every AI model is just matrix math wearing a fancy hat.

Type: Learn

Languages: Python, Julia

Prerequisites: Phase 0

Time: ~60 minutes

Learning Objectives

Implement vector and matrix operations (addition, dot product, matrix multiply) from scratch in Python
Explain geometrically what the dot product, projection, and Gram-Schmidt process do
Determine linear independence, rank, and basis of a set of vectors using row reduction
Connect linear algebra concepts to their AI applications: embeddings, attention scores, and LoRA

The Problem

Open any ML paper. Within the first page, you'll see vectors, matrices, dot products, and transformations. Without linear algebra intuition, these are just symbols. With it, you can see what a neural network is actually doing -- moving points around in space.

You don't need to be a mathematician. You need to see what these operations mean geometrically, then code them yourself.

The Concept

Vectors Are Points (and Directions)

A vector is just a list of numbers. But those numbers mean something -- they're coordinates in space.

2D vector [3, 2]:

x	y	Point
3	2	The vector points from origin (0,0) to (3, 2) on the plane

The vector has magnitude sqrt(3^2 + 2^2) = sqrt(13) and points up and to the right.

In AI, vectors represent everything:

A word → a vector of 768 numbers (its "meaning" in embedding space)
An image → a vector of millions of pixel values
A user → a vector of preferences

Matrices Are Transformations

A matrix transforms one vector into another. It can rotate, scale, stretch, or project.

graph LR subgraph Before A["Point A"] B["Point B"] end subgraph Matrix["Matrix Multiplication"] M["M (transformation)"] end subgraph After A2["Point A'"] B2["Point B'"] end A --> M B --> M M --> A2 M --> B2

In AI, matrices ARE the model:

Neural network weights → matrices that transform input into output
Attention scores → matrices that decide what to focus on
Embeddings → matrices that map words to vectors

The Dot Product Measures Similarity

The dot product of two vectors tells you how similar they are.

a · b = a₁×b₁ + a₂×b₂ + ... + aₙ×bₙ

Same direction:      a · b > 0  (similar)
Perpendicular:       a · b = 0  (unrelated)
Opposite direction:  a · b < 0  (dissimilar)

This is literally how search engines, recommendation systems, and RAG work -- find vectors with high dot products.

Linear Independence

Vectors are linearly independent if no vector in the set can be written as a combination of the others. If v1, v2, v3 are independent, they span a 3D space. If one is a combination of the others, they only span a plane.

Why it matters for AI: your feature matrix should have linearly independent columns. If two features are perfectly correlated (linearly dependent), the model cannot distinguish their effects. This causes multicollinearity in regression -- the weight matrix becomes unstable, and small input changes produce wild output swings.

Concrete example:

v1 = [1, 0, 0]
v2 = [0, 1, 0]
v3 = [2, 1, 0]   # v3 = 2*v1 + v2

v1 and v2 are independent -- neither is a scalar multiple or combination of the other. But v3 = 2*v1 + v2, so {v1, v2, v3} is a dependent set. These three vectors all lie in the xy-plane. No matter how you combine them, you cannot reach [0, 0, 1]. You have three vectors but only two dimensions of freedom.

In a dataset: if feature_3 = 2*feature_1 + feature_2, adding feature_3 gives the model zero new information. Worse, it makes the normal equations singular -- there is no unique solution for the weights.

Basis and Rank

A basis is a minimal set of linearly independent vectors that span the entire space. The number of basis vectors is the dimension of the space.

The standard basis for 3D space is {[1,0,0], [0,1,0], [0,0,1]}. But any three independent vectors in 3D form a valid basis. The choice of basis is a choice of coordinate system.

Rank of a matrix = number of linearly independent columns = number of linearly independent rows. If rank < min(rows, cols), the matrix is rank-deficient. This means:

The system has infinitely many solutions (or none)
Information is lost in the transformation
The matrix cannot be inverted

Situation	Rank	What it means for ML
Full rank (rank = min(m, n))	Maximum possible	Unique least-squares solution exists. Model is well-conditioned.
Rank deficient (rank < min(m, n))	Below maximum	Features are redundant. Infinitely many weight solutions. Regularization needed.
Rank 1	1	Every column is a scaled copy of one vector. All data lies on a line.
Near rank-deficient (small singular values)	Numerically low	Matrix is ill-conditioned. Tiny input noise causes large output changes. Use SVD truncation or ridge regression.

Projection

Projecting vector a onto vector b gives the component of a in the direction of b:

proj_b(a) = (a dot b / b dot b) * b

The residual (a - proj_b(a)) is perpendicular to b. This orthogonal decomposition is the foundation of least-squares fitting.

Projection is everywhere in ML:

Linear regression minimizes the distance from observations to the column space -- the solution IS a projection
PCA projects data onto the directions of maximum variance
Attention in transformers computes projections of queries onto keys

Example: a = [3, 4], b = [1, 0]

proj_b(a) = (3*1 + 4*0) / (1*1 + 0*0) * [1, 0] = 3 * [1, 0] = [3, 0]

The projection drops the y-component. This is dimensionality reduction in its simplest form -- throw away the directions you don't care about.

Gram-Schmidt Process

Converting any set of independent vectors into an orthonormal basis. Orthonormal means every vector has length 1 and every pair is perpendicular.

The algorithm:

Take the first vector, normalize it
Take the second vector, subtract its projection onto the first, normalize
Take the third vector, subtract its projections onto all previous vectors, normalize
Repeat for remaining vectors

Input:  v1, v2, v3, ... (linearly independent)

u1 = v1 / |v1|

w2 = v2 - (v2 dot u1) * u1
u2 = w2 / |w2|

w3 = v3 - (v3 dot u1) * u1 - (v3 dot u2) * u2
u3 = w3 / |w3|

Output: u1, u2, u3, ... (orthonormal basis)

This is how QR decomposition works internally. Q is the orthonormal basis, R captures the projection coefficients. QR decomposition is used in:

Solving linear systems (more stable than Gaussian elimination)
Computing eigenvalues (QR algorithm)
Least-squares regression (the standard numerical method)

Build It

Step 1: Vectors from scratch (Python)

class Vector:
    def __init__(self, components):
        self.components = list(components)
        self.dim = len(self.components)

    def __add__(self, other):
        return Vector([a + b for a, b in zip(self.components, other.components)])

    def __sub__(self, other):
        return Vector([a - b for a, b in zip(self.components, other.components)])

    def dot(self, other):
        return sum(a * b for a, b in zip(self.components, other.components))

    def magnitude(self):
        return sum(x**2 for x in self.components) ** 0.5

    def normalize(self):
        mag = self.magnitude()
        return Vector([x / mag for x in self.components])

    def cosine_similarity(self, other):
        return self.dot(other) / (self.magnitude() * other.magnitude())

    def __repr__(self):
        return f"Vector({self.components})"


a = Vector([1, 2, 3])
b = Vector([4, 5, 6])

print(f"a + b = {a + b}")
print(f"a · b = {a.dot(b)}")
print(f"|a| = {a.magnitude():.4f}")
print(f"cosine similarity = {a.cosine_similarity(b):.4f}")

Step 2: Matrices from scratch (Python)

class Matrix:
    def __init__(self, rows):
        self.rows = [list(row) for row in rows]
        self.shape = (len(self.rows), len(self.rows[0]))

    def __matmul__(self, other):
        if isinstance(other, Vector):
            return Vector([
                sum(self.rows[i][j] * other.components[j] for j in range(self.shape[1]))
                for i in range(self.shape[0])
            ])
        rows = []
        for i in range(self.shape[0]):
            row = []
            for j in range(other.shape[1]):
                row.append(sum(
                    self.rows[i][k] * other.rows[k][j]
                    for k in range(self.shape[1])
                ))
            rows.append(row)
        return Matrix(rows)

    def transpose(self):
        return Matrix([
            [self.rows[j][i] for j in range(self.shape[0])]
            for i in range(self.shape[1])
        ])

    def __repr__(self):
        return f"Matrix({self.rows})"


rotation_90 = Matrix([[0, -1], [1, 0]])
point = Vector([3, 1])

rotated = rotation_90 @ point
print(f"Original: {point}")
print(f"Rotated 90°: {rotated}")

Step 3: Why this matters for AI

import random

random.seed(42)
weights = Matrix([[random.gauss(0, 0.1) for _ in range(3)] for _ in range(2)])
input_vector = Vector([1.0, 0.5, -0.3])

output = weights @ input_vector
print(f"Input (3D): {input_vector}")
print(f"Output (2D): {output}")
print("This is what a neural network layer does -- matrix multiplication.")

Step 4: Julia version

a = [1.0, 2.0, 3.0]
b = [4.0, 5.0, 6.0]

println("a + b = ", a + b)
println("a · b = ", a ⋅ b)       # Julia supports unicode operators
println("|a| = ", √(a ⋅ a))
println("cosine = ", (a ⋅ b) / (√(a ⋅ a) * √(b ⋅ b)))

# Matrix-vector multiplication
W = [0.1 -0.2 0.3; 0.4 0.5 -0.1]
x = [1.0, 0.5, -0.3]
println("Wx = ", W * x)
println("This is a neural network layer.")

Step 5: Linear independence and projection from scratch (Python)

def is_linearly_independent(vectors):
    n = len(vectors)
    dim = len(vectors[0].components)
    mat = Matrix([v.components[:] for v in vectors])
    rows = [row[:] for row in mat.rows]
    rank = 0
    for col in range(dim):
        pivot = None
        for row in range(rank, len(rows)):
            if abs(rows[row][col]) > 1e-10:
                pivot = row
                break
        if pivot is None:
            continue
        rows[rank], rows[pivot] = rows[pivot], rows[rank]
        scale = rows[rank][col]
        rows[rank] = [x / scale for x in rows[rank]]
        for row in range(len(rows)):
            if row != rank and abs(rows[row][col]) > 1e-10:
                factor = rows[row][col]
                rows[row] = [rows[row][j] - factor * rows[rank][j] for j in range(dim)]
        rank += 1
    return rank == n


def project(a, b):
    scalar = a.dot(b) / b.dot(b)
    return Vector([scalar * x for x in b.components])


def gram_schmidt(vectors):
    orthonormal = []
    for v in vectors:
        w = v
        for u in orthonormal:
            proj = project(w, u)
            w = w - proj
        if w.magnitude() < 1e-10:
            continue
        orthonormal.append(w.normalize())
    return orthonormal


v1 = Vector([1, 0, 0])
v2 = Vector([1, 1, 0])
v3 = Vector([1, 1, 1])
basis = gram_schmidt([v1, v2, v3])
for i, u in enumerate(basis):
    print(f"u{i+1} = {u}")
    print(f"  |u{i+1}| = {u.magnitude():.6f}")

print(f"u1 · u2 = {basis[0].dot(basis[1]):.6f}")
print(f"u1 · u3 = {basis[0].dot(basis[2]):.6f}")
print(f"u2 · u3 = {basis[1].dot(basis[2]):.6f}")

Use It

Now the same thing with NumPy -- what you'll actually use in practice:

import numpy as np

a = np.array([1, 2, 3], dtype=float)
b = np.array([4, 5, 6], dtype=float)

print(f"a + b = {a + b}")
print(f"a · b = {np.dot(a, b)}")
print(f"|a| = {np.linalg.norm(a):.4f}")
print(f"cosine = {np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)):.4f}")

W = np.random.randn(2, 3) * 0.1
x = np.array([1.0, 0.5, -0.3])
print(f"Wx = {W @ x}")

Rank, Projection, and QR with NumPy

import numpy as np

A = np.array([[1, 2], [2, 4]])
print(f"Rank: {np.linalg.matrix_rank(A)}")

a = np.array([3, 4])
b = np.array([1, 0])
proj = (np.dot(a, b) / np.dot(b, b)) * b
print(f"Projection of {a} onto {b}: {proj}")

Q, R = np.linalg.qr(np.random.randn(3, 3))
print(f"Q is orthogonal: {np.allclose(Q @ Q.T, np.eye(3))}")
print(f"R is upper triangular: {np.allclose(R, np.triu(R))}")

PyTorch -- Tensors Are Vectors with Autodiff

import torch

x = torch.randn(3, requires_grad=True)
y = torch.tensor([1.0, 0.0, 0.0])

similarity = torch.dot(x, y)
similarity.backward()

print(f"x = {x.data}")
print(f"y = {y.data}")
print(f"dot product = {similarity.item():.4f}")
print(f"d(dot)/dx = {x.grad}")

The gradient of the dot product with respect to x is just y. PyTorch computed this automatically. Every operation in a neural network is built from operations like this -- matrix multiplies, dot products, projections -- and autodiff tracks gradients through all of them.

You just built from scratch what NumPy does in one line. Now you know what's happening under the hood.

Ship It

This lesson produces:

outputs/prompt-linear-algebra-tutor.md -- a prompt for AI assistants to teach linear algebra through geometric intuition

Connections

Everything in this lesson connects to specific parts of modern AI:

Concept	Where it shows up
Dot product	Attention scores in transformers, cosine similarity in RAG
Matrix multiply	Every neural network layer, every linear transformation
Linear independence	Feature selection, avoiding multicollinearity
Rank	Determining if a system is solvable, LoRA (low-rank adaptation)
Projection	Linear regression (projecting onto column space), PCA
Gram-Schmidt / QR	Numerical solvers, eigenvalue computation
Orthonormal basis	Stable numerical computation, whitening transforms

LoRA deserves special mention. It fine-tunes large language models by decomposing weight updates into low-rank matrices. Instead of updating a 4096x4096 weight matrix (16M parameters), LoRA updates two matrices of size 4096x16 and 16x4096 (131K parameters). The rank-16 constraint means LoRA assumes the weight update lives in a 16-dimensional subspace of the full 4096-dimensional space. That is linear algebra doing real work.

Exercises

Implement Vector.angle_between(other) that returns the angle in degrees between two vectors
Create a 2D scaling matrix that doubles the x-coordinate and triples the y-coordinate, then apply it to the vector [1, 1]
Given 5 random word-like vectors (dimension 50), find the two most similar using cosine similarity
Verify that the Gram-Schmidt output is truly orthonormal: check that every pair has dot product 0 and every vector has magnitude 1
Create a 3x3 matrix with rank 2. Verify using the rank() method. Then explain what geometric object the columns span.
Project the vector [1, 2, 3] onto [1, 1, 1]. What does the result represent geometrically?

Key Terms

Term	What people say	What it actually means
Vector	"An arrow"	A list of numbers representing a point or direction in n-dimensional space
Matrix	"A table of numbers"	A transformation that maps vectors from one space to another
Dot product	"Multiply and sum"	A measure of how aligned two vectors are -- the core of similarity search
Embedding	"Some AI magic"	A vector that represents the meaning of something (word, image, user)
Linear independence	"They don't overlap"	No vector in the set can be written as a combination of the others
Rank	"How many dimensions"	The number of linearly independent columns (or rows) in a matrix
Projection	"The shadow"	The component of one vector in the direction of another
Basis	"The coordinate axes"	A minimal set of independent vectors that span the space
Orthonormal	"Perpendicular unit vectors"	Vectors that are mutually perpendicular and each have length 1

← Debugging and Profiling Vectors, Matrices & Operations →