Calculus for Machine Learning

> Derivatives tell you which way is downhill. That is all a neural network needs to learn.

Type: Learn

Language: Python

Prerequisites: Phase 1, Lessons 01-03

Time: ~60 minutes

Learning Objectives

The Problem

You have a neural network with millions of weights. Each weight is a knob. You need to figure out which direction to turn every single knob to make the model slightly less wrong. Calculus gives you that direction.

Without calculus, training a neural network would mean trying random changes and hoping for the best. With derivatives, you know exactly how each weight affects the error. You turn every knob the right way, every time.

The Concept

What is a derivative?

A derivative measures the rate of change. For a function y = f(x), the derivative f'(x) tells you: if you nudge x by a tiny amount, how much does y change?

Geometrically, the derivative is the slope of the tangent line at a point.

f(x) = x^2:

x f(x) f'(x) (slope)
0 0 0 (flat, at the bottom)
1 1 2
2 4 4 (tangent line slope at this point)
3 9 6

At x=2, the slope is 4. If you move x a tiny bit to the right, y increases by about 4 times that amount. At x=0, the slope is 0. You are at the bottom of the bowl.

The formal definition:

f'(x) = lim   f(x + h) - f(x)
        h->0  -----------------
                     h

In code, you skip the limit and just use a very small h. That is the numerical derivative.

Partial derivatives: one variable at a time

Real functions have many inputs. A neural network loss depends on thousands of weights. A partial derivative holds all variables constant except one, then takes the derivative with respect to that one.

f(x, y) = x^2 + 3xy + y^2

df/dx = 2x + 3y     (treat y as a constant)
df/dy = 3x + 2y     (treat x as a constant)

Each partial derivative answers: if I nudge just this one weight, how does the loss change?

The gradient: vector of all partial derivatives

The gradient collects every partial derivative into one vector. For a function f(x, y, z), the gradient is:

grad f = [ df/dx, df/dy, df/dz ]

The gradient points in the direction of steepest ascent. To minimize a function, go in the opposite direction.

Contour plot of f(x,y) = x^2 + y^2:

The function forms a bowl shape with concentric circles as contour lines. The minimum is at (0, 0).

Point grad f -grad f (descent direction)
(1, 1) [2, 2] (points uphill, away from minimum) [-2, -2] (points downhill, toward minimum)
(0, 0) [0, 0] (flat, at the minimum) [0, 0]

This is gradient descent in a picture. Compute the gradient, negate it, take a step.

The connection to optimization

Training a neural network is optimization. You have a loss function L(w1, w2, ..., wn) that measures how wrong the model is. You want to minimize it.

Gradient descent update rule:

  w_new = w_old - learning_rate * dL/dw

For every weight:
  1. Compute the partial derivative of loss with respect to that weight
  2. Subtract a small multiple of it from the weight
  3. Repeat

The learning rate controls step size. Too big and you overshoot. Too small and you crawl.

Loss landscape (1D slice):

The loss function L(w) forms a curve with peaks and valleys as the weight w varies.

Feature Description
Global minimum The lowest point on the entire curve -- the best solution
Local minimum A valley that is lower than its neighbors but not the lowest overall
Slope Gradient descent follows the slope downhill from any starting point

Gradient descent follows the slope downhill. It can get stuck in local minima, but in high-dimensional spaces (millions of weights) this is rarely a practical problem.

Numerical vs analytical derivatives

There are two ways to compute a derivative.

Analytical: apply calculus rules by hand. For f(x) = x^2, the derivative is f'(x) = 2x. Exact. Fast.

Numerical: approximate using the definition. Compute f(x+h) and f(x-h) for a tiny h, then use the difference.

Numerical (central difference):

f'(x) ~= f(x + h) - f(x - h)
          -----------------------
                  2h

h = 0.0001 works well in practice

Numerical derivatives are slower but work for any function. Analytical derivatives are fast but require you to derive the formula. Neural network frameworks use a third approach: automatic differentiation, which computes exact derivatives mechanically. You will see that in Phase 3.

Derivatives by hand for simple functions

These are the derivatives you will see over and over in ML.

Function        Derivative       Used in
--------        ----------       -------
f(x) = x^2     f'(x) = 2x      Loss functions (MSE)
f(x) = wx + b  f'(w) = x        Linear layer (gradient w.r.t. weight)
                f'(b) = 1        Linear layer (gradient w.r.t. bias)
                f'(x) = w        Linear layer (gradient w.r.t. input)
f(x) = e^x     f'(x) = e^x     Softmax, attention
f(x) = ln(x)   f'(x) = 1/x     Cross-entropy loss
f(x) = 1/(1+e^-x)  f'(x) = f(x)(1-f(x))   Sigmoid activation

For f(x) = x^2:

f(x) = x^2    f'(x) = 2x

  x    f(x)   f'(x)   meaning
  -2    4      -4      slope tilts left (decreasing)
  -1    1      -2      slope tilts left (decreasing)
   0    0       0      flat (minimum!)
   1    1       2      slope tilts right (increasing)
   2    4       4      slope tilts right (increasing)

For f(w) = wx + b with x=3, b=1:

f(w) = 3w + 1    f'(w) = 3

The derivative with respect to w is just x.
If x is big, a small change in w causes a big change in output.

The chain rule

When functions are composed, the chain rule tells you how to differentiate.

If y = f(g(x)), then dy/dx = f'(g(x)) * g'(x)

Example: y = (3x + 1)^2
  outer: f(u) = u^2       f'(u) = 2u
  inner: g(x) = 3x + 1    g'(x) = 3
  dy/dx = 2(3x + 1) * 3 = 6(3x + 1)

Neural networks are chains of functions: input -> linear -> activation -> linear -> activation -> loss. Backpropagation is the chain rule applied repeatedly from output to input. That is the entire algorithm.

The Hessian Matrix

The gradient tells you the slope. The Hessian tells you the curvature.

The Hessian is the matrix of second-order partial derivatives. For a function f(x1, x2, ..., xn), entry (i, j) of the Hessian is:

H[i][j] = d^2f / (dx_i * dx_j)

For a 2-variable function f(x, y):

H = | d^2f/dx^2    d^2f/dxdy |
    | d^2f/dydx    d^2f/dy^2 |

What the Hessian tells you at a critical point (where gradient = 0):

Hessian property Meaning Example surface
Positive definite (all eigenvalues > 0) Local minimum Bowl pointing up
Negative definite (all eigenvalues < 0) Local maximum Bowl pointing down
Indefinite (mixed eigenvalues) Saddle point Horse saddle shape

Example: f(x, y) = x^2 - y^2 (a saddle function)

df/dx = 2x       df/dy = -2y
d^2f/dx^2 = 2    d^2f/dy^2 = -2    d^2f/dxdy = 0

H = | 2   0 |
    | 0  -2 |

Eigenvalues: 2 and -2 (one positive, one negative)
--> Saddle point at (0, 0)

Compare with f(x, y) = x^2 + y^2 (a bowl):

H = | 2  0 |
    | 0  2 |

Eigenvalues: 2 and 2 (both positive)
--> Local minimum at (0, 0)

Why the Hessian matters in ML:

Newton's method uses the Hessian to take better optimization steps than gradient descent. Instead of just following the slope, it accounts for curvature:

Newton's update:    w_new = w_old - H^(-1) * gradient
Gradient descent:   w_new = w_old - lr * gradient

Newton's method converges faster because the Hessian "rescales" the gradient -- steep directions get smaller steps, flat directions get larger steps.

The catch: for a neural network with N parameters, the Hessian is N x N. A model with 1 million parameters would need a 1 trillion-entry matrix. That is why we use approximations.

Method What it uses Cost Convergence
Gradient descent First derivatives only O(N) per step Slow (linear)
Newton's method Full Hessian O(N^3) per step Fast (quadratic)
L-BFGS Approximate Hessian from gradient history O(N) per step Medium (superlinear)
Adam Per-parameter adaptive rates (diagonal Hessian approx) O(N) per step Medium
Natural gradient Fisher information matrix (statistical Hessian) O(N^2) per step Fast

In practice, Adam is the default optimizer for deep learning. It approximates second-order information cheaply by tracking the running mean and variance of gradients per parameter.

Taylor Series Approximation

Any smooth function can be approximated locally by a polynomial:

f(x + h) = f(x) + f'(x)*h + (1/2)*f''(x)*h^2 + (1/6)*f'''(x)*h^3 + ...

The more terms you include, the better the approximation -- but only near the point x.

Why Taylor series matter for ML:

Approximation order    What it captures    Optimization method
-------------------    -----------------   -------------------
0th order (constant)   Just the value      Random search
1st order (linear)     Slope               Gradient descent
2nd order (quadratic)  Curvature           Newton's method
Higher orders          Finer structure     Rarely used in ML

The key insight: all gradient-based optimization is really about approximating the loss function locally and stepping to the minimum of that approximation.

Integrals in ML

Derivatives tell you rates of change. Integrals compute accumulations -- area under a curve.

In ML, you rarely compute integrals by hand, but the concept is everywhere:

Probability. For a continuous random variable with density p(x):

P(a < X < b) = integral from a to b of p(x) dx

The area under the probability density curve between a and b is the probability of landing in that range.

Expected value. The average outcome weighted by probability:

E[f(X)] = integral of f(x) * p(x) dx

The expected loss over a data distribution is an integral. Training minimizes an empirical approximation of this.

KL divergence. Measures how different two distributions are:

KL(p || q) = integral of p(x) * log(p(x) / q(x)) dx

Used in VAEs, knowledge distillation, and Bayesian inference.

Normalization constants. In Bayesian inference:

p(w | data) = p(data | w) * p(w) / integral of p(data | w) * p(w) dw

The denominator is an integral over all possible parameter values. It is often intractable, which is why we use approximations like MCMC and variational inference.

Integral concept Where it appears in ML
Area under curve Probability from density functions
Expected value Loss functions, risk minimization
KL divergence VAEs, policy optimization, distillation
Normalization Bayesian posteriors, softmax denominator
Marginal likelihood Model comparison, evidence lower bound (ELBO)

Multivariable Chain Rule in a Computation Graph

The chain rule does not just apply to scalar functions in a line. In a neural network, variables fan out and merge. Here is how derivatives flow through a simple forward pass:

graph LR x["x (input)"] -->|"*w"| z1["z1 = w*x"] z1 -->|"+b"| z2["z2 = w*x + b"] z2 -->|"sigmoid"| a["a = sigmoid(z2)"] a -->|"loss fn"| L["L = -(y*log(a) + (1-y)*log(1-a))"]

The backward pass computes gradients right to left:

graph RL dL["dL/dL = 1"] -->|"dL/da"| da["dL/da = -y/a + (1-y)/(1-a)"] da -->|"da/dz2 = a(1-a)"| dz2["dL/dz2 = dL/da * a(1-a)"] dz2 -->|"dz2/dw = x"| dw["dL/dw = dL/dz2 * x"] dz2 -->|"dz2/db = 1"| db["dL/db = dL/dz2 * 1"]

Each arrow multiplies by the local derivative. The gradient for any parameter is the product of all local derivatives along the path from loss to that parameter. When paths branch and merge, you sum the contributions (multivariate chain rule).

This is all backpropagation is: the chain rule applied systematically through a computation graph, from output to inputs.

The Jacobian matrix

When a function maps a vector to a vector (like a neural network layer), its derivative is a matrix. The Jacobian contains every partial derivative of every output with respect to every input.

For f: R^n -> R^m, the Jacobian J is an m x n matrix:

x1 x2 ... xn
f1 df1/dx1 df1/dx2 ... df1/dxn
f2 df2/dx1 df2/dx2 ... df2/dxn
... ... ... ... ...
fm dfm/dx1 dfm/dx2 ... dfm/dxn

You will not compute Jacobians by hand for neural networks. PyTorch handles it. But knowing it exists helps you understand shapes in backpropagation: if a layer maps R^n to R^m, its Jacobian is m x n. The gradient flows backward through the transpose of this matrix.

Why this matters for neural networks

Every weight in a neural network gets a gradient. The gradient tells you how to adjust that weight to reduce the loss.

graph LR subgraph Forward["Forward Pass"] I["input"] --> W1["W1"] --> R["relu"] --> W2["W2"] --> S["softmax"] --> L["loss"] end
graph RL subgraph Backward["Backward Pass"] dL["dL/dloss"] --> dW2["dL/dW2"] --> d2["..."] --> dW1["dL/dW1"] end

Each weight update:

The forward pass computes the prediction and loss. The backward pass computes the gradient of the loss with respect to every weight. Then every weight takes a small step downhill. Repeat for millions of steps. That is deep learning.

Build It

Step 1: Numerical derivative from scratch

def numerical_derivative(f, x, h=1e-7):
    return (f(x + h) - f(x - h)) / (2 * h)

def f(x):
    return x ** 2

for x in [-2, -1, 0, 1, 2]:
    numerical = numerical_derivative(f, x)
    analytical = 2 * x
    print(f"x={x:2d}  f'(x) numerical={numerical:.6f}  analytical={analytical:.1f}")

The numerical derivative matches the analytical one to many decimal places.

Step 2: Partial derivatives and gradients

def numerical_gradient(f, point, h=1e-7):
    gradient = []
    for i in range(len(point)):
        point_plus = list(point)
        point_minus = list(point)
        point_plus[i] += h
        point_minus[i] -= h
        partial = (f(point_plus) - f(point_minus)) / (2 * h)
        gradient.append(partial)
    return gradient

def f_multi(point):
    x, y = point
    return x**2 + 3*x*y + y**2

grad = numerical_gradient(f_multi, [1.0, 2.0])
print(f"Numerical gradient at (1,2): {[f'{g:.4f}' for g in grad]}")
print(f"Analytical gradient at (1,2): [2*1+3*2, 3*1+2*2] = [{2*1+3*2}, {3*1+2*2}]")

Step 3: Gradient descent to find the minimum of f(x) = x^2

x = 5.0
lr = 0.1
for step in range(20):
    grad = 2 * x
    x = x - lr * grad
    print(f"step {step:2d}  x={x:8.4f}  f(x)={x**2:10.6f}")

Starting at x=5, each step moves closer to x=0 (the minimum).

Step 4: Gradient descent on a 2D function

def f_2d(point):
    x, y = point
    return x**2 + y**2

point = [4.0, 3.0]
lr = 0.1
for step in range(30):
    grad = numerical_gradient(f_2d, point)
    point = [p - lr * g for p, g in zip(point, grad)]
    loss = f_2d(point)
    if step % 5 == 0 or step == 29:
        print(f"step {step:2d}  point=({point[0]:7.4f}, {point[1]:7.4f})  f={loss:.6f}")

Step 5: Comparing numerical and analytical derivatives

import math

test_functions = [
    ("x^2",      lambda x: x**2,          lambda x: 2*x),
    ("x^3",      lambda x: x**3,          lambda x: 3*x**2),
    ("sin(x)",   lambda x: math.sin(x),   lambda x: math.cos(x)),
    ("e^x",      lambda x: math.exp(x),   lambda x: math.exp(x)),
    ("1/x",      lambda x: 1/x,           lambda x: -1/x**2),
]

x = 2.0
print(f"{'Function':<12} {'Numerical':>12} {'Analytical':>12} {'Error':>12}")
print("-" * 50)
for name, f, df in test_functions:
    num = numerical_derivative(f, x)
    ana = df(x)
    err = abs(num - ana)
    print(f"{name:<12} {num:12.6f} {ana:12.6f} {err:12.2e}")

Step 6: Computing the Hessian numerically

def hessian_2d(f, x, y, h=1e-5):
    fxx = (f(x + h, y) - 2 * f(x, y) + f(x - h, y)) / (h ** 2)
    fyy = (f(x, y + h) - 2 * f(x, y) + f(x, y - h)) / (h ** 2)
    fxy = (f(x + h, y + h) - f(x + h, y - h) - f(x - h, y + h) + f(x - h, y - h)) / (4 * h ** 2)
    return [[fxx, fxy], [fxy, fyy]]

def saddle(x, y):
    return x ** 2 - y ** 2

def bowl(x, y):
    return x ** 2 + y ** 2

H_saddle = hessian_2d(saddle, 0.0, 0.0)
H_bowl = hessian_2d(bowl, 0.0, 0.0)
print(f"Saddle Hessian: {H_saddle}")  # [[2, 0], [0, -2]] -- mixed signs
print(f"Bowl Hessian:   {H_bowl}")    # [[2, 0], [0, 2]]  -- both positive

The Hessian of the saddle function has eigenvalues 2 and -2 (mixed signs, confirming a saddle point). The bowl has eigenvalues 2 and 2 (both positive, confirming a minimum).

Step 7: Taylor approximation in action

import math

def taylor_approx(f, f_prime, f_double_prime, x0, h, order=2):
    result = f(x0)
    if order >= 1:
        result += f_prime(x0) * h
    if order >= 2:
        result += 0.5 * f_double_prime(x0) * h ** 2
    return result

x0 = 0.0
for h in [0.1, 0.5, 1.0, 2.0]:
    true_val = math.sin(h)
    t1 = taylor_approx(math.sin, math.cos, lambda x: -math.sin(x), x0, h, order=1)
    t2 = taylor_approx(math.sin, math.cos, lambda x: -math.sin(x), x0, h, order=2)
    print(f"h={h:.1f}  sin(h)={true_val:.4f}  order1={t1:.4f}  order2={t2:.4f}")

Near x0=0, sin(x) ~ x (first-order Taylor). The approximation is excellent for small h but breaks down for large h. This is why gradient descent works best with small learning rates -- each step assumes the linear approximation is accurate.

Step 8: Why this matters for a neural network

import random

random.seed(42)

w = random.gauss(0, 1)
b = random.gauss(0, 1)
lr = 0.01

xs = [1.0, 2.0, 3.0, 4.0, 5.0]
ys = [3.0, 5.0, 7.0, 9.0, 11.0]

for epoch in range(200):
    total_loss = 0
    dw = 0
    db = 0
    for x, y in zip(xs, ys):
        pred = w * x + b
        error = pred - y
        total_loss += error ** 2
        dw += 2 * error * x
        db += 2 * error
    dw /= len(xs)
    db /= len(xs)
    total_loss /= len(xs)
    w -= lr * dw
    b -= lr * db
    if epoch % 40 == 0 or epoch == 199:
        print(f"epoch {epoch:3d}  w={w:.4f}  b={b:.4f}  loss={total_loss:.6f}")

print(f"\nLearned: y = {w:.2f}x + {b:.2f}")
print(f"Actual:  y = 2x + 1")

Every gradient-based training loop follows this pattern: predict, compute loss, compute gradients, update weights.

Use It

With NumPy, the same operations are faster and more concise:

import numpy as np

x = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([3, 5, 7, 9, 11], dtype=float)

w, b = np.random.randn(), np.random.randn()
lr = 0.01

for epoch in range(200):
    pred = w * x + b
    error = pred - y
    loss = np.mean(error ** 2)
    dw = np.mean(2 * error * x)
    db = np.mean(2 * error)
    w -= lr * dw
    b -= lr * db

print(f"Learned: y = {w:.2f}x + {b:.2f}")

You just built gradient descent from scratch. PyTorch automates the gradient computation, but the update loop is identical.

Exercises

  1. Implement numerical_second_derivative(f, x) using numerical_derivative called twice. Verify that the second derivative of x^3 at x=2 is 12.
  2. Use gradient descent to find the minimum of f(x, y) = (x - 3)^2 + (y + 1)^2. Start from (0, 0). The answer should converge to (3, -1).
  3. Add momentum to the gradient descent loop: maintain a velocity vector that accumulates past gradients. Compare convergence speed with and without momentum on f(x) = x^4 - 3x^2.

Key Terms

Term What people say What it actually means
Derivative "The slope" The rate of change of a function at a point. Tells you how much the output changes per unit change in input.
Partial derivative "Derivative of one variable" The derivative with respect to one variable while all others are held constant.
Gradient "Direction of steepest ascent" A vector of all partial derivatives. Points in the direction that increases the function fastest.
Gradient descent "Go downhill" Subtract the gradient (times a learning rate) from the parameters to reduce the loss. The core of neural network training.
Learning rate "Step size" A scalar that controls how big each gradient descent step is. Too large: diverge. Too small: converge slowly.
Chain rule "Multiply the derivatives" The rule for differentiating composed functions: df/dx = df/dg * dg/dx. The mathematical basis of backpropagation.
Jacobian "Matrix of derivatives" When a function maps vectors to vectors, the Jacobian is the matrix of all partial derivatives of outputs with respect to inputs.
Numerical derivative "Finite differences" Approximating a derivative by evaluating the function at two nearby points and computing the slope between them.
Backpropagation "Reverse-mode autodiff" Computing gradients layer by layer from output to input using the chain rule. How neural networks learn.
Hessian "Matrix of second derivatives" The matrix of all second-order partial derivatives. Describes the curvature of a function. Positive definite Hessian at a critical point means local minimum.
Taylor series "Polynomial approximation" Approximating a function near a point using its derivatives: f(x+h) ~ f(x) + f'(x)h + (1/2)f''(x)h^2 + ... The basis for understanding why gradient descent and Newton's method work.
Integral "Area under the curve" The accumulation of a quantity over a range. In ML, integrals define probabilities, expected values, and KL divergence.

Further Reading