← Chain Rule & Automatic Differentiation Bayes' Theorem →

Probability and Distributions

> Probability is the language AI uses to express uncertainty.

Type: Learn

Language: Python

Prerequisites: Phase 1, Lessons 01-04

Time: ~75 minutes

Learning Objectives

Implement PMFs and PDFs from scratch for Bernoulli, categorical, Poisson, uniform, and normal distributions
Compute expected value, variance, and use the Central Limit Theorem to explain why Gaussians dominate
Build softmax and log-softmax functions with the numerical stability trick (subtract max logit)
Calculate cross-entropy loss from logits and connect it to negative log-likelihood

The Problem

A classifier outputs [0.03, 0.91, 0.06]. A language model picks the next word from 50,000 candidates. A diffusion model generates images by sampling from learned distributions. All of these are probability in action.

Every prediction a model makes is a probability distribution. Every loss function measures how far the predicted distribution is from the true one. Every training step adjusts parameters to make one distribution look more like another. Without probability, you cannot read a single ML paper, debug a single model, or understand why your training loss is NaN.

The Concept

Events, Sample Spaces, and Probability

The sample space S is the set of all possible outcomes. An event is a subset of the sample space. Probability maps events to numbers between 0 and 1.

Coin flip:
  S = {H, T}
  P(H) = 0.5,  P(T) = 0.5

Single die roll:
  S = {1, 2, 3, 4, 5, 6}
  P(even) = P({2, 4, 6}) = 3/6 = 0.5

Three axioms define all of probability:

P(A) >= 0 for any event A
P(S) = 1 (something always happens)
P(A or B) = P(A) + P(B) when A and B cannot both occur

Everything else (Bayes' theorem, expectations, distributions) follows from these three rules.

Conditional Probability and Independence

P(A|B) is the probability of A given that B happened.

P(A|B) = P(A and B) / P(B)

Example: deck of cards
  P(King | Face card) = P(King and Face card) / P(Face card)
                      = (4/52) / (12/52)
                      = 4/12 = 1/3

Two events are independent when knowing one tells you nothing about the other:

Independent:   P(A|B) = P(A)
Equivalent to: P(A and B) = P(A) * P(B)

Coin flips are independent. Drawing cards without replacement is not.

Probability Mass Functions vs Probability Density Functions

Discrete random variables have a probability mass function (PMF). Each outcome has a specific probability that you can read off directly.

PMF: P(X = k)

Fair die:
  P(X = 1) = 1/6
  P(X = 2) = 1/6
  ...
  P(X = 6) = 1/6

  Sum of all probabilities = 1

Continuous random variables have a probability density function (PDF). The density at a single point is not a probability. Probability comes from integrating the density over an interval.

PDF: f(x)

P(a <= X <= b) = integral of f(x) from a to b

f(x) can be greater than 1 (density, not probability)
integral from -inf to +inf of f(x) dx = 1

This distinction matters in ML. Classification outputs are PMFs (discrete choices). VAE latent spaces use PDFs (continuous).

Common Distributions

Bernoulli: one trial, two outcomes. Models binary classification.

P(X = 1) = p
P(X = 0) = 1 - p
Mean = p,  Variance = p(1-p)

Categorical: one trial, k outcomes. Models multi-class classification (softmax output).

P(X = i) = p_i,  where sum of p_i = 1
Example: P(cat) = 0.7,  P(dog) = 0.2,  P(bird) = 0.1

Uniform: all outcomes equally likely. Used for random initialization.

Discrete: P(X = k) = 1/n for k in {1, ..., n}
Continuous: f(x) = 1/(b-a) for x in [a, b]

Normal (Gaussian): the bell curve. Parameterized by mean (mu) and variance (sigma^2).

f(x) = (1 / sqrt(2*pi*sigma^2)) * exp(-(x - mu)^2 / (2*sigma^2))

Standard normal: mu = 0, sigma = 1
  68% of data within 1 sigma
  95% within 2 sigma
  99.7% within 3 sigma

Poisson: counts of rare events in a fixed interval. Models event rates.

P(X = k) = (lambda^k * e^(-lambda)) / k!
Mean = lambda,  Variance = lambda

Expected Value and Variance

Expected value is the weighted average outcome.

Discrete:   E[X] = sum of x_i * P(X = x_i)
Continuous: E[X] = integral of x * f(x) dx

Variance measures spread around the mean.

Var(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2
Standard deviation = sqrt(Var(X))

In ML, expected value appears as the loss function (average loss over the data distribution). Variance tells you about model stability. High variance in gradients means noisy training.

Joint and Marginal Distributions

A joint distribution P(X, Y) describes two random variables together.

Joint PMF example (X = weather, Y = umbrella):

Y=0 (no umbrella)	Y=1 (umbrella)	Marginal P(X)
X=0 (sun)	0.40	0.10	P(X=0) = 0.50
X=1 (rain)	0.05	0.45	P(X=1) = 0.50
Marginal P(Y)	P(Y=0) = 0.45	P(Y=1) = 0.55	1.00

The marginal distribution sums out the other variable:

P(X = x) = sum over all y of P(X = x, Y = y)

The row and column totals in the table above are the marginals.

Why the Normal Distribution Shows Up Everywhere

The Central Limit Theorem: the sum (or average) of many independent random variables converges to a normal distribution, regardless of the original distribution.

Roll 1 die:  uniform distribution (flat)
Average of 2 dice:  triangular (peaked)
Average of 30 dice: nearly perfect bell curve

This works for ANY starting distribution.

This is why:

Measurement errors are approximately normal (many small independent sources)
Weight initializations in neural networks use normal distributions
Gradient noise in SGD is approximately normal (sum of many sample gradients)
The normal distribution is the maximum entropy distribution for a given mean and variance

Log Probabilities

Raw probabilities cause numerical problems. Multiplying many small probabilities together quickly underflows to zero.

P(sentence) = P(word1) * P(word2) * ... * P(word_n)
            = 0.01 * 0.003 * 0.02 * ...
            -> 0.0 (underflow after ~30 terms)

Log probabilities fix this. Multiplications become additions.

log P(sentence) = log P(word1) + log P(word2) + ... + log P(word_n)
                = -4.6 + -5.8 + -3.9 + ...
                -> finite number (no underflow)

Rules:

log(a * b) = log(a) + log(b)
log probabilities are always <= 0 (since 0 < P <= 1)
More negative = less likely
Cross-entropy loss is the negative log probability of the correct class

Softmax as a Probability Distribution

Neural networks output raw scores (logits). Softmax converts them into a valid probability distribution.

softmax(z_i) = exp(z_i) / sum(exp(z_j) for all j)

Properties:
  - All outputs are in (0, 1)
  - All outputs sum to 1
  - Preserves relative ordering of inputs
  - exp() amplifies differences between logits

The softmax trick: subtract the max logit before exponentiating to prevent overflow.

z = [100, 101, 102]
exp(102) = overflow

z_shifted = z - max(z) = [-2, -1, 0]
exp(0) = 1  (safe)

Same result, no overflow.

Log-softmax combines softmax and log for numerical stability. PyTorch uses this internally for cross-entropy loss.

Sampling

Sampling means drawing random values from a distribution. In ML:

Dropout randomly samples which neurons to zero out
Data augmentation samples random transformations
Language models sample the next token from the predicted distribution
Diffusion models sample noise and progressively denoise

Sampling from arbitrary distributions requires techniques like inverse transform sampling, rejection sampling, or the reparameterization trick (used in VAEs).

Build It

Step 1: Probability basics

import math
import random

def factorial(n):
    result = 1
    for i in range(2, n + 1):
        result *= i
    return result

def combinations(n, k):
    return factorial(n) // (factorial(k) * factorial(n - k))

def conditional_probability(p_a_and_b, p_b):
    return p_a_and_b / p_b

p_king_given_face = conditional_probability(4/52, 12/52)
print(f"P(King | Face card) = {p_king_given_face:.4f}")

Step 2: PMF and PDF from scratch

def bernoulli_pmf(k, p):
    return p if k == 1 else (1 - p)

def categorical_pmf(k, probs):
    return probs[k]

def poisson_pmf(k, lam):
    return (lam ** k) * math.exp(-lam) / factorial(k)

def uniform_pdf(x, a, b):
    if a <= x <= b:
        return 1.0 / (b - a)
    return 0.0

def normal_pdf(x, mu, sigma):
    coeff = 1.0 / (sigma * math.sqrt(2 * math.pi))
    exponent = -0.5 * ((x - mu) / sigma) ** 2
    return coeff * math.exp(exponent)

Step 3: Expected value and variance

def expected_value(values, probabilities):
    return sum(v * p for v, p in zip(values, probabilities))

def variance(values, probabilities):
    mu = expected_value(values, probabilities)
    return sum(p * (v - mu) ** 2 for v, p in zip(values, probabilities))

die_values = [1, 2, 3, 4, 5, 6]
die_probs = [1/6] * 6
mu = expected_value(die_values, die_probs)
var = variance(die_values, die_probs)
print(f"Die: E[X] = {mu:.4f}, Var(X) = {var:.4f}, SD = {var**0.5:.4f}")

Step 4: Sampling from distributions

def sample_bernoulli(p, n=1):
    return [1 if random.random() < p else 0 for _ in range(n)]

def sample_categorical(probs, n=1):
    cumulative = []
    total = 0
    for p in probs:
        total += p
        cumulative.append(total)
    samples = []
    for _ in range(n):
        r = random.random()
        for i, c in enumerate(cumulative):
            if r <= c:
                samples.append(i)
                break
    return samples

def sample_normal_box_muller(mu, sigma, n=1):
    samples = []
    for _ in range(n):
        u1 = random.random()
        u2 = random.random()
        z = math.sqrt(-2 * math.log(u1)) * math.cos(2 * math.pi * u2)
        samples.append(mu + sigma * z)
    return samples

Step 5: Softmax and log probabilities

def softmax(logits):
    max_logit = max(logits)
    shifted = [z - max_logit for z in logits]
    exps = [math.exp(z) for z in shifted]
    total = sum(exps)
    return [e / total for e in exps]

def log_softmax(logits):
    max_logit = max(logits)
    shifted = [z - max_logit for z in logits]
    log_sum_exp = max_logit + math.log(sum(math.exp(z) for z in shifted))
    return [z - log_sum_exp for z in logits]

def cross_entropy_loss(logits, target_index):
    log_probs = log_softmax(logits)
    return -log_probs[target_index]

Step 6: Central Limit Theorem demonstration

def demonstrate_clt(dist_fn, n_samples, n_averages):
    averages = []
    for _ in range(n_averages):
        samples = [dist_fn() for _ in range(n_samples)]
        averages.append(sum(samples) / len(samples))
    return averages

Step 7: Visualization

import matplotlib.pyplot as plt

xs = [mu + sigma * (i - 500) / 100 for i in range(1001)]
ys = [normal_pdf(x, mu, sigma) for x, mu, sigma in ...]
plt.plot(xs, ys)

Full implementations with all visualizations are in code/probability.py.

Use It

With NumPy and SciPy, everything above is one-liners:

import numpy as np
from scipy import stats

normal = stats.norm(loc=0, scale=1)
samples = normal.rvs(size=10000)
print(f"Mean: {np.mean(samples):.4f}, Std: {np.std(samples):.4f}")
print(f"P(X < 1.96) = {normal.cdf(1.96):.4f}")

logits = np.array([2.0, 1.0, 0.1])
from scipy.special import softmax, log_softmax
probs = softmax(logits)
log_probs = log_softmax(logits)
print(f"Softmax: {probs}")
print(f"Log-softmax: {log_probs}")

You built these from scratch. Now you know what the library calls are doing.

Exercises

Implement inverse transform sampling for the exponential distribution. Verify by sampling 10,000 values and comparing the histogram to the true PDF.

Build a joint distribution table for two loaded dice. Compute the marginal distributions and check whether the dice are independent.

Compute the cross-entropy loss for a 5-class classifier that outputs logits [2.0, 0.5, -1.0, 3.0, 0.1] when the correct class is index 3. Then verify your answer with PyTorch's nn.CrossEntropyLoss.

Write a function that takes a list of log probabilities and returns the most likely sequence, the total log probability, and the equivalent raw probability. Test it with a sentence of 50 words where each word has probability 0.01.

Key Terms

Term	What people say	What it actually means
Sample space	"All the possibilities"	The set S of every possible outcome of an experiment
PMF	"The probability function"	A function that gives the exact probability of each discrete outcome, summing to 1
PDF	"The probability curve"	A density function for continuous variables. Integrate it over an interval to get probability
Conditional probability	"Probability given something"	P(A\	B) = P(A and B) / P(B). The foundation of Bayesian thinking and Bayes' theorem
Independence	"They don't affect each other"	P(A and B) = P(A) * P(B). Knowing one event tells you nothing about the other
Expected value	"The average"	The probability-weighted sum of all outcomes. The loss function is an expected value
Variance	"How spread out"	The expected squared deviation from the mean. High variance = noisy, unstable estimates
Normal distribution	"The bell curve"	f(x) = (1/sqrt(2pisigma^2)) * exp(-(x-mu)^2/(2*sigma^2)). Appears everywhere due to the CLT
Central Limit Theorem	"Averages become normal"	The mean of many independent samples converges to a normal distribution regardless of the source
Joint distribution	"Two variables together"	P(X, Y) describes the probability of every combination of X and Y outcomes
Marginal distribution	"Sum out the other variable"	P(X) = sum_y P(X, Y). Recovers one variable's distribution from the joint
Log probability	"Log of the probability"	log P(x). Turns products into sums, preventing numerical underflow in long sequences
Softmax	"Turn scores into probabilities"	softmax(z_i) = exp(z_i) / sum(exp(z_j)). Maps real-valued logits to a valid probability distribution
Cross-entropy	"The loss function"	-sum(p_true * log(p_predicted)). Measures how different two distributions are. Lower is better
Logits	"Raw model outputs"	Unnormalized scores before softmax. Named after the logistic function
Sampling	"Drawing random values"	Generating values according to a probability distribution. How models generate output