Bayes' Theorem
> Probability is about what you expect. Bayes' theorem is about what you learn.
Type: Build
Language: Python
Prerequisites: Phase 1, Lesson 06 (Probability Fundamentals)
Time: ~75 minutes
Learning Objectives
- Apply Bayes' theorem to compute posterior probabilities from priors, likelihoods, and evidence
- Build a Naive Bayes text classifier from scratch with Laplace smoothing and log-space computation
- Compare MLE and MAP estimation and explain how MAP corresponds to L2 regularization
- Implement sequential Bayesian updating using Beta-Binomial conjugate priors for A/B testing
The Problem
A medical test is 99% accurate. You test positive. What are the chances you actually have the disease?
Most people say 99%. The real answer depends on how rare the disease is. If 1 in 10,000 people have it, a positive result only gives you about a 1% chance of being sick. The other 99% of positive results are false alarms from healthy people.
This is not a trick question. It is Bayes' theorem. Every spam filter, every medical diagnostic, every machine learning model that quantifies uncertainty uses this exact reasoning. You start with a belief. You see evidence. You update.
If you build ML systems without understanding this, you will misinterpret model outputs, set bad thresholds, and ship overconfident predictions.
The Concept
From joint probability to Bayes
You already know from Lesson 06 that conditional probability is:
P(A|B) = P(A and B) / P(B)
And symmetrically:
P(B|A) = P(A and B) / P(A)
Both expressions share the same numerator: P(A and B). Set them equal and rearrange:
P(A and B) = P(A|B) * P(B) = P(B|A) * P(A)
Therefore:
P(A|B) = P(B|A) * P(A) / P(B)
That is Bayes' theorem. Four quantities, one equation.
The four parts
| Part | Name | What it means | |
|---|---|---|---|
| P(A\ | B) | Posterior | Your updated belief about A after seeing evidence B |
| P(B\ | A) | Likelihood | How probable the evidence B is if A is true |
| P(A) | Prior | Your belief about A before seeing any evidence | |
| P(B) | Evidence | Total probability of seeing B under all possibilities |
The evidence term P(B) acts as a normalizer. You can expand it using the law of total probability:
P(B) = P(B|A) * P(A) + P(B|not A) * P(not A)
Medical test example
A disease affects 1 in 10,000 people. The test is 99% accurate (catches 99% of sick people, gives false positives 1% of the time).
P(sick) = 0.0001 (prior: disease is rare)
P(positive|sick) = 0.99 (likelihood: test catches it)
P(positive|healthy) = 0.01 (false positive rate)
P(positive) = P(positive|sick) * P(sick) + P(positive|healthy) * P(healthy)
= 0.99 * 0.0001 + 0.01 * 0.9999
= 0.000099 + 0.009999
= 0.010098
P(sick|positive) = P(positive|sick) * P(sick) / P(positive)
= 0.99 * 0.0001 / 0.010098
= 0.0098
= 0.98%
Less than 1%. The prior dominates. When a condition is rare, even accurate tests produce mostly false positives. This is why doctors order confirmation tests.
Spam filter example
You receive an email containing the word "lottery". Is it spam?
P(spam) = 0.3 (30% of email is spam)
P("lottery"|spam) = 0.05 (5% of spam emails contain "lottery")
P("lottery"|not spam) = 0.001 (0.1% of legitimate emails contain "lottery")
P("lottery") = 0.05 * 0.3 + 0.001 * 0.7
= 0.015 + 0.0007
= 0.0157
P(spam|"lottery") = 0.05 * 0.3 / 0.0157
= 0.955
= 95.5%
One word shifts the probability from 30% to 95.5%. A real spam filter applies Bayes across hundreds of words simultaneously.
Naive Bayes: independence assumption
Naive Bayes extends this to multiple features by assuming all features are conditionally independent given the class:
P(class | feature_1, feature_2, ..., feature_n)
= P(class) * P(feature_1|class) * P(feature_2|class) * ... * P(feature_n|class)
/ P(feature_1, feature_2, ..., feature_n)
The "naive" part is the independence assumption. In text, word occurrences are not independent ("New" and "York" are correlated). But the assumption works surprisingly well in practice because the classifier only needs to rank classes, not produce calibrated probabilities.
Since the denominator is the same for all classes, you can skip it and just compare numerators:
score(class) = P(class) * product of P(feature_i | class)
Pick the class with the highest score.
Maximum likelihood estimation (MLE)
How do you get P(feature|class) from training data? Count.
P("free"|spam) = (number of spam emails containing "free") / (total spam emails)
This is MLE: choose the parameter values that make the observed data most likely. You are maximizing the likelihood function, which for discrete counts reduces to relative frequency.
Problem: if a word never appears in spam during training, MLE gives it probability zero. One unseen word kills the entire product. Fix this with Laplace smoothing:
P(word|class) = (count(word, class) + 1) / (total_words_in_class + vocabulary_size)
Adding 1 to every count ensures no probability is ever zero.
Maximum a posteriori (MAP)
MLE asks: what parameters maximize P(data|parameters)?
MAP asks: what parameters maximize P(parameters|data)?
By Bayes' theorem:
P(parameters|data) proportional to P(data|parameters) * P(parameters)
MAP adds a prior over the parameters themselves. If you believe parameters should be small, you encode that as a prior that penalizes large values. This is identical to L2 regularization in ML. The "ridge" penalty in ridge regression is literally a Gaussian prior on the weights.
| Estimation | Optimizes | ML equivalent | |
|---|---|---|---|
| MLE | P(data\ | params) | Unregularized training |
| MAP | P(data\ | params) * P(params) | L2 / L1 regularization |
Bayesian vs frequentist: the practical difference
Frequentists treat parameters as fixed unknowns. They ask: "If I repeated this experiment many times, what would happen?"
Bayesians treat parameters as distributions. They ask: "Given what I have observed, what do I believe about the parameters?"
For building ML systems, the practical difference:
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Output | Point estimate | Distribution over values |
| Uncertainty | Confidence intervals (about procedure) | Credible intervals (about parameter) |
| Small data | Can overfit | Prior acts as regularization |
| Computation | Usually faster | Often requires sampling (MCMC) |
Most production ML is frequentist (SGD, point estimates). Bayesian methods shine when you need calibrated uncertainty (medical decisions, safety-critical systems) or when data is scarce (few-shot learning, cold start).
Why Bayesian thinking matters for ML
The connection is deeper than analogy:
Priors are regularization. A Gaussian prior on weights is L2 regularization. A Laplace prior is L1. Every time you add a regularization term, you are making a Bayesian statement about what parameter values you expect.
Posteriors are uncertainty. A single predicted probability tells you nothing about how confident the model is in that estimate. Bayesian methods give you a distribution: "I think P(spam) is between 0.8 and 0.95."
Bayes updates are online learning. Today's posterior becomes tomorrow's prior. When your model sees new data, it updates its beliefs incrementally instead of retraining from scratch.
Model comparison is Bayesian. Bayesian information criterion (BIC), marginal likelihood, and Bayes factors all use Bayesian reasoning to choose between models without overfitting.
Build It
Step 1: Bayes theorem function
def bayes(prior, likelihood, false_positive_rate):
evidence = likelihood * prior + false_positive_rate * (1 - prior)
posterior = likelihood * prior / evidence
return posterior
result = bayes(prior=0.0001, likelihood=0.99, false_positive_rate=0.01)
print(f"P(sick|positive) = {result:.4f}")
Step 2: Naive Bayes classifier
import math
from collections import defaultdict
class NaiveBayes:
def __init__(self, smoothing=1.0):
self.smoothing = smoothing
self.class_counts = defaultdict(int)
self.word_counts = defaultdict(lambda: defaultdict(int))
self.class_word_totals = defaultdict(int)
self.vocab = set()
def train(self, documents, labels):
for doc, label in zip(documents, labels):
self.class_counts[label] += 1
words = doc.lower().split()
for word in words:
self.word_counts[label][word] += 1
self.class_word_totals[label] += 1
self.vocab.add(word)
def predict(self, document):
words = document.lower().split()
total_docs = sum(self.class_counts.values())
vocab_size = len(self.vocab)
best_class = None
best_score = float("-inf")
for cls in self.class_counts:
score = math.log(self.class_counts[cls] / total_docs)
for word in words:
count = self.word_counts[cls].get(word, 0)
total = self.class_word_totals[cls]
score += math.log((count + self.smoothing) / (total + self.smoothing * vocab_size))
if score > best_score:
best_score = score
best_class = cls
return best_class
Log probabilities prevent underflow. Multiplying many small probabilities produces numbers too tiny for floating point. Summing log-probabilities is numerically stable and mathematically equivalent.
Step 3: Train on spam data
train_docs = [
"win free money now",
"free lottery ticket winner",
"claim your prize today free",
"urgent offer free cash",
"congratulations you won free",
"meeting tomorrow at noon",
"project update attached",
"can we schedule a call",
"quarterly report review",
"lunch on thursday sounds good",
"team standup notes attached",
"please review the pull request",
]
train_labels = [
"spam", "spam", "spam", "spam", "spam",
"ham", "ham", "ham", "ham", "ham", "ham", "ham",
]
classifier = NaiveBayes()
classifier.train(train_docs, train_labels)
test_messages = [
"free money waiting for you",
"meeting rescheduled to friday",
"you won a free prize",
"please review the attached report",
]
for msg in test_messages:
print(f" '{msg}' -> {classifier.predict(msg)}")
Step 4: Inspect the learned probabilities
def show_top_words(classifier, cls, n=5):
vocab_size = len(classifier.vocab)
total = classifier.class_word_totals[cls]
probs = {}
for word in classifier.vocab:
count = classifier.word_counts[cls].get(word, 0)
probs[word] = (count + classifier.smoothing) / (total + classifier.smoothing * vocab_size)
sorted_words = sorted(probs.items(), key=lambda x: x[1], reverse=True)
for word, prob in sorted_words[:n]:
print(f" {word}: {prob:.4f}")
print("\nTop spam words:")
show_top_words(classifier, "spam")
print("\nTop ham words:")
show_top_words(classifier, "ham")
Use It
Scikit-learn ships production-ready naive Bayes implementations:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_docs)
clf = MultinomialNB()
clf.fit(X_train, train_labels)
X_test = vectorizer.transform(test_messages)
predictions = clf.predict(X_test)
for msg, pred in zip(test_messages, predictions):
print(f" '{msg}' -> {pred}")
Same algorithm. CountVectorizer handles tokenization and vocabulary building. MultinomialNB handles smoothing and log-probabilities internally. Your from-scratch version does the same thing in 40 lines.
Ship It
The NaiveBayes class built here demonstrates the full pipeline: tokenization, probability estimation with Laplace smoothing, log-space prediction. The code in code/bayes.py runs end-to-end with no dependencies beyond Python's standard library.
Conjugate Priors
When the prior and posterior belong to the same family of distributions, the prior is called "conjugate." This makes Bayesian updating algebraically clean -- you get a closed-form posterior without numerical integration.
| Likelihood | Conjugate Prior | Posterior | Example |
|---|---|---|---|
| Bernoulli | Beta(a, b) | Beta(a + successes, b + failures) | Coin flip bias estimation |
| Normal (known variance) | Normal(mu_0, sigma_0) | Normal(weighted mean, smaller variance) | Sensor calibration |
| Poisson | Gamma(a, b) | Gamma(a + sum of counts, b + n) | Modeling arrival rates |
| Multinomial | Dirichlet(alpha) | Dirichlet(alpha + counts) | Topic modeling, language models |
Why this matters: without conjugate priors, you need Monte Carlo sampling or variational inference to approximate the posterior. With conjugate priors, you just update two numbers.
The Beta distribution is the most common conjugate prior in practice. Beta(a, b) represents your belief about a probability parameter. The mean is a/(a+b). The larger a+b, the more concentrated (confident) the distribution.
Special cases of the Beta prior:
- Beta(1, 1) = uniform. You have no opinion about the parameter.
- Beta(10, 10) = peaked at 0.5. You strongly believe the parameter is near 0.5.
- Beta(1, 10) = skewed toward 0. You believe the parameter is small.
The update rule is dead simple:
Prior: Beta(a, b)
Data: s successes, f failures
Posterior: Beta(a + s, b + f)
No integrals. No sampling. Just addition.
Sequential Bayesian Updating
Bayesian inference is naturally sequential. Today's posterior becomes tomorrow's prior. This is how real systems learn incrementally without reprocessing all historical data.
Concrete example: estimating whether a coin is fair.
Day 1: No data yet.
Start with Beta(1, 1) -- a uniform prior. You have no opinion.
- Prior mean: 0.5
- Prior is flat across [0, 1]
Day 2: Observe 7 heads, 3 tails.
Posterior = Beta(1 + 7, 1 + 3) = Beta(8, 4)
- Posterior mean: 8/12 = 0.667
- Evidence suggests the coin is biased toward heads
Day 3: Observe 5 more heads, 5 more tails.
Use yesterday's posterior as today's prior.
Posterior = Beta(8 + 5, 4 + 5) = Beta(13, 9)
- Posterior mean: 13/22 = 0.591
- The balanced new data pulled the estimate back toward 0.5
Beta(1,1)
mean = 0.50"] -->|"7H, 3T"| B["Posterior 1
Beta(8,4)
mean = 0.67"] B -->|"becomes prior"| C["Prior 2
Beta(8,4)"] C -->|"5H, 5T"| D["Posterior 2
Beta(13,9)
mean = 0.59"]
The order of observations does not matter. Beta(1,1) updated with all 12 heads and 8 tails at once gives Beta(13, 9) -- the same result. Sequential updating and batch updating are mathematically equivalent. But sequential updating lets you make decisions at each step without storing raw data.
This is the foundation of online learning in production ML systems. Thompson sampling for bandits, incremental recommendation systems, and streaming anomaly detectors all use this pattern.
Connection to A/B Testing
A/B testing is Bayesian inference in disguise.
Setup: you are testing two button colors. Variant A (blue) and variant B (green). You want to know which one gets more clicks.
The Bayesian A/B test:
- Prior. Start with Beta(1, 1) for both variants. No prior preference.
- Data. Variant A: 50 clicks out of 1000 views. Variant B: 65 clicks out of 1000 views.
- Posteriors.
- A: Beta(1 + 50, 1 + 950) = Beta(51, 951). Mean = 0.051
- B: Beta(1 + 65, 1 + 935) = Beta(66, 936). Mean = 0.066
- Decision. Compute P(B > A) -- the probability that B's true conversion rate is higher than A's.
Computing P(B > A) analytically is hard. But Monte Carlo makes it trivial:
1. Draw 100,000 samples from Beta(51, 951) -> samples_A
2. Draw 100,000 samples from Beta(66, 936) -> samples_B
3. P(B > A) = fraction of samples where B > A
If P(B > A) > 0.95, you ship variant B. If it is between 0.05 and 0.95, you keep collecting data. If P(B > A) < 0.05, you ship variant A.
Advantages over frequentist A/B testing:
- You get a direct probability statement: "there is a 97% chance B is better"
- No p-value confusion. No "fail to reject the null hypothesis" hedging.
- You can check results at any time without inflating false positive rates (no "peeking problem")
- You can incorporate prior knowledge (e.g., previous tests suggest conversion rates are usually 3-8%)
| Aspect | Frequentist A/B | Bayesian A/B |
|---|---|---|
| Output | p-value | P(B > A) |
| Interpretation | "How surprising is this data if A=B?" | "How likely is B better than A?" |
| Early stopping | Inflates false positives | Safe at any point (given a well-chosen prior and correctly specified model) |
| Prior knowledge | Not used | Encoded as Beta prior |
| Decision rule | p < 0.05 | P(B > A) > threshold |
Exercises
- Multiple tests. A patient tests positive twice on independent tests (both 99% accurate, disease prevalence 1 in 10,000). What is P(sick) after both tests? Use the posterior from the first test as the prior for the second.
- Smoothing impact. Run the spam classifier with smoothing values of 0.01, 0.1, 1.0, and 10.0. How do the top word probabilities change? What happens with smoothing=0 and a word that appears only in ham?
- Add features. Extend the NaiveBayes class to also use message length (short/long) as a feature alongside word counts. Estimate P(short|spam) and P(short|ham) from the training data and fold it into the prediction score.
- MAP by hand. Given observed data (7 heads in 10 coin flips), compute the MAP estimate of the bias using a Beta(2,2) prior. Compare it to the MLE estimate (7/10).
Key Terms
| Term | What people say | What it actually means | |
|---|---|---|---|
| Prior | "My initial guess" | P(hypothesis) before observing evidence. In ML: the regularization term. | |
| Likelihood | "How well the data fits" | P(evidence\ | hypothesis). How probable the observed data is under a specific hypothesis. |
| Posterior | "My updated belief" | P(hypothesis\ | evidence). The prior multiplied by the likelihood, then normalized. |
| Evidence | "The normalizing constant" | P(data) across all hypotheses. Ensures the posterior sums to 1. | |
| Naive Bayes | "That simple text classifier" | A classifier that assumes features are independent given the class. Works well despite the false assumption. | |
| Laplace smoothing | "Add-one smoothing" | Adding a small count to every feature to prevent zero probabilities from unseen data. | |
| MLE | "Just use the frequencies" | Choose parameters that maximize P(data\ | parameters). No prior. Can overfit with small data. |
| MAP | "MLE with a prior" | Choose parameters that maximize P(data\ | parameters) * P(parameters). Equivalent to regularized MLE. |
| Log-probability | "Work in log space" | Using log(P) instead of P to avoid floating-point underflow when multiplying many small numbers. | |
| False positive | "A wrong alarm" | The test says positive, but the true state is negative. Drives the base rate fallacy. |
Further Reading
- 3Blue1Brown: Bayes' theorem - visual explanation with the medical test example
- Stanford CS229: Generative Learning Algorithms - naive Bayes and its connection to discriminative models
- Think Bayes - free book, Bayesian statistics with Python code
- scikit-learn Naive Bayes - production implementations and when to use each variant