Logistic Regression
> Logistic regression bends a straight line into an S-curve to answer yes-or-no questions with probabilities.
Type: Build
Languages: Python
Prerequisites: Phase 2 Lesson 1-2 (What Is ML, Linear Regression)
Time: ~90 minutes
Learning Objectives
- Implement logistic regression from scratch using the sigmoid function and binary cross-entropy loss
- Compute and interpret precision, recall, F1 score, and the confusion matrix for binary classification
- Explain why MSE fails for classification and why binary cross-entropy produces a convex cost surface
- Build a softmax regression model for multi-class classification and evaluate threshold tuning tradeoffs
The Problem
You want to predict whether a tumor is malignant or benign given its size. You try linear regression. It outputs numbers like 0.3 or 1.7 or -0.5. What do those mean? Is 1.7 "very malignant"? Is -0.5 "very benign"? Linear regression outputs unbounded numbers. Classification needs bounded probabilities between 0 and 1, and a clear decision: yes or no.
Logistic regression solves this. It takes the same linear combination (wx + b) and passes it through the sigmoid function, which squashes any number into the range (0, 1). The output is a probability. You set a threshold (usually 0.5) and make a decision.
This is one of the most widely used algorithms in practice. Despite its name, logistic regression is a classification algorithm, not a regression algorithm. The name comes from the logistic (sigmoid) function it uses.
The Concept
Why Linear Regression Fails for Classification
Imagine predicting pass/fail (1/0) based on study hours. Linear regression fits a line through the data:
hours: 1 2 3 4 5 6 7 8 9 10
actual: 0 0 0 0 1 1 1 1 1 1
A linear fit might produce predictions like -0.2 at hour 1 and 1.3 at hour 10. These values are not probabilities. They go below 0 and above 1. Worse, a single outlier (someone who studied 50 hours) would drag the entire line, changing predictions for everyone.
Classification needs a function that:
- Outputs values between 0 and 1 (probabilities)
- Creates a sharp transition (a decision boundary)
- Is not distorted by outliers far from the boundary
The Sigmoid Function
The sigmoid function does exactly this:
sigmoid(z) = 1 / (1 + e^(-z))
Properties:
- When z is large and positive, sigmoid(z) approaches 1
- When z is large and negative, sigmoid(z) approaches 0
- When z = 0, sigmoid(z) = 0.5
- The output is always between 0 and 1
- The function is smooth and differentiable everywhere
The derivative has a convenient form: sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z)). This makes gradient computation efficient.
Logistic Regression = Linear Model + Sigmoid
The model computes z = wx + b (same as linear regression), then applies sigmoid:
The output p is interpreted as P(y=1 | x), the probability that the input belongs to class 1. The decision boundary is where wx + b = 0, which makes sigmoid output exactly 0.5.
Binary Cross-Entropy Loss
You cannot use MSE for logistic regression. MSE with a sigmoid creates a non-convex cost surface with many local minima. Instead, use binary cross-entropy (log loss):
Loss = -(1/n) * sum(y * log(p) + (1-y) * log(1-p))
Why this works:
- When y=1 and p is close to 1: log(1) = 0, so loss is near 0 (correct, low cost)
- When y=1 and p is close to 0: log(0) approaches negative infinity, so loss is huge (wrong, high cost)
- When y=0 and p is close to 0: log(1) = 0, so loss is near 0 (correct, low cost)
- When y=0 and p is close to 1: log(0) approaches negative infinity, so loss is huge (wrong, high cost)
This loss function is convex for logistic regression, guaranteeing a single global minimum.
Gradient Descent for Logistic Regression
The gradients for binary cross-entropy with sigmoid have a clean form:
dL/dw = (1/n) * sum((p - y) * x)
dL/db = (1/n) * sum(p - y)
These look identical to the linear regression gradients. The difference is that p = sigmoid(wx + b) instead of p = wx + b. The sigmoid introduces the nonlinearity, but the gradient update rule stays the same.
The Decision Boundary
For a 2D input (two features), the decision boundary is the line where:
w1*x1 + w2*x2 + b = 0
Points on one side get classified as 1, points on the other side as 0. Logistic regression always produces a linear decision boundary. If you need a curved boundary, you either add polynomial features or use a nonlinear model.
Multi-Class Classification with Softmax
Binary logistic regression handles two classes. For k classes, use the softmax function:
softmax(z_i) = e^(z_i) / sum(e^(z_j) for all j)
Each class has its own weight vector. The model computes a score z_i for each class, then softmax converts scores to probabilities that sum to 1. The predicted class is the one with the highest probability.
The loss function becomes categorical cross-entropy:
Loss = -(1/n) * sum(sum(y_k * log(p_k)))
where y_k is 1 for the true class and 0 for all others (one-hot encoding).
Evaluation Metrics
Accuracy alone is not enough. For a dataset with 95% negative and 5% positive, a model that always predicts negative gets 95% accuracy but is useless.
Confusion Matrix:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
Precision: Of all predicted positives, how many are actually positive?
Precision = TP / (TP + FP)
Recall (Sensitivity): Of all actual positives, how many did we catch?
Recall = TP / (TP + FN)
F1 Score: Harmonic mean of precision and recall. Balances both metrics.
F1 = 2 * (Precision * Recall) / (Precision + Recall)
When to prioritize:
- Precision: when false positives are costly (spam filter, you do not want to block legitimate email)
- Recall: when false negatives are costly (cancer screening, you do not want to miss a tumor)
- F1: when you need a single balanced metric
Build It
Step 1: Sigmoid function and data generation
import random
import math
def sigmoid(z):
z = max(-500, min(500, z))
return 1.0 / (1.0 + math.exp(-z))
random.seed(42)
N = 200
X = []
y = []
for _ in range(N // 2):
X.append([random.gauss(2, 1), random.gauss(2, 1)])
y.append(0)
for _ in range(N // 2):
X.append([random.gauss(5, 1), random.gauss(5, 1)])
y.append(1)
combined = list(zip(X, y))
random.shuffle(combined)
X, y = zip(*combined)
X = list(X)
y = list(y)
print(f"Generated {N} samples (2 classes, 2 features)")
print(f"Class 0 center: (2, 2), Class 1 center: (5, 5)")
print(f"First 5 samples:")
for i in range(5):
print(f" Features: [{X[i][0]:.2f}, {X[i][1]:.2f}], Label: {y[i]}")
Step 2: Logistic regression from scratch
class LogisticRegression:
def __init__(self, n_features, learning_rate=0.01):
self.weights = [0.0] * n_features
self.bias = 0.0
self.lr = learning_rate
self.loss_history = []
def predict_proba(self, x):
z = sum(w * xi for w, xi in zip(self.weights, x)) + self.bias
return sigmoid(z)
def predict(self, x, threshold=0.5):
return 1 if self.predict_proba(x) >= threshold else 0
def compute_loss(self, X, y):
n = len(y)
total = 0.0
for i in range(n):
p = self.predict_proba(X[i])
p = max(1e-15, min(1 - 1e-15, p))
total += y[i] * math.log(p) + (1 - y[i]) * math.log(1 - p)
return -total / n
def fit(self, X, y, epochs=1000, print_every=200):
n = len(y)
n_features = len(X[0])
for epoch in range(epochs):
dw = [0.0] * n_features
db = 0.0
for i in range(n):
p = self.predict_proba(X[i])
error = p - y[i]
for j in range(n_features):
dw[j] += error * X[i][j]
db += error
for j in range(n_features):
self.weights[j] -= self.lr * (dw[j] / n)
self.bias -= self.lr * (db / n)
loss = self.compute_loss(X, y)
self.loss_history.append(loss)
if epoch % print_every == 0:
print(f" Epoch {epoch:4d} | Loss: {loss:.4f} | w: [{self.weights[0]:.3f}, {self.weights[1]:.3f}] | b: {self.bias:.3f}")
return self
def accuracy(self, X, y):
correct = sum(1 for i in range(len(y)) if self.predict(X[i]) == y[i])
return correct / len(y)
split = int(0.8 * N)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
print("\n=== Training Logistic Regression ===")
model = LogisticRegression(n_features=2, learning_rate=0.1)
model.fit(X_train, y_train, epochs=1000, print_every=200)
print(f"\nTrain accuracy: {model.accuracy(X_train, y_train):.4f}")
print(f"Test accuracy: {model.accuracy(X_test, y_test):.4f}")
print(f"Weights: [{model.weights[0]:.4f}, {model.weights[1]:.4f}]")
print(f"Bias: {model.bias:.4f}")
Step 3: Confusion matrix and metrics from scratch
class ClassificationMetrics:
def __init__(self, y_true, y_pred):
self.tp = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 1)
self.tn = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 0)
self.fp = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 1)
self.fn = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 0)
def accuracy(self):
total = self.tp + self.tn + self.fp + self.fn
return (self.tp + self.tn) / total if total > 0 else 0
def precision(self):
denom = self.tp + self.fp
return self.tp / denom if denom > 0 else 0
def recall(self):
denom = self.tp + self.fn
return self.tp / denom if denom > 0 else 0
def f1(self):
p = self.precision()
r = self.recall()
return 2 * p * r / (p + r) if (p + r) > 0 else 0
def print_confusion_matrix(self):
print(f"\n Confusion Matrix:")
print(f" Predicted")
print(f" Pos Neg")
print(f" Actual Pos {self.tp:4d} {self.fn:4d}")
print(f" Actual Neg {self.fp:4d} {self.tn:4d}")
def print_report(self):
self.print_confusion_matrix()
print(f"\n Accuracy: {self.accuracy():.4f}")
print(f" Precision: {self.precision():.4f}")
print(f" Recall: {self.recall():.4f}")
print(f" F1 Score: {self.f1():.4f}")
y_pred_test = [model.predict(x) for x in X_test]
print("\n=== Classification Report (Test Set) ===")
metrics = ClassificationMetrics(y_test, y_pred_test)
metrics.print_report()
Step 4: Decision boundary analysis
print("\n=== Decision Boundary ===")
w1, w2 = model.weights
b = model.bias
print(f"Decision boundary: {w1:.4f}*x1 + {w2:.4f}*x2 + {b:.4f} = 0")
if abs(w2) > 1e-10:
print(f"Solved for x2: x2 = {-w1/w2:.4f}*x1 + {-b/w2:.4f}")
print("\nSample predictions near the boundary:")
test_points = [
[3.0, 3.0],
[3.5, 3.5],
[4.0, 4.0],
[2.5, 2.5],
[5.0, 5.0],
]
for point in test_points:
prob = model.predict_proba(point)
pred = model.predict(point)
print(f" [{point[0]}, {point[1]}] -> prob={prob:.4f}, class={pred}")
Step 5: Multi-class with softmax
class SoftmaxRegression:
def __init__(self, n_features, n_classes, learning_rate=0.01):
self.n_features = n_features
self.n_classes = n_classes
self.lr = learning_rate
self.weights = [[0.0] * n_features for _ in range(n_classes)]
self.biases = [0.0] * n_classes
def softmax(self, scores):
max_score = max(scores)
exp_scores = [math.exp(s - max_score) for s in scores]
total = sum(exp_scores)
return [e / total for e in exp_scores]
def predict_proba(self, x):
scores = [
sum(self.weights[k][j] * x[j] for j in range(self.n_features)) + self.biases[k]
for k in range(self.n_classes)
]
return self.softmax(scores)
def predict(self, x):
probs = self.predict_proba(x)
return probs.index(max(probs))
def fit(self, X, y, epochs=1000, print_every=200):
n = len(y)
for epoch in range(epochs):
grad_w = [[0.0] * self.n_features for _ in range(self.n_classes)]
grad_b = [0.0] * self.n_classes
total_loss = 0.0
for i in range(n):
probs = self.predict_proba(X[i])
for k in range(self.n_classes):
target = 1.0 if y[i] == k else 0.0
error = probs[k] - target
for j in range(self.n_features):
grad_w[k][j] += error * X[i][j]
grad_b[k] += error
true_prob = max(probs[y[i]], 1e-15)
total_loss -= math.log(true_prob)
for k in range(self.n_classes):
for j in range(self.n_features):
self.weights[k][j] -= self.lr * (grad_w[k][j] / n)
self.biases[k] -= self.lr * (grad_b[k] / n)
if epoch % print_every == 0:
print(f" Epoch {epoch:4d} | Loss: {total_loss / n:.4f}")
return self
def accuracy(self, X, y):
correct = sum(1 for i in range(len(y)) if self.predict(X[i]) == y[i])
return correct / len(y)
random.seed(42)
X_3class = []
y_3class = []
centers = [(1, 1), (5, 1), (3, 5)]
for label, (cx, cy) in enumerate(centers):
for _ in range(50):
X_3class.append([random.gauss(cx, 0.8), random.gauss(cy, 0.8)])
y_3class.append(label)
combined = list(zip(X_3class, y_3class))
random.shuffle(combined)
X_3class, y_3class = zip(*combined)
X_3class = list(X_3class)
y_3class = list(y_3class)
split_3 = int(0.8 * len(X_3class))
X_train_3 = X_3class[:split_3]
y_train_3 = y_3class[:split_3]
X_test_3 = X_3class[split_3:]
y_test_3 = y_3class[split_3:]
print("\n=== Multi-class Softmax Regression (3 classes) ===")
softmax_model = SoftmaxRegression(n_features=2, n_classes=3, learning_rate=0.1)
softmax_model.fit(X_train_3, y_train_3, epochs=1000, print_every=200)
print(f"\nTrain accuracy: {softmax_model.accuracy(X_train_3, y_train_3):.4f}")
print(f"Test accuracy: {softmax_model.accuracy(X_test_3, y_test_3):.4f}")
print("\nSample predictions:")
for i in range(5):
probs = softmax_model.predict_proba(X_test_3[i])
pred = softmax_model.predict(X_test_3[i])
print(f" True: {y_test_3[i]}, Predicted: {pred}, Probs: [{', '.join(f'{p:.3f}' for p in probs)}]")
Step 6: Threshold tuning
print("\n=== Threshold Tuning ===")
print("Default threshold: 0.5. Adjusting the threshold trades precision for recall.\n")
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
print(f"{'Threshold':>10} {'Accuracy':>10} {'Precision':>10} {'Recall':>10} {'F1':>10}")
print("-" * 52)
for t in thresholds:
y_pred_t = [1 if model.predict_proba(x) >= t else 0 for x in X_test]
m = ClassificationMetrics(y_test, y_pred_t)
print(f"{t:>10.1f} {m.accuracy():>10.4f} {m.precision():>10.4f} {m.recall():>10.4f} {m.f1():>10.4f}")
Use It
Now the same thing with scikit-learn.
from sklearn.linear_model import LogisticRegression as SklearnLR
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
np.random.seed(42)
X_0 = np.random.randn(100, 2) + [2, 2]
X_1 = np.random.randn(100, 2) + [5, 5]
X_sk = np.vstack([X_0, X_1])
y_sk = np.array([0] * 100 + [1] * 100)
X_tr, X_te, y_tr, y_te = train_test_split(X_sk, y_sk, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_tr)
X_te_sc = scaler.transform(X_te)
lr = SklearnLR()
lr.fit(X_tr_sc, y_tr)
y_pred = lr.predict(X_te_sc)
print("=== Scikit-learn Logistic Regression ===")
print(f"Accuracy: {accuracy_score(y_te, y_pred):.4f}")
print(f"Precision: {precision_score(y_te, y_pred):.4f}")
print(f"Recall: {recall_score(y_te, y_pred):.4f}")
print(f"F1: {f1_score(y_te, y_pred):.4f}")
print(f"\nConfusion Matrix:\n{confusion_matrix(y_te, y_pred)}")
print(f"\nClassification Report:\n{classification_report(y_te, y_pred)}")
Your from-scratch implementation produces the same decision boundary and metrics. Scikit-learn adds solver options (liblinear, lbfgs, saga), automatic regularization, multi-class strategies (one-vs-rest, multinomial), and numerical stability optimizations.
Ship It
This lesson produces:
code/logistic_regression.py- logistic regression from scratch with metrics
Exercises
- Generate a dataset that is NOT linearly separable (e.g., two concentric circles). Train logistic regression and observe its failure. Then add polynomial features (x1^2, x2^2, x1*x2) and train again. Show that the accuracy improves.
- Implement a multi-class confusion matrix for the 3-class softmax model. Compute per-class precision and recall. Which class is hardest to classify?
- Build an ROC curve from scratch. For 100 threshold values from 0 to 1, compute the true positive rate and false positive rate. Calculate the AUC (area under the curve) using the trapezoidal rule.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Logistic regression | "Regression for classification" | A linear model followed by a sigmoid function that outputs class probabilities |
| Sigmoid function | "The S-curve" | The function 1/(1+e^(-z)) that maps any real number to the range (0, 1) |
| Binary cross-entropy | "Log loss" | The loss function -[y*log(p) + (1-y)*log(1-p)] that penalizes confident wrong predictions severely |
| Decision boundary | "The dividing line" | The surface where the model's output probability equals 0.5, separating predicted classes |
| Softmax | "Multi-class sigmoid" | A function that converts a vector of scores into probabilities that sum to 1 |
| Precision | "How many selected are relevant" | TP / (TP + FP), the fraction of positive predictions that are actually positive |
| Recall | "How many relevant are selected" | TP / (TP + FN), the fraction of actual positives that the model correctly identifies |
| F1 score | "Balanced accuracy" | The harmonic mean of precision and recall: 2*P*R / (P+R) |
| Confusion matrix | "The error breakdown" | A table showing TP, TN, FP, FN counts for each class pair |
| Threshold | "The cutoff" | The probability value above which the model predicts class 1 (default 0.5, tunable) |
| One-hot encoding | "Binary columns for categories" | Representing class k as a vector of zeros with a 1 at position k |
| Categorical cross-entropy | "Multi-class log loss" | The extension of binary cross-entropy to k classes using one-hot encoded labels |