Handling Imbalanced Data

> When 99% of your data is "normal," accuracy is a lie.

Type: Build

Language: Python

Prerequisites: Phase 2, Lessons 01-09 (especially evaluation metrics)

Time: ~90 minutes

Learning Objectives

Implement SMOTE from scratch and explain how synthetic oversampling differs from random duplication
Evaluate imbalanced classifiers using F1, AUPRC, and Matthews Correlation Coefficient instead of accuracy
Compare class weighting, threshold tuning, and resampling strategies and select the right approach for a given imbalance ratio
Build a complete imbalanced data pipeline that combines SMOTE, class weights, and threshold optimization

The Problem

You build a fraud detection model. It gets 99.9% accuracy. You celebrate. Then you realize it predicts "not fraud" for every single transaction.

This is not a bug. It is the rational thing to do when only 0.1% of transactions are fraudulent. The model learns that always guessing the majority class minimizes overall error. It is technically correct and completely useless.

This happens everywhere real classification matters. Disease diagnosis: 1% positive rate. Network intrusion: 0.01% attacks. Manufacturing defects: 0.5% defective. Spam filtering: 20% spam. Churn prediction: 5% churners. The more consequential the minority class, the rarer it tends to be.

Accuracy fails because it treats all correct predictions equally. Correctly labeling a legitimate transaction and correctly catching fraud both count as one point of accuracy. But catching fraud is the entire reason the model exists. We need metrics, techniques, and training strategies that force the model to pay attention to the rare but important class.

The Concept

Why Accuracy Fails

Consider a dataset with 1000 samples: 990 negative, 10 positive. A model that always predicts negative:

Predicted Positive	Predicted Negative
Actually Positive	0 (TP)	10 (FN)
Actually Negative	0 (FP)	990 (TN)

Accuracy = (0 + 990) / 1000 = 99.0%

The model catches zero fraud. Zero disease. Zero defects. But accuracy says 99%. This is why accuracy is dangerous for imbalanced problems.

Better Metrics

Precision = TP / (TP + FP). Of everything flagged as positive, how many actually are? High precision means few false alarms.

Recall = TP / (TP + FN). Of everything actually positive, how many did we catch? High recall means few missed positives.

F1 Score = 2 * precision * recall / (precision + recall). The harmonic mean. Penalizes extreme imbalance between precision and recall more than the arithmetic mean would.

F-beta Score = (1 + beta^2) * precision * recall / (beta^2 * precision + recall). When beta > 1, recall matters more. When beta < 1, precision matters more. F2 is common in fraud detection (missing fraud is worse than a false alarm).

AUPRC (Area Under Precision-Recall Curve). Like AUC-ROC but more informative for imbalanced data. A random classifier has AUPRC equal to the positive class rate (not 0.5 like ROC). This makes improvements easier to see.

Matthews Correlation Coefficient = (TP * TN - FP * FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN)). Ranges from -1 to +1. Only gives a high score when the model does well on both classes. Balanced even when classes are very different sizes.

For the "always predict negative" model above: precision = 0/0 (undefined, often set to 0), recall = 0/10 = 0, F1 = 0, MCC = 0. These metrics correctly identify the model as worthless.

The Imbalanced Data Pipeline

flowchart TD A[Imbalanced Dataset] --> B{Imbalance Ratio?} B -->|Mild: 80/20| C[Class Weights] B -->|Moderate: 95/5| D[SMOTE + Threshold Tuning] B -->|Severe: 99/1| E[SMOTE + Class Weights + Threshold] C --> F[Train Model] D --> F E --> F F --> G[Evaluate with F1 / AUPRC / MCC] G --> H{Good Enough?} H -->|No| I[Try Different Strategy] H -->|Yes| J[Deploy with Monitoring] I --> B

SMOTE: Synthetic Minority Oversampling Technique

Random oversampling duplicates existing minority samples. This works but risks overfitting because the model sees identical points repeatedly.

SMOTE creates new synthetic minority samples that are plausible but not copies. The algorithm:

For each minority sample x, find its k nearest neighbors among other minority samples
Pick one neighbor at random
Create a new sample on the line segment between x and that neighbor

The formula: new_sample = x + random(0, 1) * (neighbor - x)

This interpolates between real minority points, creating samples in the same region of feature space without just copying existing data.

flowchart LR subgraph Original["Original Minority Points"] P1["x1 (1.0, 2.0)"] P2["x2 (1.5, 2.5)"] P3["x3 (2.0, 1.5)"] end subgraph SMOTE["SMOTE Generation"] direction TB S1["Pick x1, neighbor x2"] S2["random t = 0.4"] S3["new = x1 + 0.4*(x2-x1)"] S4["new = (1.2, 2.2)"] S1 --> S2 --> S3 --> S4 end Original --> SMOTE subgraph Result["Augmented Set"] R1["x1 (1.0, 2.0)"] R2["x2 (1.5, 2.5)"] R3["x3 (2.0, 1.5)"] R4["synthetic (1.2, 2.2)"] end SMOTE --> Result

Sampling Strategies Compared

Random Oversampling: duplicate minority samples to match majority count.

Pros: simple, no information loss
Cons: exact duplicates cause overfitting, increases training time

Random Undersampling: remove majority samples to match minority count.

Pros: fast training, simple
Cons: throws away potentially useful majority data, higher variance

SMOTE: create synthetic minority samples via interpolation.

Pros: generates new data points, reduces overfitting compared to random oversampling
Cons: can create noisy samples near the decision boundary, does not account for majority class distribution

Strategy	Data Changed	Risk	When to Use
Oversample	Minority duplicated	Overfitting	Small datasets, moderate imbalance
Undersample	Majority removed	Information loss	Large datasets, want fast training
SMOTE	Synthetic minority added	Boundary noise	Moderate imbalance, enough minority samples for k-NN

Class Weights

Instead of changing the data, change how the model treats errors. Assign higher weight to misclassifying the minority class.

For a binary problem with 950 negative and 50 positive samples:

Weight for negative class = n_samples / (2 * n_negative) = 1000 / (2 * 950) = 0.526
Weight for positive class = n_samples / (2 * n_positive) = 1000 / (2 * 50) = 10.0

The positive class gets 19x the weight. Misclassifying one positive sample costs as much as misclassifying 19 negative samples. The model is forced to pay attention to the minority class.

In logistic regression, this modifies the loss function:

weighted_loss = -sum(w_i * [y_i * log(p_i) + (1-y_i) * log(1-p_i)])

where w_i depends on the class of sample i.

Class weights are mathematically equivalent to oversampling in expectation, but without creating new data points. This makes them faster and avoids the overfitting risk of duplicated samples.

Threshold Tuning

Most classifiers output a probability. The default threshold is 0.5: if P(positive) >= 0.5, predict positive. But 0.5 is arbitrary. When classes are imbalanced, the optimal threshold is usually much lower.

The process:

Train a model
Get predicted probabilities on the validation set
Sweep thresholds from 0.0 to 1.0
Compute F1 (or your chosen metric) at each threshold
Pick the threshold that maximizes your metric

flowchart LR A[Model] --> B[Predict Probabilities] B --> C[Sweep Thresholds 0.0 to 1.0] C --> D[Compute F1 at Each] D --> E[Pick Best Threshold] E --> F[Use in Production]

A model might output P(fraud) = 0.15 for a fraudulent transaction. At threshold 0.5, this is classified as not fraud. At threshold 0.10, it is correctly caught. The probability calibration matters less than the ranking -- as long as fraud gets higher probabilities than non-fraud, there exists a threshold that separates them.

Cost-Sensitive Learning

Generalization of class weights. Instead of uniform costs, assign specific misclassification costs:

Predict Positive	Predict Negative
Actually Positive	0 (correct)	C_FN = 100
Actually Negative	C_FP = 1	0 (correct)

Missing a fraudulent transaction (FN) costs 100x more than a false alarm (FP). The model optimizes for total cost, not total error count.

This is the most principled approach when you can estimate real-world costs. A missed cancer diagnosis has a very different cost than a false alarm that leads to an extra biopsy. Making these costs explicit forces the right tradeoffs.

Decision Flowchart

flowchart TD A[Start: Imbalanced Dataset] --> B{How imbalanced?} B -->|"< 70/30"| C["Mild: try class weights first"] B -->|"70/30 to 95/5"| D["Moderate: SMOTE + class weights"] B -->|"> 95/5"| E["Severe: combine multiple strategies"] C --> F{Enough data?} D --> F E --> F F -->|"< 1000 samples"| G["Oversample or SMOTE, avoid undersampling"] F -->|"1000-10000"| H["SMOTE + threshold tuning"] F -->|"> 10000"| I["Undersampling OK, or class weights"] G --> J[Train + Evaluate with F1/AUPRC] H --> J I --> J J --> K{Recall high enough?} K -->|No| L[Lower threshold] K -->|Yes| M{Precision acceptable?} M -->|No| N[Raise threshold or add features] M -->|Yes| O[Ship it]

Build It

Step 1: Generate an imbalanced dataset

import numpy as np


def make_imbalanced_data(n_majority=950, n_minority=50, seed=42):
    rng = np.random.RandomState(seed)

    X_maj = rng.randn(n_majority, 2) * 1.0 + np.array([0.0, 0.0])
    X_min = rng.randn(n_minority, 2) * 0.8 + np.array([2.5, 2.5])

    X = np.vstack([X_maj, X_min])
    y = np.concatenate([np.zeros(n_majority), np.ones(n_minority)])

    shuffle_idx = rng.permutation(len(y))
    return X[shuffle_idx], y[shuffle_idx]

Step 2: SMOTE from scratch

def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))


def find_k_neighbors(X, idx, k):
    distances = []
    for i in range(len(X)):
        if i == idx:
            continue
        d = euclidean_distance(X[idx], X[i])
        distances.append((i, d))
    distances.sort(key=lambda x: x[1])
    return [d[0] for d in distances[:k]]


def smote(X_minority, k=5, n_synthetic=100, seed=42):
    rng = np.random.RandomState(seed)
    n_samples = len(X_minority)
    k = min(k, n_samples - 1)
    synthetic = []

    for _ in range(n_synthetic):
        idx = rng.randint(0, n_samples)
        neighbors = find_k_neighbors(X_minority, idx, k)
        neighbor_idx = neighbors[rng.randint(0, len(neighbors))]
        t = rng.random()
        new_point = X_minority[idx] + t * (X_minority[neighbor_idx] - X_minority[idx])
        synthetic.append(new_point)

    return np.array(synthetic)

Step 3: Random oversampling and undersampling

def random_oversample(X, y, seed=42):
    rng = np.random.RandomState(seed)
    classes, counts = np.unique(y, return_counts=True)
    max_count = counts.max()

    X_resampled = list(X)
    y_resampled = list(y)

    for cls, count in zip(classes, counts):
        if count < max_count:
            cls_indices = np.where(y == cls)[0]
            n_needed = max_count - count
            chosen = rng.choice(cls_indices, size=n_needed, replace=True)
            X_resampled.extend(X[chosen])
            y_resampled.extend(y[chosen])

    X_out = np.array(X_resampled)
    y_out = np.array(y_resampled)
    shuffle = rng.permutation(len(y_out))
    return X_out[shuffle], y_out[shuffle]


def random_undersample(X, y, seed=42):
    rng = np.random.RandomState(seed)
    classes, counts = np.unique(y, return_counts=True)
    min_count = counts.min()

    X_resampled = []
    y_resampled = []

    for cls in classes:
        cls_indices = np.where(y == cls)[0]
        chosen = rng.choice(cls_indices, size=min_count, replace=False)
        X_resampled.extend(X[chosen])
        y_resampled.extend(y[chosen])

    X_out = np.array(X_resampled)
    y_out = np.array(y_resampled)
    shuffle = rng.permutation(len(y_out))
    return X_out[shuffle], y_out[shuffle]

Step 4: Logistic regression with class weights

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))


def logistic_regression_weighted(X, y, weights, lr=0.01, epochs=200):
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    b = 0.0

    for _ in range(epochs):
        z = X @ w + b
        pred = sigmoid(z)
        error = pred - y
        weighted_error = error * weights

        gradient_w = (X.T @ weighted_error) / n_samples
        gradient_b = np.mean(weighted_error)

        w -= lr * gradient_w
        b -= lr * gradient_b

    return w, b


def compute_class_weights(y):
    classes, counts = np.unique(y, return_counts=True)
    n_samples = len(y)
    n_classes = len(classes)
    weight_map = {}
    for cls, count in zip(classes, counts):
        weight_map[cls] = n_samples / (n_classes * count)
    return np.array([weight_map[yi] for yi in y])

Step 5: Threshold tuning

def find_optimal_threshold(y_true, y_probs, metric="f1"):
    best_threshold = 0.5
    best_score = -1.0

    for threshold in np.arange(0.05, 0.96, 0.01):
        y_pred = (y_probs >= threshold).astype(int)
        tp = np.sum((y_pred == 1) & (y_true == 1))
        fp = np.sum((y_pred == 1) & (y_true == 0))
        fn = np.sum((y_pred == 0) & (y_true == 1))

        if metric == "f1":
            precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
            score = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        elif metric == "recall":
            score = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        elif metric == "precision":
            score = tp / (tp + fp) if (tp + fp) > 0 else 0.0

        if score > best_score:
            best_score = score
            best_threshold = threshold

    return best_threshold, best_score

Step 6: Evaluation functions

def confusion_matrix_values(y_true, y_pred):
    tp = np.sum((y_pred == 1) & (y_true == 1))
    tn = np.sum((y_pred == 0) & (y_true == 0))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))
    return tp, tn, fp, fn


def compute_metrics(y_true, y_pred):
    tp, tn, fp, fn = confusion_matrix_values(y_true, y_pred)
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0

    denom = np.sqrt(float((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)))
    mcc = (tp * tn - fp * fn) / denom if denom > 0 else 0.0

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "mcc": mcc,
    }

Step 7: Compare all approaches

X, y = make_imbalanced_data(950, 50, seed=42)
split = int(0.8 * len(y))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

# Baseline: no treatment
w_base, b_base = logistic_regression_weighted(
    X_train, y_train, np.ones(len(y_train)), lr=0.1, epochs=300
)
probs_base = sigmoid(X_test @ w_base + b_base)
preds_base = (probs_base >= 0.5).astype(int)

# Oversampled
X_over, y_over = random_oversample(X_train, y_train)
w_over, b_over = logistic_regression_weighted(
    X_over, y_over, np.ones(len(y_over)), lr=0.1, epochs=300
)
preds_over = (sigmoid(X_test @ w_over + b_over) >= 0.5).astype(int)

# SMOTE
minority_mask = y_train == 1
X_minority = X_train[minority_mask]
synthetic = smote(X_minority, k=5, n_synthetic=len(y_train) - 2 * int(minority_mask.sum()))
X_smote = np.vstack([X_train, synthetic])
y_smote = np.concatenate([y_train, np.ones(len(synthetic))])
w_sm, b_sm = logistic_regression_weighted(
    X_smote, y_smote, np.ones(len(y_smote)), lr=0.1, epochs=300
)
preds_smote = (sigmoid(X_test @ w_sm + b_sm) >= 0.5).astype(int)

# Class weights
sample_weights = compute_class_weights(y_train)
w_cw, b_cw = logistic_regression_weighted(
    X_train, y_train, sample_weights, lr=0.1, epochs=300
)
probs_cw = sigmoid(X_test @ w_cw + b_cw)
preds_cw = (probs_cw >= 0.5).astype(int)

# Threshold tuning (tune on held-out validation set, not test set)
probs_val = sigmoid(X_val @ w_cw + b_cw)
best_thresh, best_f1 = find_optimal_threshold(y_val, probs_val, metric="f1")
preds_thresh = (probs_cw >= best_thresh).astype(int)

The code file runs all of this in a single script and prints results.

Use It

With scikit-learn and imbalanced-learn, these techniques are one-liners:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

model_weighted = LogisticRegression(class_weight="balanced")
model_weighted.fit(X_train, y_train)
print(classification_report(y_test, model_weighted.predict(X_test)))

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
model_smote = LogisticRegression()
model_smote.fit(X_resampled, y_resampled)
print(classification_report(y_test, model_smote.predict(X_test)))

pipeline = Pipeline([
    ("smote", SMOTE()),
    ("model", LogisticRegression(class_weight="balanced")),
])
pipeline.fit(X_train, y_train)
print(classification_report(y_test, pipeline.predict(X_test)))

The from-scratch implementations show exactly what each technique does. SMOTE is just k-NN interpolation on the minority class. Class weights multiply the loss. Threshold tuning is a for-loop over cutoffs. No magic.

Ship It

This lesson produces:

outputs/skill-imbalanced-data.md -- a decision checklist for handling imbalanced classification problems

Exercises

Borderline-SMOTE: modify the SMOTE implementation to only generate synthetic samples for minority points that are near the decision boundary (those whose k-nearest neighbors include majority class samples). Compare results with standard SMOTE on a dataset where classes overlap.

Cost matrix optimization: implement cost-sensitive learning where the cost matrix is a parameter. Create a function that takes a cost matrix and returns optimal predictions that minimize expected cost. Test with different cost ratios (1:10, 1:100, 1:1000) and plot how the precision-recall tradeoff changes.

Threshold calibration: implement Platt scaling (fit a logistic regression on the model's raw outputs to produce calibrated probabilities). Compare the precision-recall curve before and after calibration. Show that calibration does not change the ranking (AUC stays the same) but makes the probabilities more meaningful.

Ensemble with balanced bagging: train multiple models, each on a balanced bootstrap sample (all minority + random subset of majority). Average their predictions. Compare this approach against a single model with SMOTE. Measure both performance and variance across runs.

Imbalance ratio experiment: take a balanced dataset and progressively increase the imbalance ratio (50/50, 70/30, 90/10, 95/5, 99/1). For each ratio, train with and without SMOTE. Plot F1 vs imbalance ratio for both approaches. At what ratio does SMOTE start making a meaningful difference?

Key Terms

Term	What people say	What it actually means
Class imbalance	"One class has way more samples"	The distribution of classes in the dataset is significantly skewed, causing models to favor the majority class
SMOTE	"Synthetic oversampling"	Creates new minority samples by interpolating between existing minority samples and their k-nearest minority neighbors
Class weights	"Making errors on rare classes more expensive"	Multiplying the loss function by class-specific weights so the model penalizes minority misclassification more heavily
Threshold tuning	"Moving the decision boundary"	Changing the probability cutoff for classification from the default 0.5 to a value that optimizes the desired metric
Precision-recall tradeoff	"You cannot have both"	Lowering the threshold catches more positives (higher recall) but also flags more false positives (lower precision), and vice versa
AUPRC	"Area under the PR curve"	Summarizes the precision-recall curve into a single number; more informative than AUC-ROC when classes are heavily imbalanced
Matthews Correlation Coefficient	"The balanced metric"	A correlation between predicted and actual labels that produces a high score only when the model performs well on both classes
Cost-sensitive learning	"Different mistakes cost different amounts"	Incorporating real-world misclassification costs into the training objective so the model optimizes for total cost, not error count
Random oversampling	"Duplicate the minority"	Repeating minority class samples to balance class counts; simple but risks overfitting to duplicated points