← Introduction to JAX Image Fundamentals — Pixels, Channels, Color Spaces →

Debugging Neural Networks

> Your network compiled. It ran. It produced a number. The number is wrong and nothing crashed. Welcome to the hardest kind of debugging -- the kind where there is no error message.

Type: Practice

Languages: Python, PyTorch

Prerequisites: Phase 03 Lessons 01-10 (especially backpropagation, loss functions, optimizers)

Time: ~90 minutes

Learning Objectives

Diagnose common neural network failures (NaN loss, flat loss curve, overfitting, oscillation) using systematic debugging strategies
Apply the "overfit one batch" technique to verify that your model architecture and training loop are correct
Inspect gradient magnitudes, activation distributions, and weight norms to identify vanishing/exploding gradient problems
Build a debugging checklist that covers data pipeline, model architecture, loss function, optimizer, and learning rate issues

The Problem

Traditional software crashes when it is broken. A null pointer throws an exception. A type mismatch fails at compile time. An off-by-one error produces a clearly wrong output.

Neural networks do not give you that luxury.

A broken neural network runs to completion, prints a loss value, and outputs predictions. The loss might decrease. The predictions might look plausible. But the model is silently wrong -- learning shortcuts, memorizing noise, or converging to a useless local minimum. Google researchers estimated that 60-70% of ML debugging time is spent on "silent" bugs that produce no errors but degrade model quality.

The difference between a working model and a broken one is often a single misplaced line: a missing zero_grad(), a transposed dimension, a learning rate off by 10x. the canonical "Recipe for Training Neural Networks" (2019) opens with this: "The most common neural net mistakes are bugs that don't crash."

This lesson teaches you to find those bugs.

The Concept

The Debugging Mindset

Forget print-and-pray debugging. Neural network debugging requires a systematic approach because the feedback loop is slow (minutes to hours per training run) and the symptoms are ambiguous (bad loss could mean 20 different things).

The golden rule: start simple, add complexity one piece at a time, and verify each piece independently.

flowchart TD A["Loss not decreasing"] --> B{"Check learning rate"} B -->|"Too high"| C["Loss oscillates or explodes"] B -->|"Too low"| D["Loss barely moves"] B -->|"Reasonable"| E{"Check gradients"} E -->|"All zeros"| F["Dead ReLUs or vanishing gradients"] E -->|"NaN/Inf"| G["Exploding gradients"] E -->|"Normal"| H{"Check data pipeline"} H -->|"Labels shuffled"| I["Random-chance accuracy"] H -->|"Preprocessing bug"| J["Model learns noise"] H -->|"Data is fine"| K{"Check architecture"} K -->|"Too small"| L["Underfitting"] K -->|"Too deep"| M["Optimization difficulty"]

Symptom 1: Loss Not Decreasing

This is the most common complaint. The training loop runs, epochs tick by, and the loss stays flat or oscillates wildly.

Wrong learning rate. Too high: loss oscillates or jumps to NaN. Too low: loss decreases so slowly it looks flat. For Adam, start at 1e-3. For SGD, start at 1e-1 or 1e-2. Always try 3 learning rates spanning 10x each (e.g., 1e-2, 1e-3, 1e-4) before concluding something else is wrong.

Dead ReLUs. If a ReLU neuron receives a large negative input, it outputs 0 and its gradient is 0. It never activates again. If enough neurons die, the network cannot learn. Check: print the fraction of activations that are exactly 0 after each ReLU layer. If >50% are dead, switch to LeakyReLU or reduce the learning rate.

Vanishing gradients. In deep networks with sigmoid or tanh activations, gradients shrink exponentially as they propagate backward. By the time they reach the first layer, they are ~0. The first layers stop learning. Fix: use ReLU/GELU, add residual connections, or use batch normalization.

Exploding gradients. The opposite problem -- gradients grow exponentially. Common in RNNs and very deep networks. Loss jumps to NaN. Fix: gradient clipping (torch.nn.utils.clip_grad_norm_), lower learning rate, or add normalization.

Symptom 2: Loss Decreasing But Model is Bad

The loss goes down. Training accuracy hits 99%. But test accuracy is 55%. Or the model produces nonsensical outputs on real data.

Overfitting. The model memorizes training data instead of learning patterns. Gap between training and validation loss grows over time. Fix: more data, dropout, weight decay, early stopping, data augmentation.

Data leakage. Test data leaked into training. Accuracy is suspiciously high. Common causes: shuffling before splitting, preprocessing with statistics from the full dataset, duplicate samples across splits. Fix: split first, preprocess second, check for duplicates.

Label errors. 5-10% of labels in most real datasets are wrong (Northcutt et al., 2021 -- "Pervasive Label Errors in Test Sets"). The model learns the noise. Fix: use confident learning to find and fix mislabeled examples, or use loss truncation to ignore high-loss samples.

Symptom 3: NaN or Inf in Loss

The loss value becomes nan or inf. Training is dead.

Learning rate too high. Gradient updates overshoot so far that weights explode. Fix: reduce by 10x.

log(0) or log(negative). Cross-entropy loss computes log(p). If your model outputs exactly 0 or a negative probability, the log explodes. Fix: clamp predictions to [eps, 1-eps] where eps=1e-7.

Division by zero. Batch normalization divides by standard deviation. A batch with constant values has std=0. Fix: add epsilon to the denominator (PyTorch does this by default, but custom implementations might not).

Numerical overflow. Large activations fed into exp() produce Inf. Softmax is especially prone. Fix: subtract the max before exponentiating (the log-sum-exp trick).

Technique 1: Gradient Checking

Compare your analytical gradients (from backprop) to numerical gradients (from finite differences). If they disagree, your backward pass has a bug.

Numerical gradient for parameter w:

grad_numerical = (loss(w + eps) - loss(w - eps)) / (2 * eps)

Agreement metric (relative difference):

rel_diff = |grad_analytical - grad_numerical| / max(|grad_analytical|, |grad_numerical|, 1e-8)

If rel_diff < 1e-5: correct. If rel_diff > 1e-3: almost certainly a bug.

flowchart LR A["Parameter w"] --> B["w + eps"] A --> C["w - eps"] B --> D["Forward pass"] C --> E["Forward pass"] D --> F["loss+"] E --> G["loss-"] F --> H["(loss+ - loss-) / 2eps"] G --> H H --> I["Compare to backprop gradient"]

Technique 2: Activation Statistics

Monitor the mean and standard deviation of activations after each layer during training. Healthy networks maintain activations with mean near 0 and std near 1 (after normalization) or at least bounded.

Health indicator	Mean	Std	Diagnosis
Healthy	~0	~1	Network is learning normally
Saturated	>>0 or <<0	~0	Activations stuck at extreme values
Dead	0	0	Neurons are dead (all zeros)
Exploding	>>10	>>10	Activations growing without bound

Technique 3: Gradient Flow Visualization

Plot the average gradient magnitude for each layer. In a healthy network, gradient magnitudes should be roughly similar across layers. If early layers have gradients 1000x smaller than later layers, you have vanishing gradients.

graph LR subgraph "Healthy Gradient Flow" L1["Layer 1
grad: 0.05"] --- L2["Layer 2
grad: 0.04"] --- L3["Layer 3
grad: 0.06"] --- L4["Layer 4
grad: 0.05"] end

graph LR subgraph "Vanishing Gradient Flow" V1["Layer 1
grad: 0.0001"] --- V2["Layer 2
grad: 0.003"] --- V3["Layer 3
grad: 0.02"] --- V4["Layer 4
grad: 0.08"] end

Technique 4: The Overfit-One-Batch Test

The single most important debugging technique in deep learning.

Take one small batch (8-32 samples). Train on it for 100+ iterations. The loss should go to nearly zero and training accuracy should hit 100%. If it does not, your model or training loop has a fundamental bug -- do not proceed to full training.

This test catches:

Broken loss functions
Broken backward passes
Architecture too small to represent the data
Optimizer not connected to model parameters
Data and labels misaligned

This takes 30 seconds to run and saves hours of debugging full training runs.

Technique 5: Learning Rate Finder

Leslie Smith (2017) proposed sweeping the learning rate from very small (1e-7) to very large (10) over one epoch while recording the loss. Plot loss vs learning rate. The optimal learning rate is roughly 10x smaller than the rate where loss starts decreasing fastest.

graph TD subgraph "LR Finder Plot" direction LR A["1e-7: loss=2.3"] --> B["1e-5: loss=2.3"] B --> C["1e-3: loss=1.8"] C --> D["1e-2: loss=0.9 -- steepest"] D --> E["1e-1: loss=0.5"] E --> F["1.0: loss=NaN -- too high"] end

Best LR in this example: ~1e-3 (one order of magnitude before the steepest point).

Common PyTorch Bugs

These are the bugs that waste the most collective hours in the PyTorch community:

Bug	Symptom	Fix
Forgetting `optimizer.zero_grad()`	Gradients accumulate across batches, loss oscillates	Add `optimizer.zero_grad()` before `loss.backward()`
Forgetting `model.eval()` at test time	Dropout and batch norm behave differently, test accuracy varies between runs	Add `model.eval()` and `torch.no_grad()`
Wrong tensor shapes	Silent broadcasting produces wrong results, no error	Print shapes after every operation during debugging
CPU/GPU mismatch	`RuntimeError: expected CUDA tensor`	Use `.to(device)` on model AND data
Not detaching tensors	Computation graph grows forever, OOM	Use `.detach()` or `with torch.no_grad()`
In-place operations breaking autograd	`RuntimeError: modified by in-place operation`	Replace `x += 1` with `x = x + 1`
Data not normalized	Loss stuck at random-chance level	Normalize inputs to mean=0, std=1
Labels as wrong dtype	Cross-entropy expects `Long`, got `Float`	Cast labels: `labels.long()`

The Master Debugging Table

Symptom	Likely cause	First thing to try
Loss stuck at -log(1/num_classes)	Model predicting uniform distribution	Check data pipeline, verify labels match inputs
Loss NaN after a few steps	Learning rate too high	Reduce LR by 10x
Loss NaN immediately	log(0) or division by zero	Add epsilon to log/division operations
Loss oscillating wildly	LR too high or batch size too small	Reduce LR, increase batch size
Loss decreasing then plateaus	LR too high for fine-tuning phase	Add LR schedule (cosine or step decay)
Training acc high, test acc low	Overfitting	Add dropout, weight decay, more data
Training acc = test acc = chance	Model not learning anything	Run overfit-one-batch test
Training acc = test acc but both low	Underfitting	Bigger model, more layers, more features
Gradients all zero	Dead ReLUs or detached computation graph	Switch to LeakyReLU, check `.requires_grad`
Out of memory during training	Batch too large or graph not freed	Reduce batch size, use `torch.no_grad()` for eval

Build It

A diagnostic toolkit that monitors activations, gradients, and loss curves. You will deliberately break a network and use the toolkit to diagnose each problem.

Step 1: The NetworkDebugger Class

Hooks into a PyTorch model to record activation and gradient statistics per layer.

import torch
import torch.nn as nn
import math


class NetworkDebugger:
    def __init__(self, model):
        self.model = model
        self.activation_stats = {}
        self.gradient_stats = {}
        self.loss_history = []
        self.lr_losses = []
        self.hooks = []
        self._register_hooks()

    def _register_hooks(self):
        for name, module in self.model.named_modules():
            if isinstance(module, (nn.Linear, nn.Conv2d, nn.ReLU, nn.LeakyReLU)):
                hook = module.register_forward_hook(self._make_activation_hook(name))
                self.hooks.append(hook)
                hook = module.register_full_backward_hook(self._make_gradient_hook(name))
                self.hooks.append(hook)

    def _make_activation_hook(self, name):
        def hook(module, input, output):
            with torch.no_grad():
                out = output.detach().float()
                self.activation_stats[name] = {
                    "mean": out.mean().item(),
                    "std": out.std().item(),
                    "fraction_zero": (out == 0).float().mean().item(),
                    "min": out.min().item(),
                    "max": out.max().item(),
                }
        return hook

    def _make_gradient_hook(self, name):
        def hook(module, grad_input, grad_output):
            if grad_output[0] is not None:
                with torch.no_grad():
                    grad = grad_output[0].detach().float()
                    self.gradient_stats[name] = {
                        "mean": grad.mean().item(),
                        "std": grad.std().item(),
                        "abs_mean": grad.abs().mean().item(),
                        "max": grad.abs().max().item(),
                    }
        return hook

    def record_loss(self, loss_value):
        self.loss_history.append(loss_value)

    def check_loss_health(self):
        if len(self.loss_history) < 2:
            return "NOT_ENOUGH_DATA"
        recent = self.loss_history[-10:]
        if any(math.isnan(v) or math.isinf(v) for v in recent):
            return "NAN_OR_INF"
        if len(self.loss_history) >= 20:
            first_half = sum(self.loss_history[:10]) / 10
            second_half = sum(self.loss_history[-10:]) / 10
            if second_half >= first_half * 0.99:
                return "NOT_DECREASING"
        if len(recent) >= 5:
            diffs = [recent[i+1] - recent[i] for i in range(len(recent)-1)]
            if max(diffs) - min(diffs) > 2 * abs(sum(diffs) / len(diffs)):
                return "OSCILLATING"
        return "HEALTHY"

    def check_activations(self):
        issues = []
        for name, stats in self.activation_stats.items():
            if stats["fraction_zero"] > 0.5:
                issues.append(f"DEAD_NEURONS: {name} has {stats['fraction_zero']:.0%} zero activations")
            if abs(stats["mean"]) > 10:
                issues.append(f"EXPLODING_ACTIVATIONS: {name} mean={stats['mean']:.2f}")
            if stats["std"] < 1e-6:
                issues.append(f"COLLAPSED_ACTIVATIONS: {name} std={stats['std']:.2e}")
        return issues if issues else ["HEALTHY"]

    def check_gradients(self):
        issues = []
        grad_magnitudes = []
        for name, stats in self.gradient_stats.items():
            grad_magnitudes.append((name, stats["abs_mean"]))
            if stats["abs_mean"] < 1e-7:
                issues.append(f"VANISHING_GRADIENT: {name} abs_mean={stats['abs_mean']:.2e}")
            if stats["abs_mean"] > 100:
                issues.append(f"EXPLODING_GRADIENT: {name} abs_mean={stats['abs_mean']:.2e}")
        if len(grad_magnitudes) >= 2:
            first_mag = grad_magnitudes[0][1]
            last_mag = grad_magnitudes[-1][1]
            if last_mag > 0 and first_mag / last_mag > 100:
                issues.append(f"GRADIENT_RATIO: first/last = {first_mag/last_mag:.0f}x (vanishing)")
        return issues if issues else ["HEALTHY"]

    def print_report(self):
        print("\n=== NETWORK DEBUGGER REPORT ===")
        print(f"\nLoss health: {self.check_loss_health()}")
        if self.loss_history:
            print(f"  Last 5 losses: {[f'{v:.4f}' for v in self.loss_history[-5:]]}")
        print("\nActivation diagnostics:")
        for item in self.check_activations():
            print(f"  {item}")
        print("\nGradient diagnostics:")
        for item in self.check_gradients():
            print(f"  {item}")
        print("\nPer-layer activation stats:")
        for name, stats in self.activation_stats.items():
            print(f"  {name}: mean={stats['mean']:.4f} std={stats['std']:.4f} zero={stats['fraction_zero']:.1%}")
        print("\nPer-layer gradient stats:")
        for name, stats in self.gradient_stats.items():
            print(f"  {name}: abs_mean={stats['abs_mean']:.2e} max={stats['max']:.2e}")

    def remove_hooks(self):
        for hook in self.hooks:
            hook.remove()
        self.hooks.clear()

Step 2: The Overfit-One-Batch Test

def overfit_one_batch(model, x_batch, y_batch, criterion, lr=0.01, steps=200):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    model.train()
    print("\n=== OVERFIT ONE BATCH TEST ===")
    print(f"Batch size: {x_batch.shape[0]}, Steps: {steps}")

    for step in range(steps):
        optimizer.zero_grad()
        output = model(x_batch)
        loss = criterion(output, y_batch)
        loss.backward()
        optimizer.step()

        if step % 50 == 0 or step == steps - 1:
            with torch.no_grad():
                preds = (output > 0).float() if output.shape[-1] == 1 else output.argmax(dim=1)
                targets = y_batch if y_batch.dim() == 1 else y_batch.squeeze()
                acc = (preds.squeeze() == targets).float().mean().item()
            print(f"  Step {step:3d} | Loss: {loss.item():.6f} | Accuracy: {acc:.1%}")

    final_loss = loss.item()
    if final_loss > 0.1:
        print(f"\n  FAIL: Loss did not converge ({final_loss:.4f}). Model or training loop is broken.")
        return False
    print(f"\n  PASS: Loss converged to {final_loss:.6f}")
    return True

Step 3: Learning Rate Finder

def find_learning_rate(model, x_data, y_data, criterion, start_lr=1e-7, end_lr=10, steps=100):
    import copy
    original_state = copy.deepcopy(model.state_dict())
    optimizer = torch.optim.SGD(model.parameters(), lr=start_lr)
    lr_mult = (end_lr / start_lr) ** (1 / steps)

    model.train()
    results = []
    best_loss = float("inf")
    current_lr = start_lr

    print("\n=== LEARNING RATE FINDER ===")

    for step in range(steps):
        optimizer.zero_grad()
        output = model(x_data)
        loss = criterion(output, y_data)

        if math.isnan(loss.item()) or loss.item() > best_loss * 10:
            break

        best_loss = min(best_loss, loss.item())
        results.append((current_lr, loss.item()))

        loss.backward()
        optimizer.step()

        current_lr *= lr_mult
        for param_group in optimizer.param_groups:
            param_group["lr"] = current_lr

    model.load_state_dict(original_state)

    if len(results) < 10:
        print("  Could not complete LR sweep -- loss diverged too quickly")
        return results

    min_loss_idx = min(range(len(results)), key=lambda i: results[i][1])
    suggested_lr = results[max(0, min_loss_idx - 10)][0]

    print(f"  Swept {len(results)} steps from {start_lr:.0e} to {results[-1][0]:.0e}")
    print(f"  Minimum loss {results[min_loss_idx][1]:.4f} at lr={results[min_loss_idx][0]:.2e}")
    print(f"  Suggested learning rate: {suggested_lr:.2e}")

    return results

Step 4: Gradient Checker

def _flat_to_multi_index(flat_idx, shape):
    multi_idx = []
    remaining = flat_idx
    for dim in reversed(shape):
        multi_idx.insert(0, remaining % dim)
        remaining //= dim
    return tuple(multi_idx)


def gradient_check(model, x, y, criterion, eps=1e-4):
    model.train()
    x_double = x.double()
    y_double = y.double()
    model_double = model.double()

    print("\n=== GRADIENT CHECK ===")
    overall_max_diff = 0
    checked = 0

    for name, param in model_double.named_parameters():
        if not param.requires_grad:
            continue

        layer_max_diff = 0

        model_double.zero_grad()
        output = model_double(x_double)
        loss = criterion(output, y_double)
        loss.backward()
        analytical_grad = param.grad.clone()

        num_checks = min(5, param.numel())
        for i in range(num_checks):
            idx = _flat_to_multi_index(i, param.shape)
            original = param.data[idx].item()

            param.data[idx] = original + eps
            with torch.no_grad():
                loss_plus = criterion(model_double(x_double), y_double).item()

            param.data[idx] = original - eps
            with torch.no_grad():
                loss_minus = criterion(model_double(x_double), y_double).item()

            param.data[idx] = original

            numerical = (loss_plus - loss_minus) / (2 * eps)
            analytical = analytical_grad[idx].item()

            denom = max(abs(numerical), abs(analytical), 1e-8)
            rel_diff = abs(numerical - analytical) / denom

            layer_max_diff = max(layer_max_diff, rel_diff)
            checked += 1

        overall_max_diff = max(overall_max_diff, layer_max_diff)
        status = "OK" if layer_max_diff < 1e-5 else "MISMATCH"
        print(f"  {name}: max_rel_diff={layer_max_diff:.2e} [{status}]")

    model.float()

    print(f"\n  Checked {checked} parameters")
    if overall_max_diff < 1e-5:
        print("  PASS: Gradients match (rel_diff < 1e-5)")
    elif overall_max_diff < 1e-3:
        print("  WARN: Small differences (1e-5 < rel_diff < 1e-3)")
    else:
        print("  FAIL: Gradient mismatch detected (rel_diff > 1e-3)")
    return overall_max_diff

Step 5: Deliberately Broken Networks

Now apply the toolkit to broken networks and diagnose each one.

def demo_broken_networks():
    torch.manual_seed(42)
    x = torch.randn(64, 10)
    y = (x[:, 0] > 0).long()

    print("\n" + "=" * 60)
    print("BUG 1: Learning rate too high (lr=10)")
    print("=" * 60)
    model1 = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
    debugger1 = NetworkDebugger(model1)
    optimizer1 = torch.optim.SGD(model1.parameters(), lr=10.0)
    criterion = nn.CrossEntropyLoss()
    for step in range(20):
        optimizer1.zero_grad()
        out = model1(x)
        loss = criterion(out, y)
        debugger1.record_loss(loss.item())
        loss.backward()
        optimizer1.step()
    debugger1.print_report()
    debugger1.remove_hooks()

    print("\n" + "=" * 60)
    print("BUG 2: Dead ReLUs from bad initialization")
    print("=" * 60)
    model2 = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 32), nn.ReLU(), nn.Linear(32, 2))
    with torch.no_grad():
        for m in model2.modules():
            if isinstance(m, nn.Linear):
                m.weight.fill_(-1.0)
                m.bias.fill_(-5.0)
    debugger2 = NetworkDebugger(model2)
    optimizer2 = torch.optim.Adam(model2.parameters(), lr=1e-3)
    for step in range(50):
        optimizer2.zero_grad()
        out = model2(x)
        loss = criterion(out, y)
        debugger2.record_loss(loss.item())
        loss.backward()
        optimizer2.step()
    debugger2.print_report()
    debugger2.remove_hooks()

    print("\n" + "=" * 60)
    print("BUG 3: Missing zero_grad (gradients accumulate)")
    print("=" * 60)
    model3 = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
    debugger3 = NetworkDebugger(model3)
    optimizer3 = torch.optim.SGD(model3.parameters(), lr=0.01)
    for step in range(50):
        out = model3(x)
        loss = criterion(out, y)
        debugger3.record_loss(loss.item())
        loss.backward()
        optimizer3.step()
    debugger3.print_report()
    debugger3.remove_hooks()

    print("\n" + "=" * 60)
    print("HEALTHY NETWORK: Correct setup for comparison")
    print("=" * 60)
    model_good = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
    debugger_good = NetworkDebugger(model_good)
    optimizer_good = torch.optim.Adam(model_good.parameters(), lr=1e-3)
    for step in range(50):
        optimizer_good.zero_grad()
        out = model_good(x)
        loss = criterion(out, y)
        debugger_good.record_loss(loss.item())
        loss.backward()
        optimizer_good.step()
    debugger_good.print_report()
    debugger_good.remove_hooks()

    print("\n" + "=" * 60)
    print("OVERFIT-ONE-BATCH TEST (healthy model)")
    print("=" * 60)
    model_test = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
    overfit_one_batch(model_test, x[:8], y[:8], criterion)

    print("\n" + "=" * 60)
    print("LEARNING RATE FINDER")
    print("=" * 60)
    model_lr = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
    find_learning_rate(model_lr, x, y, criterion)

    print("\n" + "=" * 60)
    print("GRADIENT CHECK")
    print("=" * 60)
    model_grad = nn.Sequential(nn.Linear(10, 8), nn.ReLU(), nn.Linear(8, 2))
    gradient_check(model_grad, x[:4], y[:4], criterion)

Use It

PyTorch Built-in Tools

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(768, 256),
    nn.ReLU(),
    nn.Linear(256, 10),
)

with torch.autograd.detect_anomaly():
    output = model(input_tensor)
    loss = criterion(output, target)
    loss.backward()

for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad_mean={param.grad.abs().mean():.2e}")

Weights & Biases Integration

import wandb

wandb.init(project="debug-training")

for epoch in range(100):
    loss = train_one_epoch()
    wandb.log({
        "loss": loss,
        "lr": optimizer.param_groups[0]["lr"],
        "grad_norm": torch.nn.utils.clip_grad_norm_(model.parameters(), float("inf")),
    })

    for name, param in model.named_parameters():
        if param.grad is not None:
            wandb.log({f"grad/{name}": wandb.Histogram(param.grad.cpu().numpy())})

TensorBoard

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter("runs/debug_experiment")

for epoch in range(100):
    loss = train_one_epoch()
    writer.add_scalar("Loss/train", loss, epoch)

    for name, param in model.named_parameters():
        writer.add_histogram(f"weights/{name}", param, epoch)
        if param.grad is not None:
            writer.add_histogram(f"gradients/{name}", param.grad, epoch)

The Debug Checklist (Before Full Training)

Run overfit-one-batch test. If it fails, stop.
Print model summary -- verify parameter count is reasonable.
Run a single forward pass with random data -- check output shape.
Train for 5 epochs -- verify loss decreases.
Check activation statistics -- no dead layers, no explosions.
Check gradient flow -- no vanishing, no exploding.
Verify data pipeline -- print 5 random samples with labels.

Ship It

This lesson produces:

outputs/prompt-nn-debugger.md -- a prompt for diagnosing neural network training failures
outputs/skill-debug-checklist.md -- a decision-tree checklist for debugging training issues

Key deployment patterns for debugging:

Add monitoring hooks to production training scripts
Log activation and gradient statistics to W&B or TensorBoard every N steps
Implement automatic alerts for NaN loss, dead neurons (>80% zero), or gradient explosion
Always run the overfit-one-batch test when changing architectures or data pipelines

Exercises

Add an exploding gradient detector. Modify the NetworkDebugger to detect when gradients exceed a threshold and automatically suggest a gradient clipping value. Test it on a 20-layer network with no normalization.

Build a dead neuron resurrector. Write a function that identifies dead ReLU neurons (always outputting 0) and reinitializes their incoming weights with Kaiming initialization. Show that this recovers a network where >70% of neurons are dead.

Implement the learning rate finder with plotting. Extend find_learning_rate to save results as a CSV and write a separate script that reads the CSV and displays the LR vs loss curve using matplotlib. Identify the optimal LR for ResNet-18 on CIFAR-10.

Create a data pipeline validator. Write a function that checks for: duplicate samples across train/test splits, label distribution imbalance (>10:1 ratio), input normalization (mean near 0, std near 1), and NaN/Inf values in the data. Run it on a deliberately corrupted dataset.

Debug a real failure. Take the mini-framework from Lesson 10, introduce a subtle bug (e.g., transpose the weight matrix in backward), and use gradient checking to locate exactly which parameter has incorrect gradients. Document the debugging process.

Key Terms

Term	What people say	What it actually means
Silent bug	"It runs but gives bad results"	A bug that produces no error but degrades model quality -- the dominant failure mode in ML
Dead ReLU	"The neurons died"	A ReLU neuron whose input is always negative, so it outputs 0 and receives 0 gradient permanently
Vanishing gradients	"Early layers stop learning"	Gradients shrink exponentially through layers, making weights in early layers effectively frozen
Exploding gradients	"Loss went to NaN"	Gradients grow exponentially through layers, causing weight updates so large they overflow
Gradient checking	"Verify backprop is correct"	Comparing analytical gradients from backprop to numerical gradients from finite differences
Overfit-one-batch	"The most important debug test"	Training on a single small batch to verify the model CAN learn -- if it cannot, something is fundamentally broken
LR finder	"Sweep to find the right learning rate"	Exponentially increasing the learning rate over one epoch and picking the rate just before loss diverges
Data leakage	"Test data leaked into training"	When information from the test set contaminates training, producing artificially high accuracy
Activation statistics	"Monitor layer health"	Tracking mean, std, and zero-fraction of each layer's output to detect dead, saturated, or exploding neurons
Gradient clipping	"Cap the gradient magnitude"	Scaling gradients down when their norm exceeds a threshold, preventing exploding gradient updates