Debugging Neural Networks
> Your network compiled. It ran. It produced a number. The number is wrong and nothing crashed. Welcome to the hardest kind of debugging -- the kind where there is no error message.
Type: Practice
Languages: Python, PyTorch
Prerequisites: Phase 03 Lessons 01-10 (especially backpropagation, loss functions, optimizers)
Time: ~90 minutes
Learning Objectives
- Diagnose common neural network failures (NaN loss, flat loss curve, overfitting, oscillation) using systematic debugging strategies
- Apply the "overfit one batch" technique to verify that your model architecture and training loop are correct
- Inspect gradient magnitudes, activation distributions, and weight norms to identify vanishing/exploding gradient problems
- Build a debugging checklist that covers data pipeline, model architecture, loss function, optimizer, and learning rate issues
The Problem
Traditional software crashes when it is broken. A null pointer throws an exception. A type mismatch fails at compile time. An off-by-one error produces a clearly wrong output.
Neural networks do not give you that luxury.
A broken neural network runs to completion, prints a loss value, and outputs predictions. The loss might decrease. The predictions might look plausible. But the model is silently wrong -- learning shortcuts, memorizing noise, or converging to a useless local minimum. Google researchers estimated that 60-70% of ML debugging time is spent on "silent" bugs that produce no errors but degrade model quality.
The difference between a working model and a broken one is often a single misplaced line: a missing zero_grad(), a transposed dimension, a learning rate off by 10x. the canonical "Recipe for Training Neural Networks" (2019) opens with this: "The most common neural net mistakes are bugs that don't crash."
This lesson teaches you to find those bugs.
The Concept
The Debugging Mindset
Forget print-and-pray debugging. Neural network debugging requires a systematic approach because the feedback loop is slow (minutes to hours per training run) and the symptoms are ambiguous (bad loss could mean 20 different things).
The golden rule: start simple, add complexity one piece at a time, and verify each piece independently.
Symptom 1: Loss Not Decreasing
This is the most common complaint. The training loop runs, epochs tick by, and the loss stays flat or oscillates wildly.
Wrong learning rate. Too high: loss oscillates or jumps to NaN. Too low: loss decreases so slowly it looks flat. For Adam, start at 1e-3. For SGD, start at 1e-1 or 1e-2. Always try 3 learning rates spanning 10x each (e.g., 1e-2, 1e-3, 1e-4) before concluding something else is wrong.
Dead ReLUs. If a ReLU neuron receives a large negative input, it outputs 0 and its gradient is 0. It never activates again. If enough neurons die, the network cannot learn. Check: print the fraction of activations that are exactly 0 after each ReLU layer. If >50% are dead, switch to LeakyReLU or reduce the learning rate.
Vanishing gradients. In deep networks with sigmoid or tanh activations, gradients shrink exponentially as they propagate backward. By the time they reach the first layer, they are ~0. The first layers stop learning. Fix: use ReLU/GELU, add residual connections, or use batch normalization.
Exploding gradients. The opposite problem -- gradients grow exponentially. Common in RNNs and very deep networks. Loss jumps to NaN. Fix: gradient clipping (torch.nn.utils.clip_grad_norm_), lower learning rate, or add normalization.
Symptom 2: Loss Decreasing But Model is Bad
The loss goes down. Training accuracy hits 99%. But test accuracy is 55%. Or the model produces nonsensical outputs on real data.
Overfitting. The model memorizes training data instead of learning patterns. Gap between training and validation loss grows over time. Fix: more data, dropout, weight decay, early stopping, data augmentation.
Data leakage. Test data leaked into training. Accuracy is suspiciously high. Common causes: shuffling before splitting, preprocessing with statistics from the full dataset, duplicate samples across splits. Fix: split first, preprocess second, check for duplicates.
Label errors. 5-10% of labels in most real datasets are wrong (Northcutt et al., 2021 -- "Pervasive Label Errors in Test Sets"). The model learns the noise. Fix: use confident learning to find and fix mislabeled examples, or use loss truncation to ignore high-loss samples.
Symptom 3: NaN or Inf in Loss
The loss value becomes nan or inf. Training is dead.
Learning rate too high. Gradient updates overshoot so far that weights explode. Fix: reduce by 10x.
log(0) or log(negative). Cross-entropy loss computes log(p). If your model outputs exactly 0 or a negative probability, the log explodes. Fix: clamp predictions to [eps, 1-eps] where eps=1e-7.
Division by zero. Batch normalization divides by standard deviation. A batch with constant values has std=0. Fix: add epsilon to the denominator (PyTorch does this by default, but custom implementations might not).
Numerical overflow. Large activations fed into exp() produce Inf. Softmax is especially prone. Fix: subtract the max before exponentiating (the log-sum-exp trick).
Technique 1: Gradient Checking
Compare your analytical gradients (from backprop) to numerical gradients (from finite differences). If they disagree, your backward pass has a bug.
Numerical gradient for parameter w:
grad_numerical = (loss(w + eps) - loss(w - eps)) / (2 * eps)
Agreement metric (relative difference):
rel_diff = |grad_analytical - grad_numerical| / max(|grad_analytical|, |grad_numerical|, 1e-8)
If rel_diff < 1e-5: correct. If rel_diff > 1e-3: almost certainly a bug.
Technique 2: Activation Statistics
Monitor the mean and standard deviation of activations after each layer during training. Healthy networks maintain activations with mean near 0 and std near 1 (after normalization) or at least bounded.
| Health indicator | Mean | Std | Diagnosis |
|---|---|---|---|
| Healthy | ~0 | ~1 | Network is learning normally |
| Saturated | >>0 or <<0 | ~0 | Activations stuck at extreme values |
| Dead | 0 | 0 | Neurons are dead (all zeros) |
| Exploding | >>10 | >>10 | Activations growing without bound |
Technique 3: Gradient Flow Visualization
Plot the average gradient magnitude for each layer. In a healthy network, gradient magnitudes should be roughly similar across layers. If early layers have gradients 1000x smaller than later layers, you have vanishing gradients.
grad: 0.05"] --- L2["Layer 2
grad: 0.04"] --- L3["Layer 3
grad: 0.06"] --- L4["Layer 4
grad: 0.05"] end
grad: 0.0001"] --- V2["Layer 2
grad: 0.003"] --- V3["Layer 3
grad: 0.02"] --- V4["Layer 4
grad: 0.08"] end
Technique 4: The Overfit-One-Batch Test
The single most important debugging technique in deep learning.
Take one small batch (8-32 samples). Train on it for 100+ iterations. The loss should go to nearly zero and training accuracy should hit 100%. If it does not, your model or training loop has a fundamental bug -- do not proceed to full training.
This test catches:
- Broken loss functions
- Broken backward passes
- Architecture too small to represent the data
- Optimizer not connected to model parameters
- Data and labels misaligned
This takes 30 seconds to run and saves hours of debugging full training runs.
Technique 5: Learning Rate Finder
Leslie Smith (2017) proposed sweeping the learning rate from very small (1e-7) to very large (10) over one epoch while recording the loss. Plot loss vs learning rate. The optimal learning rate is roughly 10x smaller than the rate where loss starts decreasing fastest.
Best LR in this example: ~1e-3 (one order of magnitude before the steepest point).
Common PyTorch Bugs
These are the bugs that waste the most collective hours in the PyTorch community:
| Bug | Symptom | Fix |
|---|---|---|
Forgetting optimizer.zero_grad() |
Gradients accumulate across batches, loss oscillates | Add optimizer.zero_grad() before loss.backward() |
Forgetting model.eval() at test time |
Dropout and batch norm behave differently, test accuracy varies between runs | Add model.eval() and torch.no_grad() |
| Wrong tensor shapes | Silent broadcasting produces wrong results, no error | Print shapes after every operation during debugging |
| CPU/GPU mismatch | RuntimeError: expected CUDA tensor |
Use .to(device) on model AND data |
| Not detaching tensors | Computation graph grows forever, OOM | Use .detach() or with torch.no_grad() |
| In-place operations breaking autograd | RuntimeError: modified by in-place operation |
Replace x += 1 with x = x + 1 |
| Data not normalized | Loss stuck at random-chance level | Normalize inputs to mean=0, std=1 |
| Labels as wrong dtype | Cross-entropy expects Long, got Float |
Cast labels: labels.long() |
The Master Debugging Table
| Symptom | Likely cause | First thing to try |
|---|---|---|
| Loss stuck at -log(1/num_classes) | Model predicting uniform distribution | Check data pipeline, verify labels match inputs |
| Loss NaN after a few steps | Learning rate too high | Reduce LR by 10x |
| Loss NaN immediately | log(0) or division by zero | Add epsilon to log/division operations |
| Loss oscillating wildly | LR too high or batch size too small | Reduce LR, increase batch size |
| Loss decreasing then plateaus | LR too high for fine-tuning phase | Add LR schedule (cosine or step decay) |
| Training acc high, test acc low | Overfitting | Add dropout, weight decay, more data |
| Training acc = test acc = chance | Model not learning anything | Run overfit-one-batch test |
| Training acc = test acc but both low | Underfitting | Bigger model, more layers, more features |
| Gradients all zero | Dead ReLUs or detached computation graph | Switch to LeakyReLU, check .requires_grad |
| Out of memory during training | Batch too large or graph not freed | Reduce batch size, use torch.no_grad() for eval |
Build It
A diagnostic toolkit that monitors activations, gradients, and loss curves. You will deliberately break a network and use the toolkit to diagnose each problem.
Step 1: The NetworkDebugger Class
Hooks into a PyTorch model to record activation and gradient statistics per layer.
import torch
import torch.nn as nn
import math
class NetworkDebugger:
def __init__(self, model):
self.model = model
self.activation_stats = {}
self.gradient_stats = {}
self.loss_history = []
self.lr_losses = []
self.hooks = []
self._register_hooks()
def _register_hooks(self):
for name, module in self.model.named_modules():
if isinstance(module, (nn.Linear, nn.Conv2d, nn.ReLU, nn.LeakyReLU)):
hook = module.register_forward_hook(self._make_activation_hook(name))
self.hooks.append(hook)
hook = module.register_full_backward_hook(self._make_gradient_hook(name))
self.hooks.append(hook)
def _make_activation_hook(self, name):
def hook(module, input, output):
with torch.no_grad():
out = output.detach().float()
self.activation_stats[name] = {
"mean": out.mean().item(),
"std": out.std().item(),
"fraction_zero": (out == 0).float().mean().item(),
"min": out.min().item(),
"max": out.max().item(),
}
return hook
def _make_gradient_hook(self, name):
def hook(module, grad_input, grad_output):
if grad_output[0] is not None:
with torch.no_grad():
grad = grad_output[0].detach().float()
self.gradient_stats[name] = {
"mean": grad.mean().item(),
"std": grad.std().item(),
"abs_mean": grad.abs().mean().item(),
"max": grad.abs().max().item(),
}
return hook
def record_loss(self, loss_value):
self.loss_history.append(loss_value)
def check_loss_health(self):
if len(self.loss_history) < 2:
return "NOT_ENOUGH_DATA"
recent = self.loss_history[-10:]
if any(math.isnan(v) or math.isinf(v) for v in recent):
return "NAN_OR_INF"
if len(self.loss_history) >= 20:
first_half = sum(self.loss_history[:10]) / 10
second_half = sum(self.loss_history[-10:]) / 10
if second_half >= first_half * 0.99:
return "NOT_DECREASING"
if len(recent) >= 5:
diffs = [recent[i+1] - recent[i] for i in range(len(recent)-1)]
if max(diffs) - min(diffs) > 2 * abs(sum(diffs) / len(diffs)):
return "OSCILLATING"
return "HEALTHY"
def check_activations(self):
issues = []
for name, stats in self.activation_stats.items():
if stats["fraction_zero"] > 0.5:
issues.append(f"DEAD_NEURONS: {name} has {stats['fraction_zero']:.0%} zero activations")
if abs(stats["mean"]) > 10:
issues.append(f"EXPLODING_ACTIVATIONS: {name} mean={stats['mean']:.2f}")
if stats["std"] < 1e-6:
issues.append(f"COLLAPSED_ACTIVATIONS: {name} std={stats['std']:.2e}")
return issues if issues else ["HEALTHY"]
def check_gradients(self):
issues = []
grad_magnitudes = []
for name, stats in self.gradient_stats.items():
grad_magnitudes.append((name, stats["abs_mean"]))
if stats["abs_mean"] < 1e-7:
issues.append(f"VANISHING_GRADIENT: {name} abs_mean={stats['abs_mean']:.2e}")
if stats["abs_mean"] > 100:
issues.append(f"EXPLODING_GRADIENT: {name} abs_mean={stats['abs_mean']:.2e}")
if len(grad_magnitudes) >= 2:
first_mag = grad_magnitudes[0][1]
last_mag = grad_magnitudes[-1][1]
if last_mag > 0 and first_mag / last_mag > 100:
issues.append(f"GRADIENT_RATIO: first/last = {first_mag/last_mag:.0f}x (vanishing)")
return issues if issues else ["HEALTHY"]
def print_report(self):
print("\n=== NETWORK DEBUGGER REPORT ===")
print(f"\nLoss health: {self.check_loss_health()}")
if self.loss_history:
print(f" Last 5 losses: {[f'{v:.4f}' for v in self.loss_history[-5:]]}")
print("\nActivation diagnostics:")
for item in self.check_activations():
print(f" {item}")
print("\nGradient diagnostics:")
for item in self.check_gradients():
print(f" {item}")
print("\nPer-layer activation stats:")
for name, stats in self.activation_stats.items():
print(f" {name}: mean={stats['mean']:.4f} std={stats['std']:.4f} zero={stats['fraction_zero']:.1%}")
print("\nPer-layer gradient stats:")
for name, stats in self.gradient_stats.items():
print(f" {name}: abs_mean={stats['abs_mean']:.2e} max={stats['max']:.2e}")
def remove_hooks(self):
for hook in self.hooks:
hook.remove()
self.hooks.clear()
Step 2: The Overfit-One-Batch Test
def overfit_one_batch(model, x_batch, y_batch, criterion, lr=0.01, steps=200):
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
model.train()
print("\n=== OVERFIT ONE BATCH TEST ===")
print(f"Batch size: {x_batch.shape[0]}, Steps: {steps}")
for step in range(steps):
optimizer.zero_grad()
output = model(x_batch)
loss = criterion(output, y_batch)
loss.backward()
optimizer.step()
if step % 50 == 0 or step == steps - 1:
with torch.no_grad():
preds = (output > 0).float() if output.shape[-1] == 1 else output.argmax(dim=1)
targets = y_batch if y_batch.dim() == 1 else y_batch.squeeze()
acc = (preds.squeeze() == targets).float().mean().item()
print(f" Step {step:3d} | Loss: {loss.item():.6f} | Accuracy: {acc:.1%}")
final_loss = loss.item()
if final_loss > 0.1:
print(f"\n FAIL: Loss did not converge ({final_loss:.4f}). Model or training loop is broken.")
return False
print(f"\n PASS: Loss converged to {final_loss:.6f}")
return True
Step 3: Learning Rate Finder
def find_learning_rate(model, x_data, y_data, criterion, start_lr=1e-7, end_lr=10, steps=100):
import copy
original_state = copy.deepcopy(model.state_dict())
optimizer = torch.optim.SGD(model.parameters(), lr=start_lr)
lr_mult = (end_lr / start_lr) ** (1 / steps)
model.train()
results = []
best_loss = float("inf")
current_lr = start_lr
print("\n=== LEARNING RATE FINDER ===")
for step in range(steps):
optimizer.zero_grad()
output = model(x_data)
loss = criterion(output, y_data)
if math.isnan(loss.item()) or loss.item() > best_loss * 10:
break
best_loss = min(best_loss, loss.item())
results.append((current_lr, loss.item()))
loss.backward()
optimizer.step()
current_lr *= lr_mult
for param_group in optimizer.param_groups:
param_group["lr"] = current_lr
model.load_state_dict(original_state)
if len(results) < 10:
print(" Could not complete LR sweep -- loss diverged too quickly")
return results
min_loss_idx = min(range(len(results)), key=lambda i: results[i][1])
suggested_lr = results[max(0, min_loss_idx - 10)][0]
print(f" Swept {len(results)} steps from {start_lr:.0e} to {results[-1][0]:.0e}")
print(f" Minimum loss {results[min_loss_idx][1]:.4f} at lr={results[min_loss_idx][0]:.2e}")
print(f" Suggested learning rate: {suggested_lr:.2e}")
return results
Step 4: Gradient Checker
def _flat_to_multi_index(flat_idx, shape):
multi_idx = []
remaining = flat_idx
for dim in reversed(shape):
multi_idx.insert(0, remaining % dim)
remaining //= dim
return tuple(multi_idx)
def gradient_check(model, x, y, criterion, eps=1e-4):
model.train()
x_double = x.double()
y_double = y.double()
model_double = model.double()
print("\n=== GRADIENT CHECK ===")
overall_max_diff = 0
checked = 0
for name, param in model_double.named_parameters():
if not param.requires_grad:
continue
layer_max_diff = 0
model_double.zero_grad()
output = model_double(x_double)
loss = criterion(output, y_double)
loss.backward()
analytical_grad = param.grad.clone()
num_checks = min(5, param.numel())
for i in range(num_checks):
idx = _flat_to_multi_index(i, param.shape)
original = param.data[idx].item()
param.data[idx] = original + eps
with torch.no_grad():
loss_plus = criterion(model_double(x_double), y_double).item()
param.data[idx] = original - eps
with torch.no_grad():
loss_minus = criterion(model_double(x_double), y_double).item()
param.data[idx] = original
numerical = (loss_plus - loss_minus) / (2 * eps)
analytical = analytical_grad[idx].item()
denom = max(abs(numerical), abs(analytical), 1e-8)
rel_diff = abs(numerical - analytical) / denom
layer_max_diff = max(layer_max_diff, rel_diff)
checked += 1
overall_max_diff = max(overall_max_diff, layer_max_diff)
status = "OK" if layer_max_diff < 1e-5 else "MISMATCH"
print(f" {name}: max_rel_diff={layer_max_diff:.2e} [{status}]")
model.float()
print(f"\n Checked {checked} parameters")
if overall_max_diff < 1e-5:
print(" PASS: Gradients match (rel_diff < 1e-5)")
elif overall_max_diff < 1e-3:
print(" WARN: Small differences (1e-5 < rel_diff < 1e-3)")
else:
print(" FAIL: Gradient mismatch detected (rel_diff > 1e-3)")
return overall_max_diff
Step 5: Deliberately Broken Networks
Now apply the toolkit to broken networks and diagnose each one.
def demo_broken_networks():
torch.manual_seed(42)
x = torch.randn(64, 10)
y = (x[:, 0] > 0).long()
print("\n" + "=" * 60)
print("BUG 1: Learning rate too high (lr=10)")
print("=" * 60)
model1 = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
debugger1 = NetworkDebugger(model1)
optimizer1 = torch.optim.SGD(model1.parameters(), lr=10.0)
criterion = nn.CrossEntropyLoss()
for step in range(20):
optimizer1.zero_grad()
out = model1(x)
loss = criterion(out, y)
debugger1.record_loss(loss.item())
loss.backward()
optimizer1.step()
debugger1.print_report()
debugger1.remove_hooks()
print("\n" + "=" * 60)
print("BUG 2: Dead ReLUs from bad initialization")
print("=" * 60)
model2 = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 32), nn.ReLU(), nn.Linear(32, 2))
with torch.no_grad():
for m in model2.modules():
if isinstance(m, nn.Linear):
m.weight.fill_(-1.0)
m.bias.fill_(-5.0)
debugger2 = NetworkDebugger(model2)
optimizer2 = torch.optim.Adam(model2.parameters(), lr=1e-3)
for step in range(50):
optimizer2.zero_grad()
out = model2(x)
loss = criterion(out, y)
debugger2.record_loss(loss.item())
loss.backward()
optimizer2.step()
debugger2.print_report()
debugger2.remove_hooks()
print("\n" + "=" * 60)
print("BUG 3: Missing zero_grad (gradients accumulate)")
print("=" * 60)
model3 = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
debugger3 = NetworkDebugger(model3)
optimizer3 = torch.optim.SGD(model3.parameters(), lr=0.01)
for step in range(50):
out = model3(x)
loss = criterion(out, y)
debugger3.record_loss(loss.item())
loss.backward()
optimizer3.step()
debugger3.print_report()
debugger3.remove_hooks()
print("\n" + "=" * 60)
print("HEALTHY NETWORK: Correct setup for comparison")
print("=" * 60)
model_good = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
debugger_good = NetworkDebugger(model_good)
optimizer_good = torch.optim.Adam(model_good.parameters(), lr=1e-3)
for step in range(50):
optimizer_good.zero_grad()
out = model_good(x)
loss = criterion(out, y)
debugger_good.record_loss(loss.item())
loss.backward()
optimizer_good.step()
debugger_good.print_report()
debugger_good.remove_hooks()
print("\n" + "=" * 60)
print("OVERFIT-ONE-BATCH TEST (healthy model)")
print("=" * 60)
model_test = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
overfit_one_batch(model_test, x[:8], y[:8], criterion)
print("\n" + "=" * 60)
print("LEARNING RATE FINDER")
print("=" * 60)
model_lr = nn.Sequential(nn.Linear(10, 32), nn.ReLU(), nn.Linear(32, 2))
find_learning_rate(model_lr, x, y, criterion)
print("\n" + "=" * 60)
print("GRADIENT CHECK")
print("=" * 60)
model_grad = nn.Sequential(nn.Linear(10, 8), nn.ReLU(), nn.Linear(8, 2))
gradient_check(model_grad, x[:4], y[:4], criterion)
Use It
PyTorch Built-in Tools
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(768, 256),
nn.ReLU(),
nn.Linear(256, 10),
)
with torch.autograd.detect_anomaly():
output = model(input_tensor)
loss = criterion(output, target)
loss.backward()
for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name}: grad_mean={param.grad.abs().mean():.2e}")
Weights & Biases Integration
import wandb
wandb.init(project="debug-training")
for epoch in range(100):
loss = train_one_epoch()
wandb.log({
"loss": loss,
"lr": optimizer.param_groups[0]["lr"],
"grad_norm": torch.nn.utils.clip_grad_norm_(model.parameters(), float("inf")),
})
for name, param in model.named_parameters():
if param.grad is not None:
wandb.log({f"grad/{name}": wandb.Histogram(param.grad.cpu().numpy())})
TensorBoard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("runs/debug_experiment")
for epoch in range(100):
loss = train_one_epoch()
writer.add_scalar("Loss/train", loss, epoch)
for name, param in model.named_parameters():
writer.add_histogram(f"weights/{name}", param, epoch)
if param.grad is not None:
writer.add_histogram(f"gradients/{name}", param.grad, epoch)
The Debug Checklist (Before Full Training)
- Run overfit-one-batch test. If it fails, stop.
- Print model summary -- verify parameter count is reasonable.
- Run a single forward pass with random data -- check output shape.
- Train for 5 epochs -- verify loss decreases.
- Check activation statistics -- no dead layers, no explosions.
- Check gradient flow -- no vanishing, no exploding.
- Verify data pipeline -- print 5 random samples with labels.
Ship It
This lesson produces:
outputs/prompt-nn-debugger.md-- a prompt for diagnosing neural network training failuresoutputs/skill-debug-checklist.md-- a decision-tree checklist for debugging training issues
Key deployment patterns for debugging:
- Add monitoring hooks to production training scripts
- Log activation and gradient statistics to W&B or TensorBoard every N steps
- Implement automatic alerts for NaN loss, dead neurons (>80% zero), or gradient explosion
- Always run the overfit-one-batch test when changing architectures or data pipelines
Exercises
- Add an exploding gradient detector. Modify the
NetworkDebuggerto detect when gradients exceed a threshold and automatically suggest a gradient clipping value. Test it on a 20-layer network with no normalization.
- Build a dead neuron resurrector. Write a function that identifies dead ReLU neurons (always outputting 0) and reinitializes their incoming weights with Kaiming initialization. Show that this recovers a network where >70% of neurons are dead.
- Implement the learning rate finder with plotting. Extend
find_learning_rateto save results as a CSV and write a separate script that reads the CSV and displays the LR vs loss curve using matplotlib. Identify the optimal LR for ResNet-18 on CIFAR-10.
- Create a data pipeline validator. Write a function that checks for: duplicate samples across train/test splits, label distribution imbalance (>10:1 ratio), input normalization (mean near 0, std near 1), and NaN/Inf values in the data. Run it on a deliberately corrupted dataset.
- Debug a real failure. Take the mini-framework from Lesson 10, introduce a subtle bug (e.g., transpose the weight matrix in backward), and use gradient checking to locate exactly which parameter has incorrect gradients. Document the debugging process.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Silent bug | "It runs but gives bad results" | A bug that produces no error but degrades model quality -- the dominant failure mode in ML |
| Dead ReLU | "The neurons died" | A ReLU neuron whose input is always negative, so it outputs 0 and receives 0 gradient permanently |
| Vanishing gradients | "Early layers stop learning" | Gradients shrink exponentially through layers, making weights in early layers effectively frozen |
| Exploding gradients | "Loss went to NaN" | Gradients grow exponentially through layers, causing weight updates so large they overflow |
| Gradient checking | "Verify backprop is correct" | Comparing analytical gradients from backprop to numerical gradients from finite differences |
| Overfit-one-batch | "The most important debug test" | Training on a single small batch to verify the model CAN learn -- if it cannot, something is fundamentally broken |
| LR finder | "Sweep to find the right learning rate" | Exponentially increasing the learning rate over one epoch and picking the rate just before loss diverges |
| Data leakage | "Test data leaked into training" | When information from the test set contaminates training, producing artificially high accuracy |
| Activation statistics | "Monitor layer health" | Tracking mean, std, and zero-fraction of each layer's output to detect dead, saturated, or exploding neurons |
| Gradient clipping | "Cap the gradient magnitude" | Scaling gradients down when their norm exceeds a threshold, preventing exploding gradient updates |
Further Reading
- Smith, "Cyclical Learning Rates for Training Neural Networks" (2017) -- the paper introducing the learning rate range test (LR finder)
- Northcutt et al., "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks" (2021) -- demonstrates that 3-6% of labels in ImageNet, CIFAR-10, and other major benchmarks are wrong
- Zhang et al., "Understanding Deep Learning Requires Rethinking Generalization" (2017) -- the paper showing neural networks can memorize random labels, which is why the overfit-one-batch test works
- PyTorch documentation on
torch.autograd.detect_anomalyandtorch.autograd.set_detect_anomalyfor built-in NaN/Inf detection