Instance Segmentation — Mask R-CNN
> Add a tiny mask branch to a Faster R-CNN detector and you have instance segmentation. The hard part is RoIAlign, and it is harder than it looks.
Type: Build + Learn
Languages: Python
Prerequisites: Phase 4 Lesson 06 (YOLO), Phase 4 Lesson 07 (U-Net)
Time: ~75 minutes
Learning Objectives
- Trace the Mask R-CNN architecture end-to-end: backbone, FPN, RPN, RoIAlign, box head, mask head
- Implement RoIAlign from scratch and explain why RoIPool is no longer used
- Use the torchvision
maskrcnn_resnet50_fpn_v2pretrained model for production-quality instance masks and read its output format correctly - Fine-tune Mask R-CNN on a small custom dataset by replacing the box and mask heads and keeping the backbone frozen
The Problem
Semantic segmentation gives you one mask per class. Instance segmentation gives you one mask per object, even when two objects share a class. Counting individuals, tracking across frames, and measuring things (the bounding box of each brick in a wall, each cell in a microscope image) all demand instance segmentation.
Mask R-CNN (He et al., 2017) solved this by reframing instance segmentation as detection-plus-a-mask. The design was so clean that for the next five years almost every instance segmentation paper was a Mask R-CNN variant, and the torchvision implementation is still the production default for small to medium datasets.
The hard engineering problem is sampling: how do you crop a fixed-size feature region out of a proposal box whose corners do not align with pixel boundaries? Getting that wrong costs tenths of a mAP point everywhere. RoIAlign is the answer.
The Concept
The architecture
backbone"] BB --> FPN["Feature
Pyramid Network"] FPN --> RPN["Region
Proposal
Network"] FPN --> RA["RoIAlign"] RPN -->|"top-K proposals"| RA RA --> BH["Box head
(class + refine)"] RA --> MH["Mask head
(14x14 conv)"] BH --> NMS["NMS"] MH --> NMS NMS --> OUT["boxes +
classes + masks"] style BB fill:#dbeafe,stroke:#2563eb style FPN fill:#fef3c7,stroke:#d97706 style RPN fill:#fecaca,stroke:#dc2626 style OUT fill:#dcfce7,stroke:#16a34a
Five pieces to understand:
- Backbone — ResNet-50 or ResNet-101 trained on ImageNet. Produces a hierarchy of feature maps at strides 4, 8, 16, 32.
- FPN (Feature Pyramid Network) — top-down + lateral connections that give every level C channels of semantic-rich features. Detection queries the FPN level matching the object size.
- RPN (Region Proposal Network) — a small conv head that, at every anchor position, predicts "is there an object here?" and "how do I refine the box?". Produces ~1000 proposals per image.
- RoIAlign — samples a fixed-size (e.g. 7x7) feature patch from any box on any FPN level. Bilinear sampling, no quantisation.
- Heads — two-layer box head that refines the box and picks a class, plus a small conv head that outputs a
28x28binary mask for each proposal.
Why RoIAlign, not RoIPool
The original Fast R-CNN used RoIPool, which splits a proposal box into a grid, takes the maximum feature in each cell, and rounds all coordinates to integers. That rounding misaligns the feature map from the input pixel coordinates by up to a full feature-map pixel — small on a 224x224 image, catastrophic when the feature map is stride 32.
RoIPool:
box (34.7, 51.3, 98.2, 142.9)
round -> (34, 51, 98, 142)
split grid -> round each cell boundary
misalignment accumulates at every step
RoIAlign:
box (34.7, 51.3, 98.2, 142.9)
sample at exact float coordinates using bilinear interpolation
no rounding anywhere
RoIAlign lifts mask AP by 3-4 points on COCO for free. Every detector that cares about localisation now uses it — YOLOv7 seg, RT-DETR, Mask2Former alike.
The RPN in one paragraph
At every position of a feature map, place K anchor boxes of different sizes and shapes. Predict an objectness score for each anchor and a regression offset to turn the anchor into a better-fitting box. Keep the top ~1,000 boxes by score, apply NMS at IoU 0.7, and hand the survivors to the heads. The RPN is trained with its own mini-loss — the same structure as the YOLO loss from Lesson 6, just with two classes (object / no object).
The mask head
For each proposal (after RoIAlign) the mask head is a tiny FCN: four 3x3 convs, a 2x deconv, a final 1x1 conv that produces num_classes output channels at 28x28 resolution. Only the channel corresponding to the predicted class is kept; the others are ignored. This decouples mask prediction from classification.
Upsample the 28x28 mask to the proposal's original pixel size to produce the final binary mask.
Losses
Mask R-CNN has four losses added together:
L = L_rpn_cls + L_rpn_box + L_box_cls + L_box_reg + L_mask
L_rpn_cls,L_rpn_box— objectness + box regression for the RPN proposals.L_box_cls— cross-entropy over (C+1) classes (including background) on the head's classifier.L_box_reg— smooth L1 on the head's box refinement.L_mask— per-pixel binary cross-entropy on the 28x28 mask output.
Each loss has its own default weight; the torchvision implementation exposes them as constructor arguments.
Output format
torchvision.models.detection.maskrcnn_resnet50_fpn_v2 returns a list of dicts, one per image:
{
"boxes": (N, 4) in (x1, y1, x2, y2) pixel coordinates,
"labels": (N,) class IDs, 0 = background so indices are 1-based,
"scores": (N,) confidence scores,
"masks": (N, 1, H, W) float masks in [0, 1] — threshold at 0.5 for binary,
}
The mask is full image resolution already. The 28x28 head output has been upsampled internally.
Build It
Step 1: RoIAlign from scratch
This is the one component of Mask R-CNN that is simpler to understand as code than as prose.
import torch
import torch.nn.functional as F
def roi_align_single(feature, box, output_size=7, spatial_scale=1 / 16.0):
"""
feature: (C, H, W) single-image feature map
box: (x1, y1, x2, y2) in original image pixel coordinates
output_size: side of the output grid (7 for box head, 14 for mask head)
spatial_scale: reciprocal of the feature map stride
"""
C, H, W = feature.shape
x1, y1, x2, y2 = [c * spatial_scale - 0.5 for c in box]
bin_w = (x2 - x1) / output_size
bin_h = (y2 - y1) / output_size
grid_y = torch.linspace(y1 + bin_h / 2, y2 - bin_h / 2, output_size)
grid_x = torch.linspace(x1 + bin_w / 2, x2 - bin_w / 2, output_size)
yy, xx = torch.meshgrid(grid_y, grid_x, indexing="ij")
gx = 2 * (xx + 0.5) / W - 1
gy = 2 * (yy + 0.5) / H - 1
grid = torch.stack([gx, gy], dim=-1).unsqueeze(0)
sampled = F.grid_sample(feature.unsqueeze(0), grid, mode="bilinear",
align_corners=False)
return sampled.squeeze(0)
Every number is at a bilinearly-sampled position. No rounding, no quantisation, no dropped gradients.
Step 2: Compare to torchvision's RoIAlign
from torchvision.ops import roi_align
feature = torch.randn(1, 16, 50, 50)
boxes = torch.tensor([[0, 10, 20, 100, 90]], dtype=torch.float32) # (batch_idx, x1, y1, x2, y2)
ours = roi_align_single(feature[0], boxes[0, 1:].tolist(), output_size=7, spatial_scale=1/4)
theirs = roi_align(feature, boxes, output_size=(7, 7), spatial_scale=1/4, sampling_ratio=1, aligned=True)[0]
print(f"shape ours: {tuple(ours.shape)}")
print(f"shape theirs: {tuple(theirs.shape)}")
print(f"max|diff|: {(ours - theirs).abs().max().item():.3e}")
With sampling_ratio=1 and aligned=True, the two match to within 1e-5.
Step 3: Load a pretrained Mask R-CNN
import torch
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2, MaskRCNN_ResNet50_FPN_V2_Weights
model = maskrcnn_resnet50_fpn_v2(weights=MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT)
model.eval()
print(f"params: {sum(p.numel() for p in model.parameters()):,}")
print(f"classes (including background): {len(model.roi_heads.box_predictor.cls_score.out_features * [0])}")
46M parameters, 91 classes (COCO). The first class (id 0) is background; everything the model actually detects starts at id 1.
Step 4: Run inference
with torch.no_grad():
x = torch.randn(3, 400, 600)
predictions = model([x])
p = predictions[0]
print(f"boxes: {tuple(p['boxes'].shape)}")
print(f"labels: {tuple(p['labels'].shape)}")
print(f"scores: {tuple(p['scores'].shape)}")
print(f"masks: {tuple(p['masks'].shape)}")
The mask tensor is shape (N, 1, H, W). Threshold at 0.5 to get a binary mask per object:
binary_masks = (p['masks'] > 0.5).squeeze(1) # (N, H, W) boolean
Step 5: Swap the heads for a custom class count
The common fine-tuning recipe: reuse the backbone, FPN, and RPN; replace the two classifier heads.
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
def build_custom_maskrcnn(num_classes):
model = maskrcnn_resnet50_fpn_v2(weights=MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
hidden_layer = 256
model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask, hidden_layer, num_classes)
return model
custom = build_custom_maskrcnn(num_classes=5)
print(f"custom cls_score.out_features: {custom.roi_heads.box_predictor.cls_score.out_features}")
num_classes must include the background class, so a dataset with 4 object classes uses num_classes=5.
Step 6: Freeze what does not need training
On small datasets, freeze the backbone and the FPN. Only the RPN objectness + regression and the two heads learn.
def freeze_backbone_and_fpn(model):
# torchvision Mask R-CNN packs the FPN inside `model.backbone` (as
# `model.backbone.fpn`), so iterating `model.backbone.parameters()` covers
# both the ResNet feature layers and the FPN lateral/output convs.
for p in model.backbone.parameters():
p.requires_grad = False
return model
custom = freeze_backbone_and_fpn(custom)
trainable = sum(p.numel() for p in custom.parameters() if p.requires_grad)
print(f"trainable after freeze: {trainable:,}")
On 500-image datasets this is the difference between convergence and overfitting.
Use It
The full training loop for Mask R-CNN in torchvision is 40 lines and does not change meaningfully between tasks — swap datasets and go.
def train_step(model, images, targets, optimizer):
model.train()
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
optimizer.zero_grad()
losses.backward()
optimizer.step()
return {k: v.item() for k, v in loss_dict.items()}
The targets list must have per-image dicts with boxes, labels, and masks (as (num_instances, H, W) binary tensors). The model returns a dict of four losses during training and a list of predictions during eval, keyed on model.training.
The pycocotools evaluator produces mAP@IoU=0.5:0.95 both for boxes and for masks; you need both numbers to know if the box head or the mask head is the bottleneck.
Ship It
This lesson produces:
outputs/prompt-instance-vs-semantic-router.md— a prompt that asks three questions and picks instance vs semantic vs panoptic plus the exact model to start with.outputs/skill-mask-rcnn-head-swapper.md— a skill that generates the 10 lines of code for swapping heads on any torchvision detection model, given the newnum_classes.
Exercises
- (Easy) Verify your RoIAlign against
torchvision.ops.roi_alignon 100 random boxes. Report the max absolute difference. Also run RoIPool (pre-2017 behaviour) and show it diverges by ~1-2 feature-map pixels on boxes near the border. - (Medium) Fine-tune
maskrcnn_resnet50_fpn_v2on a 50-image custom dataset (any two classes: balloons, fish, pothole, logos). Freeze the backbone, train for 20 epochs, report mask AP@0.5. - (Hard) Replace Mask R-CNN's mask head with one that predicts at 56x56 instead of 28x28. Measure mAP@IoU=0.75 before and after. Explain why the gain (or lack of one) matches the expected boundary-precision / memory trade-off.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Mask R-CNN | "Detection plus masks" | Faster R-CNN + a small FCN head that predicts a 28x28 mask per proposal per class |
| FPN | "Feature pyramid" | Top-down + lateral connections that give every stride level C channels of semantic-rich features |
| RPN | "Region proposer" | A small conv head that produces ~1000 object/no-object proposals per image |
| RoIAlign | "No-rounding crop" | Bilinearly samples a fixed-size feature grid from any float-coordinate box |
| RoIPool | "Pre-2017 crop" | Same purpose as RoIAlign but rounds box coordinates; obsolete |
| Mask AP | "Instance mAP" | Average precision computed with mask IoU instead of box IoU; the COCO instance segmentation metric |
| Binary mask head | "Per-class mask" | Predicts one binary mask per class for each proposal; only the predicted class's channel is kept |
| Background class | "Class 0" | The catch-all "no object" class; indices for real classes start at 1 |
Further Reading
- Mask R-CNN (He et al., 2017) — the paper; section 3 on RoIAlign is the critical read
- FPN: Feature Pyramid Networks (Lin et al., 2017) — the FPN paper; every modern detector uses it
- torchvision Mask R-CNN tutorial — the reference for the fine-tuning loop
- Detectron2 model zoo — production implementations with trained weights for nearly every detection and segmentation variant