Instance Segmentation — Mask R-CNN

> Add a tiny mask branch to a Faster R-CNN detector and you have instance segmentation. The hard part is RoIAlign, and it is harder than it looks.

Type: Build + Learn

Languages: Python

Prerequisites: Phase 4 Lesson 06 (YOLO), Phase 4 Lesson 07 (U-Net)

Time: ~75 minutes

Learning Objectives

The Problem

Semantic segmentation gives you one mask per class. Instance segmentation gives you one mask per object, even when two objects share a class. Counting individuals, tracking across frames, and measuring things (the bounding box of each brick in a wall, each cell in a microscope image) all demand instance segmentation.

Mask R-CNN (He et al., 2017) solved this by reframing instance segmentation as detection-plus-a-mask. The design was so clean that for the next five years almost every instance segmentation paper was a Mask R-CNN variant, and the torchvision implementation is still the production default for small to medium datasets.

The hard engineering problem is sampling: how do you crop a fixed-size feature region out of a proposal box whose corners do not align with pixel boundaries? Getting that wrong costs tenths of a mAP point everywhere. RoIAlign is the answer.

The Concept

The architecture

flowchart LR IMG["Input"] --> BB["ResNet
backbone"] BB --> FPN["Feature
Pyramid Network"] FPN --> RPN["Region
Proposal
Network"] FPN --> RA["RoIAlign"] RPN -->|"top-K proposals"| RA RA --> BH["Box head
(class + refine)"] RA --> MH["Mask head
(14x14 conv)"] BH --> NMS["NMS"] MH --> NMS NMS --> OUT["boxes +
classes + masks"] style BB fill:#dbeafe,stroke:#2563eb style FPN fill:#fef3c7,stroke:#d97706 style RPN fill:#fecaca,stroke:#dc2626 style OUT fill:#dcfce7,stroke:#16a34a

Five pieces to understand:

  1. Backbone — ResNet-50 or ResNet-101 trained on ImageNet. Produces a hierarchy of feature maps at strides 4, 8, 16, 32.
  2. FPN (Feature Pyramid Network) — top-down + lateral connections that give every level C channels of semantic-rich features. Detection queries the FPN level matching the object size.
  3. RPN (Region Proposal Network) — a small conv head that, at every anchor position, predicts "is there an object here?" and "how do I refine the box?". Produces ~1000 proposals per image.
  4. RoIAlign — samples a fixed-size (e.g. 7x7) feature patch from any box on any FPN level. Bilinear sampling, no quantisation.
  5. Heads — two-layer box head that refines the box and picks a class, plus a small conv head that outputs a 28x28 binary mask for each proposal.

Why RoIAlign, not RoIPool

The original Fast R-CNN used RoIPool, which splits a proposal box into a grid, takes the maximum feature in each cell, and rounds all coordinates to integers. That rounding misaligns the feature map from the input pixel coordinates by up to a full feature-map pixel — small on a 224x224 image, catastrophic when the feature map is stride 32.

RoIPool:
  box (34.7, 51.3, 98.2, 142.9)
  round -> (34, 51, 98, 142)
  split grid -> round each cell boundary
  misalignment accumulates at every step

RoIAlign:
  box (34.7, 51.3, 98.2, 142.9)
  sample at exact float coordinates using bilinear interpolation
  no rounding anywhere

RoIAlign lifts mask AP by 3-4 points on COCO for free. Every detector that cares about localisation now uses it — YOLOv7 seg, RT-DETR, Mask2Former alike.

The RPN in one paragraph

At every position of a feature map, place K anchor boxes of different sizes and shapes. Predict an objectness score for each anchor and a regression offset to turn the anchor into a better-fitting box. Keep the top ~1,000 boxes by score, apply NMS at IoU 0.7, and hand the survivors to the heads. The RPN is trained with its own mini-loss — the same structure as the YOLO loss from Lesson 6, just with two classes (object / no object).

The mask head

For each proposal (after RoIAlign) the mask head is a tiny FCN: four 3x3 convs, a 2x deconv, a final 1x1 conv that produces num_classes output channels at 28x28 resolution. Only the channel corresponding to the predicted class is kept; the others are ignored. This decouples mask prediction from classification.

Upsample the 28x28 mask to the proposal's original pixel size to produce the final binary mask.

Losses

Mask R-CNN has four losses added together:

L = L_rpn_cls + L_rpn_box + L_box_cls + L_box_reg + L_mask

Each loss has its own default weight; the torchvision implementation exposes them as constructor arguments.

Output format

torchvision.models.detection.maskrcnn_resnet50_fpn_v2 returns a list of dicts, one per image:

{
    "boxes":  (N, 4) in (x1, y1, x2, y2) pixel coordinates,
    "labels": (N,) class IDs, 0 = background so indices are 1-based,
    "scores": (N,) confidence scores,
    "masks":  (N, 1, H, W) float masks in [0, 1] — threshold at 0.5 for binary,
}

The mask is full image resolution already. The 28x28 head output has been upsampled internally.

Build It

Step 1: RoIAlign from scratch

This is the one component of Mask R-CNN that is simpler to understand as code than as prose.

import torch
import torch.nn.functional as F

def roi_align_single(feature, box, output_size=7, spatial_scale=1 / 16.0):
    """
    feature: (C, H, W) single-image feature map
    box: (x1, y1, x2, y2) in original image pixel coordinates
    output_size: side of the output grid (7 for box head, 14 for mask head)
    spatial_scale: reciprocal of the feature map stride
    """
    C, H, W = feature.shape
    x1, y1, x2, y2 = [c * spatial_scale - 0.5 for c in box]
    bin_w = (x2 - x1) / output_size
    bin_h = (y2 - y1) / output_size

    grid_y = torch.linspace(y1 + bin_h / 2, y2 - bin_h / 2, output_size)
    grid_x = torch.linspace(x1 + bin_w / 2, x2 - bin_w / 2, output_size)
    yy, xx = torch.meshgrid(grid_y, grid_x, indexing="ij")

    gx = 2 * (xx + 0.5) / W - 1
    gy = 2 * (yy + 0.5) / H - 1
    grid = torch.stack([gx, gy], dim=-1).unsqueeze(0)
    sampled = F.grid_sample(feature.unsqueeze(0), grid, mode="bilinear",
                            align_corners=False)
    return sampled.squeeze(0)

Every number is at a bilinearly-sampled position. No rounding, no quantisation, no dropped gradients.

Step 2: Compare to torchvision's RoIAlign

from torchvision.ops import roi_align

feature = torch.randn(1, 16, 50, 50)
boxes = torch.tensor([[0, 10, 20, 100, 90]], dtype=torch.float32)  # (batch_idx, x1, y1, x2, y2)

ours = roi_align_single(feature[0], boxes[0, 1:].tolist(), output_size=7, spatial_scale=1/4)
theirs = roi_align(feature, boxes, output_size=(7, 7), spatial_scale=1/4, sampling_ratio=1, aligned=True)[0]

print(f"shape ours:   {tuple(ours.shape)}")
print(f"shape theirs: {tuple(theirs.shape)}")
print(f"max|diff|:    {(ours - theirs).abs().max().item():.3e}")

With sampling_ratio=1 and aligned=True, the two match to within 1e-5.

Step 3: Load a pretrained Mask R-CNN

import torch
from torchvision.models.detection import maskrcnn_resnet50_fpn_v2, MaskRCNN_ResNet50_FPN_V2_Weights

model = maskrcnn_resnet50_fpn_v2(weights=MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT)
model.eval()
print(f"params: {sum(p.numel() for p in model.parameters()):,}")
print(f"classes (including background): {len(model.roi_heads.box_predictor.cls_score.out_features * [0])}")

46M parameters, 91 classes (COCO). The first class (id 0) is background; everything the model actually detects starts at id 1.

Step 4: Run inference

with torch.no_grad():
    x = torch.randn(3, 400, 600)
    predictions = model([x])
p = predictions[0]
print(f"boxes:  {tuple(p['boxes'].shape)}")
print(f"labels: {tuple(p['labels'].shape)}")
print(f"scores: {tuple(p['scores'].shape)}")
print(f"masks:  {tuple(p['masks'].shape)}")

The mask tensor is shape (N, 1, H, W). Threshold at 0.5 to get a binary mask per object:

binary_masks = (p['masks'] > 0.5).squeeze(1)  # (N, H, W) boolean

Step 5: Swap the heads for a custom class count

The common fine-tuning recipe: reuse the backbone, FPN, and RPN; replace the two classifier heads.

from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor

def build_custom_maskrcnn(num_classes):
    model = maskrcnn_resnet50_fpn_v2(weights=MaskRCNN_ResNet50_FPN_V2_Weights.DEFAULT)
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    model.roi_heads.mask_predictor = MaskRCNNPredictor(in_features_mask, hidden_layer, num_classes)
    return model

custom = build_custom_maskrcnn(num_classes=5)
print(f"custom cls_score.out_features: {custom.roi_heads.box_predictor.cls_score.out_features}")

num_classes must include the background class, so a dataset with 4 object classes uses num_classes=5.

Step 6: Freeze what does not need training

On small datasets, freeze the backbone and the FPN. Only the RPN objectness + regression and the two heads learn.

def freeze_backbone_and_fpn(model):
    # torchvision Mask R-CNN packs the FPN inside `model.backbone` (as
    # `model.backbone.fpn`), so iterating `model.backbone.parameters()` covers
    # both the ResNet feature layers and the FPN lateral/output convs.
    for p in model.backbone.parameters():
        p.requires_grad = False
    return model

custom = freeze_backbone_and_fpn(custom)
trainable = sum(p.numel() for p in custom.parameters() if p.requires_grad)
print(f"trainable after freeze: {trainable:,}")

On 500-image datasets this is the difference between convergence and overfitting.

Use It

The full training loop for Mask R-CNN in torchvision is 40 lines and does not change meaningfully between tasks — swap datasets and go.

def train_step(model, images, targets, optimizer):
    model.train()
    loss_dict = model(images, targets)
    losses = sum(loss for loss in loss_dict.values())
    optimizer.zero_grad()
    losses.backward()
    optimizer.step()
    return {k: v.item() for k, v in loss_dict.items()}

The targets list must have per-image dicts with boxes, labels, and masks (as (num_instances, H, W) binary tensors). The model returns a dict of four losses during training and a list of predictions during eval, keyed on model.training.

The pycocotools evaluator produces mAP@IoU=0.5:0.95 both for boxes and for masks; you need both numbers to know if the box head or the mask head is the bottleneck.

Ship It

This lesson produces:

Exercises

  1. (Easy) Verify your RoIAlign against torchvision.ops.roi_align on 100 random boxes. Report the max absolute difference. Also run RoIPool (pre-2017 behaviour) and show it diverges by ~1-2 feature-map pixels on boxes near the border.
  2. (Medium) Fine-tune maskrcnn_resnet50_fpn_v2 on a 50-image custom dataset (any two classes: balloons, fish, pothole, logos). Freeze the backbone, train for 20 epochs, report mask AP@0.5.
  3. (Hard) Replace Mask R-CNN's mask head with one that predicts at 56x56 instead of 28x28. Measure mAP@IoU=0.75 before and after. Explain why the gain (or lack of one) matches the expected boundary-precision / memory trade-off.

Key Terms

Term What people say What it actually means
Mask R-CNN "Detection plus masks" Faster R-CNN + a small FCN head that predicts a 28x28 mask per proposal per class
FPN "Feature pyramid" Top-down + lateral connections that give every stride level C channels of semantic-rich features
RPN "Region proposer" A small conv head that produces ~1000 object/no-object proposals per image
RoIAlign "No-rounding crop" Bilinearly samples a fixed-size feature grid from any float-coordinate box
RoIPool "Pre-2017 crop" Same purpose as RoIAlign but rounds box coordinates; obsolete
Mask AP "Instance mAP" Average precision computed with mask IoU instead of box IoU; the COCO instance segmentation metric
Binary mask head "Per-class mask" Predicts one binary mask per class for each proposal; only the predicted class's channel is kept
Background class "Class 0" The catch-all "no object" class; indices for real classes start at 1

Further Reading