Multi-Object Tracking & Video Memory
> Tracking is detection plus association. Detect every frame. Match this frame's detections to last frame's tracks by ID.
Type: Build
Languages: Python
Prerequisites: Phase 4 Lesson 06 (YOLO Detection), Phase 4 Lesson 08 (Mask R-CNN), Phase 4 Lesson 24 (SAM 3)
Time: ~60 minutes
Learning Objectives
- Distinguish tracking-by-detection from query-based tracking and name the algorithm families (SORT, DeepSORT, ByteTrack, BoT-SORT, SAM 2 memory tracker, SAM 3.1 Object Multiplex)
- Implement IoU + Hungarian assignment from scratch for classic tracking-by-detection
- Explain SAM 2's memory bank and why it handles occlusion better than IoU-based association
- Read the three tracking metrics (MOTA, IDF1, HOTA) and pick which one matters for a given use case
The Problem
A detector tells you where the objects are in a single frame. A tracker tells you which detection in frame t is the same object as a detection in frame t-1. Without that, you cannot count objects crossing a line, follow a ball through an occlusion, or know "car #4 has been in the lane for 8 seconds."
Tracking is essential to every video-facing product: sports analytics, surveillance, autonomous driving, medical video analysis, wildlife monitoring, wordmark counting. The core building blocks are shared: a per-frame detector, a motion model (Kalman filter or something richer), an association step (Hungarian algorithm on IoU / cosine / learned features), and a track lifecycle (birth, update, death).
2026 brought two new patterns: SAM 2 memory-based tracking (feature-memory instead of motion-model association) and SAM 3.1 Object Multiplex (shared memory for many instances of the same concept). This lesson walks the classical stack first, then the memory-based approach.
The Concept
Tracking-by-detection
(Kalman)"] PREDICT --> PRED["Predicted tracks at t"] D1 --> ASSOC["Hungarian assignment
(IoU / cosine / motion)"] PRED --> ASSOC ASSOC --> UPDATE["Update matched tracks"] ASSOC --> NEW["Birth new tracks"] ASSOC --> DEAD["Age unmatched tracks; delete after N"] UPDATE --> NEXT["Tracks at t"] NEW --> NEXT DEAD --> NEXT style DET fill:#dbeafe,stroke:#2563eb style ASSOC fill:#fef3c7,stroke:#d97706 style NEXT fill:#dcfce7,stroke:#16a34a
Every tracker you will encounter in 2026 is a variation on this loop. The differences:
- SORT (2016): Kalman filter + IoU Hungarian. Simple, fast, no appearance model.
- DeepSORT (2017): SORT + a CNN-based appearance feature per track (ReID embedding). Handles crossings better.
- ByteTrack (2021): associates low-confidence detections as a second stage; no appearance features needed but top performer on MOT17.
- BoT-SORT (2022): Byte + camera motion compensation + ReID.
- StrongSORT / OC-SORT — ByteTrack descendants with better motion and appearance.
Kalman filter in one paragraph
A Kalman filter maintains a per-track state (x, y, w, h, dx, dy, dw, dh) with a covariance. At each frame, predict the state using a constant-velocity model, then update with the matched detection. The update trusts the detection more when the predict uncertainty is high. This gives smooth trajectories and the ability to continue a track through a short occlusion (1-5 frames).
Every classical tracker uses a Kalman filter in the motion-prediction step.
The Hungarian algorithm
Given a M x N cost matrix (tracks x detections), find the one-to-one assignment that minimises total cost. Cost is usually 1 - IoU(track_bbox, detection_bbox) or negative cosine similarity of appearance features. Runtime is O((M+N)^3); for M, N up to ~1000 it is fast enough in Python via scipy.optimize.linear_sum_assignment.
ByteTrack's key idea
Standard trackers drop low-confidence detections (< 0.5). ByteTrack keeps them around as second-stage candidates: after matching tracks to high-confidence detections, unmatched tracks try to match low-confidence detections with a slightly looser IoU threshold. Recovers short occlusions, ID switches near crowds.
SAM 2 memory-based tracking
SAM 2 handles video by keeping a memory bank of per-instance spatio-temporal features. Given a prompt (click, box, text) on one frame, it encodes the instance into memory. On subsequent frames, the memory is cross-attended against the new frame's features, and the decoder produces a mask for the same instance in the new frame.
No Kalman filter, no Hungarian assignment. The association is implicit in the memory-attention operation.
Pros:
- Robust to large occlusions (memory carries instance identity across many frames).
- Open-vocabulary when combined with SAM 3's text prompts.
- Works without a separate motion model.
Cons:
- Slower than ByteTrack for many-object tracking.
- Memory bank grows; limits the context window.
SAM 3.1 Object Multiplex
Prior SAM 2 / SAM 3 tracking keeps a separate memory bank per instance. For 50 objects, 50 memory banks. Object Multiplex (March 2026) collapses them into one shared memory with per-instance query tokens. Cost scales sub-linearly in number of instances.
Multiplex is the new default for crowd tracking in 2026: concert crowds, warehouse workers, traffic intersections.
Three metrics to know
- MOTA (Multi-Object Tracking Accuracy) — 1 - (FN + FP + ID switches) / GT. Weighted by error type; a single metric that conflates detection and association failures.
- IDF1 (ID F1) — harmonic mean of ID precision and recall. Focuses specifically on how well each ground-truth track keeps its ID over time. Better than MOTA for ID-switch-sensitive tasks.
- HOTA (Higher Order Tracking Accuracy) — decomposes into detection accuracy (DetA) and association accuracy (AssA). The community standard since 2020; most comprehensive.
For surveillance (who is who): IDF1 is what you report. For sports analytics (counting passes): HOTA. For general academic comparison: HOTA.
Build It
Step 1: IoU-based cost matrix
import numpy as np
def bbox_iou(a, b):
"""
a, b: (N, 4) arrays of [x1, y1, x2, y2].
Returns (N_a, N_b) IoU matrix.
"""
ax1, ay1, ax2, ay2 = a[:, 0], a[:, 1], a[:, 2], a[:, 3]
bx1, by1, bx2, by2 = b[:, 0], b[:, 1], b[:, 2], b[:, 3]
inter_x1 = np.maximum(ax1[:, None], bx1[None, :])
inter_y1 = np.maximum(ay1[:, None], by1[None, :])
inter_x2 = np.minimum(ax2[:, None], bx2[None, :])
inter_y2 = np.minimum(ay2[:, None], by2[None, :])
inter = np.clip(inter_x2 - inter_x1, 0, None) * np.clip(inter_y2 - inter_y1, 0, None)
area_a = (ax2 - ax1) * (ay2 - ay1)
area_b = (bx2 - bx1) * (by2 - by1)
union = area_a[:, None] + area_b[None, :] - inter
return inter / np.clip(union, 1e-8, None)
Step 2: Minimal SORT-style tracker
Fixed constant-velocity Kalman omitted for brevity — we use a simple IoU association here; in production the Kalman predict is essential. The sort Python package provides the full version.
from scipy.optimize import linear_sum_assignment
class Track:
def __init__(self, tid, bbox, frame):
self.id = tid
self.bbox = bbox
self.last_frame = frame
self.hits = 1
def update(self, bbox, frame):
self.bbox = bbox
self.last_frame = frame
self.hits += 1
class SimpleTracker:
def __init__(self, iou_threshold=0.3, max_age=5):
self.tracks = []
self.next_id = 1
self.iou_threshold = iou_threshold
self.max_age = max_age
def step(self, detections, frame):
if not self.tracks:
for d in detections:
self.tracks.append(Track(self.next_id, d, frame))
self.next_id += 1
return [(t.id, t.bbox) for t in self.tracks]
track_boxes = np.array([t.bbox for t in self.tracks])
det_boxes = np.array(detections) if len(detections) else np.empty((0, 4))
iou = bbox_iou(track_boxes, det_boxes) if len(det_boxes) else np.zeros((len(track_boxes), 0))
cost = 1 - iou
cost[iou < self.iou_threshold] = 1e6
matched_track = set()
matched_det = set()
if cost.size > 0:
row, col = linear_sum_assignment(cost)
for r, c in zip(row, col):
if cost[r, c] < 1.0:
self.tracks[r].update(det_boxes[c], frame)
matched_track.add(r); matched_det.add(c)
for i, d in enumerate(det_boxes):
if i not in matched_det:
self.tracks.append(Track(self.next_id, d, frame))
self.next_id += 1
self.tracks = [t for t in self.tracks if frame - t.last_frame <= self.max_age]
return [(t.id, t.bbox) for t in self.tracks]
60 lines. Takes per-frame detections, returns per-frame track IDs. Real systems add the Kalman predict, ByteTrack's second-stage re-match, and appearance features.
Step 3: Synthetic trajectory test
def synthetic_frames(num_frames=20, num_objects=3, H=240, W=320, seed=0):
rng = np.random.default_rng(seed)
starts = rng.uniform(20, 200, size=(num_objects, 2))
velocities = rng.uniform(-5, 5, size=(num_objects, 2))
frames = []
for f in range(num_frames):
dets = []
for i in range(num_objects):
cx, cy = starts[i] + f * velocities[i]
dets.append([cx - 10, cy - 10, cx + 10, cy + 10])
frames.append(dets)
return frames
tracker = SimpleTracker()
for f, dets in enumerate(synthetic_frames()):
tracks = tracker.step(dets, f)
Three objects moving in straight lines should keep their IDs across all 20 frames.
Step 4: ID-switch metric
def count_id_switches(tracks_per_frame, gt_per_frame):
"""
tracks_per_frame: list of list of (track_id, bbox)
gt_per_frame: list of list of (gt_id, bbox)
Returns number of ID switches.
"""
prev_assignment = {}
switches = 0
for tracks, gts in zip(tracks_per_frame, gt_per_frame):
if not tracks or not gts:
continue
t_boxes = np.array([b for _, b in tracks])
g_boxes = np.array([b for _, b in gts])
iou = bbox_iou(g_boxes, t_boxes)
for g_idx, (gt_id, _) in enumerate(gts):
j = iou[g_idx].argmax()
if iou[g_idx, j] > 0.5:
t_id = tracks[j][0]
if gt_id in prev_assignment and prev_assignment[gt_id] != t_id:
switches += 1
prev_assignment[gt_id] = t_id
return switches
This is a simplified IDF1-adjacent metric: count how many times a ground-truth object changes its assigned predicted track ID. Real MOTA / IDF1 / HOTA tooling lives in py-motmetrics and TrackEval.
Use It
Production trackers in 2026:
ultralytics— YOLOv8 + ByteTrack / BoT-SORT built-in.results = model.track(source, tracker="bytetrack.yaml"). The default.supervision(Roboflow) — ByteTrack wrappers plus annotation utilities.- SAM 2 / SAM 3.1 — memory-based tracking via
processor.track(). - Custom stack: detector (YOLOv8 / RT-DETR) +
sort-tracker/OC-SORT/StrongSORT.
Picking:
- Pedestrians / cars / boxes at 30+ fps: ByteTrack with ultralytics.
- Many instances of one class in a crowd: SAM 3.1 Object Multiplex.
- Heavy occlusions with identifiable appearance: DeepSORT / StrongSORT (ReID features).
- Sports / complex interactions: BoT-SORT or learned trackers (MOTRv3).
Ship It
This lesson produces:
outputs/prompt-tracker-picker.md— picks SORT / ByteTrack / BoT-SORT / SAM 2 / SAM 3.1 given scene type, occlusion patterns, and latency budget.outputs/skill-mot-evaluator.md— writes a complete evaluation harness for MOTA / IDF1 / HOTA against ground-truth tracks.
Exercises
- (Easy) Run the synthetic tracker above with 3, 10, and 30 objects. Report ID-switch count in each case. Identify where the simple IoU-only association starts to fail.
- (Medium) Add a constant-velocity Kalman predict step before association. Show that short (2-3 frame) occlusions no longer cause ID switches.
- (Hard) Integrate SAM 2's memory-based tracker (via
transformers) as an alternative tracker backend. Run both SimpleTracker and SAM 2 on a 30-second clip of a crowd and compare ID-switch counts, manually labelling ground-truth IDs for 5 salient people.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Tracking-by-detection | "Detect then associate" | Per-frame detector + Hungarian assignment on IoU / appearance |
| Kalman filter | "Motion predict" | Linear dynamics + covariance for smooth track predictions and occlusion handling |
| Hungarian algorithm | "Optimal assignment" | Solves the minimum-cost bipartite matching problem; scipy.optimize.linear_sum_assignment |
| ByteTrack | "Low-confidence second pass" | Re-match unmatched tracks to low-confidence detections to recover short occlusions |
| DeepSORT | "SORT + appearance" | Adds a ReID feature for cross-frame matching; better for ID preservation |
| Memory bank | "SAM 2 trick" | Per-instance spatio-temporal features stored across frames; cross-attention replaces explicit association |
| Object Multiplex | "SAM 3.1 shared memory" | Single shared memory with per-instance queries for fast many-object tracking |
| HOTA | "Modern tracking metric" | Decomposes into detection and association accuracy; community standard |
Further Reading
- SORT (Bewley et al., 2016) — the minimal tracking-by-detection paper
- DeepSORT (Wojke et al., 2017) — adds appearance feature
- ByteTrack (Zhang et al., 2022) — low-confidence second pass
- BoT-SORT (Aharon et al., 2022) — camera motion compensation
- HOTA (Luiten et al., 2020) — decomposed tracking metric
- SAM 2 video segmentation (Meta, 2024) — memory-based tracker
- SAM 3.1 Object Multiplex (Meta, March 2026)