← Monocular Depth & Geometry Estimation World Models & Video Diffusion →

Multi-Object Tracking & Video Memory

> Tracking is detection plus association. Detect every frame. Match this frame's detections to last frame's tracks by ID.

Type: Build

Languages: Python

Prerequisites: Phase 4 Lesson 06 (YOLO Detection), Phase 4 Lesson 08 (Mask R-CNN), Phase 4 Lesson 24 (SAM 3)

Time: ~60 minutes

Learning Objectives

Distinguish tracking-by-detection from query-based tracking and name the algorithm families (SORT, DeepSORT, ByteTrack, BoT-SORT, SAM 2 memory tracker, SAM 3.1 Object Multiplex)
Implement IoU + Hungarian assignment from scratch for classic tracking-by-detection
Explain SAM 2's memory bank and why it handles occlusion better than IoU-based association
Read the three tracking metrics (MOTA, IDF1, HOTA) and pick which one matters for a given use case

The Problem

A detector tells you where the objects are in a single frame. A tracker tells you which detection in frame t is the same object as a detection in frame t-1. Without that, you cannot count objects crossing a line, follow a ball through an occlusion, or know "car #4 has been in the lane for 8 seconds."

Tracking is essential to every video-facing product: sports analytics, surveillance, autonomous driving, medical video analysis, wildlife monitoring, wordmark counting. The core building blocks are shared: a per-frame detector, a motion model (Kalman filter or something richer), an association step (Hungarian algorithm on IoU / cosine / learned features), and a track lifecycle (birth, update, death).

2026 brought two new patterns: SAM 2 memory-based tracking (feature-memory instead of motion-model association) and SAM 3.1 Object Multiplex (shared memory for many instances of the same concept). This lesson walks the classical stack first, then the memory-based approach.

The Concept

Tracking-by-detection

flowchart LR F1["Frame t"] --> DET["Detector"] --> D1["Detections at t"] PREV["Tracks up to t-1"] --> PREDICT["Motion predict
(Kalman)"] PREDICT --> PRED["Predicted tracks at t"] D1 --> ASSOC["Hungarian assignment
(IoU / cosine / motion)"] PRED --> ASSOC ASSOC --> UPDATE["Update matched tracks"] ASSOC --> NEW["Birth new tracks"] ASSOC --> DEAD["Age unmatched tracks; delete after N"] UPDATE --> NEXT["Tracks at t"] NEW --> NEXT DEAD --> NEXT style DET fill:#dbeafe,stroke:#2563eb style ASSOC fill:#fef3c7,stroke:#d97706 style NEXT fill:#dcfce7,stroke:#16a34a

Every tracker you will encounter in 2026 is a variation on this loop. The differences:

SORT (2016): Kalman filter + IoU Hungarian. Simple, fast, no appearance model.
DeepSORT (2017): SORT + a CNN-based appearance feature per track (ReID embedding). Handles crossings better.
ByteTrack (2021): associates low-confidence detections as a second stage; no appearance features needed but top performer on MOT17.
BoT-SORT (2022): Byte + camera motion compensation + ReID.
StrongSORT / OC-SORT — ByteTrack descendants with better motion and appearance.

Kalman filter in one paragraph

A Kalman filter maintains a per-track state (x, y, w, h, dx, dy, dw, dh) with a covariance. At each frame, predict the state using a constant-velocity model, then update with the matched detection. The update trusts the detection more when the predict uncertainty is high. This gives smooth trajectories and the ability to continue a track through a short occlusion (1-5 frames).

Every classical tracker uses a Kalman filter in the motion-prediction step.

The Hungarian algorithm

Given a M x N cost matrix (tracks x detections), find the one-to-one assignment that minimises total cost. Cost is usually 1 - IoU(track_bbox, detection_bbox) or negative cosine similarity of appearance features. Runtime is O((M+N)^3); for M, N up to ~1000 it is fast enough in Python via scipy.optimize.linear_sum_assignment.

ByteTrack's key idea

Standard trackers drop low-confidence detections (< 0.5). ByteTrack keeps them around as second-stage candidates: after matching tracks to high-confidence detections, unmatched tracks try to match low-confidence detections with a slightly looser IoU threshold. Recovers short occlusions, ID switches near crowds.

SAM 2 memory-based tracking

SAM 2 handles video by keeping a memory bank of per-instance spatio-temporal features. Given a prompt (click, box, text) on one frame, it encodes the instance into memory. On subsequent frames, the memory is cross-attended against the new frame's features, and the decoder produces a mask for the same instance in the new frame.

No Kalman filter, no Hungarian assignment. The association is implicit in the memory-attention operation.

Pros:

Robust to large occlusions (memory carries instance identity across many frames).
Open-vocabulary when combined with SAM 3's text prompts.
Works without a separate motion model.

Cons:

Slower than ByteTrack for many-object tracking.
Memory bank grows; limits the context window.

SAM 3.1 Object Multiplex

Prior SAM 2 / SAM 3 tracking keeps a separate memory bank per instance. For 50 objects, 50 memory banks. Object Multiplex (March 2026) collapses them into one shared memory with per-instance query tokens. Cost scales sub-linearly in number of instances.

Multiplex is the new default for crowd tracking in 2026: concert crowds, warehouse workers, traffic intersections.

Three metrics to know

MOTA (Multi-Object Tracking Accuracy) — 1 - (FN + FP + ID switches) / GT. Weighted by error type; a single metric that conflates detection and association failures.
IDF1 (ID F1) — harmonic mean of ID precision and recall. Focuses specifically on how well each ground-truth track keeps its ID over time. Better than MOTA for ID-switch-sensitive tasks.
HOTA (Higher Order Tracking Accuracy) — decomposes into detection accuracy (DetA) and association accuracy (AssA). The community standard since 2020; most comprehensive.

For surveillance (who is who): IDF1 is what you report. For sports analytics (counting passes): HOTA. For general academic comparison: HOTA.

Build It

Step 1: IoU-based cost matrix

import numpy as np


def bbox_iou(a, b):
    """
    a, b: (N, 4) arrays of [x1, y1, x2, y2].
    Returns (N_a, N_b) IoU matrix.
    """
    ax1, ay1, ax2, ay2 = a[:, 0], a[:, 1], a[:, 2], a[:, 3]
    bx1, by1, bx2, by2 = b[:, 0], b[:, 1], b[:, 2], b[:, 3]
    inter_x1 = np.maximum(ax1[:, None], bx1[None, :])
    inter_y1 = np.maximum(ay1[:, None], by1[None, :])
    inter_x2 = np.minimum(ax2[:, None], bx2[None, :])
    inter_y2 = np.minimum(ay2[:, None], by2[None, :])
    inter = np.clip(inter_x2 - inter_x1, 0, None) * np.clip(inter_y2 - inter_y1, 0, None)
    area_a = (ax2 - ax1) * (ay2 - ay1)
    area_b = (bx2 - bx1) * (by2 - by1)
    union = area_a[:, None] + area_b[None, :] - inter
    return inter / np.clip(union, 1e-8, None)

Step 2: Minimal SORT-style tracker

Fixed constant-velocity Kalman omitted for brevity — we use a simple IoU association here; in production the Kalman predict is essential. The sort Python package provides the full version.

from scipy.optimize import linear_sum_assignment


class Track:
    def __init__(self, tid, bbox, frame):
        self.id = tid
        self.bbox = bbox
        self.last_frame = frame
        self.hits = 1

    def update(self, bbox, frame):
        self.bbox = bbox
        self.last_frame = frame
        self.hits += 1


class SimpleTracker:
    def __init__(self, iou_threshold=0.3, max_age=5):
        self.tracks = []
        self.next_id = 1
        self.iou_threshold = iou_threshold
        self.max_age = max_age

    def step(self, detections, frame):
        if not self.tracks:
            for d in detections:
                self.tracks.append(Track(self.next_id, d, frame))
                self.next_id += 1
            return [(t.id, t.bbox) for t in self.tracks]

        track_boxes = np.array([t.bbox for t in self.tracks])
        det_boxes = np.array(detections) if len(detections) else np.empty((0, 4))

        iou = bbox_iou(track_boxes, det_boxes) if len(det_boxes) else np.zeros((len(track_boxes), 0))
        cost = 1 - iou
        cost[iou < self.iou_threshold] = 1e6

        matched_track = set()
        matched_det = set()
        if cost.size > 0:
            row, col = linear_sum_assignment(cost)
            for r, c in zip(row, col):
                if cost[r, c] < 1.0:
                    self.tracks[r].update(det_boxes[c], frame)
                    matched_track.add(r); matched_det.add(c)

        for i, d in enumerate(det_boxes):
            if i not in matched_det:
                self.tracks.append(Track(self.next_id, d, frame))
                self.next_id += 1

        self.tracks = [t for t in self.tracks if frame - t.last_frame <= self.max_age]
        return [(t.id, t.bbox) for t in self.tracks]

60 lines. Takes per-frame detections, returns per-frame track IDs. Real systems add the Kalman predict, ByteTrack's second-stage re-match, and appearance features.

Step 3: Synthetic trajectory test

def synthetic_frames(num_frames=20, num_objects=3, H=240, W=320, seed=0):
    rng = np.random.default_rng(seed)
    starts = rng.uniform(20, 200, size=(num_objects, 2))
    velocities = rng.uniform(-5, 5, size=(num_objects, 2))
    frames = []
    for f in range(num_frames):
        dets = []
        for i in range(num_objects):
            cx, cy = starts[i] + f * velocities[i]
            dets.append([cx - 10, cy - 10, cx + 10, cy + 10])
        frames.append(dets)
    return frames


tracker = SimpleTracker()
for f, dets in enumerate(synthetic_frames()):
    tracks = tracker.step(dets, f)

Three objects moving in straight lines should keep their IDs across all 20 frames.

Step 4: ID-switch metric

def count_id_switches(tracks_per_frame, gt_per_frame):
    """
    tracks_per_frame:  list of list of (track_id, bbox)
    gt_per_frame:      list of list of (gt_id, bbox)
    Returns number of ID switches.
    """
    prev_assignment = {}
    switches = 0
    for tracks, gts in zip(tracks_per_frame, gt_per_frame):
        if not tracks or not gts:
            continue
        t_boxes = np.array([b for _, b in tracks])
        g_boxes = np.array([b for _, b in gts])
        iou = bbox_iou(g_boxes, t_boxes)
        for g_idx, (gt_id, _) in enumerate(gts):
            j = iou[g_idx].argmax()
            if iou[g_idx, j] > 0.5:
                t_id = tracks[j][0]
                if gt_id in prev_assignment and prev_assignment[gt_id] != t_id:
                    switches += 1
                prev_assignment[gt_id] = t_id
    return switches

This is a simplified IDF1-adjacent metric: count how many times a ground-truth object changes its assigned predicted track ID. Real MOTA / IDF1 / HOTA tooling lives in py-motmetrics and TrackEval.

Use It

Production trackers in 2026:

ultralytics — YOLOv8 + ByteTrack / BoT-SORT built-in. results = model.track(source, tracker="bytetrack.yaml"). The default.
supervision (Roboflow) — ByteTrack wrappers plus annotation utilities.
SAM 2 / SAM 3.1 — memory-based tracking via processor.track().
Custom stack: detector (YOLOv8 / RT-DETR) + sort-tracker / OC-SORT / StrongSORT.

Picking:

Pedestrians / cars / boxes at 30+ fps: ByteTrack with ultralytics.
Many instances of one class in a crowd: SAM 3.1 Object Multiplex.
Heavy occlusions with identifiable appearance: DeepSORT / StrongSORT (ReID features).
Sports / complex interactions: BoT-SORT or learned trackers (MOTRv3).

Ship It

This lesson produces:

outputs/prompt-tracker-picker.md — picks SORT / ByteTrack / BoT-SORT / SAM 2 / SAM 3.1 given scene type, occlusion patterns, and latency budget.
outputs/skill-mot-evaluator.md — writes a complete evaluation harness for MOTA / IDF1 / HOTA against ground-truth tracks.

Exercises

(Easy) Run the synthetic tracker above with 3, 10, and 30 objects. Report ID-switch count in each case. Identify where the simple IoU-only association starts to fail.
(Medium) Add a constant-velocity Kalman predict step before association. Show that short (2-3 frame) occlusions no longer cause ID switches.
(Hard) Integrate SAM 2's memory-based tracker (via transformers) as an alternative tracker backend. Run both SimpleTracker and SAM 2 on a 30-second clip of a crowd and compare ID-switch counts, manually labelling ground-truth IDs for 5 salient people.

Key Terms

Term	What people say	What it actually means
Tracking-by-detection	"Detect then associate"	Per-frame detector + Hungarian assignment on IoU / appearance
Kalman filter	"Motion predict"	Linear dynamics + covariance for smooth track predictions and occlusion handling
Hungarian algorithm	"Optimal assignment"	Solves the minimum-cost bipartite matching problem; `scipy.optimize.linear_sum_assignment`
ByteTrack	"Low-confidence second pass"	Re-match unmatched tracks to low-confidence detections to recover short occlusions
DeepSORT	"SORT + appearance"	Adds a ReID feature for cross-frame matching; better for ID preservation
Memory bank	"SAM 2 trick"	Per-instance spatio-temporal features stored across frames; cross-attention replaces explicit association
Object Multiplex	"SAM 3.1 shared memory"	Single shared memory with per-instance queries for fast many-object tracking
HOTA	"Modern tracking metric"	Decomposes into detection and association accuracy; community standard