Multi-Object Tracking & Video Memory

> Tracking is detection plus association. Detect every frame. Match this frame's detections to last frame's tracks by ID.

Type: Build

Languages: Python

Prerequisites: Phase 4 Lesson 06 (YOLO Detection), Phase 4 Lesson 08 (Mask R-CNN), Phase 4 Lesson 24 (SAM 3)

Time: ~60 minutes

Learning Objectives

The Problem

A detector tells you where the objects are in a single frame. A tracker tells you which detection in frame t is the same object as a detection in frame t-1. Without that, you cannot count objects crossing a line, follow a ball through an occlusion, or know "car #4 has been in the lane for 8 seconds."

Tracking is essential to every video-facing product: sports analytics, surveillance, autonomous driving, medical video analysis, wildlife monitoring, wordmark counting. The core building blocks are shared: a per-frame detector, a motion model (Kalman filter or something richer), an association step (Hungarian algorithm on IoU / cosine / learned features), and a track lifecycle (birth, update, death).

2026 brought two new patterns: SAM 2 memory-based tracking (feature-memory instead of motion-model association) and SAM 3.1 Object Multiplex (shared memory for many instances of the same concept). This lesson walks the classical stack first, then the memory-based approach.

The Concept

Tracking-by-detection

flowchart LR F1["Frame t"] --> DET["Detector"] --> D1["Detections at t"] PREV["Tracks up to t-1"] --> PREDICT["Motion predict
(Kalman)"] PREDICT --> PRED["Predicted tracks at t"] D1 --> ASSOC["Hungarian assignment
(IoU / cosine / motion)"] PRED --> ASSOC ASSOC --> UPDATE["Update matched tracks"] ASSOC --> NEW["Birth new tracks"] ASSOC --> DEAD["Age unmatched tracks; delete after N"] UPDATE --> NEXT["Tracks at t"] NEW --> NEXT DEAD --> NEXT style DET fill:#dbeafe,stroke:#2563eb style ASSOC fill:#fef3c7,stroke:#d97706 style NEXT fill:#dcfce7,stroke:#16a34a

Every tracker you will encounter in 2026 is a variation on this loop. The differences:

Kalman filter in one paragraph

A Kalman filter maintains a per-track state (x, y, w, h, dx, dy, dw, dh) with a covariance. At each frame, predict the state using a constant-velocity model, then update with the matched detection. The update trusts the detection more when the predict uncertainty is high. This gives smooth trajectories and the ability to continue a track through a short occlusion (1-5 frames).

Every classical tracker uses a Kalman filter in the motion-prediction step.

The Hungarian algorithm

Given a M x N cost matrix (tracks x detections), find the one-to-one assignment that minimises total cost. Cost is usually 1 - IoU(track_bbox, detection_bbox) or negative cosine similarity of appearance features. Runtime is O((M+N)^3); for M, N up to ~1000 it is fast enough in Python via scipy.optimize.linear_sum_assignment.

ByteTrack's key idea

Standard trackers drop low-confidence detections (< 0.5). ByteTrack keeps them around as second-stage candidates: after matching tracks to high-confidence detections, unmatched tracks try to match low-confidence detections with a slightly looser IoU threshold. Recovers short occlusions, ID switches near crowds.

SAM 2 memory-based tracking

SAM 2 handles video by keeping a memory bank of per-instance spatio-temporal features. Given a prompt (click, box, text) on one frame, it encodes the instance into memory. On subsequent frames, the memory is cross-attended against the new frame's features, and the decoder produces a mask for the same instance in the new frame.

No Kalman filter, no Hungarian assignment. The association is implicit in the memory-attention operation.

Pros:

Cons:

SAM 3.1 Object Multiplex

Prior SAM 2 / SAM 3 tracking keeps a separate memory bank per instance. For 50 objects, 50 memory banks. Object Multiplex (March 2026) collapses them into one shared memory with per-instance query tokens. Cost scales sub-linearly in number of instances.

Multiplex is the new default for crowd tracking in 2026: concert crowds, warehouse workers, traffic intersections.

Three metrics to know

For surveillance (who is who): IDF1 is what you report. For sports analytics (counting passes): HOTA. For general academic comparison: HOTA.

Build It

Step 1: IoU-based cost matrix

import numpy as np


def bbox_iou(a, b):
    """
    a, b: (N, 4) arrays of [x1, y1, x2, y2].
    Returns (N_a, N_b) IoU matrix.
    """
    ax1, ay1, ax2, ay2 = a[:, 0], a[:, 1], a[:, 2], a[:, 3]
    bx1, by1, bx2, by2 = b[:, 0], b[:, 1], b[:, 2], b[:, 3]
    inter_x1 = np.maximum(ax1[:, None], bx1[None, :])
    inter_y1 = np.maximum(ay1[:, None], by1[None, :])
    inter_x2 = np.minimum(ax2[:, None], bx2[None, :])
    inter_y2 = np.minimum(ay2[:, None], by2[None, :])
    inter = np.clip(inter_x2 - inter_x1, 0, None) * np.clip(inter_y2 - inter_y1, 0, None)
    area_a = (ax2 - ax1) * (ay2 - ay1)
    area_b = (bx2 - bx1) * (by2 - by1)
    union = area_a[:, None] + area_b[None, :] - inter
    return inter / np.clip(union, 1e-8, None)

Step 2: Minimal SORT-style tracker

Fixed constant-velocity Kalman omitted for brevity — we use a simple IoU association here; in production the Kalman predict is essential. The sort Python package provides the full version.

from scipy.optimize import linear_sum_assignment


class Track:
    def __init__(self, tid, bbox, frame):
        self.id = tid
        self.bbox = bbox
        self.last_frame = frame
        self.hits = 1

    def update(self, bbox, frame):
        self.bbox = bbox
        self.last_frame = frame
        self.hits += 1


class SimpleTracker:
    def __init__(self, iou_threshold=0.3, max_age=5):
        self.tracks = []
        self.next_id = 1
        self.iou_threshold = iou_threshold
        self.max_age = max_age

    def step(self, detections, frame):
        if not self.tracks:
            for d in detections:
                self.tracks.append(Track(self.next_id, d, frame))
                self.next_id += 1
            return [(t.id, t.bbox) for t in self.tracks]

        track_boxes = np.array([t.bbox for t in self.tracks])
        det_boxes = np.array(detections) if len(detections) else np.empty((0, 4))

        iou = bbox_iou(track_boxes, det_boxes) if len(det_boxes) else np.zeros((len(track_boxes), 0))
        cost = 1 - iou
        cost[iou < self.iou_threshold] = 1e6

        matched_track = set()
        matched_det = set()
        if cost.size > 0:
            row, col = linear_sum_assignment(cost)
            for r, c in zip(row, col):
                if cost[r, c] < 1.0:
                    self.tracks[r].update(det_boxes[c], frame)
                    matched_track.add(r); matched_det.add(c)

        for i, d in enumerate(det_boxes):
            if i not in matched_det:
                self.tracks.append(Track(self.next_id, d, frame))
                self.next_id += 1

        self.tracks = [t for t in self.tracks if frame - t.last_frame <= self.max_age]
        return [(t.id, t.bbox) for t in self.tracks]

60 lines. Takes per-frame detections, returns per-frame track IDs. Real systems add the Kalman predict, ByteTrack's second-stage re-match, and appearance features.

Step 3: Synthetic trajectory test

def synthetic_frames(num_frames=20, num_objects=3, H=240, W=320, seed=0):
    rng = np.random.default_rng(seed)
    starts = rng.uniform(20, 200, size=(num_objects, 2))
    velocities = rng.uniform(-5, 5, size=(num_objects, 2))
    frames = []
    for f in range(num_frames):
        dets = []
        for i in range(num_objects):
            cx, cy = starts[i] + f * velocities[i]
            dets.append([cx - 10, cy - 10, cx + 10, cy + 10])
        frames.append(dets)
    return frames


tracker = SimpleTracker()
for f, dets in enumerate(synthetic_frames()):
    tracks = tracker.step(dets, f)

Three objects moving in straight lines should keep their IDs across all 20 frames.

Step 4: ID-switch metric

def count_id_switches(tracks_per_frame, gt_per_frame):
    """
    tracks_per_frame:  list of list of (track_id, bbox)
    gt_per_frame:      list of list of (gt_id, bbox)
    Returns number of ID switches.
    """
    prev_assignment = {}
    switches = 0
    for tracks, gts in zip(tracks_per_frame, gt_per_frame):
        if not tracks or not gts:
            continue
        t_boxes = np.array([b for _, b in tracks])
        g_boxes = np.array([b for _, b in gts])
        iou = bbox_iou(g_boxes, t_boxes)
        for g_idx, (gt_id, _) in enumerate(gts):
            j = iou[g_idx].argmax()
            if iou[g_idx, j] > 0.5:
                t_id = tracks[j][0]
                if gt_id in prev_assignment and prev_assignment[gt_id] != t_id:
                    switches += 1
                prev_assignment[gt_id] = t_id
    return switches

This is a simplified IDF1-adjacent metric: count how many times a ground-truth object changes its assigned predicted track ID. Real MOTA / IDF1 / HOTA tooling lives in py-motmetrics and TrackEval.

Use It

Production trackers in 2026:

Picking:

Ship It

This lesson produces:

Exercises

  1. (Easy) Run the synthetic tracker above with 3, 10, and 30 objects. Report ID-switch count in each case. Identify where the simple IoU-only association starts to fail.
  2. (Medium) Add a constant-velocity Kalman predict step before association. Show that short (2-3 frame) occlusions no longer cause ID switches.
  3. (Hard) Integrate SAM 2's memory-based tracker (via transformers) as an alternative tracker backend. Run both SimpleTracker and SAM 2 on a 30-second clip of a crowd and compare ID-switch counts, manually labelling ground-truth IDs for 5 salient people.

Key Terms

Term What people say What it actually means
Tracking-by-detection "Detect then associate" Per-frame detector + Hungarian assignment on IoU / appearance
Kalman filter "Motion predict" Linear dynamics + covariance for smooth track predictions and occlusion handling
Hungarian algorithm "Optimal assignment" Solves the minimum-cost bipartite matching problem; scipy.optimize.linear_sum_assignment
ByteTrack "Low-confidence second pass" Re-match unmatched tracks to low-confidence detections to recover short occlusions
DeepSORT "SORT + appearance" Adds a ReID feature for cross-frame matching; better for ID preservation
Memory bank "SAM 2 trick" Per-instance spatio-temporal features stored across frames; cross-attention replaces explicit association
Object Multiplex "SAM 3.1 shared memory" Single shared memory with per-instance queries for fast many-object tracking
HOTA "Modern tracking metric" Decomposes into detection and association accuracy; community standard

Further Reading