← Image Retrieval & Metric Learning 3D Gaussian Splatting from Scratch →

Keypoint Detection & Pose Estimation

> A pose is a set of ordered keypoints. A keypoint detector is a heatmap regressor. Everything else is bookkeeping.

Type: Build

Languages: Python

Prerequisites: Phase 4 Lesson 06 (Detection), Phase 4 Lesson 07 (U-Net)

Time: ~45 minutes

Learning Objectives

Distinguish top-down and bottom-up pose estimation and state when each is used
Regress heatmaps for K keypoints with a Gaussian-per-keypoint target and extract keypoint coordinates at inference
Explain Part Affinity Fields (PAFs) and how bottom-up pipelines associate keypoints into instances
Use MediaPipe Pose or MMPose for production keypoint estimation and understand their output format

The Problem

Keypoint tasks hide under many names: human pose (17 body joints), face landmarks (68 or 478 points), hand (21 points), animal pose, robotic object pose, medical anatomy landmarks. Every one of them shares the same structure: detect K discrete points on an object and output their (x, y) coordinates.

Pose estimation is the foundation of motion capture, fitness apps, sports analytics, gesture control, animation, AR try-on, and robotic grasping. The 2D case is mature; 3D pose (estimating joint positions in world coordinates from a single camera) is the current research frontier.

The engineering question is scale. A single-image, single-person pose is a 20ms problem. Multi-person pose in a crowd at 30 fps is a different problem with different architectures.

The Concept

Top-down vs bottom-up

flowchart LR subgraph TD["Top-down pipeline"] A1["Detect person boxes"] --> A2["Crop each box"] A2 --> A3["Per-box keypoint model
(HRNet, ViTPose)"] end subgraph BU["Bottom-up pipeline"] B1["One pass over image"] --> B2["All keypoint heatmaps
+ association field"] B2 --> B3["Group keypoints into
instances (greedy matching)"] end style TD fill:#dbeafe,stroke:#2563eb style BU fill:#fef3c7,stroke:#d97706

Top-down — detect people first, then run a per-person keypoint model on each crop. Highest accuracy; scales linearly with number of people.
Bottom-up — one forward pass predicts all keypoints plus an association field; group them. Constant time regardless of crowd size.

Top-down (HRNet, ViTPose) is the accuracy leader; bottom-up (OpenPose, HigherHRNet) is the throughput leader for crowded scenes.

Heatmap regression

Instead of regressing (x, y) directly, predict an H x W heatmap per keypoint with a Gaussian blob centred at the true location.

target[k, y, x] = exp(-((x - cx_k)^2 + (y - cy_k)^2) / (2 sigma^2))

At inference, the argmax of each heatmap is the predicted keypoint location.

Why heatmaps work better than direct regression: the network's spatial structure (conv feature map) aligns naturally with spatial output. Gaussian targets also regularise — a small localisation error produces a small loss, not zero.

Sub-pixel localisation

Argmax gives integer coordinates. For sub-pixel precision, refine by fitting a parabola to the argmax and its neighbours, or use the well-known offset (dx, dy) = 0.25 * (heatmap[y, x+1] - heatmap[y, x-1], ...) direction.

Part Affinity Fields (PAFs)

OpenPose's trick for bottom-up association. For each pair of connected keypoints (e.g. left shoulder to left elbow), predict a 2-channel field that encodes the unit vector pointing from one to the other. To associate a shoulder with its elbow, integrate the PAF along the line connecting candidate pairs; the pair with the highest integral is matched.

For each connection (limb):
  PAF channels: 2 (unit vector x, y)
  Line integral: sum over sample points of (PAF . line_direction)
  Higher integral = stronger match

Elegant and scales to arbitrary crowd sizes without per-person crops.

COCO keypoints

The standard body-pose dataset: 17 keypoints per person, PCK (Percentage of Correct Keypoints) and OKS (Object Keypoint Similarity) as metrics. OKS is the keypoint analogue of IoU and is what COCO mAP@OKS reports.

2D vs 3D

2D pose — image coordinates; solved at production quality (MediaPipe, HRNet, ViTPose).
3D pose — world / camera coordinates; still active research. Common approaches:

- Lift 2D predictions to 3D with a small MLP (VideoPose3D).

- Direct 3D regression from image (PyMAF, MHFormer).

- Multi-view setups (CMU Panoptic) for ground truth.

Build It

Step 1: Gaussian heatmap target

import numpy as np
import torch

def gaussian_heatmap(size, cx, cy, sigma=2.0):
    yy, xx = np.meshgrid(np.arange(size), np.arange(size), indexing="ij")
    return np.exp(-((xx - cx) ** 2 + (yy - cy) ** 2) / (2 * sigma ** 2)).astype(np.float32)

hm = gaussian_heatmap(64, 32, 32, sigma=2.0)
print(f"peak: {hm.max():.3f} at ({hm.argmax() % 64}, {hm.argmax() // 64})")

Per-keypoint heatmaps stacked along a channel axis give the full target tensor.

Step 2: Tiny keypoint head

A U-Net-style model that outputs K heatmap channels.

import torch.nn as nn
import torch.nn.functional as F

class TinyKeypointNet(nn.Module):
    def __init__(self, num_keypoints=4, base=16):
        super().__init__()
        self.down1 = nn.Sequential(nn.Conv2d(3, base, 3, 2, 1), nn.ReLU(inplace=True))
        self.down2 = nn.Sequential(nn.Conv2d(base, base * 2, 3, 2, 1), nn.ReLU(inplace=True))
        self.mid = nn.Sequential(nn.Conv2d(base * 2, base * 2, 3, 1, 1), nn.ReLU(inplace=True))
        self.up1 = nn.ConvTranspose2d(base * 2, base, 2, 2)
        self.up2 = nn.ConvTranspose2d(base, num_keypoints, 2, 2)

    def forward(self, x):
        h1 = self.down1(x)
        h2 = self.down2(h1)
        h3 = self.mid(h2)
        u1 = self.up1(h3)
        return self.up2(u1)

Input (N, 3, H, W), output (N, K, H, W). Loss is per-pixel MSE against Gaussian targets.

Step 3: Inference — extract keypoint coordinates

def heatmap_to_coords(heatmaps):
    """
    heatmaps: (N, K, H, W)
    returns:  (N, K, 2) float coordinates in image pixels
    """
    N, K, H, W = heatmaps.shape
    hm = heatmaps.reshape(N, K, -1)
    idx = hm.argmax(dim=-1)
    ys = (idx // W).float()
    xs = (idx % W).float()
    return torch.stack([xs, ys], dim=-1)

coords = heatmap_to_coords(torch.randn(2, 4, 32, 32))
print(f"coords: {coords.shape}")  # (2, 4, 2)

One line at inference. For sub-pixel refinement, interpolate around the argmax.

Step 4: Synthetic keypoint dataset

Simple: draw four points on a white canvas and learn to predict them.

def make_synthetic_sample(size=64):
    img = np.ones((3, size, size), dtype=np.float32)
    rng = np.random.default_rng()
    kps = rng.integers(8, size - 8, size=(4, 2))
    for cx, cy in kps:
        img[:, cy - 2:cy + 2, cx - 2:cx + 2] = 0.0
    hms = np.stack([gaussian_heatmap(size, cx, cy) for cx, cy in kps])
    return img, hms, kps

Easy enough for a tiny model to learn in a minute.

Step 5: Training

model = TinyKeypointNet(num_keypoints=4)
opt = torch.optim.Adam(model.parameters(), lr=3e-3)

for step in range(200):
    batch = [make_synthetic_sample() for _ in range(16)]
    imgs = torch.from_numpy(np.stack([b[0] for b in batch]))
    hms = torch.from_numpy(np.stack([b[1] for b in batch]))
    pred = model(imgs)
    # Upsample pred to full resolution
    pred = F.interpolate(pred, size=hms.shape[-2:], mode="bilinear", align_corners=False)
    loss = F.mse_loss(pred, hms)
    opt.zero_grad(); loss.backward(); opt.step()

Use It

MediaPipe Pose — Google's production pose estimator; ships WebGL + mobile runtimes with sub-10ms latency.
MMPose (OpenMMLab) — comprehensive research codebase; every SOTA architecture with pretrained weights.
YOLOv8-pose — fastest real-time multi-person pose with a single forward pass.
transformers HumanDPT / PoseAnything — newer vision-language approaches for open-vocabulary pose (any object, any keypoint set).

Ship It

This lesson produces:

outputs/prompt-pose-stack-picker.md — a prompt that picks MediaPipe / YOLOv8-pose / HRNet / ViTPose given latency, crowd size, and 2D vs 3D need.
outputs/skill-heatmap-to-coords.md — a skill that writes the sub-pixel heatmap-to-coordinate routine used by every production pose model.

Exercises

(Easy) Train the tiny keypoint model on the synthetic 4-point dataset. Report mean L2 error between predicted and true keypoints after 200 steps.
(Medium) Add sub-pixel refinement: given the argmax position, fit a 1D parabola along x and y from the neighbouring pixels. Report the accuracy gain vs integer argmax.
(Hard) Build a 2-person synthetic dataset where each image shows two instances of the 4-keypoint pattern. Train a bottom-up pipeline with PAFs that predict which keypoint belongs to which instance, and evaluate OKS.

Key Terms

Term	What people say	What it actually means
Keypoint	"A landmark"	A specific ordered point on an object (joint, corner, feature)
Pose	"The skeleton"	An ordered set of keypoints belonging to one instance
Top-down	"Detect then pose"	Two-stage pipeline: person detector + per-crop keypoint model; highest accuracy
Bottom-up	"Pose first, group later"	Single-pass all-keypoint prediction + grouping; constant time in crowd size
Heatmap	"Gaussian target"	H x W tensor per keypoint with peak at the true location; the preferred regression target
PAF	"Part Affinity Field"	2-channel unit vector field encoding limb directions; used to group keypoints into instances
OKS	"Keypoint IoU"	Object Keypoint Similarity; the COCO metric for pose
HRNet	"High-Resolution Net"	The dominant top-down keypoint architecture; preserves high-res features throughout