Real-Time Vision — Edge Deployment

> Edge inference is the discipline of getting a 90-accuracy model to run at 30 fps on a device with 2 GB of RAM. Every percentage point of accuracy is traded against milliseconds of latency.

Type: Learn + Build

Languages: Python

Prerequisites: Phase 4 Lesson 04 (Image Classification), Phase 10 Lesson 11 (Quantization)

Time: ~75 minutes

Learning Objectives

The Problem

A training-time vision model is a floating-point monster. 100M parameters, 10 GFLOPs per forward pass, 2 GB of VRAM. None of that fits on a phone, a car's infotainment unit, an industrial camera, or a drone. Shipping a vision system means fitting the same predictions into a budget that is 100x smaller.

Three knobs do most of the work: model choice (a smaller architecture with the same recipe), quantisation (INT8 instead of FP32), and the inference runtime (ONNX Runtime, TensorRT, Core ML, TFLite). Getting them right is the difference between a demo that runs on a workstation and a product that ships on a $30 camera module.

This lesson sets up the measurement discipline first (you cannot optimise what you cannot measure), then walks the three knobs. The goal is not to learn every edge runtime but to know what levers exist and how to verify each one does what you think.

The Concept

The three budgets

flowchart LR M["Model"] --> LAT["Latency
ms per image"] M --> MEM["Memory
peak MB"] M --> PWR["Power
mJ per inference"] LAT --> SHIP["Ship / no-ship
decision"] MEM --> SHIP PWR --> SHIP style LAT fill:#fecaca,stroke:#dc2626 style MEM fill:#fef3c7,stroke:#d97706 style PWR fill:#dbeafe,stroke:#2563eb

A table of (model, latency, memory, accuracy) is what an edge decision is made from. Every cell is measured on the target device, not the workstation.

Measurement discipline

Three rules that every edge profile should follow:

  1. Warm up the model with 5-10 dummy forward passes before measuring. Cold caches and JIT compilation produce unrepresentative first numbers.
  2. Synchronise GPU workloads with torch.cuda.synchronize() before and after the timed block. Without this you measure kernel dispatch, not kernel execution.
  3. Fix input sizes to the production resolution. Latency on 224x224 is not latency on 512x512.

FLOPs as a proxy

FLOPs (floating-point operations per inference) is a cheap, device-independent proxy for latency. Useful for architecture comparison, misleading as absolute wall-clock. A model with 10% more FLOPs can be 2x faster in practice because it uses hardware-friendly ops (depthwise convs compile well, large 7x7 convs do not).

Rule: use FLOPs for architecture search, use on-device latency for deployment decisions.

Quantisation in one paragraph

Replace FP32 weights and activations with INT8. Model size drops 4x, memory bandwidth drops 4x, compute drops 2-4x on hardware that has INT8 kernels (every modern mobile SoC, every NVIDIA GPU with Tensor Cores). Accuracy loss on vision tasks is typically 0.1-1 percentage points with post-training static quantisation.

Types:

For vision, post-training static quantisation gives 95% of the benefit with 5% of the effort. Use QAT only when accuracy loss from PTQ is unacceptable.

Pruning and distillation

The inference runtimes

In practice: export PyTorch -> ONNX -> pick the runtime for the target. ONNX is the lingua franca.

Edge architecture picker

Budget Model Why
< 3M params MobileNetV3-Small Compiles everywhere, good baseline
3-10M EfficientNet-Lite-B0 Best accuracy per param on TFLite
10-20M ConvNeXt-Tiny Best accuracy-per-param, CPU-friendly
20-30M MobileViT-S or EfficientViT Transformer with ImageNet accuracy
30-80M Swin-V2-Tiny If stack supports window attention

Quantise all of these to INT8 unless you have a specific reason not to.

Build It

Step 1: Measure latency correctly

import time
import torch

def measure_latency(model, input_shape, device="cpu", warmup=10, iters=50):
    model = model.to(device).eval()
    x = torch.randn(input_shape, device=device)
    with torch.no_grad():
        for _ in range(warmup):
            model(x)
        if device == "cuda":
            torch.cuda.synchronize()
        times = []
        for _ in range(iters):
            if device == "cuda":
                torch.cuda.synchronize()
            t0 = time.perf_counter()
            model(x)
            if device == "cuda":
                torch.cuda.synchronize()
            times.append((time.perf_counter() - t0) * 1000)
    times.sort()
    return {
        "p50_ms": times[len(times) // 2],
        "p95_ms": times[int(len(times) * 0.95)],
        "p99_ms": times[int(len(times) * 0.99)],
        "mean_ms": sum(times) / len(times),
    }

Warm up, synchronise, use time.perf_counter(). Report percentiles, not just mean.

Step 2: Parameter and FLOP counts

def parameter_count(model):
    return sum(p.numel() for p in model.parameters())

def flops_estimate(model, input_shape):
    """
    Rough FLOP count for a conv/linear-only model. For production use `fvcore` or `ptflops`.
    """
    total = 0
    def conv_hook(m, inp, out):
        nonlocal total
        c_out, c_in, kh, kw = m.weight.shape
        h, w = out.shape[-2:]
        total += 2 * c_in * c_out * kh * kw * h * w
    def linear_hook(m, inp, out):
        nonlocal total
        total += 2 * m.in_features * m.out_features
    hooks = []
    for m in model.modules():
        if isinstance(m, torch.nn.Conv2d):
            hooks.append(m.register_forward_hook(conv_hook))
        elif isinstance(m, torch.nn.Linear):
            hooks.append(m.register_forward_hook(linear_hook))
    model.eval()
    with torch.no_grad():
        model(torch.randn(input_shape))
    for h in hooks:
        h.remove()
    return total

For real projects use fvcore.nn.FlopCountAnalysis or ptflops; they handle every module type correctly.

Step 3: Post-training static quantisation

def quantise_ptq(model, calibration_loader, backend="x86"):
    import torch.ao.quantization as tq
    model = model.eval().cpu()
    model.qconfig = tq.get_default_qconfig(backend)
    tq.prepare(model, inplace=True)
    with torch.no_grad():
        for x, _ in calibration_loader:
            model(x)
    tq.convert(model, inplace=True)
    return model

Three steps: configure, prepare (insert observers), calibrate with real data, convert (fuse + quantise). Requires the model to be fused (Conv -> BN -> ReLU -> ConvBnReLU), which torch.ao.quantization.fuse_modules handles.

Step 4: Export to ONNX

def export_onnx(model, sample_input, path="model.onnx"):
    model = model.eval()
    torch.onnx.export(
        model,
        sample_input,
        path,
        input_names=["input"],
        output_names=["output"],
        dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
        opset_version=17,
    )
    return path

opset_version=17 is the safe default in 2026. dynamic_axes lets you run the ONNX model with arbitrary batch size.

Step 5: Benchmark and compare regimes

import torch.nn as nn
from torchvision.models import mobilenet_v3_small

def compare_regimes():
    model = mobilenet_v3_small(weights=None, num_classes=10)
    params = parameter_count(model)
    flops = flops_estimate(model, (1, 3, 224, 224))
    lat_fp32 = measure_latency(model, (1, 3, 224, 224), device="cpu")
    print(f"FP32 MobileNetV3-Small: {params:,} params  {flops/1e9:.2f} GFLOPs  "
          f"p50={lat_fp32['p50_ms']:.2f}ms  p95={lat_fp32['p95_ms']:.2f}ms")

Run the same function for resnet50, efficientnet_v2_s, and convnext_tiny and you have the comparison table you need for a deployment decision.

Use It

Production stacks converge on one of three paths:

For measurement, torch-tb-profiler, nvprof / nsys, and Instruments on macOS give layer-by-layer breakdowns. benchmark_app (OpenVINO) and trtexec (TensorRT) give standalone CLI numbers.

Ship It

This lesson produces:

Exercises

  1. (Easy) Measure p50 latency for resnet18, mobilenet_v3_small, efficientnet_v2_s, and convnext_tiny at 224x224 on CPU. Report the table and identify which architecture has the best accuracy-per-ms.
  2. (Medium) Apply post-training static quantisation to mobilenet_v3_small. Report FP32 vs INT8 latency and accuracy loss on a held-out subset of CIFAR-10 or similar.
  3. (Hard) Export convnext_tiny to ONNX, run it through onnxruntime with the CPUExecutionProvider, and compare latency to the PyTorch eager baseline. Identify the first layer where ONNX Runtime is faster and explain why.

Key Terms

Term What people say What it actually means
Latency "How fast" Time from input to output; p50/p95/p99 percentiles, not mean
FLOPs "Model size" Floating-point ops per forward pass; rough proxy for compute cost
INT8 quantisation "8-bit" Replace FP32 weights/activations with 8-bit integers; ~4x smaller, 2-4x faster
PTQ "Post-training quantisation" Quantise a trained model without retraining; easy, usually enough
QAT "Quantisation-aware training" Simulate quantisation during training; best accuracy, requires labelled data
ONNX "The neutral format" Model exchange format supported by every mainstream inference runtime
TensorRT "NVIDIA compiler" Compiles ONNX into an optimised engine for NVIDIA GPUs
Distillation "Teacher -> student" Train a small model to mimic a big model's logits; recovers most lost accuracy

Further Reading