Real-Time Vision — Edge Deployment
> Edge inference is the discipline of getting a 90-accuracy model to run at 30 fps on a device with 2 GB of RAM. Every percentage point of accuracy is traded against milliseconds of latency.
Type: Learn + Build
Languages: Python
Prerequisites: Phase 4 Lesson 04 (Image Classification), Phase 10 Lesson 11 (Quantization)
Time: ~75 minutes
Learning Objectives
- Measure inference latency, peak memory, and throughput for any PyTorch model, and read the FLOPs / params / latency trade-off
- Quantise a vision model to INT8 using PyTorch's post-training quantisation and verify accuracy loss < 1%
- Export to ONNX and compile with ONNX Runtime or TensorRT; name the three most common export failures and their fixes
- Explain when to pick MobileNetV3, EfficientNet-Lite, ConvNeXt-Tiny, or MobileViT for an edge constraint
The Problem
A training-time vision model is a floating-point monster. 100M parameters, 10 GFLOPs per forward pass, 2 GB of VRAM. None of that fits on a phone, a car's infotainment unit, an industrial camera, or a drone. Shipping a vision system means fitting the same predictions into a budget that is 100x smaller.
Three knobs do most of the work: model choice (a smaller architecture with the same recipe), quantisation (INT8 instead of FP32), and the inference runtime (ONNX Runtime, TensorRT, Core ML, TFLite). Getting them right is the difference between a demo that runs on a workstation and a product that ships on a $30 camera module.
This lesson sets up the measurement discipline first (you cannot optimise what you cannot measure), then walks the three knobs. The goal is not to learn every edge runtime but to know what levers exist and how to verify each one does what you think.
The Concept
The three budgets
ms per image"] M --> MEM["Memory
peak MB"] M --> PWR["Power
mJ per inference"] LAT --> SHIP["Ship / no-ship
decision"] MEM --> SHIP PWR --> SHIP style LAT fill:#fecaca,stroke:#dc2626 style MEM fill:#fef3c7,stroke:#d97706 style PWR fill:#dbeafe,stroke:#2563eb
- Latency: p50, p95, p99. Averaging only p50 hides tail behaviour that matters for real-time systems.
- Peak memory: the maximum the device ever sees, not the steady-state average. Matters because OOMs are fatal on embedded targets.
- Power / energy: millijoules per inference on a battery-powered device. Often proxied by CPU/GPU utilisation * time.
A table of (model, latency, memory, accuracy) is what an edge decision is made from. Every cell is measured on the target device, not the workstation.
Measurement discipline
Three rules that every edge profile should follow:
- Warm up the model with 5-10 dummy forward passes before measuring. Cold caches and JIT compilation produce unrepresentative first numbers.
- Synchronise GPU workloads with
torch.cuda.synchronize()before and after the timed block. Without this you measure kernel dispatch, not kernel execution. - Fix input sizes to the production resolution. Latency on 224x224 is not latency on 512x512.
FLOPs as a proxy
FLOPs (floating-point operations per inference) is a cheap, device-independent proxy for latency. Useful for architecture comparison, misleading as absolute wall-clock. A model with 10% more FLOPs can be 2x faster in practice because it uses hardware-friendly ops (depthwise convs compile well, large 7x7 convs do not).
Rule: use FLOPs for architecture search, use on-device latency for deployment decisions.
Quantisation in one paragraph
Replace FP32 weights and activations with INT8. Model size drops 4x, memory bandwidth drops 4x, compute drops 2-4x on hardware that has INT8 kernels (every modern mobile SoC, every NVIDIA GPU with Tensor Cores). Accuracy loss on vision tasks is typically 0.1-1 percentage points with post-training static quantisation.
Types:
- Dynamic — quantise weights to INT8, activations computed in FP. Easy, small speedup.
- Static (post-training) — quantise weights + calibrate activation ranges on a small calibration set. Much faster than dynamic.
- Quantisation-aware training (QAT) — simulate quantisation during training so the model learns around it. Best accuracy, needs labelled data.
For vision, post-training static quantisation gives 95% of the benefit with 5% of the effort. Use QAT only when accuracy loss from PTQ is unacceptable.
Pruning and distillation
- Pruning — remove unimportant weights (magnitude-based) or channels (structured). Works well on overparameterised models; less useful on already-compact architectures.
- Distillation — train a small student to mimic a large teacher's logits. Often recovers most of the accuracy lost by shrinking the model. Standard for production edge models.
The inference runtimes
- PyTorch eager — slow, not for deployment. Use for development only.
- TorchScript — legacy. Superseded by
torch.compileand ONNX export. - ONNX Runtime — the neutral runtime. CPU, CUDA, CoreML, TensorRT, OpenVINO all have ONNX providers. Start here.
- TensorRT — NVIDIA's compiler. Best latency on NVIDIA GPUs (workstation and Jetson). Integrates with ONNX Runtime or standalone.
- Core ML — Apple's runtime for iOS/macOS. Needs
.mlmodelor.mlpackage. - TFLite — Google's runtime for Android/ARM. Needs
.tflite. - OpenVINO — Intel's runtime for CPU/VPU. Needs
.xml+.bin.
In practice: export PyTorch -> ONNX -> pick the runtime for the target. ONNX is the lingua franca.
Edge architecture picker
| Budget | Model | Why |
|---|---|---|
| < 3M params | MobileNetV3-Small | Compiles everywhere, good baseline |
| 3-10M | EfficientNet-Lite-B0 | Best accuracy per param on TFLite |
| 10-20M | ConvNeXt-Tiny | Best accuracy-per-param, CPU-friendly |
| 20-30M | MobileViT-S or EfficientViT | Transformer with ImageNet accuracy |
| 30-80M | Swin-V2-Tiny | If stack supports window attention |
Quantise all of these to INT8 unless you have a specific reason not to.
Build It
Step 1: Measure latency correctly
import time
import torch
def measure_latency(model, input_shape, device="cpu", warmup=10, iters=50):
model = model.to(device).eval()
x = torch.randn(input_shape, device=device)
with torch.no_grad():
for _ in range(warmup):
model(x)
if device == "cuda":
torch.cuda.synchronize()
times = []
for _ in range(iters):
if device == "cuda":
torch.cuda.synchronize()
t0 = time.perf_counter()
model(x)
if device == "cuda":
torch.cuda.synchronize()
times.append((time.perf_counter() - t0) * 1000)
times.sort()
return {
"p50_ms": times[len(times) // 2],
"p95_ms": times[int(len(times) * 0.95)],
"p99_ms": times[int(len(times) * 0.99)],
"mean_ms": sum(times) / len(times),
}
Warm up, synchronise, use time.perf_counter(). Report percentiles, not just mean.
Step 2: Parameter and FLOP counts
def parameter_count(model):
return sum(p.numel() for p in model.parameters())
def flops_estimate(model, input_shape):
"""
Rough FLOP count for a conv/linear-only model. For production use `fvcore` or `ptflops`.
"""
total = 0
def conv_hook(m, inp, out):
nonlocal total
c_out, c_in, kh, kw = m.weight.shape
h, w = out.shape[-2:]
total += 2 * c_in * c_out * kh * kw * h * w
def linear_hook(m, inp, out):
nonlocal total
total += 2 * m.in_features * m.out_features
hooks = []
for m in model.modules():
if isinstance(m, torch.nn.Conv2d):
hooks.append(m.register_forward_hook(conv_hook))
elif isinstance(m, torch.nn.Linear):
hooks.append(m.register_forward_hook(linear_hook))
model.eval()
with torch.no_grad():
model(torch.randn(input_shape))
for h in hooks:
h.remove()
return total
For real projects use fvcore.nn.FlopCountAnalysis or ptflops; they handle every module type correctly.
Step 3: Post-training static quantisation
def quantise_ptq(model, calibration_loader, backend="x86"):
import torch.ao.quantization as tq
model = model.eval().cpu()
model.qconfig = tq.get_default_qconfig(backend)
tq.prepare(model, inplace=True)
with torch.no_grad():
for x, _ in calibration_loader:
model(x)
tq.convert(model, inplace=True)
return model
Three steps: configure, prepare (insert observers), calibrate with real data, convert (fuse + quantise). Requires the model to be fused (Conv -> BN -> ReLU -> ConvBnReLU), which torch.ao.quantization.fuse_modules handles.
Step 4: Export to ONNX
def export_onnx(model, sample_input, path="model.onnx"):
model = model.eval()
torch.onnx.export(
model,
sample_input,
path,
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}},
opset_version=17,
)
return path
opset_version=17 is the safe default in 2026. dynamic_axes lets you run the ONNX model with arbitrary batch size.
Step 5: Benchmark and compare regimes
import torch.nn as nn
from torchvision.models import mobilenet_v3_small
def compare_regimes():
model = mobilenet_v3_small(weights=None, num_classes=10)
params = parameter_count(model)
flops = flops_estimate(model, (1, 3, 224, 224))
lat_fp32 = measure_latency(model, (1, 3, 224, 224), device="cpu")
print(f"FP32 MobileNetV3-Small: {params:,} params {flops/1e9:.2f} GFLOPs "
f"p50={lat_fp32['p50_ms']:.2f}ms p95={lat_fp32['p95_ms']:.2f}ms")
Run the same function for resnet50, efficientnet_v2_s, and convnext_tiny and you have the comparison table you need for a deployment decision.
Use It
Production stacks converge on one of three paths:
- Web / serverless: PyTorch -> ONNX -> ONNX Runtime (CPU or CUDA provider). Easiest, good enough for most.
- NVIDIA edge (Jetson, GPU server): PyTorch -> ONNX -> TensorRT. Best latency, biggest engineering effort.
- Mobile: PyTorch -> ONNX -> Core ML (iOS) or TFLite (Android). Quantise before export.
For measurement, torch-tb-profiler, nvprof / nsys, and Instruments on macOS give layer-by-layer breakdowns. benchmark_app (OpenVINO) and trtexec (TensorRT) give standalone CLI numbers.
Ship It
This lesson produces:
outputs/prompt-edge-deployment-planner.md— a prompt that picks backbone, quantisation strategy, and runtime given target device and latency SLA.outputs/skill-latency-profiler.md— a skill that writes a complete latency-benchmarking script with warmup, synchronisation, percentiles, and memory tracking.
Exercises
- (Easy) Measure p50 latency for
resnet18,mobilenet_v3_small,efficientnet_v2_s, andconvnext_tinyat 224x224 on CPU. Report the table and identify which architecture has the best accuracy-per-ms. - (Medium) Apply post-training static quantisation to
mobilenet_v3_small. Report FP32 vs INT8 latency and accuracy loss on a held-out subset of CIFAR-10 or similar. - (Hard) Export
convnext_tinyto ONNX, run it throughonnxruntimewith theCPUExecutionProvider, and compare latency to the PyTorch eager baseline. Identify the first layer where ONNX Runtime is faster and explain why.
Key Terms
| Term | What people say | What it actually means |
|---|---|---|
| Latency | "How fast" | Time from input to output; p50/p95/p99 percentiles, not mean |
| FLOPs | "Model size" | Floating-point ops per forward pass; rough proxy for compute cost |
| INT8 quantisation | "8-bit" | Replace FP32 weights/activations with 8-bit integers; ~4x smaller, 2-4x faster |
| PTQ | "Post-training quantisation" | Quantise a trained model without retraining; easy, usually enough |
| QAT | "Quantisation-aware training" | Simulate quantisation during training; best accuracy, requires labelled data |
| ONNX | "The neutral format" | Model exchange format supported by every mainstream inference runtime |
| TensorRT | "NVIDIA compiler" | Compiles ONNX into an optimised engine for NVIDIA GPUs |
| Distillation | "Teacher -> student" | Train a small model to mimic a big model's logits; recovers most lost accuracy |
Further Reading
- EfficientNet (Tan & Le, 2019) — compound scaling for efficient architectures
- MobileNetV3 (Howard et al., 2019) — mobile-first architecture with h-swish and squeeze-excite
- A Practical Guide to TensorRT Optimization (NVIDIA) — how to actually get the throughput numbers in the paper
- ONNX Runtime docs — quantisation, graph optimisation, provider selection