Monocular Depth & Geometry Estimation

> A depth map is a single-channel image where each pixel is a distance from the camera. Predicting it from one RGB frame used to be impossible without stereo or LiDAR. In 2026 a frozen ViT encoder plus a lightweight head gets within a few percent of ground truth.

Type: Build + Use

Languages: Python

Prerequisites: Phase 4 Lesson 14 (ViT), Phase 4 Lesson 17 (Self-Supervised Vision), Phase 4 Lesson 07 (U-Net)

Time: ~60 minutes

Learning Objectives

The Problem

Depth is the missing axis in 2D computer vision. Given RGB, you know where things appear in the image plane; you do not know how far they are. Depth sensors (stereo rigs, LiDAR, time-of-flight) solve this directly but are expensive, fragile, and limited in range.

Monocular depth estimation — predicting depth from a single RGB frame — used to produce blurry, unreliable output. By 2026 large pretrained encoders changed that: Depth Anything V3 uses a frozen DINOv2 backbone and produces depth maps that generalise across indoor, outdoor, medical, and satellite domains. Marigold reframes depth as a conditional diffusion problem. ZoeDepth regresses true metric distances.

Depth is also the bridge between 2D detection and 3D understanding: multiply a detected box's pixels by depth and you lift the 2D object into a 3D point cloud. That is the core of every AR occlusion system, every obstacle-avoidance pipeline, and every "pick up the cup" robot.

The Concept

Relative vs metric depth

MiDaS and Depth Anything V3 produce relative depth. Marigold produces relative depth. ZoeDepth, UniDepth, and Metric3D produce metric depth. Metric models are sensitive to camera intrinsics; relative models are not.

The encoder-decoder pattern

flowchart LR IMG["Image (H x W x 3)"] --> ENC["Frozen ViT encoder
(DINOv2 / DINOv3)"] ENC --> FEATS["Dense features
(H/14, W/14, d)"] FEATS --> DEC["Depth decoder
(conv upsampler,
DPT-style)"] DEC --> DEPTH["Depth map
(H, W, 1)"] style ENC fill:#dbeafe,stroke:#2563eb style DEC fill:#fef3c7,stroke:#d97706 style DEPTH fill:#dcfce7,stroke:#16a34a

Depth Anything V3 freezes the encoder and trains only the DPT-style decoder. The encoder provides rich features; the decoder interpolates them back to image resolution and regresses depth.

Why a single image produces depth at all

A 2D image contains many monocular cues that correlate with depth:

A ViT trained on billions of images internalises these cues. With enough data and a strong backbone, monocular depth hits reasonable accuracy without any explicit 3D supervision.

What monocular depth cannot do

Depth Anything V3 in 2026

This is the drop-in model to call when you need depth in 2026.

Marigold — diffusion for depth

Marigold (Ke et al., CVPR 2024) reframes depth estimation as conditional image-to-image diffusion. Conditioning: RGB. Target: depth map. Uses a pretrained Stable Diffusion 2 U-Net as backbone. Output depth maps are exceptionally sharp at object boundaries. Trade-off: slower inference than feed-forward models (10-50 denoising steps).

Intrinsics and the pinhole camera

To lift a pixel (u, v) with depth d to a 3D point (X, Y, Z) in camera coordinates:

fx, fy, cx, cy = camera intrinsics
X = (u - cx) * d / fx
Y = (v - cy) * d / fy
Z = d

Intrinsics come from EXIF metadata, a calibration pattern, or a monocular intrinsics estimator (Perspective Fields, UniDepth). Without intrinsics, you can still render a point cloud by assuming a 60-70° FOV and moderate-resolution principals — usable for visualisation, not for measurement.

Evaluation

Two standard metrics:

For relative depth (Depth Anything V3, MiDaS), evaluation uses scale-and-shift invariant versions of both metrics.

Build It

Step 1: Depth metrics

import torch

def abs_rel_error(pred, target, mask=None):
    if mask is not None:
        pred = pred[mask]
        target = target[mask]
    return (torch.abs(pred - target) / target.clamp(min=1e-6)).mean().item()


def delta_accuracy(pred, target, threshold=1.25, mask=None):
    if mask is not None:
        pred = pred[mask]
        target = target[mask]
    ratio = torch.maximum(pred / target.clamp(min=1e-6), target / pred.clamp(min=1e-6))
    return (ratio < threshold).float().mean().item()

Always mask invalid depth pixels (zero, NaN, saturated) before evaluation.

Step 2: Scale-and-shift alignment

For relative-depth models, align prediction to ground truth before computing metrics. Least-squares fit of a * pred + b = target:

def align_scale_shift(pred, target, mask=None):
    if mask is not None:
        p = pred[mask]
        t = target[mask]
    else:
        p = pred.flatten()
        t = target.flatten()
    A = torch.stack([p, torch.ones_like(p)], dim=1)
    coeffs, *_ = torch.linalg.lstsq(A, t.unsqueeze(-1))
    a, b = coeffs[:2, 0]
    return a * pred + b

Run align_scale_shift before abs_rel_error when evaluating MiDaS / Depth Anything.

Step 3: Lift depth to a point cloud

import numpy as np

def depth_to_point_cloud(depth, intrinsics):
    H, W = depth.shape
    fx, fy, cx, cy = intrinsics
    v, u = np.meshgrid(np.arange(H), np.arange(W), indexing="ij")
    z = depth
    x = (u - cx) * z / fx
    y = (v - cy) * z / fy
    return np.stack([x, y, z], axis=-1)


depth = np.random.uniform(0.5, 4.0, (240, 320))
intr = (320.0, 320.0, 160.0, 120.0)
pc = depth_to_point_cloud(depth, intr)
print(f"point cloud shape: {pc.shape}  (H, W, 3)")

One function, every 3D-lifted application. Export the point cloud to .ply and open in MeshLab or CloudCompare.

Step 4: Smoke test with a synthetic depth scene

def synthetic_depth(size=96):
    yy, xx = np.meshgrid(np.arange(size), np.arange(size), indexing="ij")
    # Floor: linear gradient from near (top) to far (bottom)
    depth = 1.0 + (yy / size) * 4.0
    # Box in the middle: closer
    mask = (np.abs(xx - size / 2) < size / 6) & (np.abs(yy - size * 0.6) < size / 6)
    depth[mask] = 2.0
    return depth.astype(np.float32)


gt = torch.from_numpy(synthetic_depth(96))
pred = gt + 0.3 * torch.randn_like(gt)  # simulated prediction
aligned = align_scale_shift(pred, gt)
print(f"before align  absRel = {abs_rel_error(pred, gt):.3f}")
print(f"after align   absRel = {abs_rel_error(aligned, gt):.3f}")

Step 5: Depth Anything V3 usage (reference)

import torch
from transformers import pipeline
from PIL import Image

pipe = pipeline(task="depth-estimation", model="LiheYoung/depth-anything-v2-large")

image = Image.open("street.jpg").convert("RGB")
out = pipe(image)
depth_np = np.array(out["depth"])

Three lines. out["depth"] is a PIL grayscale; convert to numpy for math. For Depth Anything V3 specifically, swap the model id once released; the API is unchanged.

Use It

Typical integration pattern:

  1. RGB frame arrives.
  2. Depth model produces depth map.
  3. Detector produces boxes.
  4. Lift box centroids through depth to 3D; merge with point cloud if available.
  5. Downstream: AR occlusion, path planning, object-size estimation, stereo replacement.

For real-time use, Depth Anything V2 Small (INT8 quantised) hits ~30 fps on a consumer GPU at 518x518.

Ship It

This lesson produces:

Exercises

  1. (Easy) Run Depth Anything V2 on any 10 images of your desk. Save depth as grayscale PNGs and inspect. Identify one object whose predicted depth looks wrong and explain why the monocular cues failed.
  2. (Medium) Given RGB + depth from Depth Anything V2, lift to a point cloud and render with open3d. Compare two scenes (indoor / outdoor) and note which looks more believable.
  3. (Hard) Take five pairs of images that differ only by a known object's position (e.g. bottle moved 30 cm closer). Use UniDepth to predict metric depth on both. Report the predicted distance delta vs the true 30 cm.

Key Terms

Term What people say What it actually means
Monocular depth "Single-image depth" Depth estimation from one RGB frame, no stereo or LiDAR
Relative depth "Ordered depth" Ordered z-values without real-world units
Metric depth "Absolute distance" Depth in metres; requires calibration or a model trained with metric supervision
AbsRel "Absolute relative error" Mean of d_pred - d_gt / d_gt; standard depth metric
Delta accuracy "delta < 1.25" Fraction of pixels with prediction within 25% of ground truth
Pinhole camera "fx, fy, cx, cy" The camera model used to lift (u, v, d) to (X, Y, Z)
DPT "Dense Prediction Transformer" The conv-based decoder used on top of frozen ViT encoders for depth
DINOv2 backbone "The reason it works" Self-supervised features that generalise across domains without depth labels

Further Reading