SAM 3 & Open-Vocabulary Segmentation

> Give a model a text prompt and an image and get masks for every matching object. SAM 3 made that a single forward pass.

Type: Use + Build

Languages: Python

Prerequisites: Phase 4 Lesson 07 (U-Net), Phase 4 Lesson 08 (Mask R-CNN), Phase 4 Lesson 18 (CLIP)

Time: ~60 minutes

Learning Objectives

The Problem

The 2023 SAM was a visual-prompt-only model: you click a point or draw a box and it returns a mask. For "give me all the oranges in this photo" you needed a detector (Grounding DINO) to produce boxes, then SAM to segment each. Grounded SAM turned this into a pipeline, but it was a cascade of two frozen models with inevitable error accumulation.

SAM 3 (Meta, Nov 2025, ICLR 2026) collapsed the cascade. It accepts a short noun phrase or an image exemplar as prompt and returns all matching masks and instance IDs in a single forward pass. That is Promptable Concept Segmentation (PCS). Combined with the March 2026 Object Multiplex update (SAM 3.1), it tracks multiple instances of the same concept through video efficiently.

This lesson is about the structural shift this represents. 2D seg, detection, and text-image grounding have merged into one model. The production question is no longer "which pipeline do I chain together" but "which promptable model handles my use case end-to-end."

The Concept

The three generations

flowchart LR subgraph SAM1["SAM (2023)"] A1["Image + point/box prompt"] --> A2["ViT encoder"] --> A3["Mask decoder"] A3 --> A4["Mask for that prompt"] end subgraph GSAM2["Grounded SAM 2 (2024)"] B1["Text"] --> B2["Grounding DINO"] --> B3["Boxes"] --> B4["SAM 2"] --> B5["Masks + tracking"] B6["Image"] --> B2 B6 --> B4 end subgraph SAM3["SAM 3 (2025)"] C1["Text OR image exemplar"] --> C2["Shared backbone"] C3["Image"] --> C2 C2 --> C4["Image detector + memory tracker
+ presence head"] C4 --> C5["All matching masks
+ instance IDs"] end style SAM1 fill:#e5e7eb,stroke:#6b7280 style GSAM2 fill:#fef3c7,stroke:#d97706 style SAM3 fill:#dcfce7,stroke:#16a34a

Promptable Concept Segmentation

A "concept prompt" is a short noun phrase ("yellow school bus", "striped red umbrella", "hand holding a mug") or an image exemplar. The model returns segmentation masks for every instance in the image that matches the concept, plus a unique instance ID per match.

This differs from classic visual-prompt SAM in three ways:

  1. No per-instance prompting required — one text prompt returns all matches.
  2. Open-vocabulary — the concept can be anything describable in natural language.
  3. Returns multiple instances at once rather than one mask per prompt.

Key architectural pieces

Training at scale

SAM 3 was trained on 4 million unique concepts generated by a data engine that iteratively annotates and corrects using AI + human review. The new SA-CO benchmark contains 270K unique concepts, 50x larger than prior benchmarks. SAM 3 reaches 75-80% of human performance on SA-CO and doubles existing systems on image + video PCS.

SAM 3.1 Object Multiplex

March 2026 update: Object Multiplex introduces a shared-memory mechanism for joint tracking of many instances of the same concept at once. Previously, tracking N instances meant N separate memory banks. Multiplex collapses that into one shared memory with per-instance queries. Result: substantially faster multi-object tracking without sacrificing accuracy.

Where Grounded SAM still matters in 2026

Modular pipelines still have a place. For most production work, SAM 3 is the simpler answer.

YOLO-World vs SAM 3

Production split: YOLO-World for fast detection-only pipelines (robotics navigation, fast dashboards), SAM 3 for anything that needs masks or tracking.

SAM-MI efficiency

SAM-MI (2025-2026) addresses SAM's decoder bottleneck. Key ideas:

Result: ~1.6× speedup over Grounded-SAM on open-vocabulary benchmarks.

Output format for the three models

All return the same general structure (boxes + labels + scores + masks + IDs), which is helpful — your pipeline downstream does not have to branch on which model ran.

Build It

Step 1: Prompt construction

Build a helper that turns a user sentence into a list of SAM 3 concept prompts. This is the boundary where "what the user typed" meets "what the model consumes".

def split_concepts(sentence):
    """
    Heuristic splitter for multi-concept prompts.
    Returns list of short noun phrases.
    """
    for sep in [",", ";", "and", "or", "&"]:
        if sep in sentence:
            parts = [p.strip() for p in sentence.replace("and ", ",").split(",")]
            return [p for p in parts if p]
    return [sentence.strip()]

print(split_concepts("cats, dogs and balloons"))

SAM 3 accepts one concept per forward pass; for multi-concept queries, loop or batch them.

Step 2: Post-processing helpers

Turn SAM 3's raw outputs into a clean list of detections that match our Phase 4 Lesson 16 pipeline contract.

from dataclasses import dataclass
from typing import List

@dataclass
class ConceptDetection:
    concept: str
    instance_id: int
    box: tuple          # (x1, y1, x2, y2)
    score: float
    mask_rle: str       # run-length encoded


def rle_encode(binary_mask):
    flat = binary_mask.flatten().astype("uint8")
    runs = []
    prev, count = flat[0], 0
    for v in flat:
        if v == prev:
            count += 1
        else:
            runs.append((int(prev), count))
            prev, count = v, 1
    runs.append((int(prev), count))
    return ";".join(f"{v}x{c}" for v, c in runs)

RLE keeps response payloads small even for many high-resolution masks. The same format works across SAM 2, SAM 3, Grounded SAM 2.

Step 3: A unified open-vocab segmentation interface

Wrap whatever backend you have (SAM 3, Grounded SAM 2, YOLO-World + SAM 2) behind a single method. Your downstream code does not change when the backend does.

from abc import ABC, abstractmethod
import numpy as np

class OpenVocabSeg(ABC):
    @abstractmethod
    def detect(self, image: np.ndarray, concept: str) -> List[ConceptDetection]:
        ...


class StubOpenVocabSeg(OpenVocabSeg):
    """
    Deterministic stub used for pipeline testing when real models are not loaded.
    """
    def detect(self, image, concept):
        h, w = image.shape[:2]
        return [
            ConceptDetection(
                concept=concept,
                instance_id=0,
                box=(w * 0.2, h * 0.3, w * 0.5, h * 0.8),
                score=0.89,
                mask_rle="0x100;1x50;0x200",
            ),
            ConceptDetection(
                concept=concept,
                instance_id=1,
                box=(w * 0.55, h * 0.25, w * 0.85, h * 0.75),
                score=0.74,
                mask_rle="0x80;1x40;0x220",
            ),
        ]

The real SAM3OpenVocabSeg subclass would wrap transformers.Sam3Model and Sam3Processor.

Step 4: Hugging Face SAM 3 usage (reference)

For the actual model, the transformers integration:

from transformers import Sam3Processor, Sam3Model
import torch

processor = Sam3Processor.from_pretrained("facebook/sam3")
model = Sam3Model.from_pretrained("facebook/sam3").eval()

inputs = processor(images=pil_image, return_tensors="pt")
inputs = processor.set_text_prompt(inputs, "yellow school bus")

with torch.no_grad():
    outputs = model(**inputs)

masks = processor.post_process_masks(
    outputs.masks, inputs.original_sizes, inputs.reshaped_input_sizes
)
boxes = outputs.boxes
scores = outputs.scores

One prompt, all matches returned in a single call.

Step 5: Measure what Grounded SAM 2 gave you for free

An honest benchmark: what happens when you replace Grounded SAM 2 with SAM 3 in a real pipeline?

Conclusion: SAM 3 is the default for 2026 open-vocab seg. Grounded SAM 2 is still the right answer when you need detector flexibility or different license terms.

Use It

Production deployment patterns:

Ultralytics wraps SAM 3 in its Python package:

from ultralytics import SAM

model = SAM("sam3.pt")
results = model(image_path, prompts="yellow school bus")

Same interface as YOLO and SAM 2.

Ship It

This lesson produces:

Exercises

  1. (Easy) Run SAM 3 on 10 images with concept prompts you choose. Compare against SAM 2 + Grounding DINO 1.5 on the same images. Report which concepts each model missed.
  2. (Medium) Build a "click-to-include / click-to-exclude" UI on top of SAM 3: a text prompt returns candidate instances; user clicks keep which ones count as positive. Output the final concept set as JSON.
  3. (Hard) Fine-tune SAM 3 on a custom concept set (e.g. 5 types of electronic components) with 20 labelled images each. Compare to zero-shot SAM 3 on the same test set; measure mask IoU improvement.

Key Terms

Term What people say What it actually means
Open-vocabulary segmentation "Segment by text" Produce masks for objects described in natural language, not a fixed label set
PCS "Promptable Concept Segmentation" SAM 3's core task — given a noun-phrase or image exemplar, segment all matching instances
Concept prompt "The text input" Short noun phrase or image exemplar; not a full sentence
Presence head "Is it here?" SAM 3 module that decides whether the concept exists in the image before localisation
SA-CO "SAM 3 benchmark" 270K-concept open-vocabulary segmentation benchmark; 50x larger than prior open-vocab benchmarks
Object Multiplex "SAM 3.1 update" Shared-memory multi-object tracking; fast joint tracking of many instances
Grounded SAM 2 "Modular pipeline" Detector + SAM 2 cascade; still relevant when detector swap matters
SAM-MI "Efficient SAM variant" Mask Injection for 1.6x speedup over Grounded-SAM

Further Reading