Inpainting, Outpainting & Image Editing

> Text-to-image makes new things. Inpainting fixes old ones. In production, 70% of billable image work is editing — swap a background, remove a logo, extend the canvas, regenerate a hand. Inpainting is where diffusion earns its keep.

Type: Build

Languages: Python

Prerequisites: Phase 8 · 07 (Latent Diffusion), Phase 8 · 08 (ControlNet & LoRA)

Time: ~75 minutes

The Problem

A client sends a perfect product photo with a distracting sign in the background. You want to erase the sign and leave everything else pixel-identical. You cannot run text-to-image from scratch — the result will have a different color, different lighting, different product angle. You want to regenerate *only* the masked region, and you want the regeneration to respect the surrounding context.

That is inpainting. Variants:

Every diffusion pipeline in 2026 ships an inpainting mode. Flux.1-Fill, Stable Diffusion Inpaint, SDXL-Inpaint, DALL-E 3 Edit. They work on the same principle.

The Concept

Inpainting: mask-aware denoising with context-preserving reinjection

The naive approach (and why it's wrong)

Run standard text-to-image with a mask. At each sampling step, replace the unmasked region of the noisy latent with the forward-diffused clean image. It works... badly. Boundary artifacts bleed through because the model has no information about what is in the masked region.

The proper inpainting model

Train a modified U-Net that takes 9 input channels instead of 4:

input = concat([ noisy_latent (4ch), encoded_image (4ch), mask (1ch) ], dim=channel)

The extra channels are a copy of the VAE-encoded source image plus a single-channel mask. At training time, you randomly mask regions of the image and train the model to denoise only the masked region while the unmasked region is given as a clean conditioning signal. At inference, the model can "see" what surrounds the masked region and produces coherent completions.

SD-Inpaint, SDXL-Inpaint, Flux-Fill all use this 9-channel (or analog) input. Diffusers StableDiffusionInpaintPipeline, FluxFillPipeline.

SDEdit (Meng et al., 2022) — free editing

Add noise to the source image up to some intermediate t, then run the reverse chain from t down to 0 with a new prompt. No retraining. The choice of starting t trades fidelity for creative freedom:

InstructPix2Pix (Brooks et al., 2023)

Fine-tune a diffusion model on (input_image, instruction, output_image) triples. At inference, condition on both the input image and a text instruction ("make it sunset", "add a dragon"). Two CFG scales: image scale and text scale.

RePaint (Lugmayr et al., 2022)

Keep a standard unconditional diffusion model. At each reverse step, resample — jump back to a noisier state occasionally and regenerate. Avoids boundary artifacts. Used when you don't have a trained inpainting model.

Build It

code/main.py implements a toy 1-D inpainting scheme on 5-dimensional data. We train a DDPM on 5-D mixture data where each sample is 5 floats from one of two clusters. At inference, we "mask" 2 of the 5 dimensions, inject the noisy-forward version of the unmasked three at each step, and regenerate only the masked dimensions.

Step 1: 5-D DDPM data

def sample_data(rng):
    cluster = rng.choice([0, 1])
    center = [-1.0] * 5 if cluster == 0 else [1.0] * 5
    return [c + rng.gauss(0, 0.2) for c in center], cluster

Step 2: train denoiser over all 5 dims

Standard DDPM. Net outputs 5-D noise prediction for 5-D noisy input.

Step 3: at inference, mask-aware reverse

def inpaint_step(x_t, mask, clean_image, alpha_bars, t, rng):
    # replace unmasked dims with a freshly noised version of the clean source
    a_bar = alpha_bars[t]
    for i in range(len(x_t)):
        if not mask[i]:
            x_t[i] = math.sqrt(a_bar) * clean_image[i] + math.sqrt(1 - a_bar) * rng.gauss(0, 1)
    # ...then run the normal reverse step on x_t

This is the naive approach and it works on toy 1-D data. Real image inpainting uses the 9-channel input because texture coherence matters more.

Step 4: outpainting

Outpainting is inpainting with the mask inverted: mask the new (previously non-existent) canvas, fill the rest with the original. Identical training objective.

Pitfalls

Use It

Task Pipeline
Remove object, small mask SD-Inpaint or Flux-Fill, standard prompt
Replace sky SD-Inpaint + "blue sky at sunset"
Extend canvas SDXL outpaint mode (8px feather) or Flux-Fill with outpaint mask
Regenerate hand / face SD-Inpaint with prompt re-describing the subject + ControlNet-Openpose
Change style of one region SDEdit at t/T=0.5 on masked region
"Make it sunset" InstructPix2Pix or Flux-Kontext
Background replacement SAM mask → SD-Inpaint
Ultra-high-fidelity Flux-Fill or GPT-Image (hosted) for hardest cases

SAM (Meta's Segment Anything, 2023) + diffusion inpaint is the 2026 background-removal pipeline. SAM 2 (2024) works on video.

Ship It

Save outputs/skill-editing-pipeline.md. Skill takes an original image + edit description + optional mask (or SAM prompt) and outputs: mask-generation approach, base model, CFG scales (image + text), SDEdit-t or inpainting mode, and QA checklist.

Exercises

  1. Easy. In code/main.py, vary the fraction of dimensions masked from 0.2 to 0.8. At what fraction does the inpaint quality (residual in masked dims) equal unconditional generation?
  2. Medium. Implement RePaint: at every 10th reverse step, jump back 5 steps (add noise) and re-denoise. Measure whether it reduces boundary residual at the mask edge.
  3. Hard. Use Hugging Face diffusers to compare: SD 1.5 Inpaint + ControlNet-Openpose vs Flux.1-Fill on 20 face-regeneration tasks. Score pose adherence and identity preservation separately.

Key Terms

Term What people say What it actually means
Inpainting "Fill the hole" Regenerate inside a mask; keep outside pixels.
Outpainting "Extend the canvas" Regenerate outside the canvas; keep inside.
9-channel U-Net "Proper inpainting model" U-Net with `noisy encoded-source mask` as input.
SDEdit "Img2img with noise level" Noise to time t, denoise with new prompt.
InstructPix2Pix "Text-only edits" Fine-tuned diffusion on (image, instruction, output) triples.
RePaint "No retraining" Re-noise periodically during reverse to reduce seams.
SAM "Segment Anything" Mask generator by clicks or boxes; pairs with inpaint.
Flux-Kontext "Edit with context" Flux variant that accepts a reference image + instruction for edits.

Production note: edit pipelines are latency-sensitive

Users editing an image expect sub-5-second round trips. A 30-step SDXL-Inpaint at 1024² is 3-4 s on an L4, plus SAM mask generation (~200 ms) and VAE encode/decode (~500 ms combined). In production framing, this is TTFT-bound rather than throughput-bound — batch 1, low concurrency, minimize every stage:

Further Reading