3D Generation

> 3D is the modality where 2D-to-3D leverage is strongest. The 2023 breakthrough was 3D Gaussian Splatting. The 2024-2026 generative push layers multi-view diffusion + 3D reconstruction on top to produce objects and scenes from a single prompt or photo.

Type: Learn

Languages: Python

Prerequisites: Phase 4 (Vision), Phase 8 · 07 (Latent Diffusion)

Time: ~45 minutes

The Problem

3D content is painful:

The 2026 stack separates the two problems. First, generate *2D multi-view images* with a diffusion model. Second, fit a *3D representation* (usually Gaussian splatting) to those images.

The Concept

3D generation: multi-view diffusion + 3D reconstruction

Representation: 3D Gaussian Splatting (Kerbl et al., 2023)

Represent a scene as a cloud of ~1M 3D Gaussians. Each has 59 parameters: position (3), covariance (6, or quaternion 4 + scale 3), opacity (1), spherical-harmonics color (48 at degree 3, 3 at degree 0).

Rendering = projection + alpha-compositing. Fast (~100 fps at 1080p on a 4090). Differentiable. Fit by gradient descent against ground-truth photos. A scene fits in 5-30 minutes on a consumer GPU.

Two 2023-2024 innovations on top:

Multi-view diffusion

Fine-tune a pretrained image diffusion model to generate multiple consistent views of the same object from a text prompt or single image. Zero123 (Liu et al., 2023), MVDream (Shi et al., 2023), SV3D (Stability, 2024), CAT3D (Google, 2024). Usually output 4-16 views around the object, lifted to 3D via Gaussian splatting or NeRF.

Text-to-3D pipelines

Model Input Output Time
DreamFusion (2022) text NeRF via SDS ~1 hour per asset
Magic3D text mesh + texture ~40 min
Shap-E (OpenAI, 2023) text implicit 3D ~1 min
SJC / ProlificDreamer text NeRF / mesh ~30 min
LRM (Meta, 2023) image triplane ~5 s
InstantMesh (2024) image mesh ~10 s
SV3D (Stability, 2024) image novel views ~2 min
CAT3D (Google, 2024) 1-64 images 3D NeRF ~1 min
TripoSR (2024) image mesh ~1 s
Meshy 4 (2025) text + image PBR mesh ~30 s
Rodin Gen-1.5 (2025) text + image PBR mesh ~60 s
Tencent Hunyuan3D 2.0 (2025) image mesh ~30 s

2025-2026 direction: direct text-to-mesh models with PBR materials suitable for game engines. Multi-view diffusion intermediate step is still the best-performing recipe for general objects.

NeRF (for context)

Neural Radiance Field (Mildenhall et al., 2020). A tiny MLP takes (x, y, z, view direction) and outputs (color, density). Render by integrating along rays. Beats mesh-based novel-view synthesis in quality but is 100-1000x slower to render. Superseded by Gaussian splatting for most real-time use but still dominant in research.

Build It

code/main.py implements a toy 2D "Gaussian splatting" fit: represent a synthetic target image (a smooth gradient) as a sum of 2D Gaussian splats. Optimize positions, colors, and covariances by gradient descent to match the target. You see the two core operations: forward render (splat + alpha-composite) and fit by gradient descent.

Step 1: 2D Gaussian splat

def gaussian_at(x, y, gaussian):
    px, py = gaussian["pos"]
    sigma = gaussian["sigma"]
    d2 = (x - px) ** 2 + (y - py) ** 2
    return math.exp(-d2 / (2 * sigma * sigma))

Step 2: render by summing splats

def render(image_size, gaussians):
    img = [[0.0] * image_size for _ in range(image_size)]
    for g in gaussians:
        for y in range(image_size):
            for x in range(image_size):
                img[y][x] += g["color"] * gaussian_at(x, y, g)
    return img

Real 3D Gaussian splatting sorts Gaussians by depth and alpha-composites in order. Our 2D toy just sums.

Step 3: fit by gradient descent

for step in range(steps):
    pred = render(size, gaussians)
    loss = mse(pred, target)
    gradients = compute_grads(pred, target, gaussians)
    update(gaussians, gradients, lr)

Pitfalls

Use It

Task 2026 pick
Scene reconstruction from photos Gaussian splatting (3DGS, Gsplat, Scaniverse)
Text-to-3D object for games Meshy 4 or Rodin Gen-1.5 (PBR output)
Image-to-3D Hunyuan3D 2.0, TripoSR, InstantMesh
Novel-view synthesis from few images CAT3D, SV3D
Dynamic scene reconstruction 4D Gaussian Splatting
Avatar / clothed human Gaussian Avatar, HUGS
Research / SOTA Whatever dropped last week

For shipping production 3D in a game or e-commerce pipeline: Meshy 4 or Rodin Gen-1.5 output PBR meshes that go straight into Unity / Unreal.

Ship It

Save outputs/skill-3d-pipeline.md. Skill takes a 3D brief (input: text / one image / few images; output: mesh / splat / NeRF; usage: render / game / VR) and outputs: pipeline (multi-view diffusion + fit, or direct mesh model), base model, iteration budget, topology post-processing, material channels needed.

Exercises

  1. Easy. Run code/main.py with 4, 16, 64 Gaussians. Report final MSE vs target.
  2. Medium. Extend to color Gaussians (RGB). Confirm reconstruction matches the target color pattern.
  3. Hard. Using gsplat or Nerfstudio, reconstruct a real object from a 50-photo capture. Report fit time and final SSIM on held-out views.

Key Terms

Term What people say What it actually means
3D Gaussian Splatting "3DGS" Scene as a cloud of 3D Gaussians; differentiable alpha-composite render.
NeRF "Neural radiance field" MLP that outputs color + density at a 3D point; render by ray integration.
Triplane "Three 2-D planes" Factor 3D into three 2-D axis-aligned feature grids; cheaper than volumetric.
SDS "Score distillation sampling" Train 3D model by using 2D-diffusion score as pseudo-gradient.
Multi-view diffusion "Many views at once" Diffusion model that outputs a batch of consistent camera views.
PBR "Physically-based rendering" Material with albedo, roughness, metallic, normal channels.
Densification "Grow splats" 3DGS training heuristic: split / clone splats in high-gradient regions.

Production note: 3D has no shared substrate yet

Unlike image (latent diffusion + DiT) and video (spatiotemporal DiT), 3D has no single dominant runtime in 2026. The production decision tree forks on the representation:

For most 2026 products, the right answer is "run a multi-view diffusion model on request, reconstruct to 3DGS asynchronously, serve the 3DGS for real-time viewing". This splits the workload cleanly between a GPU-inference server (fast) and an offline optimizer (slow).

Further Reading