← Multi-Object Tracking & Video Memory Text Processing — Tokenization, Stemming, Lemmatization →

World Models & Video Diffusion

> A video model that predicts the next seconds of a scene is a world simulator. Condition that prediction on actions and you have a learned game engine.

Type: Learn + Build

Languages: Python

Prerequisites: Phase 4 Lesson 10 (Diffusion), Phase 4 Lesson 12 (Video Understanding), Phase 4 Lesson 23 (DiT + Rectified Flow)

Time: ~75 minutes

Learning Objectives

Explain the difference between a pure video generation model (Sora 2) and an action-conditioned world model (Genie 3, DreamerV3)
Describe a video DiT: spatio-temporal patches, 3D position encoding, joint attention across (T, H, W) tokens
Trace how a world model plugs into robotics: VLM plans → video model simulates → inverse dynamics emits actions
Pick between Sora 2, Genie 3, Runway GWM-1 Worlds, Wan-Video, and HunyuanVideo for a given use case (creative video, interactive sim, autonomous-driving synthesis)

The Problem

Video generation and world modelling converged in 2026. A model that can generate a coherent minute of video has, in some sense, learned how the world moves: object permanence, gravity, causality, style. If you condition that prediction on actions (walk left, open the door), the video model becomes a learnable simulator that can replace a game engine, a driving simulator, or a robotics environment.

The stakes are concrete. Genie 3 generates playable environments from a single image. Runway GWM-1 Worlds synthesises infinite explorable scenes. Sora 2 produces minute-long videos with synchronised audio and modelled physics. NVIDIA Cosmos-Drive, Wayve Gaia-2, and Tesla DrivingWorld generate realistic driving video for autonomous-vehicle training data. The world-model paradigm is quietly taking over sim-to-real for robotics.

This lesson is the "big picture" lesson for Phase 4. It connects image generation, video understanding, and agentic reasoning into the architecture pattern dominant research is moving toward.

The Concept

Three families of world-modelling

flowchart LR subgraph GEN["Pure video generation"] G1["Text / image prompt"] --> G2["Video DiT"] --> G3["Video frames"] end subgraph ACTION["Action-conditioned world model"] A1["Past frames + action"] --> A2["Latent-action video DiT"] --> A3["Next frames"] A3 --> A1 end subgraph RL["World models for RL (DreamerV3)"] R1["State + action"] --> R2["Latent transition model"] --> R3["Next latent + reward"] R3 --> R1 end style GEN fill:#dbeafe,stroke:#2563eb style ACTION fill:#fef3c7,stroke:#d97706 style RL fill:#dcfce7,stroke:#16a34a

Sora 2 is pure video generation conditioned on prompts. No action interface. You cannot "steer" it mid-rollout.
Genie 3, GWM-1 Worlds, Mirage / Magica are action-conditioned world models. Infer latent actions from observed video, then condition future frame predictions on actions. Interactive — you press keys or move a camera and the scene responds.
DreamerV3 and the classic RL world-model family predict in a latent space with explicit action conditioning, trained on a reward signal. Less visual; more useful for sample-efficient RL.

Video DiT architecture

Video latent:          (C, T, H, W)
Patchify (spatial):    grid of P_h x P_w patches per frame
Patchify (temporal):   group P_t frames into a temporal patch
Resulting tokens:      (T / P_t) * (H / P_h) * (W / P_w) tokens

Positional encoding is 3D: a rotary or learned embedding per (t, h, w) coordinate. Attention can be:

Full joint — all tokens attend to all tokens. O(N^2) with N tokens. Prohibitive for long videos.
Divided — alternate temporal attention (same spatial position, across time: (H*W) * T^2) and spatial attention (same timestep, across space: T * (H*W)^2). Used by TimeSformer and most video DiTs.
Window — local windows in (t, h, w). Used by Video Swin.

Every 2026 video diffusion model uses one of these three patterns plus AdaLN conditioning (Lesson 23) and rectified flow.

Conditioning on actions: latent action models

Genie learns a latent action per frame by discriminatively predicting the action between a pair of consecutive frames. The model's decoder then conditions on the inferred latent action — not on explicit keyboard keys. At inference, a user can specify a latent action (or sample one from a fresh prior) and the model generates the next frame consistent with that action.

Sora skips the action interface entirely. Its decoder predicts next spacetime tokens from past spacetime tokens. Prompt conditions the start; nothing steers it mid-generation.

Physical plausibility

Sora 2's 2026 release explicitly advertised physical plausibility: weight, balance, object permanence, cause-and-effect. Measured by the team via hand-rated plausibility scores; the model visibly improves on dropped objects, characters colliding, and failures-on-purpose (a missed jump) versus Sora 1.

Plausibility remains the dominant failure mode. 2024-2025 videos of people eating spaghetti or drinking from glasses revealed the model's lack of persistent object representation. 2026 models (Sora 2, Runway Gen-5, HunyuanVideo) reduce but do not eliminate these.

Autonomous driving world models

Driving world models generate realistic road scenes conditioned on trajectories, bounding boxes, or navigation maps. Usage:

Cosmos-Drive-Dreams (NVIDIA) — generates minutes of driving video for RL training.
Gaia-2 (Wayve) — trajectory-conditioned scene synthesis for policy evaluation.
DrivingWorld (Tesla) — simulates varied weather, time-of-day, traffic conditions.
Vista (ByteDance) — reactive driving scene synthesis.

They replace expensive real-world data collection for corner cases — pedestrian jaywalks at night, icy intersections, unusual vehicle types — that would otherwise require millions of miles of driving.

Robotics stack: VLM + video model + inverse dynamics

The emerging three-component robotics loop:

VLM parses the goal ("pick up the red cup"), plans a high-level action sequence.
Video generation model simulates what executing each action would look like — predicts observations N frames ahead.
Inverse dynamics model extracts the concrete motor commands that would produce those observations.

This replaces reward shaping and sample-heavy RL. The world model does the imagination; the inverse dynamics closes the loop on actuation. Genie Envisioner is one instantiation; many research groups are converging on this structure.

Evaluation

Visual quality — FVD (Fréchet Video Distance), user studies.
Prompt alignment — CLIPScore per frame, VQA-style evaluation.
Physical plausibility — hand-rated on a benchmark suite (Sora 2's internal benchmark, VBench).
Controllability (for interactive world models) — action → observation consistency; can you go back to a prior state?

Model landscape in 2026

Model	Use	Parameters	Output	License
Sora 2	text-to-video, audio	—	1-min 1080p + audio	API only
Runway Gen-5	text/image-to-video	—	10s clips	API
Runway GWM-1 Worlds	interactive world	—	infinite 3D rollout	API
Genie 3	interactive world from image	11B+	playable frames	research preview
Wan-Video 2.1	open text-to-video	14B	high-quality clips	non-commercial
HunyuanVideo	open text-to-video	13B	10s clips	permissive
Cosmos / Cosmos-Drive	autonomous driving sim	7-14B	driving scenes	NVIDIA open
Magica / Mirage 2	AI-native game engine	—	modifiable worlds	product

Build It

Step 1: 3D patchify for video

import torch
import torch.nn as nn


class VideoPatch3D(nn.Module):
    def __init__(self, in_channels=4, dim=64, patch_t=2, patch_h=2, patch_w=2):
        super().__init__()
        self.proj = nn.Conv3d(
            in_channels, dim,
            kernel_size=(patch_t, patch_h, patch_w),
            stride=(patch_t, patch_h, patch_w),
        )
        self.patch_t = patch_t
        self.patch_h = patch_h
        self.patch_w = patch_w

    def forward(self, x):
        # x: (N, C, T, H, W)
        x = self.proj(x)
        n, c, t, h, w = x.shape
        tokens = x.reshape(n, c, t * h * w).transpose(1, 2)
        return tokens, (t, h, w)

A 3D conv with stride equal to kernel acts as the spatio-temporal patchifier. (T, H, W) -> (T/2, H/2, W/2) grid of tokens.

Step 2: 3D rotary position encoding

Rotary Position Embeddings (RoPE) separately applied along t, h, w axes:

def rope_3d(tokens, t_dim, h_dim, w_dim, grid):
    """
    tokens: (N, T*H*W, D)
    grid: (T, H, W) sizes
    t_dim + h_dim + w_dim == D
    """
    T, H, W = grid
    n, seq, d = tokens.shape
    if t_dim + h_dim + w_dim != d:
        raise ValueError(f"t_dim+h_dim+w_dim ({t_dim}+{h_dim}+{w_dim}) must equal D={d}")
    assert seq == T * H * W
    t_idx = torch.arange(T, device=tokens.device).repeat_interleave(H * W)
    h_idx = torch.arange(H, device=tokens.device).repeat_interleave(W).repeat(T)
    w_idx = torch.arange(W, device=tokens.device).repeat(T * H)
    # Simplified: just scale channels by frequencies. Real RoPE rotates pairs.
    freqs_t = torch.exp(-torch.log(torch.tensor(10000.0)) * torch.arange(t_dim // 2, device=tokens.device) / (t_dim // 2))
    freqs_h = torch.exp(-torch.log(torch.tensor(10000.0)) * torch.arange(h_dim // 2, device=tokens.device) / (h_dim // 2))
    freqs_w = torch.exp(-torch.log(torch.tensor(10000.0)) * torch.arange(w_dim // 2, device=tokens.device) / (w_dim // 2))
    emb_t = torch.cat([torch.sin(t_idx[:, None] * freqs_t), torch.cos(t_idx[:, None] * freqs_t)], dim=-1)
    emb_h = torch.cat([torch.sin(h_idx[:, None] * freqs_h), torch.cos(h_idx[:, None] * freqs_h)], dim=-1)
    emb_w = torch.cat([torch.sin(w_idx[:, None] * freqs_w), torch.cos(w_idx[:, None] * freqs_w)], dim=-1)
    return tokens + torch.cat([emb_t, emb_h, emb_w], dim=-1)

Simplified additive form. Real RoPE rotates paired channels at frequencies; the positional information is the same.

Step 3: Divided attention block

class DividedAttentionBlock(nn.Module):
    def __init__(self, dim=64, heads=2):
        super().__init__()
        self.time_attn = nn.MultiheadAttention(dim, heads, batch_first=True)
        self.space_attn = nn.MultiheadAttention(dim, heads, batch_first=True)
        self.ln1 = nn.LayerNorm(dim)
        self.ln2 = nn.LayerNorm(dim)
        self.ln3 = nn.LayerNorm(dim)
        self.mlp = nn.Sequential(nn.Linear(dim, 4 * dim), nn.GELU(), nn.Linear(4 * dim, dim))

    def forward(self, x, grid):
        T, H, W = grid
        n, seq, d = x.shape
        # time attention: same (h, w), across t
        xt = x.view(n, T, H * W, d).permute(0, 2, 1, 3).reshape(n * H * W, T, d)
        a, _ = self.time_attn(self.ln1(xt), self.ln1(xt), self.ln1(xt), need_weights=False)
        xt = (xt + a).reshape(n, H * W, T, d).permute(0, 2, 1, 3).reshape(n, seq, d)
        # space attention: same t, across (h, w)
        xs = xt.view(n, T, H * W, d).reshape(n * T, H * W, d)
        a, _ = self.space_attn(self.ln2(xs), self.ln2(xs), self.ln2(xs), need_weights=False)
        xs = (xs + a).reshape(n, T, H * W, d).reshape(n, seq, d)
        xs = xs + self.mlp(self.ln3(xs))
        return xs

The time attention attends within each spatial position across time; the space attention attends within each frame across positions. Two O(T^2 + (HW)^2) operations instead of one O((THW)^2). This is the core of TimeSformer and every modern video DiT.

Step 4: Compose a tiny video DiT

class TinyVideoDiT(nn.Module):
    def __init__(self, in_channels=4, dim=64, depth=2, heads=2):
        super().__init__()
        self.patch = VideoPatch3D(in_channels=in_channels, dim=dim, patch_t=2, patch_h=2, patch_w=2)
        self.blocks = nn.ModuleList([DividedAttentionBlock(dim, heads) for _ in range(depth)])
        self.out = nn.Linear(dim, in_channels * 2 * 2 * 2)

    def forward(self, x):
        tokens, grid = self.patch(x)
        for blk in self.blocks:
            tokens = blk(tokens, grid)
        return self.out(tokens), grid

Not a working video generator; a structural demo that every piece shapes correctly.

Step 5: Check shapes

vid = torch.randn(1, 4, 8, 16, 16)  # (N, C, T, H, W)
model = TinyVideoDiT()
out, grid = model(vid)
print(f"input  {tuple(vid.shape)}")
print(f"tokens grid {grid}")
print(f"output {tuple(out.shape)}")

Expect grid = (4, 8, 8) and out = (1, 256, 32) after patching; the head then projects to per-token spatio-temporal patches, ready to be un-patchified back into a video.

Use It

Production access patterns for 2026:

Sora 2 API (OpenAI) — text-to-video, synchronized audio. Premium pricing.
Runway Gen-5 / GWM-1 (Runway) — image-to-video, interactive worlds.
Wan-Video 2.1 / HunyuanVideo — open-source self-host.
Cosmos / Cosmos-Drive (NVIDIA) — driving simulation open weights.
Genie 3 — research preview, request access.

For building an interactive world-model demo: start with Wan-Video for quality, layer on a latent-action adapter for interactivity. For autonomous driving simulation: Cosmos-Drive is the 2026 open reference.

For robotics, the stack in the wild:

Language goal -> VLM (Qwen3-VL) -> high-level plan.
Plan -> latent-action video model -> imagined rollout.
Rollout -> inverse dynamics model -> low-level actions.
Actions executed -> observation fed back into step 1.

Ship It

This lesson produces:

outputs/prompt-video-model-picker.md — picks between Sora 2 / Runway / Wan / HunyuanVideo / Cosmos given task, license, and latency.
outputs/skill-physical-plausibility-checks.md — a skill that defines automated checks (object permanence, gravity, continuity) to run on any generated video before shipping.

Exercises

(Easy) Compute the token count for a 5-second 360p video at patch-t=2, patch-h=8, patch-w=8. Reason about memory for attention at this size.
(Medium) Swap the divided attention block above for a full joint attention block and measure the shape and parameter count. Explain why divided attention is necessary for real video models.
(Hard) Build a minimal latent-action video model: take a dataset of (frame_t, action_t, frame_{t+1}) triples (any simple 2D game), train a tiny video DiT conditioned on action embeddings, and show that different actions produce different next frames.

Key Terms

Term	What people say	What it actually means
World model	"Learned simulator"	A model that predicts future observations given state and action
Video DiT	"Spacetime transformer"	Diffusion transformer with 3D patchification and divided attention
Latent action	"Inferred control"	Discrete or continuous action latent inferred from frame pairs; used to condition next-frame generation
Divided attention	"Time then space"	Two attention operations per block — across time then across space — to keep O(N^2) manageable
Object permanence	"Things stay real"	Scene property that video models must learn; classic failure mode on food, glassware
FVD	"Fréchet Video Distance"	Video equivalent of FID; primary visual quality metric
Inverse dynamics model	"Observations to actions"	Given (state, next state), output the action that connects them; closes robotics loop
Cosmos-Drive	"NVIDIA driving sim"	Open-weights autonomous-driving world model for RL and evaluation