← Jamba — Hybrid SSM-Transformer Speculative Decoding and EAGLE →

Async and Hogwild! Inference

> Speculative decoding (Phase 10 · 15) parallelizes tokens within one sequence. Multi-agent frameworks parallelize across whole sequences but force explicit coordination (voting, sub-task splitting). Hogwild! Inference (Rodionov et al., arXiv:2504.06261) does something else: run N instances of the same LLM in parallel against a SHARED key-value cache. Each worker sees every other worker's generated tokens instantly. Modern reasoning models — QwQ, DeepSeek-R1 — can self-coordinate through that shared cache without any fine-tuning. The approach is experimental but it opens an entirely new axis of inference parallelism that sits orthogonal to spec decode. This lesson implements a two-worker Hogwild! simulator in stdlib Python and explains why the shared-cache collaboration emerges from the existing model's reasoning abilities.

Type: Build

Languages: Python (stdlib)

Prerequisites: Phase 10 · 12 (inference optimization), Phase 10 · 15 (speculative decoding)

Time: ~60 minutes

Learning Objectives

Describe the three common parallel-LLM topologies (voting, sub-task, Hogwild!) and name which problems each one targets.
State the core Hogwild! setup: multiple workers, one shared KV cache, emergent coordination via self-prompting.
Compute the wall-time speedup of Hogwild! as a function of worker count N, task-level parallelism p, and coordination overhead c.
Implement a two-worker Hogwild! simulator on a toy problem and observe the emergent task division.

The Problem

Modern LLMs solve hard problems by producing long chains of reasoning — 5000 tokens of step-by-step logic is common, tens of thousands of tokens happens on deep math problems. At 35 tokens/sec decode on a 70B model, 50k tokens is 24 minutes. Interactive the model is not.

Speculative decoding (Phase 10 · 15) gets you a 3-5x speedup by parallelizing within one sequence. Past that the sequential dependency of autoregressive decoding is the hard ceiling. Each new token depends on every prior token.

The obvious question: can we parallelize across sequences? Run multiple copies of the same model on the same problem, let them cooperate, have them divide the work?

Prior work: voting ensembles (run N models, pick the majority answer), tree-of-thought (branch reasoning paths and recombine), and multi-agent frameworks (assign each agent a sub-task, use a coordinator). These all help in specific task domains. They all also introduce explicit coordination machinery — voting rules, branch-and-prune logic, agent-to-agent messaging protocols.

Hogwild! Inference takes a different approach. N workers share a single KV cache. Each worker sees every other worker's generated tokens immediately, as if they were its own context. The workers — without any training or fine-tuning — figure out how to divide the work. Modern reasoning models (QwQ, DeepSeek-R1, Claude-family reasoning mode) can read the shared cache and say things like "I see worker 2 already handled the base case, so I'll work on the inductive step."

The speedup is workload-dependent and experimental as of April 2026. But the idea is worth knowing because it opens a new axis of inference parallelism.

The Concept

The setup

Initialize N worker processes, all running the same LLM. Instead of per-worker KV caches, maintain ONE shared cache. When worker i generates token t_j, the token is written into the shared cache at the next position. When worker k takes its next step, it reads the current state of the cache (which includes everything all N workers have generated so far).

At step time, workers race to write tokens. There is no per-worker position index — the cache is a single growing sequence. Order is determined by write arrival time.

Why coordination emerges

The workers share a prompt. Typically something like "You are one of N instances working together on this problem. Each instance reads the shared memory and can see what other instances have written. Avoid redundant work." The prompt plus the shared cache is enough. Reasoning models read the cache, notice which parts of the problem have already been attempted, and (often but not always) pivot to unexplored parts.

The Hogwild! paper (Rodionov et al., 2025) reports observations like:

Workers formulate plans and communicate them to other workers via the cache.
Workers notice errors in other workers' reasoning and call them out.
Workers adapt when a plan fails and propose alternatives.
When prompted to check for redundancy, workers detect it and pivot.

None of this requires fine-tuning. The emergent behavior comes from the reasoning capabilities the model already has.

The naming

The paper's name riffs on Hogwild! SGD (Recht et al., 2011), an asynchronous-update optimizer. The analogy: SGD's asynchronous workers all write to a shared parameter vector; Hogwild! Inference's workers all write to a shared KV cache. Both rely on empirical convergence rather than synchronization guarantees.

RoPE makes this tractable

Rotary Position Embeddings (RoPE, Su et al. 2021) encode position information via rotation in the Q and K vectors. Because positions are rotations and not baked-in offsets, a token's position can shift without recomputing the KV cache entry. When worker i writes into the shared cache at position p, other workers reading that position can use the cached entry directly — no re-rotation needed.

In a learned-position or absolute-position model, Hogwild! would need cache invalidation on every concurrent write. RoPE lets the cache stay stable.

Wall-time math

Let T_serial be the time for one worker to solve the problem alone. Let p be the task-level parallelizable fraction. Let c be the per-step coordination overhead (reading the extended cache, deciding what to write).

Single-worker time: T_serial.

N-worker Hogwild! time, if coordination is free: T_serial * ((1 - p) + p / N). Classic Amdahl.

With coordination overhead: T_serial * ((1 - p) + p / N) + c * steps_per_worker.

For a worker to be productive, c must be small relative to the per-step decode time. On reasoning models producing 5k+ tokens, the workers can afford hundreds of tokens of coordination overhead and still come out ahead. On short chat tasks, coordination dominates and Hogwild! is worse than serial.

Concrete example

Reasoning problem: 10k tokens of chain-of-thought. Suppose the problem has p = 0.7 parallelizable content (different proof strategies, different case analyses) and c = 200 tokens of coordination overhead per worker. With N = 4 workers:

Serial time: 10000 decode steps.
Hogwild! time: 10000 * (0.3 + 0.7 / 4) + 200 * 4 = 10000 * 0.475 + 800 = 5550 decode steps.
Speedup: 10000 / 5550 = 1.8x.

That is modest. But on longer reasoning problems (50k tokens), the coordination overhead amortizes and the speedup pushes 2.5-3x. Hogwild! is the inference equivalent of thread-level parallelism in a language that lets you write multi-threaded code naturally.

When to reach for Hogwild!

Long reasoning problems (thousands of tokens) where the task can be parallelized across independent sub-goals.
Reasoning models that have been trained to think step by step. Non-reasoning models do not self-coordinate well.
Single-node deployments with enough VRAM to hold the shared cache plus N worker processes. The cache is shared, but each worker has its own activation memory.

When not to

Short interactive chat. Coordination overhead dominates.
Tasks that don't parallelize (single linear proof, single compilation). N=1 is the max.
Non-reasoning models. No coordination emerges.
Multi-node deployments. The shared cache needs very fast cross-worker synchronization. Intra-node is fine; cross-node is a latency disaster.

The experimental status

As of April 2026, Hogwild! is a research method with an open-source PyTorch implementation. Production adoption has not happened. Three blockers:

Shared KV cache management across concurrent processes is non-trivial engineering.
Emergent coordination is task-dependent; benchmarks are still being built.
The speedups are modest compared to what speculative decoding already delivers, and the two can be combined but the combined engineering is another layer.

Worth knowing. Worth experimenting with. Not yet worth betting a product on.

Build It

code/main.py implements a toy Hogwild! simulator:

Two worker processes, each a deterministic "LLM" that produces one of several token categories (work-token, observe-token, coordinate-token) with known probabilities.
A shared cache (just a list of tokens) that both workers read and write.
A simple coordination logic: when a worker sees that the other has already produced enough work tokens in a category, it picks a different category.

The simulator runs for a fixed step budget and reports:

Total work-tokens produced.
Total wall time (number of worker steps).
Effective speedup over a single worker.
A trace of which worker wrote which token.

Step 1: the shared cache

A list that both workers append to. Simple locking (Python threading.Lock) in a real implementation; we simulate with a counter.

Step 2: the worker loop

Each worker, on each step:

Reads the current shared cache.
Decides what category of token to write based on what is already there.
Writes one token.

Step 3: the coordination heuristic

If category X already has K tokens in the cache and worker's intended category is X, worker switches to category Y. This is a toy stand-in for the reasoning-model behavior of "notice this is already covered, do something else instead."

Step 4: measured speedup

Run the simulator with N=1 worker and with N=2 workers, same total step budget. Count work-tokens produced. N=2 should produce roughly 1.5-1.8x more work-tokens because of the coordination-driven task division.

Step 5: stress the coordination

Reduce the coordination heuristic's sensitivity. Run again. Observe that without good coordination, N=2 redundantly produces the same tokens and the speedup drops below 1. This matches the paper's observation: the trick only works if the workers have the reasoning capacity to self-coordinate.

Use It

Hogwild! integration in production as of April 2026 is research-grade. The reference implementation from Yandex/HSE/IST is PyTorch-based and targets single-node multi-process setups on DeepSeek-R1 and QwQ models.

Pragmatic adoption path:

Profile your reasoning-task workload. Measure the fraction of tokens that are exploratory (multiple strategies, case analyses, search) vs linear.
If exploration dominates, run a two-worker Hogwild! experiment. Measure wall-time improvement.
If the improvement is under 1.3x, you are in the coordination-dominated regime. Revert to single-worker.
If the improvement is over 1.5x, push to N=4 and measure again. Diminishing returns typically hit around N=4-8.

Combine with speculative decoding: each Hogwild! worker can independently use spec decode. The two speedups multiply (roughly), bringing a 3x spec decode and 1.8x Hogwild! to an effective 5.4x over naive single-worker decoding.

Ship It

This lesson produces outputs/skill-parallel-inference-router.md. Given a reasoning workload profile (token budget, task parallelism profile, model family, deployment target), it routes between voting, tree-of-thought, multi-agent, Hogwild!, and speculative decoding strategies.

Exercises

Run code/main.py with the default settings. Confirm the N=2 Hogwild! configuration produces more work-tokens than the N=1 baseline in the same wall time.

Reduce the coordination heuristic's strength (set coordination_weight=0.1). Re-run. Show that speedup collapses. Explain why: the workers duplicate effort when they cannot coordinate.

Compute the expected Hogwild! speedup for a 50k-token reasoning task with p=0.8, c=500 and N=4 workers. Do the same for a 1k-token chat task with p=0.3, c=200 and N=4. Why is one a win and the other a loss?

Read the Hogwild! paper's Section 4 (preliminary evaluation). Identify the two failure modes the authors report. Describe how a better coordination prompt might mitigate each.

Combine Hogwild! with speculative decoding in the toy: each worker uses a 2-token spec-decode internally. Report the multiplicative speedup. What bookkeeping problem arises when two workers both want to extend the same shared-cache prefix?

Key Terms

Term	What people say	What it actually means
Hogwild!	"Parallel workers, shared cache"	N instances of the same LLM running concurrently with one shared KV cache; emergent coordination via self-prompting
Shared KV cache	"The coordination medium"	A single growing KV buffer that all workers read and write; enables instant token visibility across workers
Emergent coordination	"No training needed"	Reasoning-capable LLMs can read the shared cache and divide work without any fine-tuning or explicit protocol
Coordination overhead (c)	"Tokens spent orienting"	The per-worker cost of reading the extended cache and deciding what to do; must stay small vs total decode time
Parallelizable fraction (p)	"What can run in parallel"	Task-level parallelism: the fraction of the total work that is not intrinsically sequential
RoPE enables Hogwild!	"Rotary positions are shift-invariant"	Because positions are rotations, writing into a shared cache does not require recomputing prior tokens
Voting ensemble	"Run N, pick the majority"	The simplest parallel inference topology; useful for classification, less for long-form reasoning
Tree of thought	"Branch and prune"	Reasoning strategy that explores multiple branches and prunes; explicit coordination logic
Multi-agent framework	"Assign sub-tasks"	Each agent gets a role; a coordinator orchestrates; heavy protocol overhead