Inference Platform Economics — Fireworks, Together, Baseten, Modal, Replicate, Anyscale

> The 2026 inference market is no longer GPU time rental. It bifurcates into custom silicon (Groq, Cerebras, SambaNova), GPU platforms (Baseten, Together, Fireworks, Modal), and API-first marketplaces (Replicate, DeepInfra). Fireworks raised price $1/hr per GPU on May 1, 2026, and $4B valuation on 10T+ tokens/day tells you the volume-driven model works. Baseten closed $300M Series E at $5B in January 2026. The competitive positioning rule is simple: Fireworks optimizes latency, Together optimizes catalog breadth, Baseten optimizes enterprise polish, Modal optimizes Python-native DX, Replicate optimizes multimodal reach, Anyscale optimizes distributed Python. This lesson gives you a matrix you can hand a founder.

Type: Learn

Languages: Python (stdlib, toy per-call economics comparator)

Prerequisites: Phase 17 · 01 (Managed LLM Platforms), Phase 17 · 04 (vLLM Serving Internals)

Time: ~60 minutes

Learning Objectives

The Problem

You evaluated managed hyperscaler platforms. You decided you need a narrower, faster provider — Fireworks for latency, Together for breadth, Baseten for a fine-tuned custom model. Now you have six real choices and the pricing pages do not line up. Fireworks shows $/M tokens; Baseten shows $/minute; Modal shows $/second; Replicate shows $/prediction. You cannot compare them head-to-head without modeling the workload.

Worse, the business model behind each pricing page is different. Fireworks runs its own custom engine (FireAttention) on shared GPUs; the per-token rate reflects their utilization curve. Baseten gives you Truss + dedicated GPUs; per-minute reflects exclusivity. Modal is true Python serverless — per-second billing with sub-second cold starts. Same output (an LLM response), three different cost functions.

This lesson models the six and tells you when each wins.

The Concept

The three segments

Custom silicon — Groq (LPU), Cerebras (WSE), SambaNova (RDU). Typically 5-10x faster decode than a GPU-based cluster on the same model. Higher per-token price (Groq was ~$0.99/M on Llama-70B late 2025) but unbeatable for latency-sensitive use cases. Groq is the production pick for voice agents and real-time translation.

GPU platforms — Baseten, Together, Fireworks, Modal, Anyscale. Run on NVIDIA (H100, H200, B200 in 2026) or sometimes AMD. The economic layer between "raw GPU rental" (RunPod, Lambda) and "hyperscaler managed service" (Bedrock).

API-first marketplaces — Replicate, DeepInfra, OpenRouter, Fal. Broad catalog, pay-per-prediction or pay-per-second, emphasize time-to-first-call.

Fireworks — latency-optimized GPU platform

Together — breadth-optimized

Baseten — enterprise-polish-optimized

Replicate — multimodal breadth

Anyscale — Ray-native

Per-token versus per-minute — when each wins

Per-token makes sense when the workload is latency-insensitive and bursty — you only pay for what you use. Per-minute makes sense when utilization is high and predictable — you beat per-token once you're saturating the GPU.

Rough rule: for workloads above ~30% sustained utilization of a dedicated GPU, per-minute (Baseten, Modal) starts to beat per-token (Fireworks, Together). Below that, per-token wins because you avoid paying for idle.

Custom engine is the real moat

Every platform above vLLM and SGLang claims a custom engine. FireAttention, RayTurbo, Baseten's inference stack. Custom-engine claims shade marketing — the honest framing is that vLLM + SGLang represent about 80% of production open-source inference, and the differentiators at the platform layer are DX, attribution, and SLAs.

Numbers you should remember

Use It

code/main.py compares the six vendors on a synthetic workload across pricing models. Reports $/day and effective $/M tokens. Run it to find the break-even between per-token and per-minute.

Ship It

This lesson produces outputs/skill-inference-platform-picker.md. Given workload profile, SLA, and budget, picks the primary inference platform and names the runner-up.

Exercises

  1. Run code/main.py. At what sustained utilization does Baseten (per-minute) beat Fireworks (per-token) for a 70B model on one H100? Derive the crossover yourself and compare to the rule of thumb.
  2. Your product serves image generation plus chat plus speech-to-text. Pick platforms for each modality and name the gateway pattern that unifies them.
  3. Fireworks raises prices by $1/hr on your primary model. Model the blended cost impact if 40% of your traffic moves to batch tier (50% off).
  4. A regulated customer requires SOC 2 Type II + HIPAA + dedicated GPUs. Which three platforms are viable and which one wins on FinOps?
  5. Compare cost per 1,000 predictions for Llama 3.1 70B on Fireworks serverless, Together on-demand, Baseten dedicated, and Replicate API. Which is cheapest at 10 predictions/day? At 10,000?

Key Terms

Term What people say What it actually means
Custom silicon "non-GPU chips" Groq LPU, Cerebras WSE, SambaNova RDU — optimized for decode
FireAttention "Fireworks engine" Custom attention kernel; marketed at 4x lower latency than vLLM
Truss "Baseten's format" Model packaging manifest; dependencies + secrets + serving config
Per-token "API pricing" Charge by tokens consumed; pay for no idle
Per-minute "dedicated pricing" Charge by wall-clock GPU time; wins at high utilization
Per-prediction "Replicate pricing" Charge per model invocation; common for image/video
RayTurbo "Anyscale engine" Proprietary inference on Ray; competes with vLLM on Ray clusters
Batch tier "50% off" Non-interactive queue at reduced rate; common on Fireworks, OpenAI
Fine-tuned at base rate "Fireworks LoRA" Charge LoRA-served requests at base model's rate (differentiator)

Further Reading