AI Engineering from Scratch

← Managed LLM Platforms — Bedrock, Vertex AI, Azure OpenAI GPU Autoscaling on Kubernetes — Karpenter, KAI Scheduler, Gang Scheduling →

> The 2026 inference market is no longer GPU time rental. It bifurcates into custom silicon (Groq, Cerebras, SambaNova), GPU platforms (Baseten, Together, Fireworks, Modal), and API-first marketplaces (Replicate, DeepInfra). Fireworks raised price $1/hr per GPU on May 1, 2026, and $4B valuation on 10T+ tokens/day tells you the volume-driven model works. Baseten closed $300M Series E at $5B in January 2026. The competitive positioning rule is simple: Fireworks optimizes latency, Together optimizes catalog breadth, Baseten optimizes enterprise polish, Modal optimizes Python-native DX, Replicate optimizes multimodal reach, Anyscale optimizes distributed Python. This lesson gives you a matrix you can hand a founder.

Type: Learn

Languages: Python (stdlib, toy per-call economics comparator)

Prerequisites: Phase 17 · 01 (Managed LLM Platforms), Phase 17 · 04 (vLLM Serving Internals)

Time: ~60 minutes

Learning Objectives

Name the three market segments (custom silicon, GPU platforms, API-first) and map each vendor to a segment.
Explain why the "per-token" API pricing model compresses toward the serving engine's cost curve, not the hardware's.
Compute effective cost per request across at least three vendors and explain when per-minute (Baseten, Modal) beats per-token.
Identify which platform is the right default for a given workload (serverless bursty, steady high-throughput, fine-tuned variants, multimodal).

The Problem

You evaluated managed hyperscaler platforms. You decided you need a narrower, faster provider — Fireworks for latency, Together for breadth, Baseten for a fine-tuned custom model. Now you have six real choices and the pricing pages do not line up. Fireworks shows $/M tokens; Baseten shows $/minute; Modal shows $/second; Replicate shows $/prediction. You cannot compare them head-to-head without modeling the workload.

Worse, the business model behind each pricing page is different. Fireworks runs its own custom engine (FireAttention) on shared GPUs; the per-token rate reflects their utilization curve. Baseten gives you Truss + dedicated GPUs; per-minute reflects exclusivity. Modal is true Python serverless — per-second billing with sub-second cold starts. Same output (an LLM response), three different cost functions.

This lesson models the six and tells you when each wins.

The Concept

The three segments

Custom silicon — Groq (LPU), Cerebras (WSE), SambaNova (RDU). Typically 5-10x faster decode than a GPU-based cluster on the same model. Higher per-token price (Groq was ~$0.99/M on Llama-70B late 2025) but unbeatable for latency-sensitive use cases. Groq is the production pick for voice agents and real-time translation.

GPU platforms — Baseten, Together, Fireworks, Modal, Anyscale. Run on NVIDIA (H100, H200, B200 in 2026) or sometimes AMD. The economic layer between "raw GPU rental" (RunPod, Lambda) and "hyperscaler managed service" (Bedrock).

API-first marketplaces — Replicate, DeepInfra, OpenRouter, Fal. Broad catalog, pay-per-prediction or pay-per-second, emphasize time-to-first-call.

Fireworks — latency-optimized GPU platform

FireAttention engine (custom); marketed as 4x lower latency than vLLM on equivalent configs.
Batch tier at ~50% of serverless rate for non-interactive workloads.
Fine-tuned model served at the same rate as the base model — a real differentiator versus providers that charge a premium for your LoRA.
Mid-2026: raised on-demand GPU rental $1/hour effective May 1, 2026. Volume pricing negotiable at scale.
Financial signal: $4B valuation, 10T+ tokens/day handled.

Together — breadth-optimized

200+ models including open-source releases within days of upstream publication.
50-70% cheaper than Replicate on equivalent LLM models — the "AI Native Cloud" positioning is volume and catalog.
Inference + fine-tuning + training in one API.

Baseten — enterprise-polish-optimized

Truss framework: model packaging with dependencies, secrets, serving config in one manifest.
GPU range from T4 through B200. Per-minute billing with reasonable cold-start mitigation.
SOC 2 Type II, HIPAA-ready. Common fintech and healthcare pick.
$5B valuation, January 2026 Series E ($300M from CapitalG, IVP, NVIDIA).

Infrastructure-as-code in pure Python. Decorate a function with @modal.function(gpu="A100") and deploy with one command.
Per-second billing. Cold starts 2-4s with pre-warming; <1s for small models.
$87M Series B at $1.1B valuation (2025). Strongest developer experience score in independent surveys.

Replicate — multimodal breadth

Pay-per-prediction. The default platform for image, video, and audio models.
Integration ecosystem (Zapier, Vercel, CMS plugins).
Less competitive on LLM per-token rates but wins on multimodal variety.

Anyscale — Ray-native

Built on Ray; RayTurbo is Anyscale's proprietary inference engine (competes with vLLM).
Best for distributed Python workloads where the inference step is one node in a larger graph.
Managed Ray clusters; tight integration with Ray AIR and Ray Serve.

Per-token versus per-minute — when each wins

Per-token makes sense when the workload is latency-insensitive and bursty — you only pay for what you use. Per-minute makes sense when utilization is high and predictable — you beat per-token once you're saturating the GPU.

Rough rule: for workloads above ~30% sustained utilization of a dedicated GPU, per-minute (Baseten, Modal) starts to beat per-token (Fireworks, Together). Below that, per-token wins because you avoid paying for idle.

Custom engine is the real moat

Every platform above vLLM and SGLang claims a custom engine. FireAttention, RayTurbo, Baseten's inference stack. Custom-engine claims shade marketing — the honest framing is that vLLM + SGLang represent about 80% of production open-source inference, and the differentiators at the platform layer are DX, attribution, and SLAs.

Numbers you should remember

Fireworks GPU rental: $1/hr raise effective May 1, 2026.
Fireworks claim: 4x lower latency than vLLM on equivalent configs.
Together: 50-70% cheaper than Replicate on LLMs.
Baseten valuation: $5B (Series E, Jan 2026, $300M round).
Modal valuation: $1.1B (Series B, 2025).
Per-minute beats per-token above ~30% sustained utilization.

Use It

code/main.py compares the six vendors on a synthetic workload across pricing models. Reports $/day and effective $/M tokens. Run it to find the break-even between per-token and per-minute.

Ship It

This lesson produces outputs/skill-inference-platform-picker.md. Given workload profile, SLA, and budget, picks the primary inference platform and names the runner-up.

Exercises

Run code/main.py. At what sustained utilization does Baseten (per-minute) beat Fireworks (per-token) for a 70B model on one H100? Derive the crossover yourself and compare to the rule of thumb.
Your product serves image generation plus chat plus speech-to-text. Pick platforms for each modality and name the gateway pattern that unifies them.
Fireworks raises prices by $1/hr on your primary model. Model the blended cost impact if 40% of your traffic moves to batch tier (50% off).
A regulated customer requires SOC 2 Type II + HIPAA + dedicated GPUs. Which three platforms are viable and which one wins on FinOps?
Compare cost per 1,000 predictions for Llama 3.1 70B on Fireworks serverless, Together on-demand, Baseten dedicated, and Replicate API. Which is cheapest at 10 predictions/day? At 10,000?

Key Terms

Term	What people say	What it actually means
Custom silicon	"non-GPU chips"	Groq LPU, Cerebras WSE, SambaNova RDU — optimized for decode
FireAttention	"Fireworks engine"	Custom attention kernel; marketed at 4x lower latency than vLLM
Truss	"Baseten's format"	Model packaging manifest; dependencies + secrets + serving config
Per-token	"API pricing"	Charge by tokens consumed; pay for no idle
Per-minute	"dedicated pricing"	Charge by wall-clock GPU time; wins at high utilization
Per-prediction	"Replicate pricing"	Charge per model invocation; common for image/video
RayTurbo	"Anyscale engine"	Proprietary inference on Ray; competes with vLLM on Ray clusters
Batch tier	"50% off"	Non-interactive queue at reduced rate; common on Fireworks, OpenAI
Fine-tuned at base rate	"Fireworks LoRA"	Charge LoRA-served requests at base model's rate (differentiator)

Inference Platform Economics — Fireworks, Together, Baseten, Modal, Replicate, Anyscale