← Batch APIs — the 50% Discount as Industry Standard Disaggregated Prefill/Decode — NVIDIA Dynamo and llm-d →

Model Routing as a Cost-Reduction Primitive

> A dynamic broker evaluates every request (task type, token length, embedding similarity, confidence) and sends simple queries to a cheap model, escalating complex ones to a frontier model. Also called model cascading. Production case studies show 20-60% cost reduction at iso-quality across US/UK/EU deployments; a 30% routing efficiency improvement on high-volume SaaS turns into six-figure annual savings. The 2026 context is that LLM inference prices dropped ~10x per year — a GPT-4-class token went from $20/M to ~$0.40/M from late 2022 to 2026. Most of the drop is better serving stacks (Phase 17 · 04-09), not hardware. Routing is how you convert that price drop into margin without product regression. The failure mode is cheap-model drift: the route pushes 40% to a weaker model, quality drops 3-5% on reasoning tasks, no one notices for a quarter. Gate routes by online quality metrics, not just offline eval sets.

Type: Learn

Languages: Python (stdlib, toy cascading router simulator)

Prerequisites: Phase 17 · 01 (Managed LLM Platforms), Phase 17 · 19 (AI Gateways)

Time: ~60 minutes

Learning Objectives

Explain model cascading: cheap-first with confidence check, escalate on low confidence.
Enumerate the four routing signals (task classification, prompt length, embedding similarity to known-hard set, self-confidence from first-pass).
Compute expected blended cost at target routing split and quality loss tolerance.
Name the drift-monitoring metric (online quality gate) that catches cheap-model creep.

The Problem

Your service costs $80k/month on GPT-5. Your analytics show 70% of queries are simple: "what time is it in Paris?" "rephrase this sentence." A Haiku-class model handles those perfectly at 3% of the cost. 30% need GPT-5's reasoning — coding, math, multi-step planning.

If you route the 70% to cheap and 30% to expensive, your bill drops ~65% at the same product quality. This is routing. The trick is building the broker without regressing quality.

The Concept

Four routing signals

Task classification: simple/complex/codegen/math/chat. Can be a rules-based classifier, a small LLM (Haiku-class at $0.25/M), or embedding similarity to labeled buckets. Output: route = cheap / balanced / frontier.

Prompt length: prompts >4K tokens often need frontier for coherence. Prompts <500 tokens usually don't.

Embedding similarity to known-hard set: if the query is close (cosine > 0.88) to a known-hard bucket, escalate to frontier directly.

Self-confidence from first-pass: send to cheap; if model's log-probs show low confidence OR it refuses OR outputs hedging language, retry on frontier. Adds P95 latency on ~10% of traffic but saves 50%+ on the other 90%.

Three patterns

Pre-route (classifier up front): ~5-10ms latency added; fastest overall.

Cascade (cheap-first, escalate on low confidence): ~1.2x median latency (cheap run plus verify), ~2x on escalated. Best quality floor.

Ensemble route (run cheap and frontier in parallel for a sample, reward-model pick): highest quality, highest cost; use only for critical A/B.

Implementation

AI gateways (Phase 17 · 19) expose routing. LiteLLM has router config with fallback and cost-routing. Portkey has guards + routing. Kong AI Gateway has plugin-based routing. OpenRouter's model marketplace exposes a recommendation API.

Open-source: RouteLLM (LMSYS), Not Diamond (commercial), Prompt Mule.

The 2026 price curve

Model class	Late 2022	2026	Change
GPT-4-level quality	~$20/M	~$0.40/M	50x cheaper
Frontier (GPT-5, Claude 4)	—	~$3-10/M	new tier

Most of the improvement is serving efficiency — the core lessons in Phase 17 · 04-09 turned into provider-side cost drops. Routing lets you capture those gains at the app layer instead of waiting for all your users to migrate to the cheap tier.

Drift is the real risk

Your route sends 40% to the cheap model. Over six months, the task distribution shifts (users get more sophisticated, ask longer questions). The router doesn't notice because its classifier was trained on Q1 data. Quality drops silently. Nobody complains loud enough. You find out in a competitor benchmark you lost.

Gate routes by online quality metrics:

User thumbs-up / thumbs-down per route.
Automated LLM-judge on a held-out sample (5%) per route.
Escalation rate: if cascade is kicking up-route >30%, the cheap model is being over-routed.
Refusal rate per route.

Numbers you should remember

2026 routing savings at iso-quality: 20-60% case studies.
LLM price drop 2022-2026: ~10x per year aggregate.
GPT-4-level 2022 vs 2026: ~$20/M → ~$0.40/M.
Cascade latency impact: ~1.2x median, ~2x escalated (~10% of traffic).

Use It

code/main.py simulates pre-route, cascade, and ensemble on a mixed workload. Reports blended cost, quality loss, and escalation rate.

Ship It

This lesson produces outputs/skill-router-plan.md. Given workload and quality budget, picks a routing pattern and signals.

Exercises

Run code/main.py. At what accuracy floor does cascade beat pre-route?
Your user base is 30% enterprise (complex queries), 70% free tier (simple). Design the routing split. What online metric gates it?
A route drops quality by 2% but saves 40%. Is that a ship? Depends on product — argue both.
Implement a confidence check using logprobs from OpenAI / Anthropic APIs. What's the threshold you start with?
Over six months, escalation rate climbs from 8% to 22%. Diagnose three causes and the fix for each.

Key Terms

Term	What people say	What it actually means
Model routing	"cost broker"	Dynamic choice of model per request
Model cascade	"cheap-first escalate"	Run cheap, fall through to frontier on low confidence
Pre-route	"classify first"	Classifier up front; no re-run
Ensemble route	"parallel pick"	Run multiple, reward-model picks best
Escalation rate	"uprouted %"	Fraction of cascade requests that escalated
RouteLLM	"LMSYS router"	OSS router library
Not Diamond	"commercial router"	SaaS model-routing product
Drift	"cheap creep"	Distribution shift without router noticing
Online quality gate	"live check"	Automated LLM-judge sampling live traffic