← Prompt Caching and Semantic Caching Economics Model Routing as a Cost-Reduction Primitive →

Batch APIs — the 50% Discount as Industry Standard

> Every major provider ships an async batch API with a 50% discount and ~24-hour turnaround. OpenAI, Anthropic, Google, and most of the inference platforms (Fireworks batch tier, Together batch) implement the same pattern. Stack batch with prompt caching and overnight pipelines drop to ~10% of synchronous-uncached cost. The rule is brutally simple: if it is not interactive, it belongs on batch. Content generation pipelines, document classification, data extraction, report generation, bulk labeling, catalog tagging — anything tolerant of 24-hour latency is money left on the table until it moves to batch. The 2026 production pattern is to triage every new LLM workload into three lanes: interactive (synchronous with caching), semi-interactive (async queue with fallback), batch (overnight, cached input stacked). Workloads that pretend to be interactive but tolerate minutes of latency waste most.

Type: Learn

Languages: Python (stdlib, toy batch-vs-sync cost simulator)

Prerequisites: Phase 17 · 14 (Prompt & Semantic Caching)

Time: ~45 minutes

Learning Objectives

Name the three provider batch APIs (OpenAI, Anthropic, Google) and the common 50% discount + 24h turnaround guarantees.
Compute the cost for stacking batch + cached-input on an overnight classification workload and compare to synchronous-uncached baseline.
Triage a workload into interactive / semi-interactive / batch and justify the lane.
Name the two traps: partial interactivity (user expects faster than 24h) and output-schema drift (batch file format differs per provider).

The Problem

Your team ships a nightly report generation pipeline. 50,000 documents, summarize each, cluster the summaries, draft an executive brief. Running synchronously it takes 4 hours at $2,000/night. You hear about batch APIs.

The batch gets you 50% off. You also enable prompt caching on the system prompt (shared across all 50k calls). Stacked, the bill drops to $180/night — ~9% of baseline. Same pipeline, three config changes.

Batch is the cheapest lever in the LLM cost toolkit that nobody pulls. The reason is mostly organizational: teams think "real-time" when the SLA actually is "by morning." This lesson is about not leaving 90% of the bill on the table.

The Concept

The three batch APIs

OpenAI Batch API: JSONL file upload with a list of requests. Promised 24-hour turnaround (usually ~2-8 hours in practice). 50% discount on input and output tokens. /v1/batches endpoint. Cache-eligible inputs also get cached-input pricing on top.

Anthropic Message Batches: JSONL upload. 24-hour turnaround. 50% discount. Supports cache_control — cache writes are explicit, reads happen automatically within the batch.

Google Vertex AI Batch Prediction: BigQuery or GCS input. Similar 50% discount for Gemini. Integrates with Vertex pipelines.

Semantic: asynchronous, not slow

Batch is "I promise to return within 24 hours" — not "this will take 24 hours." Typical P50 is 2-6 hours. Provider schedules your batch during off-peak windows when GPU inventory is underutilized.

Stack with caching

A 50k-document summarization with the same 4K-token system prompt:

Synchronous uncached: 50000 × ($input × 4000 + $output × 200) at full rates.
Synchronous cached: system prompt cached after first write; remaining 49999 get 10x cheaper input.
Batch cached: all of the above plus 50% discount on both read and write.

The stack: batch + cache = ~10% of sync uncached bill. Any workload that runs overnight and has a shared system prompt should use this.

Workload triage

Interactive — user waits for the response. TTFT matters. Synchronous call with prompt caching. Cannot batch.

Semi-interactive — user submits a task, checks back in minutes. Async queue with fallback to sync if batch not available. Think moderate-volume RAG indexing.

Batch — user expects results "by morning" or "next hour." Content pipelines, classification at scale, offline analysis. Always batch, always stack caching.

Common mistake: classifying everything as interactive because the pipeline is production. Production is not a latency spec — SLA is.

The partial-interactivity trap

Some features look interactive but tolerate 5-10 minutes. Example: a nightly customer health report with "refresh" button. User clicks refresh; wait 10 minutes is fine. Team ships it as synchronous. 50 concurrent refreshes cost 10x what batched-and-delivered-via-email would cost.

The question to ask: "What does 24-hour mean for this user?" If the answer is "they wouldn't notice," batch it.

The output-schema trap

Batch file formats differ per provider:

OpenAI: JSONL, one request per line.
Anthropic: JSONL, one message per line; response format embedded.
Vertex: BigQuery table or GCS prefix with TFRecord.

Writing "one batch client" across providers means adapter code per provider. Gateways that advertise multi-provider batch (Portkey, LiteLLM some tiers) still thin-wrap the raw format.

Numbers you should remember

Batch discount across providers: 50% flat on input + output.
Turnaround SLA: 24 hours guaranteed, 2-6 hours typical P50.
Stacked batch + cached input: ~10% of sync uncached cost.
Workload triage rule: if 24h latency acceptable, always batch.

Use It

code/main.py computes costs across sync, sync+cache, batch, and batch+cache for a 50k-document workload. Reports savings in $ and percent.

Ship It

This lesson produces outputs/skill-batch-triager.md. Given workload characteristics, triages into interactive/semi/batch and estimates savings.

Exercises

Run code/main.py. For a 100k-doc pipeline with 3K-token system prompt and 500-token output, compute the savings of full stack (batch + cache) vs sync baseline.
Pick three features in a real product you know. Triage each into interactive/semi/batch.
A user complains their report took 3 hours. Was that a batch mis-triage or a legitimate interactive? Write the decision criterion.
Your batch API return SLA is 24h but P99 is 20 hours. How do you communicate this to the user — what is the downstream system behavior on the edge case?
Compute break-even: at what shared-prefix length does batch + cache become cheaper than running overnight on your own reserved GPU?

Key Terms

Term	What people say	What it actually means
Batch API	"async discount"	50% off with 24h turnaround
JSONL	"batch format"	One JSON request per line; OpenAI/Anthropic standard
Message Batches	"Anthropic batch"	Anthropic's batch API product name
Batch prediction	"Vertex batch"	Vertex AI's batch API product
Turnaround SLA	"24h promise"	Guarantee, not typical; typical is 2-6h
Workload triage	"interactivity decision"	Interactive / semi / batch routing decision
Output schema	"response format"	Per-provider JSONL layout; not portable
Stacked discount	"batch + cache"	~10% of uncached sync bill when both apply