← FinOps for LLMs — Unit Economics and Multi-Tenant Attribution Instruction-Following as Alignment Signal →

Self-Hosted Serving Selection — llama.cpp, Ollama, TGI, vLLM, SGLang

> Four engines dominate self-hosted inference in 2026. Pick based on hardware, scale, and ecosystem. llama.cpp is fastest on CPU — widest model support, full control over quantization and threading. Ollama is the dev-laptop one-command install, ~15-30% slower than llama.cpp (Go + CGo + HTTP serialization), 3x throughput gap under prod-like load. TGI entered maintenance mode December 11, 2025 — only bug fixes, ~10% slower raw throughput than vLLM but historically top observability and HF-ecosystem integration. That maintenance status makes it a risky long-term bet — SGLang or vLLM are safer defaults for new projects. vLLM is the general-purpose production default — v0.15.1 (February 2026) adds PyTorch 2.10, RTX Blackwell SM120, H200 optimization. SGLang is the agentic multi-turn / prefix-heavy specialist — 400,000+ GPUs in production (xAI, LinkedIn, Cursor, Oracle, GCP, Azure, AWS). Hardware constraints: CPU-only → llama.cpp only. AMD / non-NVIDIA → vLLM only (TRT-LLM is NVIDIA-locked). 2026 pipeline pattern: dev = Ollama, staging = llama.cpp, prod = vLLM or SGLang. Same GGUF/HF weights throughout.

Type: Learn

Languages: Python (stdlib, engine-decision tree walker)

Prerequisites: All Phase 17 lessons covering engines (04, 06, 07, 09, 18)

Time: ~45 minutes

Learning Objectives

Pick an engine given hardware (CPU / AMD / NVIDIA Hopper / Blackwell), scale (1 user / 100 / 10,000), and workload (general chat / agent / long-context).
Name the 2026 TGI maintenance-mode status (December 11, 2025) and why it biases new projects toward vLLM or SGLang.
Describe the dev/staging/prod pipeline using the same GGUF or HF weights throughout.
Explain why "CPU only" forces llama.cpp and "AMD" excludes TRT-LLM.

The Problem

Your team starts a new self-hosted LLM project. One engineer says Ollama, another says vLLM, a third says "doesn't TGI just work out of the box?" All three are right for different contexts. None is right for all.

In 2026 the choice tree matters: hardware first, scale second, workload third. And one specific 2025 event — TGI entering maintenance mode December 11 — changes the default for new projects.

The Concept

The five engines

Engine	Best for	Notes
llama.cpp	CPU / edge / minimal deps / widest model support	Fastest on CPU, full control
Ollama	Dev laptops, single user, one-command install	15-30% slower than llama.cpp; 3x prod throughput gap
TGI	HF ecosystem, regulated industries	Maintenance mode Dec 11, 2025
vLLM	General-purpose production, 100+ users	Broad production default; v0.15.1 Feb 2026
SGLang	Agentic multi-turn, prefix-heavy workloads	400,000+ GPUs in production

Hardware-first decision

CPU only → llama.cpp. Ollama works too but is slower. No other engine is competitive on CPU.

AMD GPU → vLLM (AMD ROCm support). SGLang also works. TRT-LLM is NVIDIA-locked, so it's out.

NVIDIA Hopper (H100 / H200) → vLLM or SGLang or TRT-LLM. All three top-tier.

NVIDIA Blackwell (B200 / GB200) → TRT-LLM is the throughput leader (Phase 17 · 07). vLLM and SGLang follow close.

Apple Silicon (M-series) → llama.cpp (Metal). Ollama wraps this.

Scale-second decision

1 user / local dev → Ollama. One command, first-token in seconds.

10-100 users / small team → vLLM single-GPU.

100-10k users / production → vLLM production-stack (Phase 17 · 18) or SGLang.

10k+ users / enterprise → vLLM production-stack + disaggregated (Phase 17 · 17) + LMCache (Phase 17 · 18).

Workload-third decision

General chat / Q&A → vLLM wins on broad default.

Agentic multi-turn (tools, planning, memory) → SGLang's RadixAttention (Phase 17 · 06) dominates.

RAG with heavy prefix reuse → SGLang.

Code generation → vLLM fine; SGLang slightly better on cache.

Long context (128K+) → vLLM + chunked prefill; SGLang + tiered KV.

The TGI maintenance trap

Hugging Face TGI entered maintenance mode December 11, 2025 — only bug fixes going forward. Historically: top-tier observability, best-in-class HF-ecosystem integration (model cards, safety tools), slightly behind vLLM on raw throughput.

For new projects in 2026: default away from TGI. Existing TGI deployments can continue but should migrate eventually. SGLang and vLLM are the safer defaults.

The pipeline pattern

Dev (Ollama) → staging (llama.cpp) → prod (vLLM). Same GGUF or HF weights throughout. Engineers iterate quickly on laptops; staging mirrors production quantization; prod is the serving target.

Ollama caveat

Ollama is great for dev. It is not great for shared production: Go HTTP serialization adds overhead, concurrency management is simpler than vLLM, OpenTelemetry support lags. Use Ollama where it shines — one user, one command — and switch to vLLM for shared.

Self-hosted vs managed is a separate decision

Phase 17 · 01 (managed hyperscalers), · 02 (inference platforms) cover managed. This lesson assumes you've already decided to self-host. Reasons to self-host: data residency, custom fine-tune, total cost ownership at scale, domain model not available on hosted.

Numbers you should remember

TGI maintenance mode: December 11, 2025.
vLLM v0.15.1: February 2026; PyTorch 2.10; Blackwell SM120 support.
SGLang production footprint: 400,000+ GPUs.
Ollama throughput gap vs llama.cpp: 15-30% slower; 3x under prod load.

Use It

code/main.py is a decision-tree walker: given hardware + scale + workload, picks an engine and explains why.

Ship It

This lesson produces outputs/skill-engine-picker.md. Given constraints, picks an engine and writes the migration plan.

Exercises

Run code/main.py with your hardware / scale / workload. Does the output match your intuition?
Your infra is 12 H100s and 8 MI300X AMD. What engine? Why is TRT-LLM off the table?
A team wants to use TGI in 2026 because "it's what we know." Argue the migration case.
Ollama dev to vLLM prod: what changes in quantization, configuration, and observability?
RAG product with P99 prefix length 8K and high reuse across tenants. Pick an engine and stack it with Phase 17 · 11 + 18.

Key Terms

Term	What people say	What it actually means
llama.cpp	"the CPU one"	Widest model support, fastest on CPU
Ollama	"the laptop one"	One-command install, dev-grade throughput
TGI	"HF's serving"	Maintenance mode since Dec 2025
vLLM	"the default"	Broad production baseline 2026
SGLang	"the agentic one"	Prefix-heavy, RadixAttention
TRT-LLM	"NVIDIA-locked"	Blackwell throughput leader, NVIDIA only
GGUF	"llama.cpp format"	Bundled K-quant variants
Production-stack	"vLLM K8s"	Phase 17 · 18 reference deployment
Pipeline pattern	"dev→stage→prod"	Ollama → llama.cpp → vLLM on same weights