A/B Testing LLM Features — GrowthBook, Statsig, and the Vibes Problem

> Traditional A/B testing was not built for non-deterministic LLMs. The critical distinction: evals answer "can the model do the job?" A/B tests answer "do users care?" Both are required; shipping on vibe checks is over. What to test in 2026: prompt engineering (wording), model selection (GPT-4 vs GPT-3.5 vs OSS; accuracy vs cost vs latency), generation parameters (temperature, top-p). Real cases: a chatbot reward-model variant delivered +70% conversation length and +30% retention; Nextdoor AI subject-line experiments delivered +1% CTR after reward-function refinement; Khan Academy Khanmigo iterated on a latency-vs-math-accuracy axis. Platform split: Statsig (acquired by OpenAI for $1.1B in September 2025) — sequential testing, CUPED, all-in-one. GrowthBook — open-source, warehouse-native, Bayesian + Frequentist + Sequential engines, CUPED, SRM checks, Benjamini-Hochberg + Bonferroni corrections. You pick based on warehouse-SQL preference and whether "acquired by OpenAI" matters to your organization.

Type: Learn

Languages: Python (stdlib, toy sequential test simulator)

Prerequisites: Phase 17 · 13 (Observability), Phase 17 · 20 (Progressive Deployment)

Time: ~60 minutes

Learning Objectives

The Problem

You hand-tuned a system prompt. It feels better. You ship it. Conversion changes by noise. You blame the metric. Or you shipped a new model and conversion didn't move — did the model degrade or was the change too small to detect? You don't know, because you shipped without an A/B.

Evals answer whether the model can do a task on a labeled set. They do not answer whether users prefer the output. Only a controlled online experiment answers that, and only if the experiment has enough power, controls for non-determinism, and corrects for multiple comparisons.

The Concept

Evals vs A/B tests

Evals — offline, labeled set, judge (rubric or LLM-as-judge or human). Answer: "Is the output correct / helpful / safe on this fixed distribution?"

A/B test — online, live users, randomized. Answer: "Does the new variant move the user-level metric that matters?"

Both required. Evals catch regressions before exposure; A/B confirms product impact after.

What to test

  1. Prompt engineering — wording, system-prompt structure, examples. Metric: task success, user retention, cost/request.
  2. Model selection — GPT-4 vs GPT-3.5-Turbo vs Llama-OSS. Metric: accuracy (task) + cost/request + latency P99. Multi-objective.
  3. Generation parameters — temperature, top-p, max_tokens. Metric: task-specific (output diversity vs determinism).

CUPED — variance reduction

Controlled-experiments Using Pre-Experiment Data. Regress out pre-period variance before comparing post-period. Typical variance reduction: 30-70%. Effective sample size goes up for free.

Implementation: both Statsig and GrowthBook implement.

Sequential testing

Classical A/B assumes fixed sample size. Sequential tests ("peek-and-decide") control false-positive rate under repeated looks. Always-valid sequential procedures (mSPRT, Howard's confidence sequences) let you stop early on clear winners.

Multiple-comparison corrections

Running 20 A/B tests at 95% confidence produces one false positive by chance. Bonferroni correction tightens α per-test; Benjamini-Hochberg controls false-discovery rate. GrowthBook implements both.

SRM — sample ratio mismatch

Assignment hash randomizes users to variants. If 50/50 split delivers 47/53, something is broken — SRM check flags it. Both platforms implement.

Statsig vs GrowthBook

Statsig:

GrowthBook:

Non-determinism complicates power

Same prompt produces varying outputs. Traditional power calculations assume IID observations. With LLM non-determinism, effective sample size is lower than nominal. Multiply required sample size by ~1.3-1.5x as a safety margin.

Real case outcomes

The anti-pattern: shipping on vibes

Every senior engineer can name a feature that was shipped because "it feels better" with no A/B. Most of them regressed product metrics the team didn't notice for months. A/B is the forcing function.

Numbers you should remember

Use It

code/main.py simulates a sequential A/B test with fixed and sequential boundaries. Shows how sequential lets you stop early.

Ship It

This lesson produces outputs/skill-ab-plan.md. Given feature change, workload, baseline, picks platform, gates, sample size.

Exercises

  1. Run code/main.py. For an expected 5% lift with baseline 3% conversion, what sample size to 80% power?
  2. Pick Statsig or GrowthBook for a healthcare-regulated on-prem customer.
  3. Design an A/B that tests GPT-4 vs GPT-3.5 on cost-per-resolved-ticket. What's the primary metric, guardrail metric, secondary?
  4. Your canary passes but A/B shows -1.2% conversion. Do you ship? Write the escalation criteria.
  5. Apply CUPED to a pre-period with 60% of the variance of post. Compute the effective-sample-size boost.

Key Terms

Term What people say What it actually means
Eval "offline test" Labeled-set evaluation of model capability
A/B test "experiment" Live randomized comparison on users
CUPED "variance reduction" Pre-period regression to reduce variance
Sequential test "peek-ok test" Always-valid procedure allowing early stop
Multiple comparison "the family error" Running many tests inflates false positives
Bonferroni "tight correction" Divide α by number of tests
Benjamini-Hochberg "BH FDR" False-discovery-rate control, less conservative
SRM "bad split" Sample ratio mismatch; assignment bug
Statsig "OpenAI owned" Commercial all-in-one, acquired 2025
GrowthBook "the OSS one" MIT warehouse-native platform
mSPRT "sequential probability ratio test" Classical sequential procedure

Further Reading