A/B testing & experimentation
Deploying a model gets it live; A/B testing proves it's actually better. Hypothesis testing, primary vs guardrail metrics, sample size, and variance reduction with CUPED.
What you'll learn
- Why online A/B testing is the real measure of a model improvement
- Primary metrics vs guardrail metrics, and sample-size/power
- Reducing variance with CUPED to reach significance faster
Before you start
Deployment strategies like canary and shadow get a new model running in production. But “running” isn’t “better.” A model with higher offline accuracy can lose in production — slower responses, weird edge cases, a metric that didn’t translate to user behavior. A/B testing is how you actually prove the new model wins, with statistics instead of hope.
The setup: split traffic, compare
Randomly split users into control (old model A) and treatment (new model B), serve each its model, and compare a metric. The question is never “is B’s number higher?” — random noise guarantees some difference — but “is B’s number higher beyond what noise would produce?” That’s a hypothesis test: you reject the null (“no difference”) only when the gap is statistically significant.
The trap is sample size. A true 8% lift is invisible until you’ve collected enough data for it to clear the noise band. Watch the confidence interval tighten:
Primary metrics vs guardrail metrics
Optimizing one metric blindly is how you ship a disaster. So every experiment has two kinds:
- Primary metric — the thing you’re trying to improve (conversion, click-through, revenue per session).
- Guardrail metrics — things that must not get worse, even if the primary improves: latency, error rate, unsubscribes, infra cost. A new model that lifts conversion 2% but doubles p99 latency is not a winner.
A result ships only if the primary metric wins and no guardrail breaks.
CUPED — get to significance faster
You don’t always have the traffic to wait. CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by adjusting each user’s metric using a pre-experiment covariate — typically their own behavior before the test. Because a heavy user was always going to convert more, removing that predictable component shrinks the noise, often cutting the needed sample size by ~50% for the same power. Toggle it in the widget and watch the interval tighten.
Quick check
Quick check
Next
A/B testing closes the deploy loop. Next, the monitoring that triggers the next model: drift and retraining.
Practice this in an interview
All questionsOffline metrics often don't predict business impact, so you run a controlled online experiment: split live traffic between the current champion and the new challenger and compare a pre-registered business metric with a statistical significance test. You size the test for adequate power, watch guardrail metrics like latency and errors, and only ship if the lift is statistically and practically significant. Variance-reduction techniques like CUPED let you reach significance faster.
CUPED (Controlled-experiment Using Pre-Experiment Data) removes variance in the outcome metric that is explained by a pre-experiment covariate — typically the same metric measured before the experiment. This makes the residual variance smaller, which is equivalent to running a more powerful test or reaching significance faster with the same sample.
A fixed A/B test holds traffic splits constant to get a clean, statistically powered comparison, which is ideal when you need a trustworthy ship decision. A multi-armed bandit dynamically shifts traffic toward the better-performing model, reducing regret when you can't run long enough for significance or when the best arm may change. Shadow deployment sends real traffic to the new model without serving its outputs, so you validate behavior and latency risk-free before any user is exposed.
A rigorous A/B test requires a pre-registered hypothesis, a single primary metric, sample size calculated before launch, random unit-level assignment, and a fixed runtime. Skipping any of these steps opens the door to false positives and post-hoc rationalization.