datarekha

A/B testing & experimentation

Deploying a model gets it live; A/B testing proves it's actually better. Hypothesis testing, primary vs guardrail metrics, sample size, and variance reduction with CUPED.

7 min read Intermediate MLOps Lesson 16 of 28

What you'll learn

  • Why online A/B testing is the real measure of a model improvement
  • Primary metrics vs guardrail metrics, and sample-size/power
  • Reducing variance with CUPED to reach significance faster

Before you start

Deployment strategies like canary and shadow get a new model running in production. But “running” isn’t “better.” A model with higher offline accuracy can lose in production — slower responses, weird edge cases, a metric that didn’t translate to user behavior. A/B testing is how you actually prove the new model wins, with statistics instead of hope.

The setup: split traffic, compare

Randomly split users into control (old model A) and treatment (new model B), serve each its model, and compare a metric. The question is never “is B’s number higher?” — random noise guarantees some difference — but “is B’s number higher beyond what noise would produce?” That’s a hypothesis test: you reject the null (“no difference”) only when the gap is statistically significant.

The trap is sample size. A true 8% lift is invisible until you’ve collected enough data for it to clear the noise band. Watch the confidence interval tighten:

Primary metrics vs guardrail metrics

Optimizing one metric blindly is how you ship a disaster. So every experiment has two kinds:

  • Primary metric — the thing you’re trying to improve (conversion, click-through, revenue per session).
  • Guardrail metrics — things that must not get worse, even if the primary improves: latency, error rate, unsubscribes, infra cost. A new model that lifts conversion 2% but doubles p99 latency is not a winner.

A result ships only if the primary metric wins and no guardrail breaks.

CUPED — get to significance faster

You don’t always have the traffic to wait. CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by adjusting each user’s metric using a pre-experiment covariate — typically their own behavior before the test. Because a heavy user was always going to convert more, removing that predictable component shrinks the noise, often cutting the needed sample size by ~50% for the same power. Toggle it in the widget and watch the interval tighten.

Quick check

Quick check

0/3
Q1Your new model has higher offline accuracy. Why still run an online A/B test?
Q2What's the role of guardrail metrics?
Q3What does CUPED do?

Next

A/B testing closes the deploy loop. Next, the monitoring that triggers the next model: drift and retraining.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How do you decide if a new model is actually better in production?

Offline metrics often don't predict business impact, so you run a controlled online experiment: split live traffic between the current champion and the new challenger and compare a pre-registered business metric with a statistical significance test. You size the test for adequate power, watch guardrail metrics like latency and errors, and only ship if the lift is statistically and practically significant. Variance-reduction techniques like CUPED let you reach significance faster.

What is CUPED and how does it reduce variance in A/B tests?

CUPED (Controlled-experiment Using Pre-Experiment Data) removes variance in the outcome metric that is explained by a pre-experiment covariate — typically the same metric measured before the experiment. This makes the residual variance smaller, which is equivalent to running a more powerful test or reaching significance faster with the same sample.

When would you use a multi-armed bandit or shadow deployment instead of a fixed A/B test?

A fixed A/B test holds traffic splits constant to get a clean, statistically powered comparison, which is ideal when you need a trustworthy ship decision. A multi-armed bandit dynamically shifts traffic toward the better-performing model, reducing regret when you can't run long enough for significance or when the best arm may change. Shadow deployment sends real traffic to the new model without serving its outputs, so you validate behavior and latency risk-free before any user is exposed.

How do you design an A/B test from scratch?

A rigorous A/B test requires a pre-registered hypothesis, a single primary metric, sample size calculated before launch, random unit-level assignment, and a fixed runtime. Skipping any of these steps opens the door to false positives and post-hoc rationalization.

Related lessons

Explore further

Skip to content