How do you decide if a new model is actually better in production?
Offline metrics often don't predict business impact, so you run a controlled online experiment: split live traffic between the current champion and the new challenger and compare a pre-registered business metric with a statistical significance test. You size the test for adequate power, watch guardrail metrics like latency and errors, and only ship if the lift is statistically and practically significant. Variance-reduction techniques like CUPED let you reach significance faster.
How to think about it
The short answer
Offline accuracy rarely tells you whether a model moves the business. To know if a new model is actually better, run a controlled online experiment (A/B test): randomly split live traffic between the current champion and the challenger, and compare a pre-registered business metric with a proper significance test. Ship only if the lift is both statistically and practically significant.
Why online, not offline
There’s a well-known gap between offline metrics and real-world impact. A model with higher AUC can lose on revenue, retention, or latency. A/B testing is the reality check that measures the thing you actually care about under real user behavior and feedback loops.
How to run it properly
- Pick one primary metric (e.g., conversion, revenue per session) and define it before you start.
- Randomize assignment, usually skewed to limit risk (e.g., 90/10) early on.
- Power the test: compute the sample size / duration needed to detect your minimum meaningful effect; don’t peek and stop early.
- Run a significance test and report a confidence interval, not just a p-value.
- Watch guardrails: latency, error rate, and counter-metrics — a model that lifts clicks but tanks p99 latency isn’t a win.
Speeding it up: CUPED
Variance-reduction techniques like CUPED (Controlled experiment Using Pre-Experiment Data) use pre-period covariates to shrink variance, so you reach significance with less traffic or time — valuable when traffic is limited.
Concrete example
Challenger shows +0.4% conversion after a week, but the CI is [-0.1%, +0.9%] — not significant. You either run longer, apply CUPED, or conclude it’s not better. Don’t ship on the point estimate alone.
Common follow-up / trap
The classic trap is peeking: checking results repeatedly and stopping when it looks good inflates false positives. Mention fixed horizons or sequential testing. Another: ignoring novelty effects and network interference, which can fake an early lift.