MLOps Medium

How do you decide if a new model is actually better in production?

For ML Engineer MLOps Engineer Data Scientist

The short answer

Offline metrics often don't predict business impact, so you run a controlled online experiment: split live traffic between the current champion and the new challenger and compare a pre-registered business metric with a statistical significance test. You size the test for adequate power, watch guardrail metrics like latency and errors, and only ship if the lift is statistically and practically significant. Variance-reduction techniques like CUPED let you reach significance faster.

How to think about it

The short answer

Offline accuracy rarely tells you whether a model moves the business. To know if a new model is actually better, run a controlled online experiment (A/B test): randomly split live traffic between the current champion and the challenger, and compare a pre-registered business metric with a proper significance test. Ship only if the lift is both statistically and practically significant.

Why online, not offline

There’s a well-known gap between offline metrics and real-world impact. A model with higher AUC can lose on revenue, retention, or latency. A/B testing is the reality check that measures the thing you actually care about under real user behavior and feedback loops.

How to run it properly

Pick one primary metric (e.g., conversion, revenue per session) and define it before you start.
Randomize assignment, usually skewed to limit risk (e.g., 90/10) early on.
Power the test: compute the sample size / duration needed to detect your minimum meaningful effect; don’t peek and stop early.
Run a significance test and report a confidence interval, not just a p-value.
Watch guardrails: latency, error rate, and counter-metrics — a model that lifts clicks but tanks p99 latency isn’t a win.

Speeding it up: CUPED

Variance-reduction techniques like CUPED (Controlled experiment Using Pre-Experiment Data) use pre-period covariates to shrink variance, so you reach significance with less traffic or time — valuable when traffic is limited.

Concrete example

Challenger shows +0.4% conversion after a week, but the CI is [-0.1%, +0.9%] — not significant. You either run longer, apply CUPED, or conclude it’s not better. Don’t ship on the point estimate alone.

Common follow-up / trap

The classic trap is peeking: checking results repeatedly and stopping when it looks good inflates false positives. Mention fixed horizons or sequential testing. Another: ignoring novelty effects and network interference, which can fake an early lift.

Learn it properly A/B testing & experimentation

How do you decide if a new model is actually better in production?

The short answer

Why online, not offline

How to run it properly

Speeding it up: CUPED

Concrete example

Common follow-up / trap

Keep practising

Explore further