MLOps Medium Asked at NetflixAsked at LyftAsked at TwitterAsked at Instacart

How do you safely roll back a model in production and what triggers a rollback?

The short answer

A rollback reverts serving traffic to a known-good model version when the newly deployed model shows metric regression beyond a tolerance threshold. Safe rollback requires versioned model artifacts, traffic-routing control, and pre-defined automated or manual triggers — not ad hoc decisions under pressure.

How to think about it

Rollback is not a failure mode to avoid — it is a feature to build deliberately. Every model deployment should ship with a rollback plan decided before the deploy, not during an incident.

What triggers a rollback

Hard automated triggers should be defined before deployment and evaluated continuously:

Serving error rate exceeds a threshold (e.g., 5%) within the first 15 minutes of a canary deploy.
p99 inference latency exceeds SLA.
Business KPI (conversion rate, click-through rate) drops more than X% vs. the control arm in an A/B test with sufficient statistical power.
Offline evaluation metric (AUC, RMSE) of the new model on a fresh holdout is worse than the current champion.

Soft manual triggers cover slower-moving regressions: a 48-hour A/B result showing negative trend, a domain expert identifying systematic prediction errors in a spot check, or a data quality alarm suggesting the new model was trained on corrupted data.

How to execute a rollback safely

Model registry with versioned artifacts — every model is stored with a unique version identifier, the training run ID, the dataset hash, and evaluation metrics. The serving layer references a registry tag (e.g., “champion”), not a file path. Rollback is a tag update, not a file operation.

Traffic routing control — canary, blue/green, or weighted-split routing means rollback is a traffic weight change (100% to old model) rather than a code redeploy. This takes seconds, not minutes.

Inference logging — keep logs of inputs, outputs, and model version for every request. Post-rollback, you can replay the new model’s traffic through the old model offline to understand the magnitude of divergence.

Automated gating — do not rely on humans to notice a regression. Configure automated promotion gates that compare live metrics at each traffic split milestone; halt and revert automatically if a gate fails.

What to do after rollback

Rollback buys time; it does not fix the underlying problem. File an incident, preserve the failed model artifact (do not delete it), reproduce the failure in a staging environment, and root-cause before attempting another promotion.

How do you safely roll back a model in production and what triggers a rollback?

What triggers a rollback

How to execute a rollback safely

What to do after rollback

Keep practising

Explore further