How do you safely roll back a model in production and what triggers a rollback?
A rollback reverts serving traffic to a known-good model version when the newly deployed model shows metric regression beyond a tolerance threshold. Safe rollback requires versioned model artifacts, traffic-routing control, and pre-defined automated or manual triggers — not ad hoc decisions under pressure.
How to think about it
Rollback is not a failure mode to avoid — it is a feature to build deliberately. Every model deployment should ship with a rollback plan decided before the deploy, not during an incident.
What triggers a rollback
Hard automated triggers should be defined before deployment and evaluated continuously:
- Serving error rate exceeds a threshold (e.g., 5%) within the first 15 minutes of a canary deploy.
- p99 inference latency exceeds SLA.
- Business KPI (conversion rate, click-through rate) drops more than X% vs. the control arm in an A/B test with sufficient statistical power.
- Offline evaluation metric (AUC, RMSE) of the new model on a fresh holdout is worse than the current champion.
Soft manual triggers cover slower-moving regressions: a 48-hour A/B result showing negative trend, a domain expert identifying systematic prediction errors in a spot check, or a data quality alarm suggesting the new model was trained on corrupted data.
How to execute a rollback safely
Model registry with versioned artifacts — every model is stored with a unique version identifier, the training run ID, the dataset hash, and evaluation metrics. The serving layer references a registry tag (e.g., “champion”), not a file path. Rollback is a tag update, not a file operation.
Traffic routing control — canary, blue/green, or weighted-split routing means rollback is a traffic weight change (100% to old model) rather than a code redeploy. This takes seconds, not minutes.
Inference logging — keep logs of inputs, outputs, and model version for every request. Post-rollback, you can replay the new model’s traffic through the old model offline to understand the magnitude of divergence.
Automated gating — do not rely on humans to notice a regression. Configure automated promotion gates that compare live metrics at each traffic split milestone; halt and revert automatically if a gate fails.
What to do after rollback
Rollback buys time; it does not fix the underlying problem. File an incident, preserve the failed model artifact (do not delete it), reproduce the failure in a staging environment, and root-cause before attempting another promotion.