Retraining & continual learning
Drift detection is useless if nothing acts on it. Scheduled vs trigger-based retraining, the champion-challenger pattern, and how to retrain safely without auto-shipping a worse model.
What you'll learn
- Scheduled vs trigger-based retraining and when each fits
- The champion-challenger pattern for safe model updates
- Why automated retraining needs guardrails, not blind trust
Before you start
Drift detection tells you the world has changed. But an alert nobody acts on is just noise. Retraining is the action that closes the loop — and the question isn’t whether to retrain, but when and how safely. Get it wrong and you either serve a stale model for months or auto-ship a broken one.
Scheduled vs trigger-based
Two ways to decide when to retrain:
- Scheduled — retrain on a fixed cadence (nightly, weekly, monthly). Simple and predictable, and a fine default. The downside: you either retrain too often (burning compute when nothing changed) or too rarely (the model goes stale between runs).
- Trigger-based — retrain when a signal fires: drift crosses a threshold, performance drops below an SLA, or a volume of new labeled data arrives. More efficient (you retrain exactly when needed) but more machinery — you need the monitoring and the automated pipeline to act on it.
Most mature teams use both: a scheduled floor (retrain at least weekly) plus drift/performance triggers for when reality moves faster.
The retraining loop
Champion-challenger — never auto-ship blind
The dangerous failure mode is automatically deploying every retrained model. Retraining can produce a worse model — bad new data, a label pipeline bug, a distribution shift the model handled poorly. The safe pattern is champion-challenger:
- The champion is the model currently serving production.
- A freshly retrained challenger is evaluated against it — first offline (must beat the champion on the holdout), then often in a shadow or canary deployment or a proper A/B test.
- The challenger is promoted only if it wins. Otherwise the champion stays, and you’ve lost nothing.
This makes retraining safe to automate: the worst case is “we kept the old model.”
Quick check
Quick check
Next
Safe retraining depends on the model registry gate and orchestration to run the loop. Next: the governance side — responsible-AI ops.
Practice this in an interview
All questionsChoose between scheduled retraining on a fixed cadence and trigger-based retraining fired by monitored drift or a performance drop, picking based on how fast the data distribution changes and how good your monitoring is. Retrain safely by treating it as an automated pipeline that validates data, trains, and gates the new model against the current champion on held-out and business metrics before promotion. Then roll out progressively with shadow or canary so a bad model never fully replaces the champion.
Scheduled retraining is simple and predictable but wastes compute when nothing has shifted and reacts slowly when drift is sudden. Event-driven retraining ties compute to evidence — a drift alarm, a performance threshold breach, or a data volume trigger — and is more efficient at scale. Most mature systems combine both.
Full retraining trains a fresh model from scratch on the latest data window, giving the cleanest result but at the highest cost and slowest cadence. Incremental or warm-start training continues from existing weights on new data, which is cheaper and faster but can accumulate drift and forgetting. Continual online learning updates the model continuously from a live stream for maximum freshness, at the cost of stability, harder evaluation, and vulnerability to bad or poisoned data.
Data drift is a change in the statistical distribution of model inputs; concept drift is a change in the relationship between inputs and the target; label drift is a shift in the marginal distribution of the target itself. They require different detectors and carry different business urgency.