What's the difference between full retraining, incremental (warm-start) training, and continual online learning?

Full retraining trains a fresh model from scratch on the latest data window, giving the cleanest result but at the highest cost and slowest cadence. Incremental or warm-start training continues from existing weights on new data, which is cheaper and faster but can accumulate drift and forgetting. Continual online learning updates the model continuously from a live stream for maximum freshness, at the cost of stability, harder evaluation, and vulnerability to bad or poisoned data.

How do you decide when to retrain a model, and how do you do it safely?

Choose between scheduled retraining on a fixed cadence and trigger-based retraining fired by monitored drift or a performance drop, picking based on how fast the data distribution changes and how good your monitoring is. Retrain safely by treating it as an automated pipeline that validates data, trains, and gates the new model against the current champion on held-out and business metrics before promotion. Then roll out progressively with shadow or canary so a bad model never fully replaces the champion.

When and how should you trigger model retraining — scheduled vs. event-driven?

Scheduled retraining is simple and predictable but wastes compute when nothing has shifted and reacts slowly when drift is sudden. Event-driven retraining ties compute to evidence — a drift alarm, a performance threshold breach, or a data volume trigger — and is more efficient at scale. Most mature systems combine both.

What is the difference between data drift, concept drift, and label drift — and how do you detect each?

Data drift is a change in the statistical distribution of model inputs; concept drift is a change in the relationship between inputs and the target; label drift is a shift in the marginal distribution of the target itself. They require different detectors and carry different business urgency.

Retraining & continual learning — MLOps

The last lesson left us holding a confirmed drift alert and a sentence full of hidden machinery: “retrain on the recent window, evaluate against production, promote only if it wins.” We asked what’s actually inside that — which window, how to avoid training on the very bug that set off the alarm, how often, and how to be sure the fresh model isn’t quietly worse. This lesson unpacks it.

Drift detection tells you the world has changed. But an alert nobody acts on is just noise. Retraining is the action that closes the loop — and the question isn’t whether to retrain, but when and how safely. Get it wrong and you either serve a stale model for months or auto-ship a broken one.

Scheduled vs trigger-based

Two ways to decide when to retrain:

Scheduled — retrain on a fixed cadence (nightly, weekly, monthly). Simple and predictable, and a fine default. The downside: you either retrain too often (burning compute when nothing changed) or too rarely (the model goes stale between runs).
Trigger-based — retrain when a signal fires: drift crosses a threshold, performance drops below an SLA, or a volume of new labeled data arrives. More efficient (you retrain exactly when needed) but more machinery — you need the monitoring and the automated pipeline to act on it.

Most mature teams use both: a scheduled floor (retrain at least weekly) plus drift/performance triggers for when reality moves faster.

The retraining loop

Monitor → trigger → retrain → validate → promote only if it wins, then back to monitoring.

The dangerous failure mode is automatically deploying every retrained model. Retraining can produce a worse model — bad new data, a label pipeline bug, a distribution shift the model handled poorly. The safe pattern is champion-challenger:

The champion is the model currently serving production.
A freshly retrained challenger is evaluated against it — first offline (must beat the champion on the holdout), then often in a shadow or canary deployment or a proper A/B test.
The challenger is promoted only if it wins. Otherwise the champion stays, and you’ve lost nothing.

This makes retraining safe to automate: the worst case is “we kept the old model.”

In one breath

Retraining is the action that closes the drift loop, and the only questions are when and how safely: decide when with a mix of scheduled (a simple cadence floor) and trigger-based (fire on drift or an SLA breach); decide safely with champion-challenger — the freshly retrained challenger must beat the live champion offline and in a shadow/canary/A-B test before promotion, so the worst case is just “we kept the old model”; and watch for the feedback loop where a model trained on data its own decisions generated narrows into a self-reinforcing rut, which you counter with logged propensities and injected exploration.

Practice

Before the quiz, reason about the feedback loop concretely. A recommender shows item A to most users, so next month’s clicks are overwhelmingly on A, so the next model learns “users love A” even harder. Trace why naive retraining on that log entrenches the bias rather than correcting it — and name the one ingredient (hinted in the warning callout) that breaks the cycle. Then the safety question: champion-challenger makes automated retraining safe to run unattended — what is the single guarantee it gives you that makes “retrain every night” no longer terrifying?

Quick check

0/3

Q1What's the tradeoff between scheduled and trigger-based retraining?

Q2Why use a champion-challenger pattern instead of auto-deploying every retrained model?

Q3What is the feedback-loop risk in continual learning?

A question to carry forward

Everything in this lesson was the planned response to a model going wrong — a scheduled cadence, a measured trigger, a challenger evaluated at leisure and promoted only if it wins. It is calm, deliberate, reversible. It is also a luxury you don’t always get.

Because sometimes the model isn’t slowly drifting; it is failing right now. A feature pipeline ships a bug at 1 a.m. and by 3 a.m. the model is serving garbage to real customers, the dashboard is red, and support tickets are stacking up. There is no time to retrain a careful challenger and wait for an A/B test — the building is on fire. So the question to carry forward is the one no schedule prepares you for: when the model breaks in production this minute, with users watching, what do you do in the first ten minutes? That is incident response — the 3 a.m. ML page — and it is the next lesson.

Retraining & continual learning

What you'll learn

Before you start

Scheduled vs trigger-based

The retraining loop

Champion-challenger — never auto-ship blind

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further