Deployment Strategies: Canary, Blue-Green & Shadow

A new model can pass every offline test and still break in production. Deployment strategies control the blast radius when it does — canary, blue-green, rolling, and shadow. How traffic shifting, metric gates, and instant rollback let you ship v2 without betting the whole service on it.

9 min read Intermediate MLOps Lesson 15 of 28

What you'll learn

Why a model that passes offline eval can still fail in production — and why that forces gradual rollout
Canary deployment — ramp a tiny traffic share, gate on live metrics, auto-rollback
Blue-green and rolling — instant cutover vs in-place batch replacement, and their rollback trade-offs
Shadow (dark launch) — test v2 on real traffic with zero user impact
Why fast, reliable rollback is the feature that makes any of this safe

Before you start

Serving with FastAPI MLOps · lesson BentoML & Ray Serve MLOps · lesson Python section Machine Learning section

The last lesson ended on the one thing the whole chapter had quietly never done: change the model. We have a live service — batched, scaled, matched to its latency need — serving version 1, which never stays at version 1. Retraining makes a v2, and it has to take over while the service is up, answering real traffic, with no window to go dark. We asked how to swap one for the other safely. This lesson is the set of answers.

You trained a better model. Offline, v2 beats v1 on every metric. Now you have to put it in front of real users — and here’s the uncomfortable truth that defines this whole topic: a model can pass every offline test and still fail in production. The live input distribution drifts from your eval set, a feature pipeline computes something subtly different at serving time than at training time (training-serving skew), latency blows past budget under real load. So the question is never “is v2 good?” — you can’t fully know until it’s live. The question is “when v2 misbehaves, how few users find out?” Deployment strategies are the answers, and they all come down to one lever: how you shift traffic.

The core idea: control the blast radius

Every strategy makes the switch from v1 to v2 gradual and reversible instead of a single all-or-nothing flip. Two properties decide how safe a rollout is:

Blast radius — what fraction of users hit a broken v2 before you catch it. Smaller is safer.
Rollback speed — how fast you can get back to the known-good v1 once metrics go red.

Hold those two in mind as you step through the four strategies below — the same bad v2, four very different outcomes:

TryShip v2

Four ways to roll out v2 — and what a bad v2 costs

Pick a strategy and step through the rollout. Then inject a bad v2 and watch the blast radius — the share of users who hit a broken model before it's caught.

inject a bad v2

v1 · 100%

deployv2 ○ idleblast radius: 0%

v2 deployed — 0% live traffic yet.

step 1 / 5

Canary — the default for ML

A canary deployment (named for the canary in a coal mine) routes a tiny slice of live traffic — often 1% — to v2 while everyone else stays on v1. You watch v2’s live metrics: error rate, latency, and ideally a guardrail business metric (click-through, conversion). If they hold, you ramp — 1% → 10% → 50% → 100%. If they degrade, an automated gate trips and traffic rolls back to v1. The blast radius is just the canary percentage, which is why canary is the workhorse for serving ML models: you find out v2 is bad while only 1% of users could possibly have noticed.

Blue-Green — instant cutover, instant rollback

Blue-green runs two complete, identical environments: blue (v1, live) and green (v2, idle). You deploy and smoke-test green with zero live traffic, then flip the router so 100% goes to green in one instant. Rollback is equally instant — flip back to blue. The upsides are simplicity and a clean, atomic switch with no mixed-version window. The downsides: you’re paying for two full environments, and if a bug slips past the smoke test, the blast radius is everyone for the seconds before you flip back. Great for fast recovery; not great at limiting exposure.

Rolling — replace in place, no second environment

A rolling deployment upgrades the existing fleet in batches: take down a few v1 instances, bring them up on v2, repeat. No separate environment, so it’s cheaper than blue-green. But during the transition you have a mix of v1 and v2 serving simultaneously, a bad v2 affects a growing slice as the rollout proceeds, and rollback is slower — you have to roll the instances back, with no pre-warmed standby to flip to. It’s the common default for stateless services; for risky model changes, canary’s metric gate is safer.

Shadow — test on real traffic, zero user impact

Shadow (or dark launch) is the safest test of all: v2 receives a copy of real production traffic, runs its predictions, and its responses are discarded — users only ever see v1. You compare v2’s outputs to v1’s offline. Because no user ever receives a v2 response, the blast radius is zero, which makes shadow uniquely good for validating a model on the true live distribution before risking a single real request. The catches: it doesn’t exercise user-facing write paths or capture real user reactions, and you pay to run inference twice. Shadow most often comes before a canary — build confidence in the dark, then ramp for real.

What “watch the metrics” really means for models

For a plain web service you watch HTTP 500s and latency. A model needs more: prediction latency, the score distribution (has v2’s output distribution shifted vs v1?), and a guardrail business metric that proves v2 isn’t quietly worse. Those signals are exactly what the drift & monitoring tooling provides — the metric gate that decides “ramp or roll back” is wired straight into it. A canary without a metric gate is just a slow outage.

In one breath

Because a model can pass every offline test and still fail live (distribution drift, training-serving skew, latency under load), you never bet the whole service on a v1→v2 flip — you make the switch gradual and reversible, trading off two levers: blast radius (what fraction of users a broken v2 reaches) and rollback speed; canary ramps a tiny share behind a live-metric gate (smallest blast radius among strategies that serve v2), blue-green flips 100% instantly with instant rollback but full exposure at cutover, rolling replaces in place cheaply but with a growing mixed-version blast radius, and shadow mirrors traffic to a v2 whose responses are discarded (zero blast radius) — all of it resting on one capability: fast, reliable rollback.

Practice

Before the quiz, run the same bad v2 through all four strategies in your head and rank them by how many users get hurt. A feature-skew bug slips past testing: under shadow, canary-at-1%, rolling-at-40%-complete, and blue-green- at-cutover, who sees the broken predictions in each? Then the subtle one the lesson flags: a canary and an A/B test split traffic identically — so what different question is each one asking, and why does “v2 didn’t crash in the canary” not mean “v2 is better”?

Quick check

0/3

Q1Why do ML teams roll out a new model gradually instead of just replacing v1 with v2 after it wins offline evaluation?

Q2A bad v2 is shipped. Under which strategy is the blast radius smallest, and why?

Q3TRANSFER: A team uses blue-green: smoke-test green, then cut 100% of traffic to v2. A subtle feature-skew bug passes the smoke test and reaches every user for ~90 seconds before rollback. What change most reduces this exposure without giving up rollback speed?

A question to carry forward

We kept drawing a careful line in this lesson — most sharply in that last callout. A canary and an A/B test split traffic the exact same way, yet they ask opposite questions. The canary asks a safety question — “is v2 healthy enough to keep ramping?” — and it ends the moment it’s confident nothing is on fire. It will happily ramp a v2 that is perfectly stable, fast, error-free… and very slightly worse for users than v1. Because “didn’t crash” is not “is an improvement.”

And that gap is the whole problem the next lesson exists to close. Getting v2 running safely is only half the job; you still have to prove it’s actually better than v1 — not on your offline eval set, which already disagrees with production, but on live users, with enough data that the difference can’t just be noise. So the question to carry forward is: once v2 is safely live next to v1, how do you measure, with statistics rather than hope, whether it genuinely wins? That is A/B testing and experimentation, and it is the next lesson.

FAQCommon questions

Questions about this lesson

What is a canary deployment?

A canary deployment routes a small slice of live traffic — often 1% — to a new model version while everyone else stays on the current one, then watches the new version's live metrics (error rate, latency, a business KPI). If they hold, it ramps to 10%, 50%, 100%; if they degrade, an automated gate rolls traffic back. The blast radius is just the canary percentage, which makes it the default for serving ML.

What is the difference between canary and blue-green deployment?

A canary ramps traffic gradually (1% to 100%) and gates each step on live metrics, so a bad version is caught while only a small share of users are exposed. Blue-green runs two full environments and flips 100% of traffic from the old to the new in one instant, giving fast rollback but a full-blast-radius window if a bug slips past the smoke test. Canary limits exposure; blue-green optimizes for a clean, instant cutover.

What is a shadow deployment?

In a shadow (or dark launch) deployment, the new model version receives a copy of real production traffic and runs its predictions, but its responses are discarded — users only ever see the current version. This validates the new model on the true live distribution with a zero-percent blast radius, making it the safest pre-launch test. It usually precedes a canary.

Practice this in an interview

All questions

What is the difference between shadow deployment and canary deployment for ML models, and when do you use each?

Shadow deployment mirrors live traffic to the new model and discards its predictions, so you can evaluate performance and load without any user impact. Canary deployment routes a small real slice of traffic to the new model and uses its predictions, so real user impact is possible but limited and monitored.

How do you safely roll back a model in production and what triggers a rollback?

A rollback reverts serving traffic to a known-good model version when the newly deployed model shows metric regression beyond a tolerance threshold. Safe rollback requires versioned model artifacts, traffic-routing control, and pre-defined automated or manual triggers — not ad hoc decisions under pressure.

When would you use a multi-armed bandit or shadow deployment instead of a fixed A/B test?

A fixed A/B test holds traffic splits constant to get a clean, statistically powered comparison, which is ideal when you need a trustworthy ship decision. A multi-armed bandit dynamically shifts traffic toward the better-performing model, reducing regret when you can't run long enough for significance or when the best arm may change. Shadow deployment sends real traffic to the new model without serving its outputs, so you validate behavior and latency risk-free before any user is exposed.

How do you decide when to retrain a model, and how do you do it safely?

Choose between scheduled retraining on a fixed cadence and trigger-based retraining fired by monitored drift or a performance drop, picking based on how fast the data distribution changes and how good your monitoring is. Retrain safely by treating it as an automated pipeline that validates data, trains, and gates the new model against the current champion on held-out and business metrics before promotion. Then roll out progressively with shadow or canary so a bad model never fully replaces the champion.

Explore further

Glossary terms

Guardrails Sharding Model Routing Train-Test Split

Cheat sheets

scikit-learn