datarekha

Deployment Strategies: Canary, Blue-Green & Shadow

A new model can pass every offline test and still break in production. Deployment strategies control the blast radius when it does — canary, blue-green, rolling, and shadow. How traffic shifting, metric gates, and instant rollback let you ship v2 without betting the whole service on it.

9 min read Intermediate MLOps Lesson 15 of 28

What you'll learn

  • Why a model that passes offline eval can still fail in production — and why that forces gradual rollout
  • Canary deployment — ramp a tiny traffic share, gate on live metrics, auto-rollback
  • Blue-green and rolling — instant cutover vs in-place batch replacement, and their rollback trade-offs
  • Shadow (dark launch) — test v2 on real traffic with zero user impact
  • Why fast, reliable rollback is the feature that makes any of this safe

Before you start

You trained a better model. Offline, v2 beats v1 on every metric. Now you have to put it in front of real users — and here’s the uncomfortable truth that defines this whole topic: a model can pass every offline test and still fail in production. The live input distribution drifts from your eval set, a feature pipeline computes something subtly different at serving time than at training time (training-serving skew), latency blows past budget under real load. So the question is never “is v2 good?” — you can’t fully know until it’s live. The question is “when v2 misbehaves, how few users find out?” Deployment strategies are the answers, and they all come down to one lever: how you shift traffic.

The core idea: control the blast radius

Every strategy makes the switch from v1 to v2 gradual and reversible instead of a single all-or-nothing flip. Two properties decide how safe a rollout is:

  • Blast radius — what fraction of users hit a broken v2 before you catch it. Smaller is safer.
  • Rollback speed — how fast you can get back to the known-good v1 once metrics go red.

Hold those two in mind as you step through the four strategies below — the same bad v2, four very different outcomes:

TryShip v2

Four ways to roll out v2 — and what a bad v2 costs

Pick a strategy and step through the rollout. Then inject a bad v2 and watch the blast radius — the share of users who hit a broken model before it's caught.

deployv2 ○ idleblast radius: 0%

v2 deployed — 0% live traffic yet.

step 1 / 5

Canary — the default for ML

A canary deployment (named for the canary in a coal mine) routes a tiny slice of live traffic — often 1% — to v2 while everyone else stays on v1. You watch v2’s live metrics: error rate, latency, and ideally a guardrail business metric (click-through, conversion). If they hold, you ramp — 1% → 10% → 50% → 100%. If they degrade, an automated gate trips and traffic rolls back to v1. The blast radius is just the canary percentage, which is why canary is the workhorse for serving ML models: you find out v2 is bad while only 1% of users could possibly have noticed.

Blue-Green — instant cutover, instant rollback

Blue-green runs two complete, identical environments: blue (v1, live) and green (v2, idle). You deploy and smoke-test green with zero live traffic, then flip the router so 100% goes to green in one instant. Rollback is equally instant — flip back to blue. The upsides are simplicity and a clean, atomic switch with no mixed-version window. The downsides: you’re paying for two full environments, and if a bug slips past the smoke test, the blast radius is everyone for the seconds before you flip back. Great for fast recovery; not great at limiting exposure.

Rolling — replace in place, no second environment

A rolling deployment upgrades the existing fleet in batches: take down a few v1 instances, bring them up on v2, repeat. No separate environment, so it’s cheaper than blue-green. But during the transition you have a mix of v1 and v2 serving simultaneously, a bad v2 affects a growing slice as the rollout proceeds, and rollback is slower — you have to roll the instances back, with no pre-warmed standby to flip to. It’s the common default for stateless services; for risky model changes, canary’s metric gate is safer.

Shadow — test on real traffic, zero user impact

Shadow (or dark launch) is the safest test of all: v2 receives a copy of real production traffic, runs its predictions, and its responses are discarded — users only ever see v1. You compare v2’s outputs to v1’s offline. Because no user ever receives a v2 response, the blast radius is zero, which makes shadow uniquely good for validating a model on the true live distribution before risking a single real request. The catches: it doesn’t exercise user-facing write paths or capture real user reactions, and you pay to run inference twice. Shadow most often comes before a canary — build confidence in the dark, then ramp for real.

What “watch the metrics” really means for models

For a plain web service you watch HTTP 500s and latency. A model needs more: prediction latency, the score distribution (has v2’s output distribution shifted vs v1?), and a guardrail business metric that proves v2 isn’t quietly worse. Those signals are exactly what the drift & monitoring tooling provides — the metric gate that decides “ramp or roll back” is wired straight into it. A canary without a metric gate is just a slow outage.

Quick check

Quick check

0/3
Q1Why do ML teams roll out a new model gradually instead of just replacing v1 with v2 after it wins offline evaluation?
Q2A bad v2 is shipped. Under which strategy is the blast radius smallest, and why?
Q3TRANSFER: A team uses blue-green: smoke-test green, then cut 100% of traffic to v2. A subtle feature-skew bug passes the smoke test and reaches every user for ~90 seconds before rollback. What change most reduces this exposure without giving up rollback speed?

Next

Every strategy here leans on one thing: live metrics good enough to trust the gate. Error rate, latency, score-distribution shift, a guardrail KPI — deciding “ramp or roll back” is only as good as your monitoring. That’s the drift & monitoring lesson — the signals that make a canary safe.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

FAQCommon questions

Questions about this lesson

What is a canary deployment?

A canary deployment routes a small slice of live traffic — often 1% — to a new model version while everyone else stays on the current one, then watches the new version's live metrics (error rate, latency, a business KPI). If they hold, it ramps to 10%, 50%, 100%; if they degrade, an automated gate rolls traffic back. The blast radius is just the canary percentage, which makes it the default for serving ML.

What is the difference between canary and blue-green deployment?

A canary ramps traffic gradually (1% to 100%) and gates each step on live metrics, so a bad version is caught while only a small share of users are exposed. Blue-green runs two full environments and flips 100% of traffic from the old to the new in one instant, giving fast rollback but a full-blast-radius window if a bug slips past the smoke test. Canary limits exposure; blue-green optimizes for a clean, instant cutover.

What is a shadow deployment?

In a shadow (or dark launch) deployment, the new model version receives a copy of real production traffic and runs its predictions, but its responses are discarded — users only ever see the current version. This validates the new model on the true live distribution with a zero-percent blast radius, making it the safest pre-launch test. It usually precedes a canary.

Practice this in an interview

All questions
What is the difference between shadow deployment and canary deployment for ML models, and when do you use each?

Shadow deployment mirrors live traffic to the new model and discards its predictions, so you can evaluate performance and load without any user impact. Canary deployment routes a small real slice of traffic to the new model and uses its predictions, so real user impact is possible but limited and monitored.

How do you safely roll back a model in production and what triggers a rollback?

A rollback reverts serving traffic to a known-good model version when the newly deployed model shows metric regression beyond a tolerance threshold. Safe rollback requires versioned model artifacts, traffic-routing control, and pre-defined automated or manual triggers — not ad hoc decisions under pressure.

When would you use a multi-armed bandit or shadow deployment instead of a fixed A/B test?

A fixed A/B test holds traffic splits constant to get a clean, statistically powered comparison, which is ideal when you need a trustworthy ship decision. A multi-armed bandit dynamically shifts traffic toward the better-performing model, reducing regret when you can't run long enough for significance or when the best arm may change. Shadow deployment sends real traffic to the new model without serving its outputs, so you validate behavior and latency risk-free before any user is exposed.

How do you decide when to retrain a model, and how do you do it safely?

Choose between scheduled retraining on a fixed cadence and trigger-based retraining fired by monitored drift or a performance drop, picking based on how fast the data distribution changes and how good your monitoring is. Retrain safely by treating it as an automated pipeline that validates data, trains, and gates the new model against the current champion on held-out and business metrics before promotion. Then roll out progressively with shadow or canary so a bad model never fully replaces the champion.

Related lessons

Explore further

Skip to content