Deployment Strategies: Canary, Blue-Green & Shadow
A new model can pass every offline test and still break in production. Deployment strategies control the blast radius when it does — canary, blue-green, rolling, and shadow. How traffic shifting, metric gates, and instant rollback let you ship v2 without betting the whole service on it.
What you'll learn
- Why a model that passes offline eval can still fail in production — and why that forces gradual rollout
- Canary deployment — ramp a tiny traffic share, gate on live metrics, auto-rollback
- Blue-green and rolling — instant cutover vs in-place batch replacement, and their rollback trade-offs
- Shadow (dark launch) — test v2 on real traffic with zero user impact
- Why fast, reliable rollback is the feature that makes any of this safe
Before you start
You trained a better model. Offline, v2 beats v1 on every metric. Now you have to put it in front of real users — and here’s the uncomfortable truth that defines this whole topic: a model can pass every offline test and still fail in production. The live input distribution drifts from your eval set, a feature pipeline computes something subtly different at serving time than at training time (training-serving skew), latency blows past budget under real load. So the question is never “is v2 good?” — you can’t fully know until it’s live. The question is “when v2 misbehaves, how few users find out?” Deployment strategies are the answers, and they all come down to one lever: how you shift traffic.
The core idea: control the blast radius
Every strategy makes the switch from v1 to v2 gradual and reversible instead of a single all-or-nothing flip. Two properties decide how safe a rollout is:
- Blast radius — what fraction of users hit a broken v2 before you catch it. Smaller is safer.
- Rollback speed — how fast you can get back to the known-good v1 once metrics go red.
Hold those two in mind as you step through the four strategies below — the same bad v2, four very different outcomes:
Four ways to roll out v2 — and what a bad v2 costs
Pick a strategy and step through the rollout. Then inject a bad v2 and watch the blast radius — the share of users who hit a broken model before it's caught.
v2 deployed — 0% live traffic yet.
Canary — the default for ML
A canary deployment (named for the canary in a coal mine) routes a tiny slice of live traffic — often 1% — to v2 while everyone else stays on v1. You watch v2’s live metrics: error rate, latency, and ideally a guardrail business metric (click-through, conversion). If they hold, you ramp — 1% → 10% → 50% → 100%. If they degrade, an automated gate trips and traffic rolls back to v1. The blast radius is just the canary percentage, which is why canary is the workhorse for serving ML models: you find out v2 is bad while only 1% of users could possibly have noticed.
Blue-Green — instant cutover, instant rollback
Blue-green runs two complete, identical environments: blue (v1, live) and green (v2, idle). You deploy and smoke-test green with zero live traffic, then flip the router so 100% goes to green in one instant. Rollback is equally instant — flip back to blue. The upsides are simplicity and a clean, atomic switch with no mixed-version window. The downsides: you’re paying for two full environments, and if a bug slips past the smoke test, the blast radius is everyone for the seconds before you flip back. Great for fast recovery; not great at limiting exposure.
Rolling — replace in place, no second environment
A rolling deployment upgrades the existing fleet in batches: take down a few v1 instances, bring them up on v2, repeat. No separate environment, so it’s cheaper than blue-green. But during the transition you have a mix of v1 and v2 serving simultaneously, a bad v2 affects a growing slice as the rollout proceeds, and rollback is slower — you have to roll the instances back, with no pre-warmed standby to flip to. It’s the common default for stateless services; for risky model changes, canary’s metric gate is safer.
Shadow — test on real traffic, zero user impact
Shadow (or dark launch) is the safest test of all: v2 receives a copy of real production traffic, runs its predictions, and its responses are discarded — users only ever see v1. You compare v2’s outputs to v1’s offline. Because no user ever receives a v2 response, the blast radius is zero, which makes shadow uniquely good for validating a model on the true live distribution before risking a single real request. The catches: it doesn’t exercise user-facing write paths or capture real user reactions, and you pay to run inference twice. Shadow most often comes before a canary — build confidence in the dark, then ramp for real.
What “watch the metrics” really means for models
For a plain web service you watch HTTP 500s and latency. A model needs more: prediction latency, the score distribution (has v2’s output distribution shifted vs v1?), and a guardrail business metric that proves v2 isn’t quietly worse. Those signals are exactly what the drift & monitoring tooling provides — the metric gate that decides “ramp or roll back” is wired straight into it. A canary without a metric gate is just a slow outage.
Quick check
Quick check
Next
Every strategy here leans on one thing: live metrics good enough to trust the gate. Error rate, latency, score-distribution shift, a guardrail KPI — deciding “ramp or roll back” is only as good as your monitoring. That’s the drift & monitoring lesson — the signals that make a canary safe.
Questions about this lesson
What is a canary deployment?
A canary deployment routes a small slice of live traffic — often 1% — to a new model version while everyone else stays on the current one, then watches the new version's live metrics (error rate, latency, a business KPI). If they hold, it ramps to 10%, 50%, 100%; if they degrade, an automated gate rolls traffic back. The blast radius is just the canary percentage, which makes it the default for serving ML.
What is the difference between canary and blue-green deployment?
A canary ramps traffic gradually (1% to 100%) and gates each step on live metrics, so a bad version is caught while only a small share of users are exposed. Blue-green runs two full environments and flips 100% of traffic from the old to the new in one instant, giving fast rollback but a full-blast-radius window if a bug slips past the smoke test. Canary limits exposure; blue-green optimizes for a clean, instant cutover.
What is a shadow deployment?
In a shadow (or dark launch) deployment, the new model version receives a copy of real production traffic and runs its predictions, but its responses are discarded — users only ever see the current version. This validates the new model on the true live distribution with a zero-percent blast radius, making it the safest pre-launch test. It usually precedes a canary.
Practice this in an interview
All questionsShadow deployment mirrors live traffic to the new model and discards its predictions, so you can evaluate performance and load without any user impact. Canary deployment routes a small real slice of traffic to the new model and uses its predictions, so real user impact is possible but limited and monitored.
A rollback reverts serving traffic to a known-good model version when the newly deployed model shows metric regression beyond a tolerance threshold. Safe rollback requires versioned model artifacts, traffic-routing control, and pre-defined automated or manual triggers — not ad hoc decisions under pressure.
A fixed A/B test holds traffic splits constant to get a clean, statistically powered comparison, which is ideal when you need a trustworthy ship decision. A multi-armed bandit dynamically shifts traffic toward the better-performing model, reducing regret when you can't run long enough for significance or when the best arm may change. Shadow deployment sends real traffic to the new model without serving its outputs, so you validate behavior and latency risk-free before any user is exposed.
Choose between scheduled retraining on a fixed cadence and trigger-based retraining fired by monitored drift or a performance drop, picking based on how fast the data distribution changes and how good your monitoring is. Retrain safely by treating it as an automated pipeline that validates data, trains, and gates the new model against the current champion on held-out and business metrics before promotion. Then roll out progressively with shadow or canary so a bad model never fully replaces the champion.