MLOps Medium Asked at GoogleAsked at SpotifyAsked at AirbnbAsked at DatabricksAsked at Weights and Biases
How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?
The short answer
ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.
How to think about it
Standard software CI/CD tests code changes deterministically — the same input always produces the same output. ML CI/CD must additionally validate non-deterministic artifacts (model weights) against quality thresholds, and it must handle the fact that the trigger for retraining can be data drift, not just a code commit.
ML CI stages:
Code commit / data drift alert
→ Lint + unit tests (feature transforms, preprocessing functions)
→ Data validation (Great Expectations / Soda checks on training split)
→ Model training (on a subset for PR checks, full dataset for main)
→ Evaluation gate (AUC, RMSE, fairness metrics vs threshold)
→ Model registration (push to registry if gate passes)
ML CD stages:
Registry promotion event
→ Integration test (load model, score fixture inputs, check outputs)
→ Shadow deployment (mirror live traffic, compare distributions)
→ Canary rollout (5% traffic, monitor business metrics 30 min)
→ Progressive rollout to 100%
→ Champion replacement in registry
# GitHub Actions: evaluation gate
- name: Evaluate model
run: |
python evaluate.py \
--model-uri runs:/${{ steps.train.outputs.run_id }}/model \
--threshold-auc 0.91
# evaluate.py exits 1 if AUC < threshold, failing the CI job
# Great Expectations data validation in pipeline
import great_expectations as gx
context = gx.get_context()
result = context.run_checkpoint("training_data_checkpoint")
if not result.success:
raise ValueError("Data validation failed — aborting training")
Key differences from software CI/CD:
- Triggers: code change OR data drift OR schedule — not just Git push.
- Artifacts: model weights are not deterministic; you test quality, not exact equality.
- Rollback: reverting a Git commit does not revert a model — you must promote the previous registry version.
- Environment parity: training environment (GPU cluster) differs from serving environment (inference server) — both must be tested.