datarekha
MLOps Medium Asked at GoogleAsked at SpotifyAsked at AirbnbAsked at DatabricksAsked at Weights and Biases

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

The short answer

ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.

How to think about it

Standard software CI/CD tests code changes deterministically — the same input always produces the same output. ML CI/CD must additionally validate non-deterministic artifacts (model weights) against quality thresholds, and it must handle the fact that the trigger for retraining can be data drift, not just a code commit.

ML CI stages:

Code commit / data drift alert
  → Lint + unit tests (feature transforms, preprocessing functions)
  → Data validation (Great Expectations / Soda checks on training split)
  → Model training (on a subset for PR checks, full dataset for main)
  → Evaluation gate (AUC, RMSE, fairness metrics vs threshold)
  → Model registration (push to registry if gate passes)

ML CD stages:

Registry promotion event
  → Integration test (load model, score fixture inputs, check outputs)
  → Shadow deployment (mirror live traffic, compare distributions)
  → Canary rollout (5% traffic, monitor business metrics 30 min)
  → Progressive rollout to 100%
  → Champion replacement in registry
# GitHub Actions: evaluation gate
- name: Evaluate model
  run: |
    python evaluate.py \
      --model-uri runs:/${{ steps.train.outputs.run_id }}/model \
      --threshold-auc 0.91
  # evaluate.py exits 1 if AUC < threshold, failing the CI job
# Great Expectations data validation in pipeline
import great_expectations as gx

context = gx.get_context()
result = context.run_checkpoint("training_data_checkpoint")
if not result.success:
    raise ValueError("Data validation failed — aborting training")

Key differences from software CI/CD:

  • Triggers: code change OR data drift OR schedule — not just Git push.
  • Artifacts: model weights are not deterministic; you test quality, not exact equality.
  • Rollback: reverting a Git commit does not revert a model — you must promote the previous registry version.
  • Environment parity: training environment (GPU cluster) differs from serving environment (inference server) — both must be tested.

Keep practising

All MLOps questions

Explore further

Skip to content