MLOps Medium Asked at GoogleAsked at SpotifyAsked at AirbnbAsked at DatabricksAsked at Weights and Biases

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

For MLOps Engineer ML Engineer AI / LLM Engineer Data Engineer

The short answer

ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.

How to think about it

Standard software CI/CD tests code changes deterministically — the same input always produces the same output. ML CI/CD must additionally validate non-deterministic artifacts (model weights) against quality thresholds, and it must handle the fact that the trigger for retraining can be data drift, not just a code commit.

ML CI stages:

Code commit / data drift alert
  &#8594; Lint + unit tests (feature transforms, preprocessing functions)
  &#8594; Data validation (Great Expectations / Soda checks on training split)
  &#8594; Model training (on a subset for PR checks, full dataset for main)
  &#8594; Evaluation gate (AUC, RMSE, fairness metrics vs threshold)
  &#8594; Model registration (push to registry if gate passes)

ML CD stages:

Registry promotion event
  &#8594; Integration test (load model, score fixture inputs, check outputs)
  &#8594; Shadow deployment (mirror live traffic, compare distributions)
  &#8594; Canary rollout (5% traffic, monitor business metrics 30 min)
  &#8594; Progressive rollout to 100%
  &#8594; Champion replacement in registry

# GitHub Actions: evaluation gate
- name: Evaluate model
  run: |
    python evaluate.py \
      --model-uri runs:/${{ steps.train.outputs.run_id }}/model \
      --threshold-auc 0.91
  # evaluate.py exits 1 if AUC < threshold, failing the CI job

# Great Expectations data validation in pipeline
import great_expectations as gx

context = gx.get_context()
result = context.run_checkpoint("training_data_checkpoint")
if not result.success:
    raise ValueError("Data validation failed — aborting training")

Key differences from software CI/CD:

Triggers: code change OR data drift OR schedule — not just Git push.
Artifacts: model weights are not deterministic; you test quality, not exact equality.
Rollback: reverting a Git commit does not revert a model — you must promote the previous registry version.
Environment parity: training environment (GPU cluster) differs from serving environment (inference server) — both must be tested.

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

Keep practising

Explore further