Testing ML & the ML Test Score
Unit tests on your code aren't enough — ML systems need data tests, model tests, infra tests, and monitoring. Google's ML Test Score rubric, and the tests that actually catch production failures.
What you'll learn
- Why ML testing spans data, model, infrastructure, and monitoring
- The ML Test Score rubric and its weakest-link scoring
- Concrete tests that catch silent ML failures
Before you start
You can have 100% code coverage and still ship a broken model. In ML, the code can be perfect while the data is corrupt, the model silently degrades, or the serving features don’t match training. So “testing an ML system” means far more than unit tests — and the canonical map of what to test is Google’s ML Test Score.
Four kinds of tests
The ML Test Score (Breck et al., Google) groups production-readiness tests into four categories. A real ML system needs all four — and is only as reliable as its weakest one:
- Data tests — schema validation, no training/serving skew, features aren’t stale, PII controls.
- Model tests — beats a simple baseline, tested on data slices (not just aggregate accuracy), hyperparameters tuned and logged.
- Infrastructure tests — the pipeline is reproducible, the model spec is unit-tested, the full pipeline is integration-tested, there’s a rollback path.
- Monitoring tests — serving features match training, prediction quality is watched, drift is alerted, latency/errors are monitored.
Rate a pipeline and watch the weakest category cap the score:
The scoring insight: minimum, not average
The ML Test Score is deliberately the minimum across categories (each test counts 0 / manual 0.5 / automated 1). A brilliant model with no monitoring scores zero — because in production, the un-watched dimension is exactly where you’ll get paged at 3 a.m. This forces balanced investment instead of over-polishing the modeling and ignoring the ops.
Concrete tests that earn their keep
Beyond the rubric, a few tests catch a disproportionate share of real failures:
Others worth automating: a behavioral / invariance test (perturb an input in a way that shouldn’t change the prediction — e.g. add a neutral word — and assert it doesn’t), a slice test (accuracy must hold on each important segment, not just overall — ties to fairness), and a schema test that fails the build when an upstream column changes type (see data contracts).
Quick check
Quick check
Next
Tests need versioned data to be reproducible (data versioning) and a gate to enforce them before promotion (model registry).
Practice this in an interview
All questionsUnlike traditional software, ML systems need tests across four areas: the data, the model and training, the infrastructure and pipeline, and ongoing monitoring, because behavior depends on data, not just code. Google's ML Test Score is a rubric of 28 actionable tests across those four categories that scores a system's production readiness and technical debt. A low score flags fragile, hard-to-maintain systems even if offline accuracy looks good.
Behavioral tests check a model's input-output behavior against expectations rather than just aggregate accuracy, an idea popularized by the CheckList framework. Invariance tests assert that label-preserving perturbations do not change the prediction, directional tests assert a change moves the output the expected way, and minimum-functionality tests are simple cases the model must get right. They catch real-world failures that high overall accuracy can hide.
ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.
Production ML monitoring spans four layers: data quality (schema, distributions, null rates), model behaviour (prediction drift, confidence calibration), operational health (latency, error rate, throughput), and business KPIs (conversion, revenue impact). Each layer has different owners and different alert thresholds.