How do you test an ML system, and what is the ML Test Score?
Unlike traditional software, ML systems need tests across four areas: the data, the model and training, the infrastructure and pipeline, and ongoing monitoring, because behavior depends on data, not just code. Google's ML Test Score is a rubric of 28 actionable tests across those four categories that scores a system's production readiness and technical debt. A low score flags fragile, hard-to-maintain systems even if offline accuracy looks good.
How to think about it
The short answer
ML systems can’t be tested like ordinary code because their behavior is determined by data, not just logic. You test across four areas: (1) data, (2) model/training, (3) infrastructure/pipeline, and (4) monitoring over time. Google’s ML Test Score is a rubric of 28 actionable tests spanning exactly those four categories, producing a score for production readiness and technical-debt risk.
Why a special rubric
A model can pass every unit test and still fail in production because the data drifted, a feature was silently dropped, or training and serving compute a feature differently. The ML Test Score paper (Breck, Cai, Nielsen, Salib, Sculley) was written precisely because “the prediction behavior of a model is hard to specify a priori,” so you test the system around the model.
The four categories, with example tests
- Data: feature distributions are validated, no feature has leakage, a schema/contract catches anomalies.
- Model: every hyperparameter is tuned, the model is tested against a simple baseline, model staleness is known.
- Infrastructure: training is reproducible, the full pipeline is integration-tested, you can roll back a model.
- Monitoring: training-serving skew is detected, prediction quality and data invariants are monitored in prod.
How it’s scored
Each test scores points (e.g., 0 for not done, 0.5 for manual, 1 for automated). You sum within a category and take the minimum category score as the system’s overall readiness — so one neglected dimension (often monitoring) caps your score.
Common follow-up / trap
A classic probe: “Your offline accuracy is 0.95 — is it production-ready?” The strong answer says accuracy is one cell of a 28-test grid; without data validation, reproducibility, rollback, and skew monitoring it isn’t ready. The trap is conflating “the model is accurate” with “the system is tested.” Name a couple of concrete tests per category to show depth.