datarekha

Testing ML & the ML Test Score

Unit tests on your code aren't enough — ML systems need data tests, model tests, infra tests, and monitoring. Google's ML Test Score rubric, and the tests that actually catch production failures.

7 min read Intermediate MLOps Lesson 8 of 28

What you'll learn

  • Why ML testing spans data, model, infrastructure, and monitoring
  • The ML Test Score rubric and its weakest-link scoring
  • Concrete tests that catch silent ML failures

Before you start

You can have 100% code coverage and still ship a broken model. In ML, the code can be perfect while the data is corrupt, the model silently degrades, or the serving features don’t match training. So “testing an ML system” means far more than unit tests — and the canonical map of what to test is Google’s ML Test Score.

Four kinds of tests

The ML Test Score (Breck et al., Google) groups production-readiness tests into four categories. A real ML system needs all four — and is only as reliable as its weakest one:

  • Data tests — schema validation, no training/serving skew, features aren’t stale, PII controls.
  • Model tests — beats a simple baseline, tested on data slices (not just aggregate accuracy), hyperparameters tuned and logged.
  • Infrastructure tests — the pipeline is reproducible, the model spec is unit-tested, the full pipeline is integration-tested, there’s a rollback path.
  • Monitoring tests — serving features match training, prediction quality is watched, drift is alerted, latency/errors are monitored.

Rate a pipeline and watch the weakest category cap the score:

The scoring insight: minimum, not average

The ML Test Score is deliberately the minimum across categories (each test counts 0 / manual 0.5 / automated 1). A brilliant model with no monitoring scores zero — because in production, the un-watched dimension is exactly where you’ll get paged at 3 a.m. This forces balanced investment instead of over-polishing the modeling and ignoring the ops.

Concrete tests that earn their keep

Beyond the rubric, a few tests catch a disproportionate share of real failures:

Others worth automating: a behavioral / invariance test (perturb an input in a way that shouldn’t change the prediction — e.g. add a neutral word — and assert it doesn’t), a slice test (accuracy must hold on each important segment, not just overall — ties to fairness), and a schema test that fails the build when an upstream column changes type (see data contracts).

Quick check

Quick check

0/3
Q1Why isn't high code-coverage enough to call an ML system 'tested'?
Q2Why is the ML Test Score the minimum across categories rather than the average?
Q3What is a 'slice test'?

Next

Tests need versioned data to be reproducible (data versioning) and a gate to enforce them before promotion (model registry).

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How do you test an ML system, and what is the ML Test Score?

Unlike traditional software, ML systems need tests across four areas: the data, the model and training, the infrastructure and pipeline, and ongoing monitoring, because behavior depends on data, not just code. Google's ML Test Score is a rubric of 28 actionable tests across those four categories that scores a system's production readiness and technical debt. A low score flags fragile, hard-to-maintain systems even if offline accuracy looks good.

What are behavioral tests for ML models (invariance, directional, and minimum-functionality tests)?

Behavioral tests check a model's input-output behavior against expectations rather than just aggregate accuracy, an idea popularized by the CheckList framework. Invariance tests assert that label-preserving perturbations do not change the prediction, directional tests assert a change moves the output the expected way, and minimum-functionality tests are simple cases the model must get right. They catch real-world failures that high overall accuracy can hide.

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.

What metrics should you monitor for a production ML model, and at what layer?

Production ML monitoring spans four layers: data quality (schema, distributions, null rates), model behaviour (prediction drift, confidence calibration), operational health (latency, error rate, throughput), and business KPIs (conversion, revenue impact). Each layer has different owners and different alert thresholds.

Related lessons

Explore further

Skip to content