What are behavioral tests for ML models (invariance, directional, and minimum-functionality tests)?

Behavioral tests check a model's input-output behavior against expectations rather than just aggregate accuracy, an idea popularized by the CheckList framework. Invariance tests assert that label-preserving perturbations do not change the prediction, directional tests assert a change moves the output the expected way, and minimum-functionality tests are simple cases the model must get right. They catch real-world failures that high overall accuracy can hide.

How do you test an ML system, and what is the ML Test Score?

Unlike traditional software, ML systems need tests across four areas: the data, the model and training, the infrastructure and pipeline, and ongoing monitoring, because behavior depends on data, not just code. Google's ML Test Score is a rubric of 28 actionable tests across those four categories that scores a system's production readiness and technical debt. A low score flags fragile, hard-to-maintain systems even if offline accuracy looks good.

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.

What metrics should you monitor for a production ML model, and at what layer?

Production ML monitoring spans four layers: data quality (schema, distributions, null rates), model behaviour (prediction drift, confidence calibration), operational health (latency, error rate, throughput), and business KPIs (conversion, revenue impact). Each layer has different owners and different alert thresholds.

Testing ML & the ML Test Score — MLOps

The last lesson made the promotion gate demand evidence before a model ships — but left a hole exactly where it mattered: what counts as proof that an ML system is ready? “It got 0.91 F1” is not proof; it is one number that a corrupt feature or a failing slice could fake. This lesson is the evidence the gate has been waiting for.

You can have 100% code coverage and still ship a broken model. In ML, the code can be perfect while the data is corrupt, the model silently degrades, or the serving features don’t match training. So “testing an ML system” means far more than unit tests — and the canonical map of what to test is Google’s ML Test Score.

Four kinds of tests

The ML Test Score (Breck et al., Google) groups production-readiness tests into four categories. A real ML system needs all four — and is only as reliable as its weakest one:

Data tests — schema validation, no training/serving skew, features aren’t stale, PII controls.
Model tests — beats a simple baseline, tested on data slices (not just aggregate accuracy), hyperparameters tuned and logged.
Infrastructure tests — the pipeline is reproducible, the model spec is unit-tested, the full pipeline is integration-tested, there’s a rollback path.
Monitoring tests — serving features match training, prediction quality is watched, drift is alerted, latency/errors are monitored.

Rate a pipeline and watch the weakest category cap the score:

TryML Test Score · rate your pipeline

Production-ready is your weakest dimension

Google's ML Test Score rates readiness across four categories. Check the tests your pipeline actually has. The catch: your overall score is the minimum across categories — great models with no monitoring still score zero.

Data0/4 · bottleneck

Model0/4

Infrastructure0/4

Monitoring0/4

ML Test Score0/4

research project — not production-ready. Your weakest category is "Data" with 0 tests — that single gap caps the whole score, no matter how good the rest is.

The scoring insight: minimum, not average

The ML Test Score is deliberately the minimum across categories (each test counts 0 / manual 0.5 / automated 1). A brilliant model with no monitoring scores zero — because in production, the un-watched dimension is exactly where you’ll get paged at 3 a.m. This forces balanced investment instead of over-polishing the modeling and ignoring the ops.

Concrete tests that earn their keep

Beyond the rubric, a few tests catch a disproportionate share of real failures:

import pandas as pd, numpy as np

# A tiny "data test" you can run in CI before training.
df = pd.DataFrame({"age": [25, 34, -3, 41, 999], "income": [50000, 62000, 58000, np.nan, 71000]})

def validate(df):
    errs = []
    if (df["age"] < 0).any() or (df["age"] > 120).any(): errs.append("age out of range")
    if df["income"].isna().mean() > 0.1: errs.append("too many missing incomes")
    if df["age"].isna().any(): errs.append("null ages")
    return errs

problems = validate(df)
print("data test:", "PASS" if not problems else f"FAIL -> {problems}")

data test: FAIL -> ['age out of range', 'too many missing incomes']

Two failures, both the kind a model would otherwise swallow without complaint. The age column holds a -3 and a 999 — impossible values that no training run would reject on its own; the rule catches them. And income is missing one value out of five, a 20% null rate that trips the 10% ceiling. Notice what didn’t fire: age has no nulls, so “null ages” stays silent — the test reports exactly what is wrong and nothing it isn’t. Drop this single function into CI and a pull request that would have trained on broken data fails before it merges. It is the humblest test in the chapter and the highest-ROI one.

Others worth automating: a behavioral / invariance test (perturb an input in a way that shouldn’t change the prediction — e.g. add a neutral word — and assert it doesn’t), a slice test (accuracy must hold on each important segment, not just overall — ties to fairness), and a schema test that fails the build when an upstream column changes type (see data contracts).

In one breath

Testing an ML system means far more than unit tests, because the code can be flawless while the data is corrupt, the model has silently degraded, or serving features mismatch training — so Google’s ML Test Score spans four categories (data, model, infrastructure, monitoring) and scores a pipeline by its weakest one, not its average, forcing balanced investment; the highest-ROI test is the humblest, a data-validation gate in CI that refuses to train on broken data.

Practice

Before the quiz, sit with the minimum-not-average rule. Imagine a team with a beautifully tuned model, exhaustive data tests, a reproducible pipeline — and zero monitoring. The ML Test Score gives them a flat zero. Argue why that is the right score rather than a harsh one. Then connect it to the data test you just ran: it caught age = 999 and a 20% null rate — which of the four ML-Test-Score categories does that gate belong to, and why is it the cheapest insurance you can buy?

Quick check

0/3

Q1Why isn't high code-coverage enough to call an ML system 'tested'?

Q2Why is the ML Test Score the minimum across categories rather than the average?

Q3What is a 'slice test'?

A question to carry forward

Look at two of the four categories we just laid out — infrastructure tests demand a reproducible pipeline, and monitoring tests demand that serving features match training. Both quietly assume something we have never guaranteed: that the model behaves identically wherever it runs. But your data test passing in CI proves only that it passed there, on that machine, with those library versions.

And that assumption is exactly where production betrayal lives. The same code, with a different NumPy version or a different OS, can round a float differently, resolve a dependency differently, and serve a subtly different model — the classic “but it worked on my laptop.” So the question to carry forward out of testing is: how do you freeze the entire environment — interpreter, libraries, system, right down to the bytes — so that “passed in CI” actually means “behaves identically in production”? That is what Docker for ML does, and it is the next lesson.

Testing ML & the ML Test Score

What you'll learn

Before you start

Four kinds of tests

Production-ready is your weakest dimension

The scoring insight: minimum, not average

Concrete tests that earn their keep

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further