What are behavioral tests for ML models (invariance, directional, and minimum-functionality tests)?
Behavioral tests check a model's input-output behavior against expectations rather than just aggregate accuracy, an idea popularized by the CheckList framework. Invariance tests assert that label-preserving perturbations do not change the prediction, directional tests assert a change moves the output the expected way, and minimum-functionality tests are simple cases the model must get right. They catch real-world failures that high overall accuracy can hide.
How to think about it
The short answer
Behavioral tests check how a model behaves on specific inputs rather than only its average accuracy. Three core types (from the CheckList framework): invariance tests (label-preserving changes shouldn’t change the output), directional tests (a change should move the output the expected direction), and minimum-functionality tests (simple cases the model must get right).
Why
Aggregate accuracy hides systematic failures. A sentiment model at 92% accuracy can still flip its prediction when you swap a name or add a typo — a bug a single accuracy number never reveals. Behavioral tests turn fuzzy expectations into concrete, runnable assertions you can put in CI and run on every retrain.
The three types with examples
- Invariance (INV): “This restaurant is great” and “This eatery is great” should get the same sentiment. Changing a location name in an NER input shouldn’t change unrelated predictions.
- Directional (DIR): adding “…but the service was terrible” should push sentiment more negative, never more positive. For a price model, increasing square footage should not lower the predicted price.
- Minimum functionality (MFT): tiny, unambiguous cases — “I love this” must be positive. Like unit tests for capability.
How it fits the bigger picture
These complement the ML Test Score’s model-testing category. Invariance and directional tests double as robustness/fairness checks: if swapping a demographic-correlated name flips the decision, that’s both a bug and a fairness red flag.
Common follow-up / trap
Interviewers ask: “How is this different from a normal eval set?” An eval set measures average performance on a sampled distribution; behavioral tests assert specific guarantees and are designed to expose blind spots the distribution under-represents. The trap is thinking high test-set accuracy makes these redundant — the whole point is that it doesn’t. Wire them into the retraining pipeline so a regression blocks promotion.