What is the accuracy paradox and how does it expose the failure of accuracy as a metric?
The accuracy paradox occurs when a trivial model — one that always predicts the majority class — achieves high accuracy on an imbalanced dataset despite having zero predictive power for the minority class. A model that predicts 'not fraud' on every transaction achieves 99.9% accuracy if fraud is 0.1% of the data, but its recall for fraud is zero. Accuracy is only meaningful when classes are roughly balanced.
How to think about it
Lead with the clearest concrete example, derive the numbers, then show what metric to use instead.
The paradox in one example
A credit-card fraud dataset: 99,900 legitimate transactions and 100 fraudulent ones (0.1% fraud rate).
A “model” that predicts legitimate for every single row achieves:
- Accuracy = 99,900 / 100,000 = 99.9%
- Recall for fraud = 0 / 100 = 0%
- Precision for fraud = undefined (no positive predictions)
- F1 for fraud = 0
The accuracy number is the worst kind of misleading: it’s the majority-class baseline, not a measure of learning. Any real fraud model must beat this baseline on fraud-specific metrics, not on accuracy.
Why accuracy fails in this setting
Accuracy treats every mistake equally: missing one fraud is weighted the same as wrongly flagging one legitimate transaction. On imbalanced data, the cheap route to high accuracy is to never predict the minority class — and the model will happily take that route if accuracy is the loss signal.
When accuracy is valid
Accuracy is a fine metric when:
- Classes are approximately balanced (say, no class below 30–40% of samples).
- The costs of FP and FN are roughly symmetric.
- You are comparing multiple models head-to-head on the same balanced dataset.
What to use instead
| Situation | Preferred metric |
|---|---|
| Imbalanced binary classification | F1 (positive class), PR-AUC |
| Rare events (fraud, disease) | Recall @ fixed FPR, PR-AUC |
| Multiple minority classes | Macro or weighted F1 |
| Ranked retrieval | MAP, NDCG |