How do you approach anomaly detection, and why is accuracy a bad metric for it?
Anomaly detection finds rare points that deviate from normal patterns, using statistical, distance, density, or model-based methods like isolation forest and one-class SVM, often trained mostly on normal data. Accuracy is misleading because anomalies are extremely rare, so a model that predicts 'normal' for everything scores high accuracy while catching nothing. Use precision, recall, F1, PR-AUC, or ROC-AUC instead, chosen by the cost of false positives vs false negatives.
How to think about it
The crisp answer
Anomaly detection identifies rare observations that deviate from the normal pattern, often by learning what “normal” looks like and flagging departures. Accuracy is a bad metric because anomalies are so rare that a trivial model predicting “normal” everywhere gets near-perfect accuracy while detecting zero anomalies.
The approach
Because anomalies are scarce and often unlabeled, you usually model normality:
- Statistical: z-scores, IQR, Gaussian fits — flag low-probability points.
- Distance / density: KNN distance, Local Outlier Factor, DBSCAN noise points.
- Model-based: Isolation Forest, which the Analytics Vidhya isolation forest guide describes as isolating anomalies via random splits — anomalies need fewer splits to isolate. Also one-class SVM and autoencoder reconstruction error.
Why accuracy fails (and what to use)
With 0.1% anomalies, “always normal” is 99.9% accurate and useless. Use metrics that focus on the rare class:
- Precision / recall / F1 on the anomaly class.
- PR-AUC, which is more informative than ROC-AUC under extreme imbalance.
- Choose the operating threshold by the cost of false positives vs false negatives (a missed fraud vs a false alarm).
Concrete example
Fraud detection: you care about catching fraud (recall) without drowning analysts in false alarms (precision), so you tune the threshold on a PR curve, not accuracy.
The common trap
Reporting accuracy, or training on contaminated “normal” data that secretly contains anomalies. Also forgetting that anomalies drift over time, so models need monitoring and retraining. Follow-up: “Supervised or unsupervised?” — usually unsupervised/semi-supervised because labeled anomalies are scarce, but if you have labels, treat it as imbalanced classification with resampling or class weights.