Anomaly detection
Find the rare, the fraudulent, the broken — without labels. Isolation Forest, Local Outlier Factor, and one-class methods for spotting points that don't belong.
What you'll learn
- Anomaly detection as unsupervised ranking by isolation/rarity
- Isolation Forest and Local Outlier Factor — how each scores anomalies
- The contamination parameter and the novelty-vs-outlier distinction
Before you start
Fraud, defective parts, server intrusions, sensor failures — the most valuable events are often the rarest, and you usually have no labels for them. Anomaly detection finds the points that don’t fit the pattern, unsupervised. It’s a high-value skill in fraud, security, and ops roles.
The core idea: score by isolation
Anomaly detection ranks every point by how unusual it is, then flags the top scorers. The cleanest intuition: an anomaly is easy to isolate — it sits alone, far from the dense mass of normal data. Slide the contamination threshold and watch the lonely points get flagged:
The two workhorses
- Isolation Forest — builds random trees that split the data on random features at random thresholds. Anomalies get isolated in very few splits (they’re far from everything), so a short average path length = anomalous. Fast, scales well, and the go-to default for tabular data.
- Local Outlier Factor (LOF) — compares a point’s local density to its neighbors’. A point in a sparse pocket surrounded by dense clusters is flagged, even if it’s not globally far away. Better at local anomalies; doesn’t scale as well.
Plus One-Class SVM (learn a boundary around the normal data) and simple statistical methods (z-score, IQR) for single features.
Quick check
Quick check
Next
Anomaly detection joins clustering in the unsupervised toolkit. To see the structure these methods exploit, visualize with t-SNE & UMAP.
Practice this in an interview
All questionsIsolation Forest builds many random trees by repeatedly picking a random feature and a random split value, partitioning the data until points are isolated. Anomalies get isolated in far fewer splits because they're rare and different, so their average path length across trees is short. The shorter the expected path length, the higher the anomaly score, making it fast and effective in high dimensions.
Anomaly detection finds rare points that deviate from normal patterns, using statistical, distance, density, or model-based methods like isolation forest and one-class SVM, often trained mostly on normal data. Accuracy is misleading because anomalies are extremely rare, so a model that predicts 'normal' for everything scores high accuracy while catching nothing. Use precision, recall, F1, PR-AUC, or ROC-AUC instead, chosen by the cost of false positives vs false negatives.
Outliers are detected via statistical rules (IQR, Z-score), visualization, or isolation-based algorithms. Handling options are removal, capping (Winsorization), transformation, or using robust algorithms. The right action depends on whether the outlier is a measurement error or a genuine extreme value — genuine extremes carry signal and should not be blindly removed.
Handling outliers starts with understanding whether they are errors, rare genuine observations, or leverage points that reveal real signal. The appropriate response — removal, transformation, robust estimation, or explicit modelling — depends entirely on their cause, not on how extreme they look.