datarekha

Anomaly detection

Find the rare, the fraudulent, the broken — without labels. Isolation Forest, Local Outlier Factor, and one-class methods for spotting points that don't belong.

7 min read Intermediate Machine Learning Lesson 31 of 33

What you'll learn

  • Anomaly detection as unsupervised ranking by isolation/rarity
  • Isolation Forest and Local Outlier Factor — how each scores anomalies
  • The contamination parameter and the novelty-vs-outlier distinction

Before you start

Fraud, defective parts, server intrusions, sensor failures — the most valuable events are often the rarest, and you usually have no labels for them. Anomaly detection finds the points that don’t fit the pattern, unsupervised. It’s a high-value skill in fraud, security, and ops roles.

The core idea: score by isolation

Anomaly detection ranks every point by how unusual it is, then flags the top scorers. The cleanest intuition: an anomaly is easy to isolate — it sits alone, far from the dense mass of normal data. Slide the contamination threshold and watch the lonely points get flagged:

The two workhorses

  • Isolation Forest — builds random trees that split the data on random features at random thresholds. Anomalies get isolated in very few splits (they’re far from everything), so a short average path length = anomalous. Fast, scales well, and the go-to default for tabular data.
  • Local Outlier Factor (LOF) — compares a point’s local density to its neighbors’. A point in a sparse pocket surrounded by dense clusters is flagged, even if it’s not globally far away. Better at local anomalies; doesn’t scale as well.

Plus One-Class SVM (learn a boundary around the normal data) and simple statistical methods (z-score, IQR) for single features.

Quick check

Quick check

0/3
Q1How does Isolation Forest decide a point is anomalous?
Q2What does the contamination parameter control?
Q3What's the difference between outlier detection and novelty detection?

Next

Anomaly detection joins clustering in the unsupervised toolkit. To see the structure these methods exploit, visualize with t-SNE & UMAP.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
How does the Isolation Forest algorithm detect anomalies?

Isolation Forest builds many random trees by repeatedly picking a random feature and a random split value, partitioning the data until points are isolated. Anomalies get isolated in far fewer splits because they're rare and different, so their average path length across trees is short. The shorter the expected path length, the higher the anomaly score, making it fast and effective in high dimensions.

How do you approach anomaly detection, and why is accuracy a bad metric for it?

Anomaly detection finds rare points that deviate from normal patterns, using statistical, distance, density, or model-based methods like isolation forest and one-class SVM, often trained mostly on normal data. Accuracy is misleading because anomalies are extremely rare, so a model that predicts 'normal' for everything scores high accuracy while catching nothing. Use precision, recall, F1, PR-AUC, or ROC-AUC instead, chosen by the cost of false positives vs false negatives.

How do you detect and handle outliers in a machine learning dataset?

Outliers are detected via statistical rules (IQR, Z-score), visualization, or isolation-based algorithms. Handling options are removal, capping (Winsorization), transformation, or using robust algorithms. The right action depends on whether the outlier is a measurement error or a genuine extreme value — genuine extremes carry signal and should not be blindly removed.

How do you handle outliers statistically, and how do you decide whether to remove them?

Handling outliers starts with understanding whether they are errors, rare genuine observations, or leverage points that reveal real signal. The appropriate response — removal, transformation, robust estimation, or explicit modelling — depends entirely on their cause, not on how extreme they look.

Related lessons

Explore further

Skip to content