How do you approach anomaly detection, and why is accuracy a bad metric for it?

Anomaly detection finds rare points that deviate from normal patterns, using statistical, distance, density, or model-based methods like isolation forest and one-class SVM, often trained mostly on normal data. Accuracy is misleading because anomalies are extremely rare, so a model that predicts 'normal' for everything scores high accuracy while catching nothing. Use precision, recall, F1, PR-AUC, or ROC-AUC instead, chosen by the cost of false positives vs false negatives.

How does the Isolation Forest algorithm detect anomalies?

Isolation Forest builds many random trees by repeatedly picking a random feature and a random split value, partitioning the data until points are isolated. Anomalies get isolated in far fewer splits because they're rare and different, so their average path length across trees is short. The shorter the expected path length, the higher the anomaly score, making it fast and effective in high dimensions.

How do you detect and handle outliers in a machine learning dataset?

Outliers are detected via statistical rules (IQR, Z-score), visualization, or isolation-based algorithms. Handling options are removal, capping (Winsorization), transformation, or using robust algorithms. The right action depends on whether the outlier is a measurement error or a genuine extreme value — genuine extremes carry signal and should not be blindly removed.

How do you handle outliers statistically, and how do you decide whether to remove them?

Handling outliers starts with understanding whether they are errors, rare genuine observations, or leverage points that reveal real signal. The appropriate response — removal, transformation, robust estimation, or explicit modelling — depends entirely on their cause, not on how extreme they look.

Anomaly detection — Machine Learning

Fraud, defective parts, server intrusions, sensor failures — the most valuable events are often the rarest, and you usually have no labels for them. Anomaly detection finds the points that don’t fit the pattern, unsupervised. It’s a high-value skill in fraud, security, and ops roles.

The core idea: score by isolation

Anomaly detection ranks every point by how unusual it is, then flags the top scorers. The cleanest intuition: an anomaly is easy to isolate — it sits alone, far from the dense mass of normal data. Those lonely points are exactly the ones a detector flags:

The two workhorses

Isolation Forest — builds random trees that split the data on random features at random thresholds. Anomalies get isolated in very few splits (they’re far from everything), so a short average path length = anomalous. Fast, scales well, and the go-to default for tabular data.
Local Outlier Factor (LOF) — compares a point’s local density to its neighbors’. A point in a sparse pocket surrounded by dense clusters is flagged, even if it’s not globally far away. Better at local anomalies; doesn’t scale as well.

Plus One-Class SVM (learn a boundary around the normal data) and simple statistical methods (z-score, IQR) for single features.

import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

rng = np.random.default_rng(0)
normal = rng.normal(0, 1, size=(200, 2))
outliers = rng.uniform(-6, 6, size=(15, 2))
X = np.vstack([normal, outliers])

# Isolation Forest: contamination = expected fraction of anomalies
iso = IsolationForest(contamination=0.07, random_state=0).fit(X)
pred_iso = iso.predict(X)            # -1 = anomaly, 1 = normal
print(f"Isolation Forest flagged {(pred_iso == -1).sum()} anomalies")

# LOF: fit_predict in outlier mode
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.07)
pred_lof = lof.fit_predict(X)
print(f"LOF flagged             {(pred_lof == -1).sum()} anomalies")
print(f"\n(15 true outliers were injected.)")

In one breath

Anomaly detection finds the rare points that don’t fit, unsupervised — fraud, defects, intrusions — by ranking every point’s unusualness and flagging the top.
The clean intuition: an anomaly is easy to isolate, sitting far from the dense mass of normal data.
Isolation Forest isolates points with random splits (short path = anomalous) — fast, scalable, the tabular default; LOF compares local density and catches local anomalies.
The contamination parameter sets the flagging threshold — your precision/recall dial, usually set from domain knowledge since labels are scarce.
Outlier detection finds anomalies in already-contaminated training data; novelty detection trains on clean data and flags deviations later. And an anomaly isn’t always an error.

Quick check

0/3

Q1How does Isolation Forest decide a point is anomalous?

Q2What does the contamination parameter control?

Q3What's the difference between outlier detection and novelty detection?

Anomaly detection joins clustering in the unsupervised toolkit. To see the structure these methods exploit, visualize with t-SNE & UMAP.

Anomaly detection

What you'll learn

Before you start

The core idea: score by isolation

The two workhorses

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further