What are t-SNE and UMAP, how do they differ from PCA, and what are their limitations for ML workflows?
t-SNE and UMAP are nonlinear dimensionality reduction algorithms designed primarily for 2D/3D visualization of high-dimensional data. Unlike PCA, they preserve local neighborhood structure rather than global variance, producing cleaner cluster separations in plots. Neither should be used as a preprocessing step for training a supervised model because they are transductive and their output is not stable across runs.
How to think about it
If PCA is a projector that preserves global variance, t-SNE and UMAP are neighborhood maps: they ask “which points are near each other?” and try to replicate that neighborhood in 2D.
t-SNE
t-SNE (t-distributed Stochastic Neighbor Embedding) models pairwise similarity in high-dimensional space with a Gaussian kernel, then tries to match those similarities in 2D using a Student-t kernel. The heavy-tailed t kernel pushes dissimilar clusters apart, producing visually clean separations.
Characteristics:
- Excellent for revealing cluster structure in visualization.
- Not deterministic — different random seeds produce different layouts.
- Does not preserve global distances: two clusters being far apart in the plot may or may not mean they are far apart in the original space.
- Quadratic time complexity O(n²); slow for n > ~50,000 rows.
- The
perplexityhyperparameter (typical range 5–50) controls effective neighborhood size and strongly influences the output.
UMAP
UMAP (Uniform Manifold Approximation and Projection) constructs a weighted graph of nearest neighbors in high-dimensional space, then optimizes a low-dimensional embedding to preserve that graph structure.
Advantages over t-SNE:
- Much faster, scales to millions of points.
- Better preservation of global structure (cluster relative positions are more meaningful).
- Supports
transform()on new data — partially addresses t-SNE’s transductive limitation. - Fewer sensitive hyperparameters.
import umap
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
embedding = reducer.fit_transform(X_scaled)
# For new points (approximate):
new_embedding = reducer.transform(X_new_scaled)
PCA vs t-SNE vs UMAP
| Property | PCA | t-SNE | UMAP |
|---|---|---|---|
| Linear | Yes | No | No |
| Global structure | Yes | Weak | Moderate |
| Speed | Fast | Slow | Fast |
| New-data transform | Yes | No | Approximate |
| Interpretable axes | Partially | No | No |
| Use in model pipeline | Yes | No | Caution |
When to use each
Use PCA for preprocessing before supervised models. Use t-SNE or UMAP for exploratory visualization — inspecting cluster quality, detecting label overlap, or sanity-checking embeddings. Do not use t-SNE or UMAP as feature inputs for a supervised classifier: the embedding is not stable across datasets and cannot be faithfully applied to a held-out test set.