datarekha

What's the difference between t-SNE and UMAP, and what are the pitfalls of interpreting their plots?

The short answer

Both are nonlinear dimensionality-reduction methods for visualization that preserve local neighborhood structure, but UMAP is faster, scales better, and tends to preserve more global structure, while t-SNE emphasizes tight local clusters. The main pitfall is over-interpreting the plots: cluster sizes, densities, and distances between clusters are not meaningful, and results depend heavily on hyperparameters like perplexity or n_neighbors. Neither should be used as features for a downstream model.

How to think about it

The crisp answer

t-SNE and UMAP are both nonlinear techniques that project high-dimensional data to 2D/3D for visualization, preserving which points are neighbors. The practical differences: UMAP is faster, scales to larger data, and preserves more global structure, while t-SNE produces very tight, well-separated local clusters but is slower and tends to distort global layout.

How they differ

The PCA vs t-SNE vs UMAP guide summarizes it: t-SNE converts distances to probabilities and minimizes KL divergence between high- and low-dimensional neighbor distributions, emphasizing local structure. UMAP builds a fuzzy topological graph and optimizes a low-dimensional layout, which is faster and keeps relative cluster positions more meaningful. PCA, by contrast, is linear and used for compression, not just visualization.

The big pitfalls

  • Cluster sizes and densities are not meaningful — t-SNE in particular equalizes density, so a tight cluster may just be an artifact.
  • Distances between clusters mostly don’t mean anything in t-SNE (UMAP is somewhat better but still unreliable).
  • Hyperparameters dominate: perplexity (t-SNE) and n_neighbors/min_dist (UMAP) change the picture dramatically; always try several.
  • Both are stochastic (different runs differ) unless seeded.

The common trap

Reading these plots as ground truth — drawing conclusions from gaps, cluster sizes, or inter-cluster distances. Even the arXiv critique “Stop Misusing t-SNE and UMAP for Visual Analytics” warns about over-interpretation. Also, never feed t-SNE/UMAP embeddings into a downstream model as features — they’re for exploration, and the transform isn’t a stable, generalizable mapping. Follow-up: “PCA vs these?” — use PCA for compression/preprocessing (linear, deterministic, fast), t-SNE/UMAP for visual exploration only.

Learn it properly t-SNE & UMAP

Keep practising

All Machine Learning questions

Explore further

Skip to content