t-SNE & UMAP
See the structure in high-dimensional data by projecting it to 2D — the right way. How nonlinear embeddings reveal clusters PCA misses, and the traps that make their plots easy to misread.
What you'll learn
- Why nonlinear projections reveal clusters that linear PCA flattens
- How t-SNE and UMAP differ (and when to use each)
- The traps — cluster sizes, distances, and random seeds can mislead
Before you start
PCA is linear — it can only rotate and project, so curved structure gets flattened and overlapping clusters smear together. t-SNE and UMAP are nonlinear projections built for one job: making a 2D picture where similar high-dimensional points land near each other, so you can see clusters that PCA hides. They’re the standard tools for visualizing embeddings, gene expression, and any wide dataset.
The idea: preserve neighborhoods, not distances
Both methods optimize a 2D layout so that points which were close in high-dimensional space stay close in the picture. They don’t try to preserve global distances — only local neighborhoods. That’s exactly why they reveal clusters: tight high-dimensional groups become visually separated blobs.
t-SNE vs UMAP
- t-SNE — the original. Beautiful local structure, but slow, and it destroys global structure (distances between clusters are meaningless). Tuned by perplexity (roughly, how many neighbors to consider).
- UMAP — newer, much faster, scales to millions of points, and preserves
somewhat more global structure. It’s now the default for most embedding
visualization. Tuned by
n_neighborsandmin_dist.
Quick check
Quick check
Next
You’ve now seen the full unsupervised toolkit. The last practical lesson: AutoML — when to let a tool do the search for you.
Practice this in an interview
All questionst-SNE and UMAP are nonlinear dimensionality reduction algorithms designed primarily for 2D/3D visualization of high-dimensional data. Unlike PCA, they preserve local neighborhood structure rather than global variance, producing cleaner cluster separations in plots. Neither should be used as a preprocessing step for training a supervised model because they are transductive and their output is not stable across runs.
Both are nonlinear dimensionality-reduction methods for visualization that preserve local neighborhood structure, but UMAP is faster, scales better, and tends to preserve more global structure, while t-SNE emphasizes tight local clusters. The main pitfall is over-interpreting the plots: cluster sizes, densities, and distances between clusters are not meaningful, and results depend heavily on hyperparameters like perplexity or n_neighbors. Neither should be used as features for a downstream model.
t-SNE is a visualization method that optimizes a non-parametric 2D/3D embedding preserving local neighborhoods; it has no stable transform, distorts global structure and distances, and is stochastic, so its coordinates aren't reliable predictive features. It also can't naturally project new (test) points. For feature compression use PCA, autoencoders, or supervised embeddings, which provide a consistent, reusable mapping.
PCA finds the orthogonal directions of maximum variance in the data and projects onto a lower-dimensional subspace, reducing features while retaining most information. It is most useful before distance-based models or when training is bottlenecked by dimensionality. Its main limits are loss of interpretability, sensitivity to scale, and an assumption of linear structure.