When would you use DBSCAN instead of k-means, and what are its main limitations?
Use DBSCAN when clusters have arbitrary, non-spherical shapes, when the number of clusters is unknown, and when you need to detect outliers, since it groups by density and labels low-density points as noise. Its main limitations are sensitivity to the eps and minPts parameters and difficulty when clusters have very different densities. It also struggles in high dimensions where distance becomes unreliable.
How to think about it
The crisp answer
Reach for DBSCAN when k-means’s assumptions fail: you have non-spherical clusters, you don’t know the number of clusters, and you expect outliers. DBSCAN groups points by density and explicitly labels sparse points as noise, so it finds arbitrary shapes and ignores outliers instead of forcing them into clusters.
Why it differs from k-means
K-means partitions every point into one of k round clusters around centroids. DBSCAN instead defines clusters as dense regions connected through core points (points with at least minPts neighbors within radius eps). As the Hex comparison of density-based methods explains, this lets it recover crescents, rings, and other shapes k-means cuts through, while reporting noise points separately.
When to use it
- Geospatial clustering, anomaly detection, customer segmentation where cluster count is unknown.
- Data with noise you want flagged rather than absorbed.
- Clusters of irregular shape and roughly similar density.
The main limitations
- Parameter sensitivity: eps and minPts are hard to set; a k-distance plot helps pick eps.
- Varying densities: a single global eps can’t capture clusters that are dense in one region and sparse in another (HDBSCAN addresses this).
- High dimensions: distances concentrate, so density becomes unreliable — the curse of dimensionality.
The common trap
Forgetting that DBSCAN has no centroids and can’t assign new points without re-running, and that it struggles with multi-density data. Always scale features first. Follow-up: “What if densities vary a lot?” — use HDBSCAN, which builds a hierarchy and extracts clusters at varying density levels, or hierarchical clustering when you want a dendrogram and no fixed k.