DBSCAN
Density-Based Spatial Clustering finds arbitrarily-shaped clusters and identifies outliers as noise points
Density-Based Spatial Clustering finds arbitrarily-shaped clusters and identifies outliers as noise points.
When to use:
- Don't know number of clusters in advance
- Clusters have arbitrary shapes (not just spherical)
- Need to identify outliers/anomalies
- Have varying cluster densities
Strengths: Finds arbitrary shapes, detects outliers, no need to specify k, robust to noise Weaknesses: Sensitive to parameters (eps, min_samples), struggles with varying densities, not scalable to very large datasets
Model Parameters
Eps (default: 0.5, required) Maximum distance between two samples to be considered neighbors. This is crucial.
- Too low: Many small clusters and noise points
- Too high: Merges distinct clusters
- Use k-distance plot to determine optimal eps
Min Samples (default: 5, required) Minimum points needed to form a dense region (core point).
- 3-5: Sensitive, more clusters
- 5-10: Good default
- 10+: Conservative, fewer clusters, more noise
Metric (default: "euclidean") Distance metric:
- euclidean: Standard distance (default)
- manhattan: City-block distance
- chebyshev: Maximum coordinate difference
- Others: cosine, minkowski, etc.
Algorithm (default: "auto") Algorithm for nearest neighbors:
- auto: Automatically choose best (default)
- ball_tree: Good for low dimensions
- kd_tree: Fast for low dimensions
- brute: Exact but slow (use for small datasets)
P (optional) Power parameter for Minkowski metric (2 = Euclidean, 1 = Manhattan).