Clustering
Discover hidden patterns and group similar data points automatically
Clustering algorithms automatically discover natural groupings in your data without predefined labels. Use clustering to segment customers, detect anomalies, organize documents, or explore data structure.
🎓 Learn About Clustering
New to clustering? Visit our Clustering Concepts Guide to learn about evaluation metrics (Silhouette Score, Davies-Bouldin Index), common approaches, and when to use clustering for your unsupervised learning tasks.
Available Models
We support 11 different clustering algorithms, each suited for different data patterns:
Centroid-Based Models
- K-Means - Fast, scalable clustering for spherical clusters
- Mini Batch K-Means - Faster K-Means variant for large datasets
- Bisecting K-Means - Hierarchical variant of K-Means
Density-Based Models
- DBSCAN - Finds arbitrary-shaped clusters and detects outliers
- OPTICS - Advanced DBSCAN without preset distance parameter
- Mean Shift - Finds clusters by seeking density peaks
Hierarchical Models
- Hierarchical Clustering - Creates a tree of clusters (dendrogram)
- BIRCH - Memory-efficient hierarchical clustering for large datasets
Model-Based
- Gaussian Mixture Model - Probabilistic clustering with soft assignments
Graph-Based
- Spectral Clustering - Uses graph theory to find non-convex clusters
Message-Passing
- Affinity Propagation - Automatically determines number of clusters
Common Configuration
Most models share these settings:
Feature Configuration
Feature Columns (required) Select which columns from your dataset to use for clustering. Choose features that capture meaningful variation in your data. Consider scaling features to similar ranges for distance-based algorithms.
Hyperparameter Tuning
Enable Hyperparameter Tuning Automatically search for the best model parameters. This can improve cluster quality but takes longer.
- Disabled: Use default parameters (faster)
- Enabled: Search for optimal parameters (better results)
Tuning Method (when tuning is enabled)
- Grid Search: Try all combinations systematically
- Random Search: Try random combinations (faster)
- Bayesian Search: Intelligently search the parameter space
CV Folds (when tuning is enabled) Number of cross-validation folds (default: 5). Higher values give more reliable results but take longer.
Scoring Metric (when tuning is enabled) How to evaluate cluster quality:
- Silhouette Score: How well each point fits its cluster vs. other clusters (-1 to 1, higher is better)
- Davies-Bouldin Index: Ratio of within-cluster to between-cluster distances (lower is better)
- Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance (higher is better)
Understanding Clustering Metrics
Silhouette Score
Measures how similar a point is to its own cluster compared to other clusters.
- Range: -1 to 1
- Higher is better: 1 = perfect, 0 = overlapping, negative = wrong cluster
- Interpretation:
- 0.71-1.0: Strong structure
- 0.51-0.70: Reasonable structure
- 0.26-0.50: Weak structure
- <0.25: No substantial structure
Davies-Bouldin Index
Average similarity ratio of each cluster with its most similar cluster.
- Lower is better: Closer to 0 means better separation
- Interpretation: Measures both compactness and separation
- Use: Compare different k values
Calinski-Harabasz Index (Variance Ratio Criterion)
Ratio of between-cluster to within-cluster variance.
- Higher is better: More distinct clusters
- Interpretation: Higher values indicate better-defined clusters
- Use: Finding optimal k
Choosing the Right Model
Quick Start Guide
- Start with K-Means: Fast baseline if you know k
- Try DBSCAN: If you don't know k or need arbitrary shapes
- Experiment with Hierarchical: For visualization and multiple granularities
- Fine-tune: Adjust parameters based on results
- Validate: Use multiple metrics and visual inspection
By Dataset Size
- Small (<1k rows): Any algorithm, try Hierarchical for visualization
- Medium (1k-10k): K-Means, DBSCAN, Gaussian Mixture
- Large (10k-100k): K-Means, Mini Batch K-Means, BIRCH
- Very Large (>100k): Mini Batch K-Means, BIRCH
By Cluster Shape
- Spherical: K-Means, Mini Batch K-Means, Gaussian Mixture (spherical)
- Arbitrary shapes: DBSCAN, OPTICS, Mean Shift, Spectral
- Elliptical: Gaussian Mixture (full covariance)
- Non-convex: Spectral, DBSCAN
By Requirements
- Know k: K-Means, Hierarchical, Gaussian Mixture
- Don't know k: DBSCAN, Mean Shift, Affinity Propagation, OPTICS
- Need speed: K-Means, Mini Batch K-Means, BIRCH
- Need hierarchy: Hierarchical, BIRCH, Bisecting K-Means
- Outlier detection: DBSCAN, OPTICS
- Probabilistic: Gaussian Mixture
- Varying densities: OPTICS, Mean Shift
By Data Characteristics
- High dimensional: Spectral, Gaussian Mixture
- Graph/network data: Spectral
- Text/sparse data: Bisecting K-Means, K-Means
- Noisy data: DBSCAN, OPTICS, K-Means with preprocessing
- Mixed densities: OPTICS, Mean Shift
- Streaming/online: BIRCH, Mini Batch K-Means
Best Practices
- Scale your features - Critical for distance-based algorithms (K-Means, DBSCAN, etc.)
- Determine k - Use elbow method, silhouette analysis, or domain knowledge
- Start simple - K-Means is a great baseline
- Validate results - Use multiple metrics (Silhouette, Davies-Bouldin) AND visual inspection
- Handle outliers - Consider removing or using DBSCAN/OPTICS that handle them naturally
- Feature selection - Remove irrelevant features; they add noise in high dimensions
- Try multiple algorithms - Different algorithms may reveal different patterns
- Iterate - Clustering is exploratory; refine based on domain knowledge
- Reproducibility - Always set random_state for consistent results
- Interpret results - Statistics are guides, not truth; validate with domain experts
Common Pitfalls
- Not scaling features: Distance metrics are sensitive to feature scales
- Using wrong algorithm: K-Means on non-spherical clusters won't work well
- Ignoring outliers: Can severely affect centroid-based methods
- Optimizing only on metrics: Visual inspection is essential
- Over-clustering: Too many clusters lose interpretability
- Under-clustering: Missing important distinctions
Next Steps
Ready to cluster? Head to the Training page and:
- Select your dataset
- Choose a clustering model based on this guide
- Configure parameters (or enable hyperparameter tuning)
- Analyze results with visualizations and metrics
- Iterate and refine based on domain knowledge