Clustering

Clustering algorithms automatically discover natural groupings in your data without predefined labels. Use clustering to segment customers, detect anomalies, organize documents, or explore data structure.

🎓 Learn About Clustering

New to clustering? Visit our Clustering Concepts Guide to learn about evaluation metrics (Silhouette Score, Davies-Bouldin Index), common approaches, and when to use clustering for your unsupervised learning tasks.

Available Models

We support 11 different clustering algorithms, each suited for different data patterns:

Centroid-Based Models

K-Means - Fast, scalable clustering for spherical clusters
Mini Batch K-Means - Faster K-Means variant for large datasets
Bisecting K-Means - Hierarchical variant of K-Means

Density-Based Models

DBSCAN - Finds arbitrary-shaped clusters and detects outliers
OPTICS - Advanced DBSCAN without preset distance parameter
Mean Shift - Finds clusters by seeking density peaks

Hierarchical Models

Hierarchical Clustering - Creates a tree of clusters (dendrogram)
BIRCH - Memory-efficient hierarchical clustering for large datasets

Model-Based

Gaussian Mixture Model - Probabilistic clustering with soft assignments

Graph-Based

Spectral Clustering - Uses graph theory to find non-convex clusters

Message-Passing

Affinity Propagation - Automatically determines number of clusters

Common Configuration

Most models share these settings:

Feature Configuration

Feature Columns (required) Select which columns from your dataset to use for clustering. Choose features that capture meaningful variation in your data. Consider scaling features to similar ranges for distance-based algorithms.

Hyperparameter Tuning

Enable Hyperparameter Tuning Automatically search for the best model parameters. This can improve cluster quality but takes longer.

Disabled: Use default parameters (faster)
Enabled: Search for optimal parameters (better results)

Tuning Method (when tuning is enabled)

Grid Search: Try all combinations systematically
Random Search: Try random combinations (faster)
Bayesian Search: Intelligently search the parameter space

CV Folds (when tuning is enabled) Number of cross-validation folds (default: 5). Higher values give more reliable results but take longer.

Scoring Metric (when tuning is enabled) How to evaluate cluster quality:

Silhouette Score: How well each point fits its cluster vs. other clusters (-1 to 1, higher is better)
Davies-Bouldin Index: Ratio of within-cluster to between-cluster distances (lower is better)
Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance (higher is better)

Understanding Clustering Metrics

Silhouette Score

Measures how similar a point is to its own cluster compared to other clusters.

Range: -1 to 1
Higher is better: 1 = perfect, 0 = overlapping, negative = wrong cluster
Interpretation:
- 0.71-1.0: Strong structure
- 0.51-0.70: Reasonable structure
- 0.26-0.50: Weak structure
- <0.25: No substantial structure

Davies-Bouldin Index

Average similarity ratio of each cluster with its most similar cluster.

Lower is better: Closer to 0 means better separation
Interpretation: Measures both compactness and separation
Use: Compare different k values

Calinski-Harabasz Index (Variance Ratio Criterion)

Ratio of between-cluster to within-cluster variance.

Higher is better: More distinct clusters
Interpretation: Higher values indicate better-defined clusters
Use: Finding optimal k

Choosing the Right Model

Quick Start Guide

Start with K-Means: Fast baseline if you know k
Try DBSCAN: If you don't know k or need arbitrary shapes
Experiment with Hierarchical: For visualization and multiple granularities
Fine-tune: Adjust parameters based on results
Validate: Use multiple metrics and visual inspection

By Dataset Size

Small (<1k rows): Any algorithm, try Hierarchical for visualization
Medium (1k-10k): K-Means, DBSCAN, Gaussian Mixture
Large (10k-100k): K-Means, Mini Batch K-Means, BIRCH
Very Large (>100k): Mini Batch K-Means, BIRCH

By Cluster Shape

Spherical: K-Means, Mini Batch K-Means, Gaussian Mixture (spherical)
Arbitrary shapes: DBSCAN, OPTICS, Mean Shift, Spectral
Elliptical: Gaussian Mixture (full covariance)
Non-convex: Spectral, DBSCAN

By Requirements

Know k: K-Means, Hierarchical, Gaussian Mixture
Don't know k: DBSCAN, Mean Shift, Affinity Propagation, OPTICS
Need speed: K-Means, Mini Batch K-Means, BIRCH
Need hierarchy: Hierarchical, BIRCH, Bisecting K-Means
Outlier detection: DBSCAN, OPTICS
Probabilistic: Gaussian Mixture
Varying densities: OPTICS, Mean Shift

By Data Characteristics

High dimensional: Spectral, Gaussian Mixture
Graph/network data: Spectral
Text/sparse data: Bisecting K-Means, K-Means
Noisy data: DBSCAN, OPTICS, K-Means with preprocessing
Mixed densities: OPTICS, Mean Shift
Streaming/online: BIRCH, Mini Batch K-Means

Best Practices

Scale your features - Critical for distance-based algorithms (K-Means, DBSCAN, etc.)
Determine k - Use elbow method, silhouette analysis, or domain knowledge
Start simple - K-Means is a great baseline
Validate results - Use multiple metrics (Silhouette, Davies-Bouldin) AND visual inspection
Handle outliers - Consider removing or using DBSCAN/OPTICS that handle them naturally
Feature selection - Remove irrelevant features; they add noise in high dimensions
Try multiple algorithms - Different algorithms may reveal different patterns
Iterate - Clustering is exploratory; refine based on domain knowledge
Reproducibility - Always set random_state for consistent results
Interpret results - Statistics are guides, not truth; validate with domain experts

Common Pitfalls

Not scaling features: Distance metrics are sensitive to feature scales
Using wrong algorithm: K-Means on non-spherical clusters won't work well
Ignoring outliers: Can severely affect centroid-based methods
Optimizing only on metrics: Visual inspection is essential
Over-clustering: Too many clusters lose interpretability
Under-clustering: Missing important distinctions

Next Steps

Ready to cluster? Head to the Training page and:

Select your dataset
Choose a clustering model based on this guide
Configure parameters (or enable hyperparameter tuning)
Analyze results with visualizations and metrics
Iterate and refine based on domain knowledge

Clustering

Available Models

Centroid-Based Models

Density-Based Models

Hierarchical Models

Model-Based

Graph-Based

Message-Passing

Common Configuration

Feature Configuration

Hyperparameter Tuning

Understanding Clustering Metrics

Silhouette Score

Davies-Bouldin Index

Calinski-Harabasz Index (Variance Ratio Criterion)

Choosing the Right Model

Quick Start Guide

By Dataset Size

By Cluster Shape

By Requirements

By Data Characteristics

Best Practices

Common Pitfalls

Next Steps

Model Deployment

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

Clustering

Available Models

Centroid-Based Models

Density-Based Models

Hierarchical Models

Model-Based

Graph-Based

Message-Passing

Common Configuration

Feature Configuration

Hyperparameter Tuning

Understanding Clustering Metrics

Silhouette Score

Davies-Bouldin Index

Calinski-Harabasz Index (Variance Ratio Criterion)

Choosing the Right Model

Quick Start Guide

By Dataset Size

By Cluster Shape

By Requirements

By Data Characteristics

Best Practices

Common Pitfalls

Next Steps

Model Deployment

On this page

Command Palette