Dokumentation (english)

Clustering

Discover hidden patterns and group similar data points automatically

Clustering algorithms automatically discover natural groupings in your data without predefined labels. Use clustering to segment customers, detect anomalies, organize documents, or explore data structure.

🎓 Learn About Clustering

New to clustering? Visit our Clustering Concepts Guide to learn about evaluation metrics (Silhouette Score, Davies-Bouldin Index), common approaches, and when to use clustering for your unsupervised learning tasks.

Available Models

We support 11 different clustering algorithms, each suited for different data patterns:

Centroid-Based Models

Density-Based Models

  • DBSCAN - Finds arbitrary-shaped clusters and detects outliers
  • OPTICS - Advanced DBSCAN without preset distance parameter
  • Mean Shift - Finds clusters by seeking density peaks

Hierarchical Models

Model-Based

Graph-Based

Message-Passing

Common Configuration

Most models share these settings:

Feature Configuration

Feature Columns (required) Select which columns from your dataset to use for clustering. Choose features that capture meaningful variation in your data. Consider scaling features to similar ranges for distance-based algorithms.

Hyperparameter Tuning

Enable Hyperparameter Tuning Automatically search for the best model parameters. This can improve cluster quality but takes longer.

  • Disabled: Use default parameters (faster)
  • Enabled: Search for optimal parameters (better results)

Tuning Method (when tuning is enabled)

  • Grid Search: Try all combinations systematically
  • Random Search: Try random combinations (faster)
  • Bayesian Search: Intelligently search the parameter space

CV Folds (when tuning is enabled) Number of cross-validation folds (default: 5). Higher values give more reliable results but take longer.

Scoring Metric (when tuning is enabled) How to evaluate cluster quality:

  • Silhouette Score: How well each point fits its cluster vs. other clusters (-1 to 1, higher is better)
  • Davies-Bouldin Index: Ratio of within-cluster to between-cluster distances (lower is better)
  • Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance (higher is better)

Understanding Clustering Metrics

Silhouette Score

Measures how similar a point is to its own cluster compared to other clusters.

  • Range: -1 to 1
  • Higher is better: 1 = perfect, 0 = overlapping, negative = wrong cluster
  • Interpretation:
    • 0.71-1.0: Strong structure
    • 0.51-0.70: Reasonable structure
    • 0.26-0.50: Weak structure
    • <0.25: No substantial structure

Davies-Bouldin Index

Average similarity ratio of each cluster with its most similar cluster.

  • Lower is better: Closer to 0 means better separation
  • Interpretation: Measures both compactness and separation
  • Use: Compare different k values

Calinski-Harabasz Index (Variance Ratio Criterion)

Ratio of between-cluster to within-cluster variance.

  • Higher is better: More distinct clusters
  • Interpretation: Higher values indicate better-defined clusters
  • Use: Finding optimal k

Choosing the Right Model

Quick Start Guide

  1. Start with K-Means: Fast baseline if you know k
  2. Try DBSCAN: If you don't know k or need arbitrary shapes
  3. Experiment with Hierarchical: For visualization and multiple granularities
  4. Fine-tune: Adjust parameters based on results
  5. Validate: Use multiple metrics and visual inspection

By Dataset Size

  • Small (<1k rows): Any algorithm, try Hierarchical for visualization
  • Medium (1k-10k): K-Means, DBSCAN, Gaussian Mixture
  • Large (10k-100k): K-Means, Mini Batch K-Means, BIRCH
  • Very Large (>100k): Mini Batch K-Means, BIRCH

By Cluster Shape

  • Spherical: K-Means, Mini Batch K-Means, Gaussian Mixture (spherical)
  • Arbitrary shapes: DBSCAN, OPTICS, Mean Shift, Spectral
  • Elliptical: Gaussian Mixture (full covariance)
  • Non-convex: Spectral, DBSCAN

By Requirements

  • Know k: K-Means, Hierarchical, Gaussian Mixture
  • Don't know k: DBSCAN, Mean Shift, Affinity Propagation, OPTICS
  • Need speed: K-Means, Mini Batch K-Means, BIRCH
  • Need hierarchy: Hierarchical, BIRCH, Bisecting K-Means
  • Outlier detection: DBSCAN, OPTICS
  • Probabilistic: Gaussian Mixture
  • Varying densities: OPTICS, Mean Shift

By Data Characteristics

  • High dimensional: Spectral, Gaussian Mixture
  • Graph/network data: Spectral
  • Text/sparse data: Bisecting K-Means, K-Means
  • Noisy data: DBSCAN, OPTICS, K-Means with preprocessing
  • Mixed densities: OPTICS, Mean Shift
  • Streaming/online: BIRCH, Mini Batch K-Means

Best Practices

  1. Scale your features - Critical for distance-based algorithms (K-Means, DBSCAN, etc.)
  2. Determine k - Use elbow method, silhouette analysis, or domain knowledge
  3. Start simple - K-Means is a great baseline
  4. Validate results - Use multiple metrics (Silhouette, Davies-Bouldin) AND visual inspection
  5. Handle outliers - Consider removing or using DBSCAN/OPTICS that handle them naturally
  6. Feature selection - Remove irrelevant features; they add noise in high dimensions
  7. Try multiple algorithms - Different algorithms may reveal different patterns
  8. Iterate - Clustering is exploratory; refine based on domain knowledge
  9. Reproducibility - Always set random_state for consistent results
  10. Interpret results - Statistics are guides, not truth; validate with domain experts

Common Pitfalls

  • Not scaling features: Distance metrics are sensitive to feature scales
  • Using wrong algorithm: K-Means on non-spherical clusters won't work well
  • Ignoring outliers: Can severely affect centroid-based methods
  • Optimizing only on metrics: Visual inspection is essential
  • Over-clustering: Too many clusters lose interpretability
  • Under-clustering: Missing important distinctions

Next Steps

Ready to cluster? Head to the Training page and:

  1. Select your dataset
  2. Choose a clustering model based on this guide
  3. Configure parameters (or enable hyperparameter tuning)
  4. Analyze results with visualizations and metrics
  5. Iterate and refine based on domain knowledge

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items