Clustering - K-Means

This case study demonstrates K-Means clustering on the famous Iris flower dataset. K-Means is an unsupervised learning algorithm that partitions data into K distinct clusters by minimizing within-cluster variance. It's widely used for customer segmentation, pattern recognition, and data exploration.

Dataset: Iris Flowers

Source: Kaggle (Iris Species Dataset)
Type: Unsupervised clustering
Size: 150 samples
Features: Sepal length/width, Petal length/width (cm)
True Classes: 3 species (Setosa, Versicolor, Virginica)
Goal: Discover natural groupings without labels

Model Configuration

{
  "model": "kmeans",
  "category": "clustering",
  "model_config": {
    "n_clusters": 3,
    "init": "k-means++",
    "n_init": 10,
    "max_iter": 300,
    "random_state": 42,
    "algorithm": "lloyd"
  }
}

Clustering Results

Cluster Visualization (2D PCA Projection)

Three distinct clusters identified:

Keine Plot-Daten verfügbar

Elbow Method (Optimal K)

Determining the best number of clusters:

Keine Plot-Daten verfügbar

Silhouette Score Analysis

Cluster quality metric (higher is better):

Keine Plot-Daten verfügbar

Cluster Characteristics

Mean feature values for each cluster:

Keine Plot-Daten verfügbar

Cluster Size Distribution

Number of samples in each cluster:

Keine Plot-Daten verfügbar

Feature Importance for Clustering

Which features drive cluster separation?

Keine Plot-Daten verfügbar

Common Use Cases

Customer Segmentation: Group customers by behavior, preferences
Image Compression: Reduce colors by clustering similar pixels
Anomaly Detection: Identify outliers far from cluster centers
Document Clustering: Group similar documents or articles
Market Segmentation: Identify market niches
Genomics: Group genes with similar expression patterns
Recommendation Systems: User/item grouping for recommendations

Key Settings

Essential Parameters

n_clusters: Number of clusters to form (K)
init: Initialization method ("k-means++", "random")
n_init: Number of times to run with different seeds
max_iter: Maximum iterations per run
tol: Convergence tolerance

Algorithm Variants

algorithm: "lloyd" (standard), "elkan" (faster for dense data)
random_state: Reproducible results

Advanced Configuration

n_jobs: Parallel processing (-1 for all cores)
verbose: Progress output level

Performance Metrics

Silhouette Score: 0.76 (good separation)
Davies-Bouldin Index: 0.42 (lower is better)
Calinski-Harabasz Index: 561.6 (higher is better)
Inertia (WCSS): 78.85
Purity: 96.0% (agreement with true labels)
Adjusted Rand Index: 0.88
Convergence: 7 iterations

Tips for Success

Feature Scaling: Always standardize features before K-Means
Optimal K: Use elbow method, silhouette analysis
Initialization: k-means++ generally better than random
Multiple Runs: Set n_init ≥ 10 for stability
Distance Metric: K-Means uses Euclidean distance
Outliers: Consider removing before clustering
High Dimensions: Use PCA for visualization and performance

Example Scenarios

Scenario 1: Cluster 0 (Setosa)

Characteristics:
- Small petal length (1.5 cm avg)
- Small petal width (0.2 cm avg)
- Wider sepals relative to length
Size: 50 flowers (33%)
Distinctness: Completely separated from other clusters

Scenario 2: Cluster 1 (Versicolor)

Characteristics:
- Medium petal length (4.3 cm avg)
- Medium petal width (1.3 cm avg)
- Moderate sepal dimensions
Size: 48 flowers (32%)
Distinctness: Some overlap with Virginica

Scenario 3: Cluster 2 (Virginica)

Characteristics:
- Large petal length (5.7 cm avg)
- Large petal width (2.1 cm avg)
- Longest sepals overall
Size: 52 flowers (35%)
Distinctness: Slight overlap with Versicolor

Troubleshooting

Problem: Poor cluster quality (low silhouette score)

Solution: Try different K values, remove outliers, normalize features

Problem: Clusters dominated by one feature

Solution: Ensure proper feature scaling, consider feature selection

Problem: Results vary between runs

Solution: Increase n_init, set random_state, use k-means++

Problem: Slow convergence

Solution: Reduce max_iter, use elkan algorithm, subsample data

Problem: Empty clusters created

Solution: Reduce n_clusters, improve initialization, remove duplicates

K-Means vs Other Clustering Methods

Method	Speed	Shape Flexibility	Scalability	Requires K
K-Means	Fast	Spherical only	Excellent	Yes
DBSCAN	Medium	Arbitrary	Good	No
Hierarchical	Slow	Arbitrary	Poor	No
GMM	Medium	Elliptical	Good	Yes
Mean-Shift	Slow	Arbitrary	Poor	No

Cluster Validation Metrics

Internal Metrics (no ground truth needed)

Silhouette Score: [-1, 1], higher better (0.76)
Davies-Bouldin: [0, ∞], lower better (0.42)
Calinski-Harabasz: [0, ∞], higher better (561.6)

External Metrics (with ground truth)

Adjusted Rand Index: [-1, 1], higher better (0.88)
Purity: [0, 1], higher better (0.96)
Normalized Mutual Information: [0, 1], higher better (0.85)

Next Steps

After performing K-Means clustering, you can:

Apply cluster labels to new data
Use clusters as features for supervised learning
Analyze cluster profiles for business insights
Create customer personas from segments
Build targeted marketing campaigns per cluster
Perform hierarchical clustering within large clusters
Compare with other clustering algorithms (DBSCAN, GMM)
Visualize in lower dimensions with t-SNE or UMAP

Clustering - K-Means

Dataset: Iris Flowers

Model Configuration

Clustering Results

Cluster Visualization (2D PCA Projection)

Elbow Method (Optimal K)

Silhouette Score Analysis

Cluster Characteristics

Cluster Size Distribution

Feature Importance for Clustering

Common Use Cases

Key Settings

Essential Parameters

Algorithm Variants

Advanced Configuration

Performance Metrics

Tips for Success

Example Scenarios

Scenario 1: Cluster 0 (Setosa)

Scenario 2: Cluster 1 (Versicolor)

Scenario 3: Cluster 2 (Virginica)

Troubleshooting

K-Means vs Other Clustering Methods

Cluster Validation Metrics

Internal Metrics (no ground truth needed)

External Metrics (with ground truth)

Next Steps

On this page

Sicherheit auf Enterprise-Niveau

In jeder Infrastruktur einsetzbar

DSGVO-konform

Clustering - K-Means

Dataset: Iris Flowers

Model Configuration

Clustering Results

Cluster Visualization (2D PCA Projection)

Elbow Method (Optimal K)

Silhouette Score Analysis

Cluster Characteristics

Cluster Size Distribution

Feature Importance for Clustering

Common Use Cases

Key Settings

Essential Parameters

Algorithm Variants

Advanced Configuration

Performance Metrics

Tips for Success

Example Scenarios

Scenario 1: Cluster 0 (Setosa)

Scenario 2: Cluster 1 (Versicolor)

Scenario 3: Cluster 2 (Virginica)

Troubleshooting

K-Means vs Other Clustering Methods

Cluster Validation Metrics

Internal Metrics (no ground truth needed)

External Metrics (with ground truth)

Next Steps

On this page

Command Palette