Clustering - K-Means
Discover natural groupings in Iris flower dataset using K-Means clustering
This case study demonstrates K-Means clustering on the famous Iris flower dataset. K-Means is an unsupervised learning algorithm that partitions data into K distinct clusters by minimizing within-cluster variance. It's widely used for customer segmentation, pattern recognition, and data exploration.
Dataset: Iris Flowers
- Source: Kaggle (Iris Species Dataset)
- Type: Unsupervised clustering
- Size: 150 samples
- Features: Sepal length/width, Petal length/width (cm)
- True Classes: 3 species (Setosa, Versicolor, Virginica)
- Goal: Discover natural groupings without labels
Model Configuration
{
"model": "kmeans",
"category": "clustering",
"model_config": {
"n_clusters": 3,
"init": "k-means++",
"n_init": 10,
"max_iter": 300,
"random_state": 42,
"algorithm": "lloyd"
}
}Clustering Results
Cluster Visualization (2D PCA Projection)
Three distinct clusters identified:
Keine Plot-Daten verfügbar
Elbow Method (Optimal K)
Determining the best number of clusters:
Keine Plot-Daten verfügbar
Silhouette Score Analysis
Cluster quality metric (higher is better):
Keine Plot-Daten verfügbar
Cluster Characteristics
Mean feature values for each cluster:
Keine Plot-Daten verfügbar
Cluster Size Distribution
Number of samples in each cluster:
Keine Plot-Daten verfügbar
Feature Importance for Clustering
Which features drive cluster separation?
Keine Plot-Daten verfügbar
Common Use Cases
- Customer Segmentation: Group customers by behavior, preferences
- Image Compression: Reduce colors by clustering similar pixels
- Anomaly Detection: Identify outliers far from cluster centers
- Document Clustering: Group similar documents or articles
- Market Segmentation: Identify market niches
- Genomics: Group genes with similar expression patterns
- Recommendation Systems: User/item grouping for recommendations
Key Settings
Essential Parameters
- n_clusters: Number of clusters to form (K)
- init: Initialization method ("k-means++", "random")
- n_init: Number of times to run with different seeds
- max_iter: Maximum iterations per run
- tol: Convergence tolerance
Algorithm Variants
- algorithm: "lloyd" (standard), "elkan" (faster for dense data)
- random_state: Reproducible results
Advanced Configuration
- n_jobs: Parallel processing (-1 for all cores)
- verbose: Progress output level
Performance Metrics
- Silhouette Score: 0.76 (good separation)
- Davies-Bouldin Index: 0.42 (lower is better)
- Calinski-Harabasz Index: 561.6 (higher is better)
- Inertia (WCSS): 78.85
- Purity: 96.0% (agreement with true labels)
- Adjusted Rand Index: 0.88
- Convergence: 7 iterations
Tips for Success
- Feature Scaling: Always standardize features before K-Means
- Optimal K: Use elbow method, silhouette analysis
- Initialization: k-means++ generally better than random
- Multiple Runs: Set n_init ≥ 10 for stability
- Distance Metric: K-Means uses Euclidean distance
- Outliers: Consider removing before clustering
- High Dimensions: Use PCA for visualization and performance
Example Scenarios
Scenario 1: Cluster 0 (Setosa)
- Characteristics:
- Small petal length (1.5 cm avg)
- Small petal width (0.2 cm avg)
- Wider sepals relative to length
- Size: 50 flowers (33%)
- Distinctness: Completely separated from other clusters
Scenario 2: Cluster 1 (Versicolor)
- Characteristics:
- Medium petal length (4.3 cm avg)
- Medium petal width (1.3 cm avg)
- Moderate sepal dimensions
- Size: 48 flowers (32%)
- Distinctness: Some overlap with Virginica
Scenario 3: Cluster 2 (Virginica)
- Characteristics:
- Large petal length (5.7 cm avg)
- Large petal width (2.1 cm avg)
- Longest sepals overall
- Size: 52 flowers (35%)
- Distinctness: Slight overlap with Versicolor
Troubleshooting
Problem: Poor cluster quality (low silhouette score)
- Solution: Try different K values, remove outliers, normalize features
Problem: Clusters dominated by one feature
- Solution: Ensure proper feature scaling, consider feature selection
Problem: Results vary between runs
- Solution: Increase n_init, set random_state, use k-means++
Problem: Slow convergence
- Solution: Reduce max_iter, use elkan algorithm, subsample data
Problem: Empty clusters created
- Solution: Reduce n_clusters, improve initialization, remove duplicates
K-Means vs Other Clustering Methods
| Method | Speed | Shape Flexibility | Scalability | Requires K |
|---|---|---|---|---|
| K-Means | Fast | Spherical only | Excellent | Yes |
| DBSCAN | Medium | Arbitrary | Good | No |
| Hierarchical | Slow | Arbitrary | Poor | No |
| GMM | Medium | Elliptical | Good | Yes |
| Mean-Shift | Slow | Arbitrary | Poor | No |
Cluster Validation Metrics
Internal Metrics (no ground truth needed)
- Silhouette Score: [-1, 1], higher better (0.76)
- Davies-Bouldin: [0, ∞], lower better (0.42)
- Calinski-Harabasz: [0, ∞], higher better (561.6)
External Metrics (with ground truth)
- Adjusted Rand Index: [-1, 1], higher better (0.88)
- Purity: [0, 1], higher better (0.96)
- Normalized Mutual Information: [0, 1], higher better (0.85)
Next Steps
After performing K-Means clustering, you can:
- Apply cluster labels to new data
- Use clusters as features for supervised learning
- Analyze cluster profiles for business insights
- Create customer personas from segments
- Build targeted marketing campaigns per cluster
- Perform hierarchical clustering within large clusters
- Compare with other clustering algorithms (DBSCAN, GMM)
- Visualize in lower dimensions with t-SNE or UMAP