K-Means Clustering
K-Means Clustering
Definition
Core Statement
K-Means is an unsupervised learning algorithm that partitions data into
Purpose
- Customer Segmentation: Grouping customers by behavior (e.g., big spenders, window shoppers).
- Image Compression: Reducing colors by clustering pixel values.
- Anomaly Detection: Points far from any centroid are potential outliers.
- Pattern Recognition: Identifying underlying structures in unlabeled data.
When to Use
Use K-Means When...
- You have unlabeled data (no target variable).
- You want to find groups of similar items.
- You know (or can estimate) the number of clusters (
). - Clusters are expected to be roughly spherical and equal size.
Limitations
- Requires specifying
in advance. - Sensitive to initialization (solved by K-Means++).
- Sensitive to outliers.
- Fails on non-globular clusters (e.g., rings, crescents). Use DBSCAN instead.
Theoretical Background
The Algorithm (Lloyd's Algorithm)
- Initialize: Choose
random centroids. - Assign: Assign each data point to the nearest centroid (Euclidean distance).
- Update: Recalculate centroids as the mean of all points assigned to that cluster.
- Repeat: Steps 2-3 until centroids stop moving (convergence).
Objective Function (Inertia)
Minimize:
Where
Worked Example: T-Shirt Sizing
Problem
A clothing brand wants to define 3 sizes (Small, Medium, Large) based on customer height and weight.
Data: 1000 customers.
Goal: Find 3 centroids to represent standard sizes.
- Initialization: Pick 3 random customers as starting points.
- Assignment: Every customer is labeled "S", "M", or "L" based on which starting point they are closest to.
- Update: Calculate the average height/weight of all "S" customers. That becomes the new "Small". Do same for M and L.
- Iterate: Re-assign customers to the new, better-placed centroids.
- Result:
- Cluster 1 Centroid: (160cm, 55kg)
Size S - Cluster 2 Centroid: (175cm, 70kg)
Size M - Cluster 3 Centroid: (185cm, 90kg)
Size L
- Cluster 1 Centroid: (160cm, 55kg)
Choosing K: The Elbow Method
Plot Inertia (WCSS) vs Number of Clusters (
- As
increases, inertia decreases. - Look for the "Elbow": The point where adding another cluster gives diminishing returns.
Assumptions
Limitations & Pitfalls
Pitfalls
- Scaling is Critical: If one variable is "Income" (0-100,000) and another is "Age" (0-100), distance is dominated by Income. Always Standardize (Z-score) before clustering.
- Random Initialization Trap: Bad starting points can lead to bad clusters. Always use K-Means++ initialization and run multiple times (
n_init=10). - High Dimensions: In very high dimensions, Euclidean distance loses meaning (Curse of Dimensionality). Use Principal Component Analysis (PCA) first.
Python Implementation
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# 1. Scale Data (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(raw_data)
# 2. Elbow Method
inertias = []
k_range = range(1, 10)
for k in k_range:
model = KMeans(n_clusters=k, random_state=42, n_init=10)
model.fit(X_scaled)
inertias.append(model.inertia_)
plt.plot(k_range, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow Plot')
plt.show()
# 3. Fit Optimal K=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
centroids = kmeans.cluster_centers_
print("Cluster Labels:", labels)
print("Centroids (Scaled):", centroids)
Interpretation Guide
| Output | Interpretation |
|---|---|
| Inertia | Sum of squared distances to nearest centroid. Lower is better, but allow for complexity cost. |
| Silhouette Score | Measures how distinct clusters are (-1 to 1). High is better. |
| Cluster Centroid | The "prototype" or average member of that group. |
Related Concepts
- Principal Component Analysis (PCA) - Often used before K-means.
- DBSCAN - Alternative clustering for irregular shapes.
- Hierarchical Clustering - Alternative that builds a tree.
- Euclidean Distance