K-Means Clustering

Definition

Core Statement

K-Means is an unsupervised learning algorithm that partitions data into $k$ distinct clusters. It aims to minimize the within-cluster sum of squares (WCSS), placing each data point in the cluster with the nearest mean (centroid).

Purpose

Customer Segmentation: Grouping customers by behavior (e.g., big spenders, window shoppers).
Image Compression: Reducing colors by clustering pixel values.
Anomaly Detection: Points far from any centroid are potential outliers.
Pattern Recognition: Identifying underlying structures in unlabeled data.

When to Use

Use K-Means When...

You have unlabeled data (no target variable).
You want to find groups of similar items.
You know (or can estimate) the number of clusters ( $k$ ).
Clusters are expected to be roughly spherical and equal size.

Limitations

Requires specifying $k$ in advance.
Sensitive to initialization (solved by K-Means++).
Sensitive to outliers.
Fails on non-globular clusters (e.g., rings, crescents). Use DBSCAN instead.

Theoretical Background

The Algorithm (Lloyd's Algorithm)

Initialize: Choose $k$ random centroids.
Assign: Assign each data point to the nearest centroid (Euclidean distance).
Update: Recalculate centroids as the mean of all points assigned to that cluster.
Repeat: Steps 2-3 until centroids stop moving (convergence).

Objective Function (Inertia)

Minimize:

J = \sum_{i = 1}^{k} \sum_{x \in C_{i}} | | x - μ_{i} | |^{2}

Where $μ_{i}$ is the centroid of cluster $C_{i}$ .

Worked Example: T-Shirt Sizing

Problem

A clothing brand wants to define 3 sizes (Small, Medium, Large) based on customer height and weight.
Data: 1000 customers.
Goal: Find 3 centroids to represent standard sizes.

Initialization: Pick 3 random customers as starting points.
Assignment: Every customer is labeled "S", "M", or "L" based on which starting point they are closest to.
Update: Calculate the average height/weight of all "S" customers. That becomes the new "Small". Do same for M and L.
Iterate: Re-assign customers to the new, better-placed centroids.
Result:
- Cluster 1 Centroid: (160cm, 55kg) $\to$ Size S
- Cluster 2 Centroid: (175cm, 70kg) $\to$ Size M
- Cluster 3 Centroid: (185cm, 90kg) $\to$ Size L

Choosing K: The Elbow Method

Plot Inertia (WCSS) vs Number of Clusters ( $k$ ).

As $k$ increases, inertia decreases.
Look for the "Elbow": The point where adding another cluster gives diminishing returns.

Assumptions

Spherical Clusters: Variance is equal in all directions (requires standardization).
Similar Variance: Clusters have roughly equal density.
Balanced Sizes: Clusters have roughly equal number of points.

Limitations & Pitfalls

Pitfalls

Scaling is Critical: If one variable is "Income" (0-100,000) and another is "Age" (0-100), distance is dominated by Income. Always Standardize (Z-score) before clustering.
Random Initialization Trap: Bad starting points can lead to bad clusters. Always use K-Means++ initialization and run multiple times (n_init=10).
High Dimensions: In very high dimensions, Euclidean distance loses meaning (Curse of Dimensionality). Use Principal Component Analysis (PCA) first.

Python Implementation

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# 1. Scale Data (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(raw_data)

# 2. Elbow Method
inertias = []
k_range = range(1, 10)
for k in k_range:
    model = KMeans(n_clusters=k, random_state=42, n_init=10)
    model.fit(X_scaled)
    inertias.append(model.inertia_)

plt.plot(k_range, inertias, 'bx-')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('Elbow Plot')
plt.show()

# 3. Fit Optimal K=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
centroids = kmeans.cluster_centers_

print("Cluster Labels:", labels)
print("Centroids (Scaled):", centroids)

Interpretation Guide

Output	Interpretation
Inertia	Sum of squared distances to nearest centroid. Lower is better, but allow for complexity cost.
Silhouette Score	Measures how distinct clusters are (-1 to 1). High is better.
Cluster Centroid	The "prototype" or average member of that group.

Principal Component Analysis (PCA) - Often used before K-means.
DBSCAN - Alternative clustering for irregular shapes.
Hierarchical Clustering - Alternative that builds a tree.
Euclidean Distance