K-Nearest Neighbors (KNN)

Definition

Core Statement

K-Nearest Neighbors (KNN) is a simple, non-parametric algorithm used for classification and regression. It predicts the label of a new data point by looking at the 'K' closest training data points and taking a majority vote (for classification) or average (for regression).

"Tell me who your friends are, and I will tell you who you are."

Purpose

Baseline Model: Often used as a simple benchmark.
Imputation: Filling missing values by looking at similar rows ("KNN Imputer").
Recommendation Systems: "Users like you also liked..." (Item-based collaborative filtering).

When to Use

Use KNN When...

Training data is small to medium (It is "Lazy Learning" - training is free, prediction is expensive).
Decision boundaries are highly irregular/non-linear.
You need a simple, interpretable explanation ("It's similar to Case X").

Limitations

Slow Prediction: Must calculate distance to every training point. $O (N)$ .
Memory Hog: Must store all training data.
Curse of Dimensionality: In high dimensions, everything is "far away", and distance loses meaning.

Theoretical Background

The Distances

How do we define "Near"?

Euclidean Distance: $\sqrt{\sum (x_{i} - y_{i})^{2}}$ . (Standard straight line). (Requires Scaling!).
Manhattan Distance: $\sum | x_{i} - y_{i} |$ . (Grid/City block).
Cosine Similarity: Angle between vectors (Good for text data).

Choice of K

Small K (e.g., K=1): Low Bias, High Variance. Captures noise. (Overfitting).
Large K (e.g., K=100): High Bias, Low Variance. Smoothes boundaries excessively. (Underfitting).

Worked Example: Fruit Classification

Problem

Predict if an object is an Apple or Orange based on Weight and Redness.
New Object: 150g, very red.

Dataset:
- Point A (Apple): 140g, Red. (Dist = 2)
- Point B (Apple): 160g, Red. (Dist = 3)
- Point C (Orange): 150g, Orange. (Dist = 10)
Set K=3:
- Nearest neighbors are A, B, C.
- Votes: Apple (2), Orange (1).
- Prediction: Apple.

Note: If K=1, and we had a noisily labeled Orange nearby, we might have misclassified it.

Assumptions

Feature Scaling: CRITICAL. If "Weight" is in grams (150) and "Redness" is 0-1, Weight will dominate the distance calc. Always normalize (Z-score) features.
Relevant Features: Adding irrelevant noise features confuses the distance metric.

Python Implementation

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 1. Scale Data (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# 3. Fit
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# 4. Predict
accuracy = knn.score(X_test, y_test)
print(f"Accuracy with K=5: {accuracy:.4f}")

Feature Scaling - Mandatory pre-processing.
Bias-Variance Trade-off - Controlled by K.
Euclidean Distance
Support Vector Machines (SVM) - Alternative for classification.