K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN)
Definition
Core Statement
K-Nearest Neighbors (KNN) is a simple, non-parametric algorithm used for classification and regression. It predicts the label of a new data point by looking at the 'K' closest training data points and taking a majority vote (for classification) or average (for regression).
"Tell me who your friends are, and I will tell you who you are."
Purpose
- Baseline Model: Often used as a simple benchmark.
- Imputation: Filling missing values by looking at similar rows ("KNN Imputer").
- Recommendation Systems: "Users like you also liked..." (Item-based collaborative filtering).
When to Use
Use KNN When...
- Training data is small to medium (It is "Lazy Learning" - training is free, prediction is expensive).
- Decision boundaries are highly irregular/non-linear.
- You need a simple, interpretable explanation ("It's similar to Case X").
Limitations
- Slow Prediction: Must calculate distance to every training point.
. - Memory Hog: Must store all training data.
- Curse of Dimensionality: In high dimensions, everything is "far away", and distance loses meaning.
Theoretical Background
The Distances
How do we define "Near"?
- Euclidean Distance:
. (Standard straight line). (Requires Scaling!). - Manhattan Distance:
. (Grid/City block). - Cosine Similarity: Angle between vectors (Good for text data).
Choice of K
- Small K (e.g., K=1): Low Bias, High Variance. Captures noise. (Overfitting).
- Large K (e.g., K=100): High Bias, Low Variance. Smoothes boundaries excessively. (Underfitting).
Worked Example: Fruit Classification
Problem
Predict if an object is an Apple or Orange based on Weight and Redness.
New Object: 150g, very red.
-
Dataset:
- Point A (Apple): 140g, Red. (Dist = 2)
- Point B (Apple): 160g, Red. (Dist = 3)
- Point C (Orange): 150g, Orange. (Dist = 10)
-
Set K=3:
- Nearest neighbors are A, B, C.
- Votes: Apple (2), Orange (1).
- Prediction: Apple.
Note: If K=1, and we had a noisily labeled Orange nearby, we might have misclassified it.
Assumptions
Python Implementation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# 1. Scale Data (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 2. Split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# 3. Fit
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
# 4. Predict
accuracy = knn.score(X_test, y_test)
print(f"Accuracy with K=5: {accuracy:.4f}")
Related Concepts
- Feature Scaling - Mandatory pre-processing.
- Bias-Variance Trade-off - Controlled by K.
- Euclidean Distance
- Support Vector Machines (SVM) - Alternative for classification.