Overfitting & Underfitting
Overfitting & Underfitting
Definition
Core Statement
Overfitting occurs when a model learns the noise in the training data, performing well on training but poorly on new data. Underfitting occurs when a model is too simple to capture the underlying pattern, performing poorly on both training and test data.
Purpose
- Diagnose why a model fails to generalize.
- Guide model selection and complexity tuning.
- Prevent deployment of unreliable models.
When to Use
Always evaluate for overfitting/underfitting when:
- Building predictive models.
- Training Error << Test Error (overfitting).
- Training Error ≈ Test Error, both high (underfitting).
Theoretical Background
The Spectrum
| Model State | Training Error | Test Error | Bias | Variance |
|---|---|---|---|---|
| Underfitting | High | High | High | Low |
| Good Fit | Low | Low (similar to train) | Moderate | Moderate |
| Overfitting | Very Low | High | Low | High |
Overfitting Example
A 15th-degree polynomial perfectly fits 20 data points (Training Error = 0), but when new data arrives, predictions are wildly wrong because the model learned random noise.
Underfitting Example
Fitting a straight line to exponential data. The model is too rigid to capture the curve, so both training and test errors are high.
Detecting Overfitting
Signs of Overfitting
- Large gap between training and test performance.
- Model has many parameters relative to data size.
- Performance degrades when tested on new data.
- High variance in Cross-Validation folds.
Detecting Underfitting
Signs of Underfitting
- High error on both training and test sets.
- Model is too simple (e.g., linear model for non-linear data).
- Adding complexity improves performance.
Preventing Overfitting
| Method | Mechanism |
|---|---|
| Cross-Validation | Evaluate on multiple test folds; detects overfitting. |
| Regularization (Ridge Regression, Lasso Regression) | Penalize large coefficients; reduce model complexity. |
| Early Stopping (Neural Networks) | Stop training before memorizing noise. |
| Pruning (Decision Trees) | Remove branches that don't improve validation performance. |
| More Data | More samples make it harder to memorize noise. |
| Feature Selection | Remove irrelevant features. |
| Dropout (Deep Learning) | Randomly drop neurons during training. |
Preventing Underfitting
| Method | Mechanism |
|---|---|
| Add Features | Include relevant variables (polynomial terms, interactions). |
| Increase Model Complexity | Use more flexible models (e.g., ensemble methods). |
| Reduce Regularization | Allow model to fit data more closely. |
| Feature Engineering | Create informative features from raw data. |
Limitations
Pitfalls
- Cannot truly eliminate both. Bias-Variance Trade-off is fundamental.
- Test data contamination: If test data leaks into training, overfitting is hidden.
- Small test sets: Test error may be unreliable; use cross-validation.
Python Implementation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor
import numpy as np
# Generate Data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.ravel() + np.random.randn(100) * 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Underfit Model (Max Depth = 1)
model_underfit = DecisionTreeRegressor(max_depth=1).fit(X_train, y_train)
# Good Fit (Max Depth = 3)
model_good = DecisionTreeRegressor(max_depth=3).fit(X_train, y_train)
# Overfit Model (No Depth Limit)
model_overfit = DecisionTreeRegressor(max_depth=None).fit(X_train, y_train)
# Evaluate
for name, model in [("Underfit", model_underfit),
("Good Fit", model_good),
("Overfit", model_overfit)]:
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"{name}: Train R² = {train_score:.3f}, Test R² = {test_score:.3f}")
R Implementation
library(rpart)
library(caret)
# Generate Data
set.seed(42)
X <- matrix(runif(100) * 10, ncol = 1)
y <- 2 * X + rnorm(100, 0, 2)
df <- data.frame(X = X, y = y)
# Split
train_idx <- createDataPartition(df$y, p = 0.7, list = FALSE)
train <- df[train_idx, ]
test <- df[-train_idx, ]
# Underfit (Max Depth = 1)
model_underfit <- rpart(y ~ X, data = train, control = rpart.control(maxdepth = 1))
# Good Fit (Max Depth = 3)
model_good <- rpart(y ~ X, data = train, control = rpart.control(maxdepth = 3))
# Overfit (Max Depth = 30)
model_overfit <- rpart(y ~ X, data = train, control = rpart.control(maxdepth = 30))
# Evaluate (RMSE)
for (name in c("Underfit", "Good", "Overfit")) {
model <- get(paste0("model_", tolower(name)))
train_rmse <- sqrt(mean((predict(model, train) - train$y)^2))
test_rmse <- sqrt(mean((predict(model, test) - test$y)^2))
cat(name, "- Train RMSE:", round(train_rmse, 2), "Test RMSE:", round(test_rmse, 2), "\n")
}
Interpretation Guide
| Result | Diagnosis |
|---|---|
| Train R² = 0.99, Test R² = 0.40 | Overfitting. Model memorized training noise. |
| Train R² = 0.50, Test R² = 0.48 | Underfitting. Model is too simple. |
| Train R² = 0.85, Test R² = 0.82 | Good fit. Generalizes well. |
Related Concepts
- Bias-Variance Trade-off - Theoretical foundation.
- Cross-Validation - Detection method.
- Ridge Regression / Lasso Regression - Prevention via regularization.
- Model Selection