Bias-Variance Trade-off

Bias-Variance Trade-off

Definition

Core Statement

The Bias-Variance Trade-off describes the fundamental tension in predictive modeling: simple models have high bias (systematic error), while complex models have high variance (sensitivity to noise). The goal is to find the sweet spot that minimizes total error.


Purpose

  1. Understand why models fail (underfitting vs overfitting).
  2. Guide model selection and regularization strategies.
  3. Explain why more complexity is not always better.

When to Use

This is a conceptual framework for interpreting model performance:


Theoretical Background

Decomposition of Prediction Error

For a given model f^(x) predicting true function f(x):

Expected Test Error=(Bias[f^(x)])2Systematic Error+Variance[f^(x)]Sensitivity to Data+σ2Irreducible Error
Component Meaning Cause
Bias Error from wrong assumptions. Model is too simple. Underfitting. Missing important patterns.
Variance Error from sensitivity to training data. Model is too complex. Overfitting. Fitting noise.
Irreducible Error Noise in the data itself. Cannot be reduced by any model.

The Trade-off

Model Complexity Bias Variance Total Error
Too Simple High (underfitting) Low High
Optimal Moderate Moderate Minimum
Too Complex Low High (overfitting) High
Goldilocks Zone

The best model is neither too simple nor too complex. It balances bias and variance to minimize total error.


Visual Intuition

Test Error
    │
    │     ╱‾‾‾╲
    │    ╱     ╲ Variance
    │   ╱       ╲
    │  ╱         ╲___
    │ ╱               ╲___
    │╱_____________________╲___ Bias²
    │                          
    └──────────────────────── Model Complexity
    Simple              Complex

Assumptions

This is a mathematical decomposition, not a test with assumptions. However:


Limitations

Pitfalls

  1. Cannot directly measure bias and variance separately on real data (only the total error).
  2. Trade-off is not always smooth. Discontinuities can occur (e.g., adding a critical variable).
  3. Depends on data size: With infinite data, variance goes to zero and only bias matters.


Addressing the Trade-off

Problem Solution
High Bias (Underfitting) Add features, increase model complexity, reduce regularization.
High Variance (Overfitting) Use Cross-Validation, add Ridge Regression/Lasso Regression, reduce features, collect more data.

Python Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate Data
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y_true = np.sin(X).ravel()
y = y_true + np.random.normal(0, 0.2, 100)

# Fit Polynomials of Different Degrees
degrees = [1, 3, 15]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, degree in enumerate(degrees):
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    model = LinearRegression().fit(X_poly, y)
    
    y_pred = model.predict(X_poly)
    mse = mean_squared_error(y, y_pred)
    
    axes[i].scatter(X, y, alpha=0.5, label='Data')
    axes[i].plot(X, y_pred, 'r-', label=f'Degree {degree}')
    axes[i].set_title(f"Degree {degree}\nMSE = {mse:.3f}")
    axes[i].legend()

plt.tight_layout()
plt.show()

# Degree 1: High Bias (Underfitting)
# Degree 3: Good Balance
# Degree 15: High Variance (Overfitting)

R Implementation

set.seed(42)

# True function: sine wave
x <- seq(0, 10, length.out = 100)
y_true <- sin(x)
y <- y_true + rnorm(100, 0, 0.2)

# Fit Polynomial Models
par(mfrow = c(1, 3))

for (degree in c(1, 3, 15)) {
  model <- lm(y ~ poly(x, degree))
  y_pred <- predict(model)
  
  plot(x, y, main = paste("Degree", degree),
       xlab = "x", ylab = "y", pch = 16, col = "gray")
  lines(x, y_pred, col = "red", lwd = 2)
}

Interpretation Guide

Scenario Diagnosis Action
Training Error = 0.01, Test Error = 0.50 High Variance (Overfitting). Regularize, simplify model, more data.
Training Error = 0.30, Test Error = 0.32 High Bias (Underfitting). Add features, increase complexity.
Training Error = 0.10, Test Error = 0.12 Good fit. Bias and variance balanced. Deploy model.