Cross-Validation
Cross-Validation
Definition
Core Statement
Cross-Validation (CV) is a model validation technique that assesses how well a predictive model will generalize to an independent dataset. It partitions data into training and testing subsets multiple times to obtain a robust estimate of out-of-sample performance.
Purpose
- Estimate model performance on unseen data.
- Detect overfitting (model performs well on training data, poorly on test data).
- Select hyperparameters (e.g.,
in Ridge/Lasso). - Compare models fairly.
When to Use
Use Cross-Validation When...
- Data is limited and you can't afford a separate test set.
- You need a reliable performance estimate.
- Tuning hyperparameters (e.g., regularization strength).
Theoretical Background
Types of Cross-Validation
| Type | Description | Use Case |
|---|---|---|
| K-Fold CV | Split data into |
General-purpose. |
| Leave-One-Out (LOOCV) | Very small datasets. High variance. | |
| Stratified K-Fold | Ensures each fold has the same class distribution. | Classification with imbalanced classes. |
| Time Series CV | Expanding window or rolling window to respect temporal order. | Time series data. |
K-Fold Procedure
- Shuffle data randomly.
- Split into
equal folds. - For each fold
: - Train on folds
. - Evaluate on fold
.
- Train on folds
- Average performance across all folds.
Worked Example: Manual 3-Fold CV
Problem
Data:
Model: Simple Average (predict mean of training).
Metric: MAE (Mean Absolute Error).
Fold 1: Test on
- Train Mean: 45.
- Prediction: 45.
- Errors:
, . Average Error = 30.
Fold 2: Test on
- Train Mean: 35.
- Errors:
, . Average Error = 5.
Fold 3: Test on
- Train Mean: 25.
- Errors:
, . Average Error = 30.
Total CV Score:
Interpretation: On average, our model is off by ~21.6 units.
Assumptions
Limitations
Pitfalls
- Data Leakage (The "Future Peek"):
- Wrong: Normalize ALL data, then split. (Test data influenced the mean).
- Right: Split, calculate mean of Train, apply to Test.
- Time Series Error: Randomly shuffling stock prices means using "Tomorrow's" price to predict "Yesterday's". Impossible in reality. Use TimeSeriesSplit (Expanding Window).
- Imbalanced Classes: If Fold 1 has no "Fraud" cases, the model learns nothing about fraud. Use StratifiedKFold to force equal proportions.
Python Implementation
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
# K-Fold Cross-Validation
model = LogisticRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"Fold Scores: {scores}")
print(f"Mean Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
R Implementation
library(caret)
# K-Fold CV (trainControl specifies the CV method)
ctrl <- trainControl(method = "cv", number = 10)
model <- train(Y ~ ., data = df, method = "glm",
trControl = ctrl, family = "binomial")
print(model)
# Gives mean performance across folds.
Interpretation Guide
| Output | Interpretation |
|---|---|
| Mean CV Accuracy = 0.85 | Expected performance on unseen data is ~85%. |
| Training Accuracy = 0.95, CV Accuracy = 0.70 | Model is overfitting. Needs regularization or simpler model. |
| Low variance across folds | Model performance is stable. |
Related Concepts
- Bias-Variance Trade-off
- Ridge Regression / Lasso Regression - Use CV to tune
. - Overfitting