Train-Test Split

Definition

Core Statement

Train-Test Split is a fundamental technique in machine learning where data is divided into two subsets: a training set (to fit the model) and a test set (to evaluate performance). This prevents overfitting by assessing generalization to unseen data.

Purpose

Evaluate how well a model generalizes to new data.
Detect overfitting (model memorizes training data but fails on test data).
Provide an unbiased estimate of model performance.

When to Use

Always Use Train-Test Split When...

Building predictive models.
You have sufficient data (typically $n > 100$ ).
You want an honest assessment of performance.

Alternatives

Small datasets: Use Cross-Validation instead (more efficient use of data).
Very large datasets: Can afford separate train/validation/test (3-way split).

Theoretical Background

Standard Split Ratios

Ratio	Training Set	Test Set	When to Use
70/30	70%	30%	Balanced approach.
80/20	80%	20%	Common default.
60/20/20	60%	20% validation, 20% test	When tuning hyperparameters.

Three-Way Split (Train/Validation/Test)

Set	Purpose
Training	Fit model parameters.
Validation	Tune hyperparameters, select models.
Test	Final unbiased evaluation. Never used during development.

Test Set is Sacred

The test set should never influence any decision during model development. It is only used once at the very end for final evaluation.

Assumptions

Data is IID: Independent and identically distributed.
Representative split: Test set should represent the population.
Sufficient size: Each set should be large enough for reliable estimates.

Limitations

Pitfalls

Data Leakage: Accidentally using test information during training (e.g., scaling before splitting).
Imbalanced Classes: Random split may create unbalanced train/test. Use stratified split.
Time Series: Random split breaks temporal order. Use time-based split (train on early, test on late).
Small Data: Single split has high variance. Use Cross-Validation.

Python Implementation

from sklearn.model_selection import train_test_split
import numpy as np

# Example Data
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([0, 0, 1, 1, 1])

# 80/20 Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify keeps class balance
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

# Train Model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate on Test Set
test_score = model.score(X_test, y_test)
print(f"Test Accuracy: {test_score:.2f}")

R Implementation

library(caret)

# Example Data
set.seed(42)
df <- data.frame(
  X1 = 1:100,
  X2 = rnorm(100),
  y = factor(sample(c("A", "B"), 100, replace = TRUE))
)

# 80/20 Split (Stratified)
train_idx <- createDataPartition(df$y, p = 0.8, list = FALSE)
train_set <- df[train_idx, ]
test_set <- df[-train_idx, ]

# Train Model
model <- glm(y ~ X1 + X2, data = train_set, family = "binomial")

# Predict on Test Set
predictions <- predict(model, newdata = test_set, type = "response")
predicted_class <- ifelse(predictions > 0.5, "B", "A")

# Evaluate
confusionMatrix(factor(predicted_class), test_set$y)

Interpretation Guide

Scenario	Interpretation
Train Acc = 95%, Test Acc = 90%	Good generalization; slight overfitting (normal).
Train Acc = 99%, Test Acc = 60%	Severe overfitting. Model memorized training data.
Train Acc = 65%, Test Acc = 63%	Underfitting. Model is too simple.
Test Acc > Train Acc	Unusual; check for data leakage or lucky split.

Cross-Validation - More robust alternative for small data.
Overfitting & Underfitting
Bias-Variance Trade-off
Model Evaluation Metrics
Data Leakage