Central Limit Theorem (CLT)
Central Limit Theorem (CLT)
Definition
The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean will approximate a Normal Distribution as the sample size (
Purpose
The CLT is the foundational pillar of inferential statistics. It justifies:
- Using Z-scores and T-scores to calculate Confidence Intervals.
- Using parametric tests (T-tests, ANOVA) on non-normal data when
is large. - The assumption of normality for sample means in hypothesis testing.
When to Use
- You are working with sample means (not raw individual data points).
- Your sample size is
(classic rule of thumb). - You need to make inferences about a population mean from a sample.
- Analyzing individual observations (not means).
- Sample size is very small (
) and the population is heavily skewed. - Data has extreme outliers that distort the mean.
Theoretical Background
The Formal Statement
Let
As
Key Implications
| Concept | Formula | Meaning |
|---|---|---|
| Expected Value | The average of sample averages equals the population mean. | |
| Standard Error (SE) | As |
|
| Precision | To double precision, you need |
Assumptions
Worked Example: Delivery Truck Safety
A delivery truck can carry a maximum load of 2100 kg. It is loaded with 40 boxes.
- The weight of any individual box is a random variable with Mean (
) = 50 kg and Standard Deviation ( ) = 10 kg. - The distribution of box weights is non-normal (some heavy outliers, some light).
Question: What is the probability that the total weight exceeds the 2100 kg limit?
Solution:
-
Identify Parameters:
(Sample size > 30, so CLT applies). . . - Target Value (
) = 2100 kg.
-
Calculate Z-Score:
- Since
is large, we approximate the distribution of the sum as Normal.
- Since
-
Find Probability:
- Looking up
in a Z-table gives an area of to the left. .
- Looking up
Interpretation:
There is approximately a 5.7% chance the truck will be overloaded, even though the expected weight (2000 kg) is well below the limit. This calculation relies on the CLT because the individual box weights are not normally distributed.
Limitations
- The "Population Becomes Normal" Fallacy: A common misconception is that if
is large, the original data becomes normal. False. The population distribution remains exactly the same; only the distribution of means looks normal. - "Large enough" is relative: For symmetric distributions,
may suffice. For highly skewed distributions (e.g., Exponential), or more may be needed. - Outliers: Extreme values can distort the mean, requiring even larger
for normality. - Does not apply to medians or variances: CLT is specifically about sample means. The sampling distribution of the median or variance has different properties.
Python Implementation
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# --- Simulation ---
np.random.seed(42)
# 1. Create a Non-Normal Population (Exponential)
population = np.random.exponential(scale=2, size=100000)
# 2. Draw Many Samples and Calculate Means
sample_sizes = [5, 30, 100]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, n in enumerate(sample_sizes):
sample_means = [np.mean(np.random.choice(population, n)) for _ in range(1000)]
# Shapiro-Wilk Test for Normality
_, p_val = stats.shapiro(sample_means)
sns.histplot(sample_means, kde=True, ax=axes[i], stat='density')
axes[i].set_title(f'n = {n} (Shapiro p = {p_val:.3f})')
plt.suptitle('CLT in Action: Exponential Pop -> Normal Sampling Distribution')
plt.tight_layout()
plt.show()
R Implementation
library(ggplot2)
set.seed(42)
# 1. Population (Exponential - Right Skewed)
pop <- rexp(100000, rate = 0.5)
# 2. Draw Samples
simulate_means <- function(n, reps = 1000) {
replicate(reps, mean(sample(pop, n)))
}
means_5 <- simulate_means(5)
means_30 <- simulate_means(30)
means_100 <- simulate_means(100)
# 3. Combine and Plot
df <- data.frame(
mean = c(means_5, means_30, means_100),
n = factor(rep(c("n=5", "n=30", "n=100"), each=1000),
levels=c("n=5", "n=30", "n=100"))
)
ggplot(df, aes(x=mean, fill=n)) +
geom_histogram(aes(y=..density..), bins=30, alpha=0.7) +
stat_function(fun = dnorm,
args = list(mean = mean(pop), sd = sd(pop)/sqrt(30)),
color = "red", size=1, linetype="dashed") +
facet_wrap(~n) +
labs(title = "CLT Simulation: Exponential -> Normal") +
theme_minimal()
Interpretation Guide
| Observation | Meaning |
|---|---|
| Sampling distributon is Bell-Shaped | CLT is working; parametric inference (T-tests, Z-tests) is valid. |
| CLT cannot be relied upon. Use non-parametric tests (e.g., Wilcoxon). | |
| Standard Error decreases | As |
| Sum vs Mean | CLT applies to Sums as well as Means (Sum |
Related Concepts
- Normal Distribution - The target distribution.
- Law of Large Numbers - Related but distinct (LLN is about accuracy, CLT is about distribution shape).
- Standard Error - Derived directly from CLT.
- T-Distribution - Used when
is unknown. - Confidence Intervals - Built on CLT logic.