Normal Distribution

Normal Distribution

Definition

Core Statement

The Normal Distribution (Gaussian Distribution) is a continuous probability distribution that is symmetric about the mean, with the property that data points near the mean are more frequent than those far from the mean. It is fully characterized by two parameters: mean (μ) and standard deviation (σ).


Purpose

The Normal Distribution is central to statistics because:

  1. Many natural phenomena approximate normality (height, IQ, measurement error).
  2. The Central Limit Theorem (CLT) guarantees the sampling distribution of means is normal.
  3. Most parametric tests (T-tests, ANOVA, Regression) assume normality of residuals.

When to Use

Assume Normality When...

  • Modeling continuous variables known to be approximately normal.
  • Working with sample means of large samples (via CLT).
  • Residuals of a regression model have been verified to be normal.

Do NOT Assume Normality When...

  • Data is discrete (use Binomial, Poisson).
  • Data is bounded (e.g., percentages 0-100 often have Beta distribution).
  • Data has heavy tails or extreme outliers.


Theoretical Background

Probability Density Function (PDF)

f(x|μ,σ)=1σ2πexp(12(xμσ)2)

Properties

Property Value
Mean μ
Median μ
Mode μ
Variance σ2
Skewness 0 (Perfectly Symmetric)
Kurtosis 3 (Mesokurtic)

The Empirical Rule (68-95-99.7)

Memorize This

  • 68.27% of data falls within μ±1σ
  • 95.45% of data falls within μ±2σ
  • 99.73% of data falls within μ±3σ

The Standard Normal Distribution (Z)

Any normal variable XN(μ,σ2) can be standardized to ZN(0,1):

Z=Xμσ

Interpretation: Z represents how many standard deviations X is away from the mean.


Assumptions


Limitations

Pitfalls

  1. Normality is often assumed, not verified. Always run diagnostics.
  2. Sensitivity to Outliers: Outliers can make data appear non-normal.
  3. Misuse with Proportions/Counts: Never use normal distribution for inherently non-normal data types.


Assessing Normality

Visual Methods

  1. Histogram: Should appear symmetric and bell-shaped.
  2. Q-Q Plot: Points should fall on the 45-degree diagonal reference line.

Statistical Tests

Test Best For Null Hypothesis
Shapiro-Wilk Test n<50 Data is Normal
Kolmogorov-Smirnov n>50 Data is Normal
D'Agostino-Pearson Based on Skew/Kurtosis Data is Normal

Python Implementation

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt

# --- Properties ---
mu, sigma = 100, 15  # IQ example

# P(X < 115) = ?
prob = stats.norm.cdf(115, loc=mu, scale=sigma)
print(f"P(X < 115): {prob:.4f}")

# What value is at the 90th percentile?
x_90 = stats.norm.ppf(0.90, loc=mu, scale=sigma)
print(f"90th Percentile: {x_90:.2f}")

# --- Generate and Test ---
data = np.random.normal(mu, sigma, 200)

# Shapiro-Wilk
stat, p = stats.shapiro(data)
print(f"Shapiro-Wilk p-value: {p:.4f}")

# Q-Q Plot
import statsmodels.api as sm
sm.qqplot(data, line='45')
plt.title("Q-Q Plot")
plt.show()

R Implementation

# --- Properties ---
mu <- 100
sigma <- 15

# P(X < 115)
pnorm(115, mean = mu, sd = sigma)

# 90th Percentile
qnorm(0.90, mean = mu, sd = sigma)

# --- Generate and Test ---
data <- rnorm(200, mean = mu, sd = sigma)

# Shapiro-Wilk Test
shapiro.test(data)

# Q-Q Plot
qqnorm(data, main = "Q-Q Plot for Normality")
qqline(data, col = "red", lwd = 2)

Interpretation Guide

Scenario Interpretation
Z=0 Observation is exactly at the mean.
Z=2 Observation is 2 std. deviations above the mean (Top ~2.5%).
Z=1.5 Observation is 1.5 std. deviations below the mean.
Shapiro p>0.05 Fail to reject H0; data is consistent with normality.
Q-Q plot curves at tails Data has heavier/lighter tails than normal (Kurtosis issue).