Descriptive Statistics

Definition

Core Statement

Descriptive Statistics summarizes and describes the main features of a dataset. Unlike Inferential Statistics (which test hypotheses), descriptive statistics aims to present quantitative descriptions in a manageable form, focusing on Central Tendency, Variability, and Shape.

1. Measures of Central Tendency (The "Middle")

These metrics describe the center of the data distribution.

Measure	Definition	Pros	Cons
Mean ( $\bar{x}$ )	Arithmetic average ( $\frac{\sum x}{n}$ ).	Uses all data; basis for many tests.	Not Robust: Highly sensitive to outliers.
Median	The middle value when sorted.	Robust: Ignored outliers. Representative of skewed data (e.g., Income).	Harder to manipulate mathematically.
Mode	Most frequent value.	Works for categorical data.	Not unique (can be bimodal); can be unstable in small samples.

Mean vs Median

Symmetric: Mean $\approx$ Median.
Right Skew: Mean > Median (Outliers pull Mean up).
Left Skew: Mean < Median (Outliers pull Mean down).

2. Measures of Variability (The "Spread")

These metrics describe how spread out or dispersed the data is.

Measure	Definition	Notes
Range	Max - Min.	Heavily influenced by outliers. Simplest.
Variance ( $σ^{2}$ )	Average squared deviation from Mean.	Hard to interpret (units are squared).
Standard Deviation ( $σ$ )	$\sqrt{Variance}$ .	Same units as data. "Average distance from mean".
IQR (Interquartile Range)	Q3 - Q1.	Robust: Measures spread of middle 50%.
CV (Coef of Variation)	$σ / μ$ .	Unitless. Good for comparing variation across different scales.

3. Measures of Shape

Measure	Description	Interpretation
Skewness	Asymmetry.	0: Symmetric. >0: Right skew (Tail right). <0: Left skew (Tail left).
Kurtosis	"Tailedness" (Peakedness).	3 (approx): Normal. High (Leptokurtic): Heavy tails (Outlier prone). Low (Platykurtic): Light tails (Flat).

Worked Example: Company Salaries

Problem

Data: [40k, 42k, 45k, 48k, 50k, 2000k] (CEO outlier).

Calculations:

Mean: $\frac{2225}{6} \approx 370 k$ . (Misleading! No one earns near this).
Median: Average of 45k and 48k = $46.5 k$ . (Representative).
Range: 1,960k. (Huge).
Std Dev: $\approx 720 k$ . (Huge variability due to outlier).

Conclusion: For this dataset, Mean and SD are useless. Report Median and IQR (or just median).

Assumptions

Variable Type: Mean/SD require interval/ratio data. Mode works for nominal.
Independence: Descriptive stats assume observations are distinct (unless calculating autocorrelation).

Limitations

Pitfalls

The "Average" Lie: Reporting only the Mean for skewed data (like Wealth) is deceptive. Always report Median too.
Zero Variance: If $σ = 0$ , all data points are identical.
Anscombe's Quartet: Different datasets can have identical Mean and Variance but look completely different. Always plot the data (Histogram, Boxplot).

Python Implementation

import pandas as pd
import numpy as np
from scipy import stats

data = [10, 12, 12, 14, 15, 18, 20, 100] # Outlier at 100

# Pandas Describe
df = pd.DataFrame(data, columns=['Values'])
print(df.describe())

# Robust Stats
median = df['Values'].median()
iqr = stats.iqr(data)

# Shape
skew = df['Values'].skew()
kurt = df['Values'].kurt()

print(f"Median: {median}")
print(f"IQR: {iqr}")
print(f"Skew: {skew:.2f} (High positive skew)")

Normal Distribution - Reference for skew/kurtosis.
Boxplot - Visualizing the 5-number summary.
Outlier Analysis (Standardized Residuals)
Coefficient of Variation