Descriptive Statistics
Descriptive Statistics
Definition
Core Statement
Descriptive Statistics summarizes and describes the main features of a dataset. Unlike Inferential Statistics (which test hypotheses), descriptive statistics aims to present quantitative descriptions in a manageable form, focusing on Central Tendency, Variability, and Shape.
1. Measures of Central Tendency (The "Middle")
These metrics describe the center of the data distribution.
| Measure | Definition | Pros | Cons |
|---|---|---|---|
| Mean ( |
Arithmetic average ( |
Uses all data; basis for many tests. | Not Robust: Highly sensitive to outliers. |
| Median | The middle value when sorted. | Robust: Ignored outliers. Representative of skewed data (e.g., Income). | Harder to manipulate mathematically. |
| Mode | Most frequent value. | Works for categorical data. | Not unique (can be bimodal); can be unstable in small samples. |
Mean vs Median
- Symmetric: Mean
Median. - Right Skew: Mean > Median (Outliers pull Mean up).
- Left Skew: Mean < Median (Outliers pull Mean down).
2. Measures of Variability (The "Spread")
These metrics describe how spread out or dispersed the data is.
| Measure | Definition | Notes |
|---|---|---|
| Range | Max - Min. | Heavily influenced by outliers. Simplest. |
| Variance ( |
Average squared deviation from Mean. | Hard to interpret (units are squared). |
| Standard Deviation ( |
Same units as data. "Average distance from mean". | |
| IQR (Interquartile Range) | Q3 - Q1. | Robust: Measures spread of middle 50%. |
| CV (Coef of Variation) | Unitless. Good for comparing variation across different scales. |
3. Measures of Shape
| Measure | Description | Interpretation |
|---|---|---|
| Skewness | Asymmetry. | 0: Symmetric. >0: Right skew (Tail right). <0: Left skew (Tail left). |
| Kurtosis | "Tailedness" (Peakedness). | 3 (approx): Normal. High (Leptokurtic): Heavy tails (Outlier prone). Low (Platykurtic): Light tails (Flat). |
Worked Example: Company Salaries
Problem
Data: [40k, 42k, 45k, 48k, 50k, 2000k] (CEO outlier).
Calculations:
- Mean:
. (Misleading! No one earns near this). - Median: Average of 45k and 48k =
. (Representative). - Range: 1,960k. (Huge).
- Std Dev:
. (Huge variability due to outlier).
Conclusion: For this dataset, Mean and SD are useless. Report Median and IQR (or just median).
Assumptions
Limitations
Pitfalls
- The "Average" Lie: Reporting only the Mean for skewed data (like Wealth) is deceptive. Always report Median too.
- Zero Variance: If
, all data points are identical. - Anscombe's Quartet: Different datasets can have identical Mean and Variance but look completely different. Always plot the data (Histogram, Boxplot).
Python Implementation
import pandas as pd
import numpy as np
from scipy import stats
data = [10, 12, 12, 14, 15, 18, 20, 100] # Outlier at 100
# Pandas Describe
df = pd.DataFrame(data, columns=['Values'])
print(df.describe())
# Robust Stats
median = df['Values'].median()
iqr = stats.iqr(data)
# Shape
skew = df['Values'].skew()
kurt = df['Values'].kurt()
print(f"Median: {median}")
print(f"IQR: {iqr}")
print(f"Skew: {skew:.2f} (High positive skew)")
Related Concepts
- Normal Distribution - Reference for skew/kurtosis.
- Boxplot - Visualizing the 5-number summary.
- Outlier Analysis (Standardized Residuals)
- Coefficient of Variation