Boxplot
Boxplot
Definition
Core Statement
A Boxplot (or Box-and-Whisker Plot) is a standardized way of displaying the distribution of data based on a five-number summary: Minimum, First Quartile (Q1), Median (Q2), Third Quartile (Q3), and Maximum. It is the primary tool for identifying outliers and visualizing skewness.
Purpose
- Identify Outliers: Points outside the "whiskers" are explicit outliers.
- Visualize Spread: The box size (IQR) shows the variability of the middle 50% of data.
- Comparisons: Ideally suited for comparing distributions side-by-side across groups (e.g., Salary by Department).
- Detect Skewness: Asymmetry in the box or whiskers indicates skew.
The Anatomy of a Boxplot
- Median (Q2, 50th percentile): The line inside the box.
- The Box (IQR): Spans from Q1 (25th percentile) to Q3 (75th percentile). Contains the middle 50% of data.
- The Whiskers: Extend to the furthest data point within
. - Outliers (Diamonds/Dots): Any point beyond the whiskers.
Theoretical Background
Calculating IQR and Fences
- Interquartile Range (IQR):
. - Lower Fence:
. - Upper Fence:
.
Any data point
Worked Example: Detecting Outliers
Problem
Data:
Task: Draw boxplot and find outliers.
-
Find Quartiles:
- Sort:
. - Median (Q2): 16.
- Q1 (Median of first half
): . - Q3 (Median of second half
): .
- Sort:
-
Calculate IQR:
.
-
Calculate Fences:
- Lower:
. - Upper:
.
- Lower:
-
Identify Outliers:
- Is
? Yes. 100 is an outlier. - Is
? No.
- Is
Visualization: The box spans 12.5 to 21. The right whisker ends at 22. The dot at 100 is isolated.
Interpretation Guide
| Visual Assessment | Meaning |
|---|---|
| Median is center of box | Symmetric distribution (Normal-ish). |
| Median closer to bottom | Right-Skewed (Postive skew). Tail extends up. |
| Median closer to top | Left-Skewed (Negative skew). Tail extends down. |
| Many outliers | Heavy-tailed distribution (e.g., Cauchy / Log-Normal). |
Limitations & Pitfalls
Pitfalls
- Hides Multimodality: A boxplot cannot distinguish between a Normal (bell) peak and a Bimodal (two-peak) distribution. They might have identical quartiles. Use a Violin Plot or Histogram to check shape.
- Sample Size Blindness: A boxplot of
looks just as authoritative as . Always annotate with . - The "1.5" rule is arbitrary: Using 1.5 IQR works well for Normal data (0.7% outliers), but for naturally skewed data, it marks too many valid points as outliers.
Python Implementation
import seaborn as sns
import matplotlib.pyplot as plt
# Comparative Boxplot
sns.boxplot(x='Department', y='Salary', data=df)
plt.title("Salary Distribution by Department")
plt.show()
# Detect Outliers
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Salary'] < Q1 - 1.5 * IQR) | (df['Salary'] > Q3 + 1.5 * IQR)]
Related Concepts
- Normal Distribution - Reference shape.
- Violin Plot - Boxplot + Density (Best of both worlds).
- Histogram - Binned view of distribution.
- Outlier Analysis (Standardized Residuals)