Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Definition
Core Statement
Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms a large set of correlated variables into a smaller set of uncorrelated components (Principal Components) that capture the maximum variance.
Purpose
- Reduce Dimensionality: Compress many features into fewer components.
- Remove Multicollinearity: Create uncorrelated inputs for regression.
- Visualize High-Dimensional Data: Project data to 2D or 3D.
- Noise Reduction: Discard components with low variance.
When to Use
Use PCA When...
- You have many correlated features.
- You want to reduce dimensionality before modeling.
- You need visualization of high-dimensional data.
- VIF (Variance Inflation Factor) indicates severe multicollinearity.
Limitations
- PCA is a linear method; non-linear relationships may not be captured.
- Interpretability is lost: Principal Components are linear combinations of original variables.
Theoretical Background
How It Works
- Standardize Data: Center (mean 0) and scale (std 1). Mandatory.
- Calculate Covariance Matrix.
- Find Eigenvalues and Eigenvectors: Eigenvectors define component directions; eigenvalues define variance explained.
- Rank Components: PC1 has highest variance, PC2 second highest, etc.
- Project Data: Transform data onto selected components.
Variance Explained
Each component captures a proportion of total variance:
Rule of Thumb: Keep components that explain ~80-90% of cumulative variance.
Loadings
Loadings are the correlations between original variables and principal components. High absolute loading = variable contributes strongly to that component.
Worked Example: Customer Segmentation
Problem
You have data on customers with 3 correlated variables:
: Annual Income (Mean=50k, SD=15k) : Spending Score (0-100) (Mean=50, SD=25) : Credit Card Debt (Mean=5k, SD=2k)
Goal: Reduce these 3 dimensions to 2 Principal Components.
Solution Process:
-
Standardize (Z-scores):
- Subtract mean, divide by SD.
- Example Customer A: Income=80k, Spending=80, Debt=8k.
. . . - Input Vector:
.
-
PCA Transformation:
- Let's say PCA gives eigenvectors (loadings) for PC1:
(All variables correlate positively). - Calculate PC1 Score for A:
- Let's say PCA gives eigenvectors (loadings) for PC1:
-
Interpretation:
- Customer A has a high PC1 score. If PC1 represents "Overall Wealth/Status", this customer is "High Status".
- We have reduced 3 numbers to 1 (or 2) while retaining the core information about their deviation from the mean.
Assumptions
Limitations
Pitfalls
- Forgot to Scale? If you run PCA on Income (0-1,000,000) and Age (0-100) without scaling, PC1 will just be "Income" because it has huge variance. Always Standardize first.
- Interpretation Black Box: "PC1 decreased by 2 units" is meaningless to business stakeholders. You must analyze the loadings to translate it (e.g., "Wealth Score decreased").
- Outliers: PCA tries to capture maximum variance. A single outlier with squared distance
will pull the principal component towards it, skewing the result. Use Robust PCA for messy data.
Python Implementation
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# 1. Standardize (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 2. Fit PCA
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)
# 3. Variance Explained
print("Variance Explained:", pca.explained_variance_ratio_)
print("Cumulative:", pca.explained_variance_ratio_.cumsum())
# 4. Scree Plot
plt.plot(range(1, 6), pca.explained_variance_ratio_, 'o-')
plt.xlabel("Component")
plt.ylabel("Variance Explained")
plt.title("Scree Plot")
plt.show()
# 5. Loadings
import pandas as pd
loadings = pd.DataFrame(pca.components_.T, columns=[f'PC{i}' for i in range(1,6)], index=X.columns)
print(loadings)
R Implementation
# 1. PCA (center and scale. are CRITICAL)
pca_result <- prcomp(df, center = TRUE, scale. = TRUE)
# 2. Summary
summary(pca_result)
# Look at "Cumulative Proportion" to decide how many components to keep.
# 3. Scree Plot
screeplot(pca_result, type = "lines")
# 4. Biplot (Visualize loadings and scores)
biplot(pca_result)
# 5. Loadings
pca_result$rotation
Interpretation Guide
| Output | Interpretation |
|---|---|
| Output | Interpretation |
| -------- | ---------------- |
| PC1 Explains 60% | The first axis captures 60% of the information in the dataset. |
| Cumulative Variance > 80% | Stop adding components. You have retained enough signal. |
| Loadings > 0.5 | Variable is strongly associated with this component. |
| Biplot Arrows | Variables with arrows pointing in same direction are highly correlated. |
Related Concepts
- Factor Analysis (EFA & CFA) - Similar but for latent constructs.
- VIF (Variance Inflation Factor) - PCA fixes multicollinearity.
- t-SNE - Non-linear visualization.