Principal Component Analysis (PCA)

Definition

Core Statement

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms a large set of correlated variables into a smaller set of uncorrelated components (Principal Components) that capture the maximum variance.

Purpose

Reduce Dimensionality: Compress many features into fewer components.
Remove Multicollinearity: Create uncorrelated inputs for regression.
Visualize High-Dimensional Data: Project data to 2D or 3D.
Noise Reduction: Discard components with low variance.

When to Use

Use PCA When...

You have many correlated features.
You want to reduce dimensionality before modeling.
You need visualization of high-dimensional data.
VIF (Variance Inflation Factor) indicates severe multicollinearity.

Limitations

PCA is a linear method; non-linear relationships may not be captured.
Interpretability is lost: Principal Components are linear combinations of original variables.

Theoretical Background

How It Works

Standardize Data: Center (mean 0) and scale (std 1). Mandatory.
Calculate Covariance Matrix.
Find Eigenvalues and Eigenvectors: Eigenvectors define component directions; eigenvalues define variance explained.
Rank Components: PC1 has highest variance, PC2 second highest, etc.
Project Data: Transform data onto selected components.

Variance Explained

Each component captures a proportion of total variance:

Proportion = \frac{λ_{k}}{\sum λ}

Rule of Thumb: Keep components that explain ~80-90% of cumulative variance.

Loadings

Loadings are the correlations between original variables and principal components. High absolute loading = variable contributes strongly to that component.

Worked Example: Customer Segmentation

Problem

You have data on customers with 3 correlated variables:

$X_{1}$ : Annual Income (Mean=50k, SD=15k)
$X_{2}$ : Spending Score (0-100) (Mean=50, SD=25)
$X_{3}$ : Credit Card Debt (Mean=5k, SD=2k)

Goal: Reduce these 3 dimensions to 2 Principal Components.

Solution Process:

Standardize (Z-scores):
- Subtract mean, divide by SD.
- Example Customer A: Income=80k, Spending=80, Debt=8k.
- $Z_{1} = (80 - 50) / 15 = 2.0$ .
- $Z_{2} = (80 - 50) / 25 = 1.2$ .
- $Z_{3} = (8 - 5) / 2 = 1.5$ .
- Input Vector: $[2.0, 1.2, 1.5]$ .
PCA Transformation:
- Let's say PCA gives eigenvectors (loadings) for PC1: $v_{1} = [0.58, 0.58, 0.58]$ (All variables correlate positively).
- Calculate PC1 Score for A: $P C 1 = (0.58 \times 2.0) + (0.58 \times 1.2) + (0.58 \times 1.5)$ $P C 1 = 1.16 + 0.696 + 0.87 = 2.726$
Interpretation:
- Customer A has a high PC1 score. If PC1 represents "Overall Wealth/Status", this customer is "High Status".
- We have reduced 3 numbers to 1 (or 2) while retaining the core information about their deviation from the mean.

Assumptions

Continuous Data: PCA is designed for numeric data.
Linear Relationships: PCA captures linear correlations.
Standardization: Variables must be on the same scale. Otherwise, high-variance variables dominate.

Limitations

Pitfalls

Forgot to Scale? If you run PCA on Income (0-1,000,000) and Age (0-100) without scaling, PC1 will just be "Income" because it has huge variance. Always Standardize first.
Interpretation Black Box: "PC1 decreased by 2 units" is meaningless to business stakeholders. You must analyze the loadings to translate it (e.g., "Wealth Score decreased").
Outliers: PCA tries to capture maximum variance. A single outlier with squared distance $100 σ$ will pull the principal component towards it, skewing the result. Use Robust PCA for messy data.

Python Implementation

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# 1. Standardize (CRITICAL)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Fit PCA
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_scaled)

# 3. Variance Explained
print("Variance Explained:", pca.explained_variance_ratio_)
print("Cumulative:", pca.explained_variance_ratio_.cumsum())

# 4. Scree Plot
plt.plot(range(1, 6), pca.explained_variance_ratio_, 'o-')
plt.xlabel("Component")
plt.ylabel("Variance Explained")
plt.title("Scree Plot")
plt.show()

# 5. Loadings
import pandas as pd
loadings = pd.DataFrame(pca.components_.T, columns=[f'PC{i}' for i in range(1,6)], index=X.columns)
print(loadings)

R Implementation

# 1. PCA (center and scale. are CRITICAL)
pca_result <- prcomp(df, center = TRUE, scale. = TRUE)

# 2. Summary
summary(pca_result)
# Look at "Cumulative Proportion" to decide how many components to keep.

# 3. Scree Plot
screeplot(pca_result, type = "lines")

# 4. Biplot (Visualize loadings and scores)
biplot(pca_result)

# 5. Loadings
pca_result$rotation

Interpretation Guide

Output	Interpretation
Output	Interpretation
--------	----------------
PC1 Explains 60%	The first axis captures 60% of the information in the dataset.
Cumulative Variance > 80%	Stop adding components. You have retained enough signal.
Loadings > 0.5	Variable is strongly associated with this component.
Biplot Arrows	Variables with arrows pointing in same direction are highly correlated.

Factor Analysis (EFA & CFA) - Similar but for latent constructs.
VIF (Variance Inflation Factor) - PCA fixes multicollinearity.
t-SNE - Non-linear visualization.

Principal Component Analysis (PCA)

Definition

Purpose

When to Use

Theoretical Background

How It Works

Variance Explained

Loadings

Worked Example: Customer Segmentation

Assumptions

Limitations

Python Implementation

R Implementation

Interpretation Guide

Related Concepts