Confusion Matrix

Definition

Core Statement

A Confusion Matrix is a table that summarizes the performance of a classification model by comparing predicted labels to actual labels. It forms the basis for calculating metrics like Accuracy, Precision, Recall, and F1-Score.

Purpose

Visualize classification performance.
Calculate key metrics for model evaluation.
Identify types of errors (False Positives vs False Negatives).

When to Use

Use Confusion Matrix When...

Evaluating any classification model (Binary or Multi-class).
You need to understand what kind of errors the model makes.
Accuracy alone is misleading (imbalanced classes).

Theoretical Background

The Matrix (Binary Classification)

	Predicted Negative (0)	Predicted Positive (1)
Actual Negative (0)	True Negative (TN)	False Positive (FP) Type I Error
Actual Positive (1)	False Negative (FN) Type II Error	True Positive (TP)

Key Metrics

Metric	Formula	Interpretation
Accuracy	$\frac{T P + T N}{T o t a l}$	Overall correctness. (Misleading if imbalanced).
Precision	$\frac{T P}{T P + F P}$	Of all predicted positives, how many are correct? (Avoid false alarms).
Recall (Sensitivity, TPR)	$\frac{T P}{T P + F N}$	Of all actual positives, how many were found? (Avoid missing cases).
Specificity (TNR)	$\frac{T N}{T N + F P}$	Of all actual negatives, how many were correctly identified?
F1-Score	$\frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}$	Harmonic mean of Precision and Recall. Balances both.

Precision vs Recall Trade-off

Context Matters

High Precision Priority: Spam detection. (False Positives = Real emails in spam).
High Recall Priority: Disease screening. (False Negatives = Missing sick patients).

Limitations

Pitfalls

Accuracy Paradox: In imbalanced data (e.g., 99% negative), a model predicting all negative gets 99% accuracy but is useless.
Threshold Dependent: Metrics change with the classification threshold. Use ROC & AUC for threshold-independent evaluation.

Python Implementation

from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
import matplotlib.pyplot as plt

# y_true: actual labels, y_pred: predicted labels
cm = confusion_matrix(y_true, y_pred)

# Heatmap Visualization
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Pred 0', 'Pred 1'], yticklabels=['Actual 0', 'Actual 1'])
plt.title("Confusion Matrix")
plt.show()

# Full Report
print(classification_report(y_true, y_pred))

R Implementation

library(caret)

# Create factors
pred <- factor(predicted_labels)
actual <- factor(actual_labels)

# Confusion Matrix
confusionMatrix(data = pred, reference = actual)

# Output: Accuracy, Sensitivity, Specificity, Precision, etc.

Worked Numerical Example

Rare Disease Detection (Imbalanced)

Data: 100 Patients (95 Healthy, 5 Sick).
Model Prediction: Predicts everyone is "Healthy" (All Negative).

Confusion Matrix:

TP = 0, FN = 5 (Missed all sick people!)
TN = 95, FP = 0

Metrics:

Accuracy: $(0 + 95) / 100 = 95 %$ (Looks amazing!)
Recall: $0 / (0 + 5) = 0 %$ (Missed everyone!)
Precision: Undefined (0/0) or 0.

Conclusion: The model is useless despite 95% accuracy.

Interpretation Guide

Output	Interpretation	Edge Case Notes
High Accuracy, Low Recall	Accuracy Paradox.	Common in imbalanced data (Fraud, Disease). Ignore accuracy.
High Recall, Low Precision	"Nets too wide". Many false alarms.	Okay for screening tests (cheap filters).
High Precision, Low Recall	"Very picky". Misses many cases, but trusts positive predictions.	Good for spam filters (don't delete real email).
F1 = 0.9	Strong balance of P and R.	Excellent model.

Common Pitfall Example

Threshold Setting Blindness

Scenario: Logistic Regression outputs probabilities (0.1, 0.6, 0.9, ...).
Default: Classify as 1 if $p > 0.5$ .

Problem:

If "Positive" is Cancer, $p > 0.5$ might be too strict.
You miss patients with $p = 0.4$ who might have cancer.

Solution:

Tune the Threshold.
Lower threshold to 0.2 $\to$ Increase Recall (Find more cancer), but decrease Precision (More false alarms).
Use the ROC Curve to make this decision intentionally.

ROC & AUC - Threshold-independent performance.
Binary Logistic Regression - Typical model producing these outputs.
Precision-Recall Curve

Confusion Matrix

Definition

Purpose

When to Use

Theoretical Background

The Matrix (Binary Classification)

Key Metrics

Precision vs Recall Trade-off

Limitations

Python Implementation

R Implementation

Worked Numerical Example

Interpretation Guide

Common Pitfall Example

Related Concepts