ROC & AUC

Definition

Core Statement

ROC (Receiver Operating Characteristic) Curve plots the True Positive Rate (Recall) against the False Positive Rate at all possible classification thresholds. AUC (Area Under the Curve) summarizes the ROC into a single number representing the model's ability to discriminate between classes.

Purpose

Evaluate classifier performance across all thresholds.
Compare multiple models using a single metric (AUC).
Diagnose trade-offs between sensitivity and specificity.

When to Use

Use ROC/AUC When...

Comparing binary classifiers.
You need a threshold-independent metric.
Classes are reasonably balanced.

Alternatives

Imbalanced Classes: Use Precision-Recall AUC instead. ROC can be overly optimistic.

Theoretical Background

Axes of the ROC Curve

Y-axis: True Positive Rate (TPR, Recall): $\frac{T P}{T P + F N}$
X-axis: False Positive Rate (FPR): $\frac{F P}{F P + T N}$

Interpreting the Curve

AUC Value	Interpretation
1.0	Perfect classifier.
0.9 - 1.0	Excellent discrimination.
0.8 - 0.9	Good discrimination.
0.7 - 0.8	Fair discrimination.
0.5	Random guessing (diagonal line).
< 0.5	Worse than random (model is making inverse predictions).

Geometric Interpretation

AUC is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

Limitations

Pitfalls

Imbalanced Data: With rare positives, a model can have high AUC but low precision. Use Precision-Recall AUC.
Ignores Calibration: AUC measures ranking, not probability correctness. Use Brier Score for calibration.
Does not pick threshold: AUC tells you how good the model is overall, but you still need to choose a threshold for deployment.

Python Implementation

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Get predicted probabilities (not class labels)
probs = model.predict_proba(X_test)[:, 1]

# Calculate AUC
auc = roc_auc_score(y_test, probs)
print(f"AUC: {auc:.3f}")

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, probs)
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()
plt.show()

R Implementation

library(pROC)

# predicted_probs: probabilities from model
roc_obj <- roc(actual_labels, predicted_probs)

# Get AUC
auc(roc_obj)

# Plot
plot(roc_obj, main = "ROC Curve", print.auc = TRUE)

Worked Numerical Example

Spam Filter Comparison

Scenario: Comparing two models for detecting spam emails.

Model A: AUC = 0.92
Model B: AUC = 0.85

Interpretation:

If you pick a random Spam email and a random Non-Spam email:
Model A has a 92% chance of assigning a higher "Spam Score" to the actual spam email.
Model A separates the classes better than Model B.

BUT: Model A might be worse at low False Positive Rates (e.g., it blocks too many real emails). You must check the curve shape, not just the AUC number.

Interpretation Guide

Scenario	Interpretation	Edge Case Notes
AUC = 0.95	Excellent discrimination.	Check for "Target Leakage" (AUC=1.0 is suspicious).
AUC = 0.50	Random guessing.	Model provides no value.
AUC < 0.50	Worse than random.	Inverted labels? Maybe 0=True and 1=False in code?
Curves Cross	Models trade off performance.	Model A better for high precision, B better for high recall. Pick based on business goal.

Common Pitfall Example

The "AUC Trap" in Imbalanced Data

Scenario: Fraud Detection (0.1% Fraud).
Model:

AUC = 0.90 (Looks good!).
Precision at Recall 50% = 5%.

The Problem:

ROC looks high because True Negatives (non-fraud) flood the calculation.
In reality, to catch 50% of fraud, you flag 20 false start for every 1 real fraud.
The model might be practically useless despite "high" AUC.

Solution: Use Precision-Recall Curve (PR-AUC).

It ignores True Negatives and highlights how bad the False Positives are relative to True Positives.

Confusion Matrix - Metrics at a single threshold.
Precision-Recall Curve - Better for imbalanced data.
Binary Logistic Regression