A-B Testing

A/B Testing

Definition

Core Statement

A/B Testing (or Split Testing) is a randomized controlled experiment where two or more variants of a variable (Webpage A vs Webpage B) are shown to different segments of users at the same time to determine which version leaves the maximum impact on a specific metric (Conversion Rate).

It is the industry application of the Two-Sample Hypothesis Test.

Purpose

Causality: The only way to prove Change X caused Result Y (ruling out seasonality, trends, etc.).
Optimization: Iteratively improving products (CRO - Conversion Rate Optimization).
Risk Management: Testing risky changes on 5% of users before full rollout.

The Workflow

Metric Definition: Choose a Primary Metric (e.g., Click-Through Rate) and Guardrail Metrics (e.g., Page Latency).
Power Analysis: Determine sample size needed to detect the Minimum Detectable Effect (MDE).
Randomization: Hash User IDs to assign buckets (Control vs Treatment).
Run Test: Run for full business cycles (e.g., 1 week, 2 weeks) to capture weekday/weekend effects.
Analyze: Use statistical test (t-test or z-test).

Statistical Backend

Usually depends on the metric:

Conversion Rate (Probability): Z-Test (Two-Proportion) or Chi-Square Test.
Revenue Per User (Mean): Student's T-Test.

Limitations & Pitfalls

Pitfalls

Peeking (P-Hacking): Checking p-value every day and stopping when significant. Solution: Fix sample size in advance or use Sequential Testing.
Novelty Effect: Users click the new button just because it's new. Effect fades over time.
Interference (SUTVA violation): In social networks, treating User A might affect User B (their friend). Standard A/B test fails here.
SRM (Sample Ratio Mismatch): If you planned 50/50 split but got 48/52, your randomization is broken. Abort test.

Python Implementation

from statsmodels.stats.proportion import proportions_ztest

# Data
conversions = [400, 480]  # Control, Treatment
n_obs = [10000, 10000]    # Samples

# z-test
stat, pval = proportions_ztest(conversions, n_obs)

print(f"P-value: {pval:.4f}")
if pval < 0.05:
    print("Significant Uplift! Roll out B.")
else:
    print("No significant difference.")

Z-Test - The math engine.
Power Analysis - How many users do I need?
Sample Ratio Mismatch (SRM) - Diagnostic check.
Multi-Armed Bandit - Alternative to A/B testing (Optimization > Learning).