Simple Linear Regression

Simple Linear Regression

Definition

Core Statement

Simple Linear Regression (SLR) models the linear relationship between a single continuous independent variable (X) and a continuous dependent variable (Y). It fits a straight line that minimizes the sum of squared residuals (Ordinary Least Squares - OLS).


Purpose

  1. Prediction: Estimate Y for a given X.
  2. Explanation: Quantify the strength and direction of the association between X and Y.
  3. Testing: Determine if the relationship is statistically significant (β10).

When to Use

Use SLR When...

  • You have one continuous predictor and one continuous outcome.
  • You believe the relationship is linear.
  • You want to understand/test the bivariate association.

Do NOT Use SLR When...


Theoretical Background

The Model Equation

Yi=β0+β1Xi+εi
Term Name Interpretation
Yi Dependent Variable The outcome we are predicting.
Xi Independent Variable The predictor.
β0 Intercept Expected value of Y when X=0.
β1 Slope Change in Y for a 1-unit increase in X.
εi Error Term Random noise; captures unexplained variation.

OLS Estimation

The OLS method finds β0 and β1 that minimize the Residual Sum of Squares (RSS):

RSS=i=1n(YiY^i)2=i=1n(Yiβ0β1Xi)2

Closed-Form Solutions:

β^1=(XiX¯)(YiY¯)(XiX¯)2=Cov(X,Y)Var(X)β^0=Y¯β^1X¯

Coefficient of Determination (R2)

R2=1SSresSStot=Variance ExplainedTotal Variance

Interpretation: The proportion of variance in Y explained by X.


Assumptions (LINE)

The LINE Mnemonic


Limitations

Pitfalls

  1. Correlation Causation: SLR identifies association, not causation. Confounders may exist.
  2. Sensitive to Outliers: Extreme points can disproportionately influence the slope and R2. Check Cook's Distance.
  3. Extrapolation is Dangerous: Predicting Y for X values outside the observed range is unreliable.
  4. Omitted Variable Bias: If a relevant variable is missing, β1 may be biased.


Python Implementation

import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt

# 1. Prepare Data
X = df['Ad_Spend']
y = df['Sales']
X = sm.add_constant(X)  # Add intercept

# 2. Fit Model
model = sm.OLS(y, X).fit()

# 3. Results
print(model.summary())

# 4. Visualization
plt.scatter(df['Ad_Spend'], df['Sales'], alpha=0.6)
plt.plot(df['Ad_Spend'], model.predict(X), color='red', linewidth=2)
plt.xlabel("Ad Spend ($)")
plt.ylabel("Sales ($)")
plt.title(f"SLR: R-squared = {model.rsquared:.3f}")
plt.show()

# 5. Diagnostics
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sm.graphics.plot_regress_exog(model, 'Ad_Spend', fig=fig)
plt.show()

R Implementation

# 1. Fit Model
model <- lm(Sales ~ Ad_Spend, data = df)

# 2. Results
summary(model)

# 3. Confidence Intervals for Coefficients
confint(model)

# 4. Diagnostics (Standard R Plots)
par(mfrow = c(2, 2))
plot(model)
# Plot 1: Residuals vs Fitted (Linearity, Homoscedasticity)
# Plot 2: Normal Q-Q (Normality of Residuals)
# Plot 3: Scale-Location (Homoscedasticity)
# Plot 4: Residuals vs Leverage (Influential Points)

# 5. Prediction
new_data <- data.frame(Ad_Spend = c(100, 200, 300))
predict(model, newdata = new_data, interval = "confidence")

Worked Numerical Example

predicting Sales from Ad Spend

Data: 10 months of data.
Result: Sales = 1000 + 2.5(Ad_Spend)

Interpretation:

  • Intercept (1000): If Ad Spend is $0, we expect $1000 in baseline sales (brand awareness, organic).
  • Slope (2.5): For every additional $1 spent on ads, Sales increase by $2.50 on average.

Prediction:

  • If Ad Spend = $500:
  • Sales = 1000 + 2.5(500) = 1000 + 1250 = $2250.

Interpretation Guide

Output Example Value Interpretation Edge Case Notes
β1 (Slope) 2.5 Positive association. Y increases with X. If 0, no linear relationship.
β0 (Intercept) -50 Value of Y when X=0. Negative sales? Impossible. Intercept may have no physical meaning if X=0 is outside data range.
p-value for β1 0.001 Slope 0. Relationship is significant. Does not mean relationship is strong, just reliable.
R-squared 0.65 65% of variance in Sales is explained. 35% is noise/error.
R-squared 0.01 X explains almost nothing about Y. Check for non-linear U-shape!

Common Pitfall Example

Extrapolation Danger

Scenario: Modeling child height vs age (Age 2-10).
Model: Height = 80cm + 6cm(Age). (Fits well for data).

Prediction Error:

  • Predict height for a 30-year-old:
  • H = 80 + 6(30) = 260cm (8.5 feet!).

Lesson: Linear trends rarely continue forever. Never extrapolate far outside the range of your training data.