Simple Linear Regression

Definition

Core Statement

Simple Linear Regression (SLR) models the linear relationship between a single continuous independent variable ( $X$ ) and a continuous dependent variable ( $Y$ ). It fits a straight line that minimizes the sum of squared residuals (Ordinary Least Squares - OLS).

Purpose

Prediction: Estimate $Y$ for a given $X$ .
Explanation: Quantify the strength and direction of the association between $X$ and $Y$ .
Testing: Determine if the relationship is statistically significant ( $β_{1} \neq 0$ ).

When to Use

Use SLR When...

You have one continuous predictor and one continuous outcome.
You believe the relationship is linear.
You want to understand/test the bivariate association.

Do NOT Use SLR When...

You have multiple predictors (use Multiple Linear Regression).
The outcome is categorical (use Logistic Regression).
The relationship is clearly non-linear (consider polynomial or GAM).

Theoretical Background

The Model Equation

Y_{i} = β_{0} + β_{1} X_{i} + ε_{i}

Term	Name	Interpretation
$Y_{i}$	Dependent Variable	The outcome we are predicting.
$X_{i}$	Independent Variable	The predictor.
$β_{0}$	Intercept	Expected value of $Y$ when $X = 0$ .
$β_{1}$	Slope	Change in $Y$ for a 1-unit increase in $X$ .
$ε_{i}$	Error Term	Random noise; captures unexplained variation.

OLS Estimation

The OLS method finds $β_{0}$ and $β_{1}$ that minimize the Residual Sum of Squares (RSS):

RSS = \sum_{i = 1}^{n} (Y_{i} - {\hat{Y}}_{i})^{2} = \sum_{i = 1}^{n} (Y_{i} - β_{0} - β_{1} X_{i})^{2}

Closed-Form Solutions:

{\hat{β}}_{1} = \frac{\sum (X_{i} - \bar{X}) (Y_{i} - \bar{Y})}{\sum (X_{i} - \bar{X})^{2}} = \frac{C o v (X, Y)}{V a r (X)}

{\hat{β}}_{0} = \bar{Y} - {\hat{β}}_{1} \bar{X}

Coefficient of Determination ( $R^{2}$ )

R^{2} = 1 - \frac{S S_{r e s}}{S S_{t o t}} = \frac{Variance Explained}{Total Variance}

Interpretation: The proportion of variance in $Y$ explained by $X$ .

$R^{2} = 0.70$ : 70% of the variability in $Y$ is explained by $X$ .

Assumptions (LINE)

The LINE Mnemonic

Linearity: The relationship between $X$ and $Y$ is linear. (Check: Scatter plot, Residual vs Fitted plot).
Independence: Observations are independent. (Check: Durbin-Watson Test for time series).
Normality: Residuals (not $Y$ !) are normally distributed. (Check: Q-Q Plot, Shapiro-Wilk Test).
Equal Variance (Homoscedasticity): The variance of residuals is constant across all $X$ . (Check: Breusch-Pagan Test, Residual vs Fitted plot).

Limitations

Pitfalls

Correlation $\neq$ Causation: SLR identifies association, not causation. Confounders may exist.
Sensitive to Outliers: Extreme points can disproportionately influence the slope and $R^{2}$ . Check Cook's Distance.
Extrapolation is Dangerous: Predicting $Y$ for $X$ values outside the observed range is unreliable.
Omitted Variable Bias: If a relevant variable is missing, $β_{1}$ may be biased.

Python Implementation

import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt

# 1. Prepare Data
X = df['Ad_Spend']
y = df['Sales']
X = sm.add_constant(X)  # Add intercept

# 2. Fit Model
model = sm.OLS(y, X).fit()

# 3. Results
print(model.summary())

# 4. Visualization
plt.scatter(df['Ad_Spend'], df['Sales'], alpha=0.6)
plt.plot(df['Ad_Spend'], model.predict(X), color='red', linewidth=2)
plt.xlabel("Ad Spend ($)")
plt.ylabel("Sales ($)")
plt.title(f"SLR: R-squared = {model.rsquared:.3f}")
plt.show()

# 5. Diagnostics
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sm.graphics.plot_regress_exog(model, 'Ad_Spend', fig=fig)
plt.show()

R Implementation

# 1. Fit Model
model <- lm(Sales ~ Ad_Spend, data = df)

# 2. Results
summary(model)

# 3. Confidence Intervals for Coefficients
confint(model)

# 4. Diagnostics (Standard R Plots)
par(mfrow = c(2, 2))
plot(model)
# Plot 1: Residuals vs Fitted (Linearity, Homoscedasticity)
# Plot 2: Normal Q-Q (Normality of Residuals)
# Plot 3: Scale-Location (Homoscedasticity)
# Plot 4: Residuals vs Leverage (Influential Points)

# 5. Prediction
new_data <- data.frame(Ad_Spend = c(100, 200, 300))
predict(model, newdata = new_data, interval = "confidence")

Worked Numerical Example

predicting Sales from Ad Spend

Data: 10 months of data.
Result: Sales = 1000 + 2.5(Ad_Spend)

Interpretation:

Intercept (1000): If Ad Spend is $0, we expect $1000 in baseline sales (brand awareness, organic).
Slope (2.5): For every additional $1 spent on ads, Sales increase by $2.50 on average.

Prediction:

If Ad Spend = $500:
Sales = 1000 + 2.5(500) = 1000 + 1250 = $2250.

Interpretation Guide

Output	Example Value	Interpretation	Edge Case Notes
$β_{1}$ (Slope)	2.5	Positive association. Y increases with X.	If 0, no linear relationship.
$β_{0}$ (Intercept)	-50	Value of Y when X=0.	Negative sales? Impossible. Intercept may have no physical meaning if X=0 is outside data range.
$p$ -value for $β_{1}$	0.001	Slope $\neq$ 0. Relationship is significant.	Does not mean relationship is strong, just reliable.
R-squared	0.65	65% of variance in Sales is explained.	35% is noise/error.
R-squared	0.01	X explains almost nothing about Y.	Check for non-linear U-shape!

Common Pitfall Example

Extrapolation Danger

Scenario: Modeling child height vs age (Age 2-10).
Model: Height = 80cm + 6cm(Age). (Fits well for data).

Prediction Error:

Predict height for a 30-year-old:
H = 80 + 6(30) = 260cm (8.5 feet!).

Lesson: Linear trends rarely continue forever. Never extrapolate far outside the range of your training data.

Multiple Linear Regression - More than one predictor.
Pearson Correlation - Related but different (Correlation $\neq$ Slope).
Residual Analysis - Checking assumptions.
Cook's Distance - Identifying influential observations.