Multiple Linear Regression Assumptions and Diagnostics

Key Assumptions of Multiple Linear Regression (MLR)

Important Information from Midterm:

  • MLR1: Linearity -> The model is linear in parameters. (Linearity = residuals have a mean of zero for every level of the fitted values and predictors)
  • MLR2: Random Sample -> The data is randomly sampled from the population, ensuring that the sample represents the population.
  • MLR3: No Perfect Collinearity -> The independent variables are not perfectly correlated, so the model can estimate coefficients uniquely.
  • MLR4: Zero Conditional Mean -> The error term has an expected value of zero conditional on the independent variables. This is critical for unbiasedness, as it implies no omitted variable bias or endogeneity.
  • MLR5: Homoskedasticity -> The variance of the error term is constant across all levels of the independent variables. This ensures valid standard errors for hypothesis testing and confidence intervals.
  • MLR6: Error Term is Normally Distributed -> The error term follows a normal distribution. Important for small samples to ensure that hypothesis testing and confidence intervals are valid (by the Central Limit Theorem, normality is less critical for large samples).

Bias: MLR1, MLR2, MLR3, MLR4.

Inference: MLR5, MLR6 (in addition to MLR1–MLR4).

MLR Assumptions: Causes, Diagnostics, and Handling

AssumptionWhat Causes This?How to Diagnose?How to Handle?R Code
MLR1
  • Incorrect model specification.
  • Missing interaction terms or non-linear relationships.
  • Plot residuals vs. fitted values (should show no pattern).
  • Use scatterplots of independent vs. dependent variables.
  • RESET test (F-test compares long model to smaller models and includes squared and cubed fitted values). Note that a significant p-value is not an indication to add polynomial terms but that we need to further investigate the relationship between the predictors and the outcome. “H0: The model is well-specified.
  • Check other assumptions
  • Return to data processing
  • Log transformations / quadratic terms
  • Interaction terms
  • Alternative models

plot(m1, 1) -> Residuals vs fitted

library(lmtest) | resettest(m1, power = 2:3, type = "fitted") or type="regressors"

MLR2
  • Non-representative or biased sample.
  • Sampling errors or convenience sampling.
  • Missing values, outliers, non-random samples

*In small samples, outliers can have a large effect on model estimates, distorting predictions and coefficient values. The impact of outliers diminishes as N increases.

  • Inspect data collection methods.
  • Check descriptive statistics for sample characteristics.
  • N/As at random (no bias, BUT, lose data, less info -> means higher SE)
  • Outliers: Types: Extreme values (far outside typical range, often in Y variable) -> High residuals, Leverage points (unusual X values). Influential points (observations significantly affecting model estimates (predicted values & parameter estimates)-

Over/under representation: Weighting / post stratification

N/As: (1) Listwise delete (reduces statistical power), (2) mean median imputation (lower variance->bias), (3) regression imputation (when NAs not random, overstates confidence in predictions), (4) flagging NA’s (combines imputation + binary variable | when NA’s carry value, may introduce bias). (5) Missing category (for categorical variables -> unknown/undisclosed).

Outliers: (1) (extreme values): studentized residuals (residuals / sdev and adj. leverage). Externally studentized = excluding observation itself. (2) Leverage: distance between mean X and X value, can pull regression line (3)Influence: on predictions: DFFITS or residuals vs leverage plots. Influence on parameter estimates: DFBETAS or coefficient comparisons (before/after point exclusion)

Solutions for outliers: Cap extreme values (winsorizing), add, drop, transform variables, fix data entry issues, remove observations outside intended population.

Extreme values: Studentized residuals: dataset %>% mutate(res_stud = rstudent(m1), res_stud_large = ifelse(between(res_stud,-3, 3), "Normal", "Extreme"))

Leverage: Cooks distance (residuals vs leverage) : plot(m1, 5)

MLR3
  • Assumption fails if independent variables are perfectly correlated.
  • Assumption 3 also fails if N is too small in relation to parameters; we need at least k + 1 observations to estimate k + 1
  • Dummy variable trap (e.g., including all categories of a categorical variable).
  • Calculate Variance Inflation Factor (VIF). High VIF (e.g., >10) signals potential multicollinearity. Measures how much variance of a regression coefficient is inflated due to multicollinearity with other predictors.

Why does it matter? High multicollinearity reduces the reliability of individual predictors, making it hard to assess their true relationship with the outcome variable.

  • Combine highly correlated variables / drop one
  • Look at correlation matrices.
  • Combine highly correlated variables / drop one
  • Increase sample size to help disentangle variable effects
  • A high degree of linear relationship between x1 and x2 can lead to large variances for the OLS estimators.
  • What ultimately matters is how big ˆβj is in relation to its standard error.

numeric_vars <- laptops %>% select(Price_euros, Inches, Ram, Weight, ScreenW, ScreenH, CPU_freq, PrimaryStorage)

correlation_matrix <- round(cor(numeric_vars, use = "complete.obs"), 2)

vif(m1)

MLR4
  • Omitted variable bias.
  • Measurement errors in independent variables.
  • Endogeneity (correlation between predictors and error).
  • Perform a residual analysis.
  • Use theoretical reasoning to identify potential omitted variables.
MLR5
  • Presence of heteroskedasticity (non-constant error variance).
  • Outliers or data skewness.
  • Use a Breusch-Pagan or White test.
  • Plot residuals vs. fitted values (look for patterns).
MLR6
  • Small sample size.
  • Extreme outliers.
  • Skewed or non-normal error distribution.
  • Plot a Q-Q plot of residuals.
  • Perform a Shapiro-Wilk or Kolmogorov-Smirnov test.

Difference in Differences (DiD) Framework

The DiD framework is a widely-used method for policy evaluation that mimics random assignment using:

  • A treatment group affected by an intervention (e.g., homes near the incinerator).
  • A control group unaffected by the intervention (e.g., homes far from the incinerator).
  • Pre- and post-intervention data for both groups.