Multiple Linear Regression Assumptions and Diagnostics
Key Assumptions of Multiple Linear Regression (MLR)
Important Information from Midterm:
- MLR1: Linearity -> The model is linear in parameters. (Linearity = residuals have a mean of zero for every level of the fitted values and predictors)
- MLR2: Random Sample -> The data is randomly sampled from the population, ensuring that the sample represents the population.
- MLR3: No Perfect Collinearity -> The independent variables are not perfectly correlated, so the model can estimate coefficients uniquely.
- MLR4: Zero Conditional Mean -> The error term has an expected value of zero conditional on the independent variables. This is critical for unbiasedness, as it implies no omitted variable bias or endogeneity.
- MLR5: Homoskedasticity -> The variance of the error term is constant across all levels of the independent variables. This ensures valid standard errors for hypothesis testing and confidence intervals.
- MLR6: Error Term is Normally Distributed -> The error term follows a normal distribution. Important for small samples to ensure that hypothesis testing and confidence intervals are valid (by the Central Limit Theorem, normality is less critical for large samples).
Bias: MLR1, MLR2, MLR3, MLR4.
Inference: MLR5, MLR6 (in addition to MLR1–MLR4).
MLR Assumptions: Causes, Diagnostics, and Handling
Assumption | What Causes This? | How to Diagnose? | How to Handle? | R Code |
---|---|---|---|---|
MLR1 |
|
|
|
|
MLR2 |
*In small samples, outliers can have a large effect on model estimates, distorting predictions and coefficient values. The impact of outliers diminishes as N increases. |
| Over/under representation: Weighting / post stratification N/As: (1) Listwise delete (reduces statistical power), (2) mean median imputation (lower variance->bias), (3) regression imputation (when NAs not random, overstates confidence in predictions), (4) flagging NA’s (combines imputation + binary variable | when NA’s carry value, may introduce bias). (5) Missing category (for categorical variables -> unknown/undisclosed). Outliers: (1) (extreme values): studentized residuals (residuals / sdev and adj. leverage). Externally studentized = excluding observation itself. (2) Leverage: distance between mean X and X value, can pull regression line (3)Influence: on predictions: DFFITS or residuals vs leverage plots. Influence on parameter estimates: DFBETAS or coefficient comparisons (before/after point exclusion) Solutions for outliers: Cap extreme values (winsorizing), add, drop, transform variables, fix data entry issues, remove observations outside intended population. | Extreme values: Studentized residuals: Leverage: Cooks distance (residuals vs leverage) : |
MLR3 |
|
Why does it matter? High multicollinearity reduces the reliability of individual predictors, making it hard to assess their true relationship with the outcome variable.
|
|
|
MLR4 |
|
| ||
MLR5 |
|
| ||
MLR6 |
|
|
Difference in Differences (DiD) Framework
The DiD framework is a widely-used method for policy evaluation that mimics random assignment using:
- A treatment group affected by an intervention (e.g., homes near the incinerator).
- A control group unaffected by the intervention (e.g., homes far from the incinerator).
- Pre- and post-intervention data for both groups.