Linear Regression Models: Building, Selection, and Variables

Linear Regression Models

Theoretical/Population Model

µy = β0 + β1×1 + β2×2 + β3xi + β4xf or y = β0 + β1x + ε where ε ∼ N(0, σ) – can describe the relationship between x’s and y using a simple linear regression.

  • µy — average (mean) value of y in population for fixed value of x
  • β0 — population intercept – the mean value of the response when the explanatory variable is 0
  • β1 — population slope – the mean change in the response for every one unit increase in the explanatory variable
  • xi and xf are qualitative variables (for example, long or short hair)
  • y — actual value of the response
  • ε — population error i.e. y − µ – the difference between the actual value of the response and the mean response from the model

Estimated/Predicted Model

ŷ = b0 + b1x1 + b2x2 + b3xi + b4xf – ŷ is the predicted value. b0 and b1 are symbols to represent the statistics in the linear model.

  • ŷ — Predicted value of y for a given value of x
  • b0 — Predicted value of y when x is 0
  • b1 — Predicted change in y for every one unit increase in x
  • xi and xf are qualitative variables (for example, long or short hair)
  • e — difference between y and ŷ

Model Building (Forward Selection and Backwards Elimination)

Make sure to take note of the probability to leave number. If the p-value is larger than it, you can eliminate it.

Forward Selection

Step 1:

  • Model 1: x1 prob > .0001 T-ratio = 30
  • Model 2: x2 prob > .0001 T-ratio = 60
  • Model 3: x3 prob > .0001 T-ratio = 90 <– Include this variable (has lowest probability indicated from having the greatest t-ratio)

Step 2:

Only regard models from 4, 5, 6 that include the variable x3 and then repeat the process to find the lowest probability and greatest t-ratio.

Backwards Elimination

Step 1: (Probability to leave = .02)

  • Model 7: x1 prob > .0001
  • x2 prob > .0001
  • x3 prob > .0483 <– Exclude this variable from the model (has the highest probability and is larger than the probability to leave)

Step 2:

Find the model with x1 and x2 in it (From models 4, 5, 6). The variable with the highest p-value gets removed (if larger than the probability to leave).

Multicollinearity

Occurs when two or more explanatory variables in the model are highly correlated and therefore potentially provide redundant information about the response.

Methods to check for multicollinearity:

  • Opposite Sign of estimated Slope compared to the correlation between the explanatory variable and the response.
  • Large VIF (>10) <– Found on JMP output
  • Strong Correlations between Explanatory Variables. (Look at scatter plot matrix or matrix of correlations)
  • Strong correlation between the Explanatory Variable and Response despite a large P-value associated with the test for the corresponding slope.

Categorical Variables – Coding Indicator Variables

(2 levels)

Indicator values have levels, for example with 2 levels, (hair type = xL = 1 if long hair, 0 if short hair)

Example equation – x1 is years of experience… (Long hair) µL = ß0 + ß1×1 + ß2xL or (Short Hair) µS = ß0 + ß1×1

(3 levels)

Job title such as Secretary, manager, computer technician. Indicators necessary is one less than the number of levels (in this case its 2)

x1 = years of experience,

  • xL = 1 if long hair, 0 if short hair
  • xs = 1 if secretary, 0 if not
  • xm = 1 if manager, 0 if not

Example equation µy = ß0 + ß1×1 + ß2xL + ß3xs + ß4xm

Interaction

An interaction between two explanatory variables does not imply that the two variables must be correlated. It only implies that the effect of one variable on the response depends on the level of the other variable.

First order model –> µy = B0 + B1x1 + B2x2 Higher order model –> µy = B0 + B1x1 + B2x2 + B3x1x2

Population model – µy = β0 + β1×1 + β1 + β2×2 + β3x1x2 + β3×2

– µy = β0 + β1×1 + β2×2 + β2 + β3x1x2 + β3×1

(β1 + β3×2) represents the change in µy for every 1-unit increase in x1, holding x2 fixed

(β2 + β3×1) represents the change in µy for every 1-unit increase in x1, holding x2 fixed