Multiple Regression Analysis in Estimating Annual Salary

Posted on May 13, 2024 in Mathematics

Multiple Regression Analysis

Multiple regression analysis is an extension of simple regression analysis. It uses two or more independent variables to estimate the value of a dependent variable. This method identifies the best-fitting line based on the method of least squares.

Additional Concepts

Constant (b₀ intercept): The value of the dependent variable (DV) in the regression equation when all independent variables (IVs) are equal to 0.
Partial Regression Coefficient (conditional coefficient): Shows the impact of one IV on the DV while keeping other IVs constant.
F-test: Assesses whether the overall model (all IVs together) is significant.
T-test (partial regression coefficient): Evaluates if each independent variable contributes significantly to the model.
Stepwise Regression Analysis: A method of selecting variables from a regression model either by adding (forward) or removing (backward) them one by one based on specific criteria.
Coefficient of Multiple Correlation (R): Represents the extent of the relationship between independent variables taken as a group and the dependent variable. It shows how well the group of IVs predicts the DV.
Coefficient of Multiple Determination (R²): Explains the proportion of variance in the DV that’s explained by the IVs.
Adjusted Coefficient of Multiple Determination (Adjusted R²): A modified version of R² that adds precision and reliability by considering the impact of additional independent variables that tend to skew the results of R-squared measurements. It adjusts for the number of predictors to give more accurate measures.
Coefficient of Partial Correlation: Indicates the correlation between one IV and the DV, while other independent variable(s) are held constant.
Coefficient of Partial Determination: The squared value of the coefficient of partial correlation. It indicates the proportion of variance statistically accounted for by one particular independent variable, with the other independent variable(s) held constant.

Dummy (Indicator) Variables

Dummy variables are used to include qualitative data in a regression model. They are typically coded as 0 and 1 (e.g., female or male). One category is not inherently better than the other; the coding simply represents a difference in category. For a qualitative variable with k categories, k – 1 indicators are needed to avoid redundancy. For instance, gender (2 categories) needs 1 indicator. Binary coding (0 for one category, 1 for the other) is typical, but the choice doesn’t imply preference, just convention.

Regression Statistics Example

Table 2: Regression Statistics

Multiple Correlation Coefficient (Multiple R): X% of variability in the dependent variable is connected with X% of variability in the independent variables (X₁, X₂, etc.) taken together.
R Square (Multiple Determination Coefficient): Our regression model explains X% of the variance in the dependent variable. A very good model has a higher R square value.
Adjusted R Square: Even when considering the fact that adding additional independent variables to our model will skew our results, it will still explain around X% of the variance in the dependent variable. This indicates a very good model. When we add factors to the model, the R squared increases and improves.
Standard Error: The prediction of the dependent variable made using our model will differ from reality by X dollars.
Observations: In our model, there are X houses/observations for each of the variables. This is a small sample.

Table 1: Coefficient Table

Formal Regression Equation: Y = b₀ + b₁ * X₁ + b₂ * X₂ + b₃ * X₃ + b₄ * X₄ + b₅ * X₅

Coefficient:
Intercept = b₀ = Constant: If we do not take into consideration the independent variables (X₁, X₂, etc.), the average dependent variable (Y) will be X.
X₁ = b₁ = Partial Regression Coefficient: For each additional unit in X₁, there will be a change of b₁ in Y. On average, this will result in a payment of around X dollars.
X₂ = b₂: If X₂ increases by 1 unit, the dependent variable will increase on average by X dollars.
X_3a = b₃: Division A doesn’t have any influence on price if a house is located there.
X_3b = b₄: If you buy a house in X_3b, the house will be around X dollars more expensive than the average house, indicating that X_3b is more prestigious than Division A.
Standard Error: Coefficient / Standard Error = t-statistic (measures the precision of the coefficient estimates)
P-value:
Lower 95% & Upper 95%: Confidence interval

Table 3: ANOVA Analysis

S-statistic (F-statistic): MS (mean of square differences) / SS (sum of square differences)
Significance F = Probability: Our model is statistically significant at (1 – Significance F) level of probability. Our model is good, with a significance level greater than 99%. If the significance level is small, the model is not statistically significant.
P-Value = Risk:
Has E, so very small number = Probability = 1 – Risk:
Intercept: Statistically significant by X%.
X₁: We are X% certain that X₁ will be statistically significant.
X₂: Not that important because there is a probability of X% (small number), indicating that it is not statistically significant when it comes to predicting the dependent variable.
X_3a & X_3b: Not statistically significant at all, not important (not calculated).
X_3c: Statistically significant at X% (high number), a very important predictor of the dependent variable.
Lower 95% & Upper 95% (Confidence Interval):
Intercept: We are 95% sure that our coefficient b₀ will be between X and X.
X₁: 95% sure that coefficient b₁ will be between X and X.
X₂: We are 95% sure that our coefficient b₂ will be between X and X.
X_3a: Not relevant (all 0).
X_3b: 95% sure that X_3b (house) will be between X and X.

Multiple Regression and Correlation Analysis Example

Hypothesis Testing

Test the null hypothesis that there is no relationship in the population between the three predictors taken as a group and the annual salary of the technicians, using the 1 percent level of significance.

If the p-value is lower than 0.01, we should reject the null hypothesis because there is a statistically significant relationship between the three independent variables and the dependent variable at a 1% level of risk. In this case, the level of risk is 0.84%, which is less than 1%. If the p-value is higher than the significance level, then we accept the null hypothesis and conclude that there is no relationship at this level of significance.

Backward Stepwise Regression

is to be used to remove any predictors that do not contribute to the regression model, using the 10 percent level of significance. What predictor would be removed from the present model?why?- want to remove vaiables that do not add, which models remove? The coefficient we need to use to answer this is= Last table variable and look at P-Value. Only contribute if p-value is 10%. Rink much higher. Need to remove gender from model. Re do regression analysis.

Estimate an annual salary for a person with 16 years of experience, 2.5y of postsecundary edu and which is female.

Y= 4595.329$ + 801,57$*(16) + 1595, 74$*(2.5) + 382,6$ *(0) = 62309.81 (Y= b0 + b1*X1 + b2*X2 + b3*X3 + b4*X4 + b5*X5). The salary for the female w 16 years of experience & 2.5y of postsecundary edu will be around 62309.81$. This is what you expect.

Multiple Regression Analysis in Estimating Annual Salary