Regression Analysis: Modeling Qualitative Variables
Modeling Qualitative Variables with Regression
I. Qualitative Independent Variables
a. Modeling Values as Base and Differences
On July 19th, 2011, Dell Computers offered a base Inspiron 600 for $299.99. A buyer could customize this computer. One of the choices was the type of Office 2010, where Windows 7 Home Premium was included in the base price:
- If you wanted Microsoft Office and Student 2010, add $119 (price becomes $418).
- If you wanted Microsoft Office Home and Business 2010, add $199 (price becomes $498).
- If you wanted Microsoft Office Professional 2010, add $349.
The prices of the computer under the different options become:
- Base price: $299.99
- Microsoft Office and Student 2010: $299.99 + $119 = $418.99
- Microsoft Office Home and Business 2010: $299.99 + $199 = $498.99
- Microsoft Office Professional 2010: $299.99 + $349 = $648.99
Since for a computer we must use 1 and 0 to represent yes and no (dummy variables), we can model the choices above using the following equation:
price = 299.99 + 119*(x1) + 199*(x2)+349*(x3)
where:
- x1 = 1 if you choose option 1 (Office and Student), 0 otherwise
- x2 = 1 if you choose option 2 (Home and Business), 0 otherwise
- x3 = 1 if you choose option 3 (Professional), 0 otherwise
For example, if you choose option 2, the price would be 299.99+119(0)+199*(1)+349*(0) = 299.99+199 = 498.99
How would you model the price of the following computers?
- Computer 1: Base computer with 1 year support, $338.99
- Computer 2: 90 day support, $299.99
- Computer 3: 2 year support, $418.99
b. Using Averages
If you are purchasing the same type of computer across many dealers, some prices will be higher and some will be lower; therefore, we will average the prices. What are the prices of the following computers?
average price = 300 + 50 (option 1) + 200 (option 2) – 140 (option 3)
If there had been 5 options, how many dummy variables would have been needed?
In general, we will model a qualitative variable with c levels using c-1 dummy variables.
Population mean of y = β0 + β1x1 + β2x2 + … + βc-1xc-1
where xi = 1 if level i and 0 otherwise, i = 1, 2, … c-1
In this case, what would be the base average price?
What would β3 represent?
What would have been the population mean of level 2?
c. Errors Due to Sampling
If you are able to obtain only a sample of prices, the averages and the changes in average price due to changing options would be in error.
Example: You are measuring the tensile strength provided by 4 suppliers. Supplier 1 has been your supplier in the past and will be considered your base level. You create 3 dummy variables: x1 = 1 if supplier 2, 0 otherwise; x2 = 1 if supplier 3, 0 otherwise; and x3 = 1 if supplier 4, 0 otherwise. After taking random samples of size 5 from each supplier, you find the following result:
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
df | SS | MS | F | Significance F |
| |
Regression | 3 | 63.2855 | 21.09517 | 3.461629 | 0.041366 |
|
Residual | 16 | 97.504 | 6.094 |
|
|
|
Total | 19 | 160.7895 |
| |||
|
|
|
|
|
|
|
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | |
Intercept | 19.52 | 1.103993 | 17.68128 | 6.34e-12 | 17.17964 | 21.86036 |
x1 | 4.74 | 1.561282 | 3.035968 | 0.007866 | 1.430231 | 8.049769 |
x2 | 3.32 | 1.561282 | 2.126458 | 0.049376 | 0.010231 | 6.629769 |
x3 | 1.64 | 1.561282 | 1.050419 | 0.309133 | -1.66977 | 4.949769 |
Predicted tensile strength = 19.52 + 4.74 x1 + 3.32 x2 + 1.64 x3
- What is the sample average for supplier 2?
The sample average tensile strength for supplier 4 would be 19.52 + 1.64 = 21.16
- What happens to the average tensile strength in the sample when you go from your base supplier to supplier 3?
When you go from your base supplier to supplier 3, the sample average tensile strength improves by 3.32.
- Using part b to infer about the differences in population mean tensile strength, what would be the largest error you would expect?
The sample slope of 3.32 has a standard error of 1.561282. This is multiple regression with c-1=3 independent variables. The degrees of freedom of the t is n-k-1 = 20-3=16. The margin of error is then
t16*sb2 = 2.1199 * 1.561282 = 3.309761712
Conclusion: With 95% confidence when you change from the original supplier to supplier 4, the average tensile strength will improve by 3.32 with a margin of error of 3.31. (Excel gives the range in the table above)
What would be the interpretation of the confidence interval for β1 ?
- Can you conclude that going the average tensile strength of supplier 1 differs from supplier 4? This is a two-sided t-test, with b3 = 1.64, sb3 = 1.561282 and n-k-1= 16. Try this between now and next class.
- Can you conclude that the average tensile strength differs among the four suppliers? In this case the means would be equal if all the differences were zero. Multiple slopes are tested using an f test
h0β1= β2 = β3=0 (equal means)
h1: at one β is not zero (at least two means differ)
t.s. f = msr/mse = 21.09517/6.094 = 3.46
r.r. reject ho if f > f(3,16) = 3.24
Conclusion: We can say that the mean tensile strength differs for at least two suppliers.
Compare this with the f test of suppliers in one-way analysis of variance.
d. Interactions with Other Variables
If there are other variables in the model, I will have to add “holding all other variables constant” when interpreting a single coefficient. If you believe there are interactions between a qualitative variable and another independent variable, you would add product terms to the model.
II. Qualitative Dependent Variable
We will restrict our discussion to qualitative dependent variables that only have two levels (success-failure, pass-fail, product a or b, candidate a or b, etc.). We cannot use a dummy variable, dy, with least squares regression for this since according to our notes the mean of a dummy variable is p and the variance of a dummy variable is p(1-p). Thus the variance of dy would not be constant but depend on the mean.
In fact we switch from least squares to another estimation procedure called maximum likelihood. In maximum likelihood the probability functions of each observation are multiplied together and then derivatives are taken with respect to the unknown parameters. Estimates are formed based on solving for the unknown parameters using the values found in the sample.
Our new model in terms of the parameters is
ln(p/1-p) = β0 + β1x1 + β2x2 … + βkxk
where ln is the natural logarithm. The maximum likelihood (ML) estimate is
estimated ln(p/1-p) = b0 + b1x1 + b2x2 … + bkxk
The estimates b0, b1, … bk are approximately distributed as normal distributions and testing all βi = 0 is done with a χ2 test with k degrees of freedom.
A software program such as SAS can be used to estimate a confidence interval for p for new values (similar to predicting the value of y in regression). This interval can be used to classify a value as a success or a failure. For example if the interval for the probability of a success was from 0.98 to 0.99, you would classify the value as a success. A cutoff point needs to be decided on the value of p as what value of p will define a value as a success and what defines a failure.
It is best to use a second sample to see how well the predictions work on data that was not used to estimate the model. (see analysis and validation data sets of the model building notes).