Regression Analysis: Modeling Qualitative Variables

Posted on Mar 5, 2025 in Archaeology

Modeling Qualitative Variables with Regression

I. Qualitative Independent Variables

a. Modeling Values as Base and Differences

On July 19^th, 2011, Dell Computers offered a base Inspiron 600 for $299.99. A buyer could customize this computer. One of the choices was the type of Office 2010, where Windows 7 Home Premium was included in the base price:

If you wanted Microsoft Office and Student 2010, add $119 (price becomes $418).
If you wanted Microsoft Office Home and Business 2010, add $199 (price becomes $498).
If you wanted Microsoft Office Professional 2010, add $349.

The prices of the computer under the different options become:

Base price: $299.99
Microsoft Office and Student 2010: $299.99 + $119 = $418.99
Microsoft Office Home and Business 2010: $299.99 + $199 = $498.99
Microsoft Office Professional 2010: $299.99 + $349 = $648.99

Since for a computer we must use 1 and 0 to represent yes and no (dummy variables), we can model the choices above using the following equation:

price = 299.99 + 119*(x₁) + 199*(x₂)+349*(x₃)

where:

x₁ = 1 if you choose option 1 (Office and Student), 0 otherwise
x₂ = 1 if you choose option 2 (Home and Business), 0 otherwise
x₃ = 1 if you choose option 3 (Professional), 0 otherwise

For example, if you choose option 2, the price would be 299.99+119(0)+199*(1)+349*(0) = 299.99+199 = 498.99

How would you model the price of the following computers?

Computer 1: Base computer with 1 year support, $338.99
Computer 2: 90 day support, $299.99
Computer 3: 2 year support, $418.99

b. Using Averages

If you are purchasing the same type of computer across many dealers, some prices will be higher and some will be lower; therefore, we will average the prices. What are the prices of the following computers?

average price = 300 + 50 (option 1) + 200 (option 2) – 140 (option 3)

If there had been 5 options, how many dummy variables would have been needed?

In general, we will model a qualitative variable with c levels using c-1 dummy variables.

Population mean of y = β₀ + β₁x₁ + β₂x₂ + … + β_c-1x_c-1

where x_i = 1 if level i and 0 otherwise, i = 1, 2, … c-1

In this case, what would be the base average price?

What would β₃ represent?

What would have been the population mean of level 2?

c. Errors Due to Sampling

If you are able to obtain only a sample of prices, the averages and the changes in average price due to changing options would be in error.

Example: You are measuring the tensile strength provided by 4 suppliers. Supplier 1 has been your supplier in the past and will be considered your base level. You create 3 dummy variables: x₁ = 1 if supplier 2, 0 otherwise; x₂ = 1 if supplier 3, 0 otherwise; and x₃ = 1 if supplier 4, 0 otherwise. After taking random samples of size 5 from each supplier, you find the following result:


ANOVA
	df	SS	MS	F	Significance F
Regression	3	63.2855	21.09517	3.461629	0.041366
Residual	16	97.504	6.094
Total	19	160.7895

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%
Intercept	19.52	1.103993	17.68128	6.34e-12	17.17964	21.86036
x1	4.74	1.561282	3.035968	0.007866	1.430231	8.049769
x2	3.32	1.561282	2.126458	0.049376	0.010231	6.629769
x3	1.64	1.561282	1.050419	0.309133	-1.66977	4.949769

Predicted tensile strength = 19.52 + 4.74 x₁ + 3.32 x₂ + 1.64 x₃

What is the sample average for supplier 2?

The sample average tensile strength for supplier 4 would be 19.52 + 1.64 = 21.16

What happens to the average tensile strength in the sample when you go from your base supplier to supplier 3?

When you go from your base supplier to supplier 3, the sample average tensile strength improves by 3.32.

Using part b to infer about the differences in population mean tensile strength, what would be the largest error you would expect?

The sample slope of 3.32 has a standard error of 1.561282. This is multiple regression with c-1=3 independent variables. The degrees of freedom of the t is n-k-1 = 20-3=16. The margin of error is then

t16*sb2 = 2.1199 * 1.561282 = 3.309761712

Conclusion: With 95% confidence when you change from the original supplier to supplier 4, the average tensile strength will improve by 3.32 with a margin of error of 3.31. (Excel gives the range in the table above)

What would be the interpretation of the confidence interval for β₁ ?

Can you conclude that going the average tensile strength of supplier 1 differs from supplier 4? This is a two-sided t-test, with b₃ = 1.64, s_b3 = 1.561282 and n-k-1= 16. Try this between now and next class.

Can you conclude that the average tensile strength differs among the four suppliers? In this case the means would be equal if all the differences were zero. Multiple slopes are tested using an f test

h₀β₁= β₂ = β₃=0 (equal means)

h₁: at one β is not zero (at least two means differ)

t.s. f = msr/mse = 21.09517/6.094 = 3.46

r.r. reject ho if f > f_(3,16) = 3.24

Conclusion: We can say that the mean tensile strength differs for at least two suppliers.

Compare this with the f test of suppliers in one-way analysis of variance.

d. Interactions with Other Variables

If there are other variables in the model, I will have to add “holding all other variables constant” when interpreting a single coefficient. If you believe there are interactions between a qualitative variable and another independent variable, you would add product terms to the model.

II. Qualitative Dependent Variable

We will restrict our discussion to qualitative dependent variables that only have two levels (success-failure, pass-fail, product a or b, candidate a or b, etc.). We cannot use a dummy variable, dy, with least squares regression for this since according to our notes the mean of a dummy variable is p and the variance of a dummy variable is p(1-p). Thus the variance of dy would not be constant but depend on the mean.

In fact we switch from least squares to another estimation procedure called maximum likelihood. In maximum likelihood the probability functions of each observation are multiplied together and then derivatives are taken with respect to the unknown parameters. Estimates are formed based on solving for the unknown parameters using the values found in the sample.

Our new model in terms of the parameters is

ln(p/1-p) = β₀ + β₁x₁ + β₂x₂ … + β_kx_k

where ln is the natural logarithm. The maximum likelihood (ML) estimate is

estimated ln(p/1-p) = b₀ + b₁x₁ + b₂x₂ … + b_kx_k

The estimates b₀, b₁, … b_k are approximately distributed as normal distributions and testing all β_i = 0 is done with a χ² test with k degrees of freedom.

A software program such as SAS can be used to estimate a confidence interval for p for new values (similar to predicting the value of y in regression). This interval can be used to classify a value as a success or a failure. For example if the interval for the probability of a success was from 0.98 to 0.99, you would classify the value as a success. A cutoff point needs to be decided on the value of p as what value of p will define a value as a success and what defines a failure.

It is best to use a second sample to see how well the predictions work on data that was not used to estimate the model. (see analysis and validation data sets of the model building notes).