Vehicle Tax Analysis: Price, Age, and Regression Insights
1. Interpreting the Slope in the Simple Linear Regression Model
A 1% increase in price is associated with a 0.8% increase in taxes. Given that the increase is less than 1%, the vehicle tax is regressive, not progressive, meaning that more expensive cars pay proportionally less tax.
rate = exp(b1) * exp(0.8161 * log_price) = exp(b1) * (exp(log_price))^0.8161 = exp(b1) * (price)^0.8161
Hence, an increase in the price of 1% implies an increase in the rate of (1.01)^0.8161 = 1.00815, that is, an increase of 0.8%.
In the log scale, an increment of 1 unit in log_price represents an increment of 0.8161 units in log_tax.
2. Do New Vehicles Pay More Taxes Than Old Vehicles of the Same Price?
Interpreting the Coefficient of the Dummy Variable in the Additive ANCOVA Model and Testing its Statistical Significance
The coefficient of 0.0806 associated with new vehicles in the additive ANCOVA model (the general linear model cannot be taken since it contains gross-effect only) is positive and therefore indicates a multiplicative effect of 8% on the rates to be paid (additive effect of 0.0806 log units), for new cars of the same price, which seems negligible.
It must be formally contrasted by Fisher test. The null hypothesis is that there is no additive effect of the Age Factor in the Log_Tax and Log_Price relationship.
RSS(Covariate) – RSS(Covariate + Factor) = 0.1040 with 1 degree of freedom (d.f.).
The best estimate of the model variance comes from the complete model and is 1.5485 / 62, which is 0.02497 with 62 d.f. Therefore, the quotient 0.1040 / 0.02497 is equal to 4.1648, and therefore the Fisher statistic of (1 and 62) d.f. will be significant. Specifically, the p-value is 0.0457 (remember p_value = P(F(1,62) > 4.16) = 0.0457), less than the usual reference 5% threshold. Then, there is evidence to reject the null hypothesis, and therefore, cars pay more tax when being new.
3. Multiple Correlation Between the Response and its Two Predictors
Assuming the additive ANCOVA model, it explains (5.8851 + 0.1040) units of 7.5583 units, that is, 0.7924 to 1 or 79.24% of target variability (coefficient of determination). Therefore, the multiple correlation coefficient is the square root of the coefficient of determination and thus 0.8902.
If the interactive model is chosen, the model explains 5.8851 units of 7.5583 units of target variability, so 77.86% for the coefficient of determination and its squared root for the multiple correlation coefficient.
4. Statistical Evidence Suggesting Different Relationships Between Response and Covariate in New or Old Vehicles
A contrast of the interaction between Covariate and Factor is requested: no alternative. We do this by applying the Fisher test.
RSS(Covariate + Factor) – RSS(Covariate * Factor) = 0.0207 with 1 d.f. The best estimate of the model variance comes from the complete model and is 1.5485 / 62 = 0.02497 with 62 d.f. Therefore, the quotient 0.0207 / 0.02497 = 0.8288, and therefore the Fisher statistic of (1,62) d.f. will be non-significant. Specifically, the p-value is 0.3661, much higher than the usual reference 5% threshold. Then, there is no evidence to reject the null hypothesis. Thus, the linear relationship between price and taxes does not depend on vehicle age factor.
5. Observation with Leverage Greater than 0.1
Maximum Acceptable Value for Leverage
The maximum acceptable value is 2p/n. In the additive model, it will be 6/66 = 0.091. It must be a very large or very small observation in the explanatory variable log_Price. It looks like the observation of the lowest price. In fact, it turns out that it is the observation of an old car and more expensive price.
6. Observation with a Studentized Residual Lower than -2.7 in the Additive Model
The negative sign indicates that it is a vehicle that pays relatively few rates, so it points towards a new vehicle that is below ‘its axis’. It is the observation corresponding to log_Price 6.8 and minimum log_rate within the group of new cars.
7. Observations Suspected of Being Influential Data
Maximum Reference Value for the Statistic Used to Answer the Question
There are 2 candidates among the old vehicles: the cheapest (less than 6.4 in logPrice) and the second most expensive (approx. logPrice 7.5). There are authors who consider as a posteriori influential observations those that have an unusually high Cook’s distance with respect to the magnitudes of the rest of the observations. Chatterjee and Hadi propose using a threshold of 4/(n-p), which in our case is 0.0635. Therefore, a boxplot could be used and a decision made according to a standard descriptive statistics criterion.
8. Determining Binary Logit Null Model Parameter Estimate for Abstention Proportion
Firstly, data have to be grouped in the proper way to simplify calculus. The average probability of abstention is 0.3165 (= 5281443 / 16689375), odds are 0.4640, and log-odds are -0.7701. Null deviance is 1140214 units.