Statistical Analysis: ANOVA, Linear Regression, and Logistic Regression
ANOVA: Analysis of Variance
ANOVA (Analysis of Variance) is used to compare means across multiple groups. It tests whether there are significant differences between the means of three or more groups and determines whether variation is due to differences between or within groups.
- Between-group variability: Measures how much the means of different groups differ from one another. It represents the systematic effect of the independent variable.
- Within-group variability: Measures how much individual data points within each group differ from the group mean. It represents random error or inherent variability in the data.
n = sample size in each group, SST = SSB + SSW
Assumptions:
- Independence
- Normality
- Homogeneity of Variance
One-Way ANOVA
One-way ANOVA is used when comparing more than two groups based on a single independent variable.
Hypotheses:
- H0: μ1 = μ2 = μ3 = μ4… = μk (The means of all groups are equal)
- HA: At least one group mean is different.
F-statistic = MSB / MSW
Critical F: Dependent on both degrees of freedom.
If F-obtained is greater than F-critical, we reject the null hypothesis.
Effect size: The amount of variance in our outcome variables explained by our group. Small: 0.01, Medium: 0.09, Large: 0.25 (SSB / SST)
Post Hoc Testing
Tukey’s HSD:
- Use if you want to test the relationship between specific groups within your ANOVA.
- Use if you want to check all pairwise comparisons.
- Compares all pairwise group means to identify which specific pairs are different.
- Controls Type I error across all pairwise comparisons (assumes equal sample sizes).
Bonferroni:
- Use if you want to test the relationship between specific groups within your ANOVA.
- Controls Type I error rate by adjusting the significance threshold for multiple comparisons.
- More conservative than Tukey’s HSD, making it less likely to find significant differences.
Calculate the HSD statistic: T = q * sqrt(MSE / n), where n = number of subjects in one group.
To find q, use a table (df: total number of subjects – number of groups, and number of groups). MSE = MSw
If the absolute calculated value from Tukey’s test is smaller than the difference in means, the means are significantly different.
Factorial ANOVA
Factorial ANOVA allows analysis of two or more independent variables (factors) affecting a dependent variable. It assesses main effects (the effect of one factor on the dependent variable) and interaction effects (how one factor influences the relationship between another factor and the outcome, when the effect of one factor depends on the level of another).
- 2×2: 2 factors, each with 2 levels
- 2×3: 2 factors, one with 2 levels and another with 3 levels
- 2x4x6: 3 factors, one with 2 levels, another with 4, and another with 6 levels
Null hypotheses:
- Main effect of factor A: H0 = μA1 = μA2
- Main effect of factor B: H0 = μB1 = μB2
- Interaction effect: H0 = μA1,B1 = μA1,B2 = μA2,B1 = μA2,B2 (The effect of factor A is the same across all levels of factor B, and the effect of factor B is the same across all levels of factor A)
Main effects assume no interaction and can’t be used to assess simple effects. They are null and void if you have a significant interaction effect.
Linear Regression
Purpose: To assess a linear relationship between two or more variables and form predictions. We can use our independent variable X to form predictions about our outcome variable Y, assuming a linear relationship. It models the best-fitting line to our data.
y = mx + b
- Simple linear regression: y = β0 + β1X + ε
- Multiple linear regression: Y = β0 + β1X1 + β2X2 + … + βnXn + ε
Best-fitting line: We can’t fit a line that perfectly predicts all of our data points. Instead, we find the line that minimizes the residuals.
Residuals: Observed outcome (yi) – Expected outcome (ŷi). Sum of all (yi – ŷi)2
Effect size: R2 represents the proportion of variance in one variable that can be explained by the other. For example, if R2 = 0.23, 23% of the variance in murder rate can be explained by unemployment. Small = 0.01, Medium = 0.09, Large = 0.25 (SSM / SST)
Standard error of estimate: Quantifies the average size of the residual (√(MSE)), where MSE = SSE / dfE. It tells us how far away, on average, our predicted values are from actual values. Distance = β0 + β1 * speed
Null hypothesis: No relationship between variables, H0: β1 = 0 (The slope of the population is 0). HA: There is a relationship between variables, HA: β1 ≠ 0.
Interpretation of the linear regression:
- Significance: p-value for predictors
- Coefficients of predictors:
- Continuous predictors: Describe the change in the predicted outcome for a one-unit increase in the predictor (holding all other predictors constant for a multiple regression).
- Binary predictors: Describe the difference in the dependent variable between two levels of the binary predictor.
- Intercept:
- Continuous predictors: Predicted outcome when all predictors are equal to zero.
- Binary predictors: The predicted value of the dependent variable for the reference group.
Often, the intercept does not make sense in context.
R2: The amount of variability in the outcome that is accounted for by the predictors.
Multiple Linear Regression
Y = β0 + β1X1 + β2X2 + β3X1X2 + ε
Each coefficient represents the change in Y for a 1-unit change in the corresponding X, while holding the other predictors constant.
Regression with Categorical Predictors
This is used when the predictor is categorical or the outcome is continuous.
Differences from linear regression:
- No change in model fit
- Same linear trend
- Interpretation changes: Expected change in Y if you have success vs. if not, or expected change in outcome between a category and a reference category.
Why: Because you can make predictions. ANOVA only tells us if there is a difference between groups. Categorical regression tells us how groups contribute to a much larger model.
Dummy coding: k – 1 (where k = number of groups), because we would have multicollinearity (predictors too related to each other, can’t tell variances apart) if we included all 4. It converts them into binary variables. Each category is represented as a separate dummy variable, except for a single reference category, giving each category a unique identifier.
Limitations: Can only compare to the reference category, not to each other. So, we would do the same equation with a different reference category, either being “premium cut” or “ideal cut.”
Logistic Regression
Logistic regression is used when our outcome is binary (has two levels, 0 or 1).
Differences from linear regression: We are not fitting a linear trend but a logistic (logit) trend. Interpretation also changes: change in log odds vs. change in direct value. Log odds → odds → probability
Why: To constrain our predictions between 0 and 1, where 0 is the event not happening and 1 is the event happening.
Odds vs. probability: The probability that an event will occur is the fraction of time you expect to see that event. P = Odds / (1 + odds). Odds are defined as the probability that an event will occur divided by the probability that the event will not occur, so Odds = P / (1 – P).
The coefficient βi represents the change in the log odds of the outcome for a one-unit change in predictor X.
Exponentiated coefficients convert values to odds ratios (to interpret easier). Odds ratio = eβ. If OR = 1, there is no effect. If OR > 1, the odds of success increase with Xi.