Statistical Analysis: ANOVA, Linear Regression, and Logistic Regression

Posted on Dec 14, 2024 in Mathematics

ANOVA: Analysis of Variance

ANOVA (Analysis of Variance) is used to compare means across multiple groups. It tests whether there are significant differences between the means of three or more groups and determines whether variation is due to differences between or within groups.

Between-group variability: Measures how much the means of different groups differ from one another. It represents the systematic effect of the independent variable.
Within-group variability: Measures how much individual data points within each group differ from the group mean. It represents random error or inherent variability in the data.

n = sample size in each group, SST = SSB + SSW

Assumptions:

Independence
Normality
Homogeneity of Variance

One-Way ANOVA

One-way ANOVA is used when comparing more than two groups based on a single independent variable.

Hypotheses:

H₀: μ₁ = μ₂ = μ₃ = μ₄… = μ_k (The means of all groups are equal)
H_A: At least one group mean is different.

F-statistic = MSB / MSW

Critical F: Dependent on both degrees of freedom.

If F-obtained is greater than F-critical, we reject the null hypothesis.

Effect size: The amount of variance in our outcome variables explained by our group. Small: 0.01, Medium: 0.09, Large: 0.25 (SSB / SST)

Post Hoc Testing

Tukey’s HSD:

Use if you want to test the relationship between specific groups within your ANOVA.
Use if you want to check all pairwise comparisons.
Compares all pairwise group means to identify which specific pairs are different.
Controls Type I error across all pairwise comparisons (assumes equal sample sizes).

Bonferroni:

Use if you want to test the relationship between specific groups within your ANOVA.
Controls Type I error rate by adjusting the significance threshold for multiple comparisons.
More conservative than Tukey’s HSD, making it less likely to find significant differences.

Calculate the HSD statistic: T = q * sqrt(MSE / n), where n = number of subjects in one group.

To find q, use a table (df: total number of subjects – number of groups, and number of groups). MSE = MS_w

If the absolute calculated value from Tukey’s test is smaller than the difference in means, the means are significantly different.

Factorial ANOVA

Factorial ANOVA allows analysis of two or more independent variables (factors) affecting a dependent variable. It assesses main effects (the effect of one factor on the dependent variable) and interaction effects (how one factor influences the relationship between another factor and the outcome, when the effect of one factor depends on the level of another).

2×2: 2 factors, each with 2 levels
2×3: 2 factors, one with 2 levels and another with 3 levels
2x4x6: 3 factors, one with 2 levels, another with 4, and another with 6 levels

Null hypotheses:

Main effect of factor A: H₀ = μ_A1 = μ_A2
Main effect of factor B: H₀ = μ_B1 = μ_B2
Interaction effect: H₀ = μ_A1,B1 = μ_A1,B2 = μ_A2,B1 = μ_A2,B2 (The effect of factor A is the same across all levels of factor B, and the effect of factor B is the same across all levels of factor A)

Main effects assume no interaction and can’t be used to assess simple effects. They are null and void if you have a significant interaction effect.

Linear Regression

Purpose: To assess a linear relationship between two or more variables and form predictions. We can use our independent variable X to form predictions about our outcome variable Y, assuming a linear relationship. It models the best-fitting line to our data.

y = mx + b

Simple linear regression: y = β₀ + β₁X + ε
Multiple linear regression: Y = β₀ + β₁X₁ + β₂X₂ + … + β_nX_n + ε

Best-fitting line: We can’t fit a line that perfectly predicts all of our data points. Instead, we find the line that minimizes the residuals.

Residuals: Observed outcome (y_i) – Expected outcome (ŷ_i). Sum of all (y_i – ŷ_i)²

Effect size: R² represents the proportion of variance in one variable that can be explained by the other. For example, if R² = 0.23, 23% of the variance in murder rate can be explained by unemployment. Small = 0.01, Medium = 0.09, Large = 0.25 (SSM / SST)

Standard error of estimate: Quantifies the average size of the residual (√(MSE)), where MSE = SSE / df_E. It tells us how far away, on average, our predicted values are from actual values. Distance = β₀ + β₁ * speed

Null hypothesis: No relationship between variables, H₀: β₁ = 0 (The slope of the population is 0). H_A: There is a relationship between variables, H_A: β₁ ≠ 0.

Interpretation of the linear regression:

Significance: p-value for predictors
Coefficients of predictors:
- Continuous predictors: Describe the change in the predicted outcome for a one-unit increase in the predictor (holding all other predictors constant for a multiple regression).
- Binary predictors: Describe the difference in the dependent variable between two levels of the binary predictor.
Intercept:
- Continuous predictors: Predicted outcome when all predictors are equal to zero.
- Binary predictors: The predicted value of the dependent variable for the reference group.

Ug9X24LQnuypv1hnCUJ7sZ63mO0LjIDe74ybpHBOq8H6hmxueHseWes08rj+ndFqItH40Ls+Ktz6Ow9TXL7ZICAIrdk8KjFQgYBAQCAgEHgUAoLQxPshEBAICAQEAiaBgCA0k3iMYhICAYGAQEAgIAhNvAMCAYGAQEAgYBIICEIziccoJiEQEAgIBAQCgtDEOyAQEAgIBAQCJoGAIDSTeIxiEgIBgYBAQCAgCE28AwIBgYBAQCBgEggIQjOJxygmIRAQCAgEBAKC0MQ7IBAQCAgEBAImgYAgNJN4jGISAgGBgEBAICAITbwDAgGBgEBAIGASCPwXkDsRQQqZ1CEAAAAASUVORK5CYII=

Often, the intercept does not make sense in context.

R²: The amount of variability in the outcome that is accounted for by the predictors.

Multiple Linear Regression

Y = β₀ + β₁X₁ + β₂X₂ + β₃X₁X₂ + ε

Each coefficient represents the change in Y for a 1-unit change in the corresponding X, while holding the other predictors constant.

Regression with Categorical Predictors

This is used when the predictor is categorical or the outcome is continuous.

Differences from linear regression:

No change in model fit
Same linear trend
Interpretation changes: Expected change in Y if you have success vs. if not, or expected change in outcome between a category and a reference category.

Why: Because you can make predictions. ANOVA only tells us if there is a difference between groups. Categorical regression tells us how groups contribute to a much larger model.

Dummy coding: k – 1 (where k = number of groups), because we would have multicollinearity (predictors too related to each other, can’t tell variances apart) if we included all 4. It converts them into binary variables. Each category is represented as a separate dummy variable, except for a single reference category, giving each category a unique identifier.

Limitations: Can only compare to the reference category, not to each other. So, we would do the same equation with a different reference category, either being “premium cut” or “ideal cut.”

Logistic Regression

Logistic regression is used when our outcome is binary (has two levels, 0 or 1).

Differences from linear regression: We are not fitting a linear trend but a logistic (logit) trend. Interpretation also changes: change in log odds vs. change in direct value. Log odds → odds → probability

Why: To constrain our predictions between 0 and 1, where 0 is the event not happening and 1 is the event happening.

Odds vs. probability: The probability that an event will occur is the fraction of time you expect to see that event. P = Odds / (1 + odds). Odds are defined as the probability that an event will occur divided by the probability that the event will not occur, so Odds = P / (1 – P).

The coefficient β_i represents the change in the log odds of the outcome for a one-unit change in predictor X.

Exponentiated coefficients convert values to odds ratios (to interpret easier). Odds ratio = e^β. If OR = 1, there is no effect. If OR > 1, the odds of success increase with X_i.

PvzdhAHvzBAAAAAElFTkSuQmCC

D+Ff3I6fD2iRAAAAAElFTkSuQmCC

p> <p><img src=