Regression Analysis, ANOVA, Cluster, and Chi-Square

Regression Analysis Exercises

Key Concepts in Regression Analysis:

Formula (Coefficients): Y = b0 + b1X1 + b2X2 + …

  • Coefficients:
    • b0: If we don’t take into consideration the independent variables (X), the average of Y will be b0.
    • b1: For each additional unit of X, the average of Y will increase by b1.
  • P-value: (If p < 0.05, low risk, statistically significant): This parameter is statistically significant for a given % level of probability.
  • Lower 95%: We are 95% sure the coefficient will be between the lower and upper bounds.

Regression Analysis Metrics:

  • Multiple Correlation Coefficient: The percentage of variability in Y that is connected to the percentage of variability in all X variables taken together.
  • R-squared: Our model explains this % of the variance in Y (the higher, the better the model’s fit and efficiency).
  • Adjusted R-squared: If we take this into consideration, our model will explain this % of the variance in Y.
  • Standard Error: The prediction of Y with our model will differ from reality by approximately this amount.

ANOVA (Analysis of Variance):

  • Significance F (P-value): Our model is statistically significant at the probability level of 99% (the higher, the better; indicates a good model).

Dummy Variables:

It is possible to include a qualitative variable in a multiple regression model using a binary code (0, 1). ‘K’ designates the number of categories that exist for the qualitative variables. If there are only two categories, use 0 and 1. If not, use 1, 2, 3, etc.

Cluster Analysis

Cluster analysis is a statistical method for processing data. It organizes items into groups (clusters) based on how closely associated they are. The objective is to find similar groups of subjects, where similar features between each pair of subjects represent some global set of characteristics. It is used when there is no assumption made about the likely relationships within the data. Subjects are separated into groups so that each subject is more similar to other subjects in its group than to subjects outside the group.

  • Intra-cluster distance: The distance between data points *inside* the cluster (minimize this).
  • Inter-cluster distance: The distance between data points in *different* clusters (maximize this).

Types of Clustering:

  • Partitional: Divides data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.
  • Hierarchical: A set of nested clusters organized as a hierarchical tree.
  • Exclusive vs. Non-exclusive: In non-exclusive clustering, points may belong to multiple clusters.
  • Fuzzy vs. Non-fuzzy: In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1.
  • Partial vs. Complete: In some cases, we only want to cluster some of the data.
  • Heterogeneous vs. Homogeneous: In homogeneous clustering, objects are similar to each other in terms of the categories we are clustering for. Ambiguity depends on the individual.

Additional Concepts

  • K: Value of the b0 intercept. In Multiple Regression Analysis (MRA), the regression equation’s value of the dependent variable (Y) given that all independent variables (v) = 0.
  • Partial Regression Coefficient: Conditional coefficient given that one or more other independent variables are included in the regression equation. The slope of the regression line between the independent variable of interest and the dependent variable, given that the other independent variables are held constant (K).
  • F-test: Tests the significance of the overall model.
  • T-test: The partial regression coefficient for each independent variable represents a significant contribution to the overall model.
  • Stepwise Regression Analysis: One independent variable is added to the model in each step to select such variables for inclusion in the final model.
  • Backward Stepwise Regression Analysis: We begin with all variables under consideration being included in the model, then remove one variable in each step.
  • Coefficient of Multiple Correlation (R): Indicates the extent of the relationship between the independent variables taken as a group and the dependent variable.
  • Coefficient of Multiple Determination (R^2): Indicates the proportion of variance in the dependent variable that is statistically accounted for by knowledge of the two or more independent variables.
  • Coefficient of Partial Correlation: Indicates the correlation between one of the independent variables in the MRA and the dependent variable, with the other independent variables held constant.
  • Coefficient of Partial Determination: The squared value of the coefficient of partial correlation. This coefficient indicates the proportion of variance statistically accounted for by one particular independent variable, with the other independent variables held constant.

Chi-Square Test

Chi-Square for Representativeness:

  • t > c: Not representative.
  • t < c: Representative.

Chi-Square for Two Qualitative Variables:

  • t > c: Yes, there is a statistically significant relationship.
  • t < c: No relationship.
  • Degrees of Freedom (df) = (number of rows – 1) * (number of columns – 1)