Multiple Regression & Cluster Analysis: A Comprehensive Guide
Multiple Regression Analysis
Multiple regression analysis is an extension of simple regression analysis that uses two or more independent variables to estimate the value of a dependent variable.
Multiple Regression Equation
The multiple regression equation identifies the best-fitting line based on the method of least squares. This line passes through n-dimensional space. The calculations needed to determine the parameter estimates for multiple regression equations and their associated standard errors are complex.
Key Components
- Constant: Represents the intercept (b0) in the regression equation, indicating the value of the dependent variable (Y) when all independent variables are 0.
- Partial Regression Coefficient: Indicates the slope of the regression line between the independent variable of interest and the dependent variable, assuming other independent variables are constant.
- F-test: Assesses the overall significance of the regression model.
- t-test: Determines the significance of individual partial regression coefficients in contributing to the overall model.
- Stepwise Regression Analysis:
- Forward Stepwise: Adds one independent variable at each step for model selection.
- Backward Stepwise: Starts with all variables included and removes one at each step.
- Coefficient of Multiple Correlation (R): Measures the relationship between independent variables as a group and the dependent variable.
- Coefficient of Multiple Determination (R²): Indicates the proportion of variance in the dependent variable explained by the independent variables.
- Adjusted Coefficient of Multiple Determination (Adjusted R²): Accounts for additional independent variables that may skew R² measurements, increasing precision and reliability.
- Coefficient of Partial Correlation: Shows the correlation between an independent variable and the dependent variable, holding other independent variables constant.
- Coefficient of Partial Determination: Squared value of the coefficient of partial correlation, representing the proportion of variance explained by one independent variable while holding others constant.
Dummy Variables
Dummy variables allow the inclusion of categorical data, such as sex or country of origin, in regression models. Each category is represented by a binary code (0 or 1). If there are k categories, k – 1 dummy variables are needed. For example, binary categories like male/female require only one dummy variable.
Cluster Analysis Theory
Cluster analysis is a statistical method for processing data. It organizes items into groups (clusters) based on their similarity. The goal is to find similar groups of subjects, where shared features between each pair of subjects represent a global set of characteristics. It is commonly used in fields like market research and health analysis to identify patterns or segments within data.
Key aspects of cluster analysis include minimizing intra-cluster distance, which measures the similarity among data points within the same cluster, and maximizing inter-cluster distance, which measures the dissimilarity between data points in different clusters.
Types of Cluster Analysis
- Partitional: Divides data objects into non-overlapping subsets (clusters), with each data object belonging to exactly one subset.
- Hierarchical: Organizes a set of nested clusters into a hierarchical tree structure.
- Non-exclusive: Data points may belong to multiple clusters.
- Exclusive: Data points belong to only one cluster.
- Fuzzy: Data points belong to clusters with a weight between 0 and 1, indicating partial membership.
- Non-fuzzy: Data points have binary membership, belonging entirely to one cluster or not at all.
- Partial: Only a portion of the data is clustered.
- Complete: The entire dataset is clustered.
- Heterogeneous: Clusters exhibit diverse characteristics among the observations within them.
- Homogeneous: Clusters demonstrate relatively similar characteristics within the categories being clustered.
Multiple Regression and Correlation Analysis Example
Let’s prepare a formal regression model for the dependent variable using independent variables.
Regression Statistics
- Multiple R: x% of variability in the dependent variable (Y) is connected with x% of the variability in the independent variables (X1, X2, and X3) taken together.
- R Square: x% of variance in the dependent variable (Y) is explained by our regression model. This indicates a very good model (higher is better).
- Adjusted R Square: Considering that adding additional independent variables to our model can skew the results, our model explains x% of the variance in the dependent variable (Y). Adding factors to the model increases and improves the R square. This is still a very good model (higher is better).
- Standard Error in $: The prediction of the dependent variable (Y) made using our model will differ from reality by around x$.
- Observations: Our model includes x houses (observations) for each variable, representing small/big samples.
ANOVA Analysis
- SS: Sum of squared differences
- MS: Mean of squared differences
- F: MS regression / MS residual = x
- Significance F: P-value for the whole model, extremely small if it has an E. This represents the probability that our model is statistically significant at (1 – Significance F) level of probability. Our model is quite good, greater than 99%. If the probability is small, it is not statistically significant.
Coefficient Table
Coefficients
- Intercept → B0 → Constant: If all independent variables (X1, X2, X3) are not considered, the average value of the dependent variable (Y) is estimated to be $x.
- Partial Regression Coefficient: Independent Variables (X1, X2, X3): For each additional unit increase in the respective independent variable (1x1), the dependent variable (Y) is expected to increase/decrease by $x, holding other variables constant.
Formal Regression Equation
Y = B0 + b1X1 + b2X2 + b3X3 …
Standard Error
Coefficient / St. Error = t-Stat
P-Value (Risk)
- If it has an E, the number is very small → probability = 1 – risk (calculate 1×1). Statistically significant if probability > 80-90%.
- Intercept: Statistically significant by x%.
- X1: Statistically significant by x%, indicating X1 is statistically significant.
- X2: Not statistically significant at x% → not statistically significant concerning the dependent variable.
- X3A and X3B: Not statistically significant at all, not important (not calculated).
- X3C: Statistically significant at x% (high number), a very important predictor of the dependent variable.
Lower 95% & Upper 95% (Confidence Interval)
Intercept: we are 95% sure that our coefficient B0 will be btw x and x.
X1,2,3A,3B,3C: we are 95% sure that our coefficient B? will be btw x and x.