Understanding Regression Analysis and Variable Relationships
Relationship Between Variables and Regression: The term “regression” was introduced by Galton in his 1889 book Natural Inheritance, referring to the universal law of regression: each peculiarity in a person is shared by their descendants, but on average, to a lesser degree (regression to the mean). His work focused on describing the physical traits of descendants (a variable) based on their parents (another variable).
Pearson (Galton’s friend) conducted a study of over 1,000 households, examining relationships like: child height = 85cm + 0.5 * father’s height (approximately). Conclusion: very tall parents tend to have taller children, though their height tends to regress towards the average by half. The same applies to shorter parents. Today, regression signifies a predictive measure based on the knowledge of another.
Joint Study of Two Variables
One way to collect data on two variables is by observing several individuals in a sample. Each row represents an individual, and each column represents the values of a variable. The individuals are not listed in any particular order. These observations can be represented in a scatterplot, where each individual is a point with coordinates representing the variable values. The goal is to determine if there is a relationship between the variables, what kind of relationship it is, and if possible, predict the value of one variable based on the other.
Direct and Inverse Relationships
- Uncorrelated: For values of X above average, values of Y are above and below average in similar proportions.
- Direct: For values of X greater than the mean, values of Y are also greater. For values of X lower than the mean, values of Y are also lower.
- Inverse: For values of X greater than the mean, values of Y are lower. This is also called a decreasing relationship.
When is a Regression Model Good?
The adequacy of a model depends on the ratio between the marginal dispersion of Y and the dispersion of Y conditional on X. By setting values of X, we observe how Y is distributed. The distribution of Y for fixed values of X is called the conditional distribution. The distribution of Y, regardless of the value of X, is called the marginal distribution. If the dispersion is significantly reduced, the regression model is suitable.
Covariance of Two Variables X and Y
The covariance between two variables (Sxy) indicates whether the relationship is direct or inverse.
- Direct: Sxy > 0
- Inverse: Sxy < 0
- Uncorrelated: Sxy = 0
The sign of the covariance indicates whether the scatterplot trend is increasing or decreasing but doesn’t reveal the strength of the relationship.
Pearson Linear Correlation Coefficient
The Pearson linear correlation coefficient (r) indicates the tendency of data points to align (excluding horizontal and vertical lines). It has the same sign as the covariance (Sxy), thus indicating a direct or inverse relationship. ‘r’ is useful for determining linear relationships but not other types (quadratic, logarithmic, etc.).
Properties of r
- Dimensionless.
- Ranges from -1 to +1.
- r = 0 indicates uncorrelated variables (no linear relationship).
- r = +1 or -1 indicates a perfect linear relationship.
- The closer r is to +1 or -1, the stronger the linear relationship (assuming no outliers).
Frequently Asked Questions
If r = 0, does it mean the variables are independent? In practice, almost always yes, but not necessarily in all cases. The reverse is true: independence implies uncorrelation.
I calculated r = 1.2. Is the relationship “superlinear”? This is a miscalculation. ‘r’ always falls between -1 and +1.
From what value is there a good linear relationship? There’s no specific value. This course suggests |r| > 0.7 indicates a good linear relationship and |r| > 0.4 indicates some relationship (though it’s more complex, considering outliers and variance homogeneity).
Regression
Regression is a measure used to predict one variable in terms of another (or more).
- Y = dependent variable
- X = independent/predictor/explanatory variable
The goal is to find a relationship: Y = f(X) + error, where f is a function and the error is random, small, and independent of X.
Simple Linear Regression Model
In a simple linear regression model with two variables (Y and X), we find a simple linear function of X that allows us to fit a line: Ŷ = b0 + b1X, where b0 is the intercept and b1 is the slope. Y and Ŷ rarely coincide, even with a good model. The quantity e = Y – Ŷ is called the residual error.
Summary of Goodness of Fit
The goodness of fit is measured using the coefficient of determination (R²).
- R² is dimensionless and ranges from 0 to 1.
- A good fit has R² close to 1.
- A bad fit has R² close to 0.
- R² is also the percentage of variability explained by the regression model.
- In a simple linear model, R² = r².
Other Regression Models
Other models can be considered depending on the scatterplot (nonlinear regression) or when the dependent variable relies on several variables (multiple regression).