Statistical Analysis of Data Distribution and Correlation
Interval | Class Mark (Xi) | Mean (X) | Momentum in n and n = number raised to the point | Pearson |
---|---|---|---|---|
Linf – Lsup | (Linf + Lsup) / 2 | (xi – X) (xi – X) * nfi | b2 = moment4 / (σ2)2 (where σ2 is the population variance) |
Data Distribution Analysis
The Pearson coefficient (B2) is based on the fourth moment about the mean.
For example, for samples of size 40:
- If B2 < 2.15, the distribution is negatively skewed (platykurtic).
- If B2 > 3.99, the distribution is positively skewed.
- If B2 > 3, the distribution is more pointed than normal (leptokurtic).
- If B2 = 3, the distribution is normal (mesokurtic).
For the example of the heights of the sample of 40 students, using the reference values for n = 50, because B2 = 2.84 (greater than 2.15 and less than 3.99), we conclude that the distribution is more pointed than normal (with a 10% risk of error).
Standardization of Variables
a) The average results indicate the best performance was in Statistics (average: 5.6 points) and the lowest in Computing (average: 3.1 points).
b) Peter has a steady performance (4.5) across all three subjects, while Maria’s performance is more variable. However, their average score is the same (4.5).
c) Peter’s standardized scores reveal variable performance. His best performance was in Computing (Z = 1.17), and his poorest was in Statistics (Z = -1.22).
d) Maria’s standardized scores also show variable performance. Her best performance was in Computing (Z = 0.50), and her weakest was in Psychology (Z = -0.15).
e) Maria has a better average standardized score (Z = 0.19) than Peter (Z = 0.06), despite their equal raw score averages.
Standardized Scores and Percentiles
Standardized scores can be associated with percentiles, especially if the variable follows a normal distribution. The table of the standardized normal distribution shows cumulative probabilities. For example, the cumulative probability up to Z = 1.62 is U(1.62) = 0.9474.
Correlation Coefficient Properties
Key properties of the Pearson correlation coefficient (r):
- The value of r fluctuates between -1 and 1.
- r > 0 indicates a direct (linear) association, r = 0 indicates no linear relationship, and r < 0 indicates an inverse relationship.
- r = 1 when all sample points fall on the prediction line.
- The higher the absolute value of r, the stronger the linear association.
- The value of r is independent of the units of measurement.
- r is symmetric.
- r is appropriate only for linear relationships.
- A low r value doesn’t necessarily imply no relationship, just no linear relationship.
Significance of the Correlation Coefficient
Interpreting r is straightforward for values of 1, -1, or 0. For other values, inference is used. If the population correlation coefficient (ρ) is 0, the variables are independent. However, sample correlation coefficients (r) can deviate from 0. To determine if a sample r comes from a population with ρ = 0, we can use hypothesis testing. For example, if r = 0.9254 for n = 5, the degrees of freedom (df) are 3. At a 5% significance level, the critical value is 0.878. Since r > 0.878, we reject the hypothesis that ρ = 0 and conclude there is a direct linear relationship (with a 5% risk of error).