Statistical Analysis of Data Distribution and Correlation

Posted on Nov 13, 2024 in Mathematics

Interval	Class Mark (X_i)	Mean (X)	Momentum in n and n = number raised to the point	Pearson
L_inf – L_sup	(L_inf + L_sup) / 2	(x_i – X) *(x_i – X) n_fi**	b₂ = moment₄ / (σ²)² (where σ² is the population variance)

Interval

Class Mark (X_i)

Mean (X)

Momentum in n and n = number raised to the point

Pearson

L_inf – L_sup

(L_inf + L_sup) / 2

(x_i – X)

(x_i – X) * n_fi

b₂ =

moment₄ /

(σ²)² (where σ² is the population variance)

Formula

Data Distribution Analysis

The Pearson coefficient (B₂) is based on the fourth moment about the mean.

For example, for samples of size 40:

If B₂ < 2.15, the distribution is negatively skewed (platykurtic).
If B₂ > 3.99, the distribution is positively skewed.
If B₂ > 3, the distribution is more pointed than normal (leptokurtic).
If B₂ = 3, the distribution is normal (mesokurtic).

For the example of the heights of the sample of 40 students, using the reference values for n = 50, because B₂ = 2.84 (greater than 2.15 and less than 3.99), we conclude that the distribution is more pointed than normal (with a 10% risk of error).

Standardization of Variables

a) The average results indicate the best performance was in Statistics (average: 5.6 points) and the lowest in Computing (average: 3.1 points).

b) Peter has a steady performance (4.5) across all three subjects, while Maria’s performance is more variable. However, their average score is the same (4.5).

c) Peter’s standardized scores reveal variable performance. His best performance was in Computing (Z = 1.17), and his poorest was in Statistics (Z = -1.22).

d) Maria’s standardized scores also show variable performance. Her best performance was in Computing (Z = 0.50), and her weakest was in Psychology (Z = -0.15).

e) Maria has a better average standardized score (Z = 0.19) than Peter (Z = 0.06), despite their equal raw score averages.

Standardized Scores and Percentiles

Standardized scores can be associated with percentiles, especially if the variable follows a normal distribution. The table of the standardized normal distribution shows cumulative probabilities. For example, the cumulative probability up to Z = 1.62 is U(1.62) = 0.9474.

Correlation Coefficient Properties

Key properties of the Pearson correlation coefficient (r):

The value of r fluctuates between -1 and 1.
r > 0 indicates a direct (linear) association, r = 0 indicates no linear relationship, and r < 0 indicates an inverse relationship.
r = 1 when all sample points fall on the prediction line.
The higher the absolute value of r, the stronger the linear association.
The value of r is independent of the units of measurement.
r is symmetric.
r is appropriate only for linear relationships.
A low r value doesn’t necessarily imply no relationship, just no linear relationship.

Significance of the Correlation Coefficient

Interpreting r is straightforward for values of 1, -1, or 0. For other values, inference is used. If the population correlation coefficient (ρ) is 0, the variables are independent. However, sample correlation coefficients (r) can deviate from 0. To determine if a sample r comes from a population with ρ = 0, we can use hypothesis testing. For example, if r = 0.9254 for n = 5, the degrees of freedom (df) are 3. At a 5% significance level, the critical value is 0.878. Since r > 0.878, we reject the hypothesis that ρ = 0 and conclude there is a direct linear relationship (with a 5% risk of error).

Statistical Analysis of Data Distribution and Correlation

Data Distribution Analysis

Standardization of Variables

Standardized Scores and Percentiles

Correlation Coefficient Properties

Significance of the Correlation Coefficient

Recent Notes

Subjects

Publicidad