Statistical Sampling and Analysis Techniques
Correlation Coefficients
Pearson Correlation Coefficient (r)
Measures the strength and direction of the linear relationship between two continuous variables.
Key Characteristics
- Range: r values range from -1 to +1.
- r = +1: Perfect positive linear relationship.
- r = -1: Perfect negative linear relationship.
- r = 0: No linear relationship.
Formula:
Assumptions: Variables are continuous and normally distributed. The relationship between variables is linear. No significant outliers.
Applications: Assessing the relationship between study hours and test scores. Determining the correlation between temperature and ice cream sales.
Spearman Rank Correlation Coefficient (ρ)
Measures the strength and direction of a monotonic relationship between two variables using their ranks.
Key Characteristics
- Range: ρ values range from -1 to +1.
- ρ = +1: Perfect monotonic increasing relationship.
- ρ = -1: Perfect monotonic decreasing relationship.
- ρ = 0: No monotonic relationship.
Formula:
Assumptions: Variables can be ordinal, interval, or ratio. Measures monotonic relationships (linear or nonlinear).
Applications: Assessing the relationship between rank in a competition and practice hours. Studying the correlation between income level and job satisfaction rank.
Types of Correlations
1. Based on Direction
- Positive Correlation: Both variables move in the same direction.
- Definition: As one variable increases, the other also increases, and vice versa.
- Examples: Study hours and exam scores. Income and expenditure.
- Graphical Representation: Points form an upward-sloping line.
- Negative Correlation: Variables move in opposite directions.
- Definition: As one variable increases, the other decreases, and vice versa.
- Examples: Speed and travel time. Price of a product and demand.
- Graphical Representation: Points form a downward-sloping line.
- No Correlation: No discernible relationship between variables.
- Definition: Changes in one variable do not predict changes in the other.
- Examples: Shoe size and intelligence. Rainfall and daily stock market returns.
- Graphical Representation: Points are scattered randomly.
2. Based on Strength
- Perfect Correlation: The relationship between variables is exact.
- Example: Distance traveled and time (at constant speed).
- Graph: All points lie on a straight line.
- Strong Correlation: Variables have a clear, strong relationship but not perfect.
- Example: Height and weight in adults.
- Graph: Points are closely packed around a straight line.
- Weak Correlation: Variables show a slight relationship, but the association is not strong.
- Example: Hours of sleep and productivity.
- Graph: Points are loosely scattered but show some pattern.
3. Based on Nature of Relationship
- Linear Correlation: Relationship can be represented by a straight line.
- Example: Age and height in children.
- Non-Linear (Curvilinear) Correlation: Relationship cannot be represented by a straight line.
- Example: Stress levels and performance (Yerkes-Dodson curve).
Sampling Techniques
1. Probability Sampling
- Simple Random:
- Definition: Every individual in the population has an equal chance of being selected.
- Method: Use random number generators or draw lots.
- Example: Selecting 50 students from a university using a random number table.
- Systematic:
- Definition: Select every kth individual from a list after randomly choosing a starting point.
- Formula: k = Population size / Sample size
- Example: In a list of 1,000 people, selecting every 10th person after starting at a random position.
- Stratified:
- Definition: Divide the population into subgroups (strata) based on a shared characteristic, then randomly sample from each stratum.
- Example: A school with 60% male and 40% female students; ensure the sample maintains the same gender proportion.
- Cluster:
- Definition: Divide the population into clusters (e.g., geographic areas) and randomly select entire clusters.
- Example: Surveying students from randomly selected schools instead of individual students from all schools.
- Multi-Stage:
- Definition: Combine several sampling methods in stages.
- Example: Randomly select states, then districts, then schools within those districts, and finally students.
2. Non-Probability Sampling
- Convenience:
- Definition: Select samples that are easy to access or available.
- Example: Surveying people at a nearby mall.
- Judgmental:
- Definition: The researcher selects participants based on their judgment of who will provide the best information.
- Example: Interviewing doctors to study medical practices.
- Quota:
- Definition: Divide the population into groups and fill a pre-determined quota for each group.
- Example: Interviewing 30 males and 30 females regardless of random selection.
- Snowball:
- Definition: Existing participants recruit others to participate in the study.
- Example: Studying a hidden population like drug users or refugees.
Probability Distributions
Normal Distribution
A continuous probability distribution that is symmetric about the mean, forming a bell-shaped curve.
Key Characteristics
- Shape: Symmetrical and bell-shaped.
- Parameters:
- Mean (μ): Determines the center of the curve.
- Standard Deviation (σ): Determines the spread or width of the curve.
- Probability Density Function:
- Empirical Rule: 68% of data lies within ±1σ of the mean. 95% within ±2σ. 99.7% within ±3σ.
Applications: Heights, weights, IQ scores, and measurement errors typically follow a normal distribution. Central to many statistical methods due to the Central Limit Theorem.
Binomial Distribution
A discrete probability distribution that models the number of successes in a fixed number of independent trials, each with the same probability of success.
Key Characteristics
- Parameters:
- n: Number of trials.
- p: Probability of success in each trial.
- Probability Mass Function:
- Mean and Variance:
- Mean (μ) = n ⋅ p
- Variance (σ^2) = n ⋅ p ⋅ (1 – p)
Applications: Tossing a coin n times and counting heads. Quality control (e.g., defective vs. non-defective items).
Poisson Distribution
A discrete probability distribution that models the number of events occurring in a fixed interval of time or space, given that the events occur independently and at a constant rate.
Key Characteristics
- Parameter: λ: Average number of events in the interval (rate parameter).
- Probability Mass Function:
- Mean and Variance:
- Mean (μ) = λ.
- Variance (σ^2) = λ.
- Assumptions:
- Events occur independently.
- Events are rare in a given interval.
Applications: Counting phone calls to a call center per hour. Number of traffic accidents at a junction per day. DNA mutation counts in a specific length of genome.
Significance and Confidence Levels
Level of Significance (α) | Confidence Level |
---|---|
The probability of rejecting a true null hypothesis (Type I error). | The probability that the confidence interval contains the true population parameter. |
Common values: 0.05, 0.01, 0.10 | Common values: 90%, 95%, 99% |
α = 1 – Confidence Level | Confidence Level = 100% – α |
The likelihood of making a false positive error (rejecting a true null). | The degree of certainty that the true parameter is within the interval. |
α = 0.05: 5% chance of making a Type I error. | 95% confidence: 95% of the intervals would contain the true population mean. |
F-Ratio in ANOVA
F-ratio is a statistical measure used in Analysis of Variance (ANOVA) to determine whether there are significant differences between the means of multiple groups. It is the ratio of two variances and helps in testing hypotheses about group means.
The F-ratio is calculated by dividing the variance between groups by the variance within groups.
In simpler terms:
- Between-group variance (MSB) represents the variance due to the differences between the group means.
- Within-group variance (MSW) represents the variance within each group, accounting for random error or individual differences.
Interpretation
The F-ratio tests whether the variability between the group means is larger than the variability within the groups (random error).
- High F-Ratio: A large F-ratio suggests that the variance between the group means is much greater than the variance within the groups, indicating that the differences between the group means are statistically significant. This means that at least one group mean is different from the others.
- Low F-Ratio: A small F-ratio suggests that the variance between group means is similar to or smaller than the variance within the groups. This implies that any observed differences between the groups may be due to random chance, and the null hypothesis is not rejected.
Steps in Two-Way ANOVA
- State the Hypotheses:
- Null Hypothesis for Factor A (Main Effect A).
- Null Hypothesis for Factor B (Main Effect B).
- Null Hypothesis for Interaction (AB).
- Calculate the Sums of Squares (SS):
- SSB (Between Groups)
- SSW (Within Groups)
- SST (Total Sum of Squares)
- Calculate the Degrees of Freedom (df):
- df for Factor A.
- df for Factor B.
- df for Interaction (AB).
- df for Within Groups.
- Calculate the Mean Squares (MS):
- MSA
- MSB
- MSAB
- MSW
- Calculate the F-Ratios:
- F for Factor A.
- F for Factor B.
- F for Interaction (AB).
- Decision Rule: Compare the calculated F-ratio for each factor and interaction to the critical F-value from the F-distribution table with the appropriate degrees of freedom at a chosen significance level (usually 0.05). If the calculated F-ratio is greater than the critical value, reject the null hypothesis for that factor or interaction.
Q
Plant | n | Average Span | s |
---|---|---|---|
Hibiscus | 5 | 12 | 2 |
Marigold | 5 | 16 | 1 |
Rose | 5 | 20 | 4 |
p = 3
n = 5
N = 15
x̅ = 16
SST = Σn(x – x̅)^2
SST = 5(12 – 16)^2 + 5(16 – 16)^2 + 5(20 – 16)^2 = 160
MST = SST / (p – 1) = 160 / (3 – 1) = 80
SSE = Σ(n – 1)s^2 = 4(2^2) + 4(1^2) + 4(4^2) = 84
MSE = SSE / (N – p) = 84 / (15 – 3) = 7
F = MST / MSE = 80 / 7
F = 11.429
Parametric vs. Non-Parametric Tests
Parametric Tests
Definition: Statistical tests that assume underlying data distributions, often normal, and are applied to continuous, quantitative data.
Key Characteristics
- Assumes normal distribution of data.
- Requires homogeneity of variance (equal variances across groups).
- Sensitive to outliers and non-normal data.
Type of Data: Works with interval or ratio scale data (e.g., height, weight, test scores).
Examples:
- t-Test: Compares the means of two groups.
- Independent t-Test: For unrelated groups.
- Paired t-Test: For related groups (e.g., pre-test vs. post-test).
- ANOVA (Analysis of Variance): Compares means across three or more groups.
- Pearson Correlation: Measures linear relationships between two variables.
- Regression Analysis: Models the relationship between variables.
Non-Parametric Tests
Definition: Statistical tests that do not rely on data following a specific distribution, making them more versatile for various types of data.
Key Characteristics
- No assumption of normality or equal variance.
- Suitable for small sample sizes and non-normally distributed data.
- Robust against outliers.
Type of Data: Works with ordinal, nominal, or non-normal continuous data (e.g., satisfaction ratings, ranks, frequencies).
Examples:
- Mann-Whitney U Test: Non-parametric equivalent of the independent t-test.
- Wilcoxon Signed-Rank Test: Non-parametric equivalent of the paired t-test.
- Kruskal-Wallis Test: Non-parametric equivalent of ANOVA.
- Spearman’s Rank Correlation: Measures monotonic relationships between two variables.
- Chi-Square Test: Compares frequencies in categorical data.
Type I and Type II Errors
Type I Error (False Positive)
A Type I error occurs when we reject a null hypothesis that is actually true. In other words, it’s a “false positive” result.
Key Points
- Definition: The test suggests there is an effect or difference when, in fact, there isn’t.
- Symbol: Denoted by α, also called the significance level. Typically, α = 0.05, meaning a 5% chance of making a Type I error.
- Consequences: Concluding that a treatment or intervention works when it does not, or detecting a difference when there is none.
- Example: A medical test falsely indicates that a healthy person has a disease (false positive).
- Control: Reducing the significance level (α) reduces the risk of a Type I error, but this also makes it harder to reject the null hypothesis. A smaller α (e.g., 0.01) lowers the chance of a Type I error but may increase the chance of a Type II error.
Type II Error (False Negative)
A Type II error occurs when we fail to reject a null hypothesis that is actually false. In other words, it’s a “false negative” result.
Key Points
- Definition: The test fails to detect a true effect or difference that exists.
- Symbol: Denoted by β. The power of a test is 1 – β, representing the probability of correctly rejecting a false null hypothesis.
- Consequences: Failing to detect a treatment’s effectiveness or missing a real difference between groups.
- Example: A medical test fails to detect a disease in a person who actually has it (false negative).
- Control: Increasing the sample size improves the test’s power, which reduces the risk of a Type II error. A more powerful test reduces β, increasing the likelihood of detecting a true effect if it exists.
7 fair dice are thrown 729 times. How many times do you expect at least four dice to show three or five?