Statistics Cheat Sheet: Key Concepts and Formulas
Variables and Confounding
Types of Variables
Numerical (Measurable):
- Continuous (Range): Values can fall within a range.
- Discrete (Limited Value): Values have specific, limited possibilities.
Categorical:
- Nominal (Unordered): Categories have no inherent order.
- Ordinal (Ordered): Categories have a natural order.
Confounding Variables
A confounding variable is associated with both the explanatory and response variables, potentially influencing the observed relationship between them. It is not a consequence of the explanatory variable but rather a separate factor that can explain changes in the response variable.
Sampling Methods and Bias
Sampling Methods
- Simple Random Sample: Each member of the population has an equal chance of being selected.
- Stratified Sampling: The population is divided into subgroups (strata), and samples are randomly selected from each subgroup.
- Cluster Sampling: The population is divided into clusters, and a random sample of clusters is selected.
Sampling Bias
- Non-response Bias: Occurs when only a small fraction of the sample responds, potentially leading to an unrepresentative sample.
- Voluntary Response Bias: Occurs when individuals with strong opinions are more likely to participate, leading to a biased sample.
Experimental Design Principles
- Randomization: Treatment and control groups are randomly assigned to minimize bias.
- Replication: The experiment is repeated multiple times to ensure the results are reliable.
- Blocking: Individuals with similar characteristics are grouped together to control for potential confounding variables.
Descriptive Statistics
Measures of Central Tendency
- Population Mean (μ): The average of all values in the population.
- Sample Mean: A point estimate of the population mean, calculated from a sample.
- Median: The middle value when data is ordered, dividing the data into two equal halves.
Measures of Variability
- Variance (S2): The average of squared deviations from the mean, measuring the spread of data around the mean.
- Standard Deviation (S): The square root of the variance, providing a more interpretable measure of spread.
Percentiles and Outliers
Percentiles
- Quartiles: Divide data into four equal parts: Q1 (25th percentile), Q2 (median, 50th percentile), Q3 (75th percentile).
- Interquartile Range (IQR): The difference between Q3 and Q1, representing the spread of the middle 50% of the data.
Outliers
Observations that fall beyond the maximum reach of the whiskers in a box plot, indicating extreme values.
Skewness and Normal Distribution
Skewness
The extent to which a distribution is asymmetrical, with tails extending more to one side than the other.
Normal Distribution
A bell-shaped distribution characterized by its mean and standard deviation, often used to model real-world data.
Z-Score
A standardized score that indicates how many standard deviations a data point is from the mean.
Sampling Distributions and Confidence Intervals
Sampling Distribution
The distribution of a statistic (e.g., sample mean) from all possible samples of a given size from a population.
Central Limit Theorem (CLT)
States that the sampling distribution of the sample mean will be approximately normal under certain conditions, regardless of the shape of the population distribution.
Standard Error (SE)
The standard deviation of the sampling distribution, measuring the variability of the sample mean.
Confidence Interval (CI)
A range of values that is likely to contain the true population parameter with a certain level of confidence (e.g., 95%).
Hypothesis Testing
Hypothesis
A statement about a population parameter that we want to test.
P-value
The probability of obtaining the observed results or something more extreme if the null hypothesis is true.
Type I Error
Rejecting the null hypothesis when it is actually true (false positive).
Type 2 Error
Failing to reject the null hypothesis when it is actually false (false negative).
Two-Sample Tests
Two-Sample Proportion Test (Z-test)
Compares the proportions of two groups to determine if there is a statistically significant difference.
Chi-Square Test
Tests whether the observed frequencies in different categories of a categorical variable differ significantly from the expected frequencies.
Two-Sample T-test
Compares the means of two groups to determine if there is a statistically significant difference.
Analysis of Variance (ANOVA)
Compares the means of three or more groups to determine if there is a statistically significant difference between any of the groups.
Correlation and Regression
Correlation
Measures the strength and direction of the linear relationship between two continuous variables.
Pearson’s Correlation Coefficient (r)
- Ranges from -1 to 1, with -1 indicating a perfect negative correlation, 0 indicating no correlation, and 1 indicating a perfect positive correlation.
- The closer r is to 0, the weaker the relationship.
- The closer r is to 1 or -1, the stronger the relationship.
Regression Analysis
Models the relationship between a dependent variable and one or more independent variables.
Regression Line
The line that best fits the data points in a scatter plot, represented by the equation y^ = a + bx, where a is the intercept and b is the slope.
Residuals
The differences between the observed values and the predicted values from the regression line.
Slope
The change in the dependent variable for a one-unit change in the independent variable.
Intercept
The value of the dependent variable when the independent variable is zero.