Statistics Cheat Sheet: Key Concepts and Formulas

Posted on May 9, 2024 in Mathematics

Variables and Confounding

Types of Variables

Numerical (Measurable):

Continuous (Range): Values can fall within a range.
Discrete (Limited Value): Values have specific, limited possibilities.

Categorical:

Nominal (Unordered): Categories have no inherent order.
Ordinal (Ordered): Categories have a natural order.

Confounding Variables

A confounding variable is associated with both the explanatory and response variables, potentially influencing the observed relationship between them. It is not a consequence of the explanatory variable but rather a separate factor that can explain changes in the response variable.

Sampling Methods and Bias

Sampling Methods

Simple Random Sample: Each member of the population has an equal chance of being selected.
Stratified Sampling: The population is divided into subgroups (strata), and samples are randomly selected from each subgroup.
Cluster Sampling: The population is divided into clusters, and a random sample of clusters is selected.

Sampling Bias

Non-response Bias: Occurs when only a small fraction of the sample responds, potentially leading to an unrepresentative sample.
Voluntary Response Bias: Occurs when individuals with strong opinions are more likely to participate, leading to a biased sample.

Experimental Design Principles

Randomization: Treatment and control groups are randomly assigned to minimize bias.
Replication: The experiment is repeated multiple times to ensure the results are reliable.
Blocking: Individuals with similar characteristics are grouped together to control for potential confounding variables.

Descriptive Statistics

Measures of Central Tendency

Population Mean (μ): The average of all values in the population.
Sample Mean: A point estimate of the population mean, calculated from a sample.
Median: The middle value when data is ordered, dividing the data into two equal halves.

Measures of Variability

Variance (S²): The average of squared deviations from the mean, measuring the spread of data around the mean.
Standard Deviation (S): The square root of the variance, providing a more interpretable measure of spread.

Percentiles and Outliers

Percentiles

Quartiles: Divide data into four equal parts: Q1 (25th percentile), Q2 (median, 50th percentile), Q3 (75th percentile).
Interquartile Range (IQR): The difference between Q3 and Q1, representing the spread of the middle 50% of the data.

Outliers

Observations that fall beyond the maximum reach of the whiskers in a box plot, indicating extreme values.

Skewness and Normal Distribution

Skewness

The extent to which a distribution is asymmetrical, with tails extending more to one side than the other.

Normal Distribution

A bell-shaped distribution characterized by its mean and standard deviation, often used to model real-world data.

Z-Score

A standardized score that indicates how many standard deviations a data point is from the mean.

Sampling Distributions and Confidence Intervals

Sampling Distribution

The distribution of a statistic (e.g., sample mean) from all possible samples of a given size from a population.

Central Limit Theorem (CLT)

States that the sampling distribution of the sample mean will be approximately normal under certain conditions, regardless of the shape of the population distribution.

Standard Error (SE)

The standard deviation of the sampling distribution, measuring the variability of the sample mean.

Confidence Interval (CI)

A range of values that is likely to contain the true population parameter with a certain level of confidence (e.g., 95%).

Hypothesis Testing

Hypothesis

A statement about a population parameter that we want to test.

P-value

The probability of obtaining the observed results or something more extreme if the null hypothesis is true.

Type I Error

Rejecting the null hypothesis when it is actually true (false positive).

Type 2 Error

Failing to reject the null hypothesis when it is actually false (false negative).

Two-Sample Tests

Two-Sample Proportion Test (Z-test)

Compares the proportions of two groups to determine if there is a statistically significant difference.

Chi-Square Test

Tests whether the observed frequencies in different categories of a categorical variable differ significantly from the expected frequencies.

Two-Sample T-test

Compares the means of two groups to determine if there is a statistically significant difference.

Analysis of Variance (ANOVA)

Compares the means of three or more groups to determine if there is a statistically significant difference between any of the groups.

Correlation and Regression

Correlation

Measures the strength and direction of the linear relationship between two continuous variables.

Pearson’s Correlation Coefficient (r)

Ranges from -1 to 1, with -1 indicating a perfect negative correlation, 0 indicating no correlation, and 1 indicating a perfect positive correlation.
The closer r is to 0, the weaker the relationship.
The closer r is to 1 or -1, the stronger the relationship.

Regression Analysis

Models the relationship between a dependent variable and one or more independent variables.

Regression Line

The line that best fits the data points in a scatter plot, represented by the equation y^ = a + bx, where a is the intercept and b is the slope.

Residuals

The differences between the observed values and the predicted values from the regression line.

Slope

The change in the dependent variable for a one-unit change in the independent variable.

Intercept

The value of the dependent variable when the independent variable is zero.