Statistical Inference: Sampling, Bootstrapping, and Confidence Intervals
Packages: Infer (sampling), ggplot2 for visualization, tidyr for data manipulation (part of tidyverse), dplyr for data wrangling, and readr for spreadsheet data.
Sampling Techniques
Sampling: Use rep_sample_n
for random draws from a population, ensuring unbiased samples. As replicates increase, the sampling distribution becomes more symmetric and bell-shaped. As sample size (n) increases, variability (SD) for the sampling distribution decreases.
Population Terminology
Population Terminology:
- Population: Collection of all individuals we are interested in (N).
- Population Parameter: Numerical summary about the population (usually unknown, represented with p).
- Census: The whole collection when N is large.
- Sampling: Collection of a subset of the population, done when a census is not possible.
- Point Estimate: Summary sample statistic computed from a sample that estimates the unknown population parameter.
Sample Characteristics
Characteristics of a Sample:
- Representative: If it is random and has a good overview of the population.
- Generalizable: If results based on the sample can be used to make good guesses about the population.
- Biased/Unbiased: To ensure a representative sample, use random sampling.
- As sample size n increases, the standard error of the sample decreases.
SE = SD (sigma) / sqrt(n).
Law of Large Numbers
Law of Large Numbers: As the number of trials (sample size) increases, the sample mean (point estimate) approaches the true population mean (parameter).
Bootstrapping
Bootstrapping: Use replace = TRUE
in rep_sample_n
. Bootstrapping likely does not have the same center as the sampling distribution, so it cannot improve the quality of the estimate. However, it likely has a similar shape and spread, providing a good estimate of standard error. Works best if we have a large sample.
Ideal results are achieved if we set the bootstrap sample size = n (size of the main sample). If too large, it leads to redundant observations without adding new information. If too small, it increases variability and reduces representativeness.
Confidence Intervals
Confidence Intervals: Interpreting – If we repeated our sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population parameter OR We are 95% confident that the computed confidence interval captures the value of the true population parameter.
Patterns: As the confidence level increases, the width of the confidence interval increases; as the sample size increases, the width of the confidence interval decreases.
Percentile Method: Take the middle x% of the bootstrap distribution (e.g., 95% CI = middle 95%). So, we would remove the first 0-2.5% and 97.5-100% quantile of observations.
SE Method: CI = Sample Statistic ± (Multiplier × SE). We can get SE using the sd()
function. The multiplier is the z-score corresponding to the CL, obtained using qnorm((1+CL)/2)
. E.g., for 95% CI, use qnorm(0.975)
.
Calculated as x +- z * SE.
To calculate CI/graphs/stat, we use infer over dplyr as:
- Designed for Inference:
infer
is built specifically for statistical inference, making hypothesis testing and confidence intervals easier. - Consistent Syntax: Provides a unified framework for resampling, bootstrapping, and permutation tests.
- Built-in Statistical Functions: Handles simulations, p-values, and null distributions more efficiently than
dplyr
. - Minimal Code for Inference: Reduces the need for manual calculations compared to
dplyr
, which focuses more on data manipulation. - Better Reproducibility: Ensures a structured approach to hypothesis testing, reducing errors and improving clarity.
Infer Workflow to calculate CI/graph: sample |> specify(response = col) |> generate(reps = ...., type = "bootstrap") |> calculate(stat = "mean"/"median"/"sum"/"sd"/"prop") |>
For hist only: visualize()
in pipeline; if outside pipeline, use visualize(bootstrap_stuff)
. Can use +shade_confidence_interval(endpoints = percentile_ci) |> get_confidence_interval(level = 0.CL, type = "percentile"/"se")
Dplyr Workflow: We can select a column to sample from and use rep_sample_n
to generate samples. Use group_by(replicate)
, then calculate sample statistics using summarize()
to get a list of sample means/medians, etc. To visualize, use ggplot
and geom_histogram
. To calculate CI, we can use summarize(lower_ci = mean - z * sd/sqrt(n), upper_ci = mean + z * sd/sqrt(n))
. Note that for the sampling distribution, n here is the number of samples taken or reps, which we can get using the n()
or length()
function in our tibble. Or lower_ci = qnorm(0.025, mean, SE)
AND upper_ci = qnorm(0.975, mean, SE)
.
Central Limit Theorem
Central Limit Theorem: Sample proportion (or difference in proportions) will appear to follow a bell-shaped curve (normal distribution), even if the population distribution is non-normal. Only if the following conditions are satisfied:
- Observations in the sample are independent. Independence is guaranteed when we take a random sample from a population. If the sample is random, it is representative of the population.
- The sample is large enough. The sample size cannot be too small. (Generally, a large sample means at least 10 successes and 10 failures).
According to CLT, X follows a Normal distribution with parameters mean and SE (sample mean).
Assume normality:
- If small n and no clear outliers.
- If large n and no extreme outliers.
- Slight skew okay for n = 15.
- Moderate skew okay for n = 30.
- Strong skew okay for n = 60.
Normal Distribution
Normal Distribution: Mean = 0, SD = 1. Z score is the number of standard deviations it falls above or below the mean: Z = (x-mean)/SD. To get the probability for a z-score, use pnorm()
. To get z-scores from probability, use qnorm()
. Note: only takes values from 0 to 1.
T-Distributions
T-Distributions: Use if the population SD is unknown or the sample size is small (only do this if mentioned in the paper, as if the sample size is large, then CLT applies). df = n-1, where n is the size of the sample.
Instead of qnorm
, we will use lower_ci = mean - qt((1+CL)/2, df)*SE
and upper_ci = mean + qt((1+CL)/2, df)*SE
.
Example code:
Difference in Proportions
Difference in Proportions: Can be used if independence and normality are verified for both groups; df = min of n1 – 1, n2 – 1; Margin of error = t score * SE
According to CLT
.
Example code:
Infer workflow:
Important functions
n() | Counts the number of rows in a group |
nrow() or ncol() | Number of observations in tibble |
mutate() | Add a new column |
length() | Calculate the length of an array or tibble column |
ungroup() | Ungroups for replicate for bootstrapping |
na.rm = TRUE | Ignore NA values while summarizing or mutating |
geom_vline() | geom_vline(xintercept = value, color = "color", linetype = "type", size = thickness) . |
is.na(col) | Filters NA values from col |
mean(), sd() | |
as.numeric(col) |