Statistical Inference: Sampling, Bootstrapping, and Confidence Intervals

Packages: Infer (sampling), ggplot2 for visualization, tidyr for data manipulation (part of tidyverse), dplyr for data wrangling, and readr for spreadsheet data.


Sampling Techniques

Sampling: Use rep_sample_n for random draws from a population, ensuring unbiased samples. As replicates increase, the sampling distribution becomes more symmetric and bell-shaped. As sample size (n) increases, variability (SD) for the sampling distribution decreases.


Population Terminology

Population Terminology:

  • Population: Collection of all individuals we are interested in (N).
  • Population Parameter: Numerical summary about the population (usually unknown, represented with p).
  • Census: The whole collection when N is large.
  • Sampling: Collection of a subset of the population, done when a census is not possible.
  • Point Estimate: Summary sample statistic computed from a sample that estimates the unknown population parameter.

Sample Characteristics

Characteristics of a Sample:

  • Representative: If it is random and has a good overview of the population.
  • Generalizable: If results based on the sample can be used to make good guesses about the population.
  • Biased/Unbiased: To ensure a representative sample, use random sampling.
  • As sample size n increases, the standard error of the sample decreases.

wPwoIx1UbIxjwAAAABJRU5ErkJggg==

SE = SD (sigma) / sqrt(n). wMmS9r+akKdoAAAAABJRU5ErkJggg==


Law of Large Numbers

Law of Large Numbers: As the number of trials (sample size) increases, the sample mean (point estimate) approaches the true population mean (parameter).


Bootstrapping

Bootstrapping: Use replace = TRUE in rep_sample_n. Bootstrapping likely does not have the same center as the sampling distribution, so it cannot improve the quality of the estimate. However, it likely has a similar shape and spread, providing a good estimate of standard error. Works best if we have a large sample.

Ideal results are achieved if we set the bootstrap sample size = n (size of the main sample). If too large, it leads to redundant observations without adding new information. If too small, it increases variability and reduces representativeness.


Confidence Intervals

Confidence Intervals: Interpreting – If we repeated our sampling procedure a large number of times, we expect about 95% of the resulting confidence intervals to capture the value of the population parameter OR We are 95% confident that the computed confidence interval captures the value of the true population parameter.

Patterns: As the confidence level increases, the width of the confidence interval increases; as the sample size increases, the width of the confidence interval decreases. A1NaE3E6TSaAAAAAAElFTkSuQmCC

Percentile Method: Take the middle x% of the bootstrap distribution (e.g., 95% CI = middle 95%). So, we would remove the first 0-2.5% and 97.5-100% quantile of observations.

SE Method: CI = Sample Statistic ± (Multiplier × SE). We can get SE using the sd() function. The multiplier is the z-score corresponding to the CL, obtained using qnorm((1+CL)/2). E.g., for 95% CI, use qnorm(0.975).

Calculated as x +- z * SE.

To calculate CI/graphs/stat, we use infer over dplyr as:

  • Designed for Inference: infer is built specifically for statistical inference, making hypothesis testing and confidence intervals easier.
  • Consistent Syntax: Provides a unified framework for resampling, bootstrapping, and permutation tests.
  • Built-in Statistical Functions: Handles simulations, p-values, and null distributions more efficiently than dplyr.
  • Minimal Code for Inference: Reduces the need for manual calculations compared to dplyr, which focuses more on data manipulation.
  • Better Reproducibility: Ensures a structured approach to hypothesis testing, reducing errors and improving clarity.

Infer Workflow to calculate CI/graph: sample |> specify(response = col) |> generate(reps = ...., type = "bootstrap") |> calculate(stat = "mean"/"median"/"sum"/"sd"/"prop") |>

For hist only: visualize() in pipeline; if outside pipeline, use visualize(bootstrap_stuff). Can use +shade_confidence_interval(endpoints = percentile_ci) |> get_confidence_interval(level = 0.CL, type = "percentile"/"se")

Dplyr Workflow: We can select a column to sample from and use rep_sample_n to generate samples. Use group_by(replicate), then calculate sample statistics using summarize() to get a list of sample means/medians, etc. To visualize, use ggplot and geom_histogram. To calculate CI, we can use summarize(lower_ci = mean - z * sd/sqrt(n), upper_ci = mean + z * sd/sqrt(n)). Note that for the sampling distribution, n here is the number of samples taken or reps, which we can get using the n() or length() function in our tibble. Or lower_ci = qnorm(0.025, mean, SE) AND upper_ci = qnorm(0.975, mean, SE).


Central Limit Theorem

Central Limit Theorem: Sample proportion (or difference in proportions) will appear to follow a bell-shaped curve (normal distribution), even if the population distribution is non-normal. Only if the following conditions are satisfied:

  1. Observations in the sample are independent. Independence is guaranteed when we take a random sample from a population. If the sample is random, it is representative of the population.
  2. The sample is large enough. The sample size cannot be too small. (Generally, a large sample means at least 10 successes and 10 failures).

According to CLT, X follows a Normal distribution with parameters mean and SE (sample mean).

Assume normality:

  • If small n and no clear outliers.
  • If large n and no extreme outliers.
  • Slight skew okay for n = 15.
  • Moderate skew okay for n = 30.
  • Strong skew okay for n = 60.

Normal Distribution

Normal Distribution: Mean = 0, SD = 1. Z score is the number of standard deviations it falls above or below the mean: Z = (x-mean)/SD. To get the probability for a z-score, use pnorm(). To get z-scores from probability, use qnorm(). Note: only takes values from 0 to 1.





T-Distributions

T-Distributions: Use if the population SD is unknown or the sample size is small (only do this if mentioned in the paper, as if the sample size is large, then CLT applies). df = n-1, where n is the size of the sample. A8gmWjrIJuukAAAAAElFTkSuQmCC

Instead of qnorm, we will use lower_ci = mean - qt((1+CL)/2, df)*SE and upper_ci = mean + qt((1+CL)/2, df)*SE.

Example code:

A6bAW4GsOyDpAAAAAElFTkSuQmCC


Difference in Proportions

Difference in Proportions: Can be used if independence and normality are verified for both groups; df = min of n1 – 1, n2 – 1; Margin of error = t score * SE A8fxXXrJlgRAAAAAAElFTkSuQmCC

According to CLT gbOFmd0feYUAAAAASUVORK5CYII=

.

Example code: oA+hZUnRIIQQQgghxDj9P2+YPkEcP4UDAAAAAElFTkSuQmCC


Infer workflow: 29epSJFRekBrKvDue2JwFZ2sVd8RUWKitIDGCKigAjCeuiwaE+iGkVF6QECbjYQOXN8+yZZlA5R3WdFUZQmVKSoKIrShGoUFUVRmlCNoqIoShOqUVQURWlCNYqKoihNqEZRURSlCdUoKoqiNKEaRUVRlCZUo6goitKEahQVRVGa+H8y7TtXimyV0wAAAABJRU5ErkJggg==

w92RzYPB8uYCwAAAABJRU5ErkJggg==

IRxbCi2wAAAABJRU5ErkJggg==

Important functions


  
n()Counts the number of rows in a group
nrow() or ncol()Number of observations in tibble
mutate()Add a new column
length()Calculate the length of an array or tibble column
ungroup()Ungroups for replicate for bootstrapping
na.rm = TRUE

Ignore NA values while summarizing or mutating

geom_vline()geom_vline(xintercept = value, color = "color", linetype = "type", size = thickness).
is.na(col)Filters NA values from col
mean(), sd() 
as.numeric(col) 

Abstract background blue and white Chessboard Pattern Optical illusion  Texture. for your design 4249100 Vector Art at Vecteezy