Statistics and ggplot2: Quick Reference

Statistics and ggplot2: Quick Reference

Central Tendency

  • Mean: The average of values, affected by outliers.
    • Formula: \(\bar{x} = \frac{\Sigma x_i}{n}\)
  • Median: The middle value, robust to outliers.
  • Mode: The most frequent value in a dataset.

Variability Metrics

  • Range: \(\text{Max} – \text{Min}\)
  • Population Variance: \(\sigma^2 = \frac{\Sigma (x_i – \mu)^2}{N}\)
  • Sample Variance: \(s^2 = \frac{\Sigma (x_i – \bar{x})^2}{n-1}\) (Bessel’s correction).
  • Standard Deviation (SD): The square root of variance.
    • Formula: \(s = \sqrt{s^2}\)

Standard Error (SE)

  • Formula: \(SE = \frac{s}{\sqrt{n}}\)
  • Measures the precision of the sample mean as an estimate of the population mean.
  • Larger sample size = Lower SE.

Central Limit Theorem (CLT)

  • For \(n \geq 30\), the sampling distribution of the mean is approximately normal, regardless of population shape.
  • Allows normal methods for inference even with non-normal populations.

When to Use Mean vs. Median

  • Mean: Symmetrical data with no outliers.
  • Median: Skewed data or when there are outliers (e.g., income).

Normal Distribution

  • 68%-95%-99.7% Rule:
    • 68% within 1 SD, 95% within 2 SD, 99.7% within 3 SD.
  • Z-Score: \(z = \frac{x – \mu}{\sigma}\)

Boxplots

  • Visualizes the median, spread (IQR), and outliers.
  • Useful for detecting skewness and variability.

Histograms

  • Displays the frequency distribution of continuous data.
  • Shape (e.g., symmetric, skewed) provides insights into data.

ggplot2 Reference

Basics

  • Framework:
    ggplot(data, aes(x, y)) + geom_...()
  • aes(): Maps variables to aesthetics (e.g., x, y, color).

Common Geoms

  • Scatterplot: geom_point() for relationships between variables.
  • Line Graph: geom_line() for trends over time.
  • Histogram:
    geom_histogram(aes(y = ..density..), bins = 30) + geom_density()
  • Bar Plot: geom_bar() for categorical counts.
  • Boxplot: geom_boxplot() for variability and outliers.

Customization

  • Labels:
    labs(title = "Title", x = "X-axis", y = "Y-axis")
  • Color by Category:
    aes(color = variable)
  • Faceting:
    facet_wrap(~ category)
  • Themes:
    theme(axis.text = element_text(size = 12))

Graph Interpretation

  • Scatterplot: Shows correlation (positive, negative, or none).
  • Boxplot: Highlights spread, median, and outliers.
  • Histogram: Reveals data distribution shape (symmetric, skewed, bimodal).
  • Line Graph: Shows trends and patterns over time.

Scales and Limits

  • Set Axis Limits:
    scale_x_continuous(limits = c(min, max))
  • Log Scale:
    scale_y_log10()

Quick Reference Table

MetricFormula/Key Info
Mean\(\bar{x} = \frac{\Sigma x_i}{n}\)
Sample Variance\(s^2 = \frac{\Sigma (x_i – \bar{x})^2}{n-1}\)
Standard Deviation\(s = \sqrt{s^2}\)
Standard Error\(SE = \frac{s}{\sqrt{n}}\)
Z-Score\(z = \frac{x – \mu}{\sigma}\)
CLTSampling distribution \(\sim N(\mu, \frac{\sigma}{\sqrt{n}})\)
HistogramVisualizes frequency of continuous data.
BoxplotShows median, quartiles, whiskers, and outliers.
ScatterplotExamines relationships between variables.