Statistics and ggplot2: Quick Reference
Posted on Feb 3, 2025 in Statistics
Statistics and ggplot2: Quick Reference
Central Tendency
- Mean: The average of values, affected by outliers.
- Formula: \(\bar{x} = \frac{\Sigma x_i}{n}\)
- Median: The middle value, robust to outliers.
- Mode: The most frequent value in a dataset.
Variability Metrics
- Range: \(\text{Max} – \text{Min}\)
- Population Variance: \(\sigma^2 = \frac{\Sigma (x_i – \mu)^2}{N}\)
- Sample Variance: \(s^2 = \frac{\Sigma (x_i – \bar{x})^2}{n-1}\) (Bessel’s correction).
- Standard Deviation (SD): The square root of variance.
- Formula: \(s = \sqrt{s^2}\)
Standard Error (SE)
- Formula: \(SE = \frac{s}{\sqrt{n}}\)
- Measures the precision of the sample mean as an estimate of the population mean.
- Larger sample size = Lower SE.
Central Limit Theorem (CLT)
- For \(n \geq 30\), the sampling distribution of the mean is approximately normal, regardless of population shape.
- Allows normal methods for inference even with non-normal populations.
When to Use Mean vs. Median
- Mean: Symmetrical data with no outliers.
- Median: Skewed data or when there are outliers (e.g., income).
Normal Distribution
- 68%-95%-99.7% Rule:
- 68% within 1 SD, 95% within 2 SD, 99.7% within 3 SD.
- Z-Score: \(z = \frac{x – \mu}{\sigma}\)
Boxplots
- Visualizes the median, spread (IQR), and outliers.
- Useful for detecting skewness and variability.
Histograms
- Displays the frequency distribution of continuous data.
- Shape (e.g., symmetric, skewed) provides insights into data.
ggplot2 Reference
Basics
Common Geoms
Customization
- Labels:
labs(title = "Title", x = "X-axis", y = "Y-axis")
- Color by Category:
aes(color = variable)
- Faceting:
facet_wrap(~ category)
- Themes:
theme(axis.text = element_text(size = 12))
Graph Interpretation
- Scatterplot: Shows correlation (positive, negative, or none).
- Boxplot: Highlights spread, median, and outliers.
- Histogram: Reveals data distribution shape (symmetric, skewed, bimodal).
- Line Graph: Shows trends and patterns over time.
Scales and Limits
Quick Reference Table
Metric | Formula/Key Info |
---|
Mean | \(\bar{x} = \frac{\Sigma x_i}{n}\) |
Sample Variance | \(s^2 = \frac{\Sigma (x_i – \bar{x})^2}{n-1}\) |
Standard Deviation | \(s = \sqrt{s^2}\) |
Standard Error | \(SE = \frac{s}{\sqrt{n}}\) |
Z-Score | \(z = \frac{x – \mu}{\sigma}\) |
CLT | Sampling distribution \(\sim N(\mu, \frac{\sigma}{\sqrt{n}})\) |
Histogram | Visualizes frequency of continuous data. |
Boxplot | Shows median, quartiles, whiskers, and outliers. |
Scatterplot | Examines relationships between variables. |