Introduction to Statistics and Data Analysis
Chapter 1: Population and Sampling
Population: The entire group (of individuals or objects) we want to learn about.
Population parameter: A numerical fact about a population.
Census: Costly and often not feasible.
Good Samples
- Include all available sampling units from the population.
- Do not contain irrelevant sampling units from another population.
- Do not contain duplicated sampling units.
- Do not contain sampling units in clusters.
Selection bias: Due to non-probability sampling or a poor sampling frame.
Probability Sampling
Simple random sample: Every member of the frame has an equal chance of being chosen without replacement.
Systematic sampling: Every k-th member of the frame is chosen, starting from a random member between 1 and k-1.
Stratified sampling: The population is split into groups/strata, and members are randomly selected from each stratum.
Non-Probability Sampling
Convenience sampling: Affected by non-response bias.
Volunteer sampling: Self-selected sample, usually from individuals with strong opinions.
Generalizability
A study is generalizable if it:
- Has a good sampling frame that is equal to or larger than the population.
- Uses a probability-based sampling method to minimize selection bias.
- Has a large sample size.
- Minimizes the non-response rate.
Variable Types
Numerical variables: Discrete (has gaps) or continuous (no gaps).
Categorical variables: Ordinal (has order) or nominal (no order).
Descriptive Statistics
Mean: Σxi / n (scaling by c doesn’t change the value; adding c increases the mean by c).
Variance: Σ(xi – mean)2 / (n – 1) (always non-negative; adding c to all points doesn’t change the value; scaling points by c scales variance by c2).
Standard deviation: sqrt(Variance) (adding c to all points doesn’t change the value).
IQR: 1.5 * (3rd quartile – 1st quartile) (always non-negative; adding c to all points doesn’t change the value; scaling points by c scales IQR by |c|).
Study Design
Random assignment: Controlled experiment.
Self-assignment: Observational study.
Single blinding: Subjects are blinded.
Double blinding: Subjects and assessors are blinded.
Chapter 2: Relationships Between Variables
Marginal rate: Relates to only one variable.
Conditional rate: Relates to two variables (e.g., P(A|B) is the probability of A given that B is already true).
Joint rate: Probability of A AND B.
Positive association: A and B tend to occur together.
Negative association: A and NB tend to occur together.
Symmetry rule: rate(A|B) > rate(A|NB) ⇔ rate(B|A) > rate(B|NA).
Basic rule of rates: The overall rate(A) always lies between rate(A|B) and rate(A|NB). The closer rate(B) is to 100%, the closer rate(A) is to rate(A|B).
Simpson’s paradox: A trend appears in multiple groups of data but disappears or reverses when the groups are combined.
Confounder: A third variable associated with both the independent and dependent variables.
Chapter 3: Data Visualization and Summarization
Histogram bins: [a, b) includes a but excludes b.
Unimodal: One peak.
Multimodal: Multiple peaks.
Left skewed: Tail at low values; Mean < Median < Mode.
Symmetrical: Mean = Median = Mode.
Right skewed: Tail at high values; Mean > Median > Mode.
Spread: Low variability (low standard deviation, steep curve); High variability (high standard deviation, gentle curve).
Bivariate Data (Association)
Direction: Positive (increase in A means increase in B); Negative (increase in A means decrease in B).
Form: Linear, non-linear (quadratic, exponential, logarithmic, cubic).
Strength: How closely the points conform to the correlation line (closer -> stronger).
Correlation coefficient: Sign indicates direction; Magnitude indicates strength (0-0.3 weak, 0.3-0.7 moderate, 0.7-1 strong).
Standard units: (xi – mean(x)) / SDx
Correlation coefficient (r): Not affected by positive scalar transformation, adding the same value to any axis, or interchanging axes.
Ecological fallacy: Concluding about individual groups based on aggregate data.
Atomistic fallacy: Concluding about aggregate groups based on individual data.
Regression
Linear regression: Y = mX + b (m is the slope, b is the y-intercept, X is the independent variable, Y is the dependent variable).
Least squares linear regression: Predicts the average Y value based on X.
Exponential regression: Y = cbx.
Chapter 4: Probability and Inference
Sample space: The collection of all outcomes.
Event: A subset of the sample space.
Mutually exclusive: Events cannot happen simultaneously.
Independent: One event does not affect the probability of another.
Rules of Probabilities
- 0 ≤ P(E) ≤ 1.
- P(S) = 1.
- P(E∪F) = P(E) + P(F) if E and F are mutually exclusive.
Uniform probability: Assigning equal probability to every outcome (1/N).
Conditional probability: P(E|F) = P(E∩F) / P(F).
Independence: If A and B are independent, P(A) = P(A|B) = P(A|NB).
Conditional independence: A and B are conditionally independent given C if P(A∩B|C) = P(A|C) * P(B|C).
Law of total probability: If E and F are mutually exclusive and E∪F = S, then P(G) = P(G|E)P(E) + P(G|F)P(F).
Prosecutor’s fallacy: Assuming P(A|B) = P(B|A).
Conjunction fallacy: Assuming P(A∩B) > P(A).
Base rate fallacy: Ignoring base rate information.
Sensitivity: P(Test positive | Infected).
Specificity: P(Test negative | Not infected).
Random variable: A numerical variable with probabilities assigned to each possible value.
Sample statistic: Population parameter + bias + random error.
Confidence interval: A range of values likely to contain a population parameter.
Proportion CI:
Mean CI:
Hypothesis test: A method to decide if sample data supports a hypothesis about a population.
Significance level: 100 – confidence level.
p-value: The probability of obtaining a result as extreme or more extreme than the observed result, assuming the null hypothesis is true.
Null hypothesis: The observation can be explained by chance variation.
Alternative hypothesis: The observation is not due to random chance.
Chi-square test: Null hypothesis: No association. Alternative hypothesis: There is an association.