Introduction to Statistics and Data Analysis

Chapter 1: Population and Sampling

Population: The entire group (of individuals or objects) we want to learn about.

Population parameter: A numerical fact about a population.

Census: Costly and often not feasible.

Good Samples

  • Include all available sampling units from the population.
  • Do not contain irrelevant sampling units from another population.
  • Do not contain duplicated sampling units.
  • Do not contain sampling units in clusters.

Selection bias: Due to non-probability sampling or a poor sampling frame.

Probability Sampling

Simple random sample: Every member of the frame has an equal chance of being chosen without replacement.

Systematic sampling: Every k-th member of the frame is chosen, starting from a random member between 1 and k-1.

Stratified sampling: The population is split into groups/strata, and members are randomly selected from each stratum.

Non-Probability Sampling

Convenience sampling: Affected by non-response bias.

Volunteer sampling: Self-selected sample, usually from individuals with strong opinions.

Generalizability

A study is generalizable if it:

  1. Has a good sampling frame that is equal to or larger than the population.
  2. Uses a probability-based sampling method to minimize selection bias.
  3. Has a large sample size.
  4. Minimizes the non-response rate.

Variable Types

Numerical variables: Discrete (has gaps) or continuous (no gaps).

Categorical variables: Ordinal (has order) or nominal (no order).

Descriptive Statistics

Mean: Σxi / n (scaling by c doesn’t change the value; adding c increases the mean by c).

Variance: Σ(xi – mean)2 / (n – 1) (always non-negative; adding c to all points doesn’t change the value; scaling points by c scales variance by c2).

Standard deviation: sqrt(Variance) (adding c to all points doesn’t change the value).

IQR: 1.5 * (3rd quartile – 1st quartile) (always non-negative; adding c to all points doesn’t change the value; scaling points by c scales IQR by |c|).

Study Design

Random assignment: Controlled experiment.

Self-assignment: Observational study.

Single blinding: Subjects are blinded.

Double blinding: Subjects and assessors are blinded.

Chapter 2: Relationships Between Variables

Marginal rate: Relates to only one variable.

Conditional rate: Relates to two variables (e.g., P(A|B) is the probability of A given that B is already true).

Joint rate: Probability of A AND B.

Positive association: A and B tend to occur together.

Negative association: A and NB tend to occur together.

Symmetry rule: rate(A|B) > rate(A|NB) ⇔ rate(B|A) > rate(B|NA).

Basic rule of rates: The overall rate(A) always lies between rate(A|B) and rate(A|NB). The closer rate(B) is to 100%, the closer rate(A) is to rate(A|B).

Simpson’s paradox: A trend appears in multiple groups of data but disappears or reverses when the groups are combined.

Confounder: A third variable associated with both the independent and dependent variables.

Chapter 3: Data Visualization and Summarization

Histogram bins: [a, b) includes a but excludes b.

Unimodal: One peak.

Multimodal: Multiple peaks.

Left skewed: Tail at low values; Mean < Median < Mode.

Symmetrical: Mean = Median = Mode.

Right skewed: Tail at high values; Mean > Median > Mode.

Spread: Low variability (low standard deviation, steep curve); High variability (high standard deviation, gentle curve).

Bivariate Data (Association)

Direction: Positive (increase in A means increase in B); Negative (increase in A means decrease in B).

Form: Linear, non-linear (quadratic, exponential, logarithmic, cubic).

Strength: How closely the points conform to the correlation line (closer -> stronger).

Correlation coefficient: Sign indicates direction; Magnitude indicates strength (0-0.3 weak, 0.3-0.7 moderate, 0.7-1 strong).

Standard units: (xi – mean(x)) / SDx

Correlation coefficient (r): Not affected by positive scalar transformation, adding the same value to any axis, or interchanging axes.

Ecological fallacy: Concluding about individual groups based on aggregate data.

Atomistic fallacy: Concluding about aggregate groups based on individual data.

Regression

Linear regression: Y = mX + b (m is the slope, b is the y-intercept, X is the independent variable, Y is the dependent variable).

Least squares linear regression: Predicts the average Y value based on X.

Exponential regression: Y = cbx.

Chapter 4: Probability and Inference

Sample space: The collection of all outcomes.

Event: A subset of the sample space.

Mutually exclusive: Events cannot happen simultaneously.

Independent: One event does not affect the probability of another.

Rules of Probabilities

  1. 0 ≤ P(E) ≤ 1.
  2. P(S) = 1.
  3. P(E∪F) = P(E) + P(F) if E and F are mutually exclusive.

Uniform probability: Assigning equal probability to every outcome (1/N).

Conditional probability: P(E|F) = P(E∩F) / P(F).

Independence: If A and B are independent, P(A) = P(A|B) = P(A|NB).

Conditional independence: A and B are conditionally independent given C if P(A∩B|C) = P(A|C) * P(B|C).

Law of total probability: If E and F are mutually exclusive and E∪F = S, then P(G) = P(G|E)P(E) + P(G|F)P(F).

Prosecutor’s fallacy: Assuming P(A|B) = P(B|A).

Conjunction fallacy: Assuming P(A∩B) > P(A).

Base rate fallacy: Ignoring base rate information.

Sensitivity: P(Test positive | Infected).

Specificity: P(Test negative | Not infected).

Random variable: A numerical variable with probabilities assigned to each possible value.

Sample statistic: Population parameter + bias + random error.

Confidence interval: A range of values likely to contain a population parameter.

Proportion CI: us7EjGc7XvdgOE1whoU4F7Ffl8FzYccReuDR6if7lBJMDW5XGGGUnZZ7rfWtpRIVomEYadV5EiTG3aIKGNg749kBfGKhIaY4j04nKM6jIaY4j4aY4jwaYorzaIgpzqMhpjiPhpjiPBpiivNoiCnOoyGmOI+GmOI8GmKK82iIKc6jIaY4j4aY4jwaYorzaIgpzqMhpjiPhpjiPBpiivNoiCnOoyGmOI+GmOI8GmKK82iIKc6jIaY4j4aY4jwaYorjgP8FJsWrSln7rNoAAAAASUVORK5CYII=

Mean CI: rf2u6EuAX6jF+Kmz35MyimG59l9cO3ubwyaoJSIpRfIyO3tzkRfz4jMCS3p0yIwO3PUgBadxrLFWViRcAAzBjLGPoAVR4dhozUdhhVHZwH+B6HGP1xo2JLxAAAAAElFTkSuQmCC

Hypothesis test: A method to decide if sample data supports a hypothesis about a population.

Significance level: 100 – confidence level.

p-value: The probability of obtaining a result as extreme or more extreme than the observed result, assuming the null hypothesis is true.

Null hypothesis: The observation can be explained by chance variation.

Alternative hypothesis: The observation is not due to random chance.

Chi-square test: Null hypothesis: No association. Alternative hypothesis: There is an association.