Introduction to Statistics and Data Analysis

Posted on Nov 26, 2024 in Mathematics

Chapter 1: Population and Sampling

Population: The entire group (of individuals or objects) we want to learn about.

Population parameter: A numerical fact about a population.

Census: Costly and often not feasible.

Good Samples

Include all available sampling units from the population.
Do not contain irrelevant sampling units from another population.
Do not contain duplicated sampling units.
Do not contain sampling units in clusters.

Selection bias: Due to non-probability sampling or a poor sampling frame.

Probability Sampling

Simple random sample: Every member of the frame has an equal chance of being chosen without replacement.

Systematic sampling: Every k-th member of the frame is chosen, starting from a random member between 1 and k-1.

Stratified sampling: The population is split into groups/strata, and members are randomly selected from each stratum.

Non-Probability Sampling

Convenience sampling: Affected by non-response bias.

Volunteer sampling: Self-selected sample, usually from individuals with strong opinions.

Generalizability

A study is generalizable if it:

Has a good sampling frame that is equal to or larger than the population.
Uses a probability-based sampling method to minimize selection bias.
Has a large sample size.
Minimizes the non-response rate.

Variable Types

Numerical variables: Discrete (has gaps) or continuous (no gaps).

Categorical variables: Ordinal (has order) or nominal (no order).

Descriptive Statistics

Mean: Σx_i / n (scaling by c doesn’t change the value; adding c increases the mean by c).

Variance: Σ(x_i – mean)² / (n – 1) (always non-negative; adding c to all points doesn’t change the value; scaling points by c scales variance by c²).

Standard deviation: sqrt(Variance) (adding c to all points doesn’t change the value).

IQR: 1.5 * (3rd quartile – 1st quartile) (always non-negative; adding c to all points doesn’t change the value; scaling points by c scales IQR by |c|).

Study Design

Random assignment: Controlled experiment.

Self-assignment: Observational study.

Single blinding: Subjects are blinded.

Double blinding: Subjects and assessors are blinded.

Chapter 2: Relationships Between Variables

Marginal rate: Relates to only one variable.

Conditional rate: Relates to two variables (e.g., P(A|B) is the probability of A given that B is already true).

Joint rate: Probability of A AND B.

Positive association: A and B tend to occur together.

Negative association: A and NB tend to occur together.

Symmetry rule: rate(A|B) > rate(A|NB) ⇔ rate(B|A) > rate(B|NA).

Basic rule of rates: The overall rate(A) always lies between rate(A|B) and rate(A|NB). The closer rate(B) is to 100%, the closer rate(A) is to rate(A|B).

Simpson’s paradox: A trend appears in multiple groups of data but disappears or reverses when the groups are combined.

Confounder: A third variable associated with both the independent and dependent variables.

Chapter 3: Data Visualization and Summarization

Histogram bins: [a, b) includes a but excludes b.

Unimodal: One peak.

Multimodal: Multiple peaks.

Left skewed: Tail at low values; Mean < Median < Mode.

Symmetrical: Mean = Median = Mode.

Right skewed: Tail at high values; Mean > Median > Mode.

Spread: Low variability (low standard deviation, steep curve); High variability (high standard deviation, gentle curve).

Bivariate Data (Association)

Direction: Positive (increase in A means increase in B); Negative (increase in A means decrease in B).

Form: Linear, non-linear (quadratic, exponential, logarithmic, cubic).

Strength: How closely the points conform to the correlation line (closer -> stronger).

Correlation coefficient: Sign indicates direction; Magnitude indicates strength (0-0.3 weak, 0.3-0.7 moderate, 0.7-1 strong).

Standard units: (x_i – mean(x)) / SD_x

Correlation coefficient (r): Not affected by positive scalar transformation, adding the same value to any axis, or interchanging axes.

Ecological fallacy: Concluding about individual groups based on aggregate data.

Atomistic fallacy: Concluding about aggregate groups based on individual data.

Regression

Linear regression: Y = mX + b (m is the slope, b is the y-intercept, X is the independent variable, Y is the dependent variable).

Least squares linear regression: Predicts the average Y value based on X.

Exponential regression: Y = cb^x.

Chapter 4: Probability and Inference

Sample space: The collection of all outcomes.

Event: A subset of the sample space.

Mutually exclusive: Events cannot happen simultaneously.

Independent: One event does not affect the probability of another.

Rules of Probabilities

0 ≤ P(E) ≤ 1.
P(S) = 1.
P(E∪F) = P(E) + P(F) if E and F are mutually exclusive.

Uniform probability: Assigning equal probability to every outcome (1/N).

Conditional probability: P(E|F) = P(E∩F) / P(F).

Independence: If A and B are independent, P(A) = P(A|B) = P(A|NB).

Conditional independence: A and B are conditionally independent given C if P(A∩B|C) = P(A|C) * P(B|C).

Law of total probability: If E and F are mutually exclusive and E∪F = S, then P(G) = P(G|E)P(E) + P(G|F)P(F).

Prosecutor’s fallacy: Assuming P(A|B) = P(B|A).

Conjunction fallacy: Assuming P(A∩B) > P(A).

Base rate fallacy: Ignoring base rate information.

Sensitivity: P(Test positive | Infected).

Specificity: P(Test negative | Not infected).

Random variable: A numerical variable with probabilities assigned to each possible value.

Sample statistic: Population parameter + bias + random error.

Confidence interval: A range of values likely to contain a population parameter.

Proportion CI: us7EjGc7XvdgOE1whoU4F7Ffl8FzYccReuDR6if7lBJMDW5XGGGUnZZ7rfWtpRIVomEYadV5EiTG3aIKGNg749kBfGKhIaY4j04nKM6jIaY4j4aY4jwaYorzaIgpzqMhpjiPhpjiPBpiivNoiCnOoyGmOI+GmOI8GmKK82iIKc6jIaY4j4aY4jwaYorzaIgpzqMhpjiPhpjiPBpiivNoiCnOoyGmOI+GmOI8GmKK82iIKc6jIaY4j4aY4jwaYorjgP8FJsWrSln7rNoAAAAASUVORK5CYII=

Mean CI: rf2u6EuAX6jF+Kmz35MyimG59l9cO3ubwyaoJSIpRfIyO3tzkRfz4jMCS3p0yIwO3PUgBadxrLFWViRcAAzBjLGPoAVR4dhozUdhhVHZwH+B6HGP1xo2JLxAAAAAElFTkSuQmCC

Hypothesis test: A method to decide if sample data supports a hypothesis about a population.

Significance level: 100 – confidence level.

p-value: The probability of obtaining a result as extreme or more extreme than the observed result, assuming the null hypothesis is true.

Null hypothesis: The observation can be explained by chance variation.

Alternative hypothesis: The observation is not due to random chance.

Chi-square test: Null hypothesis: No association. Alternative hypothesis: There is an association.