Statistics Terminology and Concepts
Basic Terminology
Data Types
Sample: A portion of the population units sampled to gather information.
Target Population: The entire group of individuals or objects that the researcher is interested in studying.
Continuous Data: Data that can take on any possible value within a range.
Discrete Data: Data that have built-in restrictions on decimal places, such as whole numbers.
Categorical Measurements
Measurements where a unit is placed into a category based on an observed attribute or quality.
Nominal Labels: Categories with no inherent order (e.g., male, female).
Ordinal Labels: Categories with a sense of order but not corresponding to specific numbers (e.g., large, small, medium).
Probability
Addition Rule
P(A or B) = P(A) + P(B) – P(A & B)
If A & B are mutually exclusive/disjoint: P(A or B) = P(A) + P(B).
Multiplication Rule
P(A & B) = P(A)P(B|A) = P(B)P(A|B)
Conditional Probability
P(A|B) = P(A & B)/P(B), P(B|A) = P(A & B)/P(A)
If A and B are independent: P(A and B) = P(A).P(B)
Probability Distributions
Cumulative Probability Distribution (CDF)
P(X ≤ x) for -∞ < x < ∞ (e.g., P(1.60 ≤ 1.75) = P(X ≤ 1.75) – P(X ≤ 1.60))
Bernoulli (Binary) Random Variable
Y = px(1-p)(1-x)
Y = 1 if success is observed, & Y = 0 if not.
Binomial Distribution
Expected Value
E(x) of a random variable is the average value expected after infinitely many samples are drawn. E(x) = np.
Variance and Standard Deviation
Variance: V(x) = np(1-p)
Standard Deviation: SD(x) = sqrt(np(1-p))
Both measure the expected variability.
Normal Distribution
PDF of Normal Distribution:
Standard Normal Distribution: μ = 0 and σ = 1: Z ~ N(0,1)
Standardization: Z = (X-μ)/σ
Inverse Standardization: X = σZ+μ
Standard Error: (Standard Deviation)/sqrt(Mean)
Empirical Rule
- 68%: Data falls within 1 standard deviation of the mean.
- 95%: Data falls within 2 standard deviations of the mean.
- 99.7%: Data falls within 3 standard deviations of the mean.
Sampling Distributions
Mean: The average value of the statistic.
Standard Deviation: How spread out the statistic is from its mean.
Influencing factors: Shape of the distribution, sample size, statistic being computed.
Central Limit Theorem (CLT)
The distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population’s distribution. The approximation improves with larger sample sizes and a parent distribution closer to normal.
Mean: μ
Variance: σ2
Confidence Intervals
Inference: Drawing conclusions about a population using data from a sample.
Example: With 95% confidence, the interval (9.4,15.0) covers the true mean PSA value for men awaiting radical prostatectomy.
Hypothesis Testing
Procedure
- State question and define parameters: Let μ0 and μy be the mean log-PSA for men 65 or older and under 65 awaiting radical prostatectomy.
- State Hypothesis: H0 (null hypothesis): μ0 = μy vs HA (alternate hypothesis): μ0 ≠ μy
- State type 1 error: α = 0.05
- State test statistic (T statistic):
- Compute test statistic and p-value
- Decision: If p-value > α, do not reject H0. Else, reject H0.
- Conclusion: Interpret the results in the context of the problem.
Contingency Tables
Proportion: A part of the whole (e.g., 15 out of 1453).
Rate: A proportion rescaled for better interpretation (e.g., 5 out of 100).
binom.confint(): Computes Wilson Confidence Intervals.
In a two-way table, rows represent the explanatory variable (X) and columns represent the response variable (Y).
Diagnostic Testing
Truth: Unknown and possibly unverifiable.
Test Result: The binary outcome of a test (+ve/-ve).
Sensitivity: Probability of a positive test on a positive person (true positive rate).
Specificity: Probability of a negative test on a negative person (true negative rate).
Specificity = 1 – Sensitivity
Prevalence: Proportion of the population with the condition.
High Sensitivity: Detects most cases.
Low Specificity: May produce false alarms.
Perfect Test: Sensitivity = 1, Specificity = 0.
Good Threshold: Sensitivity increases slowly as (1-Specificity) increases.
Relative Risk (RR): Ratio of probabilities of an event in two groups.
RR = p1/p2, where p1 = n11/(n11 + n12), p2 = n21/(n21 + n22).
Odds Ratio: Ratio of odds of success in two groups.
Odds of Success: P(success)/P(failure) = p/(1-p)
Pearson Chi-Squared Test
(aa=chsiq.test(x=c.table, correct=False))
- Assume the null hypothesis H0 (independence) is true.
- Estimate expected cell counts under H0.
- Compare observed and expected cell counts to assess evidence against H0.
- Interpret p-value: If p-value < α, reject H0 (variables are associated). If p-value > α, no evidence to suggest association.
Fisher’s Exact Test
(fisher.test): An alternate test for smaller samples. Uses permutations to compute exact p-values.