Key Concepts in Statistics: Data Analysis and Probability

Key Concepts in Statistics

Data and Variables

Statistics: A branch of science that deals with collecting, organizing, analyzing, interpreting, summarizing, and presenting data.

Unit/Individual: An object on which we take a measurement or observation (e.g., people, places, things).

Population: The collection of all individuals or units under consideration.

Sample: A subset of the population from which we obtain data.

Variable: Any characteristic or property of an individual.

  • Quantitative Data: Numerical characteristics of an individual for which arithmetic operations make sense (e.g., birth weight, final exam score, blood sugar level).
  • Qualitative/Categorical and Ordinal Data: Puts individuals into groups based on common characteristics for which numerical operations do not make sense, and for which there’s a logical, commonly accepted ordering with a sense of “increasing”-ness (e.g., shoe size (1, 2, 3), satisfaction level, shirt size (S, M, L)).
  • Qualitative/Categorical and Nominal Data: Puts individuals into groups based on common characteristics for which numerical operations do not make sense and for which there is no logical ordering of “increasing”-ness (e.g., car type, area code, cell phone brand).

Distributions

Distribution (Dist.): Tells us what values a variable takes on and how often it takes on these values.

Frequency Distribution (Freq. Dist.): Count of how many data values fall into predetermined classes of intervals (1st interval has the minimum value, the last has the maximum value; goes [x-y)).

Relative Frequency (Rel. Freq.): The number of data values in a class divided by the total number of data values in a sample.

Skewed to the Right Histogram (Skew to Right Hist.): A distribution is skewed to the right if the right tail extends more than the left, with the mean to the right of the median. If skewed, use the 5-number summary.

Pie Chart: A circle divided into slices whose area is proportional to the ratio of relative frequency.

Descriptive Statistics

Mean: (x̄) The sum of observations in a data set divided by the number of observations.

Median: The middle value in an ordered dataset.

Quartiles: Divides data into four equal parts.

5-Number Summary (5#Sumry): Minimum, Q1, Q2 (Median), Q3, Maximum.

Interquartile Range (IQR): Q3 – Q1. Measures the spread of the middle 50%.

Variance: s2 = Σ(xi – x̄)2 / (n – 1). The average squared distance of data from the mean.

Standard Deviation (S.D.): The positive square root of the variance. The average absolute distance from the mean.

Boxplot: If the right tail is long, it is right-skewed.

Lower Fence (LF): Q1 – 1.5 * IQR

Upper Fence (UF): Q3 + 1.5 * IQR

Probability

Random: Individual outcomes are uncertain, but there is a regular distribution of outcomes in a large number of repetitions.

Probability: The proportion of times the outcome would occur in an infinitely long series of trials.

Probability Distribution (Prob. Dist.): A mathematical model of random behavior that consists of possible outcomes and the probability of each outcome.

Sample Space: List of possible outcomes, denoted as S. S = {(x1, x2) | x1, x2 ∈ {set of possible values}}

Event: A subset of outcomes in a sample space.

Probability of an Event: The sum of all outcomes that make up the event.

Density Curve: A mathematical model of a variable’s distribution. The area under the curve is the proportion of the variable that takes on values in a range.

punif(x, a, b): The probability of being less than x from a to b.

Parameters and Statistics

Parameter (μ, σ2, σ): A number that describes a population.

Statistics (x̄, s2, s): A number that estimates the values of parameters. A statistic is a number computed from sample data.

Population Mean: μ. The average value of all units in the population.

Sample Mean: x̄. The average of all values in a sample of the population. Use it to estimate μ.

Population Variance: σ2. The average squared distance of all units in the population from the mean. σ is the standard deviation of all units in the population.

Sample Variance: s2. Same as population variance but for the sample mean. Use the sample to estimate the population.

Normal Distributions

68-95-99.7 Rule

Standard Normal Distribution: Has a mean (μ) of 0 and a standard deviation (σ) of 1.

Z = (X – μ) / σ

pnorm(z): Gets the probability of z.

kth Percentile: The value for which k% of observations are less than or equal to the value (e.g., the 75th percentile of height means taller than 75% of the population).

qnorm(k, μ, σ): k = percentile

Sampling Distributions

Standard Deviation of x̄: Is smaller than the standard deviation of X: σ / √n

Sampling Distribution: The distribution of values taken by the statistic in all possible samples of the same size from the same population.

Unbiased (x̄): A statistic used to estimate a parameter is unbiased if the mean of its sampling distribution is equal to the true value of the parameter.

Biased (s2, σ2): A statistic is biased if it systematically over- or underestimates the value of a parameter.

Asymptotically Normally Distributed: When the sample size is sufficiently large, it becomes normal; x̄ becomes normal even if X is skewed.

Central Limit Theorem (CLT)

Draw a Simple Random Sample (SRS) of size n from any population with mean μ and standard deviation σ. When n is large enough, the sampling distribution of the sample mean is approximately normal (as long as n ≥ 30).

Sample Proportion

p̂ = x / n, where p̂ is the sample proportion of successes in an SRS drawn from a large population with population proportion p of success. Here, p̂ is the statistic used to estimate the population parameter p.

  • μ = p
  • σ = √(p(1-p) / n)
  • Z = (p̂ – p) / √(p(1-p) / n)

This is safe to use when np ≥ 10 and n(1-p) ≥ 10.