Introduction to Statistics: A Comprehensive Guide

Posted on May 30, 2024 in Mathematics

Chapter 1: Introduction to Statistics

1. Population and Sample

Population: A collection of all units of interest.
Subject: An individual unit of a population.
Sample: An observed subset of the units of a population.

2. Parameter and Statistic

Parameter: A number that describes a population, which is usually unknown.
Statistic: A number that describes a sample. It must be computable from the sample, and therefore is known.

3. Statistical Inference

Statistical inference is the procedure of using a sample to learn about the population.

Chapter 2: Types of Variables

4. Types of Variables

Categorical
Quantitative
- Discrete
- Continuous

5. Graphical Summaries for Categorical Variables

Pie chart
Bar plot

6. Graphical Summaries for Quantitative Variables

Stem-and-leaf plot
Histogram
Box plot

7. Finding the Center and Spread of a Quantitative Data from Its Graphical Summaries

(a) Stem-and-leaf plot and histogram

Center: The center is located at the longest stem(s) in the stem-and-leaf plot or the tallest rectangular bar(s) in the histogram.
Range: Begins at the lowest valued stem in the stem-and-leaf plot with nonzero length or the leftmost rectangular bar in the histogram with nonzero height; ends at the highest valued stem in the stem-and-leaf plot with nonzero length or the rightmost rectangular bar in the histogram with nonzero height.

(b) Box plot

Center: The center is located at the thick line in the box.
Range: Begins at the bottom of the lower whisker and ends at the top of the upper whisker.
Inter-quartile range (IQR): The height of the box.

8. The Outliers

The outliers refer to the observations that fall far from the bulk of the data.

9. Shape of a Distribution

(a) Number of modes

Unimodal: If the histogram of a distribution contains one peak.
Bimodal: If the histogram contains two peaks.
Uniform: If the histogram does not contain any obvious peaks.

(b) Skewness

Symmetric: If the histogram of a distribution is symmetric about the center.
Right-skewed: If the distance from the center is farther to the maximum than to the minimum, i.e. the upper tail is longer.
Left-skewed: If the distance from the center is farther to the minimum than to the maximum, i.e. the lower tail is longer.

10. Computing and Interpreting the Numerical Summaries of Quantitative Data

(a) Measures of center

Mean: x = Σxi / n
Median: M = the middle value (when the number of observations is odd); the average of the middle two values (when the number of observations is even.)

Note: For a right-skewed distribution, the mean is usually greater than the median; for a left-skewed distribution, the mean is usually less than the median.

(b) Measures of spread

Standard deviation: s = √Σ(xi – x)^2 / (n – 1)
Range: maximum – minimum
Inter-quartile range (IQR): 3rd quartile – 1st quartile

11. A numerical summary of the data is resistant if extreme observations have little influence on its value.

Examples: The median and IQR are resistant, but the mean, range, and standard deviation are not.

12. 3-standard deviations rule: For approximately bell-shaped distributions, most observations fall within 3 standard deviations of the mean (see also the 68-95-99.7 rule in Chapter 6.)

Chapter 4: Sampling and Experiments

13. Experiments and Observational Studies

Experiment: The researcher assigns the subjects to various experiment conditions (i.e. treatments), and then observe the response variable.
Observational study: The researcher observes both the explanatory variable(s) and the response variable; there is no assigning of treatments.

14. Causal relationships can only be established through experiments.

15. Sampling frame is the list of subjects in the population from which we draw the sample.

16. Simple random sampling is a method of sampling in which every possible sample of size n from the population is equally likely to be selected. It is the best way to obtain a sample that is representative of the population.

17. Types of biases

Sampling bias: Using a sample that is not randomly drawn from the population.
Nonresponse bias: Subjects in the sample are unavailable or (intentionally or unintentionally) do not respond.
Response bias: Subjects in the sample give inaccurate responses either because they are untruthful or the questions are misleading.

18. Convenience sample is the type of sample which is easy to obtain. For example, volunteer sample is a common form of convenience sample. Convenience sample is usually unrepresentative.

Chapter 5: Probability

19. A random phenomenon is an event in which the individual outcomes are uncertain but the long-run behavior is regular.

20. The probability of an event can be interpreted as its long-run proportion of occurring.

21. A probability model contains a list of all possible outcomes and the probability of each outcome.

22. The sample space of a random phenomenon is the set of all possible outcomes.

23. The basic probability rules

0 ≤ P(A) ≤ 1
Law of total probability: P(S) = 1 where S is the sample space.
Complement rule: P(Ac) = 1 – P(A) where Ac is the complement of A.
General addition rule: P(A ∪ B) = P(A) + P(B) – P(A ∩ B).
Partition of probability: P(A) = P(A ∩ B) + P(A ∩ Bc).

24. Disjoint events

Events A and B are disjoint if A ∩ B is empty.
When A and B are disjoint, then P(A ∪ B) = P(A) + P(B), and P(A ∩ B) = 0.

25. Independent events: Events A and B are independent if P(A ∩ B) = P(A)P(B), that is, knowing that one occurs does not affect the probability that the other occurs.

26. Conditional probability of event A given event B is the probability that event A occurs given that event B has occurred and is denoted as P(A | B). Provided that P(B) > 0, we have P(A | B) = P (A∩B) / P (B).

27. General multiplication rule: P(A ∩ B) = P(A | B)P(B) = P(B | A)P(A).

28. Two events A and B are independent if any of the following equivalent conditions holds:

P(A ∩ B) = P(A)P(B).
P(A | B) = P(A).
P(B | A) = P(B).

Chapter 6: Random Variables

29. Random variable is the numerical outcome of a random phenomenon.

30. Types of random variable

Discrete: the possible values can be listed.
Continuous: the possible values are from an interval.

31. Properties of discrete random

variables: (a) Each possible outcome has a non-negative probability. (b) The probabilities add up to 1. 32. Center and spread of a discrete random variable: (a) Mean: µ = PxP(x), which can be interpreted as the long-run average outcome, i.e. the expected average value of the outcomes if a very large sample is drawn. (b) Standard deviation: σ = pP(x − µ) 2P(x). 33. The probability distribution of continuous random variables are described by density curves, which have the following properties: (a) Non-negative everywhere. (b) For every numbers a and b such that a ≤ b, the probability P(a a), rewrite it as 1 − P(Z