Statistics Final Exam Study Guide: Key Concepts & Formulas

Posted on Aug 22, 2024 in Mathematics

Final Exam Study Guide

Histograms

Appropriate for quantitative data

X-axis is quantitative
Y-axis is the frequency of the data in the bin
Bins are a range of values for collecting data, should be the same size

Shape

Symmetric/Bell-shaped
Normal
Skewed
Uniform

Center

Mean – useful when data are symmetric
Median – useful when data are skewed/outliers
Mode – useful for categorical data
- Unimodal
- Bimodal

Spread

Range – easy to calculate, not useful
Interquartile range – easy to calculate, useful when data are skewed/outliers
Standard deviation – useful when data are symmetric
Outliers/Extreme values – gaps

Boxplots

Appropriate for quantitative data

Visual representation of 5-number summary
Easy to construct, conveys information easily
Min, Q1, Med, Q3, Max
Identifies outliers

Descriptive Statistics

Mean

Uses all values in calculation
Sensitive to outliers/skewness
Moves in the direction of the skew
Always paired with standard deviation

Median

Uses all the positions of the data
Resistant to outliers/skewness
Always paired with interquartile range (IQR)

Mode

Only appropriate for categorical data
Not paired with a measure of spread

Standard Deviation

Measure of variability in the data
Uses all values in calculation
Sensitive to outliers/skewness
Increases with skew
Always a positive value

Interquartile Range (IQR)

Measure of variability in the data
Uses positions in calculation
Resistant to outliers/skewness
Q3-Q1

Research Designs

Observational Studies

Random Selection

Avoids bias and makes sample representative of population
Simple Random Sample – name out of a hat
Stratified Sample – organized by a similar trait
Cluster Sample – organized by location: classroom, neighborhood, etc.

Observe what is naturally occurring

No manipulation
Survey for attitudes, beliefs
Measurements for physical, mental, skills

Association

Two variables can be associated in an observational study
Correlation does not prove causation

Experimental Design

Random Assignment

The intent is to create groups that are similar for comparison

Change something

Manipulate one variable – treatment group
Do nothing to another group – control group
- Blind the control group to the fact that nothing is happening – placebo
- The control group experiences some change – placebo effect

Causation

Causation can be proved through an experiment
Only generalizable to the same population as the volunteers

Ethical Questions

You are asking/forcing participants to do something they wouldn’t normally do
It is unethical to have participants do something harmful
- Tuskegee Experiment
- Prisoner Experiments
- Skinner’s fear experiment
- IRB (Institutional Review Board) protects against unethical experiments

Distributions

Normal Distributions

Used with proportions and means when population standard deviation (σ) is known
Unimodal
Symmetric
- Mean=Median=Mode
Area under the curve = 1
- This means we can calculate percentages, proportions, and probabilities for a given range of values
The values mathematically extend out to infinity, but practically – there is no area under the curve after 10 standard deviations

Standard Normal Distribution

Center = 0
Standard Deviation = 1

Empirical Rule

68% of all values are within ±1 standard deviation
95% of all values are within ±2 standard deviations
99.7% of all values are within ±3 standard deviations

Student t Distributions

Used with means when σ is unknown
Unimodal
Symmetric
- Mean=Median=Mode
Area under the curve = 1
- This means we can calculate percentages, proportions, and probabilities for a given range of values
The values mathematically extend out to infinity, but practically – there is no area under the curve after 10 standard deviations
NO Standard t Distribution
- Family of curves based on degrees of freedom (d.f.)
  - As the d.f. approaches infinity it becomes the Normal Distribution
- Center = 0
- Standard Deviation = 1
NO Empirical Rule
- The area in the tails is higher than in the Normal Distribution
- It changes with every d.f.

Sampling Distributions

A sampling distribution is a histogram of the average value of all possible samples of that size

What does the Central Limit Theorem tell us about Sampling Distributions?

The center of the distribution is the same as the center of the population
The amount of variability decreases as sample size increases
- An individual varies the most
- When we use population proportions or population standard deviations it is called the standard deviation of the sample
  - s.d.(p-hat) or s.d.(x-bar)
- When we use the sample proportion or sample standard deviation it is called the standard error. This helps people keep them straight BUT it still measures variability of the samples
  - s.e.(p-hat) or s.e.(x-bar)
Even if the population trait is not normally distributed, the sampling distribution will be if the conditions are met.

When you use proportions or the σ is known

Use Normal Distribution

When the population standard deviation is not known

Use Student t Distribution

Linear Regression

Only for two variables that are quantitative
- Explanatory variable is on the x-axis
  - It is the one we think explains changes in the response
- Response variable is on the y-axis
  - It is the one we think responds to changes in the explanatory
Appropriate graph/display is a scatterplot
The slope measures how much response per increase of the explanatory variable
The intercept measures how much response expected when zero explanatory variable
R² measures the amount of variability in the response due to changes in the explanatory variable
- Value from 0 – 1
- Think of like a %
The line of best fit, the linear regression, is best because it minimizes the error
- Smallest Sum of Squared Errors (SSE)
- The error is a residual
  - The residual is the difference between the actual and predicted
  - Residual = actual – predicted

Inferential Statistics

Confidence Intervals

Useful when you don’t have a population estimate
Because of the Central Limit Theorem, samples are the best estimate
But samples vary, so there is a margin of error
- This creates the lower endpoint and upper endpoint
- m.e. = critical value (based on confidence) x s.d. (based on sample size)
- Decreasing confidence decreases the interval, makes it smaller
  - Because the critical value gets smaller
- Increasing sample size decreases the interval, makes it smaller
  - Because the s.d.(p-hat) or the s.d.(x-bar) gets smaller
95% confidence means that 95 out of 100 samples will contain the true population parameter
- All values inside the interval are equally likely
- Values outside the interval are unlikely
Supports a claim if it is within the interval
- This means it can be used like a hypothesis test
- Better, because it gives a range (effect size) of how much different the value is from the claim

Hypothesis Test

Useful when you do have a population estimate
You are testing a claim that the value/trait of the sample is significantly different
Significantly different – the value is farther away from the expected by more than random variation
The process tests a sample (evidence) against the null hypothesis (nothing changed, no difference)
If the P-value is small, there is a small chance that nothing happened
Therefore, we reject the null hypothesis in favor of the alternative (something did happen)
If this is a mistake (an error), We reject the null when it is true, we call it a Type I error
If the P-value is large, there is a large chance that nothing happened
Therefore, we do not reject (fail to reject) the null hypothesis. Nothing probably did happen
If this is a mistake (an error), We do not reject the null when it is false, we call it a Type II error
How large or small depends on your level of significance (α). This becomes the cut-off for random variation to something happened
Alpha (α) is directly related to the probability of making a Type I error
α is exactly the probability of making a Type I error
Alpha (α) is linked to the probability of making a Type II error
Increasing α, decreases your probability of making a Type II error
A confidence interval can be used to test a hypothesis
If the null hypothesis value is in the confidence interval, then do not reject
If the null hypothesis value is NOT in the confidence interval, then reject