Statistics Final Exam Study Guide: Key Concepts & Formulas
Final Exam Study Guide
Histograms
Appropriate for quantitative data
- X-axis is quantitative
- Y-axis is the frequency of the data in the bin
- Bins are a range of values for collecting data, should be the same size
Shape
- Symmetric/Bell-shaped
- Normal
- Skewed
- Uniform
Center
- Mean – useful when data are symmetric
- Median – useful when data are skewed/outliers
- Mode – useful for categorical data
- Unimodal
- Bimodal
Spread
- Range – easy to calculate, not useful
- Interquartile range – easy to calculate, useful when data are skewed/outliers
- Standard deviation – useful when data are symmetric
- Outliers/Extreme values – gaps
Boxplots
Appropriate for quantitative data
- Visual representation of 5-number summary
- Easy to construct, conveys information easily
- Min, Q1, Med, Q3, Max
- Identifies outliers
Descriptive Statistics
Mean
- Uses all values in calculation
- Sensitive to outliers/skewness
- Moves in the direction of the skew
- Always paired with standard deviation
Median
- Uses all the positions of the data
- Resistant to outliers/skewness
- Always paired with interquartile range (IQR)
Mode
- Only appropriate for categorical data
- Not paired with a measure of spread
Standard Deviation
- Measure of variability in the data
- Uses all values in calculation
- Sensitive to outliers/skewness
- Increases with skew
- Always a positive value
Interquartile Range (IQR)
- Measure of variability in the data
- Uses positions in calculation
- Resistant to outliers/skewness
- Q3-Q1
Research Designs
Observational Studies
Random Selection
- Avoids bias and makes sample representative of population
- Simple Random Sample – name out of a hat
- Stratified Sample – organized by a similar trait
- Cluster Sample – organized by location: classroom, neighborhood, etc.
Observe what is naturally occurring
- No manipulation
- Survey for attitudes, beliefs
- Measurements for physical, mental, skills
Association
- Two variables can be associated in an observational study
- Correlation does not prove causation
Experimental Design
Random Assignment
- The intent is to create groups that are similar for comparison
Change something
- Manipulate one variable – treatment group
- Do nothing to another group – control group
- Blind the control group to the fact that nothing is happening – placebo
- The control group experiences some change – placebo effect
Causation
- Causation can be proved through an experiment
- Only generalizable to the same population as the volunteers
Ethical Questions
- You are asking/forcing participants to do something they wouldn’t normally do
- It is unethical to have participants do something harmful
- Tuskegee Experiment
- Prisoner Experiments
- Skinner’s fear experiment
- IRB (Institutional Review Board) protects against unethical experiments
Distributions
Normal Distributions
- Used with proportions and means when population standard deviation (σ) is known
- Unimodal
- Symmetric
- Mean=Median=Mode
- Area under the curve = 1
- This means we can calculate percentages, proportions, and probabilities for a given range of values
- The values mathematically extend out to infinity, but practically – there is no area under the curve after 10 standard deviations
Standard Normal Distribution
- Center = 0
- Standard Deviation = 1
Empirical Rule
- 68% of all values are within ±1 standard deviation
- 95% of all values are within ±2 standard deviations
- 99.7% of all values are within ±3 standard deviations
Student t Distributions
- Used with means when σ is unknown
- Unimodal
- Symmetric
- Mean=Median=Mode
- Area under the curve = 1
- This means we can calculate percentages, proportions, and probabilities for a given range of values
- The values mathematically extend out to infinity, but practically – there is no area under the curve after 10 standard deviations
- NO Standard t Distribution
- Family of curves based on degrees of freedom (d.f.)
- As the d.f. approaches infinity it becomes the Normal Distribution
- Center = 0
- Standard Deviation = 1
- Family of curves based on degrees of freedom (d.f.)
- NO Empirical Rule
- The area in the tails is higher than in the Normal Distribution
- It changes with every d.f.
Sampling Distributions
- A sampling distribution is a histogram of the average value of all possible samples of that size
What does the Central Limit Theorem tell us about Sampling Distributions?
- The center of the distribution is the same as the center of the population
- The amount of variability decreases as sample size increases
- An individual varies the most
- When we use population proportions or population standard deviations it is called the standard deviation of the sample
- s.d.(p-hat) or s.d.(x-bar)
- When we use the sample proportion or sample standard deviation it is called the standard error. This helps people keep them straight BUT it still measures variability of the samples
- s.e.(p-hat) or s.e.(x-bar)
- Even if the population trait is not normally distributed, the sampling distribution will be if the conditions are met.
When you use proportions or the σ is known
- Use Normal Distribution
When the population standard deviation is not known
- Use Student t Distribution
Linear Regression
- Only for two variables that are quantitative
- Explanatory variable is on the x-axis
- It is the one we think explains changes in the response
- Response variable is on the y-axis
- It is the one we think responds to changes in the explanatory
- Explanatory variable is on the x-axis
- Appropriate graph/display is a scatterplot
- The slope measures how much response per increase of the explanatory variable
- The intercept measures how much response expected when zero explanatory variable
- R2 measures the amount of variability in the response due to changes in the explanatory variable
- Value from 0 – 1
- Think of like a %
- The line of best fit, the linear regression, is best because it minimizes the error
- Smallest Sum of Squared Errors (SSE)
- The error is a residual
- The residual is the difference between the actual and predicted
- Residual = actual – predicted
Inferential Statistics
Confidence Intervals
- Useful when you don’t have a population estimate
- Because of the Central Limit Theorem, samples are the best estimate
- But samples vary, so there is a margin of error
- This creates the lower endpoint and upper endpoint
- m.e. = critical value (based on confidence) x s.d. (based on sample size)
- Decreasing confidence decreases the interval, makes it smaller
- Because the critical value gets smaller
- Increasing sample size decreases the interval, makes it smaller
- Because the s.d.(p-hat) or the s.d.(x-bar) gets smaller
- 95% confidence means that 95 out of 100 samples will contain the true population parameter
- All values inside the interval are equally likely
- Values outside the interval are unlikely
- Supports a claim if it is within the interval
- This means it can be used like a hypothesis test
- Better, because it gives a range (effect size) of how much different the value is from the claim
Hypothesis Test
- Useful when you do have a population estimate
- You are testing a claim that the value/trait of the sample is significantly different
- Significantly different – the value is farther away from the expected by more than random variation
- The process tests a sample (evidence) against the null hypothesis (nothing changed, no difference)
- If the P-value is small, there is a small chance that nothing happened
- Therefore, we reject the null hypothesis in favor of the alternative (something did happen)
- If this is a mistake (an error), We reject the null when it is true, we call it a Type I error
- If the P-value is large, there is a large chance that nothing happened
- Therefore, we do not reject (fail to reject) the null hypothesis. Nothing probably did happen
- If this is a mistake (an error), We do not reject the null when it is false, we call it a Type II error
- How large or small depends on your level of significance (α). This becomes the cut-off for random variation to something happened
- Alpha (α) is directly related to the probability of making a Type I error
- α is exactly the probability of making a Type I error
- Alpha (α) is linked to the probability of making a Type II error
- Increasing α, decreases your probability of making a Type II error
- A confidence interval can be used to test a hypothesis
- If the null hypothesis value is in the confidence interval, then do not reject
- If the null hypothesis value is NOT in the confidence interval, then reject