Statistics Final Exam Study Guide: Key Concepts & Formulas

Final Exam Study Guide

Histograms

Appropriate for quantitative data

  • X-axis is quantitative
  • Y-axis is the frequency of the data in the bin
  • Bins are a range of values for collecting data, should be the same size

Shape

  • Symmetric/Bell-shaped
  • Normal
  • Skewed
  • Uniform

Center

  • Mean – useful when data are symmetric
  • Median – useful when data are skewed/outliers
  • Mode – useful for categorical data
    • Unimodal
    • Bimodal

Spread

  • Range – easy to calculate, not useful
  • Interquartile range – easy to calculate, useful when data are skewed/outliers
  • Standard deviation – useful when data are symmetric
  • Outliers/Extreme values – gaps

Boxplots

Appropriate for quantitative data

  • Visual representation of 5-number summary
  • Easy to construct, conveys information easily
  • Min, Q1, Med, Q3, Max
  • Identifies outliers

Descriptive Statistics

Mean

  • Uses all values in calculation
  • Sensitive to outliers/skewness
  • Moves in the direction of the skew
  • Always paired with standard deviation

Median

  • Uses all the positions of the data
  • Resistant to outliers/skewness
  • Always paired with interquartile range (IQR)

Mode

  • Only appropriate for categorical data
  • Not paired with a measure of spread

Standard Deviation

  • Measure of variability in the data
  • Uses all values in calculation
  • Sensitive to outliers/skewness
  • Increases with skew
  • Always a positive value

Interquartile Range (IQR)

  • Measure of variability in the data
  • Uses positions in calculation
  • Resistant to outliers/skewness
  • Q3-Q1

Research Designs

Observational Studies

Random Selection

  • Avoids bias and makes sample representative of population
  • Simple Random Sample – name out of a hat
  • Stratified Sample – organized by a similar trait
  • Cluster Sample – organized by location: classroom, neighborhood, etc.

Observe what is naturally occurring

  • No manipulation
  • Survey for attitudes, beliefs
  • Measurements for physical, mental, skills

Association

  • Two variables can be associated in an observational study
  • Correlation does not prove causation

Experimental Design

Random Assignment

  • The intent is to create groups that are similar for comparison

Change something

  • Manipulate one variable – treatment group
  • Do nothing to another group – control group
    • Blind the control group to the fact that nothing is happening – placebo
    • The control group experiences some change – placebo effect

Causation

  • Causation can be proved through an experiment
  • Only generalizable to the same population as the volunteers

Ethical Questions

  • You are asking/forcing participants to do something they wouldn’t normally do
  • It is unethical to have participants do something harmful
    • Tuskegee Experiment
    • Prisoner Experiments
    • Skinner’s fear experiment
    • IRB (Institutional Review Board) protects against unethical experiments

Distributions

Normal Distributions

  • Used with proportions and means when population standard deviation (σ) is known
  • Unimodal
  • Symmetric
    • Mean=Median=Mode
  • Area under the curve = 1
    • This means we can calculate percentages, proportions, and probabilities for a given range of values
  • The values mathematically extend out to infinity, but practically – there is no area under the curve after 10 standard deviations

Standard Normal Distribution

  • Center = 0
  • Standard Deviation = 1

Empirical Rule

  • 68% of all values are within ±1 standard deviation
  • 95% of all values are within ±2 standard deviations
  • 99.7% of all values are within ±3 standard deviations

Student t Distributions

  • Used with means when σ is unknown
  • Unimodal
  • Symmetric
    • Mean=Median=Mode
  • Area under the curve = 1
    • This means we can calculate percentages, proportions, and probabilities for a given range of values
  • The values mathematically extend out to infinity, but practically – there is no area under the curve after 10 standard deviations
  • NO Standard t Distribution
    • Family of curves based on degrees of freedom (d.f.)
      • As the d.f. approaches infinity it becomes the Normal Distribution
    • Center = 0
    • Standard Deviation = 1
  • NO Empirical Rule
    • The area in the tails is higher than in the Normal Distribution
    • It changes with every d.f.

Sampling Distributions

  • A sampling distribution is a histogram of the average value of all possible samples of that size

What does the Central Limit Theorem tell us about Sampling Distributions?

  • The center of the distribution is the same as the center of the population
  • The amount of variability decreases as sample size increases
    • An individual varies the most
    • When we use population proportions or population standard deviations it is called the standard deviation of the sample
      • s.d.(p-hat) or s.d.(x-bar)
    • When we use the sample proportion or sample standard deviation it is called the standard error. This helps people keep them straight BUT it still measures variability of the samples
      • s.e.(p-hat) or s.e.(x-bar)
  • Even if the population trait is not normally distributed, the sampling distribution will be if the conditions are met.

When you use proportions or the σ is known

  • Use Normal Distribution

When the population standard deviation is not known

  • Use Student t Distribution

Linear Regression

  • Only for two variables that are quantitative
    • Explanatory variable is on the x-axis
      • It is the one we think explains changes in the response
    • Response variable is on the y-axis
      • It is the one we think responds to changes in the explanatory
  • Appropriate graph/display is a scatterplot
  • The slope measures how much response per increase of the explanatory variable
  • The intercept measures how much response expected when zero explanatory variable
  • R2 measures the amount of variability in the response due to changes in the explanatory variable
    • Value from 0 – 1
    • Think of like a %
  • The line of best fit, the linear regression, is best because it minimizes the error
    • Smallest Sum of Squared Errors (SSE)
    • The error is a residual
      • The residual is the difference between the actual and predicted
      • Residual = actual – predicted

Inferential Statistics

Confidence Intervals

  • Useful when you don’t have a population estimate
  • Because of the Central Limit Theorem, samples are the best estimate
  • But samples vary, so there is a margin of error
    • This creates the lower endpoint and upper endpoint
    • m.e. = critical value (based on confidence) x s.d. (based on sample size)
    • Decreasing confidence decreases the interval, makes it smaller
      • Because the critical value gets smaller
    • Increasing sample size decreases the interval, makes it smaller
      • Because the s.d.(p-hat) or the s.d.(x-bar) gets smaller
  • 95% confidence means that 95 out of 100 samples will contain the true population parameter
    • All values inside the interval are equally likely
    • Values outside the interval are unlikely
  • Supports a claim if it is within the interval
    • This means it can be used like a hypothesis test
    • Better, because it gives a range (effect size) of how much different the value is from the claim

Hypothesis Test

  • Useful when you do have a population estimate
  • You are testing a claim that the value/trait of the sample is significantly different
  • Significantly different – the value is farther away from the expected by more than random variation
  • The process tests a sample (evidence) against the null hypothesis (nothing changed, no difference)
  • If the P-value is small, there is a small chance that nothing happened
  • Therefore, we reject the null hypothesis in favor of the alternative (something did happen)
  • If this is a mistake (an error), We reject the null when it is true, we call it a Type I error
  • If the P-value is large, there is a large chance that nothing happened
  • Therefore, we do not reject (fail to reject) the null hypothesis. Nothing probably did happen
  • If this is a mistake (an error), We do not reject the null when it is false, we call it a Type II error
  • How large or small depends on your level of significance (α). This becomes the cut-off for random variation to something happened
  • Alpha (α) is directly related to the probability of making a Type I error
  • α is exactly the probability of making a Type I error
  • Alpha (α) is linked to the probability of making a Type II error
  • Increasing α, decreases your probability of making a Type II error
  • A confidence interval can be used to test a hypothesis
  • If the null hypothesis value is in the confidence interval, then do not reject
  • If the null hypothesis value is NOT in the confidence interval, then reject