Key Statistical Concepts and Data Visualization

Key Statistical Definitions

  • Independence: The random choice of each individual in the sample is not influenced by which other individuals are chosen.
  • Sample of Convenience: Samples chosen because they are easily available.
  • Haphazard Sampling: Samples you hope you chose randomly.
  • Volunteer Bias: Choosing individuals that are more easily available than others.
  • Accuracy: How close the average estimate from many studies is to the parameter.
  • Precision: How spread out repeated estimates are from their average.
  • Ordinal Categorical Data: Non-numerical categories that have an order.
  • Nominal Categorical Data: Categories that do not have an order.
  • Continuous Numerical Data: Can be any value within a parameter.
  • Discrete Numerical Data: Can only be specific values.

Distributions

  • Frequency Distribution: Describes the number of times each value of a variable occurs in a sample. If we hypothetically repeat the study, we get a different sample frequency each time.
  • Probability Distribution: Shows the frequencies of any (range of) values from the population; this does not change with each sample. It is a list of the probabilities of all mutually exclusive outcomes of a random trial.

Study Types

  • Observational Studies: Studies that cannot answer questions about causation. The assignment of treatments is not made by the researcher.
  • Experimental Studies: Studies that allow attribution of statistical relationships to causation by the experimentally manipulated variables. The researcher assigns treatments randomly to individuals.

Variables

  • Explanatory Variable: A variable that predicts or affects the other variable in a study (independent variable).
  • Response Variable: The variable of focus in a study or experiment (dependent variable).

Effective Data Visualization

Characteristics of a Good Graph

  • Shows the data clearly.
  • Makes patterns in the data easy to see.
  • Represents magnitudes honestly.
  • Draws graphical elements clearly.

Graph Types

  • Relative Frequency Distribution: Describes the fraction of occurrences of each value of a variable (showing data for one variable).
  • Histogram: Uses the area of rectangular bars to display the (relative) frequency distribution of a numerical variable. Can present discrete or continuous data. Mode is the peak. Bimodal indicates two distinct peaks.
  • Mode: The interval corresponding to the highest peak in the frequency distribution.
  • Skew: Refers to asymmetry in the shape of a frequency distribution for a numerical value.
    • Positive Skew: Mode to the right, long tail to the left.
    • Negative Skew: Mode to the left, long tail to the right.
  • Contingency Table: Gives the relative frequency (i.e., proportion) of occurrences of all combinations of two (or more) categorical variables.
  • Grouped Bar Graph: Uses the height of rectangular bars to display the (relative) frequency distributions of two or more categorical variables.
  • Mosaic Plot: Uses the area of rectangles to display the relative frequency of occurrence of all combinations of two categorical variables.
  • Scatter Plot: A graphical display of two numerical variables in which each observation is represented as a point on a graph with two axes.
  • Strip Chart: A graphical display of a numerical variable and a categorical variable in which each observation is represented as a dot. Shows all data points.
  • Box Plot: Uses lines and a rectangular box to display the median, quartiles, range, and extreme measurements of the data.
  • Line Graph: Several points linked by straight lines.
  • Violin Plot: A hybrid of a box plot and a kernel density plot, which shows peaks in the data.

Statistical Measures

  • Standard Deviation (SD): Used to measure the spread of a distribution from the mean. Large if most observations are far from the mean; small if most measurements lie close to the mean. It is the square root of the variance.
  • Coefficient of Variation (CV): Calculates the standard deviation as a percentage of the mean. CV = (Standard Deviation / Mean) * 100%. A lower CV means individuals are more consistently the same. A higher CV means that there is more variability. Used to compare the variability of traits that do not have the same units.