Key Statistical Concepts and Data Visualization
Posted on Jan 6, 2025 in Statistics
Key Statistical Definitions
- Independence: The random choice of each individual in the sample is not influenced by which other individuals are chosen.
- Sample of Convenience: Samples chosen because they are easily available.
- Haphazard Sampling: Samples you hope you chose randomly.
- Volunteer Bias: Choosing individuals that are more easily available than others.
- Accuracy: How close the average estimate from many studies is to the parameter.
- Precision: How spread out repeated estimates are from their average.
- Ordinal Categorical Data: Non-numerical categories that have an order.
- Nominal Categorical Data: Categories that do not have an order.
- Continuous Numerical Data: Can be any value within a parameter.
- Discrete Numerical Data: Can only be specific values.
Distributions
- Frequency Distribution: Describes the number of times each value of a variable occurs in a sample. If we hypothetically repeat the study, we get a different sample frequency each time.
- Probability Distribution: Shows the frequencies of any (range of) values from the population; this does not change with each sample. It is a list of the probabilities of all mutually exclusive outcomes of a random trial.
Study Types
- Observational Studies: Studies that cannot answer questions about causation. The assignment of treatments is not made by the researcher.
- Experimental Studies: Studies that allow attribution of statistical relationships to causation by the experimentally manipulated variables. The researcher assigns treatments randomly to individuals.
Variables
- Explanatory Variable: A variable that predicts or affects the other variable in a study (independent variable).
- Response Variable: The variable of focus in a study or experiment (dependent variable).
Effective Data Visualization
Characteristics of a Good Graph
- Shows the data clearly.
- Makes patterns in the data easy to see.
- Represents magnitudes honestly.
- Draws graphical elements clearly.
Graph Types
- Relative Frequency Distribution: Describes the fraction of occurrences of each value of a variable (showing data for one variable).
- Histogram: Uses the area of rectangular bars to display the (relative) frequency distribution of a numerical variable. Can present discrete or continuous data. Mode is the peak. Bimodal indicates two distinct peaks.
- Mode: The interval corresponding to the highest peak in the frequency distribution.
- Skew: Refers to asymmetry in the shape of a frequency distribution for a numerical value.
- Positive Skew: Mode to the right, long tail to the left.
- Negative Skew: Mode to the left, long tail to the right.
- Contingency Table: Gives the relative frequency (i.e., proportion) of occurrences of all combinations of two (or more) categorical variables.
- Grouped Bar Graph: Uses the height of rectangular bars to display the (relative) frequency distributions of two or more categorical variables.
- Mosaic Plot: Uses the area of rectangles to display the relative frequency of occurrence of all combinations of two categorical variables.
- Scatter Plot: A graphical display of two numerical variables in which each observation is represented as a point on a graph with two axes.
- Strip Chart: A graphical display of a numerical variable and a categorical variable in which each observation is represented as a dot. Shows all data points.
- Box Plot: Uses lines and a rectangular box to display the median, quartiles, range, and extreme measurements of the data.
- Line Graph: Several points linked by straight lines.
- Violin Plot: A hybrid of a box plot and a kernel density plot, which shows peaks in the data.
Statistical Measures
- Standard Deviation (SD): Used to measure the spread of a distribution from the mean. Large if most observations are far from the mean; small if most measurements lie close to the mean. It is the square root of the variance.
- Coefficient of Variation (CV): Calculates the standard deviation as a percentage of the mean. CV = (Standard Deviation / Mean) * 100%. A lower CV means individuals are more consistently the same. A higher CV means that there is more variability. Used to compare the variability of traits that do not have the same units.