Key Concepts in Statistics: Data Analysis and Metrics
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.
Population vs. Sample
- Population: All items of interest (e.g., all cars bought in Ontario last year). Note: It is often impossible to collect all these data points.
- Sample: Items randomly selected from the population (e.g., 1000 cars bought in Ontario last year).
Parameter vs. Statistic
- Parameter: A numerical description of the population.
- Statistic: A numerical description of a sample.
Types of Data
- Quantitative (Numerical) Data: Numbers and things you can measure objectively.
- Qualitative (Categorical) Data: Characteristics and descriptors that can’t be easily measured, but can be observed subjectively.
Data Scales
- Ratio/Interval: Meaningful difference between values (e.g., “97 grade in COMM162”).
- Ordinal: Data are ranked (e.g., “A+ mark”).
- Nominal: Data may only be classified (e.g., “Pass”).
Data Types by Time
- Cross-sectional Data: A sample of data that does not depend on time.
- Time Series Data: Data for which the sequencing over time is important.
- Panel Data: Data that blends the two.
Data Visualization and Tables
Frequency Tables
- Frequency Table: Lists the categories and the number of times a given category occurs.
- Relative Frequency Table: Lists the categories and the proportion with which each category occurs.
- Cumulative Frequency Table: Shows the accumulation of the previous relative frequencies.
Charts and Graphs
- Bar Graph: Use frequency or relative frequency table to create a bar graph.
- Pie Chart: Use relative frequency table to create a pie chart.
- Pareto Chart: Displays the frequency of categories, as well as their cumulative impact.
- Line Graph: Helps to depict the time-series properties of data, illustrating trends and supporting forecasts. Examples: Unemployment rates, annual ticket sales.
Scatter Plots
Used to depict the relationship between two variables:
- Airline: Y = cost (for airline to run flight), X = number of passengers on flight.
- Restaurant: Is there a relationship between meal cost and number of guests in a party? Y = meal cost, X = number of guests.
- Y (Response Variable): Variable we wish to explain, placed on the vertical axis.
- X (Explanatory Variable): Helps explain changes in Y, placed on the horizontal axis.
- Histogram: Shows distributions.
- Ogive: Shows cumulative frequency.
Data Shape Descriptors
- Symmetry: Not skewed (symmetrical) or skewed (positive or negative).
- Skewness: Indicates the direction of skew.
- Modality: Number of peaks (unimodal, bimodal).
- Bell-shaped (Normal): Symmetrical, not skewed, unimodal.
Measures of Central Tendency
- Mean: The arithmetic average of values.
- Median: The middle value (after sorting). If the number of values is even, it is the average of the two middle values. Note: Not as affected by extreme values.
- Mode: The most frequently occurring value. If there is a tie, list all frequent values.
Normal Shape
Mean = Median = Mode.
Measures of Dispersion
- Range: Difference between the largest and smallest values. Range = max(X) – min(X).
- Variance: Average of squared deviations from the mean.
- Standard Deviation (SD): Square root of variance. Measures how concentrated data is around the mean.
Percentiles and Quartiles
- Percentiles: Provide information about the position of particular values relative to the dataset. Example: 10th Percentile = Value with 10% of data below it.
- 25th Percentile (Q1), Median (Q2), 75th Percentile (Q3).
- Interquartile Range (IQR): Measures spread of the middle 50% of the data. IQR = Q3 – Q1.
Coefficient of Variation (CV)
Measures relative variation. Low CV = low variation; high CV = high variation.
Skewness and Kurtosis
- Skewness: A measure of symmetry.
- = 0: Symmetric.
- < 0: Negatively skewed.
- > 0: Positively skewed.
- Kurtosis: A measure of peakedness.
- = 0: Mesokurtic (normal).
- < 0: Platykurtic (flat, spread out).
- > 0: Leptokurtic (tall, thin).
Testing for Bell-shaped (Normal) Distributions
- Symmetrical.
- Not skewed.
- Unimodal.
- Visuals often show histograms with a single peak, symmetrical tails, and no skewness.
The Empirical Rule
- For normal distributions:
- ~68% of values lie within ±1σ of the mean.
- ~95% of values lie within ±2σ of the mean.
- ~99.7% of values lie within ±3σ of the mean.
Chebyshev’s Theorem
- For any distribution:
- At least (1 – 1/k²) of values lie within kσ (k > 1).
Summary
- Central Tendency: Mean, median, mode.
- Spread: Range, SD, variance, CV, percentiles, IQR.
- Shape: Skewness, kurtosis.
- Testing Normality: Use graphs and shape measures.
- Use Excel for descriptive statistics:
- Functions: AVERAGE(), MEDIAN(), MODE(), VAR.P(), STDEV.P(), PERCENTILE.INC(), QUARTILE.INC().