Key Concepts in Statistics: Data Analysis and Metrics

Posted on Jan 27, 2025 in Statistics

Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.

Population vs. Sample

Population: All items of interest (e.g., all cars bought in Ontario last year). Note: It is often impossible to collect all these data points.
Sample: Items randomly selected from the population (e.g., 1000 cars bought in Ontario last year).

Parameter vs. Statistic

Parameter: A numerical description of the population.
Statistic: A numerical description of a sample.

Types of Data

Quantitative (Numerical) Data: Numbers and things you can measure objectively.
Qualitative (Categorical) Data: Characteristics and descriptors that can’t be easily measured, but can be observed subjectively.

Data Scales

Ratio/Interval: Meaningful difference between values (e.g., “97 grade in COMM162”).
Ordinal: Data are ranked (e.g., “A+ mark”).
Nominal: Data may only be classified (e.g., “Pass”).

Data Types by Time

Cross-sectional Data: A sample of data that does not depend on time.
Time Series Data: Data for which the sequencing over time is important.
Panel Data: Data that blends the two.

Data Visualization and Tables

Frequency Tables

Frequency Table: Lists the categories and the number of times a given category occurs.
Relative Frequency Table: Lists the categories and the proportion with which each category occurs.
Cumulative Frequency Table: Shows the accumulation of the previous relative frequencies.

Charts and Graphs

Bar Graph: Use frequency or relative frequency table to create a bar graph.
Pie Chart: Use relative frequency table to create a pie chart.
Pareto Chart: Displays the frequency of categories, as well as their cumulative impact.
Line Graph: Helps to depict the time-series properties of data, illustrating trends and supporting forecasts. Examples: Unemployment rates, annual ticket sales.

Scatter Plots

Used to depict the relationship between two variables:

Airline: Y = cost (for airline to run flight), X = number of passengers on flight.
Restaurant: Is there a relationship between meal cost and number of guests in a party? Y = meal cost, X = number of guests.

Y (Response Variable): Variable we wish to explain, placed on the vertical axis.
X (Explanatory Variable): Helps explain changes in Y, placed on the horizontal axis.

Histogram: Shows distributions.
Ogive: Shows cumulative frequency.

Data Shape Descriptors

Symmetry: Not skewed (symmetrical) or skewed (positive or negative).
Skewness: Indicates the direction of skew.
Modality: Number of peaks (unimodal, bimodal).
Bell-shaped (Normal): Symmetrical, not skewed, unimodal.

Measures of Central Tendency

Mean: The arithmetic average of values.
Median: The middle value (after sorting). If the number of values is even, it is the average of the two middle values. Note: Not as affected by extreme values.
Mode: The most frequently occurring value. If there is a tie, list all frequent values.

Normal Shape

Mean = Median = Mode.

Measures of Dispersion

Range: Difference between the largest and smallest values. Range = max(X) – min(X).
Variance: Average of squared deviations from the mean.
Standard Deviation (SD): Square root of variance. Measures how concentrated data is around the mean.

Percentiles and Quartiles

Percentiles: Provide information about the position of particular values relative to the dataset. Example: 10th Percentile = Value with 10% of data below it.
25th Percentile (Q1), Median (Q2), 75th Percentile (Q3).
Interquartile Range (IQR): Measures spread of the middle 50% of the data. IQR = Q3 – Q1.

Coefficient of Variation (CV)

Measures relative variation. Low CV = low variation; high CV = high variation.

Skewness and Kurtosis

Skewness: A measure of symmetry.
- = 0: Symmetric.
- < 0: Negatively skewed.
- > 0: Positively skewed.
Kurtosis: A measure of peakedness.
- = 0: Mesokurtic (normal).
- < 0: Platykurtic (flat, spread out).
- > 0: Leptokurtic (tall, thin).

Testing for Bell-shaped (Normal) Distributions

Symmetrical.
Not skewed.
Unimodal.
Visuals often show histograms with a single peak, symmetrical tails, and no skewness.

The Empirical Rule

For normal distributions:
- ~68% of values lie within ±1σ of the mean.
- ~95% of values lie within ±2σ of the mean.
- ~99.7% of values lie within ±3σ of the mean.

Chebyshev’s Theorem

For any distribution:
- At least (1 – 1/k²) of values lie within kσ (k > 1).

Summary

Central Tendency: Mean, median, mode.
Spread: Range, SD, variance, CV, percentiles, IQR.
Shape: Skewness, kurtosis.
Testing Normality: Use graphs and shape measures.
Use Excel for descriptive statistics:
- Functions: AVERAGE(), MEDIAN(), MODE(), VAR.P(), STDEV.P(), PERCENTILE.INC(), QUARTILE.INC().