Key Concepts in Statistics: Data Analysis and Metrics

Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.

Population vs. Sample

  • Population: All items of interest (e.g., all cars bought in Ontario last year). Note: It is often impossible to collect all these data points.
  • Sample: Items randomly selected from the population (e.g., 1000 cars bought in Ontario last year).

Parameter vs. Statistic

  • Parameter: A numerical description of the population.
  • Statistic: A numerical description of a sample.

Types of Data

  • Quantitative (Numerical) Data: Numbers and things you can measure objectively.
  • Qualitative (Categorical) Data: Characteristics and descriptors that can’t be easily measured, but can be observed subjectively.

Data Scales

  • Ratio/Interval: Meaningful difference between values (e.g., “97 grade in COMM162”).
  • Ordinal: Data are ranked (e.g., “A+ mark”).
  • Nominal: Data may only be classified (e.g., “Pass”).

Data Types by Time

  • Cross-sectional Data: A sample of data that does not depend on time.
  • Time Series Data: Data for which the sequencing over time is important.
  • Panel Data: Data that blends the two.

Data Visualization and Tables

Frequency Tables

  • Frequency Table: Lists the categories and the number of times a given category occurs.
  • Relative Frequency Table: Lists the categories and the proportion with which each category occurs.
  • Cumulative Frequency Table: Shows the accumulation of the previous relative frequencies.

Charts and Graphs

  • Bar Graph: Use frequency or relative frequency table to create a bar graph.
  • Pie Chart: Use relative frequency table to create a pie chart.
  • Pareto Chart: Displays the frequency of categories, as well as their cumulative impact.
  • Line Graph: Helps to depict the time-series properties of data, illustrating trends and supporting forecasts. Examples: Unemployment rates, annual ticket sales.

Scatter Plots

Used to depict the relationship between two variables:

  • Airline: Y = cost (for airline to run flight), X = number of passengers on flight.
  • Restaurant: Is there a relationship between meal cost and number of guests in a party? Y = meal cost, X = number of guests.
  • Y (Response Variable): Variable we wish to explain, placed on the vertical axis.
  • X (Explanatory Variable): Helps explain changes in Y, placed on the horizontal axis.
  • Histogram: Shows distributions.
  • Ogive: Shows cumulative frequency.

Data Shape Descriptors

  • Symmetry: Not skewed (symmetrical) or skewed (positive or negative).
  • Skewness: Indicates the direction of skew.
  • Modality: Number of peaks (unimodal, bimodal).
  • Bell-shaped (Normal): Symmetrical, not skewed, unimodal.

Measures of Central Tendency

  • Mean: The arithmetic average of values.
  • Median: The middle value (after sorting). If the number of values is even, it is the average of the two middle values. Note: Not as affected by extreme values.
  • Mode: The most frequently occurring value. If there is a tie, list all frequent values.

Normal Shape

Mean = Median = Mode.

Measures of Dispersion

  • Range: Difference between the largest and smallest values. Range = max(X) – min(X).
  • Variance: Average of squared deviations from the mean.
  • Standard Deviation (SD): Square root of variance. Measures how concentrated data is around the mean.

Percentiles and Quartiles

  • Percentiles: Provide information about the position of particular values relative to the dataset. Example: 10th Percentile = Value with 10% of data below it.
  • 25th Percentile (Q1), Median (Q2), 75th Percentile (Q3).
  • Interquartile Range (IQR): Measures spread of the middle 50% of the data. IQR = Q3 – Q1.

Coefficient of Variation (CV)

Measures relative variation. Low CV = low variation; high CV = high variation.

Skewness and Kurtosis

  • Skewness: A measure of symmetry.
    • = 0: Symmetric.
    • < 0: Negatively skewed.
    • > 0: Positively skewed.
  • Kurtosis: A measure of peakedness.
    • = 0: Mesokurtic (normal).
    • < 0: Platykurtic (flat, spread out).
    • > 0: Leptokurtic (tall, thin).

Testing for Bell-shaped (Normal) Distributions

  • Symmetrical.
  • Not skewed.
  • Unimodal.
  • Visuals often show histograms with a single peak, symmetrical tails, and no skewness.

The Empirical Rule

  • For normal distributions:
    • ~68% of values lie within ±1σ of the mean.
    • ~95% of values lie within ±2σ of the mean.
    • ~99.7% of values lie within ±3σ of the mean.

Chebyshev’s Theorem

  • For any distribution:
    • At least (1 – 1/k²) of values lie within kσ (k > 1).

Summary

  • Central Tendency: Mean, median, mode.
  • Spread: Range, SD, variance, CV, percentiles, IQR.
  • Shape: Skewness, kurtosis.
  • Testing Normality: Use graphs and shape measures.
  • Use Excel for descriptive statistics:
    • Functions: AVERAGE(), MEDIAN(), MODE(), VAR.P(), STDEV.P(), PERCENTILE.INC(), QUARTILE.INC().