Understanding Statistical Analysis: Methods and Measures

Statistics

Statistics are the scientific methods through which we can collect, organize, summarize, present, and analyze data on a set of observations. They allow us to draw valid conclusions and make logical decisions based on such analysis.

Population: A collection of measurements, either a finite number or a large, virtually infinite, amount of data about any characteristic of interest.

Sample: A representative subset selected from a population.

Univariate analysis: It allows the analysis and processing of a variable, which aims to analyze and synthesize information contained in statistical data by tables, graphs, and numerical summary volumes.

Units: It is anything that we can study, e.g., a person, a machine, etc.

Variable: It is a characteristic or property associated with an observable unit of the population.

Qualitative (attributes) – 2 types: Allows data to be tabulated or organized in tables that summarize the qualities or attributes.

  • Ordinal: Level of study, academic performance
  • Nominal: Eye color

Quantitative – 2 types:

  • Discrete: Number of family members (can be ordered in individual classes or class intervals).
  • Continuous: Along a table (may be ordered only at intervals of classes).

Xi: Discrete Variable

Number of observations: Frequency of observations

Interval class: Continuously variable

Relative frequencies: (Percentage of each case or category)

fi: Relative frequency

N: Total data

ni: Number of observations of class

fi = ni / N

Relative % Cumulative Frequency

Fi = Ni / N

Ni = Cumulative absolute frequency

N = Total data

K: Number of class meetings

To calculate Fi%:

fi = 2n (number of observations of data class) / 28 (total) x 100 = Fi%

The qualitative variables should not be cumulated.

Range or round = Rmax – Rmin

Interval Number

k = 1 + 3.3 * log(N)

Size

A = Rx = range / interval k = n

Note: The scale must be treated with the same number of decimal places that the data has.

Decimal data = 2, A = 0.00 (2 decimal places also)

Lower limit of the first interval:

LI = Xmin

Upper limit of the last interval

UT

  • 0 decimal / 1
  • 1″ / 0.1
  • 2″ / 0.01
  • 3″ / 0.001

Ls(ks) = LI1 + (A * K) – UT (work unit)

Ls(k-5) = 1.58 + (0.05 * 5) – 0.01 = 1.82 > Xmax

This value must be greater than the maximum value.

If there is not enough data, increase the amplitude.

A = 0.05 ~ 0.06

Bar Graph

It is used in qualitative or quantitative discrete observations.

Each class stands on a bar of height equal to the frequency of the class.

Horizontal axis: Classes are represented

Vertical axis: Absolute frequencies

Histogram

It is used in quantitative variables.

It is a collection of rectangles, each representing a group or class interval.

Their bases are equal to the range width, and the height is determined so that its area is proportional to the frequency of each class.

Horizontal axis: Intervals of frontiers “Fi – Fs” are represented

Vertical axis: Absolute frequencies “ni”

The relative frequencies can also occur in histograms or other graphics.

Frequency Polygon

It is a line graph.

It is constructed with line segments joining the midpoints (class mark) of adjacent intervals.

It is used to determine the form that follows the frequency distribution of observations in order to adjust a given probability function.

Horizontal axis: Class marks “mi”

Vertical axis: Absolute frequencies “ni”

Ogive

It is a cumulative frequency polygon.

It starts at zero and ends at 100%.

The polygon is a part of the lower boundary of the first interval in each class, and the upper boundary is indicating its cumulative frequency.

Horizontal axis: Intervals of frontiers “Fi – Fs” are represented

Vertical axis: Cumulative absolute frequencies “Ni”

Circular Chart

Lets you represent absolute frequencies or relative frequencies as percentages in a circle.

Determine the number of degrees of the circle corresponding to each absolute frequency by the proportion.

Stem and Leaf Graph

It is a semi-graphical procedure for quantitative variables.

The digits are separated into two parts:

Stem: Defines a class and corresponds to a certain number of digits counted from left to right.

Leaf: Defines the absolute frequency of the class and for the next digit, discarding the remainder, if any.

The representation of the data is performed using a column for the stems, sorted in ascending order and without repeating, and one for the corresponding leaves.

Measures of Central Tendency

Mode

It is the category or score that occurs most often. Used with any standard of measurement.

Median

It is the value that divides the distribution in half. That is, half the cases fall below the median, and half are above the median.

The median is used in measuring levels of ordinal, interval, or ratio.

Mean

It is the arithmetic mean of the distribution.

It is the sum of all values divided by the number of cases. Applies only to measurements of interval or ratio (individual lessons).

X = 3 + 5 + 6 / 3 = 4.6

Dispersion Measurement

These are measures of dispersion or variability of data from a series of values.

They represent the similarity or difference between individuals of a group in connection with some quantitative variable (age, income, schooling, etc.).

The main ones are:

  • Variance
  • Standard deviation
  • Index of dispersion

Variance: Average squared deviations from each of the values of a series on the arithmetic mean.

Standard Deviation: It is the square root of the variance.

Measures of Dispersion

Quantify the dispersion of data around the center of the data.

The most common are:

  • Range
  • Interquartile range
  • Variance
  • Standard deviation
  • Coefficient of variation

Variance

It is most useful in statistical applications.

It is defined as the ordering or grouping of data, and the result is obtained as follows:

Individual or Grouped Data

Individual Classes Grouped Data

Grouped data in class intervals

Standard Deviation

It is defined as the average deviation of the original data with respect to their arithmetic mean.

It is denoted by:

  • It contains approximately 68% of the observations.
  • It contains approximately 95% of the observations.
  • It contains approximately 100% of the observations.

Coefficient of Variation

Delivers the “degree” or “%” of variability of the data and is used to compare two distributions that may have a different unit of measure.

Rule of Thumb:

  • If the CV is ≤ 35%, the set is homogeneous.
  • If the CV is > 35%, the set is heterogeneous.

Coefficient of Bias

Bias: Degree of asymmetry or skewness of a frequency distribution. It is determined by:

Rules:

  • If the coefficient of bias is “+”, then the bias of the distribution is positive.
  • If the coefficient of bias is “-“, then the bias of the distribution is negative.
  • If the coefficient of bias is zero, then the distribution is symmetrical.

Quantile

Partitions the area under the frequency polygon in more than two parties, with the usual four, ten, and one hundred parts.

Quartiles: Divide the frequency distribution into 4 parts.

Deciles: Divide the frequency distribution into 10 equal parts.

Percentiles: Divide the frequency distribution into 100 equal parts.