Understanding Statistical Analysis: Methods and Measures
Statistics
Statistics are the scientific methods through which we can collect, organize, summarize, present, and analyze data on a set of observations. They allow us to draw valid conclusions and make logical decisions based on such analysis.
Population: A collection of measurements, either a finite number or a large, virtually infinite, amount of data about any characteristic of interest.
Sample: A representative subset selected from a population.
Univariate analysis: It allows the analysis and processing of a variable, which aims to analyze and synthesize information contained in statistical data by tables, graphs, and numerical summary volumes.
Units: It is anything that we can study, e.g., a person, a machine, etc.
Variable: It is a characteristic or property associated with an observable unit of the population.
Qualitative (attributes) – 2 types: Allows data to be tabulated or organized in tables that summarize the qualities or attributes.
- Ordinal: Level of study, academic performance
- Nominal: Eye color
Quantitative – 2 types:
- Discrete: Number of family members (can be ordered in individual classes or class intervals).
- Continuous: Along a table (may be ordered only at intervals of classes).
Xi: Discrete Variable
Number of observations: Frequency of observations
Interval class: Continuously variable
Relative frequencies: (Percentage of each case or category)
fi: Relative frequency
N: Total data
ni: Number of observations of class
fi = ni / N
Relative % Cumulative Frequency
Fi = Ni / N
Ni = Cumulative absolute frequency
N = Total data
K: Number of class meetings
To calculate Fi%:
fi = 2n (number of observations of data class) / 28 (total) x 100 = Fi%
The qualitative variables should not be cumulated.
Range or round = Rmax – Rmin
Interval Number
k = 1 + 3.3 * log(N)
Size
A = Rx = range / interval k = n
Note: The scale must be treated with the same number of decimal places that the data has.
Decimal data = 2, A = 0.00 (2 decimal places also)
Lower limit of the first interval:
LI = Xmin
Upper limit of the last interval
UT
- 0 decimal / 1
- 1″ / 0.1
- 2″ / 0.01
- 3″ / 0.001
Ls(ks) = LI1 + (A * K) – UT (work unit)
Ls(k-5) = 1.58 + (0.05 * 5) – 0.01 = 1.82 > Xmax
This value must be greater than the maximum value.
If there is not enough data, increase the amplitude.
A = 0.05 ~ 0.06
Bar Graph
It is used in qualitative or quantitative discrete observations.
Each class stands on a bar of height equal to the frequency of the class.
Horizontal axis: Classes are represented
Vertical axis: Absolute frequencies
Histogram
It is used in quantitative variables.
It is a collection of rectangles, each representing a group or class interval.
Their bases are equal to the range width, and the height is determined so that its area is proportional to the frequency of each class.
Horizontal axis: Intervals of frontiers “Fi – Fs” are represented
Vertical axis: Absolute frequencies “ni”
The relative frequencies can also occur in histograms or other graphics.
Frequency Polygon
It is a line graph.
It is constructed with line segments joining the midpoints (class mark) of adjacent intervals.
It is used to determine the form that follows the frequency distribution of observations in order to adjust a given probability function.
Horizontal axis: Class marks “mi”
Vertical axis: Absolute frequencies “ni”
Ogive
It is a cumulative frequency polygon.
It starts at zero and ends at 100%.
The polygon is a part of the lower boundary of the first interval in each class, and the upper boundary is indicating its cumulative frequency.
Horizontal axis: Intervals of frontiers “Fi – Fs” are represented
Vertical axis: Cumulative absolute frequencies “Ni”
Circular Chart
Lets you represent absolute frequencies or relative frequencies as percentages in a circle.
Determine the number of degrees of the circle corresponding to each absolute frequency by the proportion.
Stem and Leaf Graph
It is a semi-graphical procedure for quantitative variables.
The digits are separated into two parts:
Stem: Defines a class and corresponds to a certain number of digits counted from left to right.
Leaf: Defines the absolute frequency of the class and for the next digit, discarding the remainder, if any.
The representation of the data is performed using a column for the stems, sorted in ascending order and without repeating, and one for the corresponding leaves.
Measures of Central Tendency
Mode
It is the category or score that occurs most often. Used with any standard of measurement.
Median
It is the value that divides the distribution in half. That is, half the cases fall below the median, and half are above the median.
The median is used in measuring levels of ordinal, interval, or ratio.
Mean
It is the arithmetic mean of the distribution.
It is the sum of all values divided by the number of cases. Applies only to measurements of interval or ratio (individual lessons).
X = 3 + 5 + 6 / 3 = 4.6
Dispersion Measurement
These are measures of dispersion or variability of data from a series of values.
They represent the similarity or difference between individuals of a group in connection with some quantitative variable (age, income, schooling, etc.).
The main ones are:
- Variance
- Standard deviation
- Index of dispersion
Variance: Average squared deviations from each of the values of a series on the arithmetic mean.
Standard Deviation: It is the square root of the variance.
Measures of Dispersion
Quantify the dispersion of data around the center of the data.
The most common are:
- Range
- Interquartile range
- Variance
- Standard deviation
- Coefficient of variation
Variance
It is most useful in statistical applications.
It is defined as the ordering or grouping of data, and the result is obtained as follows:
Individual or Grouped Data
Individual Classes Grouped Data
Grouped data in class intervals
Standard Deviation
It is defined as the average deviation of the original data with respect to their arithmetic mean.
It is denoted by:
- It contains approximately 68% of the observations.
- It contains approximately 95% of the observations.
- It contains approximately 100% of the observations.
Coefficient of Variation
Delivers the “degree” or “%” of variability of the data and is used to compare two distributions that may have a different unit of measure.
Rule of Thumb:
- If the CV is ≤ 35%, the set is homogeneous.
- If the CV is > 35%, the set is heterogeneous.
Coefficient of Bias
Bias: Degree of asymmetry or skewness of a frequency distribution. It is determined by:
Rules:
- If the coefficient of bias is “+”, then the bias of the distribution is positive.
- If the coefficient of bias is “-“, then the bias of the distribution is negative.
- If the coefficient of bias is zero, then the distribution is symmetrical.
Quantile
Partitions the area under the frequency polygon in more than two parties, with the usual four, ten, and one hundred parts.
Quartiles: Divide the frequency distribution into 4 parts.
Deciles: Divide the frequency distribution into 10 equal parts.
Percentiles: Divide the frequency distribution into 100 equal parts.