Understanding Data Dispersion: Standard Deviation & More
Understanding Data Dispersion
How can we tell if data is dispersed or not?
- There is no clear interpretation of the standard deviation. It depends on the context and our criterion.
- For example, in a sample of screws for technical equipment, where we measure the size of each screw, we expect a dispersion close to zero. Otherwise, it will mean we have to consider lots of screws as defective.
- On the other hand, on a sample of workers in a company and their salaries, we should expect high dispersion in the data, as the salaries of the CEOs will be much higher than the salaries of the base workers.
Pearson’s Coefficient of Variation (CV)
- Unlike the variance and the standard deviation, Pearson’s coefficient of variation has no units. This makes it very useful in order to compare dispersion between two samples.
- We can interpret the CV as a percentage of data dispersion.
- Usually, if the CV is below 0.3, then we consider that the data is not very dispersed. On the other hand, if the CV is above 0.3, then data is considered to be dispersed. However, this is not a strict rule.
- If the mean is close to zero, then the CV is not very useful.
Chebyshev’s Inequality
There is a particular distribution of data called normal distribution, which is very important (we’ll learn more about it at the end of the course). Therefore, whenever we analyze a variable, we want to know if the extracted data follows a normal distribution.
Chebyshev’s Inequality provides a first test to see if a data set is distributed according to a normal distribution.
Outliers
An outlier is an individual datum that falls outside of the overall pattern of the data (an extreme value). Such values are problematic because they tend to distort the analysis of the whole data set.
An outlier could appear either because of the variability of the measurement or because of some experimental error. In the first case, the existence of the outlier is fully justified, while in the other case, we want to detect and eliminate such data.
Ultimately, deleting data depends completely on your own criteria!
Boxplot
A boxplot is the best graph to represent the distribution of a data set. It is based on the quartiles and the analysis of outliers that we have just seen.
A boxplot consists of:
- A central box, with the bottom side marking the first quartile (Q1) and the top side marking the third quartile (Q3). The box is divided in two by a segment marking the second quartile (Q2).
- A bottom whisker from Q1 to the lower end of the admissible range and a top whisker from Q3 to the upper end of the admissible range.
- Outliers, if there are any, marked as dots below or above the whiskers.
- The width of the box is irrelevant.
Data Symmetry and Skewness
A data set is symmetric if the mean and the median agree. That is, the mean divides the sample into two halves of the same size.
- The difference between the mean and the median is a first indicator of (a)symmetry, although not a very good one.
- A data set is skewed to the right if the majority of the data is concentrated to the left of the mean (they are smaller than the mean).
- A data set is skewed to the left if the majority of the data is concentrated to the right of the mean (they are greater than the mean).
Kurtosis
The kurtosis in a sample measures how concentrated the data are in the tails of its distribution.
- A sample with low kurtosis (close to zero) is called mesokurtic.
- A sample is leptokurtic if it has thick tails.
- A sample with thin tails is called platykurtic.
The Kurtosis Coefficient
The kurtosis coefficient g3 measures the concentration of data in tails by comparison to the normal distribution. Thus:
- A mesokurtic sample (with g3 close to zero) is likely to follow a normal distribution.
- A leptokurtic sample (with g3 > 0) has more data concentrated in the tails than a normal distribution would (thick tails).
- A platykurtic sample (with g3 < 0) has less data concentrated in the tails than a normal distribution would (thin tails).