Data Attributes and Similarity Measures: Exercises
Data Attributes Classification and Analysis
Exercise 1: Attribute Types
Classify the following attributes as binary, discrete, or continuous. Also, classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Briefly indicate your reasoning if there may be some ambiguity.
- Time in terms of AM or PM.
Binary, qualitative, ordinal. - Brightness as measured by a light meter.
Continuous, quantitative, ratio. - Brightness as measured by people’s judgments.
Discrete, qualitative, ordinal. - Angles as measured in degrees between 0° and 360°.
Continuous, quantitative, ratio. - Bronze, Silver, and Gold medals as awarded at the Olympics.
Discrete, qualitative, ordinal. - Height above sea level.
Continuous, quantitative, interval/ratio (depends on whether sea level is regarded as an arbitrary origin). - Number of patients in a hospital.
Discrete, quantitative, ratio. - ISBN numbers for books.
Discrete, qualitative, nominal (ISBN numbers do have order information, though). - Ability to pass light in terms of the following values: opaque, translucent, transparent.
Discrete, qualitative, ordinal. - Military rank.
Discrete, qualitative, ordinal. - Distance from the center of campus.
Continuous, quantitative, interval/ratio (depends). - Density of a substance in grams per cubic centimeter.
Discrete, quantitative, ratio. - Coat check number.
Discrete, qualitative, nominal.
Exercise 2: Identification Numbers for Prediction
Can you think of a situation in which identification numbers would be useful for prediction?
Example: Student IDs are a good predictor of graduation date.
Exercise 3: Noise vs. Outliers
Distinguish between noise and outliers. Consider the following questions:
- Is noise ever interesting or desirable? Outliers?
Noise: No, by definition. Outliers: Yes (See Chapter 9). - Can noise objects be outliers?
Yes. Random distortion of the data is often responsible for outliers. - Are noise objects always outliers?
No. Random distortion can result in an object or value much like a normal one. - Are outliers always noise objects?
No. Often outliers merely represent a class of objects that are different from normal objects. - Can noise make a typical value into an unusual one, or vice versa?
Yes.
Exercise 4: Similarity Measures for Elephant Herd
The following attributes are measured for members of a herd of Asian elephants: weight, height, tusk length, trunk length, and ear area. Based on these measurements, what sort of similarity measure would you use to compare or group these elephants? Justify your answer and explain any special circumstances.
These attributes are all numerical but can have widely varying ranges of values, depending on the scale used to measure them. Furthermore, the attributes are not asymmetric, and the magnitude of an attribute matters. These latter two facts eliminate the cosine and correlation measures. Euclidean distance, applied after standardizing the attributes to have a mean of 0 and a standard deviation of 1, would be appropriate.
Similarity and Distance Calculations
Exercise 5: Vector Calculations
For the following vectors, x and y, calculate the indicated similarity or distance measures.
- x = (1, 1, 1, 1), y = (2, 2, 2, 2)
cosine, correlation, Euclidean
cos(x, y) = 1, corr(x, y) = 0/0 (undefined), Euclidean(x, y) = 2 - x = (0, 1, 0, 1), y = (1, 0, 1, 0)
cosine, correlation, Euclidean, Jaccard
cos(x, y) = 0, corr(x, y) = -1, Euclidean(x, y) = 2, Jaccard(x, y) = 0 - x = (0, -1, 0, 1), y = (1, 0, -1, 0)
cosine, correlation, Euclidean
cos(x, y) = 0, corr(x, y) = 0, Euclidean(x, y) = 2 - x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1)
cosine, correlation, Jaccard
cos(x, y) = 0.75, corr(x, y) = 0.25, Jaccard(x, y) = 0.6 - x = (2, -1, 0, 2, 0, -3), y = (-1, 1, -1, 0, 0, -1)
cosine, correlation
cos(x, y) = 0, corr(x, y) = 0