Data Attributes and Similarity Measures: Exercises

Data Attributes Classification and Analysis

Exercise 1: Attribute Types

Classify the following attributes as binary, discrete, or continuous. Also, classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Briefly indicate your reasoning if there may be some ambiguity.

  1. Time in terms of AM or PM.
    Binary, qualitative, ordinal.
  2. Brightness as measured by a light meter.
    Continuous, quantitative, ratio.
  3. Brightness as measured by people’s judgments.
    Discrete, qualitative, ordinal.
  4. Angles as measured in degrees between 0° and 360°.
    Continuous, quantitative, ratio.
  5. Bronze, Silver, and Gold medals as awarded at the Olympics.
    Discrete, qualitative, ordinal.
  6. Height above sea level.
    Continuous, quantitative, interval/ratio (depends on whether sea level is regarded as an arbitrary origin).
  7. Number of patients in a hospital.
    Discrete, quantitative, ratio.
  8. ISBN numbers for books.
    Discrete, qualitative, nominal (ISBN numbers do have order information, though).
  9. Ability to pass light in terms of the following values: opaque, translucent, transparent.
    Discrete, qualitative, ordinal.
  10. Military rank.
    Discrete, qualitative, ordinal.
  11. Distance from the center of campus.
    Continuous, quantitative, interval/ratio (depends).
  12. Density of a substance in grams per cubic centimeter.
    Discrete, quantitative, ratio.
  13. Coat check number.
    Discrete, qualitative, nominal.

Exercise 2: Identification Numbers for Prediction

Can you think of a situation in which identification numbers would be useful for prediction?

Example: Student IDs are a good predictor of graduation date.

Exercise 3: Noise vs. Outliers

Distinguish between noise and outliers. Consider the following questions:

  1. Is noise ever interesting or desirable? Outliers?
    Noise: No, by definition. Outliers: Yes (See Chapter 9).
  2. Can noise objects be outliers?
    Yes. Random distortion of the data is often responsible for outliers.
  3. Are noise objects always outliers?
    No. Random distortion can result in an object or value much like a normal one.
  4. Are outliers always noise objects?
    No. Often outliers merely represent a class of objects that are different from normal objects.
  5. Can noise make a typical value into an unusual one, or vice versa?
    Yes.

Exercise 4: Similarity Measures for Elephant Herd

The following attributes are measured for members of a herd of Asian elephants: weight, height, tusk length, trunk length, and ear area. Based on these measurements, what sort of similarity measure would you use to compare or group these elephants? Justify your answer and explain any special circumstances.

These attributes are all numerical but can have widely varying ranges of values, depending on the scale used to measure them. Furthermore, the attributes are not asymmetric, and the magnitude of an attribute matters. These latter two facts eliminate the cosine and correlation measures. Euclidean distance, applied after standardizing the attributes to have a mean of 0 and a standard deviation of 1, would be appropriate.

Similarity and Distance Calculations

Exercise 5: Vector Calculations

For the following vectors, x and y, calculate the indicated similarity or distance measures.

  1. x = (1, 1, 1, 1), y = (2, 2, 2, 2)
    cosine, correlation, Euclidean
    cos(x, y) = 1, corr(x, y) = 0/0 (undefined), Euclidean(x, y) = 2
  2. x = (0, 1, 0, 1), y = (1, 0, 1, 0)
    cosine, correlation, Euclidean, Jaccard
    cos(x, y) = 0, corr(x, y) = -1, Euclidean(x, y) = 2, Jaccard(x, y) = 0
  3. x = (0, -1, 0, 1), y = (1, 0, -1, 0)
    cosine, correlation, Euclidean
    cos(x, y) = 0, corr(x, y) = 0, Euclidean(x, y) = 2
  4. x = (1, 1, 0, 1, 0, 1), y = (1, 1, 1, 0, 0, 1)
    cosine, correlation, Jaccard
    cos(x, y) = 0.75, corr(x, y) = 0.25, Jaccard(x, y) = 0.6
  5. x = (2, -1, 0, 2, 0, -3), y = (-1, 1, -1, 0, 0, -1)
    cosine, correlation
    cos(x, y) = 0, corr(x, y) = 0