Clustering Algorithms: Applications and Methods

1. What are Sample Applications of Clustering in Which the Objective is to Cluster the Samples/Observations?

  • This is the most common use of clustering.
  • Examples include:
    • Machine learning
    • Data mining
    • Information retrieval
    • Bioinformatics
    • Data compression

2. What are Sample Applications of Clustering in Which the Objective is to Cluster the Variables/Features?

  • Clustering is used in market segmentation, where we try to find customers that are similar to each other, whether in terms of behaviors or attributes.
  • Image segmentation/compression, where we try to group similar regions together.
  • Document clustering based on topics.

3. Clustering Can Be Loosely Defined as “Grouping Items That are Similar/Close to Each Other”. What Components Do We Need to Formulate to Have a Well-Defined Computational Framework for Clustering?

  • Proximity measure (how closely related they are)
  • Criterion function to define good and bad clustering
  • Algorithm to compute clustering

4. What are Possible Ways of Choosing the Number of Clusters?

  • K-means
    • Assume we know how many clusters there are.
    • Find the number of clusters that minimizes the SSE (sum of squared errors).
  • Incorporate the trade-off between complexity and fit into the objective function.
    • Evaluate the trade-off between complexity and fit after computing clusterings by trying different values of K.
    • Hierarchical clustering
      • Pick the number after a distance tree is made.

5. What is the Objective Function of K-Means Clustering? Can K-Means Still Be Applied if Our Notion of Distance is Not Euclidean Distance?

  • Creates clusters based on distance from centroids.
  • K-means finds a locally optimal solution and is not guaranteed to find the (globally) optimal solution.
  • Yes, the distance function can be changed depending on the circumstances.

6. Is K-Means Guaranteed to Identify the Optimal Clustering According to its Objective Function? If Not, What Can We Say About the Optimality of the Clustering it Identifies?

  • No, it is not guaranteed.
  • The optimal solution is the clustering that minimizes the objective function.

7. Considering an Instance of Clustering Where We are Trying to Cluster Points/Observations in 2-Dimensional Space, Can You Give a Sample Instance and an Initialization That Will Make K-Means Converge to an Undesirable Solution?

  • Spherical clustering
    • Because K-means draws a line.

8. Consider a Run of K-Means in 1-Dimensional Space, with Say K=2 Clusters. If You are Shown the Configurations of the Cluster Centroids and Assignment of Data Points to Clusters at the End of Each Iteration, Can You Order Those Iterations?

  • Refer to the slides for a visual example.

9. What are the Weaknesses of K-Means?

  • Sensitive to outliers.
  • The user needs to specify K.
  • Only applicable if the mean is supplied.

10. What are Possible Strategies for Making Algorithms More Robust to Outliers in the Application of K-Means?

  • Remove outliers.
  • Perform random sampling on a small subset of data points.

11. What are the Main Components of an Agglomerative Hierarchical Clustering Algorithm?

  • A distance matrix specifying pairwise distance.
  • Either bottom-up or top-down.
  • Decisions cannot be undone.

12. What are the Possible Ways of Assessing the Distance Between Groups of Items Given the Distances Between Pairs of Items?

  • Closest pair
  • Farthest pair
  • Average of all pairs

13. Can You Give an Example of Hierarchical Clustering in 2-Dimensional Space with Euclidean Distance as the Distance Function, Such That Single-Linkage and Complete-Linkage Will Produce the Same Dendrogram?

14. Can You Give an Example of Hierarchical Clustering in 2-Dimensional Space with Euclidean Distance as the Distance Function, Such That Single-Linkage and Complete-Linkage Will Produce Different Dendrograms?

15. What is the Key Idea of Ward’s Method?

  • Consider joining two clusters: how far does it change the total distance from the centroid?
  • Is it worth it to combine these two?

16. What is the Common Requirement of K-Means and Hierarchical Clustering Using Ward’s Method?

  • They both have to have computable means.

17. What are the Pros and Cons of Applying a Divisive Hierarchical Clustering Algorithm That, for Example, Uses K-Means with K=2 to Split the Points at Each Node of the Dendrogram (as Compared to Agglomerative Clustering)?

  • Agglomerative is faster to compute.
    • Only looks at pairwise.
  • Divisive is “less blind” to all data.
    • Can find the best possible split between two.

18. What is the (Common) Key Idea Behind Bayesian Information Criterion (BIC) and Akaike Information Criterion (AIC) in Formulating a Measure to Guide the Selection of the Number of Clusters?

  • They determine how good a model fits in a maximum-likelihood sense.
  • How many parameters are needed to get a good fit?

19. What is the Key Difference Between AIC and BIC?

  • AIC (Akaike Information Criterion) tries to select the model that most adequately describes an unknown, high-dimensional reality.
  • BIC (Bayesian Information Criterion) tries to find the TRUE model among the set of candidates.
  • AIC offers better predictability.
  • BIC finds a more efficient solution.

20. Assume That You are Given a Dendrogram and You are Trying to Translate This into a Disjoint Set of Clusters Such That the Model Complexity and Model Fit are (Locally) Optimized Together. You are Also Given a Procedure That Computes AIC for a Given Clustering. Describe an Algorithm/Method You Would Use to Identify the Final Set of Clusters.

  • Find the largest ΔSSE between the number of clusters you have.
  • Between having X clusters or Y clusters, indicates that X clusters divide the cells into much more homogenous groups than does Y groups. From this reasoning, it could be possible to pick X clusters as the final solution.