Vvvbbh

Cluster: A collection of data objects

 similar (or related) to one another within the same group

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis (or clustering, data segmentation, …)

 Finding similarities between data according to the

characteristics found in the data and grouping similar

data objects into clusters

 Unsupervised learning: no predefined classes (i.e., learning

by observations vs. learning by examples: supervised)

 Typical applications

 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

Clustering for Data Understanding and

Applications

 Biology: taxonomy of living things: kingdom, phylum, class, order,

family, genus and species

 Information retrieval: document clustering

 Land use: Identification of areas of similar land use in an earth

observation database

 Marketing: Help marketers discover distinct groups in their customer

bases, and then use this knowledge to develop targeted marketing

programs

 City-planning: Identifying groups of houses according to their house

type, value, and geographical location

 Earth-quake studies: Observed earth quake epicenters should be

clustered along continent faults

 Climate: understanding earth climate, find patterns of atmospheric

and ocean

 Economic Science: market resarch

Quality: What Is Good Clustering?

 A good clustering method will produce high quality

clusters

 high intra-class similarity: cohesive within clusters

 low inter-class similarity: distinctive between clusters

 The quality of a clustering method depends on

 the similarity measure used by the method

 its implementation, and

 Its ability to disco

Measure the Quality of Clustering

 Dissimilarity/Similarity metric

 Similarity is expressed in terms of a distance function,

typically metric: d(i, j)

 The definitions of distance functions are usually rather

different for interval-scaled, boolean, categorical,

ordinal ratio, and vector variables

 Weights should be associated with different variables

based on applications and data semantics

 Quality of clustering:

 There is usually a separate “quality” function that

measures the “goodness” of a cluster.

 It is hard to define “similar enough” or “good enough”

 The answer is typically highly subjective

Requirements and Challenges

 Scalability

 Clustering all the data instead of only on samples

 Ability to deal with different types of attributes

 Numerical, binary, categorical, ordinal, linked, and mixture of

these

 Constraint-based clustering

 User may give inputs on constraints

 Use domain knowledge to determine input parameters

 Interpretability and usability

 Others

 Discovery of clusters with arbitrary shape

 Ability to deal with noisy data

 Incremental clustering and insensitivity to input order

 High dimensionality

Major Clustering Approaches (I)

 Partitioning approach:

 Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors

 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:

 Create a hierarchical decomposition of the set of data (or objects)

using some criterion

 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:

 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:

 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

Partitioning Algorithms: Basic Concept

 Partitioning method: Partitioning a database D of n objects into a set

of k clusters, such that the sum of squared distances is minimized

(where ci

is the centroid or medoid of cluster Ci

)

 Given k, find a partition of k clusters that optimizes the chosen

partitioning criterion

 Global optimal: exhaustively enumerate all partitions

 Heuristic methods: k-means and k-medoids algorithms

 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented

by the center of the cluster

 k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the objects

in the cluster 

The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in

four steps:

 Partition objects into k nonempty subsets

 Compute seed points as the centroids of the

clusters of the current partitioning (the centroid is

the center, i.e., mean point, of the cluster)

 Assign each object to the cluster with the nearest

seed point

 Go back to Step 2, stop when the assignment does

not change

What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially

distort the distribution of the data

 K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the most

centrally located object in a cluster



Supervised learning (classification)

 Supervision: The training data (observations,

measurements, etc.) are accompanied by labels

indicating the class of the observations

 New data is classified based on the training set

 Unsupervised learning (clustering)

 The class labels of training data is unknown

 Given a set of measurements, observations, etc. with

the aim of establishing the existence of classes or

clusters in the data

Classification

 predicts categorical class labels (discrete or nominal)

 classifies data (constructs a model) based on the

training set and the values (class labels) in a

classifying attribute and uses it in classifying new data

 Numeric Prediction

 models continuous-valued functions, i.e., predicts

unknown or missing values

 Typical applications

 Credit/loan approval:

 Medical diagnosis: if a tumor is cancerous or benign

 Fraud detection: if a transaction is fraudulent

 Web page categorization: which category it is

Model construction: describing a set of predetermined classes

 Each tuple/sample is assumed to belong to a predefined class, as

determined by the class label attribute

 The set of tuples used for model construction is training set

 The model is represented as classification rules, decision trees, or

mathematical formulae

 Model usage: for classifying future or unknown objects

 Estimate accuracy of the model

 The known label of test sample is compared with the classified

result from the model

 Accuracy rate is the percentage of test set samples that are

correctly classified by the model

 Test set is independent of training set (otherwise overfitting)

 If the accuracy is acceptable, use the model to classify new data

Issues: Evaluating Classification Methods

 Accuracy

 classifier accuracy: predicting class label

 predictor accuracy: guessing value of predicted

attributes

 Speed

 time to construct the model (training time)

 time to use the model (classification/prediction time)

 Robustness: handling noise and missing values

 Scalability: efficiency in disk-resident databases

 Interpretability

 understanding and insight provided by the model

 Other measures, e.g., goodness of rules, such as decision

tree size or compactness of classification rules

Algorithm for Decision Tree Induction

 Basic algorithm (a greedy algorithm)

 Tree is constructed in a top-down recursive divide-andconquer manner

 At start, all the training examples are at the root

 Attributes are categorical (if continuous-valued, they are

discretized in advance)

 Examples are partitioned recursively based on selected

attributes

 Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain)

 Conditions for stopping partitioning

 All samples for a given node belong to the same class

 There are no remaining attributes for further partitioning

– majority voting is employed for classifying the leaf

 There are no samples left

Rule Extraction from a Decision Tree

 Rules are easier to understand than large trees

 One rule is created for each path from the root to a

leaf

 Each attribute-value pair along a path forms a

conjunction: the leaf holds the class prediction

 Rules are mutually exclusive and exhaustive

Model Evaluation and Selection

 Evaluation metrics: How can we measure accuracy?

Other metrics to consider?

 Use validation test set of class-labeled tuples instead of

training set when assessing accuracy

 Methods for estimating a classifier’s accuracy:

 Holdout method, random subsampling

 Cross-validation

 Bootstrap

 Comparing classifiers:

 Confidence intervals

 Cost-benefit analysis and ROC Curves

Classifier Evaluation Metrics: Accuracy, Error

Rate, Sensitivity and Specificity

 Classifier Accuracy, or

recognition rate: percentage of

test set tuples that are

correctly classified

Accuracy = (TP + TN)/All

 Error rate: 1 – accuracy, or

Error rate = (FP + FN)/All

 Class Imbalance Problem:

 One class may be rare, e.g.

fraud, or HIV-positive

 Significant majority of the

negative class and minority of

the positive class

 Sensitivity: True Positive

recognition rate

 Sensitivity = TP/P

 Specificity: True Negative

recognition rate

 Specificity = TN/N