Data Mining Techniques and Applications
Chapter 1: What is Data Mining?
Data mining is the process of discovering patterns and knowledge from large datasets.
Be Able to Identify Data Mining Tasks
Examples include classification, clustering, association rule discovery, and anomaly detection.
Why Do We Need Data Mining?
To extract useful insights, make predictions, and support decision-making from large and complex datasets.
Understand Classification Definition
Classification assigns predefined labels to data points based on a training dataset.
Understand Clustering Definition
Clustering groups similar data points together without predefined labels.
Understand Association Rule Discovery Definition
Association rule discovery identifies relationships and patterns between items in transactional data.
Dividing the customers of a company according to their gender:
(No. This is a simple database query.)
Dividing the customers of a company according to their profitability:
(No. This is an accounting calculation.)
Computing the total sales of a company:
(No. Again, this is simple accounting.)
Sorting a student database based on student identification numbers:
(No. Again, this is a simple database query.)
Predicting the outcomes of tossing a (fair) pair of dice:
(No. Since the die is fair, this is a probability calculation.)
Predicting the future stock price of a company using historical records:
(Yes. We would attempt to create a model that can predict the continuous value of the stock price.)
Monitoring the heart rate of a patient for abnormalities:
(Yes. We would build a model of the normal behavior of heart rate and raise an alarm when an unusual heart behavior occurred.)
Monitoring seismic waves for earthquake activities:
(Yes. This is an example of the area of data mining known as classification.)
Extracting the frequencies of a sound wave:
(No. This is signal processing.)
Chapter 2: Data
Types of Attributes and Properties
Attributes can be binary, discrete, or continuous, and classified as nominal, ordinal, interval, or ratio based on their properties.
Types of Datasets
Common types include relational (tables), transactional (e.g., purchase logs), temporal (time-series), spatial, text, and multimedia datasets.
Data Quality Issues
Key issues include noise (random errors), outliers (extreme values), missing values, and duplicate data, all of which affect data reliability.
Data Preprocessing
Involves steps like aggregation (data summarization), sampling (reducing data size), dimensionality reduction (e.g., PCA), feature selection/creation, discretization, binarization, and normalization.
Measures: Similarity and Dissimilarity
Quantifies how alike or different two data points are.
Distance Metrics
Euclidean, Manhattan, Cosine, Jaccard for comparisons.
Density
Measures compactness of data regions, used in clustering algorithms like DBSCAN.
Age in years:
(Discrete, Quantitative, Ratio)
Time in terms of AM or PM:
(Binary, Qualitative, Ordinal)
Brightness as measured by a light meter:
(Continuous, Quantitative, Ratio)
Brightness as measured by people’s judgments:
(Discrete, Qualitative, Ordinal)
What is the main difference between sampling and feature selection? What is the main similarity between them?
Sampling reduces the number of instances, but feature selection reduces the number of attributes. Both of them return a subset of the data.
What is the main difference between feature selection and dimensionality reduction? What is the main similarity between them?
Feature selection remains a subset of the original attributes, but dimensionality reduction creates new attributes based on the original attribute space. Both of them reduce the number of attributes.
Given a number x = 480 in the range of [-100,9990], we need to normalize and project the number into a new range [-1, 1]. What is the new value of x if we use decimal scaling for normalization? What is the new value of x if we use min-max normalization?
Decimal scaling: k = 4 (10000), thus 480→ 480/10000=0.048
Min-max normalization: (480+100) / (9990 +100)*2– 1= –0.885.
For the following vectors, x and y, calculate the indicated similarity or distance measures.
(a) x = (1, 1, 1, 1), y = (2, 2, 2, 2)
cos(x,y)=1, corr(x,y)=0/0 (undefined!) Euclidean(x,y)=2
(b) x = (0, 1, 0, 1), y = (1, 0, 1, 0)
cos(x,y)=0, corr(x,y)=0, Euclidean(x,y)=2, Jaccard(x,y)=0
(c) x = (0,−1, 0, 1), y = (1, 0,−1, 0)
cos(x,y)=0, corr(x,y)=0, Euclidean(x,y)=2.
Chapter 3: Classification
General Structure of Hunt’s Algorithm
If all records at a node belong to the same class, label it; otherwise, split the records using an attribute test.
Determine How to Split the Records
- Attribute types: Choose attributes like categorical or numerical.
- Attribute values: Use 2-way (binary) or multi-way splits.
Measures of Node Impurity
Evaluate splits using Gini Index, Entropy, or Misclassification Error.
Finding the Best Split
Calculate Gain = P (parent impurity) – M (weighted average child impurity) and choose the split maximizing the gain.
Practical Issues of Classification
- Underfitting: Model is too simple, leading to poor performance on both training and test data.
- Overfitting: Model is too complex, fitting noise in training data.
- Missing values: Handle them by imputation or ignoring records.
- Model evaluation and comparison: Evaluate models using metrics like accuracy, precision, recall, F1-score, and ROC/AUC; compare models using cross-validation results.
Hunt’s algorithm:
If all records in Dt belong to the same class yt, label the node as yt. If Dt is empty, label it with the default class yd. Otherwise, split Dt using an attribute test and apply the procedure recursively to each subset.
Gini Index:
Measures node impurity as Gini(t)=1−∑p(j∣t)2.
Entropy:
Measures homogeneity as Entropy(t)=−∑p(j∣t)logp(j∣t).
Misclassification Error:
Measures error as Error(t)=1−max(p(j∣t)).
Finding Best Split:
Compute impurity before (P) and after splitting (M); choose the split that maximizes Gain = P−M or minimizes child impurity.
Gini Index:
Measures impurity as Gini(t)=1−∑p(j∣t)2.
Maximum: 1−1/nc (equal class distribution, least informative).
Minimum: 0 (all records in one class, most informative).
Splitting Based on GINI
In CART, SLIQ, and SPRINT, splitting quality is computed as Ginisplit=∑ninGini(i). Efficient computation involves sorting attribute values, scanning to update counts, and selecting the split with the lowest Gini index.
Gain Ratio (C4.5):
Adjusts Information Gain by SplitINFO (GainRatio=Gain/SplitINFO) to penalize splits with many small partitions.
Entropy:
Measures node homogeneity as Entropy(t)=−∑p(j∣t)logp(j∣t).
Maximum: log(nc) (equal class distribution).
Minimum: 0.0 (all records in one class).
Information Gain (ID3, C4.5):
Measures entropy reduction due to a split (Gain=Entropy(P)−∑ninEntropy(i)). It may prefer splits with many pure partitions.
Classification Error:
Measures misclassification at a node as Error(t)=1−max(p(j∣t)).
Maximum: 1−1/nc (equal class distribution).
Minimum: 0.0 (all records in one class).
Model Overfitting
Occurs when the model is too complex, fitting noise or small sample anomalies, leading to poor generalization.
Estimation of Generalization Errors
Use validation datasets, cross-validation, or pessimistic error estimates to evaluate model performance.
Handling Overfitting in Decision Tree Induction
Apply pruning techniques (e.g., post-pruning) or set a minimum threshold for node splits to simplify the tree.
Estimating Generalization Errors
Re-substitution Errors: Training error (∑e(t)).
Generalization Errors: Testing error (∑e′(t)).
Methods
Optimistic Approach: e′(t)=e(t).
Pessimistic Approach: Adjust errors using e′(t)=e(t)+0.5 per leaf. Total error: e′(T)=e(T)+N×0.5 (N = number of leaf nodes).
Example: Training error = 1%, Generalization error = 2.5%.
Reduced Error Pruning (REP): Use a validation dataset to estimate errors and prune the tree.
Chapter 4: Other Classification Techniques
Rule-based Classifier: Uses if-then rules for classification.
Nearest-neighbor Classifier: Classifies based on the majority of k-nearest neighbors.
Bayesian Classifier: Applies Bayes’ theorem assuming attribute independence.
Artificial Neural Network (ANN): Multi-layered networks trained using backpropagation.
Support Vector Machine (SVM): Maximizes margin between classes, uses kernel tricks for non-linearity.
Ensemble Methods: Combines multiple models (e.g., Bagging, Boosting) to improve accuracy.
Chapter 7: Cluster Analysis
What is Cluster Analysis?
Groups data into clusters based on similarity, with high intra-cluster similarity and low inter-cluster similarity.
What is not Cluster Analysis?
Partitioning data using predefined labels or sorting data.
Types of Clusterings
Partitional: Divides data into non-overlapping clusters (e.g., K-means).
Hierarchical: Produces nested clusters visualized using a dendrogram.
- Well-separated clusters: Each point is closer to its cluster than others.
- Center-based clusters: Points are close to a centroid (e.g., K-means).
- Contiguous clusters: Points are connected based on proximity.
- Density-based clusters: Formed by dense regions (e.g., DBSCAN).
Clustering Algorithms
K-means: Assigns points to the nearest centroid and iteratively updates clusters.
Hierarchical Clustering: Builds nested clusters using linkage methods (e.g., MIN/MAX).
Density-based Clustering: Identifies clusters as dense regions of data, handles noise (e.g., DBSCAN).
K-means Clustering
A partitional method that assigns points to the nearest centroid; requires specifying K (number of clusters).
Limitations:
Struggles with clusters of varying sizes, densities, or shapes and is sensitive to outliers.
Solution:
Use many clusters and merge them later.
Hierarchical Clustering
Produces nested clusters visualized as a dendrogram, showing the sequence of merges or splits.
MIN (single linkage): Merges based on the closest pair of points.
MAX (complete linkage): Merges based on the farthest pair of points.
Chapter 5: Association Analysis
Frequent Itemset: A set of items appearing together in transactions above a minimum support threshold.
Association Rule: An implication of the form X→Y, indicating relationships between itemsets.
Association Rule Mining Task
Identify frequent itemsets and generate rules satisfying minimum support and confidence.
Two-step Approach
1. Generate frequent itemsets; 2. Derive rules from frequent itemsets.
Apriori Principle: All subsets of a frequent itemset must also be frequent.
Generate Hash Tree: Efficiently organizes candidate itemsets for pruning and counting.
FP-growth Algorithm: Mines frequent itemsets using a compressed FP-tree structure.
Apriori Algorithm: Start with k=1; generate frequent 1-itemsets. Iteratively create (k+1)-itemsets from k-itemsets, prune infrequent subsets, count support, and retain frequent candidates until no new itemsets are found.
FP-growth Algorithm:
Compress the database into an FP-tree, which retains item frequency and ordering. Use a recursive, divide-and-conquer method to efficiently mine frequent itemsets without candidate generation.
Generate Hash Tree:
Efficiently organizes candidate itemsets into nodes using a hash function to speed up pruning and counting.
FP-Tree Construction:
Compresses the database into a tree structure that retains itemset frequency information.
FP-growth (Frequent Itemset Generation):
Uses the FP-tree and a recursive divide-and-conquer approach to mine frequent itemsets.