Association Rules and Decision Trees in Machine Learning
Unsupervised Machine Learning
Market Basket Analysis (MBA)
- Definition: Identifies relationships between items frequently bought together, aiding businesses in decisions like product placement and cross-selling.
- Output: Generates association rules that describe patterns in transactions.
- Example: {butter, jam} → {bread} implies that buying butter and jam often leads to buying bread.
Association Rules
- LHS (Left-Hand Side): Items frequently purchased together (e.g., {butter, jam}).
- RHS (Right-Hand Side): Related items likely to be purchased with LHS (e.g., {bread}).
- Goal: Understand patterns in a transaction database, not prediction.
Apriori Algorithm
- Identifies frequent item sets by assuming all subsets of frequent sets are frequent.
- Reduces computational complexity by focusing only on combinations with sufficient frequency.
- The process of creating associations occurs in two phases:
- Identify items with a minimum support threshold (frequency).
- Create association rules with a minimum confidence threshold (certainty of co-occurrence).
Evaluating Association Rules
- Support: Proportion of transactions containing a specific subset.
- Support(X)=Transactions with XTotal Transactions\text{Support}(X) = \frac{\text{Transactions with X}}{\text{Total Transactions}}Support(X)=Total TransactionsTransactions with X
- Confidence: Likelihood of RHS given LHS.
- Confidence(X→Y)=Transactions with X and YTransactions with X\text{Confidence}(X \to Y) = \frac{\text{Transactions with X and Y}}{\text{Transactions with X}}Confidence(X→Y)=Transactions with XTransactions with X and Y
- Lift: Measures the strength of association beyond random co-occurrence.
- Lift(X→Y)=Confidence(X→Y)Support(Y)\text{Lift}(X \to Y) = \frac{\text{Confidence}(X \to Y)}{\text{Support}(Y)}Lift(X→Y)=Support(Y)Confidence(X→Y)
Categorizing Rules
- Actionable: Useful insights for decision-making (e.g., cross-selling opportunities).
- Trivial: Obvious and not valuable.
- Inexplicable: No clear cause-effect relationship.
Advantages and Disadvantages of Association Rule Mining
Advantages:
- Able to work with large transactional databases.
- Results in easy-to-understand rules.
- Useful for discovering unexpected patterns in databases.
Disadvantages:
- Not very useful for small databases.
- It takes effort to draw valid conclusions.
- Relatively easy to misinterpret random patterns as meaningful.
Supervised Machine Learning
Decision Trees
- Definition:
Decision trees predict outcomes by splitting data based on variable values (numerical or categorical), providing a path of decisions leading to a final prediction. - Example:
When deciding whether to accept a job offer, you may consider factors like:- “Travel time greater than 1 hour?”
- “Salary greater than €50,000?”
- “Company aligns with personal values?”
Based on these, the tree leads to a decision: ACCEPT or DECLINE.
Advantages of Decision Trees
- Easy to visualize and interpret, even for non-experts.
- Transparent: Useful when understanding the decision-making process is crucial (e.g., credit approval, hiring).
- Works with both categorical and numerical data.
Disadvantages of Decision Trees
- Risk of overfitting: May perform well on training data but poorly on new data.
- Sensitive to data changes: Small variations can lead to a completely different tree.
- Large trees may become complex and hard to interpret.
How Decision Trees Work
- Divide and Conquer Principle:
- Start with the entire dataset (root node).
- Split data into branches based on the variable that provides the best separation of outcomes (high purity).
Node Types:
- Internal Node: Splits data further.
- Leaf Node: Represents the final decision/class.
Stopping Conditions
A tree stops growing when:
- All or most examples in a node belong to the same class (pure node).
- No more variables to split on.
- The tree reaches a predefined maximum depth.
Key Concepts
- Purity:
- Purity measures the homogeneity of a subset. A pure node contains data of only one class.
Entropy (Shannon’s Entropy):
- Metric to measure impurity. Lower entropy means higher purity.
Information Gain (IG):
- Measures reduction in entropy after a split.
- High IG indicates a good split.
Alternative Metrics:
- Gini Index: Measures impurity.
- Chi-Square Statistic: Tests independence.
- Gain Ratio: Adjusts IG for bias.
Common Algorithms
- C5.0 Algorithm:
- Developed by J. Ross Quinlan.
- An improvement over earlier ID3 and C4.5 algorithms.
- Focuses on maximizing purity through entropy and IG.
Tree Complexity and Pruning
- Overfitting: Trees that grow too large perform poorly on new data.
- Pruning Techniques:
- Pre-Pruning: Limit growth (e.g., max depth or minimum node size).
- Disadvantage: May miss subtle but meaningful patterns.
Post-Pruning: Let the tree grow fully, then remove unnecessary branches.
- Improves generalization ability.
Evaluating Decision Trees
- Confusion Matrix:
- Compares predicted vs. actual outcomes to assess accuracy.
- Key metrics include:
- True Positive (TP): Correctly predicted positives.
- False Positive (FP): Incorrectly predicted as positive.
- False Negative (FN): Missed positives.
- True Negative (TN): Correctly predicted negatives.
Cost Matrix:
- Weighs certain errors (e.g., FP, FN) higher if their consequences are more severe.
- Example: In medical diagnosis, missing a disease (FN) is costlier than a false alarm (FP).