Association Rules and Decision Trees in Machine Learning

Posted on Dec 15, 2024 in Other subjects

Unsupervised Machine Learning

Market Basket Analysis (MBA)

Definition: Identifies relationships between items frequently bought together, aiding businesses in decisions like product placement and cross-selling.
Output: Generates association rules that describe patterns in transactions.
- Example: {butter, jam} → {bread} implies that buying butter and jam often leads to buying bread.

Association Rules

LHS (Left-Hand Side): Items frequently purchased together (e.g., {butter, jam}).
RHS (Right-Hand Side): Related items likely to be purchased with LHS (e.g., {bread}).
Goal: Understand patterns in a transaction database, not prediction.

Apriori Algorithm

Identifies frequent item sets by assuming all subsets of frequent sets are frequent.
Reduces computational complexity by focusing only on combinations with sufficient frequency.
The process of creating associations occurs in two phases:

Identify items with a minimum support threshold (frequency).
Create association rules with a minimum confidence threshold (certainty of co-occurrence).

Evaluating Association Rules

Support: Proportion of transactions containing a specific subset.

Support(X)=Transactions with XTotal Transactions\text{Support}(X) = \frac{\text{Transactions with X}}{\text{Total Transactions}}Support(X)=Total TransactionsTransactions with X

Confidence: Likelihood of RHS given LHS.
- Confidence(X→Y)=Transactions with X and YTransactions with X\text{Confidence}(X \to Y) = \frac{\text{Transactions with X and Y}}{\text{Transactions with X}}Confidence(X→Y)=Transactions with XTransactions with X and Y
Lift: Measures the strength of association beyond random co-occurrence.
- Lift(X→Y)=Confidence(X→Y)Support(Y)\text{Lift}(X \to Y) = \frac{\text{Confidence}(X \to Y)}{\text{Support}(Y)}Lift(X→Y)=Support(Y)Confidence(X→Y)

Categorizing Rules

Actionable: Useful insights for decision-making (e.g., cross-selling opportunities).
Trivial: Obvious and not valuable.
Inexplicable: No clear cause-effect relationship.

Advantages and Disadvantages of Association Rule Mining

Advantages:

Able to work with large transactional databases.
Results in easy-to-understand rules.
Useful for discovering unexpected patterns in databases.

Disadvantages:

Not very useful for small databases.
It takes effort to draw valid conclusions.
Relatively easy to misinterpret random patterns as meaningful.

Supervised Machine Learning

Decision Trees

Definition:
Decision trees predict outcomes by splitting data based on variable values (numerical or categorical), providing a path of decisions leading to a final prediction.
Example:
When deciding whether to accept a job offer, you may consider factors like:
- “Travel time greater than 1 hour?”
- “Salary greater than €50,000?”
- “Company aligns with personal values?”
  Based on these, the tree leads to a decision: ACCEPT or DECLINE.

Advantages of Decision Trees

Easy to visualize and interpret, even for non-experts.
Transparent: Useful when understanding the decision-making process is crucial (e.g., credit approval, hiring).
Works with both categorical and numerical data.

Disadvantages of Decision Trees

Risk of overfitting: May perform well on training data but poorly on new data.
Sensitive to data changes: Small variations can lead to a completely different tree.
Large trees may become complex and hard to interpret.

How Decision Trees Work

Divide and Conquer Principle:

Start with the entire dataset (root node).
Split data into branches based on the variable that provides the best separation of outcomes (high purity).

Node Types:

Internal Node: Splits data further.
Leaf Node: Represents the final decision/class.

Stopping Conditions

A tree stops growing when:

All or most examples in a node belong to the same class (pure node).
No more variables to split on.
The tree reaches a predefined maximum depth.

Key Concepts

Purity:

Purity measures the homogeneity of a subset. A pure node contains data of only one class.

Entropy (Shannon’s Entropy):

Metric to measure impurity. Lower entropy means higher purity.

Information Gain (IG):

Measures reduction in entropy after a split.
High IG indicates a good split.

Alternative Metrics:

Gini Index: Measures impurity.
Chi-Square Statistic: Tests independence.
Gain Ratio: Adjusts IG for bias.

Common Algorithms

C5.0 Algorithm:

Developed by J. Ross Quinlan.
An improvement over earlier ID3 and C4.5 algorithms.
Focuses on maximizing purity through entropy and IG.

Tree Complexity and Pruning

Overfitting: Trees that grow too large perform poorly on new data.
Pruning Techniques:

Pre-Pruning: Limit growth (e.g., max depth or minimum node size).

Disadvantage: May miss subtle but meaningful patterns.

Post-Pruning: Let the tree grow fully, then remove unnecessary branches.

Improves generalization ability.

Evaluating Decision Trees

Confusion Matrix:

Compares predicted vs. actual outcomes to assess accuracy.
Key metrics include:
- True Positive (TP): Correctly predicted positives.
- False Positive (FP): Incorrectly predicted as positive.
- False Negative (FN): Missed positives.
- True Negative (TN): Correctly predicted negatives.

Cost Matrix:

Weighs certain errors (e.g., FP, FN) higher if their consequences are more severe.
Example: In medical diagnosis, missing a disease (FN) is costlier than a false alarm (FP).