Machine Learning Methods: Supervised Learning, Simulation, and Ensemble Techniques

Posted on Mar 15, 2025 in Commerce

1. Supervised Learning Basics

Linear Regression is a foundational model for supervised learning, designed to predict a continuous outcome variable Y based on one or more predictors X₁, X₂, …, X_p. The model minimizes the sum of squared errors (SSE) between observed and predicted values, producing the line of best fit. The linear regression equation is:

Y = β₀ + β₁X₁ + … + β_pX_p + ε

Estimation of Coefficients: Using the normal equation β̂ = (X‘X)^-1X‘Y, the coefficients β_i are estimated to minimize SSE. Each β represents the effect of a one-unit change in the corresponding predictor X_i on Y, holding other predictors constant.
Interpreting Outputs:
- Coefficient Signs: Positive coefficients indicate a positive association with Y, while negative ones indicate a negative association.
- R-Squared (R²): Represents the proportion of variance in Y explained by the model. Higher values indicate a better fit, though values close to 1 can signal overfitting.
- p-values: Test whether each coefficient significantly differs from zero. Low p-values (typically < 0.05) suggest significance.

Inference is understanding relationships within the data, prediction is forecasting future outcomes.

Coding Tips: Coef interpretations = represents the expected change in the response variable for a 1-unit change in the predictor holding other variables constant

If residuals show skewness, applying a log transformation to the predictor or response variable can improve the fit: make sure to convert log back to predicted unit before computing the RMSE.
Categorical variables/factors get recoded as binary; 70 ZIP codes become 69 levels; above 50 gives a warning.

Logistic Regression is used for binary classification problems, where the outcome variable is categorical (e.g., yes/no). We model the probability of an event occurring using a linear combination of predictors:

logit(p) = ln(p/1-p) = β₀ + β₁X₁ + … + β_pX_p, transforms probability p 0 to 1 into infinite continuous scale

reverse logit: p = 1 / (1 + e^-y), where y = β₀ + β₁X₁ + … + β_pX_p and p = the proba of the event occurring (ex: purchase).

Logit is therefore the link function that maps the linear part of the model (continuous) to a probability scale

Odds Ratios: , so odds ratio (p, q) = e^β_i provides an odds ratio for each predictor, representing the change in odds for a one-unit increase in X_i. An odds ratio >1 suggests a positive association. Here, increasing X₁ by 1 unit multiplies the ratio by e^β_i, so if coef β_i = 0.5 then odd ratio is e^0.5 = 1.65, so for each additional hour spent online the probability of buying p(x=1) increases by a factor of 65%.
Logistic Reg + MLE: uses ML estimation to find the best-fitting model, i.e., the coefficients. The likelihood measures how likely to observe given data based on model predictions; the predicted probability p_i calculated using logistic function is thus:
Performance Metrics:
- Confusion Matrix: Summarizes model predictions, showing true positives, true negatives, false positives, and false negatives.

Sensitivity = Recall = TPR = TP / (TP + FN), Precision = PPV = TP / (TP + FP) = 1 – Spéc, Precision = TN / (TN + FP)

- - AUC-ROC Curve: Assesses the trade-off between sensitivity (TPR) and specificity (1-Spéc) across thresholds. A higher AUC indicates a better model; 1 is perfect, 0.5 is a random model
  - Cumulative Chart: Ratio CLC = P(Truth=1 / Top x% ) / P(Truth=1 / Population )
  - If the cost of errors is known, use it as a performance metric itself = GAIN IS NEGATIVE, COST IS POSITIVE; lift is “how many more ones we get rather than picking at random”.

Low specificity means that many customers who will not default are being incorrectly classified as defaulters (false positives). In this context, this leads to unnecessary actions being taken against non-defaulters, such as restricting credit. To improve, we can adjust the classification threshold to reduce the number of false positives, ensuring we correctly identify non-defaulters. However, this adjustment needs to be balanced, as lowering the threshold too much could increase false negatives (i.e., missing actual defaulters).

In a two-stage modeling approach, we first build two separate models:

1- The first model predicts the probability that a client will make a purchase (upgrade). 2- The second model predicts the amount the client will spend, but this prediction only applies if they are predicted to purchase. 3- Once both models are built, we combine them by multiplying the predicted probability of purchasing by the expected amount to get the expected conditional spend for each customer. This allows us to not only predict the likelihood of purchase but also to estimate how much they will spend if they do purchase.

(type="response") so that we have a probability and not a continuous prediction
Lift = 3, the model identifies 3 times as many positive events than a model based on random guessing. Depth % of people who get the offer; lift depends on the prevalence of the event in the population; if high = maximum achievable lift is lower because random guessing would do the same
Consider adjusting the decision threshold from the default 0.5 to balance between false positives and false negatives depending on the use case.

2. Simulation and Bootstrapping

Simulation involves generating synthetic data to model and understand complex random processes. In R, pseudo-random number generators (RNGs) such as rnorm() for normal distribution or runif() for uniform distribution can simulate various distributions. The Bootstrap method provides a powerful way to estimate variability, allowing estimation of confidence intervals and variances without relying on strict assumptions about the data’s distribution.

Bootstrap Steps:
1. Draw multiple samples (with replacement) from the original dataset.
2. Calculate a statistic (e.g., mean or median) for each sample.
3. Use the distribution of the sample statistics to estimate confidence intervals.

Interpretation of Outputs:

Bootstrap Confidence Intervals: These intervals give an empirical measure of uncertainty for estimated statistics, which can be more reliable when standard assumptions are invalid.
Variability in Simulated Data: Large variations across bootstrap samples indicate higher uncertainty.

Common Transformations:

Inverse Transform Sampling: Generates random samples from any distribution by transforming uniform random variables. For example, for an exponential distribution with rate λ, use X = -1/λ ln(U), where U ~ Uniform(0,1).
Box-Muller for Normals: Uses two uniform random variables U₁ and U₂ to generate standard normal variables: Z₁ = -2ln(U₁)cos(2πU₂) and Z₂ = -2ln(U₁)sin(2πU₂).

Coding Tips:

Use sample() in R to draw bootstrap samples and calculate confidence intervals. The boot package can simplify this process, particularly for complex models.
For Monte Carlo simulations, use replicate() to repeat processes and observe variability across iterations.

3. Decision Trees and Pruning

Decision Trees are interpretable models that segment data by making sequential splits based on predictor values. For classification tasks, the tree splits data to maximize the homogeneity of target classes in each subset; regression trees are designed for continuous target variables and split data on minimizing variance in the target variable.

Splitting Criteria:
- Gini Index (classification) measures node impurity: Gini = 1 – Σ(p_k)², when class prob = 1, then MIN
- Entropy (classification) measures information gain, calculated as Entropy = -Σp_klog₂(p_k), where p_k is the proportion of observations in each class.
- Mean Squared Error (for regression trees) is minimized in each split to increase accuracy, reduce variance

Pruning Techniques: Splits are based on an objective measure of diversity

Pre-Pruning (Early Stopping): Limits tree depth by setting conditions on minimum samples per node or maximum tree depth. This prevents overfitting but risks underfitting.
Post-Pruning (Cost Complexity Pruning): Penalized risk, also known as cost-complexity pruning, is used to avoid overfitting in decision trees. It adds a penalty for the size of the tree (number of leaves or splits) to the regular error (misclassification, variance, etc.). The penalized risk is given by:
α results in a smaller tree by pruning more branches, and a lower one allows for more complexity.
Missing Values: Surrogate splits, when a value is missing for the primary feature chosen for a split, the surrogate split acts as a backup by using another feature that closely mimics the original split, based on its correlation or similarity with the primary split. = to make a decision even when the primary splitting feature has missing values, maintaining model robustness.

Interpreting Decision Tree Outputs:

Leaf Nodes: Each leaf represents a final decision or prediction. In classification, the dominant class in the leaf is the predicted class.
Focus on the xerror column to identify the tree with the smallest cross-validated error, as this indicates the best-performing subtree. If several trees have the same one, consider those with lower xstd, which suggests stability and robustness in performance across data splits. Look at the nsplit column; a higher number of splits indicates a more complex model with more leaves, which may overfit if complexity is too high.
Tree Depth: A deeper tree typically captures more detail but may overfit. Use cross-validation to decide optimal depth.
Feature Importance: TOTAL information gain provided by each variable, calculated as the average decrease in impurity from splits based on the feature: at line 2, couleur explains 27% of the amount of variance that clarté #1 explains, NOT AN INFERENCE TOOL, IT’S DIFFERENT THAN SIGNIFICANCE
Strength of trees: interpretation, treatment of missing values, no variable transformation, accounts for interactions quickly, robust to extreme values // Weak: don’t provide the best performance, better when target decision boundaries are rectangular, not curved

Coding Tips:

In R, rpart() or tree() can be used for tree building, with pruning enabled via cross-validation (cp parameter in rpart). Visualize the tree structure with plot() to understand splits and node contents.
For feature importance, randomForest() provides ranked feature importance based on average impurity decrease:

4. Advanced Methods: Support Vector Machines (SVMs) and Ensemble Learning

Support Vector Machines (SVMs) are particularly effective for classification tasks, especially in high-dimensional spaces. An SVM identifies the hyperplane that maximizes the margin between classes, defined as the distance between the hyperplane and the closest data points (support vectors).

Uplift modeling: estimate the incremental effect of a treatment on an outcome = compares the difference in the probability of a desired outcome between a treated group and a control group. This method helps identify persuadables—customers who are more likely to respond positively because of the treatment

Soft Margin: For data that are not linearly separable, a soft margin allows some misclassifications, balancing margin width with classification accuracy. The penalty parameter C regulates this trade-off; high values lead to fewer misclassifications but may overfit. Low C allows more errors for better generalization.
Kernel Trick: Enables SVMs to classify non-linear data by mapping them into a higher-dimensional space. Common kernels include:
- Linear: Suitable for linearly separable data.
- Polynomial: Allows curved boundaries for complex patterns.
- Radial Basis Function (RBF): A common choice for non-linear boundaries, controlled by parameter γ, which affects the influence of points on the classification boundary.

Interpreting SVM Outputs:

Support Vectors: These data points directly influence the position of the separating hyperplane and are critical to the classification boundary.
Model Accuracy: Use the confusion matrix and cross-validation accuracy to assess model performance. A large number of support vectors may indicate a more flexible model but also signal potential overfitting.

Ensemble methods combine multiple models (often called “weak learners”) to improve predictive performance. The idea is that by combining several models, the strengths of each model can offset the weaknesses of others, leading to better overall accuracy and robustness. Here are some of the main types of ensemble methods, along with their strengths and weaknesses:

1. Bagging (Bootstrap Aggregating):

Definition: Bagging builds multiple models (often decision trees) using different random subsets of the data (bootstrapping). Predictions are made by averaging (for regression) or voting (for classification) across the models. Example: Random Forest is a popular algorithm based on bagging.
Strengths: – Reduces variance in model predictions by averaging across many models.
- Helps to prevent overfitting, especially in decision trees.
Weaknesses: – Doesn’t always reduce bias, especially if individual models are biased.
- Computationally intensive if using a large number of models.

2. Boosting:

Definition: Boosting works by training models sequentially. Each new model focuses on correcting the errors made by the previous models, improving performance with each iteration.
Example: Algorithms like AdaBoost, Gradient Boosting, ADD WEIGHTS
Strengths: Reduces bias by focusing on improving weak learners in each step.
- Can result in very accurate models.
Weaknesses: Prone to overfitting if not controlled properly.
- More sensitive to noise in the data, as errors are amplified in later stages.

3. Random Forest (a special case of bagging):

Definition: Random Forest is an ensemble method that creates a large number of decision trees using random subsets of features and data. The final prediction is the average (for regression) or majority vote (for classification) of the individual trees.
Strengths: – Excellent at reducing variance compared to a single decision tree.
- Handles missing data and categorical variables well.
- Robust against overfitting in many cases.
Weaknesses: – Less interpretable than a single decision tree.
- Can be computationally expensive with a large number of trees.

Coding Tips:

Cross-Validation: Check the mean accuracy and standard deviation. A smaller deviation suggests more consistent results.

AUC/ROC Curve: Look for an AUC closer to 1, indicating strong classification performance. AUC above 0.7 is acceptable; 0.8-0.9 is good.

Confusion Matrix: Analyze True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). High FP or FN rates require threshold adjustment.

Variable Importance: Focus on the top variables contributing the most. High values indicate stronger influence on predictions.

R-squared & Adjusted R-squared: Higher values indicate a better fit. Adjusted R-squared accounts for model complexity.

Residual Plots: Check for random scatter, which indicates a good fit. Patterns suggest model misspecification.

Machine Learning Methods: Supervised Learning, Simulation, and Ensemble Techniques

1. Supervised Learning Basics

2. Simulation and Bootstrapping

3. Decision Trees and Pruning

4. Advanced Methods: Support Vector Machines (SVMs) and Ensemble Learning

Recent Notes

Subjects

Publicidad