Machine Learning Methods: Supervised Learning, Simulation, and Ensemble Techniques
1. Supervised Learning Basics
Linear Regression is a foundational model for supervised learning, designed to predict a continuous outcome variable Y based on one or more predictors X1, X2, …, Xp. The model minimizes the sum of squared errors (SSE) between observed and predicted values, producing the line of best fit. The linear regression equation is:
Y = β0 + β1X1 + … + βpXp + ε
- Estimation of Coefficients: Using the normal equation β̂ = (X‘X)-1X‘Y, the coefficients βi are estimated to minimize SSE. Each β represents the effect of a one-unit change in the corresponding predictor Xi on Y, holding other predictors constant.
- Interpreting Outputs:
- Coefficient Signs: Positive coefficients indicate a positive association with Y, while negative ones indicate a negative association.
- R-Squared (R2): Represents the proportion of variance in Y explained by the model. Higher values indicate a better fit, though values close to 1 can signal overfitting.
- p-values: Test whether each coefficient significantly differs from zero. Low p-values (typically < 0.05) suggest significance.
Inference is understanding relationships within the data, prediction is forecasting future outcomes.
Coding Tips: Coef interpretations = represents the expected change in the response variable for a 1-unit change in the predictor holding other variables constant
- If residuals show skewness, applying a log transformation to the predictor or response variable can improve the fit: make sure to convert log back to predicted unit before computing the RMSE.
- Categorical variables/factors get recoded as binary; 70 ZIP codes become 69 levels; above 50 gives a warning.
Logistic Regression is used for binary classification problems, where the outcome variable is categorical (e.g., yes/no). We model the probability of an event occurring using a linear combination of predictors:
logit(p) = ln(p/1-p) = β0 + β1X1 + … + βpXp, transforms probability p 0 to 1 into infinite continuous scale
reverse logit: p = 1 / (1 + e-y), where y = β0 + β1X1 + … + βpXp and p = the proba of the event occurring (ex: purchase).
Logit is therefore the link function that maps the linear part of the model (continuous) to a probability scale
- Odds Ratios: , so odds ratio (p, q) = eβi provides an odds ratio for each predictor, representing the change in odds for a one-unit increase in Xi. An odds ratio >1 suggests a positive association. Here, increasing X1 by 1 unit multiplies the ratio by eβi, so if coef βi = 0.5 then odd ratio is e0.5 = 1.65, so for each additional hour spent online the probability of buying p(x=1) increases by a factor of 65%.
- Logistic Reg + MLE: uses ML estimation to find the best-fitting model, i.e., the coefficients. The likelihood measures how likely to observe given data based on model predictions; the predicted probability pi calculated using logistic function is thus:
- Performance Metrics:
- Confusion Matrix: Summarizes model predictions, showing true positives, true negatives, false positives, and false negatives.
Sensitivity = Recall = TPR = TP / (TP + FN), Precision = PPV = TP / (TP + FP) = 1 – Spéc, Precision = TN / (TN + FP)
- AUC-ROC Curve: Assesses the trade-off between sensitivity (TPR) and specificity (1-Spéc) across thresholds. A higher AUC indicates a better model; 1 is perfect, 0.5 is a random model
- Cumulative Chart: Ratio CLC = P(Truth=1 / Top x% ) / P(Truth=1 / Population )
- If the cost of errors is known, use it as a performance metric itself = GAIN IS NEGATIVE, COST IS POSITIVE; lift is “how many more ones we get rather than picking at random”.
Low specificity means that many customers who will not default are being incorrectly classified as defaulters (false positives). In this context, this leads to unnecessary actions being taken against non-defaulters, such as restricting credit. To improve, we can adjust the classification threshold to reduce the number of false positives, ensuring we correctly identify non-defaulters. However, this adjustment needs to be balanced, as lowering the threshold too much could increase false negatives (i.e., missing actual defaulters).
In a two-stage modeling approach, we first build two separate models:
1- The first model predicts the probability that a client will make a purchase (upgrade). 2- The second model predicts the amount the client will spend, but this prediction only applies if they are predicted to purchase. 3- Once both models are built, we combine them by multiplying the predicted probability of purchasing by the expected amount to get the expected conditional spend for each customer. This allows us to not only predict the likelihood of purchase but also to estimate how much they will spend if they do purchase.
(type="response")
so that we have a probability and not a continuous prediction- Lift = 3, the model identifies 3 times as many positive events than a model based on random guessing. Depth % of people who get the offer; lift depends on the prevalence of the event in the population; if high = maximum achievable lift is lower because random guessing would do the same
- Consider adjusting the decision threshold from the default 0.5 to balance between false positives and false negatives depending on the use case.
2. Simulation and Bootstrapping
Simulation involves generating synthetic data to model and understand complex random processes. In R, pseudo-random number generators (RNGs) such as rnorm()
for normal distribution or runif()
for uniform distribution can simulate various distributions. The Bootstrap method provides a powerful way to estimate variability, allowing estimation of confidence intervals and variances without relying on strict assumptions about the data’s distribution.
- Bootstrap Steps:
- Draw multiple samples (with replacement) from the original dataset.
- Calculate a statistic (e.g., mean or median) for each sample.
- Use the distribution of the sample statistics to estimate confidence intervals.
Interpretation of Outputs:
- Bootstrap Confidence Intervals: These intervals give an empirical measure of uncertainty for estimated statistics, which can be more reliable when standard assumptions are invalid.
- Variability in Simulated Data: Large variations across bootstrap samples indicate higher uncertainty.
Common Transformations:
- Inverse Transform Sampling: Generates random samples from any distribution by transforming uniform random variables. For example, for an exponential distribution with rate λ, use X = -1/λ ln(U), where U ~ Uniform(0,1).
- Box-Muller for Normals: Uses two uniform random variables U1 and U2 to generate standard normal variables: Z1 = -2ln(U1)cos(2πU2) and Z2 = -2ln(U1)sin(2πU2).
Coding Tips:
- Use
sample()
in R to draw bootstrap samples and calculate confidence intervals. Theboot
package can simplify this process, particularly for complex models. - For Monte Carlo simulations, use
replicate()
to repeat processes and observe variability across iterations.
3. Decision Trees and Pruning
Decision Trees are interpretable models that segment data by making sequential splits based on predictor values. For classification tasks, the tree splits data to maximize the homogeneity of target classes in each subset; regression trees are designed for continuous target variables and split data on minimizing variance in the target variable.
- Splitting Criteria:
- Gini Index (classification) measures node impurity: Gini = 1 – Σ(pk)2, when class prob = 1, then MIN
- Entropy (classification) measures information gain, calculated as Entropy = -Σpklog2(pk), where pk is the proportion of observations in each class.
- Mean Squared Error (for regression trees) is minimized in each split to increase accuracy, reduce variance
Pruning Techniques: Splits are based on an objective measure of diversity
- Pre-Pruning (Early Stopping): Limits tree depth by setting conditions on minimum samples per node or maximum tree depth. This prevents overfitting but risks underfitting.
Post-Pruning (Cost Complexity Pruning): Penalized risk, also known as cost-complexity pruning, is used to avoid overfitting in decision trees. It adds a penalty for the size of the tree (number of leaves or splits) to the regular error (misclassification, variance, etc.). The penalized risk is given by:
α results in a smaller tree by pruning more branches, and a lower one allows for more complexity.
- Missing Values: Surrogate splits, when a value is missing for the primary feature chosen for a split, the surrogate split acts as a backup by using another feature that closely mimics the original split, based on its correlation or similarity with the primary split. = to make a decision even when the primary splitting feature has missing values, maintaining model robustness.
Interpreting Decision Tree Outputs:
- Leaf Nodes: Each leaf represents a final decision or prediction. In classification, the dominant class in the leaf is the predicted class.
- Focus on the xerror column to identify the tree with the smallest cross-validated error, as this indicates the best-performing subtree. If several trees have the same one, consider those with lower xstd, which suggests stability and robustness in performance across data splits. Look at the nsplit column; a higher number of splits indicates a more complex model with more leaves, which may overfit if complexity is too high.
- Tree Depth: A deeper tree typically captures more detail but may overfit. Use cross-validation to decide optimal depth.
- Feature Importance: TOTAL information gain provided by each variable, calculated as the average decrease in impurity from splits based on the feature: at line 2, couleur explains 27% of the amount of variance that clarté #1 explains, NOT AN INFERENCE TOOL, IT’S DIFFERENT THAN SIGNIFICANCE
- Strength of trees: interpretation, treatment of missing values, no variable transformation, accounts for interactions quickly, robust to extreme values // Weak: don’t provide the best performance, better when target decision boundaries are rectangular, not curved
Coding Tips:
- In R,
rpart()
ortree()
can be used for tree building, with pruning enabled via cross-validation (cp
parameter inrpart
). Visualize the tree structure withplot()
to understand splits and node contents. - For feature importance,
randomForest()
provides ranked feature importance based on average impurity decrease:
4. Advanced Methods: Support Vector Machines (SVMs) and Ensemble Learning
Support Vector Machines (SVMs) are particularly effective for classification tasks, especially in high-dimensional spaces. An SVM identifies the hyperplane that maximizes the margin between classes, defined as the distance between the hyperplane and the closest data points (support vectors).
Uplift modeling: estimate the incremental effect of a treatment on an outcome = compares the difference in the probability of a desired outcome between a treated group and a control group. This method helps identify persuadables—customers who are more likely to respond positively because of the treatment
- Soft Margin: For data that are not linearly separable, a soft margin allows some misclassifications, balancing margin width with classification accuracy. The penalty parameter C regulates this trade-off; high values lead to fewer misclassifications but may overfit. Low C allows more errors for better generalization.
- Kernel Trick: Enables SVMs to classify non-linear data by mapping them into a higher-dimensional space. Common kernels include:
- Linear: Suitable for linearly separable data.
- Polynomial: Allows curved boundaries for complex patterns.
- Radial Basis Function (RBF): A common choice for non-linear boundaries, controlled by parameter γ, which affects the influence of points on the classification boundary.
Interpreting SVM Outputs:
- Support Vectors: These data points directly influence the position of the separating hyperplane and are critical to the classification boundary.
- Model Accuracy: Use the confusion matrix and cross-validation accuracy to assess model performance. A large number of support vectors may indicate a more flexible model but also signal potential overfitting.
Ensemble methods combine multiple models (often called “weak learners”) to improve predictive performance. The idea is that by combining several models, the strengths of each model can offset the weaknesses of others, leading to better overall accuracy and robustness. Here are some of the main types of ensemble methods, along with their strengths and weaknesses:
1. Bagging (Bootstrap Aggregating):
- Definition: Bagging builds multiple models (often decision trees) using different random subsets of the data (bootstrapping). Predictions are made by averaging (for regression) or voting (for classification) across the models. Example: Random Forest is a popular algorithm based on bagging.
- Strengths: – Reduces variance in model predictions by averaging across many models.
- Helps to prevent overfitting, especially in decision trees.
- Weaknesses: – Doesn’t always reduce bias, especially if individual models are biased.
- Computationally intensive if using a large number of models.
2. Boosting:
- Definition: Boosting works by training models sequentially. Each new model focuses on correcting the errors made by the previous models, improving performance with each iteration.
- Example: Algorithms like AdaBoost, Gradient Boosting, ADD WEIGHTS
- Strengths: Reduces bias by focusing on improving weak learners in each step.
- Can result in very accurate models.
- Weaknesses: Prone to overfitting if not controlled properly.
- More sensitive to noise in the data, as errors are amplified in later stages.
3. Random Forest (a special case of bagging):
- Definition: Random Forest is an ensemble method that creates a large number of decision trees using random subsets of features and data. The final prediction is the average (for regression) or majority vote (for classification) of the individual trees.
- Strengths: – Excellent at reducing variance compared to a single decision tree.
- Handles missing data and categorical variables well.
- Robust against overfitting in many cases.
- Weaknesses: – Less interpretable than a single decision tree.
- Can be computationally expensive with a large number of trees.
Coding Tips:
Cross-Validation: Check the mean accuracy and standard deviation. A smaller deviation suggests more consistent results.
AUC/ROC Curve: Look for an AUC closer to 1, indicating strong classification performance. AUC above 0.7 is acceptable; 0.8-0.9 is good.
Confusion Matrix: Analyze True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). High FP or FN rates require threshold adjustment.
Variable Importance: Focus on the top variables contributing the most. High values indicate stronger influence on predictions.
R-squared & Adjusted R-squared: Higher values indicate a better fit. Adjusted R-squared accounts for model complexity.
Residual Plots: Check for random scatter, which indicates a good fit. Patterns suggest model misspecification.
5