Ensemble Methods: Bagging and Stacking in Machine Learning

Automated Feature Selection

Backward Feature Elimination

This process begins with a model containing all features. After training and testing, the least significant variable is removed. Backward feature elimination is implemented in scikit-learn using Recursive Feature Elimination (RFE).

Code Example for RFE:

from sklearn.feature_selection import RFE
X = df.copy()  # Create separate copy
del X['Divorce']  # Delete target variable
y = df['Divorce']
# Create the model object
model = LogisticRegression()
# Specify the number of features to select
rfe = RFE(model, n_features_to_select=8)
# Fit the model
rfe = rfe.fit(X, y)
# Print the selected features
print('\n\nFEATURES SELECTED\n\n')
print(rfe.support_)
for i in range(0, len(X.keys())):
    if(rfe.support_[i]):
        print(X.keys()[i])

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

def buildAndEvaluateClassifier(features, X, y):

Use only significant columns after chi-square test

X = X[features]
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# Perform logistic regression
logisticModel = LogisticRegression(fit_intercept=True, solver='liblinear', random_state=0)
# Fit the model
logisticModel.fit(X_train, y_train)
y_pred = logisticModel.predict(X_test)
# Show accuracy scores
print('Results without scaling:')
# Show confusion matrix
cm = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
print("\nConfusion Matrix")
print(cm)
print("Recall:     " + str(recall_score(y_test, y_pred)))
print("Precision: " + str(precision_score(y_test, y_pred)))
print("F1:         " + str(f1_score(y_test, y_pred)))
print("Accuracy:  " + str(accuracy_score(y_test, y_pred)))

features = ['Q3', 'Q6', 'Q17', 'Q18', 'Q26', 'Q39', 'Q40', 'Q49'] buildAndEvaluateClassifier(features, X, y)

Selecting Features with the Chi-Square Test

The chi-square test identifies significant predictor values, especially in logistic regression. A score of 3.8 or greater indicates significance.

Forward Feature Selection

This process finds the optimal feature set by starting with the single best feature and incrementally adding the next best features.

Code Example for Forward Feature Selection:

import pandas as pd
from sklearn.feature_selection import f_regression

Read the data

df = pd.read_csv("path/to/Divorce.csv", header=0)

Separate target and independent variables

X = df.copy() del X['Divorce'] y = df['Divorce']

Use f_regression to compute correlation

ffs = f_regression(X, y) variable = [] for i in range(0, len(X.columns) - 1): if ffs[0][i] >= 700: variable.append(X.columns[i]) print(variable)

Bagging

Bagging averages predictions from weak learners for a more stable result. It combines “bootstrapping” (sampling with replacement) and “aggregating.” Weak learners are built with bootstrapped data.

Code Example for Bagging:

from pydataset import data
import pandas as pd

Get housing data

df = data('Housing')

Show all columns

pd.set_option('display.max_columns', None) pd.set_option('display.width', 1000) print("\n*** Before data prep.") print(df.head(5))

Categorize price variable

df['price'] = pd.qcut(df['price'], 3, labels=[0,1,2]).cat.codes print("\nNewly categorized target (price) values") print(df['price'].value_counts())

def convertToBinaryValues(df, columns): for column in columns: df[column] = df[column].map({'yes': 1, 'no': 0}) return df

df = convertToBinaryValues(df, ['driveway', 'recroom', 'fullbase', 'gashw', 'airco', 'prefarea'])

Split data

y = df['price'] X = df.drop('price', axis=1) print("\n X") print(X.head(5)) print("\n y") print(y.head(5))

Bagging (Continued)

from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import SVC

Create classifiers

rf = RandomForestClassifier() et = ExtraTreesClassifier() knn = KNeighborsClassifier() svc = SVC() rg = RidgeClassifier()

Build array of classifiers

classifierArray = [rf, et, knn, svc, rg]

def showStats(classifier, scores): print(classifier + ": ", end="") strMean = str(round(scores.mean(),2)) strStd = str(round(scores.std(),2)) print("Mean: " + strMean + " ", end="") print("Std: " + strStd)

from sklearn import metrics from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

def evaluateModel(model, X_test, y_test, title): print("\n " + title + " ") predictions = model.predict(X_test) accuracy = metrics.accuracy_score(y_test, predictions) recall = metrics.recall_score(y_test, predictions, average='weighted') precision = metrics.precision_score(y_test, predictions, average='weighted') f1 = metrics.f1_score(y_test, predictions, average='weighted') print("Accuracy: " + str(accuracy)) print("Precision: " + str(precision)) print("Recall: " + str(recall)) print("F1: " + str(f1))

Search for the best classifier

for clf in classifierArray: modelType = clf.class.name

Evaluate stand-alone model

clfModel = clf.fit(X_train, y_train)
evaluateModel(clfModel, X_test, y_test, modelType)
# Evaluate bagged model
bagging_clf = BaggingClassifier(clf, max_samples=0.4, max_features=11, n_estimators=10)
baggedModel = bagging_clf.fit(X_train, y_train)
evaluateModel(baggedModel, X_test, y_test, "Bagged: " + modelType)

Stacked Models

A stacked model combines output from multiple models, often improving accuracy.

Code Example for Stacked Regression:

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

Show all columns

pd.set_option('display.max_columns', None) pd.set_option('display.width', 1000)

Prepare data

PATH = "./Data/" CSV_DATA = "USA_Housing.csv" dataset = pd.read_csv(PATH + CSV_DATA) print(dataset.head()) X = dataset[['Avg. Area Income','Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', "Area Population"]].values y = dataset['Price']

def getUnfitModels(): models = list() models.append(ElasticNet()) models.append(SVR(gamma='scale')) models.append(DecisionTreeRegressor()) models.append(AdaBoostRegressor()) models.append(RandomForestRegressor(n_estimators=10)) models.append(ExtraTreesRegressor(n_estimators=10)) return models

def evaluateModel(y_test, predictions, model): mse = mean_squared_error(y_test, predictions) rmse = round(np.sqrt(mse),3) print(" RMSE:" + str(rmse) + " " + model.class.name)

def fitBaseModels(X_train, y_train, X_test, models): dfPredictions = pd.DataFrame() for i in range(0, len(models)): models[i].fit(X_train, y_train) predictions = models[i].predict(X_test) colName = str(i) dfPredictions[colName] = predictions return dfPredictions, models

def fitStackedModel(X, y): model = LinearRegression() model.fit(X, y) return model

Split data

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.70) X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.50)

Get base models

unfitModels = getUnfitModels()

Fit models

dfPredictions, models = fitBaseModels(X_train, y_train, X_test, unfitModels) stackedModel = fitStackedModel(dfPredictions, y_test)

Evaluate base models

print("\n Evaluate Base Models ") dfValidationPredictions = pd.DataFrame() for i in range(0, len(models)): predictions = models[i].predict(X_val) colName = str(i) dfValidationPredictions[colName] = predictions evaluateModel(y_val, predictions, models[i])

Stacked Regression (Continued)

# Evaluate stacked model
stackedPredictions = stackedModel.predict(dfValidationPredictions)
print("\n Evaluate Stacked Model ")
evaluateModel(y_val, stackedPredictions, stackedModel)

Code Example for Stacked Classification:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

Show all columns

pd.set_option('display.max_columns', None) pd.set_option('display.width', 1000)

Prepare data

PATH = "/Users/pm/Desktop/DayDocs/data/" CSV_DATA = "Social_Network_Ads.csv" df = pd.read_csv(PATH + CSV_DATA) df = pd.get_dummies(df,columns=['Gender']) del df['User ID'] X = df.copy() del X['Purchased'] y = df['Purchased']

def getUnfitModels(): models = list() models.append(LogisticRegression()) models.append(DecisionTreeClassifier()) models.append(AdaBoostClassifier()) models.append(RandomForestClassifier(n_estimators=10)) return models

def evaluateModel(y_test, predictions, model): precision = round(precision_score(y_test, predictions),2) recall = round(recall_score(y_test, predictions), 2) f1 = round(f1_score(y_test, predictions), 2) accuracy = round(accuracy_score(y_test, predictions), 2) print("Precision:" + str(precision) + " Recall:" + str(recall) + " F1:" + str(f1) + " Accuracy:" + str(accuracy) + " " + model.class.name)

def fitBaseModels(X_train, y_train, X_test, models): dfPredictions = pd.DataFrame() for i in range(0, len(models)): models[i].fit(X_train, y_train) predictions = models[i].predict(X_test) colName = str(i) dfPredictions[colName] = predictions return dfPredictions, models

def fitStackedModel(X, y): model = LogisticRegression() model.fit(X, y) return model

Split data

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.70) X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.50)

Get base models

unfitModels = getUnfitModels()

Fit base and stacked models

dfPredictions, models = fitBaseModels(X_train, y_train, X_test, unfitModels) stackedModel = fitStackedModel(dfPredictions, y_test)

Evaluate base models with validation data

print("\n Evaluate Base Models ")

Stacked Classification (Continued)

dfValidationPredictions = pd.DataFrame()
for i in range(0, len(models)):
predictions = models[i].predict(X_val)
colName = str(i)
dfValidationPredictions[colName] = predictions
evaluateModel(y_val, predictions, models[i])

Evaluate stacked model with validation data.

stackedPredictions = stackedModel.predict(dfValidationPredictions) print("\n Evaluate Stacked Model ") evaluateModel(y_val, stackedPredictions, stackedModel)

Stacked Model Explanation

  1. Split data into three groups: Train, Test, Validation
  2. Fit standalone models with training data.
  3. Standalone models make predictions with test data.
  4. Build a DataFrame with individual predictions.
  5. Fit a super learner with the DataFrame from the last step.
  6. Evaluate standalone models with validation data.
  7. Evaluate the super learner with test data.

Benefits of Bagging and Stacking

1. Bagging (Bootstrap Aggregating)

  • Reduction of variance: Creates multiple models on data subsets, averaging predictions for stability.
  • Improved accuracy: Reduces overfitting by capturing diverse data patterns.
  • Outlier robustness: Combines predictions, minimizing outlier impact.

2. Stacking (Stacked Generalization)

  • Increased model performance: Integrates diverse models with complementary strengths.
  • Model specialization: Creates meta-features from individual predictions, capturing higher-level information.
  • Flexibility: Uses various model types, leveraging multiple algorithms.

Both techniques improve prediction, reduce overfitting, and enhance robustness, but have computational costs.