Ensemble Methods: Bagging and Stacking in Machine Learning

Posted on Mar 31, 2025 in Computers

Automated Feature Selection

Backward Feature Elimination

This process begins with a model containing all features. After training and testing, the least significant variable is removed. Backward feature elimination is implemented in scikit-learn using Recursive Feature Elimination (RFE).

Code Example for RFE:

from sklearn.feature_selection import RFE
X = df.copy()  # Create separate copy
del X['Divorce']  # Delete target variable
y = df['Divorce']
# Create the model object
model = LogisticRegression()
# Specify the number of features to select
rfe = RFE(model, n_features_to_select=8)
# Fit the model
rfe = rfe.fit(X, y)
# Print the selected features
print('\n\nFEATURES SELECTED\n\n')
print(rfe.support_)
for i in range(0, len(X.keys())):
    if(rfe.support_[i]):
        print(X.keys()[i])

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

def buildAndEvaluateClassifier(features, X, y):

`Use only significant columns after chi-square test`

X = X[features]
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# Perform logistic regression
logisticModel = LogisticRegression(fit_intercept=True, solver='liblinear', random_state=0)
# Fit the model
logisticModel.fit(X_train, y_train)
y_pred = logisticModel.predict(X_test)
# Show accuracy scores
print('Results without scaling:')
# Show confusion matrix
cm = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
print("\nConfusion Matrix")
print(cm)
print("Recall:     " + str(recall_score(y_test, y_pred)))
print("Precision: " + str(precision_score(y_test, y_pred)))
print("F1:         " + str(f1_score(y_test, y_pred)))
print("Accuracy:  " + str(accuracy_score(y_test, y_pred)))

features = ['Q3', 'Q6', 'Q17', 'Q18', 'Q26', 'Q39', 'Q40', 'Q49'] buildAndEvaluateClassifier(features, X, y)

Selecting Features with the Chi-Square Test

The chi-square test identifies significant predictor values, especially in logistic regression. A score of 3.8 or greater indicates significance.

Forward Feature Selection

This process finds the optimal feature set by starting with the single best feature and incrementally adding the next best features.

Code Example for Forward Feature Selection:

import pandas as pd
from sklearn.feature_selection import f_regression

`Read the data`

df = pd.read_csv("path/to/Divorce.csv", header=0)

`Separate target and independent variables`

X = df.copy() del X['Divorce'] y = df['Divorce']

`Use f_regression to compute correlation`

ffs = f_regression(X, y) variable = [] for i in range(0, len(X.columns) - 1): if ffs[0][i] >= 700: variable.append(X.columns[i]) print(variable)

Bagging

Bagging averages predictions from weak learners for a more stable result. It combines “bootstrapping” (sampling with replacement) and “aggregating.” Weak learners are built with bootstrapped data.

Code Example for Bagging:

from pydataset import data
import pandas as pd

`Get housing data`

df = data('Housing')

`Show all columns`

pd.set_option('display.max_columns', None) pd.set_option('display.width', 1000) print("\n*** Before data prep.") print(df.head(5))

`Categorize price variable`

df['price'] = pd.qcut(df['price'], 3, labels=[0,1,2]).cat.codes print("\nNewly categorized target (price) values") print(df['price'].value_counts())

def convertToBinaryValues(df, columns): for column in columns: df[column] = df[column].map({'yes': 1, 'no': 0}) return df

df = convertToBinaryValues(df, ['driveway', 'recroom', 'fullbase', 'gashw', 'airco', 'prefarea'])

`Split data`

y = df['price'] X = df.drop('price', axis=1) print("\n X") print(X.head(5)) print("\n y") print(y.head(5))

Bagging (Continued)

from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import SVC

`Create classifiers`

rf = RandomForestClassifier() et = ExtraTreesClassifier() knn = KNeighborsClassifier() svc = SVC() rg = RidgeClassifier()

`Build array of classifiers`

classifierArray = [rf, et, knn, svc, rg]

def showStats(classifier, scores): print(classifier + ": ", end="") strMean = str(round(scores.mean(),2)) strStd = str(round(scores.std(),2)) print("Mean: " + strMean + " ", end="") print("Std: " + strStd)

from sklearn import metrics from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

def evaluateModel(model, X_test, y_test, title): print("\n " + title + " ") predictions = model.predict(X_test) accuracy = metrics.accuracy_score(y_test, predictions) recall = metrics.recall_score(y_test, predictions, average='weighted') precision = metrics.precision_score(y_test, predictions, average='weighted') f1 = metrics.f1_score(y_test, predictions, average='weighted') print("Accuracy: " + str(accuracy)) print("Precision: " + str(precision)) print("Recall: " + str(recall)) print("F1: " + str(f1))

`Search for the best classifier`

for clf in classifierArray: modelType = clf.class.name

`Evaluate stand-alone model`

clfModel = clf.fit(X_train, y_train)
evaluateModel(clfModel, X_test, y_test, modelType)
# Evaluate bagged model
bagging_clf = BaggingClassifier(clf, max_samples=0.4, max_features=11, n_estimators=10)
baggedModel = bagging_clf.fit(X_train, y_train)
evaluateModel(baggedModel, X_test, y_test, "Bagged: " + modelType)

Stacked Models

A stacked model combines output from multiple models, often improving accuracy.

Code Example for Stacked Regression:

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

`Show all columns`

pd.set_option('display.max_columns', None) pd.set_option('display.width', 1000)

`Prepare data`

PATH = "./Data/" CSV_DATA = "USA_Housing.csv" dataset = pd.read_csv(PATH + CSV_DATA) print(dataset.head()) X = dataset[['Avg. Area Income','Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', "Area Population"]].values y = dataset['Price']

def getUnfitModels(): models = list() models.append(ElasticNet()) models.append(SVR(gamma='scale')) models.append(DecisionTreeRegressor()) models.append(AdaBoostRegressor()) models.append(RandomForestRegressor(n_estimators=10)) models.append(ExtraTreesRegressor(n_estimators=10)) return models

def evaluateModel(y_test, predictions, model): mse = mean_squared_error(y_test, predictions) rmse = round(np.sqrt(mse),3) print(" RMSE:" + str(rmse) + " " + model.class.name)

def fitBaseModels(X_train, y_train, X_test, models): dfPredictions = pd.DataFrame() for i in range(0, len(models)): models[i].fit(X_train, y_train) predictions = models[i].predict(X_test) colName = str(i) dfPredictions[colName] = predictions return dfPredictions, models

def fitStackedModel(X, y): model = LinearRegression() model.fit(X, y) return model

`Split data`

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.70) X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.50)

`Get base models`

unfitModels = getUnfitModels()

`Fit models`

dfPredictions, models = fitBaseModels(X_train, y_train, X_test, unfitModels) stackedModel = fitStackedModel(dfPredictions, y_test)

`Evaluate base models`

print("\n Evaluate Base Models ") dfValidationPredictions = pd.DataFrame() for i in range(0, len(models)): predictions = models[i].predict(X_val) colName = str(i) dfValidationPredictions[colName] = predictions evaluateModel(y_val, predictions, models[i])

Stacked Regression (Continued)

# Evaluate stacked model
stackedPredictions = stackedModel.predict(dfValidationPredictions)
print("\n Evaluate Stacked Model ")
evaluateModel(y_val, stackedPredictions, stackedModel)

Code Example for Stacked Classification:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

`Show all columns`

pd.set_option('display.max_columns', None) pd.set_option('display.width', 1000)

`Prepare data`

PATH = "/Users/pm/Desktop/DayDocs/data/" CSV_DATA = "Social_Network_Ads.csv" df = pd.read_csv(PATH + CSV_DATA) df = pd.get_dummies(df,columns=['Gender']) del df['User ID'] X = df.copy() del X['Purchased'] y = df['Purchased']

def getUnfitModels(): models = list() models.append(LogisticRegression()) models.append(DecisionTreeClassifier()) models.append(AdaBoostClassifier()) models.append(RandomForestClassifier(n_estimators=10)) return models

def evaluateModel(y_test, predictions, model): precision = round(precision_score(y_test, predictions),2) recall = round(recall_score(y_test, predictions), 2) f1 = round(f1_score(y_test, predictions), 2) accuracy = round(accuracy_score(y_test, predictions), 2) print("Precision:" + str(precision) + " Recall:" + str(recall) + " F1:" + str(f1) + " Accuracy:" + str(accuracy) + " " + model.class.name)

def fitStackedModel(X, y): model = LogisticRegression() model.fit(X, y) return model

`Split data`

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.70) X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.50)

`Get base models`

unfitModels = getUnfitModels()

`Fit base and stacked models`

dfPredictions, models = fitBaseModels(X_train, y_train, X_test, unfitModels) stackedModel = fitStackedModel(dfPredictions, y_test)

`Evaluate base models with validation data`

print("\n Evaluate Base Models ")

Stacked Classification (Continued)

dfValidationPredictions = pd.DataFrame()
for i in range(0, len(models)):
predictions = models[i].predict(X_val)
colName = str(i)
dfValidationPredictions[colName] = predictions
evaluateModel(y_val, predictions, models[i])

`Evaluate stacked model with validation data.`

stackedPredictions = stackedModel.predict(dfValidationPredictions) print("\n Evaluate Stacked Model ") evaluateModel(y_val, stackedPredictions, stackedModel)

Stacked Model Explanation

Split data into three groups: Train, Test, Validation
Fit standalone models with training data.
Standalone models make predictions with test data.
Build a DataFrame with individual predictions.
Fit a super learner with the DataFrame from the last step.
Evaluate standalone models with validation data.
Evaluate the super learner with test data.

Benefits of Bagging and Stacking

1. Bagging (Bootstrap Aggregating)

Reduction of variance: Creates multiple models on data subsets, averaging predictions for stability.
Improved accuracy: Reduces overfitting by capturing diverse data patterns.
Outlier robustness: Combines predictions, minimizing outlier impact.

2. Stacking (Stacked Generalization)

Increased model performance: Integrates diverse models with complementary strengths.
Model specialization: Creates meta-features from individual predictions, capturing higher-level information.
Flexibility: Uses various model types, leveraging multiple algorithms.

Both techniques improve prediction, reduce overfitting, and enhance robustness, but have computational costs.

Ensemble Methods: Bagging and Stacking in Machine Learning

Automated Feature Selection

Backward Feature Elimination

Code Example for RFE:

Use only significant columns after chi-square test

Selecting Features with the Chi-Square Test

Forward Feature Selection

Code Example for Forward Feature Selection:

Read the data

Separate target and independent variables

Use f_regression to compute correlation

Bagging

Code Example for Bagging:

Get housing data

Show all columns

Categorize price variable

Split data

Bagging (Continued)

Create classifiers

Build array of classifiers

Search for the best classifier

Evaluate stand-alone model

Stacked Models

Code Example for Stacked Regression:

Show all columns

Prepare data

Split data

Get base models

Fit models

Evaluate base models

Stacked Regression (Continued)

Code Example for Stacked Classification:

Show all columns

Prepare data

Split data

Get base models

Fit base and stacked models

Evaluate base models with validation data

Stacked Classification (Continued)

Evaluate stacked model with validation data.

Stacked Model Explanation

Benefits of Bagging and Stacking

1. Bagging (Bootstrap Aggregating)

2. Stacking (Stacked Generalization)

Recent Notes

Subjects

Publicidad

`Use only significant columns after chi-square test`

`Read the data`

`Separate target and independent variables`

`Use f_regression to compute correlation`

`Get housing data`

`Show all columns`

`Categorize price variable`

`Split data`

`Create classifiers`

`Build array of classifiers`

`Search for the best classifier`

`Evaluate stand-alone model`

`Show all columns`

`Prepare data`

`Split data`

`Get base models`

`Fit models`

`Evaluate base models`

`Show all columns`

`Prepare data`

`Split data`

`Get base models`

`Fit base and stacked models`

`Evaluate base models with validation data`

`Evaluate stacked model with validation data.`