Ensemble Methods: Bagging and Stacking in Machine Learning
Automated Feature Selection
Backward Feature Elimination
This process begins with a model containing all features. After training and testing, the least significant variable is removed. Backward feature elimination is implemented in scikit-learn using Recursive Feature Elimination (RFE).
Code Example for RFE:
from sklearn.feature_selection import RFE
X = df.copy() # Create separate copy
del X['Divorce'] # Delete target variable
y = df['Divorce']
# Create the model object
model = LogisticRegression()
# Specify the number of features to select
rfe = RFE(model, n_features_to_select=8)
# Fit the model
rfe = rfe.fit(X, y)
# Print the selected features
print('\n\nFEATURES SELECTED\n\n')
print(rfe.support_)
for i in range(0, len(X.keys())):
if(rfe.support_[i]):
print(X.keys()[i])
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
def buildAndEvaluateClassifier(features, X, y):
Use only significant columns after chi-square test
X = X[features]
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# Perform logistic regression
logisticModel = LogisticRegression(fit_intercept=True, solver='liblinear', random_state=0)
# Fit the model
logisticModel.fit(X_train, y_train)
y_pred = logisticModel.predict(X_test)
# Show accuracy scores
print('Results without scaling:')
# Show confusion matrix
cm = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
print("\nConfusion Matrix")
print(cm)
print("Recall: " + str(recall_score(y_test, y_pred)))
print("Precision: " + str(precision_score(y_test, y_pred)))
print("F1: " + str(f1_score(y_test, y_pred)))
print("Accuracy: " + str(accuracy_score(y_test, y_pred)))
features = ['Q3', 'Q6', 'Q17', 'Q18', 'Q26', 'Q39', 'Q40', 'Q49']
buildAndEvaluateClassifier(features, X, y)
Selecting Features with the Chi-Square Test
The chi-square test identifies significant predictor values, especially in logistic regression. A score of 3.8 or greater indicates significance.
Forward Feature Selection
This process finds the optimal feature set by starting with the single best feature and incrementally adding the next best features.
Code Example for Forward Feature Selection:
import pandas as pd
from sklearn.feature_selection import f_regression
Read the data
df = pd.read_csv("path/to/Divorce.csv", header=0)
Separate target and independent variables
X = df.copy()
del X['Divorce']
y = df['Divorce']
Use f_regression to compute correlation
ffs = f_regression(X, y)
variable = []
for i in range(0, len(X.columns) - 1):
if ffs[0][i] >= 700:
variable.append(X.columns[i])
print(variable)
Bagging
Bagging averages predictions from weak learners for a more stable result. It combines “bootstrapping” (sampling with replacement) and “aggregating.” Weak learners are built with bootstrapped data.
Code Example for Bagging:
from pydataset import data
import pandas as pd
Get housing data
df = data('Housing')
Show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
print("\n*** Before data prep.")
print(df.head(5))
Categorize price variable
df['price'] = pd.qcut(df['price'], 3, labels=[0,1,2]).cat.codes
print("\nNewly categorized target (price) values")
print(df['price'].value_counts())
def convertToBinaryValues(df, columns):
for column in columns:
df[column] = df[column].map({'yes': 1, 'no': 0})
return df
df = convertToBinaryValues(df, ['driveway', 'recroom', 'fullbase', 'gashw', 'airco', 'prefarea'])
Split data
y = df['price']
X = df.drop('price', axis=1)
print("\n X")
print(X.head(5))
print("\n y")
print(y.head(5))
Bagging (Continued)
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import SVC
Create classifiers
rf = RandomForestClassifier()
et = ExtraTreesClassifier()
knn = KNeighborsClassifier()
svc = SVC()
rg = RidgeClassifier()
Build array of classifiers
classifierArray = [rf, et, knn, svc, rg]
def showStats(classifier, scores):
print(classifier + ": ", end="")
strMean = str(round(scores.mean(),2))
strStd = str(round(scores.std(),2))
print("Mean: " + strMean + " ", end="")
print("Std: " + strStd)
from sklearn import metrics
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
def evaluateModel(model, X_test, y_test, title):
print("\n " + title + " ")
predictions = model.predict(X_test)
accuracy = metrics.accuracy_score(y_test, predictions)
recall = metrics.recall_score(y_test, predictions, average='weighted')
precision = metrics.precision_score(y_test, predictions, average='weighted')
f1 = metrics.f1_score(y_test, predictions, average='weighted')
print("Accuracy: " + str(accuracy))
print("Precision: " + str(precision))
print("Recall: " + str(recall))
print("F1: " + str(f1))
Search for the best classifier
for clf in classifierArray:
modelType = clf.class.name
Evaluate stand-alone model
clfModel = clf.fit(X_train, y_train)
evaluateModel(clfModel, X_test, y_test, modelType)
# Evaluate bagged model
bagging_clf = BaggingClassifier(clf, max_samples=0.4, max_features=11, n_estimators=10)
baggedModel = bagging_clf.fit(X_train, y_train)
evaluateModel(baggedModel, X_test, y_test, "Bagged: " + modelType)
Stacked Models
A stacked model combines output from multiple models, often improving accuracy.
Code Example for Stacked Regression:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor, ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
Show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
Prepare data
PATH = "./Data/"
CSV_DATA = "USA_Housing.csv"
dataset = pd.read_csv(PATH + CSV_DATA)
print(dataset.head())
X = dataset[['Avg. Area Income','Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', "Area Population"]].values
y = dataset['Price']
def getUnfitModels():
models = list()
models.append(ElasticNet())
models.append(SVR(gamma='scale'))
models.append(DecisionTreeRegressor())
models.append(AdaBoostRegressor())
models.append(RandomForestRegressor(n_estimators=10))
models.append(ExtraTreesRegressor(n_estimators=10))
return models
def evaluateModel(y_test, predictions, model):
mse = mean_squared_error(y_test, predictions)
rmse = round(np.sqrt(mse),3)
print(" RMSE:" + str(rmse) + " " + model.class.name)
def fitBaseModels(X_train, y_train, X_test, models):
dfPredictions = pd.DataFrame()
for i in range(0, len(models)):
models[i].fit(X_train, y_train)
predictions = models[i].predict(X_test)
colName = str(i)
dfPredictions[colName] = predictions
return dfPredictions, models
def fitStackedModel(X, y):
model = LinearRegression()
model.fit(X, y)
return model
Split data
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.70)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.50)
Get base models
unfitModels = getUnfitModels()
Fit models
dfPredictions, models = fitBaseModels(X_train, y_train, X_test, unfitModels)
stackedModel = fitStackedModel(dfPredictions, y_test)
Evaluate base models
print("\n Evaluate Base Models ")
dfValidationPredictions = pd.DataFrame()
for i in range(0, len(models)):
predictions = models[i].predict(X_val)
colName = str(i)
dfValidationPredictions[colName] = predictions
evaluateModel(y_val, predictions, models[i])
Stacked Regression (Continued)
# Evaluate stacked model
stackedPredictions = stackedModel.predict(dfValidationPredictions)
print("\n Evaluate Stacked Model ")
evaluateModel(y_val, stackedPredictions, stackedModel)
Code Example for Stacked Classification:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
Show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
Prepare data
PATH = "/Users/pm/Desktop/DayDocs/data/"
CSV_DATA = "Social_Network_Ads.csv"
df = pd.read_csv(PATH + CSV_DATA)
df = pd.get_dummies(df,columns=['Gender'])
del df['User ID']
X = df.copy()
del X['Purchased']
y = df['Purchased']
def getUnfitModels():
models = list()
models.append(LogisticRegression())
models.append(DecisionTreeClassifier())
models.append(AdaBoostClassifier())
models.append(RandomForestClassifier(n_estimators=10))
return models
def evaluateModel(y_test, predictions, model):
precision = round(precision_score(y_test, predictions),2)
recall = round(recall_score(y_test, predictions), 2)
f1 = round(f1_score(y_test, predictions), 2)
accuracy = round(accuracy_score(y_test, predictions), 2)
print("Precision:" + str(precision) + " Recall:" + str(recall) + " F1:" + str(f1) + " Accuracy:" + str(accuracy) + " " + model.class.name)
def fitBaseModels(X_train, y_train, X_test, models):
dfPredictions = pd.DataFrame()
for i in range(0, len(models)):
models[i].fit(X_train, y_train)
predictions = models[i].predict(X_test)
colName = str(i)
dfPredictions[colName] = predictions
return dfPredictions, models
def fitStackedModel(X, y):
model = LogisticRegression()
model.fit(X, y)
return model
Split data
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.70)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.50)
Get base models
unfitModels = getUnfitModels()
Fit base and stacked models
dfPredictions, models = fitBaseModels(X_train, y_train, X_test, unfitModels)
stackedModel = fitStackedModel(dfPredictions, y_test)
Evaluate base models with validation data
print("\n Evaluate Base Models ")
Stacked Classification (Continued)
dfValidationPredictions = pd.DataFrame()
for i in range(0, len(models)):
predictions = models[i].predict(X_val)
colName = str(i)
dfValidationPredictions[colName] = predictions
evaluateModel(y_val, predictions, models[i])
Evaluate stacked model with validation data.
stackedPredictions = stackedModel.predict(dfValidationPredictions)
print("\n Evaluate Stacked Model ")
evaluateModel(y_val, stackedPredictions, stackedModel)
Stacked Model Explanation
- Split data into three groups: Train, Test, Validation
- Fit standalone models with training data.
- Standalone models make predictions with test data.
- Build a DataFrame with individual predictions.
- Fit a super learner with the DataFrame from the last step.
- Evaluate standalone models with validation data.
- Evaluate the super learner with test data.
Benefits of Bagging and Stacking
1. Bagging (Bootstrap Aggregating)
- Reduction of variance: Creates multiple models on data subsets, averaging predictions for stability.
- Improved accuracy: Reduces overfitting by capturing diverse data patterns.
- Outlier robustness: Combines predictions, minimizing outlier impact.
2. Stacking (Stacked Generalization)
- Increased model performance: Integrates diverse models with complementary strengths.
- Model specialization: Creates meta-features from individual predictions, capturing higher-level information.
- Flexibility: Uses various model types, leveraging multiple algorithms.
Both techniques improve prediction, reduce overfitting, and enhance robustness, but have computational costs.