Lecture 12

Ensemble Learning

Data Mining

Improving Model Performance

Once you have a classification or prediction model, how can you make it better?
Hyperparameter tuning is one option, but it can only go so far
Why not combine multiple models to get better performance?

Why Combine Models?

Single models can be unstable (high variance)
Different models make different errors
Combining predictions reduces overall error

Mathimatically Reasoning

If we have \(B\) learners \(\hat{f}_1, \ldots, \hat{f}_B\), each with variance \(\sigma^2\), and pairwise correlation \(\rho\):

\[\operatorname{Var}\!\left(\frac{1}{B}\sum_{b=1}^B \hat{f}_b(x)\right) = \rho\sigma^2 + \frac{1-\rho}{B}\,\sigma^2\]

The second term shrinks as \(B\) increases
Averaging helps most when learners are weakly correlated (low \(\rho\))
Building diverse learners is central to ensemble design

Bagging: Bootstrap Aggregating

Draw \(B\) bootstrap samples from training data
Fit a learner to each bootstrap sample
Average predictions from all \(B\) learners

Note

Each bootstrap sample excludes ~36.8% of observations (out-of-bag) → We can use OOB observations to estimate test error without a validation set

import { baggingWidget } from "./widgets/ensembles/bagging-widget.js"
baggingWidget()

Random Forests Redux

Bagging + one important modification
At each split, only consider a random subset of \(m\) predictors
Common defaults: \(m = \sqrt{p}\) (classification), \(m = p/3\) (regression)

Why it works

Reduces correlation between trees (lower \(\rho\) in variance formula)
Each tree sees only a subset of predictors
More diverse ensemble → better predictions

Variable Importance

Two main approaches:

Impurity-based: Sum of Gini/variance reduction from splits on each variable
Permutation: Increase in OOB error when variable values are randomly shuffled

Boosting: Sequential Error Correction

Build learners sequentially, not independently
Each new learner targets the residuals (errors) of the previous
Learning rate controls how much each model contributes

Pseudo-code for boosting:

Start with a simple model (e.g., mean prediction)
Fit a model to the residuals
Add this model to the ensemble (weighted by learning rate)
Repeat

Why weight the learning rate?

To prevent overfitting by controlling how much each tree can correct the errors of the previous ones.

import { boostingCombinedWidget } from "./widgets/ensembles/boosting-combined-widget.js"
boostingCombinedWidget()

Gradient Boosting vs XGBoost

Gradient Boosting

General framework for minimizing a loss function
At each step, fit a tree to the negative gradient of the loss
Loss function determines what we’re optimizing (MSE, logloss, etc.)

What Changes In Gradient Boosting?

Ordinary boosting says “fit the next learner to what the current ensemble is getting wrong”
Gradient boosting makes that precise by defining “wrong” as the negative gradient of a chosen loss
For squared error, this is just the usual residual, so it looks like basic boosting
For classification, survival models, and other settings, the pseudo-residual depends on the loss rather than raw observed minus predicted values
Updates may happen on a transformed scale, such as log-odds for Bernoulli logloss, before converting back to probabilities

XGBoost (eXtreme Gradient Boosting)

Optimized implementation with regularization
Adds L2 penalty on leaf weights: \(\lambda \sum_j w_j^2\)
Adds tree complexity penalty: \(\gamma \times \text{number of leaves}\)
Much faster, better regularization than classic gradient boosting

XGBoost Hyperparameters

Parameter	Effect
`eta` (learning rate)	Lower = more trees, less overfitting
`max_depth`	Deeper = more complex, more overfitting
`subsample`	Fraction of data per tree (reduces overfitting)
`colsample_bytree`	Fraction of features per tree
`min_child_weight`	Minimum sum of instance weight in leaf
`lambda` (reg_lambda)	L2 regularization on leaf weights
`alpha` (reg_alpha)	L1 regularization on leaf weights

import { xgboostWidget } from "./widgets/ensembles/xgboost-widget.js"
xgboostWidget()

Stacking: Combining Different Model Families

Use different model families as base learners
Train a meta-learner on their predictions
Base learner predictions become features for the meta-learner

Key idea

Each model family has different inductive biases → different errors → a well-chosen meta-learner can exploit complementary strengths

import { stackingWidget } from "./widgets/ensembles/stacking-widget.js"
stackingWidget()

AutoML: Automated Model Selection

Automatically searches over many model families and hyperparameters
Popular tools: H2O AutoML, auto-sklearn, TPOT, mljar

Trade-offs

Pros: - Fast baseline performance - Less manual tuning required - Good for initial exploration

Cons: - Can be a “black box” - Risk of data leakage in validation - May miss domain-specific insights - Often overkill for simple problems

import { automlWidget } from "./widgets/ensembles/automl-widget.js"

automl_data = FileAttachment("widgets/ensembles/data/longbeach-automl.json").json()

automlWidget(automl_data)

Summary: Ensemble Methods

Method	Key Idea	When to Use
Bagging	Average bootstrap trees	Reduce variance of unstable models
Random Forests	Bagging + feature subsampling	Default for tabular data
Boosting	Sequential error correction	When you need high accuracy
XGBoost	Regularized boosting	Industry standard for competitions
Stacking	Combine different model families	When you have diverse models
AutoML	Automated search	Fast baseline, exploration

Key Takeaways

Ensembles typically outperform single models
More trees = more stable (but diminishing returns)
Balance complexity with interpretability
Always validate properly (don’t overfit to validation!)

For More Details

Notes Chapter 9: Full theoretical treatment with mathematical derivations
Lab 9: Hands-on practice with ensemble methods
Tidymodels documentation: https://www.tidymodels.org/
XGBoost documentation: https://xgboost.readthedocs.io/

Lecture 12

Improving Model Performance

Why Combine Models?

Mathimatically Reasoning

Bagging: Bootstrap Aggregating

Interactive: Bagging Widget

Random Forests Redux

Why it works

Variable Importance

Boosting: Sequential Error Correction

Pseudo-code for boosting:

Why weight the learning rate?

Interactive: Combined Boosting Widget

Gradient Boosting vs XGBoost

Gradient Boosting

What Changes In Gradient Boosting?

XGBoost (eXtreme Gradient Boosting)

XGBoost Hyperparameters

Interactive: XGBoost Widget

Stacking: Combining Different Model Families

Key idea

Interactive: Stacking Widget

AutoML: Automated Model Selection

Trade-offs

Interactive: AutoML Widget

Summary: Ensemble Methods

Key Takeaways

For More Details