Lecture 12

Ensemble Learning

Data Mining

Improving Model Performance

  • Once you have a classification or prediction model, how can you make it better?

  • Hyperparameter tuning is one option, but it can only go so far

  • Why not combine multiple models to get better performance?

Why Combine Models?

  • Single models can be unstable (high variance)
  • Different models make different errors
  • Combining predictions reduces overall error

Mathimatically Reasoning

If we have \(B\) learners \(\hat{f}_1, \ldots, \hat{f}_B\), each with variance \(\sigma^2\), and pairwise correlation \(\rho\):

\[\operatorname{Var}\!\left(\frac{1}{B}\sum_{b=1}^B \hat{f}_b(x)\right) = \rho\sigma^2 + \frac{1-\rho}{B}\,\sigma^2\]

  • The second term shrinks as \(B\) increases
  • Averaging helps most when learners are weakly correlated (low \(\rho\))
  • Building diverse learners is central to ensemble design

Bagging: Bootstrap Aggregating

  • Draw \(B\) bootstrap samples from training data
  • Fit a learner to each bootstrap sample
  • Average predictions from all \(B\) learners

Note

Each bootstrap sample excludes ~36.8% of observations (out-of-bag) → We can use OOB observations to estimate test error without a validation set

Interactive: Bagging Widget

Random Forests Redux

  • Bagging + one important modification
  • At each split, only consider a random subset of \(m\) predictors
  • Common defaults: \(m = \sqrt{p}\) (classification), \(m = p/3\) (regression)

Why it works

  • Reduces correlation between trees (lower \(\rho\) in variance formula)
  • Each tree sees only a subset of predictors
  • More diverse ensemble → better predictions

Variable Importance

Two main approaches:

  1. Impurity-based: Sum of Gini/variance reduction from splits on each variable
  2. Permutation: Increase in OOB error when variable values are randomly shuffled

Boosting: Sequential Error Correction

  • Build learners sequentially, not independently
  • Each new learner targets the residuals (errors) of the previous
  • Learning rate controls how much each model contributes

Pseudo-code for boosting:

  • Start with a simple model (e.g., mean prediction)
  • Fit a model to the residuals
  • Add this model to the ensemble (weighted by learning rate)
  • Repeat

Why weight the learning rate?

To prevent overfitting by controlling how much each tree can correct the errors of the previous ones.

Interactive: Combined Boosting Widget

Gradient Boosting vs XGBoost

Gradient Boosting

  • General framework for minimizing a loss function
  • At each step, fit a tree to the negative gradient of the loss
  • Loss function determines what we’re optimizing (MSE, logloss, etc.)

What Changes In Gradient Boosting?

  • Ordinary boosting says “fit the next learner to what the current ensemble is getting wrong”
  • Gradient boosting makes that precise by defining “wrong” as the negative gradient of a chosen loss
  • For squared error, this is just the usual residual, so it looks like basic boosting
  • For classification, survival models, and other settings, the pseudo-residual depends on the loss rather than raw observed minus predicted values
  • Updates may happen on a transformed scale, such as log-odds for Bernoulli logloss, before converting back to probabilities

XGBoost (eXtreme Gradient Boosting)

  • Optimized implementation with regularization
  • Adds L2 penalty on leaf weights: \(\lambda \sum_j w_j^2\)
  • Adds tree complexity penalty: \(\gamma \times \text{number of leaves}\)
  • Much faster, better regularization than classic gradient boosting

XGBoost Hyperparameters

Parameter Effect
eta (learning rate) Lower = more trees, less overfitting
max_depth Deeper = more complex, more overfitting
subsample Fraction of data per tree (reduces overfitting)
colsample_bytree Fraction of features per tree
min_child_weight Minimum sum of instance weight in leaf
lambda (reg_lambda) L2 regularization on leaf weights
alpha (reg_alpha) L1 regularization on leaf weights

Interactive: XGBoost Widget

Stacking: Combining Different Model Families

  • Use different model families as base learners
  • Train a meta-learner on their predictions
  • Base learner predictions become features for the meta-learner

Key idea

Each model family has different inductive biases → different errors → a well-chosen meta-learner can exploit complementary strengths

Interactive: Stacking Widget

AutoML: Automated Model Selection

  • Automatically searches over many model families and hyperparameters
  • Popular tools: H2O AutoML, auto-sklearn, TPOT, mljar

Trade-offs

Pros: - Fast baseline performance - Less manual tuning required - Good for initial exploration

Cons: - Can be a “black box” - Risk of data leakage in validation - May miss domain-specific insights - Often overkill for simple problems

Interactive: AutoML Widget

Summary: Ensemble Methods

Method Key Idea When to Use
Bagging Average bootstrap trees Reduce variance of unstable models
Random Forests Bagging + feature subsampling Default for tabular data
Boosting Sequential error correction When you need high accuracy
XGBoost Regularized boosting Industry standard for competitions
Stacking Combine different model families When you have diverse models
AutoML Automated search Fast baseline, exploration

Key Takeaways

  • Ensembles typically outperform single models
  • More trees = more stable (but diminishing returns)
  • Balance complexity with interpretability
  • Always validate properly (don’t overfit to validation!)

For More Details

  • Notes Chapter 9: Full theoretical treatment with mathematical derivations
  • Lab 9: Hands-on practice with ensemble methods
  • Tidymodels documentation: https://www.tidymodels.org/
  • XGBoost documentation: https://xgboost.readthedocs.io/