What are Ensemble Methods in Statistical or Machine Learning?
There are a lot of explanations elsewhere, here I’d like to share some example questions in an interview setting.
What are ensemble methods in statistical or machine learning?
Here are some tips for readers’ reference:
Ensemble methods are techniques that aim at improving the accuracy of results in models by combining multiple models instead of using a single model. The combined models increase the accuracy of the results significantly. This has boosted the popularity of ensemble methods in machine learning.
How is “ensembling” done?
Simply by averaging the predictions.
Let’s contrast how this is explained by Machine Learning Professor Jeremy Howard and Statistical Learning Professors Trevor Hastie and Rob Tibshirani. (We’ve covered the difference between statistical and machine learning in a previous post here!)
Jeremy:
Rob and Trevor:
The full episode on bagging is a classic, worth watching to the end!
Why does ensembling work? We’ll cover it in a future post!
In addition, ensemble methods are used in more places than you think! The famous transformer neural network architectures have “multi-headed attentions”, which can be viewed as (similar to) an ensemble model, which “combines multiple single-head attention modules by calculating their average”.
Happy practicing!
Thanks for reading my newsletter. You can follow me on Linkedin or Twitter @Angelina_Magr !
Note: There are different angles to answer an interview question. The author of this newsletter does not try to find a reference that answers a question exhaustively. Rather, the author would like to share some quick insights and help the readers to think, practice and do further research as necessary.
Source of quotes/images: Blog. What are Ensemble Methods?
Blog. Many Heads Are Better Than One: The Case For Ensemble Learning
Source of video: Statistical Learning: 8.4 Bagging, Trevor Hastie and Rob Tibshirani.
Lesson 7: Practical Deep Learning for Coders 2022, Jeremy Howard
Good reads: paper. Multi-head or Single-head? An Empirical Comparison for Transformer Training