Upload
sharleen-russell
View
219
Download
3
Tags:
Embed Size (px)
Citation preview
Ensemble Models - Motivation
• Remember this picture?• Always looking for balance between low complexity
(‘good on average’ but bad for prediction) and high complexity (‘good for specific cases’ but might overfit)
• By combining many different models, ensembles make it easier to hit the ‘sweet spot’ of modelling.
• Best for models to draw from diverse, independent opinions– Wisdom Of Crowds
Data Mining - Volinsky - 2011 - Columbia University 2
Strain()
Stest()
Ensemble Methods - Motivation
• Models are just models.– Usually not true!– The truth is often much more complex than any
single model can capture.– Combinations of simple models can be arbitrarily
complex. (e.g. spam/robots models, neural nets, splines)
• Notion: An average of several measurements is often more accurate and stable than a single measurement
Accuracy: how well the model does for estimation and prediction
Stability: small changes in inputs have little effect on outputs
Data Mining - Volinsky - 2011 - Columbia University 3
Ensemble Methods – How They Work
• The ensemble predicts a target value as an average or a vote of the predictions (of several individual models)... – Each model is fit independently of the others– Final prediction is a combination of the independent
predictions of all models
• For an continuous target, an ensemble averages predictions– Usually weighted
• For a categorical target (classification), an ensemble may average the probabilities of the target values…or may use ‘voting’.– Voting classifies a case into the class that was selected
most by individual models
Data Mining - Volinsky - 2011 - Columbia University 4
Ensemble Models – Why they work
• Voting example– 5 independent classifiers– 70% accuracy for each– Use voting…– What is the probability that the
ensemble model is correct? • Lets simulate it
– What about 100 examples?– (not a realistic example, why?)
Data Mining - Volinsky - 2011 - Columbia University 5
Ensemble Schemes• The beauty is that you can average together models
of any kind!!!• Don’t need fancy schemes – just average!• But there are fancy schemes: each one has various
ways of fitting many models to the same data, and use voting or averaging– Stacking (Wolpert 92): fit many leave-1-out models – Bagging (Breiman 96) build models on many permutations
of original data – Boosting (Freund & Shapire 96): iteratively re-model, using
re-weighted data based on errors from previous models…– Arcing (Breiman 98), Bumping (Tibshirani 97), Crumpling
(Anderson & Elder 98) , Born-Again (Breiman 98): – Bayesian Model Averaging - near to my heart…
• We’ll explore BMA, bagging and boosting…
Data Mining - Volinsky - 2011 - Columbia University 6
Data Mining - Volinsky - 2011 - Columbia University 8
Model Averaging
• Idea: account for inherent variance of the model selection process
• Posterior Variance = Within-Model Variance +
Between-Model Variance
• Data-driven model selection is risky: “Part of the evidence is spent specify the model” (Leamer, 1978)
• Model-based inferences can be over-precise
Data Mining - Volinsky - 2011 - Columbia University 9
Model Averaging• For some quantity of interest : avg over all
Models M, given the data D:
To calculate the first term properly, you need to integrate out model parameters ,
Where is the MLE.
For the second term, note that
Pr( | ) Pr( | , ) Pr( | ) D M D M DM
=∑
€
Pr(Δ | M,D) = Pr(∫ Δ | M,θ ,D)Pr(θ | M,D)
€
≈Pr(Δ | M, ˆ θ ,D)^
€
Pr(Mk | D) ∝ Pr(D | Mk )Pr(Mk )
BICk = logPr(D | Mk) ≈ logPr(D |θ k,Mk ) −dk
2log(n)
Bayesian Model Averaging
• The approximations on the previous page allow you to calculate many posterior model probabilities quickly, and gives you the weights to use for averaging.
• But, how do you know which models to average over?– Example, regression with p parameters– Each subset of p is a ‘model’– 2p possible models!
• Idea:
Data Mining - Volinsky - 2011 - Columbia University 10
Data Mining - Volinsky - 2011 - Columbia University11
Model Averaging
• But how to find the best models without fitting all models?
• Solution: Leaps and Bounds algorithm can find the best model without fitting all models– Goal: find the single best model for each model size
Don’t need to traverse this part of the tree since there is no way it can beat AB
BMA - Example
Data Mining - Volinsky - 2011 - Columbia University 12
PMP = Posterior Model Probability
Best Models
Score on holdout data: BMA wins
Boosting…
• Different approach to model ensembles – mostly for classification
• Observed: when model predictions are not highly correlated, combining does well
• Big idea: can we fit models specifically to the “difficult” parts of the data?
Data Mining - Volinsky - 2011 - Columbia University 14
Boosting - Advantages
• Fast algorithms - AdaBoost• Flexible – can work with any
classification algorithm• Individual models don’t have to be
good– In fact, the method works best with bad
models!– (bad = slightly better than random
guessing)– Most common model – “boosted stumps”
Data Mining - Volinsky - 2011 - Columbia University 19
Data Mining - Volinsky - 2011 - Columbia University 22
Bagging for Combining ClassifiersBagging = Boostrap aggregating• Big Idea:
– To avoid overfitting of specific dataset, fit model to “bootstrapped” random sets of the data
• Bootstrap– Random sample, with replacement, from the data
set– Size of sample = size of data
– X= (1,2,3,4,5,6,7,8,9,10)– B1=(1,2,3,3,4,5,6,6,7,8)– B2=(1,1,1,1,2,2,2,5,6,8)– …
• Bootstrap sample have the same statistical properties as original data
• By creating similar datasets you can see how much stability there is in your data. If there is a lack of stability, averaging helps.
Bagging
• Training data sets of size N
• Generate B “bootstrap” sampled data sets of size N
• Build B models (e.g., trees), one for each bootstrap sample– Intuition is that the bootstrapping “perturbs” the data
enough to make the models more resistant to true variability
– Note: only ~62% of data included in any bootstrap sample• Can use the rest as an out-of-sample estimate!
• For prediction, combine the predictions from the B models– Voting or averaging based on“out-of-bag” sample– Plus: generally improves accuracy on models such as trees– Negative: lose interpretability
Data Mining - Volinsky - 2011 - Columbia University 23
Data Mining - Volinsky - 2011 - Columbia University 26
Random Forests
• Trees are great, but
– As we’ve seen, they are “unstable”– Also, trees are sensitive to the primary
split, which can lead the tree in inappropriate directions
– one way to see this: fit a tree on a random sample, or a bootstrapped sample of the data -
Example of Tree Instability
Data Mining - Volinsky - 2011 - Columbia University 27from G. Ridgeway, 2003
Random Forests
• Solution:– random forests: an ensemble of decision trees– Similar to bagging: inject randomness to overcome instability– each tree is built on a random subset of the training data
• Boostrapped version of data
– at each split point, only a random subset of predictors are considered
– Use “out-of-bag” hold out sample to estimate size of each tree– prediction is simply majority vote of the trees ( or mean
prediction of the trees).
• Randomizing the variables used is the key– Reduces correlation between models!
• Has the advantage of trees, with more robustness, and a smoother decision rule.
Data Mining - Volinsky - 2011 - Columbia University 28
Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32
30Data Mining - Volinsky - 2011 - Columbia University
Random Forests – How Big A Tree
• Breiman’s original algorithm said: “to keep bias low, trees are to be grown to maximum depth”
• However, empirical evidence typically shows that “stumps” do best
Data Mining - Volinsky - 2011 - Columbia University 31
Ensembles – Main Points• Averaging models together has been shown to be
effective for prediction• Many weird names:
– See papers by Leo Breiman (e.g. “Bagging Predictors”, Arcing the Edge”, and “Random Forests” for more detail
• Key points– Models average well if they are uncorrelated– Can inject randomness to insure uncorrelated models– Averaging small models better than large ones
• Also, can give more insight into variables than simple tree– Variables that show up again and again must be good
Data Mining - Volinsky - 2011 - Columbia University 32
Visualizing Forests• Data: Wisconsin Breast Cancer
– Courtesy S. Urbanek
Data Mining - Volinsky - 2011 - Columbia University 33
References
• Random Forests from Leo Breiman himself
• Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32
• Hastie, Tibshirani, Friedman (HTF)– Chapters 8,10,15,16–
Data Mining - Volinsky - 2011 - Columbia University 36