Upload
cvilledatascience
View
930
Download
0
Embed Size (px)
Citation preview
Online Random Forest in 10 Minutes
Traditional Supervised Learning Algorithms● Regression● Random Forest● Support Vector Machines● Classification and Regression Tree (CART)● etc
Inputs
● Data Matrix (Regression)
Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4
.56 Red .456 Male .589
.78 Green .654 Female .6654
.987 Blue .678 Female .789
.123 Blue .999 Male .543
Inputs
● Data Matrix (Binary Classification)
Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4
Yes Red .456 Male .589
No Green .654 Female .6654
Yes Blue .678 Female .789
No Blue .999 Male .543
Inputs To Streaming Classification
● Observations now have an explicit arrival order.
Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Time
Yes Red .456 Male .589 Jan 1st 2011
No Green .654 Female .6654 Feb 4th 2012
Yes Blue .678 Female .789 Feb 5th 2013
No Blue .999 Male .543 July 4th2013
Inputs To Streaming Classification● New Observations can arrive at any time
Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Time
Yes Red .456 Male .589 Jan 1st 2011
No Green .654 Female .6654 Feb 4th 2012
Yes Blue .678 Female .789 Feb 5th 2013
No Blue .999 Male .543 July 4th2013
Yes Red .456 Male .456 NOW
Problems
● Do the important predictors change over time and when does this change occur?
● How far back is data relevant to today’s problem?
● What happens when our predictors change again in the future?
● What if this is all happening rapidly… will it scale?
Enter Online Random Forest
● Input is a single new observation● Trees learn incrementally on this new data● Trees are dropped from the forest based on
performance and replaced a new “ungrown” tree
Visualization of a single tree
5, 6 0, 70
Accuracy on test cases: 75%
Pure data stop splitting
Visualization of a single tree
2, 25
Accuracy on test cases: 55%
0, 70
50 new observations have come and we create another split off the parent node’s left branch
20,3
Tree gets pruned
2, 25
Accuracy on test cases: 55% … compare to Random variable and incorporate the age of the tree. Accuracy is TOO BAD. Prune the tree
0, 70
20,3
New Tree
It’s a stump that hasn’t yet split any data. If asked for a classification request it will vote the prior probability calculated from the last 100 observations that the old pruned tree saw
Online Random Forest
● By dropping trees that predict poorly we can adapt to change in important predictors
● If previous data is relevant to today’s problem, tree’s learned from it in the past. If it no longer becomes relevant it will be reflected in the accuracy and the tree will get prune
Online Random Forest
● This process of incremental learning and dropping is constantly occurring so we can constantly adapt to a changing signal
● We built our Online Random Forest with scala’s actor framework
● We distribute our tree’s computations (and physical location) therefore we can handle high input data streams
Example Stream
Changing Feature Importance