Online random forests in 10 minutes

Online Random Forest in 10 Minutes

Traditional Supervised Learning Algorithms● Regression● Random Forest● Support Vector Machines● Classification and Regression Tree (CART)● etc

Inputs

● Data Matrix (Regression)

Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4

.56 Red .456 Male .589

.78 Green .654 Female .6654

.987 Blue .678 Female .789

.123 Blue .999 Male .543

Inputs

● Data Matrix (Binary Classification)

Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4

Yes Red .456 Male .589

No Green .654 Female .6654

Yes Blue .678 Female .789

No Blue .999 Male .543

Inputs To Streaming Classification

● Observations now have an explicit arrival order.

Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Time

Yes Red .456 Male .589 Jan 1st 2011

No Green .654 Female .6654 Feb 4th 2012

Yes Blue .678 Female .789 Feb 5th 2013

No Blue .999 Male .543 July 4th2013

Inputs To Streaming Classification● New Observations can arrive at any time

Predictand Predictor 1 Predictor 2 Predictor 3 Predictor 4 Time

Yes Red .456 Male .589 Jan 1st 2011

No Green .654 Female .6654 Feb 4th 2012

Yes Blue .678 Female .789 Feb 5th 2013

No Blue .999 Male .543 July 4th2013

Yes Red .456 Male .456 NOW

Problems

● Do the important predictors change over time and when does this change occur?

● How far back is data relevant to today’s problem?

● What happens when our predictors change again in the future?

● What if this is all happening rapidly… will it scale?

Enter Online Random Forest

● Input is a single new observation● Trees learn incrementally on this new data● Trees are dropped from the forest based on

performance and replaced a new “ungrown” tree

Visualization of a single tree

5, 6 0, 70

Accuracy on test cases: 75%

Pure data stop splitting

Visualization of a single tree

2, 25

Accuracy on test cases: 55%

0, 70

50 new observations have come and we create another split off the parent node’s left branch

20,3

Tree gets pruned

2, 25

Accuracy on test cases: 55% … compare to Random variable and incorporate the age of the tree. Accuracy is TOO BAD. Prune the tree

0, 70

20,3

New Tree

It’s a stump that hasn’t yet split any data. If asked for a classification request it will vote the prior probability calculated from the last 100 observations that the old pruned tree saw

Online Random Forest

● By dropping trees that predict poorly we can adapt to change in important predictors

● If previous data is relevant to today’s problem, tree’s learned from it in the past. If it no longer becomes relevant it will be reflected in the accuracy and the tree will get prune

Online Random Forest

● This process of incremental learning and dropping is constantly occurring so we can constantly adapt to a changing signal

● We built our Online Random Forest with scala’s actor framework

● We distribute our tree’s computations (and physical location) therefore we can handle high input data streams

Example Stream

Changing Feature Importance

Technology

Online random forests in 10 minutes