42
Stochastic Gradient Boosting An Introduction to TreeNetTM Salford Systems http://www.salford-systems.com [email protected] Mikhail Golovnya, Dan Steinberg, Scott Cardell

Introduction to TreeNet (2004)

Embed Size (px)

DESCRIPTION

TreeNet is Salford's most flexible and powerful data mining tool, capable of consistently generating extremely accurate models. TreeNet has been responsible for the majority of Salford’s modeling competition awards. TreeNet demonstrates remarkable performance for both regression and classification. The algorithm typically generates thousands of small decision trees built in a sequential error–correcting process to converge to an accurate model.

Citation preview

Page 1: Introduction to TreeNet (2004)

Stochastic Gradient Boosting

An Introduction to TreeNetTM

Salford Systemshttp://www.salford-systems.com

[email protected] Golovnya, Dan Steinberg, Scott Cardell

Page 2: Introduction to TreeNet (2004)

New approaches to machine learning/function

◦ Approximation developed by Jerome H. Friedman at Stanford University

Co-author of CART® with Breiman, Olshen and Stone Author of MARSTM, PRIM, Projection Pursuit

Good for classification and regression problems

Builds on the notions of committees of experts and boosting but is substantially different in implementation details

Introduction to Stochastic Gradient Boosting

Page 3: Introduction to TreeNet (2004)

Stagewise function approximation in which each stage models residuals from the last step model◦ Conventional boosting models original target each stage

Each stage uses a very small tree, as small as two nodes and typically in the range of 4-8 nodes◦ Conventional bagging and boosting use full size trees and even

massively large trees

Each stage learns from a fraction of the available training data. Typically less than 50% to start and falling into 20% or less by the last stage

Each stage learns only a little: Severely down weight contribution of each new tree (learning rate is typically 0.10 or less)

Focus in classification is on points near decision boundary, ignore points far away from boundary even if the points are on the wrong

Stochastic Gradient Boosting: Key Innovations

Page 4: Introduction to TreeNet (2004)

Built on CART trees and thus◦ Immune to outliers

◦ Handles missing values automatically

◦ Selects variables

◦ Results invariant wrt monotone transformations of variables

Trains very rapidly: many small trees do not take

much longer run times than one large tree

Resistant to over training- generalizes very well

Can be remarkably accurate with little effort

BUT resulting model may be very complex

Reported Benefits of TreeNetTM

Page 5: Introduction to TreeNet (2004)

An intuitive introduction

TreeNet Mathematical Basics

◦ Specifications of the TreeNet model as a series expansion

◦ Non-parametric approach to steepest descent optimization

TreeNet at work

◦ Small trees, learning rates, sub-sample fractions, regression types

◦ Reading the output: reports and diagnostics

Comparing to AdaBoost and other methods

Overview of TreeNet Tutorial

Page 6: Introduction to TreeNet (2004)

Consider the basic problem of estimating continuous outcome y based on a vector of predictors X

Running a step-wise multiple linear regression will produce an estimate f1 (X) and associated residuals

A simple intuitive idea: run a second-stage regression model to produce an estimate of the residuals f2 (X) and the associated updated residuals r^2=(y-f1f2)

Repeating this process multiple times results to the following series expansion: y=f1+f2+f3+…

Recursive Regression

Page 7: Introduction to TreeNet (2004)

The above idea can be easily implemented

Unfortunately, the direct implementation suffers from the overfitting issues

The residuals from the previous model essentially communicate information about where this model fails the most- hence, the next stage model effectively tries to improve the previous model where it failed

This is generally known as boosting

We may want to replace individual regressions with something simpler- regression trees, for example

It is not yet known whether this simple idea actually works nor it is clear how to generalize it for various types of loss functions or classifications

Notes

Page 8: Introduction to TreeNet (2004)

For any given set of inputs X we want to predict some outcome y

Thus we want to construct a “nice” function f(X) which in turn can be used to express an estimate of y

We need to define how “nice” can be measured

Predictive Modeling

Page 9: Introduction to TreeNet (2004)

In regression, when y is continuous, the easiest is to assume that f(X) itself is the estimate of y

We may then define the loss function as the loss incurred when y is estimated by f(X)

For example, least square loss (LS) is defined as (LOA¯0’:f)^2

Formally, a “nicely” defined f(X) will have the smallest expected loss (over the entire population) within the boundaries of its construction (for example, in multiple linear regressions, f(X) belongs to the class of linear functions)

Loss for Regression

Page 10: Introduction to TreeNet (2004)

In reality, we have a set of N observed pairs (x,y) from the population, not the entire population

Hence, the expected loss WU/A can be replaced with an estimate

Here Fi=f(x)

The problem thus reduces to finding a function f(X) that minimizes R

Unfortunately, classification will demand additional treatment

Practical Estimate

Page 11: Introduction to TreeNet (2004)

Consider binary classification and assume that y is coded as +1 or -1

The most detailed solution would then give us the associated probabilities p(y)

Since probabilities are naturally constrained to the [0,1] interval, we assume that the function f(X) is transformed

p(y)=1/(1+exp(-2fy))

Note that p(+1)+p(-1)=1

The “trick” here is finding an unconstrained estimate f instead of constrained estimate p

Also note that f is simply half log-odds ratio of y=+1

Classification

Page 12: Introduction to TreeNet (2004)

(insert graph)

This graph shows the one-to-one correspondence between f and p for y=+1

Note that the most significant probability change occurs when f is between -3 and +3

S-transform

Page 13: Introduction to TreeNet (2004)

Again, the main question is what “nice” f means given that we observed N pairs (x,y) from the population

Approaching this problem from the maximum likelihood point of view, one may show that the negative log-likelihood in this case becomes

(insert equation)

The problem once again reduces to finding f that minimizes R above

We could obtain the same result formally by introducing a special loss function for classification (insert equation)

The above likelihood considerations show a “natural” way to arrive to such a peculiar loss function

Through Likelihood to Loss

Page 14: Introduction to TreeNet (2004)

Other approaches to defining the loss functions for binary classification are possible

For example, by throwing away the log term in the previous equation one would arrive to the following loss L=exp(2yf)

It is possible to show that this loss function is effectively used in the “classical” AdaBoost algorithm

AdaBoost could be considered as a predecessor of gradient boosting, we will defer the comparison until later

Other Loss Functions

Page 15: Introduction to TreeNet (2004)

To summarize we are looking for a function f(X) that minimizes the estimate of loss

The typical loss functions are

(insert equations)

Summary of Losses

Page 16: Introduction to TreeNet (2004)

The function f(X) is introduced as a known function of a fixed set of unknown parameters

The problem then reduces to finding a set of optimal parameter estimates using non-linear optimization techniques

Multiple linear regression and logistic regression: 1(X) is a linear combination of fixed predictors; parameters being the intercept term and the slope coefficients

Major problem: the function and predictors need to be specified beforehand- this usually results to a lengthy trial-and-error process

Parametric Approach

Page 17: Introduction to TreeNet (2004)

Construct f(X) using stage-wise approach

Start with a constant, then at each stage adjust the values off (X) in various regions of data

It is important to keep the adjustment rate low- the resulting model will become smoother and usually less subject to overfitting

Note that we are effectively treating the values f=f(X) at all individual observed data points as separate parameters

Non-parametric Approach

Page 18: Introduction to TreeNet (2004)

More specifically, assume that we have gone through k-1 stages and obtained the current version fK-1 (X)

We want to construct an updated version fk(x) resulting to a smaller value of R

Treating individual (insert equation) as parameters, we proceed by computing the anti-gradient (insert equation)

The individual components mark the “directions” in which individual fK-1 must be changed to obtain a smaller R

To induce smoothness lets limit our “freedom” by allowing only M (a smaller number, say between 2 and 10) distinct constant adjustments at any given stage

The Gradient Comes in

Page 19: Introduction to TreeNet (2004)

The optimal strategy is then to group individual components gk, into M mutually exclusive groups, such that the variance within each group is minimized

But this is equivalent to growing a fixed-size (M terminal nodes) regression tree using gk, as the target

Suppose we found M subsets (insert equation) of cases (insert equation)

The constant adjustments a kj are computed to minimize (insert equation)

Finally the updated f(X) is (insert equation)

The Regression Tree Comes In

Page 20: Introduction to TreeNet (2004)

For the given loss function L[y,IV],M, and MaxTrees◦ Make an initial guess f(X)=f

◦ For K=0 to MaxTrees-1

◦ Compute the anti-gradient Gk by taking the derivative of the loss with respect to f(X) and substitute y and current fk (X)

◦ Fit an M-node regression tree to the components of the negative gradient 1this will partition observations into M mutually exclusive groups

◦ Find the within node updates a5 by performing M univariate optimizations of the node contributions to the estimated loss

◦ Do the update (insert equation)

◦ End for

Generic Algorithm

Page 21: Introduction to TreeNet (2004)

For L[y,IV]=(y-f)^2, M, and MaxTrees

Initial guess f(X)=f= mean(y)

For K=0 to MaxTrees-1

The anti-gradient component (insert equation) which is the traditional definition of the current residual

Fit an M-node regression tree to the current residuals 1* this will partition observations into M mutually exclusive groups

The within-node updates a k, simply become node averages of the current residuals

Do the update: (insert equation) End for

LS Loss

Page 22: Introduction to TreeNet (2004)

For L[Y,fiX]=1 y-fl,M, and MaxTrees

Initial guess f(X)=f=median(y)

For k=0 to MaxTrees-1

The anti-gradient component (insert equation) which is the sign of the current residuals

Fit an M-node regression tree to the sign of the current residuals 1* this will partition observations into M mutually exclusive groups

The within-node updates a ki now become node medians of the current residuals

Do the update (insert equation) End for

LAD Loss

Page 23: Introduction to TreeNet (2004)

For L[y,f(X)]=log[1-exp(-2yf)], M, and MaxTrees

Initial guess f(X)=f= half log- odds of y=+1

For k=0 to MaxTrees-1

Recall that (insert equation) we call these generalize residuals

Fit an M-node regression tree to the generalized residuals 1* this will partition observations into M mutually exclusive groups

The within-node updates ak, are somewhat complicated (insert equation) where all measures are taken with respect to the node and variance (insert equation)

Do the update (insert equation) End for

Binary Classification

Page 24: Introduction to TreeNet (2004)

Consider the following simple data set with single predictor X and 1000 observations

Here and in the following slides negative response observations are marked in blue whereas positive response observations are marked in red

The general tendency is to have positive response in the middle of the range of X

(insert table)

Example: Binary Data

Page 25: Introduction to TreeNet (2004)

The dataset was generated using the following model described by f(X) and the corresponding p(X) for y=+1

(insert graphs)

Example: Probability Model

Page 26: Introduction to TreeNet (2004)

(insert graph)

TreeNet fits constant probability 0.55

The residuals are positive for y=+1 and negative for y=-1

Initial Stage

Page 27: Introduction to TreeNet (2004)

(insert graph)

The dataset was partitioned into 3 regions: low X (negative adjustment), middle X (positive), and large X (negative)

The residuals “reflect” the directions of the adjustments

Next Stage

Page 28: Introduction to TreeNet (2004)

(insert graph)

This graph shows predictors f(X) after 1000 iterations and a very small learning rate of 0.002

Note how the true shape was nearly perfectly recovered

Many Stages

Page 29: Introduction to TreeNet (2004)

The purpose of running a regression tree is to group observations into homogenous subsets

Once we have the right partition the adjustments for each terminal node are computed separately to optimize the given loss function- these are generally different from the predictions generated by the regression tree itself (they are the same only for the LS Loss)

Thus, the procedure is no longer as simple as the initial intuitive recursive regression approach we started with

Nonetheless, the tree is used to define the actual form of (X) over the range of X and not only for the individual data points observed

This becomes important in the final model deployment and scoring

A Note on Mechanics

Page 30: Introduction to TreeNet (2004)

Up to this point we guarded against overfitting only by allowing a small number of adjustments at each stage

We may further enhance this subject by forcing the adjustments to be smaller

This is done by introducing a new parameter called “shrinkage” (learning rate) that is set to a constant value between 0 and 1

Small learning rates result to smoother models: a rate of 0.1 means that TreeNet will take 10 times more iterations to extract the same signal- more variables will be tried, finer partitions will result, smaller boundary jumps will take place

Ideally, one might ultimately want to keep the learning rate close to zero and the number of stages (trees) close to infinity

However, rates below 0.001 usually become impractical

Shrinkage

Page 31: Introduction to TreeNet (2004)

(insert graph)

This graph shows predictor f(X) after 100 iterations and a learning rate of 1

Note the roughness of the shape and the presence of abrupt strong jumps

100 stages, No shrinkage

Page 32: Introduction to TreeNet (2004)

(insert graph)

This graph shows predicted f(X) after 1000 iterations and a very small learning rate of 0.0002

Note how the true shape was nearly perfectly recovered

It may be further approved

1000 Stages, Substantial Shrinkage

Page 33: Introduction to TreeNet (2004)

At each stage, instead of working with the entire learn dataset, consider taking a random sample of a fixed size

Typical sampling rates are set to 50% of the learn data (the default) and even smaller for very large datasets

In the long run, the entire learn dataset is exploited but the running time is reduced by the factor of two with the 50% sampling rate

Sampling forces TreeNet to “rethink” optimal partition points from run to run due to random fluctuations of the residuals

This, combined with the shrinkage and a large number of iterations, results to the overall improvement of the captured signal shape

Sampling Rate

Page 34: Introduction to TreeNet (2004)

(insert graph)

This graph shows predicted f(X) after 1000 stages, learning rate of 0.002, and 50% sampling

Note the minor fluctuations in the average loss

The resulting model is nice and smooth but there is still room for improvement

1000 Stages, Shrinkage and Sampling

Page 35: Introduction to TreeNet (2004)

(insert graph)

All previous allowed as few as 10 cases for individual region/node (the default)

Here we have increased this limit up to 50

This immediately resulted to an even smoother shape

In practice, various node size limits should be tried

Limiting Node Size

Page 36: Introduction to TreeNet (2004)

In classification problems, it is possible to further reduce the amount of data processed as each stage

We ignore data points “too far” from the decision boundary to be usefully considered

◦ Well correctly classified points are ignored (just like conventional boosting)

◦ Badly misclassified data points are also ignored (very different from conventional boosting)

◦ The focus is on the cases most difficult to classify correctly: those near the decision boundary

Ignoring data far from the decision boundary in classification problems

Page 37: Introduction to TreeNet (2004)

(insert graph)

2-dimensional predictor space

Red dots represent cases with +1 target

Green dots represent cases with -1 target

Black curve represents the decision boundary

Decision Boundary Diagram

Page 38: Introduction to TreeNet (2004)

The remaining slides present TreeNet runs on real data as well as give examples of GUI controls

We start with the Boston Housing dataset to illustrate regression

Then we proceed with the Cell Phone dataset to illustrate classification

Real Data Runs

Page 39: Introduction to TreeNet (2004)

(insert graph)

A Simple TreeNet Run

Page 40: Introduction to TreeNet (2004)

(insert graph)

Scatter Plot

Page 41: Introduction to TreeNet (2004)

(insert graph) Essentially a regression tree with 2 terminal

nodes

Predicted Response

Page 42: Introduction to TreeNet (2004)

(insert table) CART run with TARGET=MV

PREDICTORS= LSTAT

LIMIT DEPTH= 1

Save residuals as RESI

Equivalent CART Model