Introduction to TreeNet (2004)

Stochastic Gradient Boosting

An Introduction to TreeNetTM

Salford Systemshttp://www.salford-systems.com

[email protected] Golovnya, Dan Steinberg, Scott Cardell

http://www.salford-systems.com/

mailto:[email protected]

New approaches to machine learning/function

◦ Approximation developed by Jerome H. Friedman at Stanford University

Co-author of CART® with Breiman, Olshen and Stone Author of MARSTM, PRIM, Projection Pursuit

Good for classification and regression problems

Builds on the notions of committees of experts and boosting but is substantially different in implementation details

Introduction to Stochastic Gradient Boosting

Stagewise function approximation in which each stage models residuals from the last step model◦ Conventional boosting models original target each stage

Each stage uses a very small tree, as small as two nodes and typically in the range of 4-8 nodes◦ Conventional bagging and boosting use full size trees and even

massively large trees

Each stage learns from a fraction of the available training data. Typically less than 50% to start and falling into 20% or less by the last stage

Each stage learns only a little: Severely down weight contribution of each new tree (learning rate is typically 0.10 or less)

Focus in classification is on points near decision boundary, ignore points far away from boundary even if the points are on the wrong

Stochastic Gradient Boosting: Key Innovations

Built on CART trees and thus◦ Immune to outliers

◦ Handles missing values automatically

◦ Selects variables

◦ Results invariant wrt monotone transformations of variables

Trains very rapidly: many small trees do not take

much longer run times than one large tree

Resistant to over training- generalizes very well

Can be remarkably accurate with little effort

BUT resulting model may be very complex

Reported Benefits of TreeNetTM

An intuitive introduction

TreeNet Mathematical Basics

◦ Specifications of the TreeNet model as a series expansion

◦ Non-parametric approach to steepest descent optimization

TreeNet at work

◦ Small trees, learning rates, sub-sample fractions, regression types

◦ Reading the output: reports and diagnostics

Comparing to AdaBoost and other methods

Overview of TreeNet Tutorial

Consider the basic problem of estimating continuous outcome y based on a vector of predictors X

Running a step-wise multiple linear regression will produce an estimate f1 (X) and associated residuals

A simple intuitive idea: run a second-stage regression model to produce an estimate of the residuals f2 (X) and the associated updated residuals r^2=(y-f1f2)

Repeating this process multiple times results to the following series expansion: y=f1+f2+f3+…

Recursive Regression

The above idea can be easily implemented

Unfortunately, the direct implementation suffers from the overfitting issues

The residuals from the previous model essentially communicate information about where this model fails the most- hence, the next stage model effectively tries to improve the previous model where it failed

This is generally known as boosting

We may want to replace individual regressions with something simpler- regression trees, for example

It is not yet known whether this simple idea actually works nor it is clear how to generalize it for various types of loss functions or classifications

Notes

For any given set of inputs X we want to predict some outcome y

Thus we want to construct a “nice” function f(X) which in turn can be used to express an estimate of y

We need to define how “nice” can be measured

Predictive Modeling

In regression, when y is continuous, the easiest is to assume that f(X) itself is the estimate of y

We may then define the loss function as the loss incurred when y is estimated by f(X)

For example, least square loss (LS) is defined as (LOA¯0’:f)^2

Formally, a “nicely” defined f(X) will have the smallest expected loss (over the entire population) within the boundaries of its construction (for example, in multiple linear regressions, f(X) belongs to the class of linear functions)

Loss for Regression

In reality, we have a set of N observed pairs (x,y) from the population, not the entire population

Hence, the expected loss WU/A can be replaced with an estimate

Here Fi=f(x)

The problem thus reduces to finding a function f(X) that minimizes R

Unfortunately, classification will demand additional treatment

Practical Estimate

Consider binary classification and assume that y is coded as +1 or -1

The most detailed solution would then give us the associated probabilities p(y)

Since probabilities are naturally constrained to the [0,1] interval, we assume that the function f(X) is transformed

p(y)=1/(1+exp(-2fy))

Note that p(+1)+p(-1)=1

The “trick” here is finding an unconstrained estimate f instead of constrained estimate p

Also note that f is simply half log-odds ratio of y=+1

Classification

(insert graph)

This graph shows the one-to-one correspondence between f and p for y=+1

Note that the most significant probability change occurs when f is between -3 and +3

S-transform

Again, the main question is what “nice” f means given that we observed N pairs (x,y) from the population

Approaching this problem from the maximum likelihood point of view, one may show that the negative log-likelihood in this case becomes

(insert equation)

The problem once again reduces to finding f that minimizes R above

We could obtain the same result formally by introducing a special loss function for classification (insert equation)

The above likelihood considerations show a “natural” way to arrive to such a peculiar loss function

Through Likelihood to Loss

Other approaches to defining the loss functions for binary classification are possible

For example, by throwing away the log term in the previous equation one would arrive to the following loss L=exp(2yf)

It is possible to show that this loss function is effectively used in the “classical” AdaBoost algorithm

AdaBoost could be considered as a predecessor of gradient boosting, we will defer the comparison until later

Other Loss Functions

To summarize we are looking for a function f(X) that minimizes the estimate of loss

The typical loss functions are

(insert equations)

Summary of Losses

The function f(X) is introduced as a known function of a fixed set of unknown parameters

The problem then reduces to finding a set of optimal parameter estimates using non-linear optimization techniques

Multiple linear regression and logistic regression: 1(X) is a linear combination of fixed predictors; parameters being the intercept term and the slope coefficients

Major problem: the function and predictors need to be specified beforehand- this usually results to a lengthy trial-and-error process

Parametric Approach

Construct f(X) using stage-wise approach

Start with a constant, then at each stage adjust the values off (X) in various regions of data

It is important to keep the adjustment rate low- the resulting model will become smoother and usually less subject to overfitting

Note that we are effectively treating the values f=f(X) at all individual observed data points as separate parameters

Non-parametric Approach

More specifically, assume that we have gone through k-1 stages and obtained the current version fK-1 (X)

We want to construct an updated version fk(x) resulting to a smaller value of R

Treating individual (insert equation) as parameters, we proceed by computing the anti-gradient (insert equation)

The individual components mark the “directions” in which individual fK-1 must be changed to obtain a smaller R

To induce smoothness lets limit our “freedom” by allowing only M (a smaller number, say between 2 and 10) distinct constant adjustments at any given stage

The Gradient Comes in

The optimal strategy is then to group individual components gk, into M mutually exclusive groups, such that the variance within each group is minimized

But this is equivalent to growing a fixed-size (M terminal nodes) regression tree using gk, as the target

Suppose we found M subsets (insert equation) of cases (insert equation)

The constant adjustments a kj are computed to minimize (insert equation)

Finally the updated f(X) is (insert equation)

The Regression Tree Comes In

For the given loss function L[y,IV],M, and MaxTrees◦ Make an initial guess f(X)=f

◦ For K=0 to MaxTrees-1

◦ Compute the anti-gradient Gk by taking the derivative of the loss with respect to f(X) and substitute y and current fk (X)

◦ Fit an M-node regression tree to the components of the negative gradient 1this will partition observations into M mutually exclusive groups

◦ Find the within node updates a5 by performing M univariate optimizations of the node contributions to the estimated loss

◦ Do the update (insert equation)

◦ End for

Generic Algorithm

For L[y,IV]=(y-f)^2, M, and MaxTrees

Initial guess f(X)=f= mean(y)

For K=0 to MaxTrees-1

The anti-gradient component (insert equation) which is the traditional definition of the current residual

Fit an M-node regression tree to the current residuals 1* this will partition observations into M mutually exclusive groups

The within-node updates a k, simply become node averages of the current residuals

Do the update: (insert equation) End for

LS Loss

For L[Y,fiX]=1 y-fl,M, and MaxTrees

Initial guess f(X)=f=median(y)

For k=0 to MaxTrees-1

The anti-gradient component (insert equation) which is the sign of the current residuals

Fit an M-node regression tree to the sign of the current residuals 1* this will partition observations into M mutually exclusive groups

The within-node updates a ki now become node medians of the current residuals

Do the update (insert equation) End for

LAD Loss

For L[y,f(X)]=log[1-exp(-2yf)], M, and MaxTrees

Initial guess f(X)=f= half log- odds of y=+1

For k=0 to MaxTrees-1

Recall that (insert equation) we call these generalize residuals

Fit an M-node regression tree to the generalized residuals 1* this will partition observations into M mutually exclusive groups

The within-node updates ak, are somewhat complicated (insert equation) where all measures are taken with respect to the node and variance (insert equation)

Do the update (insert equation) End for

Binary Classification

Consider the following simple data set with single predictor X and 1000 observations

Here and in the following slides negative response observations are marked in blue whereas positive response observations are marked in red

The general tendency is to have positive response in the middle of the range of X

(insert table)

Example: Binary Data

The dataset was generated using the following model described by f(X) and the corresponding p(X) for y=+1

(insert graphs)

Example: Probability Model

(insert graph)

TreeNet fits constant probability 0.55

The residuals are positive for y=+1 and negative for y=-1

Initial Stage

(insert graph)

The dataset was partitioned into 3 regions: low X (negative adjustment), middle X (positive), and large X (negative)

The residuals “reflect” the directions of the adjustments

Next Stage

(insert graph)

This graph shows predictors f(X) after 1000 iterations and a very small learning rate of 0.002

Note how the true shape was nearly perfectly recovered

Many Stages

The purpose of running a regression tree is to group observations into homogenous subsets

Once we have the right partition the adjustments for each terminal node are computed separately to optimize the given loss function- these are generally different from the predictions generated by the regression tree itself (they are the same only for the LS Loss)

Thus, the procedure is no longer as simple as the initial intuitive recursive regression approach we started with

Nonetheless, the tree is used to define the actual form of (X) over the range of X and not only for the individual data points observed

This becomes important in the final model deployment and scoring

A Note on Mechanics

Up to this point we guarded against overfitting only by allowing a small number of adjustments at each stage

We may further enhance this subject by forcing the adjustments to be smaller

This is done by introducing a new parameter called “shrinkage” (learning rate) that is set to a constant value between 0 and 1

Small learning rates result to smoother models: a rate of 0.1 means that TreeNet will take 10 times more iterations to extract the same signal- more variables will be tried, finer partitions will result, smaller boundary jumps will take place

Ideally, one might ultimately want to keep the learning rate close to zero and the number of stages (trees) close to infinity

However, rates below 0.001 usually become impractical

Shrinkage

(insert graph)

This graph shows predictor f(X) after 100 iterations and a learning rate of 1

Note the roughness of the shape and the presence of abrupt strong jumps

100 stages, No shrinkage

(insert graph)

This graph shows predicted f(X) after 1000 iterations and a very small learning rate of 0.0002

Note how the true shape was nearly perfectly recovered

It may be further approved

1000 Stages, Substantial Shrinkage

At each stage, instead of working with the entire learn dataset, consider taking a random sample of a fixed size

Typical sampling rates are set to 50% of the learn data (the default) and even smaller for very large datasets

In the long run, the entire learn dataset is exploited but the running time is reduced by the factor of two with the 50% sampling rate

Sampling forces TreeNet to “rethink” optimal partition points from run to run due to random fluctuations of the residuals

This, combined with the shrinkage and a large number of iterations, results to the overall improvement of the captured signal shape

Sampling Rate

(insert graph)

This graph shows predicted f(X) after 1000 stages, learning rate of 0.002, and 50% sampling

Note the minor fluctuations in the average loss

The resulting model is nice and smooth but there is still room for improvement

1000 Stages, Shrinkage and Sampling

(insert graph)

All previous allowed as few as 10 cases for individual region/node (the default)

Here we have increased this limit up to 50

This immediately resulted to an even smoother shape

In practice, various node size limits should be tried

Limiting Node Size

In classification problems, it is possible to further reduce the amount of data processed as each stage

We ignore data points “too far” from the decision boundary to be usefully considered

◦ Well correctly classified points are ignored (just like conventional boosting)

◦ Badly misclassified data points are also ignored (very different from conventional boosting)

◦ The focus is on the cases most difficult to classify correctly: those near the decision boundary

Ignoring data far from the decision boundary in classification problems

(insert graph)

2-dimensional predictor space

Red dots represent cases with +1 target

Green dots represent cases with -1 target

Black curve represents the decision boundary

Decision Boundary Diagram

The remaining slides present TreeNet runs on real data as well as give examples of GUI controls

We start with the Boston Housing dataset to illustrate regression

Then we proceed with the Cell Phone dataset to illustrate classification

Real Data Runs

(insert graph)

A Simple TreeNet Run

(insert graph)

Scatter Plot

(insert graph) Essentially a regression tree with 2 terminal

nodes

Predicted Response

(insert table) CART run with TARGET=MV

PREDICTORS= LSTAT

LIMIT DEPTH= 1

Save residuals as RESI

Equivalent CART Model

Technology

Introduction to TreeNet (2004)