Upload
salford-systems
View
829
Download
2
Tags:
Embed Size (px)
DESCRIPTION
TreeNet is Salford's most flexible and powerful data mining tool, capable of consistently generating extremely accurate models. TreeNet has been responsible for the majority of Salford’s modeling competition awards. TreeNet demonstrates remarkable performance for both regression and classification. The algorithm typically generates thousands of small decision trees built in a sequential error–correcting process to converge to an accurate model.
Citation preview
Stochastic Gradient Boosting
An Introduction to TreeNetTM
Salford Systemshttp://www.salford-systems.com
[email protected] Golovnya, Dan Steinberg, Scott Cardell
New approaches to machine learning/function
◦ Approximation developed by Jerome H. Friedman at Stanford University
Co-author of CART® with Breiman, Olshen and Stone Author of MARSTM, PRIM, Projection Pursuit
Good for classification and regression problems
Builds on the notions of committees of experts and boosting but is substantially different in implementation details
Introduction to Stochastic Gradient Boosting
Stagewise function approximation in which each stage models residuals from the last step model◦ Conventional boosting models original target each stage
Each stage uses a very small tree, as small as two nodes and typically in the range of 4-8 nodes◦ Conventional bagging and boosting use full size trees and even
massively large trees
Each stage learns from a fraction of the available training data. Typically less than 50% to start and falling into 20% or less by the last stage
Each stage learns only a little: Severely down weight contribution of each new tree (learning rate is typically 0.10 or less)
Focus in classification is on points near decision boundary, ignore points far away from boundary even if the points are on the wrong
Stochastic Gradient Boosting: Key Innovations
Built on CART trees and thus◦ Immune to outliers
◦ Handles missing values automatically
◦ Selects variables
◦ Results invariant wrt monotone transformations of variables
Trains very rapidly: many small trees do not take
much longer run times than one large tree
Resistant to over training- generalizes very well
Can be remarkably accurate with little effort
BUT resulting model may be very complex
Reported Benefits of TreeNetTM
An intuitive introduction
TreeNet Mathematical Basics
◦ Specifications of the TreeNet model as a series expansion
◦ Non-parametric approach to steepest descent optimization
TreeNet at work
◦ Small trees, learning rates, sub-sample fractions, regression types
◦ Reading the output: reports and diagnostics
Comparing to AdaBoost and other methods
Overview of TreeNet Tutorial
Consider the basic problem of estimating continuous outcome y based on a vector of predictors X
Running a step-wise multiple linear regression will produce an estimate f1 (X) and associated residuals
A simple intuitive idea: run a second-stage regression model to produce an estimate of the residuals f2 (X) and the associated updated residuals r^2=(y-f1f2)
Repeating this process multiple times results to the following series expansion: y=f1+f2+f3+…
Recursive Regression
The above idea can be easily implemented
Unfortunately, the direct implementation suffers from the overfitting issues
The residuals from the previous model essentially communicate information about where this model fails the most- hence, the next stage model effectively tries to improve the previous model where it failed
This is generally known as boosting
We may want to replace individual regressions with something simpler- regression trees, for example
It is not yet known whether this simple idea actually works nor it is clear how to generalize it for various types of loss functions or classifications
Notes
For any given set of inputs X we want to predict some outcome y
Thus we want to construct a “nice” function f(X) which in turn can be used to express an estimate of y
We need to define how “nice” can be measured
Predictive Modeling
In regression, when y is continuous, the easiest is to assume that f(X) itself is the estimate of y
We may then define the loss function as the loss incurred when y is estimated by f(X)
For example, least square loss (LS) is defined as (LOA¯0’:f)^2
Formally, a “nicely” defined f(X) will have the smallest expected loss (over the entire population) within the boundaries of its construction (for example, in multiple linear regressions, f(X) belongs to the class of linear functions)
Loss for Regression
In reality, we have a set of N observed pairs (x,y) from the population, not the entire population
Hence, the expected loss WU/A can be replaced with an estimate
Here Fi=f(x)
The problem thus reduces to finding a function f(X) that minimizes R
Unfortunately, classification will demand additional treatment
Practical Estimate
Consider binary classification and assume that y is coded as +1 or -1
The most detailed solution would then give us the associated probabilities p(y)
Since probabilities are naturally constrained to the [0,1] interval, we assume that the function f(X) is transformed
p(y)=1/(1+exp(-2fy))
Note that p(+1)+p(-1)=1
The “trick” here is finding an unconstrained estimate f instead of constrained estimate p
Also note that f is simply half log-odds ratio of y=+1
Classification
(insert graph)
This graph shows the one-to-one correspondence between f and p for y=+1
Note that the most significant probability change occurs when f is between -3 and +3
S-transform
Again, the main question is what “nice” f means given that we observed N pairs (x,y) from the population
Approaching this problem from the maximum likelihood point of view, one may show that the negative log-likelihood in this case becomes
(insert equation)
The problem once again reduces to finding f that minimizes R above
We could obtain the same result formally by introducing a special loss function for classification (insert equation)
The above likelihood considerations show a “natural” way to arrive to such a peculiar loss function
Through Likelihood to Loss
Other approaches to defining the loss functions for binary classification are possible
For example, by throwing away the log term in the previous equation one would arrive to the following loss L=exp(2yf)
It is possible to show that this loss function is effectively used in the “classical” AdaBoost algorithm
AdaBoost could be considered as a predecessor of gradient boosting, we will defer the comparison until later
Other Loss Functions
To summarize we are looking for a function f(X) that minimizes the estimate of loss
The typical loss functions are
(insert equations)
Summary of Losses
The function f(X) is introduced as a known function of a fixed set of unknown parameters
The problem then reduces to finding a set of optimal parameter estimates using non-linear optimization techniques
Multiple linear regression and logistic regression: 1(X) is a linear combination of fixed predictors; parameters being the intercept term and the slope coefficients
Major problem: the function and predictors need to be specified beforehand- this usually results to a lengthy trial-and-error process
Parametric Approach
Construct f(X) using stage-wise approach
Start with a constant, then at each stage adjust the values off (X) in various regions of data
It is important to keep the adjustment rate low- the resulting model will become smoother and usually less subject to overfitting
Note that we are effectively treating the values f=f(X) at all individual observed data points as separate parameters
Non-parametric Approach
More specifically, assume that we have gone through k-1 stages and obtained the current version fK-1 (X)
We want to construct an updated version fk(x) resulting to a smaller value of R
Treating individual (insert equation) as parameters, we proceed by computing the anti-gradient (insert equation)
The individual components mark the “directions” in which individual fK-1 must be changed to obtain a smaller R
To induce smoothness lets limit our “freedom” by allowing only M (a smaller number, say between 2 and 10) distinct constant adjustments at any given stage
The Gradient Comes in
The optimal strategy is then to group individual components gk, into M mutually exclusive groups, such that the variance within each group is minimized
But this is equivalent to growing a fixed-size (M terminal nodes) regression tree using gk, as the target
Suppose we found M subsets (insert equation) of cases (insert equation)
The constant adjustments a kj are computed to minimize (insert equation)
Finally the updated f(X) is (insert equation)
The Regression Tree Comes In
For the given loss function L[y,IV],M, and MaxTrees◦ Make an initial guess f(X)=f
◦ For K=0 to MaxTrees-1
◦ Compute the anti-gradient Gk by taking the derivative of the loss with respect to f(X) and substitute y and current fk (X)
◦ Fit an M-node regression tree to the components of the negative gradient 1this will partition observations into M mutually exclusive groups
◦ Find the within node updates a5 by performing M univariate optimizations of the node contributions to the estimated loss
◦ Do the update (insert equation)
◦ End for
Generic Algorithm
For L[y,IV]=(y-f)^2, M, and MaxTrees
Initial guess f(X)=f= mean(y)
For K=0 to MaxTrees-1
The anti-gradient component (insert equation) which is the traditional definition of the current residual
Fit an M-node regression tree to the current residuals 1* this will partition observations into M mutually exclusive groups
The within-node updates a k, simply become node averages of the current residuals
Do the update: (insert equation) End for
LS Loss
For L[Y,fiX]=1 y-fl,M, and MaxTrees
Initial guess f(X)=f=median(y)
For k=0 to MaxTrees-1
The anti-gradient component (insert equation) which is the sign of the current residuals
Fit an M-node regression tree to the sign of the current residuals 1* this will partition observations into M mutually exclusive groups
The within-node updates a ki now become node medians of the current residuals
Do the update (insert equation) End for
LAD Loss
For L[y,f(X)]=log[1-exp(-2yf)], M, and MaxTrees
Initial guess f(X)=f= half log- odds of y=+1
For k=0 to MaxTrees-1
Recall that (insert equation) we call these generalize residuals
Fit an M-node regression tree to the generalized residuals 1* this will partition observations into M mutually exclusive groups
The within-node updates ak, are somewhat complicated (insert equation) where all measures are taken with respect to the node and variance (insert equation)
Do the update (insert equation) End for
Binary Classification
Consider the following simple data set with single predictor X and 1000 observations
Here and in the following slides negative response observations are marked in blue whereas positive response observations are marked in red
The general tendency is to have positive response in the middle of the range of X
(insert table)
Example: Binary Data
The dataset was generated using the following model described by f(X) and the corresponding p(X) for y=+1
(insert graphs)
Example: Probability Model
(insert graph)
TreeNet fits constant probability 0.55
The residuals are positive for y=+1 and negative for y=-1
Initial Stage
(insert graph)
The dataset was partitioned into 3 regions: low X (negative adjustment), middle X (positive), and large X (negative)
The residuals “reflect” the directions of the adjustments
Next Stage
(insert graph)
This graph shows predictors f(X) after 1000 iterations and a very small learning rate of 0.002
Note how the true shape was nearly perfectly recovered
Many Stages
The purpose of running a regression tree is to group observations into homogenous subsets
Once we have the right partition the adjustments for each terminal node are computed separately to optimize the given loss function- these are generally different from the predictions generated by the regression tree itself (they are the same only for the LS Loss)
Thus, the procedure is no longer as simple as the initial intuitive recursive regression approach we started with
Nonetheless, the tree is used to define the actual form of (X) over the range of X and not only for the individual data points observed
This becomes important in the final model deployment and scoring
A Note on Mechanics
Up to this point we guarded against overfitting only by allowing a small number of adjustments at each stage
We may further enhance this subject by forcing the adjustments to be smaller
This is done by introducing a new parameter called “shrinkage” (learning rate) that is set to a constant value between 0 and 1
Small learning rates result to smoother models: a rate of 0.1 means that TreeNet will take 10 times more iterations to extract the same signal- more variables will be tried, finer partitions will result, smaller boundary jumps will take place
Ideally, one might ultimately want to keep the learning rate close to zero and the number of stages (trees) close to infinity
However, rates below 0.001 usually become impractical
Shrinkage
(insert graph)
This graph shows predictor f(X) after 100 iterations and a learning rate of 1
Note the roughness of the shape and the presence of abrupt strong jumps
100 stages, No shrinkage
(insert graph)
This graph shows predicted f(X) after 1000 iterations and a very small learning rate of 0.0002
Note how the true shape was nearly perfectly recovered
It may be further approved
1000 Stages, Substantial Shrinkage
At each stage, instead of working with the entire learn dataset, consider taking a random sample of a fixed size
Typical sampling rates are set to 50% of the learn data (the default) and even smaller for very large datasets
In the long run, the entire learn dataset is exploited but the running time is reduced by the factor of two with the 50% sampling rate
Sampling forces TreeNet to “rethink” optimal partition points from run to run due to random fluctuations of the residuals
This, combined with the shrinkage and a large number of iterations, results to the overall improvement of the captured signal shape
Sampling Rate
(insert graph)
This graph shows predicted f(X) after 1000 stages, learning rate of 0.002, and 50% sampling
Note the minor fluctuations in the average loss
The resulting model is nice and smooth but there is still room for improvement
1000 Stages, Shrinkage and Sampling
(insert graph)
All previous allowed as few as 10 cases for individual region/node (the default)
Here we have increased this limit up to 50
This immediately resulted to an even smoother shape
In practice, various node size limits should be tried
Limiting Node Size
In classification problems, it is possible to further reduce the amount of data processed as each stage
We ignore data points “too far” from the decision boundary to be usefully considered
◦ Well correctly classified points are ignored (just like conventional boosting)
◦ Badly misclassified data points are also ignored (very different from conventional boosting)
◦ The focus is on the cases most difficult to classify correctly: those near the decision boundary
Ignoring data far from the decision boundary in classification problems
(insert graph)
2-dimensional predictor space
Red dots represent cases with +1 target
Green dots represent cases with -1 target
Black curve represents the decision boundary
Decision Boundary Diagram
The remaining slides present TreeNet runs on real data as well as give examples of GUI controls
We start with the Boston Housing dataset to illustrate regression
Then we proceed with the Cell Phone dataset to illustrate classification
Real Data Runs
(insert graph)
A Simple TreeNet Run
(insert graph)
Scatter Plot
(insert graph) Essentially a regression tree with 2 terminal
nodes
Predicted Response
(insert table) CART run with TARGET=MV
PREDICTORS= LSTAT
LIMIT DEPTH= 1
Save residuals as RESI
Equivalent CART Model