Machine learning basics using trees algorithm (Random forest, Gradient Boosting)

Proprietary Information created by Parth Khare

Machine Learning

Classification & Decision Trees

04/01/2013

Contents

Recursive Partitioning Classification Regression/Decision

Bagging Random Forest

Boosting Gradient Boosting

Questions

Detail and flow

What is the difference between supervised and unsupervised learning? What is ML? how is it different from classical statistics? Supervised learning: machine -> an application is Trees Most elementary analysis: CART Tree

Basics

Supervised Learning: Called “supervised” because of the presence of the outcome variable to guide learning process building a learner (/model) to predict the outcome for new unseen objects.

Alternatively,

Unsupervised Learning: observe only the features and have no measurements of the outcome task is rather to describe how the data are organized or clustered

Machine Learning viz Statistics‘learning’ viz ‘fitting’ Machine learning: a branch of artificial intelligence, is about the construction and study

of systems that can learn from data.

Statistics bases everything on probability models assuming your data are samples from a random variable with some distribution, then making inferences about the parameters of the distribution

Machine learning may use probability models, and when it does, it overlaps with statistics. isn't so committed to probability use other approaches to problem solving that are not based on probability

The basic optimization concept is the same for trees is same as that of parametric techniques, minimizing errors metrics. Instead of square error function or MLE, Machine Learning supervises optimization of entropy, node impurity etc

An application _-> Trees

Decision Tree Approach: Parlance

A decision tree represents a hierarchical segmentation of the data

The original segment is called the root node and is the entire data set

The root node is partitioned into two or more segments by applying a series of simple rules over an input variables

For example, risk = low, risk = not low Each rule assigns the observations to a segment based on its input value

Each resulting segment can be further partitioned into sub-segments, and so on

For example risk = low can be partitioned into income = low and income = not low

The segments are also called nodes, and the final segments are called leaf nodes or leaves

Final node surviving the partitions called the terminal node

Decision Tree Example: Risk Assessment(Loan)

Income

< $30k >= $30k

Age Credit Score

< 25 >=25 < 600 >=

not on-time on-time not on-time on-time

CART: Heuristic and Visual

Generic supervised learning problem: given a bunch of data (x1, y1), (x2, y2)…(xn,yn), and a new point ‘x i ‘, supervised learning

objective: associates a ‘y’ with this new ‘x’ Main Idea: form a binary tree and minimize error in each leaf

Given dataset, a decision tree: choose a sequence of binary split of the data

Growing the tree

Growing the tree involves successively partitioning the data – recursively partitioning

If an input variable is binary, then the two categories can be used to split the data (relative concentration of ‘0’’s and ‘1’’s)

If an input variable is interval, a splitting value is used to classify the data into two segments

For example, if household income is interval and there are 100 possible incomes in the data set, then there are 100 possible splitting values For example, income < $30k, and income >= $30k

Classification Tree: again (referrence)

Represented by a series of binary splits. Each internal node represents a value

query on one of the variables — e.g. “Is X3 > 0.4”. If the answer is “Yes”, go right, else go left.

The terminal nodes are the decision nodes. Typically each terminal node is dominated by one of the classes.

The tree is grown using training data, by recursive splitting.

The tree is often pruned to an optimal size, evaluated by cross-validation.

New observations are classified by passing their X down to a terminal node of the tree, and then using majority vote.

Evaluating the partitions

When the target is categorical, for each partition of an input variable a chi-square statistic is computed

A contingency table is formed that maps responders and non-responders against the partitioned input variable

For example, the null hypothesis might be that there is no difference between people with income <$30k and those with income >=$30k in making an on-time loan payment

The lower the significance or p-value, the more likely that we reject this hypothesis, meaning that this income split is a discriminating factor

Splitting Criteria: Categorical

Information Gain -> Entropy

The rarity of an event is defined as: -log2(pi)

Impurity Measure: - Pr(Y=0) X log2 [Pr(Y=0)] - Pr(Y=1) X log2 [Pr(Y=1)] e.g. check at Pr(Y=0) = 0.5??

Entropy sums up the rarity of response and non-response over all observations

Entropy ranges from the best case of 0 (all responders or all non-responders) to 1 (equal mix of responders and non-responders)

linkhttp://www.youtube.com/watch?v=p17C9q2M00Q

Splitting Criteria :Continuous

An F-statistic is used to measure the degree of separation of a split for an interval target, such as revenue

Similar to the sum of squares discussion under multiple regression,

F-statistic is based on the ratio of the sum of squares between the groups and the sum of squares within groups, both adjusted for the number of degrees of freedom

The null hypothesis is that there is no difference in the target mean between the two groups

Contents

Recursive Partitioning Classification Regression/Decision

Bagging

Ensemble Models : Combines the results from different models An ensemble classifier using many decision tree models

Bagging: Bootstrapped Samples of data

Working: Random Forest A different subset of the training data are selected (~2/3), with replacement, to train

each tree Remaining training data (OOB) are used to estimate error and variable importance Class assignment is made by the number of votes from all of the trees and for

regression the average of the results is used A randomly selected subset of variables is used to split each node The number of variables used is decided by the user (mtry parameter in R)

Bagging: Stanford

Suppose C(S, x) is a classifier, such as a tree, based

on our training data S, producing a predicted class label at input point x.

‘To bag C, we draw bootstrap samples S 1∗ ,...S B ∗ each of size N with replacement from the training data.

Then Cˆbag(x) = Majority Vote{C(S b∗ , x)}B

Bagging can dramatically reduce the variance of unstable procedures (like trees), leading to improved prediction.

However any simple structure in C (e.g a tree) is lost.

Bootstrapped samples

Contents

Recursive Partitioning Classification

Regression/Decision

Boosting

Make Copies of Data Boosting idea: Based on "strength of weak learn ability" principles Example:

IF Gender=MALE AND Age<=25 THEN claim_freq.=‘high’

Combination of weak learners increased accuracy Simple or “weak" learners are not perfect! Every “boosting” algorithm can be interpreted as optimizing the loss function in a “greedy stage-

wise” manner

Working: Gradient Descent First tree is created, residuals observed Now, a tree is fitted on the residuals of the first tree and so on In this way, boosting grows trees in series, with later trees dependent on the results of previous

trees Shrinkage, CV folds, Interaction Depth Adaboost, DirectBoost, Laplace Loss(Gaussian Boost)

Gradient Tree Boosting is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems.

What it does essentially By sequentially learning form the errors of the previous trees Gradient Boosting, in a way tries

to ‘learn’ the unconditional distribution of the target variable. So, analogus to how we use different types of distributions in GLM modeling, GBM creates/replicates the distribution in the given data as close as possible.

This comes with an additional risk of over-fitting, resolved by methods like cross validation within, min observation per node etc.

Parameters working: OOB data/error We know that the first tree of GBM is build on training data and the subsequent trees are

developed on the error form the first tree. This process carries on. For OOB, the training data is also split in two parts, on one part the trees and developed, and

on the other part the tree developed on the first part is tested. This second part is called the OOB data and the error obtained is known as OOB error.

Summary: Rf and GBM

Main similarities:

Both derive many benefits from ensembling, with few disadvantages Both can be applied to ensembling decision trees

Main differences:

Boosting performs an exhaustive search for best predictor to split on; RF searches only a small subset

Boosting grows trees in series, with later trees dependent on the results of previous trees

RF grows trees in parallel independently of one another. RF cannot work with missing values GBM can

More diff b/w RF and GBM

Algorithmic difference is; Random Forests are trained with random sample of data (even more randomized

cases available like feature randomization) and it trusts randomization to have better generalization performance on out of train set.

On the other spectrum, Gradient Boosted Trees algorithm additionally tries to find optimal linear combination of trees (assume final model is the weighted sum of predictions of individual trees) in relation to given train data. This extra tuning might be deemed as the difference. Note that, there are many variations of those algorithms as well.

At the practical side; owing to this tuning stage, Gradient Boosted Trees are more susceptible to jiggling data. This final stage makes

GBT more likely to overfit therefore if the test cases are inclined to be so verbose compared to train cases this algorithm starts lacking.

On the contrary, Random Forests are better to strain on overfitting although it is lacking on the other way around.

Questions

Concept/ Interpretation

Application

For further details contact:

Parth Khare https://www.linkedin.com/profile/view?

id=43877647&trk=nav_responsive_tab_profile

Machine learning basics using trees algorithm (Random forest, Gradient Boosting)

Data & Analytics

Gradient Boosting Machine with H2Odocs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/GBMBooklet.pdf · Gradient Boosting Machine with H2O by Michal Malohlava & Arno Candel with assitance

Greedy Function Approximation: A Gradient Boosting Machine ...luthuli.cs.uiuc.edu › ~daf › courses › Optimization › ... · Greedy Function Approximation: A Gradient Boosting

Gradient Boosted Models with H2O · best-in-class algorithms at scale, such as distributed random forest, gradient boosting, and deep learning. Customers can build thousands of models

Introducing TreeNet® Gradient Boosting MachineIntroducing TreeNet® Gradient Boosting Machine This guide describes the TreeNet® product and illustrates some practical examples of

22 November 2018 · 2018-11-26 · Machines Gradient Boosting Machines Random Forests Ridge ... Random Forest: Prediction = AVERAG E(Tree Predictions) = AVERAGE(Tree Signal)+ AVERAGE(Tree

Top rank optimization with gradient boosting for …...Efficient top rank optimization with gradient boosting for supervised anomaly detection By Jordan Fréry, Amaury Habrard, Marc

Tracking-by-Segmentation With Online Gradient Boosting Decision … · 2015. 10. 24. · Tracking-by-Segmentation with Online Gradient Boosting Decision Tree Jeany Son1,2 Ilchae Jung1,2

Greedy Function Approximation: A Gradient Boosting Machine

A gradient boosting approach to the Kaggle load ...souhaib-bentaieb.com/pdf/2014_ijf_gbkaggle.pdf · A gradient boosting approach to the Kaggle load forecasting competition Souhaib

Gradient-based boosting for statistical relational ...ftp.cs.wisc.edu/machine-learning/shavlik-group/natarajan.mlj12.pdf · Gradient-based boosting for statistical relational learning:

Gradient Boosting Factorization Machinestongzhang-ml.org/papers/recsys14-fm.pdf · Gradient Boosting, Factorization Machines, Recommender Systems, Collaborative ltering The work was

Gradient Boosted Models with H2O’s R Package · best-in-class algorithms at scale, such as distributed random forest, gradient boosting, and deep learning. Customers can build thousands

iccv2009 tutorial: boosting and random forest - part III

Gradient Boosting for Quantitative Finance

Gradient Tree Boosting for Training Conditional Random Fields

Decision trees, Random Forest, Gradient Boosting · Decision trees, Random Forest, Gradient Boosting Dr. Pavel Polishchuk pavel_polishchuk@ukr.net Physico-Chemical Institute of National

Gradient boosting machines, a tutorial - core.ac.uk · Gradient boosting machines, a tutorial Alexey Natekin1* and Alois Knoll2 1 fortiss GmbH, Munich, Germany 2 Department of Informatics,

Greedy Function Approximation: A Gradient Boosting Machine ...luthuli.cs.uiuc.edu/~daf/courses/Opt-2019/Papers/2699986.pdf Greedy Function Approximation: A Gradient Boosting Machine

1 Exploiting GPUs for Efﬁcient Gradient Boosting Decision Tree … · 2020-05-09 · Index Terms—Graphics Processing Units, Gradient Boosting Decision Trees, Machine Learning

Using eXtreme Gradient BOOSTing to Predict Changes in Tropical … · 2019-08-05 · atmosphere Article Using eXtreme Gradient BOOSTing to Predict Changes in Tropical Cyclone Intensity