6
Instructor’s notes Ch.2 - Supervised Learning Notation: Means pencil-and-paper QUIZ Means coding QUIZ V. Ensembles of Decision Trees (pp.85-94) Ensemble = combination of multiple ML algorithms/models to obtain a better algorithm/model. The algorithms/models that get combined are called the base learners. Q: Why do we need ensembles? A: Let us learn more ML theory! Bias and variance (not in text) Bias is due to the simplifying assumptions made in the algorithm to make the target function easier to learn. A biased algorithm will consistently learn the wrong thing by not taking into account all the information in the data (a.k.a. underfitting). As an example, parametric algorithms (like Ordinary Least Squares, Logistic Regression) are prone to high bias, because their models summarize data with a fixed-size set of parameters, irrespective of the number and distribution of training examples. “No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs.” Parametric models suffer from high bias, so they are inaccurate for complex datasets. At the other extreme, KNN with k=1 and DTs have naturally low bias, because they make few assumptions about the shape on the model: KNN with k=1 always correctly predicts the class of the points in the training set, and an (unrestricted) decision trees grows “naturally” as deep as it needs to be in order to perfectly classify the entire training set. Variance is due to the flexibility provided in the algorithm to be able to learn complex target functions. A high- variance algorithm will consistently learn the wrong thing by taking into account the errors/noise in the data (a.k.a. overfitting). Parametric algorithms have generally low variance, for example one new datapoint affected by noise will not cause a big change in the regression line, because it gets “averaged out”. On the contrary, KNN with k=1 and DTs have naturally high variance … explain! Virtually all ML algorithms have “knobs” (a.k.a. hyper-parameters) that allow to control the complexity of the model, for example: k ≥ 3 in KNN can cause points in the training set to be classified incorrectly; this is bias, but it also reduces the variance of the model, because points affected by errors can be “overruled” by the vote of their neighbors. Larger values of C in SVM cause the classifier to attempt to classify more points at the expense of a wider margin. Larger values of C increase the variance and decrease the bias of the model (and vice versa for smaller values of C). Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and variance terms to 0. However, in a world with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance.” 1 1 http://scott.fortmann-roe.com/docs/BiasVariance.html

V. Ensembles of Decision Trees - staff.tarleton.edu€¦ · Instructor’s notes Ch.2 - Supervised Learning Here is a nice illustration of the bias-variance trade-off: 2 Ensemble

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: V. Ensembles of Decision Trees - staff.tarleton.edu€¦ · Instructor’s notes Ch.2 - Supervised Learning Here is a nice illustration of the bias-variance trade-off: 2 Ensemble

Instructor’s notes Ch.2 - Supervised Learning

Notation: □Means pencil-and-paper QUIZ ► Means coding QUIZ

V. Ensembles of Decision Trees (pp.85-94)

Ensemble = combination of multiple ML algorithms/models to obtain a better algorithm/model. Thealgorithms/models that get combined are called the base learners.

Q: Why do we need ensembles?

A: Let us learn more ML theory!

Bias and variance (not in text)

Bias is due to the simplifying assumptions made in the algorithm to make the target function easier to learn. Abiased algorithm will consistently learn the wrong thing by not taking into account all the information in thedata (a.k.a. underfitting).

As an example, parametric algorithms (like Ordinary Least Squares, Logistic Regression) are prone to highbias, because their models summarize data with a fixed-size set of parameters, irrespective of the numberand distribution of training examples. “No matter how much data you throw at a parametric model, itwon’t change its mind about how many parameters it needs.” Parametric models suffer from high bias, sothey are inaccurate for complex datasets.

At the other extreme, KNN with k=1 and DTs have naturally low bias, because they make few assumptionsabout the shape on the model: KNN with k=1 always correctly predicts the class of the points in thetraining set, and an (unrestricted) decision trees grows “naturally” as deep as it needs to be in order toperfectly classify the entire training set.

Variance is due to the flexibility provided in the algorithm to be able to learn complex target functions. A high-variance algorithm will consistently learn the wrong thing by taking into account the errors/noise in the data(a.k.a. overfitting).

Parametric algorithms have generally low variance, for example one new datapoint affected by noise willnot cause a big change in the regression line, because it gets “averaged out”.

On the contrary, KNN with k=1 and DTs have naturally high variance … explain!

Virtually all ML algorithms have “knobs” (a.k.a. hyper-parameters) that allow to control the complexity of themodel, for example:

k ≥ 3 in KNN can cause points in the training set to be classified incorrectly; this is bias, but it also reducesthe variance of the model, because points affected by errors can be “overruled” by the vote of theirneighbors.

Larger values of C in SVM cause the classifier to attempt to classify more points at the expense of a widermargin. Larger values of C increase the variance and decrease the bias of the model (and vice versa forsmaller values of C).

Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and varianceterms to 0. However, in a world with imperfect models and finite data, there is a tradeoff between minimizingthe bias and minimizing the variance.”1

1 http://scott.fortmann-roe.com/docs/BiasVariance.html

Page 2: V. Ensembles of Decision Trees - staff.tarleton.edu€¦ · Instructor’s notes Ch.2 - Supervised Learning Here is a nice illustration of the bias-variance trade-off: 2 Ensemble

Instructor’s notes Ch.2 - Supervised Learning

Here is a nice illustration of the bias-variance trade-off:

2

Ensemble theory (not in text)

Ensembles methods are divided into two groups:

1. parallel, a.k.a. bagging the base learners are generated in parallel. The errors of the individual baselearners can be reduced dramatically by averaging. Bagging is a portmanteau word from bootstrap andaggregation.

2. sequential, a.k.a. boosting the base learners are generated sequentially. The overall performance isboosted (improved) by weighing previously mislabeled examples with higher weight; this way, subsequentlearners concentrate on the data points that are difficult to classify.

3

Ensembles are actually meta-algorithms, because the base learners are interchangeable. For example, we canimplement AdaBoost with Decision Trees, Neural Networks, and even “lowly” Linear Regression as baselearners. In practice, bagging is done mostly with DTs (a.k.a. Random Forest) and NNs, and boosting with DTs.

While “out-of-the-box” ensemble methods use the same type of base learner a.k.a. homogenous ensemble, itis also possible to use an ensemble of different types of base learners, a.k.a. heterogeneous ensemble; the

2 Image source: http://scott.fortmann-roe.com/docs/BiasVariance.html3Image source: https://quantdare.com/what-is-the-difference-between-bagging-and-boosting/

Page 3: V. Ensembles of Decision Trees - staff.tarleton.edu€¦ · Instructor’s notes Ch.2 - Supervised Learning Here is a nice illustration of the bias-variance trade-off: 2 Ensemble

Instructor’s notes Ch.2 - Supervised Learning

BellKor-Pragmatic Theory classifier that won the Netflix prize in 2009 used an ensemble of KNN, linearregression, logistic, NN, SVD (Singular Value Decomposition), and other models.

Bagging (bootstrap aggregation) is used for base learners that have high variance: small changes in thetraining dataset result in large changes in the model. DTs are a model known for high variance, because evenone training data point can make the decision process take another branch. We can see it here on the two-moons dataset:

Bagging reduces the variance, and also reduces overfitting. That is why we’re not concerned with pruning thebase learners (let the trees in the forest grow deep!).

Bagging does not reduce the bias in the base learners. Fortunately, DTs have naturally low bias, as explainedabove – this is why bagging works well with DTs!

Back to our text:

Random Forest (RF) is a bagging algorithm. Bagging = bootstrap + aggregation

What is bootstrapping?

Several DTs are built.

Each of them is based on a nr. of points that is equal to the entire dataset, but ...

... the points themselves are chosen with replacement from the dataset!

Example:

It can be show using Probability Theory that sampling with replacement reduces the variance of the forestwithout increasing the bias.

If we have enough data points, it is possible to do sub-bagging, sample without replacement, and even usenon-overlapping samples.

Page 4: V. Ensembles of Decision Trees - staff.tarleton.edu€¦ · Instructor’s notes Ch.2 - Supervised Learning Here is a nice illustration of the bias-variance trade-off: 2 Ensemble

Instructor’s notes Ch.2 - Supervised Learning

In Scikit-learn’s DecisionTreeClassifier, when selecting a split point, the default behavior is to look through allfeatures and all values of each feature in order to select the optimal split-point:

4

As such, even with bagging, the DTs can have a lot of structural similarities and in turn have high correlation intheir predictions. Combining predictions from multiple models in ensembles works better if the predictionsfrom the basic learners are uncorrelated (weakly correlated). The RF algorithm changes this procedure so thatthe learning algorithm is limited to a random sample of features to search in.

In Scikit-learn’s RandomForestClassifier: max_features. Default is sqrt(n_features).

The number of base learners is n_estimators.

#visualization code not shown ...

Conclusion: Due to averaging, the boundary is smoother!

In the lab --> Increase the number of trees and find out what happens to the boundary!

4http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Page 5: V. Ensembles of Decision Trees - staff.tarleton.edu€¦ · Instructor’s notes Ch.2 - Supervised Learning Here is a nice illustration of the bias-variance trade-off: 2 Ensemble

Instructor’s notes Ch.2 - Supervised Learning

The proof is in the pudding:

This is the best of all classifiers covered so far for the breast_cancer dataset!

We can also tweak any of the pre-pruning hyper-parameters: max_depth, max_leaf_nodes, min_samples_leaf.This usually does not change the precisions, but may make the algorithm more efficient (in time and/ormemory).

Even with the random selection of features, the importances are not always correct:

In the lab --> Will this problem be alleviated if we increase n_estimators?

As explained for individual DTs, the RF algorithm can also be used for regression. The Sciki-learn model is alsoin the ensemble class, and it is is called RandomForestRegressor.

Page 6: V. Ensembles of Decision Trees - staff.tarleton.edu€¦ · Instructor’s notes Ch.2 - Supervised Learning Here is a nice illustration of the bias-variance trade-off: 2 Ensemble

Instructor’s notes Ch.2 - Supervised Learning

Boosting (serial ensemble): Reduces the bias (b/c it trains specifically on the difficult cases) and to some extentthe variance (because of averaging), but may increase overfitting because some data points are given moreimportance (weight).

Example:

Scikit-learn uses a specific boosting method called gradient boosting (GB), and the base learners are again DTs.Unlike bagging, since overfitting is a problem, the strategy is to have the base learners “weak”, i.e. shallowtrees, with a lot of pre-pruning.

The default is max_depth=3 (No larger than 5!!) Note: RandomForestClassifier also has amax_depth parameter, but it is set to None by default (no pruning).

The learning_rate controls the gradient rate, i.e. how agggressively the next tree will try to correct themistakes of the current one.

The default is learning_rate=0.1

We usually set n_estimators based on the time and memory available, and then explore different valuesfor learning_rate.

As explained for individual DTs, the RF algorithm can also be used for regression. The Sciki-learn model is alsoin the ensemble class, and it is called GradientBoostingRegressor.

▀To do for more practice: See the link 30 Q&A on tree-based models here and on our webpage!

https://www.analyticsvidhya.com/blog/2017/09/30-questions-test-tree-based-models/