38
Prof. Pier Luca Lanzi Classification: Ensemble Methods Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

DMTM 2015 - 15 Classification Ensembles

Embed Size (px)

Citation preview

Page 1: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Classification: Ensemble Methods���Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Page 2: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Readings

•  “The Elements of Statistical Learning: Data Mining, Inference, and Prediction,” Second Edition, February 2009, Trevor Hastie, Robert Tibshirani, Jerome Friedman (http://statweb.stanford.edu/~tibs/ElemStatLearn/)§ Section 8.7§ Chapter 10§ Chapter 15§ Chapter 16

2

Page 3: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

What are Ensemble Methods?

Page 4: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

What is the idea? 4

Diagnosis Diagnosis Diagnosis

Final DiagnosisFinal Diagnosis

Page 5: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Ensemble Methods

Generate a set of classifiers from the training data

Predict class label of previously unseen cases by aggregating predictions made by multiple classifiers

Page 6: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

What is the General Idea?

OriginalTraining data

....D1 D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers

6

Page 7: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Building models ensembles

•  Basic idea§ Build different “experts”, let them vote

•  Advantage§ Often improves predictive performance

•  Disadvantage§ Usually produces output that is very hard to analyze

7

However, there are approaches that aim toproduce a single comprehensible structure

Page 8: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Why does it work?

•  Suppose there are 25 base classifiers

•  Each classifier has error rate, ε = 0.35

•  Assume classifiers are independent

•  The probability that the ensemble makes a wrong prediction is

8

Page 9: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

how can we generate several models using the same data and the same approach (e.g., Trees)?

Page 10: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

we might …

randomize the selection of training data?

randomizing the approach?

Page 11: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Bagging or Bootstrap Aggregation

Page 12: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

What is Bagging? (Bootstrap Aggregation)

•  Analogy§ Diagnosis based on multiple doctors’ majority vote

•  Training§ Given a set D of d tuples, at each iteration i, a training set Di of d tuples is

sampled with replacement from D (i.e., bootstrap)§ A classifier model Mi is learned for each training set Di

•  Classification (classify an unknown sample X) § Each classifier Mi returns its class prediction§ The bagged classifier M* counts the votes and ���

assigns the class with the most votes to X•  Prediction

§  It can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple

12

Page 13: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

What is Bagging?

•  Combining predictions by voting/averaging§ The simplest way is to give equal weight to each model

•  “Idealized” version§ Sample several training sets of size n ���

(instead of just having one training set of size n) ‏§ Build a classifier for each training set§ Combine the classifiers’ predictions

•  Problem§ We only have one dataset!

•  Solution§ Generate new ones of size n using Bootstrap, ���

i.e., sampling from it with replacement

13

Page 14: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Bagging Classifiers 14

Let n be the number of instances in the training data For each of t iterations:

Sample n instances from training set (with replacement)‏ Apply learning algorithm to the sample Store resulting model

For each of the t models: Predict class of instance using model

Return class that is predicted most often

Model generation

Classification

Page 15: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Bagging works because it reduces variance by voting/averaging

usually, the more classifiers the better

it can help a lot if data are noisy

however, in some pathological hypotheticalsituations the overall error might increase

Page 16: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

When Does Bagging Work?

•  A Learning algorithm is unstable, if small changes to the training set cause large changes in the learned classifier

•  If the learning algorithm is unstable, then ���Bagging almost always improves performance

•  Bagging stable classifiers is not a good idea

•  Examples of unstable classifiers? Decision trees, regression trees, linear regression, neural networks

•  Examples of stable classifiers? K-nearest neighbors

16

Page 17: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Randomize the Algorithm not the Data

• We could randomize the learning algorithm instead of the data

•  Some algorithms already have a random component, ���e.g. initial weights in neural net

•  Most algorithms can be randomized (e.g., attribute selection in decision trees)

•  More generally applicable than bagging, e.g. random subsets in nearest-neighbor scheme

•  It can be combined with bagging

17

Page 18: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Boosting

Page 19: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

What is Boosting?

•  Analogy§ Consult several doctors, based on a combination of weighted diagnoses§ Weight assigned based on the previous diagnosis accuracy

•  How does Boosting work? § Weights are assigned to each training example § A series of k classifiers is iteratively learned § After a classifier Mi is learned, the weights are updated to allow the

subsequent classifier Mi+1 to pay more attention to the training tuples that were misclassified by Mi

§ The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy

19

Page 20: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Example of Boosting

•  Suppose there are just 5 training examples {1,2,3,4,5}•  Initially each example has a 0.2 (1/5) probability of being sampled •  1st round of boosting samples (with replacement) 5 examples:���

{2, 4, 4, 3, 2} and builds a classifier from them•  Suppose examples 2, 3, 5 are correctly predicted by this classifier,

and examples 1, 4 are wrongly predicted§ Weight of examples 1 and 4 is increased,§ Weight of examples 2, 3, 5 is decreased

•  2nd round of boosting samples again 5 examples, but now examples 1 and 4 are more likely to be sampled; and so on until convergence is achieved

20

Page 21: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

AdaBoost.M1 21

Assign equal weight to each training instance For t iterations: Apply learning algorithm to weighted dataset,

store resulting model Compute model’s error e on weighted dataset If e = 0 or e ≥ 0.5: Terminate model generation For each instance in dataset: If classified correctly by model: Multiply instance’s weight by e/(1-e)‏ Normalize weight of all instances

Model generation

Assign weight = 0 to all classes For each of the t (or less) models:

For the class this model predicts add –log e/(1-e) to this class’s weight

Return class with highest weight

Classification

Page 22: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Page 23: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

More on Boosting

•  Boosting needs weights ... but •  Can adapt learning algorithm ... or •  Can apply boosting without weights

§ Resample with probability determined by weights § Disadvantage: not all instances are used § Advantage: if error > 0.5, can resample again

•  Stems from computational learning theory •  Theoretical result: training error decreases exponentially •  It works if base classifiers are not too complex, and their error doesn’t become

too large too quickly •  Boosting works with weak learners only condition: error doesn’t exceed 0.5•  In practice, boosting sometimes overfits (in contrast to bagging)

23

Page 24: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

AdaBoost

•  Given the base classifiers C1, C2, …, CT

•  The error rate is computed as

•  The importance of a classifier is computed as

•  Classification is computed as

24

Page 25: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

BoostingRound 1 + + + -- - - - - -

0.0094 0.0094 0.4623B1

a = 1.9459

Data points for training

Initial weights for each data point

OriginalData + + + -- - - - + +

0.1 0.1 0.1

Page 26: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

BoostingRound 1 + + + -- - - - - -

BoostingRound 2 - - - -- - - - + +

BoostingRound 3 + + + ++ + + + + +

Overall + + + -- - - - + +

0.0094 0.0094 0.4623

0.3037 0.0009 0.0422

0.0276 0.1819 0.0038

B1

B2

B3

a = 1.9459

a = 2.9323

a = 3.8744

Page 27: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Training Sampleclassifier C(0)(x)

Weighted Sample

re-weight

classifier C(1)(x)

Weighted Sample

re-weight

classifier C(2)(x)

Weighted Sample

re-weight

Weighted Sample

re-weight

classifier C(3)(x)

classifier C(m)(x)

ClassifierN(i)

ii

y(x) wC (x)= ∑

Page 28: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Training Sampleclassifier C(0)(x)

Weighted Sample

re-weight

classifier C(1)(x)

Weighted Sample

re-weight

classifier C(2)(x)

Weighted Sample

re-weight

Weighted Sample

re-weight

classifier C(3)(x)

classifier C(m)(x)

err

err

err

1 f with :f

misclassified eventsfall events

=

ClassifierN (i)(i)err

(i)i err

1 fy(x) log C (x)f

⎛ ⎞−= ⎜ ⎟

⎝ ⎠∑

AdaBoost re-weights events misclassified by previous classifier by:

AdaBoost weights the classifiers also using the error rate of the individual classifier according to:

Page 29: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

AdaBoost in Pictures

Start here: equal event weights

misclassified events get larger weights

… and so on

Page 30: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Boosting

•  Also uses voting/averaging• Weights models according to performance•  Iterative: new models are influenced ���

by the performance of previously built ones§ Encourage new model to become an “expert” for instances

misclassified by earlier models§ Intuitive justification: models should be experts that

complement each other•  Several variants§ Boosting by sampling, ���

the weights are used to sample the data for training§ Boosting by weighting,���

the weights are used by the learning algorithm

30

Page 31: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Random Forests

Page 32: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

learning ensemble consisting of a baggingof unpruned decision tree learners with

randomized selection of features at each split

Page 33: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

What is a Random Forest?

•  Random forests (RF) are a combination of tree predictors

•  Each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest

•  The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them

•  Using a random selection of features to split each node yields error rates that compare favorably to Adaboost, and are more robust with respect to noise

33

Page 34: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

How do Random Forests Work?

•  D = training set F = set of variables to test for split•  k = number of trees in forest n = number of variables

•  for i = 1 to k do§ build data set Di by sampling with replacement from D§  learn tree Ti from Di using at each node only a subset F of the n variables§ save tree Ti as it is, do not perform any pruning

•  Output is computed as the majority voting (for classification) or average (for prediction) of the set of k generated trees

34

Page 35: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Page 36: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Out of Bag Samples

•  For each observation (xi, yi), construct its random forest predictor by averaging only those trees corresponding to bootstrap samples in which the observation did not appear.

•  The OOB error estimate is almost identical to that obtained by N-fold crossvalidation.

•  Thus, random forests can be fit in one sequence, with cross-validation being performed along the way

•  Once the OOB error stabilizes, the training can be terminated.

36

Page 37: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

37

Page 38: DMTM 2015 - 15 Classification Ensembles

Prof. Pier Luca Lanzi

Properties of Random Forests

•  Easy to use ("off-the-shelve"), only 2 parameters ���(no. of trees, %variables for split)

•  Very high accuracy•  No overfitting if selecting large number of trees (choose high)•  Insensitive to choice of split% (~20%)•  Returns an estimate of variable importance•  Random forests are an effective tool in prediction. •  Forests give results competitive with boosting and adaptive

bagging, yet do not progressively change the training set. •  Random inputs and random features produce good results in

classification - less so in regression. •  For larger data sets, we can gain accuracy by combining random

features with boosting.

38