26
emble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Embed Size (px)

Citation preview

Page 1: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Ensemble Classification Methods

Rayid Ghani

IR Seminar – 9/26/00

Page 2: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

What is Ensemble Classification? Set of Classifiers Decisions combined in ”some” way Often more accurate than the

individual classifiers What properties should the base

learners have?

Page 3: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Why should it work? More accurate ONLY if the individual

classifiers disagree Error rate < 0.5 and errors are

independent Error rate is highly correlated with the

correlations of the errors made by the different learners (Ali & Pazzani)

Page 4: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Averaging Fails! Use Delta-functions as classifiers (predict +1 at a

point and –1 everywhere else) For training sample size m, construct a set of at

most 2m classifiers s.t. the majority vote is always correct Associate 1 delta function with every example Add M+ (# of +ve examples) copies of the function that

predicts +1 everywhere and M- (# of -ve examples) copies of the function that predicts -1 everywhere

Applying boosting to this results in zero training error but bad generalizations

Applying the margin analysis results in zero training error but margin is small O(1/m)

Page 5: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Ideas? Subsampling training examples

Bagging , Cross-Validated Committees, Boosting Manipulating input features

Choose different features Manipulating output targets

ECOC and variants Injecting randomness

NN(different initial weights), DT(pick different splits), injecting noise, MCMC

Page 6: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Combining Classifiers Unweighted Voting

Bagging, ECOC etc. Weighted Voting

Weight accuracy (training or holdout set), LSR (weights 1/variance)

Bayesian model averaging

Page 7: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

BMA All possible models in the model

space used weighted by their probability of being the “Correct” model

Optimal given the correct model space and priors

Not widely used even though it was said not to overfit (Buntine, 1990)

Page 8: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

BMA - Equations

)|(),(

)(),|(

1, hcxP

cxP

hPcxhP i

n

ii

priorlikelihood

),|()|()|,( hxcPhxPhcxP iiiii

noise model

),|(),|(),,,|( cxhPhxcPHcxxCPHh

Page 9: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Equations Posterior Uniform Noise Model Pure classification model Model space too large –

approximation required Model with highest posterior, Sampling

Page 10: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

BMA of Bagged C4.5 Rules Bagging as a form of importance

sampling where all samples are weighed equally

Experimental Results Every version of BMA performed worse

than bagging on 19 out of 26 datasets Posteriors skewed – dominated by a single

rule model – model selection rather than averaging

Page 11: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

BMA of various learners RISE Rule sets with partitioning

8 databases from UCI BMA worse than RISE in every domain

Trading Rules Intuition (there is no single right rule so

BMA should help) BMA similar to choosing the single best

rule

Page 12: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Overfitting in BMA Issue of overfitting is usually ignored (Freund et

al. 2000) Is overfitting the explanation for the poor

performance of BMA? Preferring a hypothesis that does not truly have

the lowest error of any hypothesis considered, but by chance has the lowest error on training data.

Overfitting is the result of the likelihood’s exponential sensitivity to random fluctuations in the sample and increases with # of models considered

Page 13: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

To BMA or not to BMA? Net effect will depend on which effect

prevails? Increased overfitting (small if few models are

considered) Reduction in error obtained by giving some

weight to alternative models (skewed weights => small effect)

Ali & Pazzani (1996) report good results but bagging wasn’t tried

Domingos (2000) used bootstrapping before BMA so the models were built from less data

Page 14: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Why they work? Bias / Variance Decomposition Training data insufficient for

choosing a single best classifier Learning algorithms not “smart”

enough! Hypothesis space may not contain

the true function

Page 15: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Definitions Bias is the persistent/systematic error of a

learner independent of the training set. Zero for a learner that always makes the optimal prediction

Variance is the error incurred by fluctuations in response to different training sets. Independent of the true value of the predicted variable and zero for a learner that always predicts the same class regardless of the training set

Page 16: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Bias–Variance Decomposition Kong & Dietterich (1995) – variance can be

negative and noise is ignored Breiman (1996) – undefined for any given

example and variance can be zero even when the learners predictions fluctuate

Tibshirani (1996) Hastie (1997) Kohavi & Wolpert (1996) allows the bias of the

Bayes optimal classifier to be non-zero Friedman (1997) leaves bias and variance for

zero-one loss undefined

Page 17: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Domingos (2000) Single definition of bias and variance Applicable to “any” loss function Explains the margin effect (Schapire

et al. 1997) using the decomposition Incorporates variable

misclassification costs Experimental study

Page 18: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Unified Decomposition Loss functions

Squared L(t,y)=(t-y)2 Absolute L(t,y)=|t-y| Zero-One L(t,y)=0 if y=t else 1

Goal = Minimize average L(t,y) over all weighted examples

c1N(x) + B(x) + c2V(x)

Page 19: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Properties of the unified decomposition Relation to Order-correct learner Relation to Margin of a learner Maximizing margins is a combination

of reducing the number of biased examples, decreasing variance on unbiased examples, and increasing it on biased ones.

Page 20: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Experimental Study 30 UCI datasets Methodology

100 bootstrap samples – averaged over the test set with uniform weights

Estimate bias, variance, zero-one loss DT, kNN, boosting

Page 21: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Boosting C4.5 - Results Decreases both bias and variance Bulk of bias reduction happens in

the first few rounds Variance reduction is more gradual

and the dominant effect

Page 22: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

kNN results kNN bias increases with k dominates

variance reduction however increasing k has the effect of reducing variance on unbiased examples while increasing it on biased ones.

Page 23: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Issues Does not work with “Any” loss

function e.g. absolute loss Decomposition is not purely additive

unlike the original one for squared-loss

Page 24: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Spectrum of ensembles

Asymmetry of weights

Overfitting

Bagging

Boosting

BMA

Page 25: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Open Issues concerning ensembles Best way to construct ensembles? No extensive comparison done Computationally expensive Not easily comprehensible

Page 26: Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Bibliography Overview

T. Dietterich Bauer & Kohavi

Averaging Domingos Freund, Mansour, Schapire Ali, Pazzani

Bias – Variance Decomposition Kohavi & Wolpert Domingos Friedman Kong & Dietterich