Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

Ensemble Classification Methods

Rayid Ghani

IR Seminar – 9/26/00

What is Ensemble Classification? Set of Classifiers Decisions combined in ”some” way Often more accurate than the

individual classifiers What properties should the base

learners have?

Why should it work? More accurate ONLY if the individual

classifiers disagree Error rate < 0.5 and errors are

independent Error rate is highly correlated with the

correlations of the errors made by the different learners (Ali & Pazzani)

Averaging Fails! Use Delta-functions as classifiers (predict +1 at a

point and –1 everywhere else) For training sample size m, construct a set of at

most 2m classifiers s.t. the majority vote is always correct Associate 1 delta function with every example Add M+ (# of +ve examples) copies of the function that

predicts +1 everywhere and M- (# of -ve examples) copies of the function that predicts -1 everywhere

Applying boosting to this results in zero training error but bad generalizations

Applying the margin analysis results in zero training error but margin is small O(1/m)

Ideas? Subsampling training examples

Bagging , Cross-Validated Committees, Boosting Manipulating input features

Choose different features Manipulating output targets

ECOC and variants Injecting randomness

NN(different initial weights), DT(pick different splits), injecting noise, MCMC

Combining Classifiers Unweighted Voting

Bagging, ECOC etc. Weighted Voting

Weight accuracy (training or holdout set), LSR (weights 1/variance)

Bayesian model averaging

BMA All possible models in the model

space used weighted by their probability of being the “Correct” model

Optimal given the correct model space and priors

Not widely used even though it was said not to overfit (Buntine, 1990)

BMA - Equations

)|(),(

)(),|(

1, hcxP

cxP

hPcxhP i

n

ii

priorlikelihood

),|()|()|,( hxcPhxPhcxP iiiii

noise model

),|(),|(),,,|( cxhPhxcPHcxxCPHh

Equations Posterior Uniform Noise Model Pure classification model Model space too large –

approximation required Model with highest posterior, Sampling

BMA of Bagged C4.5 Rules Bagging as a form of importance

sampling where all samples are weighed equally

Experimental Results Every version of BMA performed worse

than bagging on 19 out of 26 datasets Posteriors skewed – dominated by a single

rule model – model selection rather than averaging

BMA of various learners RISE Rule sets with partitioning

8 databases from UCI BMA worse than RISE in every domain

Trading Rules Intuition (there is no single right rule so

BMA should help) BMA similar to choosing the single best

rule

Overfitting in BMA Issue of overfitting is usually ignored (Freund et

al. 2000) Is overfitting the explanation for the poor

performance of BMA? Preferring a hypothesis that does not truly have

the lowest error of any hypothesis considered, but by chance has the lowest error on training data.

Overfitting is the result of the likelihood’s exponential sensitivity to random fluctuations in the sample and increases with # of models considered

To BMA or not to BMA? Net effect will depend on which effect

prevails? Increased overfitting (small if few models are

considered) Reduction in error obtained by giving some

weight to alternative models (skewed weights => small effect)

Ali & Pazzani (1996) report good results but bagging wasn’t tried

Domingos (2000) used bootstrapping before BMA so the models were built from less data

Why they work? Bias / Variance Decomposition Training data insufficient for

choosing a single best classifier Learning algorithms not “smart”

enough! Hypothesis space may not contain

the true function

Definitions Bias is the persistent/systematic error of a

learner independent of the training set. Zero for a learner that always makes the optimal prediction

Variance is the error incurred by fluctuations in response to different training sets. Independent of the true value of the predicted variable and zero for a learner that always predicts the same class regardless of the training set

Bias–Variance Decomposition Kong & Dietterich (1995) – variance can be

negative and noise is ignored Breiman (1996) – undefined for any given

example and variance can be zero even when the learners predictions fluctuate

Tibshirani (1996) Hastie (1997) Kohavi & Wolpert (1996) allows the bias of the

Bayes optimal classifier to be non-zero Friedman (1997) leaves bias and variance for

zero-one loss undefined

Domingos (2000) Single definition of bias and variance Applicable to “any” loss function Explains the margin effect (Schapire

et al. 1997) using the decomposition Incorporates variable

misclassification costs Experimental study

Unified Decomposition Loss functions

Squared L(t,y)=(t-y)2 Absolute L(t,y)=|t-y| Zero-One L(t,y)=0 if y=t else 1

Goal = Minimize average L(t,y) over all weighted examples

c1N(x) + B(x) + c2V(x)

Properties of the unified decomposition Relation to Order-correct learner Relation to Margin of a learner Maximizing margins is a combination

of reducing the number of biased examples, decreasing variance on unbiased examples, and increasing it on biased ones.

Experimental Study 30 UCI datasets Methodology

100 bootstrap samples – averaged over the test set with uniform weights

Estimate bias, variance, zero-one loss DT, kNN, boosting

Boosting C4.5 - Results Decreases both bias and variance Bulk of bias reduction happens in

the first few rounds Variance reduction is more gradual

and the dominant effect

kNN results kNN bias increases with k dominates

variance reduction however increasing k has the effect of reducing variance on unbiased examples while increasing it on biased ones.

Issues Does not work with “Any” loss

function e.g. absolute loss Decomposition is not purely additive

unlike the original one for squared-loss

Spectrum of ensembles

Asymmetry of weights

Overfitting

Bagging

Boosting

BMA

Open Issues concerning ensembles Best way to construct ensembles? No extensive comparison done Computationally expensive Not easily comprehensible

Bibliography Overview

T. Dietterich Bauer & Kohavi

Averaging Domingos Freund, Mansour, Schapire Ali, Pazzani

Bias – Variance Decomposition Kohavi & Wolpert Domingos Friedman Kong & Dietterich

Documents

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00