Upload
dominic-clarke
View
218
Download
0
Embed Size (px)
Citation preview
Ensemble Classification Methods
Rayid Ghani
IR Seminar – 9/26/00
What is Ensemble Classification? Set of Classifiers Decisions combined in ”some” way Often more accurate than the
individual classifiers What properties should the base
learners have?
Why should it work? More accurate ONLY if the individual
classifiers disagree Error rate < 0.5 and errors are
independent Error rate is highly correlated with the
correlations of the errors made by the different learners (Ali & Pazzani)
Averaging Fails! Use Delta-functions as classifiers (predict +1 at a
point and –1 everywhere else) For training sample size m, construct a set of at
most 2m classifiers s.t. the majority vote is always correct Associate 1 delta function with every example Add M+ (# of +ve examples) copies of the function that
predicts +1 everywhere and M- (# of -ve examples) copies of the function that predicts -1 everywhere
Applying boosting to this results in zero training error but bad generalizations
Applying the margin analysis results in zero training error but margin is small O(1/m)
Ideas? Subsampling training examples
Bagging , Cross-Validated Committees, Boosting Manipulating input features
Choose different features Manipulating output targets
ECOC and variants Injecting randomness
NN(different initial weights), DT(pick different splits), injecting noise, MCMC
Combining Classifiers Unweighted Voting
Bagging, ECOC etc. Weighted Voting
Weight accuracy (training or holdout set), LSR (weights 1/variance)
Bayesian model averaging
BMA All possible models in the model
space used weighted by their probability of being the “Correct” model
Optimal given the correct model space and priors
Not widely used even though it was said not to overfit (Buntine, 1990)
BMA - Equations
)|(),(
)(),|(
1, hcxP
cxP
hPcxhP i
n
ii
priorlikelihood
),|()|()|,( hxcPhxPhcxP iiiii
noise model
),|(),|(),,,|( cxhPhxcPHcxxCPHh
Equations Posterior Uniform Noise Model Pure classification model Model space too large –
approximation required Model with highest posterior, Sampling
BMA of Bagged C4.5 Rules Bagging as a form of importance
sampling where all samples are weighed equally
Experimental Results Every version of BMA performed worse
than bagging on 19 out of 26 datasets Posteriors skewed – dominated by a single
rule model – model selection rather than averaging
BMA of various learners RISE Rule sets with partitioning
8 databases from UCI BMA worse than RISE in every domain
Trading Rules Intuition (there is no single right rule so
BMA should help) BMA similar to choosing the single best
rule
Overfitting in BMA Issue of overfitting is usually ignored (Freund et
al. 2000) Is overfitting the explanation for the poor
performance of BMA? Preferring a hypothesis that does not truly have
the lowest error of any hypothesis considered, but by chance has the lowest error on training data.
Overfitting is the result of the likelihood’s exponential sensitivity to random fluctuations in the sample and increases with # of models considered
To BMA or not to BMA? Net effect will depend on which effect
prevails? Increased overfitting (small if few models are
considered) Reduction in error obtained by giving some
weight to alternative models (skewed weights => small effect)
Ali & Pazzani (1996) report good results but bagging wasn’t tried
Domingos (2000) used bootstrapping before BMA so the models were built from less data
Why they work? Bias / Variance Decomposition Training data insufficient for
choosing a single best classifier Learning algorithms not “smart”
enough! Hypothesis space may not contain
the true function
Definitions Bias is the persistent/systematic error of a
learner independent of the training set. Zero for a learner that always makes the optimal prediction
Variance is the error incurred by fluctuations in response to different training sets. Independent of the true value of the predicted variable and zero for a learner that always predicts the same class regardless of the training set
Bias–Variance Decomposition Kong & Dietterich (1995) – variance can be
negative and noise is ignored Breiman (1996) – undefined for any given
example and variance can be zero even when the learners predictions fluctuate
Tibshirani (1996) Hastie (1997) Kohavi & Wolpert (1996) allows the bias of the
Bayes optimal classifier to be non-zero Friedman (1997) leaves bias and variance for
zero-one loss undefined
Domingos (2000) Single definition of bias and variance Applicable to “any” loss function Explains the margin effect (Schapire
et al. 1997) using the decomposition Incorporates variable
misclassification costs Experimental study
Unified Decomposition Loss functions
Squared L(t,y)=(t-y)2 Absolute L(t,y)=|t-y| Zero-One L(t,y)=0 if y=t else 1
Goal = Minimize average L(t,y) over all weighted examples
c1N(x) + B(x) + c2V(x)
Properties of the unified decomposition Relation to Order-correct learner Relation to Margin of a learner Maximizing margins is a combination
of reducing the number of biased examples, decreasing variance on unbiased examples, and increasing it on biased ones.
Experimental Study 30 UCI datasets Methodology
100 bootstrap samples – averaged over the test set with uniform weights
Estimate bias, variance, zero-one loss DT, kNN, boosting
Boosting C4.5 - Results Decreases both bias and variance Bulk of bias reduction happens in
the first few rounds Variance reduction is more gradual
and the dominant effect
kNN results kNN bias increases with k dominates
variance reduction however increasing k has the effect of reducing variance on unbiased examples while increasing it on biased ones.
Issues Does not work with “Any” loss
function e.g. absolute loss Decomposition is not purely additive
unlike the original one for squared-loss
Spectrum of ensembles
Asymmetry of weights
Overfitting
Bagging
Boosting
BMA
Open Issues concerning ensembles Best way to construct ensembles? No extensive comparison done Computationally expensive Not easily comprehensible
Bibliography Overview
T. Dietterich Bauer & Kohavi
Averaging Domingos Freund, Mansour, Schapire Ali, Pazzani
Bias – Variance Decomposition Kohavi & Wolpert Domingos Friedman Kong & Dietterich