Data Dependence in Combining Classifiers

Data Dependence in Data Dependence in Combining ClassifiersCombining Classifiers

Mohamed KamelMohamed KamelPAMI LabPAMI Lab

University of WaterlooUniversity of Waterloo

IntroductionIntroductionData DependenceData Dependence

Implicit DependenceImplicit Dependence Explicit DependenceExplicit Dependence

Feature Based ArchitectureFeature Based Architecture Training AlgorithmTraining Algorithm

ResultsResultsConclusionsConclusions

OutlineOutline

MCS 2003MCS 2003 Data Dependence in Combining ClassifiersData Dependence in Combining Classifiers

IntroductionData Dependence

ImplicitExplicit

Feature BasedTraining

ResultsConclusions

IntroductionIntroduction

Pattern Recognition SystemsPattern Recognition Systems Best possible classification rates.Best possible classification rates. Increase efficiency and accuracy.Increase efficiency and accuracy.

Multiple Classifier SystemsMultiple Classifier Systems Empirical ObservationEmpirical Observation Problem decomposed naturally from using various Problem decomposed naturally from using various

sensorssensors Avoid making commitments to arbitrary initial Avoid making commitments to arbitrary initial

conditions or parametersconditions or parameters

““Patterns mis-classified by different classifiers are Patterns mis-classified by different classifiers are not necessarily the same”not necessarily the same”[Kittler et. al., 98][Kittler et. al., 98]

Introduction



ImplicitExplicit


ResultsConclusions

Categorization of MCSCategorization of MCS

ArchitectureArchitecture

Input/Output MappingInput/Output Mapping

RepresentationRepresentation

Specialized classifiersSpecialized classifiers

Introduction



ImplicitExplicit


ResultsConclusions

Categorization of MCS Categorization of MCS (cntd…)(cntd…)

ArchitectureArchitecture

Parallel Parallel [Dasarathy, 94][Dasarathy, 94]

Serial Serial [Dasarathy, 94][Dasarathy, 94]

Classifier 1

Classifier 2

Classifier N

FUSION

Input 1

OutputInput 2

Input N

Classifier 1 Classifier 2 Classifier NInput 1

Input 2 Input N

Output

Introduction



ImplicitExplicit


ResultsConclusions


Input/Output MappingInput/Output Mapping

Linear MappingLinear Mapping Sum RuleSum Rule Weighted AverageWeighted Average [Hashem 97][Hashem 97]

Non-linear Mapping Non-linear Mapping MaximumMaximum MajorityMajority Hierarchal Mixture of ExpertsHierarchal Mixture of Experts [Jordon and Jacobs 94][Jordon and Jacobs 94]

Stacked GeneralizationStacked Generalization [Wolpert 92][Wolpert 92]

Introduction



ImplicitExplicit


ResultsConclusions


RepresentationRepresentation

Similar representationsSimilar representations Classifiers need to be differentClassifiers need to be different

Different representationDifferent representation Use of different sensorsUse of different sensors Different features extracted from the same data setDifferent features extracted from the same data set

Introduction



ImplicitExplicit


ResultsConclusions


Specialized ClassifiersSpecialized Classifiers

Specialized classifiers Specialized classifiers Encourage specialization in areas of the feature spaceEncourage specialization in areas of the feature space All classifiers must contribute to achieve a final decision All classifiers must contribute to achieve a final decision Hierarchal Mixture of ExpertsHierarchal Mixture of Experts [Jordon and Jacobs 94][Jordon and Jacobs 94]

Co-operative Modular Neural NetworksCo-operative Modular Neural Networks [Auda and Kamel 98][Auda and Kamel 98]

Ensemble of classifiersEnsemble of classifiers Set of redundant classifiersSet of redundant classifiers

Introduction



ImplicitExplicit


ResultsConclusions


Data DependenceData Dependence Classifiers inherently dependent on the data.Classifiers inherently dependent on the data. Describe how the final aggregation uses the Describe how the final aggregation uses the

information present in the input pattern.information present in the input pattern. Describe the relationship between the final Describe the relationship between the final

output output Q(x)Q(x) and the pattern under and the pattern under classification classification xx

Introduction



ImplicitExplicit


ResultsConclusions

Data DependenceData Dependence

Data IndependentData Independent

Implicitly DependentImplicitly Dependent

Explicitly DependentExplicitly Dependent

Data Dependence



ImplicitExplicit


ResultsConclusions

Data IndependenceData Independence

Solely rely on output of classifiers to determine Solely rely on output of classifiers to determine final classification output.final classification output.

Q(x)Q(x) is the final class assigned for pattern is the final class assigned for pattern xx

CCjj is a vector composed of the output of the various is a vector composed of the output of the various

classifiers in the ensemble {classifiers in the ensemble {cc1j1j,c,c2j2j,...,c,...,cNjNj} } for a given for a given

classclass y yjj

ccijij is the confidence classifier is the confidence classifier ii has in pattern has in pattern xx

belonging to class belonging to class yyjj

Mapping Mapping FFjj can be linear or non-linearcan be linear or non-linear

Data Dependence



ImplicitExplicit


ResultsConclusions

Data Independence Data Independence (cntd…)(cntd…)

ExampleExample

Average VoteAverage Vote

Aggregation result only relies on the output confidences of the Aggregation result only relies on the output confidences of the classifiersclassifiers

The operator The operator FFjj is the summation operationis the summation operation Result skewed if individual confidences contain biasResult skewed if individual confidences contain bias Aggregation has no means of correcting this biasAggregation has no means of correcting this bias

Data Dependence



ImplicitExplicit


ResultsConclusions

Data Independence Data Independence (cntd…)(cntd…)

Simple voting techniques are data independentSimple voting techniques are data independent AverageAverage MaximumMaximum MajorityMajority

Susceptible to incorrect estimates of the confidenceSusceptible to incorrect estimates of the confidence

Data Dependence



ImplicitExplicit


ResultsConclusions

Implicit Data DependenceImplicit Data Dependence

Train the combiner on global performance of the Train the combiner on global performance of the datadata

W(C(x)) W(C(x)) is the weighting matrix composed of is the weighting matrix composed of elementselements w wijij

wwijij is the weight assigned to class is the weight assigned to class j j in classifierin classifier i i

Implicit



ImplicitExplicit


ResultsConclusions

Implicit Data DependenceImplicit Data Dependence (cntd…)(cntd…)

ExampleExample

Weighted AverageWeighted Average Based on the error correlation matrix the individual Based on the error correlation matrix the individual

weights are assigned asweights are assigned as

The weights are dependent on the behavior of the The weights are dependent on the behavior of the classifiers amongst themselvesclassifiers amongst themselves

Weights can be represented as the function Weights can be represented as the function W(CW(Cjj(x))(x))

Implicit



ImplicitExplicit


ResultsConclusions


ExampleExample

Weighted AverageWeighted Average Mapping is the summation operatorMapping is the summation operator Hence Weighted average fits in the Hence Weighted average fits in the

representationrepresentation

Implicit



ImplicitExplicit


ResultsConclusions


Implicitly data dependent approaches includeImplicitly data dependent approaches include Weighted average Weighted average [Hashem 97][Hashem 97]

Fuzzy Measures Fuzzy Measures [Gader 96][Gader 96]

Belief theory Belief theory [Xu and Krzyzak, 92][Xu and Krzyzak, 92]

Behavior Knowledge Space (BKS) Behavior Knowledge Space (BKS) [Huang, 95][Huang, 95]

Decision Templates Decision Templates [Kuncheva 01][Kuncheva 01]

Modular approaches Modular approaches [Auda and Kamel, 98][Auda and Kamel, 98]

Stacked Generalization Stacked Generalization [Wolpert 92][Wolpert 92]

Boosting Boosting [Schapire, 90][Schapire, 90]

Lacks consideration for local superiority of Lacks consideration for local superiority of classifiersclassifiers

Implicit



ImplicitExplicit


ResultsConclusions

Explicit Data DependenceExplicit Data Dependence

Classifier selection or combining performed Classifier selection or combining performed based on the sub-space which the input pattern based on the sub-space which the input pattern belongs to.belongs to.

Final classification is dependent on the pattern Final classification is dependent on the pattern being classified.being classified.

Explicit



ImplicitExplicit


ResultsConclusions

Explicit Data Dependence Explicit Data Dependence (cntd…)(cntd…)

ExampleExample

Dynamic Classifier Selection (DCS)Dynamic Classifier Selection (DCS) Estimation of the accuracy of each classifier in local Estimation of the accuracy of each classifier in local

regions of the feature spaceregions of the feature space Estimate determined by observing the input patternEstimate determined by observing the input pattern Once superiority of classifier is identified, it’s output is Once superiority of classifier is identified, it’s output is

used as the final decisionused as the final decision i.e. Binary weights are assigned based on the local i.e. Binary weights are assigned based on the local

superiority of the classifiers.superiority of the classifiers. Since weights are dependent on the input feature Since weights are dependent on the input feature

space they can be represented as space they can be represented as W(x)W(x) DCS could therefore be considered explicitly data DCS could therefore be considered explicitly data

dependent with the mapping Fdependent with the mapping F jj being the maximum being the maximum

operatoroperator

Explicit



ImplicitExplicit


ResultsConclusions

Explicit Data Dependence Explicit Data Dependence (cntd…)(cntd…)

Explicitly Data Dependent approach includeExplicitly Data Dependent approach include Dynamic Classifier Selection (DCS)Dynamic Classifier Selection (DCS)

DCS With local Accuracy (DCS_LA) DCS With local Accuracy (DCS_LA) [Woods et. al.,97][Woods et. al.,97]

DCS based on Multiple Classifier Behavior (DCS_MCB) DCS based on Multiple Classifier Behavior (DCS_MCB) [Giancinto and Roli, 01][Giancinto and Roli, 01]

Hierarchal Mixture of ExpertsHierarchal Mixture of Experts [Jordon and Jacobs 94][Jordon and Jacobs 94]

Feature-based approach Feature-based approach [Wanas et. al., 99][Wanas et. al., 99]

Weights demonstrate dependence on the input Weights demonstrate dependence on the input pattern.pattern.

Intuitively will perform better than other methodsIntuitively will perform better than other methods

Explicit



ImplicitExplicit


ResultsConclusions

Feature Based ArchitecturesFeature Based Architectures

Methodology to incorporate multiple classifiers in Methodology to incorporate multiple classifiers in a dynamically adapting systema dynamically adapting system

Aggregation adapts to the behavior of the Aggregation adapts to the behavior of the ensembleensemble

Detectors generate weights for each classifier that Detectors generate weights for each classifier that reflect the degree of confidence in each classifier for a reflect the degree of confidence in each classifier for a given inputgiven input

A trained aggregation learns to combine the different A trained aggregation learns to combine the different decisionsdecisions

Feature Based



ImplicitExplicit


ResultsConclusions

Feature Based ArchitecturesFeature Based Architectures (cntd…)(cntd…)

Architecture IArchitecture I

Classifier 1

Classifier 2

Classifier N

FusionClassifier

INPUT

FINALDECISION

Detector

Feature Based



ImplicitExplicit


ResultsConclusions


ClassifiersClassifiers Each individual classifier, CEach individual classifier, C ii, produces some output , produces some output

representing its interpretation of the input representing its interpretation of the input xx Utilizing sub-optimal classifiers.Utilizing sub-optimal classifiers. The collection of classifier outputs for class The collection of classifier outputs for class yyjj is is

represented as represented as CCjj(x)(x)

DetectorDetector Detector Detector DDl l is a classifier that uses input features to is a classifier that uses input features to

extract useful information for aggregationextract useful information for aggregation Doesn’t aim to solve the classification problem.Doesn’t aim to solve the classification problem. Detector output Detector output ddlglg(x)(x) is a probablilty that the input is a probablilty that the input

pattern x is categorized to group pattern x is categorized to group gg.. The output of all the detectors is represented by The output of all the detectors is represented by D(X)D(X)

Feature Based



ImplicitExplicit


ResultsConclusions


AggregationAggregation Fusion layer for all the classifiersFusion layer for all the classifiers Trained to adapt to the behavior of the Trained to adapt to the behavior of the

various modulesvarious modules Explicit data dependentExplicit data dependent

Weights dependent on the input pattern Weights dependent on the input pattern being classifiedbeing classified

Feature Based



ImplicitExplicit


ResultsConclusions


Architecture IIArchitecture II

Classifier 1

Classifier 2

Classifier N

FusionClassifier

INPUT

FINALDECISION

Detector

Feature Based



ImplicitExplicit


ResultsConclusions


ClassifiersClassifiers Each individual classifier, CEach individual classifier, C ii, produces some output , produces some output

representing its interpretation of the input representing its interpretation of the input xx Utilizing sub-optimal classifiers.Utilizing sub-optimal classifiers. The collection of classifier outputs for class The collection of classifier outputs for class yyjj is is

represented as represented as CCjj(x)(x)

DetectorDetector Appends input to output of classifier ensemble.Appends input to output of classifier ensemble. Produces a weighting factor, Produces a weighting factor, wwijij ,for each class in a ,for each class in a

classifier output.classifier output. The dependence of the weights on both the classifier The dependence of the weights on both the classifier

output and the input pattern is represented by output and the input pattern is represented by W(x,CW(x,Cjj (x)) (x))

Feature Based



ImplicitExplicit


ResultsConclusions


AggregationAggregation Fusion layer for all the classifiersFusion layer for all the classifiers Trained to adapt to the behavior of the various Trained to adapt to the behavior of the various

modulesmodules Combines implicit and explicit data dependenceCombines implicit and explicit data dependence

Weights dependent on the input pattern and the Weights dependent on the input pattern and the performance of the classifiers.performance of the classifiers.

Feature Based



ImplicitExplicit


ResultsConclusions

ResultsResults

Five one-hidden layer BP classifiersFive one-hidden layer BP classifiers

Training used partially disjoint data setsTraining used partially disjoint data sets

No optimization is performed for the trained No optimization is performed for the trained networksnetworks

The parameters of all the networks are The parameters of all the networks are maintained for all the classifiers that are trainedmaintained for all the classifiers that are trained

Three data setsThree data sets 20 Class Gaussian20 Class Gaussian SatimagesSatimages Clouds dataClouds data

Results



ImplicitExplicit


ResultsConclusions

Results Results (cntd…)(cntd…)

Data SetData Set 20 Class20 Class CloudsClouds SatimagesSatimages

SinglenetSinglenet 13.82 13.82 1.16 1.16 10.92 10.92 0.08 0.08 14.06 14.06 1.33 1.33

OracleOracle 7.29 7.29 1.06 1.06 7.41 7.41 0.16 0.16 7.20 7.20 0.36 0.36

Data Dependent ApproachesData Dependent Approaches

MaximumMaximum 12.92 12.92 0.35 0.35 10.68 10.68 0.04 0.04 13.61 13.61 0.21 0.21

MajorityMajority 13.13 13.13 0.36 0.36 10.71 10.71 0.02 0.02 13.40 13.40 0.16 0.16

AverageAverage 12.83 12.83 0.26 0.26 10.66 10.66 0.04 0.04 13.23 13.23 0.22 0.22

BordaBorda 13.04 13.04 0.30 0.30 10.71 10.71 0.02 0.02 13.77 13.77 0.20 0.20

Implicitly Data Dependent ApproachesImplicitly Data Dependent Approaches

Weighted Avg.Weighted Avg. 12.57 12.57 0.20 0.20 10.59 10.59 0.05 0.05 13.14 13.14 0.21 0.21

BayesianBayesian 12.48 12.48 0.21 0.21 10.71 10.71 0.02 0.02 13.51 13.51 0.16 0.16

Fuzzy IntegralFuzzy Integral 12.95 12.95 0.34 0.34 10.67 10.67 0.05 0.05 13.71 13.71 0.19 0.19

Explicit Data DependentExplicit Data Dependent

Feature-basedFeature-based 8.64 8.64 0.60 0.60 10.28 10.28 0.10 0.10 12.48 12.48 0.19 0.19

Results



ImplicitExplicit


ResultsConclusions

TrainingTraining

Training each component independentlyTraining each component independently Optimize individual components, may not Optimize individual components, may not

lead to overall improvementlead to overall improvement Collinearity, high correlation between Collinearity, high correlation between

classifiersclassifiers Components, under-trained or over-trainedComponents, under-trained or over-trained

Training



ImplicitExplicit


ResultsConclusions

Training Training (cntd…)(cntd…)

Adaptive trainingAdaptive training Selective: Selective: Reducing correlation between Reducing correlation between

componentscomponents Focused: Focused: Re-training focuses on misclassified Re-training focuses on misclassified

patterns.patterns. Efficient: Efficient: Determined the duration of trainingDetermined the duration of training

Training



ImplicitExplicit


ResultsConclusions

Adaptive Training: Main loopAdaptive Training: Main loop

Increase diversity among Increase diversity among ensembleensemble

Incremental learningIncremental learning

Evaluation of training to Evaluation of training to determine the re-training setdetermine the re-training set

Initialize

DONE = TRUE

Train

Evaluate andcomposetraining

END

START

YES

NO

Training



ImplicitExplicit


ResultsConclusions

Adaptive Training: TrainingAdaptive Training: Training

Save classifier if it performs Save classifier if it performs well on the evaluation setwell on the evaluation set

Determine when to Determine when to terminate training for each terminate training for each modulemodule

i k

DONEi = TRUE

Train CiEvaluate Ci

CFi > CFi_best

Save CiCFi_best = CFi

CFi-CFi-1 <

DONEi=TRUE

i=i+1

No

YES

YES

YES

NO

No

NO

Yes

Training



ImplicitExplicit


ResultsConclusions

Adaptive Training: EvaluationAdaptive Training: Evaluation

Train aggregation modulesTrain aggregation modules

Evaluate training sets for Evaluate training sets for each classifiereach classifier

Compose new training dataCompose new training data i kEvaluate

System onTraini

DONEi=TRUE i

DONE=TRUE

TrainAggregation

Select newTraining data

YES

NO

YES

Training



ImplicitExplicit


ResultsConclusions

Adaptive Training: Data SelectionAdaptive Training: Data Selection

New training data are composed by New training data are composed by concatenatingconcatenating

ErrorErrorii: Misclassified entries of training : Misclassified entries of training

data for classifier data for classifier i.i. CorrectCorrectii: Random choice of : Random choice of R*(P*R*(P*δδ_i)_i)

correctly classified entries of the correctly classified entries of the training data for classifier training data for classifier i.i.

Training



ImplicitExplicit


ResultsConclusions

ResultsResults

Five one-hidden layer BP classifiersFive one-hidden layer BP classifiers

Training used partially disjoint data setsTraining used partially disjoint data sets

No optimization is performed for the trained No optimization is performed for the trained networksnetworks

The parameters of all the networks are The parameters of all the networks are maintained for all the classifiers that are trainedmaintained for all the classifiers that are trained

Three data setsThree data sets 20 Class Gaussian20 Class Gaussian SatimagesSatimages Clouds dataClouds data

Results



ImplicitExplicit


ResultsConclusions

Results Results (cntd…)(cntd…)

Data SetData Set 20 Class20 Class CloudsClouds SatimagesSatimages

SinglenetSinglenet 13.82 13.82 1.16 1.16 10.92 10.92 0.08 0.08 14.06 14.06 1.33 1.33

Normal TrainingNormal Training

Best ClassifierBest Classifier 14.03 14.03 0.64 0.64 11.00 11.00 0.09 0.09 14.72 14.72 0.43 0.43

OracleOracle 7.29 7.29 1.06 1.06 7.41 7.41 0.16 0.16 7.20 7.20 0.36 0.36

Feature BasedFeature Based 8.64 8.64 0.60 0.60 10.28 10.28 0.10 0.10 12.48 12.48 0.19 0.19

Ensemble Trained Adaptively using WA as the evaluation functionEnsemble Trained Adaptively using WA as the evaluation function


OracleOracle 6.79 6.79 2.30 2.30 5.73 5.73 0.11 0.11 5.58 5.58 0.17 0.17


Feature Based Architecture Trained AdaptivelyFeature Based Architecture Trained Adaptively


OracleOracle 5.42 5.42 1.30 1.30 5.43 5.43 0.11 0.11 5.48 5.48 0.18 0.18


Results



ImplicitExplicit


ResultsConclusions

ConclusionsConclusions

Categorization of various combining Categorization of various combining approaches based on data dependenceapproaches based on data dependence

Independent : vulnerable to incorrect Independent : vulnerable to incorrect confidence estimatesconfidence estimates

implicitly dependent: doesn’t take into implicitly dependent: doesn’t take into account local superiority of classifiersaccount local superiority of classifiers

Explicitly dependent: Literature focuses on Explicitly dependent: Literature focuses on selection not combiningselection not combining

Conclusions



ImplicitExplicit


ResultsConclusions

Conclusions Conclusions (cntd…)(cntd…)

Feature-based approachFeature-based approach Combines implicit and explicit data dependenceCombines implicit and explicit data dependence Uses an Evolving training algorithm to enhance Uses an Evolving training algorithm to enhance

diversity amongst classifiersdiversity amongst classifiers Reduces harmful correlationReduces harmful correlation Determines duration of trainingDetermines duration of training Improved classification accuracyImproved classification accuracy

Conclusions



ImplicitExplicit


ResultsConclusions

ReferencesReferences[Kittler et. al., 98] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On Combining Classifiers”, IEEE Trans. [Kittler et. al., 98] J. Kittler, M. Hatef, R. Duin, and J. Matas, “On Combining Classifiers”, IEEE Trans.

PAMI, 20:3, 226-239, 1998.PAMI, 20:3, 226-239, 1998.[Dasarthy, 94] B. Dasarthy, “Decision Fusion”, IEEE Computer Soc. Press, 1994.[Dasarthy, 94] B. Dasarthy, “Decision Fusion”, IEEE Computer Soc. Press, 1994.[Hashem, 1997] S. Hashem, “Algorthims for Optimal Linear Combination of Neural Networks” Int. Conf. [Hashem, 1997] S. Hashem, “Algorthims for Optimal Linear Combination of Neural Networks” Int. Conf.

on Neural Networks, Vol 1, 242-247, 1997.on Neural Networks, Vol 1, 242-247, 1997.[Jordon and Jacob, 94] M. Jordon, and R. Jacobs, “Hierarchical Mixture of Experts and the EM [Jordon and Jacob, 94] M. Jordon, and R. Jacobs, “Hierarchical Mixture of Experts and the EM

Algorithm”, Neural Computing, 181-214, 1994.Algorithm”, Neural Computing, 181-214, 1994.[Wolpert, 92] D. Wolpert, “Stacked Generalization”, Neural Networks, Vol 5, 241-259, 1992[Wolpert, 92] D. Wolpert, “Stacked Generalization”, Neural Networks, Vol 5, 241-259, 1992[Auda and Kamel, 98] [Auda and Kamel, 98] G. Auda and M. Kamel, “Modular Neural Network Classifiers: A Comparative

Study”, J. Int. Rob. Sys., Vol. 21, 117–129, 1998.[Gader et. al., 96] [Gader et. al., 96] P. Gader, M. Mohamed, and J. Keller, “Fusion of Handwritten Word Classifiers”, Patt.

Reco. Let.,17(6), 577–584, 1996.[Xu et. al., 92] L. Xu, A. Kazyzak, C. Suen, “Methods of Combining Multiple Classifiers and their [Xu et. al., 92] L. Xu, A. Kazyzak, C. Suen, “Methods of Combining Multiple Classifiers and their

Applications to Handwritten Recognition”, IEEE Sys. Man and Cyb., 22(3), 418-435, 1992Applications to Handwritten Recognition”, IEEE Sys. Man and Cyb., 22(3), 418-435, 1992[Kuncheva et. al., 01] [Kuncheva et. al., 01] L. Kuncheva, J. Bezdek, and R. Duin, “Decsion Templates for Multiple Classifier

Fusion: An Experimental Comparison”, Patt. Reco., vol. 34, 299–314, 2001.[Huang et. al., 95] [Huang et. al., 95] Y. Huang, K. Liu, and C. Suen, “The Combination of Multiple Classifiers by a Neural

Network Approach”, J. Patt. Reco. and Art. Int., Vol. 9, 579–597, 1995.[Schapire, 90] [Schapire, 90] R. Schapire, “The Strength of Weak Learnability”, Mach. Lear., Vol. 5, 197–227,1990.[Giancinto and Roli, 01] G. Giancinto and F. Roli, “Dynamic Classifier Selection based on Multiple

Classifier Behavior”, Patt. Reco., Vol. 34, 1879-1881, 2001.[Wanas et., al., 99] N. Wanas, M. Kamel, G. Auda, and F. Karray, “Feature Based Decision Aggregation

in Modular Neural Network Classifiers”, Patt. Reco. Lett., 20(11-13), 1353-1359, 1999.

Documents

Data Dependence in Combining Classifiers