Upload
ntnu
View
205
Download
0
Embed Size (px)
Citation preview
An Experimental Study about Simple Decision Trees for
Bagging Ensemble on Datasets with Classification Noise
Joaquín Abellán and Andrés R. Masegosa
Department of Computer Science and Artificial Intelligence
University of Granada
Verona, July 2009
10th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty
ECSQARU 2009 Verona (Italy) 1/23
Introduction
Part I
Introduction
ECSQARU 2009 Verona (Italy) 2/23
Introduction
Introduction
Ensembles of Decision Trees (DT)
Features
They usually build different DT for different samples of the training dataset.
The final prediction is a combination of the individual predictions of each tree.
Take advantage of the inherent instability of DT.
Bagging, AdaBoost and Randomization the most known approaches.
ECSQARU 2009 Verona (Italy) 3/23
Introduction
Introduction
Classification Noise (CN) in the class values
Definition
The class values of the samples given to the learning algorithm have someerrors.
Random classification noise: the label of each example is flipped randomlyand independently with some fixed probability called the noise rate.
Causes
It is mainly due to errors in the data capture process.
Very common in real world applications: surveys, biological or medicalinformation...
Effects on ensembles of decision trees
The presence of classification noise degenerates the performance of anyclassification inducer.
AdaBoost is known to be very affected by the presence of classification noise.
Bagging is the ensemble approach with the better response to classificationnoise.
ECSQARU 2009 Verona (Italy) 4/23
Introduction
Introduction
Motivation of this study
Description
Decision trees built with different split criteria are considered in a Baggingscheme.
Common split criteria (InfoGain, InfoGain Ratio and Gini Index) and a new splitcriteria based on imprecise probabilities are analyzed.
Analyze which split criteria is more robust to the presence of classificationnoise.
Outline
Description of the different split criteria.
Bagging Decision Trees
Experimental Results.
Conclusions and Future Works.
ECSQARU 2009 Verona (Italy) 5/23
Split Criteria
Part II
Split Criteria
ECSQARU 2009 Verona (Italy) 6/23
Split Criteria
Split Criteria
Decision Trees
Example Description
Attributes are placed at the nodes.
Class values are placed at the leaves.
Each leaf corresponds to a decisionrule.
Learning
Split Criteria selects the attribute toplace at each branching node.
Stop Criteria decides when to fix a leafand stop the branching.
ECSQARU 2009 Verona (Italy) 7/23
Split Criteria
Split Criteria
Classic Split Criteria
Description
A real-valued function which measures the goodness of an attribute X as ansplit node in the decision tree.
A local measure that allows a recursive building of the decision tree.
Information Gain (IG)
Introduced by Quinlan as basis of his ID3 model [18].
It is based on Shannon’s entropy.
IG(X , C) = H(C|X)− H(C) = −∑
i
∑j
p(cj , xi ) logp(cj , xi )
p(cj )p(xi ).
Tendency to select attributes with a high number of states.
ECSQARU 2009 Verona (Italy) 8/23
Split Criteria
Split Criteria
Classic Split Criteria
Information Gain Ratio (IGR)
Improved version of IG (Quinlan’C4.5 tree inducer [19]).
Normalizes the information gain dividing by the entropy of the split attribute.
IGR(X , C) =IG(X , C)
H(X).
Penalizes attributes with many states.
Gini Index (GIx)
Measure the impurity degree of a partition.
Introduced by Breiman as basis of CART tree inducer [8].
GIx(X , C) = gini(C|X)− gini(C)
gini(C|X) =∑
t
p(xi )gini(C|X = xi ) gini(C) = 1−∑
j
p2(cj )
Tendency to select attributes with a high number of states.
ECSQARU 2009 Verona (Italy) 9/23
Split Criteria
Split Criteria
Split Criteria based on Imprecise Probabilities
Imprecise Information Gain (IIG) [3]
An uncertainty measure for convex sets of probability distributions.
Probability intervals for each state of the class variable are computed from thedataset using Walley’s Imprecise Dirichlet Model (IDM) [24].
p(cj ) ∈[ ncj
N + s,
ncj + s
N + s
]≡ Icj , p(cj |xi ) ∈
[ ncj ,xi
N + s,
ncj ,xi + s
Nxi + s
]≡ Icj ,xi
If we label K (C) and K (C|(X = xi )) for the following sets of probabilitydistributions q on ΩC :
K (C) = q| q(cj ) ∈ Icj K (C|X = xi ) = q| q(cj ) ∈ Icj ,xi ,
Imprecise Info-Gain for each variable X is defined as:
IIG(X , C) = S(K (C))−∑
p(xi )S(K (C|(X = xi )))
where S() is the maximum entropy function of a convex set.
It can be efficiently computed for s=1 [1].
ECSQARU 2009 Verona (Italy) 10/23
Bagging Decision Trees
Part III
Bagging Decision Trees
ECSQARU 2009 Verona (Italy) 11/23
Bagging Decision Trees
Bagging Decision Trees
Bagging Decision Trees
Procedure
Ti samples are generated byrandom sampling withreplacement from the initial trainingdataset.
From each Ti sample, a simpledecision tree is built using a givensplit criteria.
Final prediction is made by amajority voting criteria.
Description
As Breiman [9] said about Bagging: The vital element is the instability of theprediction method. If perturbing the learning set can cause significant changesin the predictor constructed, then Bagging can improve accuracy.
The combination of multiple models reduce the overfitting of the singledecision trees to the data set.
ECSQARU 2009 Verona (Italy) 12/23
Experiments
Part IV
Experiments
ECSQARU 2009 Verona (Italy) 13/23
Experiments
Experiments
Experimental Set-up
Datasets Benchmark
25 UCI datasets with very different features.
Missing values were replaced with mean and mode values for continuous anddiscrete attributes respectively.
Continuous attributes were discretized with Fayyad & Irani’s method [13].
Preprocessing was only carried out considering information for training data sets.
Evaluated Algorithms
Bagging ensembles of 100 trees.
Different split criteria: IG, IGR, GIx and IIG.
Evaluation Method
Different noise rates were applied to training datasets (not to test datasets):0%, 5%, 10%, 20% and 30%.
k-10 fold cross validation repeated 10 times were used to estimate theclassification accuracy.
ECSQARU 2009 Verona (Italy) 14/23
Experiments
Experiments
Statistical Tests
Two classifiers on a single dataset
Corrected Paired T-test [26]: A corrected version of the paired T-testimplemented in Weka.
Two classifiers on multiple datasets
Wilconxon Signed-Ranks Test [25]: A non-parametric test which ranks thedifferences in each dataset.
Sign Test [20,22]: A binomial test that counts the number of wins, losses andties across each dataset.
Multiple classifiers on multiple datasets
Friedman Test [15,16]: A non-parametric test that ranks the algorithms for eachdataset, the best one gets rank 1, second one gets rank 2... Null-hypothesis isthat all algorithms perform equally well.
Nemenyi Test [17]: A post-hoc test that is employed to compare the algorithmsamong them when the null-hypothesis with Friedman test is rejected.
ECSQARU 2009 Verona (Italy) 15/23
Experiments
Experiments
Average Performance
Analysis
The average accuracy is similar when no noise is introduced.
The introduction of noise deteriorates the performance of classifiers.
But IIG is more robust to noise, because its average performance is higher ineach one of the noise levels.
ECSQARU 2009 Verona (Italy) 16/23
Experiments
Experiments
Corrected Paired T-Test at 0.05 level
Number of accumulated Wins, Ties and Defeates (W/T/D)of IIG respect to IG, IGR and GIx in 25 datasets.
Noise IG IGR GIx0% 2/22/1 1/23/1 2/22/15% 11/14/0 10/15/0 11/14/010% 13/12/0 10/15/0 13/12/020% 16/9/0 11/14/0 18/7/030% 17/8/0 11/14/0 17/8/0
Analysis
Without noise, there is a tie in almost all datasets.
As much noise is added, higher the number of wins are.
IIG wins in a high number of datasets and it is not defeated in none of them.
ECSQARU 2009 Verona (Italy) 17/23
Experiments
Experiments
Wilconxon and Sign Test at 0.05 level
Comparison of IIG respect to the rest of split criteria.’-’ indicates non-statistically significant differences.
Noise0 %5 %10 %20 %30 %
Wilconxon TestIG IGR GIxIIG - IIGIIG IIG IIGIIG IIG IIGIIG IIG IIGIIG IIG IIG
Sign TestIG IGR GIxIIG - IIGIIG IIG IIGIIG IIG IIGIIG IIG IIGIIG IIG IIG
Analysis
Without noise, IIG outperforms IG and GIx, but not IGR.
With any level of noise, IIG outperforms the rest of the splits.
IGR also outperforms IG and GIx when there is some noise level.
ECSQARU 2009 Verona (Italy) 18/23
Experiments
Experiments
Friedman Test at 0.05 level
The ranks assessed by Friedman Test are depicted.
As lower the rank, better the performance.
Ranks in bold face indicates that IIG statistically outperforms with Nemenyi Test.
Noise IIG IG IGR GIx0% 1.86 2.92 2.52 2.705% 1.18 3.18 2.54 3.12
10% 1.12 3.26 2.36 3.2620% 1.12 3.20 2.16 3.5230% 1.12 3.36 2.26 3.26
Analysis
Without noise, IIG has the best ranking and outperforms IG.
With a noise level higher than 10%, IIG outperforms over the rest.
IGR also outperforms IG and GIx when the noise level is higher than 20%.
ECSQARU 2009 Verona (Italy) 19/23
Experiments
Experiments
Computational Time
Analysis
Without noise, all split criteria have similar time average.
The introduction of noise deteriorates the computational performance ofclassifiers.
IIG and GIx consumes less time than the other split criteria. IGR is the most timeconsumer.
ECSQARU 2009 Verona (Italy) 20/23
Conclusions and Future Works
Part V
Conclusions and Future Works
ECSQARU 2009 Verona (Italy) 21/23
Conclusions and Future Works
Conclusions and Future Works
Conclusions and Future Works
Conclusions
Experimental study about the performance of different split criteria in abagging scheme under classification noise.
Three classic split criteria: IG, IGR and GIx; and a new one based onimprecise probabilities: IIG.
Bagging with IIG has a strong behavior respect to the other ones when thenoise level is increased.
IGR has also a good performance with noise level, but lower than IIG.
Future Works
Extend the methods for continuous and missing values.
Further investigate the computational cost of any of the models as well as otherfactors such as number of trees, pruning...
Introduce new imprecise models.
ECSQARU 2009 Verona (Italy) 22/23
Thanks for your attention!!
Questions?
ECSQARU 2009 Verona (Italy) 23/23