An Experimental Study about Simple Decision Trees for Bagging Ensemble on Datasets with Classification Noise

An Experimental Study about Simple Decision Trees for

Bagging Ensemble on Datasets with Classification Noise

Joaquín Abellán and Andrés R. Masegosa

Department of Computer Science and Artificial Intelligence

University of Granada

Verona, July 2009

10th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty

ECSQARU 2009 Verona (Italy) 1/23

Introduction

Part I

Introduction


Introduction

Introduction

Ensembles of Decision Trees (DT)

Features

They usually build different DT for different samples of the training dataset.

The final prediction is a combination of the individual predictions of each tree.

Take advantage of the inherent instability of DT.

Bagging, AdaBoost and Randomization the most known approaches.


Introduction

Introduction

Classification Noise (CN) in the class values

Definition

The class values of the samples given to the learning algorithm have someerrors.

Random classification noise: the label of each example is flipped randomlyand independently with some fixed probability called the noise rate.

Causes

It is mainly due to errors in the data capture process.

Very common in real world applications: surveys, biological or medicalinformation...

Effects on ensembles of decision trees

The presence of classification noise degenerates the performance of anyclassification inducer.

AdaBoost is known to be very affected by the presence of classification noise.

Bagging is the ensemble approach with the better response to classificationnoise.


Introduction

Introduction

Motivation of this study

Description

Decision trees built with different split criteria are considered in a Baggingscheme.

Common split criteria (InfoGain, InfoGain Ratio and Gini Index) and a new splitcriteria based on imprecise probabilities are analyzed.

Analyze which split criteria is more robust to the presence of classificationnoise.

Outline

Description of the different split criteria.

Bagging Decision Trees

Experimental Results.

Conclusions and Future Works.


Split Criteria

Part II

Split Criteria


Split Criteria

Split Criteria

Decision Trees

Example Description

Attributes are placed at the nodes.

Class values are placed at the leaves.

Each leaf corresponds to a decisionrule.

Learning

Split Criteria selects the attribute toplace at each branching node.

Stop Criteria decides when to fix a leafand stop the branching.


Split Criteria

Split Criteria

Classic Split Criteria

Description

A real-valued function which measures the goodness of an attribute X as ansplit node in the decision tree.

A local measure that allows a recursive building of the decision tree.

Information Gain (IG)

Introduced by Quinlan as basis of his ID3 model [18].

It is based on Shannon’s entropy.

IG(X , C) = H(C|X)− H(C) = −∑

i

∑j

p(cj , xi ) logp(cj , xi )

p(cj )p(xi ).

Tendency to select attributes with a high number of states.


Split Criteria

Split Criteria

Classic Split Criteria

Information Gain Ratio (IGR)

Improved version of IG (Quinlan’C4.5 tree inducer [19]).

Normalizes the information gain dividing by the entropy of the split attribute.

IGR(X , C) =IG(X , C)

H(X).

Penalizes attributes with many states.

Gini Index (GIx)

Measure the impurity degree of a partition.

Introduced by Breiman as basis of CART tree inducer [8].

GIx(X , C) = gini(C|X)− gini(C)

gini(C|X) =∑

t

p(xi )gini(C|X = xi ) gini(C) = 1−∑

j

p2(cj )

Tendency to select attributes with a high number of states.


Split Criteria

Split Criteria

Split Criteria based on Imprecise Probabilities

Imprecise Information Gain (IIG) [3]

An uncertainty measure for convex sets of probability distributions.

Probability intervals for each state of the class variable are computed from thedataset using Walley’s Imprecise Dirichlet Model (IDM) [24].

p(cj ) ∈[ ncj

N + s,

ncj + s

N + s

]≡ Icj , p(cj |xi ) ∈

[ ncj ,xi

N + s,

ncj ,xi + s

Nxi + s

]≡ Icj ,xi

If we label K (C) and K (C|(X = xi )) for the following sets of probabilitydistributions q on ΩC :

K (C) = q| q(cj ) ∈ Icj K (C|X = xi ) = q| q(cj ) ∈ Icj ,xi ,

Imprecise Info-Gain for each variable X is defined as:

IIG(X , C) = S(K (C))−∑

p(xi )S(K (C|(X = xi )))

where S() is the maximum entropy function of a convex set.

It can be efficiently computed for s=1 [1].



Part III






Procedure

Ti samples are generated byrandom sampling withreplacement from the initial trainingdataset.

From each Ti sample, a simpledecision tree is built using a givensplit criteria.

Final prediction is made by amajority voting criteria.

Description

As Breiman [9] said about Bagging: The vital element is the instability of theprediction method. If perturbing the learning set can cause significant changesin the predictor constructed, then Bagging can improve accuracy.

The combination of multiple models reduce the overfitting of the singledecision trees to the data set.


Experiments

Part IV

Experiments


Experiments

Experiments

Experimental Set-up

Datasets Benchmark

25 UCI datasets with very different features.

Missing values were replaced with mean and mode values for continuous anddiscrete attributes respectively.

Continuous attributes were discretized with Fayyad & Irani’s method [13].

Preprocessing was only carried out considering information for training data sets.

Evaluated Algorithms

Bagging ensembles of 100 trees.

Different split criteria: IG, IGR, GIx and IIG.

Evaluation Method

Different noise rates were applied to training datasets (not to test datasets):0%, 5%, 10%, 20% and 30%.

k-10 fold cross validation repeated 10 times were used to estimate theclassification accuracy.


Experiments

Experiments

Statistical Tests

Two classifiers on a single dataset

Corrected Paired T-test [26]: A corrected version of the paired T-testimplemented in Weka.

Two classifiers on multiple datasets

Wilconxon Signed-Ranks Test [25]: A non-parametric test which ranks thedifferences in each dataset.

Sign Test [20,22]: A binomial test that counts the number of wins, losses andties across each dataset.

Multiple classifiers on multiple datasets

Friedman Test [15,16]: A non-parametric test that ranks the algorithms for eachdataset, the best one gets rank 1, second one gets rank 2... Null-hypothesis isthat all algorithms perform equally well.

Nemenyi Test [17]: A post-hoc test that is employed to compare the algorithmsamong them when the null-hypothesis with Friedman test is rejected.


Experiments

Experiments

Average Performance

Analysis

The average accuracy is similar when no noise is introduced.

The introduction of noise deteriorates the performance of classifiers.

But IIG is more robust to noise, because its average performance is higher ineach one of the noise levels.


Experiments

Experiments

Corrected Paired T-Test at 0.05 level

Number of accumulated Wins, Ties and Defeates (W/T/D)of IIG respect to IG, IGR and GIx in 25 datasets.

Noise IG IGR GIx0% 2/22/1 1/23/1 2/22/15% 11/14/0 10/15/0 11/14/010% 13/12/0 10/15/0 13/12/020% 16/9/0 11/14/0 18/7/030% 17/8/0 11/14/0 17/8/0

Analysis

Without noise, there is a tie in almost all datasets.

As much noise is added, higher the number of wins are.

IIG wins in a high number of datasets and it is not defeated in none of them.


Experiments

Experiments

Wilconxon and Sign Test at 0.05 level

Comparison of IIG respect to the rest of split criteria.’-’ indicates non-statistically significant differences.

Noise0 %5 %10 %20 %30 %

Wilconxon TestIG IGR GIxIIG - IIGIIG IIG IIGIIG IIG IIGIIG IIG IIGIIG IIG IIG

Sign TestIG IGR GIxIIG - IIGIIG IIG IIGIIG IIG IIGIIG IIG IIGIIG IIG IIG

Analysis

Without noise, IIG outperforms IG and GIx, but not IGR.

With any level of noise, IIG outperforms the rest of the splits.

IGR also outperforms IG and GIx when there is some noise level.


Experiments

Experiments

Friedman Test at 0.05 level

The ranks assessed by Friedman Test are depicted.

As lower the rank, better the performance.

Ranks in bold face indicates that IIG statistically outperforms with Nemenyi Test.

Noise IIG IG IGR GIx0% 1.86 2.92 2.52 2.705% 1.18 3.18 2.54 3.12

10% 1.12 3.26 2.36 3.2620% 1.12 3.20 2.16 3.5230% 1.12 3.36 2.26 3.26

Analysis

Without noise, IIG has the best ranking and outperforms IG.

With a noise level higher than 10%, IIG outperforms over the rest.

IGR also outperforms IG and GIx when the noise level is higher than 20%.


Experiments

Experiments

Computational Time

Analysis

Without noise, all split criteria have similar time average.

The introduction of noise deteriorates the computational performance ofclassifiers.

IIG and GIx consumes less time than the other split criteria. IGR is the most timeconsumer.


Conclusions and Future Works

Part V






Conclusions

Experimental study about the performance of different split criteria in abagging scheme under classification noise.

Three classic split criteria: IG, IGR and GIx; and a new one based onimprecise probabilities: IIG.

Bagging with IIG has a strong behavior respect to the other ones when thenoise level is increased.

IGR has also a good performance with noise level, but lower than IIG.

Future Works

Extend the methods for continuous and missing values.

Further investigate the computational cost of any of the models as well as otherfactors such as number of trees, pruning...

Introduce new imprecise models.


Thanks for your attention!!

Questions?