Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

A statistical approachfor separability of classes

ColloqueApprentissage statistique : Théorie et application

Djamel A. ZIGHEDStéphane LALLICHFabrice MUHLENBACH

Laboratoire ERIC - LyonUniversité Lumière Lyon 25, avenue Pierre Mendès-France69676 BRON CedexFRANCE

[email protected]@[email protected]

Overview•General framework

•Class separability

•Neighborhood graph and clusters

•Cut weighted edges statistic

•Experiments

•Conclusions and future work

General framework

PopulationLa

bele

dex

ampl

es

70 1 4 130 322 0 2 109 0 2,40 2 3 3 267 0 3 115 564 0 2 160 0 1,60 2 0 7 157 1 2 124 261 0 0 141 0 0,30 1 0 7 264 1 4 128 263 0 0 105 1 0,20 2 1 7 174 0 2 120 269 0 2 121 1 0,20 1 1 3 165 1 4 120 177 0 0 140 0 0,40 1 0 7 156 1 3 130 256 1 2 142 1 0,60 2 1 6 259 1 4 110 239 0 2 142 1 1,20 2 1 7 260 1 4 140 293 0 2 170 0 1,20 2 2 7 2

647448

Class attribute(categorical)

Predictive attributes(continuous)

(X1, X2, X3, …, Xp) Y

Representation space

Ωa

X(Ω )XΩ

E=y1,, . . . ,ym

Yϕ ϕ(X(ω ))=Y(ω)

As accurate as possible

Ω

MachineLearning algorithm

ϕ ε

Neural NetInduction GraphDisc. AnalysisSVM…

Assume that we aim to find a prediction model ϕ that fits the data with an accuracy rate ε> ε∗ , whatever the machine learning algorithm.

X(Ω ), Y(Ω ) Neural Net (ϕ,ε< ε∗ )nn

X(Ω ), Y(Ω ) Ind. Graph (ϕ,ε< ε∗ )IG

We failed finding out such prediction model • All the MLA aren’t suitable for this specific problem, so we have to look for a new MLA,… until…

• The classes aren’t separable.

Class separability

X1

Xi Xp

the classes are noteasily separable

it will be difficult tofind a method thatcan produce a reliablemodel

X1

Xi Xp

the classes seem tobe well separable

a machine learningmethod can probably

find a model of these data

(b)(a)

2 clusters

(c)

3 clusters

(e) (d)12 clusters

The separability of classes seems to be linked to

• The number of homogenous clusters• The size of each cluster

(b)(a)

Few homogenous clusters Many homogenous clusters

Neighborhood graph and clusters

X1

X2 For “natural neighborhood”,we need a connectedgraph

X1

X2

MinimalSpanning

Tree

α β

X1

X2

GabrielGraph

α β

X1

X2

RelativeNeighborhood

Graph (Toussaint)

There are many other possibilities to build such neighborhood graphs, for instance :

Voronoi Diagram,Delaunay triangulation,

X1

X2

3 clusters

6 edges are cut

Cut weighted edge statistic

Weighted edges•Connection:

ij

k

wi j =1wi k =1wj k = 0 (no edge)

ij

k d(i,k) =1d(i,j) =3

•Similarity :

wi j =1/ (1+d(i,j))wi k =1/(1+d(i,k))

ij

k 1

2•Rank:

w’i j =1/2=0.5w’i k =1/1=1

Principle:The classes are well separable if the global weight of

the cut edges (for generating the clusters) is very small.

(b)(a)

77 Edges5 edges cut2 clusters

(b’)

57 Edges39 edges cut9G+13R = 22 clusters

Notation:Let I be the sum of the non cut edges weightLet J be the sum of the cut edges weight

Null hypothesis:H0: the vertices of the graph are labeled independently of each other, according to the same probability distribution of the labels.

Reject H0: the classes are not independently distributedor the probability distribution of the classes is not thesame for the different vertices.

Study the statistic J

Law of J :

•Boolean case (Moran 1948, Cliff & Ord 1986):

where Zij equals 1 if Y(i) =Y(j) and 0 if not

∑=i=12

1ijijZJ w

n

∑j=1

n

1,22 classes:1 and 2

•Multiple classes :we consider all the pairs of different labels r and s

Mean of J is calculated from Jr,s

∑=i=12

1ijijZJ w

n

∑j=1

n

r,s∑ ∑−= +== 11 1

kr

krs r,sJJ

Decision:H0 is rejected if the p-value of Js is lower than 5% (for example)

Experiments

Domain name n p k clust. edges J / (I + J) J s p-valueWaves-20 20 21 3 6 25 0.400 -0.44 0.6635Waves-50 50 21 3 11 72 0.375 -4.05 5.0E-05Waves-100 100 21 3 12 156 0.301 -8.44 3.3E-17Waves-1000 1000 21 3 49 2443 0.255 -42.75 0

Breiman L., Friedman J.H., Olshen R.A. and Stone J.Classification and regression trees, 1984 Wadsworth Int.

•13 benchmarks of the UCI Machine Learning Repository•Graph: Relative Neighborhood Graph of Toussaint•Weights: connection, distance and rank

Domain name n p k clust. edges error r. J / (I + J) J s p-value J / (I + J) J s p-value J / (I + J) J s p-valueWine recognition 178 13 3 9 281 0.0389 0.093 -19.32 0 0.054 -19.40 0 0.074 -19.27 0Breast Cancer 683 9 2 10 7562 0.0409 0.008 -25.29 0 0.003 -24.38 0 0.014 -25.02 0Iris (Bezdek) 150 4 3 6 189 0.0533 0.090 -16.82 0 0.077 -17.01 0 0.078 -16.78 0Iris plants 150 4 3 6 196 0.0600 0.087 -17.22 0 0.074 -17.41 0 0.076 -17.14 0Musk "Clean1" 476 166 2 14 810 0.0650 0.167 -17.53 0 0.115 -7.69 2E-14 0.143 -18.10 0Image seg. 210 19 7 27 268 0.1238 0.224 -29.63 0 0.141 -29.31 0 0.201 -29.88 0Ionosphere 351 34 2 43 402 0.1397 0.137 -11.34 0 0.046 -11.07 0 0.136 -11.33 0Waveform 1000 21 3 49 2443 0.1860 0.255 -42.75 0 0.248 -42.55 0 0.248 -42.55 0Pima Indians 768 8 2 82 1416 0.2877 0.310 -8.74 2E-18 0.282 -9.86 0 0.305 -8.93 4E-19Glass Ident. 214 9 6 52 275 0.3169 0.356 -12.63 0 0.315 -12.90 0 0.342 -12.93 0Haberman 306 3 2 47 517 0.3263 0.331 -1.92 0.054 0.321 -2.20 0.028 0.331 -1.90 0.058Bupa 345 6 2 50 581 0.3632 0.401 -3.89 1E-04 0.385 -4.33 1E-05 0.394 -4.08 5E-05Yeast 1484 8 10 401 2805 0.4549 0.524 -27.03 0 0.512 -27.18 0 0.509 -28.06 0

weighting: distance weighting: rankGeneral information weighting: connection

nb of instances

nb of predictive attributes

nb of classes error rate on a 1-NN(in a 10-fold cross validation)

J/(I+J): relative cut edge weightJs: standardized cut edge weight

•Error rate in machine learning and weight of the cut edges

Domain name n p k clust. edges J / (I + J) Js p-value 1-NN C4.5 Sipina Perc. MLP N. Bayes MeanBreast Cancer 683 9 2 10 7562 0.008 -25.29 0 0.041 0.059 0.050 0.032 0.032 0.026 0.040BUPA liver 345 6 2 50 581 0.401 -3.89 0.0001 0.363 0.369 0.347 0.305 0.322 0.380 0.348Glass Ident. 214 9 6 52 275 0.356 -12.63 0 0.317 0.289 0.304 0.350 0.448 0.401 0.352Haberman 306 3 2 47 517 0.331 -1.92 0.0544 0.326 0.310 0.294 0.241 0.275 0.284 0.288Image seg. 210 19 7 27 268 0.224 -29.63 0 0.124 0.124 0.152 0.119 0.114 0.605 0.206Ionosphere 351 34 2 43 402 0.137 -11.34 0 0.140 0.074 0.114 0.128 0.131 0.160 0.124Iris (Bezdek) 150 4 3 6 189 0.090 -16.82 0 0.053 0.060 0.067 0.060 0.053 0.087 0.063Iris plants 150 4 3 6 196 0.087 -17.22 0 0.060 0.033 0.053 0.067 0.040 0.080 0.056Musk "Clean1" 476 166 2 14 810 0.167 -17.53 0 0.065 0.162 0.232 0.187 0.113 0.227 0.164Pima Indians 768 8 2 82 1416 0.310 -8.74 2.4E-18 0.288 0.283 0.270 0.231 0.266 0.259 0.266Waveform 1000 21 3 49 2443 0.255 -42.75 0 0.186 0.260 0.251 0.173 0.169 0.243 0.214Wine recognition 178 13 3 9 281 0.093 -19.32 0 0.039 0.062 0.073 0.011 0.017 0.186 0.065Yeast 1484 8 10 401 2805 0.524 -27.03 0 0.455 0.445 0.437 0.447 0.446 0.435 0.444

Mean 0.189 0.195 0.203 0.181 0.187 0.259 0.2020.933 0.934 0.937 0.912 0.877 0.528 0.9790.076 0.020 0.019 0.036 0.063 0.005 0.026

Error rate

R² (J/(I+J) ; error rate)R² (Js ; error rate)

General information Statistical value

instance-based learning methoddecision tree

induction graphneural networks

Naive Bayes

•Relative cut edge weight and mean of the error rates

y = 0,8663x + 0,0036R2 = 0,979

0.00

0.10

0.20

0.30

0.40

0.50

0.00 0.20 0.40 0.60

J/(I+J)

Erro

r rat

e

ConclusionThe cut weighted edge statistic is a goodclass separability index.

The cut weighted edge statistic gives anappropriate information of the a priori abilityof a database to be learnt.

Related work:• Local version of the test to detect outliers(Lallich, Muhlenbach, Zighed, ISMIS 2002)• Feature selection based on the best separability

of classes (in progress)

Documents

Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH