27
A statistical approach for separability of classes Colloque Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH Fabrice MUHLENBACH Laboratoire ERIC - Lyon Université Lumière Lyon 2 5, avenue Pierre Mendès-France 69676 BRON Cedex FRANCE [email protected] [email protected] [email protected]

Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

A statistical approachfor separability of classes

ColloqueApprentissage statistique : Théorie et application

Djamel A. ZIGHEDStéphane LALLICHFabrice MUHLENBACH

Laboratoire ERIC - LyonUniversité Lumière Lyon 25, avenue Pierre Mendès-France69676 BRON CedexFRANCE

[email protected]@[email protected]

Page 2: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

Overview•General framework

•Class separability

•Neighborhood graph and clusters

•Cut weighted edges statistic

•Experiments

•Conclusions and future work

Page 3: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

General framework

Page 4: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

PopulationLa

bele

dex

ampl

es

70 1 4 130 322 0 2 109 0 2,40 2 3 3 267 0 3 115 564 0 2 160 0 1,60 2 0 7 157 1 2 124 261 0 0 141 0 0,30 1 0 7 264 1 4 128 263 0 0 105 1 0,20 2 1 7 174 0 2 120 269 0 2 121 1 0,20 1 1 3 165 1 4 120 177 0 0 140 0 0,40 1 0 7 156 1 3 130 256 1 2 142 1 0,60 2 1 6 259 1 4 110 239 0 2 142 1 1,20 2 1 7 260 1 4 140 293 0 2 170 0 1,20 2 2 7 2

647448

Class attribute(categorical)

Predictive attributes(continuous)

(X1, X2, X3, …, Xp) Y

Representation space

Ωa

X(Ω )XΩ

E=y1,, . . . ,ym

Yϕ ϕ(X(ω ))=Y(ω)

As accurate as possible

Ω

MachineLearning algorithm

ϕ ε

Neural NetInduction GraphDisc. AnalysisSVM…

Page 5: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

Assume that we aim to find a prediction model ϕ that fits the data with an accuracy rate ε> ε∗ , whatever the machine learning algorithm.

X(Ω ), Y(Ω ) Neural Net (ϕ,ε< ε∗ )nn

X(Ω ), Y(Ω ) Ind. Graph (ϕ,ε< ε∗ )IG

We failed finding out such prediction model • All the MLA aren’t suitable for this specific problem, so we have to look for a new MLA,… until…

• The classes aren’t separable.

Page 6: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

Class separability

Page 7: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

X1

Xi Xp

the classes are noteasily separable

it will be difficult tofind a method thatcan produce a reliablemodel

Page 8: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

X1

Xi Xp

the classes seem tobe well separable

a machine learningmethod can probably

find a model of these data

Page 9: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

(b)(a)

2 clusters

(c)

3 clusters

(e) (d)12 clusters

Page 10: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

The separability of classes seems to be linked to

• The number of homogenous clusters• The size of each cluster

(b)(a)

Few homogenous clusters Many homogenous clusters

Page 11: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

Neighborhood graph and clusters

Page 12: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

X1

X2 For “natural neighborhood”,we need a connectedgraph

Page 13: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

X1

X2

MinimalSpanning

Tree

Page 14: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

α β

X1

X2

GabrielGraph

Page 15: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

α β

X1

X2

RelativeNeighborhood

Graph (Toussaint)

Page 16: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

There are many other possibilities to build such neighborhood graphs, for instance :

Voronoi Diagram,Delaunay triangulation,

Page 17: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

X1

X2

3 clusters

6 edges are cut

Page 18: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

Cut weighted edge statistic

Page 19: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

Weighted edges•Connection:

ij

k

wi j =1wi k =1wj k = 0 (no edge)

ij

k d(i,k) =1d(i,j) =3

•Similarity :

wi j =1/ (1+d(i,j))wi k =1/(1+d(i,k))

ij

k 1

2•Rank:

w’i j =1/2=0.5w’i k =1/1=1

Page 20: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

Principle:The classes are well separable if the global weight of

the cut edges (for generating the clusters) is very small.

(b)(a)

77 Edges5 edges cut2 clusters

(b’)

57 Edges39 edges cut9G+13R = 22 clusters

Page 21: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

Notation:Let I be the sum of the non cut edges weightLet J be the sum of the cut edges weight

Null hypothesis:H0: the vertices of the graph are labeled independently of each other, according to the same probability distribution of the labels.

Reject H0: the classes are not independently distributedor the probability distribution of the classes is not thesame for the different vertices.

Study the statistic J

Page 22: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

Law of J :

•Boolean case (Moran 1948, Cliff & Ord 1986):

where Zij equals 1 if Y(i) =Y(j) and 0 if not

∑=i=12

1ijijZJ w

n

∑j=1

n

1,22 classes:1 and 2

•Multiple classes :we consider all the pairs of different labels r and s

Mean of J is calculated from Jr,s

∑=i=12

1ijijZJ w

n

∑j=1

n

r,s∑ ∑−= +== 11 1

kr

krs r,sJJ

Decision:H0 is rejected if the p-value of Js is lower than 5% (for example)

Page 23: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

Experiments

Domain name n p k clust. edges J / (I + J) J s p-valueWaves-20 20 21 3 6 25 0.400 -0.44 0.6635Waves-50 50 21 3 11 72 0.375 -4.05 5.0E-05Waves-100 100 21 3 12 156 0.301 -8.44 3.3E-17Waves-1000 1000 21 3 49 2443 0.255 -42.75 0

Breiman L., Friedman J.H., Olshen R.A. and Stone J.Classification and regression trees, 1984 Wadsworth Int.

Page 24: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

•13 benchmarks of the UCI Machine Learning Repository•Graph: Relative Neighborhood Graph of Toussaint•Weights: connection, distance and rank

Domain name n p k clust. edges error r. J / (I + J) J s p-value J / (I + J) J s p-value J / (I + J) J s p-valueWine recognition 178 13 3 9 281 0.0389 0.093 -19.32 0 0.054 -19.40 0 0.074 -19.27 0Breast Cancer 683 9 2 10 7562 0.0409 0.008 -25.29 0 0.003 -24.38 0 0.014 -25.02 0Iris (Bezdek) 150 4 3 6 189 0.0533 0.090 -16.82 0 0.077 -17.01 0 0.078 -16.78 0Iris plants 150 4 3 6 196 0.0600 0.087 -17.22 0 0.074 -17.41 0 0.076 -17.14 0Musk "Clean1" 476 166 2 14 810 0.0650 0.167 -17.53 0 0.115 -7.69 2E-14 0.143 -18.10 0Image seg. 210 19 7 27 268 0.1238 0.224 -29.63 0 0.141 -29.31 0 0.201 -29.88 0Ionosphere 351 34 2 43 402 0.1397 0.137 -11.34 0 0.046 -11.07 0 0.136 -11.33 0Waveform 1000 21 3 49 2443 0.1860 0.255 -42.75 0 0.248 -42.55 0 0.248 -42.55 0Pima Indians 768 8 2 82 1416 0.2877 0.310 -8.74 2E-18 0.282 -9.86 0 0.305 -8.93 4E-19Glass Ident. 214 9 6 52 275 0.3169 0.356 -12.63 0 0.315 -12.90 0 0.342 -12.93 0Haberman 306 3 2 47 517 0.3263 0.331 -1.92 0.054 0.321 -2.20 0.028 0.331 -1.90 0.058Bupa 345 6 2 50 581 0.3632 0.401 -3.89 1E-04 0.385 -4.33 1E-05 0.394 -4.08 5E-05Yeast 1484 8 10 401 2805 0.4549 0.524 -27.03 0 0.512 -27.18 0 0.509 -28.06 0

weighting: distance weighting: rankGeneral information weighting: connection

nb of instances

nb of predictive attributes

nb of classes error rate on a 1-NN(in a 10-fold cross validation)

J/(I+J): relative cut edge weightJs: standardized cut edge weight

Page 25: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

•Error rate in machine learning and weight of the cut edges

Domain name n p k clust. edges J / (I + J) Js p-value 1-NN C4.5 Sipina Perc. MLP N. Bayes MeanBreast Cancer 683 9 2 10 7562 0.008 -25.29 0 0.041 0.059 0.050 0.032 0.032 0.026 0.040BUPA liver 345 6 2 50 581 0.401 -3.89 0.0001 0.363 0.369 0.347 0.305 0.322 0.380 0.348Glass Ident. 214 9 6 52 275 0.356 -12.63 0 0.317 0.289 0.304 0.350 0.448 0.401 0.352Haberman 306 3 2 47 517 0.331 -1.92 0.0544 0.326 0.310 0.294 0.241 0.275 0.284 0.288Image seg. 210 19 7 27 268 0.224 -29.63 0 0.124 0.124 0.152 0.119 0.114 0.605 0.206Ionosphere 351 34 2 43 402 0.137 -11.34 0 0.140 0.074 0.114 0.128 0.131 0.160 0.124Iris (Bezdek) 150 4 3 6 189 0.090 -16.82 0 0.053 0.060 0.067 0.060 0.053 0.087 0.063Iris plants 150 4 3 6 196 0.087 -17.22 0 0.060 0.033 0.053 0.067 0.040 0.080 0.056Musk "Clean1" 476 166 2 14 810 0.167 -17.53 0 0.065 0.162 0.232 0.187 0.113 0.227 0.164Pima Indians 768 8 2 82 1416 0.310 -8.74 2.4E-18 0.288 0.283 0.270 0.231 0.266 0.259 0.266Waveform 1000 21 3 49 2443 0.255 -42.75 0 0.186 0.260 0.251 0.173 0.169 0.243 0.214Wine recognition 178 13 3 9 281 0.093 -19.32 0 0.039 0.062 0.073 0.011 0.017 0.186 0.065Yeast 1484 8 10 401 2805 0.524 -27.03 0 0.455 0.445 0.437 0.447 0.446 0.435 0.444

Mean 0.189 0.195 0.203 0.181 0.187 0.259 0.2020.933 0.934 0.937 0.912 0.877 0.528 0.9790.076 0.020 0.019 0.036 0.063 0.005 0.026

Error rate

R² (J/(I+J) ; error rate)R² (Js ; error rate)

General information Statistical value

instance-based learning methoddecision tree

induction graphneural networks

Naive Bayes

Page 26: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

•Relative cut edge weight and mean of the error rates

y = 0,8663x + 0,0036R2 = 0,979

0.00

0.10

0.20

0.30

0.40

0.50

0.00 0.20 0.40 0.60

J/(I+J)

Erro

r rat

e

Page 27: Colloque Apprentissage statistique : Théorie et ...viennet/CNAM2002/PDF/05_DAZ_SL_FM.pdf · Apprentissage statistique : Théorie et application Djamel A. ZIGHED Stéphane LALLICH

ConclusionThe cut weighted edge statistic is a goodclass separability index.

The cut weighted edge statistic gives anappropriate information of the a priori abilityof a database to be learnt.

Related work:• Local version of the test to detect outliers(Lallich, Muhlenbach, Zighed, ISMIS 2002)• Feature selection based on the best separability

of classes (in progress)