Influence function of the error rate of generalized k-means - Joint ...€¦ · Influence function...

Preview:

Citation preview

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Influence function of the error rate ofgeneralized k-meansJoint work with G. Haesbroeck

Ch. Ruwet

UNIVERSITY OF LIÈGE

30 March 2009

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Introduction

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

The k -means clustering method

Aim of clustering : Group similar observations in kclusters C1, . . . , Ck .

The k -means algorithm constructs clusters in order tominimize the within cluster sum of squared distances.

Let us focus on k = 2 groups.

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Classification based on clustering

In case of natural groups in the data, clustering may beused to find these groups via a classification rule :

x ∈ Cj ⇔ ‖x − xj‖ = min1≤i≤2

‖x − xi‖

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Example in 1D

Height of students in 1BM

Height (in cm)

160 170 180 190

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Example in 1D

Height of students in 1BM

Height (in cm)

160 170 180 190

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Example in 2D

40 50 60 70 80 90

160

170

180

190

Students of 1BM

Weight (in kg)

Hei

ght (

in c

m)

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Example in 2D

40 50 60 70 80 90

160

170

180

190

Students of 1BM

Weight (in kg)

Hei

ght (

in c

m)

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated example in 1D

Height of students in 1BM

Height (in cm)

120 140 160 180

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated example in 1D

Height of students in 1BM

Height (in cm)

120 140 160 180

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

The generalized 2-means clustering method

The clusters’ centers (T1, T2) are solution of

mint1,t2⊂R2

n∑

i=1

Ω

(

inf1≤j≤2

‖xi − tj‖)

for a suitable nondecreasing penalty function Ω.

Classical penalty functions :

Ω(x) = x2 → 2-means method

Ω(x) = x → 2-medoids method

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated example with the 2-medoidsmethod

Height of students in 1BM

Height (in cm)

120 130 140 150 160 170 180 190

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

The generalized 2-means clustering method

The classification rule :

x ∈ Cj ⇔ Ω(‖x − Tj‖) = min1≤i≤2

Ω(‖x − Ti‖)

In one dimension, the estimated clusters are simply:

C1 =] −∞, C[

C2 =]C, +∞[

where C =T1 + T2

2is the cut-off point.

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Error rate

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Example in 1D

Height of students in 1BM

Height (in cm)

160 170 180 190

GirlsBoys

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Example in 1D with the 2-means

Height of students in 1BM

Height (in cm)

160 170 180 190

GirlsBoys

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Example in 1D with the 2-means

Height of students in 1BM

Height (in cm)

160 170 180 190

GirlsBoys

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Example in 1D with the 2-means

Height of students in 1BM

Height (in cm)

160 170 180 190

GirlsBoys

ER=0.326

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated example in 1D

Height of students in 1BM

Height (in cm)

120 140 160 180

GirlsBoys

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated example in 1D with the 2-means

Height of students in 1BM

Height (in cm)

120 140 160 180

GirlsBoys

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated example in 1D with the 2-means

Height of students in 1BM

Height (in cm)

120 140 160 180

GirlsBoys

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated example in 1D with the 2-means

Height of students in 1BM

Height (in cm)

120 140 160 180

GirlsBoys

ER=0.609

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated example in 1D with the2-medoids

Height of students in 1BM

Height (in cm)

120 140 160 180

GirlsBoys

ER=0.370

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Classification setting

Suppose

X arises from 2 groups G1 and G2 with πi(F ) = IPF [X ∈ Gi ]

then

F is a mixture of two distributions

F = π1(F )F1 + π2(F )F2

with densities f1 and f2.

Additional assumption : one dimension !

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

The generalized 2-means as statisticalfunctionals

The clusters’ centers (T1(F ), T2(F )) are solution of

mint1,t2⊂R2

Ω

(

inf1≤j≤2

‖x − tj‖)

dF (x)

for a suitable nondecreasing penalty function Ω.

The classification rule is

RF (x) = Cj(F ) ⇔ Ω(‖x−Tj(F )‖) = min1≤i≤2

Ω(‖x−Ti(F )‖)

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Optimality in classification

A classification rule is optimal if the corresponding errorrate is minimal

The optimal classification rule is the Bayes rule :

x ∈ C1 ⇔ π1(F )f1(x) > π2(F )f2(x)

(Anderson, 1958)

The 2-means procedure is optimal under the model

FN = 0.5 N(µ1, σ2) + 0.5 N(µ2, σ

2) with µ1 < µ2

(Qiu and Tamhane, 2007)

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Theoretical vs empirical error rate

Theoretical error rate :• Training sample according to F : estimation of the rule• Test sample according to Fm : evaluation of the rule• In ideal circumstances : F = Fm

ER(F , Fm) =2

j=1

πj(Fm)IPFm

[

RF (X ) 6= Cj(F )∣

∣ Gj]

Empirical error rate :• Training sample according to F : estimation and

evaluation of the rule

ER(F , F ) =2

j=1

πj(F )IPF[

RF (X ) 6= Cj(F )∣

∣ Gj]

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Theoretical vs empirical error rate

Theoretical error rate :• Training sample according to F : estimation of the rule• Test sample according to Fm : evaluation of the rule• In ideal circumstances : F = Fm

ER(F , Fm) =2

j=1

πj(Fm)IPFm

[

RF (X ) 6= Cj(F )∣

∣ Gj]

Empirical error rate :• Training sample according to F : estimation and

evaluation of the rule

ER(F , F ) =2

j=1

πj(F )IPF[

RF (X ) 6= Cj(F )∣

∣ Gj]

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated distribution

A contaminated distribution is defined by

""D

DD

DD

DD

D

zzuuuu

uuuu

uu

1 − ε : F ε : G

where G is a arbitrary distribution function.

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated distribution

A contaminated distribution is defined by

""D

DD

DD

DD

D

zzuuuu

uuuu

uu

1 − ε : F ε : G

where G is a arbitrary distribution function.

To see the influence of one singular point x , G = ∆x leadingto

Fε = (1 − ε)F + ε∆x

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated mixture

−4 −2 0 2 4

0.00

0.05

0.10

0.15

0.20

Mixture

x

0.4

0.6

−4 −2 0 2 4

0.00

0.05

0.10

0.15

0.20

Contaminated mixture

x

(1−ε)0.4

(1−ε)0.6

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Theoretical vs empirical error rate undercontamination

Now, the training sample is distributed as Fε which is acontaminated mixture.

Theoretical error rate :

ER(Fε, Fm) =2

j=1

πj(Fm)IPFm

[

RFε(X ) 6= Cj(Fε)

∣ Gj]

Empirical error rate :

ER(Fε, Fε) =2

j=1

πj(Fε)IPFε

[

RFε(X ) 6= Cj(Fε)

∣ Gj]

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Theoretical vs empirical error rate undercontamination

Now, the training sample is distributed as Fε which is acontaminated mixture.

Theoretical error rate :

ER(Fε, Fm) =2

j=1

πj(Fm)IPFm

[

RFε(X ) 6= Cj(Fε)

∣ Gj]

Empirical error rate :

ER(Fε, Fε) =2

j=1

πj(Fε)IPFε

[

RFε(X ) 6= Cj(Fε)

∣ Gj]

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Theoretical vs empirical error rate undercontamination

Graphically :

Fm = FN ≡ 0.5 N(−1, 1) + 0.5 N(1, 1) an optimal model

C(FN) = −1+12 = 0 (Qiu and Tamhane, 2007)

Fε = (1 − ε)Fm + ε∆x

ε varying and x = −0.5

x ∈ G1 varying and ε = 0.1

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Theoretical vs empirical error rate undercontamination

Theoretical error rate (with the 2-means) :

0.0 0.2 0.4 0.6 0.8

0.16

00.

165

0.17

00.

175

0.18

0

Err

or R

ate

ε−2 −1 0 1 2

0.15

90.

160

0.16

10.

162

0.16

3

x

Err

or R

ate

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Theoretical vs empirical error rate undercontamination

Empirical error rate (with the 2-means) :

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Err

or R

ate

If x is well placedIf not

ε−2 −1 0 1 2

0.14

0.16

0.18

0.20

0.22

0.24

x

Err

or R

ate

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Influence function of theerror rate

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Influence functions

Hampel et al (1986) : For any statistical functional T andany distribution F ,

IF(x ; T, F ) = limε→0

T(Fε) − T(F )

ε=

∂εT(Fε)

ε=0where

Fε = (1 − ε)F + ε∆x ;

EF [IF(X ; T, F )] = 0;

T(Fε) ≈ T(F ) + εIF(x ; T, F ) for ε small enough(First-order von Mises expansion of T at F ).

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Influence functions

Hampel et al (1986) : For any statistical functional T andany distribution F ,

IF(x ; T, F ) = limε→0

T(Fε) − T(F )

ε=

∂εT(Fε)

ε=0where

Fε = (1 − ε)F + ε∆x ;

EF [IF(X ; T, F )] = 0;

T(Fε) ≈ T(F ) + εIF(x ; T, F ) for ε small enough(First-order von Mises expansion of T at F ).

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Theoretical vs empirical error rate

ER(Fε, F ) ≈ ER(F , F ) + εIF(x ; ER, F )

ER(Fε, Fε) ≈ ER(F , F ) + εIF(x ; ER, F )

Theoretical error rate :

ER(Fε, FN) ≥ ER(FN , FN) ⇒ IF(x ; ER, FN) ≡ 0

Empirical error rate : The IF does not vanish!From now on, ER(F ) = ER(F , F ).

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Theoretical vs empirical error rate

ER(Fε, F ) ≈ ER(F , F ) + εIF(x ; ER, F )

ER(Fε, Fε) ≈ ER(F , F ) + εIF(x ; ER, F )

Theoretical error rate :

ER(Fε, FN) ≥ ER(FN , FN) ⇒ IF(x ; ER, FN) ≡ 0

Empirical error rate : The IF does not vanish!From now on, ER(F ) = ER(F , F ).

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Theoretical vs empirical error rate

ER(Fε, F ) ≈ ER(F , F ) + εIF(x ; ER, F )

ER(Fε, Fε) ≈ ER(F , F ) + εIF(x ; ER, F )

Theoretical error rate :

ER(Fε, FN) ≥ ER(FN , FN) ⇒ IF(x ; ER, FN) ≡ 0

Empirical error rate : The IF does not vanish!From now on, ER(F ) = ER(F , F ).

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

ER(Fε) =?

ER(F ) =2

j=1

πj(F )IPF[

RF (X ) 6= Cj(F )∣

∣ Gj]

= π1(F )1 − F1(C(F )) + π2(F )F2(C(F ))

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

ER(Fε) =?

ER(F ) =2

j=1

πj(F )IPF[

RF (X ) 6= Cj(F )∣

∣ Gj]

= π1(F )1 − F1(C(F )) + π2(F )F2(C(F ))

Under Fε = (1 − ε)F + ε∆x , one has

ER(Fε) = π1(Fε)

1 − F1,ε (C(Fε))

+ π2(Fε)F2,ε (C(Fε))

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

ER(Fε) =?

ER(F ) =2

j=1

πj(F )IPF[

RF (X ) 6= Cj(F )∣

∣ Gj]

= π1(F )1 − F1(C(F )) + π2(F )F2(C(F ))

Under Fε = (1 − ε)F + ε∆x , one has

ER(Fε) = π1(Fε)

1 − F1,ε (C(Fε))

+ π2(Fε)F2,ε (C(Fε))

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

πi(Fε) =? and Fi ,ε =?

Fε = (1 − ε)F + ε∆x

πi(Fε) = (1 − ε)πi(F ) + εIx ∈ Gi

Fi,ε =

(

1 −εIx ∈ Gi

πi(Fε)

)

Fi +εIx ∈ Gi

πi(Fε)∆x

⇒ Fε = π1(Fε)F1,ε + π2(Fε)F2,ε

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

πi(Fε) =? and Fi ,ε =?

Fε = (1 − ε)F + ε∆x

πi(Fε) = (1 − ε)πi(F ) + εIx ∈ Gi

Fi,ε =

(

1 −εIx ∈ Gi

πi(Fε)

)

Fi +εIx ∈ Gi

πi(Fε)∆x

⇒ Fε = π1(Fε)F1,ε + π2(Fε)F2,ε

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

πi(Fε) =? and Fi ,ε =?

Fε = (1 − ε)F + ε∆x

πi(Fε) = (1 − ε)πi(F ) + εIx ∈ Gi

Fi,ε =

(

1 −εIx ∈ Gi

πi(Fε)

)

Fi +εIx ∈ Gi

πi(Fε)∆x

⇒ Fε = π1(Fε)F1,ε + π2(Fε)F2,ε

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Influence function of the empirical error rate

Proposition

For all x 6= C(F ),

IF(x ; ER, F ) = −ER(F )

+ Ix ≤ C(F )(1 − 2 Ix ∈ G1) + Ix ∈ G1

+12(IF(x ; T1, F ) + IF(x ; T2, F ))

π2(F )f2(C(F )) − π1(F )f1(C(F )).

Expressions of IF(x ; T1, F ) and IF(x ; T2, F ) have beencomputed by García-Escudero and Gordaliza (1999).

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Graphics of IF(x ; ER, F )

−4 −2 0 2 4

0.0

0.5

1.0

x

IF E

R

x in G1x in G2

−4 −2 0 2 4

0.0

0.5

1.0

xIF

ER

x in G1x in G2

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Graphics of IF(x ; ER, F )

−4 −2 0 2 4

−2

−1

01

2

x

IF E

R

pi1=0.2pi1=0.4pi1=0.6pi1=0.8

−4 −2 0 2 4

−2

−1

01

2x

IF E

R

pi1=0.2pi1=0.4pi1=0.6pi1=0.8

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Graphics of IF(x ; ER, F )

∆ = µ2 − µ1

−4 −2 0 2 4

−1.

0−

0.5

0.0

0.5

1.0

1.5

x

IF E

R

Delta=1Delta=3Delta=5

−4 −2 0 2 4

−1.

0−

0.5

0.0

0.5

1.0

1.5

x

IF E

R

Delta=1Delta=3Delta=5

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Graphics of IF(x ; ER, F )

−4 −2 0 2 4

−0.

50.

00.

51.

0

x

IF E

R

sigma=0.5sigma=1sigma=1.5sigma=2

−4 −2 0 2 4

−0.

50.

00.

51.

0x

IF E

R

sigma=0.5sigma=1sigma=1.5sigma=2

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Bias of the error rate

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Definition

In practice, F is replace by Fn, the empirical cdf

The bias is defined by Bn(ER) = EF [ER(Fn) − ER(F )]

Fernholz (2001) :

Bn(ER) =1

2nEF [IF2(X ; ER, F )] + o(n−1)

where

IF2(x ; ER, F ) =∂2

∂ε2 ER((1 − ε)F + ε∆x)

ε=0.

This result is true if ER(.) is Hadamard or Fréchetdifferentiable.

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Computation of the bias

Proposition

Under asymptotic normality and consistency of T1 and T2

(Pollard, 1981 and 1982) :

Bn(ER) ≈1

4nπ2(F )f2(C(F )) − π1(F )f1(C(F ))

(EF [IF2(X ; T1, F )] + EF [IF2(X ; T2, F )])

+1

8nπ2(F )f ′2(C(F )) − π1(F )f ′1(C(F ))

(ASV(T1) + ASV(T2) + 2 ASC(T1, T2)).

Expressions of IF2(X ; T1, F ) and IF2(X ; T2, F ) have beencomputed.

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Asymptotic difference

How much difference in error rate is to be expected byestimating a clustering rule from a finite sample ?

A-Diff(ER) = limn→∞

n Bn(ER)

⇒ Graphical comparisons of the 2-means and 2-medoidsmethods :

F = π1 N(−∆/2, 1) + (1 − π1) N(∆/2, 1)

• π1 varying and ∆ = 3• ∆ varying and π1 = 0.4

FN = 0.5 N(−∆/2, 1) + 0.5 N(∆/2, 1)

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Graphics of A-Diff(ER) under F

0.2 0.4 0.6 0.8

−50

050 2−means

2−medoids

π1

A-D

iff

0 2 4 6 8 10

−0.

10.

00.

10.

20.

3

2−means2−medoids

∆A

-Diff

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Graphic of A-Diff(ER) under FN

0 2 4 6 8 10

−0.

06−

0.04

−0.

020.

00

2−means2−medoids

A-D

iff

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation study

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation settings

F = 0.4 N(−1.5, 1) + 0.6 N(1.5, 1)

n = 100Fε = (1 − ε)F + ε∆x with

• ε = 0.01 and |x | = 5• ε = 0.05 and |x | = 5• ε = 0.01 and |x | = 50• ε = 0.05 and |x | = 50

N = 1000

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)

in C1 in C2 in C1 in C2

from G1

from G2

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)

in C1 in C2 in C1 in C2

0.068 0.083from G1

0.078 0.072from G2

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)

in C1 in C2 in C1 in C2

0.068 0.083 0.071 0.083from G1

0.078 0.072 0.081 0.073from G2

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)

in C1 in C2 in C1 in C2

0.068 0.083 0.071 0.083from G1 0.064 0.139

0.078 0.072 0.081 0.073from G2 0.119 0.083

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)

in C1 in C2 in C1 in C2

0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127

0.078 0.072 0.081 0.073from G2 0.119 0.083 0.116 0.072

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)

in C1 in C2 in C1 in C2

0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127

0.39 0.61

0.078 0.072 0.081 0.073from G2 0.119 0.083 0.116 0.072

0.41 0.59

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)

in C1 in C2 in C1 in C2

0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127

0.39 0.61 0.072 0.083

0.078 0.072 0.081 0.073from G2 0.119 0.083 0.116 0.072

0.41 0.59 0.081 0.073

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)

in C1 in C2 in C1 in C2

0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127

0.39 0.61 0.072 0.0830.35 0.650.078 0.072 0.081 0.073

from G2 0.119 0.083 0.116 0.0720.41 0.59 0.081 0.0730.45 0.55

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Simulation results

2-means 2-medoids

(0.071) (0.072)

in C1 in C2 in C1 in C2

0.068 0.083 0.071 0.083from G1 0.064 0.139 0.066 0.127

0.39 0.61 0.072 0.0830.35 0.65 0.35 0.650.078 0.072 0.081 0.073

from G2 0.119 0.083 0.116 0.0720.41 0.59 0.081 0.0730.45 0.55 0.45 0.55

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Contaminated example with the 2-medoids

Height of students in 1BM

Height (in cm)

120 130 140 150 160 170 180

GirlsBoys

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Conclusion of these simulations

A well-placed contamination in the smallest groupmakes the error rate decrease

Too much contamination makes the error rate of the2-means break down

Unfortunately, the error rate of the 2-medoids alsobreaks down when there are too much and too faroutliers !

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Future research

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Future research

Generalized trimmed 2-means : for α ∈ [0, 1],(T1(F ), T2(F )) are solution of

minA:F (A)=1−α

mint1,t2⊂R

(

inf1≤j≤2

‖x − tj‖)

dF (x).

Theoretical error rate

ER(Fε, Fm) =2

j=1

πj(Fm)IPFm

[

RFε(X ) 6= Cj(Fε)

∣ Gj]

More than 1 dimension or more than 2 groups

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Thank you for yourattention!

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Bibliography

CROUX C., FILZMOSER P. and JOOSSENS K. (2008),Classification efficiences for robust linear discriminantanalysis, Statistica Sinica 18, pp. 581-599.

CROUX C., HAESBROECK G. and JOOSSENS K. (2008),Logistic discrimination using robust estimators : aninfluence function approach, The Canadian Journal ofStatistics, 36, pp. 157-174.

FERNHOLZ L. T., On multivariate higher order von Misesexpansions in Metrika 2001, vol. 53, pp. 123-140.

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Bibliography

GARCÍA-ESCUDERO L. A. and GORDALIZA A.,Robustness Properties of k Means and Trimmed kMeans, Journal of the American Statistical Association,September 1999, Vol. 94, nr 447, pp. 956-969.

HAMPEL F.R., RONCHETTI E.M., ROUSSEEUW P.J.,STAHEL W.A., Robust Statistics : The Approach Basedon Influence Functions, John Wiley and Sons,New-York, 1986.

ANDERSON T.W., An Introduction to MultivariateStatistical Analysis, Wiley, New-York, 1958, pp.126-133.

Influencefunction of

the error rateof

generalizedk-means

Ch. Ruwet

Introduction

Error rate

Influencefunction of theerror rate

Bias of theerror rate

Simulationstudy

Futureresearch

Bibliography

POLLARD D., Strong Consistency of k-MeansClustering, The Annals of Probability, 1981, Vol.9, nr4,pp.919-926.

POLLARD D., A Central Limit Theorem for k-MeansClustering, The Annals of Probability, 1982, Vol.10, nr1,pp.135-140.

QIU D. and TAMHANE A. C. (2007), A comparative studyof the k -means algorithm and the normal mixturemodel for clustering : Univariate case, Journal ofStatistical Planning and Inference, 137, pp. 3722-3740.

Recommended