Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation

1/23

Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion

Segmentation of the mean of heteroscedastic datavia cross-validation

Alain Celisse

1UMR 8524 CNRS - Université Lille 1

2SSB Group, Paris

joint work with Sylvain Arlot

GDR �Statistique et Santé�

Paris, October, 21 2009

Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

2/23


Illustration: Original signal

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Position t

Sig

nal


2/23


Illustration: Observed signal (discretized)

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Position t

Sig

nal

Discretized signal (n=100 observations)


3/23


Illustration: Find breakpoints

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Position t

Sig

nal

?

??

? ?


3/23


Illustration: True regression function

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

Position t

Sig

nal

SignalReg. func.

? ?


4/23


Statistical framework: Change-point detection

(t1,Y1), . . . , (tn,Yn) ∈ [0, 1]× Y independent,

Yi = s(ti ) + σi εi ∈ Y = R

Instants ti : deterministic (e.g. ti = i/n).

s: piecewise constant

Residuals ε: E [εi ] = 0 and E[ε2i]= 1.

Noise level: σi (heteroscedastic)


4/23


Statistical framework: Change-point detection

(t1,Y1), . . . , (tn,Yn) ∈ [0, 1]× Y independent,

Yi = s(ti ) + σi εi ∈ Y = R

Instants ti : deterministic (e.g. ti = i/n).

s: piecewise constant

Residuals ε: E [εi ] = 0 and E[ε2i]= 1.

Noise level: σi (heteroscedastic)


5/23


Estimation versus Identi�cation

Purpose:

Estimate s to recover most of the important jumps w.r.t. the noise

level −→ Estimation purpose.

55 60 65 70 75 80 85 90 95 100−0.2

0

0.2

0.4

0.6

0.8

1 Signal: YReg. func. s

Strategy:

1 Use piecewise constant functions.

2 Adopt the model selection point of view.


6/23


Model selection

Models:

(Iλ)λ∈Λm: partition of [0, 1]

Sm: linear space of piecewise constant functions on (Iλ)λ∈Λm

Strategy:

(Sm)m∈Mn−→ (sm)m∈Mn

−→ sm ???

Goal:

Oracle inequality (in expectation, or with large probability):

‖s − sm‖2 ≤ C infMn

{‖s − sm‖2 + R(m, n)

}


6/23


Model selection

Models:

(Iλ)λ∈Λm: partition of [0, 1]

Sm: linear space of piecewise constant functions on (Iλ)λ∈Λm

Strategy:

(Sm)m∈Mn−→ (sm)m∈Mn

−→ sm ???

Goal:

Oracle inequality (in expectation, or with large probability):

‖s − sm‖2 ≤ C infMn

{‖s − sm‖2 + R(m, n)

}


7/23


Least-squares estimator

Empirical risk minimizer over Sm (= model):

sm ∈ arg minu∈Sm

Pnγ(u) = arg minu∈Sm

1

n

n∑i=1

(u(ti )− Yi )2 .

Regressogram:

sm =∑λ∈Λm

βλ1Iλ βλ =1

Card {ti ∈ Iλ}∑ti∈Iλ

Yi .

Oracle:

m∗ := Argminm∈Mn‖s − sm‖2 .

−→ s m∗ : best estimator among {sm | m ∈Mn}.


8/23


Empirical Risk Minimization (ERM)

Assumption:

The number D − 1 of breakpoints is known.

Question:

Find the locations of the D − 1 breakpoints (D is given).

Strategy:

The �best� segmentation in D pieces is obtained by applying the

ERM algorithm over⋃

Dm=D Sm :

ERM algorithm:

mERM(D) = Argminm|Dm=DPnγ (sm) .


9/23


ERM segmentation: Homoscedastic

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2Segmentation (Homoscedastic)

Position t

Sig

nal

Yi Signal

OracleERM

−→ ERM is close to the oracle


10/23


Expectations

Homoscedastic:

R (sm) = dist (s, Sm) + σ2Dm

n+ cste,

E [Pnγ(sm) ] = dist (s, Sm)−σ2Dm

n+ cste .

Conclusions:

1 The variance term σ2Dm/n does not matter,

2 Sms are only distinguished according to dist (s, Sm).


11/23


ERM segmentation: Heteroscedastic

0 10 20 30 40 50 60 70 80 90 100−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2Segmentation (Heteroscedastic)

Position t

Sig

nal

Yi Signal

OracleERM

−→ ERM over�ts in noisy regions


11/23


ERM over�tting: Expectations

Heteroscedastic:

R (sm) = dist (s, Sm) +1

n

∑λ

(σrλ)2 + cste,

E [Pnγ(sm) ] = dist (s, Sm)−1

n

∑λ

(σrλ)2 + cste,

with (σrλ)2 := 1

nλ

∑ni=1 σ

2i 1Iλ(ti ), nλ := Card ({i | ti ∈ Iλ}) .

Conclusions:

1 The variance term is di�erent for models Sm (with dimension

D),

2 ERM rather puts breakpoints in the noise.


12/23


Cross-validation principle

0 0.5 1−3

−2

−1

0

1

2

3

0 0.5 1−3

−2

−1

0

1

2

3

0 0.5 1−3

−2

−1

0

1

2

3

0 0.5 1−3

−2

−1

0

1

2

3


13/23


Cross-validation

Leave-p-out (Lpo) ∀1 ≤ p ≤ n − 1,

Rp(sm) =

(n

p

)−1 ∑D(t)∈Ep

1

p

∑Zi∈D(v)

(sD

(t)

m (Xi )− Yi

)2 ,where Ep =

{D(t) ⊂ {Z1, . . . ,Zn} | Card

(D(t)

)= n − p

}.

Algorithmic complexity: exponential.

Theorem (C. Ph.D. (2008))

Rp(sm) =∑

λ∈Λ(m)

{Sλ,2Aλ +

(S2

λ,1 − Sλ,2)Bλ},

where Sλ,1 :=∑n

i=1Yi1Iλ , Sλ,2 :=

∑n

i=1Y 2

i 1Iλ ,

Aλ,Bλ: known functions.

Algorithmic complexity: O(n).Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse

14/23


Applicability of cross-validation

Lpo-based model selection procedure:

1 Lpo is Computationally Tractable

C. and Robin (2008), CSDA: DensityC. and Robin (2008), arXiv: Multiple TestingC. Ph.D. (2008), TEL: Density, regressionC. (2009), arXiv: Density

2 As computationally expensive as ERM.

Lpo segmentation of dimension D:

For every 1 ≤ p ≤ n − 1,

mp(D) = Argminm|Dm=D Rp(sm).


15/23


Taking variance into account: Lpo expectation

Theorem (C. Ph.D. (2008))

Homoscedastic:

E[Rp(sm)

]≈ dist (s, Sm) + σ2

Dm

n − p+ σ2 ,

Heteroscedastic:

E[Rp(sm)

]≈ dist (s, Sm)+

1

n − p

∑λ

(σrλ)2 + cste.

R(sm) = dist (s, Sm)+1

n

∑λ

(σrλ)2 + cste.


16/23


Leave-one-out (Lpo with p = 1): An alternative to ERM

Strategy:

Replace ERM by leave-one-out

(Loo) to take variance into

account.

Loo algorithm:

m1(D) = Argminm|Dm=D R1(sm).

Conclusion:

Loo prevents from over�tting.

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

2

Oracle

ERM

0 10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

2Oracle

Loo

ERM


17/23


Quality of the segmentations w.r.t. D

5 10 15 20 25 30 35

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Number of breakpoints

Ave

rag

e lo

ss v

alu

e

Segmentation quality (Homosc.), N=300 trials

ERM

Loo

5 10 15 20 25 30 35 400

0.02

0.04

0.06

0.08

0.1

0.12

Number of breakpointsA

vera

ge

loss

val

ue

Segmentation quality (heterosc.), N=300 trials

ERM

Loo


18/23


Quality of the best segmentation

s· σ· ERM Loo

2 c 2.88 ± 0.01 2.93 ± 0.01pc,1 1.31 ± 0.02 1.16 ± 0.02pc,3 3.09 ± 0.03 2.52 ± 0.03

3 c 3.18 ± 0.01 3.25 ± 0.01pc,1 3.00 ± 0.01 2.67 ± 0.02pc,3 4.41 ± 0.02 3.97 ± 0.02

Table: Average of E[infD

∥∥s − s A(D)

∥∥2 ] /E [ infm ‖s − sm‖2]over

10 000 samples. A denotes either ERM, or Loo.

−→ Same results when D is chosen by VFCV.


18/23


Quality of the best segmentation

s· σ· ERM Loo

2 c 2.88 ± 0.01 2.93 ± 0.01pc,1 1.31 ± 0.02 1.16 ± 0.02pc,3 3.09 ± 0.03 2.52 ± 0.03

3 c 3.18 ± 0.01 3.25 ± 0.01pc,1 3.00 ± 0.01 2.67 ± 0.02pc,3 4.41 ± 0.02 3.97 ± 0.02

Table: Average of E[infD

∥∥s − s A(D)

∥∥2 ] /E [ infm ‖s − sm‖2]over

10 000 samples. A denotes either ERM, or Loo.

−→ Same results when D is chosen by VFCV.


19/23


Summary

1 Lpo takes variance into account−→ outperforms ERM (heteroscedastic).

−→ close to ERM (homoscedastic).

2 Lpo is fully tractable (closed-form expressions)−→ as computationally expensive as ERM.

3 Similar results when D is chosen by V -fold cross-validation.

Conclusion:

Cross-validation is robust (to heteroscedasticity) and reliable

alternative to ERM.

−→ Arlot and C. (2009), arXiv


20/23


The Bt474 Cell lines

These are epithelial cells

Obtained from human breast cancer tumors

A test genome is compared to a reference male genome

We only consider chromosomes 1 and 9


21/23


Results: Chromosome 9

Homoscedastic model (Picard et al. (05))

Heteroscedastic model (Picard et al. (05))

LOO+VFCV


22/23


Results: Chromosome 1

Homoscedastic model (Picard et al. (05))

Heteroscedastic model (Picard et al. (05))

LOO+VFCV


23/23


Prospects

1 Optimality results for segmentation procedures.

2 Other resampling schemes (Bootstrap, Rademacherpenalties,. . . )

3 Extension to the multivariate setting: Detect ANR projectBiology: Multi-patient CGH pro�le segmentation.Computer vision: Video segmentation

Thank you.


23/23


Prospects

1 Optimality results for segmentation procedures.

2 Other resampling schemes (Bootstrap, Rademacherpenalties,. . . )

3 Extension to the multivariate setting: Detect ANR projectBiology: Multi-patient CGH pro�le segmentation.Computer vision: Video segmentation

Thank you.


Documents

Segmentation of the mean of heteroscedastic data via cross ...labomath.univ-lille1.fr/~celisse/Talks/091020GDR.pdf · Segmentation of the mean of heteroscedastic data via cross-validation