Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Segmentation of the mean of heteroscedastic datavia cross-validation
Alain Celisse
1UMR 8524 CNRS - Université Lille 1
2SSB Group, Paris
joint work with Sylvain Arlot
GDR �Statistique et Santé�
Paris, October, 21 2009
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
2/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Illustration: Original signal
0 10 20 30 40 50 60 70 80 90 100−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Position t
Sig
nal
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
2/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Illustration: Observed signal (discretized)
0 10 20 30 40 50 60 70 80 90 100−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Position t
Sig
nal
Discretized signal (n=100 observations)
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
3/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Illustration: Find breakpoints
0 10 20 30 40 50 60 70 80 90 100−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Position t
Sig
nal
?
??
? ?
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
3/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Illustration: True regression function
0 10 20 30 40 50 60 70 80 90 100−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Position t
Sig
nal
SignalReg. func.
? ?
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
4/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Statistical framework: Change-point detection
(t1,Y1), . . . , (tn,Yn) ∈ [0, 1]× Y independent,
Yi = s(ti ) + σi εi ∈ Y = R
Instants ti : deterministic (e.g. ti = i/n).
s: piecewise constant
Residuals ε: E [εi ] = 0 and E[ε2i]= 1.
Noise level: σi (heteroscedastic)
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
4/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Statistical framework: Change-point detection
(t1,Y1), . . . , (tn,Yn) ∈ [0, 1]× Y independent,
Yi = s(ti ) + σi εi ∈ Y = R
Instants ti : deterministic (e.g. ti = i/n).
s: piecewise constant
Residuals ε: E [εi ] = 0 and E[ε2i]= 1.
Noise level: σi (heteroscedastic)
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
5/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Estimation versus Identi�cation
Purpose:
Estimate s to recover most of the important jumps w.r.t. the noise
level −→ Estimation purpose.
55 60 65 70 75 80 85 90 95 100−0.2
0
0.2
0.4
0.6
0.8
1 Signal: YReg. func. s
Strategy:
1 Use piecewise constant functions.
2 Adopt the model selection point of view.
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
6/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Model selection
Models:
(Iλ)λ∈Λm: partition of [0, 1]
Sm: linear space of piecewise constant functions on (Iλ)λ∈Λm
Strategy:
(Sm)m∈Mn−→ (sm)m∈Mn
−→ sm ???
Goal:
Oracle inequality (in expectation, or with large probability):
‖s − sm‖2 ≤ C infMn
{‖s − sm‖2 + R(m, n)
}
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
6/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Model selection
Models:
(Iλ)λ∈Λm: partition of [0, 1]
Sm: linear space of piecewise constant functions on (Iλ)λ∈Λm
Strategy:
(Sm)m∈Mn−→ (sm)m∈Mn
−→ sm ???
Goal:
Oracle inequality (in expectation, or with large probability):
‖s − sm‖2 ≤ C infMn
{‖s − sm‖2 + R(m, n)
}
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
7/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Least-squares estimator
Empirical risk minimizer over Sm (= model):
sm ∈ arg minu∈Sm
Pnγ(u) = arg minu∈Sm
1
n
n∑i=1
(u(ti )− Yi )2 .
Regressogram:
sm =∑λ∈Λm
βλ1Iλ βλ =1
Card {ti ∈ Iλ}∑ti∈Iλ
Yi .
Oracle:
m∗ := Argminm∈Mn‖s − sm‖2 .
−→ s m∗ : best estimator among {sm | m ∈Mn}.
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
8/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Empirical Risk Minimization (ERM)
Assumption:
The number D − 1 of breakpoints is known.
Question:
Find the locations of the D − 1 breakpoints (D is given).
Strategy:
The �best� segmentation in D pieces is obtained by applying the
ERM algorithm over⋃
Dm=D Sm :
ERM algorithm:
mERM(D) = Argminm|Dm=DPnγ (sm) .
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
9/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
ERM segmentation: Homoscedastic
0 10 20 30 40 50 60 70 80 90 100−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2Segmentation (Homoscedastic)
Position t
Sig
nal
Yi Signal
OracleERM
−→ ERM is close to the oracle
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
10/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Expectations
Homoscedastic:
R (sm) = dist (s, Sm) + σ2Dm
n+ cste,
E [Pnγ(sm) ] = dist (s, Sm)−σ2Dm
n+ cste .
Conclusions:
1 The variance term σ2Dm/n does not matter,
2 Sms are only distinguished according to dist (s, Sm).
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
11/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
ERM segmentation: Heteroscedastic
0 10 20 30 40 50 60 70 80 90 100−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2Segmentation (Heteroscedastic)
Position t
Sig
nal
Yi Signal
OracleERM
−→ ERM over�ts in noisy regions
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
11/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
ERM over�tting: Expectations
Heteroscedastic:
R (sm) = dist (s, Sm) +1
n
∑λ
(σrλ)2 + cste,
E [Pnγ(sm) ] = dist (s, Sm)−1
n
∑λ
(σrλ)2 + cste,
with (σrλ)2 := 1
nλ
∑ni=1 σ
2i 1Iλ(ti ), nλ := Card ({i | ti ∈ Iλ}) .
Conclusions:
1 The variance term is di�erent for models Sm (with dimension
D),
2 ERM rather puts breakpoints in the noise.
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
12/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Cross-validation principle
0 0.5 1−3
−2
−1
0
1
2
3
0 0.5 1−3
−2
−1
0
1
2
3
0 0.5 1−3
−2
−1
0
1
2
3
0 0.5 1−3
−2
−1
0
1
2
3
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
13/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Cross-validation
Leave-p-out (Lpo) ∀1 ≤ p ≤ n − 1,
Rp(sm) =
(n
p
)−1 ∑D(t)∈Ep
1
p
∑Zi∈D(v)
(sD
(t)
m (Xi )− Yi
)2 ,where Ep =
{D(t) ⊂ {Z1, . . . ,Zn} | Card
(D(t)
)= n − p
}.
Algorithmic complexity: exponential.
Theorem (C. Ph.D. (2008))
Rp(sm) =∑
λ∈Λ(m)
{Sλ,2Aλ +
(S2
λ,1 − Sλ,2)Bλ},
where Sλ,1 :=∑n
i=1Yi1Iλ , Sλ,2 :=
∑n
i=1Y 2
i 1Iλ ,
Aλ,Bλ: known functions.
Algorithmic complexity: O(n).Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
14/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Applicability of cross-validation
Lpo-based model selection procedure:
1 Lpo is Computationally Tractable
C. and Robin (2008), CSDA: DensityC. and Robin (2008), arXiv: Multiple TestingC. Ph.D. (2008), TEL: Density, regressionC. (2009), arXiv: Density
2 As computationally expensive as ERM.
Lpo segmentation of dimension D:
For every 1 ≤ p ≤ n − 1,
mp(D) = Argminm|Dm=D Rp(sm).
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
15/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Taking variance into account: Lpo expectation
Theorem (C. Ph.D. (2008))
Homoscedastic:
E[Rp(sm)
]≈ dist (s, Sm) + σ2
Dm
n − p+ σ2 ,
Heteroscedastic:
E[Rp(sm)
]≈ dist (s, Sm)+
1
n − p
∑λ
(σrλ)2 + cste.
R(sm) = dist (s, Sm)+1
n
∑λ
(σrλ)2 + cste.
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
16/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Leave-one-out (Lpo with p = 1): An alternative to ERM
Strategy:
Replace ERM by leave-one-out
(Loo) to take variance into
account.
Loo algorithm:
m1(D) = Argminm|Dm=D R1(sm).
Conclusion:
Loo prevents from over�tting.
0 10 20 30 40 50 60 70 80 90 100−1.5
−1
−0.5
0
0.5
1
1.5
2
Oracle
ERM
0 10 20 30 40 50 60 70 80 90 100−1.5
−1
−0.5
0
0.5
1
1.5
2Oracle
Loo
ERM
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
17/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Quality of the segmentations w.r.t. D
5 10 15 20 25 30 35
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Number of breakpoints
Ave
rag
e lo
ss v
alu
e
Segmentation quality (Homosc.), N=300 trials
ERM
Loo
5 10 15 20 25 30 35 400
0.02
0.04
0.06
0.08
0.1
0.12
Number of breakpointsA
vera
ge
loss
val
ue
Segmentation quality (heterosc.), N=300 trials
ERM
Loo
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
18/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Quality of the best segmentation
s· σ· ERM Loo
2 c 2.88 ± 0.01 2.93 ± 0.01pc,1 1.31 ± 0.02 1.16 ± 0.02pc,3 3.09 ± 0.03 2.52 ± 0.03
3 c 3.18 ± 0.01 3.25 ± 0.01pc,1 3.00 ± 0.01 2.67 ± 0.02pc,3 4.41 ± 0.02 3.97 ± 0.02
Table: Average of E[infD
∥∥s − s A(D)
∥∥2 ] /E [ infm ‖s − sm‖2]over
10 000 samples. A denotes either ERM, or Loo.
−→ Same results when D is chosen by VFCV.
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
18/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Quality of the best segmentation
s· σ· ERM Loo
2 c 2.88 ± 0.01 2.93 ± 0.01pc,1 1.31 ± 0.02 1.16 ± 0.02pc,3 3.09 ± 0.03 2.52 ± 0.03
3 c 3.18 ± 0.01 3.25 ± 0.01pc,1 3.00 ± 0.01 2.67 ± 0.02pc,3 4.41 ± 0.02 3.97 ± 0.02
Table: Average of E[infD
∥∥s − s A(D)
∥∥2 ] /E [ infm ‖s − sm‖2]over
10 000 samples. A denotes either ERM, or Loo.
−→ Same results when D is chosen by VFCV.
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
19/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Summary
1 Lpo takes variance into account−→ outperforms ERM (heteroscedastic).
−→ close to ERM (homoscedastic).
2 Lpo is fully tractable (closed-form expressions)−→ as computationally expensive as ERM.
3 Similar results when D is chosen by V -fold cross-validation.
Conclusion:
Cross-validation is robust (to heteroscedasticity) and reliable
alternative to ERM.
−→ Arlot and C. (2009), arXiv
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
20/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
The Bt474 Cell lines
These are epithelial cells
Obtained from human breast cancer tumors
A test genome is compared to a reference male genome
We only consider chromosomes 1 and 9
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
21/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Results: Chromosome 9
Homoscedastic model (Picard et al. (05))
Heteroscedastic model (Picard et al. (05))
LOO+VFCV
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
22/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Results: Chromosome 1
Homoscedastic model (Picard et al. (05))
Heteroscedastic model (Picard et al. (05))
LOO+VFCV
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
23/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Prospects
1 Optimality results for segmentation procedures.
2 Other resampling schemes (Bootstrap, Rademacherpenalties,. . . )
3 Extension to the multivariate setting: Detect ANR projectBiology: Multi-patient CGH pro�le segmentation.Computer vision: Video segmentation
Thank you.
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse
23/23
Statistical framework Empirical Risk Minimization Cross-validation Results Conclusion
Prospects
1 Optimality results for segmentation procedures.
2 Other resampling schemes (Bootstrap, Rademacherpenalties,. . . )
3 Extension to the multivariate setting: Detect ANR projectBiology: Multi-patient CGH pro�le segmentation.Computer vision: Video segmentation
Thank you.
Segmentation of the mean of heteroscedastic data via cross-validation Alain Celisse