Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Deep Kernel Representation Learning forComplex Data and Reliability Issues
Pierre LAFORGUE, PhD DefenseLTCI, Telecom Paris, France
Jury:
d’ALCHE-BUC Florence SupervisorBONALD Thomas PresidentCLEMENCON Stephan co-SupervisorKADRI Hachem ExaminerLUGOSI Gabor ReviewerMAIRAL Julien ExaminerVERT Jean-Philippe Reviewer
0
Motivation: need for structured data representations
Goal of ML: infer from a set of examples, the relationship betweensome explanatory variables x , and a target output y
A representation: set of features characterizing the observations
Ex 1:digit recognition (MNIST)
Ex 2:molecule activity prediction
How to (automatically) learn structured data representations?
1
Motivation: need for reliable procedures
Empirical Risk Minimization: minimize the average error on train data
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Activity
0
10
20
30
40
Occurences Ordinary Least Squares fail, need
for more robust loss functionsand/or mean estimators
Importance Sampling may onlycorrect on the space covered bythe training observations
How to adapt to data with outliers and/or biased?
2
Outline for today
Empirical Risk Minimization (ERM), formally:
minh measurable
EP
[`(h(X ),Y )
]→ min
h∈H
1n
n∑i=1
`(
h(xi ), yi
)
Part I: Deep kernel architectures for complex data
Part II: Robust losses for RKHSs with infinite dimensional outputs
Part III: Reliable learning through Median-of-Means approaches
Backup: Statistical learning from biased training samples
3
Part I:Deep kernel architectures
for complex data
minh measurable
EP
[`(h(X ),Y )
]→ min
h∈H
1n
n∑i=1
`(
h(xi ), yi
)
4
Two opposite representation learning paradigms
Deep Learning: representations learned along with the training,key to the success [Erhan et al., 2009]
Kernel Methods: linear method after embedding through feature map φ,choice of kernel ⇐⇒ choice of representation
Question: Is it possible to combine both approaches [Mairal et al., 2014]?
5
Autoencoders (AEs)
• Idea: compress and reconstruct inputs by a Neural Net (NN)
• Base mapping: f : Rd → Rp such that fW ,b(x) = σ(W x + b)
• Hour-glass shaped network, reconstruction criterion:
minW ,b,W ′,b′
∥∥x − fW ′,b′ fW ,b(x)∥∥2 (self-supervised)
x1
x2
x3
x4
z1
z2
x′1
x′2
x′3
x′4
W, b W’, b’
Fig. 1: A 1 hidden layer autoencoder
6
Autoencoders: uses
• Data compression, link to Principal Component Analysis (PCA)[Bourlard and Kamp, 1988, Hinton and Salakhutdinov, 2006]
• Pre-training of neural networks [Bengio et al., 2007]• Denoising [Vincent et al., 2010]• For non-vectorial data?
Fig. 2: PCA/AE on MNIST(reproduced from HS ’06)
Fig. 3: Pre-training of biggernetwork through AEs
7
Scalar kernel methods [Scholkopf et al., 2004]
• feature map φ : X → Hk associated to scalar kernel k : X × X → Rsuch that 〈φ(x), φ(x ′)〉Hk
= k(x , x ′)
• Replace x with φ(x) and use linear methods. Ridge regression:
minβ∈Rp
∑i
(yi − 〈xi , β〉Rp )2 + 2nλ‖β‖2Rp
minω∈Hk
∑i
(yi − 〈φ(xi ), ω〉Hk)2 + 2nλ‖ω‖2
Hk
• In an autoencoder? Need for Hilbert-valued functions!
minfl∈NNem
1n
n∑i=1
∥∥∥φ(xi )− fL . . . f1(φ(xi ))∥∥∥2
Hk
8
Operator-valued kernel methods [Carmeli et al., 2006]
Generalization of scalar kernel methods to output Hilbert spaces:
• k : X × X → R K : X × X → L(Y)
• k(x ′, x) = k(x , x ′) K(x ′, x) = K(x , x ′)∗
• ∑i,j αi k(xi , xj)αj ≥ 0∑
i,j 〈αi ,K(xi , xj)αj〉Y ≥ 0
• Hk = Span k(·, x) ⊂ RX HK = Span K(·, x)y ⊂ YX
Kernel trick in the output space [Cortes ’05, Geurts ’06, Brouard ’11,Kadri ’13, Brouard ’16], Input Output Kernel Regression (IOKR).
9
How to learn in vector-valued RKHSs? OVK ridge regression
For (xi , yi )ni=1 ∈ (X × Y)n with Y a Hilbert space, we want to solve:
h ∈ argminh∈HK
1n
n∑i=1
∥∥h(xi )− yi∥∥2Y + Λ
2 ‖h‖2HK .
Representer Theorem [Micchelli and Pontil, 2005]:
∃(αi )ni=1 ∈ Yn s.t. h(x) =
n∑i=1K(·, xi )αi , and differentiating gives:
n∑i=1
(K(x1, xi ) + Λnδ1i IY) αi = y1,
· · ·n∑
i=1(K(xn, xi ) + Λnδni IY) αi = yn.
If K(x , x ′) = k(x , x ′)IY , then closed form solution:
αi =∑
j Aijyj with A = (K + ΛnIn)−1
.10
The Kernel Autoencoder [Laforgue et al., 2019a]
Complex Structured Inputx ∈ X
Finite Dimensional Representation(0.94, 0.55)
φ(x)1
φ(x)2
φ(x)3
...
...
z1
z2
φ(x)′1
φ(x)′2
φ(x)′3
...
...
f1 ∈ HK1
vv-RKHS
f2 ∈ HK2
vv-RKHS
φ(x) ∈ FX
z ∈ R2
φ(x) ∈ FX
K2AE: minfl∈vv-RKHS
1n
n∑i=1
∥∥∥φ(xi )− fL . . . f1(φ(xi ))∥∥∥2
FX+
L∑l=1
λl‖fl‖2Hl
11
Connection to kernel Principal Component Analysis (PCA)
2-layer K2AE with linear kernels, internal layer of size p, and nopenalization. Let ((σ1, u1) . . . , (σp, up)) denote the largest eigenvalues/vectors of Kin. It holds:
K2AE output:(√σ1u1, . . . ,
√σpup
)∈ Rn×p
KPCA output: (σ1u1, . . . , σpup) ∈ Rn×p
Proof: X ∈ Rn×d , Y = XX>A ∈ Rn×p, Z = YY>B.The objective writes minA,B ‖X − Z‖2
Fro and Eckart-Young gives:
Z∗ = Ud Σp Vd> with X = Ud Σd Vd
>.
Sufficient: A = Up Σp− 3
2 ∈ Rn×p B = Ud Vd> ∈ Rn×d .
Extends to X ∈ L(Y,Rn) as SVD exists for compact operators.
12
A composite representer theorem [Laforgue et al., 2019a]
How to train the Kernel Autoencoder?
minfl∈vv-RKHS
1n
n∑i=1
∥∥∥φ(xi )− fL . . . f1(φ(xi ))∥∥∥2
FX+
L∑l=1
λl‖fl‖2Hl
For l ≤ L, Xl Hilbert, X0 = XL = FX , Kl : Xl−1 ×Xl−1 → L(Xl ).
For all L0 ≤ L, there is (α1,1, . . . , α1,n, . . . , αL0,1, . . . , αL0,n) ∈X n
1 × . . .×X nL0
, such that for all l ≤ L it holds:
fl (·) =n∑
i=1Kl
(· , xi
(l−1))αl,i ,
with the notation for all i ≤ n:
x (l)i = fl . . . f1(xi ) and x (0)
i = xi .
13
Optimization algorithm
How to train the Kernel Autoencoder?
minfl∈vv-RKHS
1n
n∑i=1
∥∥∥φ(xi )− fL . . . f1(φ(xi ))∥∥∥2
FX+
L∑l=1
λl‖fl‖2Hl
• Last layer’s infinite dimensional coefficients makes it impossible toperform Gradient Descent directly
• Yet, gradient can propagate through last layer ([NL]ij = 〈αL,i , αL,j〉):n∑
i,i′=1[Nl ]ii′
(∇(1)kl
(xi
(l−1), xi′(l−1)
))>Jacxi (l−1) (αl0,i0 )
• If inner layers fixed and KL = kLIX0 , closed-form solution for NL
Alternate descent: gradient steps and OVKRR resolution
14
Application to molecule activity prediction
KAE representations are useful for posterior supervised tasks
1 2 3 4 5Cancer Index
0.026
0.028
0.030
0.032
0.034
0.036
Normalized
Mea
n Squ
are Errors
NMSEs for Different Algorithms
KRRKPCA_5KPCA_10KPCA_25KPCA_50KPCA_100
K2AE_5
K2AE_10
K2AE_25
K2AE_50
K2AE_100
Fig. 4: Performance of the different strategies on 5 cancers (NCI dataset)
15
Part II:Robust losses for RKHSs with
infinite dimensional outputs
minh measurable
EP
[`(h(X ),Y )
]→ min
h∈H
1n
n∑i=1
`(
h(xi ), yi
)
16
Infinite dimensional outputs in machine learning
Kernel Autoencoder [Laforgue et al., 2019a].
minh1,h2∈H1
K×H2K
12n
n∑i=1
∥∥φ(xi )− h2 h1(φ(xi ))∥∥2FX
+ Λ Reg(h1, h2)
Structured prediction by ridge-IOKR [Brouard et al., 2016].
h = argminh∈HK
12n
n∑i=1
∥∥φ(zi )− h(xi )∥∥2FY
+Λ2‖h‖2HK
g(x) = argminz∈Z
∥∥φ(z)− h(x)∥∥FZ
Function to function regression [Kadri et al., 2016].
0 0.1 0.2 0.3 0.4 0.5 0.60
0.5
1
1.5
2
seconds
Milli
volts
EMG curves
0 0.1 0.2 0.3 0.4 0.5 0.6
−3
−2
−1
0
1
2
3
seconds
Met
ers/
s2
Lip acceleration curves
0 0.1 0.2 0.3 0.4 0.5 0.60
0.5
1
1.5
2
seconds
Milli
volts
EMG curves
0 0.1 0.2 0.3 0.4 0.5 0.6
−3
−2
−1
0
1
2
3
seconds
Met
ers/
s2
Lip acceleration curves
minh∈HK
12n
n∑i=1
∥∥yi−h(xi )∥∥2
L2 +Λ2‖h‖2
17
Purpose of this part
Question: Is it possible to extend the previous approaches todifferent (ideally robust) loss functions?
First answer: Yes, possible extension to maximum-margin regression[Brouard et al., 2016], and ε-insensitive loss functionsfor matrix-valued kernels [Sangnier et al., 2017]
What about general Operator-Valued Kernels (OVKs)?
What about other types of loss functions?
18
Learning in vector-valued RKHSs (reminder)
For (xi , yi )ni=1 ∈ (X × Y)n with Y a Hilbert space, we want to solve:
h ∈ argminh∈HK
1n
n∑i=1
`(h(xi ), yi
)+ Λ
2 ‖h‖2HK .
Representer Theorem [Micchelli and Pontil, 2005]:
∃(αi )ni=1 ∈ Yn (infinite dimensional!) s.t. h(x) =
n∑i=1K(·, xi )αi .
If`(·, ·) = 1
2‖ · − · ‖2Y
K = k · IY: αi =
∑nj=1 Aijyj , A = (K + nΛIn)−1.
19
Applying duality
h ∈ argminh∈HK
1n
n∑i=1
`i (h(xi )) + Λ2 ‖h‖
2HK writes h = 1
Λn
n∑i=1K(·, xi )αi ,
with (αi )ni=1 ∈ Yn the solutions to the dual problem:
min(αi )n
i=1∈Yn
n∑i=1
`?i (−αi ) + 12Λn
n∑i,j=1〈αi ,K(xi , xj)αj〉Y ,
with f ? : α ∈ Y 7→ supy∈Y〈α, y〉Y − f (y) the Fenchel-Legendre transform of f .
20
Applying duality
h ∈ argminh∈HK
1n
n∑i=1
`i (h(xi )) + Λ2 ‖h‖
2HK writes h = 1
Λn
n∑i=1K(·, xi )αi ,
with (αi )ni=1 ∈ Yn the solutions to the dual problem:
min(αi )n
i=1∈Yn
n∑i=1
`?i (−αi ) + 12Λn
n∑i,j=1〈αi ,K(xi , xj)αj〉Y ,
with f ? : α ∈ Y 7→ supy∈Y〈α, y〉Y − f (y) the Fenchel-Legendre transform of f .
• 1st limitation: FL transform `? must be computable (→ assumption)
• 2nd limitation: dual variables (αi )ni=1 are still infinite dimensional!
If Y = Spanyj , j ≤ n invariant by K, i.e. y ∈ Y ⇒ K(x , x ′)y ∈ Y :
αi ∈ Y → possible reparametrization: αi =∑
j ωijyj
20
Applying duality
h ∈ argminh∈HK
1n
n∑i=1
`i (h(xi )) + Λ2 ‖h‖
2HK writes h = 1
Λn
n∑i=1K(·, xi )αi ,
with (αi )ni=1 ∈ Yn the solutions to the dual problem:
min(αi )n
i=1∈Yn
n∑i=1
`?i (−αi ) + 12Λn
n∑i,j=1〈αi ,K(xi , xj)αj〉Y ,
with f ? : α ∈ Y 7→ supy∈Y〈α, y〉Y − f (y) the Fenchel-Legendre transform of f .
• 1st limitation: FL transform `? must be computable (→ assumption)
• 2nd limitation: dual variables (αi )ni=1 are still infinite dimensional!
If Y = Spanyj , j ≤ n invariant by K, i.e. y ∈ Y ⇒ K(x , x ′)y ∈ Y :
αi ∈ Y → possible reparametrization: αi =∑
j ωijyj20
The double representer theorem [Laforgue et al., 2020]
Assume that OVK K and loss ` satisfy the appropriate assumptions(verified by standard kernels and losses), then
h = argminHK
1n∑
i`(h(xi ), yi ) + Λ
2 ‖h‖2HK is given by
h = 1Λn
n∑i,j=1K(·, xi ) ωij yj ,
with Ω = [ωij ] ∈ Rn×n solution to the finite dimensional problem
minΩ∈Rn×n
n∑i=1
Li(Ωi :,K Y )+ 1
2Λn Tr(M>(Ω⊗ Ω)
),
with M the n2×n2 matrix writing of M s.t. Mijkl = 〈yk ,K(xi , xj)yl〉Y .
21
The double representer theorem (2/2)
If K further satisfies K(x , x ′) =∑
t kt(x , x ′)At , then tensor Msimplifies to Mijkl =
∑t [K X
t ]ij [K Yt ]kl and the problem rewrites
minΩ∈Rn×n
n∑i=1
Li(Ωi :,K Y )+ 1
2Λn
T∑t=1
Tr(K X
t ΩK Yt Ω>
).
Rmk. Only need the n4 tensor 〈yk ,K(xi , xj)yl〉Y to learn OVKMs.
Simplifies to 2 n2 matrices K Xij and K Y
kl if K is decomposable.
How to apply the duality approach?
22
Infimal convolution and Fenchel-Legendre transforms
Infimal-convolution operator between proper lower semicontinuousfunctions [Bauschke et al., 2011]:
(f g)(x) = infy
f (y) + g(x − y).
Relation to FL transform:
(f g)? = f ? + g?
Ex: ε-insensitive losses. Let ` : Y → R be a convex loss with uniqueminimum at 0, and ε > 0. Its ε-insensitive, denoted `ε, is defined by:
`ε(y) = (` χBε) (y) =
`(0) if ‖y‖Y ≤ εinf
‖d‖Y≤1`(y − εd) otherwise ,
and has FL transform:
`?ε (y) = (` χBε)? (y) = `?(y) + ε‖y‖.
23
Interesting loss functions: sparsity and robustness
ε-Ridge
4 2 0 2 40
2
4
6
8
10
12 12 ||x||2-insensitive
31
1
3 3 2 1 0 1 2 30
2
4
6
8
10
12
0
2
4
6
8
10
12
12‖ · ‖
2 χBε
(Sparsity)
ε-SVR
4 2 0 2 40
1
2
3
4
5 ||x||ε-insensitive
3
1
1
3 3 2 1 0 1 2 30.00.51.01.52.02.53.03.54.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
‖ · ‖ χBε
(Sparsity, Robustness)
κ-Huber
4 2 0 2 40
1
2
3
412 ||x||2
Huber loss
31
1
3 3 2 1 0 1 2 3
2
4
6
8
10
12
0
2
4
6
8
10
12
κ‖ · ‖ 12‖ · ‖
2
(Robustness)24
Specific dual problems
For the ε-ridge, ε-SVR and κ-Huber, it holds Ω = W V−1, with Wthe solution to these finite dimensional dual problems:
(D1) minW∈Rn×n
12 ‖AW − B‖2
Fro + ε ‖W ‖2,1,
(D2) minW∈Rn×n
12 ‖AW − B‖2
Fro + ε ‖W ‖2,1,
s.t. ‖W ‖2,∞ ≤ 1,
(D3) minW∈Rn×n
12 ‖AW − B‖2
Fro ,
s.t. ‖W ‖2,∞ ≤ κ,
with V , A, B such that: VV> = K Y , A>A = K X/(Λn) + In
(or A>A = K X/(Λn) for the ε-SVR), and A>B = V .25
Application to structured prediction
• Experiments on YEAST dataset• Empirically, ε-SV-IOKR outperforms ridge-IOKR for a wide range of ε• Promotes sparsity and acts as a regularizer
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Λ
2.0
2.1
2.2
2.3
2.4
2.5
2.6
Test
MSE
Comparison ε-SVR / KRRKRR
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ε
10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1
Λ
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Spar
sity
(% n
ull c
ompo
nent
s)
Sparsity w.r.t. Λ for different ε (ε-SVR)
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
ε
Fig. 5: MSEs and sparsity w.r.t. Λ for several ε
26
Part III:Reliable learning through
Median-of-Means approaches
minh measurable
EP
[`(h(X ),Y )
]→ min
h∈H
1n
n∑i=1
`(
h(xi ), yi
)
27
Preliminaries
Sample Sn = Z1, . . . ,Zn ∼ Z i.i.d. such that E [Z ] = θ
• θavg = 1n∑n
i=1 Zi
• θmed = Zσ( n+12 ), with Zσ(1) ≤ . . . ≤ Zσ(n)
• Deviation Probabilities [Catoni, 2012]: P|θ − θ| > t
.
• If Z is bounded (see Hoeffding’s Inequality) or sub-Gaussian:
P
∣∣∣θavg − θ∣∣∣ > σ
√2 ln(2/δ)
n
≤ δ.
Do estimators exist with same guarantees under weaker assumptions?
How to use them to perform (robust) learning?
28
The Median-of-Means
Z1. . . ZB
. . . Zn−B+1 . . . Zn
mean mean
θ1. . . θK
θMoM
median
Z1, . . . ,Zn i.i.d. realizations of r.v. Z s.t. E[Z ] = θ, Var(Z ) = σ2.
∀δ ∈ [e1− 2n9 , 1[, for K =
⌈ 92 ln(1/δ)
⌉it holds [Devroye et al., 2016]:
P
∣∣∣θMoM − θ∣∣∣ > 3
√6σ√
1 + ln(1/δ)n
≤ δ.
29
Proof
θk =1B
∑i∈Bk
Zi , Ik,t = I|θk − θ| > t
, pt = E[I1,t ] = P
|θ1 − θ| > t
P∣∣θMoM − θ
∣∣ > t≤ P
K∑
k=1
Ik,t ≥K2
≤ P
1K
K∑k=1
(Ik,t − pt ) ≥12−
σ2
Bt2
,
≤ exp
(−2K
(12−
σ2
Bt2
)2),
≤ δ for K =9 ln(1/δ)
2and
σ2
Bt2 =16⇔ t = 3
√3σ
√ln(1/δ)
n.
30
U-statistics & pairwise learning
Estimator of E[h(Z ,Z ′)] with minimal variance, defined from an i.i.d.sample Z1, . . . ,Zn as:
Un(h) = 2n(n − 1)
∑1≤i<j≤n
h(Zi ,Zj).
Ex: the empirical variance when h(z , z ′) = (z−z′)2
2 .
Encountered e.g. in pairwise ranking and metric learning:
Rn(r) = 2n(n − 1)
∑1≤i<j≤n
I r(Xi ,Xj) · (Yi − Yj) ≤ 0 .
Rn(d) = 2n(n − 1)
∑1≤i<j≤n
I Yij · (d(Xi ,Xj)− ε) > 0 .
How to extend MoM to U-statistics?31
The Median-of-U-statistics
Z1. . . ZB
. . . Zn−B+1 . . . Zn
build pairs build pairs
(Z1, Z2) . . . (ZB−1, ZB) (Zn−B+1, Zn−B+2) . . . (Zn−1, Zn)
U -stat U -stat
U1(h) . . . UK(h)
θMoU(h)median
w.p. 1− δ,∣∣θMoU(h)− θ(h)
∣∣ ≤ C1(h)
√1 + ln(1/δ)
n+ C2(h)
1 + ln(1/δ)n
32
Why randomization?
Partition
Build all possible blocks[Joly and Lugosi, 2016]
Random Blocks Random Pairs
Randomization allows for a better exploration
33
Why randomization?
Partition
Build all possible blocks[Joly and Lugosi, 2016]
Random Blocks
Random Pairs
Randomization allows for a better exploration
33
Why randomization?
Partition
Build all possible blocks[Joly and Lugosi, 2016]
Random Blocks Random Pairs
Randomization allows for a better exploration33
The Median-of-Randomized-Means [Laforgue et al., 2019b]
Z1. . . ZB ZB+1 . . . Zn−1 Zn
mean mean
θ1 . . . θK
θMoRM
median
With blocks formed by SWoR, ∀ τ ∈]0, 1/2[, ∀ δ ∈ [2e− 8τ2n9 , 1[, set
K :=⌈
ln(2/δ)2(1/2−τ)2
⌉, and B :=
⌊8τ 2n
9 ln(2/δ)
⌋, it holds:
P
∣∣θMoRM − θ∣∣ > 3
√3 σ
2 τ 3/2
√ln(2/δ)
n
≤ δ.
34
Proof
Random block Bk characterized by random vector εk = (εk,1, . . . , εk,n) ∈ 0, 1n i.i.d.uniformly over Λn,B =
ε ∈ 0, 1n : 1>ε = B
, of cardinality
(nB
).
θk =1B
∑i∈Bk
Zi , Iεk ,t = I|θk − θ| > t, pt = E[Iεk ,t ] = P|θ1 − θ| > t
Un,t = Eε
[1K
K∑k=1
Iεk ,t∣∣Sn
]=
1(nB
) ∑ε∈Λ(n,B)
Iε,t =1(nB
)∑I
I∣∣∣ 1
B
B∑j=1
XIj − θ∣∣∣ > t
P∣∣θMoRM − θ
∣∣ > t≤ P
1K
K∑k=1
Iεk ,t ≥12
,
35
Proof
Random block Bk characterized by random vector εk = (εk,1, . . . , εk,n) ∈ 0, 1n i.i.d.uniformly over Λn,B =
ε ∈ 0, 1n : 1>ε = B
, of cardinality
(nB
).
θk =1B
∑i∈Bk
Zi , Iεk ,t = I|θk − θ| > t, pt = E[Iεk ,t ] = P|θ1 − θ| > t
Un,t = Eε
[1K
K∑k=1
Iεk ,t∣∣Sn
]=
1(nB
) ∑ε∈Λ(n,B)
Iε,t =1(nB
)∑I
I∣∣∣ 1
B
B∑j=1
XIj − θ∣∣∣ > t
P∣∣θMoRM − θ
∣∣ > t≤ P
1K
K∑k=1
Iεk ,t − Un,t + Un,t − pt ≥12−pt + τ−τ
,
≤ exp(−2K
(12− τ)2)
+ exp
(−2
nB
(τ −
σ2
Bt2
)2).
35
The Median-of-Randomized-U-statistics [Laforgue et al., 2019b]
Z1. . . ZB ZB+1 . . . Zn−1 Zn
build all pairs build all pairs
(Z1, ZB) . . . (ZB , Zn−1) (ZB , ZB+1) . . . (ZB+1, Zn)
U -stat U -stat
U1(h) . . . UK(h)
θMoRU(h)median
w.p.a.l. 1− δ,∣∣θMoRU − θ(h)
∣∣ ≤ C1(h, τ)
√ln(2/δ)
n+ C2(h, τ)
ln(2/δ)n
36
The tournament procedure [Lugosi and Mendelson, 2016]
We want g∗ ∈ argming∈G
R(g) = E[(g(X )−Y )2]. For any pair (g , g ′) ∈ G2:
1) Compute the MoM estimate of ‖g − g ′‖L1
ΦS(g , g ′) = median(E1|g − g ′|, . . . , EK |g − g ′|
).
2) If it is large enough, compute the match
ΨS′(g , g ′) = median(E1[(g(X )− Y )2 − (g ′(X )− Y )2], . . . ,
EK [(g(X )− Y )2 − (g ′(X )− Y )2]).
g winning all its matches verifies w.p.a.l. 1− exp(c0n min1, r 2)
R(g)−R(g∗) ≤ cr .
Can be extended to pairwise learning thanks to MoU
37
The MoM Gradient Descent [Lecue et al., 2018]
If G is parametric, want to compute the minimizer of:
MoM[`(gu,Z )] = median(E1[`(gu,Z )], . . . , EK [`(gu,Z )]
)Idea: find the block with median risk, and use it as mini-batch
38
MoU Gradient Descent for metric learning
We want to minimize for M ∈ S+q (R):
2n(n − 1)
∑i<j
max(
0, 1+yij (d2M (xi , xj )−2)
)2 0 2 4 6
1st component5
3
1
1
3
5
2nd
com
pone
nt
class 1class 2class 3outliers
0 50 100 150 200 250epoch
10 2
10 1
100
Trai
n ob
ject
ive
valu
e
GD saneMoU-GD saneGD cont.MoU-GD cont.
0 50 100 150 200 250epoch
0.3
0.4
0.5
0.6
Test
obj
ectiv
e va
lue
GD saneMoU-GD saneGD cont.MoU-GD cont.
39
Conclusion
40
Conclusion
minh measurable
EP
[`(h(X ),Y )
]→ min
h∈H
1n
n∑i=1
`(
h(xi ), yi
)
• New hypothesis set for RL inspired from deep and kernel• Link with Kernel PCA, optimization based on composite RT• Allows to autoencode any type of data, empirical success on molecules
• Double RT: coefficients linear combinations of the outputs• Allows to cope with many losses (ε, Huber) and kernels• Empirical improvements on surrogate tasks
• Extension of MoM to randomized blocks and/or U-statistics• Extension of MoM tournaments and MoM-GD to pairwise learning• Remarkable empirical resistance to the presence of outliers
41
Conclusion
minh measurable
EP
[`(h(X ),Y )
]→ min
h∈H
1n
n∑i=1
`(
h(xi ), yi
)
• New hypothesis set for RL inspired from deep and kernel• Link with Kernel PCA, optimization based on composite RT• Allows to autoencode any type of data, empirical success on molecules
• Double RT: coefficients linear combinations of the outputs• Allows to cope with many losses (ε, Huber) and kernels• Empirical improvements on surrogate tasks
• Extension of MoM to randomized blocks and/or U-statistics• Extension of MoM tournaments and MoM-GD to pairwise learning• Remarkable empirical resistance to the presence of outliers
41
Conclusion
minh measurable
EP
[`(h(X ),Y )
]→ min
h∈H
1n
n∑i=1
`(
h(xi ), yi
)
• New hypothesis set for RL inspired from deep and kernel• Link with Kernel PCA, optimization based on composite RT• Allows to autoencode any type of data, empirical success on molecules
• Double RT: coefficients linear combinations of the outputs• Allows to cope with many losses (ε, Huber) and kernels• Empirical improvements on surrogate tasks
• Extension of MoM to randomized blocks and/or U-statistics• Extension of MoM tournaments and MoM-GD to pairwise learning• Remarkable empirical resistance to the presence of outliers
41
Perspectives
From K2AE to deep IOKR.I fully supervised schemeI benefits of a hybrid architecture?I learning the output embeddings?
Y’s invariance: the good characterization for K?I what if we relax the hypothesis?I case of integral losses: `(h(x), y) =
∫`θ[h(x)(θ), y(θ)]dθ
Among the numerous MoM possibilities.I a partial representer theorem?I concentration in presence of outliers?
42
Remerciements
• PhD supervisors: Florence d’Alche-Buc, Stephan Clemencon• Co-authors: Alex Lambert, Luc Brogat-Motte, Patrice Bertail• Thank you: Olivier Fercoq
I Autoencoding any data through kernel autoencoderswith S. Clemencon and F. d’Alche-Buc, AISTATS 2019
I On medians-of-randomized-(pairwise)-meanswith S. Clemencon and P. Bertail, ICML 2019
I Duality in RKHSs with infinite dimensional outputs:application to robust losseswith A. Lambert, L. Brogat-Motte and F. d’Alche-Buc, ICML 2020
I On statistical learning from biased training sampleswith S. Clemencon, Submitted
43
References I
Audiffren, J. and Kadri, H. (2013).Stability of multi-task kernel regression algorithms.In Asian Conference on Machine Learning, pages 1–16.
Bauschke, H. H., Combettes, P. L., et al. (2011).Convex analysis and monotone operator theory in Hilbertspaces, volume 408.Springer.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007).Greedy layer-wise training of deep networks.In Advances in neural information processing systems, pages153–160.
44
References II
Bourlard, H. and Kamp, Y. (1988).Auto-association by multilayer perceptrons and singular valuedecomposition.Biological cybernetics, 59(4):291–294.
Bousquet, O. and Elisseeff, A. (2002).Stability and generalization.Journal of Machine Learning Research, 2(Mar):499–526.
Brouard, C., Szafranski, M., and d’Alche-Buc, F. (2016).Input output kernel regression: Supervised and semi-supervisedstructured output prediction with operator-valued kernels.Journal of Machine Learning Research, 17:176:1–176:48.
45
References III
Carmeli, C., De Vito, E., and Toigo, A. (2006).Vector valued reproducing kernel hilbert spaces of integrablefunctions and mercer theorem.Analysis and Applications, 4(04):377–408.
Catoni, O. (2012).Challenging the empirical mean and empirical variance: adeviation study.In Annales de l’Institut Henri Poincare, Probabilites et Statistiques,volume 48, pages 1148–1185. Institut Henri Poincare.
Devroye, L., Lerasle, M., Lugosi, G., Oliveira, R. I., et al. (2016).Sub-gaussian mean estimators.The Annals of Statistics, 44(6):2695–2725.
46
References IV
Erhan, D., Bengio, Y., Courville, A., and Vincent, P. (2009).Visualizing higher-layer features of a deep network.University of Montreal, 1341(3):1.
Gill, R., Vardi, Y., and Wellner, J. (1988).Large sample theory of empirical distributions in biasedsampling models.The Annals of Statistics, 16(3):1069–1112.
Hinton, G. E. and Salakhutdinov, R. R. (2006).Reducing the dimensionality of data with neural networks.science, 313(5786):504–507.
Huber, P. J. (1964).Robust estimation of a location parameter.The Annals of Mathematical Statistics, pages 73–101.
47
References V
Joly, E. and Lugosi, G. (2016).Robust estimation of u-statistics.Stochastic Processes and their Applications, 126(12):3760–3773.
Kadri, H., Duflos, E., Preux, P., Canu, S., Rakotomamonjy, A., andAudiffren, J. (2016).Operator-valued kernels for learning from functional responsedata.Journal of Machine Learning Research, 17:20:1–20:54.
Kadri, H., Ghavamzadeh, M., and Preux, P. (2013).A generalized kernel approach to structured output learning.In International Conference on Machine Learning (ICML), pages471–479.
48
References VI
Laforgue, P., Clemencon, S., and d’Alche-Buc, F. (2019a).Autoencoding any data through kernel autoencoders.In Artificial Intelligence and Statistics, pages 1061–1069.
Laforgue, P., Clemencon, S., and Bertail, P. (2019b).On medians of (Randomized) pairwise means.In Proceedings of the 36th International Conference on MachineLearning, volume 97 of Proceedings of Machine Learning Research,pages 1272–1281, Long Beach, California, USA. PMLR.
Laforgue, P., Lambert, A., Motte, L., and d’Alche Buc, F. (2020).Duality in rkhss with infinite dimensional outputs: Applicationto robust losses.arXiv preprint arXiv:1910.04621.
49
References VII
Lecue, G., Lerasle, M., and Mathieu, T. (2018).Robust classification via mom minimization.arXiv preprint arXiv:1808.03106.
Lugosi, G. and Mendelson, S. (2016).Risk minimization by median-of-means tournaments.arXiv preprint arXiv:1608.00757.
Mairal, J., Koniusz, P., Harchaoui, Z., and Schmid, C. (2014).Convolutional kernel networks.In Advances in neural information processing systems, pages2627–2635.
50
References VIII
Maurer, A. (2014).A chain rule for the expected suprema of gaussian processes.In Algorithmic Learning Theory: 25th International Conference, ALT2014, Bled, Slovenia, October 8-10, 2014, Proceedings, volume8776, page 245. Springer.
Maurer, A. (2016).A vector-contraction inequality for rademacher complexities.In International Conference on Algorithmic Learning Theory, pages3–17. Springer.
Maurer, A. and Pontil, M. (2016).Bounds for vector-valued function estimation.arXiv preprint arXiv:1606.01487.
51
References IX
Micchelli, C. A. and Pontil, M. (2005).On learning vector-valued functions.Neural computation, 17(1):177–204.
Moreau, J. J. (1962).Fonctions convexes duales et points proximaux dans un espacehilbertien.Comptes rendus hebdomadaires des seances de l’Academie dessciences, 255:2897–2899.Sangnier, M., Fercoq, O., and d’Alche-Buc, F. (2017).Data sparse nonparametric regression with ε-insensitive losses.In Asian Conference on Machine Learning, pages 192–207.
Scholkopf, B., Tsuda, K., and Vert, J.-P. (2004).Support vector machine applications in computational biology.MIT press.
52
References X
Vardi, Y. (1985).Empirical distributions in selection bias models.Ann. Statist., 13:178–203.Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol,P.-A. (2010).Stacked denoising autoencoders: Learning usefulrepresentations in a deep network with a local denoisingcriterion.J. Mach. Learn. Res., 11:3371–3408.
53
Appendices Kernel Autoencoder
54
Functional spaces similarity
• Neural mapping fNN parametrized by a matrix A ∈ Rp×d with rows(aj)j≤p, and and activation function σ
• Kernel mapping fOVK from decomposable OVK K = kIp, associatedto the (scalar) feature map φk
fNN(x) =
σ (〈a1, x〉)...
σ (〈ap, x〉)
fOVK(x) =
f 1(x) =⟨f 1, φk(x)
⟩...
f p(x) = 〈f p, φk(x)〉
Only differ on the order in which linear/nonlinear mappings areused (and on their nature)
55
Disentangling concentric circles
More complex layers enhance the learned representations
−10 −5 0 5 10
−10
−5
0
5
10
−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0
−2
−1
0
1
2
−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5−8
−6
−4
−2
0
2
4
6
8
−12 −11 −10 −9 −8 −7 −6 −5 −4
−18
−16
−14
−12
−10
−8
Fig. 6: KAE performance on concentric circles56
KAE generalization bound
2-layer KAE on data bounded in norm by M, with:
• internal layer of size p• encoder f ∈ H1 such that ‖f ‖ ≤ s• decoder g ∈ H2 such that ‖g‖ ≤ t, with Lipschitz constant L
Then it holds:
ε(gn fn)− ε∗ ≤ C0LMst√
Kpn + 24M2
√log(2)/δ
2n .
with ε(g f ) = EX ‖X − g f (X )‖2X0
57
Proof
Based on vector-valued Rademacher average:
Rn(C(S)) = Eσ
[suph∈C
1n
n∑i=1〈σi , h(xi )〉H
].
With Hs,t ⊂ F(X0,X0) = H1,s H2,t , ` the squared norm on X0, it holds:
Rn
((` (id−Hs,t)
)(S))≤ 2√
2M Rn
((id−Hs,t
)(S)),
≤ 2√
2M Rn
(Hs,t(S)
)≤ 2√πM Gn
(Hs,t(S)
).
Gn
(Hs,t(S)
)≤ C1 L
(H2,t ,H1,s(S)
)Gn
(H1,s(S)
)+ C2
n R(H2,t ,H1,s(S)
)D(H1,s(S)
)+ 1
n G(H2,t(0)
).
using [Maurer, 2016, Maurer, 2014] in the spirit of [Maurer and Pontil, 2016]
58
Appendices Duality in vv-RKHSs
59
On the invariance assumption
With Y = Spanyj , j ≤ n, the assumption reads:
∀(x , x ′) ∈ X 2, ∀y ∈ Y, y ∈ Y =⇒ K(x , x ′)y ∈ Y
• We do not need it to hold for every collection of yii≤n ∈ Yn
• Rather an a posteriori condition to ensure that the kernel is aligned
• The little we know about Y should be preserved through K
• If Y finite dimensional, and sufficiently many outputs, then Y = Y
• Identity-decomposable kernels fit (nontrivial in infinite dimension)
• The empirical covariance kernel∑
i yi ⊗ yi [Kadri et al., 2013] fits
60
Admissible kernels
• K(s, t) =∑
i ki (s, t) yi ⊗ yi ,with ki positive semi-definite (p.s.d.) scalar kernels for all i ≤ n
• K(s, t) =∑
i µi k(s, t) yi ⊗ yi ,with k a p.s.d. scalar kernel and µi ≥ 0 for all i ≤ n
• K(s, t) =∑
i k(s, xi )k(t, xi ) yi ⊗ yi ,
• K(s, t) =∑
i,j kij(s, t) (yi + yj)⊗ (yi + yj),with kij p.s.d. scalar kernels for all i , j ≤ n
• K(s, t) =∑
i,j µij k(s, t) (yi + yj)⊗ (yi + yj),with k a p.s.d. scalar kernel and µij ≥ 0
• K(s, t) =∑
i,j k(s, xi , xj)k(t, xi , xj) (yi + yj)⊗ (yi + yj).
61
Admissible losses
∀i ≤ n, ∀(αY, α⊥) ∈ Y× Y⊥, `?i (αY) ≤ `?i (αY + α⊥)
• `i (y) = f (〈y , zi〉), zi ∈ Y and f : R→ R convex. Maximum-marginobtained with zi = yi and f (t) = max(0, 1− t).
• `(y) = f (‖y‖), f : R+ → R convex increasing s.t. t 7→ f ′(t)t is
continuous over R+. Includes the functions λη ‖y‖
ηY for η > 1, λ > 0.
• ∀λ > 0, with Bλ the centered ball of radius λ,
I `(y) = λ‖y‖, I `(y) = λ‖y‖ log(‖y‖),I `(y) = χBλ(y), I `(y) = λ(exp(‖y‖)− 1).
• `i (y) = f (y − yi ), f ? verifying the condition.
• Infimal convolution of functions verifying the condition. (ε-insensitive[Sangnier et al., 2017], the Huber loss [Huber, 1964], Moreau orPasch-Hausdorff envelopes [Moreau, 1962, Bauschke et al., 2011])
62
Proof of the Double Representer Theorem
Dual problem:
(αi )ni=1 ∈ argmin
(αi )ni=1∈Yn
n∑i=1
`?i (−αi ) + 12Λn
n∑i,j=1〈αi ,K(xi , xj)αj〉Y .
• Decompose αi = αYi + α⊥i , with (αY
i )i≤n, (α⊥i )i≤n ∈ Yn × Y⊥n
• Assume that `?i (αY) ≤ `?i (αY + α⊥) (satisfied if ` relies on 〈·, ·〉)
Then it holds:n∑
i=1`?i (−αY
i ) + 12Λn
n∑i,j=1
⟨αY
i ,K(xi , xj)αYj⟩Y
≤n∑
i=1`?i (−αY
i − α⊥i ) + 12Λn
n∑i,j=1
⟨αY
i + α⊥i ,K(xi , xj)(αYj + α⊥j )
⟩Y .
63
Approximating the dual problem if no invariance
The kernel K = k · A is a separable OVK, with A a compact operator.
There exists an o.n.b. (ψj)∞j=1 of Y, s.t. A =∑∞
j=1 λjψj ⊗ ψj , (λj ≥ 0).
There exists (ωi )ni=1 ∈ `2(R)n such that ∀i ≤ n, αi =
∑∞j=1 ωijψj .
Denoting by Ym = span(ψjmj=1), S = diag(λj)m
j=1, solve instead:
min(αi )n
i=1∈Ynm
n∑i=1
`?i (−αi ) + 12Λn
n∑i,j=1〈αi ,K(xi , xj)αj〉Y .
The final solution is given by: h = 1Λn
n∑i=1
m∑j=1
k(·, xi ) λj ωij ψj ,
with Ω solution to:
minΩ∈Rn×m
n∑i=1
Li (Ωi :,Ri :) + 12Λn Tr
(K X ΩSΩ>
).
64
Application to robust function-to-function regression
• Predict lip acceleration from EMG signals [Kadri et al., 2016]• Dataset augmented with outliers, model learned with Huber loss• Improvement for every output size m
0.0 0.5 1.0 1.5
κ
0.750
0.775
0.800
0.825
0.850
0.875
0.900
LO
Oge
nera
lizat
ion
erro
r
M=4
M=5
M=6
M=7
M=15
Ridge Regression
Fig. 7: LOO generalization error w.r.t. κ65
Application to kernel autoencoding
• Experiments on molecules with Tanimoto-Gaussian kernel• Empirical improvements for wide range of ε• Introduces sparsity
0.0 0.2 0.4 0.6 0.8 1.00.5
0.6
0.7
0.8
0.9
1.0
MSE
0
200
400
600
800
1000
1200
1400
1600
||W|| 2
,1
0 0 3 6 12 22 38 46 74 108 136
397
1229
-KAEStandard KAEW's 2, 1 normDiscarded data
Fig. 8: Performances of ε-insensitive Kernel Autoencoder
66
Algorithmic stability analysis [Bousquet and Elisseeff, 2002]
Algorithm A has stability β if for any sample Sn, and any i ≤ n, it holds:
sup(x,y)∈X×Y
|`(hA(Sn)(x), y)− `(hA(S\i
n )(x), y)| ≤ β
Let A be an algorithm with stability β and loss function bounded by M. Then, for anyn ≥ 1 and δ ∈]0, 1[ it holds with probability at least 1− δ:
R(hA(Sn)) ≤ Rn(hA(Sn)) + 2β + (4nβ + M)
√ln(1/δ)
2n.
If ‖K(x , x)‖op ≤ γ2, and |`(hS(x), y)− `(hS\i (x), y)| ≤ C‖hS(x)− hS\i (x)‖Y , thenOVK algorithm has stability β ≤ C2γ2/(Λn) [Audiffren and Kadri, 2013].
M C
ε-SVR√
MY − ε(√
2γ√Λ
+√
MY − ε)
1
ε-Ridge (MY − ε)2(
1 + 2√
2γ√Λ
+ 2γ2Λ
)2(MY − ε)
(1 + γ
√2√
Λ
)κ-Huber κ
√MY − κ
2
(γ√
2κ√Λ
+√
MY − κ2
)κ
67
Appendices Learning with Sample Bias
68
Empirical Risk Minimization (ERM)
General goal of supervised machine learning:From a r.v. Z = (X ,Y ), and a loss function ` : Y × Y → R, find:
h∗ = argminh measurable
R(h) = EP [`(h(X ),Y )] .
Empirical Risk Minimization (ERM):
• P is unknown (and the set of measurable functions too large)
• sample (X1,Y1), . . . , (Xn,Yn) i.i.d∼ P, hypothesis set H
hn = argminh∈H
1n
n∑i=1
`(h(Xi ),Yi ) = EPn[`(h(X ),Y )] ,
with Pn = 1n∑
i δZi , and Zi = (Xi ,Yi ). It holds Pn →n→+∞
P.
69
Importance Sampling (IS)
What if the data is not drawn from P?
Sample (X1,Y1), . . . , (Xn,Yn) i.i.d∼ Q such that dQdP (z) = q(z)
p(z) .
Now 1n∑
i δZi = Qn →n→+∞
Q.
70
Importance Sampling (IS)
What if the data is not drawn from P?
Sample (X1,Y1), . . . , (Xn,Yn) i.i.d∼ Q such that dQdP (z) = q(z)
p(z) .
Now 1n∑
i δZi = Qn →n→+∞
Q.
q(x)/p(x) = 1/2.
minh∈H
1n
n∑i=1
`(h(Xi ),Yi ) ·p(Zi )q(Zi )
minh∈H
EQn
[`(h(X),Y ) ·
p(Z)q(Z)
]↓
EQ
[`(h(X),Y ) ·
p(Z)q(Z)
]= EP [`(h(X),Y )]
70
Importance Sampling (IS)
What if the data is not drawn from P?
Sample (X1,Y1), . . . , (Xn,Yn) i.i.d∼ Q such that dQdP (z) = q(z)
p(z) .
Now 1n∑
i δZi = Qn →n→+∞
Q.
q(x)/p(x) = I15 ≤ x ≤ 55.
minh∈H
1n
n∑i=1
`(h(Xi ),Yi ) ·p(Zi )q(Zi )
minh∈H
EQn
[`(h(X),Y ) ·
p(Z)q(Z)
]not possible!
↓EQ
[`(h(X),Y ) ·
p(Z)q(Z)
]= EP [`(h(X),Y )]
70
Adding samples
q1(x)/p(x) = I15 ≤ x ≤ 55
We need:K⋃
k=1Supp(qk) = Supp(p).
Sample-wise IS doe not work because of samples proportions.
71
Adding samples
q1(x)/p(x) = I15 ≤ x ≤ 55
q2(x)/p(x) = I50 ≤ x ≤ 70
We need:K⋃
k=1Supp(qk) = Supp(p).
Sample-wise IS doe not work because of samples proportions.
71
Adding samples
q1(x)/p(x) = I15 ≤ x ≤ 55
q2(x)/p(x) = I50 ≤ x ≤ 70
q3(x)/p(x) = Ix ≤ 20+ Ix ≥ 60
We need:K⋃
k=1Supp(qk) = Supp(p).
Sample-wise IS doe not work because of samples proportions.
71
Setting and assumptions
• K independent i.i.d. samples Dk = Zk,1, . . . , Zk,nk• n =
∑k nk , λk = nk/n for k ≤ K
• sample k drawn according to Qk such that dQkdP (z) = ωk (z)
Ωk
• The Ωk = EP [ωk(Z )] =∫Z ωk(z)P(dz) are unknown.
• ∃C , λ, λ1, . . . , λK > 0, |λk − λk | ≤ C√n and λ ≤ λk .
• The graph Gκ is connected.• ∃ξ > 0, ∀k ≤ K , Ωk ≥ ξ.• ∃m,M > 0, m ≤ inf
zmaxk≤K
ωk(z) and supz
maxk≤K
ωk(z) ≤ M.
72
Building an unbiased estimate of P (1/2)
Without considering the bias issue:
Qn = 1n
n∑i=1
δZi =K∑
k=1
nkn
1nk
∑i∈Dk
δZi →K∑
k=1λkQk 6= P.
But it holds:
dQk = ωkΩk
dP,∑
kλkdQk =
∑k
λkωkΩk
dP
dP =(∑
k
λkωkΩk
)−1∑kλkdQk (1)
We only need to estimate the Ωk ’s.
73
Building an unbiased estimate of P (2/2)
It holds:
Ωk =∫ωkdP =
∫ (∑k
λkωkΩk
)−1∑kλkωkdQk .
Ω solution to the system:
∀k ≤ K , Hk(Ω)− 1 = 0,
with Hk(Ω) =∫ (∑
k
λkωkΩk
)−1∑kλkωkdQk .
The final estimate is obtained by plugging Ω in Equation (1).
74
Non-asymptotic guarantees
Debiasing procedure due to [Vardi, 1985, Gill et al., 1988], but onlyasymptotic results.
With Pn =(∑
kλkωk
Ωk
)−1∑k λkdQk , there exists (πi )i≤n such that:
EPn[`(h(X ),Y )] =
n∑i=1
πi · `(h(Xi ),Yi ), (2)
and hn minimizer of Equation (2) satisfies with probability 1− δ:
R(hn)− R(h∗) ≤ C1
√K 3
n + C2
√K log n
n + C3
√K log 1/δ
n .
75
Experiments on the Adult dataset
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Years of education
0
2000
4000
6000
8000
10000
Popu
latio
n
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Prop
ortio
n 5
0k$
20 30 40 50 60 70 80 90Age
0.0
0.2
0.4
0.6
0.8
1.0
Prop
ortio
n 5
0k$
highmediumlow
Dataset of size 6, 000: 98% from 13+ years of education, 2% unbiased. Scores:
LogReg RF
ERM 63.95 ± 1.37 42.73 ± 3.36debiased ERM 79.77 ± 1.72 43.58 ± 4.77
unbiased sample 77.75 ± 2.27 22.16 ± 6.18
76