88
Deep Kernel Representation Learning for Complex Data and Reliability Issues Pierre LAFORGUE, PhD Defense LTCI, T´ el´ ecom Paris, France Jury: d’ALCH ´ E-BUC Florence Supervisor BONALD Thomas President CL ´ EMENC ¸ ON Stephan co-Supervisor KADRI Hachem Examiner LUGOSI G´ abor Reviewer MAIRAL Julien Examiner VERT Jean-Philippe Reviewer 0

Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Deep Kernel Representation Learning forComplex Data and Reliability Issues

Pierre LAFORGUE, PhD DefenseLTCI, Telecom Paris, France

Jury:

d’ALCHE-BUC Florence SupervisorBONALD Thomas PresidentCLEMENCON Stephan co-SupervisorKADRI Hachem ExaminerLUGOSI Gabor ReviewerMAIRAL Julien ExaminerVERT Jean-Philippe Reviewer

0

Page 2: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Motivation: need for structured data representations

Goal of ML: infer from a set of examples, the relationship betweensome explanatory variables x , and a target output y

A representation: set of features characterizing the observations

Ex 1:digit recognition (MNIST)

Ex 2:molecule activity prediction

How to (automatically) learn structured data representations?

1

Page 3: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Motivation: need for reliable procedures

Empirical Risk Minimization: minimize the average error on train data

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Activity

0

10

20

30

40

Occurences Ordinary Least Squares fail, need

for more robust loss functionsand/or mean estimators

Importance Sampling may onlycorrect on the space covered bythe training observations

How to adapt to data with outliers and/or biased?

2

Page 4: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Outline for today

Empirical Risk Minimization (ERM), formally:

minh measurable

EP

[`(h(X ),Y )

]→ min

h∈H

1n

n∑i=1

`(

h(xi ), yi

)

Part I: Deep kernel architectures for complex data

Part II: Robust losses for RKHSs with infinite dimensional outputs

Part III: Reliable learning through Median-of-Means approaches

Backup: Statistical learning from biased training samples

3

Page 5: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Part I:Deep kernel architectures

for complex data

minh measurable

EP

[`(h(X ),Y )

]→ min

h∈H

1n

n∑i=1

`(

h(xi ), yi

)

4

Page 6: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Two opposite representation learning paradigms

Deep Learning: representations learned along with the training,key to the success [Erhan et al., 2009]

Kernel Methods: linear method after embedding through feature map φ,choice of kernel ⇐⇒ choice of representation

Question: Is it possible to combine both approaches [Mairal et al., 2014]?

5

Page 7: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Autoencoders (AEs)

• Idea: compress and reconstruct inputs by a Neural Net (NN)

• Base mapping: f : Rd → Rp such that fW ,b(x) = σ(W x + b)

• Hour-glass shaped network, reconstruction criterion:

minW ,b,W ′,b′

∥∥x − fW ′,b′ fW ,b(x)∥∥2 (self-supervised)

x1

x2

x3

x4

z1

z2

x′1

x′2

x′3

x′4

W, b W’, b’

Fig. 1: A 1 hidden layer autoencoder

6

Page 8: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Autoencoders: uses

• Data compression, link to Principal Component Analysis (PCA)[Bourlard and Kamp, 1988, Hinton and Salakhutdinov, 2006]

• Pre-training of neural networks [Bengio et al., 2007]• Denoising [Vincent et al., 2010]• For non-vectorial data?

Fig. 2: PCA/AE on MNIST(reproduced from HS ’06)

Fig. 3: Pre-training of biggernetwork through AEs

7

Page 9: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Scalar kernel methods [Scholkopf et al., 2004]

• feature map φ : X → Hk associated to scalar kernel k : X × X → Rsuch that 〈φ(x), φ(x ′)〉Hk

= k(x , x ′)

• Replace x with φ(x) and use linear methods. Ridge regression:

minβ∈Rp

∑i

(yi − 〈xi , β〉Rp )2 + 2nλ‖β‖2Rp

minω∈Hk

∑i

(yi − 〈φ(xi ), ω〉Hk)2 + 2nλ‖ω‖2

Hk

• In an autoencoder? Need for Hilbert-valued functions!

minfl∈NNem

1n

n∑i=1

∥∥∥φ(xi )− fL . . . f1(φ(xi ))∥∥∥2

Hk

8

Page 10: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Operator-valued kernel methods [Carmeli et al., 2006]

Generalization of scalar kernel methods to output Hilbert spaces:

• k : X × X → R K : X × X → L(Y)

• k(x ′, x) = k(x , x ′) K(x ′, x) = K(x , x ′)∗

• ∑i,j αi k(xi , xj)αj ≥ 0∑

i,j 〈αi ,K(xi , xj)αj〉Y ≥ 0

• Hk = Span k(·, x) ⊂ RX HK = Span K(·, x)y ⊂ YX

Kernel trick in the output space [Cortes ’05, Geurts ’06, Brouard ’11,Kadri ’13, Brouard ’16], Input Output Kernel Regression (IOKR).

9

Page 11: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

How to learn in vector-valued RKHSs? OVK ridge regression

For (xi , yi )ni=1 ∈ (X × Y)n with Y a Hilbert space, we want to solve:

h ∈ argminh∈HK

1n

n∑i=1

∥∥h(xi )− yi∥∥2Y + Λ

2 ‖h‖2HK .

Representer Theorem [Micchelli and Pontil, 2005]:

∃(αi )ni=1 ∈ Yn s.t. h(x) =

n∑i=1K(·, xi )αi , and differentiating gives:

n∑i=1

(K(x1, xi ) + Λnδ1i IY) αi = y1,

· · ·n∑

i=1(K(xn, xi ) + Λnδni IY) αi = yn.

If K(x , x ′) = k(x , x ′)IY , then closed form solution:

αi =∑

j Aijyj with A = (K + ΛnIn)−1

.10

Page 12: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

The Kernel Autoencoder [Laforgue et al., 2019a]

Complex Structured Inputx ∈ X

Finite Dimensional Representation(0.94, 0.55)

φ(x)1

φ(x)2

φ(x)3

...

...

z1

z2

φ(x)′1

φ(x)′2

φ(x)′3

...

...

f1 ∈ HK1

vv-RKHS

f2 ∈ HK2

vv-RKHS

φ(x) ∈ FX

z ∈ R2

φ(x) ∈ FX

K2AE: minfl∈vv-RKHS

1n

n∑i=1

∥∥∥φ(xi )− fL . . . f1(φ(xi ))∥∥∥2

FX+

L∑l=1

λl‖fl‖2Hl

11

Page 13: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Connection to kernel Principal Component Analysis (PCA)

2-layer K2AE with linear kernels, internal layer of size p, and nopenalization. Let ((σ1, u1) . . . , (σp, up)) denote the largest eigenvalues/vectors of Kin. It holds:

K2AE output:(√σ1u1, . . . ,

√σpup

)∈ Rn×p

KPCA output: (σ1u1, . . . , σpup) ∈ Rn×p

Proof: X ∈ Rn×d , Y = XX>A ∈ Rn×p, Z = YY>B.The objective writes minA,B ‖X − Z‖2

Fro and Eckart-Young gives:

Z∗ = Ud Σp Vd> with X = Ud Σd Vd

>.

Sufficient: A = Up Σp− 3

2 ∈ Rn×p B = Ud Vd> ∈ Rn×d .

Extends to X ∈ L(Y,Rn) as SVD exists for compact operators.

12

Page 14: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

A composite representer theorem [Laforgue et al., 2019a]

How to train the Kernel Autoencoder?

minfl∈vv-RKHS

1n

n∑i=1

∥∥∥φ(xi )− fL . . . f1(φ(xi ))∥∥∥2

FX+

L∑l=1

λl‖fl‖2Hl

For l ≤ L, Xl Hilbert, X0 = XL = FX , Kl : Xl−1 ×Xl−1 → L(Xl ).

For all L0 ≤ L, there is (α1,1, . . . , α1,n, . . . , αL0,1, . . . , αL0,n) ∈X n

1 × . . .×X nL0

, such that for all l ≤ L it holds:

fl (·) =n∑

i=1Kl

(· , xi

(l−1))αl,i ,

with the notation for all i ≤ n:

x (l)i = fl . . . f1(xi ) and x (0)

i = xi .

13

Page 15: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Optimization algorithm

How to train the Kernel Autoencoder?

minfl∈vv-RKHS

1n

n∑i=1

∥∥∥φ(xi )− fL . . . f1(φ(xi ))∥∥∥2

FX+

L∑l=1

λl‖fl‖2Hl

• Last layer’s infinite dimensional coefficients makes it impossible toperform Gradient Descent directly

• Yet, gradient can propagate through last layer ([NL]ij = 〈αL,i , αL,j〉):n∑

i,i′=1[Nl ]ii′

(∇(1)kl

(xi

(l−1), xi′(l−1)

))>Jacxi (l−1) (αl0,i0 )

• If inner layers fixed and KL = kLIX0 , closed-form solution for NL

Alternate descent: gradient steps and OVKRR resolution

14

Page 16: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Application to molecule activity prediction

KAE representations are useful for posterior supervised tasks

1 2 3 4 5Cancer Index

0.026

0.028

0.030

0.032

0.034

0.036

Normalized

Mea

n Squ

are Errors

NMSEs for Different Algorithms

KRRKPCA_5KPCA_10KPCA_25KPCA_50KPCA_100

K2AE_5

K2AE_10

K2AE_25

K2AE_50

K2AE_100

Fig. 4: Performance of the different strategies on 5 cancers (NCI dataset)

15

Page 17: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Part II:Robust losses for RKHSs with

infinite dimensional outputs

minh measurable

EP

[`(h(X ),Y )

]→ min

h∈H

1n

n∑i=1

`(

h(xi ), yi

)

16

Page 18: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Infinite dimensional outputs in machine learning

Kernel Autoencoder [Laforgue et al., 2019a].

minh1,h2∈H1

K×H2K

12n

n∑i=1

∥∥φ(xi )− h2 h1(φ(xi ))∥∥2FX

+ Λ Reg(h1, h2)

Structured prediction by ridge-IOKR [Brouard et al., 2016].

h = argminh∈HK

12n

n∑i=1

∥∥φ(zi )− h(xi )∥∥2FY

+Λ2‖h‖2HK

g(x) = argminz∈Z

∥∥φ(z)− h(x)∥∥FZ

Function to function regression [Kadri et al., 2016].

0 0.1 0.2 0.3 0.4 0.5 0.60

0.5

1

1.5

2

seconds

Milli

volts

EMG curves

0 0.1 0.2 0.3 0.4 0.5 0.6

−3

−2

−1

0

1

2

3

seconds

Met

ers/

s2

Lip acceleration curves

0 0.1 0.2 0.3 0.4 0.5 0.60

0.5

1

1.5

2

seconds

Milli

volts

EMG curves

0 0.1 0.2 0.3 0.4 0.5 0.6

−3

−2

−1

0

1

2

3

seconds

Met

ers/

s2

Lip acceleration curves

minh∈HK

12n

n∑i=1

∥∥yi−h(xi )∥∥2

L2 +Λ2‖h‖2

17

Page 19: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Purpose of this part

Question: Is it possible to extend the previous approaches todifferent (ideally robust) loss functions?

First answer: Yes, possible extension to maximum-margin regression[Brouard et al., 2016], and ε-insensitive loss functionsfor matrix-valued kernels [Sangnier et al., 2017]

What about general Operator-Valued Kernels (OVKs)?

What about other types of loss functions?

18

Page 20: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Learning in vector-valued RKHSs (reminder)

For (xi , yi )ni=1 ∈ (X × Y)n with Y a Hilbert space, we want to solve:

h ∈ argminh∈HK

1n

n∑i=1

`(h(xi ), yi

)+ Λ

2 ‖h‖2HK .

Representer Theorem [Micchelli and Pontil, 2005]:

∃(αi )ni=1 ∈ Yn (infinite dimensional!) s.t. h(x) =

n∑i=1K(·, xi )αi .

If`(·, ·) = 1

2‖ · − · ‖2Y

K = k · IY: αi =

∑nj=1 Aijyj , A = (K + nΛIn)−1.

19

Page 21: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Applying duality

h ∈ argminh∈HK

1n

n∑i=1

`i (h(xi )) + Λ2 ‖h‖

2HK writes h = 1

Λn

n∑i=1K(·, xi )αi ,

with (αi )ni=1 ∈ Yn the solutions to the dual problem:

min(αi )n

i=1∈Yn

n∑i=1

`?i (−αi ) + 12Λn

n∑i,j=1〈αi ,K(xi , xj)αj〉Y ,

with f ? : α ∈ Y 7→ supy∈Y〈α, y〉Y − f (y) the Fenchel-Legendre transform of f .

20

Page 22: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Applying duality

h ∈ argminh∈HK

1n

n∑i=1

`i (h(xi )) + Λ2 ‖h‖

2HK writes h = 1

Λn

n∑i=1K(·, xi )αi ,

with (αi )ni=1 ∈ Yn the solutions to the dual problem:

min(αi )n

i=1∈Yn

n∑i=1

`?i (−αi ) + 12Λn

n∑i,j=1〈αi ,K(xi , xj)αj〉Y ,

with f ? : α ∈ Y 7→ supy∈Y〈α, y〉Y − f (y) the Fenchel-Legendre transform of f .

• 1st limitation: FL transform `? must be computable (→ assumption)

• 2nd limitation: dual variables (αi )ni=1 are still infinite dimensional!

If Y = Spanyj , j ≤ n invariant by K, i.e. y ∈ Y ⇒ K(x , x ′)y ∈ Y :

αi ∈ Y → possible reparametrization: αi =∑

j ωijyj

20

Page 23: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Applying duality

h ∈ argminh∈HK

1n

n∑i=1

`i (h(xi )) + Λ2 ‖h‖

2HK writes h = 1

Λn

n∑i=1K(·, xi )αi ,

with (αi )ni=1 ∈ Yn the solutions to the dual problem:

min(αi )n

i=1∈Yn

n∑i=1

`?i (−αi ) + 12Λn

n∑i,j=1〈αi ,K(xi , xj)αj〉Y ,

with f ? : α ∈ Y 7→ supy∈Y〈α, y〉Y − f (y) the Fenchel-Legendre transform of f .

• 1st limitation: FL transform `? must be computable (→ assumption)

• 2nd limitation: dual variables (αi )ni=1 are still infinite dimensional!

If Y = Spanyj , j ≤ n invariant by K, i.e. y ∈ Y ⇒ K(x , x ′)y ∈ Y :

αi ∈ Y → possible reparametrization: αi =∑

j ωijyj20

Page 24: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

The double representer theorem [Laforgue et al., 2020]

Assume that OVK K and loss ` satisfy the appropriate assumptions(verified by standard kernels and losses), then

h = argminHK

1n∑

i`(h(xi ), yi ) + Λ

2 ‖h‖2HK is given by

h = 1Λn

n∑i,j=1K(·, xi ) ωij yj ,

with Ω = [ωij ] ∈ Rn×n solution to the finite dimensional problem

minΩ∈Rn×n

n∑i=1

Li(Ωi :,K Y )+ 1

2Λn Tr(M>(Ω⊗ Ω)

),

with M the n2×n2 matrix writing of M s.t. Mijkl = 〈yk ,K(xi , xj)yl〉Y .

21

Page 25: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

The double representer theorem (2/2)

If K further satisfies K(x , x ′) =∑

t kt(x , x ′)At , then tensor Msimplifies to Mijkl =

∑t [K X

t ]ij [K Yt ]kl and the problem rewrites

minΩ∈Rn×n

n∑i=1

Li(Ωi :,K Y )+ 1

2Λn

T∑t=1

Tr(K X

t ΩK Yt Ω>

).

Rmk. Only need the n4 tensor 〈yk ,K(xi , xj)yl〉Y to learn OVKMs.

Simplifies to 2 n2 matrices K Xij and K Y

kl if K is decomposable.

How to apply the duality approach?

22

Page 26: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Infimal convolution and Fenchel-Legendre transforms

Infimal-convolution operator between proper lower semicontinuousfunctions [Bauschke et al., 2011]:

(f g)(x) = infy

f (y) + g(x − y).

Relation to FL transform:

(f g)? = f ? + g?

Ex: ε-insensitive losses. Let ` : Y → R be a convex loss with uniqueminimum at 0, and ε > 0. Its ε-insensitive, denoted `ε, is defined by:

`ε(y) = (` χBε) (y) =

`(0) if ‖y‖Y ≤ εinf

‖d‖Y≤1`(y − εd) otherwise ,

and has FL transform:

`?ε (y) = (` χBε)? (y) = `?(y) + ε‖y‖.

23

Page 27: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Interesting loss functions: sparsity and robustness

ε-Ridge

4 2 0 2 40

2

4

6

8

10

12 12 ||x||2-insensitive

31

1

3 3 2 1 0 1 2 30

2

4

6

8

10

12

0

2

4

6

8

10

12

12‖ · ‖

2 χBε

(Sparsity)

ε-SVR

4 2 0 2 40

1

2

3

4

5 ||x||ε-insensitive

3

1

1

3 3 2 1 0 1 2 30.00.51.01.52.02.53.03.54.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

‖ · ‖ χBε

(Sparsity, Robustness)

κ-Huber

4 2 0 2 40

1

2

3

412 ||x||2

Huber loss

31

1

3 3 2 1 0 1 2 3

2

4

6

8

10

12

0

2

4

6

8

10

12

κ‖ · ‖ 12‖ · ‖

2

(Robustness)24

Page 28: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Specific dual problems

For the ε-ridge, ε-SVR and κ-Huber, it holds Ω = W V−1, with Wthe solution to these finite dimensional dual problems:

(D1) minW∈Rn×n

12 ‖AW − B‖2

Fro + ε ‖W ‖2,1,

(D2) minW∈Rn×n

12 ‖AW − B‖2

Fro + ε ‖W ‖2,1,

s.t. ‖W ‖2,∞ ≤ 1,

(D3) minW∈Rn×n

12 ‖AW − B‖2

Fro ,

s.t. ‖W ‖2,∞ ≤ κ,

with V , A, B such that: VV> = K Y , A>A = K X/(Λn) + In

(or A>A = K X/(Λn) for the ε-SVR), and A>B = V .25

Page 29: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Application to structured prediction

• Experiments on YEAST dataset• Empirically, ε-SV-IOKR outperforms ridge-IOKR for a wide range of ε• Promotes sparsity and acts as a regularizer

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Λ

2.0

2.1

2.2

2.3

2.4

2.5

2.6

Test

MSE

Comparison ε-SVR / KRRKRR

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

ε

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1

Λ

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Spar

sity

(% n

ull c

ompo

nent

s)

Sparsity w.r.t. Λ for different ε (ε-SVR)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

ε

Fig. 5: MSEs and sparsity w.r.t. Λ for several ε

26

Page 30: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Part III:Reliable learning through

Median-of-Means approaches

minh measurable

EP

[`(h(X ),Y )

]→ min

h∈H

1n

n∑i=1

`(

h(xi ), yi

)

27

Page 31: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Preliminaries

Sample Sn = Z1, . . . ,Zn ∼ Z i.i.d. such that E [Z ] = θ

• θavg = 1n∑n

i=1 Zi

• θmed = Zσ( n+12 ), with Zσ(1) ≤ . . . ≤ Zσ(n)

• Deviation Probabilities [Catoni, 2012]: P|θ − θ| > t

.

• If Z is bounded (see Hoeffding’s Inequality) or sub-Gaussian:

P

∣∣∣θavg − θ∣∣∣ > σ

√2 ln(2/δ)

n

≤ δ.

Do estimators exist with same guarantees under weaker assumptions?

How to use them to perform (robust) learning?

28

Page 32: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

The Median-of-Means

Z1. . . ZB

. . . Zn−B+1 . . . Zn

mean mean

θ1. . . θK

θMoM

median

Z1, . . . ,Zn i.i.d. realizations of r.v. Z s.t. E[Z ] = θ, Var(Z ) = σ2.

∀δ ∈ [e1− 2n9 , 1[, for K =

⌈ 92 ln(1/δ)

⌉it holds [Devroye et al., 2016]:

P

∣∣∣θMoM − θ∣∣∣ > 3

√6σ√

1 + ln(1/δ)n

≤ δ.

29

Page 33: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Proof

θk =1B

∑i∈Bk

Zi , Ik,t = I|θk − θ| > t

, pt = E[I1,t ] = P

|θ1 − θ| > t

P∣∣θMoM − θ

∣∣ > t≤ P

K∑

k=1

Ik,t ≥K2

≤ P

1K

K∑k=1

(Ik,t − pt ) ≥12−

σ2

Bt2

,

≤ exp

(−2K

(12−

σ2

Bt2

)2),

≤ δ for K =9 ln(1/δ)

2and

σ2

Bt2 =16⇔ t = 3

√3σ

√ln(1/δ)

n.

30

Page 34: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

U-statistics & pairwise learning

Estimator of E[h(Z ,Z ′)] with minimal variance, defined from an i.i.d.sample Z1, . . . ,Zn as:

Un(h) = 2n(n − 1)

∑1≤i<j≤n

h(Zi ,Zj).

Ex: the empirical variance when h(z , z ′) = (z−z′)2

2 .

Encountered e.g. in pairwise ranking and metric learning:

Rn(r) = 2n(n − 1)

∑1≤i<j≤n

I r(Xi ,Xj) · (Yi − Yj) ≤ 0 .

Rn(d) = 2n(n − 1)

∑1≤i<j≤n

I Yij · (d(Xi ,Xj)− ε) > 0 .

How to extend MoM to U-statistics?31

Page 35: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

The Median-of-U-statistics

Z1. . . ZB

. . . Zn−B+1 . . . Zn

build pairs build pairs

(Z1, Z2) . . . (ZB−1, ZB) (Zn−B+1, Zn−B+2) . . . (Zn−1, Zn)

U -stat U -stat

U1(h) . . . UK(h)

θMoU(h)median

w.p. 1− δ,∣∣θMoU(h)− θ(h)

∣∣ ≤ C1(h)

√1 + ln(1/δ)

n+ C2(h)

1 + ln(1/δ)n

32

Page 36: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Why randomization?

Partition

Build all possible blocks[Joly and Lugosi, 2016]

Random Blocks Random Pairs

Randomization allows for a better exploration

33

Page 37: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Why randomization?

Partition

Build all possible blocks[Joly and Lugosi, 2016]

Random Blocks

Random Pairs

Randomization allows for a better exploration

33

Page 38: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Why randomization?

Partition

Build all possible blocks[Joly and Lugosi, 2016]

Random Blocks Random Pairs

Randomization allows for a better exploration33

Page 39: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

The Median-of-Randomized-Means [Laforgue et al., 2019b]

Z1. . . ZB ZB+1 . . . Zn−1 Zn

mean mean

θ1 . . . θK

θMoRM

median

With blocks formed by SWoR, ∀ τ ∈]0, 1/2[, ∀ δ ∈ [2e− 8τ2n9 , 1[, set

K :=⌈

ln(2/δ)2(1/2−τ)2

⌉, and B :=

⌊8τ 2n

9 ln(2/δ)

⌋, it holds:

P

∣∣θMoRM − θ∣∣ > 3

√3 σ

2 τ 3/2

√ln(2/δ)

n

≤ δ.

34

Page 40: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Proof

Random block Bk characterized by random vector εk = (εk,1, . . . , εk,n) ∈ 0, 1n i.i.d.uniformly over Λn,B =

ε ∈ 0, 1n : 1>ε = B

, of cardinality

(nB

).

θk =1B

∑i∈Bk

Zi , Iεk ,t = I|θk − θ| > t, pt = E[Iεk ,t ] = P|θ1 − θ| > t

Un,t = Eε

[1K

K∑k=1

Iεk ,t∣∣Sn

]=

1(nB

) ∑ε∈Λ(n,B)

Iε,t =1(nB

)∑I

I∣∣∣ 1

B

B∑j=1

XIj − θ∣∣∣ > t

P∣∣θMoRM − θ

∣∣ > t≤ P

1K

K∑k=1

Iεk ,t ≥12

,

35

Page 41: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Proof

Random block Bk characterized by random vector εk = (εk,1, . . . , εk,n) ∈ 0, 1n i.i.d.uniformly over Λn,B =

ε ∈ 0, 1n : 1>ε = B

, of cardinality

(nB

).

θk =1B

∑i∈Bk

Zi , Iεk ,t = I|θk − θ| > t, pt = E[Iεk ,t ] = P|θ1 − θ| > t

Un,t = Eε

[1K

K∑k=1

Iεk ,t∣∣Sn

]=

1(nB

) ∑ε∈Λ(n,B)

Iε,t =1(nB

)∑I

I∣∣∣ 1

B

B∑j=1

XIj − θ∣∣∣ > t

P∣∣θMoRM − θ

∣∣ > t≤ P

1K

K∑k=1

Iεk ,t − Un,t + Un,t − pt ≥12−pt + τ−τ

,

≤ exp(−2K

(12− τ)2)

+ exp

(−2

nB

(τ −

σ2

Bt2

)2).

35

Page 42: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

The Median-of-Randomized-U-statistics [Laforgue et al., 2019b]

Z1. . . ZB ZB+1 . . . Zn−1 Zn

build all pairs build all pairs

(Z1, ZB) . . . (ZB , Zn−1) (ZB , ZB+1) . . . (ZB+1, Zn)

U -stat U -stat

U1(h) . . . UK(h)

θMoRU(h)median

w.p.a.l. 1− δ,∣∣θMoRU − θ(h)

∣∣ ≤ C1(h, τ)

√ln(2/δ)

n+ C2(h, τ)

ln(2/δ)n

36

Page 43: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

The tournament procedure [Lugosi and Mendelson, 2016]

We want g∗ ∈ argming∈G

R(g) = E[(g(X )−Y )2]. For any pair (g , g ′) ∈ G2:

1) Compute the MoM estimate of ‖g − g ′‖L1

ΦS(g , g ′) = median(E1|g − g ′|, . . . , EK |g − g ′|

).

2) If it is large enough, compute the match

ΨS′(g , g ′) = median(E1[(g(X )− Y )2 − (g ′(X )− Y )2], . . . ,

EK [(g(X )− Y )2 − (g ′(X )− Y )2]).

g winning all its matches verifies w.p.a.l. 1− exp(c0n min1, r 2)

R(g)−R(g∗) ≤ cr .

Can be extended to pairwise learning thanks to MoU

37

Page 44: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

The MoM Gradient Descent [Lecue et al., 2018]

If G is parametric, want to compute the minimizer of:

MoM[`(gu,Z )] = median(E1[`(gu,Z )], . . . , EK [`(gu,Z )]

)Idea: find the block with median risk, and use it as mini-batch

38

Page 45: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

MoU Gradient Descent for metric learning

We want to minimize for M ∈ S+q (R):

2n(n − 1)

∑i<j

max(

0, 1+yij (d2M (xi , xj )−2)

)2 0 2 4 6

1st component5

3

1

1

3

5

2nd

com

pone

nt

class 1class 2class 3outliers

0 50 100 150 200 250epoch

10 2

10 1

100

Trai

n ob

ject

ive

valu

e

GD saneMoU-GD saneGD cont.MoU-GD cont.

0 50 100 150 200 250epoch

0.3

0.4

0.5

0.6

Test

obj

ectiv

e va

lue

GD saneMoU-GD saneGD cont.MoU-GD cont.

39

Page 46: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Conclusion

40

Page 47: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Conclusion

minh measurable

EP

[`(h(X ),Y )

]→ min

h∈H

1n

n∑i=1

`(

h(xi ), yi

)

• New hypothesis set for RL inspired from deep and kernel• Link with Kernel PCA, optimization based on composite RT• Allows to autoencode any type of data, empirical success on molecules

• Double RT: coefficients linear combinations of the outputs• Allows to cope with many losses (ε, Huber) and kernels• Empirical improvements on surrogate tasks

• Extension of MoM to randomized blocks and/or U-statistics• Extension of MoM tournaments and MoM-GD to pairwise learning• Remarkable empirical resistance to the presence of outliers

41

Page 48: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Conclusion

minh measurable

EP

[`(h(X ),Y )

]→ min

h∈H

1n

n∑i=1

`(

h(xi ), yi

)

• New hypothesis set for RL inspired from deep and kernel• Link with Kernel PCA, optimization based on composite RT• Allows to autoencode any type of data, empirical success on molecules

• Double RT: coefficients linear combinations of the outputs• Allows to cope with many losses (ε, Huber) and kernels• Empirical improvements on surrogate tasks

• Extension of MoM to randomized blocks and/or U-statistics• Extension of MoM tournaments and MoM-GD to pairwise learning• Remarkable empirical resistance to the presence of outliers

41

Page 49: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Conclusion

minh measurable

EP

[`(h(X ),Y )

]→ min

h∈H

1n

n∑i=1

`(

h(xi ), yi

)

• New hypothesis set for RL inspired from deep and kernel• Link with Kernel PCA, optimization based on composite RT• Allows to autoencode any type of data, empirical success on molecules

• Double RT: coefficients linear combinations of the outputs• Allows to cope with many losses (ε, Huber) and kernels• Empirical improvements on surrogate tasks

• Extension of MoM to randomized blocks and/or U-statistics• Extension of MoM tournaments and MoM-GD to pairwise learning• Remarkable empirical resistance to the presence of outliers

41

Page 50: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Perspectives

From K2AE to deep IOKR.I fully supervised schemeI benefits of a hybrid architecture?I learning the output embeddings?

Y’s invariance: the good characterization for K?I what if we relax the hypothesis?I case of integral losses: `(h(x), y) =

∫`θ[h(x)(θ), y(θ)]dθ

Among the numerous MoM possibilities.I a partial representer theorem?I concentration in presence of outliers?

42

Page 51: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Remerciements

• PhD supervisors: Florence d’Alche-Buc, Stephan Clemencon• Co-authors: Alex Lambert, Luc Brogat-Motte, Patrice Bertail• Thank you: Olivier Fercoq

I Autoencoding any data through kernel autoencoderswith S. Clemencon and F. d’Alche-Buc, AISTATS 2019

I On medians-of-randomized-(pairwise)-meanswith S. Clemencon and P. Bertail, ICML 2019

I Duality in RKHSs with infinite dimensional outputs:application to robust losseswith A. Lambert, L. Brogat-Motte and F. d’Alche-Buc, ICML 2020

I On statistical learning from biased training sampleswith S. Clemencon, Submitted

43

Page 52: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

References I

Audiffren, J. and Kadri, H. (2013).Stability of multi-task kernel regression algorithms.In Asian Conference on Machine Learning, pages 1–16.

Bauschke, H. H., Combettes, P. L., et al. (2011).Convex analysis and monotone operator theory in Hilbertspaces, volume 408.Springer.

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007).Greedy layer-wise training of deep networks.In Advances in neural information processing systems, pages153–160.

44

Page 53: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

References II

Bourlard, H. and Kamp, Y. (1988).Auto-association by multilayer perceptrons and singular valuedecomposition.Biological cybernetics, 59(4):291–294.

Bousquet, O. and Elisseeff, A. (2002).Stability and generalization.Journal of Machine Learning Research, 2(Mar):499–526.

Brouard, C., Szafranski, M., and d’Alche-Buc, F. (2016).Input output kernel regression: Supervised and semi-supervisedstructured output prediction with operator-valued kernels.Journal of Machine Learning Research, 17:176:1–176:48.

45

Page 54: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

References III

Carmeli, C., De Vito, E., and Toigo, A. (2006).Vector valued reproducing kernel hilbert spaces of integrablefunctions and mercer theorem.Analysis and Applications, 4(04):377–408.

Catoni, O. (2012).Challenging the empirical mean and empirical variance: adeviation study.In Annales de l’Institut Henri Poincare, Probabilites et Statistiques,volume 48, pages 1148–1185. Institut Henri Poincare.

Devroye, L., Lerasle, M., Lugosi, G., Oliveira, R. I., et al. (2016).Sub-gaussian mean estimators.The Annals of Statistics, 44(6):2695–2725.

46

Page 55: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

References IV

Erhan, D., Bengio, Y., Courville, A., and Vincent, P. (2009).Visualizing higher-layer features of a deep network.University of Montreal, 1341(3):1.

Gill, R., Vardi, Y., and Wellner, J. (1988).Large sample theory of empirical distributions in biasedsampling models.The Annals of Statistics, 16(3):1069–1112.

Hinton, G. E. and Salakhutdinov, R. R. (2006).Reducing the dimensionality of data with neural networks.science, 313(5786):504–507.

Huber, P. J. (1964).Robust estimation of a location parameter.The Annals of Mathematical Statistics, pages 73–101.

47

Page 56: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

References V

Joly, E. and Lugosi, G. (2016).Robust estimation of u-statistics.Stochastic Processes and their Applications, 126(12):3760–3773.

Kadri, H., Duflos, E., Preux, P., Canu, S., Rakotomamonjy, A., andAudiffren, J. (2016).Operator-valued kernels for learning from functional responsedata.Journal of Machine Learning Research, 17:20:1–20:54.

Kadri, H., Ghavamzadeh, M., and Preux, P. (2013).A generalized kernel approach to structured output learning.In International Conference on Machine Learning (ICML), pages471–479.

48

Page 57: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

References VI

Laforgue, P., Clemencon, S., and d’Alche-Buc, F. (2019a).Autoencoding any data through kernel autoencoders.In Artificial Intelligence and Statistics, pages 1061–1069.

Laforgue, P., Clemencon, S., and Bertail, P. (2019b).On medians of (Randomized) pairwise means.In Proceedings of the 36th International Conference on MachineLearning, volume 97 of Proceedings of Machine Learning Research,pages 1272–1281, Long Beach, California, USA. PMLR.

Laforgue, P., Lambert, A., Motte, L., and d’Alche Buc, F. (2020).Duality in rkhss with infinite dimensional outputs: Applicationto robust losses.arXiv preprint arXiv:1910.04621.

49

Page 58: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

References VII

Lecue, G., Lerasle, M., and Mathieu, T. (2018).Robust classification via mom minimization.arXiv preprint arXiv:1808.03106.

Lugosi, G. and Mendelson, S. (2016).Risk minimization by median-of-means tournaments.arXiv preprint arXiv:1608.00757.

Mairal, J., Koniusz, P., Harchaoui, Z., and Schmid, C. (2014).Convolutional kernel networks.In Advances in neural information processing systems, pages2627–2635.

50

Page 59: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

References VIII

Maurer, A. (2014).A chain rule for the expected suprema of gaussian processes.In Algorithmic Learning Theory: 25th International Conference, ALT2014, Bled, Slovenia, October 8-10, 2014, Proceedings, volume8776, page 245. Springer.

Maurer, A. (2016).A vector-contraction inequality for rademacher complexities.In International Conference on Algorithmic Learning Theory, pages3–17. Springer.

Maurer, A. and Pontil, M. (2016).Bounds for vector-valued function estimation.arXiv preprint arXiv:1606.01487.

51

Page 60: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

References IX

Micchelli, C. A. and Pontil, M. (2005).On learning vector-valued functions.Neural computation, 17(1):177–204.

Moreau, J. J. (1962).Fonctions convexes duales et points proximaux dans un espacehilbertien.Comptes rendus hebdomadaires des seances de l’Academie dessciences, 255:2897–2899.Sangnier, M., Fercoq, O., and d’Alche-Buc, F. (2017).Data sparse nonparametric regression with ε-insensitive losses.In Asian Conference on Machine Learning, pages 192–207.

Scholkopf, B., Tsuda, K., and Vert, J.-P. (2004).Support vector machine applications in computational biology.MIT press.

52

Page 61: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

References X

Vardi, Y. (1985).Empirical distributions in selection bias models.Ann. Statist., 13:178–203.Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol,P.-A. (2010).Stacked denoising autoencoders: Learning usefulrepresentations in a deep network with a local denoisingcriterion.J. Mach. Learn. Res., 11:3371–3408.

53

Page 62: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Appendices Kernel Autoencoder

54

Page 63: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Functional spaces similarity

• Neural mapping fNN parametrized by a matrix A ∈ Rp×d with rows(aj)j≤p, and and activation function σ

• Kernel mapping fOVK from decomposable OVK K = kIp, associatedto the (scalar) feature map φk

fNN(x) =

σ (〈a1, x〉)...

σ (〈ap, x〉)

fOVK(x) =

f 1(x) =⟨f 1, φk(x)

⟩...

f p(x) = 〈f p, φk(x)〉

Only differ on the order in which linear/nonlinear mappings areused (and on their nature)

55

Page 64: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Disentangling concentric circles

More complex layers enhance the learned representations

−10 −5 0 5 10

−10

−5

0

5

10

−10.0 −7.5 −5.0 −2.5 0.0 2.5 5.0 7.5 10.0

−2

−1

0

1

2

−7.5 −5.0 −2.5 0.0 2.5 5.0 7.5−8

−6

−4

−2

0

2

4

6

8

−12 −11 −10 −9 −8 −7 −6 −5 −4

−18

−16

−14

−12

−10

−8

Fig. 6: KAE performance on concentric circles56

Page 65: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

KAE generalization bound

2-layer KAE on data bounded in norm by M, with:

• internal layer of size p• encoder f ∈ H1 such that ‖f ‖ ≤ s• decoder g ∈ H2 such that ‖g‖ ≤ t, with Lipschitz constant L

Then it holds:

ε(gn fn)− ε∗ ≤ C0LMst√

Kpn + 24M2

√log(2)/δ

2n .

with ε(g f ) = EX ‖X − g f (X )‖2X0

57

Page 66: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Proof

Based on vector-valued Rademacher average:

Rn(C(S)) = Eσ

[suph∈C

1n

n∑i=1〈σi , h(xi )〉H

].

With Hs,t ⊂ F(X0,X0) = H1,s H2,t , ` the squared norm on X0, it holds:

Rn

((` (id−Hs,t)

)(S))≤ 2√

2M Rn

((id−Hs,t

)(S)),

≤ 2√

2M Rn

(Hs,t(S)

)≤ 2√πM Gn

(Hs,t(S)

).

Gn

(Hs,t(S)

)≤ C1 L

(H2,t ,H1,s(S)

)Gn

(H1,s(S)

)+ C2

n R(H2,t ,H1,s(S)

)D(H1,s(S)

)+ 1

n G(H2,t(0)

).

using [Maurer, 2016, Maurer, 2014] in the spirit of [Maurer and Pontil, 2016]

58

Page 67: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Appendices Duality in vv-RKHSs

59

Page 68: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

On the invariance assumption

With Y = Spanyj , j ≤ n, the assumption reads:

∀(x , x ′) ∈ X 2, ∀y ∈ Y, y ∈ Y =⇒ K(x , x ′)y ∈ Y

• We do not need it to hold for every collection of yii≤n ∈ Yn

• Rather an a posteriori condition to ensure that the kernel is aligned

• The little we know about Y should be preserved through K

• If Y finite dimensional, and sufficiently many outputs, then Y = Y

• Identity-decomposable kernels fit (nontrivial in infinite dimension)

• The empirical covariance kernel∑

i yi ⊗ yi [Kadri et al., 2013] fits

60

Page 69: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Admissible kernels

• K(s, t) =∑

i ki (s, t) yi ⊗ yi ,with ki positive semi-definite (p.s.d.) scalar kernels for all i ≤ n

• K(s, t) =∑

i µi k(s, t) yi ⊗ yi ,with k a p.s.d. scalar kernel and µi ≥ 0 for all i ≤ n

• K(s, t) =∑

i k(s, xi )k(t, xi ) yi ⊗ yi ,

• K(s, t) =∑

i,j kij(s, t) (yi + yj)⊗ (yi + yj),with kij p.s.d. scalar kernels for all i , j ≤ n

• K(s, t) =∑

i,j µij k(s, t) (yi + yj)⊗ (yi + yj),with k a p.s.d. scalar kernel and µij ≥ 0

• K(s, t) =∑

i,j k(s, xi , xj)k(t, xi , xj) (yi + yj)⊗ (yi + yj).

61

Page 70: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Admissible losses

∀i ≤ n, ∀(αY, α⊥) ∈ Y× Y⊥, `?i (αY) ≤ `?i (αY + α⊥)

• `i (y) = f (〈y , zi〉), zi ∈ Y and f : R→ R convex. Maximum-marginobtained with zi = yi and f (t) = max(0, 1− t).

• `(y) = f (‖y‖), f : R+ → R convex increasing s.t. t 7→ f ′(t)t is

continuous over R+. Includes the functions λη ‖y‖

ηY for η > 1, λ > 0.

• ∀λ > 0, with Bλ the centered ball of radius λ,

I `(y) = λ‖y‖, I `(y) = λ‖y‖ log(‖y‖),I `(y) = χBλ(y), I `(y) = λ(exp(‖y‖)− 1).

• `i (y) = f (y − yi ), f ? verifying the condition.

• Infimal convolution of functions verifying the condition. (ε-insensitive[Sangnier et al., 2017], the Huber loss [Huber, 1964], Moreau orPasch-Hausdorff envelopes [Moreau, 1962, Bauschke et al., 2011])

62

Page 71: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Proof of the Double Representer Theorem

Dual problem:

(αi )ni=1 ∈ argmin

(αi )ni=1∈Yn

n∑i=1

`?i (−αi ) + 12Λn

n∑i,j=1〈αi ,K(xi , xj)αj〉Y .

• Decompose αi = αYi + α⊥i , with (αY

i )i≤n, (α⊥i )i≤n ∈ Yn × Y⊥n

• Assume that `?i (αY) ≤ `?i (αY + α⊥) (satisfied if ` relies on 〈·, ·〉)

Then it holds:n∑

i=1`?i (−αY

i ) + 12Λn

n∑i,j=1

⟨αY

i ,K(xi , xj)αYj⟩Y

≤n∑

i=1`?i (−αY

i − α⊥i ) + 12Λn

n∑i,j=1

⟨αY

i + α⊥i ,K(xi , xj)(αYj + α⊥j )

⟩Y .

63

Page 72: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Approximating the dual problem if no invariance

The kernel K = k · A is a separable OVK, with A a compact operator.

There exists an o.n.b. (ψj)∞j=1 of Y, s.t. A =∑∞

j=1 λjψj ⊗ ψj , (λj ≥ 0).

There exists (ωi )ni=1 ∈ `2(R)n such that ∀i ≤ n, αi =

∑∞j=1 ωijψj .

Denoting by Ym = span(ψjmj=1), S = diag(λj)m

j=1, solve instead:

min(αi )n

i=1∈Ynm

n∑i=1

`?i (−αi ) + 12Λn

n∑i,j=1〈αi ,K(xi , xj)αj〉Y .

The final solution is given by: h = 1Λn

n∑i=1

m∑j=1

k(·, xi ) λj ωij ψj ,

with Ω solution to:

minΩ∈Rn×m

n∑i=1

Li (Ωi :,Ri :) + 12Λn Tr

(K X ΩSΩ>

).

64

Page 73: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Application to robust function-to-function regression

• Predict lip acceleration from EMG signals [Kadri et al., 2016]• Dataset augmented with outliers, model learned with Huber loss• Improvement for every output size m

0.0 0.5 1.0 1.5

κ

0.750

0.775

0.800

0.825

0.850

0.875

0.900

LO

Oge

nera

lizat

ion

erro

r

M=4

M=5

M=6

M=7

M=15

Ridge Regression

Fig. 7: LOO generalization error w.r.t. κ65

Page 74: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Application to kernel autoencoding

• Experiments on molecules with Tanimoto-Gaussian kernel• Empirical improvements for wide range of ε• Introduces sparsity

0.0 0.2 0.4 0.6 0.8 1.00.5

0.6

0.7

0.8

0.9

1.0

MSE

0

200

400

600

800

1000

1200

1400

1600

||W|| 2

,1

0 0 3 6 12 22 38 46 74 108 136

397

1229

-KAEStandard KAEW's 2, 1 normDiscarded data

Fig. 8: Performances of ε-insensitive Kernel Autoencoder

66

Page 75: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Algorithmic stability analysis [Bousquet and Elisseeff, 2002]

Algorithm A has stability β if for any sample Sn, and any i ≤ n, it holds:

sup(x,y)∈X×Y

|`(hA(Sn)(x), y)− `(hA(S\i

n )(x), y)| ≤ β

Let A be an algorithm with stability β and loss function bounded by M. Then, for anyn ≥ 1 and δ ∈]0, 1[ it holds with probability at least 1− δ:

R(hA(Sn)) ≤ Rn(hA(Sn)) + 2β + (4nβ + M)

√ln(1/δ)

2n.

If ‖K(x , x)‖op ≤ γ2, and |`(hS(x), y)− `(hS\i (x), y)| ≤ C‖hS(x)− hS\i (x)‖Y , thenOVK algorithm has stability β ≤ C2γ2/(Λn) [Audiffren and Kadri, 2013].

M C

ε-SVR√

MY − ε(√

2γ√Λ

+√

MY − ε)

1

ε-Ridge (MY − ε)2(

1 + 2√

2γ√Λ

+ 2γ2Λ

)2(MY − ε)

(1 + γ

√2√

Λ

)κ-Huber κ

√MY − κ

2

(γ√

2κ√Λ

+√

MY − κ2

67

Page 76: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Appendices Learning with Sample Bias

68

Page 77: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Empirical Risk Minimization (ERM)

General goal of supervised machine learning:From a r.v. Z = (X ,Y ), and a loss function ` : Y × Y → R, find:

h∗ = argminh measurable

R(h) = EP [`(h(X ),Y )] .

Empirical Risk Minimization (ERM):

• P is unknown (and the set of measurable functions too large)

• sample (X1,Y1), . . . , (Xn,Yn) i.i.d∼ P, hypothesis set H

hn = argminh∈H

1n

n∑i=1

`(h(Xi ),Yi ) = EPn[`(h(X ),Y )] ,

with Pn = 1n∑

i δZi , and Zi = (Xi ,Yi ). It holds Pn →n→+∞

P.

69

Page 78: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Importance Sampling (IS)

What if the data is not drawn from P?

Sample (X1,Y1), . . . , (Xn,Yn) i.i.d∼ Q such that dQdP (z) = q(z)

p(z) .

Now 1n∑

i δZi = Qn →n→+∞

Q.

70

Page 79: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Importance Sampling (IS)

What if the data is not drawn from P?

Sample (X1,Y1), . . . , (Xn,Yn) i.i.d∼ Q such that dQdP (z) = q(z)

p(z) .

Now 1n∑

i δZi = Qn →n→+∞

Q.

q(x)/p(x) = 1/2.

minh∈H

1n

n∑i=1

`(h(Xi ),Yi ) ·p(Zi )q(Zi )

minh∈H

EQn

[`(h(X),Y ) ·

p(Z)q(Z)

]↓

EQ

[`(h(X),Y ) ·

p(Z)q(Z)

]= EP [`(h(X),Y )]

70

Page 80: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Importance Sampling (IS)

What if the data is not drawn from P?

Sample (X1,Y1), . . . , (Xn,Yn) i.i.d∼ Q such that dQdP (z) = q(z)

p(z) .

Now 1n∑

i δZi = Qn →n→+∞

Q.

q(x)/p(x) = I15 ≤ x ≤ 55.

minh∈H

1n

n∑i=1

`(h(Xi ),Yi ) ·p(Zi )q(Zi )

minh∈H

EQn

[`(h(X),Y ) ·

p(Z)q(Z)

]not possible!

↓EQ

[`(h(X),Y ) ·

p(Z)q(Z)

]= EP [`(h(X),Y )]

70

Page 81: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Adding samples

q1(x)/p(x) = I15 ≤ x ≤ 55

We need:K⋃

k=1Supp(qk) = Supp(p).

Sample-wise IS doe not work because of samples proportions.

71

Page 82: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Adding samples

q1(x)/p(x) = I15 ≤ x ≤ 55

q2(x)/p(x) = I50 ≤ x ≤ 70

We need:K⋃

k=1Supp(qk) = Supp(p).

Sample-wise IS doe not work because of samples proportions.

71

Page 83: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Adding samples

q1(x)/p(x) = I15 ≤ x ≤ 55

q2(x)/p(x) = I50 ≤ x ≤ 70

q3(x)/p(x) = Ix ≤ 20+ Ix ≥ 60

We need:K⋃

k=1Supp(qk) = Supp(p).

Sample-wise IS doe not work because of samples proportions.

71

Page 84: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Setting and assumptions

• K independent i.i.d. samples Dk = Zk,1, . . . , Zk,nk• n =

∑k nk , λk = nk/n for k ≤ K

• sample k drawn according to Qk such that dQkdP (z) = ωk (z)

Ωk

• The Ωk = EP [ωk(Z )] =∫Z ωk(z)P(dz) are unknown.

• ∃C , λ, λ1, . . . , λK > 0, |λk − λk | ≤ C√n and λ ≤ λk .

• The graph Gκ is connected.• ∃ξ > 0, ∀k ≤ K , Ωk ≥ ξ.• ∃m,M > 0, m ≤ inf

zmaxk≤K

ωk(z) and supz

maxk≤K

ωk(z) ≤ M.

72

Page 85: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Building an unbiased estimate of P (1/2)

Without considering the bias issue:

Qn = 1n

n∑i=1

δZi =K∑

k=1

nkn

1nk

∑i∈Dk

δZi →K∑

k=1λkQk 6= P.

But it holds:

dQk = ωkΩk

dP,∑

kλkdQk =

∑k

λkωkΩk

dP

dP =(∑

k

λkωkΩk

)−1∑kλkdQk (1)

We only need to estimate the Ωk ’s.

73

Page 86: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Building an unbiased estimate of P (2/2)

It holds:

Ωk =∫ωkdP =

∫ (∑k

λkωkΩk

)−1∑kλkωkdQk .

Ω solution to the system:

∀k ≤ K , Hk(Ω)− 1 = 0,

with Hk(Ω) =∫ (∑

k

λkωkΩk

)−1∑kλkωkdQk .

The final estimate is obtained by plugging Ω in Equation (1).

74

Page 87: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Non-asymptotic guarantees

Debiasing procedure due to [Vardi, 1985, Gill et al., 1988], but onlyasymptotic results.

With Pn =(∑

kλkωk

Ωk

)−1∑k λkdQk , there exists (πi )i≤n such that:

EPn[`(h(X ),Y )] =

n∑i=1

πi · `(h(Xi ),Yi ), (2)

and hn minimizer of Equation (2) satisfies with probability 1− δ:

R(hn)− R(h∗) ≤ C1

√K 3

n + C2

√K log n

n + C3

√K log 1/δ

n .

75

Page 88: Deep Kernel Representation Learning for Complex Data and ... · Pierre LAFORGUE, PhD Defense LTCI, T´el´ecom Paris, France Jury: d’ALCH´E-BUC Florence Supervisor BONALD Thomas

Experiments on the Adult dataset

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Years of education

0

2000

4000

6000

8000

10000

Popu

latio

n

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Prop

ortio

n 5

0k$

20 30 40 50 60 70 80 90Age

0.0

0.2

0.4

0.6

0.8

1.0

Prop

ortio

n 5

0k$

highmediumlow

Dataset of size 6, 000: 98% from 13+ years of education, 2% unbiased. Scores:

LogReg RF

ERM 63.95 ± 1.37 42.73 ± 3.36debiased ERM 79.77 ± 1.72 43.58 ± 4.77

unbiased sample 77.75 ± 2.27 22.16 ± 6.18

76