56
Test of Time Award Online Dictionary Learning for Sparse Coding Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro International Conference on Machine Learning, 2019 Julien Mairal Online Dictionary Learning for Sparse Coding 1/15

Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Test of Time Award

Online Dictionary Learning for Sparse Coding

Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro

International Conference on Machine Learning, 2019

Julien Mairal Online Dictionary Learning for Sparse Coding 1/15

Page 2: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Test of Time Award

Online Learning for Matrix Factorization and Sparse Coding

Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro

International Conference on Machine Learning, 2019

Julien Mairal Online Dictionary Learning for Sparse Coding 1/15

Page 3: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Francis Jean Guillermo

Julien Mairal Online Dictionary Learning for Sparse Coding 2/15

Page 4: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What are these papers about?

They are dealing with matrix factorization

X

m

n

D

p

×A

p

n

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

Page 5: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What are these papers about?

They are dealing with matrix factorization

X

m

n

D

p

×A

p

n

when a factor is sparse.

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

Page 6: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What are these papers about?

They are dealing with matrix factorization

X

m

n

D

p

×A

p

n

or the other one.

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

Page 7: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What are these papers about?

They are dealing with matrix factorization

X

m

n

D

p

×A

p

n

or both.

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

Page 8: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What are these papers about?

They are dealing with matrix factorization

X

m

n

D

p

×A

p

n

or not only one factor is sparse, but it admits a particular structure.

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

Page 9: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What are these papers about?

They are dealing with matrix factorization

X

m

n

D

p

×A

p

n

or one factor admits a particular structure (e.g., piecewise constant), but it is not sparse.

Julien Mairal Online Dictionary Learning for Sparse Coding 3/15

Page 10: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What these papers are about?

In these papers, data matrices have many columns,

X

m ≈

D

p

×A

p

n→ +∞

n→ +∞

or an infinite number of columns, or columns are streamed online.

Julien Mairal Online Dictionary Learning for Sparse Coding 4/15

Page 11: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Formulation(s)

X = [x1,x2, . . . ,xn] is a data matrix.

We may call D = [d1, . . . ,dp] a dictionary.

A = [α1, . . . ,αn] carries the decomposition coefficients of X onto D.

Interpretation as signal/data decomposition

X ≈ DA ⇐⇒ ∀ i, xi ≈ Dαi =

p∑j=1

αi[j]dj .

Julien Mairal Online Dictionary Learning for Sparse Coding 5/15

Page 12: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Formulation(s)

X = [x1,x2, . . . ,xn] is a data matrix.

We may call D = [d1, . . . ,dp] a dictionary.

A = [α1, . . . ,αn] carries the decomposition coefficients of X onto D.

Interpretation as signal/data decomposition

X ≈ DA ⇐⇒ ∀ i, xi ≈ Dαi =

p∑j=1

αi[j]dj .

Julien Mairal Online Dictionary Learning for Sparse Coding 5/15

Page 13: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Formulation(s)

X = [x1,x2, . . . ,xn] is a data matrix.

We may call D = [d1, . . . ,dp] a dictionary.

A = [α1, . . . ,αn] carries the decomposition coefficients of X onto D.

Interpretation as signal/data decomposition

X ≈ DA ⇐⇒ ∀ i, xi ≈ Dαi =

p∑j=1

αi[j]dj .

Generic formulation

minD∈D

1

n

n∑i=1

L(xi,D) with L(x,D)M= min

α∈A

1

2‖x−Dα‖2 + λψ(α).

Julien Mairal Online Dictionary Learning for Sparse Coding 5/15

Page 14: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Formulation(s)

X = [x1,x2, . . . ,xn] is a data matrix.

We may call D = [d1, . . . ,dp] a dictionary.

A = [α1, . . . ,αn] carries the decomposition coefficients of X onto D.

Interpretation as signal/data decomposition

X ≈ DA ⇐⇒ ∀ i, xi ≈ Dαi =

p∑j=1

αi[j]dj .

Generic formulation / stochastic case

n∑i=1

minD∈D

Ex [L(x,D)] with L(x,D)M= min

α∈A

1

2‖x−Dα‖2 + λψ(α).

Julien Mairal Online Dictionary Learning for Sparse Coding 5/15

Page 15: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Formulation(s)

n∑i=1

minD∈D

Ex [L(x,D)] with L(x,D)M= min

α∈A

1

2‖x−Dα‖2 + λψ(α).

Which formulations does it cover?

D A ψ

non-negative matrix factorization Rm×p+ Rp

+ 0

sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1non-negative sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp

+ ‖.‖1structured sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1 + Ω(.)

≈ sparse PCA D : ∀ j, ‖dj‖22 + ‖dj‖1 ≤ 1 Rp ‖.‖1...

......

...

[Paatero and Tapper, ’94]

, [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011], [Zou et al., 2004].

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

Page 16: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Formulation(s)

n∑i=1

minD∈D

Ex [L(x,D)] with L(x,D)M= min

α∈A

1

2‖x−Dα‖2 + λψ(α).

Which formulations does it cover?

D A ψ

non-negative matrix factorization Rm×p+ Rp

+ 0

sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1

non-negative sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp+ ‖.‖1

structured sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1 + Ω(.)

≈ sparse PCA D : ∀ j, ‖dj‖22 + ‖dj‖1 ≤ 1 Rp ‖.‖1...

......

...

[Paatero and Tapper, ’94], [Olshausen and Field, ’96]

, [Hoyer, 2002], [Mairal et al., 2011], [Zou et al., 2004].

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

Page 17: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Formulation(s)

n∑i=1

minD∈D

Ex [L(x,D)] with L(x,D)M= min

α∈A

1

2‖x−Dα‖2 + λψ(α).

Which formulations does it cover?

D A ψ

non-negative matrix factorization Rm×p+ Rp

+ 0

sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1non-negative sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp

+ ‖.‖1

structured sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1 + Ω(.)

≈ sparse PCA D : ∀ j, ‖dj‖22 + ‖dj‖1 ≤ 1 Rp ‖.‖1...

......

...

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002]

, [Mairal et al., 2011], [Zou et al., 2004].

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

Page 18: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Formulation(s)

n∑i=1

minD∈D

Ex [L(x,D)] with L(x,D)M= min

α∈A

1

2‖x−Dα‖2 + λψ(α).

Which formulations does it cover?

D A ψ

non-negative matrix factorization Rm×p+ Rp

+ 0

sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1non-negative sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp

+ ‖.‖1structured sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1 + Ω(.)

≈ sparse PCA D : ∀ j, ‖dj‖22 + ‖dj‖1 ≤ 1 Rp ‖.‖1...

......

...

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011]

, [Zou et al., 2004].

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

Page 19: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Formulation(s)

n∑i=1

minD∈D

Ex [L(x,D)] with L(x,D)M= min

α∈A

1

2‖x−Dα‖2 + λψ(α).

Which formulations does it cover?

D A ψ

non-negative matrix factorization Rm×p+ Rp

+ 0

sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1non-negative sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp

+ ‖.‖1structured sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1 + Ω(.)

≈ sparse PCA D : ∀ j, ‖dj‖22 + ‖dj‖1 ≤ 1 Rp ‖.‖1

......

......

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011], [Zou et al., 2004].

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

Page 20: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Formulation(s)

n∑i=1

minD∈D

Ex [L(x,D)] with L(x,D)M= min

α∈A

1

2‖x−Dα‖2 + λψ(α).

Which formulations does it cover?

D A ψ

non-negative matrix factorization Rm×p+ Rp

+ 0

sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1non-negative sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp

+ ‖.‖1structured sparse coding D : ∀ j, ‖dj‖ ≤ 1 Rp ‖.‖1 + Ω(.)

≈ sparse PCA D : ∀ j, ‖dj‖22 + ‖dj‖1 ≤ 1 Rp ‖.‖1...

......

...

[Paatero and Tapper, ’94], [Olshausen and Field, ’96], [Hoyer, 2002], [Mairal et al., 2011], [Zou et al., 2004].

Julien Mairal Online Dictionary Learning for Sparse Coding 6/15

Page 21: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

The sparse coding context

was introduced by Olshausen and Field, ’96. It was the first time (together with ICA, see[Bell and Sejnowski, ’97]) that a simple unsupervised learning principle would lead to

various sorts of “Gabor-like” filters, whentrained on natural image patches.

Julien Mairal Online Dictionary Learning for Sparse Coding 7/15

Page 22: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

The sparse coding context

Remember that we can play with various structured sparsity-inducing penalties:

[Jenatton et al. 2010], [Kavukcuoglu et al., 2009], [Mairal et al. 2011], [Hyvarinen and Hoyer, 2001].

Julien Mairal Online Dictionary Learning for Sparse Coding 8/15

Page 23: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Sparsity and simplicity principles

1921

1950 1960 1970 1980 1990 2000 2010

1921: Wrinch and Jeffrey’s simplicity principle.

1952: Markowitz’s portfolio selection.

1960’s and 70’s: best subset selection in statistics.

1990’s: the wavelet era in signal processing.

1996: Olshausen and Field’s dictionary learning method.

1994–1996: the Lasso (Tibshirani) and Basis pursuit (Chen and Donoho).

2004: compressed sensing (Candes, Romberg and Tao).

2006: Elad and Aharon’s image denoising method.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

Page 24: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Sparsity and simplicity principles

1921 1950

1960 1970 1980 1990 2000 2010

1921: Wrinch and Jeffrey’s simplicity principle.

1952: Markowitz’s portfolio selection.

1960’s and 70’s: best subset selection in statistics.

1990’s: the wavelet era in signal processing.

1996: Olshausen and Field’s dictionary learning method.

1994–1996: the Lasso (Tibshirani) and Basis pursuit (Chen and Donoho).

2004: compressed sensing (Candes, Romberg and Tao).

2006: Elad and Aharon’s image denoising method.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

Page 25: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Sparsity and simplicity principles

1921 1950 1960 1970

1980 1990 2000 2010

1921: Wrinch and Jeffrey’s simplicity principle.

1952: Markowitz’s portfolio selection.

1960’s and 70’s: best subset selection in statistics.

1990’s: the wavelet era in signal processing.

1996: Olshausen and Field’s dictionary learning method.

1994–1996: the Lasso (Tibshirani) and Basis pursuit (Chen and Donoho).

2004: compressed sensing (Candes, Romberg and Tao).

2006: Elad and Aharon’s image denoising method.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

Page 26: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Sparsity and simplicity principles

1921 1950 1960 1970 1980 1990

2000 2010

1921: Wrinch and Jeffrey’s simplicity principle.

1952: Markowitz’s portfolio selection.

1960’s and 70’s: best subset selection in statistics.

1990’s: the wavelet era in signal processing.

1996: Olshausen and Field’s dictionary learning method.

1994–1996: the Lasso (Tibshirani) and Basis pursuit (Chen and Donoho).

2004: compressed sensing (Candes, Romberg and Tao).

2006: Elad and Aharon’s image denoising method.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

Page 27: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Sparsity and simplicity principles

1921 1950 1960 1970 1980 1990 2000

2010

1921: Wrinch and Jeffrey’s simplicity principle.

1952: Markowitz’s portfolio selection.

1960’s and 70’s: best subset selection in statistics.

1990’s: the wavelet era in signal processing.

1996: Olshausen and Field’s dictionary learning method.

1994–1996: the Lasso (Tibshirani) and Basis pursuit (Chen and Donoho).

2004: compressed sensing (Candes, Romberg and Tao).

2006: Elad and Aharon’s image denoising method.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

Page 28: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Sparsity and simplicity principles

1921 1950 1960 1970 1980 1990 2000 2010

1921: Wrinch and Jeffrey’s simplicity principle.

1952: Markowitz’s portfolio selection.

1960’s and 70’s: best subset selection in statistics.

1990’s: the wavelet era in signal processing.

1996: Olshausen and Field’s dictionary learning method.

1994–1996: the Lasso (Tibshirani) and Basis pursuit (Chen and Donoho).

2004: compressed sensing (Candes, Romberg and Tao).

2006: Elad and Aharon’s image denoising method.

Julien Mairal Online Dictionary Learning for Sparse Coding 9/15

Page 29: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Context of 2009

Many successful stories of dictionary learning in image processing

image denoising, inpainting, demosaicing, super-resolution . . .

Also successful stories in computer vision for modeling local features

dictionary learning on top of SIFT wins the PASCAL VOC’09 challenge.

another variant wins the ImageNet 2010 challenge.

Matrix factorization becomes a key technique for unsupervised data modeling

recommender systems (Netflix prize) and social networks.

document clustering.

genomic pattern discovery.

. . .

Julien Mairal Online Dictionary Learning for Sparse Coding 10/15

Page 30: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Context of 2009

Many successful stories of dictionary learning in image processing

image denoising, inpainting, demosaicing, super-resolution . . .

Also successful stories in computer vision for modeling local features

dictionary learning on top of SIFT wins the PASCAL VOC’09 challenge.

another variant wins the ImageNet 2010 challenge.

Matrix factorization becomes a key technique for unsupervised data modeling

recommender systems (Netflix prize) and social networks.

document clustering.

genomic pattern discovery.

. . .

[Elad and Aharon., 2006], [Mairal et al., 2008], [Yang et al., 2008] . . .

Julien Mairal Online Dictionary Learning for Sparse Coding 10/15

Page 31: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Context of 2009

Many successful stories of dictionary learning in image processing

image denoising, inpainting, demosaicing, super-resolution . . .

Also successful stories in computer vision for modeling local features

dictionary learning on top of SIFT wins the PASCAL VOC’09 challenge.

another variant wins the ImageNet 2010 challenge.

Matrix factorization becomes a key technique for unsupervised data modeling

recommender systems (Netflix prize) and social networks.

document clustering.

genomic pattern discovery.

. . .

[Elad and Aharon., 2006], [Mairal et al., 2008], [Yang et al., 2008] . . .

Julien Mairal Online Dictionary Learning for Sparse Coding 10/15

Page 32: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Context of 2009

Many successful stories of dictionary learning in image processing

image denoising, inpainting, demosaicing, super-resolution . . .

Also successful stories in computer vision for modeling local features

dictionary learning on top of SIFT wins the PASCAL VOC’09 challenge.

another variant wins the ImageNet 2010 challenge.

Matrix factorization becomes a key technique for unsupervised data modeling

recommender systems (Netflix prize) and social networks.

document clustering.

genomic pattern discovery.

. . .

[Yang et al., 2009], [Lin et al. 2010] . . .

Julien Mairal Online Dictionary Learning for Sparse Coding 10/15

Page 33: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Context of 2009

Many successful stories of dictionary learning in image processing

image denoising, inpainting, demosaicing, super-resolution . . .

Also successful stories in computer vision for modeling local features

dictionary learning on top of SIFT wins the PASCAL VOC’09 challenge.

another variant wins the ImageNet 2010 challenge.

Matrix factorization becomes a key technique for unsupervised data modeling

recommender systems (Netflix prize) and social networks.

document clustering.

genomic pattern discovery.

. . .

[Koren et al., 2009b], [Ma et al. 2008], [Xu et al. 2003], [Brunet et al., 2004]. . .

Julien Mairal Online Dictionary Learning for Sparse Coding 10/15

Page 34: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Context of 2009

Classical approach for matrix factorization: alternate minimization

minD∈D,A∈A

1

2‖X−DA‖2F + λψ(A).

which requires loading all data at every iteration (batch optimization).

Meanwhile, Leon Bottou is advocating stochastic optimization for machine learning

which makes the risk minimization point of view relevant:

minD∈D

1

n

n∑i=1

L(xi,D) with L(x,D)M= min

α∈A

1

2‖x−Dα‖2+λψ(α).

see Leon’s tutorial at NIPS’07, or NeurIPS’18 test of time award [Bottou and Bousquet, 2008].

Julien Mairal Online Dictionary Learning for Sparse Coding 11/15

Page 35: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Context of 2009

Classical approach for matrix factorization: alternate minimization

minD∈D,A∈A

1

2‖X−DA‖2F + λψ(A).

which requires loading all data at every iteration (batch optimization).

Meanwhile, Leon Bottou is advocating stochastic optimization for machine learning

which makes the risk minimization point of view relevant:

minD∈D

1

n

n∑i=1

L(xi,D) with L(x,D)M= min

α∈A

1

2‖x−Dα‖2+λψ(α).

see Leon’s tutorial at NIPS’07, or NeurIPS’18 test of time award [Bottou and Bousquet, 2008].

Julien Mairal Online Dictionary Learning for Sparse Coding 11/15

Page 36: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Context of 2009

Classical approach for matrix factorization: alternate minimization

minD∈D,A∈A

1

2‖X−DA‖2F + λψ(A).

which requires loading all data at every iteration (batch optimization).

Meanwhile, Leon Bottou is advocating stochastic optimization for machine learning

which makes the risk minimization point of view relevant:

minD∈D

1

n

n∑i=1

L(xi,D) with L(x,D)M= min

α∈A

1

2‖x−Dα‖2+λψ(α).

see Leon’s tutorial at NIPS’07, or NeurIPS’18 test of time award [Bottou and Bousquet, 2008].

Julien Mairal Online Dictionary Learning for Sparse Coding 11/15

Page 37: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What we did

We started experimenting with SGD

and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Idea 1: If we knew optimal codes α?i for all xi’s in advance, then the problem becomes

minD∈D

1

2trace

(D>DB

)− trace

(D>C

)with B =

1

n

n∑i=1

α?iα

?>i and C =

1

n

n∑i=1

xiα?>i ,

which yields parameter-free block coordinate descent rules for updating D.

Idea 2: Build appropriate matrices B and C in an online fashion.

What about theory?

We could provide guarantees of convergence to stationary points, even though the problemis non-convex, stochastic, constrained, and non-smooth.

[Neal and Hinton, ’98], [Mairal, 2013], [Mensch, 2018].

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

Page 38: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Idea 1: If we knew optimal codes α?i for all xi’s in advance, then the problem becomes

minD∈D

1

2trace

(D>DB

)− trace

(D>C

)with B =

1

n

n∑i=1

α?iα

?>i and C =

1

n

n∑i=1

xiα?>i ,

which yields parameter-free block coordinate descent rules for updating D.

Idea 2: Build appropriate matrices B and C in an online fashion.

What about theory?

We could provide guarantees of convergence to stationary points, even though the problemis non-convex, stochastic, constrained, and non-smooth.

[Neal and Hinton, ’98], [Mairal, 2013], [Mensch, 2018].

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

Page 39: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Idea 1: If we knew optimal codes α?i for all xi’s in advance, then the problem becomes

minD∈D

1

2trace

(D>DB

)− trace

(D>C

)with B =

1

n

n∑i=1

α?iα

?>i and C =

1

n

n∑i=1

xiα?>i ,

which yields parameter-free block coordinate descent rules for updating D.

Idea 2: Build appropriate matrices B and C in an online fashion.

What about theory?

We could provide guarantees of convergence to stationary points, even though the problemis non-convex, stochastic, constrained, and non-smooth.

[Neal and Hinton, ’98], [Mairal, 2013], [Mensch, 2018].

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

Page 40: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Idea 1: If we knew optimal codes α?i for all xi’s in advance, then the problem becomes

minD∈D

1

2trace

(D>DB

)− trace

(D>C

)with B =

1

n

n∑i=1

α?iα

?>i and C =

1

n

n∑i=1

xiα?>i ,

which yields parameter-free block coordinate descent rules for updating D.

Idea 2: Build appropriate matrices B and C in an online fashion.

What about theory?

We could provide guarantees of convergence to stationary points, even though the problemis non-convex, stochastic, constrained, and non-smooth.

[Neal and Hinton, ’98], [Mairal, 2013], [Mensch, 2018].

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

Page 41: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Idea 1: If we knew optimal codes α?i for all xi’s in advance, then the problem becomes

minD∈D

1

2trace

(D>DB

)− trace

(D>C

)with B =

1

n

n∑i=1

α?iα

?>i and C =

1

n

n∑i=1

xiα?>i ,

which yields parameter-free block coordinate descent rules for updating D.

Idea 2: Build appropriate matrices B and C in an online fashion.

What about theory?

We could provide guarantees of convergence to stationary points, even though the problemis non-convex, stochastic, constrained, and non-smooth.

[Neal and Hinton, ’98]

, [Mairal, 2013], [Mensch, 2018].

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

Page 42: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

What we did

We started experimenting with SGD and tuning the step-size turned out to be painful.

Can we then design an algorithm that would be as fast as SGD, but more practical?

Idea 1: If we knew optimal codes α?i for all xi’s in advance, then the problem becomes

minD∈D

1

2trace

(D>DB

)− trace

(D>C

)with B =

1

n

n∑i=1

α?iα

?>i and C =

1

n

n∑i=1

xiα?>i ,

which yields parameter-free block coordinate descent rules for updating D.

Idea 2: Build appropriate matrices B and C in an online fashion.

What about theory?

We could provide guarantees of convergence to stationary points, even though the problemis non-convex, stochastic, constrained, and non-smooth.

[Neal and Hinton, ’98], [Mairal, 2013], [Mensch, 2018].

Julien Mairal Online Dictionary Learning for Sparse Coding 12/15

Page 43: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for morescalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox.

robustness to hyper-parameters: default setting that works (many times) in practice.

(try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

Page 44: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for morescalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox.

robustness to hyper-parameters: default setting that works (many times) in practice.

(try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

Page 45: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for morescalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox.

robustness to hyper-parameters: default setting that works (many times) in practice.

(try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

Page 46: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for morescalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox.

robustness to hyper-parameters: default setting that works (many times) in practice.

(try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

Page 47: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for morescalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox.

robustness to hyper-parameters: default setting that works (many times) in practice.

(try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

Page 48: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for morescalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox.

robustness to hyper-parameters: default setting that works (many times) in practice.

(try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

Page 49: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for morescalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox.

robustness to hyper-parameters: default setting that works (many times) in practice.

(try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

Page 50: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Reasons for impact: How did it help other fields?

A timely context ( ≈ luck)

Datasets were becoming larger and larger, and there was suddenly a need for morescalable matrix factorization methods.

A combination of mathematics and engineering?

an efficient software package: the SPAMS toolbox.

robustness to hyper-parameters: default setting that works (many times) in practice.

(try it with pip install spams in Python, or download R/Matlab packages).

Flexibility in the constraints/penalty design

allowing the method to be used in unexpected contexts.

Julien Mairal Online Dictionary Learning for Sparse Coding 13/15

Page 51: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Connection with neural networks

A cheap way to obtain a sparse code β from x and D is

β = relu(D>x− λ),

versus

α ∈ arg minα∈A

1

2‖x−Dα‖2 + λ‖α‖1.

Then, not surprisingly, for dictionary learning,

end-to-end feature learning is feasible.

one can design convolutional and multilayer models.

sparse decomposition algorithms perform neural network-like operations (LISTA).

[Mairal et al., 2012], [Zeiler and Fergus, 2010], [Gregor and LeCun, 2010].

Julien Mairal Online Dictionary Learning for Sparse Coding 14/15

Page 52: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Connection with neural networks

A cheap way to obtain a sparse code β from x and D is

β = relu(D>x− λ),

versus

α ∈ arg minα∈A

1

2‖x−Dα‖2 + λ‖α‖1.

Then, not surprisingly, for dictionary learning,

end-to-end feature learning is feasible.

one can design convolutional and multilayer models.

sparse decomposition algorithms perform neural network-like operations (LISTA).

[Mairal et al., 2012], [Zeiler and Fergus, 2010], [Gregor and LeCun, 2010].

Julien Mairal Online Dictionary Learning for Sparse Coding 14/15

Page 53: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Connection with neural networks

A cheap way to obtain a sparse code β from x and D is

β = relu(D>x− λ),

versus

α ∈ arg minα∈A

1

2‖x−Dα‖2 + λ‖α‖1.

Then, not surprisingly, for dictionary learning,

end-to-end feature learning is feasible.

one can design convolutional and multilayer models.

sparse decomposition algorithms perform neural network-like operations (LISTA).

[Mairal et al., 2012]

, [Zeiler and Fergus, 2010], [Gregor and LeCun, 2010].

Julien Mairal Online Dictionary Learning for Sparse Coding 14/15

Page 54: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Connection with neural networks

A cheap way to obtain a sparse code β from x and D is

β = relu(D>x− λ),

versus

α ∈ arg minα∈A

1

2‖x−Dα‖2 + λ‖α‖1.

Then, not surprisingly, for dictionary learning,

end-to-end feature learning is feasible.

one can design convolutional and multilayer models.

sparse decomposition algorithms perform neural network-like operations (LISTA).

[Mairal et al., 2012], [Zeiler and Fergus, 2010]

, [Gregor and LeCun, 2010].

Julien Mairal Online Dictionary Learning for Sparse Coding 14/15

Page 55: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Connection with neural networks

A cheap way to obtain a sparse code β from x and D is

β = relu(D>x− λ),

versus

α ∈ arg minα∈A

1

2‖x−Dα‖2 + λ‖α‖1.

Then, not surprisingly, for dictionary learning,

end-to-end feature learning is feasible.

one can design convolutional and multilayer models.

sparse decomposition algorithms perform neural network-like operations (LISTA).

[Mairal et al., 2012], [Zeiler and Fergus, 2010], [Gregor and LeCun, 2010].

Julien Mairal Online Dictionary Learning for Sparse Coding 14/15

Page 56: Test of Time Award *0.2cm Online Dictionary Learning for Sparse …no-parent)-12-10-00-5269-online_dictiona.pdf · Test of Time Award Online Learning for Matrix Factorization and

Thoughts

Is Wrinch and Jeffrey’s simplicity principle still relevant?

Simplicity is a key to interpretability and to model/hypothesis selection.

Next form will probably be different than `1. Which one?

Simplicity is not enough. Various forms and robustness and stability are also needed.

[Yu and Kumbier, 2019].

Julien Mairal Online Dictionary Learning for Sparse Coding 15/15