45
Understanding Trainable Sparse Coding with Matrix Factorization Thomas Moreau CMLA - ENS Paris-Saclay J. Bruna, J. Audiffren

Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

  • Upload
    others

  • View
    28

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Understanding Trainable Sparse Codingwith Matrix Factorization

Thomas Moreau CMLA - ENS Paris-Saclay

J. Bruna, J. Audiffren

Page 2: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Plan

1 Physiological signals

2 End-to-end Approaches for Time Series

3 Post-training for Deep Learning

4 Adaptive Iterative Soft Thresholding

5 Numerical Experiments

Tech talk 2/40 Google, Zurich

Page 3: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Plan

1 Physiological signals

2 End-to-end Approaches for Time Series

3 Post-training for Deep Learning

4 Adaptive Iterative Soft Thresholding

5 Numerical Experiments

Tech talk 3/40 Google, Zurich

Page 4: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Physiological signals

ECG

Tech talk 4/40 Google, Zurich

Page 5: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Physiological signals

EEG

Tech talk 5/40 Google, Zurich

Page 6: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Physiological signals

Oculometric signals

0 1 2 3 4 5 6Time [s]

Eye p

ositi

ons

Tech talk 6/40 Google, Zurich

Page 7: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Physiological signals

Accelerometers

Time [s]

Accel

eratio

n/Ang

ular V

elocit

y

Tech talk 7/40 Google, Zurich

Page 8: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Statistical analysis for Time Series

I Failure of the vectorial distances

I Alignment issues, different lengths(can be solved with DTW)

I ”Curse of dimensionality”

I Different approaches which can be classified in 2categories:

I Model based methods:feature extraction + vectorial method, . . .

I Data driven methodsEnd-to-end model, Neural networks, . . .

Tech talk 8/40 Google, Zurich

Page 9: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Statistical analysis for Time Series

I Failure of the vectorial distances

I Alignment issues, different lengths(can be solved with DTW)

I ”Curse of dimensionality”

I Different approaches which can be classified in 2categories:

I Model based methods:feature extraction + vectorial method, . . .

I Data driven methodsEnd-to-end model, Neural networks, . . .

Tech talk 8/40 Google, Zurich

Page 10: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Plan

1 Physiological signals

2 End-to-end Approaches for Time Series

3 Post-training for Deep Learning

4 Adaptive Iterative Soft Thresholding

5 Numerical Experiments

Tech talk 9/40 Google, Zurich

Page 11: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

End-to-end Approaches

Neural Networks:

I Raw signal as input,

No feature-engineering

I Internally select the data representation,

Adaptive

I Representation adapted to the task,

Performant

I Simple training algorithms,

Scalable

Tech talk 10/40 Google, Zurich

Page 12: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Theoretical guarantees

Split between risk error 3 terms:[Bottou and Bousquet, 2008]

I Approximation error: Universal approximation,

[Hornik, 1991]

I Estimation error: Generalization bound,

[Kawaguchi et al., 2017]

I Optimization error: Learning convexification,

[Haeffele and Vidal, 2017]

Tech talk 11/40 Google, Zurich

Page 13: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Neural Networks

Main drawback:

Lack of interpretability. It is often seen as a black box.

How can we bring interpretability in the internalrepresentation?

Tech talk 12/40 Google, Zurich

Page 14: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

End-to-end Approaches

Task-driven Dictionary Learning:[Mairal et al., 2012]

I Raw signal as input,

No feature-engineering

I Representation adapted to the task,

Performant

I Complex training algorithms,

Scalable

I Highlight local structures,

Interpretable

Tech talk 13/40 Google, Zurich

Page 15: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Problematic

Can we study the links betweenthese two models to bring more

interpretability in neural networks?

Tech talk 14/40 Google, Zurich

Page 16: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Plan

1 Physiological signals

2 End-to-end Approaches for Time Series

3 Post-training for Deep Learning

4 Adaptive Iterative Soft Thresholding

5 Numerical Experiments

Tech talk 15/40 Google, Zurich

Page 17: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Post-training for Deep Learning

Paper with J. Audiffren: arxiv:1611.04499

Use the idea to split the representation learning and the task resolution:

I Post-training step: only train the last layer,

I Easy problem: this problem is often convex,

I Link with kernel: close form solution for optimal last layer.

I Experiments: consistent performance boost with multiplearchitecture.

Tech talk 16/40 Google, Zurich

Page 18: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Plan

1 Physiological signals

2 End-to-end Approaches for Time Series

3 Post-training for Deep Learning

4 Adaptive Iterative Soft Thresholding

5 Numerical Experiments

Tech talk 17/40 Google, Zurich

Page 19: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

LASSO [Tibshirani, 1996]

The LASSO or sparse coding problem is defined as

argminz

F (z) := ‖x − Dz‖22︸ ︷︷ ︸

E(z)

+λ‖z‖1 , (1)

where x ∈ RP , D ∈ RP×K and z ∈ RK .

(1) can be rewritten as a proximal problem for the B-norm

argminz

(y − z)TB(y − z)︸ ︷︷ ︸E(z)

+λ‖z‖1

(= F (z)

)

where B = DTD is the Gram matrix of D and y = D†x .

Tech talk 18/40 Google, Zurich

Page 20: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Learned ISTA [Gregor and Lecun, 2010]

Accelerate the LASSO resolution using a neural network.

WeX Z

Wg

X

W(0)e

W(1)g

W(1)e

W(2)g

W(2)e

Z

Figure: Adapted from [Gregor and Lecun, 2010]

Link dictionary learning model and sparse representation.

Why does it work?

Tech talk 19/40 Google, Zurich

Page 21: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

ISTA (1/2) [Daubechies et al., 2004, Beck and Teboulle, 2009]

Surrogate function Fq associated with point z(q) :

Fq(z) = E (z(q)) + 〈B(z(q) − y), z − z(q)〉+‖B‖2

2‖z − z(q)‖2

2 + λ‖z‖1 ,

Properties

This surrogate function satisfies

1 Fq(z(q)) = F (z(q))

2 for all z , Fq(z) ≥ F (z),

3 solving argminz Fq(z) is computationally efficient.

Tech talk 20/40 Google, Zurich

Page 22: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

ISTA (2/2)

Iterative procedure: proximal splitting

z(q+1) = argminz

Fq(z)

= proxλ‖·‖1

(z(q) − 1

L∇E (z(q))

)(2)

Properties

1 z∗ is a fix point of (2),

2 Efficient computation for z(q+1) asthe problem is separable,

3 Convergence in O(

1q

)in general.

10 5 0 5 100

100

200

300

400

500

600

zk

zk+ 1

FFk

Tech talk 21/40 Google, Zurich

Page 23: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Why does it work?

I Guaranteed descent

The construction of the next point guarantees the cost function isdecreasing:

F (z(q+1)) ≤ Fq(z(q+1)) ≤ Fq(z(q)) = F (z(q))

I Efficient computation:

With the isotropic quadratic form L2IIIK , the function Fq is separable.

The computation are linear in K .

Tech talk 22/40 Google, Zurich

Page 24: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Toward an adaptive procedure

We define QS(u, v) = 12 (u − v)TS(u − v) + λ‖u‖1 .

ISTA:

Fq(z) = E (z(q)) + 〈B(z(q) − y), z − z(q)〉+ QLIIIK (z , z(q)) ,

→ minz

QLIIIK (z , z(q) − 1

LB(z(q) − y))

⇒ Replace B with an upperbound LIIIK

FacNet: For any matrix S diagonal, and A unitary we define :

F̃q(z) = E (z(q)) + 〈B(z(q) − y), z − z(q)〉+ QS(Az ,Az(q)) ,

→ minz

QS(Az ,Az(q) − S−1AB(z(q) − y))

⇒ Replace B with an approximation ATSA

Can we choose A,S to accelerate the optimization compared to ISTA?

Tech talk 23/40 Google, Zurich

Page 25: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Toward an adaptive procedure

Similar iterative procedure with steps adapted to the problem topology.

F̃q(z) = F (z) + (z − z(q))TR(z − z(q)) + δA(z)

Tradeoff between:

I Rotation to align the norm ‖ · ‖B and the norm ‖ · ‖1 , Computation

R = ATSA− B

I Deformation of the `1-norm with the rotation A . Accuracy

δA(z) = λ(‖Az‖1 − ‖z‖1

)

Tech talk 24/40 Google, Zurich

Page 26: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

One step improvement

Proposition

Suppose that R = ATSA− B � 0 is positive definite, and define

z(q+1) = arg minz

F̃q(z) ,

Then

F (z(q+1))− F (z∗) ≤ 1

2(z(q) − z∗)TR(z(q) − z∗) + δA(z∗)− δA(z(q+1)) .

We are interested in factorization (A, S) for which ‖R‖2 and δA are small.

Tech talk 25/40 Google, Zurich

Page 27: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Adaptive Iterative Soft thresholding - Convergence rate

Theorem

Let Aq,Sq be the pair of unitary and diagonal matrices corresponding toiteration q, chosen such that Rq = AT

q SqAq − B � 0. It results that

F (z (q))− F (z∗) ≤ (z∗ − z (0))TR0(z∗ − z (0)) + 2LA0 (z (1))‖(z∗ − z (1))‖2

2q

+αq − βq

2q,

αq =

q−1∑i=1

(2LAi (z

(i+1))‖(z∗ − z (i+1))‖+ (z∗ − z (i))T(Ri−1 − Ri )(z∗ − z (i))),

βq =

q−1∑i=0

(i + 1)(

(z (i+1) − z (i))TRi (z(i+1) − z (i)) + 2δAi (z

(i+1))− 2δAi (z(i))),

where LA(z) denote the local Lipschitz constant of δA at z .

Tech talk 26/40 Google, Zurich

Page 28: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Interpretation

I For Aq = IIIK and Sq = ‖B‖2IIIK , the procedure is equivalent to ISTA,with the same rate of convergence.

I If ‖R0‖2 + 2LA0(z1)

‖z∗ − z0‖2≤ ‖B‖2

2and Aq = IIIK and Sq = ‖B‖2IIIK for

k > 0, then the procedure get a head start compare to ISTA

I Phase transition :

The upper bound is improved when ‖Rq‖2 + 2LAq(z(q+1))

‖z∗ − z(q)‖2≤ ‖B‖2

2,

it is thus harder to gain as ‖z(q) − z∗‖2 → 0

Tech talk 27/40 Google, Zurich

Page 29: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Generic Dictionaries

A dictionary D ∈ Rp×K is a generic dictionary when its columns Di aredrawn uniformly over the `2 unit sphere Sp−1.

Theorem (Acceleration conditions)

In expectation over the generic dictionary D, the factorizationalgorithm using a diagonally dominant matrix A ⊂ Eδ, has betterperformance for iteration q + 1 than the normal ISTA iteration – whichuses the identity – when

λEz

[‖z(q+1)‖1 + ‖z∗‖1

]≤

√K (K − 1)

pEz

[‖z(q) − z∗‖2

2

]︸ ︷︷ ︸

expected resolutionat iteration q

Tech talk 28/40 Google, Zurich

Page 30: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Generic Dictionaries

Corollary (Acceleration conditions)

If the input distribution and the regularization parameter λ verify

λ√p

8≤ Ez

[‖z∗‖1

],

Then for any resolution Ez

[‖z(q) − z∗‖2

]= ε > 0 at iteration q, the

performance of our factorization algorithm is better than the performanceof ISTA, in expectation over the generic dictionaries.

FacNet can improve the performances compared to ISTA when this isverified.

Tech talk 29/40 Google, Zurich

Page 31: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Plan

1 Physiological signals

2 End-to-end Approaches for Time Series

3 Post-training for Deep Learning

4 Adaptive Iterative Soft Thresholding

5 Numerical Experiments

Tech talk 30/40 Google, Zurich

Page 32: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Learned ISTA [Gregor and Lecun, 2010]

WeX Z

Wg

X

W(0)e

W(1)g

W(1)e

W(2)g

W(2)e

Z

Figure: Network architecture for ISTA/LISTA. LISTA is the unfolded version ofthe RNN of ISTA, trainable with back-propagation.

If We = DT

L and Wg = I − BL , this network is exactly 2 iterations of ISTA.

Tech talk 31/40 Google, Zurich

Page 33: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

FacNet

Specialization of LISTA

z(q+1) = AT proxS

(Az(q) − S−1AB(z(q) − y)) ,

with A unitary and S diagonal.Same architecture with more constraints on the parameter space:We = S−1ADT

Wg = AT − S−1ABAT

⇒ LISTA can be at least as good as this model.

Tech talk 32/40 Google, Zurich

Page 34: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Learned FISTA

The same ideas can also be applied to FISTA to obtain similar procedures:

X

W(0)e

W(1)g +

W(1)m

W(1)e

W(2)g +

W(2)m

W(2)e

W(3)g +

W(3)e

Z

Figure: Network architecture for L-FISTA.

Tech talk 33/40 Google, Zurich

Page 35: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Artificial simulation

Generating Model:

I D =

(d1‖d1‖2

, . . . dK‖dK‖2

)with dk ∼ N (0, IIIP) for all k ∈ J1,KK ,

I z = (z1, . . . zK ) are constructed following a bernouilli gaussian:

zk = bkak , bk ∼ B(ρ) and a ∼ N (0, σIIIK )

with: K = 100, P = 64, for the dimension, σ = 10 and λ = 0.01

⇒ The sparsity patterns are uniformely distributed.

Tech talk 34/40 Google, Zurich

Page 36: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Artificial simulation

100 101 102

# iteration/layers q

10 7

10 6

10 5

10 4

10 3

10 2

10 1

100

101Co

st fu

nctio

n F(

z(q) )

F(z* )

ISTALinearFISTA

L-ISTAFacNetL-FISTA

Evolution of the cost function F (z(q))− F (z∗) with the number oflayers/iterations q with a sparse model ρ = 1/20.

Tech talk 34/40 Google, Zurich

Page 37: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Artificial simulation

100 101 102 103

# iteration/layers q

10 6

10 5

10 4

10 3

10 2

10 1

100

101

Cost

func

tion

F(z(q

) )F(

z* )ISTALinearFISTA

L-ISTAFacNetL-FISTA

Evolution of the cost function F (z(q))− F (z∗) with the number oflayers/iterations q with a denser model ρ = 1/4.

Tech talk 34/40 Google, Zurich

Page 38: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Adversarial dictionary

Adversarial dictionary:

D =[d1 . . . dK

]∈ RK×p ,

with

dj = e−i2πjζq

K

for a random subset offrequencies

{ζi}i≤m

K

p

⇒ Eigenvectors of D are far from canonical basis.

Tech talk 35/40 Google, Zurich

Page 39: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Adversarial dictionary

100 101 102 103

# iteration/layers q

10 6

10 5

10 4

10 3

10 2

10 1

100

101

102

Cost

func

tion

F(z(q

) )F(

z* )ISTAFISTA

L-ISTAFacNet

Evolution of the cost function F (z(q))− F (z∗) with the number oflayers/iterations k with n adversarial dictionary.

Tech talk 35/40 Google, Zurich

Page 40: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

PASCAL 08

Sparse coding for the PASCAL 08datasets over the Haar waveletsfamily.

The sparse coding is performed forpatches of size 8× 8.

Train over 500 images and test over100 images.

Tech talk 36/40 Google, Zurich

Page 41: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

PASCAL 08

100 101 102

# iteration/layers q

10 6

10 5

10 4

10 3

10 2

10 1

Cost

func

tion

F(z(q

) )F(

z* )ISTALinearFISTA

L-ISTAFacNetL-FISTA

Evolution of the cost function F (z(q))− F (z∗) with the number of layersor the number of iteration q for Pascal VOC 2008.

Tech talk 36/40 Google, Zurich

Page 42: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

MNIST

Dictionary D with K = 100 atoms learned on 10 000 MNIST samples(17x17) with dictionary learning. LISTA trained with MNIST training setand tested on MNIST test set.

100 101 102

# iteration/layers q

10 2

10 1

100

101

102

103

104

105

106

Cost

func

tion

F(z(q

) )F(

z* )

ISTALinearFISTA

L-ISTAFacNetL-FISTA

Evolution of the cost function F (z(q))− F (z∗) with the number of layersor the number of iteration q for MNIST.

Tech talk 37/40 Google, Zurich

Page 43: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Conclusion

I Non asymptotic acceleration is possible :Approximate matrix factorization of B = DTD

I Nearly diagonalize the kernel,I `1-norm nearly invariant by this orthogonal transformation.

I Future work:I Improve the factorization formulation:

minATA=IIIK

f (‖DA‖1,2) + λq‖A‖1,1

n,

I Give generic bounds for sub gaussian D,I Link to Sparse PCA.

Tech talk 38/40 Google, Zurich

Page 44: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

Conclusion

Questions?

Code: tomMoral/AdaptiveOptim

Paper: https://arxiv.org/abs/1706.01338

More at tommoral.github.io tomMoral

Tech talk 39/40 Google, Zurich

Page 45: Understanding Trainable Sparse Coding with Matrix …Understanding Trainable Sparse Coding with Matrix Factorization Thomas MoreauCMLA - ENS Paris-Saclay J. Bruna, J. Audi ren Plan

References

Beck, A. and Teboulle, M. (2009). A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAMJournal on Imaging Sciences, 2(1):183–202.

Bottou, L. and Bousquet, O. (2008). Learning using large datasets. Mining Massive DataSets for Security, 3.

Daubechies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with asparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413–1457.

Gregor, K. and Lecun, Y. (2010). Learning Fast Approximations of Sparse Coding Karol. In International Conference onMachine Learning (ICML), volume 152, pages 399–406, Haifa, Israel.

Haeffele, B. D. and Vidal, R. (2017). Global Optimality in Neural Network Training. In Conference on Computer Vision andPattern Recognition (CVPR), pages 7331–7339, Honolulu, HI, USA.

Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257.

Kawaguchi, K., Pack Kaelbling, L., and Bengio, Y. (2017). Generalization in Deep Learning. preprintq, arXiv:1710(05468).

Mairal, J., Bach, F., and Ponce, J. (2012). Task-driven dictionary learning. IEEE Transactions on Pattern Analysis andMachine Intelligence, 34(4):791–804.

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the royal statistical society. Series B(methodological), 58(1):267—-288.

Tech talk 40/40 Google, Zurich