Sparse Coding: An Overview - UBC Computer Scienceschmidtm/MLRG/sparseCoding.pdf · Introduction The Basics Adding Prior Knowledge Conclusions The aim of sparse coding Every column

Introduction The Basics Adding Prior Knowledge Conclusions

Sparse Coding: An Overview

Brian Booth

SFU Machine Learning Reading Group

November 12, 2013


The aim of sparse coding

Every columnof D is aprototype

Similar to, butmore generalthan, PCA










Example: Sparse Coding of Images


Sparse Coding in V1


Example: Image Denoising


Example: Image Restoration


Sparse Coding and Acoustics

Inner ear (cochlea) alsodoes sparse coding offrequencies


Sparse Coding and Acoustics

Inner ear (cochlea) alsodoes sparse coding offrequencies


Sparse Coding and Natural Language Processing


Outline

1 Introduction: Why Sparse Coding?

2 Sparse Coding: The Basics

3 Adding Prior Knowledge

4 Conclusions


Outline




4 Conclusions


Outline




4 Conclusions


Outline




4 Conclusions


Outline




4 Conclusions


The aim of sparse coding, revisited

We assume our data x satisfies

x ≈n∑

i=1

αidi = αD

Learning:Given training data xj , j ∈ 1, · · · ,mLearn dictionary D and sparse code α

Encoding:Given test data x, dictionary DLearn sparse code α




x ≈n∑

i=1

αidi = αD






x ≈n∑

i=1

αidi = αD




Learning: The Objective Function

Dictionary learning involves optimizing:

arg mindi,αj

m∑j=1

‖xj −n∑

i=1

αjidi‖2

+ βm∑

j=1

n∑i=1

|αji |

subject to ‖di‖2 ≤ c, ∀i = 1, · · · ,n.

In matrix notation:

arg minD,A‖X− AD‖2F + β

∑i,j

|αi,j |

subject to∑

i

D2i,j ≤ c, ∀i = 1, · · · ,n.

Split the optimization over D and A in two.




arg mindi,αj

m∑j=1

‖xj −n∑

i=1

αjidi‖2 + β

m∑j=1

n∑i=1

|αji |

subject to ‖di‖2 ≤ c, ∀i = 1, · · · ,n.

In matrix notation:


∑i,j

|αi,j |

subject to∑

i

D2i,j ≤ c, ∀i = 1, · · · ,n.





arg mindi,αj

m∑j=1

‖xj −n∑

i=1

αjidi‖2 + β

m∑j=1

n∑i=1

|αji |

subject to ‖di‖2 ≤ c, ∀i = 1, · · · ,n.

In matrix notation:


∑i,j

|αi,j |

subject to∑

i

D2i,j ≤ c, ∀i = 1, · · · ,n.





arg mindi,αj

m∑j=1

‖xj −n∑

i=1

αjidi‖2 + β

m∑j=1

n∑i=1

|αji |

subject to ‖di‖2 ≤ c, ∀i = 1, · · · ,n.

In matrix notation:


∑i,j

|αi,j |

subject to∑

i

D2i,j ≤ c, ∀i = 1, · · · ,n.





arg mindi,αj

m∑j=1

‖xj −n∑

i=1

αjidi‖2 + β

m∑j=1

n∑i=1

|αji |

subject to ‖di‖2 ≤ c, ∀i = 1, · · · ,n.

In matrix notation:


∑i,j

|αi,j |

subject to∑

i

D2i,j ≤ c, ∀i = 1, · · · ,n.



Step 1: Learning the Dictionary

Reduced optimization problem:

arg minD‖X− AD‖2F

subject to∑

i

D2i,j ≤ c, ∀i = 1, · · · ,n.

Introduce Lagrange multipliers:

L (D, λ) = tr(

(X− AD)T (X− AD))

+n∑

j=1

λj

(∑i

Di,j − c

)

where each λj ≥ 0 is a dual variable...





subject to∑

i

D2i,j ≤ c, ∀i = 1, · · · ,n.


L (D, λ) = tr(


+n∑

j=1

λj

(∑i

Di,j − c

)






subject to∑

i

D2i,j ≤ c, ∀i = 1, · · · ,n.


L (D, λ) = tr(


+n∑

j=1

λj

(∑i

Di,j − c

)



Step 1: Moving to the dual

From the Lagrangian

L (D, λ) = tr(


+n∑

j=1

λj

(∑i

D2i,j − c

)

minimize over D to obtain Lagrange dual

D (λ) = minDL (D, λ) =

tr(

XT X− XAT(

AAT + Λ)−1 (

XAT)T− cΛ

)

The dual can be optimized using congugate gradientOnly n, λ values compared to D being n × k



From the Lagrangian

L (D, λ) = tr(


+n∑

j=1

λj

(∑i

D2i,j − c

)


D (λ) = minDL (D, λ) =

tr(

XT X− XAT(

AAT + Λ)−1 (

XAT)T− cΛ

)




From the Lagrangian

L (D, λ) = tr(


+n∑

j=1

λj

(∑i

D2i,j − c

)


D (λ) = minDL (D, λ) = tr

(XT X− XAT

(AAT + Λ

)−1 (XAT

)T− cΛ

)




From the Lagrangian

L (D, λ) = tr(


+n∑

j=1

λj

(∑i

D2i,j − c

)



(XT X− XAT

(AAT + Λ

)−1 (XAT

)T− cΛ

)

The dual can be optimized using congugate gradient

Only n, λ values compared to D being n × k



From the Lagrangian

L (D, λ) = tr(


+n∑

j=1

λj

(∑i

D2i,j − c

)



(XT X− XAT

(AAT + Λ

)−1 (XAT

)T− cΛ

)



Step 1: Dual to the Dictionary

With the optimal Λ, our dictionary is

DT =(

AAT + Λ)−1 (

XAT)T

Key point: Moving to the dual reduces the number ofoptimization variables, speeding up the optimization.


Step 1: Dual to the Dictionary

With the optimal Λ, our dictionary is

DT =(

AAT + Λ)−1 (

XAT)T

Key point: Moving to the dual reduces the number ofoptimization variables, speeding up the optimization.


Step 2: Learning the Sparse Code

With D now fixed, optimize for A

arg minA‖X− AD‖2F + β

∑i,j

|αi,j |

Unconstrained, convex quadratic optimizationMany solvers for this (e.g. interior point methods, in-crowdalgorithm, fixed-point continuation)

Note:

Same problem as the encoding problem.Runtime of optimization in the encoding stage?





∑i,j

|αi,j |

Unconstrained, convex quadratic optimization

Many solvers for this (e.g. interior point methods, in-crowdalgorithm, fixed-point continuation)

Note:






∑i,j

|αi,j |


Note:






∑i,j

|αi,j |


Note:Same problem as the encoding problem.Runtime of optimization in the encoding stage?


Speeding up the testing phase

Fair amount of work on speeding up the encoding stage:

H. Lee et al., Efficient sparse coding algorithmshttp://ai.stanford.edu/~hllee/nips06-sparsecoding.pdf

K. Gregor and Y. LeCun, Learning Fast Approximations ofSparse Codinghttp://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf

S. Hawe et al., Separable Dictionary Learninghttp://arxiv.org/pdf/1303.5244v1.pdf

http://ai.stanford.edu/~hllee/nips06-sparsecoding.pdf

http://ai.stanford.edu/~hllee/nips06-sparsecoding.pdf

http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf

http://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf

http://arxiv.org/pdf/1303.5244v1.pdf


Outline




4 Conclusions


Outline




4 Conclusions


Relationships between Dictionary atoms

Dictionaries areover-complete bases

Dictate relationshipsbetween atoms

Example: Hierarchicaldictionaries












Example: Image Patches


Example: Document Topics


Problem Statement

Goal:Have sub-groups of sparse codeα all be non-zero (or zero).

Hierarchical:If a node is non-zero, it’s parentmust be non-zeroIf a node’s parent is zero, thenode must be zero

Implementation:Change the regularizationEnforce sparsity differently...


Problem Statement





Problem Statement





Grouping Code Entries

Level k included in k + 1 groups

Add |αi | to objective function once for each group


Grouping Code Entries

Level k included in k + 1 groupsAdd |αi | to objective function once for each group


Group Regularization

Updated objective function:

arg minD,αj

m∑j=1

[‖xj − Dαj‖2

+ βΩ(αj)

]

whereΩ (α) =

∑g∈P

wg‖α|g‖

α|g are the code values for group g.wg weights the enforcement of the hierarchySolve using proximal methods.




arg minD,αj

m∑j=1

[‖xj − Dαj‖2 + βΩ

(αj) ]

whereΩ (α) =

∑g∈P

wg‖α|g‖





arg minD,αj

m∑j=1


(αj) ]

whereΩ (α) =

∑g∈P

wg‖α|g‖





arg minD,αj

m∑j=1


(αj) ]

whereΩ (α) =

∑g∈P

wg‖α|g‖

α|g are the code values for group g.

wg weights the enforcement of the hierarchySolve using proximal methods.




arg minD,αj

m∑j=1


(αj) ]

whereΩ (α) =

∑g∈P

wg‖α|g‖

α|g are the code values for group g.wg weights the enforcement of the hierarchy

Solve using proximal methods.




arg minD,αj

m∑j=1


(αj) ]

whereΩ (α) =

∑g∈P

wg‖α|g‖



Other Examples

Other examples of structured sparsity:

M. Stojnic et al., On the Reconstruction of Block-SparseSignals With an Optimal Number of Measurements,http://dx.doi.org/10.1109/TSP.2009.2020754

J. Mairal et al., Convex and Network Flow Optimization forStructured Sparsity, http://jmlr.org/papers/volume12/mairal11a/mairal11a.pdf

http://dx.doi.org/10.1109/TSP.2009.2020754

http://jmlr.org/papers/volume12/mairal11a/mairal11a.pdf

http://jmlr.org/papers/volume12/mairal11a/mairal11a.pdf


Outline




4 Conclusions


Outline




4 Conclusions


Summary

Two interestingdirections:

Increasingspeed of thetesting phase

Optimizingdictionarystructure


Summary





Summary





Summary




Documents

Sparse Coding: An Overview - UBC Computer Scienceschmidtm/MLRG/sparseCoding.pdf · Introduction The Basics Adding Prior Knowledge Conclusions The aim of sparse coding Every column