Upload
vokiet
View
215
Download
1
Embed Size (px)
Citation preview
Introduction The Basics Adding Prior Knowledge Conclusions
Sparse Coding: An Overview
Brian Booth
SFU Machine Learning Reading Group
November 12, 2013
Introduction The Basics Adding Prior Knowledge Conclusions
The aim of sparse coding
Every columnof D is aprototype
Similar to, butmore generalthan, PCA
Introduction The Basics Adding Prior Knowledge Conclusions
The aim of sparse coding
Every columnof D is aprototype
Similar to, butmore generalthan, PCA
Introduction The Basics Adding Prior Knowledge Conclusions
The aim of sparse coding
Every columnof D is aprototype
Similar to, butmore generalthan, PCA
Introduction The Basics Adding Prior Knowledge Conclusions
Example: Sparse Coding of Images
Introduction The Basics Adding Prior Knowledge Conclusions
Sparse Coding in V1
Introduction The Basics Adding Prior Knowledge Conclusions
Example: Image Denoising
Introduction The Basics Adding Prior Knowledge Conclusions
Example: Image Restoration
Introduction The Basics Adding Prior Knowledge Conclusions
Sparse Coding and Acoustics
Inner ear (cochlea) alsodoes sparse coding offrequencies
Introduction The Basics Adding Prior Knowledge Conclusions
Sparse Coding and Acoustics
Inner ear (cochlea) alsodoes sparse coding offrequencies
Introduction The Basics Adding Prior Knowledge Conclusions
Sparse Coding and Natural Language Processing
Introduction The Basics Adding Prior Knowledge Conclusions
Outline
1 Introduction: Why Sparse Coding?
2 Sparse Coding: The Basics
3 Adding Prior Knowledge
4 Conclusions
Introduction The Basics Adding Prior Knowledge Conclusions
Outline
1 Introduction: Why Sparse Coding?
2 Sparse Coding: The Basics
3 Adding Prior Knowledge
4 Conclusions
Introduction The Basics Adding Prior Knowledge Conclusions
Outline
1 Introduction: Why Sparse Coding?
2 Sparse Coding: The Basics
3 Adding Prior Knowledge
4 Conclusions
Introduction The Basics Adding Prior Knowledge Conclusions
Outline
1 Introduction: Why Sparse Coding?
2 Sparse Coding: The Basics
3 Adding Prior Knowledge
4 Conclusions
Introduction The Basics Adding Prior Knowledge Conclusions
Outline
1 Introduction: Why Sparse Coding?
2 Sparse Coding: The Basics
3 Adding Prior Knowledge
4 Conclusions
Introduction The Basics Adding Prior Knowledge Conclusions
The aim of sparse coding, revisited
We assume our data x satisfies
x ≈n∑
i=1
αidi = αD
Learning:Given training data xj , j ∈ 1, · · · ,mLearn dictionary D and sparse code α
Encoding:Given test data x, dictionary DLearn sparse code α
Introduction The Basics Adding Prior Knowledge Conclusions
The aim of sparse coding, revisited
We assume our data x satisfies
x ≈n∑
i=1
αidi = αD
Learning:Given training data xj , j ∈ 1, · · · ,mLearn dictionary D and sparse code α
Encoding:Given test data x, dictionary DLearn sparse code α
Introduction The Basics Adding Prior Knowledge Conclusions
The aim of sparse coding, revisited
We assume our data x satisfies
x ≈n∑
i=1
αidi = αD
Learning:Given training data xj , j ∈ 1, · · · ,mLearn dictionary D and sparse code α
Encoding:Given test data x, dictionary DLearn sparse code α
Introduction The Basics Adding Prior Knowledge Conclusions
Learning: The Objective Function
Dictionary learning involves optimizing:
arg mindi,αj
m∑j=1
‖xj −n∑
i=1
αjidi‖2
+ βm∑
j=1
n∑i=1
|αji |
subject to ‖di‖2 ≤ c, ∀i = 1, · · · ,n.
In matrix notation:
arg minD,A‖X− AD‖2F + β
∑i,j
|αi,j |
subject to∑
i
D2i,j ≤ c, ∀i = 1, · · · ,n.
Split the optimization over D and A in two.
Introduction The Basics Adding Prior Knowledge Conclusions
Learning: The Objective Function
Dictionary learning involves optimizing:
arg mindi,αj
m∑j=1
‖xj −n∑
i=1
αjidi‖2 + β
m∑j=1
n∑i=1
|αji |
subject to ‖di‖2 ≤ c, ∀i = 1, · · · ,n.
In matrix notation:
arg minD,A‖X− AD‖2F + β
∑i,j
|αi,j |
subject to∑
i
D2i,j ≤ c, ∀i = 1, · · · ,n.
Split the optimization over D and A in two.
Introduction The Basics Adding Prior Knowledge Conclusions
Learning: The Objective Function
Dictionary learning involves optimizing:
arg mindi,αj
m∑j=1
‖xj −n∑
i=1
αjidi‖2 + β
m∑j=1
n∑i=1
|αji |
subject to ‖di‖2 ≤ c, ∀i = 1, · · · ,n.
In matrix notation:
arg minD,A‖X− AD‖2F + β
∑i,j
|αi,j |
subject to∑
i
D2i,j ≤ c, ∀i = 1, · · · ,n.
Split the optimization over D and A in two.
Introduction The Basics Adding Prior Knowledge Conclusions
Learning: The Objective Function
Dictionary learning involves optimizing:
arg mindi,αj
m∑j=1
‖xj −n∑
i=1
αjidi‖2 + β
m∑j=1
n∑i=1
|αji |
subject to ‖di‖2 ≤ c, ∀i = 1, · · · ,n.
In matrix notation:
arg minD,A‖X− AD‖2F + β
∑i,j
|αi,j |
subject to∑
i
D2i,j ≤ c, ∀i = 1, · · · ,n.
Split the optimization over D and A in two.
Introduction The Basics Adding Prior Knowledge Conclusions
Learning: The Objective Function
Dictionary learning involves optimizing:
arg mindi,αj
m∑j=1
‖xj −n∑
i=1
αjidi‖2 + β
m∑j=1
n∑i=1
|αji |
subject to ‖di‖2 ≤ c, ∀i = 1, · · · ,n.
In matrix notation:
arg minD,A‖X− AD‖2F + β
∑i,j
|αi,j |
subject to∑
i
D2i,j ≤ c, ∀i = 1, · · · ,n.
Split the optimization over D and A in two.
Introduction The Basics Adding Prior Knowledge Conclusions
Step 1: Learning the Dictionary
Reduced optimization problem:
arg minD‖X− AD‖2F
subject to∑
i
D2i,j ≤ c, ∀i = 1, · · · ,n.
Introduce Lagrange multipliers:
L (D, λ) = tr(
(X− AD)T (X− AD))
+n∑
j=1
λj
(∑i
Di,j − c
)
where each λj ≥ 0 is a dual variable...
Introduction The Basics Adding Prior Knowledge Conclusions
Step 1: Learning the Dictionary
Reduced optimization problem:
arg minD‖X− AD‖2F
subject to∑
i
D2i,j ≤ c, ∀i = 1, · · · ,n.
Introduce Lagrange multipliers:
L (D, λ) = tr(
(X− AD)T (X− AD))
+n∑
j=1
λj
(∑i
Di,j − c
)
where each λj ≥ 0 is a dual variable...
Introduction The Basics Adding Prior Knowledge Conclusions
Step 1: Learning the Dictionary
Reduced optimization problem:
arg minD‖X− AD‖2F
subject to∑
i
D2i,j ≤ c, ∀i = 1, · · · ,n.
Introduce Lagrange multipliers:
L (D, λ) = tr(
(X− AD)T (X− AD))
+n∑
j=1
λj
(∑i
Di,j − c
)
where each λj ≥ 0 is a dual variable...
Introduction The Basics Adding Prior Knowledge Conclusions
Step 1: Moving to the dual
From the Lagrangian
L (D, λ) = tr(
(X− AD)T (X− AD))
+n∑
j=1
λj
(∑i
D2i,j − c
)
minimize over D to obtain Lagrange dual
D (λ) = minDL (D, λ) =
tr(
XT X− XAT(
AAT + Λ)−1 (
XAT)T− cΛ
)
The dual can be optimized using congugate gradientOnly n, λ values compared to D being n × k
Introduction The Basics Adding Prior Knowledge Conclusions
Step 1: Moving to the dual
From the Lagrangian
L (D, λ) = tr(
(X− AD)T (X− AD))
+n∑
j=1
λj
(∑i
D2i,j − c
)
minimize over D to obtain Lagrange dual
D (λ) = minDL (D, λ) =
tr(
XT X− XAT(
AAT + Λ)−1 (
XAT)T− cΛ
)
The dual can be optimized using congugate gradientOnly n, λ values compared to D being n × k
Introduction The Basics Adding Prior Knowledge Conclusions
Step 1: Moving to the dual
From the Lagrangian
L (D, λ) = tr(
(X− AD)T (X− AD))
+n∑
j=1
λj
(∑i
D2i,j − c
)
minimize over D to obtain Lagrange dual
D (λ) = minDL (D, λ) = tr
(XT X− XAT
(AAT + Λ
)−1 (XAT
)T− cΛ
)
The dual can be optimized using congugate gradientOnly n, λ values compared to D being n × k
Introduction The Basics Adding Prior Knowledge Conclusions
Step 1: Moving to the dual
From the Lagrangian
L (D, λ) = tr(
(X− AD)T (X− AD))
+n∑
j=1
λj
(∑i
D2i,j − c
)
minimize over D to obtain Lagrange dual
D (λ) = minDL (D, λ) = tr
(XT X− XAT
(AAT + Λ
)−1 (XAT
)T− cΛ
)
The dual can be optimized using congugate gradient
Only n, λ values compared to D being n × k
Introduction The Basics Adding Prior Knowledge Conclusions
Step 1: Moving to the dual
From the Lagrangian
L (D, λ) = tr(
(X− AD)T (X− AD))
+n∑
j=1
λj
(∑i
D2i,j − c
)
minimize over D to obtain Lagrange dual
D (λ) = minDL (D, λ) = tr
(XT X− XAT
(AAT + Λ
)−1 (XAT
)T− cΛ
)
The dual can be optimized using congugate gradientOnly n, λ values compared to D being n × k
Introduction The Basics Adding Prior Knowledge Conclusions
Step 1: Dual to the Dictionary
With the optimal Λ, our dictionary is
DT =(
AAT + Λ)−1 (
XAT)T
Key point: Moving to the dual reduces the number ofoptimization variables, speeding up the optimization.
Introduction The Basics Adding Prior Knowledge Conclusions
Step 1: Dual to the Dictionary
With the optimal Λ, our dictionary is
DT =(
AAT + Λ)−1 (
XAT)T
Key point: Moving to the dual reduces the number ofoptimization variables, speeding up the optimization.
Introduction The Basics Adding Prior Knowledge Conclusions
Step 2: Learning the Sparse Code
With D now fixed, optimize for A
arg minA‖X− AD‖2F + β
∑i,j
|αi,j |
Unconstrained, convex quadratic optimizationMany solvers for this (e.g. interior point methods, in-crowdalgorithm, fixed-point continuation)
Note:
Same problem as the encoding problem.Runtime of optimization in the encoding stage?
Introduction The Basics Adding Prior Knowledge Conclusions
Step 2: Learning the Sparse Code
With D now fixed, optimize for A
arg minA‖X− AD‖2F + β
∑i,j
|αi,j |
Unconstrained, convex quadratic optimization
Many solvers for this (e.g. interior point methods, in-crowdalgorithm, fixed-point continuation)
Note:
Same problem as the encoding problem.Runtime of optimization in the encoding stage?
Introduction The Basics Adding Prior Knowledge Conclusions
Step 2: Learning the Sparse Code
With D now fixed, optimize for A
arg minA‖X− AD‖2F + β
∑i,j
|αi,j |
Unconstrained, convex quadratic optimizationMany solvers for this (e.g. interior point methods, in-crowdalgorithm, fixed-point continuation)
Note:
Same problem as the encoding problem.Runtime of optimization in the encoding stage?
Introduction The Basics Adding Prior Knowledge Conclusions
Step 2: Learning the Sparse Code
With D now fixed, optimize for A
arg minA‖X− AD‖2F + β
∑i,j
|αi,j |
Unconstrained, convex quadratic optimizationMany solvers for this (e.g. interior point methods, in-crowdalgorithm, fixed-point continuation)
Note:Same problem as the encoding problem.Runtime of optimization in the encoding stage?
Introduction The Basics Adding Prior Knowledge Conclusions
Speeding up the testing phase
Fair amount of work on speeding up the encoding stage:
H. Lee et al., Efficient sparse coding algorithmshttp://ai.stanford.edu/~hllee/nips06-sparsecoding.pdf
K. Gregor and Y. LeCun, Learning Fast Approximations ofSparse Codinghttp://yann.lecun.com/exdb/publis/pdf/gregor-icml-10.pdf
S. Hawe et al., Separable Dictionary Learninghttp://arxiv.org/pdf/1303.5244v1.pdf
Introduction The Basics Adding Prior Knowledge Conclusions
Outline
1 Introduction: Why Sparse Coding?
2 Sparse Coding: The Basics
3 Adding Prior Knowledge
4 Conclusions
Introduction The Basics Adding Prior Knowledge Conclusions
Outline
1 Introduction: Why Sparse Coding?
2 Sparse Coding: The Basics
3 Adding Prior Knowledge
4 Conclusions
Introduction The Basics Adding Prior Knowledge Conclusions
Relationships between Dictionary atoms
Dictionaries areover-complete bases
Dictate relationshipsbetween atoms
Example: Hierarchicaldictionaries
Introduction The Basics Adding Prior Knowledge Conclusions
Relationships between Dictionary atoms
Dictionaries areover-complete bases
Dictate relationshipsbetween atoms
Example: Hierarchicaldictionaries
Introduction The Basics Adding Prior Knowledge Conclusions
Relationships between Dictionary atoms
Dictionaries areover-complete bases
Dictate relationshipsbetween atoms
Example: Hierarchicaldictionaries
Introduction The Basics Adding Prior Knowledge Conclusions
Example: Image Patches
Introduction The Basics Adding Prior Knowledge Conclusions
Example: Document Topics
Introduction The Basics Adding Prior Knowledge Conclusions
Problem Statement
Goal:Have sub-groups of sparse codeα all be non-zero (or zero).
Hierarchical:If a node is non-zero, it’s parentmust be non-zeroIf a node’s parent is zero, thenode must be zero
Implementation:Change the regularizationEnforce sparsity differently...
Introduction The Basics Adding Prior Knowledge Conclusions
Problem Statement
Goal:Have sub-groups of sparse codeα all be non-zero (or zero).
Hierarchical:If a node is non-zero, it’s parentmust be non-zeroIf a node’s parent is zero, thenode must be zero
Implementation:Change the regularizationEnforce sparsity differently...
Introduction The Basics Adding Prior Knowledge Conclusions
Problem Statement
Goal:Have sub-groups of sparse codeα all be non-zero (or zero).
Hierarchical:If a node is non-zero, it’s parentmust be non-zeroIf a node’s parent is zero, thenode must be zero
Implementation:Change the regularizationEnforce sparsity differently...
Introduction The Basics Adding Prior Knowledge Conclusions
Grouping Code Entries
Level k included in k + 1 groups
Add |αi | to objective function once for each group
Introduction The Basics Adding Prior Knowledge Conclusions
Grouping Code Entries
Level k included in k + 1 groupsAdd |αi | to objective function once for each group
Introduction The Basics Adding Prior Knowledge Conclusions
Group Regularization
Updated objective function:
arg minD,αj
m∑j=1
[‖xj − Dαj‖2
+ βΩ(αj)
]
whereΩ (α) =
∑g∈P
wg‖α|g‖
α|g are the code values for group g.wg weights the enforcement of the hierarchySolve using proximal methods.
Introduction The Basics Adding Prior Knowledge Conclusions
Group Regularization
Updated objective function:
arg minD,αj
m∑j=1
[‖xj − Dαj‖2 + βΩ
(αj) ]
whereΩ (α) =
∑g∈P
wg‖α|g‖
α|g are the code values for group g.wg weights the enforcement of the hierarchySolve using proximal methods.
Introduction The Basics Adding Prior Knowledge Conclusions
Group Regularization
Updated objective function:
arg minD,αj
m∑j=1
[‖xj − Dαj‖2 + βΩ
(αj) ]
whereΩ (α) =
∑g∈P
wg‖α|g‖
α|g are the code values for group g.wg weights the enforcement of the hierarchySolve using proximal methods.
Introduction The Basics Adding Prior Knowledge Conclusions
Group Regularization
Updated objective function:
arg minD,αj
m∑j=1
[‖xj − Dαj‖2 + βΩ
(αj) ]
whereΩ (α) =
∑g∈P
wg‖α|g‖
α|g are the code values for group g.
wg weights the enforcement of the hierarchySolve using proximal methods.
Introduction The Basics Adding Prior Knowledge Conclusions
Group Regularization
Updated objective function:
arg minD,αj
m∑j=1
[‖xj − Dαj‖2 + βΩ
(αj) ]
whereΩ (α) =
∑g∈P
wg‖α|g‖
α|g are the code values for group g.wg weights the enforcement of the hierarchy
Solve using proximal methods.
Introduction The Basics Adding Prior Knowledge Conclusions
Group Regularization
Updated objective function:
arg minD,αj
m∑j=1
[‖xj − Dαj‖2 + βΩ
(αj) ]
whereΩ (α) =
∑g∈P
wg‖α|g‖
α|g are the code values for group g.wg weights the enforcement of the hierarchySolve using proximal methods.
Introduction The Basics Adding Prior Knowledge Conclusions
Other Examples
Other examples of structured sparsity:
M. Stojnic et al., On the Reconstruction of Block-SparseSignals With an Optimal Number of Measurements,http://dx.doi.org/10.1109/TSP.2009.2020754
J. Mairal et al., Convex and Network Flow Optimization forStructured Sparsity, http://jmlr.org/papers/volume12/mairal11a/mairal11a.pdf
Introduction The Basics Adding Prior Knowledge Conclusions
Outline
1 Introduction: Why Sparse Coding?
2 Sparse Coding: The Basics
3 Adding Prior Knowledge
4 Conclusions
Introduction The Basics Adding Prior Knowledge Conclusions
Outline
1 Introduction: Why Sparse Coding?
2 Sparse Coding: The Basics
3 Adding Prior Knowledge
4 Conclusions
Introduction The Basics Adding Prior Knowledge Conclusions
Summary
Two interestingdirections:
Increasingspeed of thetesting phase
Optimizingdictionarystructure
Introduction The Basics Adding Prior Knowledge Conclusions
Summary
Two interestingdirections:
Increasingspeed of thetesting phase
Optimizingdictionarystructure
Introduction The Basics Adding Prior Knowledge Conclusions
Summary
Two interestingdirections:
Increasingspeed of thetesting phase
Optimizingdictionarystructure
Introduction The Basics Adding Prior Knowledge Conclusions
Summary
Two interestingdirections:
Increasingspeed of thetesting phase
Optimizingdictionarystructure