Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
An Introduction to Sparse Coding and Dictionary Learning
Kai Cao January 14, 2014
1
Outline
• Introduction • Mathematical foundation • Sparse coding • Dictionary learning • Summary
2
Introduction
3
What is sparsity?
• Sparsity implies many zeros in a vector or a matrix
4
FFT transform
Sparse representation
Fingerprint patch FFT response
Reconstructed patch
IFFT transform
Usage: Compression Analysis Denoising ……
Sparse Representation
Dictionary Learning Problem
Sparse Coding Problem
x α≈ D
≈
Application---Denoising
[M. Elad, Springer 2010]
Dictionary
Source
Result 30.829dB
Noisy image PSNR 22.1dB=
Application---Compression
[O. Bryta, M. Elad, 2008]
550 bytes per image
9.44
15.81
14.67
15.30
13.89
12.41
12.57
10.66
10.27
6.60
5.49
6.36
Original JPEG JPEG 2000 PCA
Dictionary based
Bottom: RMSE values
Mathematical foundation
8
Derivatives of vectors
• First order
• Second order
• Exercise
9
T Ta x x a ax x
∂ ∂= =
( )T
Tx Bx B B xx
∂= +
∂
2 22 2
1min || || || || , ,2m
n n mx D x R D Rα
α λ α ×∈
− + ∈ ∈
1( )T TD D I D xα λ −= +
Trace of a Matrix
• Definition
• Properties
10
1( ) ,n iiiTr A a==∑ ( )
n nijA a R
×= ∈
2 2
1 1|| || ( ),
n nT
F iji j
A a Tr A A= =
= =∑∑
( ) ( ) ( ),Tr ABC Tr BCA Tr CAB= =
( ) ( ),TTr A Tr A=( ) ( ),Tr A B Tr A B+ = +
( ) ( ),Tr aA aTr A=
( ) ( ),Tr AB Tr BA=
n nB R ×∈
a R∈n nB R ×∈
, n nB C R ×∈
Derivatives of traces
• First order
• Derivatives of traces
• Exercise
11
( ) TTr XA AX∂
=∂
( )T TTr X XA XA XAX∂
= +∂
( )TTr X A AX∂
=∂
( )T TTr X BX B X BXX∂
= +∂
2 2min || || || || , ,k m
n m n kF FA R
X DA A X R D Rλ×
× ×
∈− + ∈ ∈
1( )T TA D D I D Xλ −= +
Sparse coding
12
Sparse linear model
• Let x ϵ Rn be a signal
• Let D =[d1, d2, …, dm] ϵ Rn×m be a set of normalized “basis vectors” (dictionary)
• Sparse representation is to find a sparse vector α ϵ
Rm such that x ≈ Dα, where α is regarded as sparse code
13
D
x
( 1)Ti id d =
The sparse coding model
• Objective function
• The regularization term can be – the l2 norm. – the l0 norm. – the l1 norm. – …
14
22
1min || || ( )2m
x Dα
α λϕ α∈
− +
Data fitting term Regularization term
ϕ2 22 1
|| || m iiα α=∑0|| || #{ | 0}ii aα ≠
1 1|| || | |m iiα α=∑
Sparsity inducing
Matching pursuit
1. Initialization: α = 0, residual r = x 2. while ||α||0
Patch from latent Dictionary elements
…… d1 d2 d3 d4 d5
An example for matching pursuit
x
c1=-0.039 c2= 0.577 c3=0.054 c4=-0.031 c5 =-0.437
……
= - ×0.577 Residual r
Correlation ci= diT x
…… Correlation ci= diT r d1 d2 d3 d4 d5
c1=-0.035 c2= 0 c3=0.037 c4=-0.046 c5 =-0.289
= - - × (-0.289) Residual r × 0.577
Coefficient does not update !
Reconstructed patch = + × 0.577 × (-0.289) x̂2ˆ|| || 0.763x x− =
Orthogonal matching pursuit
1. Initialization: α = 0, residual r = x, active set Ω = Æ 2. while ||α||0
An example for orthogonal matching pursuit
2ˆ|| || 0.759x x− =× (-0.309)
= - - ×(-0.309)
Reconstructed patch =
Residual r × 0.499
+ × 0.499 x̂
Patch from latent Dictionary elements
…… d1 d2 d3 d4 d5 x
c1=-0.039 c2= 0.577 c3=0.054 c4=-0.031 c5 =-0.437
……
= - ×0.577 Residual r
Correlation ci= diT x
…… Correlation ci= diT r d1 d2 d3 d4 d5
c1=-0.035 c2= 0 c3=0.037 c4=-0.046 c5 =-0.289
Why does l1-norm induce sparsity?
• Analysis in 1D (comparison with l2)
19
21min ( ) | |2
xα
α λ α∈
− +
2 21min ( )2
xα
α λα∈
− +
if x ³ λ , α = x – λ if x £- λ , α = x+ λ else, α =0
Þ α = x/(1+2λ)
x λ
α
-λ x
α slope = /(1+2λ)
Þ
Why does l1-norm induce sparsity?
• Analysis in 2D (comparison with l2)
20
22 1
1min || || || ||2
xα
α λ α∈
− +
2 22 2
1min || || || ||2
xα
α λ α∈
− +
22
1min || ||2
xα
α∈
−
1|| ||α µ≤s.t. ⇔22
1min || ||2
xα
α∈
−
2|| ||α µ≤s.t. ⇔
μ -μ
α2
α1 α1
α2
μ -μ
Optimality condition for l1-norm regularization
• Directional derivative in the direction u at α
• g is subgradient of J at α if and only if
• Proposition 1: • Proposition 2: if J is differentiable at α, • Proposition 3: α is optimal if and only if for all u,
21
22 1
1min ( ) || || || ||2m
J x Dα
α α λ α∈
= − +
0
( ) ( )( , ) limt
J tu JJ ut
α αα+→
+ −∇ =
, ( ) ( ) ( )m Tt R J t J g tα α∀ ∈ ≥ + −
( , ) 0J uα∇ ≥( , ) ( )TJ u J uα α∇ = ∇
, ( , )m Tu R g u J uα⇔∀ ∈ ≤∇g is a subgradient
• Example: f(x) = |x|
22
Subgradient for l1-norm regularization
x
f(x) = |x|
x
subgradient
1
-1
| | 0( , )
( ) 0u x
f x usign x u x
=∇ = ≠
Subgradient for l1-norm regularization
• g is a subgradient at α if and only if for all i
23
22 1
1min ( ) || || || ||2m
J x Dα
α α λ α∈
= − +
, 0 , 0( , ) ( ) sign( ) | |
i i
T Ti i i
i a i aJ u u D x D a u uα α λ λ
≠ =
∇ = − − + +∑ ∑
| ( ) |Ti ig d x Dα λ− − ≤ if 0ia =
0ia ≠( ) sign( )Ti i ig d x D aα λ= − + if
First order method for convex optimization
• Differentiable objective – Gradient descent: – With line search for a decent ht – Diminishing step size: e.g., ht=(t+t0)-1
• Non differentiable objective – Subgradient decent: , gt is a subgradient – With line search – Diminishing step size
24
1 ( )t t t tJα α η α+ = − ∇
1t t t tgα α η+ = −
Reformulation as quadratic program
25
22 1
1min || || || ||2m
x Dα
α λ α∈
− +
22
,
1min || || (1 1 )2m
T Tx D Dα α
α α λ α α+ − +
+ − + −∈
− + + +
Dictionary Learning
26
Dictionary selection
• Which D to use? • A fixed set of basis:
– Steerable wavelet – Contourlet – DCT Basis – ……
• Data adaptive dictionary – learn from data – K-SVD (l0-norm) – On-line dictionary learning (l1-norm)
27
The examples are linear combinations
of atoms from D
The objective function for K-SVD
D ≈ X A
2
,min || ||FD A X DA− 0
|| ||j Lα ≤"j, s.t.
Each example has a sparse representation with
no more than L atoms
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt 28
K–SVD – An Overview
D Initialize D Sparse Coding
Use MP or OMP
Dictionary Update
Column-by-Column by SVD computation
X T
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt 29
K–SVD: Sparse Coding Stage
D
X
For the jth example we solve
Ordinary Sparse Coding !
T 2
02min . .jx s t Lα α α− ≤D
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt
2min || ||FA X DA− 0|| ||j Lα ≤"j, s.t.
30
K–SVD: Dictionary Update Stage
D
X T
For the kth atom
we solve
(the residual)
2min || ||FD X DA− 0|| ||j Lα ≤"j, s.t.
2min || ||k
kk T k Fd
d Eα −
ik i T
i kE d Xα
≠
= −∑
Solve with SVD 31
TkE U V= Λ 1kd u=
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt
K–SVD Dictionary Update Stage
Only some of the examples use column dk!
When updating ak, only recompute the coefficients corresponding to those examples
dk ak T
−
Solve with SVD!
We want to solve:
Ek
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt 32
−
Compare K-SVD with K-means
33
Initialize Dictionary
Sparse Coding Use MP or OMP
Dictionary Update
Column-by-Column by SVD computation
Initialize Cluster Centers
Assignment for each vector
Cluster centers update
Cluster-by-cluster
K-SVD K-means
dictionary learning with l1-norm regularization
• Objective function for l1-norm regularization
• Advantages of online learning: – Handle large and dynamic datasets, – Could be much faster than batch algorithms.
34
22 1
1
1 1min || || || ||2
t
i i iD ix D
tα λ α
=
− +∑
22 1
1arg min || || || ||2mi iR
x Dα
α α λ α∈
− +where
dictionary learning with l1-norm regularization
35
22 1
1
11
1 1( ) || || || ||2
1 1( ( ) ( )) || ||2
t
t i i ii
tT T
t t ii
F D x Dt
Tr D DA Tr D Bt
α λ α
λ α
=
=
= − +
= − +
∑
∑where
1,
tT
t i ii
A α α=
=∑1
tT
t i ii
B xα=
=∑
( ) 1( )t t tF D DA B
D t∂
= −∂
1 1 1,T
t t t tA A α α+ + += +For a new xt+1, 1 1 1T
t t t tB B x α+ + += +
On-line dictionary learning 1) Initialization: D0 ϵ Rn×m; A0=0; B0=0; 2) For t=1,…,T 3) Draw xt from the training data set 4) Get sparse code
5) Aggregate sufficient statistics
6) Dictionary update
7) End for
36
21 2 1
1arg min || || || ||2mt t tR
x Dα
α α λ α−∈
= − +
1 ,T
t t t tA A α α−= + 1 ,T
t t t tB B xα−= +
1( )t
t tF DD D
Dρ−∂
= −∂
Toolbox - SPAMS • SPArse Modeling Software:
– Sparse coding • l0-norm regularization • l1-norm regularization • ……
– Dictionary learning • K-SVD • Online dictionary learning • ……
• C++ implemented with Matlab interface • http://spams-devel.gforge.inria.fr/
37
Summary
• Sparsity and sparse representation • Sparse coding with l0- and l1-norm regularization
– Orthogonal matching pursuit/matching pursuit – Subgradient and optimal condition
• Dictionary learning with l0- and l1-norm regularization – K-SVD – Online dictionary learning
• Try to use it !!
38
References • T. T. Cai, Lie Wang ,Orthogonal Matching Pursuit for Sparse Signal Recovery
With Noise, IEEE Transactions on Information Theory, 57(7): 4680-4688,2011 • Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals
of statistics, 32(2):407–499, 2004. • M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for
designing of overcomplete dictionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11):4311-4322, November 2006.
• J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the International Conference on Machine Learning (ICML), 2009a.
39
Thank you for listening
40
An Introduction to Sparse Coding and Dictionary LearningOutlineIntroduction What is sparsity? Sparse RepresentationApplication---DenoisingApplication---CompressionMathematical foundationDerivatives of vectorsTrace of a MatrixDerivatives of tracesSparse codingSparse linear modelThe sparse coding modelMatching pursuitAn example for matching pursuitOrthogonal matching pursuitAn example for orthogonal matching pursuitWhy does l1-norm induce sparsity?Why does l1-norm induce sparsity?Optimality condition for l1-norm regularizationSubgradient for l1-norm regularization Subgradient for l1-norm regularization First order method for �convex optimizationReformulation as quadratic programDictionary LearningDictionary selectionSlide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Compare K-SVD with K-meansdictionary learning with �l1-norm regularizationdictionary learning with �l1-norm regularizationOn-line dictionary learningToolbox - SPAMSSummaryReferencesThank you for listening