Download pdf - An Online Learning View via a Projections' Path in the Sparse-landicore/Docs/Sergios... · 2016. 9. 28. · An Online Learning View via a Projections’ Path in the Sparse-land Sergios

An Online Learning View via a Projections’ Path in theSparse-land

Sergios Theodoridis1

1Dept. of Informatics and Telecommunications, National and Kapodistrian Universityof Athens, Athens, Greece.

Workshop on Sparse Signal ProcessingFriday, Sep. 16, 2016

Joint work withP. Bouboulis, S. Chouvardas, Y. Kopsinis, G. Papageorgiou, K. Slavakis

Sergios Theodoridis, University of Athens. An Online Learning View via a Projections’ Path in the Sparse-land, 1/58

Sparsity

Sparse Modeling

• Sparse modeling has been a major focus of research effort overthe last decade or so.

• Sparsity promoting regularization of cost functions copes with:

Ill conditioning-overfitting when solving inverse problems; Learningfrom data is an instance of inverse problems.

Promote zeros when the underlying models have manynear-to-zero values.


Sparsity

Sparse Modeling






Sparsity

Sparse Modeling






Sparsity

Sparse Modeling






Sparse Modeling

The need for sparse Models: Two examples

• Compression

• Echo Cancelation


Sparse Modeling

The Generic Model

OUTPUT=INPUT× SPARSE MODEL+NOISE


Sparse Modeling

The Regression Model

• A generic model that covers a large class of problems (Filtering,Prediction)

yn = uTna∗ + vn

a∗ ∈ RL, is the unknown vector.un ∈ RL, is the incoming signal (sensing vectors).yn ∈ R, is the observed signal (measurements).vn is the additive noise process.

• a∗ is assumed to be sparse. That is, only a few, K << L, of itscomponents are nonzero

a∗ = [0, 0, ?︸︷︷︸1

, 0, . . . , 0, ?︸︷︷︸2

, 0, 0, . . . , 0, ?︸︷︷︸K

, 0, . . . , 0]T

• In its simplest formulation the task comprises the estimation ofa∗, based on a set of measurements (yn,un), n = 1 . . . N .


Sparse Modeling



yn = uTna∗ + vn



a∗ = [0, 0, ?︸︷︷︸1

, 0, . . . , 0, ?︸︷︷︸2

, 0, 0, . . . , 0, ?︸︷︷︸K

, 0, . . . , 0]T



Sparse Modeling



yn = uTna∗ + vn



a∗ = [0, 0, ?︸︷︷︸1

, 0, . . . , 0, ?︸︷︷︸2

, 0, 0, . . . , 0, ?︸︷︷︸K

, 0, . . . , 0]T



Sparse Modeling



yn = uTna∗ + vn



a∗ = [0, 0, ?︸︷︷︸1

, 0, . . . , 0, ?︸︷︷︸2

, 0, 0, . . . , 0, ?︸︷︷︸K

, 0, . . . , 0]T



Sparse Modeling



yn = uTna∗ + vn



a∗ = [0, 0, ?︸︷︷︸1

, 0, . . . , 0, ?︸︷︷︸2

, 0, 0, . . . , 0, ?︸︷︷︸K

, 0, . . . , 0]T



Sparse Modeling



yn = uTna∗ + vn



a∗ = [0, 0, ?︸︷︷︸1

, 0, . . . , 0, ?︸︷︷︸2

, 0, 0, . . . , 0, ?︸︷︷︸K

, 0, . . . , 0]T



Sparse Modeling

Dictionary Learning

• This is a powerful tool in analysing signals in terms ofovercomplete basis vectors.

[y1, . . . ,yN ]︸︷︷︸L×N

= [u1, . . . ,um]︸︷︷︸L×m

[a1, . . . ,aN ]︸︷︷︸m×N

, m>L

Y = UA

yn,∈ RL n = 1, 2, . . . , N , are the observation vectors.ui ∈ RL, i = 1, 2, . . . ,m, are the unknown atoms of thedictionary.an ∈ Rm, n = 1, 2, . . . , N , are the vectors of the unknownweights, corresponding in the respective expansion of the nthinput vector:

yn =

m∑i=1

uiani

where, an, n = 1, 2, . . . , N , sparse vectors.


Sparse Modeling

Dictionary Learning


[y1, . . . ,yN ]︸︷︷︸L×N

= [u1, . . . ,um]︸︷︷︸L×m

[a1, . . . ,aN ]︸︷︷︸m×N

, m>L

Y = UA


yn =

m∑i=1

uiani



Sparse Modeling

Dictionary Learning


[y1, . . . ,yN ]︸︷︷︸L×N

= [u1, . . . ,um]︸︷︷︸L×m

[a1, . . . ,aN ]︸︷︷︸m×N

, m>L

Y = UA


yn =

m∑i=1

uiani



Sparse Modeling

Low Rank Matrix Factorization

• This task is at the heart of dimensionality reduction.

Y = UA

=

r∑i=1

uiaTi

[y1, . . . ,yN ]︸︷︷︸L×N

= [u1, . . . ,ur]︸︷︷︸L×r

aT1...

arT

︸︷︷︸

r×N

• r < N .

• PCA performs low rank matrix factorization, by imposing sparsityon the singular values as well as orthogonality on U .


Sparse Modeling


• This task is at the heart of dimensionality reduction.

Y = UA

=

r∑i=1

uiaTi

[y1, . . . ,yN ]︸︷︷︸L×N

= [u1, . . . ,ur]︸︷︷︸L×r

aT1...

arT

︸︷︷︸

r×N

• r < N .

• PCA performs low rank matrix factorization, by imposing sparsityon the singular values as well as orthogonality on U .


Sparse Modeling


• Matrix Completion is a special constrained version of low rankmatrix factorization

• Y has missing elements and the lower rank matrix factorization isconstrained to provide the non-missing elements at the respectivepositions

Y =

∗ ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗...

......

......

...∗ ∗ ∗ ∗ ∗ ∗

=

r∑i=1

uiaTi


Sparse Modeling


• Matrix Completion is a special constrained version of low rankmatrix factorization

• Y has missing elements and the lower rank matrix factorization isconstrained to provide the non-missing elements at the respectivepositions

Y =

∗ ∗ ∗ ∗ ∗ ∗∗ ∗ ∗ ∗ ∗ ∗...

......

......

...∗ ∗ ∗ ∗ ∗ ∗

=

r∑i=1

uiaTi


Sparse Modeling


• Robust PCA is another special constrained version of low rankmatrix factorization.

Y = L+ V

L is a low rank matrix and V is a sparse matrix. The latter

models OUTLIER NOISE. Being outlier is sparse.

• The goal of the task is to obtain estimates L and V by imposingsparsity on the singular values of Y as well as on the elements ofV , constrained so that Y = L+ V .


Sparse Modeling



Y = L+ V





Sparse Modeling



Y = L+ V





Sparse Modeling

Robust Regression

• Robust Regression is an old problem, with a major impact comingfrom the works of Huber. The revival of interest is due to a newlook via sparsity-aware learning techniques. For example, thenoise may comprise a few large values (outliers) on top of theGaussian component. Since the large values are only a few, theycan be treated via sparse modeling arguments.


Sparse Regression Modeling

There are two paths that lead to the “truth”, e.g, obtain an estimate aof the unknown a∗.

Batch Learning Problem

Linear Regression Model yn = uTna∗ + vn

• U := [u1,u2, . . . ,uN ]T ∈ RN×L

• y := [y1, y2, . . . , yN ]T ∈ RN , and v := [v1, v2, . . . , vN ]T ∈ RN .

Batch Formulation: y = Ua∗ + v

=






• U := [u1,u2, . . . ,uN ]T ∈ RN×L



=






• U := [u1,u2, . . . ,uN ]T ∈ RN×L



=


Estimating the unknown


Batch vs Online Learning

Batch formulation: y = Ua∗ + v

Online Formulation: yn = uTna∗ + vn,

obtain an estimate, an, after (yn,un) has been received

========








========








========








=

=======








===

=====








======

==








========


Sparse Vs Online Learning

Sparsity-promoting Batch algorithm(Compressed Sensing)

• Are mobilized after a finitenumber of data, (un, yn)

N−1n=0 , is

collected.

• For any new datum, theestimation of a∗, is repeatedfrom scratch.

• Computational complexity mightbecome prohibitive.

• Excessive storage demands.

• It is a “mature” research fieldwith a diverse number oftechniques and applications.

Sparsity-promoting Online algorithms

• Infinite number of data.

• For any new datum, the estimateof a∗ is updated dynamically.

• Cases of time-varying a∗ are“naturally” handled.

• Low complexity is required forstreaming applications.

• Fast convergence / Tracking.

• Large potential in Big Dataapplications





N−1n=0 , is

collected.
















N−1n=0 , is

collected.













Sparsity-Promoting Methods

`0-norm constrained minimization

• `0 (pseudo) norm minimization: NP-hard nonconvex task.

• a : mina∈Rl ‖a‖0, s.t. ‖y − Ua‖22 ≤ ε

• The above is carried out via greedy-type algorithmic arguments.

Constrained Least Squares Estimation: Three equivalent formulations

• a := argmina∈Rl

{‖y − Ua‖22 + λ‖a‖1

}• a : mina∈Rl ‖y − Ua‖22, s.t. ‖a‖1 ≤ ρ

• a : mina∈Rl ‖a‖1, s.t. ‖y − Ua‖22 ≤ ε• Why `1 norm: It is the “closest” to `0 “norm” (number of

nonzero elements) that retains its convex nature.





• a : mina∈Rl ‖a‖0, s.t. ‖y − Ua‖22 ≤ ε




{‖y − Ua‖22 + λ‖a‖1

}• a : mina∈Rl ‖y − Ua‖22, s.t. ‖a‖1 ≤ ρ







• a : mina∈Rl ‖a‖0, s.t. ‖y − Ua‖22 ≤ ε




{‖y − Ua‖22 + λ‖a‖1

}• a : mina∈Rl ‖y − Ua‖22, s.t. ‖a‖1 ≤ ρ







• a : mina∈Rl ‖a‖0, s.t. ‖y − Ua‖22 ≤ ε




{‖y − Ua‖22 + λ‖a‖1

}• a : mina∈Rl ‖y − Ua‖22, s.t. ‖a‖1 ≤ ρ







• a : mina∈Rl ‖a‖0, s.t. ‖y − Ua‖22 ≤ ε




{‖y − Ua‖22 + λ‖a‖1

}• a : mina∈Rl ‖y − Ua‖22, s.t. ‖a‖1 ≤ ρ





Hard and Soft thresholding

• The `1 norm is associated with a soft thresholding operation on therespective coefficients. This is a continuous function operation, butit adds bias even for the large values. On the other hand, hardthresholding is a discontinuous one.


Batch Penalized Least-Squares Estimator

Penalized Least-Squares - General Case

mina∈RL

{1

2‖y −Ua‖22 + λ

L∑i=1

p(|ai|)

}• p(·), sparsity-promoting penalty function,

• λ, regularization parameter.


Batch Penalized Least-Squares Estimator

Penalized Least-Squares - General Case

mina∈RL

{1

2‖y −Ua‖22 + λ

L∑i=1

p(|ai|)

}• p(·), sparsity-promoting penalty function,

• λ, regularization parameter.

Examples: Penalty functions

• p(|ai|) := |ai|γ , ∀ai ∈ R

• p(|ai|) = λ(1− e−β|ai|

)• p(|ai|) := λ

log(γ+1) log(γ|ai|+ 1), ∀ai ∈ R


Online Sparsity-Promoting Methods

Penalized Recursive LS

mina∈RL

{1

2

N∑n=1

βN−ne2n + λL∑i=1

p(|ai|)

},

rn+1 = βrn + yn+1un+1, Rn+1 = βRn + un+1uTn+1

an+1 = f(rn+1,Rn+1)

• It Works!

• Complexity O(L2)

• Regularization parameter needs fine tuning

• [Angelosante, Bazerque and Giannakis, 2010]

• [Eksioglu and Tanc, 2011]




mina∈RL

{1

2

N∑n=1


p(|ai|)

},

rN :=

N∑n=1

βN−nynun, RN :=

N∑n=1

βN−nunuTn


an+1 = f(rn+1,Rn+1)

• It Works!








mina∈RL

{1

2

N∑n=1


p(|ai|)

},


an+1 = f(rn+1,Rn+1)

• It Works!








mina∈RL

{1

2

N∑n=1


p(|ai|)

},


an+1 = f(rn+1,Rn+1)

• It Works!








mina∈RL

{1

2

N∑n=1


p(|ai|)

},


an+1 = f(rn+1,Rn+1)

• It Works!







Penalized stochastic gradient descent: LMS type

mina∈RL

{1

2e2n + λ

L∑i=1

p(|ai|)

}

an+1 = an + µen(a)un − µλf(an)

f(an) =

[∂p(|an,1|)∂an,1

,∂p(|an,2|)∂an,2

, . . . ,∂p(|an,L|)∂an,L

]T• Complexity O(L)• It Works! (when it is compared to standard LMS)• Slow convergence• Regularization parameter needs fine tuning

• [Chen, Gu and Hero, 2009]• [Mileounis, Babadi, Kalouptsidis and Tarokh, 2010]• [Wang and Gu, 2012]




mina∈RL

{1

2e2n + λ

L∑i=1

p(|ai|)

}


f(an) =

[∂p(|an,1|)∂an,1

,∂p(|an,2|)∂an,2

, . . . ,∂p(|an,L|)∂an,L






mina∈RL

{1

2e2n + λ

L∑i=1

p(|ai|)

}


f(an) =

[∂p(|an,1|)∂an,1

,∂p(|an,2|)∂an,2

, . . . ,∂p(|an,L|)∂an,L




The Set-Theoretic Estimation Approach

The main concept

A descendent of POCS

Projection onto a Closed Convex Set

Let C be a closed convex set in RL. Then, for each a ∈ RL there exists aunique a∗ ∈ C such that

‖a− a∗‖ = ming∈C‖a− g‖.



The main concept




‖a− a∗‖ = ming∈C‖a− g‖.



The main concept




‖a− a∗‖ = ming∈C‖a− g‖.

Metric Projection Mapping

Metric Projection is the mappingPC : RL → C : a 7→ PC(a) := a∗.



The main concept




‖a− a∗‖ = ming∈C‖a− g‖.





The main concept




‖a− a∗‖ = ming∈C‖a− g‖.





The main concept




‖a− a∗‖ = ming∈C‖a− g‖.

Relaxed Projection Mapping

The relaxed Projection is the mappingTC(a) := a+ µ(PC(a)− a),µ ∈ (0, 2),∀a ∈ RL.



The POCS: Finite number of Convex Sets [Von Neumann ’33], [Bregman’65], [Gubin, Polyak, Raik ’67]

Given a finite number of closed convex sets C1, . . . , Cq, with⋂qi=1 Ci 6= ∅, let

their associated projection mappings be PC1, . . . , PCq

. For any a ∈ RL, definethe sequence of projections:

PC1(a).







PC1(a).







PC2PC1

(a).







PC1PC2

PC1(a).







PC2PC1

PC2PC1

(a).







· · ·PC2PC1

PC2PC1

(a).



Convex Combination of Projection Mappings [Pierra ’84]

Given a finite number of closed convex sets C1, . . . , Cq, with⋂q

i=1 Ci 6= ∅, let theirassociated projection mappings be PC1 , . . . , PCq . Let also a set of positive constantsw1, . . . , wq such that

∑qi=1 wi = 1. Then for any a0, the sequence

an+1 = an + µn(

q∑i=1

wiPCi(an)︸︷︷︸Convex combination of projections

−an), ∀n,

converges weakly to a point a∗ in⋂q

i=1 Ci,where µn ∈ (ε,Mn), for ε ∈ (0, 1), and

Mn :=∑q

i=1 wi‖PCi(an)−an‖2

‖∑q

i=1 wiPCi(an)−an‖2

.







an+1 = an + µn(

q∑i=1


−an), ∀n,



Mn :=∑q


‖∑q


.







an+1 = an + µn(

q∑i=1


−an), ∀n,



Mn :=∑q


‖∑q


.







an+1 = an + µn(

q∑i=1


−an), ∀n,



Mn :=∑q


‖∑q


.







an+1 = an + µn(

q∑i=1


−an), ∀n,



Mn :=∑q


‖∑q


.







an+1 = an + µn(

q∑i=1


−an), ∀n,



Mn :=∑q


‖∑q


.


Set-Theoretic Estimation: The Online Case Approach

Constructing the Convex Sets

For each received set of measurements (training pairs) (un, yn), construct ahyperslab:

Sn[ε] :={a ∈ RL : |uTna− yn| ≤ ε

}

Solution






}

Solution

Find a point in the intersection of all the hyperslabs






}

Solution







}

Solution


1

1






}

Solution


2

2






}

Solution


3

3






}

Solution

Find a point in the intersection of all the hyperslabs4

4






}

Solution

Find a point in the intersection of all the hyperslabs5

5






}

Solution

[Yamada 2001], [Yamada, Slavakis, Yamada 2002], [Yamada, Ogura 2004],[Slavakis, Yamada Ogura 2006].

[Chouvardas, Slavakis, Theodoridis, Yamada, 2013]: Under the assumption ofBounded noise it converges with probability 1 arbitrarily close to the truemodel.


Adaptive Projection Subgradient Method (APSM)

The Algorithm

an+1 := an + µn

n∑i=n−q+1

ω(n)i

(PSn[ε](an)− an

)Projection onto Hyperslab

PSn[ε](a) = a+

yn−ε−uT

na‖un‖2 un, if yn − ε > uTna

0, if |uTna− yn| ≤ εyn+ε−uT

na‖un‖2 un, if yn + ε < uTna



Geometric illustration example











APSM under the `1 ball constraint

The `1-ball case

Given (un, yn), n = 0, 1, 2, . . ., find a such that∣∣aTun − yn∣∣ ≤ ε, n = 0, 1, 2, . . .

‖a‖1 ≤ δ.

The recursion:

an+1 := PB`1[δ]

an + µn

n∑j=n−q+1

ω(n)j PSj [ε](an)− an

,

converges to

a∗ ∈ B`1 [δ] ∩

⋂n≥n0

Sn[ε]

.



The `1-ball case


‖a‖1 ≤ δ.

The recursion:

an+1 := PB`1[δ]

an + µn

n∑j=n−q+1


,

converges to

a∗ ∈ B`1 [δ] ∩

⋂n≥n0

Sn[ε]

.



The `1-ball case


‖a‖1 ≤ δ.

The recursion:

an+1 := PB`1[δ]

an + µn

n∑j=n−q+1


,

converges to

a∗ ∈ B`1 [δ] ∩

⋂n≥n0

Sn[ε]

.



The `1-ball case


‖a‖1 ≤ δ.

The recursion:

an+1 := PB`1[δ]

an + µn

n∑j=n−q+1


,

converges to

a∗ ∈ B`1 [δ] ∩

⋂n≥n0

Sn[ε]

.




















APSM under the weighted `1 ball constraint

The weighted `1-ball case:

• Convergence can be significantly speeded up if `1-ball, is replacedby the weighted `1 ball.

• Definition:

‖a‖1,w :=

L∑i=1

wi|ai|.

• Time-adaptive weighted norm:

wn,i :=1

|an,i|+ ε′n.

• A time varying constraint case.

• The recursion:

an+1 := PB`1[wn,δ]

an + µn

n∑j=n−q+1


.





• Definition:

‖a‖1,w :=

L∑i=1

wi|ai|.


wn,i :=1

|an,i|+ ε′n.


• The recursion:

an+1 := PB`1[wn,δ]

an + µn

n∑j=n−q+1


.





• Definition:

‖a‖1,w :=

L∑i=1

wi|ai|.


wn,i :=1

|an,i|+ ε′n.


• The recursion:

an+1 := PB`1[wn,δ]

an + µn

n∑j=n−q+1


.





• Definition:

‖a‖1,w :=

L∑i=1

wi|ai|.


wn,i :=1

|an,i|+ ε′n.


• The recursion:

an+1 := PB`1[wn,δ]

an + µn

n∑j=n−q+1


.





• Definition:

‖a‖1,w :=

L∑i=1

wi|ai|.


wn,i :=1

|an,i|+ ε′n.


• The recursion:

an+1 := PB`1[wn,δ]

an + µn

n∑j=n−q+1


.




_




_




_




_



Convergence of the Scheme

• Does this scheme converge?Note that our constraint, i.e., the weighted `1-ball is atime-varying constraint.Remark: This case was not covered by the existing theory.

0





0





0





0





0





0


Simulation Examples

Example: Time-invariant signal sparse in wavelet domain

L := 1024, ‖a∗‖0 := 100 wavelet coefficients. The radius of the `1-ball is set to δ := 101.


Simulation Examples

Example: Time varying signal compressible in wavelet domain

L := 4096.The sum of two chirp signals.


Simulation Examples

Example: Time varying signal compressible in wavelet domain

L := 4096. The radius of the `1-ball is set to δ := 40.

Movies of the OCCD, and the APWL1sub.


Generalized Thresholding Rules

Thresholding rules associated with non-convex penalty functions

• Penalized LS thresholding operators:

mina

1

2(a− a)2 + λp(|a|)

• p(·): nonnegative, nondecreasing and differentiable function on(0,∞)

• Under some general conditions it has a unique solution [Antoniadis2007].

• PLSTO basically defines a mapping

a 7→ mina

1

2(a− a)2 + λp(|a|)

which corresponds to a Shrinkage operator.





mina

1

2(a− a)2 + λp(|a|)




a 7→ mina

1

2(a− a)2 + λp(|a|)






mina

1

2(a− a)2 + λp(|a|)




a 7→ mina

1

2(a− a)2 + λp(|a|)






mina

1

2(a− a)2 + λp(|a|)




a 7→ mina

1

2(a− a)2 + λp(|a|)





• p(|a|) := |a|γ , ∀a ∈ R

• p(|a|) = λ(1− e−β|a|

)• p(|a|) := λ

log(γ+1) log(γ|a|+ 1), ∀a ∈ R

Examples: Penalized Least-Squares Thresholding Operators




• p(|a|) := |a|γ , ∀a ∈ R

• p(|a|) = λ(1− e−β|a|

)• p(|a|) := λ

log(γ+1) log(γ|a|+ 1), ∀a ∈ R

Examples: Penalized Least-Squares Thresholding Operators



Generalized Thresholding (GT) operator: Definition:

For any a ∈ RL, z := T(K)GT (a) is obtained coordinate-wise:

∀l ∈ 1, L, zl :=

{al, If, al is one of the largest K components,

shr(al), otherwise

Shrinkage Function (Shr)

• τshr(τ) ≥ 0, ∀τ ∈ R.

• shr acts as a strict shrinkage operator over all intervals which do notinclude 0.

• Any arbitrary function inline with the properties above can be used.

• All the penalized Least-Squares thresholding operators are included.





∀l ∈ 1, L, zl :=


shr(al), otherwise

Shrinkage Function (Shr)

• τshr(τ) ≥ 0, ∀τ ∈ R.

• shr acts as a strict shrinkage operator over all intervals which do notinclude 0.

• Any arbitrary function inline with the properties above can be used.

• All the penalized Least-Squares thresholding operators are included.





∀l ∈ 1, L, zl :=


shr(al), otherwise

In words

• Choose the largest K components of the estimate.

• The rest are shrunk according to the shrinkage rule.





∀l ∈ 1, L, zl :=


shr(al), otherwise

Examples: Generalized Thresholding (GT) operator


Adaptive Projection-Based Algorithm With GeneralizedThresholding (APGT)

The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1

ω(n)i (P (an)− an)

• Each piece of a-priori information, is also represented by a set

*

Thresholding Operator



The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



*




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



*




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



*P ( )

P ( )




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



*P ( )

P ( )




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



**




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



***




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



=0

**




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



=0shr( )

**




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



*




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



*




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



*




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



*




The Algorithm

an+1 := Tn

an + µn

n∑i=n−q+1



*

=0

=0



Adaptive Projection-Based Algorithm With GT (APGT)

Convergence of APGT

• Partially Quasi-nonexpansive Mapping.∀x ∈ RL,∃Yx ⊂ Fix(T ) : ∀y ∈ Yx,‖T (x)− y‖ ≤ ‖x− y‖

• The fixed point set of GT is a union ofsubspaces (non-convex).

Examples: Union of Subspaces for s = 2



Convergence of APGT






Convergence of APGT






Convergence of APGT






Convergence of APGT






Convergence of APGT






Convergence of APGT

• Partially Quasi-nonexpansive Mapping.




Convergence of APGT



• It has been shown [Slavakis, Kopsinis, Theodoridis, McLaughlin, 2013]:

The algorithm leads to a sequence of estimates (a)n∈Z≥0whose set

of cluster points is nonempty,each one of the cluster points is guaranteed to be, at most, s-sparse,the solution is located arbitrarily close to an intersection of aninfinite number of hyperslabs.



Convergence of APGT





of cluster points is nonempty,

each one of the cluster points is guaranteed to be, at most, s-sparse,the solution is located arbitrarily close to an intersection of aninfinite number of hyperslabs.



Convergence of APGT





of cluster points is nonempty,each one of the cluster points is guaranteed to be, at most, s-sparse,

the solution is located arbitrarily close to an intersection of aninfinite number of hyperslabs.



Convergence of APGT





of cluster points is nonempty,each one of the cluster points is guaranteed to be, at most, s-sparse,the solution is located arbitrarily close to an intersection of aninfinite number of hyperslabs.


Simulation Examples

Example: Time-varying case exhibiting an abrupt change

L := 1024, s = 100 (up to n = 1500, and s = 110 afterwards)

APGT:O(qL+ qK)OSCD: O(L2)IPAPA: O(q3)


Simulation Examples

Example: Sparse system identification with colored input

L := 600, s = 60, AR input (cond ' 100) .


Bibliography• I. Yamada and N. Ogura. Adaptive Projected Subgradient Method for asymptotic minimization of sequence of

nonnegative convex functions. Numerical Functional Analysis and Optimization, 25(7&8), 2004.• K. Slavakis, I. Yamada, and N. Ogura. The adaptive projected subgradient method over the fixed point set of

strongly attracting nonexpansive mappings. Numerical Functional Analysis and Optimization, 27 (7&8), Nov 2006.• S. Theodoridis, K. Slavakis, and I. Yamada, “Adaptive learning in a world of projections: a unifying framework for

linear and nonlinear classification and regression tasks,” ,” IEEE Trans. Signal Proc., vol. 28, Jan. 2011.• K. Slavakis and I. Yamada. The adaptive projected subgradient method constrained by families of quasi-

nonexpansive mappings and its application to online learning. SIAM Journal on Optimization, vol. 23, no. 1, 2013.• Y. Kopsinis, K. Slavakis, and S. Theodoridis, “Online sparse system identification and signal reconstruction using

projections onto weighted `1 balls,” IEEE Trans. Signal Proc., vol. 59, Mar. 2011.• S. Chouvardas, K. Slavakis, and S. Theodoridis. Adaptive robust distributed learning in diffusion sensor networks.

IEEE Transactions on Signal Processing, vol. 59, Oct. 2011.• S. Chouvardas, K. Slavakis, Y. Kopsinis, and S. Theodoridis. A sparsity promoting adaptive algorithm for

distributed learning. IEEE Transactions on Signal Processing, vol. 60, Oct. 2012.• K. Slavakis, Y. Kopsinis, S. Theodoridis, and S. McLaughlin. Generalized thresholding and online sparsity-aware

learning in a union of subspaces. Accepted for publication in the IEEE Transactions on Signal Processing, 2013• S. Chouvardas, K. Slavakis, S. Theodoridis, and I. Yamada. Stochastic analysis of hyperslab-based adaptive

projected subgradient method under bounded noise. IEEE Signal Processing Letters, vol. 20, 2013.• Y. Kopsinis, K. Slavakis, S. Theodoridis, “Thresholding-Based Online Algorithms of Complexity Comparable to

Sparse LMS methods,” Under preparation (A part submitted to ISCAS 2013),

• D. Angelosante, J. A. Bazerque, and G. B. Giannakis, ‘Online Adaptive Estimation of Sparse Signals: Where RLSMeets the l1-Norm’, IEEE Transactions on Signal Processing, vol. 58, Jul. 2010.

• E. M. Eksioglu and A. K. Tanc, ‘RLS Algorithm With Convex Regularization’, IEEE Signal Processing Letters, vol.18, Aug. 2011.

• Y. Chen, Y. Gu, and A. O. Hero, ‘Sparse LMS for system identification’, in IEEE International Conference onAcoustics, Speech and Signal Processing, ICASSP 2009.

• X. Wang and Y. Gu, ‘Proof of Convergence and Performance Analysis for Sparse Recovery via Zero-PointAttracting Projection’, IEEE Transactions on Signal Processing, vol. 60, Aug. 2012.

• G. Mileounis, B.Babadi, N. Kalouptsidis, and V. Tarokh, ”An Adaptive Greedy Algorithm with Application toNonlinear Communications”, IEEE Trans. on Signal Processing, vol. 58, Jun. 2010.

• P. Di Lorenzo and A. H. Sayed, “Sparse distributed learning based on diffusion adaptation,” IEEE Trans. SignalProcessing, vol. 61, March 2013.


“Machine Learning: A Bayesian and Optimization Perspective”

by

Sergios Theodoridis

Academic Press, 2015

1050 pages


Thank youfor your patience...

I hope that there are NO QUESTIONS !!!!!!!!!


Thank youfor your patience...

I hope that there are NO QUESTIONS !!!!!!!!!