68
On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute, NYU Optimization without Borders, Les Houches Zaid Harchaoui Catalyst 1/46

On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Embed Size (px)

Citation preview

Page 1: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

On the Versatility of the

Nesterov Acceleration Scheme

Zaid Harchaoui

Courant Institute, NYU

Optimization without Borders, Les Houches

Zaid Harchaoui Catalyst 1/46

Page 2: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Tribute to Y. Nesterov’s teaching and research

Book

Y. Nesterov. Introductory Lectures on Convex Optimization: a basiccourse. Kluwer Academic Publishers.

Zaid Harchaoui Catalyst 2/46

Page 3: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Collaborators

Hongzhou JulienLin Mairal

Publication

H. Lin, J. Mairal and Z. Harchaoui. A Universal Catalyst for First-OrderOptimization. Adv. NIPS 2015.

Zaid Harchaoui Catalyst 3/46

Page 4: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rp

{

F (x)△

=1

n

n∑

i=1

fi(x) + ψ(x)

}

,

where each fi is smooth and convex and ψ is a convex but notnecessarily differentiable penalty.

Goal of this work

Design accelerated methods for minimizing large finite sums.

Give a generic acceleration scheme which can apply to previouslyun-accelerated algorithms.

Zaid Harchaoui Catalyst 4/46

Page 5: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Why do large finite sums matter?

Empirical risk minimization

minx∈Rp

{

F (x)△

=1

n

n∑

i=1

fi(x) + ψ(x)

}

,

Typically, x represents model parameters.

Each function fi measures the fidelity of x to a data point.

ψ is a regularization function to prevent overfitting.

For instance, given training data (yi , zi )i=1,...,n with features zi in Rp

and labels yi in {−1,+1}, we may want to predict yi by sign(〈zi , x〉).Functions fi measures how far the prediction is from the true label.

This would be a classification problem with a linear model.

Zaid Harchaoui Catalyst 5/46

Page 6: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Why large finite sums matter?

A few examples

Ridge regression: minx∈Rp

1

n

n∑

i=1

1

2(yi − 〈x , zi 〉)2 +

λ

2‖x‖22.

Linear SVM: minx∈Rp

1

n

n∑

i=1

max(0, 1 − yi〈x , zi 〉) +λ

2‖x‖22.

Logistic regression: minx∈Rp

1

n

n∑

i=1

log(

1 + e−yi 〈x ,zi 〉)

2‖x‖22.

Zaid Harchaoui Catalyst 6/46

Page 7: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Why does the composite problem matter?

A few examples

Ridge regression: minx∈Rp

1

n

n∑

i=1

1

2(yi − 〈x , zi 〉)2 +

λ

2‖x‖22.

Linear SVM: minx∈Rp

1

n

n∑

i=1

max(0, 1 − yi〈x , zi 〉))2 +λ

2‖x‖22.

Logistic regression: minx∈Rp

1

n

n∑

i=1

log(

1 + e−yi 〈x ,zi 〉)

2‖x‖22.

The squared ℓ2-norm penalizes large entries in x .

Zaid Harchaoui Catalyst 7/46

Page 8: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Why does the composite problem matter?

A few examples

Ridge regression: minx∈Rp

1

n

n∑

i=1

1

2(yi − 〈x , zi 〉)2 + λ‖x‖1.

Linear SVM: minx∈Rp

1

n

n∑

i=1

max(0, 1 − yi〈x , zi 〉)2 + λ‖x‖1.

Logistic regression: minx∈Rp

1

n

n∑

i=1

log(

1 + e−yi 〈x ,zi 〉)

+ λ‖x‖1.

When one knows in advance that x should be sparse, one should use asparsity-inducing regularization such as the ℓ1-norm.

[Chen et al., 1999, Tibshirani, 1996].

Zaid Harchaoui Catalyst 8/46

Page 9: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

How to minimize a large sum composite problem?

Two major challenges

Non-differentiable regularization penalty.

Exclude existing solver such as MOSEK, CPLEX, etc.

Large-scale and high-dimensionality

Exclude higher-order (Newton) methods.

This leads us to first-order gradient-based methods.

Zaid Harchaoui Catalyst 9/46

Page 10: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Gradient descent methods

Let us consider the composite problem

minx∈Rp

f (x) + ψ(x),

where f is convex, differentiable with L-Lipschitz continuous gradientand ψ is convex, but not necessarily differentiable.

The classical forward-backward/ISTA algorithm

xk ← argminx∈Rp

1

2

x −(

xk−1 −1

L∇f (xk−1)

)∥

2

2

+1

Lψ(x).

f (xk)− f ⋆ = O(1/k) for convex problems;

f (xk)− f ⋆ = O((1− µ/L)k) for µ-strongly convex problems;

[Nowak and Figueiredo, 2001, Daubechies et al., 2004, Combettes and Wajs,2006, Beck and Teboulle, 2009, Wright et al., 2009, Nesterov, 2013]...

Zaid Harchaoui Catalyst 10/46

Page 11: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Accelerated gradient descent methods

Nesterov introduced in 1983 an acceleration scheme for the gradientdescent algorithm. It was generalized later to the compositesetting [Nesterov, 1983, 2004, 2013].

FISTA [Beck and Teboulle, 2009]

xk ← argminx∈Rp

1

2

x −(

yk−1 −1

L∇f (yk−1)

)∥

2

2

+1

Lψ(x);

Find αk > 0 s.t. α2k = (1− αk)α

2k−1 +

µ

Lαk ;

yk ← xk + βk(xk − xk−1) with βk =αk−1(1− αk−1)

α2k−1 + αk

.

f (xk)− f ⋆ = O(1/k2) for convex problems;

f (xk)− f ⋆ = O((1−√

µ/L)k) for µ-strongly convex problems;

Acceleration works in many practical cases.

see also [Nesterov, 1983, 2004, 2013]Zaid Harchaoui Catalyst 11/46

Page 12: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

What do we mean by “acceleration”?

Complexity analysis for large finite sums

Since f is a sum of n functions, computing ∇f requires computing ngradients ∇fi . The complexity to reach an ε−solution is given below

µ > 0 µ = 0

ISTA O(

n Lµlog(

)

)

O(

nLε

)

FISTA O(

n√

Lµlog(

)

)

O(

nL√ε

)

Remarks

ε-solution means here f (xk)− f ⋆ ≤ ε.For n = 1, the rates of FISTA are optimal for a “first-order localblack box” [Nesterov, 2004].

For n > 1, the sum structure of f is not exploited.

Zaid Harchaoui Catalyst 12/46

Page 13: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Can we do better for large finite sums?

Several randomized algorithms are designed with one ∇fi computed periteration, which yields a better expected computational complexity.

µ > 0

FISTA O(

n√

Lµlog(

)

)

SVRG, SAG, SAGA, SDCA, MISO, Finito O(

max(

n, Lµ

)

log(

)

)

SVRG, SAG, SAGA, SDCA, MISO, Finito improve upon FISTA when

max

(

n,L

µ

)

≤ n

L

µ⇔√

L

µ≤ n,

but they are not “accelerated” in the sense of Nesterov.

[Schmidt et al., 2013, Xiao and Zhang, 2014, Defazio et al., 2014a,b,Shalev-Shwartz and Zhang, 2012, Mairal, 2015, Zhang and Xiao, 2015]

Zaid Harchaoui Catalyst 13/46

Page 14: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Can we do even better for large finite sums?

Without vs with acceleration

µ > 0

FISTA O(

n√

Lµlog(

)

)

SVRG, SAG, SAGA, SDCA, MISO, Finito O(

max(

n, Lµ

)

log(

)

)

Acc-SDCA O(

max(

n,√

n Lµ

)

log(

)

)

Acc-SDCA is due to Shalev-Shwartz and Zhang [2014].

Acceleration occurs when n ≤ Lµ.

see [Agarwal and Bottou, 2015] for discussions about optimality.

Challenge: can we accelerate these algorithms by a universal

scheme for both convex and strongly convex objectives ?

Zaid Harchaoui Catalyst 14/46

Page 15: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Catalyst is coming

Zaid Harchaoui Catalyst 15/46

Page 16: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Main idea

Catalyst, a meta-algorithm

Given an algorithmM that can solve a convex problem ”appropriately”.

At iteration k , rather than minimizing F , we useM to minimize afunction Gk , defined as follows,

Gk(x)△

= F (x) +κ

2‖x − yk−1‖22,

up to accuracy εk , i.e., such that Gk(xk)− G ⋆

k ≤ εk .Then compute the next prox-center yk using an extrapolation step

yk = xk + βk(xk − xk−1).

The choices of βk , ǫk , κ are driven by the theoretical analysis.

Zaid Harchaoui Catalyst 16/46

Page 17: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Main idea

Catalyst, a meta-algorithm

Given an algorithmM that can solve a convex problem ”appropriately”.

At iteration k , rather than minimizing F , we useM to minimize afunction Gk , defined as follows,

Gk(x)△

= F (x) +κ

2‖x − yk−1‖22,

up to accuracy εk , i.e., such that Gk(xk)− G ⋆

k ≤ εk .Then compute the next prox-center yk using an extrapolation step

yk = xk + βk(xk − xk−1).

The choices of βk , ǫk , κ are driven by the theoretical analysis.

Catalyst is a wrapper ofM that yields an accelerated algorithm A.

Zaid Harchaoui Catalyst 16/46

Page 18: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Sources of inspiration

In addition to accelerated proximal algorithms [Beck and Teboulle, 2009,Nesterov, 2013], several works have inspired Catalyst.

The inexact accelerated proximal point algorithm of Guler [1992].

Catalyst is a variant of inexact accelerated PPA.

Complexity analysis for outer-loop only with non practicalinexactness criterium.

Accelerated SDCA of Shalev-Shwartz and Zhang [2014].

Accelerated SDCA is an instance of inexact accelerated PPA.

Complexity analysis limited to µ-strongly convex objectives.

Zaid Harchaoui Catalyst 17/46

Page 19: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Sources of inspiration

In addition to accelerated proximal algorithms [Beck and Teboulle, 2009,Nesterov, 2013], several works have inspired Catalyst.

The inexact accelerated proximal point algorithm of Guler [1992].

Catalyst is a variant of inexact accelerated PPA.

Complexity analysis for outer-loop only with non practicalinexactness criterium.

Accelerated SDCA of Shalev-Shwartz and Zhang [2014].

Accelerated SDCA is an instance of inexact accelerated PPA.

Complexity analysis limited to µ-strongly convex objectives.

Other related work

[Frostig et al., 2015, Schmidt et al., 2011, Salzo and Villa, 2012, He andYuan, 2012, Lan, 2015]. + Chambolle and Pock, 15.

Zaid Harchaoui Catalyst 17/46

Page 20: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

This work

Contributions

Generic acceleration scheme, which applies to previouslyunaccelerated algorithms such as SVRG, SAG, SAGA, SDCA,MISO, or Finito, and which is not taylored to finite sums.

Provides explicit support to non-strongly convex objectives.

Complexity analysis for µ-strongly convex objectives.

Complexity analysis for non-strongly convex objectives.

Example of application

Garber and Hazan [2015] have used Catalyst to accelerate new principalcomponent analysis algorithms based on convex optimization.

Zaid Harchaoui Catalyst 18/46

Page 21: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

AppropriateM = Linear convergence rate when µ > 0

Linear convergence rate

Consider a strongly convex minimization problem

minz∈Rp

H(z).

We say that an algorithmM has a linear convergence rate ifMgenerates a sequence of iterates (zt)t∈N such that there exists τM,H

in (0, 1) and a constant CM,H in R satisfying

H(zt)− H⋆ ≤ CM,H(1− τM,H)t . (1)

τM,H depends usually on the condition number L/µ, e.g.,τM,H = µ/L for ISTA and τM,H =

µ/L for FISTA.

CM,H depends usually on H(z0)− H∗.

Zaid Harchaoui Catalyst 19/46

Page 22: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

AppropriateM = Linear convergence rate when µ > 0

Linear convergence rate

Consider a strongly convex minimization problem

minz∈Rp

H(z).

We say that an algorithmM has a linear convergence rate ifMgenerates a sequence of iterates (zt)t∈N such that there exists τM,H

in (0, 1) and a constant CM,H in R satisfying

H(zt)− H⋆ ≤ CM,H(1− τM,H)t . (1)

Important message: we do not make any assumption for nonstrongly convex objectives. It is possible thatM is not even definedfor µ = 0.

Zaid Harchaoui Catalyst 19/46

Page 23: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Catalyst action

Catalyst action

Gk(x)△

= F (x) +κ

2‖x − yk−1‖22,

Gk is always strongly convex as long as F is convex.

When F is strongly convex, the condition number of Gk is betterthan that of F since L+κ

µ+κ< L

µ.

Zaid Harchaoui Catalyst 20/46

Page 24: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Catalyst action

Catalyst action

Gk(x)△

= F (x) +κ

2‖x − yk−1‖22,

Gk is always strongly convex as long as F is convex.

When F is strongly convex, the condition number of Gk is betterthan that of F since L+κ

µ+κ< L

µ.

Minimizing Gk is easier than minimizing F !

Zaid Harchaoui Catalyst 20/46

Page 25: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Catalyst action

Catalyst action

Gk(x)△

= F (x) +κ

2‖x − yk−1‖22,

Gk is always strongly convex as long as F is convex.

When F is strongly convex, the condition number of Gk is betterthan that of F since L+κ

µ+κ< L

µ.

Minimizing Gk is easier than minimizing F !

If κ≫ 1, then minimizing Gk is easy;

If κ ≈ 0, then Gk is a good approximation of F .

We will choose κ to optimize the computational complexity.

Zaid Harchaoui Catalyst 20/46

Page 26: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Convergence analysis

An analysis in two stages

Gk(x)△

= F (x) +κ

2‖x − yk−1‖22,

xk is a approximate minimizer of Gk such that Gk(xk)− G ∗k ≤ ǫk .

Outer loop: once we obtain the sequence (xk)k∈N, what can we sayabout the convergence rate of F (xk)− F ∗?⇒ Wisely choose (yk) and control the accumulation of errors.

Inner loop: how much effort do we need to obtain a xk withaccuracy ǫk?⇒ Wisely choose the starting point.

Zaid Harchaoui Catalyst 21/46

Page 27: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Choice of (yk)k∈N

Extrapolation

yk = xk + βk(xk − xk−1) with βk =αk−1(1− αk−1)

α2k−1 + αk

.

This update is identical to Nesterov’s accelerated gradient descentor FISTA.

Unfortunately, the literature does not provide any simple geometricexplanation why it yields an acceleration...

The construction is purely theoretical by using a mechanismintroduced by Nesterov, called “estimate sequence”.

Zaid Harchaoui Catalyst 22/46

Page 28: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

How does “acceleration” work?

If f is µ-strongly convex and ∇f is L-Lipschitz continuous

x⋆

x

f (x)

b

b

b

bxk−1

f (x) ≤ f (xk−1) +∇f (xk−1)⊤(x − xk−1) +

L2‖x − xk−1‖22;

f (x) ≥ f (xk−1) +∇f (xk−1)⊤(x − xk−1) +

µ

2 ‖x − xk−1‖22;

Zaid Harchaoui Catalyst 23/46

Page 29: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

How does “acceleration” work?

If ∇f is L-Lipschitz continuous

x⋆

x

f (x)

b

b

b

bb

b

xk−1xk

f (x) ≤ f (xk−1) +∇f (xk−1)⊤(x − xk−1) +

L2‖x − xk−1‖22;

xk = xk−1 − 1L∇f (xk−1) (gradient descent step).

Zaid Harchaoui Catalyst 24/46

Page 30: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

How does “acceleration” work?

Definition of estimate sequence [Nesterov].

A pair of sequences (ϕk)k≥0 and (λk)k≥0, with ϕk : Rp → R andλk ≥ 0 , is called an estimate sequence of function F if

λk → 0;

ϕk(x) ≤ (1− λk)F (x) + λkϕ0(x), for any k , x ;

There exists a sequence (xk)k≥0 such that

F (xk) ≤ ϕ⋆

k△

= minx∈Rp

ϕk(x).

Remarks

ϕk is neither an upper-bound, nor a lower-bound;

Finding the right estimate sequence is often nontrivial.

Zaid Harchaoui Catalyst 25/46

Page 31: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Convergence of outer-loop algorithm

Analysis for µ-strongly convex objective functions

Choose α0 =√q with q = µ/(µ + κ) and

ǫk =2

9(F (x0)− F ∗)(1− ρ)k with ρ <

√q.

Then, the algorithm generates iterates (xk)k≥0 such that

F (xk)− F ∗ ≤ C (1− ρ)k+1(F (x0)− F ∗) with C =8

(√q − ρ)2 .

In practice

Choice of ρ can safely be set to ρ = 0.9√q.

Choice of (εk)k≥0 typically follows from a duality gap at x0. WhenF is non-negative, we can set εk = (2/9)F (x0)(1− ρ)k .

Zaid Harchaoui Catalyst 26/46

Page 32: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Convergence of outer-loop algorithm

Analysis for non-strongly convex objective functions, µ = 0

Choose α0 = (√5− 1)/2 and

ǫk =2(F (x0)− F ∗)9(k + 2)4+η

with η > 0.

Then, the meta-algorithm generates iterates (xk)k≥0 such that

F (xk)− F ∗ ≤ 8

(k + 2)2

(

(

1 +2

η

)2

(F (x0)− F ∗) +κ

2‖x0 − x∗‖2

)

.

(2)

In practice

Choice of η can be set to η = 0.1.

Zaid Harchaoui Catalyst 27/46

Page 33: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

How many iterates ofM do we need to obtain xk?

Control of inner-loop complexity

For minimizing Gk , consider a methodM generating iterates (zt)t≥0

with linear convergence rate

Gk(zt)− G ⋆

k ≤ A(1− τM)t(Gk(z0)− G ⋆

k ).

Then by choosing z0 = xk−1, the precision εk is reached with at most

A constant number of iterations TM when µ > 0;

A logarithmic increasing number of iterations TM log(k + 2)when µ = 0.

where TM = O(1/τM) is independent of k .

Zaid Harchaoui Catalyst 28/46

Page 34: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Global computational complexity

Analysis for µ-strongly convex objective functions

The global convergence rate of the accelerated algorithm A is

Fs − F ⋆ ≤ C

(

1− ρ

TM

)s

(F (x0)− F ∗). (3)

where Fs is the objective function value obtained after performings = kTM iterations of the methodM. As a result,

τA,F =ρ

TM= O(τM

√µ/√µ+ κ),

where τM typically depends on κ (the greater, the faster).

κ will be chosen to maximize the ratio τM/√µ+ κ.

Zaid Harchaoui Catalyst 29/46

Page 35: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Global computational complexity

Analysis for µ-strongly convex objective functions

The global convergence rate of the accelerated algorithm A is

Fs − F ⋆ ≤ C

(

1− ρ

TM

)s

(F (x0)− F ∗). (3)

where Fs is the objective function value obtained after performings = kTM iterations of the methodM. As a result,

τA,F =ρ

TM= O(τM

√µ/√µ+ κ),

where τM typically depends on κ (the greater, the faster).

e.g., κ = L− 2µ when τM = µ+κ

L+κ⇒ τA = O

(√

µ

L

)

.

Zaid Harchaoui Catalyst 29/46

Page 36: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Global computational complexity

Analysis for non-strongly convex objective functions

The global convergence rate of the accelerated algorithm A is

Fs − F ∗ ≤ 8T 2M log2(s)

s2

(

(

1 +2

η

)2

(F (x0)− F ∗) +κ

2‖x0 − x∗‖2

)

.

IfM is a first-order method, this rate is near-optimal, up to alogarithmic factor, when compared to the optimal rate O(1/s2), whichmay be the price to pay for using a generic acceleration scheme.

κ will be chosen to maximize the ratio τM/√L+ κ

Zaid Harchaoui Catalyst 30/46

Page 37: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Applications

Expected computational complexity in the regime n ≤ L/µ when µ > 0,

µ > 0 µ = 0 Catalyst µ > 0 Cat. µ = 0

FG O

(

n

(

)

log(

)

)

O(

nLε

)

O

(

n

Lµlog

(

)

)

O

(

nL√ε

)

SAG

O

(

Lµlog

(

)

)

O

(√

nLµlog

(

)

)SAGA

Finito/MISO

NASDCA

SVRG O

(

L′

µlog

(

)

)

O

(√

nL′

µlog

(

)

)

Acc-FG O

(

n

Lµlog

(

)

)

O

(

nL√ε

)

no accelerationAcc-SDCA O

(√

nLµlog

(

)

)

NA

Zaid Harchaoui Catalyst 31/46

Page 38: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Experiments with MISO/SAG/SAGA

ℓ2-logistic regression formulation

Given some data (yi , zi ), with yi in {−1,+1} and zi in Rp, minimize

minx∈Rp

1

n

n∑

i=1

log(1 + e−yix⊤zi ) +

µ

2‖x‖22,

µ is the regularization parameter and the strong convexity modulus.

Datasets

name rcv1 real-sim covtype ocr alpha

n 781 265 72 309 581 012 2 500 000 250 000

p 47 152 20 958 54 1 155 500

Zaid Harchaoui Catalyst 32/46

Page 39: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Experiments with MISO/SAG/SAGA

The complexity analysis is not just a theoretical exercise since it providesthe values of κ, εk , βk , which are required in concrete implementations.

Here, theoretical values match practical ones.

Restarting

The theory tells us to restartM with xk−1. For SDCA/Finito/MISO,the theory tells us to use instead xk−1 +

κ

µ+κ(yk−1 − yk−2). We also

tried this as a heuristic for SAG and SAGA.

One-pass heuristic

constrainM to always perform at most n iterations in inner loop; wecall this variant AMISO2 for MISO, whereas AMISO1 refers to theregular “vanilla” accelerated variant; idem to accelerate SAG and SAGA.

Zaid Harchaoui Catalyst 33/46

Page 40: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Experiments without strong convexity, µ = 0

0 100 200 300 400 500

0.514

0.515

0.516

0.517

0.518

#Passes, Dataset covtype, µ= 0

Objectivefunction

0 50 100 150 200

4

6

8

10

12

x 10−3

#Passes, Dataset real-sim, µ= 0

Objectivefunction

0 20 40 60 80 1000.096

0.098

0.1

0.102

0.104

#Passes, Dataset rcv1, µ= 0

Objectivefunction

0 10 20 30 40 50

0.465

0.466

0.467

0.468

#Passes, Dataset alpha, µ= 0

Objectivefunction

0 5 10 15 20 25

0.4957

0.4958

0.4959

0.496

#Passes, Dataset ocr, µ = 0

Objectivefunction

Figure: Objective function value for different number of passes on data.

Conclusions

SAG, SAGA are accelerated when they do not perform well already;

AMISO2 ≥ AMISO1 (vanilla), MISO does not apply.

Zaid Harchaoui Catalyst 34/46

Page 41: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Experiments without strong convexity, µ = 10−1/n

0 100 200 300 400 50010

−10

10−8

10−6

10−4

10−2

100

#Passes, Dataset covtype, µ/L = 10−1/n

Relativeduality

gap

MISOAMISO1AMISO2SAGASAGSAGAASAGA

0 100 200 300 400 50010

−10

10−8

10−6

10−4

10−2

100

#Passes, Dataset real-sim, µ/L = 10−1/n

Relativeduality

gap

MISOAMISO1AMISO2SAGASAGSAGAASAGA

0 20 40 60 80 10010

−10

10−8

10−6

10−4

10−2

100

#Passes, Dataset rcv1, µ/L = 10−1/n

Relativeduality

gap

0 10 20 30 40 5010

−10

10−8

10−6

10−4

10−2

100

#Passes, Dataset alpha, µ/L = 10−1/n

Relativeduality

gap

0 5 10 15 20 2510

−10

10−8

10−6

10−4

10−2

100

#Passes, Dataset ocr, µ/L = 10−1/n

Relativeduality

gap

Figure: Relative duality gap (log-scale) for different number of passes on data.

Conclusions

SAG, SAGA are not always accelerated, but often.

AMISO2,AMISO1 ≫ MISO.

Zaid Harchaoui Catalyst 35/46

Page 42: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Experiments without strong convexity, µ = 10−3/n

0 100 200 300 400 50010

−10

10−8

10−6

10−4

10−2

100

#Passes, Dataset covtype, µ/L = 10−3/n

Relativeduality

gap

0 100 200 300 400 50010

−8

10−6

10−4

10−2

100

#Passes, Dataset real-sim, µ/L = 10−3/n

Relativeduality

gap

0 20 40 60 80 10010

−4

10−2

100

#Passes, Dataset rcv1, µ/L = 10−3/n

Relativeduality

gap

0 10 20 30 40 5010

−10

10−8

10−6

10−4

10−2

100

#Passes, Dataset alpha, µ/L = 10−3/n

Relativeduality

gap

0 5 10 15 20 2510

−10

10−8

10−6

10−4

10−2

100

#Passes, Dataset ocr, µ/L = 10−3/n

Relativeduality

gap

Figure: Relative duality gap (log-scale) for different number of passes on data.

Conclusions

same conclusions as µ = 10−1/n;

µ is so small that (unaccelerated) MISO becomes numericallyunstable.

Zaid Harchaoui Catalyst 36/46

Page 43: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

General conclusions about Catalyst

Summary: lots of nice features

Simple acceleration scheme with broad application range.

Recover near-optimal rates for known algorithms.

Effortless implementation.

... but also lots of unsolved problems

Acceleration occurs when n ≤ L/µ; otherwise, the “unaccelerated”complexity O(n log(1/ε)) seems unbeatable.

µ is an estimate of the true strong convexity parameter µ′ ≥ µ.µ is the global strong convexity parameter, not a local one µ⋆ ≥ µ.When n ≤ L/µ, but n ≥ L/(µ′ or µ⋆), a methodM that adapts tothe unknown strong convexity may be impossible to accelerate.

The optimal restart forM is not yet fully understood.

Zaid Harchaoui Catalyst 37/46

Page 44: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Happy birthday!

Zaid Harchaoui Catalyst 38/46

Page 45: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Catalyst, the algorithmAlgorithm 1 Catalyst

input initial estimate x0 ∈ Rp, parameters κ and α0, sequence (εk)k≥0,

optimization methodM; initialize q = µ/(µ + κ) and y0 = x0;1: while the desired stopping criterion is not satisfied do

2: Find an approx. solution xk usingM s.t. Gk(xk)− G ⋆

k ≤ εk

xk ≈ argminx∈Rp

{

Gt(x)△

= F (x) +κ

2‖x − yk−1‖2

}

3: Compute αk ∈ (0, 1) from equation α2k = (1− αk)α

2k−1 + qαk ;

4: Compute

yk = xk + βk(xk − xk−1) with βk =αk−1(1− αk−1)

α2k−1 + αk

.

5: end while

output xk (final estimate).

Zaid Harchaoui Catalyst 39/46

Page 46: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Ideas of the proofs

Main theorem

Let us denote

λk =

k−1∏

i=0

(1− αi), (4)

where the αi ’s are defined in Catalyst. Then, the sequence (xk)k≥0

satisfies

F (xk)− F ∗ ≤ λk(

Sk + 2k∑

i=1

ǫiλi

)2

, (5)

where F ⋆ is the minimum value of F and

Sk = F (x0)−F ∗+γ02‖x0−x∗‖2+

k∑

i=1

ǫiλi

where γ0 =α0 ((κ+ µ)α0 − µ)

1− α0,

(6)where x⋆ is a minimizer of F .

Zaid Harchaoui Catalyst 40/46

Page 47: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Ideas of the proofs

Then, the theorem will be used with the following lemma to control theconvergence rate of the sequence (λk)k≥0, whose definition follows theclassical use of estimate sequences. This will provide us convergencerates both for the strongly convex and non-strongly convex cases.

Lemma 2.2.4 from Nesterov [2004]

If in the quantity γ0 defined in (6) satisfies γ0 ≥ µ, then thesequence (λk)k≥0 from (4) satisfies

λk ≤ min

(1−√q)k , 4(

2 + k√

γ0

κ+µ

)2

, (7)

where q△

= µ/(µ+ κ).

Zaid Harchaoui Catalyst 41/46

Page 48: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Ideas of proofs

Step 1: build an approximate estimate sequence

Remember that in general, we build ϕk from ϕk−1 as follows

ϕk(x)△

= (1− αk)ϕk−1(x) + αkdk(x),

where dk is a lower bound.

Here, a natural lower bound would be

F (x) ≥ dk(x)△

= F (x∗k ) + 〈κ(yk−1 − x∗k ), x − x∗k 〉+µ

2‖x − x∗k‖2,

where x⋆k△

= argminx∈Rp

{

Gk(x)△

= F (x) + κ

2‖x − yk−1‖22}

.

But x⋆k is unknown! Then, use instead d ′k(x) defined as

d ′k(x)

= F (xk) + 〈κ(yk−1 − xk), x − xk〉+µ

2‖x − xk‖2.

Zaid Harchaoui Catalyst 42/46

Page 49: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Ideas of proofs

Step 2: Relax the condition F (xk) ≤ ϕ⋆

k .

We can show that Catalyst generates iterates (xk)k≥0 such that

F (xk) ≤ φ∗k + ξk ,

where the sequence (ξk)k≥0 is defined by ξ0 = 0 and

ξk = (1− αk−1)(ξk−1 + εk − (κ+ µ)〈xk − x∗k , xk−1 − xk〉).

The sequences (αk)k≥0 and (yk)k≥0 are chosen in such a way thatall the terms involving yk−1 − xk are cancelled.

We will control later the quantity xk − x∗k by strong convexity of Gk :

κ+ µ

2‖xk − x∗k‖22 ≤ Gk(xk)− G ⋆

k ≤ εk .

Zaid Harchaoui Catalyst 43/46

Page 50: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Ideas of proofs

Step 3: Control how this errors sum up together.

Do cumbersome calculus.

Zaid Harchaoui Catalyst 44/46

Page 51: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Catalyst in practice

General strategy and application to randomized algorithms

Calculating the iteration-complexity decomposes into three steps:

1 When F is µ-strongly convex, find κ that maximizes the ratioτM,Gk

/√µ+ κ for algorithmM . When F is non-strongly convex,

maximize instead the ratio τM,Gk/√L+ κ.

2 Compute the upper-bound of the number of outer iterations koutusing the theorems.

3 Compute the upper-bound of the expected number of inneriterations

maxk=1,...,kout

E[TM,Gk(εk)] ≤ kin,

Then, the expected iteration-complexity denoted Comp. is given by

Comp ≤ kin × kout .

Zaid Harchaoui Catalyst 45/46

Page 52: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Applications

Deterministic and Randomized Incremental Gradient methods

Stochastic Average Gradient (SAG and SAGA) [Schmidt et al.,2013, Defazio et al., 2014a];

Finito and MISO [Mairal, 2015, Defazio et al., 2014b];

Semi-Stochastic/Mixed Gradient [Konecny et al., 2014, Zhanget al., 2013];

Stochastic Dual coordinate Ascent [Shalev-Shwartz and Zhang,2012];

Stochastic Variance Reduced Gradient [Xiao and Zhang, 2014].

But also, randomized coordinate descent methods, and more generallyfirst-order methods with linear convergence rates.

Zaid Harchaoui Catalyst 46/46

Page 53: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Appendix on proximal MISO

Zaid Harchaoui Catalyst 47/46

Page 54: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Original motivation

Given some data, learn some model parameters x in Rp by minimizing

minx∈Rp

{

F (x)△

=1

n

n∑

i=1

fi(x)

}

,

where each fi may be nonsmooth and nonconvex.

The original MISO algorithm is an incremental extension of themajorization-minimization principle [Lange et al., 2000].

Paper

J. Mairal. Incremental Majorization-Minimization Optimizationwith Application to Large-Scale Machine Learning. SIAM Journalon Optimization. 2015.

J. Mairal. Optimization with First-Order Surrogate Functions.ICML. 2013.

Zaid Harchaoui Catalyst 48/46

Page 55: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Majorization-minimization principle

f (θ)g(θ)b κ

Iteratively minimize locally tight upper bounds of the objective.

The objective monotonically decreases.

Under some assumptions, we get similar convergence rates asgradient-based approaches for convex problems.

Zaid Harchaoui Catalyst 49/46

Page 56: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Incremental optimization: MISO

Algorithm 2 Incremental scheme MISO

input x0 ∈ Rp; T (number of iterations).

1: Choose surrogates g0i of fi near x0 for all i ;

2: for k = 1, . . . ,K do

3: Randomly pick up one index ık and choose a surrogate gkık

of fık

near xk−1. Set gki

= gk−1i for i 6= ık ;

4: Update the solution:

xk ∈ argminx∈Rp

1

n

n∑

i=1

gki (x).

5: end for

output xK (final estimate);

Zaid Harchaoui Catalyst 50/46

Page 57: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Incremental Optimization: MISO

Update rule with basic upper bounds

We want to minimize 1n

∑ni=1 fi(x), where the fi ’s are smooth.

xk ← argminx∈Rp

1

n

n∑

i=1

fi(yki ) +∇fi(yki )⊤(x − yki ) +

L

2‖x − yki ‖22

=1

n

n∑

i=1

yki −1

Ln

n∑

i=1

∇fi(yki ).

At iteration k , randomly draw one index ık , and update ykık← xk .

Remarks

replace (1/n)∑n

i=1 yki by xk−1 yields SAG [Schmidt et al., 2013].

replace (1/L) by (1/µ) for strongly convex problems is close to avariant of SDCA [Shalev-Shwartz and Zhang, 2012].

Zaid Harchaoui Catalyst 51/46

Page 58: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Incremental Optimization: MISOµ.

Update rule with lower bounds???

We want to minimize 1n

∑ni=1 fi(x), where the fi ’s are smooth.

xk = argminx∈Rp

1

n

n∑

i=1

fi (yki ) +∇fi(yki )⊤(x − yki ) +

µ

2‖x − yki ‖22

=1

n

n∑

i=1

yki −1

µn

n∑

i=1

∇fi(yki ).

Remarks

Requires strong convexity.

Use a counter-intuitive minorization-minimization principle.

Close to a variant of SDCA [Shalev-Shwartz and Zhang, 2012].

Much faster than the basic MISO (faster rate).

Zaid Harchaoui Catalyst 52/46

Page 59: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Incremental Optimization: MISOµ.

In the first part of this presentation, what we have called MISO is thealgorithm that uses 1/(µn) step-sizes (sorry for the confusion).

To minimize F (x)△

= 1n

∑ni=1 fi(x), MISOµ has the following guarantees

Proposition [Mairal, 2015]

When the functions fi are µ-strongly convex, differentiable withL-Lipschitz gradient, and non-negative, MISOµ satisfies

E[F (xk)− F ⋆] ≤(

1− 1

3n

)k

nf ⋆,

under the condition n ≥ 2L/µ.

Remarks

When n ≤ 2L/µ, the algorithm may diverge;

When µ is very small, numerical stability is an issue.

The condition fi ≥ 0 does not really matter.

Zaid Harchaoui Catalyst 53/46

Page 60: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Proximal MISO [Lin, Mairal, and Harchaoui, 2015]

Main goals

Remove the condition n ≤ 2L/µ;

Allow a composite term ψ:

minx∈Rp

{

F (x)△

=1

n

n∑

i=1

fi(x) + ψ(x)

}

,

Starting points

MISOµ is iteratively updating/minimizing a lower-bound of F

xk ← argminx∈Rp

{

Dk(x)△

=1

n

n∑

i=1

dki (x)

}

,

[Lin, Mairal, and Harchaoui, 2015].

Zaid Harchaoui Catalyst 54/46

Page 61: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Proximal MISO

Adding the proximal term

xt ← argminx∈Rp

{

Dk(x)△

=1

n

n∑

i=1

dki (x) + ψ(x)

}

,

Remove the condition n ≥ 2L/µ

For i = ık ,

dki (x)=(1−δ)dk−1

i (x)+δ(

fi(xk−1)+〈∇fi (xk−1), x − xk−1〉+µ

2‖x − xk−1‖2

)

Remarks

the original MISOµ uses δ = 1. To get rid of the condition

n ≥ 2L/µ, proximal MISO uses instead δ = min(

1, µn2(L−µ)

)

.

variant “5” of SDCA [Shalev-Shwartz and Zhang, 2012] is identicalwith another value δ = µn

L+µnin (0, 1).

Zaid Harchaoui Catalyst 55/46

Page 62: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Proximal MISO

Convergence of MISO-Prox

Let (xk)k≥0 be obtained by MISO-Prox, then

E[F (xk)]−F ∗ ≤ 1

τ(1−τ)k+1 (F (x0)− D0(x0)) with τ ≥ min

{

µ

4L,1

2n

}

.

(8)Furthermore, we also have fast convergence of the certificate

E[F (xk)− Dk(xk)] ≤1

τ(1− τ)k (F ∗ − D0(x0)) .

Differences with SDCA

The construction is primal. The proof of convergence and thealgorithm do not use duality, while SDCA is a dual ascent technique.

Dk(xk) is a lower-bound of F ⋆; it plays the same role as the dual inSDCA, but is easier to evaluate.

Zaid Harchaoui Catalyst 56/46

Page 63: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

Conclusions

Relatively simple algorithm, with simple convergence proof, andsimple optimality certificate.

Catalyst not only accelerates it, but also stabilizes it numerically,with the parameter δ = 1.

Close to SDCA, but without duality.

Zaid Harchaoui Catalyst 57/46

Page 64: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

References I

A. Agarwal and L. Bottou. A lower bound for the optimization of finite sums.In Proceedings of the International Conference on Machine Learning (ICML),2015.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm forlinear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202,2009.

S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basispursuit. SIAM Journal on Scientific Computing, 20:33–61, 1999.

P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backwardsplitting. SIAM Multiscale Modeling and Simulation, 4(4):1168–1200, 2006.

I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithmfor linear inverse problems with a sparsity constraint. Communications onPure and Applied Mathematics, 57(11):1413–1457, 2004.

A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradientmethod with support for non-strongly convex composite objectives. InAdvances in Neural Information Processing Systems (NIPS), 2014a.

Zaid Harchaoui Catalyst 58/46

Page 65: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

References IIA. J. Defazio, T. S. Caetano, and J. Domke. Finito: A faster, permutable

incremental gradient method for big data problems. In Proceedings of theInternational Conference on Machine Learning (ICML), 2014b.

R. Frostig, R. Ge, S. M. Kakade, and A. Sidford. Un-regularizing: approximateproximal point algorithms for empirical risk minimization. In Proceedings ofthe International Conference on Machine Learning (ICML), 2015.

Dan Garber and Elad Hazan. Fast and simple pca via convex optimization.arXiv preprint arXiv:1509.05647, 2015.

O. Guler. New proximal point algorithms for convex minimization. SIAMJournal on Optimization, 2(4):649–664, 1992.

B. He and X. Yuan. An accelerated inexact proximal point algorithm for convexminimization. Journal of Optimization Theory and Applications, 154(2):536–548, 2012.

Jakub Konecny, Jie Liu, Peter Richtarik, and Martin Takac. ms2gd: Mini-batchsemi-stochastic gradient descent in the proximal setting. arXiv preprintarXiv:1410.4744, 2014.

Zaid Harchaoui Catalyst 59/46

Page 66: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

References IIIGuanghui Lan. An optimal randomized incremental gradient method. arXiv

preprint arXiv:1507.02000, 2015.

K. Lange, D. R. Hunter, and I. Yang. Optimization transfer using surrogateobjective functions. Journal of Computational and Graphical Statistics, 9(1):1–20, 2000.

H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-orderoptimization. In Advances in Neural Information Processing Systems, 2015.

J. Mairal. Incremental majorization-minimization optimization with applicationto large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.

Y. Nesterov. Introductory lectures on convex optimization: a basic course.Kluwer Academic Publishers, 2004.

Y. Nesterov. Gradient methods for minimizing composite objective function.Mathematical Programming, 140(1):125–161, 2013.

Yurii Nesterov. A method for unconstrained convex minimization problem withthe rate of convergence o (1/k2). In Doklady an SSSR, volume 269, pages543–547, 1983.

Zaid Harchaoui Catalyst 60/46

Page 67: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

References IVR. D. Nowak and M. A. T. Figueiredo. Fast wavelet-based image deconvolution

using the EM algorithm. In Conference Record of the Thirty-Fifth AsilomarConference on Signals, Systems and Computers., 2001.

S. Salzo and S. Villa. Inexact and accelerated proximal point algorithms.Journal of Convex Analysis, 19(4):1167–1192, 2012.

M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexactproximal-gradient methods for convex optimization. In Advances in NeuralInformation Processing Systems (NIPS), 2011.

M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with thestochastic average gradient. arXiv:1309.2388, 2013.

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent.arXiv:1211.2717, 2012.

S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dualcoordinate ascent for regularized loss minimization. MathematicalProgramming, pages 1–41, 2014.

R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of theRoyal Statistical Society Series B, 58(1):267–288, 1996.

Zaid Harchaoui Catalyst 61/46

Page 68: On the Versatility of the Nesterov Acceleration Schemeaspremon/Houches/talks/catalyst.pdf · On the Versatility of the Nesterov Acceleration Scheme Zaid Harchaoui Courant Institute,

References VS.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by

separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009.

L. Xiao and T. Zhang. A proximal stochastic gradient method with progressivevariance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.

Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence withcondition number independent access of full gradients. In Advances inNeural Information Processing Systems, pages 980–988, 2013.

Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method forregularized empirical risk minimization. In Proceedings of the InternationalConference on Machine Learning (ICML), 2015.

Zaid Harchaoui Catalyst 62/46