163
1 / 48 On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functions Michael L. Overton Courant Institute of Mathematical Sciences New York University Les Houches, 8 February 2016

On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

1 / 48

On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functions

Michael L. OvertonCourant Institute of Mathematical Sciences

New York University

Les Houches, 8 February 2016

Page 2: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Yurii Nesterov

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

2 / 48

Page 3: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Yurii Nesterov

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

2 / 48

■ It seems we first met in 1988 at the Tokyo ISMP. We don’thave a proof of this, but we do have a proof that we wereboth at the meeting: we both used the beautiful gray bagwith the Samurai warrior design for many years, bringing it toother conferences long after everyone else abandoned theirs!

Page 4: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Yurii Nesterov

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

2 / 48

■ It seems we first met in 1988 at the Tokyo ISMP. We don’thave a proof of this, but we do have a proof that we wereboth at the meeting: we both used the beautiful gray bagwith the Samurai warrior design for many years, bringing it toother conferences long after everyone else abandoned theirs!

■ We definitely met in 1994 at the Ann Arbor ISMP, where Ilearned about the Nesterov-Todd primal-dual interior-pointalgorithm for SDP.

Page 5: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Yurii Nesterov

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

2 / 48

■ It seems we first met in 1988 at the Tokyo ISMP. We don’thave a proof of this, but we do have a proof that we wereboth at the meeting: we both used the beautiful gray bagwith the Samurai warrior design for many years, bringing it toother conferences long after everyone else abandoned theirs!

■ We definitely met in 1994 at the Ann Arbor ISMP, where Ilearned about the Nesterov-Todd primal-dual interior-pointalgorithm for SDP.

■ We met again on many subsequent occasions, most notablyduring very enjoyable extended visits to Louvain-la-neuve in2004 and 2008.

Page 6: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Yurii Nesterov

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

2 / 48

■ It seems we first met in 1988 at the Tokyo ISMP. We don’thave a proof of this, but we do have a proof that we wereboth at the meeting: we both used the beautiful gray bagwith the Samurai warrior design for many years, bringing it toother conferences long after everyone else abandoned theirs!

■ We definitely met in 1994 at the Ann Arbor ISMP, where Ilearned about the Nesterov-Todd primal-dual interior-pointalgorithm for SDP.

■ We met again on many subsequent occasions, most notablyduring very enjoyable extended visits to Louvain-la-neuve in2004 and 2008.

■ Always a great pleasure to interact with this brilliant butmodest colleague!

Page 7: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Introduction

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

3 / 48

Page 8: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth, Nonconvex Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

4 / 48

Problem: find x that locally minimizes f , where f : Rn → R is

Page 9: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth, Nonconvex Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

4 / 48

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous

Page 10: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth, Nonconvex Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

4 / 48

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers

Page 11: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth, Nonconvex Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

4 / 48

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers■ Not convex

Page 12: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth, Nonconvex Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

4 / 48

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz

Page 13: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth, Nonconvex Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

4 / 48

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz

Lots of interesting applications

Page 14: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth, Nonconvex Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

4 / 48

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz

Lots of interesting applications

Any locally Lipschitz function is differentiable almost everywhereon its domain. So, whp, can evaluate gradient at any given point.

Page 15: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth, Nonconvex Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

4 / 48

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz

Lots of interesting applications

Any locally Lipschitz function is differentiable almost everywhereon its domain. So, whp, can evaluate gradient at any given point.

What happens if we simply use steepest descent (gradientdescent) with a standard line search?

Page 16: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Example

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

5 / 48

f(x)=10*|x2 − x

12| + (1−x

1)2

steepest descent iterates−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Page 17: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Methods Suitable for Nonsmooth Functions

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

6 / 48

In fact, it’s been known for several decades that at any giveniterate, one should exploit the gradient information obtained atseveral points, not just at one point. Some such methods:

Page 18: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Methods Suitable for Nonsmooth Functions

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

6 / 48

In fact, it’s been known for several decades that at any giveniterate, one should exploit the gradient information obtained atseveral points, not just at one point. Some such methods:

■ Bundle methods (C. Lemarechal, K.C. Kiwiel, etc.):extensive practical use and theoretical analysis, butcomplicated in nonconvex case

Page 19: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Methods Suitable for Nonsmooth Functions

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

6 / 48

In fact, it’s been known for several decades that at any giveniterate, one should exploit the gradient information obtained atseveral points, not just at one point. Some such methods:

■ Bundle methods (C. Lemarechal, K.C. Kiwiel, etc.):extensive practical use and theoretical analysis, butcomplicated in nonconvex case

■ Gradient sampling: an easily stated method with niceconvergence theory (J.V. Burke, A.S. Lewis, M.L.O., 2005;K.C. Kiwiel, 2007), but computationally intensive

Page 20: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Methods Suitable for Nonsmooth Functions

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

6 / 48

In fact, it’s been known for several decades that at any giveniterate, one should exploit the gradient information obtained atseveral points, not just at one point. Some such methods:

■ Bundle methods (C. Lemarechal, K.C. Kiwiel, etc.):extensive practical use and theoretical analysis, butcomplicated in nonconvex case

■ Gradient sampling: an easily stated method with niceconvergence theory (J.V. Burke, A.S. Lewis, M.L.O., 2005;K.C. Kiwiel, 2007), but computationally intensive

■ BFGS: traditional workhorse for smooth optimization, worksamazingly well for nonsmooth optimization too, but verylimited convergence theory

Page 21: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Methods Suitable for Nonsmooth Functions

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

6 / 48

In fact, it’s been known for several decades that at any giveniterate, one should exploit the gradient information obtained atseveral points, not just at one point. Some such methods:

■ Bundle methods (C. Lemarechal, K.C. Kiwiel, etc.):extensive practical use and theoretical analysis, butcomplicated in nonconvex case

■ Gradient sampling: an easily stated method with niceconvergence theory (J.V. Burke, A.S. Lewis, M.L.O., 2005;K.C. Kiwiel, 2007), but computationally intensive

■ BFGS: traditional workhorse for smooth optimization, worksamazingly well for nonsmooth optimization too, but verylimited convergence theory

A completely different approach using randomized gradient-freemethods: the first complexity result for nonsmooth, nonconvexoptimization (Y. Nesterov and V. Spokoiny, JFoCM, 2015).

Page 22: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Failure of Steepest Descent: Simpler Example

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

7 / 48

Let f(x) = 6|x1|+ 3x2. Note that f is polyhedral and convex.

Page 23: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Failure of Steepest Descent: Simpler Example

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

7 / 48

Let f(x) = 6|x1|+ 3x2. Note that f is polyhedral and convex.

On this function, using a bisection-based backtracking line

search with “Armijo” parameter in [0, 13 ] and starting at

[23

],

steepest descent generates the sequence

2−k

[2(−1)k

3

], k = 1, 2, . . . ,

converging to

[00

].

Page 24: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Failure of Steepest Descent: Simpler Example

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

7 / 48

Let f(x) = 6|x1|+ 3x2. Note that f is polyhedral and convex.

On this function, using a bisection-based backtracking line

search with “Armijo” parameter in [0, 13 ] and starting at

[23

],

steepest descent generates the sequence

2−k

[2(−1)k

3

], k = 1, 2, . . . ,

converging to

[00

].

In contrast, BFGS with the same line search rapidly reduces thefunction value towards −∞ (arbitrarily far, in exact arithmetic)(A.S. Lewis and S. Zhang, 2010).

Page 25: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Page 26: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Page 27: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H

(which is supposed to approximate the inverse Hessian of f)

Page 28: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H

(which is supposed to approximate the inverse Hessian of f)

Repeat

Page 29: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H

(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)T d < 0

Page 30: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H

(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)T d < 0■ Armijo-Wolfe line search: find t so that f(x+ td) < f(x) + βtα

and ∇f(x+ td)T d > γα

Page 31: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H

(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)T d < 0■ Armijo-Wolfe line search: find t so that f(x+ td) < f(x) + βtα

and ∇f(x+ td)T d > γα

■ Set s = td, y = ∇f(x+ td)−∇f(x)

Page 32: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H

(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)T d < 0■ Armijo-Wolfe line search: find t so that f(x+ td) < f(x) + βtα

and ∇f(x+ td)T d > γα

■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td

Page 33: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H

(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)T d < 0■ Armijo-Wolfe line search: find t so that f(x+ td) < f(x) + βtα

and ∇f(x+ td)T d > γα

■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td

■ Replace H by V HV T + 1

sT yssT , where V = I − 1

sT ysyT

Page 34: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H

(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)T d < 0■ Armijo-Wolfe line search: find t so that f(x+ td) < f(x) + βtα

and ∇f(x+ td)T d > γα

■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td

■ Replace H by V HV T + 1

sT yssT , where V = I − 1

sT ysyT

Note that H can be computed in O(n2) operations since V is arank one perturbation of the identity

Page 35: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H

(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)T d < 0■ Armijo-Wolfe line search: find t so that f(x+ td) < f(x) + βtα

and ∇f(x+ td)T d > γα

■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td

■ Replace H by V HV T + 1

sT yssT , where V = I − 1

sT ysyT

Note that H can be computed in O(n2) operations since V is arank one perturbation of the identityThe Armijo condition ensures “sufficient decrease” in f

Page 36: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The BFGS Method (“Full” Version)

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

8 / 48

Broyden, Fletcher, Goldfarb, Shanno independently, 1970

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H

(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)T d < 0■ Armijo-Wolfe line search: find t so that f(x+ td) < f(x) + βtα

and ∇f(x+ td)T d > γα

■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td

■ Replace H by V HV T + 1

sT yssT , where V = I − 1

sT ysyT

Note that H can be computed in O(n2) operations since V is arank one perturbation of the identityThe Armijo condition ensures “sufficient decrease” in f

The Wolfe condition ensures that the directional derivative alongthe line increases algebraically, which guarantees that sT y > 0and that the new H is positive definite.

Page 37: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

BFGS for Nonsmooth Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

9 / 48

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Page 38: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

BFGS for Nonsmooth Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

9 / 48

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to very special cases.

Page 39: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

BFGS for Nonsmooth Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

9 / 48

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to very special cases.

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

Page 40: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

BFGS for Nonsmooth Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

9 / 48

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to very special cases.

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.

Page 41: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

BFGS for Nonsmooth Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

9 / 48

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to very special cases.

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.

Remarkably, the condition number of the inverse Hessianapproximation typically reaches 1016 before the method breaks down.

Page 42: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

BFGS for Nonsmooth Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

9 / 48

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to very special cases.

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.

Remarkably, the condition number of the inverse Hessianapproximation typically reaches 1016 before the method breaks down.

We have never seen convergence to non-stationary points that cannotbe explained by numerical difficulties.

Page 43: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

BFGS for Nonsmooth Optimization

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

9 / 48

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to very special cases.

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.

Remarkably, the condition number of the inverse Hessianapproximation typically reaches 1016 before the method breaks down.

We have never seen convergence to non-stationary points that cannotbe explained by numerical difficulties.

Convergence rate of BFGS is typically linear (not superlinear) in thenonsmooth case.

Page 44: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

With BFGS

Yurii Nesterov

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

The BFGS Method(“Full” Version)

BFGS forNonsmoothOptimization

With BFGS

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

10 / 48

f(x)=10*|x2 − x

12| + (1−x

1)2

steepest descent, grad sampling and BFGS iterates

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

contours of fstarting pointoptimal pointsteepest descentgrad samp (1st phase)grad samp (2nd phase)bfgs

Page 45: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Some Nonsmooth Analysis

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

11 / 48

Page 46: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The Clarke Subdifferential

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

12 / 48

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Page 47: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The Clarke Subdifferential

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

12 / 48

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

Page 48: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The Clarke Subdifferential

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

12 / 48

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂Cf(x) = conv

{lim

x→x,x∈D∇f(x)

}.

Page 49: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The Clarke Subdifferential

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

12 / 48

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂Cf(x) = conv

{lim

x→x,x∈D∇f(x)

}.

F.H. Clarke, 1973 (he used the name “generalized gradient”).

Page 50: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The Clarke Subdifferential

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

12 / 48

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂Cf(x) = conv

{lim

x→x,x∈D∇f(x)

}.

F.H. Clarke, 1973 (he used the name “generalized gradient”).

If f is continuously differentiable at x, then ∂Cf(x) = {∇f(x)}.

Page 51: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The Clarke Subdifferential

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

12 / 48

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂Cf(x) = conv

{lim

x→x,x∈D∇f(x)

}.

F.H. Clarke, 1973 (he used the name “generalized gradient”).

If f is continuously differentiable at x, then ∂Cf(x) = {∇f(x)}.

If f is convex, ∂Cf is the subdifferential of convex analysis.

Page 52: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The Clarke Subdifferential

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

12 / 48

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂Cf(x) = conv

{lim

x→x,x∈D∇f(x)

}.

F.H. Clarke, 1973 (he used the name “generalized gradient”).

If f is continuously differentiable at x, then ∂Cf(x) = {∇f(x)}.

If f is convex, ∂Cf is the subdifferential of convex analysis.

We say x is Clarke stationary for f if 0 ∈ ∂Cf(x).

Page 53: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Note that 0 ∈ ∂Cf(x) = 0 at x = [1; 1]T

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

13 / 48

f(x)=10*|x2 − x

12| + (1−x

1)2

steepest descent, grad sampling and BFGS iterates

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

contours of fstarting pointoptimal pointsteepest descentgrad samp (1st phase)grad samp (2nd phase)bfgs

Page 54: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Regularity

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

14 / 48

A locally Lipschitz, directionally differentiable function f is(Clarke) regular near a point x when its directional derivativex 7→ f ′(x; d) is upper semicontinuous near x for every fixeddirection d.

Page 55: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Regularity

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

14 / 48

A locally Lipschitz, directionally differentiable function f is(Clarke) regular near a point x when its directional derivativex 7→ f ′(x; d) is upper semicontinuous near x for every fixeddirection d.

In this case 0 ∈ ∂Cf(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.

Page 56: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Regularity

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

14 / 48

A locally Lipschitz, directionally differentiable function f is(Clarke) regular near a point x when its directional derivativex 7→ f ′(x; d) is upper semicontinuous near x for every fixeddirection d.

In this case 0 ∈ ∂Cf(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.

■ All convex functions are regular

Page 57: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Regularity

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

14 / 48

A locally Lipschitz, directionally differentiable function f is(Clarke) regular near a point x when its directional derivativex 7→ f ′(x; d) is upper semicontinuous near x for every fixeddirection d.

In this case 0 ∈ ∂Cf(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.

■ All convex functions are regular■ All smooth functions are regular

Page 58: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Regularity

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

14 / 48

A locally Lipschitz, directionally differentiable function f is(Clarke) regular near a point x when its directional derivativex 7→ f ′(x; d) is upper semicontinuous near x for every fixeddirection d.

In this case 0 ∈ ∂Cf(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.

■ All convex functions are regular■ All smooth functions are regular■ Nonsmooth concave functions are not regular

Example: f(x) = −|x|

Page 59: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Partly Smooth Functions

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

15 / 48

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

Page 60: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Partly Smooth Functions

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

15 / 48

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x

Page 61: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Partly Smooth Functions

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

15 / 48

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂Cf is continuous on M near x

Page 62: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Partly Smooth Functions

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

15 / 48

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂Cf is continuous on M near x■ par ∂Cf(x), the subspace parallel to the affine hull of the

subdifferential of f at x, is exactly the subspace normal toM at x.

Page 63: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Partly Smooth Functions

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

15 / 48

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂Cf is continuous on M near x■ par ∂Cf(x), the subspace parallel to the affine hull of the

subdifferential of f at x, is exactly the subspace normal toM at x.

We refer to par ∂Cf(x) as the V-space for f at x (with respectto M), and to its orthogonal complement, the subspace tangentto M at x, as the U-space for f at x.

Page 64: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Partly Smooth Functions

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

15 / 48

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂Cf is continuous on M near x■ par ∂Cf(x), the subspace parallel to the affine hull of the

subdifferential of f at x, is exactly the subspace normal toM at x.

We refer to par ∂Cf(x) as the V-space for f at x (with respectto M), and to its orthogonal complement, the subspace tangentto M at x, as the U-space for f at x.

For nonzero y in the V-space, the mapping t 7→ f(x+ ty) isnecessarily nonsmooth at t = 0, while for nonzero y in theU-space, t 7→ f(x+ ty) is differentiable at t = 0 as long as f islocally Lipschitz.

Page 65: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Illustration of U and V-spaces on Same Example

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

The ClarkeSubdifferentialNote that0 ∈ ∂Cf(x) = 0

at x = [1; 1]T

Regularity

Partly SmoothFunctionsIllustration of U andV-spaces on SameExample

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctions

16 / 48

f(x)=10*|x2 − x

12| + (1−x

1)2

steepest descent, grad sampling and BFGS iterates

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

contours of fstarting pointoptimal pointsteepest descentgrad samp (1st phase)grad samp (2nd phase)bfgs

Page 66: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s Chebyshev-Rosenbrock Functions

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

17 / 48

Page 67: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First Chebyshev-Rosenbrock Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

18 / 48

Nesterov (2008, private comm.): consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p ∈ [1, 2]

Page 68: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First Chebyshev-Rosenbrock Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

18 / 48

Nesterov (2008, private comm.): consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p ∈ [1, 2]

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Page 69: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First Chebyshev-Rosenbrock Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

18 / 48

Nesterov (2008, private comm.): consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p ∈ [1, 2]

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

Page 70: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First Chebyshev-Rosenbrock Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

18 / 48

Nesterov (2008, private comm.): consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p ∈ [1, 2]

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).

Page 71: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First Chebyshev-Rosenbrock Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

18 / 48

Nesterov (2008, private comm.): consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p ∈ [1, 2]

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).

When p = 2: N2 is smooth but not convex. Starting at x:

Page 72: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First Chebyshev-Rosenbrock Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

18 / 48

Nesterov (2008, private comm.): consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p ∈ [1, 2]

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).

When p = 2: N2 is smooth but not convex. Starting at x:

■ n = 5: BFGS needs 370 iterations to reduce N2 below 10−15

Page 73: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First Chebyshev-Rosenbrock Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

18 / 48

Nesterov (2008, private comm.): consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p ∈ [1, 2]

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).

When p = 2: N2 is smooth but not convex. Starting at x:

■ n = 5: BFGS needs 370 iterations to reduce N2 below 10−15

■ n = 10: needs ∼ 50,000 iterations to reduce N2 below 10−15

even though N2 is smooth!

Page 74: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why BFGS Takes So Many Iterations to Minimize N2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

19 / 48

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

Page 75: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why BFGS Takes So Many Iterations to Minimize N2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

19 / 48

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

Page 76: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why BFGS Takes So Many Iterations to Minimize N2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

19 / 48

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

Page 77: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why BFGS Takes So Many Iterations to Minimize N2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

19 / 48

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1

Page 78: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why BFGS Takes So Many Iterations to Minimize N2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

19 / 48

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x2

1 − 1 to trace the graph of T2(x1) on [−1, 1]

Page 79: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why BFGS Takes So Many Iterations to Minimize N2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

19 / 48

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x2

1 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]

Page 80: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why BFGS Takes So Many Iterations to Minimize N2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

19 / 48

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x2

1 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]

which has 2n−1 − 1 extrema in (−1, 1).

Page 81: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why BFGS Takes So Many Iterations to Minimize N2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

19 / 48

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x2

1 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]

which has 2n−1 − 1 extrema in (−1, 1).Even though BFGS will not track the manifold MN exactly, it willfollow it approximately. So, since the manifold is highly oscillatory,BFGS must take relatively short steps to obtain reduction in N2 in theline search, and hence it takes many iterations!

Page 82: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why BFGS Takes So Many Iterations to Minimize N2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

19 / 48

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x2

1 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]

which has 2n−1 − 1 extrema in (−1, 1).Even though BFGS will not track the manifold MN exactly, it willfollow it approximately. So, since the manifold is highly oscillatory,BFGS must take relatively short steps to obtain reduction in N2 in theline search, and hence it takes many iterations!

At the very end, since N2 is smooth, BFGS is superlinearly convergent!

Page 83: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why BFGS Takes So Many Iterations to Minimize N2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

19 / 48

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x2

1 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]

which has 2n−1 − 1 extrema in (−1, 1).Even though BFGS will not track the manifold MN exactly, it willfollow it approximately. So, since the manifold is highly oscillatory,BFGS must take relatively short steps to obtain reduction in N2 in theline search, and hence it takes many iterations!

At the very end, since N2 is smooth, BFGS is superlinearly convergent!

Newton’s method is not much faster, although it convergesquadratically at the end.

Page 84: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Length of a Piecewise Linear Descent Path

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

20 / 48

F. Jarre (2013): if the second term (the sum) in Nesterov’ssmooth Chebyshev-Rosenbrock function N2 is weighted by 400,any continuous piecewise linear descent path starting at x andleading to the global minimizer x∗ has

at least 1.618n linear segments.

Page 85: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First C-R Function: Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

21 / 48

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

Page 86: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First C-R Function: Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

21 / 48

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

Page 87: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First C-R Function: Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

21 / 48

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN , and x∗ = [1, 1, . . . , 1]T is its only stationary point.

Page 88: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First C-R Function: Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

21 / 48

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN , and x∗ = [1, 1, . . . , 1]T is its only stationary point.

We cannot initialize BFGS at x, so starting at normallydistributed random points:

Page 89: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First C-R Function: Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

21 / 48

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN , and x∗ = [1, 1, . . . , 1]T is its only stationary point.

We cannot initialize BFGS at x, so starting at normallydistributed random points:

■ n = 5: BFGS reduces N1 only to about 5× 10−3 in 1000iterations

Page 90: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First C-R Function: Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

21 / 48

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN , and x∗ = [1, 1, . . . , 1]T is its only stationary point.

We cannot initialize BFGS at x, so starting at normallydistributed random points:

■ n = 5: BFGS reduces N1 only to about 5× 10−3 in 1000iterations

■ n = 10: BFGS reduces N1 only to about 2× 10−2 in 1000iterations

Page 91: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s First C-R Function: Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

21 / 48

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN , and x∗ = [1, 1, . . . , 1]T is its only stationary point.

We cannot initialize BFGS at x, so starting at normallydistributed random points:

■ n = 5: BFGS reduces N1 only to about 5× 10−3 in 1000iterations

■ n = 10: BFGS reduces N1 only to about 2× 10−2 in 1000iterations

The method appears to be converging, very slowly, but may behaving numerical difficulties.

Page 92: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nesterov’s Second Nonsmooth C-R Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

22 / 48

N1(x) =1

4|x1 − 1|+

n−1∑

i=1

|xi+1 − 2|xi|+ 1|.

Again, the unique global minimizer is x∗. The second term iszero on the set

S = {x : xi+1 = 2|xi| − 1, i = 1, . . . , n− 1}

but S is not a manifold: it has “corners”.

Page 93: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Contour Plots of the Nonsmooth Variants for n = 2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

23 / 48

Nesterov−Chebyshev−Rosenbrock, first variant

x1

x2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2Nesterov−Chebyshev−Rosenbrock, second variant

x1

x2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Contour plots of nonsmooth Chebyshev-Rosenbrock functions N1

(left) and N1 (right), with n = 2, with iterates generated byBFGS initialized at 7 different randomly generated points.

Page 94: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Contour Plots of the Nonsmooth Variants for n = 2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

23 / 48

Nesterov−Chebyshev−Rosenbrock, first variant

x1

x2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2Nesterov−Chebyshev−Rosenbrock, second variant

x1

x2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Contour plots of nonsmooth Chebyshev-Rosenbrock functions N1

(left) and N1 (right), with n = 2, with iterates generated byBFGS initialized at 7 different randomly generated points.On the left, always get convergence to x∗ = [1, 1]T . On theright, most runs converge to [1, 1] but some go to x = [0,−1]T .

Page 95: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Properties of the Second Nonsmooth Variant N1

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

24 / 48

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

Page 96: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Properties of the Second Nonsmooth Variant N1

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

24 / 48

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′

1(x, d) < 0.

Page 97: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Properties of the Second Nonsmooth Variant N1

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

24 / 48

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′

1(x, d) < 0.

These two properties mean that N1 is not regular at [0,−1]T .

Page 98: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The Mordukhovich Subdifferential

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

25 / 48

B.S. Mordukhovich (1976), R.T. Rockafellar and R. J.-B. Wets(1998)

Consider a continuous function f : Rn → R (not necessarilyLipschitz) and a point x ∈ R

n. A vector v ∈ Rn is a regular

subgradient of f at x (written v ∈ ∂f(x)) if

lim infz → x

z 6= x

f(z)− f(x)− 〈v, z − x〉

|z − x|≥ 0.

Page 99: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The Mordukhovich Subdifferential

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

25 / 48

B.S. Mordukhovich (1976), R.T. Rockafellar and R. J.-B. Wets(1998)

Consider a continuous function f : Rn → R (not necessarilyLipschitz) and a point x ∈ R

n. A vector v ∈ Rn is a regular

subgradient of f at x (written v ∈ ∂f(x)) if

lim infz → x

z 6= x

f(z)− f(x)− 〈v, z − x〉

|z − x|≥ 0.

A vector v ∈ Rn is a Mordukhovich subgradient of f at x

(written v ∈ ∂Mf(x)) if there exist sequences {x} and {v} inRn satisfying

x → x

v ∈ ∂f(x)

v → v.

Page 100: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

The Mordukhovich Subdifferential

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

25 / 48

B.S. Mordukhovich (1976), R.T. Rockafellar and R. J.-B. Wets(1998)

Consider a continuous function f : Rn → R (not necessarilyLipschitz) and a point x ∈ R

n. A vector v ∈ Rn is a regular

subgradient of f at x (written v ∈ ∂f(x)) if

lim infz → x

z 6= x

f(z)− f(x)− 〈v, z − x〉

|z − x|≥ 0.

A vector v ∈ Rn is a Mordukhovich subgradient of f at x

(written v ∈ ∂Mf(x)) if there exist sequences {x} and {v} inRn satisfying

x → x

v ∈ ∂f(x)

v → v.

We say f is Mordukhovich stationary at x if 0 ∈ ∂Mf(x).

Page 101: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Relationship Between ∂Cf and ∂Mf

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

26 / 48

For a locally Lipschitz function f , we have

∂Cf(x) = conv ∂Mf(x).

and, if f is regular,

∂Cf(x) = ∂Mf(x).

Page 102: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Relationship Between ∂Cf and ∂Mf

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

26 / 48

For a locally Lipschitz function f , we have

∂Cf(x) = conv ∂Mf(x).

and, if f is regular,

∂Cf(x) = ∂Mf(x).

Example: let g(x) = |x1| − |x2|, x ∈ R2. Then

∂Cg(0) = [−1, 1]× [−1, 1] and ∂Mg(0) = [−1, 1]× {−1, 1}

so g is not regular.

Page 103: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Back to Nesterov’s Second Nonsmooth C-R Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

27 / 48

Page 104: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Back to Nesterov’s Second Nonsmooth C-R Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

27 / 48

Theorem. For n ≥ 2:

■ N1 has 2n−1 Clarke stationary points

Page 105: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Back to Nesterov’s Second Nonsmooth C-R Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

27 / 48

Theorem. For n ≥ 2:

■ N1 has 2n−1 Clarke stationary points■ N1 has exactly one Mordukhovich stationary point, the

global minimizer x∗

Page 106: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Back to Nesterov’s Second Nonsmooth C-R Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

27 / 48

Theorem. For n ≥ 2:

■ N1 has 2n−1 Clarke stationary points■ N1 has exactly one Mordukhovich stationary point, the

global minimizer x∗

■ its only local minimizer is the global minimizer x∗

M. Gurbuzbalaban and M.L.O., SIOPT, 2012.

Page 107: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Back to Nesterov’s Second Nonsmooth C-R Function

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

27 / 48

Theorem. For n ≥ 2:

■ N1 has 2n−1 Clarke stationary points■ N1 has exactly one Mordukhovich stationary point, the

global minimizer x∗

■ its only local minimizer is the global minimizer x∗

M. Gurbuzbalaban and M.L.O., SIOPT, 2012.

Furthermore, starting from enough randomly generated startingpoints, BFGS finds all 2n−1 Clarke stationary points!

Page 108: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Behavior of BFGS on the Second Nonsmooth Variant

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

28 / 48

0 200 400 600 800 10000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5Nesterov−Chebyshev−Rosenbrock, n=5

different starting points

so

rte

d f

ina

l va

lue

of

f

0 200 400 600 800 10000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5Nesterov−Chebyshev−Rosenbrock, n=6

different starting points

so

rte

d f

ina

l va

lue

of

f

Left: sorted final values of N1 for 1000 randomly generatedstarting points, when n = 5: BFGS finds all 16 Clarke stationarypoints. Right: same with n = 6: BFGS finds all 32 Clarkestationary points.

Page 109: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Convergence to Non-Locally-Minimizing Points

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

29 / 48

When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.

Page 110: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Convergence to Non-Locally-Minimizing Points

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

29 / 48

When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.

However, this kind of convergence is what we are seeing for thenon-regular, non-smooth Nesterov Chebyshev-Rosenbrockexample, and it is stable under perturbation. The same behavioroccurs for gradient sampling or bundle methods.

Page 111: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Convergence to Non-Locally-Minimizing Points

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

29 / 48

When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.

However, this kind of convergence is what we are seeing for thenon-regular, non-smooth Nesterov Chebyshev-Rosenbrockexample, and it is stable under perturbation. The same behavioroccurs for gradient sampling or bundle methods.

Kiwiel (private communication): the Nesterov example is the firsthe had seen which causes his bundle code to have this behavior.

Page 112: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Convergence to Non-Locally-Minimizing Points

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

29 / 48

When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.

However, this kind of convergence is what we are seeing for thenon-regular, non-smooth Nesterov Chebyshev-Rosenbrockexample, and it is stable under perturbation. The same behavioroccurs for gradient sampling or bundle methods.

Kiwiel (private communication): the Nesterov example is the firsthe had seen which causes his bundle code to have this behavior.

Nonetheless, we don’t know whether, in exact arithmetic, themethods would actually generate sequences converging to thenonminimizing Clarke stationary points. Experiments by Kaku(2011) suggest that the higher the precision used, the more likelyBFGS is to eventually move away from such a point.

Page 113: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Experiments using BFGS with Extended Precision

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

30 / 48

M.S. thesis by A. Kaku experimenting with Sherry Li’s “doubledouble” C++ package.

Page 114: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Experiments using BFGS with Extended Precision

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

30 / 48

M.S. thesis by A. Kaku experimenting with Sherry Li’s “doubledouble” C++ package.

“double double” is not the same as quadruple precision: eachnumber is represented as the sum of two ordinary doubleprecision numbers

Page 115: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Experiments using BFGS with Extended Precision

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

30 / 48

M.S. thesis by A. Kaku experimenting with Sherry Li’s “doubledouble” C++ package.

“double double” is not the same as quadruple precision: eachnumber is represented as the sum of two ordinary doubleprecision numbers

Thus, 1 + 10−30 and 1 + 10−300 are both valid “double double”numbers

Page 116: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Experiments using BFGS with Extended Precision

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

30 / 48

M.S. thesis by A. Kaku experimenting with Sherry Li’s “doubledouble” C++ package.

“double double” is not the same as quadruple precision: eachnumber is represented as the sum of two ordinary doubleprecision numbers

Thus, 1 + 10−30 and 1 + 10−300 are both valid “double double”numbers

In practice, it is just a convenient, inexpensive softwareimplementation that approximates quadruple precision(approximately 32 decimal digits of accuracy instead of 16)

Page 117: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Experiments using BFGS with Extended Precision

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

30 / 48

M.S. thesis by A. Kaku experimenting with Sherry Li’s “doubledouble” C++ package.

“double double” is not the same as quadruple precision: eachnumber is represented as the sum of two ordinary doubleprecision numbers

Thus, 1 + 10−30 and 1 + 10−300 are both valid “double double”numbers

In practice, it is just a convenient, inexpensive softwareimplementation that approximates quadruple precision(approximately 32 decimal digits of accuracy instead of 16)

Show plots from Kaku’s thesis.

Page 118: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

An Approach using Automatic Differentiation

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

31 / 48

Recent work by A. Griewank on automatic differentiation fornonsmooth optimization: leads to a more efficient method foroptimization of Nesterov’s second nonsmoothChebyshev-Rosenbrock since it is able to efficiently exploit thepiecewise-linearity of the function.

Page 119: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

An Approach using Automatic Differentiation

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctionsNesterov’s FirstChebyshev-RosenbrockFunctionWhy BFGS Takes SoMany Iterations toMinimize N2

Length of aPiecewise LinearDescent PathNesterov’s First C-RFunction:Nonsmooth CaseNesterov’s SecondNonsmooth C-RFunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond NonsmoothVariant N1

The MordukhovichSubdifferentialRelationship

Between ∂Cf and

31 / 48

Recent work by A. Griewank on automatic differentiation fornonsmooth optimization: leads to a more efficient method foroptimization of Nesterov’s second nonsmoothChebyshev-Rosenbrock since it is able to efficiently exploit thepiecewise-linearity of the function.

Starting at x, it visits all 2n−1 Clarke stationary points, but itdoes not get stuck at any of them because it repeatedly solvesLPs that define the piecewise linear path leading to the globalminimum.

Page 120: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Other Examples of Behavior of BFGS

on Nonsmooth Functions

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

32 / 48

Page 121: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Minimizing a Product of Eigenvalues

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

33 / 48

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN .

Page 122: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Minimizing a Product of Eigenvalues

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

33 / 48

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN . We wish to minimize

f(X) = log

N/2∏

i=1

λi(A ◦X)

where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.

Page 123: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Minimizing a Product of Eigenvalues

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

33 / 48

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN . We wish to minimize

f(X) = log

N/2∏

i=1

λi(A ◦X)

where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.

If we replace∏

by∑

we would have a semidefinite program.

Page 124: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Minimizing a Product of Eigenvalues

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

33 / 48

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN . We wish to minimize

f(X) = log

N/2∏

i=1

λi(A ◦X)

where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.

If we replace∏

by∑

we would have a semidefinite program.

Since f is not convex, may as well replace X by Y Y T whereY ∈ R

N×N : eliminates psd constraint, and then also easy toeliminate diagonal constraint.

Page 125: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Minimizing a Product of Eigenvalues

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

33 / 48

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN . We wish to minimize

f(X) = log

N/2∏

i=1

λi(A ◦X)

where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.

If we replace∏

by∑

we would have a semidefinite program.

Since f is not convex, may as well replace X by Y Y T whereY ∈ R

N×N : eliminates psd constraint, and then also easy toeliminate diagonal constraint.

Application: entropy minimization in an environmentalapplication (K.M. Anstreicher and J. Lee, 2004)

Page 126: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

BFGS from 10 Randomly Generated Starting Points

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

34 / 48

0 200 400 600 800 1000 1200 140010

−15

10−10

10−5

100

105

iteration

f − f op

t (di

ffere

nt s

tart

ing

poin

ts)

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

f − fopt, where fopt is least value of f found over all runs

Page 127: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Evolution of Eigenvalues of A ◦X

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

35 / 48

0 200 400 600 800 1000 1200 14000.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

iteration

eige

nval

ues

of A

o X

Page 128: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Evolution of Eigenvalues of A ◦X

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

35 / 48

0 200 400 600 800 1000 1200 14000.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

iteration

eige

nval

ues

of A

o X

Note that λ6(X), . . . , λ14(X) coalesce

Page 129: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Evolution of Eigenvalues of H

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

36 / 48

0 200 400 600 800 1000 1200 140010

−15

10−10

10−5

100

105

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

iteration

eige

nval

ues

of H

Page 130: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Evolution of Eigenvalues of H

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

36 / 48

0 200 400 600 800 1000 1200 140010

−15

10−10

10−5

100

105

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

iteration

eige

nval

ues

of H

44 eigenvalues of H converge to zero...why???

Page 131: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why Did 44 Eigenvalues of H Converge to Zero?

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

37 / 48

The eigenvalue product is partly smooth with respect to themanifold of matrices with an eigenvalue with given multiplicity.

Page 132: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why Did 44 Eigenvalues of H Converge to Zero?

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

37 / 48

The eigenvalue product is partly smooth with respect to themanifold of matrices with an eigenvalue with given multiplicity.

Recall that at the computed minimizer,

λ6(A ◦X) ≈ . . . ≈ λ14(A ◦X).

Matrix theory says that imposing multiplicity m on an eigenvaluea matrix ∈ SN is m(m+1)

2 − 1 conditions, or 44 when m = 9, sothe dimension of the V -space at this minimizer is 44.

Page 133: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why Did 44 Eigenvalues of H Converge to Zero?

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

37 / 48

The eigenvalue product is partly smooth with respect to themanifold of matrices with an eigenvalue with given multiplicity.

Recall that at the computed minimizer,

λ6(A ◦X) ≈ . . . ≈ λ14(A ◦X).

Matrix theory says that imposing multiplicity m on an eigenvaluea matrix ∈ SN is m(m+1)

2 − 1 conditions, or 44 when m = 9, sothe dimension of the V -space at this minimizer is 44.

And tiny eigenvalues of the BFGS matrix H approximating the“inverse Hessian” correspond to “infinite curvature”:nonsmoothness in the V-space

Page 134: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Why Did 44 Eigenvalues of H Converge to Zero?

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

37 / 48

The eigenvalue product is partly smooth with respect to themanifold of matrices with an eigenvalue with given multiplicity.

Recall that at the computed minimizer,

λ6(A ◦X) ≈ . . . ≈ λ14(A ◦X).

Matrix theory says that imposing multiplicity m on an eigenvaluea matrix ∈ SN is m(m+1)

2 − 1 conditions, or 44 when m = 9, sothe dimension of the V -space at this minimizer is 44.

And tiny eigenvalues of the BFGS matrix H approximating the“inverse Hessian” correspond to “infinite curvature”:nonsmoothness in the V-space

Thus BFGS automatically detected the U and V spacepartitioning without knowing anything about the mathematicalstructure of f !

Page 135: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Variation of f from Minimizer, along EigVecs of H

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

38 / 48

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

f(x op

t + t

w)

− f op

t

t

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

w is eigvector for eigvalue 10 of final Hw is eigvector for eigvalue 20 of final Hw is eigvector for eigvalue 30 of final Hw is eigvector for eigvalue 40 of final Hw is eigvector for eigvalue 50 of final Hw is eigvector for eigvalue 60 of final H

Eigenvalues of H numbered smallest to largest

Page 136: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Minimizing the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

39 / 48

Given the discrete-time dynamical system with control input andmeasured output

z(k+1) = Fz(k) +Gu(k), y(k) = Hz(k)

where F ∈ Rn×n, G ∈ R

n×p, H ∈ Rm×n, the static output

feedback problem is to find a controller X ∈ Rp×m so that,

setting u(k) = Xy(k), all solutions of

z(k+1) = (F +GXH)z(k)

converge to zero, that is all eigenvalues of F +GXH are insidethe unit disk (Schur stable), or prove that this is not possible.

Page 137: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Minimizing the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

39 / 48

Given the discrete-time dynamical system with control input andmeasured output

z(k+1) = Fz(k) +Gu(k), y(k) = Hz(k)

where F ∈ Rn×n, G ∈ R

n×p, H ∈ Rm×n, the static output

feedback problem is to find a controller X ∈ Rp×m so that,

setting u(k) = Xy(k), all solutions of

z(k+1) = (F +GXH)z(k)

converge to zero, that is all eigenvalues of F +GXH are insidethe unit disk (Schur stable), or prove that this is not possible.Pose as optimization problem:

minX∈Rp×m

ρ(F +GXH)

where ρ is spectral radius.

Page 138: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Minimizing the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

39 / 48

Given the discrete-time dynamical system with control input andmeasured output

z(k+1) = Fz(k) +Gu(k), y(k) = Hz(k)

where F ∈ Rn×n, G ∈ R

n×p, H ∈ Rm×n, the static output

feedback problem is to find a controller X ∈ Rp×m so that,

setting u(k) = Xy(k), all solutions of

z(k+1) = (F +GXH)z(k)

converge to zero, that is all eigenvalues of F +GXH are insidethe unit disk (Schur stable), or prove that this is not possible.Pose as optimization problem:

minX∈Rp×m

ρ(F +GXH)

where ρ is spectral radius.NP-hard if add bounds on entries of X(V. Blondel and J. Tsitsiklis, 1996).

Page 139: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth Analysis of the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

40 / 48

The spectral radius ρ is not locally Lipschitz at matrices withmultiple active eigenvalues (those attaining the maximalmodulus).

Page 140: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth Analysis of the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

40 / 48

The spectral radius ρ is not locally Lipschitz at matrices withmultiple active eigenvalues (those attaining the maximalmodulus).

Nonsmooth analysis of ρ in this case, deriving ∂Mρ, was given byJ.V. Burke and M.L.O. (2001), J.V. Burke, A.S. Lewis andM.L.O. (2005), etc.

Page 141: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Nonsmooth Analysis of the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

40 / 48

The spectral radius ρ is not locally Lipschitz at matrices withmultiple active eigenvalues (those attaining the maximalmodulus).

Nonsmooth analysis of ρ in this case, deriving ∂Mρ, was given byJ.V. Burke and M.L.O. (2001), J.V. Burke, A.S. Lewis andM.L.O. (2005), etc.

But to apply BFGS, we assume that everywhere we evaluate ρ atA(X) = F +GXH, there is just one active real eigenvalue oractive conjugate pair with multiplicity one, and break any “ties”arbitrarily.

Page 142: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Gradient of the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

41 / 48

Gradient of the spectral radius in real matrix space:

∇ρ(A) = Reµ

|µ|

1

v∗uvu∗

where v and u are right and left eigenvectors for the relevantactive eigenvalue µ of A, which is assumed to be simple andhave nonnegative imaginary part.

Page 143: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Gradient of the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

41 / 48

Gradient of the spectral radius in real matrix space:

∇ρ(A) = Reµ

|µ|

1

v∗uvu∗

where v and u are right and left eigenvectors for the relevantactive eigenvalue µ of A, which is assumed to be simple andhave nonnegative imaginary part.

Gradients may be arbitrarily large for µ nearly a multipleeigenvalue: spectral functions are not locally Lipschitz at anactive multiple eigenvalue.

Page 144: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Gradient of the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

41 / 48

Gradient of the spectral radius in real matrix space:

∇ρ(A) = Reµ

|µ|

1

v∗uvu∗

where v and u are right and left eigenvectors for the relevantactive eigenvalue µ of A, which is assumed to be simple andhave nonnegative imaginary part.

Gradients may be arbitrarily large for µ nearly a multipleeigenvalue: spectral functions are not locally Lipschitz at anactive multiple eigenvalue.

Break ties for active eigenvalue arbitrarily.

Page 145: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Gradient of the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

41 / 48

Gradient of the spectral radius in real matrix space:

∇ρ(A) = Reµ

|µ|

1

v∗uvu∗

where v and u are right and left eigenvectors for the relevantactive eigenvalue µ of A, which is assumed to be simple andhave nonnegative imaginary part.

Gradients may be arbitrarily large for µ nearly a multipleeigenvalue: spectral functions are not locally Lipschitz at anactive multiple eigenvalue.

Break ties for active eigenvalue arbitrarily.

Since A is real, take Im µ ≥ 0 wlog.

Page 146: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Gradient of the Spectral Radius

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

41 / 48

Gradient of the spectral radius in real matrix space:

∇ρ(A) = Reµ

|µ|

1

v∗uvu∗

where v and u are right and left eigenvectors for the relevantactive eigenvalue µ of A, which is assumed to be simple andhave nonnegative imaginary part.

Gradients may be arbitrarily large for µ nearly a multipleeigenvalue: spectral functions are not locally Lipschitz at anactive multiple eigenvalue.

Break ties for active eigenvalue arbitrarily.

Since A is real, take Im µ ≥ 0 wlog.

Defining A(X) = F +GXH, use ordinary chain rule to obtaingradients of ρ(A(X)) in the X space.

Page 147: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Numerical Results for some SOF Problems

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

42 / 48

Let F be an n× n Toeplitz matrix whose nonzeros are 0.5 onthe main diagonal and first three superdiagonals and and thenumber −0.5 on the first subdiagonal. Not Schur stable.

Page 148: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Numerical Results for some SOF Problems

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

42 / 48

Let F be an n× n Toeplitz matrix whose nonzeros are 0.5 onthe main diagonal and first three superdiagonals and and thenumber −0.5 on the first subdiagonal. Not Schur stable.

First set of experiments: set n = 8 and optimize over X ∈ Rp×m

with p = 1 (setting G = [1, . . . , 1]T ), and consider m rangingfrom 0 to 8 (setting H to the matrix whose rows are the first mrows of the identity matrix).

Page 149: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Numerical Results for some SOF Problems

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

42 / 48

Let F be an n× n Toeplitz matrix whose nonzeros are 0.5 onthe main diagonal and first three superdiagonals and and thenumber −0.5 on the first subdiagonal. Not Schur stable.

First set of experiments: set n = 8 and optimize over X ∈ Rp×m

with p = 1 (setting G = [1, . . . , 1]T ), and consider m rangingfrom 0 to 8 (setting H to the matrix whose rows are the first mrows of the identity matrix).

For each m, run BFGS from 100 randomly generated startingpoints to search for local minimizers of ρ(F +GXH) over Xand plot eigenvalues of F +GXH for the best X found.

Page 150: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Numerical Results for some SOF Problems

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

42 / 48

Let F be an n× n Toeplitz matrix whose nonzeros are 0.5 onthe main diagonal and first three superdiagonals and and thenumber −0.5 on the first subdiagonal. Not Schur stable.

First set of experiments: set n = 8 and optimize over X ∈ Rp×m

with p = 1 (setting G = [1, . . . , 1]T ), and consider m rangingfrom 0 to 8 (setting H to the matrix whose rows are the first mrows of the identity matrix).

For each m, run BFGS from 100 randomly generated startingpoints to search for local minimizers of ρ(F +GXH) over Xand plot eigenvalues of F +GXH for the best X found.

Second set of experiments: n = 15, p = 2, with G having asecond column [1,−1, 1,−1, ..., 1]T .

Page 151: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Optimized Eigenvalues: n = 8, p = 1

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

43 / 48

−1 0 1−1

−0.5

0

0.5

1m=0

−1 0 1−1

−0.5

0

0.5

1m=1

−0.5 0 0.5

−0.5

0

0.5

m=2

−0.5 0 0.5

−0.5

0

0.5

m=3

−0.5 0 0.5

−0.5

0

0.5

m=4

−0.5 0 0.5

−0.5

0

0.5

m=5

−0.5 0 0.5

−0.5

0

0.5

m=6

−0.2 0 0.2−0.2

−0.1

0

0.1

0.2m=7

−0.05 0 0.05

−0.05

0

0.05

m=8

’*’ : known optimal value for m = 7 and m = 8

Page 152: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Sorted Final Values of ρ for 100 Runs of BFGS

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

44 / 48

0 50 100

1.0413

1.0413

1.0414

1.0414

m=0

0 50 100

1.2

1.4

1.6

m=1

0 50 1001

1.2

1.4

1.6m=2

0 50 1000.95

1

1.05

1.1

1.15

m=3

0 50 100

0.95

1

1.05

1.1

m=4

0 50 100

0.85

0.9

0.95

1

m=5

0 50 100

0.8

0.9

1

1.1

m=6

0 50 100

0.4

0.6

0.8

1

m=7

0 50 100

0.2

0.4

0.6

0.8

1

1.2m=8

Page 153: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Optimized Eigenvalues: n = 15, p = 2

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

45 / 48

−1 0 1−1

−0.5

0

0.5

1

m=0

−1 0 1−1

−0.5

0

0.5

1

m=1

−1 0 1−1

−0.5

0

0.5

1m=2

−1 0 1−1

−0.5

0

0.5

1m=3

−1 0 1−1

−0.5

0

0.5

1m=4

−0.5 0 0.5

−0.5

0

0.5

m=5

−0.5 0 0.5

−0.5

0

0.5

m=6

−0.5 0 0.5

−0.5

0

0.5

m=7

−0.5 0 0.5

−0.5

0

0.5

m=8

’*’ : known optimal value for m = 8

Page 154: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Sorted Final Values of ρ for 100 Runs of BFGS

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

46 / 48

0 50 100

1.1011

1.1012

1.1012

1.1013

m=0

0 50 100

1.08

1.09

1.1

m=1

0 50 100

1.1

1.2

1.3

m=2

0 50 100

1.05

1.1

1.15

1.2

m=3

0 50 100

1.1

1.2

1.3

m=4

0 50 1001

1.05

1.1

1.15

1.2

m=5

0 50 100

1

1.05

1.1

1.15

1.2

m=6

0 50 1000.9

1

1.1

m=7

0 50 1000.9

0.95

1

1.05

1.1

m=8

Page 155: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Challenge: Convergence of BFGS in Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

47 / 48

Page 156: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Challenge: Convergence of BFGS in Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

47 / 48

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Page 157: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Challenge: Convergence of BFGS in Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

47 / 48

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Page 158: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Challenge: Convergence of BFGS in Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

47 / 48

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Prove or disprove that the following hold with probability one:

Page 159: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Challenge: Convergence of BFGS in Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

47 / 48

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Prove or disprove that the following hold with probability one:

1. BFGS generates an infinite sequence {x} with f

differentiable at all iterates

Page 160: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Challenge: Convergence of BFGS in Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

47 / 48

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Prove or disprove that the following hold with probability one:

1. BFGS generates an infinite sequence {x} with f

differentiable at all iterates2. Any cluster point x is Clarke stationary

Page 161: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Challenge: Convergence of BFGS in Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

47 / 48

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Prove or disprove that the following hold with probability one:

1. BFGS generates an infinite sequence {x} with f

differentiable at all iterates2. Any cluster point x is Clarke stationary3. The sequence of function values generated (including all of

the line search iterates) converges to f(x) R-linearly

Page 162: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Challenge: Convergence of BFGS in Nonsmooth Case

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

47 / 48

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Prove or disprove that the following hold with probability one:

1. BFGS generates an infinite sequence {x} with f

differentiable at all iterates2. Any cluster point x is Clarke stationary3. The sequence of function values generated (including all of

the line search iterates) converges to f(x) R-linearly4. If {x} converges to x where f is “partly smooth” w.r.t. a

manifold M then the subspace defined by the eigenvectorscorresponding to eigenvalues of H converging to zeroconverges to the “V-space” of f w.r.t. M at x

A.S. Lewis and M.L.O., Math Programming, 2013.

Page 163: On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

And Finally

Yurii Nesterov

Introduction

Some NonsmoothAnalysis

Nesterov’sChebyshev-RosenbrockFunctions

Other Examples ofBehavior of BFGSon NonsmoothFunctionsMinimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Why Did 44Eigenvalues of HConverge to Zero?

Variation of f fromMinimizer, alongEigVecs of H

Minimizing theSpectral Radius

Nonsmooth Analysisof the Spectral

48 / 48

Happy Birthday Yurii!