233
1 / 59 Nonsmooth, Nonconvex Optimization Algorithms and Examples Michael L. Overton Courant Institute of Mathematical Sciences New York University Convex and Nonsmooth Optimization Class, Spring 2016, Final Lecture Mostly based on my research work with Jim Burke and Adrian Lewis

Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

  • Upload
    others

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

1 / 59

Nonsmooth, Nonconvex OptimizationAlgorithms and Examples

Michael L. OvertonCourant Institute of Mathematical Sciences

New York University

Convex and Nonsmooth Optimization Class, Spring 2016, Final Lecture

Mostly based on my research work with Jim Burke and Adrian Lewis

Page 2: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Introduction

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

2 / 59

Page 3: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Nonconvex Optimization

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

3 / 59

Problem: find x that locally minimizes f , where f : Rn → R is

Page 4: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Nonconvex Optimization

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

3 / 59

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous

Page 5: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Nonconvex Optimization

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

3 / 59

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers

Page 6: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Nonconvex Optimization

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

3 / 59

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers■ Not convex

Page 7: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Nonconvex Optimization

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

3 / 59

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz: for all x there

exists Lx such that |f(x+ d)− f(x)| ≤ Lx‖d‖ for small ‖d‖

Page 8: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Nonconvex Optimization

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

3 / 59

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz: for all x there

exists Lx such that |f(x+ d)− f(x)| ≤ Lx‖d‖ for small ‖d‖

Lots of interesting applications

Page 9: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Nonconvex Optimization

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

3 / 59

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz: for all x there

exists Lx such that |f(x+ d)− f(x)| ≤ Lx‖d‖ for small ‖d‖

Lots of interesting applications

Any locally Lipschitz function is differentiable almost everywhereon its domain. So, whp, can evaluate gradient at any given point.

Page 10: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Nonconvex Optimization

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

3 / 59

Problem: find x that locally minimizes f , where f : Rn → R is

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz: for all x there

exists Lx such that |f(x+ d)− f(x)| ≤ Lx‖d‖ for small ‖d‖

Lots of interesting applications

Any locally Lipschitz function is differentiable almost everywhereon its domain. So, whp, can evaluate gradient at any given point.

What happens if we simply use steepest descent (gradientdescent) with a standard line search?

Page 11: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Example

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

4 / 59

f(x)=10*|x2 − x

12| + (1−x

1)2

steepest descent iterates−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Page 12: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Methods Suitable for Nonsmooth Functions

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

5 / 59

In fact, it’s been known for several decades that at any giveniterate, we need to exploit the gradient information obtained atseveral points, not just at one point. Some such methods:

Page 13: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Methods Suitable for Nonsmooth Functions

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

5 / 59

In fact, it’s been known for several decades that at any giveniterate, we need to exploit the gradient information obtained atseveral points, not just at one point. Some such methods:

■ Bundle methods (C. Lemarechal, K.C. Kiwiel, etc.):extensive practical use and theoretical analysis, butcomplicated in nonconvex case

Page 14: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Methods Suitable for Nonsmooth Functions

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

5 / 59

In fact, it’s been known for several decades that at any giveniterate, we need to exploit the gradient information obtained atseveral points, not just at one point. Some such methods:

■ Bundle methods (C. Lemarechal, K.C. Kiwiel, etc.):extensive practical use and theoretical analysis, butcomplicated in nonconvex case

■ Gradient sampling: an easily stated method with niceconvergence theory (J.V. Burke, A.S. Lewis, M.L.O., 2005;K.C. Kiwiel, 2007), but computationally intensive

Page 15: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Methods Suitable for Nonsmooth Functions

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

5 / 59

In fact, it’s been known for several decades that at any giveniterate, we need to exploit the gradient information obtained atseveral points, not just at one point. Some such methods:

■ Bundle methods (C. Lemarechal, K.C. Kiwiel, etc.):extensive practical use and theoretical analysis, butcomplicated in nonconvex case

■ Gradient sampling: an easily stated method with niceconvergence theory (J.V. Burke, A.S. Lewis, M.L.O., 2005;K.C. Kiwiel, 2007), but computationally intensive

■ BFGS: traditional workhorse for smooth optimization, worksamazingly well for nonsmooth optimization too, but verylimited convergence theory

Page 16: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Failure of Steepest Descent: Simpler Example

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

6 / 59

Let f(x) = 6|x1|+ 3x2. Note that f is polyhedral and convex.

Page 17: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Failure of Steepest Descent: Simpler Example

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

6 / 59

Let f(x) = 6|x1|+ 3x2. Note that f is polyhedral and convex.

On this function, using a bisection-based backtracking line

search with “Armijo” parameter in [0, 13 ] and starting at

[23

],

steepest descent generates the sequence

2−k

[2(−1)k

3

], k = 1, 2, . . . ,

converging to

[00

].

Page 18: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Failure of Steepest Descent: Simpler Example

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

6 / 59

Let f(x) = 6|x1|+ 3x2. Note that f is polyhedral and convex.

On this function, using a bisection-based backtracking line

search with “Armijo” parameter in [0, 13 ] and starting at

[23

],

steepest descent generates the sequence

2−k

[2(−1)k

3

], k = 1, 2, . . . ,

converging to

[00

].

In contrast, BFGS with the same line search rapidly reduces thefunction value towards −∞ (arbitrarily far, in exact arithmetic)(A.S. Lewis and S. Zhang, 2010).

Page 19: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Gradient Sampling

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks7 / 59

Page 20: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks8 / 59

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Page 21: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks8 / 59

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.

Page 22: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks8 / 59

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.

Repeat (outer loop)

Page 23: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks8 / 59

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.

Repeat (outer loop)

■ Repeat (inner loop: gradient sampling with fixed ǫ):

Page 24: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks8 / 59

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.

Repeat (outer loop)

■ Repeat (inner loop: gradient sampling with fixed ǫ):

◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)},sampling u1, · · · , um from the unit ball

Page 25: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks8 / 59

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.

Repeat (outer loop)

■ Repeat (inner loop: gradient sampling with fixed ǫ):

◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)},sampling u1, · · · , um from the unit ball

◆ Set g = argmin{||g|| : g ∈ conv(G)}

Page 26: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks8 / 59

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.

Repeat (outer loop)

■ Repeat (inner loop: gradient sampling with fixed ǫ):

◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)},sampling u1, · · · , um from the unit ball

◆ Set g = argmin{||g|| : g ∈ conv(G)}◆ If ‖g‖ > τ , do backtracking line search: set d = −g/‖g‖

and replace x by x+ td, with t ∈ {1, 12 ,14 , . . .} and

f(x+ td) < f(x)− βt‖g‖

Page 27: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks8 / 59

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.

Repeat (outer loop)

■ Repeat (inner loop: gradient sampling with fixed ǫ):

◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)},sampling u1, · · · , um from the unit ball

◆ Set g = argmin{||g|| : g ∈ conv(G)}◆ If ‖g‖ > τ , do backtracking line search: set d = −g/‖g‖

and replace x by x+ td, with t ∈ {1, 12 ,14 , . . .} and

f(x+ td) < f(x)− βt‖g‖

■ until ‖g‖ ≤ τ .

Page 28: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks8 / 59

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.

Repeat (outer loop)

■ Repeat (inner loop: gradient sampling with fixed ǫ):

◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)},sampling u1, · · · , um from the unit ball

◆ Set g = argmin{||g|| : g ∈ conv(G)}◆ If ‖g‖ > τ , do backtracking line search: set d = −g/‖g‖

and replace x by x+ td, with t ∈ {1, 12 ,14 , . . .} and

f(x+ td) < f(x)− βt‖g‖

■ until ‖g‖ ≤ τ .

■ New phase: set ǫ = µǫ and τ = θτ .

Page 29: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks8 / 59

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.

Repeat (outer loop)

■ Repeat (inner loop: gradient sampling with fixed ǫ):

◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)},sampling u1, · · · , um from the unit ball

◆ Set g = argmin{||g|| : g ∈ conv(G)}◆ If ‖g‖ > τ , do backtracking line search: set d = −g/‖g‖

and replace x by x+ td, with t ∈ {1, 12 ,14 , . . .} and

f(x+ td) < f(x)− βt‖g‖

■ until ‖g‖ ≤ τ .

■ New phase: set ǫ = µǫ and τ = θτ .

J.V. Burke, A.S. Lewis and M.L.O., SIOPT, 2005.

Page 30: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

With First Phase of Gradient Sampling

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks9 / 59

f(x)=10*|x2 − x

12| + (1−x

1)2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Page 31: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

With Second Phase of Gradient Sampling

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks10 / 59

f(x)=10*|x2 − x

12| + (1−x

1)2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Page 32: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Clarke Subdifferential

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks11 / 59

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Page 33: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Clarke Subdifferential

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks11 / 59

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

Page 34: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Clarke Subdifferential

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks11 / 59

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂f(x) = conv

{lim

x→x,x∈D∇f(x)

}.

Page 35: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Clarke Subdifferential

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks11 / 59

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂f(x) = conv

{lim

x→x,x∈D∇f(x)

}.

F.H. Clarke, 1973 (he used the name “generalized gradient”).

Page 36: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Clarke Subdifferential

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks11 / 59

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂f(x) = conv

{lim

x→x,x∈D∇f(x)

}.

F.H. Clarke, 1973 (he used the name “generalized gradient”).

If f is continuously differentiable at x, then ∂f(x) = {∇f(x)}.

Page 37: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Clarke Subdifferential

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks11 / 59

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂f(x) = conv

{lim

x→x,x∈D∇f(x)

}.

F.H. Clarke, 1973 (he used the name “generalized gradient”).

If f is continuously differentiable at x, then ∂f(x) = {∇f(x)}.

If f is convex, ∂f is the subdifferential of convex analysis.

Page 38: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Clarke Subdifferential

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks11 / 59

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂f(x) = conv

{lim

x→x,x∈D∇f(x)

}.

F.H. Clarke, 1973 (he used the name “generalized gradient”).

If f is continuously differentiable at x, then ∂f(x) = {∇f(x)}.

If f is convex, ∂f is the subdifferential of convex analysis.

We say x is Clarke stationary for f if 0 ∈ ∂f(x).

Page 39: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Clarke Subdifferential

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks11 / 59

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Rademacher’s Theorem: Rn\D has measure zero.

The Clarke subdifferential of f at x is

∂f(x) = conv

{lim

x→x,x∈D∇f(x)

}.

F.H. Clarke, 1973 (he used the name “generalized gradient”).

If f is continuously differentiable at x, then ∂f(x) = {∇f(x)}.

If f is convex, ∂f is the subdifferential of convex analysis.

We say x is Clarke stationary for f if 0 ∈ ∂f(x).

Key point: the convex hull of the set G generated by GradientSampling is a surrogate for ∂f .

Page 40: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Note that 0 ∈ ∂f(x) = 0 at x = [1; 1]T

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks12 / 59

f(x)=10*|x2 − x

12| + (1−x

1)2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Page 41: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Grad. Samp.: A Stabilized Steepest Descent Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks13 / 59

Lemma. Let G be a compact convex set. Then

−dist(0, G) = min‖d‖≤1

maxg∈G

gTd

Page 42: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Grad. Samp.: A Stabilized Steepest Descent Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks13 / 59

Lemma. Let G be a compact convex set. Then

−dist(0, G) = min‖d‖≤1

maxg∈G

gTd

Proof.−dist(0, G) = −min

g∈G‖g‖

Page 43: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Grad. Samp.: A Stabilized Steepest Descent Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks13 / 59

Lemma. Let G be a compact convex set. Then

−dist(0, G) = min‖d‖≤1

maxg∈G

gTd

Proof.−dist(0, G) = −min

g∈G‖g‖

= −ming∈G

max‖d‖≤1

gTd

Page 44: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Grad. Samp.: A Stabilized Steepest Descent Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks13 / 59

Lemma. Let G be a compact convex set. Then

−dist(0, G) = min‖d‖≤1

maxg∈G

gTd

Proof.−dist(0, G) = −min

g∈G‖g‖

= −ming∈G

max‖d‖≤1

gTd

= − max‖d‖≤1

ming∈G

gTd

Page 45: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Grad. Samp.: A Stabilized Steepest Descent Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks13 / 59

Lemma. Let G be a compact convex set. Then

−dist(0, G) = min‖d‖≤1

maxg∈G

gTd

Proof.−dist(0, G) = −min

g∈G‖g‖

= −ming∈G

max‖d‖≤1

gTd

= − max‖d‖≤1

ming∈G

gTd

= − max‖d‖≤1

ming∈G

gT (−d)

Page 46: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Grad. Samp.: A Stabilized Steepest Descent Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks13 / 59

Lemma. Let G be a compact convex set. Then

−dist(0, G) = min‖d‖≤1

maxg∈G

gTd

Proof.−dist(0, G) = −min

g∈G‖g‖

= −ming∈G

max‖d‖≤1

gTd

= − max‖d‖≤1

ming∈G

gTd

= − max‖d‖≤1

ming∈G

gT (−d)

= min‖d‖≤1

maxg∈G

gTd.

Page 47: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Grad. Samp.: A Stabilized Steepest Descent Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks13 / 59

Lemma. Let G be a compact convex set. Then

−dist(0, G) = min‖d‖≤1

maxg∈G

gTd

Proof.−dist(0, G) = −min

g∈G‖g‖

= −ming∈G

max‖d‖≤1

gTd

= − max‖d‖≤1

ming∈G

gTd

= − max‖d‖≤1

ming∈G

gT (−d)

= min‖d‖≤1

maxg∈G

gTd.

Note: the distance is nonnegative, and zero iff 0 ∈ G.

Page 48: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Grad. Samp.: A Stabilized Steepest Descent Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks13 / 59

Lemma. Let G be a compact convex set. Then

−dist(0, G) = min‖d‖≤1

maxg∈G

gTd

Proof.−dist(0, G) = −min

g∈G‖g‖

= −ming∈G

max‖d‖≤1

gTd

= − max‖d‖≤1

ming∈G

gTd

= − max‖d‖≤1

ming∈G

gT (−d)

= min‖d‖≤1

maxg∈G

gTd.

Note: the distance is nonnegative, and zero iff 0 ∈ G.Otherwise, equality is attained by g = ΠG(0), d = −g/|g‖.

Page 49: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Grad. Samp.: A Stabilized Steepest Descent Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks13 / 59

Lemma. Let G be a compact convex set. Then

−dist(0, G) = min‖d‖≤1

maxg∈G

gTd

Proof.−dist(0, G) = −min

g∈G‖g‖

= −ming∈G

max‖d‖≤1

gTd

= − max‖d‖≤1

ming∈G

gTd

= − max‖d‖≤1

ming∈G

gT (−d)

= min‖d‖≤1

maxg∈G

gTd.

Note: the distance is nonnegative, and zero iff 0 ∈ G.Otherwise, equality is attained by g = ΠG(0), d = −g/|g‖.Ordinary steepest descent: G = {∇f(x)}.

Page 50: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks14 / 59

Suppose that f : Rn → R

Page 51: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks14 / 59

Suppose that f : Rn → R

■ is locally Lipschitz

Page 52: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks14 / 59

Suppose that f : Rn → R

■ is locally Lipschitz■ is continuously differentiable on an open dense subset of Rn

Page 53: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks14 / 59

Suppose that f : Rn → R

■ is locally Lipschitz■ is continuously differentiable on an open dense subset of Rn

■ has bounded level sets

Page 54: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks14 / 59

Suppose that f : Rn → R

■ is locally Lipschitz■ is continuously differentiable on an open dense subset of Rn

■ has bounded level sets

Then, with probability one, the line search always terminates, fis differentiable at every iterate x, and if the sequence of iterates{x} converges to some point x, then, with probability one

Page 55: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks14 / 59

Suppose that f : Rn → R

■ is locally Lipschitz■ is continuously differentiable on an open dense subset of Rn

■ has bounded level sets

Then, with probability one, the line search always terminates, fis differentiable at every iterate x, and if the sequence of iterates{x} converges to some point x, then, with probability one

■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and

Page 56: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks14 / 59

Suppose that f : Rn → R

■ is locally Lipschitz■ is continuously differentiable on an open dense subset of Rn

■ has bounded level sets

Then, with probability one, the line search always terminates, fis differentiable at every iterate x, and if the sequence of iterates{x} converges to some point x, then, with probability one

■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and

■ x is Clarke stationary for f , i.e., 0 ∈ ∂f(x).

Page 57: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks14 / 59

Suppose that f : Rn → R

■ is locally Lipschitz■ is continuously differentiable on an open dense subset of Rn

■ has bounded level sets

Then, with probability one, the line search always terminates, fis differentiable at every iterate x, and if the sequence of iterates{x} converges to some point x, then, with probability one

■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and

■ x is Clarke stationary for f , i.e., 0 ∈ ∂f(x).

J.V. Burke, A.S. Lewis and M.L.O., SIOPT, 2005.

Page 58: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks14 / 59

Suppose that f : Rn → R

■ is locally Lipschitz■ is continuously differentiable on an open dense subset of Rn

■ has bounded level sets

Then, with probability one, the line search always terminates, fis differentiable at every iterate x, and if the sequence of iterates{x} converges to some point x, then, with probability one

■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and

■ x is Clarke stationary for f , i.e., 0 ∈ ∂f(x).

J.V. Burke, A.S. Lewis and M.L.O., SIOPT, 2005.

Drop the assumption that f has bounded level sets. Then, wp 1,either the sequence {f(x)} → −∞, or every cluster point of thesequence of iterates {x} is Clarke stationary.

Page 59: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks14 / 59

Suppose that f : Rn → R

■ is locally Lipschitz■ is continuously differentiable on an open dense subset of Rn

■ has bounded level sets

Then, with probability one, the line search always terminates, fis differentiable at every iterate x, and if the sequence of iterates{x} converges to some point x, then, with probability one

■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and

■ x is Clarke stationary for f , i.e., 0 ∈ ∂f(x).

J.V. Burke, A.S. Lewis and M.L.O., SIOPT, 2005.

Drop the assumption that f has bounded level sets. Then, wp 1,either the sequence {f(x)} → −∞, or every cluster point of thesequence of iterates {x} is Clarke stationary.

K.C. Kiwiel, SIOPT, 2007.

Page 60: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Extension to Problems with Nonsmooth Constraints

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks15 / 59

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

Page 61: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Extension to Problems with Nonsmooth Constraints

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks15 / 59

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.

Page 62: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Extension to Problems with Nonsmooth Constraints

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks15 / 59

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.

A successive quadratic programming gradient sampling methodwith convergence theory.

Page 63: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Extension to Problems with Nonsmooth Constraints

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks15 / 59

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.

A successive quadratic programming gradient sampling methodwith convergence theory.

F.E. Curtis and M.L.O., SIOPT, 2012.

Page 64: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Quasi-Newton Methods

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

16 / 59

Page 65: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Bill Davidon

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

17 / 59

W. Davidon, a physicist at Argonne, had the breakthrough ideain 1959: since it’s too expensive to compute and factor theHessian ∇2f(x) at every iteration, update an approximation toits inverse using information from gradient differences, andmultiply this onto the negative gradient to approximateNewton’s method.

Page 66: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Bill Davidon

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

17 / 59

W. Davidon, a physicist at Argonne, had the breakthrough ideain 1959: since it’s too expensive to compute and factor theHessian ∇2f(x) at every iteration, update an approximation toits inverse using information from gradient differences, andmultiply this onto the negative gradient to approximateNewton’s method.

Each inverse Hessian approximation differs from the previous oneby a rank-two correction.

Page 67: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Bill Davidon

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

17 / 59

W. Davidon, a physicist at Argonne, had the breakthrough ideain 1959: since it’s too expensive to compute and factor theHessian ∇2f(x) at every iteration, update an approximation toits inverse using information from gradient differences, andmultiply this onto the negative gradient to approximateNewton’s method.

Each inverse Hessian approximation differs from the previous oneby a rank-two correction.

Ahead of its time: the paper was rejected by the physicsjournals, but published 30 years later in the first issue of SIAM J.Optimization.

Page 68: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Bill Davidon

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

17 / 59

W. Davidon, a physicist at Argonne, had the breakthrough ideain 1959: since it’s too expensive to compute and factor theHessian ∇2f(x) at every iteration, update an approximation toits inverse using information from gradient differences, andmultiply this onto the negative gradient to approximateNewton’s method.

Each inverse Hessian approximation differs from the previous oneby a rank-two correction.

Ahead of its time: the paper was rejected by the physicsjournals, but published 30 years later in the first issue of SIAM J.Optimization.

Davidon was a well known active anti-war protester during theVietnam War. In December 2013, it was revealed that he wasthe mastermind behind the break-in at the FBI office in Media,PA, on March 8, 1971, during the Muhammad Ali - Joe Frazierworld heavyweight boxing championship.

Page 69: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Fletcher and Powell

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

18 / 59

In 1963, R. Fletcher and M.J.D. Powell improved Davidon’smethod and established convergence for convex quadraticfunctions.

Page 70: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Fletcher and Powell

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

18 / 59

In 1963, R. Fletcher and M.J.D. Powell improved Davidon’smethod and established convergence for convex quadraticfunctions.

They applied it to solve problems in 100 variables: a lot at thetime.

Page 71: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Fletcher and Powell

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

18 / 59

In 1963, R. Fletcher and M.J.D. Powell improved Davidon’smethod and established convergence for convex quadraticfunctions.

They applied it to solve problems in 100 variables: a lot at thetime.

The method became known as the DFP method.

Page 72: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

19 / 59

In 1970, C.G. Broyden, R. Fletcher, D. Goldfarb and D. Shannoall independently proposed the BFGS method, which is a kind ofdual of the DFP method. It was soon recognized that this was aremarkably effective method for smooth optimization.

Page 73: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

19 / 59

In 1970, C.G. Broyden, R. Fletcher, D. Goldfarb and D. Shannoall independently proposed the BFGS method, which is a kind ofdual of the DFP method. It was soon recognized that this was aremarkably effective method for smooth optimization.

In 1973, C.G. Broyden, J.E. Dennis and J.J. More proved genericlocal superlinear convergence of BFGS and DFP and otherquasi-Newton methods.

Page 74: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

19 / 59

In 1970, C.G. Broyden, R. Fletcher, D. Goldfarb and D. Shannoall independently proposed the BFGS method, which is a kind ofdual of the DFP method. It was soon recognized that this was aremarkably effective method for smooth optimization.

In 1973, C.G. Broyden, J.E. Dennis and J.J. More proved genericlocal superlinear convergence of BFGS and DFP and otherquasi-Newton methods.

In 1975, M.J.D. Powell established convergence of BFGS with aninexact Armijo-Wolfe line search for a general class of smoothconvex functions for BFGS. In 1987, this was extended byR.H. Byrd, J. Nocedal and Y.-X. Yuan to include the whole“Broyden” class of methods interpolating BFGS and DFP:except for the DFP end point.

Page 75: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

19 / 59

In 1970, C.G. Broyden, R. Fletcher, D. Goldfarb and D. Shannoall independently proposed the BFGS method, which is a kind ofdual of the DFP method. It was soon recognized that this was aremarkably effective method for smooth optimization.

In 1973, C.G. Broyden, J.E. Dennis and J.J. More proved genericlocal superlinear convergence of BFGS and DFP and otherquasi-Newton methods.

In 1975, M.J.D. Powell established convergence of BFGS with aninexact Armijo-Wolfe line search for a general class of smoothconvex functions for BFGS. In 1987, this was extended byR.H. Byrd, J. Nocedal and Y.-X. Yuan to include the whole“Broyden” class of methods interpolating BFGS and DFP:except for the DFP end point.

Pathological counterexamples to convergence in the smooth,nonconvex case are known to exist (Y.-H. Dai, 2002, 2013;W. Mascarenhas 2004), but it is widely accepted that themethod works well in practice in the smooth, nonconvex case.

Page 76: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Page 77: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Page 78: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Repeat

Page 79: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)Td < 0

Page 80: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)Td < 0■ Armijo-Wolfe line search: find t so that

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα

Page 81: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)Td < 0■ Armijo-Wolfe line search: find t so that

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα■ Set s = td, y = ∇f(x+ td)−∇f(x)

Page 82: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)Td < 0■ Armijo-Wolfe line search: find t so that

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td

Page 83: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)Td < 0■ Armijo-Wolfe line search: find t so that

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td■ Replace H by V HV T + 1

sT yssT , where V = I − 1

sT ysyT

Page 84: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)Td < 0■ Armijo-Wolfe line search: find t so that

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td■ Replace H by V HV T + 1

sT yssT , where V = I − 1

sT ysyT

Note that H can be computed in O(n2) operations since V is arank one perturbation of the identity

Page 85: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)Td < 0■ Armijo-Wolfe line search: find t so that

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td■ Replace H by V HV T + 1

sT yssT , where V = I − 1

sT ysyT

Note that H can be computed in O(n2) operations since V is arank one perturbation of the identityThe Armijo condition ensures “sufficient decrease” in f

Page 86: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

20 / 59

Choose line search parameters 0 < β < γ < 1

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)Td < 0■ Armijo-Wolfe line search: find t so that

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td■ Replace H by V HV T + 1

sT yssT , where V = I − 1

sT ysyT

Note that H can be computed in O(n2) operations since V is arank one perturbation of the identityThe Armijo condition ensures “sufficient decrease” in fThe Wolfe condition ensures that the directional derivative alongthe line increases algebraically, which guarantees that sT y > 0and that the new H is positive definite.

Page 87: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

21 / 59

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Page 88: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

21 / 59

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.

Page 89: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

21 / 59

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

Page 90: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

21 / 59

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.

Page 91: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

21 / 59

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.

Remarkably, the condition number of the inverse Hessianapproximation typically reaches 1016 before the method breaks down.

Page 92: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

21 / 59

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.

Remarkably, the condition number of the inverse Hessianapproximation typically reaches 1016 before the method breaks down.

We have never seen convergence to non-stationary points that cannotbe explained by numerical difficulties.

Page 93: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

21 / 59

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.

Remarkably, the condition number of the inverse Hessianapproximation typically reaches 1016 before the method breaks down.

We have never seen convergence to non-stationary points that cannotbe explained by numerical difficulties.

Convergence rate of BFGS is typically linear (not superlinear) in thenonsmooth case.

Page 94: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

With BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

22 / 59

f(x)=10*|x2 − x

12| + (1−x

1)2

steepest descent, grad sampling and BFGS iterates

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

contours of fstarting pointoptimal pointsteepest descentgrad samp (1st phase)grad samp (2nd phase)bfgs

Page 95: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Example: Minimizing a Product of Eigenvalues

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

23 / 59

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN .

Page 96: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Example: Minimizing a Product of Eigenvalues

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

23 / 59

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN . We wish to minimize

f(X) = log

N/2∏

i=1

λi(A ◦X)

where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.

Page 97: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Example: Minimizing a Product of Eigenvalues

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

23 / 59

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN . We wish to minimize

f(X) = log

N/2∏

i=1

λi(A ◦X)

where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.

If we replace∏

by∑

we would have a semidefinite program.

Page 98: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Example: Minimizing a Product of Eigenvalues

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

23 / 59

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN . We wish to minimize

f(X) = log

N/2∏

i=1

λi(A ◦X)

where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.

If we replace∏

by∑

we would have a semidefinite program.

Since f is not convex, may as well replace X by Y Y T whereY ∈ R

N×N : eliminates psd constraint, and then also easy toeliminate diagonal constraint.

Page 99: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Example: Minimizing a Product of Eigenvalues

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

23 / 59

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN . We wish to minimize

f(X) = log

N/2∏

i=1

λi(A ◦X)

where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.

If we replace∏

by∑

we would have a semidefinite program.

Since f is not convex, may as well replace X by Y Y T whereY ∈ R

N×N : eliminates psd constraint, and then also easy toeliminate diagonal constraint.

Application: entropy minimization in an environmentalapplication (K.M. Anstreicher and J. Lee, 2004)

Page 100: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

BFGS from 10 Randomly Generated Starting Points

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

24 / 59

0 200 400 600 800 1000 1200 140010

−15

10−10

10−5

100

105

iteration

f − f op

t (di

ffere

nt s

tart

ing

poin

ts)

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

f − fopt, where fopt is least value of f found over all runs

Page 101: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Evolution of Eigenvalues of A ◦X

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

25 / 59

0 200 400 600 800 1000 1200 14000.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

iteration

eige

nval

ues

of A

o X

Page 102: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Evolution of Eigenvalues of A ◦X

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

25 / 59

0 200 400 600 800 1000 1200 14000.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

iteration

eige

nval

ues

of A

o X

Note that λ6(X), . . . , λ14(X) coalesce

Page 103: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Evolution of Eigenvalues of H

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

26 / 59

0 200 400 600 800 1000 1200 140010

−15

10−10

10−5

100

105

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

iteration

eige

nval

ues

of H

Page 104: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Evolution of Eigenvalues of H

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

26 / 59

0 200 400 600 800 1000 1200 140010

−15

10−10

10−5

100

105

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

iteration

eige

nval

ues

of H

44 eigenvalues of H converge to zero...why???

Page 105: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

27 / 59

A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative x 7→ f ′(x; d) is upper semicontinuous there for everyfixed direction d.

Page 106: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

27 / 59

A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative x 7→ f ′(x; d) is upper semicontinuous there for everyfixed direction d.

In this case 0 ∈ ∂f(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.

Page 107: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

27 / 59

A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative x 7→ f ′(x; d) is upper semicontinuous there for everyfixed direction d.

In this case 0 ∈ ∂f(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.

■ All convex functions are regular

Page 108: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

27 / 59

A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative x 7→ f ′(x; d) is upper semicontinuous there for everyfixed direction d.

In this case 0 ∈ ∂f(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.

■ All convex functions are regular■ All smooth functions are regular

Page 109: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

27 / 59

A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative x 7→ f ′(x; d) is upper semicontinuous there for everyfixed direction d.

In this case 0 ∈ ∂f(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.

■ All convex functions are regular■ All smooth functions are regular■ Nonsmooth concave functions are not regular

Example: f(x) = −|x|

Page 110: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

27 / 59

A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative x 7→ f ′(x; d) is upper semicontinuous there for everyfixed direction d.

In this case 0 ∈ ∂f(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.

■ All convex functions are regular■ All smooth functions are regular■ Nonsmooth concave functions are not regular

Example: f(x) = −|x|

Note: this is simpler than the definition of regularity at a point

given in last week’s lecture.

Page 111: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Partly Smooth Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

28 / 59

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

Page 112: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Partly Smooth Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

28 / 59

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x

Page 113: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Partly Smooth Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

28 / 59

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x

Page 114: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Partly Smooth Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

28 / 59

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x■ par ∂f(x), the subspace parallel to the affine hull of the

subdifferential of f at x, is exactly the subspace normal toM at x.

Page 115: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Partly Smooth Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

28 / 59

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x■ par ∂f(x), the subspace parallel to the affine hull of the

subdifferential of f at x, is exactly the subspace normal toM at x.

We refer to par ∂f(x) as the V-space for f at x (with respect toM), and to its orthogonal complement, the subspace tangent toM at x, as the U-space for f at x.

Page 116: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Partly Smooth Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

28 / 59

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x■ par ∂f(x), the subspace parallel to the affine hull of the

subdifferential of f at x, is exactly the subspace normal toM at x.

We refer to par ∂f(x) as the V-space for f at x (with respect toM), and to its orthogonal complement, the subspace tangent toM at x, as the U-space for f at x.

When we refer to the V-space and U-space without reference toa point x, we mean at a minimizer.

Page 117: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Partly Smooth Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

28 / 59

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x■ par ∂f(x), the subspace parallel to the affine hull of the

subdifferential of f at x, is exactly the subspace normal toM at x.

We refer to par ∂f(x) as the V-space for f at x (with respect toM), and to its orthogonal complement, the subspace tangent toM at x, as the U-space for f at x.

When we refer to the V-space and U-space without reference toa point x, we mean at a minimizer.

For nonzero y in the V-space, the mapping t 7→ f(x+ ty) isnecessarily nonsmooth at t = 0, while for nonzero y in theU-space, t 7→ f(x+ ty) is differentiable at t = 0 as long as f islocally Lipschitz.

Page 118: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Same Example Again

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

29 / 59

f(x)=10*|x2 − x

12| + (1−x

1)2

steepest descent, grad sampling and BFGS iterates

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

contours of fstarting pointoptimal pointsteepest descentgrad samp (1st phase)grad samp (2nd phase)bfgs

Page 119: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Relation of Partial Smoothness to Earlier Work

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

30 / 59

Partial smoothness is closely related to earlier work of J.V. Burkeand J.J. More (1990,1994) and S.J. Wright (1993) onidentification of constraint structure by algorithms.

Page 120: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Relation of Partial Smoothness to Earlier Work

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

30 / 59

Partial smoothness is closely related to earlier work of J.V. Burkeand J.J. More (1990,1994) and S.J. Wright (1993) onidentification of constraint structure by algorithms.

When f is convex, the partly smooth nomenclature is consistentwith the usage of V-space and U-space by C. Lemarechal,F. Oustry and C. Sagastizabal (2000), but partial smoothnessdoes not imply convexity and convexity does not imply partialsmoothness.

Page 121: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why Did 44 Eigenvalues of H Converge to Zero?

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

31 / 59

The eigenvalue product is partly smooth with respect to themanifold of matrices with an eigenvalue with given multiplicity.

Page 122: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why Did 44 Eigenvalues of H Converge to Zero?

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

31 / 59

The eigenvalue product is partly smooth with respect to themanifold of matrices with an eigenvalue with given multiplicity.

Recall that at the computed minimizer,

λ6(A ◦X) ≈ . . . ≈ λ14(A ◦X).

Matrix theory says that imposing multiplicity m on an eigenvaluea matrix ∈ SN is m(m+1)

2 − 1 conditions, or 44 when m = 9, sothe dimension of the V -space at this minimizer is 44.

Page 123: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why Did 44 Eigenvalues of H Converge to Zero?

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

31 / 59

The eigenvalue product is partly smooth with respect to themanifold of matrices with an eigenvalue with given multiplicity.

Recall that at the computed minimizer,

λ6(A ◦X) ≈ . . . ≈ λ14(A ◦X).

Matrix theory says that imposing multiplicity m on an eigenvaluea matrix ∈ SN is m(m+1)

2 − 1 conditions, or 44 when m = 9, sothe dimension of the V -space at this minimizer is 44.

Thus BFGS automatically detected the U and V spacepartitioning without knowing anything about the mathematicalstructure of f !

Page 124: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Variation of f from Minimizer, along EigVecs of H

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

32 / 59

−10 −5 0 5 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

f(x op

t + t

w)

− f op

t

t

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

w is eigvector for eigvalue 10 of final Hw is eigvector for eigvalue 20 of final Hw is eigvector for eigvalue 30 of final Hw is eigvector for eigvalue 40 of final Hw is eigvector for eigvalue 50 of final Hw is eigvector for eigvalue 60 of final H

Eigenvalues of H numbered smallest to largest

Page 125: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Challenge: Convergence of BFGS in Nonsmooth Case

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

33 / 59

Page 126: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Challenge: Convergence of BFGS in Nonsmooth Case

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

33 / 59

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Page 127: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Challenge: Convergence of BFGS in Nonsmooth Case

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

33 / 59

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Page 128: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Challenge: Convergence of BFGS in Nonsmooth Case

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

33 / 59

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Prove or disprove that the following hold with probability one:

Page 129: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Challenge: Convergence of BFGS in Nonsmooth Case

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

33 / 59

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Prove or disprove that the following hold with probability one:

1. BFGS generates an infinite sequence {x} with fdifferentiable at all iterates

Page 130: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Challenge: Convergence of BFGS in Nonsmooth Case

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

33 / 59

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Prove or disprove that the following hold with probability one:

1. BFGS generates an infinite sequence {x} with fdifferentiable at all iterates

2. Any cluster point x is Clarke stationary

Page 131: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Challenge: Convergence of BFGS in Nonsmooth Case

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

33 / 59

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Prove or disprove that the following hold with probability one:

1. BFGS generates an infinite sequence {x} with fdifferentiable at all iterates

2. Any cluster point x is Clarke stationary3. The sequence of function values generated (including all of

the line search iterates) converges to f(x) R-linearly

Page 132: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Challenge: Convergence of BFGS in Nonsmooth Case

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

33 / 59

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Prove or disprove that the following hold with probability one:

1. BFGS generates an infinite sequence {x} with fdifferentiable at all iterates

2. Any cluster point x is Clarke stationary3. The sequence of function values generated (including all of

the line search iterates) converges to f(x) R-linearly4. If {x} converges to x where f is “partly smooth” w.r.t. a

manifold M then the subspace defined by the eigenvectorscorresponding to eigenvalues of H converging to zeroconverges to the “V-space” of f w.r.t. M at x

Page 133: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Extensions of BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

34 / 59

A combined BFGS-Gradient Sampling method with convergencetheory (F.E. Curtis and X. Que, 2015)

Page 134: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Extensions of BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

34 / 59

A combined BFGS-Gradient Sampling method with convergencetheory (F.E. Curtis and X. Que, 2015)

Constrained Problems

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.

Page 135: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Extensions of BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

34 / 59

A combined BFGS-Gradient Sampling method with convergencetheory (F.E. Curtis and X. Que, 2015)

Constrained Problems

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.

A successive quadratic programming (SQP) BFGS methodapplied to challenging problems in static-output-feedback controldesign (F.E. Curtis, T. Mitchell and M.L.O., 2015).

Page 136: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Extensions of BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

34 / 59

A combined BFGS-Gradient Sampling method with convergencetheory (F.E. Curtis and X. Que, 2015)

Constrained Problems

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.

A successive quadratic programming (SQP) BFGS methodapplied to challenging problems in static-output-feedback controldesign (F.E. Curtis, T. Mitchell and M.L.O., 2015).

Although there are no theoretical results, it is much moreefficient and effective than the SQP Gradient Sampling methodwhich does have convergence results.

Page 137: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Some Difficult Examples

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

35 / 59

Page 138: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

36 / 59

Consider a generalization of the Rosenbrock (1960) function:

Rp(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |p, where p > 0.

Page 139: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

36 / 59

Consider a generalization of the Rosenbrock (1960) function:

Rp(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |p, where p > 0.

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Rp(x∗) = 0.

Page 140: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

36 / 59

Consider a generalization of the Rosenbrock (1960) function:

Rp(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |p, where p > 0.

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Rp(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Rp(x) = 1 and the manifold

MR = {x : xi+1 = x2i , i = 1, . . . , n− 1}

Page 141: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

36 / 59

Consider a generalization of the Rosenbrock (1960) function:

Rp(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |p, where p > 0.

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Rp(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Rp(x) = 1 and the manifold

MR = {x : xi+1 = x2i , i = 1, . . . , n− 1}

For x ∈ MR, e.g. x = x∗ or x = x, the 2nd term of Rp is zero.Starting at x, BFGS needs to approximately follow MR to reachx∗ (unless it “gets lucky”).

Page 142: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

36 / 59

Consider a generalization of the Rosenbrock (1960) function:

Rp(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |p, where p > 0.

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Rp(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Rp(x) = 1 and the manifold

MR = {x : xi+1 = x2i , i = 1, . . . , n− 1}

For x ∈ MR, e.g. x = x∗ or x = x, the 2nd term of Rp is zero.Starting at x, BFGS needs to approximately follow MR to reachx∗ (unless it “gets lucky”).

When p = 2: R2 is smooth but not convex. Starting at x:

Page 143: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

36 / 59

Consider a generalization of the Rosenbrock (1960) function:

Rp(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |p, where p > 0.

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Rp(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Rp(x) = 1 and the manifold

MR = {x : xi+1 = x2i , i = 1, . . . , n− 1}

For x ∈ MR, e.g. x = x∗ or x = x, the 2nd term of Rp is zero.Starting at x, BFGS needs to approximately follow MR to reachx∗ (unless it “gets lucky”).

When p = 2: R2 is smooth but not convex. Starting at x:

■ n = 5: BFGS needs 43 iterations to reduce R2 below 10−15

Page 144: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

36 / 59

Consider a generalization of the Rosenbrock (1960) function:

Rp(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |p, where p > 0.

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Rp(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Rp(x) = 1 and the manifold

MR = {x : xi+1 = x2i , i = 1, . . . , n− 1}

For x ∈ MR, e.g. x = x∗ or x = x, the 2nd term of Rp is zero.Starting at x, BFGS needs to approximately follow MR to reachx∗ (unless it “gets lucky”).

When p = 2: R2 is smooth but not convex. Starting at x:

■ n = 5: BFGS needs 43 iterations to reduce R2 below 10−15

■ n = 10, BFGS needs 276 iterations to reduce R2 below 10−15.

Page 145: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Nonsmooth Variant of the Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |

Page 146: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Nonsmooth Variant of the Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |

R1 is nonsmooth (but locally Lipschitz) as well as nonconvex.The second term is still zero on the manifold MR, but R1 is notdifferentiable on MR.

Page 147: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Nonsmooth Variant of the Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |

R1 is nonsmooth (but locally Lipschitz) as well as nonconvex.The second term is still zero on the manifold MR, but R1 is notdifferentiable on MR.

However, R1 is regular at x ∈ MR and partly smooth at x w.r.t.MR.

Page 148: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Nonsmooth Variant of the Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |

R1 is nonsmooth (but locally Lipschitz) as well as nonconvex.The second term is still zero on the manifold MR, but R1 is notdifferentiable on MR.

However, R1 is regular at x ∈ MR and partly smooth at x w.r.t.MR.

We cannot initialize BFGS at x, so starting at normallydistributed random points:

Page 149: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Nonsmooth Variant of the Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |

R1 is nonsmooth (but locally Lipschitz) as well as nonconvex.The second term is still zero on the manifold MR, but R1 is notdifferentiable on MR.

However, R1 is regular at x ∈ MR and partly smooth at x w.r.t.MR.

We cannot initialize BFGS at x, so starting at normallydistributed random points:

■ n = 5: BFGS reduces R1 only to about 1× 10−3 in 1000iterations

Page 150: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Nonsmooth Variant of the Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |

R1 is nonsmooth (but locally Lipschitz) as well as nonconvex.The second term is still zero on the manifold MR, but R1 is notdifferentiable on MR.

However, R1 is regular at x ∈ MR and partly smooth at x w.r.t.MR.

We cannot initialize BFGS at x, so starting at normallydistributed random points:

■ n = 5: BFGS reduces R1 only to about 1× 10−3 in 1000iterations

■ n = 10: BFGS reduces R1 only to about 7× 10−4 in 1000iterations

Page 151: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

The Nonsmooth Variant of the Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − x2i |

R1 is nonsmooth (but locally Lipschitz) as well as nonconvex.The second term is still zero on the manifold MR, but R1 is notdifferentiable on MR.

However, R1 is regular at x ∈ MR and partly smooth at x w.r.t.MR.

We cannot initialize BFGS at x, so starting at normallydistributed random points:

■ n = 5: BFGS reduces R1 only to about 1× 10−3 in 1000iterations

■ n = 10: BFGS reduces R1 only to about 7× 10−4 in 1000iterations

Again the method appears to be converging, very slowly, butmay be having numerical difficulties.

Page 152: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

An Aside: Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

38 / 59

A sequence of orthogonal polynomials defined on [−1, 1] by

T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).

Page 153: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

An Aside: Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

38 / 59

A sequence of orthogonal polynomials defined on [−1, 1] by

T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).

So T2(x) = 2x2 − 1,

Page 154: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

An Aside: Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

38 / 59

A sequence of orthogonal polynomials defined on [−1, 1] by

T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).

So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.

Page 155: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

An Aside: Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

38 / 59

A sequence of orthogonal polynomials defined on [−1, 1] by

T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).

So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.

Important properties that can be proved easily include

Page 156: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

An Aside: Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

38 / 59

A sequence of orthogonal polynomials defined on [−1, 1] by

T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).

So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.

Important properties that can be proved easily include

■ Tn(x) = cos(n cos−1(x))

Page 157: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

An Aside: Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

38 / 59

A sequence of orthogonal polynomials defined on [−1, 1] by

T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).

So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.

Important properties that can be proved easily include

■ Tn(x) = cos(n cos−1(x))

■ Tm(Tn(x)) = Tmn(x)

Page 158: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

An Aside: Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

38 / 59

A sequence of orthogonal polynomials defined on [−1, 1] by

T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).

So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.

Important properties that can be proved easily include

■ Tn(x) = cos(n cos−1(x))

■ Tm(Tn(x)) = Tmn(x)

∫ 1−1

1√1−x2

Ti(x)Tj(x)dx = 0 if i 6= j

Page 159: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Plots of Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

39 / 59

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Left: Plots of T0(x), . . . , T4(x) Right: Plot of T8(x).

Page 160: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Plots of Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

39 / 59

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Left: Plots of T0(x), . . . , T4(x) Right: Plot of T8(x).Question: How many extrema does Tn(x) have in [−1, 1]?

Page 161: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nesterov’s Chebyshev-Rosenbrock Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

40 / 59

Consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p > 0

Page 162: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nesterov’s Chebyshev-Rosenbrock Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

40 / 59

Consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p > 0

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Page 163: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nesterov’s Chebyshev-Rosenbrock Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

40 / 59

Consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p > 0

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

Page 164: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nesterov’s Chebyshev-Rosenbrock Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

40 / 59

Consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p > 0

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).

Page 165: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nesterov’s Chebyshev-Rosenbrock Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

40 / 59

Consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p > 0

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).

When p = 2: N2 is smooth but not convex. Starting at x:

Page 166: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nesterov’s Chebyshev-Rosenbrock Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

40 / 59

Consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p > 0

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).

When p = 2: N2 is smooth but not convex. Starting at x:

■ n = 5: BFGS needs 370 iterations to reduce N2 below 10−15

Page 167: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nesterov’s Chebyshev-Rosenbrock Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

40 / 59

Consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|p, where p > 0

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).

When p = 2: N2 is smooth but not convex. Starting at x:

■ n = 5: BFGS needs 370 iterations to reduce N2 below 10−15

■ n = 10: needs ∼ 50,000 iterations to reduce N2 below 10−15

even though N2 is smooth!

Page 168: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why BFGS Takes So Many Iterations to Minimize N2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

41 / 59

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

Page 169: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why BFGS Takes So Many Iterations to Minimize N2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

41 / 59

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

Page 170: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why BFGS Takes So Many Iterations to Minimize N2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

41 / 59

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

Page 171: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why BFGS Takes So Many Iterations to Minimize N2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

41 / 59

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1

Page 172: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why BFGS Takes So Many Iterations to Minimize N2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

41 / 59

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]

Page 173: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why BFGS Takes So Many Iterations to Minimize N2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

41 / 59

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]

Page 174: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why BFGS Takes So Many Iterations to Minimize N2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

41 / 59

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]

which has 2n−1 − 1 extrema in (−1, 1).

Page 175: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why BFGS Takes So Many Iterations to Minimize N2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

41 / 59

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]

which has 2n−1 − 1 extrema in (−1, 1).Even though BFGS will not track the manifold MN exactly, itwill follow it approximately. So, since the manifold is highlyoscillatory, BFGS must take relatively short steps to obtainreduction in N2 in the line search, and hence it takes many

iterations!

Page 176: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Why BFGS Takes So Many Iterations to Minimize N2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

41 / 59

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]

which has 2n−1 − 1 extrema in (−1, 1).Even though BFGS will not track the manifold MN exactly, itwill follow it approximately. So, since the manifold is highlyoscillatory, BFGS must take relatively short steps to obtainreduction in N2 in the line search, and hence it takes many

iterations! At the very end, since N2 is smooth, BFGS issuperlinearly convergent!

Page 177: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

First Nonsmooth Variant of Nesterov’s Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

Page 178: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

First Nonsmooth Variant of Nesterov’s Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

Page 179: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

First Nonsmooth Variant of Nesterov’s Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .

Page 180: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

First Nonsmooth Variant of Nesterov’s Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .

We cannot initialize BFGS at x, so starting at normallydistributed random points:

Page 181: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

First Nonsmooth Variant of Nesterov’s Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .

We cannot initialize BFGS at x, so starting at normallydistributed random points:

■ n = 5: BFGS reduces N1 only to about 5× 10−3 in 1000iterations

Page 182: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

First Nonsmooth Variant of Nesterov’s Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .

We cannot initialize BFGS at x, so starting at normallydistributed random points:

■ n = 5: BFGS reduces N1 only to about 5× 10−3 in 1000iterations

■ n = 10: BFGS reduces N1 only to about 2× 10−2 in 1000iterations

Page 183: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

First Nonsmooth Variant of Nesterov’s Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

i=1

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .

We cannot initialize BFGS at x, so starting at normallydistributed random points:

■ n = 5: BFGS reduces N1 only to about 5× 10−3 in 1000iterations

■ n = 10: BFGS reduces N1 only to about 2× 10−2 in 1000iterations

The method appears to be converging, very slowly, but may behaving numerical difficulties.

Page 184: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Second Nonsmooth Variant of Nesterov’s Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

43 / 59

N1(x) =1

4|x1 − 1|+

n−1∑

i=1

|xi+1 − 2|xi|+ 1|.

Again, the unique global minimizer is x∗. The second term iszero on the set

S = {x : xi+1 = 2|xi| − 1, i = 1, . . . , n− 1}

but S is not a manifold: it has “corners”.

Page 185: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Contour Plots of the Nonsmooth Variants for n = 2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

44 / 59

Nesterov−Chebyshev−Rosenbrock, first variant

x1

x2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2Nesterov−Chebyshev−Rosenbrock, second variant

x1

x2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Contour plots of nonsmooth Chebyshev-Rosenbrock functions N1

(left) and N1 (right), with n = 2, with iterates generated byBFGS initialized at 7 different randomly generated points.

Page 186: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Contour Plots of the Nonsmooth Variants for n = 2

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

44 / 59

Nesterov−Chebyshev−Rosenbrock, first variant

x1

x2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2Nesterov−Chebyshev−Rosenbrock, second variant

x1

x2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Contour plots of nonsmooth Chebyshev-Rosenbrock functions N1

(left) and N1 (right), with n = 2, with iterates generated byBFGS initialized at 7 different randomly generated points.On the left, always get convergence to x∗ = [1, 1]T . On theright, most runs converge to [1, 1] but some go to x = [0,−1]T .

Page 187: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Properties of the Second Nonsmooth Variant N1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

45 / 59

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

Page 188: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Properties of the Second Nonsmooth Variant N1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

45 / 59

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′

1(x, d) < 0.

Page 189: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Properties of the Second Nonsmooth Variant N1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

45 / 59

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′

1(x, d) < 0.

These two properties mean that N1 is not regular at [0,−1]T .

Page 190: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Properties of the Second Nonsmooth Variant N1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

45 / 59

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′

1(x, d) < 0.

These two properties mean that N1 is not regular at [0,−1]T .

In fact, for n ≥ 2:

■ N1 has 2n−1 Clarke stationary points

Page 191: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Properties of the Second Nonsmooth Variant N1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

45 / 59

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′

1(x, d) < 0.

These two properties mean that N1 is not regular at [0,−1]T .

In fact, for n ≥ 2:

■ N1 has 2n−1 Clarke stationary points■ the only local minimizer is the global minimizer x∗

Page 192: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Properties of the Second Nonsmooth Variant N1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

45 / 59

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′

1(x, d) < 0.

These two properties mean that N1 is not regular at [0,−1]T .

In fact, for n ≥ 2:

■ N1 has 2n−1 Clarke stationary points■ the only local minimizer is the global minimizer x∗

■ x∗ is the only stationary point in the sense of Mordukhovich(i.e., with 0 ∈ ∂N1(x) where ∂ is defined in Rockafellar andWets, Variational Analysis, 1998).

(M. Gurbuzbalaban and M.L.O., 2012).

Page 193: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Properties of the Second Nonsmooth Variant N1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

45 / 59

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′

1(x, d) < 0.

These two properties mean that N1 is not regular at [0,−1]T .

In fact, for n ≥ 2:

■ N1 has 2n−1 Clarke stationary points■ the only local minimizer is the global minimizer x∗

■ x∗ is the only stationary point in the sense of Mordukhovich(i.e., with 0 ∈ ∂N1(x) where ∂ is defined in Rockafellar andWets, Variational Analysis, 1998).

(M. Gurbuzbalaban and M.L.O., 2012).

Furthermore, starting from enough randomly generated startingpoints, BFGS finds all 2n−1 Clarke stationary points!

Page 194: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Behavior of BFGS on the Second Nonsmooth Variant

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

46 / 59

0 200 400 600 800 10000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5Nesterov−Chebyshev−Rosenbrock, n=5

different starting points

so

rte

d f

ina

l va

lue

of

f

0 200 400 600 800 10000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5Nesterov−Chebyshev−Rosenbrock, n=6

different starting points

so

rte

d f

ina

l va

lue

of

f

Left: sorted final values of N1 for 1000 randomly generatedstarting points, when n = 5: BFGS finds all 16 Clarke stationarypoints. Right: same with n = 6: BFGS finds all 32 Clarkestationary points.

Page 195: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence to Non-Locally-Minimizing Points

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

47 / 59

When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.

Page 196: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence to Non-Locally-Minimizing Points

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

47 / 59

When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.

However, this kind of convergence is what we are seeing for thenon-regular, non-smooth Nesterov Chebyshev-Rosenbrockexample, and it is stable under perturbation. The same behavioroccurs for gradient sampling or bundle methods.

Page 197: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence to Non-Locally-Minimizing Points

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

47 / 59

When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.

However, this kind of convergence is what we are seeing for thenon-regular, non-smooth Nesterov Chebyshev-Rosenbrockexample, and it is stable under perturbation. The same behavioroccurs for gradient sampling or bundle methods.

Kiwiel (private communication): the Nesterov example is the firsthe had seen which causes his bundle code to have this behavior.

Page 198: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Convergence to Non-Locally-Minimizing Points

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

47 / 59

When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.

However, this kind of convergence is what we are seeing for thenon-regular, non-smooth Nesterov Chebyshev-Rosenbrockexample, and it is stable under perturbation. The same behavioroccurs for gradient sampling or bundle methods.

Kiwiel (private communication): the Nesterov example is the firsthe had seen which causes his bundle code to have this behavior.

Nonetheless, we don’t know whether, in exact arithmetic, themethods would actually generate sequences converging to thenonminimizing Clarke stationary points. Experiments by Kaku(2011) suggest that the higher the precision used, the more likelyBFGS is to eventually move away from such a point.

Page 199: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Limited Memory Methods

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

48 / 59

Page 200: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

49 / 59

“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.

Page 201: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

49 / 59

“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.

In the 1980s, J. Nocedal and others developed a “limitedmemory” version of BFGS, with O(n) space and timerequirements, which is very widely used for minimizing smoothfunctions in many variables. It works by saving only the mostrecent k rank two updates to an initial inverse Hessianapproximation.

Page 202: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

49 / 59

“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.

In the 1980s, J. Nocedal and others developed a “limitedmemory” version of BFGS, with O(n) space and timerequirements, which is very widely used for minimizing smoothfunctions in many variables. It works by saving only the mostrecent k rank two updates to an initial inverse Hessianapproximation.

There are two variants: with and without “scaling” (usuallyscaling is preferred).

Page 203: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

49 / 59

“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.

In the 1980s, J. Nocedal and others developed a “limitedmemory” version of BFGS, with O(n) space and timerequirements, which is very widely used for minimizing smoothfunctions in many variables. It works by saving only the mostrecent k rank two updates to an initial inverse Hessianapproximation.

There are two variants: with and without “scaling” (usuallyscaling is preferred).

The convergence rate of limited memory BFGS is linear, notsuperlinear, on smooth problems.

Page 204: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

49 / 59

“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.

In the 1980s, J. Nocedal and others developed a “limitedmemory” version of BFGS, with O(n) space and timerequirements, which is very widely used for minimizing smoothfunctions in many variables. It works by saving only the mostrecent k rank two updates to an initial inverse Hessianapproximation.

There are two variants: with and without “scaling” (usuallyscaling is preferred).

The convergence rate of limited memory BFGS is linear, notsuperlinear, on smooth problems.

Question: how effective is it on nonsmooth problems?

Page 205: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Limited Memory BFGS on the Eigenvalue Product

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

50 / 59

0 10 20 30 40 50

10−5

10−4

10−3

10−2

10−1

100

Log eig prod, N=20, r=20, K=10, n=400, maxit = 400

number of vectors k

med

ian

redu

ctio

n in

f−f op

t (ov

er 1

0 ru

ns)

Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

Page 206: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Limited Memory BFGS on the Eigenvalue Product

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

50 / 59

0 10 20 30 40 50

10−5

10−4

10−3

10−2

10−1

100

Log eig prod, N=20, r=20, K=10, n=400, maxit = 400

number of vectors k

med

ian

redu

ctio

n in

f−f op

t (ov

er 1

0 ru

ns)

Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

Limited Memory is not nearly as good as full BFGS.

Page 207: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Limited Memory BFGS on the Eigenvalue Product

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

50 / 59

0 10 20 30 40 50

10−5

10−4

10−3

10−2

10−1

100

Log eig prod, N=20, r=20, K=10, n=400, maxit = 400

number of vectors k

med

ian

redu

ctio

n in

f−f op

t (ov

er 1

0 ru

ns)

Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

Limited Memory is not nearly as good as full BFGS.

No significant improvement when k reaches 44.

Page 208: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A More Basic Example

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

51 / 59

Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function

f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)

where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].

Page 209: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A More Basic Example

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

51 / 59

Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function

f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)

where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].

The first term is quadratic, the second is nonsmooth but convex,and the third is the nonsmooth, nonconvex Rosenbrock function.

Page 210: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A More Basic Example

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

51 / 59

Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function

f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)

where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].

The first term is quadratic, the second is nonsmooth but convex,and the third is the nonsmooth, nonconvex Rosenbrock function.

The optimal value is 0, with x = e. The function f is partlysmooth and the dimension of the V-space is nB + nR − 1.

Page 211: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A More Basic Example

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

51 / 59

Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function

f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)

where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].

The first term is quadratic, the second is nonsmooth but convex,and the third is the nonsmooth, nonconvex Rosenbrock function.

The optimal value is 0, with x = e. The function f is partlysmooth and the dimension of the V-space is nB + nR − 1.

Set A = XXT where xij are normally distributed, with conditionnumber about 106 when nA = 200. Similarly B with nB < nA.

Page 212: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A More Basic Example

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

51 / 59

Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function

f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)

where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].

The first term is quadratic, the second is nonsmooth but convex,and the third is the nonsmooth, nonconvex Rosenbrock function.

The optimal value is 0, with x = e. The function f is partlysmooth and the dimension of the V-space is nB + nR − 1.

Set A = XXT where xij are normally distributed, with conditionnumber about 106 when nA = 200. Similarly B with nB < nA.

Besides limited memory BFGS and full BFGS, we also comparelimited memory Gradient Sampling, where we sample k ≪ ngradients per iteration.

Page 213: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Smooth, Convex: nA = 200, nB = 0, nR = 1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

52 / 59

0 5 10 15 2010

−7

10−6

10−5

nA=200, nB=0, nR=1, maxit = 201

number of vectors k

med

ian

redu

ctio

n in

f (o

ver

10 r

uns)

Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

Page 214: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Smooth, Convex: nA = 200, nB = 0, nR = 1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

52 / 59

0 5 10 15 2010

−7

10−6

10−5

nA=200, nB=0, nR=1, maxit = 201

number of vectors k

med

ian

redu

ctio

n in

f (o

ver

10 r

uns)

Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

LM-BFGS with scaling even better than full BFGS

Page 215: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Convex: nA = 200, nB = 10, nR = 1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

53 / 59

0 5 10 15 2010

−6

10−5

10−4

10−3

10−2

nA=200, nB=10, nR=1, maxit = 211

number of vectors k

med

ian

redu

ctio

n in

f (o

ver

10 r

uns)

Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

Page 216: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Convex: nA = 200, nB = 10, nR = 1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

53 / 59

0 5 10 15 2010

−6

10−5

10−4

10−3

10−2

nA=200, nB=10, nR=1, maxit = 211

number of vectors k

med

ian

redu

ctio

n in

f (o

ver

10 r

uns)

Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

LM-BFGS much worse than full BFGS

Page 217: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Nonconvex: nA = 200, nB = 10, nR = 5

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

54 / 59

0 5 10 15 2010

−4

10−3

10−2

10−1

nA=200, nB=10, nR=5, maxit = 215

number of vectors k

med

ian

redu

ctio

n in

f (o

ver

10 r

uns)

Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

Page 218: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Nonsmooth, Nonconvex: nA = 200, nB = 10, nR = 5

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

54 / 59

0 5 10 15 2010

−4

10−3

10−2

10−1

nA=200, nB=10, nR=5, maxit = 215

number of vectors k

med

ian

redu

ctio

n in

f (o

ver

10 r

uns)

Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

LM-BFGS with scaling even worse than LM-Grad-Samp

Page 219: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Limited Effectiveness of Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

55 / 59

We see that that addition of nonsmoothness to a problem,convex or nonconvex, creates great difficulties for LimitedMemory BFGS, even when the dimension of the V-space is lessthan the size of the memory, although it helps to turn off scaling.With scaling it may be no better than Limited Memory GradientSampling. More investigation of this is needed.

Page 220: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Other Ideas for Large Scale Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

56 / 59

Page 221: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Other Ideas for Large Scale Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

56 / 59

■ Exploit structure! Lots of work on this has been done.

Page 222: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Other Ideas for Large Scale Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

56 / 59

■ Exploit structure! Lots of work on this has been done.■ Smoothing! Lots of work on this has been done too, most

notably by Yu. Nesterov.

Page 223: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Other Ideas for Large Scale Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

56 / 59

■ Exploit structure! Lots of work on this has been done.■ Smoothing! Lots of work on this has been done too, most

notably by Yu. Nesterov.■ Bundle methods, pioneered by C. Lemarechal in the convex

case and K. Kiwiel in the 1980s in the nonconvex case, andwith lots of work done since, e.g. by P. Apkarian and D. Nollin small-scale control applications and by T.M.T. Do andT. Artieres in large-scale machine learning applications.

Page 224: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Other Ideas for Large Scale Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

56 / 59

■ Exploit structure! Lots of work on this has been done.■ Smoothing! Lots of work on this has been done too, most

notably by Yu. Nesterov.■ Bundle methods, pioneered by C. Lemarechal in the convex

case and K. Kiwiel in the 1980s in the nonconvex case, andwith lots of work done since, e.g. by P. Apkarian and D. Nollin small-scale control applications and by T.M.T. Do andT. Artieres in large-scale machine learning applications.

■ Lots of other recent work on nonconvexity in machinelearning.

Page 225: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Other Ideas for Large Scale Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

56 / 59

■ Exploit structure! Lots of work on this has been done.■ Smoothing! Lots of work on this has been done too, most

notably by Yu. Nesterov.■ Bundle methods, pioneered by C. Lemarechal in the convex

case and K. Kiwiel in the 1980s in the nonconvex case, andwith lots of work done since, e.g. by P. Apkarian and D. Nollin small-scale control applications and by T.M.T. Do andT. Artieres in large-scale machine learning applications.

■ Lots of other recent work on nonconvexity in machinelearning.

■ Adaptive Gradient Sampling (F.E. Curtis and X. Que).

Page 226: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Other Ideas for Large Scale Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

56 / 59

■ Exploit structure! Lots of work on this has been done.■ Smoothing! Lots of work on this has been done too, most

notably by Yu. Nesterov.■ Bundle methods, pioneered by C. Lemarechal in the convex

case and K. Kiwiel in the 1980s in the nonconvex case, andwith lots of work done since, e.g. by P. Apkarian and D. Nollin small-scale control applications and by T.M.T. Do andT. Artieres in large-scale machine learning applications.

■ Lots of other recent work on nonconvexity in machinelearning.

■ Adaptive Gradient Sampling (F.E. Curtis and X. Que).■ Automatic Differentiation (AD): (B. Bell, A. Griewank).

Page 227: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Concluding Remarks

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

Summary

A Final Quote

57 / 59

Page 228: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Summary

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

Summary

A Final Quote

58 / 59

Gradient Sampling is a simple method for nonsmooth, nonconvexoptimization for which a convergence theory is known, but it istoo expensive to use in most applications.

Page 229: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Summary

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

Summary

A Final Quote

58 / 59

Gradient Sampling is a simple method for nonsmooth, nonconvexoptimization for which a convergence theory is known, but it istoo expensive to use in most applications.

BFGS — the full version — is remarkably effective onnonsmooth problems, but little theory is known. Our packageHIFOO (H-infinity fixed order optimization) for controller design,primarily based on BFGS, has been used successfully in manyapplications.

Page 230: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Summary

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

Summary

A Final Quote

58 / 59

Gradient Sampling is a simple method for nonsmooth, nonconvexoptimization for which a convergence theory is known, but it istoo expensive to use in most applications.

BFGS — the full version — is remarkably effective onnonsmooth problems, but little theory is known. Our packageHIFOO (H-infinity fixed order optimization) for controller design,primarily based on BFGS, has been used successfully in manyapplications.

Limited Memory BFGS is not so effective on nonsmoothproblems, but it seems to help to turn off scaling.

Page 231: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

Summary

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

Summary

A Final Quote

58 / 59

Gradient Sampling is a simple method for nonsmooth, nonconvexoptimization for which a convergence theory is known, but it istoo expensive to use in most applications.

BFGS — the full version — is remarkably effective onnonsmooth problems, but little theory is known. Our packageHIFOO (H-infinity fixed order optimization) for controller design,primarily based on BFGS, has been used successfully in manyapplications.

Limited Memory BFGS is not so effective on nonsmoothproblems, but it seems to help to turn off scaling.

Diabolical nonconvex problems such as Nesterov’sChebyshev-Rosenbrock problems can be very difficult, especiallyin the nonsmooth case.

Page 232: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A Final Quote

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

Summary

A Final Quote

59 / 59

“Nonconvexity is scary to some, but there are vastly differenttypes of nonconvexity (some of which are really scary!)”

— Yann LeCun

Page 233: Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex Optimization Example Methods Suitable for Nonsmooth Functions Failure of Steepest

A Final Quote

Introduction

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

Summary

A Final Quote

59 / 59

“Nonconvexity is scary to some, but there are vastly differenttypes of nonconvexity (some of which are really scary!)”

— Yann LeCun

Papers, software are available at www.cs.nyu.edu/overton.