Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex...

1 / 59

Nonsmooth, Nonconvex OptimizationAlgorithms and Examples

Michael L. OvertonCourant Institute of Mathematical Sciences

New York University

Convex and Nonsmooth Optimization Class, Spring 2016, Final Lecture

Mostly based on my research work with Jim Burke and Adrian Lewis

Introduction

IntroductionNonsmooth,NonconvexOptimization

Example

Methods Suitable forNonsmoothFunctionsFailure of SteepestDescent: SimplerExample

Gradient Sampling

Quasi-NewtonMethods

Some DifficultExamples

Limited MemoryMethods

Concluding Remarks

2 / 59

Nonsmooth, Nonconvex Optimization

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

3 / 59

Problem: find x that locally minimizes f , where f : Rn → R is

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

3 / 59

■ Continuous

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

3 / 59

■ Continuous■ Not differentiable everywhere, in particular often not

differentiable at local minimizers

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

3 / 59

differentiable at local minimizers■ Not convex

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

3 / 59

differentiable at local minimizers■ Not convex■ Usually, but not always, locally Lipschitz: for all x there

exists Lx such that |f(x+ d)− f(x)| ≤ Lx‖d‖ for small ‖d‖

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

3 / 59

Lots of interesting applications

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

3 / 59

Any locally Lipschitz function is differentiable almost everywhereon its domain. So, whp, can evaluate gradient at any given point.

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

3 / 59

Any locally Lipschitz function is differentiable almost everywhereon its domain. So, whp, can evaluate gradient at any given point.

What happens if we simply use steepest descent (gradientdescent) with a standard line search?

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

4 / 59

f(x)=10*|x2 − x

12| + (1−x

steepest descent iterates−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−1.5

−0.5

Methods Suitable for Nonsmooth Functions

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

5 / 59

In fact, it’s been known for several decades that at any giveniterate, we need to exploit the gradient information obtained atseveral points, not just at one point. Some such methods:

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

5 / 59

■ Bundle methods (C. Lemarechal, K.C. Kiwiel, etc.):extensive practical use and theoretical analysis, butcomplicated in nonconvex case

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

5 / 59

■ Gradient sampling: an easily stated method with niceconvergence theory (J.V. Burke, A.S. Lewis, M.L.O., 2005;K.C. Kiwiel, 2007), but computationally intensive

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

5 / 59

■ Gradient sampling: an easily stated method with niceconvergence theory (J.V. Burke, A.S. Lewis, M.L.O., 2005;K.C. Kiwiel, 2007), but computationally intensive

■ BFGS: traditional workhorse for smooth optimization, worksamazingly well for nonsmooth optimization too, but verylimited convergence theory

Failure of Steepest Descent: Simpler Example

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

6 / 59

Let f(x) = 6|x1|+ 3x2. Note that f is polyhedral and convex.

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

6 / 59

On this function, using a bisection-based backtracking line

search with “Armijo” parameter in [0, 13 ] and starting at

steepest descent generates the sequence

[2(−1)k

], k = 1, 2, . . . ,

converging to

Example

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

6 / 59

On this function, using a bisection-based backtracking line

search with “Armijo” parameter in [0, 13 ] and starting at

steepest descent generates the sequence

[2(−1)k

], k = 1, 2, . . . ,

converging to

In contrast, BFGS with the same line search rapidly reduces thefunction value towards −∞ (arbitrarily far, in exact arithmetic)(A.S. Lewis and S. Zhang, 2010).

Gradient Sampling

Introduction

Gradient Sampling

The GradientSampling Method

With First Phase ofGradient Sampling

With Second Phaseof GradientSampling

The ClarkeSubdifferentialNote that0 ∈ ∂f(x) = 0 at

x = [1; 1]T

Grad. Samp.: AStabilized SteepestDescent MethodConvergence ofGradient SamplingMethodExtension toProblems withNonsmoothConstraints

Quasi-NewtonMethods

Concluding Remarks7 / 59

The Gradient Sampling Method

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Fix sample size m ≥ n+ 1, line search parameter β ∈ (0, 1),reduction factors µ ∈ (0, 1) and θ ∈ (0, 1).

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Initialize sampling radius ǫ > 0, tolerance τ > 0, iterate x.

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Repeat (outer loop)

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Repeat (outer loop)

■ Repeat (inner loop: gradient sampling with fixed ǫ):

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Repeat (outer loop)

◆ Set G = {∇f(x),∇f(x+ ǫu1), . . . ,∇f(x+ ǫum)},sampling u1, · · · , um from the unit ball

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Repeat (outer loop)

◆ Set g = argmin{||g|| : g ∈ conv(G)}

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Repeat (outer loop)

◆ Set g = argmin{||g|| : g ∈ conv(G)}◆ If ‖g‖ > τ , do backtracking line search: set d = −g/‖g‖

and replace x by x+ td, with t ∈ {1, 12 ,14 , . . .} and

f(x+ td) < f(x)− βt‖g‖

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Repeat (outer loop)

■ until ‖g‖ ≤ τ .

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Repeat (outer loop)

■ New phase: set ǫ = µǫ and τ = θτ .

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Repeat (outer loop)

■ New phase: set ǫ = µǫ and τ = θτ .

J.V. Burke, A.S. Lewis and M.L.O., SIOPT, 2005.

With First Phase of Gradient Sampling

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

f(x)=10*|x2 − x

12| + (1−x

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

With Second Phase of Gradient Sampling

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

f(x)=10*|x2 − x

12| + (1−x

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

The Clarke Subdifferential

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Assume f : Rn → R is locally Lipschitz, andlet D = {x ∈ R

n : f is differentiable at x}.

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Rademacher’s Theorem: Rn\D has measure zero.

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

The Clarke subdifferential of f at x is

∂f(x) = conv

x→x,x∈D∇f(x)

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

∂f(x) = conv

x→x,x∈D∇f(x)

F.H. Clarke, 1973 (he used the name “generalized gradient”).

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

∂f(x) = conv

x→x,x∈D∇f(x)

If f is continuously differentiable at x, then ∂f(x) = {∇f(x)}.

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

∂f(x) = conv

x→x,x∈D∇f(x)

If f is convex, ∂f is the subdifferential of convex analysis.

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

∂f(x) = conv

x→x,x∈D∇f(x)

We say x is Clarke stationary for f if 0 ∈ ∂f(x).

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

∂f(x) = conv

x→x,x∈D∇f(x)

We say x is Clarke stationary for f if 0 ∈ ∂f(x).

Key point: the convex hull of the set G generated by GradientSampling is a surrogate for ∂f .

Note that 0 ∈ ∂f(x) = 0 at x = [1; 1]T

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

f(x)=10*|x2 − x

12| + (1−x

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

Grad. Samp.: A Stabilized Steepest Descent Method

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Lemma. Let G be a compact convex set. Then

−dist(0, G) = min‖d‖≤1

maxg∈G

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

maxg∈G

Proof.−dist(0, G) = −min

g∈G‖g‖

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

maxg∈G

g∈G‖g‖

= −ming∈G

max‖d‖≤1

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

maxg∈G

g∈G‖g‖

= −ming∈G

max‖d‖≤1

= − max‖d‖≤1

ming∈G

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

maxg∈G

g∈G‖g‖

= −ming∈G

max‖d‖≤1

= − max‖d‖≤1

ming∈G

= − max‖d‖≤1

ming∈G

gT (−d)

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

maxg∈G

g∈G‖g‖

= −ming∈G

max‖d‖≤1

= − max‖d‖≤1

ming∈G

= − max‖d‖≤1

ming∈G

gT (−d)

= min‖d‖≤1

maxg∈G

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

maxg∈G

g∈G‖g‖

= −ming∈G

max‖d‖≤1

= − max‖d‖≤1

ming∈G

= − max‖d‖≤1

ming∈G

gT (−d)

= min‖d‖≤1

maxg∈G

Note: the distance is nonnegative, and zero iff 0 ∈ G.

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

maxg∈G

g∈G‖g‖

= −ming∈G

max‖d‖≤1

= − max‖d‖≤1

ming∈G

= − max‖d‖≤1

ming∈G

gT (−d)

= min‖d‖≤1

maxg∈G

Note: the distance is nonnegative, and zero iff 0 ∈ G.Otherwise, equality is attained by g = ΠG(0), d = −g/|g‖.

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

maxg∈G

g∈G‖g‖

= −ming∈G

max‖d‖≤1

= − max‖d‖≤1

ming∈G

= − max‖d‖≤1

ming∈G

gT (−d)

= min‖d‖≤1

maxg∈G

Note: the distance is nonnegative, and zero iff 0 ∈ G.Otherwise, equality is attained by g = ΠG(0), d = −g/|g‖.Ordinary steepest descent: G = {∇f(x)}.

Convergence of Gradient Sampling Method

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Suppose that f : Rn → R

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

■ is locally Lipschitz

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

■ is locally Lipschitz■ is continuously differentiable on an open dense subset of Rn

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

■ has bounded level sets

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Then, with probability one, the line search always terminates, fis differentiable at every iterate x, and if the sequence of iterates{x} converges to some point x, then, with probability one

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

■ the inner loop always terminates, so the sequences ofsampling radii {ǫ} and tolerances {τ} converge to zero, and

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

■ x is Clarke stationary for f , i.e., 0 ∈ ∂f(x).

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Drop the assumption that f has bounded level sets. Then, wp 1,either the sequence {f(x)} → −∞, or every cluster point of thesequence of iterates {x} is Clarke stationary.

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

Drop the assumption that f has bounded level sets. Then, wp 1,either the sequence {f(x)} → −∞, or every cluster point of thesequence of iterates {x} is Clarke stationary.

K.C. Kiwiel, SIOPT, 2007.

Extension to Problems with Nonsmooth Constraints

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

where f and c1, . . . , cp are locally Lipschitz but may not bedifferentiable at local minimizers.

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

A successive quadratic programming gradient sampling methodwith convergence theory.

Introduction

Gradient Sampling

x = [1; 1]T

Quasi-NewtonMethods

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

A successive quadratic programming gradient sampling methodwith convergence theory.

F.E. Curtis and M.L.O., SIOPT, 2012.

Quasi-Newton Methods

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

BFGSThe BFGS Method(”Full” Version)

BFGS forNonsmoothOptimization

With BFGSExample:Minimizing aProduct ofEigenvalues

BFGS from 10Randomly GeneratedStarting Points

Evolution ofEigenvalues ofA ◦ XEvolution ofEigenvalues of H

Regularity

Partly SmoothFunctionsSame ExampleAgain

Relation of PartialSmoothness toEarlier Work

16 / 59

Bill Davidon

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

17 / 59

W. Davidon, a physicist at Argonne, had the breakthrough ideain 1959: since it’s too expensive to compute and factor theHessian ∇2f(x) at every iteration, update an approximation toits inverse using information from gradient differences, andmultiply this onto the negative gradient to approximateNewton’s method.

Bill Davidon

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

17 / 59

Each inverse Hessian approximation differs from the previous oneby a rank-two correction.

Bill Davidon

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

17 / 59

Ahead of its time: the paper was rejected by the physicsjournals, but published 30 years later in the first issue of SIAM J.Optimization.

Bill Davidon

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

17 / 59

Ahead of its time: the paper was rejected by the physicsjournals, but published 30 years later in the first issue of SIAM J.Optimization.

Davidon was a well known active anti-war protester during theVietnam War. In December 2013, it was revealed that he wasthe mastermind behind the break-in at the FBI office in Media,PA, on March 8, 1971, during the Muhammad Ali - Joe Frazierworld heavyweight boxing championship.

Fletcher and Powell

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

18 / 59

In 1963, R. Fletcher and M.J.D. Powell improved Davidon’smethod and established convergence for convex quadraticfunctions.

Fletcher and Powell

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

18 / 59

They applied it to solve problems in 100 variables: a lot at thetime.

Fletcher and Powell

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

18 / 59

They applied it to solve problems in 100 variables: a lot at thetime.

The method became known as the DFP method.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

19 / 59

In 1970, C.G. Broyden, R. Fletcher, D. Goldfarb and D. Shannoall independently proposed the BFGS method, which is a kind ofdual of the DFP method. It was soon recognized that this was aremarkably effective method for smooth optimization.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

19 / 59

In 1973, C.G. Broyden, J.E. Dennis and J.J. More proved genericlocal superlinear convergence of BFGS and DFP and otherquasi-Newton methods.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

19 / 59

In 1975, M.J.D. Powell established convergence of BFGS with aninexact Armijo-Wolfe line search for a general class of smoothconvex functions for BFGS. In 1987, this was extended byR.H. Byrd, J. Nocedal and Y.-X. Yuan to include the whole“Broyden” class of methods interpolating BFGS and DFP:except for the DFP end point.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

19 / 59

In 1975, M.J.D. Powell established convergence of BFGS with aninexact Armijo-Wolfe line search for a general class of smoothconvex functions for BFGS. In 1987, this was extended byR.H. Byrd, J. Nocedal and Y.-X. Yuan to include the whole“Broyden” class of methods interpolating BFGS and DFP:except for the DFP end point.

Pathological counterexamples to convergence in the smooth,nonconvex case are known to exist (Y.-H. Dai, 2002, 2013;W. Mascarenhas 2004), but it is widely accepted that themethod works well in practice in the smooth, nonconvex case.

The BFGS Method (”Full” Version)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Choose line search parameters 0 < β < γ < 1

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Initialize iterate x and positive-definite symmetric matrix H(which is supposed to approximate the inverse Hessian of f)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Repeat

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)Td < 0

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Repeat

■ Set d = −H∇f(x). Let α = ∇f(x)Td < 0■ Armijo-Wolfe line search: find t so that

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Repeat

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα■ Set s = td, y = ∇f(x+ td)−∇f(x)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Repeat

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Repeat

f(x+ td) < f(x) + βtα and ∇f(x+ td)Td > γα■ Set s = td, y = ∇f(x+ td)−∇f(x)■ Replace x by x+ td■ Replace H by V HV T + 1

sT yssT , where V = I − 1

sT ysyT

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Repeat

sT ysyT

Note that H can be computed in O(n2) operations since V is arank one perturbation of the identity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Repeat

sT ysyT

Note that H can be computed in O(n2) operations since V is arank one perturbation of the identityThe Armijo condition ensures “sufficient decrease” in f

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

20 / 59

Repeat

sT ysyT

Note that H can be computed in O(n2) operations since V is arank one perturbation of the identityThe Armijo condition ensures “sufficient decrease” in fThe Wolfe condition ensures that the directional derivative alongthe line increases algebraically, which guarantees that sT y > 0and that the new H is positive definite.

BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

21 / 59

In 1982, C. Lemarechal observed that quasi-Newton methods can beeffective for nonsmooth optimization, but dismissed them as there wasno theory behind them and no good way to terminate them.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

21 / 59

Otherwise, there is not much in the literature on the subject untilA.S. Lewis and M.L.O. (Math. Prog., 2013): we address both issues indetail, but our convergence results are limited to special cases.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

21 / 59

Key point: use the original Armijo-Wolfe line search. Do not insist onreducing the magnitude of the directional derivative along the line!

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

21 / 59

In the nonsmooth case, BFGS builds a very ill-conditioned inverse“Hessian” approximation, with some tiny eigenvalues converging tozero, corresponding to “infinitely large” curvature in the directionsdefined by the associated eigenvectors.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

21 / 59

Remarkably, the condition number of the inverse Hessianapproximation typically reaches 1016 before the method breaks down.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

21 / 59

We have never seen convergence to non-stationary points that cannotbe explained by numerical difficulties.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

21 / 59

We have never seen convergence to non-stationary points that cannotbe explained by numerical difficulties.

Convergence rate of BFGS is typically linear (not superlinear) in thenonsmooth case.

With BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

22 / 59

f(x)=10*|x2 − x

12| + (1−x

steepest descent, grad sampling and BFGS iterates

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

contours of fstarting pointoptimal pointsteepest descentgrad samp (1st phase)grad samp (2nd phase)bfgs

Example: Minimizing a Product of Eigenvalues

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

23 / 59

Let SN denote the space of real symmetric N ×N matrices, and

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN .

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

23 / 59

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

denote the eigenvalues of X ∈ SN . We wish to minimize

f(X) = log

N/2∏

λi(A ◦X)

where A ∈ SN is fixed and ◦ is the Hadamard (componentwise)matrix product, subject to the constraints that X is positivesemidefinite and has diagonal entries equal to 1.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

23 / 59

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

f(X) = log

N/2∏

λi(A ◦X)

If we replace∏

we would have a semidefinite program.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

23 / 59

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

f(X) = log

N/2∏

λi(A ◦X)

If we replace∏

Since f is not convex, may as well replace X by Y Y T whereY ∈ R

N×N : eliminates psd constraint, and then also easy toeliminate diagonal constraint.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

23 / 59

λ1(X) ≥ λ2(X) ≥ · · ·λN (X)

f(X) = log

N/2∏

λi(A ◦X)

If we replace∏

Since f is not convex, may as well replace X by Y Y T whereY ∈ R

N×N : eliminates psd constraint, and then also easy toeliminate diagonal constraint.

Application: entropy minimization in an environmentalapplication (K.M. Anstreicher and J. Lee, 2004)

BFGS from 10 Randomly Generated Starting Points

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

24 / 59

0 200 400 600 800 1000 1200 140010

10−10

10−5

iteration

f − f op

Log eigenvalue product, N=20, n=400, fopt

= −4.37938e+000

f − fopt, where fopt is least value of f found over all runs

Evolution of Eigenvalues of A ◦X

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

25 / 59

0 200 400 600 800 1000 1200 14000.2

= −4.37938e+000

iteration

Evolution of Eigenvalues of A ◦X

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

25 / 59

0 200 400 600 800 1000 1200 14000.2

= −4.37938e+000

iteration

Note that λ6(X), . . . , λ14(X) coalesce

Evolution of Eigenvalues of H

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

26 / 59

0 200 400 600 800 1000 1200 140010

10−10

10−5

= −4.37938e+000

iteration

Evolution of Eigenvalues of H

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

26 / 59

0 200 400 600 800 1000 1200 140010

10−10

10−5

= −4.37938e+000

iteration

44 eigenvalues of H converge to zero...why???

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

27 / 59

A locally Lipschitz, directionally differentiable function f isregular (Clarke 1970s) near a point x when its directionalderivative x 7→ f ′(x; d) is upper semicontinuous there for everyfixed direction d.

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

27 / 59

In this case 0 ∈ ∂f(x) is equivalent to the first-order optimalitycondition f ′(x, d) ≥ 0 for all directions d.

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

27 / 59

■ All convex functions are regular

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

27 / 59

■ All convex functions are regular■ All smooth functions are regular

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

27 / 59

■ All convex functions are regular■ All smooth functions are regular■ Nonsmooth concave functions are not regular

Example: f(x) = −|x|

Regularity

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

27 / 59

■ All convex functions are regular■ All smooth functions are regular■ Nonsmooth concave functions are not regular

Example: f(x) = −|x|

Note: this is simpler than the definition of regularity at a point

given in last week’s lecture.

Partly Smooth Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

28 / 59

A regular function f is partly smooth at x relative to a manifoldM containing x (A.S. Lewis 2003) if

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

28 / 59

■ its restriction to M is twice continuously differentiable near x

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

28 / 59

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

28 / 59

■ its restriction to M is twice continuously differentiable near x■ the Clarke subdifferential ∂f is continuous on M near x■ par ∂f(x), the subspace parallel to the affine hull of the

subdifferential of f at x, is exactly the subspace normal toM at x.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

28 / 59

We refer to par ∂f(x) as the V-space for f at x (with respect toM), and to its orthogonal complement, the subspace tangent toM at x, as the U-space for f at x.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

28 / 59

When we refer to the V-space and U-space without reference toa point x, we mean at a minimizer.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

28 / 59

When we refer to the V-space and U-space without reference toa point x, we mean at a minimizer.

For nonzero y in the V-space, the mapping t 7→ f(x+ ty) isnecessarily nonsmooth at t = 0, while for nonzero y in theU-space, t 7→ f(x+ ty) is differentiable at t = 0 as long as f islocally Lipschitz.

Same Example Again

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

29 / 59

f(x)=10*|x2 − x

12| + (1−x

steepest descent, grad sampling and BFGS iterates

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

contours of fstarting pointoptimal pointsteepest descentgrad samp (1st phase)grad samp (2nd phase)bfgs

Relation of Partial Smoothness to Earlier Work

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

30 / 59

Partial smoothness is closely related to earlier work of J.V. Burkeand J.J. More (1990,1994) and S.J. Wright (1993) onidentification of constraint structure by algorithms.

Relation of Partial Smoothness to Earlier Work

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

30 / 59

Partial smoothness is closely related to earlier work of J.V. Burkeand J.J. More (1990,1994) and S.J. Wright (1993) onidentification of constraint structure by algorithms.

When f is convex, the partly smooth nomenclature is consistentwith the usage of V-space and U-space by C. Lemarechal,F. Oustry and C. Sagastizabal (2000), but partial smoothnessdoes not imply convexity and convexity does not imply partialsmoothness.

Why Did 44 Eigenvalues of H Converge to Zero?

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

31 / 59

The eigenvalue product is partly smooth with respect to themanifold of matrices with an eigenvalue with given multiplicity.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

31 / 59

Recall that at the computed minimizer,

λ6(A ◦X) ≈ . . . ≈ λ14(A ◦X).

Matrix theory says that imposing multiplicity m on an eigenvaluea matrix ∈ SN is m(m+1)

2 − 1 conditions, or 44 when m = 9, sothe dimension of the V -space at this minimizer is 44.

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

31 / 59

Recall that at the computed minimizer,

λ6(A ◦X) ≈ . . . ≈ λ14(A ◦X).

Matrix theory says that imposing multiplicity m on an eigenvaluea matrix ∈ SN is m(m+1)

2 − 1 conditions, or 44 when m = 9, sothe dimension of the V -space at this minimizer is 44.

Thus BFGS automatically detected the U and V spacepartitioning without knowing anything about the mathematicalstructure of f !

Variation of f from Minimizer, along EigVecs of H

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

32 / 59

−10 −5 0 5 100

f(x op

− f op

= −4.37938e+000

w is eigvector for eigvalue 10 of final Hw is eigvector for eigvalue 20 of final Hw is eigvector for eigvalue 30 of final Hw is eigvector for eigvalue 40 of final Hw is eigvector for eigvalue 50 of final Hw is eigvector for eigvalue 60 of final H

Eigenvalues of H numbered smallest to largest

Challenge: Convergence of BFGS in Nonsmooth Case

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

33 / 59

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

33 / 59

Assume f is locally Lipschitz with bounded level sets and issemi-algebraic

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

33 / 59

Assume the initial x and H are generated randomly (e.g. fromnormal and Wishart distributions)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

33 / 59

Prove or disprove that the following hold with probability one:

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

33 / 59

1. BFGS generates an infinite sequence {x} with fdifferentiable at all iterates

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

33 / 59

2. Any cluster point x is Clarke stationary

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

33 / 59

2. Any cluster point x is Clarke stationary3. The sequence of function values generated (including all of

the line search iterates) converges to f(x) R-linearly

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

33 / 59

2. Any cluster point x is Clarke stationary3. The sequence of function values generated (including all of

the line search iterates) converges to f(x) R-linearly4. If {x} converges to x where f is “partly smooth” w.r.t. a

manifold M then the subspace defined by the eigenvectorscorresponding to eigenvalues of H converging to zeroconverges to the “V-space” of f w.r.t. M at x

Extensions of BFGS for Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

34 / 59

A combined BFGS-Gradient Sampling method with convergencetheory (F.E. Curtis and X. Que, 2015)

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

34 / 59

Constrained Problems

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

34 / 59

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

A successive quadratic programming (SQP) BFGS methodapplied to challenging problems in static-output-feedback controldesign (F.E. Curtis, T. Mitchell and M.L.O., 2015).

Introduction

Gradient Sampling

Quasi-NewtonMethods

Bill Davidon

Fletcher and Powell

Regularity

34 / 59

min f(x)

subject to ci(x) ≤ 0, i = 1, . . . , p

A successive quadratic programming (SQP) BFGS methodapplied to challenging problems in static-output-feedback controldesign (F.E. Curtis, T. Mitchell and M.L.O., 2015).

Although there are no theoretical results, it is much moreefficient and effective than the SQP Gradient Sampling methodwhich does have convergence results.

Some Difficult Examples

Introduction

Gradient Sampling

Quasi-NewtonMethods

A RosenbrockFunctionThe NonsmoothVariant of theRosenbrock FunctionAn Aside:ChebyshevPolynomials

Plots of ChebyshevPolynomials

Nesterov’sChebyshev-RosenbrockFunctionsWhy BFGS Takes SoMany Iterations toMinimize N2

First NonsmoothVariant ofNesterov’s FunctionSecond NonsmoothVariant ofNesterov’s FunctionContour Plots of theNonsmooth Variantsfor n = 2Properties of theSecond Nonsmooth

35 / 59

A Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

36 / 59

Consider a generalization of the Rosenbrock (1960) function:

Rp(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |p, where p > 0.

Introduction

Gradient Sampling

Quasi-NewtonMethods

36 / 59

Rp(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |p, where p > 0.

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Rp(x∗) = 0.

Introduction

Gradient Sampling

Quasi-NewtonMethods

36 / 59

Rp(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |p, where p > 0.

Define x = [−1, 1, 1, . . . , 1]T with Rp(x) = 1 and the manifold

MR = {x : xi+1 = x2i , i = 1, . . . , n− 1}

Introduction

Gradient Sampling

Quasi-NewtonMethods

36 / 59

Rp(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |p, where p > 0.

MR = {x : xi+1 = x2i , i = 1, . . . , n− 1}

For x ∈ MR, e.g. x = x∗ or x = x, the 2nd term of Rp is zero.Starting at x, BFGS needs to approximately follow MR to reachx∗ (unless it “gets lucky”).

Introduction

Gradient Sampling

Quasi-NewtonMethods

36 / 59

Rp(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |p, where p > 0.

MR = {x : xi+1 = x2i , i = 1, . . . , n− 1}

When p = 2: R2 is smooth but not convex. Starting at x:

Introduction

Gradient Sampling

Quasi-NewtonMethods

36 / 59

Rp(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |p, where p > 0.

MR = {x : xi+1 = x2i , i = 1, . . . , n− 1}

■ n = 5: BFGS needs 43 iterations to reduce R2 below 10−15

Introduction

Gradient Sampling

Quasi-NewtonMethods

36 / 59

Rp(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |p, where p > 0.

MR = {x : xi+1 = x2i , i = 1, . . . , n− 1}

■ n = 5: BFGS needs 43 iterations to reduce R2 below 10−15

■ n = 10, BFGS needs 276 iterations to reduce R2 below 10−15.

The Nonsmooth Variant of the Rosenbrock Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |

Introduction

Gradient Sampling

Quasi-NewtonMethods

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |

R1 is nonsmooth (but locally Lipschitz) as well as nonconvex.The second term is still zero on the manifold MR, but R1 is notdifferentiable on MR.

Introduction

Gradient Sampling

Quasi-NewtonMethods

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |

However, R1 is regular at x ∈ MR and partly smooth at x w.r.t.MR.

Introduction

Gradient Sampling

Quasi-NewtonMethods

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |

We cannot initialize BFGS at x, so starting at normallydistributed random points:

Introduction

Gradient Sampling

Quasi-NewtonMethods

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |

■ n = 5: BFGS reduces R1 only to about 1× 10−3 in 1000iterations

Introduction

Gradient Sampling

Quasi-NewtonMethods

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |

Introduction

Gradient Sampling

Quasi-NewtonMethods

37 / 59

R1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − x2i |

Again the method appears to be converging, very slowly, butmay be having numerical difficulties.

An Aside: Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

38 / 59

A sequence of orthogonal polynomials defined on [−1, 1] by

T0(x) = 1, T1(x) = x, Tn+1(x) = 2xTn(x)− Tn−1(x).

Introduction

Gradient Sampling

Quasi-NewtonMethods

38 / 59

So T2(x) = 2x2 − 1,

Introduction

Gradient Sampling

Quasi-NewtonMethods

38 / 59

So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.

Introduction

Gradient Sampling

Quasi-NewtonMethods

38 / 59

So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.

Important properties that can be proved easily include

Introduction

Gradient Sampling

Quasi-NewtonMethods

38 / 59

So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.

■ Tn(x) = cos(n cos−1(x))

Introduction

Gradient Sampling

Quasi-NewtonMethods

38 / 59

So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.

■ Tm(Tn(x)) = Tmn(x)

Introduction

Gradient Sampling

Quasi-NewtonMethods

38 / 59

So T2(x) = 2x2 − 1, T3(x) = 4x3 − 3, etc.

■ Tm(Tn(x)) = Tmn(x)

∫ 1−1

1√1−x2

Ti(x)Tj(x)dx = 0 if i 6= j

Plots of Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

39 / 59

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

Left: Plots of T0(x), . . . , T4(x) Right: Plot of T8(x).

Plots of Chebyshev Polynomials

Introduction

Gradient Sampling

Quasi-NewtonMethods

39 / 59

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

Left: Plots of T0(x), . . . , T4(x) Right: Plot of T8(x).Question: How many extrema does Tn(x) have in [−1, 1]?

Nesterov’s Chebyshev-Rosenbrock Functions

Introduction

Gradient Sampling

Quasi-NewtonMethods

40 / 59

Consider the function

Np(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|p, where p > 0

Introduction

Gradient Sampling

Quasi-NewtonMethods

40 / 59

Np(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|p, where p > 0

The unique minimizer is x∗ = [1, 1, . . . , 1]T with Np(x∗) = 0.

Introduction

Gradient Sampling

Quasi-NewtonMethods

40 / 59

Np(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|p, where p > 0

Define x = [−1, 1, 1, . . . , 1]T with Np(x) = 1 and the manifold

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

Introduction

Gradient Sampling

Quasi-NewtonMethods

40 / 59

Np(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|p, where p > 0

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

For x ∈ MN , e.g. x = x∗ or x = x, the 2nd term of Np is zero.Starting at x, BFGS needs to approximately follow MN to reachx∗ (unless it “gets lucky”).

Introduction

Gradient Sampling

Quasi-NewtonMethods

40 / 59

Np(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|p, where p > 0

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

When p = 2: N2 is smooth but not convex. Starting at x:

Introduction

Gradient Sampling

Quasi-NewtonMethods

40 / 59

Np(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|p, where p > 0

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

■ n = 5: BFGS needs 370 iterations to reduce N2 below 10−15

Introduction

Gradient Sampling

Quasi-NewtonMethods

40 / 59

Np(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|p, where p > 0

MN = {x : xi+1 = 2x2i − 1, i = 1, . . . , n− 1}

■ n = 5: BFGS needs 370 iterations to reduce N2 below 10−15

■ n = 10: needs ∼ 50,000 iterations to reduce N2 below 10−15

even though N2 is smooth!

Why BFGS Takes So Many Iterations to Minimize N2

Introduction

Gradient Sampling

Quasi-NewtonMethods

41 / 59

Let Ti(x) denote the ith Chebyshev polynomial. For x ∈ MN ,

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

Introduction

Gradient Sampling

Quasi-NewtonMethods

41 / 59

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

Introduction

Gradient Sampling

Quasi-NewtonMethods

41 / 59

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

To move from x to x∗ along the manifold MN exactly requires

Introduction

Gradient Sampling

Quasi-NewtonMethods

41 / 59

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

■ x1 to change from −1 to 1

Introduction

Gradient Sampling

Quasi-NewtonMethods

41 / 59

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]

Introduction

Gradient Sampling

Quasi-NewtonMethods

41 / 59

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]

Introduction

Gradient Sampling

Quasi-NewtonMethods

41 / 59

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

■ x1 to change from −1 to 1■ x2 = 2x21 − 1 to trace the graph of T2(x1) on [−1, 1]■ x3 = T2(T2(x)) to trace the graph of T4(x1) on [−1, 1]■ xn = T2n−1(x) to trace the graph of T2n−1(x1) on [−1, 1]

which has 2n−1 − 1 extrema in (−1, 1).

Introduction

Gradient Sampling

Quasi-NewtonMethods

41 / 59

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

which has 2n−1 − 1 extrema in (−1, 1).Even though BFGS will not track the manifold MN exactly, itwill follow it approximately. So, since the manifold is highlyoscillatory, BFGS must take relatively short steps to obtainreduction in N2 in the line search, and hence it takes many

iterations!

Introduction

Gradient Sampling

Quasi-NewtonMethods

41 / 59

xi+1 = 2x2i − 1 = T2(xi) = T2(T2(xi−1))

= T2(T2(. . . T2(x1) . . .)) = T2i(x1).

which has 2n−1 − 1 extrema in (−1, 1).Even though BFGS will not track the manifold MN exactly, itwill follow it approximately. So, since the manifold is highlyoscillatory, BFGS must take relatively short steps to obtainreduction in N2 in the line search, and hence it takes many

iterations! At the very end, since N2 is smooth, BFGS issuperlinearly convergent!

First Nonsmooth Variant of Nesterov’s Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|

Introduction

Gradient Sampling

Quasi-NewtonMethods

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|

N1 is nonsmooth (though locally Lipschitz) as well asnonconvex. The second term is still zero on the manifold MN ,but N1 is not differentiable on MN .

Introduction

Gradient Sampling

Quasi-NewtonMethods

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|

However, N1 is regular at x ∈ MN and partly smooth at x w.r.t.MN .

Introduction

Gradient Sampling

Quasi-NewtonMethods

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|

Introduction

Gradient Sampling

Quasi-NewtonMethods

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|

■ n = 5: BFGS reduces N1 only to about 5× 10−3 in 1000iterations

Introduction

Gradient Sampling

Quasi-NewtonMethods

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|

Introduction

Gradient Sampling

Quasi-NewtonMethods

42 / 59

N1(x) =1

4(x1 − 1)2 +

n−1∑

|xi+1 − 2x2i + 1|

The method appears to be converging, very slowly, but may behaving numerical difficulties.

Second Nonsmooth Variant of Nesterov’s Function

Introduction

Gradient Sampling

Quasi-NewtonMethods

43 / 59

N1(x) =1

4|x1 − 1|+

n−1∑

|xi+1 − 2|xi|+ 1|.

Again, the unique global minimizer is x∗. The second term iszero on the set

S = {x : xi+1 = 2|xi| − 1, i = 1, . . . , n− 1}

but S is not a manifold: it has “corners”.

Contour Plots of the Nonsmooth Variants for n = 2

Introduction

Gradient Sampling

Quasi-NewtonMethods

44 / 59

Nesterov−Chebyshev−Rosenbrock, first variant

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

2Nesterov−Chebyshev−Rosenbrock, second variant

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

Contour plots of nonsmooth Chebyshev-Rosenbrock functions N1

(left) and N1 (right), with n = 2, with iterates generated byBFGS initialized at 7 different randomly generated points.

Contour Plots of the Nonsmooth Variants for n = 2

Introduction

Gradient Sampling

Quasi-NewtonMethods

44 / 59

Nesterov−Chebyshev−Rosenbrock, first variant

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

2Nesterov−Chebyshev−Rosenbrock, second variant

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−0.5

Contour plots of nonsmooth Chebyshev-Rosenbrock functions N1

(left) and N1 (right), with n = 2, with iterates generated byBFGS initialized at 7 different randomly generated points.On the left, always get convergence to x∗ = [1, 1]T . On theright, most runs converge to [1, 1] but some go to x = [0,−1]T .

Properties of the Second Nonsmooth Variant N1

Introduction

Gradient Sampling

Quasi-NewtonMethods

45 / 59

When n = 2, the point x = [0,−1]T is Clarke stationary for thesecond nonsmooth variant N1. We can see this because zero isin the convex hull of the gradient limits for N1 at the point x.

Introduction

Gradient Sampling

Quasi-NewtonMethods

45 / 59

However, x = [0,−1]T is not a local minimizer, becaused = [1, 2]T is a direction of linear descent: N ′

1(x, d) < 0.

Introduction

Gradient Sampling

Quasi-NewtonMethods

45 / 59

1(x, d) < 0.

These two properties mean that N1 is not regular at [0,−1]T .

Introduction

Gradient Sampling

Quasi-NewtonMethods

45 / 59

1(x, d) < 0.

In fact, for n ≥ 2:

■ N1 has 2n−1 Clarke stationary points

Introduction

Gradient Sampling

Quasi-NewtonMethods

45 / 59

1(x, d) < 0.

■ N1 has 2n−1 Clarke stationary points■ the only local minimizer is the global minimizer x∗

Introduction

Gradient Sampling

Quasi-NewtonMethods

45 / 59

1(x, d) < 0.

■ x∗ is the only stationary point in the sense of Mordukhovich(i.e., with 0 ∈ ∂N1(x) where ∂ is defined in Rockafellar andWets, Variational Analysis, 1998).

(M. Gurbuzbalaban and M.L.O., 2012).

Introduction

Gradient Sampling

Quasi-NewtonMethods

45 / 59

1(x, d) < 0.

■ x∗ is the only stationary point in the sense of Mordukhovich(i.e., with 0 ∈ ∂N1(x) where ∂ is defined in Rockafellar andWets, Variational Analysis, 1998).

(M. Gurbuzbalaban and M.L.O., 2012).

Furthermore, starting from enough randomly generated startingpoints, BFGS finds all 2n−1 Clarke stationary points!

Behavior of BFGS on the Second Nonsmooth Variant

Introduction

Gradient Sampling

Quasi-NewtonMethods

46 / 59

0 200 400 600 800 10000

0.5Nesterov−Chebyshev−Rosenbrock, n=5

different starting points

0 200 400 600 800 10000

0.5Nesterov−Chebyshev−Rosenbrock, n=6

different starting points

Left: sorted final values of N1 for 1000 randomly generatedstarting points, when n = 5: BFGS finds all 16 Clarke stationarypoints. Right: same with n = 6: BFGS finds all 32 Clarkestationary points.

Convergence to Non-Locally-Minimizing Points

Introduction

Gradient Sampling

Quasi-NewtonMethods

47 / 59

When f is smooth, convergence of methods such as BFGS tonon-locally-minimizing stationary points or local maxima ispossible but not likely, because of the line search, and suchconvergence will not be stable under perturbation.

Introduction

Gradient Sampling

Quasi-NewtonMethods

47 / 59

However, this kind of convergence is what we are seeing for thenon-regular, non-smooth Nesterov Chebyshev-Rosenbrockexample, and it is stable under perturbation. The same behavioroccurs for gradient sampling or bundle methods.

Introduction

Gradient Sampling

Quasi-NewtonMethods

47 / 59

Kiwiel (private communication): the Nesterov example is the firsthe had seen which causes his bundle code to have this behavior.

Introduction

Gradient Sampling

Quasi-NewtonMethods

47 / 59

Kiwiel (private communication): the Nesterov example is the firsthe had seen which causes his bundle code to have this behavior.

Nonetheless, we don’t know whether, in exact arithmetic, themethods would actually generate sequences converging to thenonminimizing Clarke stationary points. Experiments by Kaku(2011) suggest that the higher the precision used, the more likelyBFGS is to eventually move away from such a point.

Limited Memory Methods

Introduction

Gradient Sampling

Quasi-NewtonMethods

Limited MemoryMethodsLimited MemoryBFGSLimited MemoryBFGS on theEigenvalue Product

A More BasicExample

Smooth, Convex:nA = 200, nB =0, nR = 1

Nonsmooth, Convex:nA = 200, nB =10, nR = 1

Nonsmooth,Nonconvex:nA = 200, nB =10, nR = 5

LimitedEffectiveness ofLimited MemoryBFGSOther Ideas forLarge ScaleNonsmoothOptimization

48 / 59

Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

49 / 59

“Full” BFGS requires storing an n× n matrix and doingmatrix-vector multiplies, which is not possible when n is large.

Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

49 / 59

In the 1980s, J. Nocedal and others developed a “limitedmemory” version of BFGS, with O(n) space and timerequirements, which is very widely used for minimizing smoothfunctions in many variables. It works by saving only the mostrecent k rank two updates to an initial inverse Hessianapproximation.

Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

49 / 59

There are two variants: with and without “scaling” (usuallyscaling is preferred).

Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

49 / 59

The convergence rate of limited memory BFGS is linear, notsuperlinear, on smooth problems.

Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

49 / 59

The convergence rate of limited memory BFGS is linear, notsuperlinear, on smooth problems.

Question: how effective is it on nonsmooth problems?

Limited Memory BFGS on the Eigenvalue Product

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

50 / 59

0 10 20 30 40 50

10−5

10−4

10−3

10−2

10−1

Log eig prod, N=20, r=20, K=10, n=400, maxit = 400

number of vectors k

f−f op

Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

50 / 59

0 10 20 30 40 50

10−5

10−4

10−3

10−2

10−1

number of vectors k

f−f op

Limited Memory is not nearly as good as full BFGS.

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

50 / 59

0 10 20 30 40 50

10−5

10−4

10−3

10−2

10−1

number of vectors k

f−f op

Limited Memory is not nearly as good as full BFGS.

No significant improvement when k reaches 44.

A More Basic Example

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

51 / 59

Let x = [y; z;w] ∈ RnA+nB+nR and consider the test function

f(x) = (y − e)TA(y − e) + {(z − e)TB(z − e)}1/2 +R1(w)

where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

51 / 59

where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].

The first term is quadratic, the second is nonsmooth but convex,and the third is the nonsmooth, nonconvex Rosenbrock function.

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

51 / 59

where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].

The optimal value is 0, with x = e. The function f is partlysmooth and the dimension of the V-space is nB + nR − 1.

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

51 / 59

where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].

Set A = XXT where xij are normally distributed, with conditionnumber about 106 when nA = 200. Similarly B with nB < nA.

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

51 / 59

where A = AT ≻ 0, B = BT ≻ 0, e = [1; 1; . . . ; 1].

Set A = XXT where xij are normally distributed, with conditionnumber about 106 when nA = 200. Similarly B with nB < nA.

Besides limited memory BFGS and full BFGS, we also comparelimited memory Gradient Sampling, where we sample k ≪ ngradients per iteration.

Smooth, Convex: nA = 200, nB = 0, nR = 1

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

52 / 59

0 5 10 15 2010

10−6

10−5

nA=200, nB=0, nR=1, maxit = 201

number of vectors k

Grad Samp 1e−02, 1e−04Lim Mem BFGS (scaling)Lim Mem BFGS (no scaling)full BFGS (scaling once)full BFGS (no scaling)

Smooth, Convex: nA = 200, nB = 0, nR = 1

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

52 / 59

0 5 10 15 2010

10−6

10−5

nA=200, nB=0, nR=1, maxit = 201

number of vectors k

LM-BFGS with scaling even better than full BFGS

Nonsmooth, Convex: nA = 200, nB = 10, nR = 1

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

53 / 59

0 5 10 15 2010

10−5

10−4

10−3

10−2

nA=200, nB=10, nR=1, maxit = 211

number of vectors k

Nonsmooth, Convex: nA = 200, nB = 10, nR = 1

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

53 / 59

0 5 10 15 2010

10−5

10−4

10−3

10−2

nA=200, nB=10, nR=1, maxit = 211

number of vectors k

LM-BFGS much worse than full BFGS

Nonsmooth, Nonconvex: nA = 200, nB = 10, nR = 5

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

54 / 59

0 5 10 15 2010

10−3

10−2

10−1

nA=200, nB=10, nR=5, maxit = 215

number of vectors k

Nonsmooth, Nonconvex: nA = 200, nB = 10, nR = 5

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

54 / 59

0 5 10 15 2010

10−3

10−2

10−1

nA=200, nB=10, nR=5, maxit = 215

number of vectors k

LM-BFGS with scaling even worse than LM-Grad-Samp

Limited Effectiveness of Limited Memory BFGS

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

55 / 59

We see that that addition of nonsmoothness to a problem,convex or nonconvex, creates great difficulties for LimitedMemory BFGS, even when the dimension of the V-space is lessthan the size of the memory, although it helps to turn off scaling.With scaling it may be no better than Limited Memory GradientSampling. More investigation of this is needed.

Other Ideas for Large Scale Nonsmooth Optimization

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

56 / 59

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

56 / 59

■ Exploit structure! Lots of work on this has been done.

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

56 / 59

■ Exploit structure! Lots of work on this has been done.■ Smoothing! Lots of work on this has been done too, most

notably by Yu. Nesterov.

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

56 / 59

notably by Yu. Nesterov.■ Bundle methods, pioneered by C. Lemarechal in the convex

case and K. Kiwiel in the 1980s in the nonconvex case, andwith lots of work done since, e.g. by P. Apkarian and D. Nollin small-scale control applications and by T.M.T. Do andT. Artieres in large-scale machine learning applications.

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

56 / 59

■ Lots of other recent work on nonconvexity in machinelearning.

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

56 / 59

■ Adaptive Gradient Sampling (F.E. Curtis and X. Que).

Introduction

Gradient Sampling

Quasi-NewtonMethods

A More BasicExample

56 / 59

■ Adaptive Gradient Sampling (F.E. Curtis and X. Que).■ Automatic Differentiation (AD): (B. Bell, A. Griewank).

Concluding Remarks

Introduction

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

Summary

A Final Quote

57 / 59

Summary

Introduction

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

Summary

A Final Quote

58 / 59

Gradient Sampling is a simple method for nonsmooth, nonconvexoptimization for which a convergence theory is known, but it istoo expensive to use in most applications.

Summary

Introduction

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

Summary

A Final Quote

58 / 59

BFGS — the full version — is remarkably effective onnonsmooth problems, but little theory is known. Our packageHIFOO (H-infinity fixed order optimization) for controller design,primarily based on BFGS, has been used successfully in manyapplications.

Summary

Introduction

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

Summary

A Final Quote

58 / 59

Limited Memory BFGS is not so effective on nonsmoothproblems, but it seems to help to turn off scaling.

Summary

Introduction

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

Summary

A Final Quote

58 / 59

Limited Memory BFGS is not so effective on nonsmoothproblems, but it seems to help to turn off scaling.

Diabolical nonconvex problems such as Nesterov’sChebyshev-Rosenbrock problems can be very difficult, especiallyin the nonsmooth case.

A Final Quote

Introduction

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

Summary

A Final Quote

59 / 59

“Nonconvexity is scary to some, but there are vastly differenttypes of nonconvexity (some of which are really scary!)”

— Yann LeCun

A Final Quote

Introduction

Gradient Sampling

Quasi-NewtonMethods

Concluding Remarks

Summary

A Final Quote

59 / 59

“Nonconvexity is scary to some, but there are vastly differenttypes of nonconvexity (some of which are really scary!)”

— Yann LeCun

Papers, software are available at www.cs.nyu.edu/overton.

Nonsmooth, Nonconvex OptimizationNonsmooth, Nonconvex Optimization Introduction Nonsmooth, Nonconvex...

Documents

A ROBUST GRADIENT SAMPLING ALGORITHM FOR · a robust gradient sampling algorithm for nonsmooth, nonconvex optimization∗ james v. burke†, adrian s. lewis‡, and michael l. overton§

Lecture 1. Essentials of numerical nonsmooth optimization

[Poster Presentation] Nonconvex Optimization …rhayakawa/paper/RCC2019...[Poster Presentation] Nonconvex Optimization Based Algorithm for Discrete-Valued Vector Reconstruction Ryo

Nonconvex Optimization for High-Dimensional Signal

Smoothing methods for nonsmooth, nonconvex minimization · Smoothing methods for nonsmooth, nonconvex minimization 73 Deﬁnition 1 Let f: Rn → R be a continuous function. We call

Generalized Nonconvex Nonsmooth Low-Rank Minimization · 2019-09-06 · Generalized Nonconvex Nonsmooth Low-Rank Minimization Canyi Lu1, Jinhui Tang2, Shuicheng Yan1, Zhouchen Lin3,∗

Structured Nonconvex and Nonsmooth Optimization ...zhangs/Reports/2016_JLMZ.pdfStructured Nonconvex and Nonsmooth Optimization: Algorithms and Iteration Complexity Analysis Bo Jiang

Mini-batch stochastic approximation methods for nonconvex ...hozhang/papers/BatchSA.pdf · ﬁed optimal method for smooth, nonsmooth and stochastic optimization. This uniﬁed optimal

Nonsmooth Vector Functions and Continuous Optimization

Dynamic optimization with a nonsmooth, nonconvex ... · Takashi Kamihigashi · Santanu Roy Dynamic optimization with a nonsmooth, nonconvex technology: The case of a linear objective

A Geometric Framework for Nonconvex Optimization …angelia/nonnegative_augmenting...A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions⁄

Convex Analysis and Nonsmooth Optimization

Generalized Nonconvex Nonsmooth Low-Rank Minimization...Generalized Nonconvex Nonsmooth Low-Rank Minimization Canyi Lu 1, Jinhui Tang2, Shuicheng Yan , Zhouchen Lin3; 1 Department

A proximal bundle method for nonsmooth nonconvex functions

On Nesterov’s Nonsmooth Chebyshev-Rosenbrock Functionsaspremon/Houches/talks/lesHouches... · 2016. 3. 6. · Nonsmooth Optimization With BFGS Some Nonsmooth Analysis Nesterov’s

Introduction to nonconvex optimization

Nonconvex Robust Optimization - Department of Industrial

Structured Nonconvex and Nonsmooth Optimization ... · Algorithms and Iteration Complexity Analysis Bo Jiang Tianyi Lin y Shiqian Ma z Shuzhong Zhang x November 13, 2017 Abstract

Nonconvex Optimization for Communication Systems · Nonconvex Optimization for Communication Systems Mung Chiang Electrical Engineering Department Princeton University, Princeton,

Nonconvex Optimization for Communication Systems