Georg Stadler Courant Institute, NYU stadler@cims.nyu

Numerical Methods I: Numerical optimization

Georg StadlerCourant Institute, NYUstadler@cims.nyu.edu

Oct 19, 2017

1 / 29

Optimization problemsMain source: Nocedal/Wright: Numerical Optimization, Springer 2006.

Di↵erent optimization problems:

x2Rnf(x)

where f : Rn ! R. Often, one additionally encounters constraintsof the form

g(x) = 0 (equality constraints)

h(x) � 0 (inequality constraints)

I Often used: “programming” ⌘ optimizationI continuous optimization (x 2 Rn) versus discrete optimization

(e.g., x 2 Zn)I nonsmooth (e.g., f is not di↵erentiable) versus smooth

optimization (we assume f 2 C2)I convex optimization vs. nonconvex optimization (convexity of

f)2 / 29

Continuous unconstrained optimizationAssumptions

We assume that f(·) 2 C2, and assume unconstrainedminimization problems, i.e.:

x2Rnf(x).

A point x⇤ is a global solution if

f(x) � f(x⇤) (1)

for all x 2 Rn, and a local solution if (1) for all x in aneighborhood of x⇤.

Strict (local/global) minimizers satisfy (1) with a “>” instead of a“�” in a neighborhood of the point.

3 / 29

Continuous unconstrained optimization

4 / 29

Continuous unconstrained optimizationNecessary conditions

At a local minimum x

⇤ holds the first-order necessary condition

Rn 3 rf(x⇤) = 0

and the second-order (necessary) su�cient condition

Rn⇥n 3 r2f(x⇤) is positive (semi-) definite.

5 / 29

Continuous unconstrained optimizationAlgorithms

To find a candidate for a minimum, we can thus solve thenonlinear equation for a stationary point:

G(x) := rf(x) = 0,

for instance with Newton’s method. Note that the Jacobian ofG(x) is r2f(x).

In optimization, one often prefers iterative descent algorithms thattake into account the optimization structure.

6 / 29

Convex minimization

A function If f : Rn ! R is convex if for all x,y holds, for allt 2 [0, 1]:

f(�x+ (1� �)y) �f(x) + (1� �)f(y)

Theorem: If f is convex, then any local minimizer x⇤ is also aglobal minimizer. If f is di↵erentiable, then any stationary pointx

⇤ is a global minimizer.

7 / 29

Convex minimization

8 / 29

Descent algorithm

Basic descent algorithm:

1. Initialize starting point x0, set k = 1.

2. For k = 0, 1, 2, . . ., find a descent direction d

3. Find a step length ↵k > 0 for the update

k+ ↵kd

such that f(xk+1

) < f(xk). Set k := k + 1 and repeat.

9 / 29

Descent algorithm

Idea: Instead of solving an n-dim. minimization problem,(approximately) solve a sequence of 1-dim. problems:

I Initialization: As close as possible to x

I Descent direction: Direction in which function decreaseslocally.

I Step length: Want to make large, but not too large steps.

I Check for descent: Make sure you make progress towards a(local) minimum.

10 / 29

Descent algorithm

Initialization: Ideally close to the minimizer. Solution depends, ingeneral, on initialization (in the presence of multiple local minima).

11 / 29

Descent algorithm

Directions, in which the function decreases (locally) are calleddescent directions.

I Steepest descent direction:

k= �rf(xk

I When Bk 2 Rn⇥n is positive definite, then

k= �B�1

k rf(xk)

is the quasi-Newton descent direction.

I When Hk = H(x

k) = r2f(xk

) is positive definite, then

k= �H�1

k rf(xk)

is the Newton descent direction. At a local minimum, H(x

is positive (semi)definite.

12 / 29

Descent algorithmWhy is the negative gradient the steepest direction?

13 / 29

Descent algorithmNewton method for optimization

Idea behind Newton’s method in optimization: Instead of findingminimum of f , find minimum of quadratic approximation of faround current point:

qk(d) = f(xk) +rf(xk

Tr2f(xk)d

Minimum is (provided r2f(xk) is spd):

d = �r2f(xk)

�1rf(xk).

is the Newton search direction. Since this is the minimum of thequadratic approximation, ↵k = 1 is the “optimal” step length.

14 / 29

Descent algorithm

Step length: Need to choose step length ↵k > 0 in

k+ ↵kd

Ideally: Find minimum ↵ of 1-dim. problem

f(xk+ ↵dk

It is not necessary to find the exact minimum.

15 / 29

Descent algorithm

Step length (continued): Find ↵k that satisfies the Armijocondition:

f(xk+ ↵kd

k) f(xk

) + c1

↵krf(xk)

k, (2)

where c1

2 (0, 1) (usually chosen rather small, e.g., c1

�4).

Additionally, one often uses the gradient condition

rf(xk+ ↵kd

k � c2

rf(xk)

with c2

The two conditions (2) and (3) are called Wolfe conditions.

16 / 29

Armijo/Wolfe conditions

17 / 29

Descent algorithmConvergence of line search methods

Denote the angle between d

k and �rf(xk) by ⇥k:

cos(⇥k) =�rf(xk

krf(xk)kkdkk

Assumptions on f : Rn ! R: continuously di↵erentiable, derivativeis Lipschitz-continuous, f is bounded from below.Method: descent algorithm with Wolfe-conditions.Then: X

(⇥k)krf(xk)k2 < 1.

In particular: If cos(⇥k) � � > 0, then limk!1 krf(xk)k = 0.

Note that this does not imply that xk converges.

18 / 29

Descent algorithm

Alternative to Wolfe step length: Find ↵k that satisfies the Armijocondition:

f(xk+ ↵kd

k) f(xk

) + c1

↵krf(xk)

k, (4)

where c1

2 (0, 1).

Use backtracking linesearch to find a step length that is largeenough:

I Start with (large) step length ↵0

k > 0.

I If it satisfies (4), accept the step length.

I Else, compute ↵i+1

k := ⇢↵ik with ⇢ < 1 (usually, ⇢ = 0.5) and

go back to previous step.

This also leads to a globally converging method to a stationarypoint.

19 / 29

Backtracking

20 / 29

Descent algorithmConvergence rates

Let us consider a simple case, where f is quadratic:

f(x) :=1

TQx� b

where Q is spd. The gradient is rf(x) = Qx� b, and minimizerx

⇤ is solution to Qx = b. Using exact line search, the convergenceis:

kxk+1 � x

⇤k2Q �max

� �min

�max

+ �min

kxk � x

⇤k2Q

(linear convergence with rate depending on eigenvalues of Q)

21 / 29

Descent algorithmsConvergence of steepest descent

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

22 / 29

Descent algorithmsConvergence of steepest descent

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−1

−0.8

−0.6

−0.4

−0.2

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−1

−0.8

−0.6

−0.4

−0.2

23 / 29

Descent algorithmConvergence rates

Newton’s method: Assumptions on f: 2⇥di↵erentiable withLipschitz-continuous Hessian r2f(xk

). Hessian is positive definitein a neighborhood around solution x

Assumptions on starting point: x0 su�cient close to x

Then: Quadratic convergence of Newton’s method with ↵k = 1,and krf(xk

)k ! 0 quadratically.

Equivalent to Newton’s method for solving rf(x) = 0, if Hessianis positive.

How many iterations does Newton need for quadratic problems?

24 / 29

Summary of Newton methods and variants

I Newton to solve nonlinear equation F (x) = 0.

I Newton to solve optimization problem is equivalent to solvingfor the stationary point rf(x) = 0, provided Hessian ispositive and full steps are used (compare also convergenceresult).

I Optimization perspective to solve rf(x) provided additionalinformation.

I Gauss-Newton method for nonlinear least squares problem is aspecific quasi-Newton method.

25 / 29

Georg Stadler Courant Institute, NYU stadler@cims.nyu

Documents

Speech Recognition - NYU Computer Sciencemohri/asr12/lecture_9.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Parameter Reduction Problem: too many parameters (>

Back-Propagation - CILVR at NYU...Y LeCun MA Ranzato Back-Propagation Lecture 0 2 Yann Le Cun Facebook AI Research, Center for Data Science, NYU Courant Institute of Mathematical Sciences,

Advanced Machine Learning - NYU Courant › ~mohri › amls › aml_deep_boosting.pdfAdvanced Machine Learning - Mohri@ page Connections with Previous Work For , DeepBoost coincides

Adaptive LOD Editing of Quad Meshes - NYU Courant

V1 Dynamics and Sparsity and Multiple Feature Maps Michael Shelley – Courant Institute/CNS, NYU Collaborators: Bob Shapley – CNS/CIMS

Probabilistic Reasoning in Data Analysis Lawrence Sirovich Mt. Sinai School of Med.; Rockefeller U.; Courant Inst., NYU lsirovich@rockefeller.edu 1

Advanced Machine Learning - NYU Courant

Storm Track Dynamics - NYU Courant

On the Unique Games Conjecture Subhash Khot NYU Courant CCC, June 10, 2010

Chapter 1 Poisson Processes - NYU Courant

A new approach to bounds on mixing - NYU Courantleger/resources/cims/talk-mixing.pdfA new approach to bounds on mixing FlavienLéger Courant Institute of Mathematical Science NYU Flavien

Stadler Form 2012, Stadler Form 2012 magyar nyelvű katalógus, Stadler Form párásító katalógus

A brief introduction to Computational Neurosciencerinzel/CMNSF07/Rangan_LMI_Talk_Gateway.pdf · A brief introduction to Computational Neuroscience Aaditya Rangan Courant, NYU. Visual

Speech Recognition - NYU Computer Sciencemohri/asr12/lecture_1.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Objectives Computer science view of automatic speech

Speech Recognition - NYU Computer Sciencemohri/asr12/lecture_4.pdf · 2012. 10. 22. · Mehryar Mohri - Speech Recognition page Courant Institute, NYU Software Libraries GRM Library:

Shakhar Smorodinsky Courant Institute (NYU) Joint Work with Noga Alon Conflict-Free Coloring of Shallow Discs

Speech Recognition - NYU Computer Scienceeugenew/asr13/lecture_1.pdf · Eugene Weinstein - Speech Recognition page Courant Institute, NYU Logistics Assignment 0 out today, due September

CLIMATE OVER PAST MILLENNIA - NYU Courant

REGIONALTRIEBZUG «CAPRICORN» - Stadler Rail · Stadler Rail Group Ernst-Stadler-Strasse 1 CH-9565 Bussnang Telefon +41 71 626 21 20 stadler.rail@stadlerrail.com Stadler Altenrhein

Deep learning with Elastic Averaging SGDDeep learning with Elastic Averaging SGD Sixin Zhang Courant Institute, NYU zsx@cims.nyu.edu Anna Choromanska Courant Institute, NYU achoroma@cims.nyu.edu