Advanced Statistical Computing Week 4: Optimizationavdvaart/ASC/optim.pdf · Nelder-Mead algorithm 21 The Nelder-Mead algorithm keeps a test set ofk +1 function values f(xi). It iterates

Advanced Statistical ComputingWeek 4: Optimization

Aad van der Vaart

Fall 2012

Contents

2

Error propagation

Root finding

Minimization

Maximum likelihood

Error propagation

Machine precision

4

Most numbers are not represented exactly on a computer.

> (.3-.1)-.2[1] -2.775558e-17> (.3-.1)==.2[1] FALSE> abs((.3-.1)-.2)<1e-10[1] TRUE> identical(.3-.1,.2)[1] FALSE> isTRUE(all.equal(.3-.1,.2))[1] TRUE>> .Machine$double.eps[1] 2.220446e-16

You cannot test if a number is exactly 0. Only if it is close to 0.

[ In this case the errors arise because the numbers .3, .1, and .2 can be exactly represented in base10, but not in base 2, which is used by the computer. The R function identical is a more robustversion of “==”; the isTRUE(all.equal) construction tests for approximate equality.]

Error propagation

5

If you want to compute f(x) for given x, you make two types of errors:• f(x̃)− f(x), for x̃ the machine representation of x.• f̃(x̃)− f(x̃), for f̃ the machine algorithm for f .

The first can be made smaller only by higher precision of x̃. The second dependson the algorithm.

In the following example we calculate

f(x) = x(√

x+ 1−√x)

=x√

x+ 1 +√x.

> x=500009999998> x* (sqrt(x+1)-sqrt(x))[1] 353560.4> x/(sqrt(x+1)+sqrt(x))[1] 353556.9

[ The two identical expressions suggest two algorithms, which give different outcomes.]

Condition number

6

The relative error in f(x̃) as approximation of f(x) is

f(x̃)− f(x)

f(x)≈ f ′(x)

f(x)(x̃− x).

We like the condition number f ′(x)/f(x) of f (at x) to be not too large.

An algorithm to compute f(x) might consist of multiple steps:

x 7→ x̃ 7→ f̃1(x̃) 7→ f̃2(

f̃1(x))

7→ · · · 7→ f̃(x̃) = f̃k ◦ f̃k−1 ◦ · · · ◦ f̃1(x̃).

We like each step fi to be well conditioned.

An example of an ill-conditioned elementary operation is the map

(x1, x2) 7→ x1 − x2, if x1 ≈ x2.

Try and avoid subtracting numbers that are almost equal.

[ This is what went wrong in the example on the preceding slide.For vector-valued x the conditon number f ′(x)/f(x) can be defined from the vector of partialderivatives as (∂/∂xi)f(x)/f(x).]

Root finding

Problem statement

8

For given (nice) function f :Rk → Rk find x̄ such that

f(x̄) = 0.

METHODS:• Bisection• Newton-Raphson• Secant• Inverse quadratic interpolation• Brent

Bisection method

9

a bm a bm

GIVEN a, b WITH f(a) < 0, f(b) > 0 REPEAT WHILE |a− b| > eps:• m = (a+ b)/2• if f(m) < 0 then a: = m else b: = m

• Distance to solution cut by 1/2 in each step.• Sure to converge, but relatively slow.

Newton-Raphson

10

x0x1 x0x1x1x2

GIVEN x0 REPEAT:• draw tangent to f at x0

• x1: = zero of tangent.

• Fast if x0 is close to solution• May fail to converge• Need formula for derivative• Many adaptations• Generalizes to higher dimensions

Newton-Raphson — derivation

11

x0x1

The tangent to f at x0 is x 7→ f(x0) + f ′(x0)(x− x0). Its zero x1 satisfiesf(x0) + f ′(x0)(x1 − x0) = 0, leading to

xi+1 = xi −f(xi)

f ′(xi).

Higher-dimensional analogue

xi+1 = xi − f ′(xi)−1f(xi),

for f ′(x) the matrix of partial derivatives.

Order of convergence

12

Let x0, x1, x2, . . . be successive approximations for the solution x̄.

The method converges linearly (or at first order ) if

|xi+1 − x̄| ≤ C|xi − x̄|, C < 1.

The method converges quadratically (or at second order ) if

|xi+1 − x̄| ≤ K|xi − x̄|2, K ∈ R.

For a linear method |xi − x̄| ≤ Ci|x0 − x̄| → 0, but a quadratic methodconverges much faster provided |xi − x̄| becomes small eventually.

R

13

Bisection (linear)> f=function(x){(x-2) * (1+xˆ2-0.3 * (x-1)ˆ3)}> A=1.4; B=2.9> for (i in 1:20) {m=(A+B)/2+ if (f(A) * f(m)>0) A=m else B=m+ print(A,digits=12)}[1] 1.4[1] 1.775[1] 1.9625[1] 1.9625[1] 1.9625[1] 1.9859375[1] 1.99765625[1] 1.99765625[1] 1.99765625[1] 1.99912109375[1] 1.99985351562[1] 1.99985351562[1] 1.99985351562[1] 1.99994506836[1] 1.99999084473[1] 1.99999084473[1] 1.99999084473[1] 1.99999656677[1] 1.9999994278[1] 1.9999994278

Newton-Raphson (quadratic)> f=function(x){(x-2) * (1+xˆ2-0.3 * (x-1)ˆ3)}> f1=function(x){1+xˆ2-0.3 * (x-1)ˆ3+(x-2) * (2 * x-0.9 * (x-1)ˆ2)}> g=function(x){x-f(x)/f1(x)}> xold=2.9> for (i in 1:20){ xnew=g(xold); xold=xnew+ print(xnew,digits=12)}[1] 2.21416533654[1] 2.0235925475[1] 2.00035651707[1] 2.0000000838[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2

Newton-Raphson — has quadratic order

14

The Newton-Raphson algorithm iterates xi+1 = g(xi) for

g(x) = x− f(x)

f ′(x).

This function has the properties

g(x̄) = x̄− f(x̄)/f ′(x̄) = x̄,

g′(x̄) = 1− f ′(x̄)2 − f(x̄)f ′′(x̄)

f ′(x̄)2= 0.

Therefore Taylor’s theorem can be used to obtain

xi+1 − x̄ = g(xi)− g(x̄) ≈ g′(x̄)(xi − x̄) + 1

2g′′(x̄)(xi − x̄)2

≈ 1

2g′′(x̄)(xi − x̄)2.

Secant method

15

x0x1 x0x1 x0x1x2 x0x1 x0x1 x0x1x2 x1x2

GIVEN x0, x1 REPEAT:• draw line (“secant”) through points

(

x0, f(x0))

,(

x1, f(x1))

• x2: = zero of secant.

• If x0 ≈ x1 very similar to Newton-Raphson• Does not need formula for derivative

Quadratic interpolation

16

x1x2x3

GIVEN x0, x1, x2 REPEAT:• draw parabola P2(y) through

(

f(x0), x0

)

,(

f(x1), x1

)

,(

f(x2), x2

)

• x3: = P2(0)

• With line instead of parabola becomes secant method• Does not need formula for derivative

[ The parabola approximates f−1, whence P2(0) ≈ f−1(0) = x̄.]

Brent’s method

17

This a hybrid algorithm, trying to borrow speed from Newton and surety frombisection. Its iterations use bisection, secant or quadratic interpolation,depending on the current values.

R implementation: uniroot .

> uniroot(f,c(1,3)) # search in interval [1,3]$root[1] 1.999998$f.root[1] -8.559716e-06$iter[1] 7$estim.prec[1] 6.103516e-05

Minimization

Problem statement

19

For given (nice) function F :Rk → R find x̄ such that

F (x̄) = minx

F (x).

For unconstrained optimization x̄ is a zero of the derivative: F ′(x̄) = 0.

METHODS:• Newton• Quasi Newton• Nelder-Mead• (any root finding algorithm)• (many specialized algorithms)

Quasi Newton

20

Newton’s algorithm for solving F ′(x̄) = 0 gives iterates

xi+1 = xi − F ′′(xi)−1F ′(xi), F ′′(x) =

( ∂2

∂xixj

F (x))

.

A Quasi Newton method replaces F ′′(xi) by a matrix that is easier to compute orgives a more stable algorithm.R implementations: nlm and optim with method BFGS.

> f=function(x){(x-2) * (1+xˆ2-0.3 * (x-1)ˆ3)}> F=function(x){xˆ2/2+xˆ4/4-0.3 * x* (x-1)ˆ4/4-0.3 * (x-1)ˆ5/20-2 * x-2 * xˆ3/3+0.3 * (x-1)ˆ4/2}> nlm(F,2.9)$minimum[1] -3.351043$estimate[1] 2.035899$code[1] 1$iterations[1] 5> optim(2.9,F,method="BFGS")$par[1] 2.0359$value[1] -3.351043$convergence[1] 0[ .... some output deleted....]> optim(2.9,F,f,method="BFGS") # pass gradient f=F’ to fun ction for increased efficiency$par[1] 2.000809$value[1] -3.348453[ .... some output deleted....]

Nelder-Mead algorithm

21

The Nelder-Mead algorithm keeps a test set of k + 1 function values f(xi).

It iterates by replacing points xi by “more promising” points using a library ofpossible moves (e.g. reflection in the point of gravity of the points).

R implementation: optim with (default) method Nelder-Mead .

> F=function(x){xˆ2/2+xˆ4/4-0.3 * x* (x-1)ˆ4/4-0.3 * (x-1)ˆ5/20-2 * x+ -2 * xˆ3/3+0.3 * (x-1)ˆ4/2}> optim(2.9,F,method="Nelder-Mead")$par[1] 2.035947$value[1] -3.351043$convergence[1] 0Warning message:

one-diml optimization by Nelder-Mead is unreliable: use "B rent"or optimize() directly

[ This may perhaps be viewed as a higher-dimensional version of bisection.]

Maximum likelihood

Maximum likelihood

23

Observations X1, . . . , Xn from density x 7→ pθ(x).

MLE θ̂ maximizes likelihood

θ 7→n∏

j=1

pθ(Xj).

If θ 7→ pθ(x) is smooth, then MLE solves

n∑

j=1

ℓ̇θ(Xj) = 0,

for ℓ̇θ the score function

ℓ̇θ(x) =∂

∂θlog pθ(x).

Fisher scoring

24

The Newton-Raphson iterations are

θi+1 = θi −∑n

j=1ℓ̇θi(Xj)

∑n

j=1ℓ̈θi(Xj)

,

for

ℓ̇θ(x) =∂

∂θlog pθ(x), ℓ̈θ(x) =

∂2

∂θ2log pθ(x).

The Hessian −∑n

j=1ℓ̈θ(Xj) is called observed information and satisfies, for Iθ

the Fisher information of the model,

−Eθ

n∑

j=1

ℓ̈θ(Xj) = nIθ.

The iterations with −∑n

j=1ℓ̈θ(Xj) replaced by nIθ are called Fisher scoring.

The root of the inverse nIθ̂

is an estimate of the se of the MLE.

[ Fisher scoring is sensible only if Iθ can be calculated analytically.]

R — Cauchy MLE

25

Calculate MLE for sample from Cauchy distribution centered at θ.Likelihood and score function are proportional to

θ 7→n∏

j=1

1

2π

1

1 + (Xj − θ)2,

n∑

j=1

2(Xj − θ)

1 + (Xj − θ)2.

−60 −40 −20 0 20

5010

015

020

0

−60 −40 −20 0 20

−10

−5

05

10

R — Cauchy MLE

26


θ 7→n∏

j=1

1

2π

1

1 + (Xj − θ)2,

n∑

j=1

2(Xj − θ)

1 + (Xj − θ)2.

> set.seed(1244); x=rcauchy(25)> q=seq(min(x),max(x),by=0.01)> minloglik=function(q,x){apply(log(1+outer(x,q,"-") ˆ2),2,sum)}> minloglik1=function(q,x){v=outer(x,q,"-"); 2 * apply(v/(1+vˆ2),2,sum)}> uniroot(minloglik1,median(x)+c(-3,3),x=x)$root[1] 0.3691374> optim(median(x),method="BFGS",minloglik,x=x,hessia n=TRUE)$par[1] 0.369137$hessian

[,1][1,] 12.99744 # estimated se is 1/sqrt(12.99744)> optim(median(x),lower=median(x)-3,upper=median(x)+ 3,method="Brent",minloglik,$par[1] 0.3691371[ .... some output deleted....]

R — Cauchy MLE — local optima

27


θ 7→n∏

j=1

1

2π

1

1 + (Xj − θ)2,

n∑

j=1

2(Xj − θ)

1 + (Xj − θ)2.

−70 −68 −66 −64 −62 −60

200

201

202

203

204

−70 −68 −66 −64 −62 −60

0.0

0.5

1.0

1.5

R — Cauchy MLE — local optima

28


θ 7→n∏

j=1

1

2π

1

1 + (Xj − θ)2,

n∑

j=1

2(Xj − θ)

1 + (Xj − θ)2.

> optim(-70,method="BFGS",minloglik,x=x)$par[1] -68.24447$convergence[1] 0[ .... some output deleted....]> uniroot(minloglik1,c(-68,-64),x=x)$root[1] -66.39614$f.root[1] -1.346517e-06$iter[1] 5$estim.prec[1] 6.103516e-05[ .... some output deleted....]

R — Cauchy MLE — multivariate

29

Calculate MLE for sample from Cauchy distribution with center θ and scale σ.Likelihood function

(θ, σ) 7→n∏

j=1

1

2πσ

1

1 + (Xj − θ)2/σ2.

> set.seed(1244); x=rcauchy(25)> minloglik=function(par,x){q=par[1]; s=par[2];+ length(x) * log(s)+apply(log(1+outer(x,q,"-")ˆ2/sˆ2),2,sum)}> optim(c(median(x),mad(x)),minloglik,x=x) # Default me thod: Nelder-Mead$par[1] 0.3668736 1.2556457$value[1] 48.00677[ .... some output deleted....]> OP=optim(c(median(x),mad(x)),method="BFGS",minlogl ik,x=x); OP # Quasi Newton$par[1] 0.3670376 1.2552827$value[1] 48.00677$hessian

[,1] [,2][1,] 9.6044413 0.2627547[2,] 0.2627547 6.2611816[ .... some output deleted....]> sqrt(diag(solve(OP$hessian)))[1] 0.3228594 0.3998723

[ The inverse of the Hessian is an estimate of the variance matrix of the MLE (with estimated squarese’s on the diagonal).]

Documents

Advanced Statistical Computing Week 4: Optimizationavdvaart/ASC/optim.pdf · Nelder-Mead algorithm 21 The Nelder-Mead algorithm keeps a test set ofk +1 function values f(xi). It iterates