Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Advanced Statistical ComputingWeek 4: Optimization
Aad van der Vaart
Fall 2012
Contents
2
Error propagation
Root finding
Minimization
Maximum likelihood
Error propagation
Machine precision
4
Most numbers are not represented exactly on a computer.
> (.3-.1)-.2[1] -2.775558e-17> (.3-.1)==.2[1] FALSE> abs((.3-.1)-.2)<1e-10[1] TRUE> identical(.3-.1,.2)[1] FALSE> isTRUE(all.equal(.3-.1,.2))[1] TRUE>> .Machine$double.eps[1] 2.220446e-16
You cannot test if a number is exactly 0. Only if it is close to 0.
[ In this case the errors arise because the numbers .3, .1, and .2 can be exactly represented in base10, but not in base 2, which is used by the computer. The R function identical is a more robustversion of “==”; the isTRUE(all.equal) construction tests for approximate equality.]
Error propagation
5
If you want to compute f(x) for given x, you make two types of errors:• f(x̃)− f(x), for x̃ the machine representation of x.• f̃(x̃)− f(x̃), for f̃ the machine algorithm for f .
The first can be made smaller only by higher precision of x̃. The second dependson the algorithm.
In the following example we calculate
f(x) = x(√
x+ 1−√x)
=x√
x+ 1 +√x.
> x=500009999998> x* (sqrt(x+1)-sqrt(x))[1] 353560.4> x/(sqrt(x+1)+sqrt(x))[1] 353556.9
[ The two identical expressions suggest two algorithms, which give different outcomes.]
Condition number
6
The relative error in f(x̃) as approximation of f(x) is
f(x̃)− f(x)
f(x)≈ f ′(x)
f(x)(x̃− x).
We like the condition number f ′(x)/f(x) of f (at x) to be not too large.
An algorithm to compute f(x) might consist of multiple steps:
x 7→ x̃ 7→ f̃1(x̃) 7→ f̃2(
f̃1(x))
7→ · · · 7→ f̃(x̃) = f̃k ◦ f̃k−1 ◦ · · · ◦ f̃1(x̃).
We like each step fi to be well conditioned.
An example of an ill-conditioned elementary operation is the map
(x1, x2) 7→ x1 − x2, if x1 ≈ x2.
Try and avoid subtracting numbers that are almost equal.
[ This is what went wrong in the example on the preceding slide.For vector-valued x the conditon number f ′(x)/f(x) can be defined from the vector of partialderivatives as (∂/∂xi)f(x)/f(x).]
Root finding
Problem statement
8
For given (nice) function f :Rk → Rk find x̄ such that
f(x̄) = 0.
METHODS:• Bisection• Newton-Raphson• Secant• Inverse quadratic interpolation• Brent
Bisection method
9
a bm a bm
GIVEN a, b WITH f(a) < 0, f(b) > 0 REPEAT WHILE |a− b| > eps:• m = (a+ b)/2• if f(m) < 0 then a: = m else b: = m
• Distance to solution cut by 1/2 in each step.• Sure to converge, but relatively slow.
Newton-Raphson
10
x0x1 x0x1x1x2
GIVEN x0 REPEAT:• draw tangent to f at x0
• x1: = zero of tangent.
• Fast if x0 is close to solution• May fail to converge• Need formula for derivative• Many adaptations• Generalizes to higher dimensions
Newton-Raphson — derivation
11
x0x1
The tangent to f at x0 is x 7→ f(x0) + f ′(x0)(x− x0). Its zero x1 satisfiesf(x0) + f ′(x0)(x1 − x0) = 0, leading to
xi+1 = xi −f(xi)
f ′(xi).
Higher-dimensional analogue
xi+1 = xi − f ′(xi)−1f(xi),
for f ′(x) the matrix of partial derivatives.
Order of convergence
12
Let x0, x1, x2, . . . be successive approximations for the solution x̄.
The method converges linearly (or at first order ) if
|xi+1 − x̄| ≤ C|xi − x̄|, C < 1.
The method converges quadratically (or at second order ) if
|xi+1 − x̄| ≤ K|xi − x̄|2, K ∈ R.
For a linear method |xi − x̄| ≤ Ci|x0 − x̄| → 0, but a quadratic methodconverges much faster provided |xi − x̄| becomes small eventually.
R
13
Bisection (linear)> f=function(x){(x-2) * (1+xˆ2-0.3 * (x-1)ˆ3)}> A=1.4; B=2.9> for (i in 1:20) {m=(A+B)/2+ if (f(A) * f(m)>0) A=m else B=m+ print(A,digits=12)}[1] 1.4[1] 1.775[1] 1.9625[1] 1.9625[1] 1.9625[1] 1.9859375[1] 1.99765625[1] 1.99765625[1] 1.99765625[1] 1.99912109375[1] 1.99985351562[1] 1.99985351562[1] 1.99985351562[1] 1.99994506836[1] 1.99999084473[1] 1.99999084473[1] 1.99999084473[1] 1.99999656677[1] 1.9999994278[1] 1.9999994278
Newton-Raphson (quadratic)> f=function(x){(x-2) * (1+xˆ2-0.3 * (x-1)ˆ3)}> f1=function(x){1+xˆ2-0.3 * (x-1)ˆ3+(x-2) * (2 * x-0.9 * (x-1)ˆ2)}> g=function(x){x-f(x)/f1(x)}> xold=2.9> for (i in 1:20){ xnew=g(xold); xold=xnew+ print(xnew,digits=12)}[1] 2.21416533654[1] 2.0235925475[1] 2.00035651707[1] 2.0000000838[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2[1] 2
Newton-Raphson — has quadratic order
14
The Newton-Raphson algorithm iterates xi+1 = g(xi) for
g(x) = x− f(x)
f ′(x).
This function has the properties
g(x̄) = x̄− f(x̄)/f ′(x̄) = x̄,
g′(x̄) = 1− f ′(x̄)2 − f(x̄)f ′′(x̄)
f ′(x̄)2= 0.
Therefore Taylor’s theorem can be used to obtain
xi+1 − x̄ = g(xi)− g(x̄) ≈ g′(x̄)(xi − x̄) + 1
2g′′(x̄)(xi − x̄)2
≈ 1
2g′′(x̄)(xi − x̄)2.
Secant method
15
x0x1 x0x1 x0x1x2 x0x1 x0x1 x0x1x2 x1x2
GIVEN x0, x1 REPEAT:• draw line (“secant”) through points
(
x0, f(x0))
,(
x1, f(x1))
• x2: = zero of secant.
• If x0 ≈ x1 very similar to Newton-Raphson• Does not need formula for derivative
Quadratic interpolation
16
x1x2x3
GIVEN x0, x1, x2 REPEAT:• draw parabola P2(y) through
(
f(x0), x0
)
,(
f(x1), x1
)
,(
f(x2), x2
)
• x3: = P2(0)
• With line instead of parabola becomes secant method• Does not need formula for derivative
[ The parabola approximates f−1, whence P2(0) ≈ f−1(0) = x̄.]
Brent’s method
17
This a hybrid algorithm, trying to borrow speed from Newton and surety frombisection. Its iterations use bisection, secant or quadratic interpolation,depending on the current values.
R implementation: uniroot .
> uniroot(f,c(1,3)) # search in interval [1,3]$root[1] 1.999998$f.root[1] -8.559716e-06$iter[1] 7$estim.prec[1] 6.103516e-05
Minimization
Problem statement
19
For given (nice) function F :Rk → R find x̄ such that
F (x̄) = minx
F (x).
For unconstrained optimization x̄ is a zero of the derivative: F ′(x̄) = 0.
METHODS:• Newton• Quasi Newton• Nelder-Mead• (any root finding algorithm)• (many specialized algorithms)
Quasi Newton
20
Newton’s algorithm for solving F ′(x̄) = 0 gives iterates
xi+1 = xi − F ′′(xi)−1F ′(xi), F ′′(x) =
( ∂2
∂xixj
F (x))
.
A Quasi Newton method replaces F ′′(xi) by a matrix that is easier to compute orgives a more stable algorithm.R implementations: nlm and optim with method BFGS.
> f=function(x){(x-2) * (1+xˆ2-0.3 * (x-1)ˆ3)}> F=function(x){xˆ2/2+xˆ4/4-0.3 * x* (x-1)ˆ4/4-0.3 * (x-1)ˆ5/20-2 * x-2 * xˆ3/3+0.3 * (x-1)ˆ4/2}> nlm(F,2.9)$minimum[1] -3.351043$estimate[1] 2.035899$code[1] 1$iterations[1] 5> optim(2.9,F,method="BFGS")$par[1] 2.0359$value[1] -3.351043$convergence[1] 0[ .... some output deleted....]> optim(2.9,F,f,method="BFGS") # pass gradient f=F’ to fun ction for increased efficiency$par[1] 2.000809$value[1] -3.348453[ .... some output deleted....]
Nelder-Mead algorithm
21
The Nelder-Mead algorithm keeps a test set of k + 1 function values f(xi).
It iterates by replacing points xi by “more promising” points using a library ofpossible moves (e.g. reflection in the point of gravity of the points).
R implementation: optim with (default) method Nelder-Mead .
> F=function(x){xˆ2/2+xˆ4/4-0.3 * x* (x-1)ˆ4/4-0.3 * (x-1)ˆ5/20-2 * x+ -2 * xˆ3/3+0.3 * (x-1)ˆ4/2}> optim(2.9,F,method="Nelder-Mead")$par[1] 2.035947$value[1] -3.351043$convergence[1] 0Warning message:
one-diml optimization by Nelder-Mead is unreliable: use "B rent"or optimize() directly
[ This may perhaps be viewed as a higher-dimensional version of bisection.]
Maximum likelihood
Maximum likelihood
23
Observations X1, . . . , Xn from density x 7→ pθ(x).
MLE θ̂ maximizes likelihood
θ 7→n∏
j=1
pθ(Xj).
If θ 7→ pθ(x) is smooth, then MLE solves
n∑
j=1
ℓ̇θ(Xj) = 0,
for ℓ̇θ the score function
ℓ̇θ(x) =∂
∂θlog pθ(x).
Fisher scoring
24
The Newton-Raphson iterations are
θi+1 = θi −∑n
j=1ℓ̇θi(Xj)
∑n
j=1ℓ̈θi(Xj)
,
for
ℓ̇θ(x) =∂
∂θlog pθ(x), ℓ̈θ(x) =
∂2
∂θ2log pθ(x).
The Hessian −∑n
j=1ℓ̈θ(Xj) is called observed information and satisfies, for Iθ
the Fisher information of the model,
−Eθ
n∑
j=1
ℓ̈θ(Xj) = nIθ.
The iterations with −∑n
j=1ℓ̈θ(Xj) replaced by nIθ are called Fisher scoring.
The root of the inverse nIθ̂
is an estimate of the se of the MLE.
[ Fisher scoring is sensible only if Iθ can be calculated analytically.]
R — Cauchy MLE
25
Calculate MLE for sample from Cauchy distribution centered at θ.Likelihood and score function are proportional to
θ 7→n∏
j=1
1
2π
1
1 + (Xj − θ)2,
n∑
j=1
2(Xj − θ)
1 + (Xj − θ)2.
−60 −40 −20 0 20
5010
015
020
0
−60 −40 −20 0 20
−10
−5
05
10
R — Cauchy MLE
26
Calculate MLE for sample from Cauchy distribution centered at θ.Likelihood and score function are proportional to
θ 7→n∏
j=1
1
2π
1
1 + (Xj − θ)2,
n∑
j=1
2(Xj − θ)
1 + (Xj − θ)2.
> set.seed(1244); x=rcauchy(25)> q=seq(min(x),max(x),by=0.01)> minloglik=function(q,x){apply(log(1+outer(x,q,"-") ˆ2),2,sum)}> minloglik1=function(q,x){v=outer(x,q,"-"); 2 * apply(v/(1+vˆ2),2,sum)}> uniroot(minloglik1,median(x)+c(-3,3),x=x)$root[1] 0.3691374> optim(median(x),method="BFGS",minloglik,x=x,hessia n=TRUE)$par[1] 0.369137$hessian
[,1][1,] 12.99744 # estimated se is 1/sqrt(12.99744)> optim(median(x),lower=median(x)-3,upper=median(x)+ 3,method="Brent",minloglik,$par[1] 0.3691371[ .... some output deleted....]
R — Cauchy MLE — local optima
27
Calculate MLE for sample from Cauchy distribution centered at θ.Likelihood and score function are proportional to
θ 7→n∏
j=1
1
2π
1
1 + (Xj − θ)2,
n∑
j=1
2(Xj − θ)
1 + (Xj − θ)2.
−70 −68 −66 −64 −62 −60
200
201
202
203
204
−70 −68 −66 −64 −62 −60
0.0
0.5
1.0
1.5
R — Cauchy MLE — local optima
28
Calculate MLE for sample from Cauchy distribution centered at θ.Likelihood and score function are proportional to
θ 7→n∏
j=1
1
2π
1
1 + (Xj − θ)2,
n∑
j=1
2(Xj − θ)
1 + (Xj − θ)2.
> optim(-70,method="BFGS",minloglik,x=x)$par[1] -68.24447$convergence[1] 0[ .... some output deleted....]> uniroot(minloglik1,c(-68,-64),x=x)$root[1] -66.39614$f.root[1] -1.346517e-06$iter[1] 5$estim.prec[1] 6.103516e-05[ .... some output deleted....]
R — Cauchy MLE — multivariate
29
Calculate MLE for sample from Cauchy distribution with center θ and scale σ.Likelihood function
(θ, σ) 7→n∏
j=1
1
2πσ
1
1 + (Xj − θ)2/σ2.
> set.seed(1244); x=rcauchy(25)> minloglik=function(par,x){q=par[1]; s=par[2];+ length(x) * log(s)+apply(log(1+outer(x,q,"-")ˆ2/sˆ2),2,sum)}> optim(c(median(x),mad(x)),minloglik,x=x) # Default me thod: Nelder-Mead$par[1] 0.3668736 1.2556457$value[1] 48.00677[ .... some output deleted....]> OP=optim(c(median(x),mad(x)),method="BFGS",minlogl ik,x=x); OP # Quasi Newton$par[1] 0.3670376 1.2552827$value[1] 48.00677$hessian
[,1] [,2][1,] 9.6044413 0.2627547[2,] 0.2627547 6.2611816[ .... some output deleted....]> sqrt(diag(solve(OP$hessian)))[1] 0.3228594 0.3998723
[ The inverse of the Hessian is an estimate of the variance matrix of the MLE (with estimated squarese’s on the diagonal).]