Fast evaluation of mixed derivatives and calculation of

Algorithmic Differentiation (AD)Cross–Derivatives

High Dimensional IntegrationNumerical Experiments

Fast evaluation of mixed derivatives andcalculation of optimal weights for integration

Hernan LeoveyHumboldt Universitat zu Berlin

02.14.2012

MCQMC2012Tenth International Conference on

Monte Carlo and Quasi–Monte Carlo Methodsin Scientific Computing

[email protected] Leovey Cross–Derivatives and Optimal Weights



Contents

1 Algorithmic Differentiation (AD)BasicsComplexity

2 Cross–DerivativesArithmetic operations and nonlinear functionsComplexityComparison with other methods: univariate Taylor polynomialexpansions

3 High Dimensional IntegrationReproducing Kernel Hilbert Spaces (RKHS)Integrands with low effective dimension

4 Numerical Experiments

Hernan Leovey Cross–Derivatives and Optimal Weights



BasicsComplexity

Basics

Frequently we have a program that calculates numerical values for afunction, and we would like to obtain accurate values for derivatives ofthe function as well.The usual divided difference approach is given by :

D+h f (x) ≡ f (x + h)− f (x)h or D±h f (x) ≡ f (x + h)− f (x − h)

2h

For h small, truncation and round-off errors reduce the number ofsignificant digits

If h is not small, normally no good approximation to a derivative isexpected




BasicsComplexity

Basics

Typically h =√ε is taken, for ε = working accuracy.

Expected accuracy:12 of the significant digits of f for D+h23 of the significant digits of f for D±h

In contrast, AD methods incur no truncation errors at all and usuallyyield derivatives with working accuracy.

AD §0: Algorithmic Differentiation does not incur truncation errors

AD §1: Difference quotients may sometimes be useful too

AD §2: What is good for function values is good for their derivatives




BasicsComplexity

Basics

Standard setting for AD:Vector function F is the composition of a sequence of once continuouslydifferentiable elemental functions ϕi .Basic set of functions (polynomial core):

{+, ∗,− (unary sign op.), c (const. init.)}

A typical example of a library containing “elemental” functions:

{c ,+,−, ∗, /, exp, log, sin, cos, tan, tan−1, ...,Φ,Φ−1, ...}




BasicsComplexity




BasicsComplexity

Basic complexity resultsConsider temporal complexity measure TIME ,

TIME{task(F )} = w>WORK{task(F )} (1)

with w = (w1,w2,w3,w4) a vector of platform dependent weights, and

WORK{task} ≡

MOVESADDS

MULTSNLOPS

≡

] of fetches and stores] of additions and subtractions

] of multiplications] of nonlinear operations

(2)

Forward mode AD:

TIME{F (x),F ′(x)x} ≤ ωtang TIME{F (x)}

with a constant ωtang ∈ [2, 5/2]




BasicsComplexity

Reverse mode AD:

“Cheap Gradient Principle“

TIME{F (x), y>F ′(x)} ≤ ωgrad TIME{F (x)} (3)

for a constant ωgrad ∈ [3, 4].

As consequence, the cost to evaluate a gradient ∇f is bounded above bya small constant ωgrad ∈ [3, 4] times the cost to evaluate the functionitself.

Random Access Memory requirements, in forward and reverse, arebounded multiples of those for the functions.

Sequential Access Memory requirement of basic reverse mode isproportional to temporal complexity of the function.




Arithmetic operations and nonlinear functionsComplexityComparison with other methods: univariate Taylor polynomial expansions

Cross–Derivatives

(Automatic Evaluations of Cross-Derivatives. Griewank, L, L, Z)With the term cross–derivatives we refer to those mixed partialderivatives where differentiation w.r.t. each variable is done at most once.

fi(x) =

∏j∈i

∂∂xj

f

(x) =∂k f

∂xi1 . . . ∂xik(x), i = {i1, i2, . . . , ik} .

There are 2d cross–derivatives if we take f∅(x) = f (x).We create a data structure with all 2d cross–derivatives of a function u ina flat array with 2d entries.We call such data structure an d–dimensional cube.

d = 3→ u u{1} u{2} u{1,2} u{3} u{1,3} u{2,3} u{1,2,3}





Basic OperationsFor a function u we denote by U its cube.

For a constant function u(x) = c we set U[0] = c and zeroeverywhere else.

For a coordinate function reps. input variable u(x) = xj we initializeits cube by U[0] = xj and U[2j ] = 1. The rest of the entries are setto zero.

Addition and Subtraction: V[i]=U[i] ± W[i] for all 0 6 i < 2d .

Scalar Multiplication: For v(x) = cu(x) the propagation rule isV[i]=c*U[i].

Scalar Addition/Subtraction is applied only to U[0].

The complexity of the above basic operations is (O(2d)).





Nonlinear Operations

Multiplication: The Leibniz formula for the multiplication of twofunctions v = u · w states that:

vi(x) =∑j⊆i

uj(x)wi−j(x) .

Assume now that n /∈ i. Then the above convolution sum can be split into

vi∪{n}(x) =∑j⊂i

ui−j(x)wj∪{n}(x) +∑j⊂i

uj∪{n}(x)wi−j(x)

Fixing the same subset i, the sums have the same structure. They alloperate inside separate halves of cubes.





This leads to a possible implementation:

void crossmult (int h, double∗U, double∗W, double∗V){if (h == 1) { V[0] + = (U[0]∗W[0]); return; }h/ = 2;crossmult(h,U,W+h,V+h); crossmult(h,U+h,W,V+h);crossmult(h,U,W,V);}

Due to the recursive nature of this procedure, there will be 3d overallfunction calls at h = 1 resulting in 3d multiplications and the samenumber of additions.





Exponential function: v = exp(u) has a very simple identity for thefirst partial derivatives, vk = vuk . This generalizes for k /∈ i to:

vi∪{k} =∑j⊆i

vi−j(x)uj∪{k}(x)

The second half cube of v is thus obtained by multiplying thepreviously computed first half cube of v and the second half cube ofu.

void exponent(int h, double∗U, double∗V) {int i; for(i= 0;i<h;i++) { V[i] = 0.0; }V[0]=exp(U[0]);for(i=1;i<h;i∗ =2) { crossmult (i,V,U+i,V+i); } }

There are d calls to the multiplication function and the final relativecost is 1

2 of the cost of a full multiplication.





Complexity:

Nonlinear differentiable functions ϕ(u) included in ”math.h“ exhibitcost proportional to cross–multiplicationGiven a library exhibiting cost proportional to cross–multiplication,extend it by considering any nonlinear ϕ(u) satisfying differentialequation

ϕ′(u)− a(u)ϕ(u) = b(u)

with functions a(.), b(.) in original library (ODE extension).

Proposition

The direct computation of all cross–derivatives f∗ of a function f given asan evaluation procedure (with elementals in ODE–extended library) isitself an evaluation procedure with complexity

OPS(f∗) = O(3d) · OPS(f )

for the runtime and with a factor of 2d in the memory size.

The unit is one multiplication, which is also the cost of addition orsubtraction.





Comparison with other methods:

Proposition

Method of interpolation of all cross–derivatives from Taylor coefficientsvia univariate expansions exhibits complexity

OPS(f∗) = O(d2 2d) · (OPS(f ) + c), c ≤ 4,

for the runtime and with a factor of (d + 1) 2d in the memory size.

The cross–over between the methods occur at d ∼ 14.For large dimensions d , the Taylor method will have better runtimes.Advantages of direct new method:

more accurate than Taylor univariate method

faster for d ≤ 14




Reproducing Kernel Hilbert Spaces (RKHS)Integrands with low effective dimension

Quasi Monte Carlo Methods (QMC):

QN,n(f ) :=1

N

N∑i=1

f (xi ) ≈ I (f ) :=

∫[0,1]n

f (x)dx ,

with x1, · · · , xN deterministically and cleverly chosen from [0, 1]n.

Lattice Rules

QN,n,z(f ) :=1

N

N−1∑i=0

f

({i

Nz

})Where N (usually prime) is the number of selected points and z is acarefully selected integer vector in Zn.Shifted Lattice Rules

QN,n,z,∆(f ) :=1

N

N−1∑i=0

f

({i

Nz + ∆

})for ∆ ∈ [0, 1]n.





(Weighted) Reproducing Kernel Hilbert spaces(Sloan&Wozniakowski’98)Integration over particular RKHS Fn of functions over [0, 1]n.Reproducing kernel Kn(x , t) is function defined over [0, 1]n × [0, 1]n,such that

Kn(., t) ∈ Fn for all t ∈ [0, 1]n and

f (t) = 〈f (.),Kn(., t)〉n, ∀f ∈ Fn; ∀t ∈ [0, 1]n.

Worst Case Error of QMC algorithm over Fn

e(QN,n) := supf∈Fn:‖f ‖Fn≤1

|I (f )− QN,n(f )|

Assume integration functional I (.) is continuous over Fn, then e(QN,n) isbounded and

|I (f )− QN,n(f )| ≤ e(QN,n). ‖f ‖Fn





Weighted Unanchored Sobolev Space Fn,γγγ

Consider weights 0 ≤ γγγn,i, for ∅ 6= i ⊆ {1, · · · , n}.

Kn,γγγ(xxx ,yyy) = 1+∑

∅6=i⊆{1,··· ,n}

γγγn,i∏j∈i

(1

2B2({xj − yj}) + (xj −

1

2)(yj −

1

2)

)

‖f ‖Fn,γγγ =

∑i⊆{1,··· ,n}

γγγ−1n,i

∫[0,1]|i|

(∫[0,1]n−|i|

∂|i|

∂xxx if (xxx i,xxxD−i)dxxxD−i

)2

dxxx i

12

Product weights γγγn,i =∏

j∈i γ{n,j} → Tensor Product RKHS

Fn,γγγ = Hn,γγγ := H1,γ1⊗· · ·⊗H1,γn n–times → Kn,γγγ(xxx ,yyy) =n∏

j=1

K1,γj (xi , yi )





Theorem (Novak&Wozniakowski’10, Kuo, Sloan, Joe,..)

Let 0 ≤ γγγn,i, ∅ 6= i ⊆ D, D := {1, · · · , n}, f ∈ Fn,γγγ . Given a primenumber N, there exits a shifted rank-1 lattice rule QN,n,z,∆ withgenerator vector z constructed by the Component by Componentalgorithm (CBC), such that

|I (f )− QN,d,z,∆(f )| ≤

(∑∅6=i⊆D (γγγn,i)

1/(2τ)(

2ζ(1/τ)

(√

2π)1/τ

)|i|)τ(N − 1)τ

‖f ‖∗Fn,γγγ

for any τ ∈ [ 12 , 1).

For fixed f , we need the weights to construct a generator vector zfor a lattice rule, using CBC algorithm.

How should we choose the weights in practice?

What is an optimal embedding for a given function f in a practicalproblem?





Common Approach: Choose the weights such that the integration errorbound is minimized.

In general case, 2n − 1 terms inside

‖f ‖∗Fn,γγγ=

∑∅6=i⊆D

γγγ−1n,i

∫[0,1]|i|

(∫[0,1]n−|i|

∂|i|


)2

dxxx i

12

Approach:

Very often, problems in applications exhibit low effective dimensiond << n. Effective dimension refers to essential ANOVA part of thefunction that accumulates most of the variance (≥ 99%).





Assume f is a square integrable function. Then we can write f as thesum of 2n ANOVA terms:

f (xxx) =∑i⊂D

f i(xxx), f i(xxx) =

∫[0,1]n−|i|

f (xxx i,xxxD−i)dxxxD−i −∑j(i

f j(xxx)

For a given family T of subsets of D, let us define now

fT (xxx) =∑i∈T

f i(xxx).

Then, the integration error of a QMC algorithm QN,n is given by

|(I − QN,n)(f )| ≤ |(I − QN,n)(fT )|+

∣∣∣∣∣∣(I − QN,n)

∑i⊂{1,...,n},i 6∈T

fi(x)

∣∣∣∣∣∣Hernan Leovey Cross–Derivatives and Optimal Weights




Theorem

Let T be a given family of subsets of D. Let f i ∈ Fn,γγγ for i ∈ T . Thenfor the function fT defined above it holds

‖fT‖∗Fn,γγγ=

∑∅6=i∈T

γγγ−1n,i

∫[0,1]|i|

(∂|i|

∂xxx i

∫[0,1]n−|i|

f (xxx i,xxxD−i)dxxxD−i

)2

dxxx i

12

Moreover, if f ∈ Fn,γγγ , it holds for i ⊂ D

bf ,i :=

∫[0,1]|i|

(∂|i|

∂xxx i

∫[0,1]n−|i|

f (xxx i,xxxD−i)dxxxD−i

)2

dxxx i

=

∫[0,1]|i|

(∫[0,1]n−|i|

∂|i|


)2

dxxx i ≤∫

[0,1]n

(∂|i|

∂xxx if (xxx)

)2

dxxx





Integrands with low effective dimension

Remark

Note that any (good) upper bound bf ,i ∈ R, bf ,i ≥ bf ,i, conduces alsoto an integration error upper bound of the form

|(I − QN,n)(fT )| ≤

(∑∅6=i⊆D (γγγn,i)

1/(2τ)(

2ζ(1/τ)

(√

2π)1/τ

)|i|)τ(N − 1)τ

∑∅6=i∈T

γγγ−1n,i bf ,i

12

Product Weights (γγγn,i =∏

j∈i γn,{j}):

Let d denote effective dimension of f in truncation sense (say d ≤ 14).

App.-1

fEFFTd:=

∑i⊂{1,...,d}

f i(xxx).





Set 00 = 0, c

0 = +∞ for c > 0. Assume that at least one term bi,f > 0for some ∅ 6= i ⊂ {1, ..., d}.For fEFFTd

consider bound–objective function ψ : Rn≥0 → [0,+∞] (where

the variables are the weights). For simplicity set ψ(0) = +∞, and for(x1, . . . , xn) ∈ Rn

≥0 \ {0} define

ψ(x1, . . . , xn) =

(n∏

j=1

(1 + x

12τj

2ζ(1/τ)

(√2π)1/τ

)− 1

)τ ∑∅6=i⊂{1,...,d}

(∏j∈i

x−1j )bi,f

12

Clearly we have:

minimize(x1,...,xn)∈Rn

≥0

ψ(x1, . . . , xn) ←→ minimize(x1,...,xn)∈Rn

≥0

ψ(x1, . . . , xn)

subject to xj = 0 , d + 1 ≤ j ≤ n.





Choice for non–important weights (in ANOVA sense):

Lemma

For fixed τ ∈ [1/2, 1), let γ∗n = (γ∗n,1, . . . , γ∗n,d , 0, . . . , 0) be an optimal feasible

solution of problem above. Let ε0 > 0, and let a1, . . . , an−d be any sequence ofnonnegative real numbers with

∑n−di=1 ai ≤ M. Define

R0 =(√2π)1/τ

M2ζ(1/τ)log

ε 1τ0

∏dj=1

(1 + (γ∗n,j)

12τ

2ζ(1/τ)

(√

2π)1/τ

)− 1∏d

j=1

(1 + (γ∗n,j)

12τ

2ζ(1/τ)

(√

2π)1/τ

)+ 1

.Then it follows for γn = (γ∗n,1, . . . , γ

∗n,d ,R0a1, . . . ,R0an−d) that

ψ(γn) ≤ (1 + ε0)ψ(γ∗n ).




Option valuation problem for arithmetic average Asian optionsAsset St follows the geometric Brownian motion model.

St = S0 exp

((r − σ2

2

)t + σWt

)Simulating asset prices reduces to simulating paths Wt1 , . . . ,Wtd .

V =e−rT

(2π)d/2√

det(C )

∫Rd

max

1

d

d∑j=1

Sj(w)− K , 0

e−12 wTC−1wdw

with w = (Wt1 , ...,Wtd ). After a factorization C = AAT of thecovariance matrix, transform integral using Φ−1(.).

V = e−rT∫

[0,1]dmax

1

d

d∑j=1

Sj(AΦ−1(x))− K , 0

dx ,

For the tests, we simplify the problem assuming K = 0. Consider principalcomponents (PCA) and Brownian Bridge (BB) factorization of C .




Sensitivity tests for effective dimension (Algo. Wang&Fang ’03)K = 0, S0 = 100, T = 1real dimension n = 16, 64, 128Domain Truncation (Kuo&Sloan&Griebel ’10)ε = 0.1, 0.01, 0.001, 0.0001 (|I (f )− I (fε)| ≤ εS0)For (σ, r) ∈ [0.05, 0.35]× [0.05, 0.35] (tests on 7× 7 uniform grid)

Using 216 Sobol points, all tests resulted with effective dimension intruncation sense d ≤ 3 for PCA, and d ≤ 8 for BB construction.

bf ,i ∼ bf ,i :=

∫[0,1]n

(∂|i|

∂xxx if (xxx)

)2

dxxx for i ⊂ {1, · · · , d}

CrossAD cost for simplified examples without strike (K = 0):

Example \n = 8 16 32 64 128 256 512Runtime(crossPCA)

Runtime(PCA)(d = 4) 2.4 2.4 2.3 2.3 2.3 2.2 2.2

Runtime(crossBB)Runtime(BB)

(d = 8) 33 34 35 35 36 36 36




Fixed K = 0, S0 = 100, T = 1,σ = 0.1,r = 0.1, domain truncationε = 0.1, (bi,f estimates using cross AD for d–first variables)

Table: Weights for τ = 0.9 (runtime for opt. solver approx. 0.08 seconds)

BB n8 S11 n8 S14 n8 acc n16 S11 n16 S14 n16 acc n128 S11 n128 S14

γ∗n,1 0.1188 0.1580 0.1322 0.1371 0.2183 0.1813 0.1477 0.3817γ∗n,2 0.0435 0.0487 0.0486 0.0526 0.0598 0.0697 0.0631 0.0677γ∗n,3 0.0076 0.0115 0.0094 0.0093 0.0162 0.0124 0.0158 0.0280γ∗n,4 0.0121 0.0166 0.0162 0.0123 0.0220 0.0228 0.0103 0.0356γ∗n,5 0.0013 0.0013 0.0014 0.0014 0.0015 0.0018 0.0011 0.0013γ∗n,6 0.0025 0.0055 0.0034 0.0025 0.0089 0.0045 0.0022 0.0169γ∗n,7 0.0034 0.0037 0.0040 0.0046 0.0044 0.0056 0.0122 0.0046γ∗n,8 0.0029 0.0037 0.0035 0.0033 0.0047 0.0046 0.0031 0.0036

ψ(γ∗) 2.6e+03 3.1e+03 2.9e+03 4.0e+03 5.5e+03 5.2e+03 1.1e+04 2.6e+04

ψ( 1j, 0) 9.5e+04 9.6e+04 9.6e+04 1.2e+05 1.2e+05 1.2e+05 2.4e+05 3.5e+05

ψ( 1j2, 0) 1.0e+04 1.0e+04 1.0e+04 1.3e+04 1.4e+04 1.4e+04 3.6e+04 5.2e+04




Table: Weights for τ = 0.9 (runtime for opt. solver approx. 0.05 seconds)

PCA n8 S11 n8 S14 n8 acc n16 S11 n16 S14 n16 acc n128 S11 n128 S14

γ∗n,1 0.3776 0.3746 0.3507 0.3175 0.4131 0.4692 0.1971 0.4942γ∗n,2 0.0234 0.0233 0.0228 0.0245 0.0340 0.0313 0.0383 0.0446γ∗n,3 0.0160 0.0164 0.0165 0.0181 0.0280 0.0238 0.0346 0.0421γ∗n,4 0.0085 0.0099 0.0095 0.0114 0.0103 0.0134 0.0273 0.0159

ψ(γ∗) 6.0e+02 6.1e+02 6.1e+02 8.1e+02 9.2e+02 8.9e+02 2.7e+03 2.6e+03

ψ( 1j, 0) 4.0e+03 4.0e+03 4.0e+03 5.0e+03 5.0e+03 5.0e+03 1.1e+04 1.1e+04

ψ( 1j2, 0) 1.6e+03 1.6e+03 1.6e+03 2.0e+03 2.0e+03 2.0e+03 4.5e+03 4.5e+03




Further investigations:

fEFF+Td

:=∑

i⊂{1,...,d}

f i(xxx) +∑

d+1≤j≤n

f {j}(xxx).

Using cross AD + reverse mode AD for cheap gradients(Optimization problem remains n–dimensional)

Product and order–dependent Weights for functions with loweffective superposition dimension using forward and reverse AD(No need for numerical Optimization)

Good bounds for functions with kinks (K 6= 0, eff. sup. dim. d = 2and P.O.D. weights)

Improved sampling strategy for squared mixed derivatives stronglydiverging at small sub-cube borders

Domain truncation alternative

Thank you for your attention!


Documents

Fast evaluation of mixed derivatives and calculation of