23
Parallel Deterministic and Stochastic Global Minimization of Functions with Very Many Minima DAVID R. EASTERLING [email protected] Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA LAYNE T. WATSON Departments of Computer Science and Mathematics, Virginia Polytechnic Institute and State University, Blacks- burg, VA 24061, USA MICHAEL L. MADIGAN Department of Engineering Science and Mechanics, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA BRENT S. CASTLE School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA MICHAEL W. TROSSET Department of Statistics, Indiana University, Bloomington, IN 47405, USA Abstract. The optimization of three problems with high dimensionality and many local minima are investigated under five different optimization algorithms: DIRECT, simulated annealing, Spall’s SPSA algorithm, the KNITRO package, and QNSTOP, a new algorithm developed at Indiana University. Keywords: stochastic optimization, deterministic optimization, biomechanics, quadratic optimization, simulated annealing, DIRECT, QNSTOP, KNITRO 1 Introduction The minimization of difficult functions is an important topic in numerical analysis. An optimization method that solves one problem efficiently and effectively can fail to yield useful results for another problem, even if the two problems under consideration share broad similarities. Three problems are considered here: a biomechanical balance problem with an unknown global minimum, a nonconvex nonsmooth quadratic minimization problem with a global minimum known by construction, and a smooth problem in reflection reduction with a global minimum that can be directly calculated. While the problems are of similar dimensionality and all have a large number of local minima, the character of each problem is distinct. The balance problem, while deterministic, contains enough modeling noise to cause determin- istic optimization algorithms difficulty. (The biomechanics ‘model’ consists of splicing together published empirical models over different motion regimes. These models are inconsistent at their interfaces, and the resulting combined model is thus discontinuous across manifolds in the domain. The numerical noise (ODEs, integrals) is significant but dominated by the modelling noise.) The balance problem is a constrained optimization problem of 57 dimensions. The nonconvex non- smooth quadratic minimization problem is a reformulation of an integer programming problem, and the function considered here is carefully constructed to contain a large number of local minima 1

Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

Parallel Deterministic and Stochastic GlobalMinimization of Functions with Very Many Minima

DAVID R. EASTERLING [email protected]

Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA

LAYNE T. WATSON

Departments of Computer Science and Mathematics, Virginia Polytechnic Institute and State University, Blacks-

burg, VA 24061, USA

MICHAEL L. MADIGAN

Department of Engineering Science and Mechanics, Virginia Polytechnic Institute and State University, Blacksburg,

VA 24061, USA

BRENT S. CASTLE

School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA

MICHAEL W. TROSSET

Department of Statistics, Indiana University, Bloomington, IN 47405, USA

Abstract. The optimization of three problems with high dimensionality and many local minima are investigatedunder five different optimization algorithms: DIRECT, simulated annealing, Spall’s SPSA algorithm, the KNITROpackage, and QNSTOP, a new algorithm developed at Indiana University.

Keywords: stochastic optimization, deterministic optimization, biomechanics, quadratic optimization, simulatedannealing, DIRECT, QNSTOP, KNITRO

1 Introduction

The minimization of difficult functions is an important topic in numerical analysis. An optimization

method that solves one problem efficiently and effectively can fail to yield useful results for another

problem, even if the two problems under consideration share broad similarities. Three problems are

considered here: a biomechanical balance problem with an unknown global minimum, a nonconvex

nonsmooth quadratic minimization problem with a global minimum known by construction, and

a smooth problem in reflection reduction with a global minimum that can be directly calculated.

While the problems are of similar dimensionality and all have a large number of local minima, the

character of each problem is distinct.

The balance problem, while deterministic, contains enough modeling noise to cause determin-

istic optimization algorithms difficulty. (The biomechanics ‘model’ consists of splicing together

published empirical models over different motion regimes. These models are inconsistent at their

interfaces, and the resulting combined model is thus discontinuous across manifolds in the domain.

The numerical noise (ODEs, integrals) is significant but dominated by the modelling noise.) The

balance problem is a constrained optimization problem of 57 dimensions. The nonconvex non-

smooth quadratic minimization problem is a reformulation of an integer programming problem,

and the function considered here is carefully constructed to contain a large number of local minima

1

Page 2: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

and a single global optimum point. This minimization problem is an unconstrained problem with

a scalable number of dimensions, chosen here to be 57. The wave annihilation problem is a smooth

minimization problem with an even number of variables, chosen as 56 to keep the size comparable

to that of the other two problems. Together, these three problems provide a useful context in which

to compare the performance of the optimization algorithms on moderately large and qualitatively

different problems.

Five optimization algorithms are considered for each problem: the simultaneous perturbation

stochastic approximation algorithm, two parallel implementations of a simulated annealing scheme,

a parallel implementation of the DIRECT algorithm, the direct interior point method found in

the commercial KNITRO optimization package, and a new quasi-Newton stochastic algorithm,

QNSTOP.

The simultaneous perturbation stochastic approximation algorithm (SPSA) is an uncon-

strained stochastic optimization algorithm notable for the small number of objective function

evaluations per iteration [33]. This allows it to scale to higher dimensions better than the finite

difference methods from which it is derived [35]. Like the finite difference methods, it suffers from

a tendency to become trapped at local minima. SPSA is more usually employed as a local opti-

mization algorithm, but it may function as a global optimization algorithm under certain broad

conditions [26]. The parallel implementation employed here is a naive one, with minimal interpro-

cessor communication. This parallel SPSA is applied to these three problems to test its suitability

for problems with a high dimensionality and a large number of local minima. For more information

on the SPSA algorithm, see [34] and [36].

Simulated annealing is an unconstrained stochastic global optimization algorithm commonly

used for difficult biomechanics problems. The parallel implementation employed here, simulated

parallel annealing within a neighborhood (SPAN), is designed to minimize interprocessor commu-

nication while maximizing the use of multiple processors to compute objective function values at

the desired points [19]. For more on simulated annealing, see [13], [21], and [25]. [29] discusses

other parallel simulated annealing algorithms.

The DIRECT algorithm is a highly parallelizable box-constrained deterministic global opti-

mization algorithm [22]. The parallel implementation employed here, pVTdirect, is designed to

preserve the deterministic nature of the algorithm while exploiting its natural parallelism [18], [15].

For more information on DIRECT, see [23], [16], and [17].

KNITRO contains a collection of algorithms for local nonlinear optimization developed by

Ziena Optimization, LLC. While all the optimization algorithms in the KNITRO package are

designed for twice continuously differentiable problems, KNITRO nevertheless contains code for

approximation of derivatives and can be used on nonsmooth problems as well, though for such

problems the performance may degrade. As the only truly gradient-driven optimization technique

considered here, KNITRO provides a contrast to the other algorithms employed here to show the

usefulness of such a technique with the test problems here. For more information on KNITRO, see

[3], [4], and [5].

QNSTOP, an algorithm under development at Indiana University, is a local search strategy

for stochastic optimization that synthesizes ideas from numerical optimization (secant updates,

trust regions) and response surface methodology (ridge analysis). Here, stochastic optimization

describes problems in which function evaluation is uncertain, i.e., instead of computing y = µ(x),

y is drawn from a probability distribution P (x). For example, the case of additive normal errors

with equal variance σ2 is y ∼ Normal(µ(x), σ2). Some modifications are therefore necessary to

apply this algorithm to the deterministic problems being considered.

2

Page 3: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

The paper is organized as follows. Sections 2, 3, and 4 describe the three test problems.

The parallel DIRECT is described in Section 5, simulated annealing in Section 6, Spall’s SPSA

algorithm in Section 7, KNITRO in Section 8, and QNSTOP in Section 9. Section 10 contains a

discussion of experimental results and concludes.

2 Biomechanics

The first problem under consideration is a two-dimensional musculoskeletal model utilizing for-

ward dynamic simulations [2]. The task investigated involves maintaining bipedal balance without

stepping after an abrupt backwards support surface displacement.

Y

X

Shanks

Thighs

θ

θ

hip

knee

Head, Arm, and Trunk

θankle

Figure 1. Schematic drawing of the three segment sagittal plane model repre-

senting the human body.

The musculoskeletal model is a sagittal plane representation of the volunteer including three

rigid segments representing the shanks, thighs, and head-arms-trunk (HAT) connected by friction-

less pin joints (see Figure 1) activated by three joint torques representing the torques of the ankles,

knees, and hips. The joint torques are the sum of the passive and active joint torques T = Tp +Ta

and represent all flexor and extensor contributions to the joints. The feet are neglected in the

model because the volunteer from which experimental data was derived exhibited minimal heel

rise during trials. As such, the joint representing the ankle is assumed to simply connect the distal

end of the shanks to the floor. The inputs to the dynamic model are the joint torques and the

time-varying position of the moving platform.

Equations of motion for the model are derived from Lagrange dynamics, and are uniquely

determined from the segments’ length, mass, center of mass, and moments of inertia. These

constants are calculated from the subject’s height (1.6 m) and weight (60 kg) by the formulas

3

Page 4: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

presented in [27], and are given here for convenience. The segment masses are calculated as 5.39

kg for the shank segment, 13.0 kg for the thigh segment, and 37.3 kg for the trunk segment. The

length of the segments are 0.408 m for the shank, 0.402 m for the thigh, and 0.475 m for the trunk.

The center of mass for each segment, in distance from the proximal joint, is 0.171 m for the shank,

0.157 m for the thigh, and 2.5 · 10−4 m for the trunk. The moments of inertia are calculated as

0.0584 kg ·m2 for the shank, 0.228 kg ·m2 for the thigh, and 2.07 kg ·m2 for the trunk. Finally,

the initial angles for the model are −0.0114 rad for the ankle, 3.152 rad for the knee, and 3.272

rad for the hip.

Passive torque (N ·m)

Tp = 2(Tp,a, Tp,k, −Tp,h)T

is calculated using equations taken from [30], which generate passive torque with respect to the

ankle Tp,a, knee Tp,k, and hip Tp,h. Since these equations are in degrees, joint angles must be

converted to degrees in order to use them. Given that θa, represents the angle between the ground

and the shank, θk, represents the angle between the shank and the thigh, and θh, represents the

angle between the hip and the torso, Tp,a = e(a−b θa,−c θk,) − e(−d+f θa,+g θk,) − 1.792 represents

the passive torque associated with a single ankle joint, where a = 2.1016, b = 0.0843, c = 0.0176,

d = 7.97634, e is Euler’s constant, f = 0.1949, and g = 0.0008. Tp,k = e(h−j θa,−k θk,+l θh,) −e(−m+n θa,+o θk,−p θh,) + e(q−r θk,) − 4.820 represents the passive torque associated with a single

knee joint, where h = 1.8, j = 0.0460, k = 0.0352, l = 0.0217, m = 3.971, n = 0.0004, o = 0.0495,

p = 0.0128, q = 2.220, and r = 0.150. Tp,h = e(s−t θk,−u θh,) − e(v−w θk,+y θh,) − 8.072 represents

the passive torque associated with a single hip joint, where s = 1.4655, t = 0.0034, u = 0.0750,

v = 1.3403, w = 0.0226, and y = 0.0305.

Active torque (N ·m)

Ta = 4(Ta,a, Ta,k, Ta,h)T

is defined as the maximum isometric torque scaled by three functions that are known to influence

torque production. (The value “4” is that used by Bieryla [2], but should be “2” in a correct model.)

Each active torque (ankle Ta,a, knee Ta,k, and hip Ta,h) is the result of the torques generated by

forces applied by muscles in two directions (extension and flexion). The active torque with respect

to an individual joint j and a force direction f generated at time t (s) with joint angle θj (rad)

and angular velocity ωj (rad/s) is

Ta,j,f (t, θj , ωj) = Tj,f,max rj,f (θj) hj(ωj) Aj(t).

Depending on the activation at a given moment in time Aj(t), either the extension or flexion

formulas will be used to calculate Ta,j,f . Positive activation for a joint corresponds to extension

for the ankle and hip and flexion for the knee, and vice versa for negative activation.

Ta,e,max = 0.125hwts is the maximum isometric torque (N ·m) for the ankle in the extension

direction, where h is the height of the subject (m), w is the weight of the subject (N), and ts = 1.2

is a (unitless) variable based on the strength of the subject. Similarly, Ta,f,max = 0.022hwts ,

Tk,e,max = 0.124hwts, Tk,f,max = 0.060hwts , Th,e,max = 0.138hwts , and Th,f,max = 0.081hwts .

These maximum isometric torques are determined for a single lower extremity from a strength

model of female older adults [1].

The torque-angle relation rj,f(θj) (unitless) is obtained from previously published experimen-

tal data [20] and varies from zero to one. If the values for the angles fall outside the range allowed

4

Page 5: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

by the model during the simulation, they are set to the limit of the model. The angle limits (rad)

and relation formulas are

−0.52 < θa < 0.61, 0.0 < θk < 2.27, −0.17 < θh < 2.27,

ra,f = −0.1731(θ3a)− 0.5882(θ2

a) + 0.3357(θa) + 0.9502,

ra,e = 2.742(θ4a) + 1.6115(θa3 )− 2.8579(θ2

a)− 0.4996(θa) + 0.9699,

rk,f = −0.2543(θ5k) + 1.5215(θ4

k)− 2.9033(θ3k) + 1.4916(θ2

k) + 0.2539(θk) + 0, 7643,

rk,e = 0.2334(θ4k)− 0.4944(θ3

k)− 1.0148(θ2k) + 2.051(θk) + 0.1865,

rh,f = 0.4450(θ7h)− 3.1958(θ6

h) + 8.5726(θ5h)− 10.2750(θ4

h) + 4.7283(θ3h)− 0.1678(θ2

h)

+ 0.2293(θh) + 0.6449,

rh,e = 0.2056(θ4h)− 0.5625(θ3

h)− 0.2723(θ2h) + 0.9446(θh) + 0.6095.

The torque-angular velocity relation

h(ωj) =

(ω0 − ωj)/(ω0 + Γωj), ωj/ω0 ≤ 1,

0, ωj/ω0 > 1,

also varies from zero to one and is obtained from [32], where ωj is the angular velocity of the joint

j (rad/s), ω0 (±20 rad/s) is the maximum angular velocity for all joints, and Γ = 2.5 is the shape

factor describing the torque-angular velocity curve [32]. If the angular velocity and activation level

have opposite signs, indicative of eccentric muscle contraction, h(ωj) is increased to a maximum

value of 1.5.

A(t) (unitless) is allowed to vary from −1 to 1 to allow for flexion and extension of each

joint. Because activation dynamics are not instantaneous, joint torque activation rate of change is

limited to a maximum absolute value of 1/0.08 per second [7]. To enforce this rate of change, a

bijective conformal mapping is employed [10]. Nineteen nodes, equally spaced 100 ms apart, are

used to represent the joint torque activation profile of the ankles, knees, and hips combined (57

nodes total). Linear interpolation is used to define the activation levels between consecutive nodes.

These nodes represent the variables for the objective function f .

The four optimization algorithms under consideration are used to attempt to determine the

values for the joint activations (57 nodes) that minimize the performance criterion

f = w1

∫ tf

t0

(XC(t)−XA)dt + w2

∫ tf

t0

e(θ(t))dt

+ w3

∫ tf

t0

e(θ(t))dt + w4

∫ tf

t0

(3∑

i=1

qi(t)2

)1/2

dt

+ w5

∫ tf

t0

XC(t)dt+ w6

∫ tf

t0

XC(t)dt

+ w7

3∑

i=1

∫ tf

t0

(τi(t)2)dt

5

Page 6: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

adapted from [39], [40], where e(s(t)) =∑3

i=1 φ(si(t)) and

φ(si(t)) =

si(t)− − si(t), si(t) < si(t)

−,0, si(t)

− ≤ si(t) ≤ si(t)+,

si(t)− si(t)+, si(t) > si(t)

+,

with si(t)− and si(t)

+ representing the lower and upper physical bounds of the joint angles, re-

spectively.

The first term in the objective function (unitless) minimizes the maximum horizontal displace-

ment of the center of mass XC(t) − XA, where XC(t) is the center of mass of the body on the

displacement platform (m) and XA is the position of the ankle on the displacement platform (m).

These values are taken from the experimental data. The second and third terms restrict joint

angle θ(t) (rad) and angular velocity θ(t) (rad/s) to remain within previously published physiologic

limits. The joint angle minimums are −0.873 rad for the ankle, 0 rad for the knee and −0.524 rad

for the hip. The joint angle maximums are 0.524 rad for the ankle, 2.269 rad for the knee and

2.182 rad for the hip. The angular velocity minimums are −6.2 rad/s for the ankle, −7.3 rad/s for

the knee, and −8.5 rad/s for the hip. The angular velocity maximums are 8 rad/s for the ankle, 15

rad/s for the knee, and 10 rad/s for the hip. The fourth, fifth, and sixth terms minimize segment

angular velocity qi(t) (rad/s), center of mass velocity XC(t) in (m/s), and center of mass acceler-

ation XC(t) in (m/s2), respectively, over the entire simulation. The seventh term minimizes the

integral of the square of the joint torques τi(t), the sum of active and passive torques. The weights

for f are w1 = 1000, w2 = 500, w3 = 500, w4 = 50, w5 = 100, w6 = 25, and w7 = 0.025. The initial

joint angle configuration is derived from experimental data, and the initial joint angular velocities

are set to zero. The duration of the simulation time is tf = 1.8 s, allowing for full recovery from

the perturbation.

The minimum time to boundary (TTB) of the center of mass is used to quantify model

performance with respect to balance. TTB is calculated as the instantaneous anterior-posterior

(A/P) distance from the center of mass to the base of support divided by the instantaneous

absolute value of the center of mass A/P velocity [31]. The base of support boundary is defined as

the position of the first metatarsal based on the volunteer’s anthropometry. The minimum TTB

value describes the smallest amount of time for the participant to reach their limit of stability.

Loss of balance occurs for TTB ≤ 0 s. A higher TTB indicates a longer period of time until the

participant reaches their limit of stability. If the participant reaches the base of support a step

occurs. Therefore, a decrease in TTB is seen as a degradation in balance.

In summary, given that x is the torque activation level function A(t) discretized to nineteen

distinct nodes for the three joints, each optimization algorithm attempts to minimize the stated

f(x), subject to the constraints that −1 ≤ x ≤ 1 and that within a single activation profile, the

activation level between two consecutive discrete nodes may differ by no more than 1.25.

The best known solution to this problem is f(x∗) = 1222.05. This solution was found in

the course of this investigation by a parallel run of pVTdirect, centered at a point found by the

QNSTOP algorithm with a single experiment centered on the origin that was granted a function

evaluation budget of 105. This measure f(x) is unitless and only relevant when compared to other

function evaluations to determine the relative fitness of a minimizing point.

6

Page 7: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

Figure 2. f(x) along the line through two points of the biomechanics problem

(top), and a zoomed view (bottom).

The highly variable nature of the biomechanics objective function f(x) is shown in Figure

2, which shows the function evaluated along the line between two widely separated points and a

zoomed view of a “smooth” part of the plot, demonstrating the presence of local noise.

3 Nonconvex Quadratic Minimization

The second problem of interest is the nonconvex box-constrained quadratic minimization problem:

(Pb) : min

P (x) =

1

2xTAx− fTx : x ∈ Xb

,

where

Xb = x ∈ IRn| − 1 ≤ xi ≤ 1,∀i = 1, . . . , n .

Replacing the feasible set Xb by its vertices

δXb = x ∈ IRn|x ∈ −1, 1n

7

Page 8: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

gives the integer programming problem

(Pip) : min

P (x) =

1

2xTAx− xTf : x ∈ δXb

.

Using the canonical duality theory of Gao et al. [11], [12], the integer programming problem

(Pip) may be reformulated as

(Pdip) : min

Q(σ) =

1

2σTσ −

n∑

i=1

∣∣fi + (BTσ)i

∣∣ : σ ∈ IRm

,

where σ = (σ1, . . . , σm), f = (f1, . . . , fn), and the m × n real matrix B is related to A. The

reformulation is a nonconvex nonsmooth unconstrained minimization problem. Here m = 57,

n = 190,

B =

1 −1 0 −1 2 0 1 −2 1 11 −1 1 −1 −1 0 −2 2 0 12 2 −1 −1 2 −2 0 0 −1 1

,

f = 10−2[1.491803633709836, 3.0717213019723066,

5.246230264266409, −6.718373452055033,

3.969549763760797, 7.502845410079123,

5.622108089244097, −1.9585631018739558,

− 2.729844702016424, 8.26721052052138],

B = I19×19 ⊗ B, and f = e19 ⊗ f ,

where e19 = (1, . . . , 1) ∈ IR19. This problem has exactly 219 known local minimum points and the

unique global minimum Q(σ(1)) is located at

σ(1) = ( 6 −4 12 . . . 6 −4 12 ) .

All the local minima are within 0.5% of the global minimum Q(σ(1)) = −1866.01.

4 Wave Annihilation Problem

The wave annihilation problem studied here was first presented in [14]. That study developed

a method for producing a coating of total thickness T distributed evenly in n layers of varying

properties between two media to eliminate the reflection of waves over a band of frequencies [Ω0,

Ω1] in one of those media. Reflections are eliminated entirely at n specific frequencies and reduced

significantly for other frequencies within this band; as n approaches infinity, reflections within this

band are eliminated entirely. This process is treated as an acoustic application in [14], but as is

pointed out, it can easily be adapted to electromagnetism or any other phenomena governed by

variants of the linear wave equation.

8

Page 9: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

Given n, a crucial component to this process is to determine the n specific coatings such that

the reflection r = 0 at frequency

ωk = Ω0 +

(k − 1

n− 1

)(Ω1 − Ω0)

for k = 1, 2, . . . , n, where the complex-valued

r(n, γ, κ, ωk) =

(Γ−, γ1)

n∏

j=1

Aj

(−1

1

)

(Γ−, γ1)n∏

j=1

Aj

(1

1

) ,

Aj =

(γje

+j γj+1e

j

γje−

j γj+1e+j

), γn+1 = Γ+, e±j = exp

(2γj∆xωki

κj

)± 1,

i =√−1, Γ+ and Γ− are the impedances of the half-spaces surrounding the coating, ∆x = T/n,

and γj and κj are the impedance and stiffness of layer j, respectively. Note that unlike the other

problems studied here, r is differentiable and the nonlinear system of equations can be solved using

a variation of Newton’s method [14]. An objective function f = r ∗ r may be formed by observing

that the inner product of the complex vector(r(ω1), . . . , r(ωn)

)with itself yields a scalar real value

with a known minimum of zero where the original vector is zero. By choosing n = 28, a problem of

56 real variables is constructed that can be studied using the algorithms presented here, with both

the impedance and the stiffness of the n layers being used as arguments to r, while the frequencies

ωk are determined directly from n. For our purposes, Γ+ is chosen to be 1 and Γ− is chosen to be

28.14776 (the ratio between the two is the same as the ratio between the reflectivity of water and

steel), T = 1 m, and Ω0 = 0.09091 Hz while Ω1 = 10Ω0.

5 DIRECT

The DIRECT (DIviding RECTangles) algorithm by D.R. Jones [22] is a deterministic global op-

timization algorithm. The DIRECT algorithm does not require the computation of the gradient

of the objective function, or even objective function values (ranking information is sufficient). It

performs Lipschitzian optimization, but does not require knowledge of the Lipschitz constant.

The DIRECT algorithm works as follows [18]. The algorithm begins with an initial box

normalized to the unit hypercube. The objective function F (assumed to satisfy a Lipschitz

condition) is evaluated at the center of this box. The current minimum value is initialized to this

value. An evaluation counter m and an iteration counter t are initialized to m = 1 and t = 0. The

following process is repeated until m or t reaches some prespecified limit.

Selection. Identify the set S of “potentially optimal” boxes. Here “potentially optimal”

means that (1) for some Lipschitz constant K, the box potentially contains a point with smaller

objective function value than any other box, and (2) F (c) − K · L/2 ≤ fmin − ǫ|fmin|, where F

is the objective function, c is the center point of the box, K is the same Lipschitz constant, L is

the box diameter, fmin is the current minimum value for the objective function, and ǫ is a small,

nonnegative, fixed constant.

9

Page 10: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

Sampling. Select one of the potentially optimal boxes B from S. For box B, identify the set

I of dimensions with maximum side length L and let δ = 13L. Sample the function at all points of

the form c ± δei for each i ∈ I, where c is the center of the rectangle and ei is the ith standard

basis vector. Update m.

Division. Divide the box containing the point c into thirds along the dimensions in I,

beginning with the dimension with the least value of minF (c+ δei), F (c− δei), and ending with

the dimension with the greatest such value. Update the minimum value.

Iteration. Remove the box B from the set of potentially optimal boxes S. If S = ∅, then

increment t and go to Selection. Otherwise, go to Sampling.

Figure 3. An example of a box scatter plot.

The method of choosing the sub-box according to both objective function value and box

size gives DIRECT its local and global aspects. The DIRECT algorithm performs a convex hull

computation to determine potentially optimal boxes without using the Lipschitz constant directly

(see Figure 3 for an illustration). From Figure 3, it is clear that the convex hull consists of boxes

with objective function values that are minimal amongst all boxes of the same size (notice that

the set of boxes of the same size form a “box column”). Since every box is ultimately examined,

DIRECT will not get stuck at a local optimum, but will instead perform a global search of the

feasible set. Further details can be found in [22].

A parallel implementation of the DIRECT algorithm, pVTdirect, developed at Virginia Tech

[18], is used here. The pVTdirect implementation contains some modifications to the original

DIRECT algorithm in order to meet the needs of various applications and improve the performance

on large scale parallel systems. First, an optional domain decomposition step is added to create

multiple subdomains, each with a starting point for a DIRECT search. Empirical results have

shown that this approach significantly improves load balancing among a large number of processors,

and likely shortens the optimization process for problems with asymmetric or irregular structures.

The Selection step has two new features. The first is an “aggressive” switch that originally

appeared in Watson et al. [38]. The switch generates more function evaluation tasks that may

help balance the workload under the parallel environment. The second is an adjustable ǫ, which is

the minimum improvement required to update the current minimum objective function value. In

10

Page 11: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

general, smaller ǫ values make the search more local and generate more function evaluation tasks.

On the other hand, larger ǫ values bias the search toward broader exploration, exhibiting slower

convergence. The value of ǫ is taken as zero by default, but can be specified by the user depending

on problem characteristics and optimization goals.

In the serial version of DIRECT, Sampling samples one box at a time to conserve memory.

In pVTdirect, more tasks are performed in parallel, so new points are sampled around all boxes in

S along their longest dimensions during Sampling. This obviates the need for the step Iteration

and simplifies the loop. Since box center function values may be identical or nearly so, the parallel

Sampling may yield a different box sequence in each box column (i.e., ordered column of equal-

sized boxes) as the parallel scheme varies. Thus boxes will be subdivided in a different order,

causing pVTdirect to become nondeterministic. To maintain its deterministic property, pVTdirect

performs lexicographical order comparison between box center coordinates, thereby keeping the

boxes in the same sampling sequence on the same platform. However, the computed values can

depend on the particular machine or compiler one uses, so the results for the same problem may

vary slightly from system to system.

Finally, pVTdirect allows more stopping conditions for the termination of the algorithm. First,

the minimum diameter variable MIN DIA causes an exit when the diameter of the best box reaches

the value MIN DIA. Second, the objective function convergence tolerance variable OBJ CONV causes

an exit when the relative change in the optimum objective function value has reached the given

value. These new stopping conditions address a complaint by Jones et al. [22] that the original

stopping criterion for DIRECT was somewhat artificial and unconvincing for many real-world

optimization problems.

6 Simulated Annealing

Simulated annealing is a stochastic algorithm, generally well suited to hard problems in biome-

chanics [19]. It begins with an initial feasible guess X0 and the current minimum is set to F (X0).

The algorithm then pseudorandomly generates points in a neighborhood of X0 until it generates a

feasible pointX1. If F (X1) ≤ F (X0), then the current point is set toX1. If F (X1)−F (X0) = d > 0,

then the current point is set to X1 with probability e−d/T , where T is the temperature. With prob-

ability 1 − e−d/T , X1 is rejected and the algorithm continues to generate points around X0 until

a new feasible point is generated. This process is repeated until the distance between successive

iterates is less than some small, fixed value.

The temperature begins at some high value T and is continually lowered throughout the search

according to some temperature schedule. Since e−d/T ≈ 1 for T ≫ d, a relatively large number of

random moves will be made at the beginning of the search, when T is large. Thus the beginning of

the search is primarily global in nature. As T decreases throughout the search, e−d/T gets closer

to zero, and therefore the search becomes increasingly greedy, eventually performing similarly to

a gradient descent method.

Simulated Parallel Annealing within a Neighborhood (SPAN) [19], one of the implementations

used here, was developed for parallel computation. A straightforward serial simulated annealing

algorithm consists of three nested loops: a temperature reduction loop that causes the temperature

to gradually decline as the algorithm proceeds, an inner search radius loop that causes the neigh-

borhood to gradually shrink while maintaining the same temperature, and an innermost control

11

Page 12: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

loop that handles the perturbation of the variables to find the minimum point. The SPAN im-

plementation parallelizes the independent search radius loop, so that all processors being utilized

have all the information required to do a function evaluation (the current X and the radius of

the search). The processors then communicate their results in a global communication (gather)

before the search radius is adjusted so that roughly half of all the generated points in the previous

neighborhood are acceptable in the new neighborhood. This communication overhead scales lin-

early with the number of processors involved, causing a notable degradation of performance on a

large number of processors, especially with fast function evaluations [19]. For comparison’s sake, a

more traditional naive parallel implementation of simulated annealing (with random X0) was also

constructed, using the same underlying algorithm [13] with only one gather operation at the end

of the optimization to collect all results. Both of the implementations used an initial temperature

of T = 5000 and a cooling schedule of Tnext = 0.85(Tcurrent) for all three problems.

7 SPSA

Spall’s simultaneous perturbation for stochastic approximation (SPSA) algorithm is, like sim-

ulated annealing, a stochastic global optimization algorithm [35]. SPSA is similar to the standard

finite difference stochastic approximation (FDSA) algorithm [24] with one primary difference. The

general form of both SPSA and the FDSA algorithm is Xk+1 = Xk − ak · g(Xk), where Xk is the

variable vector, ak is the kth element of the gain sequence a, and g(Xk) is meant to approximate

the gradient of the objective function at Xk. Whereas the FDSA algorithm perturbs the compo-

nents of the vector X in the objective function F (X) one at a time, SPSA performs a simultaneous

perturbation of each component of X. This might appear to reduce the ability of the algorithm

to search the problem space effectively when compared to component-by-component perturbation,

but Spall [33]claims that “one properly chosen simultaneous random change in all the variables in

a problem provides as much information for optimization as a full set of one-at-a-time changes of

each variable”.

As stated above, the main difference between SPSA and the FDSA algorithm is the method of

perturbation of the components of X. The FDSA algorithm explores in each dimension around the

point Xk, seeking the steepest descent (negative gradient) direction. Formally, the ith component

of the (two-sided) gradient approximation for FDSA is computed as

gi(Xk) =F (Xk + ckei)− F (Xk − ckei)

2ck,

where F (X) = F (X) + noise, ck is the kth element of a sequence c that converges monotonically

to zero slowly as k → ∞, and ei is the ith standard basis vector. For an n-dimensional X, 2n

objective function evaluations per iteration are required.

SPSA generates a vector v using a Monte Carlo method [34], evaluates the objective function

at two points Xk + v and Xk − v at each iteration to approximate the gradient at Xk, and

then adjusts Xk based on the resulting estimation of the gradient. Consequently, SPSA uses two

objective function evaluations at each iteration. Formally,

gi(Xk) =F (Xk + ck∆k)− F (Xk − ck∆k)

2ck∆ki,

12

Page 13: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

where ∆k = (∆k1,∆k2, . . . ,∆kn) is the user-specified random perturbation vector and v = ck∆k.

The distribution of ∆k must satisfy certain conditions in order to guarantee convergence; in par-

ticular, each component of ∆k must be nonzero [33].

In the implementation of SPSA used for the biomechanics problem, an adaptation called

blocking is employed that requires an extra function evaluation at each iteration. The idea is to

“block” updates to Xk if the update will produce a new objective function value that is significantly

worse than the current objective function value. This requires that the objective function be

evaluated at Xk, as well as at Xk + v and Xk − v. The extra function evaluation at each iteration

increases the total number of evaluations by 33%, but can result in faster convergence of the

algorithm. However, this technique can also reduce the likelihood that the algorithm will achieve

a global minimum [34]. Projection is employed to prevent the algorithm from moving outside the

feasible set.

When attempting to solve the nonconvex quadratic minimization problem dual and the wave

annihilation problem, blocking is not used and instead an adaptation called injected noise is em-

ployed that simulates a random element in the objective function, with the intent of inducing

the algorithm to abandon local minima. While the resulting implementation may take longer to

converge to a minimum, the likelihood of global convergence is increased [26]. This adaptation is

not necessary in the biomechanics problem because the noisiness inherent in the objective function

fills the same role.

8 KNITRO

The KNITRO optimization package contains three optimization algorithms, but only one of

them is utilized here, the direct interior point method [4]. Since the problems here are unconstrained

except for upper and lower bound constraints, the sequential quadratic programming (SQP) method

used by the KNITRO solver should be very efficient. However, the interior point method, like all

the methods in the KNITRO package, assumes that the objective function is twice continuously

differentiable. This is not the case for either the biomechanics problem or the nonconvex quadratic

minimization problem dual, so a central difference method, included in the package, is invoked to

supply second derivative input values; as a result, the efficacy of the direct interior point method

suffers.

It is important to note that the direct interior point method employed by KNITRO, while

very efficient at solving general constrained nonlinear optimization problems, loses some efficiency

compared with other optimization algorithms for problems with only bound constraints [4]. Never-

theless, as a widely used commercial gradient-dependent optimization package, KNITRO represents

a class of local optimization algorithms that apply to these difficult nonlinear problems.

The interior point direct algorithm seeks to find Karush-Kuhn-Tucker (KKT) points using

sequential quadratic programming and trust region methods. As with all nonlinear optimization

algorithms, the goal is to solve problems of the form

minx

f(x)

subject to cE(x) = 0,

cI(x) ≥ 0,

13

Page 14: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

where here f : IR n → IR . The interior point direct algorithm first defines the barrier subproblem

minx,s

f(x)− µm∑

i=1

ln si

subject to cE(x) = 0,

cI(x)− s = 0,

where cE : IR n → IR l, cI : IR n → IR m, the barrier parameter µ > 0 and the vector of slack

variables s ∈ IR m is positive.

Following [3], the KKT conditions for the above system can be written as

∇f(x)−AET(x)y −AI

T(x)z = 0

−µe+ Sz = 0

cE(x) = 0

cI(x)− s = 0,

where e = (1, . . . , 1)T, S = diag(s1, . . . , sm), AE and AI are the Jacobian matrices corresponding

to the equality and inequality constraint vectors respectively, and y and z represent vectors of

Lagrange multipliers. The line search algorithm, used here, applies Newton’s method to the above

system, backtracking if necessary to ensure that s, z > 0, and ensuring that the merit function

ψ(x, s) = f(x)− µ∑mi=1 ln si is sufficiently reduced.

Applying Newton’s method in the variables x, s, y, z gives

∇2xxL 0 −AE

T(x) −AIT(x)

0 Z 0 SAE(x) 0 0 0AI(x) −I 0 0

dx

ds

dy

dz

= −

∇f(x)−AET(x)y −AI

T(x)zSz − µecE(x)

cI(x)− s

where L(x, s, y, z) = f(x) − µ∑mi=1 ln si − yTcE(x) − zT(cI(x) − s) is the Lagrangian of the

above problem and Z = diag(z1, . . . , zm).

If the inertia of the above matrix is (n +m, l +m, 0), then the step d determined above is a

descent direction for the merit function ψ. If so, the scalars

αmaxs = max α ∈ (0, 1] : s+ αds ≥ (1− τ)s,αmax

z = max α ∈ (0, 1] : z + αdz ≥ (1− τ)z,

are computed with τ = 0.995. If minαmaxs , αmax

z is not too small, a backtracking line search is

used to compute the steplengths αs ∈ (0, αmaxs ], αz ∈ (0, αmax

z ] that provide a sufficient decrease

in ψ. The next iteration, with a reduced barrier variable, is then computed with

x+ = x+ αsdx, s+ = s+ αsds,

y+ = y + αzdy, z+ = z + αzdz.

14

Page 15: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

However, if the inertia of the matrix is not of the desired form or the steplengths αs or αz are

too small, a trust region method is employed to compute the current step d. This has the benefit

of guaranteeing a successful step even in the presence of negative curvature or singularity.

The trust region method employed, which is also the standard step of the Interior/CG KNI-

TRO algorithm, takes the following form. First, the normal step v = (vx, vs) is computed by

solving the subproblem

minv

‖AEvx + cE‖22 + ‖AIvx − vs + cI − s‖22 (8.1)

subject to ‖(vx, S−1vs)‖2 ≤ 0.8∆. (8.2)

This subproblem is solved using a dogleg approach that minimizes (8.1) along a piecewise

linear path composed of a steepest descent step in the norm used in (8.2) and a Newton step with

respect to the same norm. The scaling S−1vs in the norm tends to limit the extent to which the

bounds on the slack variable are violated.

Once the normal step v = (vx, vs) is computed, the tangential subproblem is then defined as

mindx,ds

∇f tdx − µetS−1ds +1

2

(dt

x∇2xxLdx + dt

sS−1Zds

)(8.3)

subject to AEdx = AEvx (8.4)

AIdx − ds = AIvx − vs (8.5)

‖(dx, S−1ds)‖2 ≤ ∆. (8.6)

To find the approximate tangential solution d, first the scaling ds ← S−1ds is applied to

convert (8.6) into a sphere, and then a standard projected conjugate gradient method is applied to

this transformed quadratic program, iterating in the linear manifold defined by (8.4)-(8.5). Finally,

the step d is truncated if necessary to preserve s > 0.

9 QNSTOP

QNSTOP is a class of quasi-Newton methods for stochastic optimization. The implementation

considered features several variations specific to global, deterministic optimization. In iteration k,

QNSTOP methods compute the gradient vector gk and Hessian matrix Hk of a quadratic model

mk(X −Xk) = fk + gTk (X −Xk) +

1

2(X −Xk)T Hk (X −Xk) , (9.1)

of the objective function f centered at Xk, where fk is generally not f(Xk).

In the unconstrained context, QNSTOP methods progress by

Xk+1 = Xk −[Hk + µkWk

]−1

gk, (9.2)

where µk is the Lagrange multiplier of a trust region subproblem and Wk is a scaling matrix. In

these cases, where the feasible set Θ is a convex subset of IR p, consider an algorithm of the form

Xk+1 =

(Xk −

[Hk + µkWk

]−1

gk

)

Θ

,

15

Page 16: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

where (·)Θ denotes projection onto Θ.

9.1 Estimating the Gradient

Following a response surface methodology approach, QNSTOP designs regression experiments

in a region of interest containing the current iterate. QNSTOP uses an ellipsoidal design region

centered at the current iterate Xk ∈ IR p. Let

Wγ =W ∈ IR p×p : W = WT, det(W ) = 1, γ−1Ip W γIp

for some γ ≥ 1 where Ip is the p × p identity matrix. Here γ is fixed at 20. The elements of

the set Wγ are valid scaling matrices that control the shape of the ellipsoidal design regions with

eccentricity constrained by γ. Let the ellipsoidal design regions

Ek(τk) =X ∈ IR

p : (X −Xk)TWk (X −Xk) ≤ τ2

k

where Wk ∈Wγ . For this implementation τk = τ > 0 is fixed at τ = 1 for each iteration.

In each iteration, QNSTOP methods choose a set of Nk design sites Xk1, . . ., XkNk ⊂

Ek(τk) ∩ Θ. In this implementation N = Nk is fixed for each k = 1, 2, . . . and Xk1, . . ., XkN ∈Ek(τk) ∩Θ are uniformly sampled in each iteration.

Let Yk = (yk1,. . .,ykN)T denote the N -vector of responses where yki = F (Xki) + noise. The

response surface is modeled by the linear model yki = fk + XTkigk + ǫki where ǫki accounts for

lack of fit. Let Xk = N−1∑N

i=1Xki. The least squares estimate of the gradient gk, ignoring the

estimate for fk, is obtained by observing the responses and solving

(DT

k Dk

)gk = DT

k Yk (9.3)

where

Dk =

(Xk1 − Xk

)T

...(XkN − Xk

)T

.

9.2 Updating the Model Hessian Matrix

In the stochastic context, QNSTOP methods constrain the Hessian matrix update to satisfy

−ηIp Hk − Hk−1 ηIp (9.4)

for some η ≥ 0. Conceptually, this prevents the quadratic model from changing drastically from

one iteration to the next. In [6], a variation of the SR1 (symmetric, rank one) update that satisfies

this constraint is proposed. However, this constraint is simply relaxed in the deterministic case

and the BFGS update is used, i.e., with the Hessian matrix updated according to

Hk = Hk−1 +Hk−1sks

Tk Hk−1

sTk Hk−1sk

+νkν

Tk

νTk sk

16

Page 17: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

wheresk = Xk −Xk−1,

νk = gk − gk−1.

9.3 Step Length Control

QNSTOP methods utilize an ellipsoidal trust region concentric with the design region for

controlling step length. Typically, in the stochastic case, the volume of the ellipsoid is adjusted

from iteration to iteration. Here, the volume of the ellipsoid (controlled by some ρ > 0) is fixed

with ρ = 1, and the following optimization problem is solved:

minX∈Ek(ρ)

gTk (X −Xk) +

1

2(X −Xk)

THk (X −Xk) (9.5)

The solution to (9.5) is on the arc

X(µ) = Xk −[Hk + µWk

]−1

gk. (9.6)

It remains to estimate µk such that X(µk) solves (9.5). Using [9] [Lemma 6.4.1], and a little

manipulation, it can be established that there is a unique µk ≥ 0 such that ‖X(µk)−Xk‖Wk= ρ,

unless ‖X(0) −Xk‖Wk≤ ρ in which case µk = 0. Estimating µk is difficult, but well understood.

Chapter 7 in [8] is a comprehensive treatment. In particular, Algorithm 7.3.6 in [8] is robust and

easily implemented.

9.4 Updating the Experimental Design Region

The QNSTOP approach to constructing the ellipsoidal design regions is here considered. [37]

considers confidence regions for the constrained minimizer of a quadratic model fit by regression.

An early suggestion for the QNSTOP approach was a convenient ellipsoidal approximation of the

confidence set for the minimizer of a quadratic subject to a trust region constraint.

However, if a linear model is fit by least squares and the model Hessian matrix is updated by a

secant update then a different approach is warranted. This implementation uses an approximation

derived in [6]. First, the approximation for the covariance matrix of ∇mk(Xk+1 −Xk),

Vk = 4σ2(DTk Dk)−1, (9.7)

is computed, where σ2 is the ordinary least squares estimate of the variance. Then

Ek+1(χp,1−α) =X ∈ IR

p : (X −Xk+1)TWk+1(X −Xk+1) ≤ χ2

p,1−α

is an ellipsoidal approximation of the 1− α percentile confidence set for the minimizer where

Wk+1 =(Hk + µkWk

)T

V −1k

(Hk + µkWk

).

Strictly using the updates for Wk+1 above can lead to degenerate ellipsoids. To obtain useful design

ellipsoids the constraints γ−1Ip Wk+1 γIp and det(Wk+1) = 1 are enforced by modifying the

eigenvalues— hence, the definition of Wγ ∋Wk+1.

17

Page 18: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

9.5 Algorithm Overview

The QNSTOP implementation used in this paper is summarized in Algorithm 1. Each run

of the algorithm in the experiments was terminated when a budget of function evaluations B had

been exhausted.

Algorithm 1. QNSTOP-GLOBAL

Step 0 (initialization) : Fix τ = 1, ρ = 1, N = 100, and γ ≥ 1 = 20. Fix scaling matrix

W0 = Ip and model Hessian matrix H0 = Ip. Choose an initial iterate X0 and set k = 0.

Step 1 (regression experiment) : Uniformly sample Xk1, . . . ,XkN ⊂ Ek(τ)∩Θ. Observe

the response vector Yk = (yk1, . . . , ykN)T. Compute gk by solving (9.3).

Step 2 (secant update) : If k > 0, compute the model Hessian matrix Hk using BFGS.

Step 3 (update iterate) : Compute µk using the method described in Section 9.3, solve

[Hk + µkWk]sk = −gk, and compute Xk+1 = Xk + sk.

Step 4 (update subsequent design ellipsoid) : Compute Wk+1 ∈Wγ using the approach

described in section 9.4.

Step 5 : If (k + 2)N < B then increment k by 1 and go to Step 1. Otherwise, the algorithm

terminates.

Figure 4 shows a typical progression of QNSTOP over 20 iterations. The solid line represents

the lowest value found among 200 design sites for that iteration, while the dotted line represents

the corresponding minimum found by the minimizer of the quadratic model. Note that while at

times the model will give an imperfect minimum, the overall downward trend is significant.

ççç

ç ç

çç

ç

ç

ç ç ç

ç

ç

çç

ç

ç

ç

ç

105 110 115Iteration

1460

1480

1500

1520

1540

1560

f HxL

Figure 4. A typical QNSTOP progression.

10 Experimental Results and Discussion

Each of the optimization algorithms here is run 50 times using starting points selected from

a Latin hypercube design based on the calculated bounds for each of the three problems. For

the biomechanics problem, the bounds on each variable run from −1 to 1. While the nonconvex

quadratic minimization problem dual is unconstrained and the reflection annihilation problem

has nonnegative variables, both have bounds that can be calculated a priori that allow the Latin

hypercube design to be constructed. For the nonconvex quadratic minimization problem dual, the

18

Page 19: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

bounds on each variable are ±41.569. For the reflection annihilation problem, the lower and upper

bounds are vectors of 0 and 40, respectively.

DIRECT (ǫ = 10−4), given the constraints of the initial box, is run for each of the problems

centered on a point derived from a variation of the Latin hypercube design used for the starting

points of the other algorithms: the fifty starting points are all divided by 100 to allow as much of

the space to be explored as possible while still differentiating the starting points from each other.

The naive simulated annealing algorithm uses the Latin hypercube starting points for only one of

the eight processors run in parallel. SPAN similarly uses only eight processors per experiment,

since the utilization of more processors results in extreme communication overhead; the number of

processors for the naive implementation was limited to the same number of processors that SPAN

employed to directly see the advantage of one longer annealing versus the eight shorter ones. The

naive parallel implementation of SPSA uses the Latin hypercube starting points in only one of

the 640 processors used in parallel for each experiment. KNITRO 8.0 uses the Latin hypercube

starting point for only one of its randomly generated multistart points per experiment, and the

rest are generated by the built-in pseudorandom generation. Each of the fifty experiments was

run on only a single processor for KNITRO 8.0. In a slight difference from the other algorithms,

QNSTOP uses 10,000 samples for each of 100 Latin hypercube starting points per experiment,

forgoing the standard Latin hypercube starting point in order to provide a more global search

strategy. QNSTOP is currently implemented only as a serial algorithm, so going beyond 100

processors per experiment is not possible.

It must be noted that the SPAN implementation, while using the same algorithm as the naive

implementation, chooses its evaluation points based on a number of different seeds for the pseu-

dorandom number generator equal to the number of processors, to avoid duplicating values. The

naive implementation, which uses the same parameters as the SPAN implementation and per-

forms eight annealings (at 125,000 function evaluations each), therefore examines different points

than the single annealing performed over multiple processors for the SPAN implementation. In

fact, without changing the random seed input for the SPAN implementation, simply changing the

number of processors used can change the outcome based on the change to the local seeds. Of

course, given the identical algorithms, the tradeoff here is simply one of time for a single annealing

performed in parallel versus the time for multiple serial annealings performed simultaneously.

Table 1 displays the number of function evaluations used by each optimization algorithm as

it pursues the global minimum over each of the 50 experiments. Each optimization algorithm is

given a function evaluation budget of 106 for each experiment and run until it reaches the function

evaluation budget or terminates according to the rules of the algorithm.

Tables 2–4 display, for each of the problems described, the minimum, maximum, first, second,

and third quartile objective function values for each of the algorithms over the fifty experiments,

along with the best known minimum for each of the problems and the method of discovery. Note

that the experiments that discovered the best known minimum for these problems are not part of

the set of experiments listed here, as they tend to have higher function evaluation limits to allow

for exhaustive searching.

The worst performer was the naive parallel SPSA, for several reasons. First, SPSA is de-

signed to be a local optimization algorithm, here used in three global optimization applications;

attempts to increase the number of starting points to compensate for this shortcoming were rather

unsuccessful in the face of the number of dimensions involved in each problem space. Second,

the approach taken by SPSA suffers in any case when a large number of variables is encountered,

causing it to be much slower to discover local minima. Third, the naive parallel implementation

19

Page 20: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

Table 1Average function evaluations per experiment for each problem.

DIRECT SPAN SA (naive) SPSA QNSTOP KNITRO

Biomech 1008497 1000000 1000000 1000000 1000000 1010492

Quad Dual 1000593 1000000 1000000 1000000 1000000 1075451

Wave 1001585 1000000 1000000 1000000 1000000 1000781

Table 2Results for the biomechanics problem.

Best known minimum value: 1222.05 (QNSTOP/DIRECT).

Minimum 1st quartile 2nd quartile 3rd quartile Maximum

DIRECT 8501 24227 28594 34321 77567

SA (naive) 12606 14414 15677 16352 17723

SPAN 3295 4460 4833 5567 82297

SPSA 33447 46040 56503 61939 84676

KNITRO 19545 28180 30688 35390 40234

QNSTOP 10134 13359 14710 17185 45828

Table 3Results for the nonconvex quadratic minimization problem dual.

Best known minimum value: −1866.01 (DIRECT).

Minimum 1st quartile 2nd quartile 3rd quartile Maximum

DIRECT −1864.32 −1863.03 −1862.21 −1861.79 −1860.00

SA (naive) −1146.16 −1110.06 −1095.66 −1084.64 −1030.77

SPAN −1861.53 −1859.93 −1859.20 −1858.69 −1857.43

SPSA 253.30 604.44 688.00 759.65 893.00

KNITRO −1864.74 −1827.05 −1825.67 −1808.06 −1609.60

QNSTOP −1863.90 −1862.63 −1862.21 −1861.37 −1860.52

Table 4Results for the wave annihilation problem.Best known minimum value: 0 (DIRECT).

Minimum 1st quartile 2nd quartile 3rd quartile Maximum

DIRECT 8.19 ∗ 10−7 1.02 ∗ 10−3 5.76 ∗ 10−3 5.74 ∗ 10−2 2.7 ∗ 10−1

SA (naive) 26.87 27.26 27.36 27.53 27.76

SPAN 2.71 3.35 25.20 26.25 26.62

SPSA 12.94 523.35 2902.51 8031.26 206193.00

KNITRO 27.09 28.00 28.00 28.00 28.00

QNSTOP 26.64 27.10 27.19 27.30 27.48

utilized here simply divided the number of function evaluations by the number of processors; while

20

Page 21: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

this may have made the algorithm more likely to start close to a good minimum, the tradeoff was

arguably not favorable in any instance except the wave annihilation problem, the most amenable to

traditional derivative-based solution methods. Finally, the injected noise technique, while arguably

useful for the result obtained for the wave annihilation problem, was certainly not helpful for the

quadratic dual. SPSA’s best performance was merely mediocre. To determine the best performer,

however, requires a look at each problem individually.

Biomechanics problem. SPAN did very well on the biomechanics problem compared to the naive

simulated annealing algorithm. In fact, the vast majority of the solutions found by SPAN beat

even the best solutions found by the rest of the algorithms for that problem. It’s of interest

to note the impact that local searching has on improving the minimums for the biomechanics

problem, since the only difference between SPAN and the naive annealing is the longer local search

time, where the annealing is done over a fairly small neighborhood and the temperature is quite

“cool.” DIRECT similarly benefitted from the transition to local searching inherent in its execution,

resulting in a good best result and reasonable results for most of its experiments. The QNSTOP

global strategy employed here, Latin hypercube sampling, is perhaps not the best strategy to use

to show the strengths of this algorithm when applied to this problem; the best known minimum

for this problem is the result of a previous QNSTOP experiment. Even so, the admittedly inferior

global strategy utilized here yielded reasonable results compared to the multistart KNITRO, SPSA,

and naive parallel simulated annealing strategies, and consistently beat the experiments performed

by DIRECT. Note, however, that DIRECT found the best known value in one large run, and none

of the other methods, in these or other experiments, ever found this best value. The deterministic

DIRECT is guaranteed to monotonically decrease the objective function with more work, whereas

the nondeterministic methods (SA, SPAN, SPSA, QNSTOP) are only likely to do so.

Quadratic dual problem. Recall that for the nonconvex quadratic minimization problem dual, all

the local minima are within 0.5% of the global minimum −1866.01. The layout of this problem

is particularly devastating for SPSA, which has difficulty settling into the local minimum points

at which the function is nondifferentiable, resulting in SPSA’s worst showing; the injected noise

technique, which was employed in an attempt to escape the vast number of local minima, did not

encourage SPSA to settle for any minimum within the number of function evaluations employed.

DIRECT, QNSTOP, and KNITRO are almost tied for the best results, although DIRECT and

QNSTOP are much more consistent in their performance, as KNITRO encounters difficulties as

well with the nondifferentiability of the function at the local minimum points. Finally, once again

the lack of local search in the naive parallel simulated annealing is revealed, as the naive annealing

experiments found the broad basins that held the local minimum points but failed to refine to a

local minimum point. Again DIRECT, in a larger run, found the best known value (theoretical

global minimum, for this problem), and similar comments as for the biomechanics problem also

apply here.

Wave annihilation problem. DIRECT consistently found the optimum solution for the wave anni-

hilation problem, while the other algorithms (with the notable exception of SPSA) were consistent

in finding local minima near 28. SPSA benefitted the most with this problem from the “shotgun”

approach utilized here, coming in third for the best result found, with the notable downside that

its performance was otherwise abysmal. QNSTOP, KNITRO, and the naive simulated annealing

all found local minima near 28, while SPAN found good minima in about half of its experiments

and near 28 in the other half.

The following general conclusions are immediately obvious. Firstly, SPSA is entirely unfit for

an optimization problem with a large number of dimensions and a large number of minima. This is

21

Page 22: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

not surprising, as this is not the purpose for which SPSA was developed. Secondly, any approach

that relied on local information for gradient approximations, namely SPSA and KNITRO, had

an unfortunate tendency to restrict its search prematurely and thus lost significantly in global

exploration compared to the other algorithms presented here, particularly in the biomechanics

and wave annihilation problems. While the multistart strategy allowed for some automatic global

searching, it was clearly not enough to overcome the difficulties inherent in these problems. Finally,

the multistart strategy employed by QNSTOP needs refinement before drawing any conclusions

about the true value of this algorithm.

Some general conclusions are also in order about the two newest algorithms considered here

— the massively parallel implementation pVTdirect of DIRECT, and the quasi-Newton stochastic

algorithm QNSTOP. Because pVTdirect maintains a history of all samples, it makes more efficient

use of samples than highly parallel independent sampling stochastic algorithms do, and thus is likely

to scale better with more processors. Deterministic algorithms like DIRECT may perform very

well on noisy functions (like the biomechanics problem here), and local stochastic algorithms like

QNSTOP may perform very well on global optimization problems (as here). In the context of ever

increasing parallelism, higher dimensions, and global optimization, algorithms like (deterministic)

pVTdirect and (stochastic) QNSTOP, and hybrids thereof, seem well worth pursuing.

Acknowledgements This work was supported in part by AFOSR Grant FA9550-09-1-0153, AFRL GrantFA8650-09-2-3938, and the computing facilities at Indiana University and Virginia Polytechnic Institue & StateUniversity.

References

1. D. E. Anderson; M. L. Madigan; M. A. Nussbaum. 2007, “Maximum voluntary joint torque as a function of jointangle and angular velocity: model development and application to the lower limb”, Journal of Biomechanics,40, no. 14, 3105–3113.

2. K. A. Bieryla. 2009, “An investigation of perturbation-based balance training as a fall prevention interventionfor older adults”, Ph.D. thesis, Department of Mechanical Engineering, VPI & SU, Blacksburg, VA.

3. R. H. Byrd; J. Nocedal; R. A. Waltz. 2006, Large-Scale Nonlinear Optimization, Springer-Verlag, 35–59.4. R. H. Byrd; M. E. Hribar; J. Nocedal. 1999, “An interior point method for large scale nonlinear programming”,

SIAM Journal on Optimization, 9, no. 4, 877–900.5. R. H. Byrd; J.C. Gilbert; J. Nocedal. 2000, “A trust region method based on interior point techniques for

nonlinear programming”, Mathematical Programming A, 89, 149–185.6. B. S. Castle. 2012, Quasi-Newton Methods for Stochastic Optimization and Proximity-Based Methods for

Disparate Information Fusion, Ph.D.thesis, Indiana University, Bloomington, IN.7. K. B. Cheng. 2008, “The relationship between joint strength and standing vertical jump performance”, Journal

of Applied Biomechanics, 24, no. 3, 224–233.8. A. R. Conn; N. I. M. Gould; P. L. Toint. 2000, “Trust-Region Methods”, MPS-SIAM Series on Optimization,

SIAM, Philadelphia.9. J.E. Dennis, Jr.; R.B. Schanbel. 1996, Numerical methods for unconstrained optimization and nonlinear equa-

tions (2nd ed.), SIAM, Philadelphia.10. D.R. Easterling; L.T. Watson; M. L. Madigan. 2010, “The DIRECT algorithm applied to a problem in biome-

chanics with conformal mapping”, in Proc. 2010 International Conference on Scientific Computing, CSC ‘10,H. Arabnia and G. Grawanis (eds.), CSREA Press, USA, 2010, 128–133.

11. D. Y. Gao. 2000, Duality Principles in Nonconvex Systems: Theory, Methods, and Applications, Kluwer Aca-demic Publishers, 472 pp.

12. D. Y. Gao. 2000, “Canonical dual transformation method and generalized triality theory in nonsmooth globaloptimization”, Journal of Global Optimization, 17(1/4), 127–160.

13. W.L. Goffe; G.D. Ferrier; J. Rogers. 1994, “Global optimization of statistical functions with simulated anneal-ing”, Journal of Econometrics, 60, 65–100.

14. W. W. Hager; R. Rostamian; D. Wang. 2000., “The wave annihilation technique and the design of nonreflectivecoatings”, SIAM Journal on Applied Mathematics, 60, no. 4, 1388–1424.

22

Page 23: Parallel Deterministic and Stochastic Global Minimization of …eprints.cs.vt.edu/archive/00001167/01/dirCOAP10c.pdf · 2013. 2. 5. · cessor communication. This parallel SPSA is

15. J. He; A. Verstak; L. T. Watson; M. Sosonkina. 2008., “Design and implementation of a massively parallel

version of DIRECT”, Computational Optimization and Applications, 40, no. 2, 217–245.

16. J. He; A. Verstak; L.T. Watson; M. Sosonkina. 2009., “Performance modeling and analysis of a massivelyparallel DIRECT: part 1”, International Journal of High Performance Computing Applications, 23, no. 1,

14–28.

17. J. He; L.T. Watson; N. Ramakrishnan; C. A. Shaffer; A. Verstak; J. Jiang; K. Bae; W.H. Tranter. 2002,

“Dynamic data structures for a direct search algorithm”, Computational Optimization and Applications, 23,

no. 1, 5–25.

18. J. He; L. T. Watson; M. Sosonkina. 2009, “Algorithm 897: VTDIRECT95: serial and parallel codes for the

global optimization algorithm DIRECT”, ACM Transactions on Mathematical Software, 36, no. 3, Article 17,1–24.

19. J. S. Higginson; R.R. Neptune; F.C. Anderson. 2004., “Simulated parallel annealing within a neighborhood for

optimization of biomechanical systems.”, Journal of Biomechanics., 38, no. 9, 1938–1942.

20. M. G. Hoy; F. E. Zajac; M. E. Gordon. 1990, “A musculoskeletal model of the human lower extremity: the

effect of muscle, tendon, and moment arm on the moment-angle relationship of musculotendon actuators at

the hip, knee, and ankle”, Journal of Biomechanics, 23, no. 2, 157–169.21. L. Ingber. 1993, “Simulated annealing: practice versus theory”, Mathematical Computer Modeling, 18, no. 11,

29–57.

22. D. R. Jones; C. D. Perttunen; B. E. Stuckman. 1993, “Lipschitzian optimization without the Lipschitz con-

stant”, Journal of Optimization Theory and Applications, 79, no. 1, 157–181.

23. D. R. Jones. 2001, “The DIRECT global optimization algorithm”, Encyclopedia of Optimization, Vol. 1,

Dordrecht : Kluwer Academic Publishers, 431–440.

24. J. Kiefer; J. Wolfowitz. 1952, “Stochastic estimation of a regression function”, Annals of Mathematical Statis-tics, 23, 462–466.

25. S. Kirkpatrick; C. D. Gelatt; M.P. Vecchi. 1983, “Optimization by simulated annealing”, Science, New Series,

220, no. 4598, 671–680.

26. J. L. Maryak; D. C. Chin. 2008, “Global random optimization by simultaneous perturbation stochastic ap-

proximation”, IEEE Transactions on Automatic Control, 53, no. 3, 780–783.

27. M. J. Pavol; T. M. Owings; M.D. Grabiner. 2002, “Body segment inertial parameter estimation for the general

population of older adults”, Journal of Biomechanics, 35, 707–712.28. N. R. Radcliffe; D. R. Easterling; L. T. Watson; M. L. Madigan; K. A. Bieryla. 2010, “Results of two global

optimization algorithms applied to a problem in biomechanics.”, in Proc. 2010 Spring Simulation Multicon-

ference, High Performance Computing Symp, A. Sandu, L. Watson, and W. Thacker (eds), Soc. for Modelling

and Simulation Internat., Vista, CA, 2010, 117–123.

29. D. J. Ram; T. H. Sreenivas; K. G. Subramaniam. 1996, “Parallel simulated annealing algorithms”, Journal of

Parallel and Distributed Computing, 37, 207–212.30. R. Riener; T. Edrich. 1999, “Identification of passive elastic joint moments in the lower extremities”, Journal

of Biomechanics, 32, no. 5, 539–544.

31. B. W. Schulz; J. A. Ashton-Miller; N. B. Alexander. 2006, “Can initial and additional compensatory steps be

predicted in young, older, and balance-impaired older females in response to anterior and posterior waist pulls

while standing?”, Journal of Biomechanics, 39, no. 8, 1444–1453.

32. W. S. Selbie; G. E. Caldwell. 1996, “A simulation study of vertical jumping from different starting postures”,

Journal of Biomechanics, 29, no. 9, 1137–1146.33. J. C. Spall. 1998, “An overview of the simultaneous perturbation method for efficient optimization”, John

Hopkins APL Tech. Digest, 19, no. 4, 482–492.

34. J. C. Spall. 1998, “Implementation of the simultaneous perturbation algorithm for stochastic optimization”,

IEEE Transactions on Aerospace and Electronic Systems, 34, no. 3, 817–823.

35. J. C. Spall. 1987, “A stochastic approximation technique for generating maximum likelihood parameter esti-

mates”, in Proc. American Control Conference (Minneapolis, MN, June 10-12), 1161–1167.36. J. C. Spall. 1992, “Multivariate stochastic approximation using simultaneous perturbation gradient approxi-

mation”, IEEE Trans. Autom. Control, 37, no. 3, 332–341.

37. D. M. Stablein; W. H. Carter, Jr.; G. L. Wampler. 1983, “Confidence regions for constrained optima in response-

surface experiments”, Biometrics, 39, 759–763.

38. L. T. Watson; C. A. Baker. 2001, “A fully-distributed parallel global search algorithm”, Engineering Compu-

tations, 18, no. 1–2, 155–169.

39. F. Yang; F. C. Anderson; Y. C. Pai. 2007, “Predicted threshold against backward balance loss in gait”, Journalof Biomechanics, 40, no. 4, 804–811.

40. F. Yang; F. C. Anderson; Y. C. Pai. 2008, “Predicted threshold against backward balance loss following a slip

in gait”, Journal of Biomechanics, 41, no. 9, 1823–1831.

23