39
Accepted Manuscript A Bregman extension of quasi-Newton updates II: Analysis of robustness properties Takafumi Kanamori, Atsumi Ohara PII: S0377-0427(13)00181-7 DOI: http://dx.doi.org/10.1016/j.cam.2013.04.005 Reference: CAM 9108 To appear in: Journal of Computational and Applied Mathematics Received date: 16 December 2011 Revised date: 25 October 2012 Please cite this article as: T. Kanamori, A. Ohara, A Bregman extension of quasi-Newton updates II: Analysis of robustness properties, Journal of Computational and Applied Mathematics (2013), http://dx.doi.org/10.1016/j.cam.2013.04.005 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Bregman extension of quasi-Newton updates II: Analysis of robustness properties

  • Upload
    atsumi

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Accepted Manuscript

A Bregman extension of quasi-Newton updates II: Analysis ofrobustness properties

Takafumi Kanamori, Atsumi Ohara

PII: S0377-0427(13)00181-7DOI: http://dx.doi.org/10.1016/j.cam.2013.04.005Reference: CAM 9108

To appear in: Journal of Computational and AppliedMathematics

Received date: 16 December 2011Revised date: 25 October 2012

Please cite this article as: T. Kanamori, A. Ohara, A Bregman extension of quasi-Newtonupdates II: Analysis of robustness properties, Journal of Computational and AppliedMathematics (2013), http://dx.doi.org/10.1016/j.cam.2013.04.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As aservice to our customers we are providing this early version of the manuscript. The manuscriptwill undergo copyediting, typesetting, and review of the resulting proof before it is published inits final form. Please note that during the production process errors may be discovered whichcould affect the content, and all legal disclaimers that apply to the journal pertain.

A Bregman extension of quasi-Newton updates II:

analysis of robustness properties

Takafumi Kanamoria, Atsumi Oharab

a Department of Computer Science and Mathematical Informatics, Nagoya University,A4-2(780) Furocho, Chikusaku, Nagoya 464-8603, Japan.

b Department of Electrical and Electronics Engineering, University of Fukui,3-9-1 Bunkyo, Fukui City, Fukui 910-8507, Japan.

Abstract

In Part I of this series of articles, we introduced information geometric frame-work of quasi-Newton methods and gave an extension of Hessian update for-mulae based on the Bregman divergence. The purpose of this article is toinvestigate the convergence and robustness properties of extended Hessianupdate formulae. Fletcher has studied a variational problem which derivesthe approximate Hessian update formula of the quasi-Newton methods. Wepoint out that the variational problem is identical to optimization of theKullback-Leibler divergence, which is a discrepancy measure between twoprobability distributions. Then, we introduce the Bregman divergence asan extension of the Kullback-Leibler divergence, and derive extended quasi-Newton update formulae based on the variational problem with the Bregmandivergence. The proposed update formulae belong to a class of self-scalingquasi-Newton methods. We study the convergence property of the proposedquasi-Newton method. Moreover, we apply tools in the robust statistics toanalyze the robustness properties of Hessian update formulae against numer-ical rounding errors or a shift of tuning parameters included in line searchmethods of the step length. As the main contribution of this paper, wepresent that the influence of perturbations in the line search is bounded onlyfor the standard BFGS formula for the Hessian approximation. Numericalstudies are conducted to verify the usefulness of the tools borrowed from therobust statistics.

Email addresses: [email protected] (Takafumi Kanamori),[email protected] (Atsumi Ohara)

Preprint submitted to Journal of Computational and Applied Mathematics April 8, 2013

Keywords: quasi-Newton methods, robustness, influence function,sensitivity analysis

1. Introduction

We consider quasi-Newton methods for the unconstrained optimizationproblem

minimize f(x), x ∈ Rn, (1)

in which the function f : Rn → R is twice continuously differentiable on Rn.The quasi-Newton method is known to be one of the most successful methodsfor unconstrained function minimization. Details are shown in [14, 17] andreferences therein.

This article is a continuation of Part I [13] in which we presented in-formation geometric framework of quasi-Newton methods and an extendedHessian update formula based on the Bregman divergence. The best knownquasi-Newton methods are the DFP and BFGS methods. Fletcher [8] haspointed out that the standard formulae, DFP and BFGS, are obtained asthe optimal solution of a variational problem over the set of positive definitematrices. Along this line, we extended the quasi-Newton update formula.

In quasi-Newton methods, the BFGS method has been generally consid-ered to be the most effective method for unconstrained optimization prob-lems [16]. The BFGS method has the global convergence property [21], andthe superlinear convergence property [4]. These properties also hold for therestricted Broyden class except for DFP. Moreover, Powell discussed the self-correcting properties of the BFGS method [22], and showed that the BFGSmethod is better at correcting the large eigenvalue of the initial Hessian ap-proximation than the DFP method. As a result, the number of iterationsby the BFGS method to obtain the solution with good accuracy is muchless than the number of iterations by the DFP method. Powell dealt withonly the two-dimensional quadratic convex objective function. However, theself-correcting property is considered to explain the effectiveness of BFGSover DFP. Since 1970s, much effort has been expended in trying to seek avariable metric method that is more efficient than the BFGS method, andsome interesting approaches such as the self-scaling method [20] and the op-timally conditioned update [6] have been proposed. However, the practical

2

advantage of these methods is modest, and these methods do not outper-form the standard BFGS method. Up to our knowledge, there are only afew theoretical works to explain the effectiveness of the BFGS method, e.g.[22] in which the problem setup was restricted to the two-dimensional convexquadratic optimization.

In this paper, we present a robustness property of quasi-Newton methodsagainst numerical errors or a shift of tuning parameters in line search meth-ods. Usually, line search algorithms include some tuning parameters. A smallchange of these tuning parameters may significantly affect the computationfollowing the line search. Hence, it is important to investigate the robustnessof quasi-Newton update formula against a small perturbation of a step lengthcomputed by the line search. To analyze the robustness property of Hessianupdate formulae, we apply tools in the robust statistics [12]. We consideran extended class of Broyden family, and we present that the influence ofnumerical errors is bounded only for the standard BFGS formula for theHessian approximation. For the other quasi-Newton update formulae, theperturbation of the step length has a large impact on the computation of theHessian approximation. Our result is valid for a wide range of non-linear op-timization problems. Up to our knowledge, this is the first theoretical resultshowing that the BFGS method has totally different and favorable propertiesin comparison to the other update formulae in the class of extended Broydenfamily for general non-linear optimization problems.

In the series of our research, the variational view of Hessian update for-mulae was extended, and then, various quasi-Newton methods were unifiedunder the optimization of Bregman divergences. This extension enables usto proceed with the systematical study of quasi-Newton methods. In this pa-per, we focus on robustness properties against numerical errors in line searchmethods. Our unified framework has the possibility to contribute to revealthe other theoretical features of variable metric methods.

There are various kinds of quasi-Newton methods such as the standardBFGS and DFP methods, the SR1 method, and limited-memory quasi-Newton methods; see [17]. In this paper, we focus on the family of quasi-Newton methods derived from the variational method discussed in [13]. Itincludes self-scaling quasi-Newton methods and the Broyden family, and hasa functional degrees of freedom to specify an update formula. Hence, manyquasi-Newton update formulae can be described as the variational problem.On the other hand, the SR1 method and the limited-memory quasi-Newtonmethod are not included in our analysis. The SR1 method is a rank-one

3

update formula, and a variational view above leads to only rank-two up-date formulae. In limited-memory quasi-Newton methods, especially thelimited-memory BFGS (L-BFGS) method is useful for solving large scale op-timization problems. To estimate the curvature of the objective function,the L-BFGS method maintains only a small number of vectors instead ofthe approximate full Hessian matrix. This procedure is rather complicatedto provide an information geometric and variational view. As a result, thetheoretical technique used in this paper will not be efficient for the SR1 andL-BFGS methods. In the paper, we study the most robust update formu-lae, and the optimality is investigated in the range of quasi-Newton methodsgiven by the variational method.

Here is the brief outline of the article. In Section 2, we introduce anextended quasi-Newton formula from the viewpoint of the variational ap-proach. The convergence property of the extended quasi-Newton formula isalso presented. Then, Section 3 is devoted to discuss the robustness of theHessian update formula. Numerical simulations are presented in Section 4.We conclude with a discussion and outlook in Section 5. Proofs of theoremsare postponed to Appendix.

Throughout the paper, we use the following notations: The set of positivereal numbers are denoted as R+ ⊂ R. The determinant of the square matrixA is denoted as det A. GL(n) is the set of n by n invertible real matrices.The set of all n by n real symmetric matrices is denoted as Sym(n), and letPD(n) ⊂ GL(n) ∩ Sym(n) be the set of n by n symmetric positive definitematrices. For two square matrices A, B, the inner product 〈A,B〉 is definedby tr(AB>), and ‖A‖F is the Frobenius norm defined by the square rootof 〈A,A〉. Throughout the paper we only deal with the inner product ofsymmetric matrices, and the transposition in the trace will be dropped. Forthe vector x, ‖x‖ denotes the Euclidean norm. The first and second orderderivative of the function f : R → R are expressed by f ′ and f ′′, respectively.The gradient vector of the function f : Rn → R is denoted as ∇f(x) ∈ Rn,and the Hessian matrix is represented as ∇2f(x) ∈ Sym(n).

2. A Variational View of Quasi-Newton Methods

We briefly introduce quasi-Newton formulae and its interpretation as thevariational problem according to Fletcher [8], and Kanamori and Ohara [13].In Section 2.4, we present the convergence property of extended quasi-Newtonmethods.

4

2.1. quasi-Newton Methods

In quasi-Newton method, a sequence {xk}∞k=0 ⊂ Rn is successively gener-ated in a manner such that

xk+1 = xk − αkB−1k ∇f(xk). (2)

The coefficient αk ∈ R is a step-size computed by a line search, and Bk isa positive definite matrix approximating the Hessian matrix ∇2f(xk) at thepoint xk. Let sk and yk be column vectors defined by

sk = xk+1 − xk = −αkB−1k ∇f(xk), yk = ∇f(xk+1)−∇f(xk).

We need a Hessian approximation Bk+1 for ∇2f(xk+1) for the computationof the next step. In the BFGS method, Bk+1 is given by

Bk+1 = BBFGS[Bk; sk, yk] := Bk −Bksks

>k Bk

s>k Bksk

+yky

>k

s>k yk

, (3)

and the DFP method provides the different formula such that

Bk+1 = BDFP [Bk; sk, yk] := Bk −Bksky

>k + yks

>k Bk

s>k yk

+ s>k Bkskyky

>k

(s>k yk)2+

yky>k

s>k yk

.

(4)

When Bk ∈ PD(n) and s>k yk > 0 hold, both BDFP [Bk; sk, yk] and BBFGS[Bk; sk, yk]are also positive definite matrices. In practice, the Cholesky decompositionof Bk will be successively updated in order to compute the search direction−B−1

k ∇f(xk) efficiently. The idea of updating Cholesky factors is pioneeredby Gill and Murray [10]. Note that the equality

BDFP [Bk; sk, yk]−1 = BBFGS[B−1

k ; yk, sk]

holds. Hence, the update formula for the inverse Hk+1 = B−1k+1 is directly

derived from Hk = B−1k without computing inversion of the matrix.

2.2. Bregman Divergence

To introduce a variational approach in quasi-Newton methods, we definethe Bregman divergence [2]. Let ϕ : PD(n) → R be a differentiable, strictly

5

convex function that maps positive definite matrices to real numbers. Wedefine Bregman divergence of the matrix P from the matrix Q as

D(P,Q) = ϕ(P )− ϕ(Q)− 〈∇ϕ(Q), P −Q〉, (5)

where ∇ϕ(Q) is the n by n matrix whose (i, j) element is given as ∂ϕ∂Qij

(Q).

The strict convexity of ϕ guarantees that D(P,Q) is non-negative and equalsto zero if and only if P = Q holds. Hence, Bregman divergence is intuitedas a distance metric, even though it is not necessarily symmetric. Note thatD(P,Q) is convex in P but not necessarily convex in Q. Bregman divergenceshave been well studied for nearness problems in the fields of statistics andmachine learning [1, 7, 15].

In particular, we focus on Bregman divergences induced from potentialfunctions defined below. See [19] for details.

Definition 1 (potential function). Let V : R+ → R be a function which isstrictly convex, decreasing, and third order continuously differentiable. Sup-pose that the functions ν and β defined from V satisfy the following condi-tions:

ν(z) := −zV ′(z) > 0, (6)

β(z) :=zν ′(z)

ν(z)<

1

n, (7)

for all z > 0 and

limz→+0

z

ν(z)n−1= 0. (8)

Then, V is called potential function or potential for short. For P ∈ PD(n),the function V (det P ) is also referred to as potential on PD(n).

As shown in [19], the function V (det P ) is strictly convex in P ∈ PD(n) ifand only if V satisfies (6) and (7). The condition (8) guarantees the existenceof Hessian update formula. See [13] for details.

Given a potential function V , the Bregman divergence defined from thepotential function ϕ(P ) = V (det P ) in (5) is denotes as DV (P,Q), and re-ferred to as V -Bregman divergence. The V -Bregman divergence has the formof

DV (P,Q) = V (det P )− V (det Q) + ν(det Q)〈Q−1, P 〉 − nν(det Q).

Below we show some examples of V -Bregman divergences.

6

Example 1. For the negative logarithmic function V (z) = − log(z), we haveν(z) = 1. Then V -divergence is given as

DV (P,Q) = 〈P,Q−1〉 − log det(PQ−1)− n.

This Bregman divergence is called Kullback-Leibler(KL)-divergence, and de-noted as KL(P,Q). KL-divergence satisfies KL(P,Q) = KL(Q−1, P−1), andthus, KL(P,Q) is convex in both P and Q−1.

Example 2. For the power potential V (z) = (1 − zγ)/γ with γ < 1/n, wehave ν(z) = zγ and β(z) = γ. Then, we obtain

DV (P,Q) = (det Q)γ

{〈P,Q−1〉+

1− (det PQ−1)γ

γ− n

}.

The KL-divergence in Example 1 is recovered by taking the limit of γ → 0.

Example 3. For 0 ≤ a < b, let V (z) be V (z) = a log(az + 1) − b log(z).Then V (z) is a convex and decreasing function, and we obtain

ν(z) = b− a +a

az + 1> 0, β(z) =

−a2z

(az + 1)(a(b− a)z + b)≤ 0

for z > 0. This potential satisfies the inequality 0 < b − a ≤ ν(z) ≤ b.The bounding condition of ν will be assumed in the convergence analysis ofSection 2.4. KL-divergence is recovered by setting a = 0, b = 1.

2.3. A Variational Approach in Quasi-Newton Methods

Fletcher [8] has shown that the BFGS update formula (3) is obtained asthe unique solution of the constraint optimization problem,

minB∈PD(n)

KL(B,Bk) subject to Bsk = yk,

where KL is the KL-divergence defined in Example 1. According to the Part Iof our article [13], we apply V -Bregman divergences to extend quasi-Newtonupdate formulae. Let us define the V -BFGS and V -DFP formulae as theunique optimal solution of following problems,

(V -BFGS) minB∈PD(n)

DV (B,Bk), subject to Bsk = yk, (9)

(V -DFP) minB∈PD(n)

DV (B−1, B−1k ), subject to Bsk = yk. (10)

Mainly we consider the V -BFGS update formula. The proof of the followingtheorem is found in [13].

7

Theorem 1. Let Bk ∈ PD(n), and suppose s>k yk > 0. Then the problem (9)has the unique optimal solution Bk+1 ∈ PD(n) satisfying

Bk+1 =ν(det Bk+1)

ν(det Bk)BBFGS[Bk; sk, yk] +

(1− ν(det Bk+1)

ν(det Bk)

)yky

>k

s>k yk

. (11)

In the same way, we obtain the V -DFP update formula,

Bk+1 =ν((det Bk)

−1)

ν((det Bk+1)−1)BDFP [Bk; sk, yk] +

(1− ν((det Bk)

−1)

ν((det Bk+1)−1)

)yky

>k

s>k yk

(12)

which is the unique optimal solution of (10).The Hessian update algorithm of the V -BFGS formula exploiting the

Cholesky decomposition is presented in [13]. By maintaining the Choleskydecomposition, we can easily compute the search direction B−1

k ∇f(xk).The V -BFGS update formula is represented by the affine sum of BBFGS[Bk; sk, yk]

and yky>k /s>k yk. This expression is equivalent to the self-scaling quasi-Newton

update [20, 18] defined as

Bk+1 = θkBBFGS [Bk; sk, yk] + (1− θk)

yky>k

s>k yk

, (13)

where θk is a positive real number. In the V -BFGS update formula, thecoefficient θk is determined from the function ν. In the self-scaling updateformula in (13), the parameter θk defined by

θk =s>k yk

s>k Bksk

(14)

is often recommended. As analyzed in [18], however, the self-scaling methodwith inexact line search for the step length tends to lead the relative ineffi-ciency compared to the standard BFGS method.

Example 4. We show the V -BFGS formula derived from the power potential.Let V (z) be the power potential V (z) = (1− zγ)/γ with γ < 1/n. As shownin Example 2, we have ν(z) = zγ. Then, some calculations yields that

ν(det Bk+1)

ν(det Bk)=

(s>k yk

s>k Bksk

, ρ =γ

1− (n− 1)γ.

8

As a result, the V -BFGS update formula is given as

Bk+1 =

(s>k yk

s>k Bksk

BBFGS[Bk; sk, yk] +

(1−

(s>k yk

s>k Bksk

)ρ )yky

>k

s>k yk

.

For γ such that γ < 1/n, we have −1/(n − 1) < ρ < 1. The standard self-scaling update formula corresponds to the above update with ρ = 1. Hence, thestandard self-scaling update formula is not derived from the power potential.Indeed, the power potential with ρ = 1 or equivalently γ = 1/n is convex, butnot strictly convex.

It is straightforward to combine the V -BFGS method with the V -DFPmethod in the same way as the standard Broyden family [3]. Let BBFGS

V1,k+1

be the Hessian approximation given by the V -BFGS update formula usingthe potential V = V1, and BDFP

V2,k+1 be the Hessian approximation given bythe V -DFP update formula using the potential V = V2. Then the updateformula of the (V1, V2)-Broyden family is defined by

Bk+1 = ϑB(V1)BFGS,k+1 + (1− ϑ) B

(V2)DFP,k+1, (15)

for ϑ ∈ [0, 1]. The (V1, V2)-Broyden family is obtained by a convex-hull ofBBFGS[Bk; sk, yk], BDFP [Bk; sk, yk] and yky

>k /s>k yk. The standard Broyden

family is recovered by setting V1(z) = V2(z) = − log z.

2.4. Analysis of Convergence Properties

We consider the convergence property of the V -BFGS method. Somestandard assumptions about the objective function f are stated below. SeeSection 6.4 of [17] for details.

Assumption 1. 1. The objective function f is twice continuously differ-entiable.

2. Let ∇2f(x) be the Hessian matrix of f at x. For the starting point x0,the level set L = {x ∈ Rn | f(x) ≤ f(x0)} is convex, and there existpositive constants m and M such that

m‖z‖2 ≤ z>∇2f(x)z ≤ M‖z‖2 (16)

holds for all z ∈ Rn and x ∈ L.

9

The following theorem implies that the sequence {xk}∞k=0 generated by theequation (2) with the V -BFGS update formula converges to a local minimizerof f , if the function ν(z) of the potential V (z) satisfies the bounding conditionabove. A line search is used to determine the parameter αk in (2). Weassume that αk satisfies the Wolfe condition, the detail of which is presentedin Section 3.1 of [17]. The proof of the following theorem is presented inAppendix A.

Theorem 2. Let B0 ∈ PD(n) be an initial matrix and x0 ∈ Rn be a startingpoint which meets Assumption 1. Suppose that there exist positive constantsL1, L2 > 0 such that L1 ≤ ν ≤ L2. Suppose that the parameter αk in (2)satisfies the Wolfe condition. Then the sequence {xk} generated by the V -BFGS update formula converges to a local minimizer of f .

The potential defined in Example 3 meets the condition on ν in Theo-rem 2, while the power potential V (z) = (1 − zγ)/γ with ν(z) = zγ doesnot satisfy the condition. As analyzed in Nocedal and Yuan [18], the self-scaling BFGS method defined by (13) and (14) is globally convergent, but isnot super-linearly convergent unless special attention is paid to the choicesof the step-lengths. In Section 2.3, we showed that the popular self-scalingBFGS method does not belong to the set of the extended Hessian updateformulae (11). This fact may lead to inefficiency of the popular self-scalingBFGS method.

3. Robustness of Quasi-Newton Update Formulae

The robustness or stability is an important feature in numerical com-putation. In this section we study the robustness of quasi-Newton updateagainst numerical errors or the choice of tuning parameters involved in theline search.

Mainly there are two types of quasi-Newton updates: one is the updateformula for approximate Hessian matrix; and the other is the update forapproximate inverse Hessian matrix. In the approximate inverse Hessianupdate, the matrix Hk = B−1

k is directly update to Hk+1 = B−1k+1 under the

secant condition Hk+1yk = sk. As a result, we study four kinds of update for-mulae, that is, V -BFGS/V -DFP method for the Hessian approximation/theinverse Hessian approximation.

10

We consider the robustness of Hessian approximation formula. Let theupdated point xk+1 be

xk+1 = xk − αkB−1k ∇f(xk) = xk + sk,

where the step length αk is computed by a deterministic rule of line searchsuch as the Armijo rule or the Wolfe rule with fixed tuning parameters.Note that popular line search methods provide a possible interval of the steplength. The deterministic rule in the above implies that it is specified howto choose a step length in the interval in a deterministic way. Under a fixedtuning parameter in a line search algorithm, suppose that the matrix Bk is up-dated to Bk+1 which is the minimum solution of DV (B,Bk) or DV (B−1, B−1

k )subject to Bsk = yk. If the tuning parameter is slightly changed, the steplength αk will be slightly perturbed and then sk = xk+1−xk will be changedto (1 + ε)sk, where ε is an infinitesimal. The vector yk will also change to yk

defined by

yk = ∇f(xk + (1 + ε)sk)−∇f(xk) = yk + ε∇2f(xk+1)sk + O(ε2). (17)

Then, the secant condition for the Hessian update becomes (1 + ε)Bsk = yk.If a small shift of tuning parameters in the line search has a large impacton the computation of the approximate Hessian matrix, the update formulais unstable. This implies that a small change of tuning parameters maydegrade the convergence property of quasi-Newton methods. Since non-linearoptimization methods are applied to a wide range of problems, the robustnessproperty is favorable to guarantee a good convergence property even underthe worst-case setup.

We study the relation between the perturbation of sk and the Hessianapproximation Bk+1 or the inverse Hessian approximation Hk+1. Based onthe above argument, we consider the optimization problem defined by

(V -BFGS-B) minB∈PD(n)

DV (B,Bk) subject to (1 + ε)Bs = y + εy, (18)

(V -DFP-B) minB∈PD(n)

DV (B−1, B−1k ) subject to (1 + ε)Bs = y + εy

(19)

for a fixed matrix Bk ∈ PD(n) and fixed vectors s, y, y ∈ Rn, where thesubscript k for the vectors is dropped for simplicity. In the same way, the

11

update formula for the inverse Hessian under the shift of s = xk+1 − xk isdefined as the optimal solution of the following problem,

(V -BFGS-H) minH∈PD(n)

DV (H−1, H−1k ) subject to H(y + εy) = (1 + ε)s,

(20)

(V -DFP-H) minH∈PD(n)

DV (H,Hk) subject to H(y + εy) = (1 + ε)s,

(21)

for fixed Hk ∈ PD(n), s, y, y ∈ Rn. The update formula given by V -BFGS-H/V -DFP-H directly provides the inverse matrix of Bk+1 computed by V -BFGS-B/V -DFP-B, respectively. Theorem 1 guarantees that there exists theunique optimal solution as long as s>(y + εy) > 0 holds. Though Theorem 1deals with only V -BFGS-B formula, we can prove the existence and theuniqueness of optimal solution for the other problems in the same manner.

In order to study the robustness of update formulae, we borrow toolscalled the influence function and gross error sensitivity from the study ofrobust statistics [12]. Below the V -BFGS-B update formula is considered asan example. Let B(ε) be the optimal solution of V -BFGS-B in (18). Thenthe influence function of B(ε) is defined as the derivative of B(ε) at ε = 0,that is,

B(0) = limε→0

B(ε)−B(0)

ε.

Later we prove the differentiability of B(ε). In the literature of robust statis-tics, the influence function is defined for the functional over the infinite di-mensional space such as the set of all probability distributions. From thedefinition of the influence function, the optimal solution B(ε) is asymptot-ically equal to B(0) + εB(0). This implies that, when the norm of B(0) islarge, the perturbation of the step length has a large impact on the com-putation of Hessian approximation. In the sense of the influence function,the preferable potential is the function V (z) which provides the influencefunction B(0) with a small norm.

The influence function and the gross error sensitivity have been studied inrobust statistics [12]. We use these statistical techniques to analyze the sta-bility of numerical computation. In the literature of statistics, the “statisticalmodel” {B ∈ PD(n) | Bsk = yk} or {H ∈ PD(n) | Hyk = sk} is fixed, andthe “observed data” Bk or Hk is contaminated such that Bk +εB(0)+O(ε2),

12

Table 1: Gross error sensitivity of V -BFGS/V -DFP formula for the Hessian/the inverseHessian approximation. Only the standard BFGS for the Hessian approximation has finitegross error sensitivity.

V -BFGS V -DFPHessian approximation. finite only for BFGS ∞

inverse Hessian approximation ∞ ∞

while in the present analysis, the matrix Bk = H−1k is fixed and the model

corresponding to the secant condition is perturbed.We consider the worst-case evaluation of the influence function. For fixed

vectors s and y such that s>y > 0, the influence function B(0) depends onthe matrix Bk and the perturbation y ∈ Rn. For subsets B ⊂ PD(n) andY ⊂ Rn, the gross error sensitivity is defined as the largest norm of theinfluence function, that is,

gross error sensitivity = sup{‖B(0)‖F | Bk ∈ B ⊂ PD(n), y ∈ Y ⊂ Rn

}.

In many cases, the gross error sensitivity becomes infinity if B or Y is un-bounded. Our concern is to find the potential function V (z) which leads toa finite gross error sensitivity under a reasonable setup.

The potential function V (z) minimizing the gross error sensitivity willbe favorable for robust computation. Below we prove that the standardBFGS update for the Hessian approximation is the more robust than theother update formulae. This result meets the empirical observations [5, 17].Theoretical results are summarized in Table 1.

In the following, we consider the gross error sensitivity with B = PD(n)and a bounded subset Y ⊂ Rn. In principle, one can choose any positivedefinite matrix as an initial approximate Hessian matrix B0. Hence, thechoice of B = PD(n) is reasonable for the worst-case analysis. We illustratean interpretation of the set Y . For example, define Y as

Y = {y ∈ Rn | ‖y‖ ≤ Ma}

for positive constants M and a. Let C2 be the set of all twice continuouslydifferentiable functions, then Y includes the following subset,

{(∇2f(x))s ∈ Rn | ‖s‖ ≤ a, f ∈ C2, ‖∇2f(x)‖F ≤ M for all x

},

13

i.e., the set of perturbations of y = ∇f(xk+1) − ∇f(xk) for any vector sand any objective function f satisfying the bounding condition above. If thegross error sensitivity is bounded above for B = PD(n) and any boundedsubset Y ⊂ Rn, the corresponding update formula is robust for a wide rangeof non-linear optimization problems.

We show that the influence function and the gross error sensitivity areinformative only for non-quadratic objective functions.

Lemma 3. Suppose that the objective function f(x) is a convex quadraticfunction. Then, the influence function and the gross error sensitivity areequal to zero.

Lemma 3 is clear, since for the quadratic objective function, the secantcondition Bs = y is changed to (1 + ε)Bs = (1 + ε)y under the perturbationof the step length. That is, the secant condition is kept unchanged, and thusB(ε) = B(0) holds.

In many theoretical analyses of quasi-Newton methods, the progress of thealgorithm is studied for convex quadratic objective functions. By assumingthat the objective function is quadratic and convex, one can obtain strongtheoretical results such as the finite convergence properties [14]. On the otherhand, the influence function and the gross error sensitivity are not informativefor quadratic objective functions, as shown in Lemma 3. For non-quadraticobjective functions which are common in practical situation, however, thegross error sensitivity can be a useful tool to analyze the theoretical propertiesof update formulae. We show that a systematic analysis of quasi-Newtonmethods for general setups is possible by using these tools.

Next, we prove that the influence function is well-defined.

Theorem 4. Suppose that s>y > 0 holds for vectors s and y in the problems(18), (19), (20) and (21). Then, for small ε, the optimal solutions of V -BFGS-B, V -DFP-B, V -BFGS-H and V -DFP-H are all uniquely determined.The optimal solutions are second-order continuously differentiable with re-spect to ε in a vicinity of ε = 0.

Proof is deferred to Appendix B.The gross error sensitivity of each update formula is computed in the

following theorems. Proofs are deferred to Appendix C.

Theorem 5 (gross error sensitivity of V -BFGS-B). Suppose n ≥ 3. Let sand y be fixed vectors such that s>y > 0 and Y be a bounded subset in Rn.

14

For small ε, let B(ε) be the optimal solution of V -BFGS-B in (18). Then,the optimal potential function of the problem

minV

maxBk, y

‖B(0)‖F subject to Bk ∈ PD(n), y ∈ Y (22)

is given as V (z) = − log(z) up to a positive constant factor. In the abovemin-max problem, the function V is sought from among all potentials.

Theorem 6 (gross error sensitivity of V -DFP-B). Suppose n ≥ 3. Let sand y be fixed vectors such that s>y > 0 and Y be a bounded subset in Rn.Suppose that there exists an open subset included in Y. Let B(ε) be theoptimal solution of V -DFP-B in (19). Then for any potential V , the grosserror sensitivity is infinite, i.e.,

sup{‖B(0)‖F | Bk ∈ PD(n), y ∈ Y} = ∞.

Theorem 7 (gross error sensitivity of V -BFGS-H). Suppose n ≥ 4. Lets and y be fixed vectors such that s>y > 0 and Y be a bounded subset inRn. Suppose that there exists an open subset included in Y. Let H(ε) be theoptimal solution of V -BFGS-H in (20). Then for any potential V , the grosserror sensitivity is infinite, i.e.,

sup{‖H(0)‖F | Hk ∈ PD(n), y ∈ Y} = ∞.

Theorem 8 (gross error sensitivity of V -DFP-H). Suppose n ≥ 3. Let s andy be fixed vectors such that s>y > 0 and Y be a bounded subset in Rn. LetH(ε) be the optimal solution of V -DFP-H in (21). Then for any potential V ,the gross error sensitivity is infinite, i.e.,

sup{‖H(0)‖F | Hk ∈ PD(n), y ∈ Y} = ∞.

It is well-known that there is the dual relation between the BFGS formulaand the DFP formula. Indeed, the V -DFP update for the inverse Hessianapproximation is derived from the V -BFGS update formula for the Hessianapproximation by replacing Bk, sk, yk with Hk, yk, sk. For the robustnessanalysis, however, the dual relation is violated as shown in Table 1. In thisproblem, we focus on the perturbation of the vector sk rather than that ofyk. This is the reason why the dual relation is violated. Powell has shown asignificant difference between BFGS and DFP for two-dimensional quadraticconvex objective functions [22] by considering the behavior of eigenvalues of

15

approximate Hessian matrix. In the present paper, we exploited the grosserror sensitivity which is meaningful for non-quadratic objective functions.Our approach also provides a significant difference between the BFGS andDFP methods for non-linear and non-quadratic objective functions.

In Section 2, we introduced the (V1, V2)-Broyden family defined by (15). Itis straightforward to prove that only the standard BFGS has finite gross errorsensitivity among the (V1, V2)-Broyden family with a fixed mixing parameterϑ ∈ [0, 1]. The Broyden family for approximate inverse Hessian matrix Hk =B−1

k is also defined in the same way. We mention the theorem below withoutproof.

Theorem 9 (gross error sensitivity of extended Broyden family). Supposethat the bounded subset Y includes an open subset of Rn.

1. Suppose n ≥ 3. Among the (V1, V2)-Broyden family (15) with a fixedparameter ϑ ∈ [0, 1], only the standard BFGS method defined by V1(z) =− log z and ϑ = 1 has the finite gross error sensitivity for B = PD(n)and the bounded subset Y.

2. Suppose n ≥ 4. Then, any update formula in the (V1, V2)-Broydenfamily for the inverse Hessian matrix with a fixed parameter ϑ ∈ [0, 1]has infinite gross error sensitivity for B = PD(n) and the boundedsubset Y.

4. Numerical Studies

We demonstrate numerical experiments on robustness of quasi-Newtonupdate formulae such as V -BFGS-B, V -DFP-B, V -BFGS-H, and V -DFP-Hproposed in Section 3. Especially, the update formula derived from powerpotential in Example 2 is examined.

In the first numerical study, we consider numerical stability of updateformulae. Let B(ε) be the optimal solution of V -BFGS-B (18) or V -DFP-B(19), and H(ε) be the optimal solution of V -BFGS-H (20) or V -DFP-H (21).For each update formula, we numerically compute the approximate influencefunction ‖(B(ε)−B(0))/ε‖F and ‖(H(ε)−H(0))/ε‖F for small ε, where thepower potential V (z) = (1− zγ)/γ is applied. Remember that V -BFGS andV -DFP are respectively reduced to the standard BFGS and DFP when γ isequal to zero.

16

In what follows, we show the setup of numerical studies. Let diag(a1, . . . , an)be the n by n diagonal matrix with diagonal elements a1, . . . , an. For the V -BFGS-B and V -DFP-B, the matrix Bk ∈ PD(n) is set to one of the followingthree matrices:

Bk = diag(1, . . . , n)/(n!)1/n, Bk = diag(1, . . . , n), or Bk = I + n3 · pp>,

where I is the identity matrix and p is a column unit vector defined below.The dimension of the matrix Bk is set to n = 10, 100, 500 or 1000. Thefirst matrix diag(1, . . . , n)/(n!)1/n has the determinant one, and the othertwo matrices have a large determinant. Below we show the procedure forgenerating the vectors s and y and the contaminated vectors (1 + ε)s andy+εy for V -BFGS-B and V -DFP-B. In the numerical studies for V -BFGS-Hand V -DFP-H, the matrix Bk below is replaced with the approximate inverseHessian Hk.

1. In the case that Bk is diag(1, . . . , n)/(n!)1/n or diag(1, . . . , n), the vec-tors s and y are both generated according to the multivariate normaldistribution with mean zero and variance-covariance matrix 102× I. Ifthe inner product s>y is non-positive, the sign of y is flipped. The inten-sity of noise involved in the line search is determined by ε, which is gen-erated according to the uniform distribution on the interval [−0.2, 0.2].Then, the vector y is also generated according to the multivariate stan-dard normal distribution. If the inequality (1 + ε)s>(y + εy) > 0 doesnot hold, again ε and y are generated until the vectors enjoy the posi-tivity condition.

2. In the case that Bk is supposed to have the expression I + n3 · pp>,first the vector s is generated according to the multivariate normaldistribution with mean zero and variance-covariance matrix 102 × I,and y is defined such that y = s. The vector p is a unit vector which isorthogonal to y, that is, p is a vector satisfying p>y = 0 and ‖p‖ = 1,and let Bk be Bk = I + n3 · pp>. Then the vector y is defined as y = p.The construction of these vectors is used in the proof of Theorem 7 andTheorem 8.

Hessian or inverse Hessian update formula is applied to Bk or Hk with therandomly generated secant condition. The updated matrix B(0) and B(ε) arerespectively computed under the constraint Bs = y and (1 + ε)Bs = y + εyby using V -BFGS-B or V -DFP-B update formula. In the same way, V -BFGS-H and V -DFP-H are respectively applied to compute H(0) with the

17

constraint Hy = s and H(ε) with the perturbed secant condition H(y+εy) =(1 + ε)s. The influence function of each update formula is approximated by‖(B(ε)− B(0))/ε‖F or ‖(H(ε)−H(0))/ε‖F .

Table 2 shows the average of the approximate influence function over20 runs for each setup. When Bk or Hk is equal to the diagonal matrixdiag(1, . . . , n)/(n!)1/n, we see that the power γ of the power potential doesnot significantly affect the influence function in both V -BFGS and V -DFP.In the other setups, overall V -BFGS-B with γ = 0 has smaller influencefunction than the other update formulae. The V -DFP-H for inverse Hessianupdate also has relatively small influence function when Hk is proportionalto diag(1, . . . , n). For Hk = I + n3pp>, however, we find that V -DFP-H issensitive against the perturbation in the line search.

As shown below, these numerical results meet the theoretical analysis inSection 3.

1. Theorem 5 implies that the standard BFGS method is robust againstthe perturbation of the step length.

2. As shown in Example 4, V -BFGS-B update with power potential isclose to the standard BFGS update for large n and moderate det(Bk).That is, the mixing parameter (s>k yk/s

>k Bksk)

ρ with ρ = γ/(1−(n−1)γ)in Example 4 will be close to one, if n is large and s>k yk/s

>k Bksk does

not depend on the dimension n that much. When Bk has a large deter-minant which grows with the dimension n, the number of s>k yk/s

>k Bksk

may severely depend on the dimension n. Then, the mixing parame-ter (s>k yk/s

>k Bksk)

ρ will not close to one even for large n. Hence, insuch case the influence function is affected by the choice of the powerγ. The same argument on the relation between influence function andthe power γ will hold for the inverse Hessian update, V -BFGS-H andV -DFP-H.

3. For Bk = I + n3ppT , the results for V -BFGS-B and V -DFP-B are nu-merically the same. Under this setup, we can theoretically confirm thatthe influence functions of both update formulae are identical to eachother. On the other hand, some calculation yields that the influencefunctions of V -BFGS-H and V -DFP-H are not the same.

The standard BFGS update formula achieves the min-max optimalityof the gross error sensitivity. This implies that BFGS method may not benecessarily optimal for each setup. In numerical studies, however, BFGS

18

method uniformly provides fairly stable update formula in comparison to theother methods.

Next, we apply the standard BFGS-B and DFP-B to solve the followingthree optimization problems:

(Problem 1 [11]) minx∈Rn

f(x) = (x1 − 1)2 +n∑

i=2

i · (xi−1 − 2xi)2,

(Problem 2) minx∈Rn

f(x) =1

2x>Ax− e>x,

where e = (1, . . . , 1)> ∈ Rn and

A =

2 −1−1 2 −1

−1 2. . .

. . . . . . −1−1 2

∈ Rn×n,

and the boundary value problem [9]

(Problem 3) minx∈Rn

f(x) =1

2x>Ax− e>x− 1

(n + 1)2

n∑

i=1

(2xi + cos xi),

where the vector e and the matrix A are the same as Problem 2. The objectivefunction in Problem 3 is non-linear and non-convex. The initial point x0 israndomly generated from n-dimensional normal distribution with mean zeroand variance-covariance matrix 102 × I, and the initial approximate Hessianmatrix is set to the identity matrix. The termination criterion

‖∇f(xk)‖ ≤ n× 10−5 or k ≥ 50000,

is employed, which is the same criterion used by Yamashita [23]. Althoughthe second criterion above implies that the method fails to obtain a solution,all trials did not reach the maximum number of iterations. In each problem,the step-length αk is computed by the matlab command “fminbnd” with theoption TolX = 10−12 which denotes the termination tolerance on x. In thesame way as the numerical studies on robustness of update formulae, thevector sk = xk+1−xk is randomly perturbed such that sk = (1 + ε)sk, where

19

ε is a random variable according to the uniform distribution on the interval[−h, h]. The number of h varies from 0 to 0.3. Accordingly, the vectoryk = ∇f(xk+1)−∇f(xk) is also changed to yk = ∇f(xk + sk)−∇f(xk). Asthe result, at each iteration the secant condition with perturbed step lengthis given as Bsk = yk.

For each update formula, the average number of iterations over 20 runsis shown in Table 3. In Problem 1, the BFGS and DFP need almost thesame number of iterations to solve the problem, and the noise level h affectsboth update formulae almost equally. In Problem 2 or 3, BFGS methodrequires fewer number of iterations to reach the optimal solution, comparedto DFP. In addition BFGS is stable against the noise level h. In contrast,overall DFP method is sensitive to the contamination of step-length. Indeed,the number of iterations in DFP method rises drastically with the intensityof the noise in all problems. Although the sensitivity of update formuladepends on each problem, the numerical results imply that as a whole DFPis not robust against random noise involved in the line search compared toBFGS. As shown in Lemma 3, the perturbation in the step length does notaffect the secant condition for the quadratic convex objective function. Hencethe goodness of the descent direction B−1

k ∇f(xk) will be easily degraded bynumerical errors in the line search in DFP method.

We also study the relation between the Hessian matrix of the objectivefunction and the number of iterations for the convex quadratic problem,

(Problem 4) minx∈Rn

f(x) =1

2x>(c · A)x− e>x,

where A and e are the same as Problem 2 and c is a positive number. Forn = 100 and c = 0.5, 1, 5, 10 and 20, the number of iterations of quasi-Newtonmethod is shown in Table 4. The results imply that the noise level h affectsthe number of iterations when c is rather large. This result is consistent withthe numerical results in Table 3. Indeed, the Hessian matrix of Problem 1 haslarger eigenvalues compared to that of Problem 2. In any case, the numberof iterations in BFGS method tends to be less than or equal to that of DFPmethod.

These numerical properties in quasi-Newton methods have been empiri-cally well-known [5, 17]. Powell [22] has theoretically studied the progressionof eigenvalues in approximate Hessian matrices in order to illustrate thedifference between BFGS and DFP. Through the numerical studies in thissection, we found that the theoretical framework with the influence function

20

and gross error sensitivity is a useful tool to study the numerical propertiesof quasi-Newton methods.

5. Concluding Remarks

Along the line of the research stared by Fletcher [8], we considered quasi-Newton update formulae based on the Bregman divergence induced frompotential functions. The proposed update formulae for the Hessian approxi-mation belong to the class of self-scaling quasi-Newton method. We studiedthe convergence property. Then, we applied tools in the robust statistics toanalyze robustness of extended Hessian update formulae. As the result, wefound that the influence of the perturbation in the line search is bounded onlyfor the standard BFGS formula for the Hessian approximation. Numericalstudies support the usefulness of the theoretical framework borrowed fromthe robust statistics.

It will be an interesting future work to investigate the practical advantageof the self-scaling quasi-Newton methods derived from the V -Bregman diver-gence. Nocedal and Yuan proved that the self-scaling quasi-Newton methodwith the popular scaling parameter (14) has some drawbacks [18]. In ourframework, the self-scaling quasi-Newton method with the scaling parameter(14) is out of formulae derived from V -Bregman divergence. More precisely,the function V (z) = n(1 − z1/n), which is not potential, formally leads thepopular self-scaling quasi-Newton formula. For the corresponding Bregmandivergence DV (P,Q), the equality DV (P, cP ) = 0 holds for any P ∈ PD(n)and any c > 0. This property implies that the scale of the Hessian approxi-mation is not fixed. We think that this property may lead some inefficiencyof the self-scaling quasi-Newton method with (14). The self-scaling quasi-Newton method associated with V -Bregman divergence may perform well inpractice.

Another research direction is to consider the choice of the potential func-tion V . Under the criterion of the gross error sensitivity, we found that thenegative logarithmic function V (z) = − log z is the optimal choice. The othercriterion will lead to other optimal potentials. Investigating the relation be-tween the criterion for the update formula and the optimal potential will bebeneficial for the design of numerical algorithms.

21

6. Acknowledgements

The authors are grateful to Dr. Nobuo Yamashita of Kyoto university andDr. Ichiro Takeuchi of Nagoya Institute of Technology for helpful comments.T. Kanamori was partially supported by Grant-in-Aid for Young Scientists(20700251).

Appendix A. Proof of Theorem 2

Using the following lemma, we prove Theorem 2 in a manner similar toSection 8.4 in [17].

Lemma 10 (Eq. 6.12 in [17]). Let G be the averaged Hessian

G =

∫ 1

0

∇2f(xk + τs)dτ, s = xk+1 − xk ∈ Rn,

then the property y = Gs follows from Taylor’s theorem, where y = ∇f(xk+1)−∇f(xk).

Proof of Theorem 2. Let Bk, k = 0, 1, 2, . . . be the sequence of approximateHessian matrices generated by the V -BFGS update formula. We define Bk+1

and Bk by Bk+1 = 1ν(det Bk+1)

Bk+1 and Bk = 1ν(det Bk)

Bk, respectively. Then

the update formula shown in Theorem 1 is represented as

Bk+1 = Bk −Bksks

>k Bk

s>k Bksk

+1

ν(det Bk+1)

yky>k

s>k yk

. (A.1)

We compute

ψ(Bk+1) = tr(Bk+1)− log det Bk+1.

The inequality (16) yields

s>k yk

‖sk‖2=

s>k Gsk

‖sk‖2≥ m, (A.2)

‖yk‖2

s>k yk

=s>k G2sk

s>k Gsk

≤ M. (A.3)

We now define

cos θk =s>k Bksk

‖sk‖‖Bksk‖, qk =

s>k Bksk

‖sk‖2.

22

Then the trace of Bk+1 is bounded above. Indeed, the inequality

tr(Bk+1) = tr(Bk)−‖Bksk‖2

s>k Bksk

+‖yk‖2

ν(det Bk+1)s>k yk

≤ tr(Bk)−qk

cos2 θk

+M

ν(det Bk+1),

holds, where (A.3) is used. Using the formula det(I + xy> + uv>) = (1 +x>y)(1 + u>v>) − (x>v)(y>u) for Bk+1, we obtain a lower bound of thedeterminant det(Bk+1) such that

det(Bk+1) = det(Bk)1

ν(det Bk+1)

‖sk‖2

s>k Bksk

s>k yk

‖sk‖2≥ det(Bk)

m

qkν(det Bk+1).

These inequalities present an upper bound of ψ(Bk+1),

ψ(Bk+1) ≤ ψ(Bk) +

(M

ν(det Bk+1)− log

m

ν(det Bk+1)− 1

)

+

(1− qk

cos2 θk

+ logqk

cos2 θk

)+ log cos2 θk

≤ ψ(Bk) +

(M

L1

− logm

L2

− 1

)+ log cos2 θk.

The second inequality is derived from

1− qk

cos2 θk

+ logqk

cos2 θk

≤ 0.

As the result we obtain

0 < ψ(Bk+1) ≤ ψ(B0) + c(k + 1) +k∑

j=1

log cos2 θj,

where c is a positive constant such that c > ML1− log m

L2− 1. Let us then

proceed by contradiction and assume that cos θj → 0. Then there existsk1 > 0 such that for all j > k1, we have

log cos2 θj < −2c.

Thus the following inequality holds for all k > k1:

0 < ψ(B0) + c(k + 1) +

k1∑

j=1

log cos2 θj + (k − k1)(−2c)

= ψ(B0) +

k1∑

j=1

log cos2 θj + c(2k1 + 1)− 2ck.

23

The right-hand-side is negative for large k, giving a contradiction. Thereforethere exists a subsequence satisfying cos θjk

≥ δ > 0. By Zoutendijk’s result1

with the Wolfe condition, this limit implies that lim infk→∞ ‖∇f(xk)‖ = 0.The convexity of f on L guarantees that xk converges to the local optimalsolution.

Appendix B. Proofs of Theorem 4

We show that the optimal solution of V -BFGS-B is second order contin-uously differentiable. The same proof works for the other update formulae.

Proof. We consider the problem (18). Since the inequality s>(y + εy) > 0holds for infinitesimal ε, Theorem 1 guarantees that there exists the uniqueoptimal solution B(ε) around ε = 0. Let the function F : Rn×n ×R → Rn×n

be

F (X, ε) =1

ν(det X)X − 1

ν(det Bk)BBFGS[Bk; (1 + ε)s, y + εy]

−(

1

ν(det X)− 1

ν(det Bk)

)(y + εy)(y + εy)>

(1 + ε)s>(y + εy),

for X ∈ Rn×n and ε ∈ R. For infinitesimal ε, the equality F (B(ε), ε) = Oholds, where O is the null matrix. We apply the implicit function theoremto prove the differentiability of B(ε). Since the potential function is thirdorder continuously differentiable, clearly F (X, ε) is second order continuouslydifferentiable in a vicinity of (X, ε) = (B(0), 0). For any symmetric matrixA ∈ Sym(n), the equality

∇X〈F (X, ε), A〉∣∣X=B(0),ε=0

=1

ν(det B(0))

(A− β(det B(0))

⟨B(0)− yy>

s>y, A

⟩B(0)−1

)

holds, where ∇X denotes the gradient with respect to the variable X. Weprove that the equality

∇X〈F (X, ε), A〉∣∣X=B(0),ε=0

= O (B.1)

1Under some condition,∑

j≥0 cos2 θj‖∇f(xj)‖2 < ∞ holds. See Theorem 3.2 in [17]

24

leads to A = O. This implies that the Jacobian matrix of F (X, ε) withrespect to X at (X, ε) = (B(0), 0) is invertible. When (B.1) holds, we seethat A should be of the form A = κB(0)−1 with a constant κ. SubstitutingA = κB(0)−1 into (B.1), we obtain

κ{1− (n− 1)β(det B(0))} = 0.

The inequality β < 1/n leads to κ = 0. Therefore, we conclude A = O.

Appendix C. Computations of Gross Error Sensitivity

First, a universal formula for the computation of influence function isproved, and some useful lemmas are prepared. Then, the gross error sensi-tivity for each update formula is computed in Appendix C.1, Appendix C.2,Appendix C.3 and Appendix C.4.

Lemma 11. Let s, s, y and y be column vectors in Rn such that s>y > 0,and Bk be a positive definite matrix. For an infinitesimal ε let B(ε) be theoptimal solution of

minB∈PD(n)

DV (B,Bk) subject to B(s + εs) = y + εy, (C.1)

and let ∆[Bk; s, s, y, y] be the influence function B(0). Then we have

B(0)

= ∆[Bk; s, s, y, y]

=

{s>y − s>y

s>y+

ν(det B(0))

ν(det Bk)

(2s>Bks · s>Bk(B(0))−1Bks

(s>Bks)2− 2s>Bk(B(0))−1Bks

s>Bks

)}

× β(det B(0))

1− (n− 1)β(det B(0))

[B(0)− yy>

s>y

]+

yy> + yy>

s>y− s>y + s>y

(s>y)2yy>

+ν(det B(0))

ν(det Bk)

[2s>Bks

(s>Bks)2Bkss

>Bk −Bk(ss

> + ss>)Bk

s>Bks

]. (C.2)

The matrix ∆[Bk; s, s, y, y] is well-defined, since the inequalities ν > 0 and1− (n−1)β > 0 hold for any potential function. Note that ∆[Bk; s, s, y, y] =O holds. This is another proof of Lemma 3.

25

Proof of Lemma 11. In the same way as the proof of Theorem 1 and The-orem 4, we can prove the existence and the differentiability of B(ε). SinceB(ε) is second order continuously differentiable around ε = 0, the equality

B(ε) = B(0) + ε∆ + O(ε2),

holds, where ∆ ∈ Sym(n). Then we have

det(B(ε)) = det(B(0) + ε∆ + O(ε2))

= det(B(0)) + ε det(B(0))〈∆, B(0)−1〉+ O(ε2)

and thus we obtain

ν(det B(ε)) = ν(det B(0)) + εν ′(det B(0)) det(B(0))〈∆, B(0)−1〉+ O(ε2).

For simplicity let δ be

δ = det(B(0))〈∆, B(0)−1〉 (C.3)

then the equality

ν(det B(ε)) = ν(det B(0)) + ε · δ · ν ′(det B(0)) + O(ε2) (C.4)

holds. By some calculations, we see that the asymptotic expansion of BBFGS[Bk; s+εs, y + εy] and (y + εy)(y + εy)>/(s + εs)>(y + εy) are respectively given by

BBFGS[Bk; s + εs, y + εy]

= BBFGS[Bk; s, y]

+ ε

(yy> + yy>

s>y− s>y + s>y

(s>y)2yy> − Bk(ss

> + ss>)Bk

s>Bks+

2s>Bks

(s>Bks)2Bkss

>Bk

)

+ O(ε2) (C.5)

and

(y + εy)(y + εy)>

(s + εs)>(y + εy)=

yy>

s>y+ ε

(yy> + yy>

s>y− s>y + s>y

(s>y)2yy>

)+ O(ε2).

(C.6)

Substituting (C.4), (C.5) and (C.6) into the equality

B(ε) =ν(det B(ε))

ν(det Bk)BBFGS[Bk; s + εs, y + εy]

+

(1− ν(det B(ε))

ν(det Bk)

)(y + εy)(y + εy)>

(s + εs)>(y + εy),

26

we obtain

B(ε) = B(0) + ε ·{

δ · ν ′(det B(0))

ν(det Bk)

(BBFGS[Bk; s, y]− yy>

s>y

)

+yy> + yy>

s>y− s>y + s>y

(s>y)2yy>

− ν(det B(0))

ν(det Bk)

Bk(ss> + ss>)Bk

s>Bks+

ν(det B(0))

ν(det Bk)

2s>Bks

(s>Bks)2Bkss

>Bk

}+ O(ε2),

and thus ∆ is represented as

∆ = δ · ν ′(det B(0))

ν(det Bk)

[BBFGS[Bk; s, y]− yy>

s>y

]+

yy> + yy>

s>y− s>y + s>y

(s>y)2yy>

− ν(det B(0))

ν(det Bk)

Bk(ss> + ss>)Bk

s>Bks+

ν(det B(0))

ν(det Bk)

2s>Bks

(s>Bks)2Bkss

>Bk

= δ · ν ′(det B(0))

ν(det B(0))

[B(0)− yy>

s>y

]+

yy> + yy>

s>y− s>y + s>y

(s>y)2yy>

− ν(det B(0))

ν(det Bk)

Bk(ss> + ss>)Bk

s>Bks+

ν(det B(0))

ν(det Bk)

2s>Bks

(s>Bks)2Bkss

>Bk

in which we use the equality

ν(det B(0))

ν(det Bk)

[BBFGS[Bk; s, y]− yy>

s>y

]= B(0)− yy>

s>y.

Substituting the above ∆ into (C.3), we have

δ =det B(0)

1− β(det B(0))(n− 1)

{s>y − s>y

s>y

+ν(det B(0))

ν(det Bk)

(2s>Bks · s>Bk(B(0))−1Bks

(s>Bks)2− 2s>Bk(B(0))−1Bks

s>Bks

)}.

27

As the result, we obtain

B(ε)− B(0)

ε

=

{s>y − s>y

s>y+

ν(det B(0))

ν(det Bk)

(2s>Bks · s>Bk(B(0))−1Bks

(s>Bks)2− 2s>Bk(B(0))−1Bks

s>Bks

)}

× β(det B(0))

1− (n− 1)β(det B(0))

[B(0)− yy>

s>y

]+

yy> + yy>

s>y− s>y + s>y

(s>y)2yy>

+ν(det B(0))

ν(det Bk)

[2s>Bks

(s>Bks)2Bkss

>Bk −Bk(ss

> + ss>)Bk

s>Bks

]+ O(ε).

Letting ε tend to zero, we obtain the influence function B(0) = ∆[Bk; s, s, y, y].

Lemma 12. Let s, s, y and y be a set of column vectors in Rn such thats>y > 0 and Bk be a matrix in PD(n). For an infinitesimal ε let B(ε) be theoptimal solution of

minB∈PD(n)

DV (B−1, B−1k ) subject to B(s + εs) = y + εy

and let Γ[Bk; s, s, y, y] be B(0) then we have

Γ[Bk; s, s, y, y] = −B(0)∆[B−1k ; y, y, s, s]B(0), (C.7)

where ∆ is the function defined in Lemma 11.

Proof. Let H(ε) be the optimal solution of

minH∈PD(n)

DV (H,B−1k ) subject to H(y + εy) = s + εs

then, clearly B(ε) = H(ε)−1 holds. Thus we have

Γ[Bk; s, s, y, y] = B(0) = −H(0)−1H(0)H(0)−1 = −B(0)∆[B−1k ; y, y, s, s]B(0),

where H(0) = ∆[B−1k ; y, y, s, s] is applied.

We show another lemma which is useful to prove that the gross errorsensitivity diverges to infinity.

28

Lemma 13. Suppose n ≥ k + 3 for non-negative integers n and k. For anyset of vectors s, y, y1 . . . , yk ∈ Rn such that s>y > 0 and any positive realnumber d, there exists a sequence {Bi}∞i=1 ⊂ PD(n) satisfying the followingthree conditions:

1. The equalities Biy = s and Biym = Bjym hold for all i, j ≥ 1 andm = 1, . . . , k.

2. det(Bi) = d for all i ≥ 1.

3. limi→∞ ‖Bi‖F = ∞.

Proof. For any s, y ∈ Rn such that s>y > 0 there exists B ∈ PD(n) sat-isfying Bs = y. Indeed, for the n by n identity matrix I, the matrixB = BBFGS[I; s, y] ∈ PD(n) is well-defined and satisfies Bs = y. Whenn ≥ k + 3 holds, there exist two unit vectors p1, p2 ∈ Rn satisfying p>1 p2 = 0and

p>1 (B1/2s) = 0, p>1 (B1/2ym) = 0, m = 1, . . . , k,

p>2 (B1/2s) = 0, p>2 (B1/2ym) = 0, m = 1, . . . , k.

We will show that the matrix

B(a) = B1/2(I + ap1p>1 + bp2p

>2 )B1/2

with

a > 0, b =d/ det(B)

1 + a− 1 (C.8)

satisfies four conditions: B(a)s = y, B(a)ym = Bym, det B(a) = d andB(a) ∈ PD(n) for all a > 0. The first two equalities are clear from thedefinition of p1, p2 and B. The determinant of B(a) is equal to

det(B(a)) = det(B) det(I + ap1p>1 + bp2p

>2 ) = det(B)(1 + a)(1 + b) = d.

For any unit vector x ∈ Rn we have

x>(I + ap1p>1 + bp2p

>2 )x = 1 + a(p>1 x)2 + b(p>2 x)2

≥ 1 + b(p>2 x)2 (∵ a > 0)

≥ 1− (p>2 x)2 (∵ b > −1)

≥ 0 (Schwarz inequality)

29

and in addition the determinant of (I+ap1p>1 +bp2p

>2 ) is equal to d/ det(B) >

0. Thus B(a) is positive definite. Let x be the unit vector defined by x =B−1/2p1/‖B−1/2p1‖. Then in terms of the maximum eigenvalue of B(a) wehave

‖B(a)‖F ≥ x>B(a)x = x>Bx +a

p>1 B−1p1

.

Hence, ‖B(a)‖F tends to infinity when a tends to infinity. Thus the sequencedefined by

Bi = B(i), i = 1, 2, 3, . . . (C.9)

satisfies three conditions of the lemma.

Appendix C.1. Proof of Theorem 5

Let B(ε) be the optimal solution of (18). For a perturbation of the steplength, the influence function B(0) for V -BFGS-B is equal to ∆[Bk; s, s, y, y]which is defined in Lemma 11. Thus we have

B(0) =(y − y)>s

s>y

β(det B(0))

1− (n− 1)β(det B(0))

[B(0)− yy>

s>y

]

+yy> + yy>

s>y− (y + y)>s

(s>y)2yy>. (C.10)

If (y − y)>s = 0 holds for any y ∈ Y , the potential does not affect thenorm of the influence function, because the first term of the above expressionvanishes. Thus, clearly V (z) = − log(z) is an optimal potential. Below weassume (y− y)>s 6= 0 for a vector y ∈ Y . Suppose that Bk satisfies Bks = y.Then B(0) = Bk holds, and the triangle inequality yields that

‖B(0)‖F = ‖∆[Bk; s, s, y, y]‖F

≥∣∣∣∣(y − y)>s

s>y

∣∣∣∣∣∣∣∣

β(det Bk)

1− (n− 1)β(det Bk)

∣∣∣∣(‖Bk‖F −

∥∥yy>

s>y

∥∥F

)

−∥∥∥∥yy> + yy>

s>y− (y + y)>s

(s>y)2yy>

∥∥∥∥F

.

If β(z) is not the null function, there exists d > 0 such that β(d) 6= 0.Lemma 13 with k = 0 implies that for n ≥ 3 there exists a sequence {Bi} ⊂PD(n) satisfying Bis = y, det Bi = d for all i and limi→∞ ‖Bi‖F = ∞. Hence

limi→∞

‖∆[Bi; s, s, y, y]‖F = ∞

30

holds, and then we obtain

sup{ ‖∆[Bk; s, s, y, y]‖F | Bk ∈ PD(n), y ∈ Y } = ∞.

On the other hand, if β(z) = 0 for all z > 0, we obtain

maxBk,y

‖∆[Bk; s, s, y, y]‖F = maxy∈Y

∥∥∥∥yy> + yy>

s>y− (y + y)>s

(s>y)2yy>

∥∥∥∥F

< ∞,

since Y is bounded. As the result, the potential V (z) such that β(z) = 0defined from V (z) minimizes the gross error sensitivity. The condition β = 0leads to V (z) = − log(z) up to a constant factor.

Appendix C.2. Proof of Theorem 6

Let B(ε) be the optimal solution of (19). For a perturbation of the steplength, the influence function B(0) for V -DFP-B is equal to Γ[Bk; s, s, y, y]which is defined in Lemma 12.

First, we study the case that β(z) is not the null function. For the matrixBk such that Bks = y, we have B(0) = Bk. Using Lemma 12 for B(0) = Bk,we have

B(0) = −Bk∆[B−1k ; y, y, s, s]Bk

=(y − y)>s

s>y· β(det(Bk)

−1)

1− (n− 1)β(det(Bk)−1)

[Bk −

yy>

s>y

]

+yy> + yy>

s>y− (y + y)>s

(s>y)2yy>,

in which the equality Bks = y is used. The above expression is almost sameas (C.10) with B(0) = Bk, and thus the same proof works to obtain

sup{ ‖B(0)‖F | Bk ∈ PD(n), y ∈ Y } = ∞.

Next, we study the case that β is the null function, that is, β(z) = 0.Then, V (z) = − log(z) and ν(z) = 1 hold. Let Bk be a positive definitematrix which does not necessarily satisfy Bks = y. Then we obtain

B(0) = −B(0)∆[B−1k ; y, y, s, s]B(0)

= −(y − y)>s

(s>y)2yy> +

B(0)B−1k (yy> + yy>)B−1

k B(0)

y>B−1k y

− 2y>B−1k y

(y>B−1k y)2

B(0)B−1k yy>B−1

k B(0)

31

in which we used B(0)s = y. For β = 0, the updated matrix B(0) is equalto BDFP [Bk; s, y] and thus, we have

B(0)B−1k = I − Bksy

>B−1k + ys>

s>y+

s>Bks

(s>y)2yy>B−1

k +1

s>yyy>B−1

k . (C.11)

Let B ∈ PD(n) and c be a positive real number, and we define t = Bs, thenfor Bk = cB some calculation yields

B(0) = −B(0)∆[(cB)−1; y, y, s, s]B(0) = − c

s>yZ − (y + y)>s

(s>y)2yy> +

yy> + yy>

s>y,

where Z is defined by

Z =

(t− s>t

s>yy

)(y − s>y

s>yy

)>+

(y − s>y

s>yy

)(t− s>t

s>yy

)>.

Since Y contains an open subset, there exists a vector y ∈ Y which is linearlyindependent to y. Clearly there exists B ∈ PD(n) such that three vectors,t = Bs, y and y, are linearly independent. For such choice, Z is not the nullmatrix, and the equality

limc→∞

‖B(0)∆[(cB)−1; y, y, s, s]B(0)‖F = ∞

holds. As the result, even for the standard DFP formula, we have

sup{‖B(0)‖F | B ∈ PD(n), y ∈ Y

}= ∞.

In summary, for all V -DFP update for the Hessian approximation, the grosserror sensitivity defined in Theorem 6 is equal to infinity.

Appendix C.3. Proof of Theorem 7Let H(ε) be the optimal solution of (20). For a perturbation of the step

length, the influence function H(0) for V -BFGS-H is equal to Γ[Hk; y, y, s, s]which is defined in Lemma 12.

First, we study the case that β(z) is not the null function. Supposeβ(d) 6= 0. If Hk satisfies Hky = s, then we have Hk = H(0). Using Lemma 11and Lemma 12 for the matrix Hk such that Hky = s, we obtain

H(0) = −Hk∆[H−1k ; s, s, y, y]Hk

=(y − y)>s

s>y

β(det(Hk)−1)

1− (n− 1)β(det(Hk)−1)

[Hk −

ss>

s>y

]

− Hkys> + sy>Hk

s>y+

(y + y)>s

(s>y)2ss>. (C.12)

32

Lemma 13 with k = 1 implies that for n ≥ 4 there exists a sequence {Hi} ⊂PD(n) satisfying the following conditions: Hiy = s and (det Hi)

−1 = d forall i ≥ 1; Hiy = Hj y for all i, j ≥ 1; limi→∞ ‖Hi‖F = ∞. We define t = Hiywhich does not depend on i. Then for Hk = Hi we have

‖H(0)‖F = ‖Hi∆[H−1i , s, s, y, y]Hi‖F

≥∣∣∣∣(y − y)>s

s>y

∣∣∣∣∣∣∣∣

β(d)

1− (n− 1)β(d)

∣∣∣∣(‖Hi‖ −

∥∥ss>

s>y

∥∥)

−∥∥∥∥ts> + st>

s>y− (y + y)>s

(s>y)2ss>

∥∥∥∥.

Hence the equality

limi→∞

‖Hi∆[H−1i , s, s, y, y]Hi‖F = ∞

holds, and thus we obtain

sup{‖H(0)‖F | Hk ∈ PD(n), y ∈ Y

}= ∞.

Next, we study the case that β is the null function, that is, β(z) = 0.Then, V (z) = − log(z) and ν(z) = 1 holds. For Hk such that Hky = s, wehave

H(0) = −Hkys> + sy>Hk

s>y+

(y + y)>s

(s>y)2ss>. (C.13)

Let H0 ∈ PD(n) be a matrix satisfying H0y = s. Let p1 ∈ Rn and y ∈ Y be

vectors satisfying p>1 H1/20 y = 0 and p>1 H

1/20 y 6= 0. For n ≥ 4, the existence

of p1 and y is guaranteed by the assumption on Y . Indeed, there existsy ∈ Y such that y and y are linearly independent. We now define the matrixHi ∈ PD(n) by

Hi = H1/20 (I + i · p1p

>1 )H

1/20 , i = 0, 1, 2, . . .

Then we have

Hiy = s, Hiy = z + i · u,

where z = H0y and u = (p>1 H1/20 y)H

1/20 p1 6= 0. Substituting Hk = Hi into

(C.13), we obtain

H(0) = −i · us> + su>

s>y+

(y + y)>s

s>yss> − zs> + sz>

s>y.

33

This implies that

limi→∞

‖Hi∆[H−1i ; s, s, y, y]Hi‖ = ∞.

for β = 0. Hence we obtain

sup{‖H(0)‖F | Hk ∈ PD(n), y ∈ Y

}= ∞

even for the standard BFGS update of the inverse Hessian approximation.

Appendix C.4. Proof of Theorem 8

Let H(ε) be the optimal solution of (21). For a perturbation of the steplength, the influence function H(0) for V -DFP-H is equal to ∆[Hk; y, y, s, s]which is defined in Lemma 11.

First, we study the case that β(z) is not the null function. Supposeβ(d) 6= 0 for d > 0. If Hk satisfies Hky = s, we have Hk = H(0). UsingLemma 11 for the matrix Hk such that Hky = s, we obtain

H(0) = ∆[Hk; y, y, s, s]

=(y − y)>s

s>y

β(det Hk)

1− (n− 1)β(det Hk)

[Hk −

ss>

s>y

]

− Hkys> + sy>Hk

s>y+

(y + y)>s

(s>y)2ss>.

The above expression is almost same as (C.12), and thus the same proofremains valid to obtain

sup{‖H(0)‖F | Hk ∈ PD(n), y ∈ Y

}= ∞.

Next, we consider the case that β is the null function. Then V (z) =− log(z) and ν(z) = 1 hold. For Hk such that Hky = s, we have

H(0) = ∆[Hk; y, y, s, s] = −Hkys> + sy>Hk

s>y+

(y + y)>s

(s>y)2ss>.

This is the same as the influence function of (C.13), and thus, we obtain

sup{‖H(0)‖F | Hk ∈ PD(n), y ∈ Y

}= ∞.

34

References

[1] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering withBregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.

[2] L. M. Bregman. The relaxation method of finding the common pointof convex sets and its application to the solution of problems in con-vex programming. USSR Computational Mathematics and MathematicalPhysics, 7:200–217, 1967.

[3] C. G. Broyden. Quasi-newton methods and their application to functionminimisation. Mathematics of Computation, 21(99):368–381, 1967.

[4] R. H. Byrd, J. Nocedal, and Y-X. Yuan. Global convergence of a cassof quasi-Newton methods on convex problems. SIAM J. Numer. Anal.,24, 1987.

[5] A. R. Conn, N. I. M Gould, and P. L. Toint. Testing a class of algorithmsfor solving minimization problems with simple bounds on the variables.Mathematics of Computation, 50:399–430, 1988.

[6] J. E. Dennis, Jr. and H. Wolkowicz. Sizing and least-change secantmethods. SIAM J. Numer. Anal., 30:1291–1314, 1993.

[7] I. S. Dhillon and J. A. Tropp. Matrix nearness problems with Bregmandivergences. SIAM J. Matrix Anal. Appl., 29(4):1120–1146, 2007.

[8] R. Fletcher. A new variational result for quasi-Newton formulae. SIMAJ. Optim., 1:18–21, 1991.

[9] R. Fletcher. An optimal positive definite update form sparse hessianmatrices. SIMA J. Optim., 5:192–218, 1995.

[10] P. E. Gill and W. Murray. Quasi-Newton methods for unconstrainedoptimization. J. Inst. Maths. Applns., 9:91–108, 1972.

[11] N. I. M. Gould, D. Orban, and P. L. Toint. Cuter and sifdec: A con-strained and unconstrained testing environment, revisited. ACM Trans.Math. Softw., 29(4):373–394, 2003.

35

[12] F. R. Hampel, P. J. Rousseeuw, E. M. Ronchetti, and W. A. Stahel.Robust Statistics. The Approach based on Influence Functions. JohnWiley and Sons, Inc., 1986.

[13] T. Kanamori and A. Ohara. A Bregman extension of quasi-Newton up-dates I: An information geometrical framework. Optimization Methodsand Software, to appear.

[14] D. Luenberger and Y. Ye. Linear and Nonlinear Programming. Springer,2008.

[15] N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi. Informationgeometry of U -Boost and Bregman divergence. Neural Computation,16(7):1437–1481, 2004.

[16] J. Nocedal. Theory of algorithms for unconstrained optimization. ActaNumerica, 1, 1992.

[17] J. Nocedal and S. Wright. Numerical Optimization. Springer, 1999.

[18] J. Nocedal and Y-X. Yuan. Analysis of a self-scaling quasi-Newtonmethod. Math. Program., 61:19–37, 1993.

[19] A. Ohara and S. Eguchi. Geometry on positive definite matrices andV-potential function. Technical report, ISM Research Memo, 2005.

[20] S. S. Oren and D. G. Luenberger. Self-scaling variable metric (ssvm)algorithms, part i. criteria and sufficient conditions for scaling a class ofalgorithms. Management Science, 20:845–862, 1974.

[21] M. J. D. Powell. Some global convergence properties of a variable metricalgorithm for minimization without exact line search. Nonlinear Pro-gramming, SIAM-AMS Proceedings, IX:53–72, 1976.

[22] M. J. D. Powell. How bad are the BFGS and DFP methods when theobjective function is quadratic? Math. Prog., 34(1):34–47, 1986.

[23] N. Yamashita. Sparse quasi-Newton updates with positive definite ma-trix completion. Math. Program., 115(1):1–30, 2008.

36

Table 2: Approximate influence function for V -BFGS update and V -DFP update is shown.The power potential V (z) = (1 − zγ)/γ is used for V -extended quasi-Newton methods,where γ = 0 corresponds to BFGS or DFP method.

V -BFGS-BBk diag(1, . . . , n)/(n!)1/n diag(1, . . . , n) I + n3pp>

γ −2 −1 0 −2 −1 0 −2 −1 0n = 10 9.5e+00 9.5e+00 9.5e+00 1.5e+01 9.7e+00 9.5e+00 2.0e+02 1.0e+02 5.0e+01n = 100 2.7e+01 2.7e+01 2.7e+01 2.3e+02 2.8e+01 2.7e+01 1.1e+04 1.0e+04 8.7e+03n = 500 9.3e+01 9.3e+01 9.3e+01 2.8e+03 9.6e+01 9.3e+01 2.6e+05 2.5e+05 2.4e+05n = 1000 1.0e+02 1.0e+02 1.0e+02 7.4e+03 1.1e+02 1.0e+02 1.0e+06 9.9e+05 9.7e+05

V -DFP-BBk diag(1, . . . , n)/(n!)1/n diag(1, . . . , n) I + n3pp>

γ −2 −1 0 −2 −1 0 −2 −1 0n = 10 1.3e+02 1.3e+02 1.3e+02 2.9e+03 6.5e+02 1.5e+02 2.0e+02 1.0e+02 5.0e+01n = 100 1.7e+03 1.7e+03 1.7e+03 2.5e+06 6.5e+04 1.7e+03 1.1e+04 1.0e+04 8.7e+03n = 500 4.6e+04 4.6e+04 4.6e+04 1.6e+09 8.7e+06 4.7e+04 2.6e+05 2.5e+05 2.4e+05n = 1000 3.0e+04 3.0e+04 3.0e+04 4.1e+09 1.1e+07 3.0e+04 1.0e+06 9.9e+05 9.7e+05

V -BFGS-HHk diag(1, . . . , n)/(n!)1/n diag(1, . . . , n) I + n3pp>

γ −2 −1 0 −2 −1 0 −2 −1 0n = 10 2.1e+02 2.1e+02 2.1e+02 4.8e+03 1.1e+03 2.4e+02 2.2e+02 1.1e+02 5.6e+01n = 100 1.1e+03 1.1e+03 1.1e+03 1.6e+06 4.1e+04 1.1e+03 2.0e+04 1.7e+04 1.5e+04n = 500 8.2e+04 8.2e+04 8.2e+04 2.8e+09 1.5e+07 8.3e+04 8.7e+05 8.4e+05 8.1e+05n = 1000 2.6e+04 2.6e+04 2.6e+04 3.6e+09 9.8e+06 2.7e+04 4.7e+06 4.6e+06 4.5e+06

V -DFP-HHk diag(1, . . . , n)/(n!)1/n diag(1, . . . , n) I + n3pp>

γ −2 −1 0 −2 −1 0 −2 −1 0n = 10 1.0e+01 1.0e+01 1.0e+01 1.7e+01 1.1e+01 1.0e+01 2.5e+02 1.3e+02 6.4e+01n = 100 2.1e+01 2.1e+01 2.1e+01 4.5e+02 2.5e+01 2.1e+01 4.1e+06 3.6e+06 3.1e+06n = 500 9.9e+01 9.9e+01 9.9e+01 9.5e+03 1.2e+02 9.9e+01 1.4e+09 1.4e+09 1.3e+09n = 1000 1.2e+02 1.2e+02 1.2e+02 3.6e+04 1.7e+02 1.2e+02 1.2e+10 1.2e+10 1.2e+10

37

Table 3: Number of iterations in BFGS and DFP with inexact computation. The numberof h corresponds to the noise level in the line search.

n = 100 n = 500 n = 1000h BFGS DFP BFGS DFP BFGS DFP

Problem 1 0.0 73.4 73.4 185.7 185.7 267.7 267.70.1 301.8 300.9 1016.5 1014.2 1557.4 1545.40.2 392.8 396.9 1417.6 1364.2 2178.1 2087.00.3 487.8 484.3 1786.9 1751.5 2931.5 2809.3

Problem 2 0.0 100.4 110.6 434.6 577.8 682.1 1788.50.1 102.9 166.2 430.6 1165.2 680.9 2628.90.2 104.5 198.6 443.6 1361.8 685.1 3099.20.3 106.0 223.0 444.2 1501.6 687.6 3365.9

Problem 3 0.0 100.9 111.6 428.5 585.7 661.5 2489.80.1 102.8 153.5 443.5 1237.4 672.4 2762.10.2 104.4 177.7 438.3 1419.6 682.7 3301.20.3 106.1 199.4 454.0 1592.8 694.0 3730.8

Table 4: Number of iterations in BFGS and DFP with inexact computation is shownfor Problem 4 with n = 100, in which c is set to 0.5, 1, 5 10 or 20. The number of hcorresponds to the noise level in the line search.

Hessian 0.5 ·A A 5 ·A 10 ·A 20 ·Ah BFGS DFP BFGS DFP BFGS DFP BFGS DFP BFGS DFP

0.0 104.1 113.2 100.1 111.2 100.0 100.5 100.0 100.0 100.0 100.00.1 109.8 193.6 102.5 153.6 112.2 116.0 125.3 126.8 155.1 156.50.2 112.9 223.8 104.5 176.2 115.5 123.6 132.0 135.9 176.3 179.80.3 116.3 255.7 106.5 197.1 117.4 130.6 140.2 146.2 195.7 200.6

38