Department of Applied Mathematics and Theoretical Physics

Mathematical Programming 38 (1987) 29-46 29 North-Holland

U P D A T I N G C O N J U G A T E D I R E C T I O N S BY T H E B F G S F O R M U L A

M.J.D. POWELL

Department of Applied Mathematics and Theoretical Physics, l_/niver;ity q/" Cambridge, England

Received 10 October 1985 Revised manuscript received 7 December 1986

Many iterative algorithms for optimization calculations form positive definite second derivative approximations, B say, automatically, but B is not stored explicitly because of the need to solve equations of the form Bd - -g. We consider working with matrices Z, whose columns satisfy the conjugacy conditions z r B z T M L Particular attention is given to updating Z in a way that corresponds to revising B by the BFGS formula. A procedure is proposed that seems to be much more stable than the direct use of a product formula [1]. An extension to this procedure provides some automatic rescaling of the columns of Z, which avoids some inefficiencies due to a poor choice of the initial second derivative approximation. Our work is also relevant to active set methods for linear inequality constraints, to updating the Cholesky factorization of B, and to explaining some properties of the BFGS algorithm.

Key words: Nonlinear programming, conjugate directions, updating, variable metric algorithms.

I. Introduction

Many algorithms for unconstrained and constrained optimization calculations update a positive definite approximation to an n x n second derivative matrix by the BFGS formula

B* =B B&3tB+ TTT (1.1) 8VB8 6vy '

where B and B* are the old and new second derivative approximations respectively, and where 8 and ~, are vectors in W' that satisfy 6 r y > 0 . However, in order to solve equations of the form Bd= - g in O(n 2) operations, it is usual to work with the inverses or with the Cholesky factors of B and B*, instead of calculating the elements of B and B* explicitly. Therefore several methods that correspond to

equation (1.1) have been developed already for revising inverses or Cholesky factors directly; for example see [3] and [6].

The equation B d = - g can also be solved in O(n 2) operations if a square matrix Z is available that satisfies the condition

ZZX=~ B J. (1.2)

Dedicated to Martin Beale, whose achievements, advice and encouragement were of great value to my research, especially in the field of conjugate direction methods.

30 M.J.D. Powell / Updating conjugate directions

Davidon [2] recommends this kind of factorization for unconstrained minimization,

and it occurs in some algorithms for quadratic programming [8, 10, 11] because the

preservation of condition (1.2) when Z is postmultiplied by any orthogonal matrix

provides a highly convenient way of satisfying active constraints. Moreover, as

equation (1.2) can be written in the form

z T B z = I, ( 1.3 )

the columns of Z, {zi; i= 1 , 2 , . . . , n} say, are normalized, mutually conjugate directions with respect to B. Han [9] explains that the availability of such directions

may be important to optimization calculations on parallel computers, because it

allows separate processors to carry out separate searches in mutually conjugate

subspaces. Therefore this paper addresses the question of updating Z by a method that is

analogous to the BFGS formula (1.1), which means that our task is as follows: given Z, ~ and y such that ~-)y > 0, we have to form a matrix Z* such that, if B = (ZZ r)

is substituted in equation (1.1), then the resultant matrix is (Z*Z*T) -~. In other

words the columns of Z* are to be normalized, mutually conjugate directions with

respect to B*. In general there is no sparsity or symmetry in these directions, so

the use of Z requires space for n 2 elements, which is about twice the storage

requirements of most variable metric algorithms for unconstrained optimization.

The freedom to postmultiply Z by an orthogonal matrix is sometimes important to maintain good accuracy in the presence of computer rounding errors. For example,

if B has ( i 7 - l ) eigenvalues of magnitude one and one tiny eigenvalue whose

eigenvector is e= (1 1 . . . 1)~, then B ~ is dominated by a large multiple of the

n x n matrix whose elements are all one. Therefore, if just the elements of B ' are

stored, then rounding errors prevent the accurate determination of B from B ~.

However, if we employ a matrix Z that satisfies equation (1.2), then all the large

elements of Z can be confined to a single column, which can make the determination of B from Z well-conditioned. We will describe an updating algorithm that can

keep good accuracy when B is nearly singular, by allowing large differences in the

magnitudes of the column norms {H z~[I; i= l, 2 . . . . , n}. Severe loss of information

would occur in this case if Z were replaced by ZaC2, where f2 is any orthogonal

matrix that causes the largest column of Z to dominate every column of the product.

Brodlie, Gourlay and Greenstadt [1] show that equation (1.1) implies the formula

( B , ) - , = ( l _ 6 p r ) B ,(1 p~T), (1.4)

where p is the vector

B6 y p + - - . x/(sTy)(~$rB~) 6q-y

Therefore the updating formula

Z * = ( l - ,3pW)Z,

(1.5)

(1.6)

M.J.D. Powell / Updating cot~jugate directions 31

which is suggested by Han [9], provides a suitable matrix Z*, except that much cancellation may occur if this equation is applied directly, particularly if [I Z* I[ <<

II Z II. One can calculate p without knowing B in the usual case when 6 is a multiple of B-~g and g is an available gradient. The main purpose of this paper is to present and discuss a new algorithm for computing Z*, that is based on an analytic

simplification of equation (1.6) in the special case when 6 is a multiple of the first column of Z.

The new algorithm is described in Section 2. It replaces Z by the matrix

2 = Z D , (1.7)

where S2 is a lower Hessenberg, orthogonal matrix, chosen so that 6 will be a multiple of the first column of 2. Then the simplified form of equation (1.6) is applied to Z The resultant matrices Z* are considered when this updating algorithm is applied on every iteration of the minimization of a quadratic function by the BFGS algorithm. We find that the columns of Z* have some useful conjugacy properties with respect to the true second derivative matrix.

Section 3 begins with some numerical results; our algorithm gives excellent

accuracy when II Z* I] <~ II z II, even if [] Z* II is much smaller than II Z H, but the results are disappointing when II z * II >> II z II. Therefore, because Z* becomes the Z of the next application of the updating formula, an extension to the algorithm is proposed that automatically scales up any columns of Z* that seem to be too small. Thus we abandon the condition (z*Tz*)=(B *) ~, where B* is the matrix (1.1), but it is shown that we preserve the conjugacy relations that provide the quadratic termination properties of the BFGS formula. This technique is analogous to rescaling ',~n

initial second derivative approximation automatically (except that some rescaliJlg can occur on each iteration), it provides much better accuracy than the algorithm

of Section 2 when II Z* II >> II z I[, and it preserves the good numerical results when ][z*ll ~< [[zll. We even find that, if we let Z be singular initially, then in practice the rounding errors of a sequence of updating calculations remove the singularity

very successfully. Because the difficulty when II z * II >> II z II is a property of the BFGS formula and not of our implementation, the new algorithm with column scaling may be highly useful.

Section 4 identifies the analogous updating algorithm when equation (1.1) is replaced by the DFP formula

6T~,

It is simpler than the BFGS procedure, and has the property that, if Z is lower

triangular, then Z* is lower Hessenberg. It follows from equation (1.2) that, if one postmultiplies Z* by a sequence of Givens rotations in order to regain lower triangularity, then one has an algorithm that calculates the Cholesky factorization of (B*)-~ from the Cholesky factorization of B ~. This procedure is not new, because

32 M..L D. Powell / Updating conjugate directions

it is equivalent to a method of Goldfarb [7] for calculating the Cholesky factorization of B* from the Cholesky faetorization of B when the BFGS formula is employed.

Finally, Section 5 discusses briefly the idea of working with matrices of conjugate directions in variable metric algorithms instead of with B -t or with the Cholesky factorization of B. This discussion includes some comments on the stability and on

the amount of work of several updating formulae.

2. The basic algorithm

Let the columns of Z, 2 and Z* be {z,}, {s and {z*} respectively (i = 1, 2 , . . . , n). If 6 is a multiple of z~, i.e. if 8 / / z j , then it follows from equation (1.6) that, for any p, we have z*/ /z~. Therefore, using the conjugacy condition Z * v B * Z *= I, we

deduce the value

z*, = a / , / aT B* a = ~/ 4 g%,, (2.1)

where the last part is a consequence of equation (1.1). Moreover, the conditions Z T B Z = I and 6//z~ imply {(B6)Tz~ = 0, i = 2, 3 . . . . . n}. Thus, when 6//z~, equations

(2.1), (1.5) and (1.6) give the formula

~ / ~ / S v y , i = 1, z~ = [ Z i - - [ ( y T z i ) / ( 6 T ' ] / ) ] ~ , i = 2, 3 , . . . , n,

(2.2)

for updating Z. It has two strong advantages. Firstly B is not present, and secondly,

if the reduction II zt II << II z, II occurs, then it is achieved by a suitable scaling of instead of by cancellation.

In order to enjoy these advantages for general 8, i.e. when 8 is not a multiple of z~, we require the vector s such that 8 = Zs. This vector is available in variable metric algorithms for unconstrained optimization, because 8 has the form - a B ~g = - a z z r g , a being a step-length. It is also available when 6 is composed as a linear

combination of the columns of Z, which is the case in the parallel processing applications that are suggested by Han [9]. The orthogonal matrix f2 of equation

(1.7) is chosen to satisfy the condition

,Qe, = • I I , II, (2.3)

where Oe~ is the first column of O, e~ being the first co-ordinate vector. Instead of calculating ~ expl ic i t ly, we form the matr ix

,Z = Zg2, (2.4)

because its first column is the vector

~., = 2 e , = Z a e , = • II s II = + ,~ / I I ~' II (2.5)

MJ.D. Powell / Updating conjugate directions 33

as required. Therefore, remembering the derivation of equation (2.2) and that fi~iZ x = Z Z v, we complete the calculation of Z* by applying the formula

8 / x / s T y , i= 1, (2.6) Z * = [ F, i - -[( ' /Tg~)/ (SXT)]8, i = 2 , 3 . . . . ,n.

There are several choices of s that satisfy $2el//s and that allow 2 to be formed

in O(n z) computer operations. We prefer a lower Hessenberg matrix of the form

~2 = S2,_I~Q,, 2 . - . f~l, (2.7)

where each S2, is a Givens rotation that may differ from the unit matrix only in its

t-th and ( t + 1)-th columns. Condition (2.3) is equivalent to ~ V s / / e ~ . Therefore, if

the last r components of s are zero, we set ,Q, = I for n - 1 I> t ~> n - r. Otherwise,

for t = n - r - 1 , n - r - 2 . . . . ,1 , a'2, is defined by making the ( t + l ) - t h and t-th components of the vector (~2, r.QT+~ . . . X?,,_jv s) zero and positive respectively. The

description of the basic updating algorithm is now complete, except that an efficient implementation of equation (2.4) is given in Section 5.

We find below that in exact arithmetic with a quadratic objective function, the

first q components of s are zero on the (q + l)-th iteration. In this case s2, is a

rotation through Tr/2 for q/> t/> 1. It follows from the construction of the matrix

(2.7) that we have the equations

I n ..... ,1=1, t = 1 , 2 , . . . , q . ~2.8)

Further, the remaining elements in the first q rows and in the second to (q+ 1)-th

columns of ~ are zero, because all rows and columns of orthogonal matrices have

length one. We now consider the minimization of a strictly convex quadratic function by the

BFGS algorithm with exact line searches, when each iteration works with a matrix

of conjugate directions of the current second derivative approximation, that is

nonsingular initially and that is updated in the way that has been described. In this analysis we assume that all arithmetic is exact, and "conjugate" means conjugate

with respect to the true second derivative matrix of the objective function, G say.

As usual each 6 is the change in the variables made by the current iteration, each

y has the value

"1" = GB, (2.9)

and it is well-known [3, eq. (2.5.2)] that the gradient of the objective function at the beginning of an iteration, g say, is orthogonal to all the vectors 6 that occurred

on previous iterations.

Equation (2.6) shows that, at the end of the first iteration, the first column of the updated Z is a multiple of the first search direction that by (2.9) satisfies z * T G z * = 1,

and the remaining columns are orthogonal to y, which means by (2.9) that they are

conjugate to & Our main result is that, after k iterations, the first k columns of the

updated Z are multiples of the first k search directions, the ith normalized search


~ # T f~,~ ~.- 1, "* (i = 1, 2, k) with the normalizat ion ~..k+l_iiJ.~,k+l _.= direction being + ,-k~ 1 - i �9 �9 -, while the remaining columns satisfy the conjugacy condit ions

"*Tin '* = 0, l<~i<~k<.j<~n. (2.10)

We prove these results by induction, knowing they are true when k = 1, and assuming

that they hold for k. Therefore we consider the (k + 1 )-th iteration of the calculation.

If the current gradient g is zero, then termination occurs. Otherwise the search direction - z z r g is calculated, so the vector s in equation (2.3) is a multiple o f ZVg. It follows from the remark after equat ion (2.9) and the inductive hypothesis

that s~ = s2 . . . . . sk = 0. Therefore, in view of the structure of ~2, including condit ion

(2.8), expression (2.4) gives the vectors

~i ~ = • i = l , 2 , . . . , k ; (2.11)

they are conjugate to 6, because they are multiples of the first k search directions [3, eq. (3.4.11)]. Hence by (2.9) the terms {yT~i; i = 2 , 3 , . . . , k + 1} are all zero in

formula (2.6), so we have {z*~l = g~ ~ = +z,; i = 1,2 . . . . ,k}. Thus, in view of the first

line o f formula (2.6), the first ( k + l ) columns of Z* contain the first ( k + l )

normalized search directions as required. In fact the inductive hypothesis (2.10) shows that, not only 6, but every vector

in the linear space spanned by {:i; j = k + l , k + 2 , . . . , n} is conjugate to {z~; i = 1, 2 , . . . , k} (we have removed the stars from condit ion (2.10) because the Z* of the kth iteration is the current Z) . Therefore, because the lower Hessenberg structure

of ~Q causes all the vectors {~j; j = k + 2, k + 3 , . . . , n} to belong to this linear space,

we have ~ - ~< k + 1 {~i G~,_~=0; 2 i<~ <.j n}. Hence, recalling {z~=• ~; 2<~i<~

k + 1} from the previous paragraph, equat ion (2.6) implies the identity

z':~T G: , , = • {~ / - [ ( ~:~j) l ( a '~,) ]~} ~ Gz,,_, (2.12)

=0 , 2<~i<~k+l<j<~n,

where the last term of the first line is zero because the current search direction

is conjugate to the previous search directions {z~_~; 2 ~ < i ~ < k + l}. Now formula (2.6)

not only gives condit ion (2.12), but also the multiplier o f 3 in the second line of

(2.6) is chosen to provide {z* r y = 0 ; j = 2, 3 , . . . , n}, which is the conjugacy condit ion

{z.*rGz * = 0 ; j =2 , 3 , . . . , n}. Therefore the p roof by induct ion is complete.

This theory may explain a useful property o f the updat ing method, that was found

experimentally. Our version of the BFGS method was used to calculate the least value o f the quadratic function {~xTGx; x E [~4}, starting from x = e~, where G and

the initial Z are the matrices

/o O l ol O l / t 1 4 1 ) 0.1 0.2 0.1 0.1 0 1 G Z o = �9

= 0.1 0.1 0.3 0.1 ' 1 0

0.1 0.1 0.1 0 .4 / 4 1

(2.13)

M.J.D. Powell / Updating conjugate directions 35

This case is interesting because Z is singular (since the functions { ~ ( t ) = ( t - j ) 2 :

t ~ R, j = 1, 2, 3, 4} span a 3-dimensional space, and so are linearly dependent on {1, 2, 3, 4} c ~). In exact arithmetic, the first 3 iterations of the BFGS method would generate conjugate search directions that span the column space of Z, and at the end of the third iteration equation (2.10) would hold for k = 3 and j = 4. Therefore,

because z* would also be in the column space of Z, it would be identically zero. However, the calculation was carried out by a Radio Shack TRS 80 micro-computer to about 7 decimals accuracy, and 3 iterations gave the matrix

/ -3 .2813 2 . 4 6 1 6 - 0 . 8 9 2 3 3 .1 •

Z * = / 1.0529 0.9130 -0.4347 - 1 . 9 x l 0 - V / 2.0387 -0.4035 -0.4347 8.2x 10 ~/" (2.14)

\ - 0 .3239 -1.4878 -0.8923 2 . 2 x 1 0 8/

As expected, the first 3 columns of Z* are the normalized third, second and first search directions, and the last column of Z* is small. However, the direction of this fast column is remarkable, because it is fairly close to the direction that is conjugate to the first 3 columns, which is any non-zero multiple of (14 -12 3 -2 ) T.

Thus the rounding errors that have removed the singularity in Z have allowed equation (2.10) to be satisfied roughly when k = 3. We make good use of this property

in the next section.

3. An automatic rescaling technique

Several numerical experiments are reported in this section. In all of them we

minimize a convex objective function of the form

F ( x ) = }XTGX, X ~ ~4, (3.1)

using a version of the BFGS algorithm with exact line searches that works with matrices of conjugate directions. These calculations were performed on the TRS 80 computer to a relative accuracy of about 10 7. In every case the norm of the initial

vector of variables has magnitude one, and the iterations are terminated as soon as variables are found that satisfy It x ]1 <~ 10-6- There are always at least four iterations, so exact arithmetic would make the number

r /= max I(Z*GZ*-I),;] (3.2)

have the value zero [3, pp. 42-44], where Z* is the final matrix of conjugate directions. The error in Z* is indicated by computing and printing the value of rt at the end of each experiment. Each vector y is calculated directly from equation

(2.9), instead of being the difference between two gradients, because the main


purpose o f our exper iments is to find out whether our upda t ing a lgor i thm loses

accuracy when good in fo rma t ion about changes in first der ivat ives is avai lable .

Because one of the a ims when deve lop ing the a lgor i thm of Sect ion 2 was to obta in

good prec is ion when much cance l la t ion would occur in the p roduc t upda t ing formula

(1.6), we begin by repor t ing a ca lcula t ion o f this kind. Specif ical ly, G and the init ial

Z are the matr ices

0 1 2 1 Zo = 1 l 2 2 G = 1 3 ' 1 1 1 ' (3.3)

1 1 1 1 1

where 0 = 10 -~~ and the initial vector o f var iables is x = (1 1 1 l ) r . After 4 i terat ions

we find Ilxll = 3 • 10 -7 and rl = 4 . 2 x 10 6. G o o d accuracy is achieved because the

norms o f the columns of Z* are made very small only by the first part of fo rmula

(2.6). Fo r example , af ter 2 i tera t ions we have the matr ix

: - 0 . 0 0 8 2 x l0 "~ 0.2805x 10 io -0 .5525 - 0 . 2 3 8 2 \ L

Z * = 0 .5131x10 io 0 .2457x10 lo 0.1855 -0 .3125 / (3.4)

0 .0684x 10 -lc~ 0.2043 x 10 i~ 0.1435 0.5072~"

- 0 . 4 9 0 1 x 1 0 -1~ 0 .1549x10 -:~ 0.0694 - 0 . 0 6 8 2 ] /

Similar accuracy was observed for all trial values of 0 ~> 1.

There fore it was surpr is ing to find that for 0 = 0.1 and 0 = 0.01 the values of [1 x II after 4 i tera t ions were 4 .3x 10 3 and 0.22 respect ively. To expla in this loss o f

accuracy ,we suppose that 0<< 1, and that a numer ica l compu ta t i on preserves the

magni tudes of all numbers that would occur in exact ar i thmetic . We give par t i cu la r

a t tent ion to the con t r ibu t ion from the errors in x to the new vector of var iables

x + ted that is ca lcu la ted by the k-th i tera t ion when k = 2, 3 or 4. I f e is the magni tude

of the error in x, then we find errors in g = G x and d = - Z Z X g of magni tudes 0e

and e respect ively, because the first ( k - 1) co lumns of Z have magn i tude 0 u,-~

Hence the induced error in x + a d is a p p r o x i m a t e l y ae. Now a d is i ndependen t of

0 in exact ar i thmet ic , and , r emember ing that the first ( k - 1 ) co lumns of Z should

be o r thogona l to g, we have II d [ [ ~ Ilgll ~ 0, giving a ~ 1/0. Thus an error o f e in x

can induce an error in x + a d of magni tude e /0 . In most cases, in pa r t i cu la r when

the error in x at the beg inn ing of the second i terat ion is a mul t ip le of the first search

di rect ion, it can be shown that this growth occurs for k = 2 , 3 and 4, if 0 3 is

subs tant ia l ly greater than the relative prec is ion o f the compu te r ar i thmetic . Thus,

after 4 i terat ions, some errors can be magnif ied by about 0 3, which accounts for

the loss o f accuracy that is observed.

This accumula t ion o f errors arises from the c o m p u t e d change in var iables of the

first i terat ion. Therefore it is i n d e p e n d e n t of the way in which the BFGS upda t ing

formula is app l i ed , but our use of the B -~ = Z Z v fac tor iza t ion helps to expla in the

loss o f accuracy. Table 1 gives the number of i tera t ions of our a lgor i thm to achieve

]] x [I ~< 10 6 for a range o f small values of O. We see that some efficiency is lost if

M.J.D. Powell / Updating conjugate directions

T a b l e I

T h e c a l c u l a t i o n (3.3)

0 I t e r a t i o n s r t

I 4 1 .7 x I 0 7

0.1 5 3 . 6 x 10 7

0.01 6 2.6 X 10 7

10 4 7 5 . 2 x 1 0 7

10 -6 7 1.1 x 10 -7

10 8 10 4 . 8 • 10 -7

37

initially ]1Z I] is less than H G-~/2 ]] - 0- ' /2, but that the final accuracy of Z rGZ-~ 1 is unimpaired. Results of such experiments for other versions of the BFGS algorithm are reported in Section 5.

Our results so far suggest that one can save some work by choosing [I Zo ]] to be larger than instead of smaller than ]] G ~/2 [[. Further, if the matrix (2.14) occurred,

then it would be helpful to magnify its last column automatically. Therefore we recommend the following "rescaling" extension to the updating algorithm of Section 2.

Set o- = ]] z* [] on the first iteration, and on each subsequent iteration replace r by min[o-, ][z* []], where z* is given in the first part of equation (2.6). Then, for i = 2, 3 . . . . . n, after using formula (2.6) to calculate z*, rescale z* if necessary so that its Euclidean length is at least o-. Of course, if z* has to be changed, then we

multiply it by or/II ~;* tl. Because this extension only alters the lengths of the columns of Z*, it preserves

all the conjugacy properties that are proved in Section 2 for the quadratic case with exact line searches. Thus we still have quadratic termination. Further, 11o normalized

search directions are rescaled, because o- is always the least Euclidean length of all the normalized search directions that have been calculated.

The extended algorithm was applied to the test problems of Table 1. It was found that only 5 iterations are needed for all 0 < 1, and there now seems to be no deterioration in efficiency as 0 is decreased. Further, the values of "q remain small.

The case when Z is singular is interesting and deserves attention for two reasons. Firstly, if singularity occurred, and if all subsequent arithmetic were exact, then usually it would be impossible to complete an unconstrained minimization calculation successfully, because the variable metric search direction d = - B ~g = - - Z Z T g

and the columns of Z* would all be in the column space of Z (which follows from equations (2.4) and (2.6)), so all changes to the vectors of variables would be confined to the column space of the initial singular Z; this remark applies to all objective functions and to all techniques for choosing step-lengths. Secondly, a

technique that uses rounding errors successfully to correct singularity is likely to be useful for any highly ill-conditioned cases that may occur in practice. Clearly, our rescaling technique would be suitable in the case (2.14).


The numerical results of Section 2 are one case of a range of calculations, depending on a parameter 0, where x = e~ initially, and where G and the initial Z have the values

( i 1 1 i ) ' Z ~ 1 4 i ) G = O 1 2 1 0 1 1 3 1 0 " ( 3 . 5 )

1 1 7 4 1

Table 2 shows the advantages of the rescaling technique in these calculations, not only for achieving II x II -< 1 0 - 6 efficiently, but also for obtaining good conjugate directions with respect to the true second derivative matrix.

The rescaling technique can also be useful when Z is ill-conditioned. To demon- strate this point, we again set x = e~ initially, we let G be the left hand matrix (3.5) with 0 = 1, and we let Zo be like a Hilbert matrix, its elements having the values

(Zo)u = 1/( i + j + tz ), (3.6)

where p. is a parameter. Some results of these calculations are given in Table 3. It seems that, if the given extension to the updating algorithm of Section 2 is

employed, then good efficiency can be obtained for a much wider range of initial choices of Z.

Table 2

The calculat ion (3.5)

Without rescal ing With rescal ing

0 I te ra t ions 7/ I terat ions r/

1 8 5 .5x 10 7 5

0.1 11 2 . 5 x 10 -~ 5

0.01 31 8.3 x 10 -7 6

0.001 18 3 ,9x 10 7 5

10 ~0 58 4 .4x 10 - 7 5

3.2 x 10 7

2.7 x 10 -7

4.8 x 10 -7

2 .4x 10 -7

4.5 x 10 -7

Table 3

I l l -condi t ioned Z

Wi thou t rescal ing

I te ra t ions rt

With rescal ing

I terat ions

0

1

2

5

10

10 ~,

8

7

9

1l

46

1 .2x10 ~ 7.2 x l 0 - 7

6 .6x 10 -7

5.1 x 10 7

1.2x 10 6

6 .9x 10 7

1.1 x 10 -6

2.2 x 10 7

5 . 0 x 10 7

4 . 8 x 10 -7

3.6 x 10 -7

M.J.D. Powell / Updating conjugate directions

4. The D F P formula

39

In this section we seek a way of updating the matrix Z, that provides the nice properties of Section 2, in the case when the underlying change to the second derivative approximation is given by the DFP formula (1.8), instead of by the BFGS formula (1.1). Therefore again we require z', ~: to be a multiple of 6. It is relevant to

note that the DFP formula is equivalent to the equation

H~yy l H 6~ r H* = H k - - , (4.1) ,yTHg ~Ty

where by (1.2) H = B l = Z Z T a n d H * = ( B * ) ' = Z * Z *T.

Suppose that Z is nonsingular, and that its first column is a multiple of Hy. Then

we have the identity

HT,yT H aia:, (4.2)

H ,yTHT i=2

because the right hand side is obtained by subtracting from H = Z Z r the multiple of z~z( that gives singularity. It follows that the formula

~ /~/'w/6T"f. i = 1, (4.3) = , Z i , " i = 2 , 3 . . . . ,n,

which is analogous to expression (2.2), makes Z * Z *r equal to the matrix (4.1) as required. Therefore, for general % we wish to apply an orthogonal transformation

2 = Z~I (4.4)

to Z, so that 2e~/ /Hy, in order that we can replace Z by 2 in formula (4.3). Because H y = z z T y , our condition 2e~ = Z f 2 e ~ / / H T is that the first column o f / 2 shall be a multiple of ZVy, which is condition (2.3), where s is now the vector

s = Z r % (4.5)

Hence the following updating procedure is suitable. First calculate s = zTy . Then form the matrix (4.4), where .Q is a lower Hessenberg, orthogonal matrix that satisfies

equation (2.3). Finally, let the columns of Z* be the vectors

z~ ~ = / 6 / ~ / ~ l y , i = 1, (4.6) t z i , i = 2 , 3 , . . . , n,

where {gi; i = 1 , 2 , . . . , n} are the columns of 5. We compare the work of this version of the DFP formula with the computation

of the algorithm of Section 2 for updating Z by the BFGS method, assuming that s is freely available in the BFGS algorithm. The difference between the two procedures is the calculation of the vector (4.5) versus the work of the second line of

equation (2.6). We see that the DFP procedure is slightly faster; in particular it requires about n 2 fewer multiplications. Moreover, it retains the important properties that B does not occur explicitly and that the length of z* is defined by normalization.


We now suppose that this procedure is used on every iteration of the minimization of a convex quadratic function by the DFP algorithm with exact line searches. We find that, due to the form of S2, it also provides the properties that are proved by induction in Section 2 for the BFGS algorithm; namely the first k columns of the matrix Z* of the k-th iteration are the first k normalized search directions in reverse order, and equation (2.10) holds.

Again our proof is by induction, so we assume that k = 1 or that the above statements are true when k is reduced by one. Since the search directions are mutually conjugate [3, eq. (3.4.11)], it follows from equation (4.5) that the first ( k - 1 ) components of s are all zero. Hence equation (2.8) holds for q = k - 1 , so

the first (k - 1) columns of Z become the second to kth columns of Z*, except that the overall sign of each of these columns may change. Hence, in view of the first line of formula (4.6), the first part of the inductive hypothesis is true.

To prove the other part, we first consider the scalar products {yTzT; j = 2, 3 , . . . , n}. Using the identities {z* = Z*ei = Zg2ei; j = 2 , 3 , . . . , n} and Z3-y = s =

• s II Me, , we deduce the values

~,*z~ = ( z T r ) * s ~ e j = • s II ( ~ e , ) * ~ e j -- 0, j = 2, 3 , . . . , n, (4.7)

because ,O is an orthogonal matrix and ej is the j - th co-ordinate vector. Therefore equation (2.10) is satisfied when i = 1 and k<j<~ n, which completes the proof for k = 1. When k > l , we must consider 2~< i~< k too, and the inductive hypothesis states that the first (k - 1) columns of Z are conjugate to the last (n - k + 1 ) columns of Z. Because the lower Hessenberg structure of S2 makes the last (n - k) columns

of Z* belong to the linear space that is spanned by the last ( n - k + 1) columns of Z, and because we proved in the previous paragraph that the second to k-th columns of Z* are multiples of the first (k - 1) columns of Z, it follows that the last (n - k)

columns of Z* are conjugate to the second to k-th columns of Z*, which completes the proof for general k.

The updating procedure of this section has not been tried in practice, because

usually a BFGS algorithm is preferred to a DFP algorithm. It is of practical interest, however, that methods for updating a factorization of B -~ by the DFP formula are

equivalent to methods for updating a factorization of B by the BFGS formula. This remark extends the usefulness of our work, because equations (4.4) and (4.6) and the structure of .O imply that, if Z is lower triangular, then Z* is lower Hessenberg.

Specifically, our procedure suggests the following algorithm for updating a Cholesky factorization B = LL T of B, when the second derivative approximation B is revised by the BFGS formula (1.1). Let s be the vector L r& Calculate the matrix

L = Lg2, (4.8)

where S2 is a lower Hessenberg, orthogonal matrix whose first column is a multiple of s. Let /~ be the matrix whose columns have the values

f ~ ' / ~ / s f % i= 1,

1/=[~,~ i = 2 , 3 , . . . , n , (4.9)


where {~; i = l, 2 , . . . , n} are the columns of L Finally, because/~ is lower Hessen- berg, the required Cholesky factorization B * = L*L *T is obtained from the formula

L* =/],g), (4.10)

where ~ is a product of (n - 1) Givens rotations that makes L* lower triangular. The proof that L*L *T= B* when LL T= B is similar to the argument that includes

equations (4.2) and (4.3). Specifically, we have /_]s B, and that the first column o f /2 is the vector

[, = L e , = L . O e , = • L s / I I s II = • LLT~/II s 11 = • B S / I I ~ II, (4.11)

which give the identity

~'[T= B B~STB (4.12) i=2 ~ T B ~ "

Therefore formula (4.9) implies that /~/~T is the matrix (1.1). Because equation (4.10) provides the lower triangular matrix that satisfies L*L *v= s163 it follows that L* is the required Cholesky factor of B*.

As mentioned in Section 1, this method of calculating L* is equivalent to an updating algorithm of Goldfarb 117].

5. Discussion

Three main questions are considered in this final section, namely the amount of work of the given updating procedures, the observed accuracy of the numerical experiments, and the possible advantages of using matrices of conjugate directions. Our discussion of the first and second questions includes some comments on other techniques for updating matrix factorizations.

Of course the amount of work depends on the method that is used to calculate the matrix 2 of equations (2.4) and (4.4), when s is available and when the first column of the lower Hessenberg, orthogonal matrix ~2 is to be a multiple of s. This calculation is studied by Goldfarb [7], but we describe it again because it is quite useful that the required sequence of Givens rotations can be expressed conveniently using the notation of downdating calculations. The following operations provide the last (n - 1) columns of 2 for use in formula (2.6) or (4.6).

Step 0 Set k to the greatest integer in [1, n] such that Sk # 0. Set auxiliary variables

h = SkZk and 4) = Sk, h being a vector in E" and 4' being a scalar. If k < n, then set g~=z~ for i = k + l , k + 2 , . . . , n .

Step 1 Terminate if k = 1, because the last (n - 1) columns of ~Y have been found. Step 2 Calculate the vector

4' , sk-__~ h i (5.1)

Then add sk-lZk-, to h and add s~._, to 4'. Step3 Reduce k b y 1 and go back to Step 1.

42 M.J.D. Powell / Updating coniugate directions

TO prove the efficacy of this procedure, we let Zk be the matrix

2k =(z~ z_~ . . . z~ , 4' ~/=1' gk~, f.k.< . . . 77,,) (5.2)

at Step 1, where k, h and 4' have their current values. Thus 2a = Z when Step 1 is

reached from Step 0 (except that the overall sign of the k-th column may have

altered). In Step 2 2k_~ is formed by postmult iplying 2k by the Givens rotation

~r~k 1, whose elements {(-Qk ,){i; i , j = k - l , k} have the values

+ - +

I . . . . 1/ '2 " ( 5 . 3 ) \4' '-l(sz ,+4,) ~ ,I(~ ,+4")'~el

Hence there is an orthogonal matrix ~O such that Z~ = ZX2. Because the procedure

sets each of the vectors {zk; k = 2 , 3 , . . . , n} to a vector in the linear space spanned

by {z~; j = k - 1, k . . . . , n}, .(2 is lower Hessenberg as required. Moreover, because the first colunln of 21 = ZO is a multiple o f the final h, which is the vector

la

h = 5~ s~z~ = Z s, (5,4~ i :1

the first column of -Q is a multiple of s. Therefore the given procedure provides a suitable 2.

By applying this procedure as early as possible when the BFGS formula is used,

we can obtain a gain in efficiency that is comparable to the advantage of DFP over BFGS that is mentioned in the paragraph that follows equation (4.6). Specifically, because by (1.2) the search direction is - z z r g , and because s is a multiple of zrg, we can set s = -z~rg in the procedure of this section before forming - Z Z T g . Thus,

not only do we obtain the property that ~ will be a multiple o f the first column of Z', but also the calculation of 2 = Z~Q gives the vector (5.4) automatically. Because

- h is the required search direction - ZZ1g, we save a matrix times vector multiplica- tion.

Thus our version of the BFGS algorithm requires 4 n : + O ( n ) multiplications to calculate both Z and the search direction -ZZVg, and then 2 n = + O ( n ) more

multiplications are needed to apply formula (2.6). Alternatively, when the DFP

algorithm is preferred, then 2n : multiplications are made to calculate the search direction, n 2 more multiplications are required to form its s, 3 n 2 + O ( n ) multiplica-

tions occur when 2 is calculated, and finally only O(n) multiplications are present

in formula (4.6). Hence there are about 6n 2 multiplications in both cases. This number can be reduced to 5n= by accumulat ing in a diagonal matrix the contributions

from the factor outside the square brackets o f expression (5.1), which is analogous

to the square-root-free Givens procedure [5]. In this case we would work with a

ZDZ T factorization of B ~, where D would be diagonal. Hence the columns of Z

would remain mutally conjugate with respect to the current second derivative approximat ion, but their scalings would depend on D.

However, Goldfarb [7] suggests some implementat ions o f the BFGS algorithm that require only 5 n 2 / 2 + O ( n ) multiplications per iteration to calculate the search

M.J,D. Powell / Updating conjugate directions 43

direction and to update the Cholesky factorizat ion o f the second derivative approxi-

mation. The amount o f work is less than in our procedures, because Cholesky factors

are tr iangular, but there is the addit ional task o f maintaining triangularity. Goldfarb ' s

figures are low because he includes two extensions o f the techniques that we employ

to combine the calculation o f - Z Z T g with the calculation o f 2 in the BFGS

algorithm. Unfor tunate ly the extensions require several intermediate vectors to be

stored for later use, so there is some extra work to set against the reduct ion in the

number o f multiplications. Therefore we estimate that the use o f conjugate directions instead o f Cholesky factorizations increases the amount o f routine calculation of

an iteration by about 50%.

M a n y general computer implementat ions of the BFGS algorithm, however, update

Cholesky factorizations by methods that are less efficient than some of Goldfarb ' s

[7] procedures , because it is usual to take the view that the BFGS formula (1.1) is

composed of two separate rank one corrections to B. Indeed, this view is stated explicitly by Gill and Murray [6], and it has to be taken if one decides to employ

the updat ing procedure that is r ecommended by Fletcher and Powell [4]. Further,

these two papers give careful attention to the severe loss o f accuracy that can occur

if rounding errors make some numbers negative that in theory cannot be less than

zero. Therefore many researchers believe that the accurate revision of Cholesky

factors o f second derivative approximat ions in variable metric algorithms is a difficult

calculation, and suppor t for this view is provided by the amount of detail in Goldfa rb ' s paper [7], It is hoped, therefore, that some readers will find useful our

brief descript ion of an updat ing algori thm that is given in the paragraph that includes

equat ions (4.8)-(4.10).

Several o f the numerical results o f Section 3 are relevant to a compar ison of the

accuracy of updat ing methods. First we address the suitability o f our technique

when [[ Z* 1[ << ]] Zo I[ by giving further attention to example (3.3) with x = (1 1 1 l ) r

initially. Table 4 shows some values of expression (3.2) after four iterations for the new updat ing formula o f Section 2 and the Brod l ie -Gour lay-Greens tad t (BGG)

formula (1.5)-(1.6). Further, the final column of the table gives the number

*1 = max]( H G - l)q[ (5.5) t,J

after four iterations for the same test problem when H = ZoZ[ initially, and when

Table 4 More values of ~ for the calculation (3.3)

0 New algorithm BGG formula Original BFGS

I 1.7x10 7 3.1• v 2.4x10 7 102 3.3• 10 v 1.9• 10 -6 1 . 8 x 10 - 4

104 2.5 x 10 -7 9.(Ix 10 -s 1.6x 10 -2 10 6 6.6x10 7 2.9• 4 2.7x10 o 10 TM I . I • I 0 ~' 1.6x 10 -2 8 . 8 x 10 2


on each iteration the estimate H ~ G -~ is revised by the original BFGS formula

H * = H Hy6T + 6yT H+ ( I + Y - ~ 66T ~Ty 6 y / 6Ty " (5.6)

We see that only the new procedure provides good estimates o f normalized conjugate

directions when 0 >> 1 (but at the end of each of these calculations we found

]] x ]] ~< 2 x 10 6). The losses o f accuracy that occur in " B G G " and "original BFGS" are due to cancellation. Therefore, because Z and H are reduced from magni tude one to magni tudes 0 ~/2 and 0 -1 respectively, the errors include the factors 01/2

and 0 respectively, which explains the main differences between the columns of the

table.

When 0 < 1, however, serious cancellation does not occur in any of these updat ing

formulae. Table 5 compares the number o f iterations that are needed by three

algorithms to achieve ]1 x ]] <~ 10 -6 for tlhe test problem of Tables 1 and 4, and we

see little difference from the second column of Table 1. The Cholesky factorization

procedure is the updat ing method that is described at the end of Section 4. These

figures support the suggestion in Section 3 that, if step-lengths of line searches exceed one, then any errors in calculated values of x can be magnified in a way

that is independent of the implementat ion o f the BFGS formula. But we also recall

from Section 3 that automat ic rescaling procedures can avoid such inefficiencies.

Table 5

More iteration counts for the calculation (3.3)

0 BGG formula Original BFGS Cholesky method

1 4 5 4 0.1 5 5 5

0.01 6 6 6 10 - 4 7 7 7 10 -6 9 7 8

l0 -~ 11 9 l0

Finally we return to the main purpose of this paper, which is to consider the use

of matrices o f conjugate directions of second derivative approximations. Some of

the theory of Section 2 shows that our approach is well suited to active set methods

for linear inequality constraints. In order to explain this point, we suppose that there are m independent active constraints, and that Z is chosen so that its first

( n - rn) columns are or thogonal to the active constraint gradients, which can be

achieved by postmult iplying Z by a suitable or thogonal matrix. Then the variable metric search direction has the convenient form T -ZaZag, where g is still the current gradient o f the objective function, and where the columns of Za are the first (n - m)

columns of Z. Hence the vector s c ~" in the updat ing algorithm of Section 2 is

composed of the components of - Z ~ g followed by m zero components . Thus the


initial value of k, which is set in Step 0 of our calculation of Z, is at most (n - m), so automatically the first ( n - m) columns of Z* become orthogonal to the active constraint gradients. Further, Goldfarb and ldnani [8] show that it is straightforward to make suitable changes to Z when the active constraints are altered. Therefore

several properties of conjugate direction matrices are useful to linearly constrained optimization calculations.

In both constrained and unconstrained optimization calculations, the use of conjugate direction matrices allows the rescaling technique of Section 3 to be applied. An alternative way of avoiding the inefficiencies of large step-lengths is to choose Z to be very large initially, which corresponds to a small initial B, but this remedy

has a strong disadvantage. To explain it we note that the matrix (3.4), whose first two columns have been suitably scaled automatically, would be typical of Z after two iterations. In this case the third search direction, -ZZVg, would be dominated by the contributions from the unscaled columns of Z. Continuing this argument, it follows that, due to the relatively large columns of Z that are inherited from the initial choice, it is usual for none of the first n search directions to be suitable for correcting any errors in the line searches of the early iterations. Thus one may destroy the highly useful property of variable metric algorithms that, when n is large, it happens frequently that far fewer than n iterations are sufficient to obtain good accuracy. Therefore we recommend that the rescaling technique of Section 3 be combined with an initial second derivative matrix that does not underestimate greatly the curvature of the objective function along any search direction. This suggestion, however, has not been tried in practice when the objective function is not quadratic.

We have considered several properties of variable metric algorithms, and it is instructive to emphasise that two of them are independent of the way in which

second derivative information is represented. Specifically, we have in mind the loss of efficiency that is shown in Table 5, and the comments that have just been made on the disadvantages of a small initial second derivative approximation. Our dis- cussions of these two points are helped greatly by referring to conjugate direction matrices, but the two points are relevant even if, for example, one prefers an implementation of the BFGS algorithm that works with Cholesky factors of B. It is possible, therefore, that our studies will make a helpful contribution to both

theoretical understanding and practical improvements of algorithms for optimization calculations.

Acknowledgements

I am very grateful to an Associate Editor and two referees for their helpful reports

on the original version of this paper. In particular their comments led to Table 5, which shows clearly that, when the new method is disappointing, its competitors give poor accuracy too. One of the referees was so thorough that his 87 suggestions have influenced favourably about half of the given sentences.

46 M.J.D. Powell / Updating cot!/ugate directions

References

[1] K.W. Brodlie, A.R. Gourlay and J. Greenstadt, ~'Rank-one and rank-two corrections to positive definite matrices expressed in product form," Journal t~f the Institute of Mathematics and its Applications 11 (1973)73-82.

[2] W.C. Davidon, "Optimally conditioned optimization algorithms without line searches," Mathemati- cal Programming 9 (1975) 1-30.

[3] R. Fletcher, Practical Methods of Optimization, Vol. 1: Unconstrained Optimization (John Wiley & Sons, Chichester, 1980).

[4] R. Fletcher and M.J.D. Powell, "On the modification of LDL r factorizations,'" Mathematics of Computation 28 (1974) 1067-1087.

[5] W.M. Gentleman, "Least squares computations by Givens transformations without square roots,'" Journal of the Institute of Mathematics and its Applications 12 (1973) 32%336.

[6] P.E. Gill and W. Murray, "Quasi-Newton methods for unconstrained optimisation," Journal o]the Institute of Mathematics and its Applications 9 (1972) 91-108.

[7] D. Goldfarb, "'Factorized variable metric methods for unconstrained optimization," Mathematics of Computation 30 (1976) 796-811.

[8] D. Goldfarb and A. Idnani, "A numerically stable dual method for solving strictly convex quadratic programs," Mathematical Programming 27 (1983) 1-33.

[9] S-P. Han, "Optimization by updated conjugate subspaces," in: D.F. Griffiths and G.A. Watson, eds., Numerical Ana(vsis: Pitman Research Notes in Mathematics Series 140 (Longman Scientific & Technical, Burnt Mill, England) pp. 82-97.

[ 10] W. Murray, "An algorithm for finding a local minimum of an indefinite quadratic program," Report NAC 1, National Physical Laboratory, Teddington (1971).

[ 11 ] M.J.D. Powetl, "'On the quadratic programming algorithm of Goldfarb and ldnani," Mathematical Programming Study 25 (1985) 46-61.

Documents

Department of Applied Mathematics and Theoretical Physics