Optimal conditioning of self-scaling variable Metric algorithms · 2012-10-03 · Introduction Variable Metric Methods are algorithms for minimizing a function fix) over x e E n,

Mathematical Programming 10 (1976) 70-90. North-Holland Publishing Company

OPTIMAL CONDITIONING OF SELF-SCALING VARIABLE METRIC ALGORITHMS

Shmuel S. OREN Xerox Research Center, Palo Alto, Calif., U.S.A.

and

Emilio SPEDICATO* CISE, Segrate, Milano, Italy

Received 22 July 1974 Revised manuscript recieved 20 May 1975

Variable Metric Methods are "Newton-Raphson-like" algorithms for unconstrained minimization in which the inverse Hessian is replaced by an approximation, inferred from previous gradients and updated at each iteration, During the past decade various approaches have been used to derive general classes of such algorithms having the com- mon properties of being Conjugate Directions methods and having "quadratic termina- tion". Observed differences in actual performance of such methods motivated recent at- tempts to identify variable metric algorithms having additional properties that may be significant in practical situations (e.g. nonquadratic functions, inaccurate linesearch, etc.). The SSVM algorithms, introduced by this first author, are such methods that among their other properties, they automatically compensate for poor scaling of the objective function. This paper presents some new theoretical results identifying a subclass of SSVM algorithms that have the additional property of minimizing a sharp bound on the condition number of the inverse Hessian approximation at each iteration. Reducing this condition number is important for decreasing the roundoff error. The theoretical properties of this subclass are explored and two of its special cases are tested numerically in comparison with other SSVM algorithms.

1. Int roduct ion

Variable Metric Methods are algorithms for minimizing a funct ion f i x ) over x e E n, given that f ~ C 2 and the gradients g(x) are available for any x. These methods are based upon the iteration:

* This work has been done while this author was a visiting fellow at the Engineering Economic System Department, Stanford University.

70

S.S. Oren / Optimal conditioning of SS VM algorithms 71

Xk+ 1 = Xk -- akDkg k, (1)

where D k is an n × n matrix approximating the inverse Hessian [V2f(xk)] -1, and a k is a step size parameter selected to satisfy pre- scribed descent conditions. The approximation D k in these algorithms is inferred from the gradients in previous iteration and is updated at each step starting with an initial approximation D o . The updating is done such as to satisfy the equation

Dk +1 qk = PkPk, (2)

where Pk is a scalar parameter qk = gk+l -- gk and Pk = Xk+l -- Xk" Usu- ally Pk-- 1 in which case eq. (2) is referred to as the "quasi Newton condition". This condition is motivated by the fact that for a quadratic function it is also satisfied by the true inverse Hessian. Different variable metric algorithms usually vary in their selection procedure for ak and updating formula for D k.

In this paper we consider the two parameter class of updates proposed by Oren and Luenberger [5] which can be expressed in the form

where

with

Dk+I = cD°k(Dg, "Yk, Pg, qk), (3)

cD ° (D, % p , q ) = (D - D q q ' D / q ' D q + 0 o o ' ) 7 + PP'/P'q (4.a)

o = (q'Dq) 1/2 (p /p 'q - Dq /q 'Dq) (4.b)

this class of updates, which satisfies the aforementioned quasi-Newton condition, contains most of the formulae used in the past as special cases and is equivalent to Huang's symmetric class [3].

A special subset of the updates represented by (4) is the one corresponding to the parameter ranges (p 'q]q 'Dq) <~ 7 <<- ( P ' D - l p / P ' q ) (as. suming p 'q > 0) and 0 ~< 0 < 1. This subclass which will be denoted by O ( D , p , q ) was introduced by Oren [4] and it forms the basis for the Self Scaling Variable Metric (SSVM) algorithms described in [ 5, 6 and 7]. Among the features that make the updates belonging to O(D,p, q)

-attractive are: positive definiteness of the inverse Hessian approximations and monotonic decrease of the condition number of H1/2DgHY2 when f ( x ) is a quadratic with Hessian H. With exact line-search, the last p roper ty ensures monotonic decrease in a bound on the single step

72 S.S. Oren / Optimal conditioning of SSVM algorithms

convergence rate of SSVM algorithms and its virtues were extensively discussed by Oren and Luenberger [5]. The freedom available in selecting 3' and 0 within O(D,p,q), and experimental results reported in [8], suggest that further improvements can be made by proper selection of these parameters. Such an attempt was made by Oren [8] who proposed a selection rule for updates in O(D,p, q). According to this rule, 3' is selected as close as possible to unity and 0 is chosen such as to offset an estimated bias in det (DkH) relative to unity. Though the aforementioned selection rule produced reasonably good results, the criteria leading to it are somewhat heuristic.

This paper considers the problem of selecting updates in O(D, p, q) that will minimize the condition number of D k at each step. A low condition number of D k is desirable from a numerical view point since it will reduce the round-off error in the iteration (1)and thus improve the numerical stability of the algorithm. This criteria has been proposed in the. past by Shanno and Kettler [9] for selecting a parameter in a different class of updates; however, their implementation of this idea is quite different from the one presented in this paper.

In Section 2 we shall develop a sharp bound on the condition number of positive definite updates of the form q)°(D, %p,q). Minimizing this bound with respect to 0 and3, in the range corresponding to O(D,p, q) leads to a relation between the optimal values of these parameters. Imposing this relation on O(D,p,q) results in the one parameter class of Optimally Conditioned Self Scaling (OCSS) updates O(D,p, q), developed in Section 3. Section 4 examines some duality properties of updates in O(D,p, q) and their implications. Two strategies for selecting the remaining free parameter 7 in O(D,p,q) are tested numerically in Section 5, and the results are compared to previous results given in [8].

2. A bound on the condition number of ~°(D, ~l,P,q)

For the sake of brevity we shall use D* to represent 9°(D, 7,P, q) and also introduce the notation:

a = p!q, (5.a)

r = q'Dq, (5.b)

e = p 'D-lp . (5.c)

S.S. Oren / Optimal conditioning of SSVM algorithms 73

In terms of the above notation eq. (4) can be rewritten as

D * = T D + o + 7 0 r 7(0 - 1)Dqq , D 70 pp' + - - - (pq 'D + Dqp ' )

if2 T O (6)

and for D * ~ O ( D , p , q ) we require 0~<0~< 1 and ( o / r ) < 7 < ( e / o ) . In the following theorem we derive a bound on the condition of D*.

This result is equivalent to the one obtained by Spedicato [ 11 ], however the approach presented-here is somewhat different, and~consequently the proof is much simpler.

Theorem 1. Let D* be defined by (6) and let ~(.) denote condition number with respect to the spectral norm (~:(A) = HAl[ IIA-11[). Then i f the matrices D and D* are positive definite there holds:

max[p + (p2 _ # )1 /2 , 7 ] ~(D*) ~< ~(D) , (7)

min[p - (/9 2 -- p )1 /2 , 7 ]

where

and

p = (e + # r ) / 2 ~ (8)

p = 7[02 + O(er-- 02)]~to. (9)

Furthermore eq. (7) becomes an equality i f D is the identity or i f Dq = p and 7 ~ [a/r, e/a].

Proof. For all x E E n we have by (6)

x 'D*x = (x'Dx)gZ (7, 0, x ) , (10)

where

~ ( 7 , O, x) = (7 + - - o + 7 0 r ( p ' x ) 2 7 ( 0 - 1) (x'Dq) 2

t- 02 x 'Dx r x 'Dx

2 70 (x'p) o x 'Dx /

(11)


Let us define

q~(y, 0) = max c~(y, 0, x)/min 9~ (7, 0, x), x x

(12)

then, since D and D* are positive definite,

max [(x'Dx)qZ (3', 0, x ) / x ' x ] ~:(D*) = [[D*LI [LD*-III - - x

min [ (x 'Dx ) C~ (% O, x ) /x ' x ] x

max (x 'Dx/x ' x ) <

(13)

rain (x' D x / x 'x ) x

the remaining of the the algebra we introduce the following notation

L = D 1/2 ,

y = Lx / ( x ,Dx) t / 2 ,

= L - l p ,

u = L q .

,~(7, o) : ~ (o) ,~ (~ , ,o )

proof consists of evaluating ~b(7, 0). To simplify

( 1 4 . a )

(14.b)

(14.c)

(14.d)

From (14) and (5) we obtain

e = I[ uiI 2 ,

7" = Ilull 2,

O = 1 / ' 0 ,

I[y II = 1.

(15.a)

(15.b)

(15.c)

(15.d)

Substituting (15) into (11) yields

a + 7Or ~Z(7, O, x ) = y

a 2 (o'y) 2 +

7(O - 1) (u'y) 2

2"~0

o - - - ( u ' y ) ( ~ ' y ) =d 9 ~ ( 7 , O , y ) . (16)


Consequently,

q~(%.0) = max erR(% O,y)/min 9g(%O,y). (17) Ilyll=l l[Yll=l

The necessary conditions for an extremum of qg(% 0, y) over the hypersphere II Y II = 1 are

7y crK(% O,y) + 2vy = 0, (18.a)

y ' y = 1, (18.b)

where 7 is a Lagrange multiplier. Calculating the gradient of c~ (% 0, y) from (16) and substituting in (18.a) yields

a + "TOt "T(O - 1) ( ~ ' y ) ~+ - - ( u ' y ) u - "TO (~ 'y ) u

02 T 0

'TO - - - ( u ' y ) o + 7 Y = 0 . ( 1 9 )

(7

Taking the inner product of (19) with ~), u andy respectively and using (18.b) to eliminate y'y, results in the following equations.

(!a+'T0r)e 'T0)(~'y)+ (.'To(O_~ 1) 'TOe] (u'y)+ 7(~'y)=0

o2 o / (20)

( , ;y ) - 'T(u'y) + 7(u 'y ) = 0 (21)

°+3~0T(o'Y)2+ 'T(0 - 1)(u'y)2 - 2'TO(o'y)(u'Y)+7=0. (22) 02 "g 0

These equations can be solved for (o'y), (u'y) and 7- Clearly (20)-(22) are necessary conditions for an extremum of

9g (% O,y) over the hypersphere ]lyl[ = 1. Thus by substituting (22) into (16) we obtain that for such an extremum y0 with corresponding Lagrange multiplier 7 o

q~(7, O , y °) = 7 - 7 0. (23)

In view of (23) we have to solve equations (20)-(22) only for 7. If u'y = 0, then from (21) o'y = 0 and consequently 7 = 0. If u'y :~ 0, we


define a = ~'y/u 'y . Then from (21 )

r~ = 3' - a, (24)

Dividing (20) by u'y and using (24) yields

a 2 - °e + 3`0(re - a2) + 3`a2 a~ 3"e2+3"O(er-02) - 0 . (25)

02 TO

By using (9) and (8) one can reduce (25) to the equation

a 2 - 20a + U = 0 (26)

whose solutions are

a = p + (p2 _ /~)1 /2 . (27)

To show that these solutions are real we have to show that p 2 _ tt >t 0. For t~ <~ 0 this is clearly true. If on the other hand/a > 0, then by the Schwarz inequality er ~> 02, and by (8) we obtain

p2 _ t~ = ((e +/ar))/2o) 2 - g = \ 2a / + g - /> 0.

Since the two solutions given by (27) are real, we can substitute them in (24) to obtain two additional real values for ~ satisfying the necessary conditions (20) - (22) .

By Weierstrass theorem, 9 g ( 7 , 0 , y ) must have a minimum and a maximum on the hypersphere Ilyll = 1. Since these extremums must satisfy (20 ) - (22 ) they correspond to one of the three values of r/ obtained above. The assumption that D is positive definite implies by (10) and (16) that crg(7, O , y ) > 0 over the hypersphere ]IY I] = 1. It follows by (23), (24) and (27) that p > 0, 7 > 0 and consequently by the above argument,

m a x c ~ ( 7 , 0 , y ) = m a x [ p + (p2 _/ .01/2 , 7 ] , liy ii = 1

m i n crg(7, O,y) = m i n [p - (p2 _ / .01 /2 , 7 ] . i iy l [= l

(28)

(29)

Equation (7) follows directly from (13), (17), (28) and (29). For D---I , (13) is clearly an equality and thus (7) is an equality.


For Dq =p we have r = e = o; consequently 3, = 1 and by (8) and (9) /a = 1 and p = 1. Substituting the above in (7) yields ~:(D*) ~< ~:(D). This however is an equality since by (6) D* = D.

Corollary 1. Let D* be defined by (6) and D be a positive definite matrix. Then D* is positive definite i f and only i f ' / > O, cr > 0 and

O ( e r - o 2 ) > - o 2 . (30)

Proof. In view of (10);D* is positive definite if and only if 9t( '/ , O,x) > 0 for all x or equivalently qg('/, O , y ) > 0 for all y such that IIY 11 = 1. By (29) this holds if and only if 3' > 0, p > 0 and ta > 0, where p and/~ are defined by (8) and (9). Since e and r are positive (by the positive definiteness of D) and /a is positive then p > 0 if and only if o > 0. Equation (30) follows directly from (9) and the definition of ' / , /~ , r and o.

Corollary 2. Let D* be defined by (6) and D be a positive definite matrix. Then D* is positive definite i f cr > O, "/ > 0 and 0 > O.

Proof. By the Schwarz inequality er - o 2 > 0 thus, 3' > 0 and 0 > 0 satisfy the sufficient condition for positive definiteness of D* obtained in Corollary 1.

The result given in Corollary 1 was originally given by Spedicato [10]. The result of Corollary 2 was given by Oren and Luenberger [5] who also proved that for 3, > 0, unless 0 > 0 there exist a positive definite D, and vectors p, q satisfying p 'q > 0 that lead to a non-positive definite D*. The conditions given in Corollaries 1 and 2 could be used now to replace the positive definiteness of D* required in Theorem 1.

Theorem 2. Let D* be defined by (6), where D is a positive definite n × n matrix, p, q ~ E n satisfy p 'q > O, o/r <<. 3, <~ e/o and 0 > - o2/(er- a2). Then there holds

t~(D*) ~< h:(D) • [~ + (~2 _ 1)1/212 (31)

where ~ = p/la 1/2 while p and I~ are defined by (8) and (9). Further- more, (31) becomes an equality i f D = I or Dq = p.

7 8 S.S. Oren / Optimal conditioning of SS VM algorithms

Proof. Since D is positive definite, r > 0 and therefore 3' > o/r > O. This together with the conditions on o and 0 implies, according to Corollary 1, positivity of /~ and positive definiteness of D*, which ensure the validity of Theorem 1.

By Schwarz inequality, er - o 2 > 0. Thus using (8) and (9) we have

Also,

p+ (p2/~)1/2=~ (c7+ [(~0~)2+4/~(7_ 1) 1 1/2}

+ > - > 7- (32) O O

p _ (p2 _/a)l/2 = p _ ( p 2 _ (2op -- e)/r} m

= P - { ( 0 - o / r ) 2 + ( e r - 0 2 ) / r 2 } 1/2

<<. P - I P - o/rl < o/r <~ 7. (33)

Applying the results of (32) and (33) to Theorem 1 reduces (7) to

~:(D*) < •(D). [p + (/)2 _ g)l/2]/[p _ (p2 _/a) l /2] . (34)

Multiplying the numerator and denominator in the right-l~and side of (34) by [p + (p 2 -/a) 1/2] and replacing p in terms of ~ results in (31). The conditions for equality in (31) follow directly from Theorem 1 and the restriction on the range of 7.

3. Optimally condit ioned self-scaling updates

As mentioned in the introduction, it is desirable from a numerical stability standpoint to use an update that will minimize the condition number of D k at each'iteration. Since however, an explicit expression for this condition number is usually unavailable, it seems reasonable to at least minimize its upper bound by proper selection of 0 and 7. Proposition 1 provides such a bound depending on 0 and 7 which motivates the following definition.

Definition 1. Let D be a positive definite n X n matrix and p, q ~ E n satisfy p 'q > 0; then D* = c3°(D, 7 , p , q ) is said to be opt imal ly con-


di t ioned if 3' and 0 are such that D* is positive definite and the right- hand side in (7) is minimized.

In the following theorem we restrict ourselves to updates corresponding to 3" ~ [o/r, e/o] to which the result of Theorem 2 applies. Cases outside this range are discussed by Spedicato [ 12].

Theorem 3. L e t D*, D, p, q, % p,/a, O, and ~ be as in Theorem 2. Then D* is op t imal ly cond i t ioned i f and only if, either er = o 2 or

0 = o(e - 3"o)13"(er - O 2) d 0. (35)

Furthermore , i f D* is op t imal ly condi t ioned, then

K(D*) ~< ~:(D)[co + (co2 _ 1)1/212, (36)

where

co = (er/02) 1/2 . (37)

Proof. Clearly for the case considered in this theorem and in view of (31), D* is optimally conditioned if and only if 0 and 3' minimize O>(3', 0) which is defined as

O>(3', O) = [~ + i~ 2 - 1)1/212 (38)

I fp = Dq, then e = r = or. Thus O>(3', 0) = 1 and (36) is satisfied for any 3" and 0. If p 4: Dq, then by the Schwarz inequality o 2 < er. For any given 3', O>(3', 0) will have a stationary point with respect to 0 only if

dO> do> d~ du d-O = d-~-" d-~" d--0- = 0. (39)

By (38),

From (9),

do>/d~ = 2O> 1/2. (1 + ~/(~2 _ 1) i /2) > O.

d/a/d0 = v(er - o2) /ro > O.

Also, since

~. = (e +/ar)/2o/a 1/2 ,

(40)

(41)

(42)


we obtain

d~/d/a = ( r /~ - e)/2/~ 3/2 . (43)

The assumption on 3', 0 and o imply /~> 0. Thus by (39)- (43) , d~/d0 = 0 implies/~ = e/r which by (9) yields (35). Examining the sec- ond derivative 'of q~(3', 0) with respect to 0 indicates that (35) corre- sponds to a minimum. Furthermore substituting/~ = c/r, (43) and (37) in (38) yields

alp(3', 0) = [60 + (0o 2 -- 1)1/212 = minq~(3', 0). (44) 0,3"

Equation (36) follows from (44), (38) and (31).

Corollary 3. The optimal parameter 0 defined in Theorem 3 satisfies:

0<<.0<<. 1 for o/r<~ 3"<~ e/o,

0 = 0 for 3"= c/o,

0 = 1 for 3"= a/r.

Proof. Follows directly from (35) and the Schwarz inequality.

Let us denote by O(D,p, q) the class of updates of the form c9 O(D, 7, P, q) with 3' ~ [o/r, c/o] and O defined in terms of 3' by (35). Then it follows from Corollary 3 that O(D, p, q) is contained in O(D, p, q) and hence possesses all the properties of the later class. Furthermore the fact that both (dependent) parameters 0 and 7 in O(D, p, q) vary over the entire range allowed in O(D, p, q), indicates that 8(D, p, q) is in some sense a "complete" one parameter subclass of O(D, p, q). In view of. the above Qbservations, ~(D, p, q), referred to as Optimally Conditioned Self-Scaling (OCSS) updates, seems to be a nat: ural restriction of O(D, p, q). Substituting 0 from (35) into (6) yield the following general form of OCSS updates

~0(D ' % p, q) = c~ (D, % p, q)

= 7/) + 1 { [(2er - 3're - 62)/o]pp'- [e(o - 3"r)/r]Dqq'D ( e r - 0 2)

-- (e - 3"a) (pq'D + Dqp')} , (45)


where 3, E [a/T, e/o]. It :should be noted that though (43) calls for evaluation of e which i s

defined as p ' D - l p / p ' q , it does not require evaluation of D -1 since usually Pk = --e~kDk gk and therefore e e = p'~ gx/g'eDk qk.

The following proposition specializes some of the earlier results to the quadratic case.

Proposition 1. Let H be a positive definite n × n matrix, {Pk, qk} a sequence o f vector pairs in E n satisfying qk = HPe for all k, and (De} a sequence o f matrices such that De+ 1 c O(Dk, PX, qk), starting with a positive definite matrix D o . Then, the condition number o f D e is bounded Such that

g(Dk) <~ g(H1/2DoH1/2 ) • t~(H). (46)

If, in addition, De+ 1 ~ O(De, Pk, qe), the rate o f deterioration o f this condition number is bounded such that

~:(Dk+l) ~ ~:(De). K(H1/2DoH1/2). (47)

Proof. It was shown by Oren and Luenberger [5] that for updates in

O(Dk, qk, Pk),

where

Thus

K(,Rk+I) ~< K(Rk) ~< K(R0) ,

R e = H1/2DkH1/2"

(Oe) < K(Re) • K(H) < (R0) •

proving (46).

For De+ 1 ~ 6 (D e, Pk, qk) we have by Theorem 3,

~C(Dk+l) ~< t~(Dk)[W + (09 2 - - 1)1/2] 2 '

where

(48)

(49)

(50)

~o = [ (PkD~l pk) (qkDk qk)/(p'~qk) 2 ]1 /2 . (5 1 )


By the Kantorowitch inequality we obtain

CO 2 <~ [ 1 + g (Rk ) ]2 /4g (Rk ) .

Substituting (52) into (51) and using (48) yields

~(D~+0 ~< ~(nk) • ~(Rk) ~< ~(Dx) ~(R0).

(52)

(53)

The condition stated in Proposition 1 applies when we minimize a quadratic function with Hessian H. The proposition indicates that in such a case i f D 0 is the identity matrix and a OCSS update is used, then the condition number of the approximations D k never exceeds [~:(/_/)]2 and its deterioration rate is no higher than ~:(/-/). To illustrate the virtue of this property we consider an extreme case where the objective function is a hypersphere with Hessian h • I, the initial inverse Hessian approximation is the identity, and assume that no line search is performed (so that the minimum is not found on the first iteration). To satisfy eq. (46), the condition number of D k would have in this case to remain unity for all k. On the other hand, it can be easily verified that any update of the form given by eq. (4) that uses 3' = 1 (as the DFP and BSF updates do) will generate D 1 that has one eigenvalue h and the others unity so that ~:(D1) = h. Even if exact line search was done on the sub- sequent iterations, it would take n steps before ~:(Dk) recovers back to unity. If h and the number of variables are large, such behaviour is clearly undesirable.

It should be emphasized that though the above example is quite artificial, it illustrates a phenomenon that is typical to situations where commonly used variable metric methods are applied to poorly scaled problem of moderate size. More realistic examples illustrating such phe- nomena were given in [5] and [6].

4. Duality properties of updates

The concept of Duality was first introduced by Fletcher [1] who showed that for 3' = 1, cD°(.) is the dual of cDl(.) in the sense that c-D°(D, 1 ,p ,q) = [c/)1(/)-1 ' 1, q, p ) ] - l . This result was extended in [5] to an arbitrary 3' in which case c/)°(D, 3,, p, q) = [ c/)t(D-1, 1/3', q,p)]- 1 The mapping defined by inverting the update after changing D -+ D -1,


3' -+ 1/% p --> q and q -+ p is usually referred to as a Duality transformation.

The following theorem generalizes the above results to the case of an arbitrary 0.

Theorem 4. Given a non-singular n × n matrix D, nonzero vectors p, q E E n and scalars 3' and 0 such that "y 4= 0 and 0 4= o2/(eT - a2), there exists a 0 such that cD~(.) is the dual o f q)o(.), i.e.,

cD°(D, % p , q )= [ cD6(D-1, 1/% q, p ) ] - l . (54)

Furthermore, this 0 is related to 0 through the relation

0 = O2(1 -- 0)/[o2(1 - 0) + Oer]. (55)

Proof. Transforming the right-hand side of (6) by changing D, ~, p, q, e, r, 0 into D -1, 1/% q, p, T, e and 0 respectively and then mult iplying it by the original right-hand side of (6) yields (after some laborious algebra) the identi ty matrix, proving (54).

Corollary 4. Let D be a positive definite n × n matrix and p, q ~ E n satisfy p 'q > O. t f the update cD o (D, ?, p, q) is in O(D, p, q), then its dual

7,p, q) e o(D, t,, q) (56)

and its inverse

[ Q)°(D, % p, q)] - i @ O(O-1, q, p). (57)

Proof. Equat ion (56) follows from the fact that, by (55), 0 ~ [0, 1 ] implies 0 ~ [0, 1 ]. To prove (57) we note that changing D, p, q to D -1, q, p also interchanges the values of e and r. Thus, (57) follows from (54) and the fact that 0 ~ [ 0 , 1 ] and 9iE[cr/r ,e /o] imply 0 ~ [ 0 , 1 ] and

[o/e, rlo].

Corollary 5. Let D, p, q be as in Corollary 4, and let

0* = 1/(1 + co), (58)


where co is defined by (37). Then,

[c/)°*(D-1, 1/3,, q ,p ) ] - I = cDO*(D ' 3,,P, q). (59)

Proof. Substituting 0 --- 0* in (55) and using (58) and (37) yields 0 = 0". Thus (59) follows from (54).

According to Corollary 5, the update cD°*(D, 3", P, q) is self-dual in the sense that the duality transformation maps this update into itself. Furthermore, since w > 0, 0* ~ [0, 1 ], so that cD°*(D, 3", P, q) is in O(D, p , q ) for 3' e [o/r, e/e] and is in Fletcher's class [1] if 3' = 1.

Proposition 2. OCSS updates are self-dual; that is, for any update Cb(D, 3", P, q) ~ ©(D, p, q) there holds

^ 1 c~( D, 3', P, q) = [c/)( D - , 1/3,, q, p ) ] - l . (60)

Proof. By(43) and (35), c~ (D, 3", p, q) = c~g(D, 3", p, q), where

0 = o ( e - 3"o)13"(er- o2). (61)

Substituting 0 for 0 in (55) yields

0 = 3"o(r - o/3,)/(er - o2). (62)

C lea r ly the same 0 could have been obtained by applying the duality transformation to the right-hand side of (61).

In view of Theorem 4, updates of D -1 have the same general form as these of D, with D, 3", P, q replaced by D -1, 1/3,, q, p respectively. Thus, the rational used in obtaining optimally conditioned updates for D could be also applied to obtain such updates for D -1. The resulting class of OCSS updates f o r D -1 will be ~O(D -1, q, p) and the corresponding bound on the condition number of D is the same as in Theorem 3. Proposition 2 assures us that for any given 3' the OCSS update of D -1 is just the inverse of the OCSS update of D. This kind of symmetry is clearly desirable since it would follow automatically if we actually minimized the condition number of D*. It was not obvious however that minimizing a bound on this condition number preserves that symmetry.

S.S. Oren / Optimal conditioning of SSVM algorithms 8 5

A special OCSS update is obtained by choosing 0 = 0", where 0* is defined by (58). The properties of this update are summarized in the following proposit.ion.

Proposition 3. L e t D be a posit ive def ini te n × n matrix, p, q c E n satisfy p 'q > 0 and let

where

and

Then

and

UD(D, p, q) = cgO*(D, 3'*, p, q),

O* = 1/[1 + (er/02) 1/2]

(63)

(64)

3'* = (e/r) 1/2. (65)

C~(D, p, q) E O(D, p, q)

~(D,p , q) = [ ~ ( D -1, q, p ) ] - l .

(66)

(67)

Proof. One can easily verify that 7* and 0* satisfy eq. (35) and clearly 0* ~ [0, 1]. Hence by Corollary 3, 3'* 6 Jolt , e/o] proving (66).

Equation (67) follows from (60) and the fact that changing D, q, p to D -1, p, q respectively changes 3'* to 1/3'*.

The explicit form of the update @(D, p, q) is obtained by substituting 3' = 7* in (45). This yields

~ ( D , p, q) = (e / r ) l /2D

where

+ - -

o( 1 + co) (( 1 + 2oo)pp' - ( e / r ) D q ' q ' D

- (e /r) 1/2" (pq 'D + D q p ' ) ) ,

oo = (er /o2) 1/2 .

(68)


The special feature of this formula is that its inverse is obtained by merely substituting D -1 for D and interchanging e with ~- and p with q.

5. Switching rules and numerical experiments

Restricting ourselves to the class of OCSS updates still leaves freedom to select any value of 7 • [o/r, e/o]. One reasonable strategy for choosing 7 which was first used in [8] is to select the 3' of maximum proximity to unity within the acceptable interval. Using this strategy in the update defined by eq. (45) is equivalent to selecting the parameter in (4) according to the following switching rule.

Switch 1 If e/a ~< 1, choose 3' = e/o, 0 = 0; if o / r ~> 1, choose 7 = o/r, 0 = 1 ; if a / r ~< 1 ~< e/a, choose 3' = 1,0 = o(e - a ) / ( e r - o2) .

An unpleasing property of the above switch which is also shared by Oren's [8] and Fletcher's [1] switches is that when applied to the update of D -1 these switches yield values of 0 which are different from the ones obtained for the update of D. In fact one can easily verify that any switch based on the values e, r, o that will select 0 and 3' as the parameters for the update of D; would choose 0 and 1/7 for the corresponding update of D -1, where 0 is related to 0 through (55). This follows from Theorem 4 and the fact that the update of the inverse equals the inverse of the update.

The only update that avoids the aforementioned deficiency is (D, p, q) defined by eq. (68). This update is equivalent to using (4)

with the following switch:

Switch 2

7 = (e/r) 1/2 , 0 = 1/[ 1 + ( re /o2) 1/21

It is interesting to note that this choice results in values of 0 which never exceed 1, and approach this value asymptotically.

In our numerical experiments we used update (4) with Switches 1 and 2. The results which are given in Table 1 were compared with those reported in [8] for the following two switches.


Switch 3 If e/o <~ 1, choose 3" = e/o, 0 = 0; if o/r ~> 1, choose 3" = o/r, 0 = 1 ; if O/T <. 1 <, e/O, choose 3" = 1, 0 = o(r - o)/(er - 0 2 ) .

Switch 4

3" = e / r , 0 = ½.

It is interesting to note that though Switches 1 and 3 are based on different rationals, they are almost the same with the only difference in the parameter 0 corresponding to 3" = 1.

The step size selection procedure used in these experiments was the same as the one described in [8]. In this procedure the initial step size at every iteration k is chosen as min[1,/3/[[Pkl[] and modified by a "cubic bracketing" line search procedure if Goldstein's [2]. criteria is not satisfied. The parameter /3 is used to bound the initial step size whenever this is essential. Further details of these procedures are described in [8]. The results in Table 1 are given for three values of the parameter 6 in Goldstein test. The value 6 --- 0.5 enforces line search at every iteration since with this value the Goldstein test can not be met. On the other hand, with 5 = 0.01 the test will be satisfied by the initial unit step size on most of the iterations resulting in very few line searches. The parameter /3 was chosen ,~ery large for most of the test problems (105) so that it had no effect; however for the function EXP 5 and Powell's function it became necessary to restrict the step size in order to avoid failure of the algorithm. The results given in Table 1 for these two functions were obtained with/3 = 10 for EXP 5 and/3 = 0.1 for Powell's function. Most of the test problems used here are the same ones used in [8] and their details wilt hence be omitted. The only new function tested is the Hilbert Quadratic defined as

N N

f ( x ) = ~ ~ [(x] - 1)(x i - 1)/(j + i - 1)1 i=1 j = l

whose minimum is at x* - - (1 , 1, 1, ..., 1) and f ( x * ) - - 0 . In our test we used N = 16 and starting values x ° = - 4 / k .

The results given in Table 1 indicate that there is no significant difference in the performance of the four switches.



+fi

Z

ve~

+a

~ q

o

Z

~A

Z

Z

o ~

90 S.8. Oren / Optimal conditioning of SSVM algorithms

6. Conclusions

In this paper we introduced a theoretical bound on the rate of deterioration of the condition number, for the inverse Hessian approximations, generated by a wide class of rank two updates. Minimizing this bound leads to a natural selection of parameters in the class of Self Scaling updates. It was shown that with this selection the resulting updates are "self dual".

Two switches based on these theoretical considerations have been tested in comparison with two other switches that produced the most favorable result in [8]. The numerical results do not indicate any significant difference in the performance of these four switches. This does not imply that the performance of SSVM algorithms is insensitive to the choice of parameters. Indeed it was demonstrated in [8] that different parameter selection can cause wide variations in performance. It is conceivable however that the performance has a "flat" optimum and the four switches mentioned above produce values of the parameters within this flat region.

References

[1] R. Fletcher, "A new approach to variable metric algorithms", The Computer Journal 13 (1970) 317-322.

[2] A.A. Goldstein, "On steepest descent", SIAM Journal on Control 1 (1965) 147-151. [3] H.Y. Huang, "Unified approach to quadratically convergent algorithms for function mini-

mization", Journal of Optimization Theory and Applications 5 (1970) 405-423. [4] S.S. Oren, "Self-scaling variable metric algorithms for unconstrained minimization", Ph.D.

thesis, Department of Engineering~Economic Systems, Stanford University, Stanford, Calif., 1972.

[5] S.S. Oren and D.G. Luenberger, "Self-scaling variable metric (SSVM) algorithms I: criteria and sufficient conditions for scaling a class of algorithms", Management Science 20 (1974) 845-862.

[6] S.S. Oren, "Self-scaling variable metric (SSVM) algorithms II: implementation and experiments", Management Science 20 (1974) 863-874.

[7] S.S. Oren, "Self-scaling variable metric algorithm without linesearch for unconstrained minimization", Mathematics of Computation 27 (1973) 873-885.

[8] S.S. Oren, "On the selection of parameters in self-scaling variable metric algorithms", PARC Memo Rept., ARG MR# 73-8 (Presented at the 8th International Symposium on Mathematical Programming, Stanford, August 1973).

• [9] D.F. Shanno and P.C. Kettler, "Optimal conditioning of quasi-Newton methods", Mathe- matics of Computation 24 (1970) 657-667.

[10] E. Spedicato, "Stability of Huang's update for the conjugate gradient method", Journal of Optimization Theory and Applications 11 (1973)469-479.

[ l l l E. Spedicato, "On condition numbers of matrices in rank two minimization algorithms", CISE Rept., CISE, Segrete, Italy.

[12] E. Spedicato, "A bound on the condition number of rank-two corrections and applications to the variable metric methods", Mathematics to Computation, to appear.

Documents

Optimal conditioning of self-scaling variable Metric algorithms · 2012-10-03 · Introduction Variable Metric Methods are algorithms for minimizing a function fix) over x e E n,