c391x Appendix C

Embed Size (px)

Citation preview

  • 8/10/2019 c391x Appendix C

    1/16

    C

    Parameter Optimization Methods

    IN this appendix classical necessary and suf cient conditions for solution of uncon-strained and equality-constrained parameter optimization problems are summa-rized. We also summarize two iterative techniques for unconstrained minimization,and discuss the relative merits of these approaches.

    C.1 Unconstrained Extrema

    Suppose we wish to determine a vector x that minimizes (or maximizes) the fol-lowing loss function :

    ( x) (C.1)

    with x = x 1 x 2 x nT . Without loss in generality, we assume our task is to min-

    imize eqn. (C.1). It is evident that a simple change of sign converts a maximizationproblem to a minimization problem. To obtain the most fundamental classical re-sults, we restrict initial attention to and x of class C 2 (smooth, continuous func-tions having two continuous derivatives with respect to all arguments). Using thematrix calculus differentiation rules developed in A.5, it follows that a stationaryor critical point can be determined by solving the following necessary condition:

    x x =0 (C.2)

    where x is the Jacobian ( see Appendix A ). Unfortunately, satisfying the conditionin eqn. (C.2) does not guarantee a local minimum in general. If x is scalar, then

    the classic test for a local minimum is to check the second derivative of , whichmust be positive. This concept can be expanded to a vector of unknown variables byusing a matrix check. 1, 2 The sufciency condition requires that one determine thedeniteness of the matrix of partial derivatives, known as the Hessian matrix (seeAppendix A). Suppose we have a stationary point, denoted by x

    . This point is alocal minimum if the following sufcient condition is satised:

    2x

    2x xT

    x

    must be positive denite (C.3)

    569 2004 by CRC Press LLC

    http://c391x_appendix_a.pdf/http://c391x_appendix_a.pdf/
  • 8/10/2019 c391x Appendix C

    2/16

    570 Optimal Estimation of Dynamic Systems

    where 2x is the Hessian ( see Appendix A ). If this matrix is negative denite,then the point is a maximum. If the matrix is indenite, then a saddle point exists,

    which corresponds to a relative minimum or maximum with respect to the individualcomponents of x. A global minimum is much more difcult to establish though.Consider the minimization of the following function (known as Himmelblaus func-tion): 3

    ( x) =( x 21 + x 2 11 )

    2+( x 1 + x

    22 7)

    2 (C.4)

    A plot of the contours (lines of constant ) is shown in Figure C.1 . Also shown inthis plot are the numerical iteration points for the method of gradients (see C.3.2)from various starting guesses. There is a set of four stationary points which providelocal minimums, each of approximately the same importance:

    x1 = 3 2T , with (x1 ) =0.

    x2 = 3.7792 3.2831T , with (x2 ) =0.0054.

    x3 = 2.8051 3 .1313T , with (x3 ) =0.0085.

    x4 = 3.5843 1.8483T , with (x4 ) =0.0011.

    Clearly, a numerical technique such as the method of gradients can converge to anyone of these four points from various starting guesses. Fortunately a resourcefulanalyst can often achieve a high degree of condence that a stationary point is aglobal minimum through intimate knowledge of the loss function (e.g., the Hessianmatrix for a quadratic loss function is constant).

    Example C.1: In this example we consider nding the extreme points of the follow-ing loss function: 1

    ( x) = x 31 + x

    32 +2 x

    21 +4 x

    22 +6

    The necessary conditions for x 1 and x 2 , given by eqn. (C.2), are

    x 1 = x 1(3 x 1 +4) =0 x 2 = x 2(3 x 2 +8) =0

    These equations are satised at the following stationary points:

    x1 = 0 0T

    , x2 = 0 83T

    x3 = 43 0T

    , x4 = 43 83T

    The Hessian matrix is given by

    2x =

    6 x 1 +4 00 6 x 2 +8

    2004 by CRC Press LLC

    http://c391x_appendix_a.pdf/http://c391x_appendix_a.pdf/
  • 8/10/2019 c391x Appendix C

    3/16

    Parameter Optimization Methods 571

    6 4 2 0 2 4 66

    4

    2

    0

    2

    4

    6

    x 1

    x 2

    Figure C.1: Himmelblaus Function

    Table C.1 gives the nature of the Hessian and the value of the loss function at thestationary points. The rst point gives a local minimum, the next two points aresaddle points, and the last point gives a local maximum.

    C.2 Equality Constrained Extrema

    One often encounters problems that must extremize

    ( x) (C.5)subject to the following set of m 1 equality constraints :

    (x) =0 (C.6)

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    4/16

    572 Optimal Estimation of Dynamic Systems

    Table C.1: Nature of the Hessian and Values for the Loss Function

    Point xi Nature of 2x |xi Nature of xi (xi )

    x1 = 0 0T

    Positive Denite Relative Minimum 6

    x2 = 0 83T

    Indenite Saddle Point 418 / 27

    x3 = 43 0T

    Indenite Saddle Point 194 / 27

    x4 = 43 83T

    Negative Denite Relative Maximum 50 / 3

    where m < n . Let us consider the case where n = 2 and m = 1. Suppose ( x1 , x 2 )locally minimizes eqn. (C.5) while satisfying eqn. (C.6). If this is true, then arbi-trary admissible differential variations ( x 1 , x 2) in the differential neighborhood of ( x 1 , x 2 ) in the sense ( x 1 , x 2) = ( x 1 + x 1 , x 2 + x 2) result in a stationary value of :

    = x 1

    x 1 + x 2

    x 2 =0 (C.7)Since we restrict attention to neighboring points that satisfy the constraint given by

    eqn. (C.6), we also require the rst variation of the constraint to vanish as a conditionon the admissibility of ( x 1 , x 2) as

    = x 1

    x 1 + x 2

    x 2 =0 (C.8)For notational convenience, we suppress the truth that all partials in eqns. (C.7) and(C.8) are evaluated at ( x 1 , x 2 ) . Since eqn. (C.8) constrains the admissible varia-tions, we can solve for either variable and eliminate the constraint equation. The twosolutions of the constraint equations are obviously

    x 1 = x 2 x 1

    x 2 and x 2 = x 1 x 2

    x 1 (C.9)

    Substitution of the differential eliminations into the differential of the loss functionallows us to locally constrain the variations of and reduce the dimensionality eitherof two ways. The rst way is given by using

    = x 2

    x 1 x 1

    x 2

    x 2 =0 (C.10)

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    5/16

    Parameter Optimization Methods 573

    The second way is given by using

    = x

    1

    x 2 x 2

    x

    1

    x 1 =0 (C.11)

    It is evident that either of eqns. (C.10) or (C.11) can be used to argue that the localvariations are arbitrary and the coefcient within the brackets must vanish as a nec-essary condition for a local minimum at ( x 1 , x 2 ) . The rst form of the necessaryconditions is given by

    x 1

    x 2 x 2

    x 1 =0 (C.12a)

    ( x 1 , x 2) =0 (C.12b)The second form of the necessary conditions is given by

    x 2

    x 1

    x 1

    x 2 =0 (C.13a)

    ( x 1 , x 2) =0 (C.13b)When this approach is carried to higher dimensions, the number of differential elim-ination possibilities is obviously much greater, and some of these forms of the nec-essary conditions may be poorly conditioned if the partial derivatives in the denomi-nator approaches zero.

    Lagrange noticed a pattern in the above and decided to automate all possible

    differential eliminations by linearly combining eqns. (C.7) and (C.8) with an un-specied scalar Lagrange multiplier as

    + = x 1 +

    x 1

    x 1 + x 2 +

    x 2

    x 2 =0 (C.14)

    While it isnt legal to set the two brackets to zero using the argument that ( x 1 , x 2)are independent, we can set either one of the brackets to zero to determine . No-tice that setting the rst bracket to zero and substituting the resulting equation for

    = x 1 / x 1 into the second bracket renders the second bracket equal toeqn. (C.13a), whereas setting the second bracket to zero, solving for and substi-tuting renders the rst bracket equal to eqn. (C.12a). Thus the following necessarygeneralized Lagrange form of the necessary conditions captures all possible differ-

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    6/16

    574 Optimal Estimation of Dynamic Systems

    ential constraint eliminations (only two in this case):

    x 1 +

    x 1 =0 (C.15a)

    x 2 +

    x 2 =0 (C.15b)

    ( x 1 , x 2) =0 (C.15c)It is apparent by inspection of eqn. (C.15) that these equations are the gradient of theaugmented function + with respect to ( x 1 , x 2 , ) and thus the Lagrangemultiplier rule is validated. The necessary conditions for a constrained minimum of eqn. (C.5) subject to eqn. (C.6) has the form of an unconstrained minimum of theaugmented function :

    x 1 =

    x 1 +

    x 1 =0 (C.16a) x 2 =

    x 2 +

    x 2 =0 (C.16b)

    ( x 1 , x 2) =0 (C.16c)Equations (C.16) provide four equations; all points ( x 1 , x 2 , ) satisfying these equa-tions are constrained stationary points .

    Expanding this concept to the general case results in the necessary conditionsfor a stationary point, which is applied by the unconstrained necessary condition of eqn. (C.2) to the following augmented function :

    ( x, ) = ( x) +T (x) (C.17)

    The necessary conditions are now given by

    x x =

    x +

    x

    T

    =0 (C.18a)

    = (x) =0 (C.18b)

    where is an m 1 vector of Lagrange multipliers. The (n +m) equations shownin eqn. (C.18), which dene the Lagrange multiplier rule , are solved for the (n +m) unknowns x and . Suppose we have a stationary point, denoted by x witha corresponding Lagrange multiplier . The point x is a local minimum if thefollowing sufcient condition is satised:

    2x

    2x xT

    (x, )

    must be positive denite. (C.19)

    The sufcient condition can be simplied by checking the positive deniteness of a matrix that is always smaller than the n n matrix shown by eqn. (C.19). Let usrewrite the loss function in eqn. (C.5) as

    ( x 1 , . . . , x m , x m+1 , . . . , x n ) (y, z) (C.20)

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    7/16

    Parameter Optimization Methods 575

    where y is an m 1 vector and z is a p 1 vector (with p =n m). The necessaryconditions are still given by eqn. (C.18) with x yT zT

    T . But the sufcient con-

    dition can now be determined by checking the deniteness of the following p pmatrix: 2

    Q [ z ]T

    y T

    2y y

    T [ z ] +

    2z

    z y y 1 [ z ] [ z ]

    T

    y T

    y z (y, z, )(C.21)

    where z y and y z are p m and m p matrices, respectively, made up of the partial derivatives with respect to y and z. A stationary point is a local minimum

    (maximum) if Q is positive (negative) denite. Note that the inverse of an m mmatrix must be taken. Still, the matrix in eqn. (C.21) is usually simpler to check thanusing the n n matrix in eqn. (C.19).Example C.2: In this example we consider nding the extreme points of the follow-ing loss function, which represents a plane:

    =6 y2

    z3

    subject to a constraint represented by an elliptic cylinder:

    ( x) = 9( y 4)2

    +4( z 5)2

    36 = 0where x x y

    T . The augmented function of eqn. (C.17) for this problem is givenby

    ( x, ) =6 y2

    z3 9( y 4)

    2+4( z 5)

    236

    From the necessary conditions of eqn. (C.18) we have

    y =

    12 +18( y 4) =0

    z =

    13 +8( y 5) =0

    ( x) =9( y 4)2+4( z 5)

    236 =0

    Solving these equations for gives =1/( 36 2) . Therefore, the stationary pointsare given by y=4 +

    136 =4 2

    z=45 + 124 =5 32 2=

    1

    36 2

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    8/16

    576 Optimal Estimation of Dynamic Systems

    ds

    dx dy

    y

    x

    1

    2 c =

    3 4

    c

    Figure C.2: The Directional Derivative Concept

    The sufcient condition of eqn. (C.19) for this problem is given by

    2x =

    18 00 8

    Also, eqn. (C.21) gives

    Q q =8( z5)2( y4)2 +

    1

    Clearly, if = +1/( 36 2) then the stationary point given by y =4 + 2 and z =5 +(3/ 2) 2 is a local minimum with = (7/ 3) 2. Likewise, if =1/( 36 2) then the stationary point given by y=4 2 and z=5 (3/ 2) 2is a local maximum with =(7/ 3) + 2.

    C.3 Nonlinear Unconstrained Optimization

    In this section two iterative methods are shown that can be used to solve nonlinearunconstrained optimization problems. Several approaches can be used to numeri-cally solve these problems, but are beyond the scope of the present text. The inter-ested reader is encouraged to pursue other approaches in the open literature (e.g., seeRefs. [1] and [3]).

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    9/16

    Parameter Optimization Methods 577

    C.3.1 Some Geometrical Insights

    Consider the function ( x , y) of two variables whose contours are sketched inFigure C.2 . From the geometry of Figure C.2 it is evident that

    tan =dydx (C.22a)

    sin =dyds

    (C.22b)

    cos =dx ds

    (C.22c)

    For arbitrary small displacements ( dx , dy ) away from the current point ( x c , yc ),the differential change in is given by

    d =

    x c dx +

    y c dy (C.23)

    If s is the distance measured along an arbitrary line through c, then the rate of change(differential derivative) of in the direction of the line is

    d ds c =

    x c

    dx ds c +

    y c

    dyds c

    (C.24)

    Making use of eqns. (C.22b) and (C.22c), we have

    d

    ds c =

    x ccos

    +

    y csin (C.25)

    Now, lets look at a couple of particularly interesting cases. Suppose we wish toselect the particular line for which d ds c =0. Equation (C.25) tells us that the angle 1 = orienting this line is given by

    tan 1 =

    x c

    y c

    (C.26)

    which gives the contour direction. Now lets also nd the particular direction of which results in the minimum or maximum d ds c . The necessary condition for theextremum of d ds c requires

    d d

    d ds c =

    x c

    sin + y c

    cos =0 (C.27)From eqn. (C.27) the angle 2 = which orients the direction of steepest descentor steepest ascent is given by

    tan 2 = y c x c

    (C.28)

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    10/16

    578 Optimal Estimation of Dynamic Systems

    2

    y

    x

    c

    1

    c y

    c x

    Figure C.3: Geometrical Interpretation of the Gradient Line

    which gives the gradient direction. Notice that (tan 1)( tan 2) = 1. Therefore, 1 and 2 orient lines that are perpendicular. So, the contour line is perpendicularto the gradient line, as shown in Figure C.3. These geometrical concepts are dif-cult to conceptualize rigorously in higher dimensional spaces, but fortunately, themathematics does generalize rigorously and in a straightforward fashion.

    C.3.2 Methods of GradientsOne immediate conclusion of the foregoing is that (based only upon the rst

    derivative information), the most favorable direction to take a small step toward min-imizing (or maximizing) the function is down (or up) the locally evaluated gradientof . The method of gradients (also known as the method of steepest descent forminimizing or the method of steepest ascent for maximizing ) is a sequence of one-dimensional searches along the lines established by successively evaluated localgradients of . Consider to be a function of n variables which are the elements of x. Let the local evaluations be denoted by superscripts. For example,

    (k )

    = x(k ) (C.29)

    denotes (x) evaluated at the k th set of x -values. The k th one-dimensional searchdetermines a scalar (k ) such that

    x(k +1) =x(k )

    (k ) [ x ]

    (k ) (C.30)

    results in

    (k +1) = x(k +1) (C.31)being a local minimum or maximum. The one-dimensional search for (k ) can bedetermined analytically or numerically using various methods (see Refs. [1] and [3]).

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    11/16

    Parameter Optimization Methods 579

    It is important to develop a geometrical feel for the method of gradients to un-derstand the circumstances under which it works best, to anticipate failures, and todecide upon remedial action when failure occurs. Sequences of iterations from vari-ous starting guesses for Himmelblaus function are shown in Figure C.1 . Observe theorthogonality of successive gradients. The successive gradients will be exactly or-thogonal only if the one-dimensional minima or maxima are perfectly located. Note,for the case of two unknowns only one gradient calculation may be necessary, sinceall successive gradients are either parallel or perpendicular to the rst. However, thisorthogonality condition is obviously insufcient to establish the gradient directionsfor the case of three or more unknowns (e.g., for three unknowns there exists a planethat is perpendicular to the gradient vector).

    The convergence of the gradient method is heavily dependent upon the circular-ity of the contours ( see Figure C.5 for a function with nonlinear trenches). As anaside, in 3-space the contours most desired are spherical surfaces; in n -space thecontours most desired are hyperspheres. Also, the gradient method often con-verges rapidly for the rst few iterations (far from the solution), but is usually a verypoor algorithm during the nal iterations. For any function with non-sphericalcontours, the number of iterations to converge exactly is generally unbounded. Sat-isfactory convergence accuracy often requires an unacceptably large number of one-dimensional searches. This can be overcome by using the Levenberg-Marquardtalgorithm shown in 1.6.3, which combines the least squares differential correctionprocess with a gradient search.

    Example C.3: In this example the method of gradients is used to determine theminimum of the following quadratic function:

    ( x) =4 x 21 +3 x

    22 4 x 1 x 2 + x 1

    The starting guess is given by x(0) = 1 3T . A plot of the iterations superimposed

    on the contours is shown in Figure C.4 . This function has low eccentricity contourswith the minimum of x = 3/ 16 1/ 8

    T . The Hessian matrix is constant and

    symmetric for this function:

    2x = 8 44 6

    The eigenvalues of this matrix are all positive, which states that the function is wellbehaved. The iterations are given by

    x(1) = 0.7576 0 .9649T

    x(2) = 0.2456 0 .1003T

    x(3) = 0.1192 0.0462 T x(4) = 0.1917 0.1088

    T

    x(5) = 0.1826 0.1194T

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    12/16

    580 Optimal Estimation of Dynamic Systems

    3 2 1 0 1 2 33

    2

    1

    0

    1

    2

    3

    x 1

    x 2

    Figure C.4: Minimization of a Quadratic Loss Function

    x(6) = 0.1878 0.1238T

    x(7) = 0.1871 0.1246T

    x(8) = 0.1875 0.1250T

    This clearly shows the typical performance of the gradient method, where rapid con-vergence is given far from the minimum, but slow progress is given near the mini-mum. Still, the algorithm converges to the true minimum. This behavior is also seenfrom various other starting guesses.

    C.3.3 Second-Order (Gauss-Newton) AlgorithmThe Gauss-Newton algorithm is probably the most powerful unconstrained op-

    timization method. We will discuss a curvature pitfall that necessitates care inapplying this algorithm, however. Say a loss function is evaluated at a local point

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    13/16

    Parameter Optimization Methods 581

    x(k ) . It is desired to modify x (k ) by x(k ) according to

    x(k +1) =x(k )

    + x(k ) (C.32)

    in such a fashion that is decreased or increased. The behavior of near x (k ) canbe approximated by a second-order Taylors series:

    = x(k )

    + xT g(k ) +

    12

    xT H (k ) x (C.33)

    where g (k ) x (k ) (the gradient of ) and H (k ) 2x (k ) (the Hessian of ). Thelocal strategy is to determine the particular correction vector x(k ) which minimizes(maximizes) the second-order prediction of . Investigating eqn. (C.33) for an ex-treme leads to the following:

    necessary condition x =g

    (k )+ H

    (k ) x =0 (C.34)suf cient condition

    2

    x = H (k )

    must be positive denite for minimum.must be negative denite for maximum.must be indenite for saddle.

    (C.35)

    From the necessary condition of eqn. (C.34), the local corrections are then given by

    x(k ) = H (k ) 1 g(k ) (C.36)

    Substituting eqn. (C.36) into eqn. (C.32) gives the Gauss-Newton second-order opti-mization algorithm:

    x(k +1) =x(k )

    H (k ) 1 g(k ) (C.37)

    It is important to note that this algorithm converges in exactly one iteration for aquadratic loss function, regardless of the starting guesses used. For example, thesecond-order correction for the loss function shown in example C.3 is given by

    x(1) = x (0)1

    x (0)2

    316

    18

    18

    14

    8 x (0)1 4 x (0)2 +1

    6 x (0)2 4 x (0)1

    =

    316

    18

    (C.38)

    which gives the optimal solution in one iteration! In many (probably most) solvableunconstrained optimization problems, the second-order approximation underlyingeqn. (C.37) becomes valid during the nal iterations; the terminal convergence of eqn. (C.37) is usually exceptionally rapid.

    There is a pitfall though! If the sufcient condition of eqn. (C.35) is not satised,then the correction will be in the wrong direction. It is difcult to attempt minimizing

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    14/16

    582 Optimal Estimation of Dynamic Systems

    3 2 1 0 1 2 34

    3

    2

    1

    0

    1

    2

    3

    4

    x 1

    x 2

    Figure C.5: Minimization of Rosenbrocks Loss Function

    a function by solving for local maxima. This pitfall can be circumvented by using agradient algorithm until the neighborhood of the solution is reached, then testing thesufcient condition of eqn. (C.35) and employing the second-order algorithm if it issatised.

    Example C.4: In this example the Gauss-Newton algorithm is used to determinethe minimum of Rosenbrocks loss function, which has been devised to be a specic

    challenge to gradient-based approaches:

    ( x) =100 ( x 2 x 21 )2 +(1 x 1)2A plot of the contours for this function is shown in Figure C.5. Note the highlynonlinear trenches for this function. The starting guess is given by x(0) = 1.2 1

    T .For this particular problem, the gradient method of C.3.2 does not converge to thetrue minimum of x

    =1 1 T even after 1,000 iterations. However, the second-order

    algorithm converges in just two iterations, shown in Figure C.5. The iterations aregiven by

    x(1) = 1.0000 3.8400T

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    15/16

    Parameter Optimization Methods 583

    x(2) = 1.0000 1 .0000T

    The Hessian matrix evaluated for this function is given

    2x = 400 ( x 2

    x 21 )

    +800 x 21

    +2

    400 x 1

    400 x 1 200which is always positive denite at all the iterations. This example clearly shows theadvantages of using a second-order correction in the optimization process.

    The overwhelmingly most signicant drawback of the second-order correction is

    the necessity of calculating the matrix of second derivatives. For complicated lossfunction models, it is usually an expensive consideration to simply determine then elements of the gradient vector. One is thus motivated to ask the question: Isit possible to approximate quadratic convergence without the expense of calculat-ing second partial derivatives? The answer turns out to be yes! Observe that somesecond-order information is contained in the sequence of local function and gradi-ent calculations. Two such techniques have been developed that are in common usetoday (the Fletcher-Powell 4 and Fletcher-Reeves 5 algorithms). These algorithms arenot developed here due to space limitations; the interested reader should see Refs. [1]

    and [3] for theoretical development and numerical examples of these important al-gorithms.

    It is also signicant to note that when the loss function is the sum of squares of aset of functions whose rst derivatives are available, that second-order convergencecan be approximated by linearizing the functions before squaring . The result is alocal quadratic approximation of ; this local approximation can be minimized rig-orously. The classical example use of this approach is the Gaussian least squaresdifferential correction , which is also known as nonlinear least squares . This algo-rithm is developed in 1.4 and is applied to numerous examples in this text.

    References

    [1] Rao, S.S., Engineering Optimization: Theory and Practice , John Wiley &Sons, New York, NY, 3rd ed., 1996.

    [2] Bryson, A.E. and Ho, Y.C., Applied Optimal Control , Taylor & Francis, Lon-don, England, 1975.

    [3] Reklaitis, G.V., Ravindran, A., and Ragsdell, K.M., Engineering Optimization: Methods and Applications , John Wiley & Sons, New York, NY, 1983.

    2004 by CRC Press LLC

  • 8/10/2019 c391x Appendix C

    16/16

    584 Optimal Estimation of Dynamic Systems

    [4] Fletcher, R. and Powell, M., A Rapidly Convergent Descent Method for Min-imization, Computer Journal , Vol. 6, No. 2, July 1963, pp. 163168.

    [5] Fletcher, R. and Reeves, C.M., Function Minimization by Conjugate Gradi-ents, Computer Journal , Vol. 7, No. 2, July 1964, pp. 149154.