Learning Hessian matrix.pdf

7/27/2019 Learning Hessian matrix.pdf

1/100

Derivatives, Higher Derivatives, Hessian Matrix

And its Application in Numerical Optimization

Dr M Zulfikar Ali

Professor

Department of MathematicsUniversity of Rajshahi


2/100

1. First-, Second- and Higher-Order Derivatives

11. Differentiable Real-Valued Function

Let f be a real valued function defined on an open set in nR , and let x be in . f is said to be differentiable at x iffor all nRx such that + xx we have that

xxxxxtxfxxf ),()()()( ++=+

where )(xt is an n-dimensional bounded vector, and is a

real-valued function of x such that0),(lim

0=

xx

x

f is said to be differentiable on if it is differentiable ateach x in .[Obviously, if f is differentiable on the open set , it isalso differentiable on any subset

1 (open or not) of .

Hence when we say that f is differentiable on some set

1 (open or not), we shall mean that f is differentiable onsome open set containing

1 .]

Theorem 1 Let )(xf be a real-valued function defined on

an open set in nR , and let x in .(i) If )(xf is differentiable atx , then )(xf is continuous

atx , and )(xf exists (but not conversely), andxxxxxxfxxf ),()()()( ++=+ ,

0),(lim0

=

xxx

for + xx

(ii) If )(xf has continuous partial derivatives at x with

respect ton

xxx ,,,21K , that is )(xf exists and f is

continuous atx , then f is differentiable at x .


3/100

Theorem 2 Let )(xf be a real-valued function defined on

an open set in nR , and let x in .(i) If )(xf is differentiable atx , then )(xf is continuous

atx , and )(xf exists(but not conversely), andxxxxxxfxxf ),()()()( ++=+ ,

0),(lim0

=

xxx

for + xx

(ii) If )(xf has continuous partial derivatives at x with

respect ton

xxx ,,,21K , that is )(xf exists and f is

continuous at x , then f is differentiable at x .

1.2. Twice Differentiable Real-Valued Function

Let f be a real-valued function defined on an open set

in nR , and let xbe in . f is said to be twice differentiableat x if for all nRx such that + xx we have that

2

2

))(,(2

)()()()( xxx

xxfxxxfxfxxf +

++=+

where )(2 xf is an nn matrix of bounded elements, and is a real-valued function ofx such that 0),(lim

0=

xx

x

The nn matrix )(2 xf is called the Hessian (matrix) off at x and its ij-th element is written as

[ ] njixxxf

xfji

ij ,,2,1,,)(

)(

2

2

L=

=

Obviously, if f is twice differentiable atx , it must also be

differentiable at x .


4/100

1.3. First Partial Derivative

Suppose ),( yxf is a real-valued function of two

independent variables x and y. Then the partial derivative

of ),( yxf with respect to x is defined as

+

=

x

yxfyxxf

x

fx

y

),(),(lim

0(1)

Similarly, the partial derivative of ),( yxf with respect to y

is defined as

+

=

y

yxfyyxf

y

fx

x

),(),(lim

0. (2)

Example1 If 22 2),( yxyxf = . Then

+

=

=

x

yxyxx

x

ff

x

y

x

)2(]2)[(lim

2222

0

x2= Similarly

+

=

=

y

yxyyx

y

ff

x

x

y

)2(])(2[lim

2222

0

y4=

.1.4. Gradient of Real-Valued Functions

Let fbe a real-valued function defined on an open set innR , and let x be in . The n-dimensional vector of the


5/100

partial derivatives of f with respect ton

xxx ,,,21K at x is

called the gradient of f at x

and is denoted by )(xf , that is,

))(,,)(()(1 n

xxfxxfxf = K

Example2 Let .)(),( 22 yyxyxf += Then the gradient off ,

)42,22())(,)(()(21

yxyxxxfxxfxf +==

1.5. Function of a Function

It is well-known propert of functions of one independent

variable that if f is a function of a variable u, and u is a

function of a variable x, then

dx

du

du

df

dx

df= . (3)

This result may be immediately extended to the case when f

is a function of two or more independent variables.

Suppose )(uff = and ),( yxuu = . Then, by the definitionof a partial derivative,

yy

x

x

u

du

df

x

ff

=

= , (4)

xx

y

yu

dudf

yff

=

= . (5)

Example 3 If


6/100

x

yyxf

1tan),( =

Then putting xyu = we have

22

1 )(tanyx

y

x

uu

du

d

x

ff

yy

x

+=

=

=

22

1 )(tanyx

x

y

uu

du

d

y

ff

xx

y

+=

=

= .

1.6. Higher Partial Derivatives

Provided the first partial derivatives of function are

differentiable we may differentiate them partially to obtain

the second partial derivatives. The four second partial

derivatives of ),( yxf are therefore

y

xxx

x

f

xf

xx

ff

=

=

=

2

2

, (6)

x

yyy

y

f

yf

yy

ff

=

=

=

2

2

(7)

x

yxy

y

f

xf

xyx

ff

=

=

=

2

, (8)

and

y

xyx x

f

yf

yxy

ff

=

=

=

2

. (9)

Higher partial derivatives than the second may be obtained

in a similar way.


7/100

Example 4 Ifx

yyxf

1tan),( = . Then

2222,

yxx

xf

yxy

xf

xy +=

+=

.

Hence differentiating these first derivatives partially, we

obtain

222222

2

)(

2)(

yx

xy

yx

y

xx

ff

xx

+=

+

=

=

and

222222

2

)(

2)(

yx

xy

yx

x

yy

ff

yy

+=

+

=

=

also

222

22

22

2

)()(

yx

xy

yx

x

xyx

ff

xy

+

=

+

=

=

and

222

22

22

2

)()(

yx

xy

yx

y

yxy

ffyx

+=

+

=

= .

We see that

xy

f

yx

f

=

22(10)

which shows that the operatorsx

and

y

are

commutative.

We also see that

02

2

2

2

=

+

y

f

x

f. (11)


8/100

The above equation is called Laplace equation in two

variables.

In general, any function satisfying this equation is called a

harmonic function.

1.7. Total Derivatives

Suppose ),( yxf is a continuous function defined in a

region R of the xy-plane, and that bothyx

f

and

xy

f

are

continuous in this region. We now consider the change in

the value of the function brought about by allowing smallchanges in x and y.

If f is the change in ),( yxf due to changes x and y in

x and y then

),(),( yxfyyxxff ++= (12)

),(),(

),(),(

yxfyyxf

yyxfyyxxf

++

+++=

(13)

Now, by definition (1) and (2)

+++

=+

x

yyxfyyxxfyyxf

x x

),(),(lim),(

0(14)

and

+

=

y

yxfyyxfyxf

y y

),(),(lim),(

0(15)

Consequently ,

xyyxfx

yyxfyyxxf

++

=+++ ),(),(),(


9/100

(16)

and

yyxf

y

yxfyyxf

+

=+ ),(),(),( (17)

where and satisfy the conditions

0lim0

=

xand 0lim

0=

y. (18)

Using (16) and (17) in (13) we now find

yyxfy

xyyxfx

f

+

+

++

= ),(),( . (19)

Furthermore, since all first derivatives are continuous by

assumption, the first term of (19) may be written as

+

=+

x

yxfyyxf

x

),(),( (20)

where satisfies the condition

0lim0

=y

. (21)

Hence, using (20) and (19) now becomes

yxyy

yxfx

x

yxff +++

+

= )(

),(),(. (22)

The expression

yy

yxfxx

yxff

+

),(),( (23)

obtained by neglecting the small terms x )( + and y in(22) represents, to the first order in x and y . The change


10/100

f in f(x, y) due to changes x and y in x and y

respectively is called the total differential of f.

In case of a function of n independent variables

),,,( 21 nxxxf L we have

=

++

+

=

n

rr

r

n

n

xx

fx

x

fx

x

fx

x

ff

12

2

1

1

L .

(24)

Example5. To find the change in

xy

xeyxf =),(

when the values of x and y are slightly changed from 1 and

0 to 1+ x and y respectively.

We first use (23) to obtain

yexxexyef

xyxyxy

2

)(++

Hence putting x=1, y=0 in the above expression we have

yxf +

For example, if 10.0=x and 05.0=y , then 15.0f

We now return to the exact expression for f given in (22).

Suppose u=f(x, y) and that both x and y are differentiablefunctions of a variable t so that

x=x(t), y=y(t) (25)


11/100

and

u=u(t). (26)

Hence dividing (22) by t and proceeding to the limit0t (which implies 0x , 0y andconsequently 0,, ) we have

dt

dy

y

f

dt

dx

x

f

dt

du

+

= . (27)

This expression is called the total derivative of )(tu with

respect to t.It is easily seen that if

),,,(21 n

xxxfu L= (28)

wheren

xxx ,,,21L are all differentiable functions of a

variable t, then u=u(t) and

=

++

+

=

n

r

r

r

n

ndtdx

xf

dtdx

xf

dtdx

xf

dtdx

xf

dtdu

1

2

2

1

1

L

(29)

Example6. Suppose22),( yxyxfu +==

And2

,sinh tytx == .Then

yy

fx

x

f2,2 =

=


12/100

tdt

dyt

dt

dx2,cosh ==

Hence

34coshsinh24cosh2 tttyttxdt

dy

y

f

dt

dx

x

f

dt

du+=+=

+

=

.

1.8. Implicit DifferentiationIn special case of the total derivative (27) aries when y is

itself a function of x (i.e. t=x).Consequently u is a function of x only and

dx

dy

y

f

x

f

dx

du

+

= (30)

Example7. Suppose

y

xyxfu 1tan),( ==

andxy sin=

Then by (30), we have

xx

xxx

xyx

x

yx

y

dx

du

22

222

sin

cossin

cos

+

=

+

+=

When y is defined as a function of x by the equation

0),( == yxfu (31)


13/100

Y is called an implicit function of x. Since u is identically

zero its total derivative must vanish, and consequently from(30)

xyy

f

x

f

dx

dy

= (32)

Example8. Suppose

0222),( 22 =+++++= cfygxbyhxyaxyxf

(where a, h, b, g, f and c are constants)Then

fhxby

ghyax

dx

dy

222

222

++

++= .

1.9. Higher Total Derivatives

We have already seen that if ),( yxfu = and x and y aredifferentiable functions of t then

dt

dy

y

f

dt

dx

x

f

dt

du

+

= . (33)

To find2

2

dt

udwe note (33) that the operator

dt

dcan be

written as

ydtdy

xdtdx

dtd

+

. (34)

Hence

=

=

dt

du

dt

d

dt

ud2

2

+

+

dt

dy

y

f

dt

dx

x

f

ydt

dy

xdt

dx


14/100

2

2

222

2

2

2

+

+

=

dt

dy

y

f

dt

dy

dt

dx

yx

f

dt

dx

x

f

+2

2

2

2

dt

yd

y

f

dt

xd

x

f

+

(35)

where we have assumed thatyxxy

ff = .Higher total

derivatives may be obtained in a similar way.

A special case of (35) is when

,, kdt

dyh

dt

dx==

where h and k are constants. We then have

2

2

2

2

2

2

2

2

2

2y

fk

yx

fhk

x

fh

dt

ud

+

+

= ,

which, if we define the differential operator D* by

y

k

x

hD

+

=* ,

may be written symbolically as

.2*2

2

2

fDfy

kx

hdt

ud=

+

=

Similarly we find

3

3

3

2

3

2

2

3

2

3

3

3

3

3

33y

fk

yx

fhk

yx

fkh

x

fh

dt

ud

+

+

+

=

fDfy

kx

h3*

3

=

+

= ,

assuming the commutative properties of partial

differentiation.In general,


15/100

fDfy

kx

hdt

ud nn

n

n

*=

+

= ,

where the operatorn

yk

xh

+

is to be expanded by

means of the binomial theorem.

1.10 . Taylors Theorem for functions of Two Independent

Variables

Theorem3 (Taylors Theorem) If ),( yxf is defined in a

region R of xy-plane all its partial derivatives of orders upto and including the (n+1)th are continuous in R , then for

any point (a, b) in this region

,),(!

1

),(!2

1),(),(),(

*

2

*

*

nn EbafD

n

bafDbaDfbafkbhaf

++

+++=++ L

where D* is the differential operator defined by

yk

xhD

+

=*

and

),(* bafD r means ),( yxfy

kx

h

r

+

evaluated at the point ),( ba . The Lagrange error termn

E is

given b


16/100

),()!1(

11

*

kbhafDn

E nn

+++

= +

where 10


17/100

+

+

+

2

3

32

1

6

3

3)1(2

2

yyx

+ terms of degree 3 and higher.

2. Derivatives by Linearization Technique

2.1. Linear Part of a Function.

Consider the function, where32

3),( yxyxyxf = ;and try to approximate ),( yxf near the point

)1,2(),( =yx by simpler function. To do this, set+= 2x and += 1y . Then

)1,2(),( ++= fyxf

)22(3)44(2

++++=

)3331( 32 ++

).33()97(11 322 ++++= Here )1,2(11 = f ; 97 is a linear function of thevariables and; and 322 33 ++=v is small,

compared to the linear function 97 , ifandare bothsmall enough. This means that, ifandare both small

enough, then)97(11),( +yxf


18/100

is a good approximation, the terms omitted being small in

comparision to the terms retained.

To make the idea of small enough precise, denote by

dthe distance from )1,2(

to )1,2( ++

; then2

1

22 ][ +=d .Then d and d , so that, as 0d ,

022 = dddd ; 0333 2 = dddd ;

and similarly for the remaining terms in v . This motivates

the following definition.

Definition 1. The function f of two variables yx, is

differentiable at the point ),( ba if

. ),()()(),(),( yxbyqaxpbafyxf +++= ,where p and q are constants, and

0d as 0])()[( 21

22 += byaxd .

The linear function )()( byqaxp + will be called thelinear part of f at ),( ba . (some books call it the

differential.)

The numbers pandqcan be directly calculated. Ify is

fixed at the value b , then axd = , and

0),(

0),(),(

+

++=

p

ax

bxp

ax

bafbxf

as 0= axd , thus as ax . Thus p equals the partialderivative of f with respect to x at a , holding y fixed at

b . This will be written


19/100

),(),(),( bafDbafbax

fp

xx==

= , or as

x

f

or

xf or fD

x

for short. Similarly q equals ),(),( bafbayf

y=

, the partial

derivative of f with respect to y at b , holding x fixed at a .

Thus, in the above example, yxfx

32 = , and so, at

)1,2(),( =yx , pfx

=== 7)1(3)2(2 . Similarly,

))1,2((933 2 === atqyxfy

.

Example 1. Calculate the linear part of f at )0,4

(

where

)32cos(),( yxyxf += .Solution. )32cos(),( yxyxf +=

)0,4

(3)32sin(3),(

)0,4

(2)32sin(2),(

atyxyxf

atyxyxf

y

x

=+=

=+=

Hence the linear part of f at )0,4

(

is

=

04)32(

)0(3)4

(2

yx

yx


20/100

Example 2. Calculate the linear part of g at )2

,0(

, where

).sin(),( 22 yxyxg +=

Solution. )4

cos()2

,0(,0)2

,0(2

==yx

gg

Hence the linear part of g at )2

,0(

is

)2

)(4

cos(

)2

)(4

cos()0.(0

2

2

=

+

y

yx

2.2. Vector Viewpoint

It will be convenient to regard f as a function of a vector

variable w whose components are x andy . Denote

=y

xw , as a column vector, in matrix language; denote

also

=

b

ac . Define also row vector

( ) ),(),,(,)( bafbafqpcfyx

== .

( The notation )(cf suggests a derivative; we have here asort of vector derivative.). Then, since f is differentiable,

),())(()()( wcwcfcfwf ++=


21/100

where the product

)()(),())(( byqaxp

by

axqpcwcf +=

=

is the usual matrix product. Now let cw denote the

length of the vector cw ; then the previous cwd = ,

and so 0)( cww

Example 3. For ,1,2,3),( 32 === bayxyxyxf set

= 1

2c ; then

)9,7()( = cf .

2.3. Directional Derivative.

Suppose that the point (x, y) moves along a straight line

through the point (a, b); thus myx == ,l , where is

the distance from (a, b) to (x, y) andl andm are constantsspecifying the direction of the line (with 122 =+ ml ). In

vector language, let

=

y

xw ,

=

b

ac ,

=

mt

l; then

cw = ; and the line has equation tcw += .

The rate of increase of )(wf , measured at cw = , as wmoves along the line tcw += , is called the directional

derivative of f at c in the direction t.This may be

calculated, assuming f differentiable, as follows:

+=

=

+tcf

tcfcftcf)(

))(()()(.


22/100

The required directional derivative is the limit of this ratio

as 0 , namely ( ) qmpm

qptcf +=

= l

l)( (since

0 as 0 ). Note that t is here a unit vector.

Example 4(a) Let .3),( 32 yxyxyxf = The directional

derivative of at

1

2in the direction

sin

cosis

( )

sin9cos7

sin

cos97

sin

cos)1,2( =

=

f .

(b) Find the directional derivative of222),,( yzxyxzyxf ++= at (1,-1, 2) in the direction of the

vector A=(1,-2,2).

Solution. )2,,4(),,( 2 yzzxyxzyxf ++= ,

).4,5,3()2,1,1( =f Unit vector in the direction of A= (1,-2, 2) is ).

3

2,

3

2,

3

1(

Therefore the directional derivative at (1,-1, 2) in the

direction of A is

( ) 5

32

32

31

453 =

.


23/100

2.4. Vector Functions

Let and each be differentiable real functions of the

two real variables x andy . The pair of equations

),(),(

yxvyxu

==

defines a mapping from the point ),( yx to the point ),( vu .

If, instead of considering points, we consider a vectorw,

with components yx, , and a vectors , with components u, v,

then the two equations define a mapping from the vector w

to the vectors . This mapping is then specified by the vector

function

=f .

Definition. The vector function f is differentiable at c if

there is a matrix )(cf such that

)())(()()( wcwcfcfwf ++= (1)

holds, with

0)(

cw

was 0 cw .

The term ))(( cwcf is called the linear part of f at c .

Example 5. Let 22),( yxyx += and xyyx 2),( = .thesefunctions are differentiable at )2,1( , and calculation shows

that


24/100

)).2)(1(2()2(2)1(44

);)2(4)1(()2(4)1(25 22

+++=

++++=

yxyxv

yxyxu

This pair of equations combines into a single matrixequation:

++

+

=

)2)(1(2

)2(4)1(

2

1

24

42

4

5 22

yx

yx

y

x

v

u.

In vector notation, this may be written as

)())(()()( wcwcfcfwf ++= ,

where now )(cf is a 22 matrix. Since the components

21, of satisfy 0)(

1 cww and

0)(2

cww as 0 cw , it follows that

0)( cww as 0 cw .

Definition 2. The vector function f is differentiable at c if

there is a matrix )(cf

such that Equation (1) holds, with

0)(

cw

was 0 cw .

(2)

The term ))(( cwcf is called the linear part of f at c.

Example 6. For the function

+=

=

xy

yxf

2

22

,


25/100

the linear part at

2

1is

2

1

24

42

y

x

.

Example7. Let .2

22

=

xy

yx

y

xf Calculate

2

1f .

Here ;2

22

2

=

xyy

yx

y

xf .

44

42

2

1

=

f

2.5. Functions of Functions

Let the differentiable function f map the vector w to the

vector s; let the differentiable function g map the vector s to

the vector t. Diagrammatically,

tsw

gf

Then the composition fgh o= of the functions g and fmaps w to t.

Since f and g are differentiable,

;)()()( += cwAcfwf ;))()(())(())(( += cfwfBcfgwfg

where A and B are suitable matrices, and and can be

neglected, then approximately


26/100

)())(())(( cwBAcfgwfg .

The linear part of h, is in fact, )( cwBA .

From the chain rule we have)())(()( cfcfgch = .

Example8. Let

+=

=

=

xy

yx

yx

yx

v

uf

2),(

),( 22

,

23),( vuvug = ,

=

2

1c .

=

+=

=

4

5

212

21

2

1)(

22

fcf

=

=

24

42)(,

22

22),( cf

xy

yxyxf ,

)2,3()2,3(),( vvvug == , )8,3()( = cg

The chain rules then gives

).426(24

42)83(

2

1=

=

h

Example9 . tytxyxz sin,cos,2 22 === . Here

222,sin

cos)( yx

y

xg

t

ttf =

= .

Taking partial derivatives,

( )yxy

xg

t

ttf 22;

cos

sin)( =

=

Hence


27/100

( )

=

t

tyx

dt

dz

cos

sin22

( ) 0cos

sin

sin2cos2 =

= t

t

tt

as it should.

3. Gateaux and Frechet Derivatives

Nonlinear Operators can be investigated by establishing a

connection between them and linear operators-more

precisely, by the technique of local approximation to thenonlinear operator by a linear one. The differential calculus

for nonlinear operators is needed for this

purpose.Differentiation is a technique that helps us to

approximate a nonlinear operator locally. In Banach spaces

there are different types of derivatives. Among them

Gateaux and Frechet Derivatives are very important for

applications.One of them is more general than the other butin some special circustances they are equivalent.

3.1. Gateaux Derivative

The Gateaux derivative is the generalization of directional

derivative.

Definition.Let X and Y are Banach spaces and let P be an

operator such that YXP : . Then P is said to be Gateauxdifferentiable at Xx

0if there exists a continuous linear

operator YXU : (in general depends on0

x ) such that


28/100

)()()(

lim 000

xUt

xPtxxPt

=+

for every Xx .

The above is clearly equivalent to

0)]()()([1

lim000

=+

xtUxPtxxPtt

(1)

for every Xx .

In the above situation, U is called the Gareaux derivative

of P at0

x , written )(0

xPU = , and its value at Xx is

denoted by ))((0

xxP or simply xxP )(0

.

Theorem1. The Gateaux derivative, if it exists, is unique.

Proof. Let UandVbe two Gateaux derivatives of Pat0

x .

From the relation, for 0t

)],()()([1

)]()()([1

)()(

00

00

xtVxPtxxPt

xtUxPtxxPt

xUxV

+

+=

we obtain for 0>t

)()()(1

)()()(1

)()(

00

00

xtVxPtxxPt

xtUxPtxxPt

xUxV

++

+


29/100

and as 0t , because of (1), both the expressions in theright hand side tend to zero.

Since this is true for each Xx , we see that VU = and rhe

theorem is proved.

We now suppose that X and Y are finite dimensional, saynRX = and mRY = . Let us analyse the representation of the

Gareaux derivative of an operator from X into Y. We know

that if U is a linear operator from X into Y then U is given

by the matrix

mnmm

n

n

aaa

aaa

aaa

L

LLLL

L

L

21

22221

11211

whereYyXxxUy

mn=== ),,,(,),,,(),(

2121 LL

and

k

n

kiki

a ==1

, mi ,,2,1 L= . (1)

Let P be an operator mapping an open subset G of X into

Y and Gxn

= ),,,(21

L

Yym

= ),,,(21

L and )(xPy = . Then we see that there

exist numerical functionsm

,,,21L such that

=ii ( n ,,, 21 L ), mi ,,2,1 L= ; (2)

Suppose that the Gateaux derivative of Pexists at

),,,( )0()0(2

)0(

10 nx L= and UxP = )(

0. Let Ube given by

the above matrix, If the equation


30/100

)()()(

lim 000

xUt

xPtxxPt

=+

is written in full then with the help of the relation (1) and

(2) we obtain m relations

t

tttninni

t

),,,(),,,(lim

)0()0(

2

)0(

1

)0(

2

)0(

21

)0(

1

0

LL +++

=k

n

kik

a =1

, mi ,,2,1 L= (3)

The relation (3) holds for all Xxn

= ),,,(21

L and

therefore taking in turn an x whose all coordinates are zero

except one which is equal to unity, we see that the

functionsm

,,,21L have partial derivatives with respect

ton

,,,21L and

ik

k

ni a=

),,,( )0()0(2

)0(

1L

where mi ,,2,1 L= and nk ,,2,1 L= .

The derivative UxP = )( 0 is therefore given by the matrix of

partial derivatives of the functions m ,,, 21 L

=

nmmm

n

n

xP

L

LLLL

L

L

21

22212

12111

0 )(

which is known as the Jacobian matrix and is denoted by

)(0

xJ .


31/100

Example1. In this example, we show that the existence of

the partial derivatives ofm

,,,21L need not guarantee

the existence of )( 0xP .Let m=1 and n=2 and

)0,0(,0)0,0(,)(

),(0122

2

2

1

21

211==

+= x

.

h

h

h

hhh

)0,0()0,(lim

)0,0()0,0(lim

)0,0(11

0

11

0

1

1

=

+=

0)(

0.lim

220==

hh

hh

and similarly, 0)0,0(

2

1 =

.

Therefore, if the derivative )(0

xP were to exist then it must

be the zero operator and then (3) should give0

),(lim 211

0=

t

ttt

.


32/100

But

+=

+=

=

+=

22

2

2

1

5

21

2

0

22

2

2

1

21

0

1

2

1

1

0

11

00

})(){(lim

})(){(

.1lim

)0,0()(lim

)0,0()0(lim)(

hht

hht

thth

thth

t

t

th

th

t

thxP

t

t

t

t

=

+ 222

2

1

3

21

0 })(){(lim

hht

hht

which does not

exist.

(b) Let m=n=2 and ),(),( 22

3

121xxxxP = . Let ),(

21zzz = be

any point then we see that

=

2

21

20

03)(z

zzP .

3.2. Frechet Derivative

The derivative of a real function of a real variable is

defined by

h

xfhxf

xf h)()(

lim)( 0+

= (1)

This definition cannot be used in the case of mapping

defined in Banach space because h


33/100

is then a vector and division by a vector is meaningless. On

the other hand, the division by a vector can be easily

avoided by rewriting (1) in the form

)()()()( hhxfxfhxf ++=+ (2)

where is a function ofh such that 0)( h as. 0h Equivalently, we can now say that )(xf is the derivativeof f at x if

)()()()( hhxfxfhxf +

=+ (3)

where 0)( hh as 0h .

The definition based on (3) can be generalized to include

mappings from a Banach space into a Banach space. This

leads to the concept of the Frechet differentiability and

Frechet derivative.

Definition. Let f map a ball }:{0


34/100

Here )(afDi

denotes )(ax

f

i

. If ),(2 RUCf then, for each

n

Rv with Uva + , the linear part of

i

n

iii

uafDvafDuafvaf +=+=1

)]()([)]()([

is

ji

n

jiij

vuafDvuaf ==1,

)(),)((

where )(afDij denotes )(ax

f

xij

.

This process may be continued. If ),( RUCfk , denote

)()(11

21

afxxx

afDiii

iii

kk

k

=

LLLLL

.

Define then, for nk

Rwww ,,,21

LL ,

kk

k iki

n

iiiiiiik

k

wwwafDwwwaf,,2

1,,,,121

)(

221

121

)(),,,)(( LLLL

LLL=

=

where1,1 i

w denotes the 1i component of 1w . If all wwi = , we

abbreviate

),,,)((21

)(

k

k

wwwaf LL to .))(()(nk

waf

The derivative )(af is representd by a n1 matrix, whosecomponents are )(afD

i. Also )(af is represented by an

nn matrix, M say, whose ji, element is )(. afD ji ; ifu

andv are regarded as colums, then

MuvvuafT= ),)(( .


35/100

It is shown below that M is a symmetric matrix.

Example Define RRf n : by the polynomial

yxyxyxyxyxf222

623473),( ++++++= .Then

]643,1232[)0()0( 21212121

vvvvvvvfvf ++++=+

Taking the linear part of this expression, and applying it to

=

2

1

u

uu ,

( )

=

++=

2

1

21

2

1

2121

43

32

)43,32(),)(0(

u

uvv

u

uvvvvvuf

where the 22 matrix consists of second partialderivatives.

In more abstract terms, let ),( RRL n denote the vector

space of continuous linear maps from nR into R . Then, in

terms of continuous linear maps, ),()( RRLaf n ,)),,(()( RRRLLaf n and so on. Also )(af is a bilinear

map from nn RR into R ; this means that ),)(( vuaf islinear in u for each fixed v, and also linear in v for each

fixed u.

Theorem 2 (Taylors)

Let );,( RUCf k let Ua and let Uxa + . Then


36/100

L+++=+ 2))((!2

1)(

!1

1)()( xafxafafxaf

kkkk

xafk

xafk

))((!

1))((

)!1(

1)(1)1( +

+ ,

where xac += for some in .10


37/100

xy

fxfyfyxfyx

)0,0()0,(),0(),(),(

+=

(i)

and, for fixed y, let )0,(),()( xfyxfx = . Then themean-value theorem shows that, for fixed y,

y

xfyxf

xy

xx

xy

xyx xx

)0,(),()()0()(),(

=

=

=

For some in 10


38/100

3.4. Example of Bilinear Operator

Assume that ( ) mi

Rcc = is a real m-vector, thatij

aA =

is a real ),( mm matrix and that B=ijk

b is a bilinear operator

from mm RR tomR . Then the mapping mm RRf :

defined by

m

RzBzAzczf ++= ,)( 2 ( where BzzBz =2 ) (b1)

is called a quadratic operator. The equation 0)( =zf , thatis

02 =++ cAzBz (b2)

is called a quadratic equation in mR

As a simple but very important example to the quadratic

equation (b2), we consider the algebraic-eigenvalue

problem

xTx = (b3)where

ijtT = is a real (n,n) matrix. We assume that the

eigenvector )( ixx = has Eulidean length one

===

n

ii

xx1

22

21 (b4)

If we set ( ),,,,21 n

T

xxxz L= , then (b3) and (b4) can be

written as a system of nonlinear equations, namely

0)1(

2

1

)(

)( 22

=

=

x

xIT

zf

(b5)


39/100

It is well known that (b5) is a quadratic equation of the

form (b1) where m=n+1 and

=

2

100 L

T

c , mRc (b6)

=

000

0

0

L

MTA (b7)

=

0000100

0

0

1

1

1

0

0

0010

0

1

2

1

LL

MOM

L

L

MO

O

OOB . (q8)

For the mapping (5), we get by using (q6, q7) and (q8) that

BzfBzAzfBzAzczf 2)(,2)(,)( 2 =+++= .

Therefore )(zf has the matrix representation

=

0)(

x

x

Tzf

T

and )(zf is the bilinear operator defined in (b8),multiplied by the factor two.

For, 2=n ,

=

=

=

2

10

0

,,2

1

2

1

cx

x

zx

xx

.


40/100

+

+

=

=

)1(21

)(

)(

)1(

2

1

)(

)(

2

2

2

1

222121

212111

2

2

xx

xtxt

xtxt

x

xIT

zf

=

0

)(

21

22221

11211

xx

xtt

xtt

zf

=

0

T

x

xIT

=

000010001

0

0

10

01

1

0

00

00

0

1

00

00

)(zf

and for 3=n ,

=

0000010000100001 0

0

0

100

010

001

1

0

0

000

000

000

0

1

0

000

000

000

0

0

1

000

000

000

)(zf

4. Hessian Matrix and Unconstraint Optimization

In mathematics, the Hessian matrix(or simply the Hessian)

is the square matrix of second-order partial derivatives of afunction, that is , it describes the local curvature of a

function of many variables.The Hessian matrix was

developed in 19th century by the German mathematician


41/100

Ludwig Otto Hesse and later named after him. Hesse

himself had used the term functional determinants.

Given the real-valued function),,,(21 n

xxxf L

If all second partial derivatives of f exist, then the Hessian

matrix of f is the matrix)()()( xfDDxfH

jiij=

where ),,,(21 n

xxxx L= andi

D is the differentiation

operator with respect to the ith argument and the Hessian

becomes

=

nnn

n

n

xfxxfxxf

xxfxfxxf

xxfxxfxf

fH

22

2

2

1

2

2

2

2

22

12

2

1

2

21

2

1

22

)()()(

)()()(

)()()(

)(

L

MOMM

L

L

))(( xfH is frequently shortened to simply )(xH .

Some mathematicians define the Hessian as the

determinant of the above matrix.

Hessian matrices are used in large-scale optimization

problems within Newton-type methods because they are the

coefficient of the quadratic term of a local Taylor

expansion of a function.That is,

xxHxxxJxfxxfy T +++= )(2

1)()()(

Where J is the Jacobian matrix, which is a vector (the

gradient for a scalar valued function). The full Hessian

matrix can be difficult to compute in practice, in such

situations, quasi-Newton algorithms have been developed

that use approximations to the Hessian. The best known

quasi-Newton algorithm is the BFGS algorithm.


42/100

4.1. Mixed derivatives and Symmetry of the Hessian

The mixed derivatives of f are the entries off the main

diagonal in the Hessian. Assuming that they are continuous,

the order of differentiation does not matter (Clairauttheorem).

For example,

=

x

f

yy

f

x.

This can also be written asxyyx

ff = .

In a formal statement: if the second derivatives of f are all

continuous in a neiborhood D, then the Hessian of f is asymmetric matrix throughout D.

Example1

Consider the real-valued function35),( xyyxf = .

Then )15,5(),( 23 xyyyxf = and the Hessian matrix,

==

2

22

2

2

2

2 ),(

x

f

xy

f

yxf

xf

yxfH =

xyy

y

3015

1502

2

.

Example2

Let 32 3),( yxyxyxf = .

Then ( )2

3332),( yxyxy

f

x

f

yxf =

=


43/100

The Hessian matrix,

=

==y

xf

xyf

yx

f

x

f

yxfH

63

32),(

2

22

2

2

2

2 .

Example3 Let a function, 22: RRf is given by

+=

xy

yxyxf

2),(

22

.

Then

= xy

yxyxf 22

22),( and

=

0220

2002),(2 yxf

which is not a Hessian matrix but a bilinear operator.

4.2. Critical points and discriminant

If the gradient of f is zero at some point x, then f has acritical point (or a stationary point) at x. The determinant of

the Hessian at x is then called the discriminant. If this

determinant is zero then x is called a degenerate critical

point of f, this is also called a non-Morse critical point of f.

Otherwise, it is non-degenerate, this is called a Morse

critical point of f.

For a real-valued function ),( yxf of two variables x and y

and let1and

2 be the eigenvalues of the corresponding

Hessian matrix of f, then

=+21

trace (H) and =21

det(H)


44/100

Example4

For the previous example, the critical point of f is given by

( ) 03332),(2 ==

yxyxyxf , whence wehave

)0,0(),( =yx

Thus,

=

03

32H . The eigenvalues of H are

( ) ( )101,101,21

+=

Hence 221

=+ =trace(H) and == 921

det(H)

4.3. Functions of one Variable.

Definitions. Suppose )(xf is a real-valued function defined

on some interval (The interval I may be finite or infinite,

open or closed, or half-open.). A point *x in I is:(a) a global minimizer for )(xf on I if

)()( * xfxf for all x in I;(b) a stritct global minimizer for )(xf on I if

)()( * xfxf < for all x in I such that *xx ;(c) a local minimizer for )(xf if there is a positive

number such that )()( * xfxf for all x in I for

which +


45/100

(e) a critical point of )(xf if )( *xf exists and isequal to zero.

The Taylors formula (single variable)

Theorem1. Suppose that )(,)(,)( xfxfxf exist on the

closed interval [ ] { }bxaRxba = :, .If xx ,* are anytwo different points of ]ba, , then there exists a point zstrictly between *x andx such that

2**** )(2

)())(()()( xxzfxxxfxfxf ++= .

If *x is the critical point of )(xf then the above formula

reduces to

2** )(

2

)(0)()( xx

zfxfxf

++= ,

or

2** )(2

)()()( xx

zfxfxf

=

for all *xx .

Theorem 2. Suppose that )(,)(,)( xfxfxf are all

continuous on an interval Iand that Ix

*

is a criticalpoint of )(xf .

(a) If 0)( xf for all Ix , then *x is a globalminimizer of )(xf on I.


46/100

(b) If 0)( > xf for all Ix , such that *xx , then*

x is a strict global minimizer of )(xf on I.

(c) If 0)( > xf , then *x is a strict local minimizerof )(xf .

Example5. Consider 143)( 34 += xxxf .Since ),1(121212)( 223 == xxxxxf the only critical

points of )(xf are 0=x and 1=x . Also, since)23(122436)( 2 == xxxxxf we see that

0)0( =f and 12)1( =f .Therefore, 1=x is a strict local minimizer of )(xf Definition. Suppose that )(xf be a numerical function

defined on a subset D of nR . A point x in D is

(i) a global minimizer for )(xf on D if )()( xfxf forall Dx ;

(ii) a strictly global minimizer for )(xf on D if)()( xfxf < for all Dx such that xx ;

(iii) a local minimizer for )(xf if there is a positivenumbersuch that )()( xfxf for all Dx forwhich ),( xBx ;

(iv) a strictly local minimizer for )(xf if there is apositive numbersuch that )()( xfxf < for all

Dx for which ),( xBx and xx ;(v) a critical point for )(xf if the first partial derivatives

of )(xf exist at x and


47/100

0)( =i

xxf for ni ,,2,1 K= ,

that is0)( = xf

Example 6. Consider 23 )5(3)4(40)( ++= yxxxyf .))5(6,124( 23 = yxxf .

0=f gives two critical points )}5,0(),5,3{(),( =yx

4.4. Taylors Formula (several variables).

Theorem 5. Suppose that x ,x are points in nR and that

)(xf is a function of n variables with continuous first andsecond partial derivatives on some open set containing the

line segment

[ ] { }10;)(:, +== txxtxwRwxx n joining x andx . Then there exists a [ ]xxz , such that

))(()(

2

1)()()()( xxzHfxxxxxfxfxf ++=

Here H denotes the Hessian Matrix

This formula is used to develop tests for maximizers and

minimizers among the critial points of a function.

Theorem 6. Suppose that *x is a critical point of a function

)(xf with continuous first and second partial derivatives

on

n

R . Then:(a) *x is a global minimizer for )(xf if

0))(()( xxzHfxx for all nRx and all ],[ * xxz ;


48/100

(b) *x is a strict global minimizer for )(xf if

0))(()( > xxzHfxx for all nRx such that *xx andall ],[ * xxz ;

(c)*

x is a global maximizer for )(xf if0))(()( xxzHfxx for all nRx and all ],[ * xxz ;

(d) *x is a strict global maximizer for )(xf if

0))(()(


49/100

In general, )(yQA

is a sum of terms of the formjiij

yyc where

nji ,,1, K= andij

c is the coefficient which may be zero,

that is, every term in )(yQA is of second degree in thevariables

nyyy ,,,

21K . On the other hand, any function

),,(1 n

yyq K that is the sum of second-degree terms in

nyyy ,,,

21K can be expressed as the quadratic form

associated with an nn -symmetric matrix A by splittingthe coefficient of

jiyy between the ),( ji

and ),( ij entries of A.

Example 7(a). The function

3221

2

3

2

2

2

1321424),,( yyyyyyyyyyq ++=

is the sum of second degree terms in .,,321

yyy

Splitting the coefficients ofji

yy , we get

323231

312121

2

3

2

2

2

1321

220

04),,(

yyyyyy

yyyyyyyyyyyyq

+++

++=

=233213311221

2

3

2

2

2

122004 yyyyyyyyyyyyyyy +++++

=

3

2

1

321

420

211

011

),,(

y

y

y

yyy

where

=

=

420

211

011

333231

232221

131211

ddd

ddd

ddd

A

with


50/100

=ii

d coefficient of 2i

y

ijd or

jid =

2

1(coefficient of

jiyy or

jiy ) , ji .

(b) 23

2

2

2

13213),,( yyyyyyq ++=

323231312121

2

3

2

2

2

1.0.0.0.0.0.03 yyyyyyyyyyyyyyy +++++++=

( )

3

2

1

321

100

030

001

y

y

y

yyy

(c) 23

2

21321)2(),,( yyyyyyq +=

21

2

3

2

2

2

144 yyyyy ++=

3232

31312121

2

3

2

2

2

1

.0.0

.0.0224

yyyy

yyyyyyyyyyy

++

++++=

( )

=

3

2

1

321

100

042

021

y

y

y

yyy

The Hessian )(zHfH = of )(xf evaluated at a point z is annn -symmetric matrix.

For nRxx *, , the quadratic form )(yQ

Hassociated with H

evaluated at *xx is))(()()( *** xxzHfxxxxQ

H=


51/100

4.6. Positive and Negative Semidefinite and Definiteness

Definitions. Suppose that A is an nn -symmetric matrix

and that AyyyQA=

)( is the quadratic form associated with A. Then A and

AQ are

called:

(a) positive semidefinite if 0)( = AyyyQA

for alln

Ry ;(b) positive definite if 0)( >= AyyyQ

Afor all

n

Ry , 0y ;

(c) negative semidefinite if 0)( = AyyyQA for allnRy , 0y ;

(d) negative definite if 0)( = AyyyQ

Afor some nRy and

0)(


52/100

(d) *x is a strict global maximizer for )(xf if )(xHf is

negative definite on nR .

Here are some examples.

Example 8.

(a) A symmetric matrix whose entries are all positiveneed not be positive definite. For example, the matrix

=

14

41A

is not positive definite. For if )1,1( =x , then

.063

3)1,1(

1

1

14

41)1,1()(


53/100

=

200

030

001

A

is positive definite because the associated quadratic form

)(xQA is

2

3

2

2

2

123)( xxxAxxxQ

A++==

and so 0)( >xQA

unless 0321

=== xxx .

(d) A 33 -diagonal matrix

=

3

2

1

00

00

00

d

d

d

A

is

(1) positive definite if 0>id for 3,2,1=i ;(2) positive semidefinite if 0i

d for 3,2,1=i ;

(3) negative definite if 0> ddd then

0)( 222

2

11+= xdxdxQ

A

for all 0x since 0,021

>> dd , but if )1,0,0(=x , then

0)( =xQA

even though 0x .


54/100

(e) If a 22 -symmetric matrix

=

cb

baA

is positive definite , then 0>a and 0>c . For if )0,1(=x ,

then 0x and so .0.0.0.21.)(0 22 acbaxQA

=++=<

Similarly, if )1,0(=x , then .)(0 cxQA

=< However,

(a)shows that there are 22 -symmetric matrices with0,0 >> ca that are not positive definite. We can see that

the size of b relative to the size of the product ac is the

determining factor for positive definiteness.

The examples show that for general symmetric matrices

there is little relationship between the signs of the matrix

entries and the positive or negative definite features of the

matrix. They also show that for diagonal matrices, these

features are completely transparent.

Here are some examples of positive definite, positive semi-

definite, negative definite and negative semi-definite

matrices in real field:

Example 9.

(a)

=

210

121

012

A , ( ) 0321

>=T

xxxX

0)()( 23

2

1

2

32

2

21>+++++= xxxxxxAXXT .


55/100

The matrix A is PD

(b)

=

1105

0181551525

A , ( ) 0321

>=T

xxxX

0111825 23

2

2

2

1>++= xxxAXXT .

The matrix A is PD

(c)

=

151451

141451

5551

1111

A , ( ) 04321

>=T

xxxxX

0)(9

)(4)(2

4

2

43

2

432

2

4321

>+++

++++++=

xxx

xxxxxxxAXXT

.

The matrix A is PD

(d)

=

1100

1210

0121

0012

A , ( ) 04321

>=T

xxxxX

0)()()( 243

2

32

2

21

2

1>+++= xxxxxxxAXXT .

The matrix A is PD


56/100

(e) 322

2

1,)( Rxxxxq += , for any nonzero Rx

3and

021

== xx , ( ) 33

000 Rx

gives 0)( 22

2

1=+= xxxq

The matrix

=

000

010

001

A is PSD

(f) )3()( 23

2

2

2

1xxxxq ++=

( )

=

3

2

1

321

100

030

001

x

x

x

xxx .

The matrix

=

100

030

001

A is ND.

(g) ))2(()( 23

2

21xxxxq +=

=( )

3

2

1

321

100

042

021

x

x

x

xxx .

The matrix

=

100

042

021

A is NSD

We will develop two basic tests for positive and negative

definiteness-one in terms of determinant, and another in

terms of eigenvalues.


57/100

4.7. Determinant Approach

We begin by looking at functions of two variables. If A is a

22

-symmetric matrix

=

2212

1211

aa

aaA

then the associated quadratic form is.2.)( 2

2222112

2

111xaxxaxaAxxxQ

A++==

For any 0x in 2R , either )0,(1

xx = with

01

x ),(21

xxx = with 02

x .

Let us analyze the sign of )(xQA in terms of entries of A in

each of these two cases

Case 1. )0,(1

xx = with 01

x .

In this case, 2111

)( xaxQA

= so 0)( >xQA

if and only if

011

>a , while 0)( t for all Rt .

Note that1211

22)( atat += ,11

2)( at = , so that

1112

* / aat = is a critical point of )(t and this critical point is

a strict minimizer if 011 >a and strict maximizer if 011 a and if Rt , then


58/100

2212

1211

11

2

122211

11

22

11

2

12

11

12*

1

)(

1

)()()(

aa

aa

aaaaa

aa

a

a

att

==

+==

. Thus, if 011

>a and

det 02212

1211 >

aa

aa, then 0)( >t for all Rt and so

0)( >xQA

for all ),(21

xxx = with 02

x .On the other hand,if 0)( >xQ

Afor all such x, then 0)( >t for all Rt and

so 011

>a and the discriminant of )(t

4442211

2

12= aaa det

2212

1211

aa

aais negative ,

that is 011

>a and det 02212

1211 >

aa

aa. An entirely similar

analysis shows that 0)( a , det 0

2212

1211

>

aaaa ;


59/100

(b) negative definite if and only if 011

aa

aa. The 22 case and a little imagination

suggest the correct formulation of the general case.

Suppose A is an nn -symmetric matrix. Definek

to be

the determinant of the upper left-hand corner kk -sub-matrix of A for nk 1 .The determinant

k is called the

kth principal minor of A

=

nnnnn

n

n

n

aaaa

aaaa

aaaa

aaaa

A

L

MMMMM

L

L

L

321

3332313

2232212

1131211

,111

a= Aaa

aan

det,,det2212

1211

2=

= L .

The general theorem can be formulated as follows:

Theorem.9. If A is an nn -symmetric matrix and ifk

is

the kth principal minor of A for nk

1 , then:

(a) A is positive definite if and only if 0>k

for k=1, 2 ,

... , n;


60/100

(b) A is negative definite if and only if 0)1( >k

k for

k=1, 2 , ... , n(that is, the principal minors alternate in

sign with 01


61/100

(a) 0. >Axx for all 03

x such that 03

=x if and only if

;0,021

>>

(b) 0. Axx for all

0x such that 03 x if and only if

=),( ts 022223131233

2

22

2

11>+++++ tasastaatasa

for real numbers ts , . In addition, 0.


62/100

0det2212

1211

2

=

aa

aa,

and this unique solution is given by Cramers Rule as

=

2223

1213

2

* det1

aa

aas , =*t

2312

1311

2

det1

aa

aa. (1)

if we multiply the equation0

13

*

12

*

11=++ atasa

by *s , and multiply the equation0

23

*

22

*

12=++ atasa

by

*

t and add the results, we obtain 02)()( *23

*

13

**

12

2*

22

2*

11=++++ tasatsatasa .

Cosequently,

33

*

23

*

13

** ),( atasats ++= ,

and so (1) implies that if

,02

then

2

3

2332313

232212

131211

2

**det

det1

),(

=

=

=

A

aaa

aaa

aaa

ts .

(2)

Since

2

2212

1211

422

22det),( =

=

aa

aatsH ,

it follows from Theorem 8 and Theorem 7 that ),( ** ts is a

strict global minimizer for ),( ts if and only if0,021

>> . Similarly, ),( ** ts is a strict global maximizer

for ),( ts if and only if 0,021

>>> , then the conclusion (a) of Case 1

shows that if 0x and 03

=x , then 0. >Axx ; on the other


63/100

hand , the considerations in Case 2 show that if

,,,0,031323

sxxtxxxx == then

.0),(),(.2

32

3

**2

3

2

3>

== xtsxtsxAxx

Therefore 0. >AXx for all 0x if 0,0,0321

>>> .

On the other hand , if 0. >Axx for all 0x , then theconclusion (a) of Case 1 shows that 0,0

21>> . Also, if

)1,,( *** tsx = , then (2) yields

,0),( ****

1

3 >==

Axxts so 0

3> . This proves part (a)

of the theorem for 3=n

Example.9

(a) Minimize the function313221

2

3

2

2

2

1321),,( xxxxxxxxxxxxf +++= .

The critical point of ),,(321

xxxf are the solutions of the

system

.02

,02

,02

321

321

321

=++

=++

=

xxx

xxx

xxx

or

=

0

0

0

211

121

112

3

2

1

x

x

x

.


64/100

Sincedet

211

121

112

0 , 0,0,0321

=== xxx is the

one and only solution.

The Hessian of ),,(321

xxxf is the constant matrix

=),,(321

xxxHf

211

121

112

Note that ,4,3,2 321 === so ),,( 321 xxxHf is positivedefinite everywhere on 3R .

It follows from Theorem 7 that the critical point (0, 0, 0) is

a strict global minimizer for ),,(321

xxxf and this is the only

one critical point of ),,(321

xxxf .

(b) Find the global minimizer of22),,( zeeezyxf xxyyx +++= .

To this end, compute

+

=

z

ee

xeee

zyxfxyyx

xxyyx

2

2

),,(

2

,

and


65/100

+

+++

=

200

0

024

),,(

222

xyyxxyyx

xyyxxxxyyx

eeee

eeeexee

zyxHf

Clearly, 01

> for all x, y, z because all the terms of it are

positive. Also

=2

222 )()24)(()(

22xyyxxxxyyxxyyx eeeexeeee

+++++

= )24)((222 xxxyyx

eexee ++

>0

because both the factors are always positive. Finally,

0223

>= . Hence ),,( zyxHf is positive definite at all

points. Therefore by Theorem 7, ),,( zyxf is strictly

globally minimized at any critical point ).,,( *** zyx To find

),,( *** zyx , solve

0

2

2

),,(*

*

*** ****

2*****

=

+

=

z

ee

exee

zyxfxyyx

xxyyx

This leads to ,,0***** xyyx

eez == hence 02

2* )(* =xex .

Accordingly, ;**** xyyx = that is, ** yx = and 0* =x .Therefore )0,0,0(),,( *** =zyx is the global minimizer

of ),,( zyxf .

(c) Find the global minimizers of


66/100

.),( xyyx eeyxf +=

To this end, compute

+=

xyyx

xyyx

eeeeyxf ),(

and

+

+=

xyyxxyyx

xyyxxyyx

eeee

eeeeyxHf ),(

Since 0>+ xyyx ee for all x, y and det 0),( =yxHf , the

Hessian ),( yxHf is positive semidefinite for all x, y.

Therefore by Theorem 7, ),( yxf is minimized at any

critical point ),( ** yx of ),( yxf .

To find ),( ** yx , solve

== ),(0** yxf

+

****

8***

xyyx

xyyx

ee

ee.

This gives

8***xyyx ee

= or

**** xyyx = ;

that is ** 22 yx = .This shows that all points of the line xy = are the globalminimizers of ),( yxf .


67/100

(d) Find the global minimizes of.),( yxyx eeyxf + +=

In this case,

+

+=

+

+

yxyx

yxyx

ee

eeyxf ),(

++

++=

++

++

yxyxyxyx

yxyxyxyx

eeee

eeeeyxHf ),( .

Since 0>+ + yxyx ee for all x, y and det 0),( >yxHf , then),( yxHf is positive definite for all x, y. Therefore by

Theorem 7, ),( yxf is minimized at any critical

point ),( ** yx . To find ),( ** yx ,

),(0 ** yxf= =

+

++

+

*8**

*8**

yxyx

yxyx

ee

ee.

Thus*8**

yxyx

ee+

+ =0and0

*8**

=+ + yxyx ee .But 0

**

>yxe and 0**

>+yxe for all ** , yx . Therefore the

equality 0*8**

=+ + yxyx ee is impossible. Thus ),( yxf has nocritical points and hence ),( yxf has no global minimizers.

There is no disputing that global minimization is far moreimportant than mere local minimization. Still there are

certain situations in which scientists want knowledge of

local minimizers of a function. Since we are in an excellent


68/100

position to understand local minimization, let us get on

with it. The basic fact to understand is the next theorem.

Theorem.10. Suppose that )(xf is a function withcontinuous first and second partial derivatives on some set

D in nR . Suppose *x is an interior point of D and that *x is a

critical point of )(xf . Then *x is:

(a) a strict local minimizer of )(xf if )( *xHf is positivedefinite;

(b)

a strict local maximizer of )(xf if )(

*

xHf is negativedefinite.

Now we briefly investigate the meaning of an indefinite

Hessian at a critical point of a function. Suppose that

)(xf has continuous second partial derivatives on a set D

in nR , that *x is an interior point of D which is a critical

point of )(xf , and that )( *xHf is indefinite. This means that

there are nonzero vectors wy, in nR such that

.0)(.,0)(. ** wxHfwyxHfy

Since )(xf has continuous second partial derivatives on D,

there is an 0> such thatfor all t with


69/100

Therefore, t=0 is a strict local minimizer for Y(t) and a

strict local maximizer for W(t).

Thus, if we move from

*

x in the direction of y or y, thevalues of )(xf increase, but if we move from *x in the

direction of w or w, the values of )(xf decrease. For this

reason, we call the critical point *x a saddle point.

The following result summerizes this little discussion:

Theorem.11. If )(xf is a function with continuous second

partial derivatives on a set D in nR , if *x is an interior point

of D that is a critical point of )(xf , and if the Hessian

)( *xHf is indefinite, then *x is a saddle point for )(xf .

Example.10. Let us look for the global and minimizers

and maximizers (if any) of the function.812),( 3

221

3

121xxxxxxf +=

In this case, the critical points are the solutions of the

system

.24120

,1230

2

21

2

2

2

1

1

xxx

f

xxx

f

+=

=

=

=

This system can be readily solved to identify the critical

points (2,1) and (0,0).

The Hessian of ),(21

xxf is


70/100

=

2

1

21

4812

126),(

x

xxxHf .

Since

=4812

1212)1,2(Hf

and since 121

= and 4322

= , it follows that the critical

point (2,1) is a strict local minimizer.

Now let us see whether (2,1) is a global minimizer.

Observe that ),(21

xxHf is not positive definite for all

),( 21 xx ; for example,

=

4812

120)1,0(Hf

is indefinite. In view of Theorem 7, this leads us to suspect

that (2,1) may not be a global minimizer. The fact that=

)0,(lim

11

xfx

shows conclusively that ),( 21 xxf has no global minimizer.

Moreover, since+=

+)0,(lim

11

xfx

,

we see that there are no global maximizers or global

minimizers.

How about the critical point (0, 0)? Well, since

= 012

120

)0,0(Hf ,

this matrix miserably fails the tests for positive definiteness

and this leads us to expect trouble at (0, 0).The Theorem 11

tells us that there is a saddle point at (0, 0).


71/100

4.8. Eigenvalues and Positive Definite Matrices

If A is an nn -matrix and if x is a nonzero vector in nR such that xAx = for some real or complex number, then

is called an eigenvalue of A. Ifis an eigenvalue of A,then any nonzero vector x that satisfies the equation

xAx = is an eigenvector of A corresponding to. Since is an eigenvalue of an nn -matrix A if and only if thehomogeneous system 0)( = xIA of n equations in nunknowns has a nonzero solution x, it follows that the

eigenvalues of A are just the roots of the characteristic

equation

0)det( = IA .Since )det( IA is a polynomial of degree n in , thecharacteristic equation has n real or complex roots if we

count the multiple roots according to heir multiplicities, so

an nn -matrix A has n real or complex eigenvaluescounting their multiplicities.

Symmetric matrices have the following special propertieswith respect to eigenvalues and eigenvectors:

(1) All of the eigenvalues of a symmetric matrix are realnumbers.

(2) Eigenvectors corresponding to distinct eigenvalues ofa symmetric matrix are orthogonal.

(3) Ifis an eigenvalue of multiplicity k for a symmetricmatrix A, there are k linearly independent eigenvectorscorresponding to . By applying the Gram-Schmidt

Orthogonalization Process, we can always replace these

k linearly independent eigenvectors with a set of k

mutually orthogonal eitgenvectors of unit length.


72/100

Combining (2) and (3), we see that if A is an nn -symmetric matrix, there are n mutually orthogonal unit

eigenvectors

)()1(

,,

n

uuL

corresponding to the n eigenvaluesn

,,1L . If P is the nn -matrix whose i-th column is the

unit eigenvector )( iu corresponding toi

, and if D is the

diagonal matrix with the eigenvaluesn

,,1L down the

main diagonal, then the following matrix equation holds:

PDAP =

because )()( ii

i uAu = for ni ,,1L= .Since the matrix P is

orthogonal(that is, its columns are mutually orthogonal unit

vectors), P is invertible and the inverse 1P ofP

is just the transpose TP of P . It follows that

DAPPT = ,that is , the orthogonal matrix P diagonalizes A. If

AxxxQA .)(=

is the quadratic form associated with thesymmetric matrix A and if Pyx = , then

22

22

2

11

)()(.)(

nn

TTTT

A

yyy

DyyAPyPyPyAPyAxxxQ

+++=

====

L

Moreover, since P is invertible, 0x if and only if.0y Also, if )( iy is the vector in nR

with the ith component equal to 1 and all other componentsequal to zero, and if )()( ii Pyx = , then

i

i

AxQ =)( )(


73/100

for all ni ,,1L= . These considerations yield the following

eigenvalue test for definite, semidefinite, and indefinite

matrices.

Theorem 12. If A is a symmetric matrix, then:

(a) the matrix A is positive definite (resp. negativedefinite) if and only if all the eigenvalues of A are

positive (resp. negative);

(b) the matrix A is positive semidefinite (resp.negative semidefinite) if and only if all the

eigenvalues of A are nonnegative (resp. nonpositive);

(c) the matrix A is indefinite if and only if A has atleast one positive eigenvalue and at least one negative

eigenvalue.

Example11. Let us locate all maximizers, minimizers, and

saddle points of

21

2

3

2

2

2

13214),,( xxxxxxxxf ++=

The critical points of ),,(321

xxxf are solutions of the systm

of equations


74/100

.20

,240

,420

3

3

21

2

21

1

xx

f

xxx

f

xxx

f

=

=

+=

=

=

=

(0, 0, 0) is the one and only one solution of this system.

The Hessian of ),,(321

xxxf is the constant matrix

=

200

024042

),,(321

xxxHf .

The eigenvalues of the Hessian matrix are 2,6,2 = , so

the Hessian is indefinite at the critical point (0, 0, 0) and

hence it is a saddle point for ),,(321

xxxf .

4.9. Problem posed as minimization problem.

(a) Consider the system of equations

bAx = with A of order (m, n) and m>n. The problem becomes

that of finding x which minimizes2

2bAx ; this is a

quadratic function in x and hence the vector2

2)( bAxxfg == containing the first order partial

derivatives is linear in x. The solution is found to be that of

bAAxATT =


75/100

Note that the Hessian of f is AAT which is seen to be

positive definite matrix when rank of A is n, meaning that

the problem is posed as a minimization problem.

(b) Another example is the least squares method for

finding the straight line baxy += to the set of points ),(,),,(),,(

2211 nnyxyxyx L .This

problem becomes that of finding the constants a andb

which minimizes the function

=2)(),( baxybaf

ii

They are given by ==

=

n

iiii

xbaxya

f

1

))((20

and

==

=

n

iii

baxyb

f

1

)1)((20

The above equations can be organized in the convenient

form

=

=

=

==

=n

iii

n

ii

n

ii

n

ii

n

ii

xy

y

a

b

xx

xn

1

1

1

2

1

1

We can easily justify that the matrix on the left hand side is

nonsingular because2

11

2

>==

n

ii

n

ii

xxn

[ Holders Inequality:

( ) ( ) 111,1

1

1

11

=+=== qp

babaqn

i

q

i

pn

i

p

ii

n

ii

.

Letting ,1, ==iii

bxa and 2== qp , we get


76/100

( ) ( )2121

1

2

1

nxxn

ii

n

ii

==

i.e. , ) ]211

2

==

n

ii

n

ii

xxn

The matrix on the left hand side is also a Hessian matrix of

the function ),( baf ; this being positive definite means thatwe are again dealing with a minimization problem.

Many other nonlinear functions which seem difficult to

deal with, except by nonlinear techniques can be dealt with

using same technique explained above, if they can be

formulated as a quadratic objective function. The

procedure is to take the logarithm of both sides to

formulate the problem as follows:

( ) =

n

iiiba

bxay1

2

,lnlnmin

If the function is not quadratic, the above method fails, for

0),(),( == yxfyxg generate a set of nnonlinear

simultaneous equations. The procedure will then be tochoose a guess point 0x and improve on it until the solution

is reached. The procedure is as follows.

Expand )(xf around 0x by Taylors series:

))((

)()(2

1)()()(

30

0000

00

xxo

xxHxxxxgxfxfxx

T

+

++=

Now x)

is defined as the point at which

0=

xx

f

)

.


77/100

Differentiating the Taylors expansion w. r. t. x or

)( 0xx gives)(0 0

00xxHg

xx+)

, where the third term is neglected if0

x is rightly chosen nearx)

. Hence00

10

xxgHxx

)

where the vector gHy 1= is obtained by solving the linear

equationsgHy =

Because the above correction forx)

is only approximately,

the solution x)

can be improved by iteration.

The convergence of the above method is guaranteed ifH

after every iteration is found positive definite for a

minimization problem or negative definite for a

maximization problem.

For Example, for a minimization problem,

)()(2

1)()()( 0000

00xxHxxxxgxfxf

xx

T ++=))))

= { } )}(){(2

1)(

000000

110

x

T

xxxxx

T ggHgHgxf ++

=000000

110

2

1)(

xxx

T

xxx

T

gHggHgxf +

= 00010

2

1)( xxx

T

gHgxf

and since0x

H is positive definite,0

1

xH

is also positive

definite .

Hence 0000

1 >xxx

T

gHg


78/100

From this, we obtain )()( 0xfxf ==

< 1

1

1

11

22

1

21

1

22

,,

,,,dHd

gg

gggggg

= >< 1

1

1

11

22

1

1

1

1

11

22

,,

,0,

,

,dHd

gg

ggdHd

gg

gg

=0 ( 0, 21 >=< ggQ )

And minimizing in the direction of 3x , we obtain


83/100

>=< gg .

The above property holds because 2g is a linear

combination of 2d and 1d . But 2d is orthogonal on 3g , and

so on is 1d , for

0,

,,,1

2

2

2

1212

22

213

>=>==


84/100

22 244),( yxyxyxf +=

Solution The function can be rewritten as

( )

=

y

xyxyxf

22

24),( .

)44,48(),( yxyxyxf +=

==

44

48),(2 yxfH .

Let us start with

= 3

21

x .

Then )4,4()128,1216(11 =+== gd

( )

( )2

1

64

32

0

1644

32

1616

1632),4,4(

32

4

4

44

48

),4,4(

4

444

,

,1

1

1

11

1

==

=

>

+

Documents

Learning Hessian matrix.pdf