Upload
sirajus-salekin
View
313
Download
5
Embed Size (px)
Citation preview
7/27/2019 Learning Hessian matrix.pdf
1/100
Derivatives, Higher Derivatives, Hessian Matrix
And its Application in Numerical Optimization
Dr M Zulfikar Ali
Professor
Department of MathematicsUniversity of Rajshahi
7/27/2019 Learning Hessian matrix.pdf
2/100
1. First-, Second- and Higher-Order Derivatives
11. Differentiable Real-Valued Function
Let f be a real valued function defined on an open set in nR , and let x be in . f is said to be differentiable at x iffor all nRx such that + xx we have that
xxxxxtxfxxf ),()()()( ++=+
where )(xt is an n-dimensional bounded vector, and is a
real-valued function of x such that0),(lim
0=
xx
x
f is said to be differentiable on if it is differentiable ateach x in .[Obviously, if f is differentiable on the open set , it isalso differentiable on any subset
1 (open or not) of .
Hence when we say that f is differentiable on some set
1 (open or not), we shall mean that f is differentiable onsome open set containing
1 .]
Theorem 1 Let )(xf be a real-valued function defined on
an open set in nR , and let x in .(i) If )(xf is differentiable atx , then )(xf is continuous
atx , and )(xf exists (but not conversely), andxxxxxxfxxf ),()()()( ++=+ ,
0),(lim0
=
xxx
for + xx
(ii) If )(xf has continuous partial derivatives at x with
respect ton
xxx ,,,21K , that is )(xf exists and f is
continuous atx , then f is differentiable at x .
7/27/2019 Learning Hessian matrix.pdf
3/100
Theorem 2 Let )(xf be a real-valued function defined on
an open set in nR , and let x in .(i) If )(xf is differentiable atx , then )(xf is continuous
atx , and )(xf exists(but not conversely), andxxxxxxfxxf ),()()()( ++=+ ,
0),(lim0
=
xxx
for + xx
(ii) If )(xf has continuous partial derivatives at x with
respect ton
xxx ,,,21K , that is )(xf exists and f is
continuous at x , then f is differentiable at x .
1.2. Twice Differentiable Real-Valued Function
Let f be a real-valued function defined on an open set
in nR , and let xbe in . f is said to be twice differentiableat x if for all nRx such that + xx we have that
2
2
))(,(2
)()()()( xxx
xxfxxxfxfxxf +
++=+
where )(2 xf is an nn matrix of bounded elements, and is a real-valued function ofx such that 0),(lim
0=
xx
x
The nn matrix )(2 xf is called the Hessian (matrix) off at x and its ij-th element is written as
[ ] njixxxf
xfji
ij ,,2,1,,)(
)(
2
2
L=
=
Obviously, if f is twice differentiable atx , it must also be
differentiable at x .
7/27/2019 Learning Hessian matrix.pdf
4/100
1.3. First Partial Derivative
Suppose ),( yxf is a real-valued function of two
independent variables x and y. Then the partial derivative
of ),( yxf with respect to x is defined as
+
=
x
yxfyxxf
x
fx
y
),(),(lim
0(1)
Similarly, the partial derivative of ),( yxf with respect to y
is defined as
+
=
y
yxfyyxf
y
fx
x
),(),(lim
0. (2)
Example1 If 22 2),( yxyxf = . Then
+
=
=
x
yxyxx
x
ff
x
y
x
)2(]2)[(lim
2222
0
x2= Similarly
+
=
=
y
yxyyx
y
ff
x
x
y
)2(])(2[lim
2222
0
y4=
.1.4. Gradient of Real-Valued Functions
Let fbe a real-valued function defined on an open set innR , and let x be in . The n-dimensional vector of the
7/27/2019 Learning Hessian matrix.pdf
5/100
partial derivatives of f with respect ton
xxx ,,,21K at x is
called the gradient of f at x
and is denoted by )(xf , that is,
))(,,)(()(1 n
xxfxxfxf = K
Example2 Let .)(),( 22 yyxyxf += Then the gradient off ,
)42,22())(,)(()(21
yxyxxxfxxfxf +==
1.5. Function of a Function
It is well-known propert of functions of one independent
variable that if f is a function of a variable u, and u is a
function of a variable x, then
dx
du
du
df
dx
df= . (3)
This result may be immediately extended to the case when f
is a function of two or more independent variables.
Suppose )(uff = and ),( yxuu = . Then, by the definitionof a partial derivative,
yy
x
x
u
du
df
x
ff
=
= , (4)
xx
y
yu
dudf
yff
=
= . (5)
Example 3 If
7/27/2019 Learning Hessian matrix.pdf
6/100
x
yyxf
1tan),( =
Then putting xyu = we have
22
1 )(tanyx
y
x
uu
du
d
x
ff
yy
x
+=
=
=
22
1 )(tanyx
x
y
uu
du
d
y
ff
xx
y
+=
=
= .
1.6. Higher Partial Derivatives
Provided the first partial derivatives of function are
differentiable we may differentiate them partially to obtain
the second partial derivatives. The four second partial
derivatives of ),( yxf are therefore
y
xxx
x
f
xf
xx
ff
=
=
=
2
2
, (6)
x
yyy
y
f
yf
yy
ff
=
=
=
2
2
(7)
x
yxy
y
f
xf
xyx
ff
=
=
=
2
, (8)
and
y
xyx x
f
yf
yxy
ff
=
=
=
2
. (9)
Higher partial derivatives than the second may be obtained
in a similar way.
7/27/2019 Learning Hessian matrix.pdf
7/100
Example 4 Ifx
yyxf
1tan),( = . Then
2222,
yxx
xf
yxy
xf
xy +=
+=
.
Hence differentiating these first derivatives partially, we
obtain
222222
2
)(
2)(
yx
xy
yx
y
xx
ff
xx
+=
+
=
=
and
222222
2
)(
2)(
yx
xy
yx
x
yy
ff
yy
+=
+
=
=
also
222
22
22
2
)()(
yx
xy
yx
x
xyx
ff
xy
+
=
+
=
=
and
222
22
22
2
)()(
yx
xy
yx
y
yxy
ffyx
+=
+
=
= .
We see that
xy
f
yx
f
=
22(10)
which shows that the operatorsx
and
y
are
commutative.
We also see that
02
2
2
2
=
+
y
f
x
f. (11)
7/27/2019 Learning Hessian matrix.pdf
8/100
The above equation is called Laplace equation in two
variables.
In general, any function satisfying this equation is called a
harmonic function.
1.7. Total Derivatives
Suppose ),( yxf is a continuous function defined in a
region R of the xy-plane, and that bothyx
f
and
xy
f
are
continuous in this region. We now consider the change in
the value of the function brought about by allowing smallchanges in x and y.
If f is the change in ),( yxf due to changes x and y in
x and y then
),(),( yxfyyxxff ++= (12)
),(),(
),(),(
yxfyyxf
yyxfyyxxf
++
+++=
(13)
Now, by definition (1) and (2)
+++
=+
x
yyxfyyxxfyyxf
x x
),(),(lim),(
0(14)
and
+
=
y
yxfyyxfyxf
y y
),(),(lim),(
0(15)
Consequently ,
xyyxfx
yyxfyyxxf
++
=+++ ),(),(),(
7/27/2019 Learning Hessian matrix.pdf
9/100
(16)
and
yyxf
y
yxfyyxf
+
=+ ),(),(),( (17)
where and satisfy the conditions
0lim0
=
xand 0lim
0=
y. (18)
Using (16) and (17) in (13) we now find
yyxfy
xyyxfx
f
+
+
++
= ),(),( . (19)
Furthermore, since all first derivatives are continuous by
assumption, the first term of (19) may be written as
+
=+
x
yxfyyxf
x
),(),( (20)
where satisfies the condition
0lim0
=y
. (21)
Hence, using (20) and (19) now becomes
yxyy
yxfx
x
yxff +++
+
= )(
),(),(. (22)
The expression
yy
yxfxx
yxff
+
),(),( (23)
obtained by neglecting the small terms x )( + and y in(22) represents, to the first order in x and y . The change
7/27/2019 Learning Hessian matrix.pdf
10/100
f in f(x, y) due to changes x and y in x and y
respectively is called the total differential of f.
In case of a function of n independent variables
),,,( 21 nxxxf L we have
=
++
+
=
n
rr
r
n
n
xx
fx
x
fx
x
fx
x
ff
12
2
1
1
L .
(24)
Example5. To find the change in
xy
xeyxf =),(
when the values of x and y are slightly changed from 1 and
0 to 1+ x and y respectively.
We first use (23) to obtain
yexxexyef
xyxyxy
2
)(++
Hence putting x=1, y=0 in the above expression we have
yxf +
For example, if 10.0=x and 05.0=y , then 15.0f
We now return to the exact expression for f given in (22).
Suppose u=f(x, y) and that both x and y are differentiablefunctions of a variable t so that
x=x(t), y=y(t) (25)
7/27/2019 Learning Hessian matrix.pdf
11/100
and
u=u(t). (26)
Hence dividing (22) by t and proceeding to the limit0t (which implies 0x , 0y andconsequently 0,, ) we have
dt
dy
y
f
dt
dx
x
f
dt
du
+
= . (27)
This expression is called the total derivative of )(tu with
respect to t.It is easily seen that if
),,,(21 n
xxxfu L= (28)
wheren
xxx ,,,21L are all differentiable functions of a
variable t, then u=u(t) and
=
++
+
=
n
r
r
r
n
ndtdx
xf
dtdx
xf
dtdx
xf
dtdx
xf
dtdu
1
2
2
1
1
L
(29)
Example6. Suppose22),( yxyxfu +==
And2
,sinh tytx == .Then
yy
fx
x
f2,2 =
=
7/27/2019 Learning Hessian matrix.pdf
12/100
tdt
dyt
dt
dx2,cosh ==
Hence
34coshsinh24cosh2 tttyttxdt
dy
y
f
dt
dx
x
f
dt
du+=+=
+
=
.
1.8. Implicit DifferentiationIn special case of the total derivative (27) aries when y is
itself a function of x (i.e. t=x).Consequently u is a function of x only and
dx
dy
y
f
x
f
dx
du
+
= (30)
Example7. Suppose
y
xyxfu 1tan),( ==
andxy sin=
Then by (30), we have
xx
xxx
xyx
x
yx
y
dx
du
22
222
sin
cossin
cos
+
=
+
+=
When y is defined as a function of x by the equation
0),( == yxfu (31)
7/27/2019 Learning Hessian matrix.pdf
13/100
Y is called an implicit function of x. Since u is identically
zero its total derivative must vanish, and consequently from(30)
xyy
f
x
f
dx
dy
= (32)
Example8. Suppose
0222),( 22 =+++++= cfygxbyhxyaxyxf
(where a, h, b, g, f and c are constants)Then
fhxby
ghyax
dx
dy
222
222
++
++= .
1.9. Higher Total Derivatives
We have already seen that if ),( yxfu = and x and y aredifferentiable functions of t then
dt
dy
y
f
dt
dx
x
f
dt
du
+
= . (33)
To find2
2
dt
udwe note (33) that the operator
dt
dcan be
written as
ydtdy
xdtdx
dtd
+
. (34)
Hence
=
=
dt
du
dt
d
dt
ud2
2
+
+
dt
dy
y
f
dt
dx
x
f
ydt
dy
xdt
dx
7/27/2019 Learning Hessian matrix.pdf
14/100
2
2
222
2
2
2
+
+
=
dt
dy
y
f
dt
dy
dt
dx
yx
f
dt
dx
x
f
+2
2
2
2
dt
yd
y
f
dt
xd
x
f
+
(35)
where we have assumed thatyxxy
ff = .Higher total
derivatives may be obtained in a similar way.
A special case of (35) is when
,, kdt
dyh
dt
dx==
where h and k are constants. We then have
2
2
2
2
2
2
2
2
2
2y
fk
yx
fhk
x
fh
dt
ud
+
+
= ,
which, if we define the differential operator D* by
y
k
x
hD
+
=* ,
may be written symbolically as
.2*2
2
2
fDfy
kx
hdt
ud=
+
=
Similarly we find
3
3
3
2
3
2
2
3
2
3
3
3
3
3
33y
fk
yx
fhk
yx
fkh
x
fh
dt
ud
+
+
+
=
fDfy
kx
h3*
3
=
+
= ,
assuming the commutative properties of partial
differentiation.In general,
7/27/2019 Learning Hessian matrix.pdf
15/100
fDfy
kx
hdt
ud nn
n
n
*=
+
= ,
where the operatorn
yk
xh
+
is to be expanded by
means of the binomial theorem.
1.10 . Taylors Theorem for functions of Two Independent
Variables
Theorem3 (Taylors Theorem) If ),( yxf is defined in a
region R of xy-plane all its partial derivatives of orders upto and including the (n+1)th are continuous in R , then for
any point (a, b) in this region
,),(!
1
),(!2
1),(),(),(
*
2
*
*
nn EbafD
n
bafDbaDfbafkbhaf
++
+++=++ L
where D* is the differential operator defined by
yk
xhD
+
=*
and
),(* bafD r means ),( yxfy
kx
h
r
+
evaluated at the point ),( ba . The Lagrange error termn
E is
given b
7/27/2019 Learning Hessian matrix.pdf
16/100
),()!1(
11
*
kbhafDn
E nn
+++
= +
where 10
7/27/2019 Learning Hessian matrix.pdf
17/100
+
+
+
2
3
32
1
6
3
3)1(2
2
yyx
+ terms of degree 3 and higher.
2. Derivatives by Linearization Technique
2.1. Linear Part of a Function.
Consider the function, where32
3),( yxyxyxf = ;and try to approximate ),( yxf near the point
)1,2(),( =yx by simpler function. To do this, set+= 2x and += 1y . Then
)1,2(),( ++= fyxf
)22(3)44(2
++++=
)3331( 32 ++
).33()97(11 322 ++++= Here )1,2(11 = f ; 97 is a linear function of thevariables and; and 322 33 ++=v is small,
compared to the linear function 97 , ifandare bothsmall enough. This means that, ifandare both small
enough, then)97(11),( +yxf
7/27/2019 Learning Hessian matrix.pdf
18/100
is a good approximation, the terms omitted being small in
comparision to the terms retained.
To make the idea of small enough precise, denote by
dthe distance from )1,2(
to )1,2( ++
; then2
1
22 ][ +=d .Then d and d , so that, as 0d ,
022 = dddd ; 0333 2 = dddd ;
and similarly for the remaining terms in v . This motivates
the following definition.
Definition 1. The function f of two variables yx, is
differentiable at the point ),( ba if
. ),()()(),(),( yxbyqaxpbafyxf +++= ,where p and q are constants, and
0d as 0])()[( 21
22 += byaxd .
The linear function )()( byqaxp + will be called thelinear part of f at ),( ba . (some books call it the
differential.)
The numbers pandqcan be directly calculated. Ify is
fixed at the value b , then axd = , and
0),(
0),(),(
+
++=
p
ax
bxp
ax
bafbxf
as 0= axd , thus as ax . Thus p equals the partialderivative of f with respect to x at a , holding y fixed at
b . This will be written
7/27/2019 Learning Hessian matrix.pdf
19/100
),(),(),( bafDbafbax
fp
xx==
= , or as
x
f
or
xf or fD
x
for short. Similarly q equals ),(),( bafbayf
y=
, the partial
derivative of f with respect to y at b , holding x fixed at a .
Thus, in the above example, yxfx
32 = , and so, at
)1,2(),( =yx , pfx
=== 7)1(3)2(2 . Similarly,
))1,2((933 2 === atqyxfy
.
Example 1. Calculate the linear part of f at )0,4
(
where
)32cos(),( yxyxf += .Solution. )32cos(),( yxyxf +=
)0,4
(3)32sin(3),(
)0,4
(2)32sin(2),(
atyxyxf
atyxyxf
y
x
=+=
=+=
Hence the linear part of f at )0,4
(
is
=
04)32(
)0(3)4
(2
yx
yx
7/27/2019 Learning Hessian matrix.pdf
20/100
Example 2. Calculate the linear part of g at )2
,0(
, where
).sin(),( 22 yxyxg +=
Solution. )4
cos()2
,0(,0)2
,0(2
==yx
gg
Hence the linear part of g at )2
,0(
is
)2
)(4
cos(
)2
)(4
cos()0.(0
2
2
=
+
y
yx
2.2. Vector Viewpoint
It will be convenient to regard f as a function of a vector
variable w whose components are x andy . Denote
=y
xw , as a column vector, in matrix language; denote
also
=
b
ac . Define also row vector
( ) ),(),,(,)( bafbafqpcfyx
== .
( The notation )(cf suggests a derivative; we have here asort of vector derivative.). Then, since f is differentiable,
),())(()()( wcwcfcfwf ++=
7/27/2019 Learning Hessian matrix.pdf
21/100
where the product
)()(),())(( byqaxp
by
axqpcwcf +=
=
is the usual matrix product. Now let cw denote the
length of the vector cw ; then the previous cwd = ,
and so 0)( cww
Example 3. For ,1,2,3),( 32 === bayxyxyxf set
= 1
2c ; then
)9,7()( = cf .
2.3. Directional Derivative.
Suppose that the point (x, y) moves along a straight line
through the point (a, b); thus myx == ,l , where is
the distance from (a, b) to (x, y) andl andm are constantsspecifying the direction of the line (with 122 =+ ml ). In
vector language, let
=
y
xw ,
=
b
ac ,
=
mt
l; then
cw = ; and the line has equation tcw += .
The rate of increase of )(wf , measured at cw = , as wmoves along the line tcw += , is called the directional
derivative of f at c in the direction t.This may be
calculated, assuming f differentiable, as follows:
+=
=
+tcf
tcfcftcf)(
))(()()(.
7/27/2019 Learning Hessian matrix.pdf
22/100
The required directional derivative is the limit of this ratio
as 0 , namely ( ) qmpm
qptcf +=
= l
l)( (since
0 as 0 ). Note that t is here a unit vector.
Example 4(a) Let .3),( 32 yxyxyxf = The directional
derivative of at
1
2in the direction
sin
cosis
( )
sin9cos7
sin
cos97
sin
cos)1,2( =
=
f .
(b) Find the directional derivative of222),,( yzxyxzyxf ++= at (1,-1, 2) in the direction of the
vector A=(1,-2,2).
Solution. )2,,4(),,( 2 yzzxyxzyxf ++= ,
).4,5,3()2,1,1( =f Unit vector in the direction of A= (1,-2, 2) is ).
3
2,
3
2,
3
1(
Therefore the directional derivative at (1,-1, 2) in the
direction of A is
( ) 5
32
32
31
453 =
.
7/27/2019 Learning Hessian matrix.pdf
23/100
2.4. Vector Functions
Let and each be differentiable real functions of the
two real variables x andy . The pair of equations
),(),(
yxvyxu
==
defines a mapping from the point ),( yx to the point ),( vu .
If, instead of considering points, we consider a vectorw,
with components yx, , and a vectors , with components u, v,
then the two equations define a mapping from the vector w
to the vectors . This mapping is then specified by the vector
function
=f .
Definition. The vector function f is differentiable at c if
there is a matrix )(cf such that
)())(()()( wcwcfcfwf ++= (1)
holds, with
0)(
cw
was 0 cw .
The term ))(( cwcf is called the linear part of f at c .
Example 5. Let 22),( yxyx += and xyyx 2),( = .thesefunctions are differentiable at )2,1( , and calculation shows
that
7/27/2019 Learning Hessian matrix.pdf
24/100
)).2)(1(2()2(2)1(44
);)2(4)1(()2(4)1(25 22
+++=
++++=
yxyxv
yxyxu
This pair of equations combines into a single matrixequation:
++
+
=
)2)(1(2
)2(4)1(
2
1
24
42
4
5 22
yx
yx
y
x
v
u.
In vector notation, this may be written as
)())(()()( wcwcfcfwf ++= ,
where now )(cf is a 22 matrix. Since the components
21, of satisfy 0)(
1 cww and
0)(2
cww as 0 cw , it follows that
0)( cww as 0 cw .
Definition 2. The vector function f is differentiable at c if
there is a matrix )(cf
such that Equation (1) holds, with
0)(
cw
was 0 cw .
(2)
The term ))(( cwcf is called the linear part of f at c.
Example 6. For the function
+=
=
xy
yxf
2
22
,
7/27/2019 Learning Hessian matrix.pdf
25/100
the linear part at
2
1is
2
1
24
42
y
x
.
Example7. Let .2
22
=
xy
yx
y
xf Calculate
2
1f .
Here ;2
22
2
=
xyy
yx
y
xf .
44
42
2
1
=
f
2.5. Functions of Functions
Let the differentiable function f map the vector w to the
vector s; let the differentiable function g map the vector s to
the vector t. Diagrammatically,
tsw
gf
Then the composition fgh o= of the functions g and fmaps w to t.
Since f and g are differentiable,
;)()()( += cwAcfwf ;))()(())(())(( += cfwfBcfgwfg
where A and B are suitable matrices, and and can be
neglected, then approximately
7/27/2019 Learning Hessian matrix.pdf
26/100
)())(())(( cwBAcfgwfg .
The linear part of h, is in fact, )( cwBA .
From the chain rule we have)())(()( cfcfgch = .
Example8. Let
+=
=
=
xy
yx
yx
yx
v
uf
2),(
),( 22
,
23),( vuvug = ,
=
2
1c .
=
+=
=
4
5
212
21
2
1)(
22
fcf
=
=
24
42)(,
22
22),( cf
xy
yxyxf ,
)2,3()2,3(),( vvvug == , )8,3()( = cg
The chain rules then gives
).426(24
42)83(
2
1=
=
h
Example9 . tytxyxz sin,cos,2 22 === . Here
222,sin
cos)( yx
y
xg
t
ttf =
= .
Taking partial derivatives,
( )yxy
xg
t
ttf 22;
cos
sin)( =
=
Hence
7/27/2019 Learning Hessian matrix.pdf
27/100
( )
=
t
tyx
dt
dz
cos
sin22
( ) 0cos
sin
sin2cos2 =
= t
t
tt
as it should.
3. Gateaux and Frechet Derivatives
Nonlinear Operators can be investigated by establishing a
connection between them and linear operators-more
precisely, by the technique of local approximation to thenonlinear operator by a linear one. The differential calculus
for nonlinear operators is needed for this
purpose.Differentiation is a technique that helps us to
approximate a nonlinear operator locally. In Banach spaces
there are different types of derivatives. Among them
Gateaux and Frechet Derivatives are very important for
applications.One of them is more general than the other butin some special circustances they are equivalent.
3.1. Gateaux Derivative
The Gateaux derivative is the generalization of directional
derivative.
Definition.Let X and Y are Banach spaces and let P be an
operator such that YXP : . Then P is said to be Gateauxdifferentiable at Xx
0if there exists a continuous linear
operator YXU : (in general depends on0
x ) such that
7/27/2019 Learning Hessian matrix.pdf
28/100
)()()(
lim 000
xUt
xPtxxPt
=+
for every Xx .
The above is clearly equivalent to
0)]()()([1
lim000
=+
xtUxPtxxPtt
(1)
for every Xx .
In the above situation, U is called the Gareaux derivative
of P at0
x , written )(0
xPU = , and its value at Xx is
denoted by ))((0
xxP or simply xxP )(0
.
Theorem1. The Gateaux derivative, if it exists, is unique.
Proof. Let UandVbe two Gateaux derivatives of Pat0
x .
From the relation, for 0t
)],()()([1
)]()()([1
)()(
00
00
xtVxPtxxPt
xtUxPtxxPt
xUxV
+
+=
we obtain for 0>t
)()()(1
)()()(1
)()(
00
00
xtVxPtxxPt
xtUxPtxxPt
xUxV
++
+
7/27/2019 Learning Hessian matrix.pdf
29/100
and as 0t , because of (1), both the expressions in theright hand side tend to zero.
Since this is true for each Xx , we see that VU = and rhe
theorem is proved.
We now suppose that X and Y are finite dimensional, saynRX = and mRY = . Let us analyse the representation of the
Gareaux derivative of an operator from X into Y. We know
that if U is a linear operator from X into Y then U is given
by the matrix
mnmm
n
n
aaa
aaa
aaa
L
LLLL
L
L
21
22221
11211
whereYyXxxUy
mn=== ),,,(,),,,(),(
2121 LL
and
k
n
kiki
a ==1
, mi ,,2,1 L= . (1)
Let P be an operator mapping an open subset G of X into
Y and Gxn
= ),,,(21
L
Yym
= ),,,(21
L and )(xPy = . Then we see that there
exist numerical functionsm
,,,21L such that
=ii ( n ,,, 21 L ), mi ,,2,1 L= ; (2)
Suppose that the Gateaux derivative of Pexists at
),,,( )0()0(2
)0(
10 nx L= and UxP = )(
0. Let Ube given by
the above matrix, If the equation
7/27/2019 Learning Hessian matrix.pdf
30/100
)()()(
lim 000
xUt
xPtxxPt
=+
is written in full then with the help of the relation (1) and
(2) we obtain m relations
t
tttninni
t
),,,(),,,(lim
)0()0(
2
)0(
1
)0(
2
)0(
21
)0(
1
0
LL +++
=k
n
kik
a =1
, mi ,,2,1 L= (3)
The relation (3) holds for all Xxn
= ),,,(21
L and
therefore taking in turn an x whose all coordinates are zero
except one which is equal to unity, we see that the
functionsm
,,,21L have partial derivatives with respect
ton
,,,21L and
ik
k
ni a=
),,,( )0()0(2
)0(
1L
where mi ,,2,1 L= and nk ,,2,1 L= .
The derivative UxP = )( 0 is therefore given by the matrix of
partial derivatives of the functions m ,,, 21 L
=
nmmm
n
n
xP
L
LLLL
L
L
21
22212
12111
0 )(
which is known as the Jacobian matrix and is denoted by
)(0
xJ .
7/27/2019 Learning Hessian matrix.pdf
31/100
Example1. In this example, we show that the existence of
the partial derivatives ofm
,,,21L need not guarantee
the existence of )( 0xP .Let m=1 and n=2 and
)0,0(,0)0,0(,)(
),(0122
2
2
1
21
211==
+= x
.
h
h
h
hhh
)0,0()0,(lim
)0,0()0,0(lim
)0,0(11
0
11
0
1
1
=
+=
0)(
0.lim
220==
hh
hh
and similarly, 0)0,0(
2
1 =
.
Therefore, if the derivative )(0
xP were to exist then it must
be the zero operator and then (3) should give0
),(lim 211
0=
t
ttt
.
7/27/2019 Learning Hessian matrix.pdf
32/100
But
+=
+=
=
+=
22
2
2
1
5
21
2
0
22
2
2
1
21
0
1
2
1
1
0
11
00
})(){(lim
})(){(
.1lim
)0,0()(lim
)0,0()0(lim)(
hht
hht
thth
thth
t
t
th
th
t
thxP
t
t
t
t
=
+ 222
2
1
3
21
0 })(){(lim
hht
hht
which does not
exist.
(b) Let m=n=2 and ),(),( 22
3
121xxxxP = . Let ),(
21zzz = be
any point then we see that
=
2
21
20
03)(z
zzP .
3.2. Frechet Derivative
The derivative of a real function of a real variable is
defined by
h
xfhxf
xf h)()(
lim)( 0+
= (1)
This definition cannot be used in the case of mapping
defined in Banach space because h
7/27/2019 Learning Hessian matrix.pdf
33/100
is then a vector and division by a vector is meaningless. On
the other hand, the division by a vector can be easily
avoided by rewriting (1) in the form
)()()()( hhxfxfhxf ++=+ (2)
where is a function ofh such that 0)( h as. 0h Equivalently, we can now say that )(xf is the derivativeof f at x if
)()()()( hhxfxfhxf +
=+ (3)
where 0)( hh as 0h .
The definition based on (3) can be generalized to include
mappings from a Banach space into a Banach space. This
leads to the concept of the Frechet differentiability and
Frechet derivative.
Definition. Let f map a ball }:{0
7/27/2019 Learning Hessian matrix.pdf
34/100
Here )(afDi
denotes )(ax
f
i
. If ),(2 RUCf then, for each
n
Rv with Uva + , the linear part of
i
n
iii
uafDvafDuafvaf +=+=1
)]()([)]()([
is
ji
n
jiij
vuafDvuaf ==1,
)(),)((
where )(afDij denotes )(ax
f
xij
.
This process may be continued. If ),( RUCfk , denote
)()(11
21
afxxx
afDiii
iii
kk
k
=
LLLLL
.
Define then, for nk
Rwww ,,,21
LL ,
kk
k iki
n
iiiiiiik
k
wwwafDwwwaf,,2
1,,,,121
)(
221
121
)(),,,)(( LLLL
LLL=
=
where1,1 i
w denotes the 1i component of 1w . If all wwi = , we
abbreviate
),,,)((21
)(
k
k
wwwaf LL to .))(()(nk
waf
The derivative )(af is representd by a n1 matrix, whosecomponents are )(afD
i. Also )(af is represented by an
nn matrix, M say, whose ji, element is )(. afD ji ; ifu
andv are regarded as colums, then
MuvvuafT= ),)(( .
7/27/2019 Learning Hessian matrix.pdf
35/100
It is shown below that M is a symmetric matrix.
Example Define RRf n : by the polynomial
yxyxyxyxyxf222
623473),( ++++++= .Then
]643,1232[)0()0( 21212121
vvvvvvvfvf ++++=+
Taking the linear part of this expression, and applying it to
=
2
1
u
uu ,
( )
=
++=
2
1
21
2
1
2121
43
32
)43,32(),)(0(
u
uvv
u
uvvvvvuf
where the 22 matrix consists of second partialderivatives.
In more abstract terms, let ),( RRL n denote the vector
space of continuous linear maps from nR into R . Then, in
terms of continuous linear maps, ),()( RRLaf n ,)),,(()( RRRLLaf n and so on. Also )(af is a bilinear
map from nn RR into R ; this means that ),)(( vuaf islinear in u for each fixed v, and also linear in v for each
fixed u.
Theorem 2 (Taylors)
Let );,( RUCf k let Ua and let Uxa + . Then
7/27/2019 Learning Hessian matrix.pdf
36/100
L+++=+ 2))((!2
1)(
!1
1)()( xafxafafxaf
kkkk
xafk
xafk
))((!
1))((
)!1(
1)(1)1( +
+ ,
where xac += for some in .10
7/27/2019 Learning Hessian matrix.pdf
37/100
xy
fxfyfyxfyx
)0,0()0,(),0(),(),(
+=
(i)
and, for fixed y, let )0,(),()( xfyxfx = . Then themean-value theorem shows that, for fixed y,
y
xfyxf
xy
xx
xy
xyx xx
)0,(),()()0()(),(
=
=
=
For some in 10
7/27/2019 Learning Hessian matrix.pdf
38/100
3.4. Example of Bilinear Operator
Assume that ( ) mi
Rcc = is a real m-vector, thatij
aA =
is a real ),( mm matrix and that B=ijk
b is a bilinear operator
from mm RR tomR . Then the mapping mm RRf :
defined by
m
RzBzAzczf ++= ,)( 2 ( where BzzBz =2 ) (b1)
is called a quadratic operator. The equation 0)( =zf , thatis
02 =++ cAzBz (b2)
is called a quadratic equation in mR
As a simple but very important example to the quadratic
equation (b2), we consider the algebraic-eigenvalue
problem
xTx = (b3)where
ijtT = is a real (n,n) matrix. We assume that the
eigenvector )( ixx = has Eulidean length one
===
n
ii
xx1
22
21 (b4)
If we set ( ),,,,21 n
T
xxxz L= , then (b3) and (b4) can be
written as a system of nonlinear equations, namely
0)1(
2
1
)(
)( 22
=
=
x
xIT
zf
(b5)
7/27/2019 Learning Hessian matrix.pdf
39/100
It is well known that (b5) is a quadratic equation of the
form (b1) where m=n+1 and
=
2
100 L
T
c , mRc (b6)
=
000
0
0
L
MTA (b7)
=
0000100
0
0
1
1
1
0
0
0010
0
1
2
1
LL
MOM
L
L
MO
O
OOB . (q8)
For the mapping (5), we get by using (q6, q7) and (q8) that
BzfBzAzfBzAzczf 2)(,2)(,)( 2 =+++= .
Therefore )(zf has the matrix representation
=
0)(
x
x
Tzf
T
and )(zf is the bilinear operator defined in (b8),multiplied by the factor two.
For, 2=n ,
=
=
=
2
10
0
,,2
1
2
1
cx
x
zx
xx
.
7/27/2019 Learning Hessian matrix.pdf
40/100
+
+
=
=
)1(21
)(
)(
)1(
2
1
)(
)(
2
2
2
1
222121
212111
2
2
xx
xtxt
xtxt
x
xIT
zf
=
0
)(
21
22221
11211
xx
xtt
xtt
zf
=
0
T
x
xIT
=
000010001
0
0
10
01
1
0
00
00
0
1
00
00
)(zf
and for 3=n ,
=
0000010000100001 0
0
0
100
010
001
1
0
0
000
000
000
0
1
0
000
000
000
0
0
1
000
000
000
)(zf
4. Hessian Matrix and Unconstraint Optimization
In mathematics, the Hessian matrix(or simply the Hessian)
is the square matrix of second-order partial derivatives of afunction, that is , it describes the local curvature of a
function of many variables.The Hessian matrix was
developed in 19th century by the German mathematician
7/27/2019 Learning Hessian matrix.pdf
41/100
Ludwig Otto Hesse and later named after him. Hesse
himself had used the term functional determinants.
Given the real-valued function),,,(21 n
xxxf L
If all second partial derivatives of f exist, then the Hessian
matrix of f is the matrix)()()( xfDDxfH
jiij=
where ),,,(21 n
xxxx L= andi
D is the differentiation
operator with respect to the ith argument and the Hessian
becomes
=
nnn
n
n
xfxxfxxf
xxfxfxxf
xxfxxfxf
fH
22
2
2
1
2
2
2
2
22
12
2
1
2
21
2
1
22
)()()(
)()()(
)()()(
)(
L
MOMM
L
L
))(( xfH is frequently shortened to simply )(xH .
Some mathematicians define the Hessian as the
determinant of the above matrix.
Hessian matrices are used in large-scale optimization
problems within Newton-type methods because they are the
coefficient of the quadratic term of a local Taylor
expansion of a function.That is,
xxHxxxJxfxxfy T +++= )(2
1)()()(
Where J is the Jacobian matrix, which is a vector (the
gradient for a scalar valued function). The full Hessian
matrix can be difficult to compute in practice, in such
situations, quasi-Newton algorithms have been developed
that use approximations to the Hessian. The best known
quasi-Newton algorithm is the BFGS algorithm.
7/27/2019 Learning Hessian matrix.pdf
42/100
4.1. Mixed derivatives and Symmetry of the Hessian
The mixed derivatives of f are the entries off the main
diagonal in the Hessian. Assuming that they are continuous,
the order of differentiation does not matter (Clairauttheorem).
For example,
=
x
f
yy
f
x.
This can also be written asxyyx
ff = .
In a formal statement: if the second derivatives of f are all
continuous in a neiborhood D, then the Hessian of f is asymmetric matrix throughout D.
Example1
Consider the real-valued function35),( xyyxf = .
Then )15,5(),( 23 xyyyxf = and the Hessian matrix,
==
2
22
2
2
2
2 ),(
x
f
xy
f
yxf
xf
yxfH =
xyy
y
3015
1502
2
.
Example2
Let 32 3),( yxyxyxf = .
Then ( )2
3332),( yxyxy
f
x
f
yxf =
=
7/27/2019 Learning Hessian matrix.pdf
43/100
The Hessian matrix,
=
==y
xf
xyf
yx
f
x
f
yxfH
63
32),(
2
22
2
2
2
2 .
Example3 Let a function, 22: RRf is given by
+=
xy
yxyxf
2),(
22
.
Then
= xy
yxyxf 22
22),( and
=
0220
2002),(2 yxf
which is not a Hessian matrix but a bilinear operator.
4.2. Critical points and discriminant
If the gradient of f is zero at some point x, then f has acritical point (or a stationary point) at x. The determinant of
the Hessian at x is then called the discriminant. If this
determinant is zero then x is called a degenerate critical
point of f, this is also called a non-Morse critical point of f.
Otherwise, it is non-degenerate, this is called a Morse
critical point of f.
For a real-valued function ),( yxf of two variables x and y
and let1and
2 be the eigenvalues of the corresponding
Hessian matrix of f, then
=+21
trace (H) and =21
det(H)
7/27/2019 Learning Hessian matrix.pdf
44/100
Example4
For the previous example, the critical point of f is given by
( ) 03332),(2 ==
yxyxyxf , whence wehave
)0,0(),( =yx
Thus,
=
03
32H . The eigenvalues of H are
( ) ( )101,101,21
+=
Hence 221
=+ =trace(H) and == 921
det(H)
4.3. Functions of one Variable.
Definitions. Suppose )(xf is a real-valued function defined
on some interval (The interval I may be finite or infinite,
open or closed, or half-open.). A point *x in I is:(a) a global minimizer for )(xf on I if
)()( * xfxf for all x in I;(b) a stritct global minimizer for )(xf on I if
)()( * xfxf < for all x in I such that *xx ;(c) a local minimizer for )(xf if there is a positive
number such that )()( * xfxf for all x in I for
which +
7/27/2019 Learning Hessian matrix.pdf
45/100
(e) a critical point of )(xf if )( *xf exists and isequal to zero.
The Taylors formula (single variable)
Theorem1. Suppose that )(,)(,)( xfxfxf exist on the
closed interval [ ] { }bxaRxba = :, .If xx ,* are anytwo different points of ]ba, , then there exists a point zstrictly between *x andx such that
2**** )(2
)())(()()( xxzfxxxfxfxf ++= .
If *x is the critical point of )(xf then the above formula
reduces to
2** )(
2
)(0)()( xx
zfxfxf
++= ,
or
2** )(2
)()()( xx
zfxfxf
=
for all *xx .
Theorem 2. Suppose that )(,)(,)( xfxfxf are all
continuous on an interval Iand that Ix
*
is a criticalpoint of )(xf .
(a) If 0)( xf for all Ix , then *x is a globalminimizer of )(xf on I.
7/27/2019 Learning Hessian matrix.pdf
46/100
(b) If 0)( > xf for all Ix , such that *xx , then*
x is a strict global minimizer of )(xf on I.
(c) If 0)( > xf , then *x is a strict local minimizerof )(xf .
Example5. Consider 143)( 34 += xxxf .Since ),1(121212)( 223 == xxxxxf the only critical
points of )(xf are 0=x and 1=x . Also, since)23(122436)( 2 == xxxxxf we see that
0)0( =f and 12)1( =f .Therefore, 1=x is a strict local minimizer of )(xf Definition. Suppose that )(xf be a numerical function
defined on a subset D of nR . A point x in D is
(i) a global minimizer for )(xf on D if )()( xfxf forall Dx ;
(ii) a strictly global minimizer for )(xf on D if)()( xfxf < for all Dx such that xx ;
(iii) a local minimizer for )(xf if there is a positivenumbersuch that )()( xfxf for all Dx forwhich ),( xBx ;
(iv) a strictly local minimizer for )(xf if there is apositive numbersuch that )()( xfxf < for all
Dx for which ),( xBx and xx ;(v) a critical point for )(xf if the first partial derivatives
of )(xf exist at x and
7/27/2019 Learning Hessian matrix.pdf
47/100
0)( =i
xxf for ni ,,2,1 K= ,
that is0)( = xf
Example 6. Consider 23 )5(3)4(40)( ++= yxxxyf .))5(6,124( 23 = yxxf .
0=f gives two critical points )}5,0(),5,3{(),( =yx
4.4. Taylors Formula (several variables).
Theorem 5. Suppose that x ,x are points in nR and that
)(xf is a function of n variables with continuous first andsecond partial derivatives on some open set containing the
line segment
[ ] { }10;)(:, +== txxtxwRwxx n joining x andx . Then there exists a [ ]xxz , such that
))(()(
2
1)()()()( xxzHfxxxxxfxfxf ++=
Here H denotes the Hessian Matrix
This formula is used to develop tests for maximizers and
minimizers among the critial points of a function.
Theorem 6. Suppose that *x is a critical point of a function
)(xf with continuous first and second partial derivatives
on
n
R . Then:(a) *x is a global minimizer for )(xf if
0))(()( xxzHfxx for all nRx and all ],[ * xxz ;
7/27/2019 Learning Hessian matrix.pdf
48/100
(b) *x is a strict global minimizer for )(xf if
0))(()( > xxzHfxx for all nRx such that *xx andall ],[ * xxz ;
(c)*
x is a global maximizer for )(xf if0))(()( xxzHfxx for all nRx and all ],[ * xxz ;
(d) *x is a strict global maximizer for )(xf if
0))(()(
7/27/2019 Learning Hessian matrix.pdf
49/100
In general, )(yQA
is a sum of terms of the formjiij
yyc where
nji ,,1, K= andij
c is the coefficient which may be zero,
that is, every term in )(yQA is of second degree in thevariables
nyyy ,,,
21K . On the other hand, any function
),,(1 n
yyq K that is the sum of second-degree terms in
nyyy ,,,
21K can be expressed as the quadratic form
associated with an nn -symmetric matrix A by splittingthe coefficient of
jiyy between the ),( ji
and ),( ij entries of A.
Example 7(a). The function
3221
2
3
2
2
2
1321424),,( yyyyyyyyyyq ++=
is the sum of second degree terms in .,,321
yyy
Splitting the coefficients ofji
yy , we get
323231
312121
2
3
2
2
2
1321
220
04),,(
yyyyyy
yyyyyyyyyyyyq
+++
++=
=233213311221
2
3
2
2
2
122004 yyyyyyyyyyyyyyy +++++
=
3
2
1
321
420
211
011
),,(
y
y
y
yyy
where
=
=
420
211
011
333231
232221
131211
ddd
ddd
ddd
A
with
7/27/2019 Learning Hessian matrix.pdf
50/100
=ii
d coefficient of 2i
y
ijd or
jid =
2
1(coefficient of
jiyy or
jiy ) , ji .
(b) 23
2
2
2
13213),,( yyyyyyq ++=
323231312121
2
3
2
2
2
1.0.0.0.0.0.03 yyyyyyyyyyyyyyy +++++++=
( )
3
2
1
321
100
030
001
y
y
y
yyy
(c) 23
2
21321)2(),,( yyyyyyq +=
21
2
3
2
2
2
144 yyyyy ++=
3232
31312121
2
3
2
2
2
1
.0.0
.0.0224
yyyy
yyyyyyyyyyy
++
++++=
( )
=
3
2
1
321
100
042
021
y
y
y
yyy
The Hessian )(zHfH = of )(xf evaluated at a point z is annn -symmetric matrix.
For nRxx *, , the quadratic form )(yQ
Hassociated with H
evaluated at *xx is))(()()( *** xxzHfxxxxQ
H=
7/27/2019 Learning Hessian matrix.pdf
51/100
4.6. Positive and Negative Semidefinite and Definiteness
Definitions. Suppose that A is an nn -symmetric matrix
and that AyyyQA=
)( is the quadratic form associated with A. Then A and
AQ are
called:
(a) positive semidefinite if 0)( = AyyyQA
for alln
Ry ;(b) positive definite if 0)( >= AyyyQ
Afor all
n
Ry , 0y ;
(c) negative semidefinite if 0)( = AyyyQA for allnRy , 0y ;
(d) negative definite if 0)( = AyyyQ
Afor some nRy and
0)(
7/27/2019 Learning Hessian matrix.pdf
52/100
(d) *x is a strict global maximizer for )(xf if )(xHf is
negative definite on nR .
Here are some examples.
Example 8.
(a) A symmetric matrix whose entries are all positiveneed not be positive definite. For example, the matrix
=
14
41A
is not positive definite. For if )1,1( =x , then
.063
3)1,1(
1
1
14
41)1,1()(
7/27/2019 Learning Hessian matrix.pdf
53/100
=
200
030
001
A
is positive definite because the associated quadratic form
)(xQA is
2
3
2
2
2
123)( xxxAxxxQ
A++==
and so 0)( >xQA
unless 0321
=== xxx .
(d) A 33 -diagonal matrix
=
3
2
1
00
00
00
d
d
d
A
is
(1) positive definite if 0>id for 3,2,1=i ;(2) positive semidefinite if 0i
d for 3,2,1=i ;
(3) negative definite if 0> ddd then
0)( 222
2
11+= xdxdxQ
A
for all 0x since 0,021
>> dd , but if )1,0,0(=x , then
0)( =xQA
even though 0x .
7/27/2019 Learning Hessian matrix.pdf
54/100
(e) If a 22 -symmetric matrix
=
cb
baA
is positive definite , then 0>a and 0>c . For if )0,1(=x ,
then 0x and so .0.0.0.21.)(0 22 acbaxQA
=++=<
Similarly, if )1,0(=x , then .)(0 cxQA
=< However,
(a)shows that there are 22 -symmetric matrices with0,0 >> ca that are not positive definite. We can see that
the size of b relative to the size of the product ac is the
determining factor for positive definiteness.
The examples show that for general symmetric matrices
there is little relationship between the signs of the matrix
entries and the positive or negative definite features of the
matrix. They also show that for diagonal matrices, these
features are completely transparent.
Here are some examples of positive definite, positive semi-
definite, negative definite and negative semi-definite
matrices in real field:
Example 9.
(a)
=
210
121
012
A , ( ) 0321
>=T
xxxX
0)()( 23
2
1
2
32
2
21>+++++= xxxxxxAXXT .
7/27/2019 Learning Hessian matrix.pdf
55/100
The matrix A is PD
(b)
=
1105
0181551525
A , ( ) 0321
>=T
xxxX
0111825 23
2
2
2
1>++= xxxAXXT .
The matrix A is PD
(c)
=
151451
141451
5551
1111
A , ( ) 04321
>=T
xxxxX
0)(9
)(4)(2
4
2
43
2
432
2
4321
>+++
++++++=
xxx
xxxxxxxAXXT
.
The matrix A is PD
(d)
=
1100
1210
0121
0012
A , ( ) 04321
>=T
xxxxX
0)()()( 243
2
32
2
21
2
1>+++= xxxxxxxAXXT .
The matrix A is PD
7/27/2019 Learning Hessian matrix.pdf
56/100
(e) 322
2
1,)( Rxxxxq += , for any nonzero Rx
3and
021
== xx , ( ) 33
000 Rx
gives 0)( 22
2
1=+= xxxq
The matrix
=
000
010
001
A is PSD
(f) )3()( 23
2
2
2
1xxxxq ++=
( )
=
3
2
1
321
100
030
001
x
x
x
xxx .
The matrix
=
100
030
001
A is ND.
(g) ))2(()( 23
2
21xxxxq +=
=( )
3
2
1
321
100
042
021
x
x
x
xxx .
The matrix
=
100
042
021
A is NSD
We will develop two basic tests for positive and negative
definiteness-one in terms of determinant, and another in
terms of eigenvalues.
7/27/2019 Learning Hessian matrix.pdf
57/100
4.7. Determinant Approach
We begin by looking at functions of two variables. If A is a
22
-symmetric matrix
=
2212
1211
aa
aaA
then the associated quadratic form is.2.)( 2
2222112
2
111xaxxaxaAxxxQ
A++==
For any 0x in 2R , either )0,(1
xx = with
01
x ),(21
xxx = with 02
x .
Let us analyze the sign of )(xQA in terms of entries of A in
each of these two cases
Case 1. )0,(1
xx = with 01
x .
In this case, 2111
)( xaxQA
= so 0)( >xQA
if and only if
011
>a , while 0)( t for all Rt .
Note that1211
22)( atat += ,11
2)( at = , so that
1112
* / aat = is a critical point of )(t and this critical point is
a strict minimizer if 011 >a and strict maximizer if 011 a and if Rt , then
7/27/2019 Learning Hessian matrix.pdf
58/100
2212
1211
11
2
122211
11
22
11
2
12
11
12*
1
)(
1
)()()(
aa
aa
aaaaa
aa
a
a
att
==
+==
. Thus, if 011
>a and
det 02212
1211 >
aa
aa, then 0)( >t for all Rt and so
0)( >xQA
for all ),(21
xxx = with 02
x .On the other hand,if 0)( >xQ
Afor all such x, then 0)( >t for all Rt and
so 011
>a and the discriminant of )(t
4442211
2
12= aaa det
2212
1211
aa
aais negative ,
that is 011
>a and det 02212
1211 >
aa
aa. An entirely similar
analysis shows that 0)( a , det 0
2212
1211
>
aaaa ;
7/27/2019 Learning Hessian matrix.pdf
59/100
(b) negative definite if and only if 011
aa
aa. The 22 case and a little imagination
suggest the correct formulation of the general case.
Suppose A is an nn -symmetric matrix. Definek
to be
the determinant of the upper left-hand corner kk -sub-matrix of A for nk 1 .The determinant
k is called the
kth principal minor of A
=
nnnnn
n
n
n
aaaa
aaaa
aaaa
aaaa
A
L
MMMMM
L
L
L
321
3332313
2232212
1131211
,111
a= Aaa
aan
det,,det2212
1211
2=
= L .
The general theorem can be formulated as follows:
Theorem.9. If A is an nn -symmetric matrix and ifk
is
the kth principal minor of A for nk
1 , then:
(a) A is positive definite if and only if 0>k
for k=1, 2 ,
... , n;
7/27/2019 Learning Hessian matrix.pdf
60/100
(b) A is negative definite if and only if 0)1( >k
k for
k=1, 2 , ... , n(that is, the principal minors alternate in
sign with 01
7/27/2019 Learning Hessian matrix.pdf
61/100
(a) 0. >Axx for all 03
x such that 03
=x if and only if
;0,021
>>
(b) 0. Axx for all
0x such that 03 x if and only if
=),( ts 022223131233
2
22
2
11>+++++ tasastaatasa
for real numbers ts , . In addition, 0.
7/27/2019 Learning Hessian matrix.pdf
62/100
0det2212
1211
2
=
aa
aa,
and this unique solution is given by Cramers Rule as
=
2223
1213
2
* det1
aa
aas , =*t
2312
1311
2
det1
aa
aa. (1)
if we multiply the equation0
13
*
12
*
11=++ atasa
by *s , and multiply the equation0
23
*
22
*
12=++ atasa
by
*
t and add the results, we obtain 02)()( *23
*
13
**
12
2*
22
2*
11=++++ tasatsatasa .
Cosequently,
33
*
23
*
13
** ),( atasats ++= ,
and so (1) implies that if
,02
then
2
3
2332313
232212
131211
2
**det
det1
),(
=
=
=
A
aaa
aaa
aaa
ts .
(2)
Since
2
2212
1211
422
22det),( =
=
aa
aatsH ,
it follows from Theorem 8 and Theorem 7 that ),( ** ts is a
strict global minimizer for ),( ts if and only if0,021
>> . Similarly, ),( ** ts is a strict global maximizer
for ),( ts if and only if 0,021
>>> , then the conclusion (a) of Case 1
shows that if 0x and 03
=x , then 0. >Axx ; on the other
7/27/2019 Learning Hessian matrix.pdf
63/100
hand , the considerations in Case 2 show that if
,,,0,031323
sxxtxxxx == then
.0),(),(.2
32
3
**2
3
2
3>
== xtsxtsxAxx
Therefore 0. >AXx for all 0x if 0,0,0321
>>> .
On the other hand , if 0. >Axx for all 0x , then theconclusion (a) of Case 1 shows that 0,0
21>> . Also, if
)1,,( *** tsx = , then (2) yields
,0),( ****
1
3 >==
Axxts so 0
3> . This proves part (a)
of the theorem for 3=n
Example.9
(a) Minimize the function313221
2
3
2
2
2
1321),,( xxxxxxxxxxxxf +++= .
The critical point of ),,(321
xxxf are the solutions of the
system
.02
,02
,02
321
321
321
=++
=++
=
xxx
xxx
xxx
or
=
0
0
0
211
121
112
3
2
1
x
x
x
.
7/27/2019 Learning Hessian matrix.pdf
64/100
Sincedet
211
121
112
0 , 0,0,0321
=== xxx is the
one and only solution.
The Hessian of ),,(321
xxxf is the constant matrix
=),,(321
xxxHf
211
121
112
Note that ,4,3,2 321 === so ),,( 321 xxxHf is positivedefinite everywhere on 3R .
It follows from Theorem 7 that the critical point (0, 0, 0) is
a strict global minimizer for ),,(321
xxxf and this is the only
one critical point of ),,(321
xxxf .
(b) Find the global minimizer of22),,( zeeezyxf xxyyx +++= .
To this end, compute
+
=
z
ee
xeee
zyxfxyyx
xxyyx
2
2
),,(
2
,
and
7/27/2019 Learning Hessian matrix.pdf
65/100
+
+++
=
200
0
024
),,(
222
xyyxxyyx
xyyxxxxyyx
eeee
eeeexee
zyxHf
Clearly, 01
> for all x, y, z because all the terms of it are
positive. Also
=2
222 )()24)(()(
22xyyxxxxyyxxyyx eeeexeeee
+++++
= )24)((222 xxxyyx
eexee ++
>0
because both the factors are always positive. Finally,
0223
>= . Hence ),,( zyxHf is positive definite at all
points. Therefore by Theorem 7, ),,( zyxf is strictly
globally minimized at any critical point ).,,( *** zyx To find
),,( *** zyx , solve
0
2
2
),,(*
*
*** ****
2*****
=
+
=
z
ee
exee
zyxfxyyx
xxyyx
This leads to ,,0***** xyyx
eez == hence 02
2* )(* =xex .
Accordingly, ;**** xyyx = that is, ** yx = and 0* =x .Therefore )0,0,0(),,( *** =zyx is the global minimizer
of ),,( zyxf .
(c) Find the global minimizers of
7/27/2019 Learning Hessian matrix.pdf
66/100
.),( xyyx eeyxf +=
To this end, compute
+=
xyyx
xyyx
eeeeyxf ),(
and
+
+=
xyyxxyyx
xyyxxyyx
eeee
eeeeyxHf ),(
Since 0>+ xyyx ee for all x, y and det 0),( =yxHf , the
Hessian ),( yxHf is positive semidefinite for all x, y.
Therefore by Theorem 7, ),( yxf is minimized at any
critical point ),( ** yx of ),( yxf .
To find ),( ** yx , solve
== ),(0** yxf
+
****
8***
xyyx
xyyx
ee
ee.
This gives
8***xyyx ee
= or
**** xyyx = ;
that is ** 22 yx = .This shows that all points of the line xy = are the globalminimizers of ),( yxf .
7/27/2019 Learning Hessian matrix.pdf
67/100
(d) Find the global minimizes of.),( yxyx eeyxf + +=
In this case,
+
+=
+
+
yxyx
yxyx
ee
eeyxf ),(
++
++=
++
++
yxyxyxyx
yxyxyxyx
eeee
eeeeyxHf ),( .
Since 0>+ + yxyx ee for all x, y and det 0),( >yxHf , then),( yxHf is positive definite for all x, y. Therefore by
Theorem 7, ),( yxf is minimized at any critical
point ),( ** yx . To find ),( ** yx ,
),(0 ** yxf= =
+
++
+
*8**
*8**
yxyx
yxyx
ee
ee.
Thus*8**
yxyx
ee+
+ =0and0
*8**
=+ + yxyx ee .But 0
**
>yxe and 0**
>+yxe for all ** , yx . Therefore the
equality 0*8**
=+ + yxyx ee is impossible. Thus ),( yxf has nocritical points and hence ),( yxf has no global minimizers.
There is no disputing that global minimization is far moreimportant than mere local minimization. Still there are
certain situations in which scientists want knowledge of
local minimizers of a function. Since we are in an excellent
7/27/2019 Learning Hessian matrix.pdf
68/100
position to understand local minimization, let us get on
with it. The basic fact to understand is the next theorem.
Theorem.10. Suppose that )(xf is a function withcontinuous first and second partial derivatives on some set
D in nR . Suppose *x is an interior point of D and that *x is a
critical point of )(xf . Then *x is:
(a) a strict local minimizer of )(xf if )( *xHf is positivedefinite;
(b)
a strict local maximizer of )(xf if )(
*
xHf is negativedefinite.
Now we briefly investigate the meaning of an indefinite
Hessian at a critical point of a function. Suppose that
)(xf has continuous second partial derivatives on a set D
in nR , that *x is an interior point of D which is a critical
point of )(xf , and that )( *xHf is indefinite. This means that
there are nonzero vectors wy, in nR such that
.0)(.,0)(. ** wxHfwyxHfy
Since )(xf has continuous second partial derivatives on D,
there is an 0> such thatfor all t with
7/27/2019 Learning Hessian matrix.pdf
69/100
Therefore, t=0 is a strict local minimizer for Y(t) and a
strict local maximizer for W(t).
Thus, if we move from
*
x in the direction of y or y, thevalues of )(xf increase, but if we move from *x in the
direction of w or w, the values of )(xf decrease. For this
reason, we call the critical point *x a saddle point.
The following result summerizes this little discussion:
Theorem.11. If )(xf is a function with continuous second
partial derivatives on a set D in nR , if *x is an interior point
of D that is a critical point of )(xf , and if the Hessian
)( *xHf is indefinite, then *x is a saddle point for )(xf .
Example.10. Let us look for the global and minimizers
and maximizers (if any) of the function.812),( 3
221
3
121xxxxxxf +=
In this case, the critical points are the solutions of the
system
.24120
,1230
2
21
2
2
2
1
1
xxx
f
xxx
f
+=
=
=
=
This system can be readily solved to identify the critical
points (2,1) and (0,0).
The Hessian of ),(21
xxf is
7/27/2019 Learning Hessian matrix.pdf
70/100
=
2
1
21
4812
126),(
x
xxxHf .
Since
=4812
1212)1,2(Hf
and since 121
= and 4322
= , it follows that the critical
point (2,1) is a strict local minimizer.
Now let us see whether (2,1) is a global minimizer.
Observe that ),(21
xxHf is not positive definite for all
),( 21 xx ; for example,
=
4812
120)1,0(Hf
is indefinite. In view of Theorem 7, this leads us to suspect
that (2,1) may not be a global minimizer. The fact that=
)0,(lim
11
xfx
shows conclusively that ),( 21 xxf has no global minimizer.
Moreover, since+=
+)0,(lim
11
xfx
,
we see that there are no global maximizers or global
minimizers.
How about the critical point (0, 0)? Well, since
= 012
120
)0,0(Hf ,
this matrix miserably fails the tests for positive definiteness
and this leads us to expect trouble at (0, 0).The Theorem 11
tells us that there is a saddle point at (0, 0).
7/27/2019 Learning Hessian matrix.pdf
71/100
4.8. Eigenvalues and Positive Definite Matrices
If A is an nn -matrix and if x is a nonzero vector in nR such that xAx = for some real or complex number, then
is called an eigenvalue of A. Ifis an eigenvalue of A,then any nonzero vector x that satisfies the equation
xAx = is an eigenvector of A corresponding to. Since is an eigenvalue of an nn -matrix A if and only if thehomogeneous system 0)( = xIA of n equations in nunknowns has a nonzero solution x, it follows that the
eigenvalues of A are just the roots of the characteristic
equation
0)det( = IA .Since )det( IA is a polynomial of degree n in , thecharacteristic equation has n real or complex roots if we
count the multiple roots according to heir multiplicities, so
an nn -matrix A has n real or complex eigenvaluescounting their multiplicities.
Symmetric matrices have the following special propertieswith respect to eigenvalues and eigenvectors:
(1) All of the eigenvalues of a symmetric matrix are realnumbers.
(2) Eigenvectors corresponding to distinct eigenvalues ofa symmetric matrix are orthogonal.
(3) Ifis an eigenvalue of multiplicity k for a symmetricmatrix A, there are k linearly independent eigenvectorscorresponding to . By applying the Gram-Schmidt
Orthogonalization Process, we can always replace these
k linearly independent eigenvectors with a set of k
mutually orthogonal eitgenvectors of unit length.
7/27/2019 Learning Hessian matrix.pdf
72/100
Combining (2) and (3), we see that if A is an nn -symmetric matrix, there are n mutually orthogonal unit
eigenvectors
)()1(
,,
n
uuL
corresponding to the n eigenvaluesn
,,1L . If P is the nn -matrix whose i-th column is the
unit eigenvector )( iu corresponding toi
, and if D is the
diagonal matrix with the eigenvaluesn
,,1L down the
main diagonal, then the following matrix equation holds:
PDAP =
because )()( ii
i uAu = for ni ,,1L= .Since the matrix P is
orthogonal(that is, its columns are mutually orthogonal unit
vectors), P is invertible and the inverse 1P ofP
is just the transpose TP of P . It follows that
DAPPT = ,that is , the orthogonal matrix P diagonalizes A. If
AxxxQA .)(=
is the quadratic form associated with thesymmetric matrix A and if Pyx = , then
22
22
2
11
)()(.)(
nn
TTTT
A
yyy
DyyAPyPyPyAPyAxxxQ
+++=
====
L
Moreover, since P is invertible, 0x if and only if.0y Also, if )( iy is the vector in nR
with the ith component equal to 1 and all other componentsequal to zero, and if )()( ii Pyx = , then
i
i
AxQ =)( )(
7/27/2019 Learning Hessian matrix.pdf
73/100
for all ni ,,1L= . These considerations yield the following
eigenvalue test for definite, semidefinite, and indefinite
matrices.
Theorem 12. If A is a symmetric matrix, then:
(a) the matrix A is positive definite (resp. negativedefinite) if and only if all the eigenvalues of A are
positive (resp. negative);
(b) the matrix A is positive semidefinite (resp.negative semidefinite) if and only if all the
eigenvalues of A are nonnegative (resp. nonpositive);
(c) the matrix A is indefinite if and only if A has atleast one positive eigenvalue and at least one negative
eigenvalue.
Example11. Let us locate all maximizers, minimizers, and
saddle points of
21
2
3
2
2
2
13214),,( xxxxxxxxf ++=
The critical points of ),,(321
xxxf are solutions of the systm
of equations
7/27/2019 Learning Hessian matrix.pdf
74/100
.20
,240
,420
3
3
21
2
21
1
xx
f
xxx
f
xxx
f
=
=
+=
=
=
=
(0, 0, 0) is the one and only one solution of this system.
The Hessian of ),,(321
xxxf is the constant matrix
=
200
024042
),,(321
xxxHf .
The eigenvalues of the Hessian matrix are 2,6,2 = , so
the Hessian is indefinite at the critical point (0, 0, 0) and
hence it is a saddle point for ),,(321
xxxf .
4.9. Problem posed as minimization problem.
(a) Consider the system of equations
bAx = with A of order (m, n) and m>n. The problem becomes
that of finding x which minimizes2
2bAx ; this is a
quadratic function in x and hence the vector2
2)( bAxxfg == containing the first order partial
derivatives is linear in x. The solution is found to be that of
bAAxATT =
7/27/2019 Learning Hessian matrix.pdf
75/100
Note that the Hessian of f is AAT which is seen to be
positive definite matrix when rank of A is n, meaning that
the problem is posed as a minimization problem.
(b) Another example is the least squares method for
finding the straight line baxy += to the set of points ),(,),,(),,(
2211 nnyxyxyx L .This
problem becomes that of finding the constants a andb
which minimizes the function
=2)(),( baxybaf
ii
They are given by ==
=
n
iiii
xbaxya
f
1
))((20
and
==
=
n
iii
baxyb
f
1
)1)((20
The above equations can be organized in the convenient
form
=
=
=
==
=n
iii
n
ii
n
ii
n
ii
n
ii
xy
y
a
b
xx
xn
1
1
1
2
1
1
We can easily justify that the matrix on the left hand side is
nonsingular because2
11
2
>==
n
ii
n
ii
xxn
[ Holders Inequality:
( ) ( ) 111,1
1
1
11
=+=== qp
babaqn
i
q
i
pn
i
p
ii
n
ii
.
Letting ,1, ==iii
bxa and 2== qp , we get
7/27/2019 Learning Hessian matrix.pdf
76/100
( ) ( )2121
1
2
1
nxxn
ii
n
ii
==
i.e. , ) ]211
2
==
n
ii
n
ii
xxn
The matrix on the left hand side is also a Hessian matrix of
the function ),( baf ; this being positive definite means thatwe are again dealing with a minimization problem.
Many other nonlinear functions which seem difficult to
deal with, except by nonlinear techniques can be dealt with
using same technique explained above, if they can be
formulated as a quadratic objective function. The
procedure is to take the logarithm of both sides to
formulate the problem as follows:
( ) =
n
iiiba
bxay1
2
,lnlnmin
If the function is not quadratic, the above method fails, for
0),(),( == yxfyxg generate a set of nnonlinear
simultaneous equations. The procedure will then be tochoose a guess point 0x and improve on it until the solution
is reached. The procedure is as follows.
Expand )(xf around 0x by Taylors series:
))((
)()(2
1)()()(
30
0000
00
xxo
xxHxxxxgxfxfxx
T
+
++=
Now x)
is defined as the point at which
0=
xx
f
)
.
7/27/2019 Learning Hessian matrix.pdf
77/100
Differentiating the Taylors expansion w. r. t. x or
)( 0xx gives)(0 0
00xxHg
xx+)
, where the third term is neglected if0
x is rightly chosen nearx)
. Hence00
10
xxgHxx
)
where the vector gHy 1= is obtained by solving the linear
equationsgHy =
Because the above correction forx)
is only approximately,
the solution x)
can be improved by iteration.
The convergence of the above method is guaranteed ifH
after every iteration is found positive definite for a
minimization problem or negative definite for a
maximization problem.
For Example, for a minimization problem,
)()(2
1)()()( 0000
00xxHxxxxgxfxf
xx
T ++=))))
= { } )}(){(2
1)(
000000
110
x
T
xxxxx
T ggHgHgxf ++
=000000
110
2
1)(
xxx
T
xxx
T
gHggHgxf +
= 00010
2
1)( xxx
T
gHgxf
and since0x
H is positive definite,0
1
xH
is also positive
definite .
Hence 0000
1 >xxx
T
gHg
7/27/2019 Learning Hessian matrix.pdf
78/100
From this, we obtain )()( 0xfxf ==
< 1
1
1
11
22
1
21
1
22
,,
,,,dHd
gg
gggggg
= >< 1
1
1
11
22
1
1
1
1
11
22
,,
,0,
,
,dHd
gg
ggdHd
gg
gg
=0 ( 0, 21 >=< ggQ )
And minimizing in the direction of 3x , we obtain
7/27/2019 Learning Hessian matrix.pdf
83/100
>=< gg .
The above property holds because 2g is a linear
combination of 2d and 1d . But 2d is orthogonal on 3g , and
so on is 1d , for
0,
,,,1
2
2
2
1212
22
213
>=>==
7/27/2019 Learning Hessian matrix.pdf
84/100
22 244),( yxyxyxf +=
Solution The function can be rewritten as
( )
=
y
xyxyxf
22
24),( .
)44,48(),( yxyxyxf +=
==
44
48),(2 yxfH .
Let us start with
= 3
21
x .
Then )4,4()128,1216(11 =+== gd
( )
( )2
1
64
32
0
1644
32
1616
1632),4,4(
32
4
4
44
48
),4,4(
4
444
,
,1
1
1
11
1
==
=
>
+