Supplemental Lecture for Mathematical Backgroundweb.it.nctu.edu.tw/.../courses/Optimization/slides/MatrixCalculus.pdf · 2016/10/19 Lecture: Supplement 11 Directional Derivative of

Optimization Theory and

Applications

SupplementalLectureforMathematicalBackground

Prof. Chun-Hung Liu

2016/10/18 Lecture:Supplement 1

Dept. of Electrical and Computer EngineeringNational Chiao Tung University

Fall 2016


Function Notation• When we write

we mean that f is a function on the set mapping into the set B. Thus the notation means that f maps (some) n-vectors into m-vectors.

domf ✓ Af : Rn ! Rm

• As an example consider the function given byf : Sn ! R

with . The notation specifies the syntax of f: it takes as argument a symmetric matrix, and returns a real number. The notation specifies which symmetric matrices are valid input arguments for f (i.e., only positive definite ones).

domf = S

n++ f : Sn ! R

n⇥ nn⇥ n

domf = S

n++


Gradient, Jacobian, Hessian

• A gradient is the derivative of a scalar with respect to a vector.

rx

f(x) =

f(x) = 2x1x2 + x

22 + x1x

23• Example: if then its gradient is

rx

f(x) =


• A Jacobian is a the derivative of a vector with respect to a transposed vector.


• Example: If we have the function

then its Jacobian is

, f(x) = [f1(x), f2(x), . . . , fk(x)]



• The Hessian is derivative of a Gradient with respect to a transposed vector.

r2x

f(x) =

Recall that our above Gradient is

The Hessian is


Trace Derivative of Matrix• If and , then X 2 Rn⇥m f(Y) = tr(Y)

@f(Y)

@X=

where is what we put in the ijth place in our derivative matrix.@tr(Y)

@xij

Thus,

because is what we put in the jith place in our derivative matrix.

@tr(Y)

@xji


Product Rule for Vector and Matrix

• Suppose u(x) and v(x) are scalar functions of x. Recall from your Calculus class, you must see the following product rule:

• Now, we want to translate this to matrices and traces of matrices:

• If we take the derivative of the matrix product with respect to a scalar:


Then we find that the i,jth place in our new derivative matrix is

Product Rule for Vector and Matrix

• Since the picked elements is arbitrary, we can generalize:


Chain Rule for Vector and Matrix• Let us first see an example. Suppose

Now if


Chain Rule for Vector and MatrixBut, we know

So, another way of getting the same result:

• In other words, if then f : Rn ! R

df(x) =nX

i=1

@f(x)

@xi· @xi

@t

dt = rf(x)T@x

@t

dt


Directional Derivative of Vectors

• The standard derivative definition:

• In vector calculus, we have the directional derivative defined as

• Now, our function is a surface (a scalar function of as many dimensional inputs as there are elements in x), so the derivative will change at a multidimensional point x based on the direction we travel from that point.

• Think of what happens if you were to stand on a mountain and turn around in a circle: in some directions, the slope will be very steep (and you might fall off the mountain), but in other directions, there will barely be any slope at all.


Directional Derivative of Vectors • Actually, we can write

Dw

f(x) = w

T @f(x)

@x= w

Trx

f(x)

• Example: Let . Find f(x) = x

TQx r

x

f(x) =?

= w

TQx+ x

TQw + limt!0

twTQw

Dwf(x) = limt!0

f(x+ tw)

t= lim

t!0

(x+ tw)TQ(x+ tw)� x

TQx

t

= 2wTQx

rx

f(x) = Qx

r2x

f(x) = Q


Table of Gradients • For , and , we have the following x 2 Rn A 2 Rm⇥n b 2 Rm


Table of Gradients • For , we have the following x, a, b 2 Rn


• Now we can extend our directional derivative definition to matrices

where Y is a matrix with the same dimension as X.

• Similarly, we have

(We ca use this definition to find )

tr

✓YT @f(X)

@X

◆

@f(X)

@X

Directional Derivative of Matrices


Directional Derivative of Matrices • Example: Suppose and we want to find

= ?

By definition, we know


Directional Derivative of Matrices • So we can have

= tr(Y TAT ) = tr

✓Y T @f(X)

@X

◆

= AT

• Example: Let . Find @f(X)

@X=?



(Because )tr(UV ) = tr(V U)




Directional Derivative of Vectors

• Example: Consider the function and . f(X) = log det(X)

f : Sn++ ! R

• One (tedious) way to find the gradient of f is to follow the direction derivative introduced in above. Instead, we will directly find the first-order approximation of f at .

• Let be close to X, and let (which is assumed to be small). Thus we have

Now we would like to show .

X 2 Sn++

Z 2 Sn++ �X = Z �X

where is the ith eigenvalue of�i X�1/2�XX�1/2

rf(X) = X�1


Directional Derivative of Vectors • Now we use the fact that is small, which implies are small, so to

first order we have . Using this first-order approximation in the expression above, we get

�X �i

log(1 + �i) ⇡ �i

Thus, the first-order approximation of f at X is the affine function of Zgiven by

Noting that the second term on the right hand side is the standard inner product of and Z-X . So we can identify as the gradient of f at X.X�1 X�1


Table of Gradients


Table of Gradients


Symmetric Eigenvalue Decomposition

• Suppose , i.e., A is a real symmetric matrix. Then A can be factored as

A 2 Sn n⇥ n

where is an orthogonal matrix (or called unitary matrix), i.e., satisfies , and . The (real) numbers are the eigenvalues of A, The columns of Q form an orthonormal set ofeigenvectors of A.

QTQ = I ⇤ = diag(�1, . . . ,�n) �i

• The determinant and trace can be expressed in terms of the eigenvalues,

as can the spectral and Frobenius norms,


Definiteness and Matrix Inequalities• The largest and smallest eigenvalues satisfy

In particular, for any x, we have

• A matrix is called positive definite if for all , . We denote this as . By the inequality above, we see that if and only all its eigenvalues are positive, i.e., . We use to denote the set of positive definite matrices in .

A 2 Snx 6= 0

x

TAx > 0

A � 0 A � 0�min(A) > 0 Sn

++

Sn

• For , we use to mean , and so on.A,B 2 Sn A � B B �A � 0


Singular Value Decomposition (SVD)• Symmetric squareroot: Let , with eigenvalue decomposition

. We define the (symmetric) squareroot of A as

A 2 Sn+

The squareroot is the unique symmetric positive semidefinite solution of the equation .

A1/2

X2 = A

• Suppose with . Then A can be factored asA 2 Rm⇥n rankA = r

(SVD)


Definiteness and Matrix Inequalities• The columns of U are called left singular vectors of A, the columns

of V are right singular vectors, and the numbers are the singular values. The singular value decomposition can be written

�i

where are the left singular vectors, and are the right singular vectors.

ui 2 Rm vi 2 Rn

• The singular value decomposition of a matrix A is closely related to the eigenvalue decomposition of the (symmetric, nonnegative definite) matrix . So we can writeATA

So we conclude that its nonzero eigenvalues are the singular values of Asquared, and the associated eigenvectors of are the right singular vectors of A.

ATA


Pseudo-inverse of Matrices• Let be the singular value decomposition of

with with rank A=r. We define the pseudo-inverse or Moore-Penrose inverse of A as

A = U⌃V T A 2 Rm⇥n

• If , then . If , then. If A is square and nonsingular, then .

rankA = n A† = (ATA)�1AT rankA = mA† = AT (AAT )�1 A† = A�1

• The pseudo-inverse comes up in problems involving least-squares, minimum norm, quadratic minimization, and (Euclidean) projection. For example, is a solution of the least-squares problemA†b

When the solution is not unique, gives the solution with minimum(Euclidean) norm.

A†b


Schur Complement• Consider a matrix partitioned asX 2 Sn

where . If , the matrixA 2 Sn det(A) 6= 0

is called the Schur complement of A in X. Schur complements arise in several contexts, and appear in many important formulas and theorems. For example, we have


• The Schur complement comes up in solving linear equations, by eliminating one block of variables. For example,

Schur Complement

assuming . Eliminating x from the top block equation and substituting it into the bottom block equation yield

det(A) 6= 0

v = BTA�1u+ Sy y = S�1(v �BTA�1u)

Substituting this into the first equation yields


Schur Complement• The Schur complement arises when you minimize a quadratic form

over some of the variables. Suppose , and consider the minimization problem

A � 0

with variable u. The solution is and the optimal value is

u = �A�1Bv

• From this we can derive the following characterizations of positive definiteness or semidefiniteness of the block matrix X:

Documents

Supplemental Lecture for Mathematical Backgroundweb.it.nctu.edu.tw/.../courses/Optimization/slides/MatrixCalculus.pdf · 2016/10/19 Lecture: Supplement 11 Directional Derivative of