Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

1of 177

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research GroupData61 | CSIRO

andCollege of Engineering and Computer Science

The Australian National University

CanberraFebruary – June 2018

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")




University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

90of 177

Part III

Linear Algebra




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

91of 177

Linear Curve Fitting - Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N0 2 4 6 8 10

x

0

1

2

3

4

5

6

7

8

t




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

92of 177

Strategy in this course

Estimate best predictor = training = learningGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

1 Identify the type of input x and output y data2 Propose a (linear) mathematical model for fw3 Design an objective function or likelihood4 Calculate the optimal parameter (w)5 Model uncertainty using the Bayesian approach6 Implement and compute (the algorithm in python)7 Interpret and diagnose results




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

93of 177

Linear Curve Fitting

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N0 2 4 6 8 10

x

0

1

2

3

4

5

6

7

8

t




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

94of 177

Linear Curve Fitting - Least Squares

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

y(x,w) = w1x + w0

X ≡ [x 1]

w∗ = (XTX)−1XT t

0 2 4 6 8 10x

0

1

2

3

4

5

6

7

8

t




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

95of 177

Intuition

Arithmeticzero, positive/negative numbersaddition and multiplication

GeometryPoints and LinesVector addition and scalingHumans have experience with 3 dimensions (less with 1, 2though)

Generalisation to N dimensions (possibly N →∞)Line→ vector space VPoint→ vector x ∈ VExample : X ∈ Rn×m

Space of matrices Rn×m and the space of vectors Rn·m areisomorphic




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

96of 177

Vector Space V , underlying Field F

Given vectors u, v,w ∈ V and scalars α, β ∈ F , the followingholds:

Associativity of addition : u + (v + w) = (u + v) + w .Commutativity of addition : v + w = w + v .Identity element of addition : ∃ 0 such thatv + 0 = v, ∀ v ∈ V.Inverse of addition : For all v ∈ V, ∃w ∈ V, such thatv + w = 0. The additive inverse is denoted −v.Distributivity of scalar multiplication with respect to vectoraddition : α(v + w) = αv + αw .Distributivity of scalar multiplication with respect to fieldaddition : (α+ β)v = αv + βv .Compatibility of scalar multiplication with fieldmultiplication : α(βv) = (αβ)v .Identity element of scalar multiplication : 1v = v, where 1 isthe identity in F .




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

97of 177

Matrix-Vector Multiplication

� �1 for i in range (m) :

R[ i ] = 0 . 0 ;for j in range ( n ) :

R[ i ] = R[ i ] + A [ i , j ] ∗ V[ j ]� �Listing 1: Code for elementwise matrix multiplication.

A V = Ra11 a12 . . . a1n

a21 a22 . . . a2n

. . . . . . . . . . . .am1 am2 . . . amn

v1v2. . .vn

=

a11v1 + a12v2 + · · ·+ a1nvn

a21v1 + a22v2 + · · ·+ a2nvn

. . .am1v1 + am2v2 + · · ·+ amnvn




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

98of 177


� �1 for i in range (m) :

R[ i ] = 0 . 0 ;for j in range ( n ) :

R[ i ] = R[ i ] + A [ i , j ] ∗ V[ j ]� �Listing 2: Code for elementwise matrix multiplication.

A V = Ra11 a12 . . . a1n

a21 a22 . . . a2n

. . . . . . . . . . . .am1 am2 . . . amn

v1v2. . .vn

=

a11v1 + a12v2 + · · ·+ a1nvn

a21v1 + a22v2 + · · ·+ a2nvn

. . .am1v1 + am2v2 + · · ·+ amnvn




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

99of 177


A V = Ra11 a12 . . . a1n

a21 a22 . . . a2n

. . . . . . . . . . . .am1 am2 . . . amn

v1v2. . .vn

=

a11v1 + a12v2 + · · ·+ a1nvn

a21v1 + a22v2 + · · ·+ a2nvn

. . .am1v1 + am2v2 + · · ·+ amnvn

� �1 R = A [ : , 0 ] ∗ V [ 0 ] ;

for j in range (1 , n ) :R += A [ : , j ] ∗ V[ j ] ;� �

Listing 3: Code for columnwise matrix multiplication.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

100of 177


A V = Ra11 a12 . . . a1n

a21 a22 . . . a2n

. . . . . . . . . . . .am1 am2 . . . amn

v1v2. . .vn

=

a11v1 + a12v2 + · · ·+ a1nvn

a21v1 + a22v2 + · · ·+ a2nvn

. . .am1v1 + am2v2 + · · ·+ amnvn

� �

R = A [ : , 0 ] ∗ V [ 0 ] ;2 for j in range (1 , n ) :

R += A [ : , j ] ∗ V[ j ] ;� �Listing 4: Code for columnwise matrix multiplication.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

101of 177


Denote the n columns of A by a1, a2, . . . , an

Each ai is now a (column) vector

A V = Ra1 a2 . . . an

v1v2. . .vn

=

a1v1 + a2v2 + · · ·+ anvn




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

102of 177

Transpose of R

Given R = AV,

R =

a1v1 + a2v2 + · · ·+ anvn

What is RT ?

RT =[v1aT

1 + v2aT2 + · · ·+ vnaT

n]

NOT equal to ATVT ! (In fact, ATVT is not even defined.)




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

103of 177

Transpose of R

Given R = AV,

R =

a1v1 + a2v2 + · · ·+ anvn

What is RT ?

RT =[v1aT

1 + v2aT2 + · · ·+ vnaT

n]

NOT equal to ATVT ! (In fact, ATVT is not even defined.)




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

104of 177

Transpose

Reverse order rule : (AV)T = VTAT

VT AT = RT

[v1 v2 . . . vn

]

aT1

aT2

. . .

aTn

=[v1aT

1 + v2aT2 + · · ·+ vnaT

n]




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

105of 177

Transpose

Reverse order rule : (AV)T = VTAT

VT AT = RT

[v1 v2 . . . vn

]

aT1

aT2

. . .

aTn

=[v1aT

1 + v2aT2 + · · ·+ vnaT

n]




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

106of 177

Trace

The trace of a square matrix A is the sum of all diagonalelements of A.

tr {A} =

n∑k=1

Akk

Example

A =

1 2 34 5 67 8 9

tr {A} = 1 + 5 + 9 = 15

The trace does not exist for a non-square matrix.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

107of 177

vec(x) operator

Define vec(X) as the vector which results from stacking allcolumns of a matrix A on top of each other.Example

A =

1 2 34 5 67 8 9

vec(A) =

147258369

Given two matrices X,Y ∈ Rn×p, the trace of the productXTY can be written as

tr{

XTY}

= vec(X)T vec(Y)




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

108of 177

Inner Product

Mapping from two vectors x, y ∈ V to a field of scalars F( F = R or F = C ):

〈·, ·〉 : V× V→ F

Conjugate Symmetry :〈x, y〉 = 〈y, x〉Linearity :〈ax, y〉 = a〈x, y〉 , and 〈x + y, z〉 = 〈x, z〉+ 〈y, z〉Positive-definitness :〈x, x〉 ≥ 0, and 〈x, x〉 = 0 for x = 0 only.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

109of 177

Canonical Inner Products

Inner product for real numbers x, y ∈ R:

〈x, y〉 = x y

Dot product between two vectors:Given x, y ∈ Rn

〈x, y〉 = xTy ≡n∑

k=1

xk yk = x1 y1 + x2 y2 + · · ·+ xn yn

Canonical inner product for matrices :Given X,Y ∈ Rn×p

〈X,Y〉 = tr{

XTY}

=

p∑k=1

(XTY)kk =

p∑k=1

n∑l=1

(XT)kl(Y)lk

=

p∑k=1

n∑l=1

XlkYlk




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

110of 177

Canonical Inner Products

Inner product for real numbers x, y ∈ R:

〈x, y〉 = x y

Dot product between two vectors:Given x, y ∈ Rn

〈x, y〉 = xTy ≡n∑

k=1

xk yk = x1 y1 + x2 y2 + · · ·+ xn yn

Canonical inner product for matrices :Given X,Y ∈ Rn×p

〈X,Y〉 = tr{

XTY}

=

p∑k=1

(XTY)kk =

p∑k=1

n∑l=1

(XT)kl(Y)lk

=

p∑k=1

n∑l=1

XlkYlk




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

111of 177

Calculations with Matrices

Denote (i, j)th element of matrix X by Xij

Transpose : (XT)ij = Xji

Product : (XY)ij =∑...

k=1 XikYkj

Proof of the linearity of the canonical matrix inner product :

〈X + Y,Z〉 = tr{

(X + Y)TZ}

=

p∑k=1

((X + Y)TZ)kk

=

p∑k=1

n∑l=1

((X + Y)T)klZlk =

p∑k=1

n∑l=1

(Xlk + Ylk)Zlk

=

p∑k=1

n∑l=1

XlkZlk + YlkZlk

=

p∑k=1

n∑l=1

(XT)klZlk + (YT)klZlk

=

p∑k=1

(XTZ)kk + (YTZ)kk

= tr{

XTZ}

+ tr{

YTZ}

= 〈X,Z〉+ 〈Y,Z〉




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

112of 177

Projection

In linear algebra and functional analysis, a projection is alinear transformation P from a vector space V to itself suchthat

P2 = P

x

y(a,b)

(a-b,0)

A[

ab

]=

[a− b

0

]A =

[1 −10 0

]




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

113of 177

Projection


P2 = P

x

y(a,b)

(a-b,0)

A[

ab

]=

[a− b

0

]A =

[1 −10 0

]




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

114of 177

Projection


P2 = P

x

y(a,b)

(a-b,0)

A[

ab

]=

[a− b

0

]A =

[1 −10 0

]




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

115of 177

Orthogonal Projection

x

y

Px Py

y - Py

Orthogonality : need an inner product 〈x, y〉Choose two arbitrary vectors x and y. Then Px and y− Pyare orthogonal.

0 = 〈Px, y− Py〉 = (Px)T(y− Py) = xT(PT − PTP)y




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

116of 177



P2 = P P = PT

Example : Given some unit vector u ∈ Rn characterising aline through the origin in Rn

project an arbitrary vector x ∈ Rn onto this line by

P = uuT

Proof : P = PT , and

P2x = (uuT)(uuT)x = uuTx = Px




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

117of 177


Given a matrix A ∈ Rn×p and a vector x. What is theclosest point x̃ to x in the column space of A ?Orthogonal Projection into the column space of A

x̃ = A(ATA)−1ATx

Projection matrix

P = A(ATA)−1AT

Proof

P2 = A(ATA)−1ATA(ATA)−1AT = A(ATA)−1AT = P

Orthogonal projection ?

PT = (A(ATA)−1AT)T = A(ATA)−1AT = P




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

118of 177


N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

y(x,w) = w1x + w0

X ≡ [x 1]


0 2 4 6 8 10x

0

1

2

3

4

5

6

7

8

t




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

119of 177

Linear Independence of Vectors

A set of vectors is linearly independent if none of them canbe written as a linear combination of finitely many othervectors in the set.Given a vector space V and k vectors v1, . . . , vk ∈ V andscalars a1, . . . , ak. The vectors are linearly independent, ifand only if

a1v1 + a2v2 + · · ·+ akvk = 0 ≡

00...0

has only the trivial solution

a1 = a2 = ak = 0




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

120of 177

Rank

The column rank of a matrix A is the maximal number oflinearly independent columns of A.The row rank of a matrix A is the maximal number oflinearly independent rows of A.For every matrix A, the row rank and column rank areequal, called the rank of A.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

121of 177

Determinant

Let Sn be the set of all permutation of the numbers {1, . . . , n}.The determinant of the square matrix A is then given by

det {A} =∑σ∈Sn

sgn(σ)

n∏i=1

Ai,σ(i)

where sgn is the signature of the permutation ( +1 if thepermutation is even, −1 if the permutation is odd).

det {AB} = det {A} det {B} for A,B square matrices.

det{

A−1}

= det {A}−1.

det{

AT}

= det {A}.det {c A} = cn det {A} for A ∈ Rn×n and c ∈ R.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

122of 177

Matrix Inverse

Identity Matrix I

I = AA−1 = A−1A

The matrix inverse A−1 does only exist for square matriceswhich are NOT singular.Singular matrix

at least one eigenvalue is zero,determinant |A| = 0.

Inversion ’reverses the order’ (like transposition)

(AB)−1 = B−1A−1

(Only then is (AB)(AB)−1 = (AB)B−1A−1 = AA−1 = I.)




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

123of 177

Matrix Inverse and Transpose

(A−1)T ?

need to assume that (A−1) existsrule : (A−1)T = (AT)−1




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

124of 177

Matrix Inverse and Transpose

(A−1)T ?need to assume that (A−1) existsrule : (A−1)T = (AT)−1




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

125of 177

Useful Identity

(A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1

How to analyse and prove such an equation?A−1 must exist, therefore A must be square, say A ∈ Rn×n.Same for C−1, so let’s assume C ∈ Rm×m.Therefore, B ∈ Rm×n.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

126of 177

Useful Identity


How to analyse and prove such an equation?A−1 must exist, therefore A must be square, say A ∈ Rn×n.Same for C−1, so let’s assume C ∈ Rm×m.Therefore, B ∈ Rm×n.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

127of 177

Prove the Identity


Multiply by (BABT + C)

(A−1 + BTC−1B)−1BTC−1(BABT + C) = ABT

Simplify the left-hand side

(A−1 + BTC−1B)−1[BTC−1B(ABT) + BT ]

= (A−1 + BTC−1B)−1[BTC−1B + A−1]ABT

= ABT




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

128of 177

Woodbury Identity

(A + BD−1C)−1 = A−1 − A−1B(D + CA−1B)−1CA−1

Useful if A is diagonal and easy to invert, and B is very tall(many rows, but only a few columns) and C is very wide(few rows, many columns).




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

129of 177

Manipulating Matrix Equations

Don’t multiply by a matrix which does not have full rank.Why? In the general case, you will lose solutions.Don’t multiply by nonsquare matrices.Don’t assume a matrix inverse exists because a matrix issquare. The matrix might be singular and you are makingthe same mistake as if deducing a = b from a 0 = b 0.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

130of 177

Eigenvectors

Every square matrix A ∈ Rn×n has an Eigenvectordecomposition

Ax = λx

where x ∈ Rn and λ ∈ C.Example: [

0 1−1 0

]x = λx

λ = {−ı, ı}

x =

{[ı1

],

[−ı

1

]}




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

131of 177

Eigenvectors

How many eigenvalue/eigenvector pairs?

Ax = λx

is equivalent to(A− λI)x = 0

Has only non-trivial solution for det {A− λI} = 0polynom of nth order; at most n distinct solutions




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

132of 177

Real Eigenvalues

How can we enforce real eigenvalues?Let’s look at matrices with complex entries A ∈ Cn×n.Transposition is replaced by Hermitian adjoint, e.g.[

1 + ı2 3 + ı45 + ı6 7 + ı8

]H

=

[1− ı2 5− ı63− ı4 7− ı8

]Denote the complex conjugate of a complex number λ byλ.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

133of 177

Real Eigenvalues

How can we enforce real eigenvalues?Let’s assume A ∈ Cn×n, Hermitian (AH = A).Calculate

xHAx = λxHx

for an eigenvector x ∈ Cn of A.Another possibility to calculate xHAx

xHAx = xHAHx (A is Hermitian)

= (xHAx)H (reverse order)

= (λxHx)H (eigenvalue)

= λxHx

and thereforeλ = λ (λ is real).

If A is Hermitian, then all eigenvalues are real.Special case: If A has only real entries and is symmetric,then all eigenvalues are real.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

134of 177

Singular Value Decomposition

Every matrix A ∈ Rn×p can be decomposed into a product ofthree matrices

A = UΣVT

where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices (UTU = I and VTV = I ), and Σ ∈ Rn×p has nonnegativenumbers on the diagonal.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

135of 177


N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

y(x,w) = w1x + w0

X ≡ [x 1]


0 2 4 6 8 10x

0

1

2

3

4

5

6

7

8

t




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

136of 177

Inverse of a matrix

Assume a full rank symmetric real matrix A.Then A = UTΛU whereΛ is a diagonal matrix with real eigenvaluesU contains the eigenvectors

A−1 = (UTΛU)−1

= U−1Λ−1U−T inverse changes order

= UTΛ−1U UTU = I

The inverse of a diagonal matrix is the inverse of itselements.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

137of 177

How to calculate a gradient?

Recall the gradient of a univariate function f : R→ R.Example:

f (x) = x4 + 7x3 + 5x2 − 17x + 3

whose gradient is given by

dfdx

= 4x3 + 21x2 + 10x− 17

6 5 4 3 2 1 0 1 2value of parameter

40

20

0

20

40

60

obje

ctiv

e

x4 + 7x3 + 5x2 17x + 3

Ong, Cheng Soon (Data61, Canberra City)




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

138of 177

Gradients and optima

We find the stationary points by calculating the derivative,setting it to zero and solving the resulting equation.Example:

dfdx

= 4x3 + 21x2 + 10x− 17 = 0

has three stationary points (because it is a cubicpolynomial)

We calculate the second derivative ( d2fdx2 ) to check whether

a stationary point is a minimum ( d2fdx2 > 0) or a maximum

( d2fdx2 < 0).

The example has two minimum and one maximumstationary point.




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

139of 177

Gradient descent

We may not be able to solve the problem analytically, haveto resort to gradient descent.Idea: Take a step in the negative gradient direction.Many different approaches for gradient descent.

1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0x

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

f(x)




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

140of 177

Rules of differentiation

We denote the derivative of f as f ′.Sum rule

(f (x) + g(x))′ = f ′(x) + g′(x)

Product rule

(f (x)g(x))′ = f ′(x)g(x) + f (x)g′(x)

Quotient rule (f (x)

g(x)

)′=

f ′(x)g(x)− f (x)g′(x)

(g(x))2

Chain rule (g(f (x))

)′= (g ◦ f )′(x) = g′(f )f ′(x)




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

141of 177

How to calculate a gradient?

Given a multivariate function f : RD → R.Compute the partial derivative with respect to eachelement, and collect the results in the appropriate shape.Use knowledge of scalar differentiation.Recommend the convention that gradients are rowvectors.Example:

f (x1, x2) = x21x2 + x1x3

2

has partial derivatives

∂f (x1, x2)

∂x1= 2x1x2 + x3

2

∂f (x1, x2)

∂x2= x2

1 + 3x1x22

and therefore the gradient is (∈ R1×2)

∇xf (x) =dfdx

=[∂f (x1,x2)∂x1

∂f (x1,x2)∂x2

]=[2x1x2 + x3

2 x21 + 3x1x2

2

].




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

142of 177

Gradient with trace

We can combine matrix multiplication rules with partialdifferentation.Example: ∇A tr {AB} = B>

Recall:

tr {AB} =

←− a1 −→←− a2 −→

· · ·←− aN −→

↑ ↑ ↑

b1 b2 · · · bn

↓ ↓ ↓

=

m∑i=1

a1ibi1 +

m∑i=1

a2ibi2 + . . .+

m∑i=1

anibin

Therefore:∂ tr {AB}∂aij

= bji




University

Motivation

Basic Concepts


Trace

Inner Product

Projection


Matrix Inverse

Eigenvectors


Gradient

Books

143of 177

Books

Gilbert Strang, "Introduction to Linear Algebra", WellesleyCambridge, 2009.David C. Lay, Steven R. Lay, Judi J. McDonald, "LinearAlgebra and Its Applications", Pearson, 2015.Jonathan S. Golan, "The Linear Algebra a BeginningGraduate Student Ought to Know", Springer, 2012Kaare B. Petersen, Michael S. Pedersen, "The MatrixCookbook", 2012

Documents

Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University