55
Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University 1of 177 Introduction to Statistical Machine Learning Cheng Soon Ong & Christian Walder Machine Learning Research Group Data61 | CSIRO and College of Engineering and Computer Science The Australian National University Canberra February – June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Embed Size (px)

Citation preview

Page 1: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

1of 177

Introduction to Statistical Machine Learning

Cheng Soon Ong & Christian Walder

Machine Learning Research GroupData61 | CSIRO

andCollege of Engineering and Computer Science

The Australian National University

CanberraFebruary – June 2018

(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")

Page 2: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

90of 177

Part III

Linear Algebra

Page 3: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

91of 177

Linear Curve Fitting - Input Specification

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N0 2 4 6 8 10

x

0

1

2

3

4

5

6

7

8

t

Page 4: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

92of 177

Strategy in this course

Estimate best predictor = training = learningGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).

1 Identify the type of input x and output y data2 Propose a (linear) mathematical model for fw3 Design an objective function or likelihood4 Calculate the optimal parameter (w)5 Model uncertainty using the Bayesian approach6 Implement and compute (the algorithm in python)7 Interpret and diagnose results

Page 5: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

93of 177

Linear Curve Fitting

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N0 2 4 6 8 10

x

0

1

2

3

4

5

6

7

8

t

Page 6: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

94of 177

Linear Curve Fitting - Least Squares

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

y(x,w) = w1x + w0

X ≡ [x 1]

w∗ = (XTX)−1XT t

0 2 4 6 8 10x

0

1

2

3

4

5

6

7

8

t

Page 7: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

95of 177

Intuition

Arithmeticzero, positive/negative numbersaddition and multiplication

GeometryPoints and LinesVector addition and scalingHumans have experience with 3 dimensions (less with 1, 2though)

Generalisation to N dimensions (possibly N →∞)Line→ vector space VPoint→ vector x ∈ VExample : X ∈ Rn×m

Space of matrices Rn×m and the space of vectors Rn·m areisomorphic

Page 8: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

96of 177

Vector Space V , underlying Field F

Given vectors u, v,w ∈ V and scalars α, β ∈ F , the followingholds:

Associativity of addition : u + (v + w) = (u + v) + w .Commutativity of addition : v + w = w + v .Identity element of addition : ∃ 0 such thatv + 0 = v, ∀ v ∈ V.Inverse of addition : For all v ∈ V, ∃w ∈ V, such thatv + w = 0. The additive inverse is denoted −v.Distributivity of scalar multiplication with respect to vectoraddition : α(v + w) = αv + αw .Distributivity of scalar multiplication with respect to fieldaddition : (α+ β)v = αv + βv .Compatibility of scalar multiplication with fieldmultiplication : α(βv) = (αβ)v .Identity element of scalar multiplication : 1v = v, where 1 isthe identity in F .

Page 9: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

97of 177

Matrix-Vector Multiplication

� �1 for i in range (m) :

R[ i ] = 0 . 0 ;for j in range ( n ) :

R[ i ] = R[ i ] + A [ i , j ] ∗ V[ j ]� �Listing 1: Code for elementwise matrix multiplication.

A V = Ra11 a12 . . . a1n

a21 a22 . . . a2n

. . . . . . . . . . . .am1 am2 . . . amn

v1v2. . .vn

=

a11v1 + a12v2 + · · ·+ a1nvn

a21v1 + a22v2 + · · ·+ a2nvn

. . .am1v1 + am2v2 + · · ·+ amnvn

Page 10: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

98of 177

Matrix-Vector Multiplication

� �1 for i in range (m) :

R[ i ] = 0 . 0 ;for j in range ( n ) :

R[ i ] = R[ i ] + A [ i , j ] ∗ V[ j ]� �Listing 2: Code for elementwise matrix multiplication.

A V = Ra11 a12 . . . a1n

a21 a22 . . . a2n

. . . . . . . . . . . .am1 am2 . . . amn

v1v2. . .vn

=

a11v1 + a12v2 + · · ·+ a1nvn

a21v1 + a22v2 + · · ·+ a2nvn

. . .am1v1 + am2v2 + · · ·+ amnvn

Page 11: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

99of 177

Matrix-Vector Multiplication

A V = Ra11 a12 . . . a1n

a21 a22 . . . a2n

. . . . . . . . . . . .am1 am2 . . . amn

v1v2. . .vn

=

a11v1 + a12v2 + · · ·+ a1nvn

a21v1 + a22v2 + · · ·+ a2nvn

. . .am1v1 + am2v2 + · · ·+ amnvn

� �1 R = A [ : , 0 ] ∗ V [ 0 ] ;

for j in range (1 , n ) :R += A [ : , j ] ∗ V[ j ] ;� �

Listing 3: Code for columnwise matrix multiplication.

Page 12: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

100of 177

Matrix-Vector Multiplication

A V = Ra11 a12 . . . a1n

a21 a22 . . . a2n

. . . . . . . . . . . .am1 am2 . . . amn

v1v2. . .vn

=

a11v1 + a12v2 + · · ·+ a1nvn

a21v1 + a22v2 + · · ·+ a2nvn

. . .am1v1 + am2v2 + · · ·+ amnvn

� �

R = A [ : , 0 ] ∗ V [ 0 ] ;2 for j in range (1 , n ) :

R += A [ : , j ] ∗ V[ j ] ;� �Listing 4: Code for columnwise matrix multiplication.

Page 13: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

101of 177

Matrix-Vector Multiplication

Denote the n columns of A by a1, a2, . . . , an

Each ai is now a (column) vector

A V = Ra1 a2 . . . an

v1v2. . .vn

=

a1v1 + a2v2 + · · ·+ anvn

Page 14: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

102of 177

Transpose of R

Given R = AV,

R =

a1v1 + a2v2 + · · ·+ anvn

What is RT ?

RT =[v1aT

1 + v2aT2 + · · ·+ vnaT

n]

NOT equal to ATVT ! (In fact, ATVT is not even defined.)

Page 15: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

103of 177

Transpose of R

Given R = AV,

R =

a1v1 + a2v2 + · · ·+ anvn

What is RT ?

RT =[v1aT

1 + v2aT2 + · · ·+ vnaT

n]

NOT equal to ATVT ! (In fact, ATVT is not even defined.)

Page 16: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

104of 177

Transpose

Reverse order rule : (AV)T = VTAT

VT AT = RT

[v1 v2 . . . vn

]

aT1

aT2

. . .

aTn

=[v1aT

1 + v2aT2 + · · ·+ vnaT

n]

Page 17: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

105of 177

Transpose

Reverse order rule : (AV)T = VTAT

VT AT = RT

[v1 v2 . . . vn

]

aT1

aT2

. . .

aTn

=[v1aT

1 + v2aT2 + · · ·+ vnaT

n]

Page 18: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

106of 177

Trace

The trace of a square matrix A is the sum of all diagonalelements of A.

tr {A} =

n∑k=1

Akk

Example

A =

1 2 34 5 67 8 9

tr {A} = 1 + 5 + 9 = 15

The trace does not exist for a non-square matrix.

Page 19: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

107of 177

vec(x) operator

Define vec(X) as the vector which results from stacking allcolumns of a matrix A on top of each other.Example

A =

1 2 34 5 67 8 9

vec(A) =

147258369

Given two matrices X,Y ∈ Rn×p, the trace of the productXTY can be written as

tr{

XTY}

= vec(X)T vec(Y)

Page 20: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

108of 177

Inner Product

Mapping from two vectors x, y ∈ V to a field of scalars F( F = R or F = C ):

〈·, ·〉 : V× V→ F

Conjugate Symmetry :〈x, y〉 = 〈y, x〉Linearity :〈ax, y〉 = a〈x, y〉 , and 〈x + y, z〉 = 〈x, z〉+ 〈y, z〉Positive-definitness :〈x, x〉 ≥ 0, and 〈x, x〉 = 0 for x = 0 only.

Page 21: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

109of 177

Canonical Inner Products

Inner product for real numbers x, y ∈ R:

〈x, y〉 = x y

Dot product between two vectors:Given x, y ∈ Rn

〈x, y〉 = xTy ≡n∑

k=1

xk yk = x1 y1 + x2 y2 + · · ·+ xn yn

Canonical inner product for matrices :Given X,Y ∈ Rn×p

〈X,Y〉 = tr{

XTY}

=

p∑k=1

(XTY)kk =

p∑k=1

n∑l=1

(XT)kl(Y)lk

=

p∑k=1

n∑l=1

XlkYlk

Page 22: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

110of 177

Canonical Inner Products

Inner product for real numbers x, y ∈ R:

〈x, y〉 = x y

Dot product between two vectors:Given x, y ∈ Rn

〈x, y〉 = xTy ≡n∑

k=1

xk yk = x1 y1 + x2 y2 + · · ·+ xn yn

Canonical inner product for matrices :Given X,Y ∈ Rn×p

〈X,Y〉 = tr{

XTY}

=

p∑k=1

(XTY)kk =

p∑k=1

n∑l=1

(XT)kl(Y)lk

=

p∑k=1

n∑l=1

XlkYlk

Page 23: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

111of 177

Calculations with Matrices

Denote (i, j)th element of matrix X by Xij

Transpose : (XT)ij = Xji

Product : (XY)ij =∑...

k=1 XikYkj

Proof of the linearity of the canonical matrix inner product :

〈X + Y,Z〉 = tr{

(X + Y)TZ}

=

p∑k=1

((X + Y)TZ)kk

=

p∑k=1

n∑l=1

((X + Y)T)klZlk =

p∑k=1

n∑l=1

(Xlk + Ylk)Zlk

=

p∑k=1

n∑l=1

XlkZlk + YlkZlk

=

p∑k=1

n∑l=1

(XT)klZlk + (YT)klZlk

=

p∑k=1

(XTZ)kk + (YTZ)kk

= tr{

XTZ}

+ tr{

YTZ}

= 〈X,Z〉+ 〈Y,Z〉

Page 24: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

112of 177

Projection

In linear algebra and functional analysis, a projection is alinear transformation P from a vector space V to itself suchthat

P2 = P

x

y(a,b)

(a-b,0)

A[

ab

]=

[a− b

0

]A =

[1 −10 0

]

Page 25: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

113of 177

Projection

In linear algebra and functional analysis, a projection is alinear transformation P from a vector space V to itself suchthat

P2 = P

x

y(a,b)

(a-b,0)

A[

ab

]=

[a− b

0

]A =

[1 −10 0

]

Page 26: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

114of 177

Projection

In linear algebra and functional analysis, a projection is alinear transformation P from a vector space V to itself suchthat

P2 = P

x

y(a,b)

(a-b,0)

A[

ab

]=

[a− b

0

]A =

[1 −10 0

]

Page 27: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

115of 177

Orthogonal Projection

x

y

Px Py

y - Py

Orthogonality : need an inner product 〈x, y〉Choose two arbitrary vectors x and y. Then Px and y− Pyare orthogonal.

0 = 〈Px, y− Py〉 = (Px)T(y− Py) = xT(PT − PTP)y

Page 28: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

116of 177

Orthogonal Projection

Orthogonal Projection

P2 = P P = PT

Example : Given some unit vector u ∈ Rn characterising aline through the origin in Rn

project an arbitrary vector x ∈ Rn onto this line by

P = uuT

Proof : P = PT , and

P2x = (uuT)(uuT)x = uuTx = Px

Page 29: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

117of 177

Orthogonal Projection

Given a matrix A ∈ Rn×p and a vector x. What is theclosest point x̃ to x in the column space of A ?Orthogonal Projection into the column space of A

x̃ = A(ATA)−1ATx

Projection matrix

P = A(ATA)−1AT

Proof

P2 = A(ATA)−1ATA(ATA)−1AT = A(ATA)−1AT = P

Orthogonal projection ?

PT = (A(ATA)−1AT)T = A(ATA)−1AT = P

Page 30: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

118of 177

Linear Curve Fitting - Least Squares

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

y(x,w) = w1x + w0

X ≡ [x 1]

w∗ = (XTX)−1XT t

0 2 4 6 8 10x

0

1

2

3

4

5

6

7

8

t

Page 31: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

119of 177

Linear Independence of Vectors

A set of vectors is linearly independent if none of them canbe written as a linear combination of finitely many othervectors in the set.Given a vector space V and k vectors v1, . . . , vk ∈ V andscalars a1, . . . , ak. The vectors are linearly independent, ifand only if

a1v1 + a2v2 + · · ·+ akvk = 0 ≡

00...0

has only the trivial solution

a1 = a2 = ak = 0

Page 32: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

120of 177

Rank

The column rank of a matrix A is the maximal number oflinearly independent columns of A.The row rank of a matrix A is the maximal number oflinearly independent rows of A.For every matrix A, the row rank and column rank areequal, called the rank of A.

Page 33: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

121of 177

Determinant

Let Sn be the set of all permutation of the numbers {1, . . . , n}.The determinant of the square matrix A is then given by

det {A} =∑σ∈Sn

sgn(σ)

n∏i=1

Ai,σ(i)

where sgn is the signature of the permutation ( +1 if thepermutation is even, −1 if the permutation is odd).

det {AB} = det {A} det {B} for A,B square matrices.

det{

A−1}

= det {A}−1.

det{

AT}

= det {A}.det {c A} = cn det {A} for A ∈ Rn×n and c ∈ R.

Page 34: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

122of 177

Matrix Inverse

Identity Matrix I

I = AA−1 = A−1A

The matrix inverse A−1 does only exist for square matriceswhich are NOT singular.Singular matrix

at least one eigenvalue is zero,determinant |A| = 0.

Inversion ’reverses the order’ (like transposition)

(AB)−1 = B−1A−1

(Only then is (AB)(AB)−1 = (AB)B−1A−1 = AA−1 = I.)

Page 35: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

123of 177

Matrix Inverse and Transpose

(A−1)T ?

need to assume that (A−1) existsrule : (A−1)T = (AT)−1

Page 36: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

124of 177

Matrix Inverse and Transpose

(A−1)T ?need to assume that (A−1) existsrule : (A−1)T = (AT)−1

Page 37: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

125of 177

Useful Identity

(A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1

How to analyse and prove such an equation?A−1 must exist, therefore A must be square, say A ∈ Rn×n.Same for C−1, so let’s assume C ∈ Rm×m.Therefore, B ∈ Rm×n.

Page 38: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

126of 177

Useful Identity

(A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1

How to analyse and prove such an equation?A−1 must exist, therefore A must be square, say A ∈ Rn×n.Same for C−1, so let’s assume C ∈ Rm×m.Therefore, B ∈ Rm×n.

Page 39: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

127of 177

Prove the Identity

(A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1

Multiply by (BABT + C)

(A−1 + BTC−1B)−1BTC−1(BABT + C) = ABT

Simplify the left-hand side

(A−1 + BTC−1B)−1[BTC−1B(ABT) + BT ]

= (A−1 + BTC−1B)−1[BTC−1B + A−1]ABT

= ABT

Page 40: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

128of 177

Woodbury Identity

(A + BD−1C)−1 = A−1 − A−1B(D + CA−1B)−1CA−1

Useful if A is diagonal and easy to invert, and B is very tall(many rows, but only a few columns) and C is very wide(few rows, many columns).

Page 41: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

129of 177

Manipulating Matrix Equations

Don’t multiply by a matrix which does not have full rank.Why? In the general case, you will lose solutions.Don’t multiply by nonsquare matrices.Don’t assume a matrix inverse exists because a matrix issquare. The matrix might be singular and you are makingthe same mistake as if deducing a = b from a 0 = b 0.

Page 42: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

130of 177

Eigenvectors

Every square matrix A ∈ Rn×n has an Eigenvectordecomposition

Ax = λx

where x ∈ Rn and λ ∈ C.Example: [

0 1−1 0

]x = λx

λ = {−ı, ı}

x =

{[ı1

],

[−ı

1

]}

Page 43: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

131of 177

Eigenvectors

How many eigenvalue/eigenvector pairs?

Ax = λx

is equivalent to(A− λI)x = 0

Has only non-trivial solution for det {A− λI} = 0polynom of nth order; at most n distinct solutions

Page 44: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

132of 177

Real Eigenvalues

How can we enforce real eigenvalues?Let’s look at matrices with complex entries A ∈ Cn×n.Transposition is replaced by Hermitian adjoint, e.g.[

1 + ı2 3 + ı45 + ı6 7 + ı8

]H

=

[1− ı2 5− ı63− ı4 7− ı8

]Denote the complex conjugate of a complex number λ byλ.

Page 45: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

133of 177

Real Eigenvalues

How can we enforce real eigenvalues?Let’s assume A ∈ Cn×n, Hermitian (AH = A).Calculate

xHAx = λxHx

for an eigenvector x ∈ Cn of A.Another possibility to calculate xHAx

xHAx = xHAHx (A is Hermitian)

= (xHAx)H (reverse order)

= (λxHx)H (eigenvalue)

= λxHx

and thereforeλ = λ (λ is real).

If A is Hermitian, then all eigenvalues are real.Special case: If A has only real entries and is symmetric,then all eigenvalues are real.

Page 46: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

134of 177

Singular Value Decomposition

Every matrix A ∈ Rn×p can be decomposed into a product ofthree matrices

A = UΣVT

where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices (UTU = I and VTV = I ), and Σ ∈ Rn×p has nonnegativenumbers on the diagonal.

Page 47: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

135of 177

Linear Curve Fitting - Least Squares

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T

xi ∈ R i = 1,. . . , N

ti ∈ R i = 1,. . . , N

y(x,w) = w1x + w0

X ≡ [x 1]

w∗ = (XTX)−1XT t

0 2 4 6 8 10x

0

1

2

3

4

5

6

7

8

t

Page 48: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

136of 177

Inverse of a matrix

Assume a full rank symmetric real matrix A.Then A = UTΛU whereΛ is a diagonal matrix with real eigenvaluesU contains the eigenvectors

A−1 = (UTΛU)−1

= U−1Λ−1U−T inverse changes order

= UTΛ−1U UTU = I

The inverse of a diagonal matrix is the inverse of itselements.

Page 49: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

137of 177

How to calculate a gradient?

Recall the gradient of a univariate function f : R→ R.Example:

f (x) = x4 + 7x3 + 5x2 − 17x + 3

whose gradient is given by

dfdx

= 4x3 + 21x2 + 10x− 17

6 5 4 3 2 1 0 1 2value of parameter

40

20

0

20

40

60

obje

ctiv

e

x4 + 7x3 + 5x2 17x + 3

Ong, Cheng Soon (Data61, Canberra City)
Page 50: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

138of 177

Gradients and optima

We find the stationary points by calculating the derivative,setting it to zero and solving the resulting equation.Example:

dfdx

= 4x3 + 21x2 + 10x− 17 = 0

has three stationary points (because it is a cubicpolynomial)

We calculate the second derivative ( d2fdx2 ) to check whether

a stationary point is a minimum ( d2fdx2 > 0) or a maximum

( d2fdx2 < 0).

The example has two minimum and one maximumstationary point.

Page 51: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

139of 177

Gradient descent

We may not be able to solve the problem analytically, haveto resort to gradient descent.Idea: Take a step in the negative gradient direction.Many different approaches for gradient descent.

1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0x

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

f(x)

Page 52: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

140of 177

Rules of differentiation

We denote the derivative of f as f ′.Sum rule

(f (x) + g(x))′ = f ′(x) + g′(x)

Product rule

(f (x)g(x))′ = f ′(x)g(x) + f (x)g′(x)

Quotient rule (f (x)

g(x)

)′=

f ′(x)g(x)− f (x)g′(x)

(g(x))2

Chain rule (g(f (x))

)′= (g ◦ f )′(x) = g′(f )f ′(x)

Page 53: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

141of 177

How to calculate a gradient?

Given a multivariate function f : RD → R.Compute the partial derivative with respect to eachelement, and collect the results in the appropriate shape.Use knowledge of scalar differentiation.Recommend the convention that gradients are rowvectors.Example:

f (x1, x2) = x21x2 + x1x3

2

has partial derivatives

∂f (x1, x2)

∂x1= 2x1x2 + x3

2

∂f (x1, x2)

∂x2= x2

1 + 3x1x22

and therefore the gradient is (∈ R1×2)

∇xf (x) =dfdx

=[∂f (x1,x2)∂x1

∂f (x1,x2)∂x2

]=[2x1x2 + x3

2 x21 + 3x1x2

2

].

Page 54: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

142of 177

Gradient with trace

We can combine matrix multiplication rules with partialdifferentation.Example: ∇A tr {AB} = B>

Recall:

tr {AB} =

←− a1 −→←− a2 −→

· · ·←− aN −→

↑ ↑ ↑

b1 b2 · · · bn

↓ ↓ ↓

=

m∑i=1

a1ibi1 +

m∑i=1

a2ibi2 + . . .+

m∑i=1

anibin

Therefore:∂ tr {AB}∂aij

= bji

Page 55: Cheng Soon Ong & Christian Walder - machlearn.gitlab.io · Introduction to Statistical Machine Learning c 2018 Ong & Walder & Webers Data61 | CSIRO The Australian National University

Introduction to StatisticalMachine Learning

c©2018Ong & Walder & Webers

Data61 | CSIROThe Australian National

University

Motivation

Basic Concepts

Linear Transformations

Trace

Inner Product

Projection

Rank, Determinant, Trace

Matrix Inverse

Eigenvectors

Singular ValueDecomposition

Gradient

Books

143of 177

Books

Gilbert Strang, "Introduction to Linear Algebra", WellesleyCambridge, 2009.David C. Lay, Steven R. Lay, Judi J. McDonald, "LinearAlgebra and Its Applications", Pearson, 2015.Jonathan S. Golan, "The Linear Algebra a BeginningGraduate Student Ought to Know", Springer, 2012Kaare B. Petersen, Michael S. Pedersen, "The MatrixCookbook", 2012