Upload
duongdung
View
222
Download
0
Embed Size (px)
Citation preview
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
1of 177
Introduction to Statistical Machine Learning
Cheng Soon Ong & Christian Walder
Machine Learning Research GroupData61 | CSIRO
andCollege of Engineering and Computer Science
The Australian National University
CanberraFebruary – June 2018
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
90of 177
Part III
Linear Algebra
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
91of 177
Linear Curve Fitting - Input Specification
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N0 2 4 6 8 10
x
0
1
2
3
4
5
6
7
8
t
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
92of 177
Strategy in this course
Estimate best predictor = training = learningGiven data (x1, y1), . . . , (xn, yn), find a predictor fw(·).
1 Identify the type of input x and output y data2 Propose a (linear) mathematical model for fw3 Design an objective function or likelihood4 Calculate the optimal parameter (w)5 Model uncertainty using the Bayesian approach6 Implement and compute (the algorithm in python)7 Interpret and diagnose results
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
93of 177
Linear Curve Fitting
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N0 2 4 6 8 10
x
0
1
2
3
4
5
6
7
8
t
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
94of 177
Linear Curve Fitting - Least Squares
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
y(x,w) = w1x + w0
X ≡ [x 1]
w∗ = (XTX)−1XT t
0 2 4 6 8 10x
0
1
2
3
4
5
6
7
8
t
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
95of 177
Intuition
Arithmeticzero, positive/negative numbersaddition and multiplication
GeometryPoints and LinesVector addition and scalingHumans have experience with 3 dimensions (less with 1, 2though)
Generalisation to N dimensions (possibly N →∞)Line→ vector space VPoint→ vector x ∈ VExample : X ∈ Rn×m
Space of matrices Rn×m and the space of vectors Rn·m areisomorphic
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
96of 177
Vector Space V , underlying Field F
Given vectors u, v,w ∈ V and scalars α, β ∈ F , the followingholds:
Associativity of addition : u + (v + w) = (u + v) + w .Commutativity of addition : v + w = w + v .Identity element of addition : ∃ 0 such thatv + 0 = v, ∀ v ∈ V.Inverse of addition : For all v ∈ V, ∃w ∈ V, such thatv + w = 0. The additive inverse is denoted −v.Distributivity of scalar multiplication with respect to vectoraddition : α(v + w) = αv + αw .Distributivity of scalar multiplication with respect to fieldaddition : (α+ β)v = αv + βv .Compatibility of scalar multiplication with fieldmultiplication : α(βv) = (αβ)v .Identity element of scalar multiplication : 1v = v, where 1 isthe identity in F .
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
97of 177
Matrix-Vector Multiplication
� �1 for i in range (m) :
R[ i ] = 0 . 0 ;for j in range ( n ) :
R[ i ] = R[ i ] + A [ i , j ] ∗ V[ j ]� �Listing 1: Code for elementwise matrix multiplication.
A V = Ra11 a12 . . . a1n
a21 a22 . . . a2n
. . . . . . . . . . . .am1 am2 . . . amn
v1v2. . .vn
=
a11v1 + a12v2 + · · ·+ a1nvn
a21v1 + a22v2 + · · ·+ a2nvn
. . .am1v1 + am2v2 + · · ·+ amnvn
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
98of 177
Matrix-Vector Multiplication
� �1 for i in range (m) :
R[ i ] = 0 . 0 ;for j in range ( n ) :
R[ i ] = R[ i ] + A [ i , j ] ∗ V[ j ]� �Listing 2: Code for elementwise matrix multiplication.
A V = Ra11 a12 . . . a1n
a21 a22 . . . a2n
. . . . . . . . . . . .am1 am2 . . . amn
v1v2. . .vn
=
a11v1 + a12v2 + · · ·+ a1nvn
a21v1 + a22v2 + · · ·+ a2nvn
. . .am1v1 + am2v2 + · · ·+ amnvn
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
99of 177
Matrix-Vector Multiplication
A V = Ra11 a12 . . . a1n
a21 a22 . . . a2n
. . . . . . . . . . . .am1 am2 . . . amn
v1v2. . .vn
=
a11v1 + a12v2 + · · ·+ a1nvn
a21v1 + a22v2 + · · ·+ a2nvn
. . .am1v1 + am2v2 + · · ·+ amnvn
� �1 R = A [ : , 0 ] ∗ V [ 0 ] ;
for j in range (1 , n ) :R += A [ : , j ] ∗ V[ j ] ;� �
Listing 3: Code for columnwise matrix multiplication.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
100of 177
Matrix-Vector Multiplication
A V = Ra11 a12 . . . a1n
a21 a22 . . . a2n
. . . . . . . . . . . .am1 am2 . . . amn
v1v2. . .vn
=
a11v1 + a12v2 + · · ·+ a1nvn
a21v1 + a22v2 + · · ·+ a2nvn
. . .am1v1 + am2v2 + · · ·+ amnvn
� �
R = A [ : , 0 ] ∗ V [ 0 ] ;2 for j in range (1 , n ) :
R += A [ : , j ] ∗ V[ j ] ;� �Listing 4: Code for columnwise matrix multiplication.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
101of 177
Matrix-Vector Multiplication
Denote the n columns of A by a1, a2, . . . , an
Each ai is now a (column) vector
A V = Ra1 a2 . . . an
v1v2. . .vn
=
a1v1 + a2v2 + · · ·+ anvn
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
102of 177
Transpose of R
Given R = AV,
R =
a1v1 + a2v2 + · · ·+ anvn
What is RT ?
RT =[v1aT
1 + v2aT2 + · · ·+ vnaT
n]
NOT equal to ATVT ! (In fact, ATVT is not even defined.)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
103of 177
Transpose of R
Given R = AV,
R =
a1v1 + a2v2 + · · ·+ anvn
What is RT ?
RT =[v1aT
1 + v2aT2 + · · ·+ vnaT
n]
NOT equal to ATVT ! (In fact, ATVT is not even defined.)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
104of 177
Transpose
Reverse order rule : (AV)T = VTAT
VT AT = RT
[v1 v2 . . . vn
]
aT1
aT2
. . .
aTn
=[v1aT
1 + v2aT2 + · · ·+ vnaT
n]
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
105of 177
Transpose
Reverse order rule : (AV)T = VTAT
VT AT = RT
[v1 v2 . . . vn
]
aT1
aT2
. . .
aTn
=[v1aT
1 + v2aT2 + · · ·+ vnaT
n]
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
106of 177
Trace
The trace of a square matrix A is the sum of all diagonalelements of A.
tr {A} =
n∑k=1
Akk
Example
A =
1 2 34 5 67 8 9
tr {A} = 1 + 5 + 9 = 15
The trace does not exist for a non-square matrix.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
107of 177
vec(x) operator
Define vec(X) as the vector which results from stacking allcolumns of a matrix A on top of each other.Example
A =
1 2 34 5 67 8 9
vec(A) =
147258369
Given two matrices X,Y ∈ Rn×p, the trace of the productXTY can be written as
tr{
XTY}
= vec(X)T vec(Y)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
108of 177
Inner Product
Mapping from two vectors x, y ∈ V to a field of scalars F( F = R or F = C ):
〈·, ·〉 : V× V→ F
Conjugate Symmetry :〈x, y〉 = 〈y, x〉Linearity :〈ax, y〉 = a〈x, y〉 , and 〈x + y, z〉 = 〈x, z〉+ 〈y, z〉Positive-definitness :〈x, x〉 ≥ 0, and 〈x, x〉 = 0 for x = 0 only.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
109of 177
Canonical Inner Products
Inner product for real numbers x, y ∈ R:
〈x, y〉 = x y
Dot product between two vectors:Given x, y ∈ Rn
〈x, y〉 = xTy ≡n∑
k=1
xk yk = x1 y1 + x2 y2 + · · ·+ xn yn
Canonical inner product for matrices :Given X,Y ∈ Rn×p
〈X,Y〉 = tr{
XTY}
=
p∑k=1
(XTY)kk =
p∑k=1
n∑l=1
(XT)kl(Y)lk
=
p∑k=1
n∑l=1
XlkYlk
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
110of 177
Canonical Inner Products
Inner product for real numbers x, y ∈ R:
〈x, y〉 = x y
Dot product between two vectors:Given x, y ∈ Rn
〈x, y〉 = xTy ≡n∑
k=1
xk yk = x1 y1 + x2 y2 + · · ·+ xn yn
Canonical inner product for matrices :Given X,Y ∈ Rn×p
〈X,Y〉 = tr{
XTY}
=
p∑k=1
(XTY)kk =
p∑k=1
n∑l=1
(XT)kl(Y)lk
=
p∑k=1
n∑l=1
XlkYlk
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
111of 177
Calculations with Matrices
Denote (i, j)th element of matrix X by Xij
Transpose : (XT)ij = Xji
Product : (XY)ij =∑...
k=1 XikYkj
Proof of the linearity of the canonical matrix inner product :
〈X + Y,Z〉 = tr{
(X + Y)TZ}
=
p∑k=1
((X + Y)TZ)kk
=
p∑k=1
n∑l=1
((X + Y)T)klZlk =
p∑k=1
n∑l=1
(Xlk + Ylk)Zlk
=
p∑k=1
n∑l=1
XlkZlk + YlkZlk
=
p∑k=1
n∑l=1
(XT)klZlk + (YT)klZlk
=
p∑k=1
(XTZ)kk + (YTZ)kk
= tr{
XTZ}
+ tr{
YTZ}
= 〈X,Z〉+ 〈Y,Z〉
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
112of 177
Projection
In linear algebra and functional analysis, a projection is alinear transformation P from a vector space V to itself suchthat
P2 = P
x
y(a,b)
(a-b,0)
A[
ab
]=
[a− b
0
]A =
[1 −10 0
]
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
113of 177
Projection
In linear algebra and functional analysis, a projection is alinear transformation P from a vector space V to itself suchthat
P2 = P
x
y(a,b)
(a-b,0)
A[
ab
]=
[a− b
0
]A =
[1 −10 0
]
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
114of 177
Projection
In linear algebra and functional analysis, a projection is alinear transformation P from a vector space V to itself suchthat
P2 = P
x
y(a,b)
(a-b,0)
A[
ab
]=
[a− b
0
]A =
[1 −10 0
]
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
115of 177
Orthogonal Projection
x
y
Px Py
y - Py
Orthogonality : need an inner product 〈x, y〉Choose two arbitrary vectors x and y. Then Px and y− Pyare orthogonal.
0 = 〈Px, y− Py〉 = (Px)T(y− Py) = xT(PT − PTP)y
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
116of 177
Orthogonal Projection
Orthogonal Projection
P2 = P P = PT
Example : Given some unit vector u ∈ Rn characterising aline through the origin in Rn
project an arbitrary vector x ∈ Rn onto this line by
P = uuT
Proof : P = PT , and
P2x = (uuT)(uuT)x = uuTx = Px
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
117of 177
Orthogonal Projection
Given a matrix A ∈ Rn×p and a vector x. What is theclosest point x̃ to x in the column space of A ?Orthogonal Projection into the column space of A
x̃ = A(ATA)−1ATx
Projection matrix
P = A(ATA)−1AT
Proof
P2 = A(ATA)−1ATA(ATA)−1AT = A(ATA)−1AT = P
Orthogonal projection ?
PT = (A(ATA)−1AT)T = A(ATA)−1AT = P
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
118of 177
Linear Curve Fitting - Least Squares
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
y(x,w) = w1x + w0
X ≡ [x 1]
w∗ = (XTX)−1XT t
0 2 4 6 8 10x
0
1
2
3
4
5
6
7
8
t
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
119of 177
Linear Independence of Vectors
A set of vectors is linearly independent if none of them canbe written as a linear combination of finitely many othervectors in the set.Given a vector space V and k vectors v1, . . . , vk ∈ V andscalars a1, . . . , ak. The vectors are linearly independent, ifand only if
a1v1 + a2v2 + · · ·+ akvk = 0 ≡
00...0
has only the trivial solution
a1 = a2 = ak = 0
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
120of 177
Rank
The column rank of a matrix A is the maximal number oflinearly independent columns of A.The row rank of a matrix A is the maximal number oflinearly independent rows of A.For every matrix A, the row rank and column rank areequal, called the rank of A.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
121of 177
Determinant
Let Sn be the set of all permutation of the numbers {1, . . . , n}.The determinant of the square matrix A is then given by
det {A} =∑σ∈Sn
sgn(σ)
n∏i=1
Ai,σ(i)
where sgn is the signature of the permutation ( +1 if thepermutation is even, −1 if the permutation is odd).
det {AB} = det {A} det {B} for A,B square matrices.
det{
A−1}
= det {A}−1.
det{
AT}
= det {A}.det {c A} = cn det {A} for A ∈ Rn×n and c ∈ R.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
122of 177
Matrix Inverse
Identity Matrix I
I = AA−1 = A−1A
The matrix inverse A−1 does only exist for square matriceswhich are NOT singular.Singular matrix
at least one eigenvalue is zero,determinant |A| = 0.
Inversion ’reverses the order’ (like transposition)
(AB)−1 = B−1A−1
(Only then is (AB)(AB)−1 = (AB)B−1A−1 = AA−1 = I.)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
123of 177
Matrix Inverse and Transpose
(A−1)T ?
need to assume that (A−1) existsrule : (A−1)T = (AT)−1
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
124of 177
Matrix Inverse and Transpose
(A−1)T ?need to assume that (A−1) existsrule : (A−1)T = (AT)−1
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
125of 177
Useful Identity
(A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1
How to analyse and prove such an equation?A−1 must exist, therefore A must be square, say A ∈ Rn×n.Same for C−1, so let’s assume C ∈ Rm×m.Therefore, B ∈ Rm×n.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
126of 177
Useful Identity
(A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1
How to analyse and prove such an equation?A−1 must exist, therefore A must be square, say A ∈ Rn×n.Same for C−1, so let’s assume C ∈ Rm×m.Therefore, B ∈ Rm×n.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
127of 177
Prove the Identity
(A−1 + BTC−1B)−1BTC−1 = ABT(BABT + C)−1
Multiply by (BABT + C)
(A−1 + BTC−1B)−1BTC−1(BABT + C) = ABT
Simplify the left-hand side
(A−1 + BTC−1B)−1[BTC−1B(ABT) + BT ]
= (A−1 + BTC−1B)−1[BTC−1B + A−1]ABT
= ABT
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
128of 177
Woodbury Identity
(A + BD−1C)−1 = A−1 − A−1B(D + CA−1B)−1CA−1
Useful if A is diagonal and easy to invert, and B is very tall(many rows, but only a few columns) and C is very wide(few rows, many columns).
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
129of 177
Manipulating Matrix Equations
Don’t multiply by a matrix which does not have full rank.Why? In the general case, you will lose solutions.Don’t multiply by nonsquare matrices.Don’t assume a matrix inverse exists because a matrix issquare. The matrix might be singular and you are makingthe same mistake as if deducing a = b from a 0 = b 0.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
130of 177
Eigenvectors
Every square matrix A ∈ Rn×n has an Eigenvectordecomposition
Ax = λx
where x ∈ Rn and λ ∈ C.Example: [
0 1−1 0
]x = λx
λ = {−ı, ı}
x =
{[ı1
],
[−ı
1
]}
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
131of 177
Eigenvectors
How many eigenvalue/eigenvector pairs?
Ax = λx
is equivalent to(A− λI)x = 0
Has only non-trivial solution for det {A− λI} = 0polynom of nth order; at most n distinct solutions
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
132of 177
Real Eigenvalues
How can we enforce real eigenvalues?Let’s look at matrices with complex entries A ∈ Cn×n.Transposition is replaced by Hermitian adjoint, e.g.[
1 + ı2 3 + ı45 + ı6 7 + ı8
]H
=
[1− ı2 5− ı63− ı4 7− ı8
]Denote the complex conjugate of a complex number λ byλ.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
133of 177
Real Eigenvalues
How can we enforce real eigenvalues?Let’s assume A ∈ Cn×n, Hermitian (AH = A).Calculate
xHAx = λxHx
for an eigenvector x ∈ Cn of A.Another possibility to calculate xHAx
xHAx = xHAHx (A is Hermitian)
= (xHAx)H (reverse order)
= (λxHx)H (eigenvalue)
= λxHx
and thereforeλ = λ (λ is real).
If A is Hermitian, then all eigenvalues are real.Special case: If A has only real entries and is symmetric,then all eigenvalues are real.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
134of 177
Singular Value Decomposition
Every matrix A ∈ Rn×p can be decomposed into a product ofthree matrices
A = UΣVT
where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices (UTU = I and VTV = I ), and Σ ∈ Rn×p has nonnegativenumbers on the diagonal.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
135of 177
Linear Curve Fitting - Least Squares
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
y(x,w) = w1x + w0
X ≡ [x 1]
w∗ = (XTX)−1XT t
0 2 4 6 8 10x
0
1
2
3
4
5
6
7
8
t
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
136of 177
Inverse of a matrix
Assume a full rank symmetric real matrix A.Then A = UTΛU whereΛ is a diagonal matrix with real eigenvaluesU contains the eigenvectors
A−1 = (UTΛU)−1
= U−1Λ−1U−T inverse changes order
= UTΛ−1U UTU = I
The inverse of a diagonal matrix is the inverse of itselements.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
137of 177
How to calculate a gradient?
Recall the gradient of a univariate function f : R→ R.Example:
f (x) = x4 + 7x3 + 5x2 − 17x + 3
whose gradient is given by
dfdx
= 4x3 + 21x2 + 10x− 17
6 5 4 3 2 1 0 1 2value of parameter
40
20
0
20
40
60
obje
ctiv
e
x4 + 7x3 + 5x2 17x + 3
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
138of 177
Gradients and optima
We find the stationary points by calculating the derivative,setting it to zero and solving the resulting equation.Example:
dfdx
= 4x3 + 21x2 + 10x− 17 = 0
has three stationary points (because it is a cubicpolynomial)
We calculate the second derivative ( d2fdx2 ) to check whether
a stationary point is a minimum ( d2fdx2 > 0) or a maximum
( d2fdx2 < 0).
The example has two minimum and one maximumstationary point.
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
139of 177
Gradient descent
We may not be able to solve the problem analytically, haveto resort to gradient descent.Idea: Take a step in the negative gradient direction.Many different approaches for gradient descent.
1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
f(x)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
140of 177
Rules of differentiation
We denote the derivative of f as f ′.Sum rule
(f (x) + g(x))′ = f ′(x) + g′(x)
Product rule
(f (x)g(x))′ = f ′(x)g(x) + f (x)g′(x)
Quotient rule (f (x)
g(x)
)′=
f ′(x)g(x)− f (x)g′(x)
(g(x))2
Chain rule (g(f (x))
)′= (g ◦ f )′(x) = g′(f )f ′(x)
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
141of 177
How to calculate a gradient?
Given a multivariate function f : RD → R.Compute the partial derivative with respect to eachelement, and collect the results in the appropriate shape.Use knowledge of scalar differentiation.Recommend the convention that gradients are rowvectors.Example:
f (x1, x2) = x21x2 + x1x3
2
has partial derivatives
∂f (x1, x2)
∂x1= 2x1x2 + x3
2
∂f (x1, x2)
∂x2= x2
1 + 3x1x22
and therefore the gradient is (∈ R1×2)
∇xf (x) =dfdx
=[∂f (x1,x2)∂x1
∂f (x1,x2)∂x2
]=[2x1x2 + x3
2 x21 + 3x1x2
2
].
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
142of 177
Gradient with trace
We can combine matrix multiplication rules with partialdifferentation.Example: ∇A tr {AB} = B>
Recall:
tr {AB} =
←− a1 −→←− a2 −→
· · ·←− aN −→
↑ ↑ ↑
b1 b2 · · · bn
↓ ↓ ↓
=
m∑i=1
a1ibi1 +
m∑i=1
a2ibi2 + . . .+
m∑i=1
anibin
Therefore:∂ tr {AB}∂aij
= bji
Introduction to StatisticalMachine Learning
c©2018Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Motivation
Basic Concepts
Linear Transformations
Trace
Inner Product
Projection
Rank, Determinant, Trace
Matrix Inverse
Eigenvectors
Singular ValueDecomposition
Gradient
Books
143of 177
Books
Gilbert Strang, "Introduction to Linear Algebra", WellesleyCambridge, 2009.David C. Lay, Steven R. Lay, Judi J. McDonald, "LinearAlgebra and Its Applications", Pearson, 2015.Jonathan S. Golan, "The Linear Algebra a BeginningGraduate Student Ought to Know", Springer, 2012Kaare B. Petersen, Michael S. Pedersen, "The MatrixCookbook", 2012