3
Lecture 25: The Covariance Matrix and Principal Component Analysis Anup Rao May 31, 2019 The covariance of two variables X, Y is a number, but it is often very convenient to view this number as an entry of a bigger ma- trix, called the covariance matrix. Suppose X =( X 1 , X 2 ,..., X k ) is a random vector. Then the covariance matrix K is the matrix with K i, j = Cov X i , X j . It is a k × k matrix, and the entries on the diago- nal are the variances of all the variables. If X - E [ X] is viewed as a If X is a random matrix, then its expec- tation is the matrix whose entries are the expectation of the corre- sponding random entries of X. So, E [ X]= x p(x) · x, where here x is a matrix. column vector, the covariance matrix is K = E [(X - E [ X])(X - E [ X]) | ] . The matrix representation is particularly nice for working with data. Suppose X =( X 1 ,..., X k ) is a random vector sampled by set- ting X to be a uniformly random vector from the set {v 1 , v 2 ,..., v n }. Let v be the average of these vectors. Then let V be the k × n matrix whose columns are v 1 - v, v 2 - v,..., v n - v. Then it is easy to check that the covariance matrix of X is K =(1/n) · VV | . The matrix representation allows us to easily generalize linear re- gression to the case that the inputs themselves are vectors. Suppose we are given data in the form of pairs (v 1 , y 1 ), (v 2 , y 2 ),..., (v n , y n ), where v 1 ,..., v n are each k-dimensional column vectors, and y 1 ,..., y n are numbers. For simplicity, let us assume again that the mean of the v’s is 0, and the mean of the y’s is 0. Let X be a random one of the n vectors, and Y be the corresponding number. Then we seek the vector a such that E (a | X - Y) 2 is minimized. We have: E h (a | X - Y) 2 i = E h Y 2 - 2Ya | X + a | XX | a i = 2a | E [ XX | ] - 2 E [YX] , but here E [ XX | ] is exactly the covariance matrix. So, it turns out that the solution is to set a | = E [YX] · K -1 . K may not be invertible in general, but you can still solve for a using the pseudoinverse of K. Principal Component Analysis The covariance matrix has an extremely important application in ma- chine learning. In machine learning, we seek to find the key features of large sets of data, features that are most informative about the underlying data set.

Lecture 25: The Covariance Matrix and Principal Component

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 25: The Covariance Matrix and Principal Component

Lecture 25: The Covariance Matrix and PrincipalComponent AnalysisAnup Rao

May 31, 2019

The covariance of two variables X, Y is a number, but it is oftenvery convenient to view this number as an entry of a bigger ma-trix, called the covariance matrix. Suppose X = (X1, X2, . . . , Xk) isa random vector. Then the covariance matrix K is the matrix withKi,j = Cov

[Xi, Xj

]. It is a k× k matrix, and the entries on the diago-

nal are the variances of all the variables. If X −E [X] is viewed as a If X is a random matrix, then its expec-tation is the matrix whose entriesare the expectation of the corre-sponding random entries of X. So,E [X] = ∑x p(x) · x, where here x is amatrix.

column vector, the covariance matrix is

K = E [(X−E [X])(X−E [X])ᵀ] .

The matrix representation is particularly nice for working withdata. Suppose X = (X1, . . . , Xk) is a random vector sampled by set-ting X to be a uniformly random vector from the set {v1, v2, . . . , vn}.Let v be the average of these vectors. Then let V be the k × n matrixwhose columns are v1 − v, v2 − v, . . . , vn − v. Then it is easy to checkthat the covariance matrix of X is

K = (1/n) ·VVᵀ.

The matrix representation allows us to easily generalize linear re-gression to the case that the inputs themselves are vectors. Supposewe are given data in the form of pairs (v1, y1), (v2, y2), . . . , (vn, yn),where v1, . . . , vn are each k-dimensional column vectors, and y1, . . . , yn

are numbers. For simplicity, let us assume again that the mean of thev’s is 0, and the mean of the y’s is 0. Let X be a random one of the nvectors, and Y be the corresponding number. Then we seek the vectora such that E

[(aᵀX−Y)2] is minimized. We have:

∇E

[(aᵀX−Y)2

]= ∇E

[Y2 − 2YaᵀX + aᵀXXᵀa

]= 2aᵀ E [XXᵀ]− 2 E [YX] ,

but here E [XXᵀ] is exactly the covariance matrix. So, it turns out thatthe solution is to set aᵀ = E [YX] · K−1. K may not be invertible in general,

but you can still solve for a using thepseudoinverse of K.

Principal Component Analysis

The covariance matrix has an extremely important application in ma-chine learning. In machine learning, we seek to find the key featuresof large sets of data, features that are most informative about theunderlying data set.

Page 2: Lecture 25: The Covariance Matrix and Principal Component

lecture 25: the covariance matrix and principal component analysis 2

Figure 1: PCA applied to twodimensional data. Image credit:Nicoguaro, from Wikipedia.

For example, if I were to try and figure out what kinds of moviesa particular student in 312 likes, I might ask the student to tell mehow much they like movies from the following categories: action,romantic comedies, horror, . . . . But are these the right questions toask? Should I be asking whether they like animated movies or not, orwhether they like slapstick humor? One of the key developments inmachine learning is the use of mathematics to eliminate the subjectiv-ity here—data is used to generate the most informative questions.

How can we bring the math to bear in the this scenario? Let usassociate every student in the class with a column vector—say a kdimensional vector x, where k is the number of movies on Netflix,and each coordinate indicates whether or not the student likes themovie. Let X be the vector corresponding to a uniformly randomstudent in the class. If I want to boil down a students preferences toa single number, it is natural to consider a number that is a linearfunction of their vector. So, I want a unit length vector a such that theinner product

aᵀx = a1x1 + a2x2 + · · ·+ akxk

is the most informative. Then a natural approach is to pick a to max-imize Var [aᵀX]. Geometrically, this corresponds to projecting all thek dimensional vectors to a single direction, in such a way that thevariance is maximized. There is a very nice web demo

here: http://setosa.io/ev/principal-component-analysis/.

To understand how to find this direction a, we need to use theconcept of eigenvalues and eigenvectors from linear algebra. Given a

Page 3: Lecture 25: The Covariance Matrix and Principal Component

lecture 25: the covariance matrix and principal component analysis 3

square matrix M, a number λ ∈ R is an eigenvalue of M if there is avector v such that Mv = λv. In general, eigenvalues can be complexnumbers, even if the original matrix is only real valued. However, wehave:

Fact 1. If M is a k× k symmetric matrix, then it has exactly k real valuedeigenvalues λ1 ≥ λ2 ≥ · · · ≥ λk, and the corresponding eigenvectors forman orthonormal basis v1, . . . , vk for Rk.

Returning to our problem above, we seek the vector a that maxi-mizes Var [aᵀX]. For simplicity, let us as usual assume that E [X] = 0,since if this is not the case, we can consider the random variableX′ = X − E [X] instead, and this does not change the choice of a.The assumption that the mean is 0 ensures that E [aᵀX] = a1 E [X1] +

· · ·+ an E [Xn] = 0. Then we have

Var [aᵀX] = E

[(aᵀX)2

]= E [aᵀXXᵀa] = aᵀ E [XXᵀ] a = aᵀKa,

where K is the covariance matrix of X.Since K is a symmetric matirx, Fact 1 applies. Let λ1, . . . , λk, v1, . . . , vk

be as in the fact. Then we can express a = α1v1 + α2v2 + . . . + αkvk.The fact that a has length 1 corresponds to√

α21 + · · ·+ α2

k = 1⇒ α21 + · · ·+ α2

k = 1.

We have:

aᵀKa

= (α1v1 + · · ·+ αkvk)ᵀK(α1v1 + · · ·+ αkvk)

= (α1v1 + · · ·+ αkvk)ᵀ(λ1α1v1 + · · ·+ λkαkvk)

= λ1α21 + · · ·+ λkα2

k . Since vi , vj are orthonormal, for i 6= j,we have vᵀi vj = 0, and vᵀi vi = 1.

Since λ1 is the largest eigenvalue, and α21 + . . . + α2

k = 1, this quantityis maximized when α1 = 1, and α2 = α3 = · · · = αk = 0. So, thelargest variance is achieved when a = v1, the eigenvector with thelargest eigenvalue! This eigenvector is called the principal component. In general, if you want the 5 most

informative numbers, you should pickthe eigenvectors corresponding tothe top 5 eigenvalues, and ask for thecorresponding linear combinations.