Pca Slides

7/28/2019 Pca Slides

1/33

Principal Component Analysis

Frank Wood

December 8, 2009

This lecture borrows and quotes from Joliffes Principle Component Analysis book. Go buy it!
http://find/


2/33

Principal Component Analysis

The central idea of principal component analysis (PCA) isto reduce the dimensionality of a data set consisting of a

large number of interrelated variables, while retaining asmuch as possible of the variation present in the data set.This is achieved by transforming to a new set of variables,the principal components (PCs), which are uncorrelated,

and which are ordered so that the first few retain most ofthe variation present in all of the original variables.[Jolliffe, Pricipal Component Analysis, 2nd edition]
http://find/


3/33

Data distribution (inputs in regression analysis)

Figure: Gaussian PDF
http://find/http://goback/


4/33

Uncorrelated projections of principal variation

Figure: Gaussian PDF with PC eigenvectors
http://find/


5/33

PCA rotation

Figure: PCA Projected Gaussian PDF
http://find/


6/33

PCA in a nutshell

Notation

x is a vector of p random variables k is a vector of p constants

kx =p

j=1 kjxj

Procedural description

Find linear function ofx, 1x with maximum variance.

Next find another linear function ofx, 2x, uncorrelated with1x maximum variance.

Iterate.

GoalIt is hoped, in general, that most of the variation in x will be

accounted for by m PCs where m


7/33

Derivation of PCA

Assumption and More Notation

is the known covariance matrix for the random variable x Foreshadowing : will be replaced with S, the sample

covariance matrix, when is unknown.

Shortcut to solution For k = 1, 2, . . . , p the kth PC is given by zk =

kx where kis an eigenvector of corresponding to its kth largesteigenvalue k.

Ifk is chosen to have unit length (i.e.

kk = 1) thenVar(zk) = k
http://find/


8/33

Derivation of PCA

First Step

Find kx that maximizes Var(kx) = kk Without constraint we could pick a very big k.

Choose normalization constraint, namely kk = 1 (unitlength vector).

Constrained maximization - method of Lagrange multipliers

To maximize kk subject to

kk = 1 we use thetechnique of Lagrange multipliers. We maximize the function

kk (

kk 1)

w.r.t. to k by differentiating w.r.t. to k.
http://find/


9/33

Derivation of PCA


This results in

d

dk

kk k(

kk 1)

= 0

k kk = 0

k = kk

This should be recognizable as an eigenvector equation wherek is an eigenvector of bf and k is the associatedeigenvalue.

Which eigenvector should we choose?


10/33

Derivation of PCA


If we recognize that the quantity to be maximized

kk =

kkk = k

kk = k

then we should choose k to be as big as possible. So, calling

1 the largest eigenvector of and 1 the correspondingeigenvector then the solution to

1 = 11

is the 1st

principal component ofx. In general k will be the k

th PC ofx and Var(x) = k We will demonstrate this for k = 2, k> 2 is more involved

but similar.
http://find/


11/33

Derivation of PCA

Constrained maximization - more constraints

The second PC, 2x maximizes 22 subject to beinguncorrelated with 1x.

The uncorrelation constraint can be expressed using any ofthese equations

cov(1x,

2x) =

12 =

21 =

21

1

= 1

2 = 1

12 = 0

Of these, if we choose the last we can write an Langrangianto maximize 2

22 2(

22 1)

21


12/33

Derivation of PCA

Constrained maximization - more constraints

Differentiation of this quantity w.r.t. 2 (and setting theresult equal to zero) yields

d

d2

22 2(

22 1)

21

= 0

2 22 1 = 0

If we left multiply 1 into this expression

12 2

12

11 = 0

0 0 1 = 0

then we can see that must be zero and that when this istrue that we are left with

2 22 = 0
http://find/


13/33

Derivation of PCA

Clearly2 22 = 0

is another eigenvalue equation and the same strategy of choosing2 to be the eigenvector associated with the second largesteigenvalue yields the second PC ofx, namely 2x.

This process can be repeated for k = 1 . . . p yielding up to pdifferent eigenvectors of along with the correspondingeigenvalues 1, . . . p.

Furthermore, the variance of each of the PCs are given by

Var[kx] = k, k = 1, 2, . . . , p
http://find/


14/33

Properties of PCA

For any integer q, 1 q p, consider the orthonormal lineartransformation

y = B

x

where y is a q-element vector and B is a q p matrix, and lety = B

B be the variance-covariance matrix for y. Then thetrace ofy, denoted tr(y), is maximized by taking B = Aq,

where Aq consists of the first q columns ofA.What this means is that if you want to choose a lower dimensionalprojection ofx, the choice ofB described here is probably a goodone. It maximizes the (retained) variance of the resulting variables.

In fact, since the projections are uncorrelated, the percentage ofvariance accounted for by retaining the first q PCs is given by

qk=1 kpk=1 k

100
http://find/


15/33

PCA using the sample covariance matrix

If we recall that the sample covariance matrix (an unbiasedestimator for the covariance matrix of x) is given by

S =1

n 1XX

where X is a (n p) matrix with (i,j)th element (xij xj) (in

other words,X

is a zero mean design matrix).

We construct the matrix A by combining the p eigenvectors ofS(or eigenvectors ofXX theyre the same) then we can define amatrix of PC scores

Z = XAOf course, if we instead form Z by selecting the q eigenvectorscorresponding to the q largest eigenvalues ofS when forming Athen we can achieve an optimal (in some senses) q-dimensionalprojection ofx.
http://find/


16/33


17/33

Sample Covariance Matrix PCA

Figure: Gaussian Samples

S C C
http://find/


18/33


Figure: Gaussian Samples with eigenvectors of sample covariance matrix

S l C i M i PCA
http://goforward/http://find/http://goback/


19/33


Figure: PC projected samples

S l C i M i PCA


20/33


Figure: PC dimensionality reduction step

S l C i M t i PCA
http://find/


21/33


Figure: PC dimensionality reduction step

PCA i li i
http://find/


22/33

PCA in linear regression

PCA is useful in linear regression in several ways

Identification and elimination of multicolinearities in the data.

Reduction in the dimension of the input space leading tofewer parameters and easier regression.

Related to the last point, the variance of the regressioncoefficient estimator is minimized by the PCA choice of basis.

We will consider the following example.

x N

[2 5],

4.5 1.51.5 1.0

y = X[1 2] when no colinearities are present (no noise)

xi3 = .8xi1 + .5xi2 imposed colinearity

Noiseless Li ea Relatio shi ith No Coli ea it
http://find/


23/33

Noiseless Linear Relationship with No Colinearity

Figure: y = x[1 2] + 5, x N([2 5], 4.5 1.51.5 1.0

)

Noiseless Planar Relationship
http://find/


24/33

Noiseless Planar Relationship

Figure: y = x[1 2] + 5, x N([2 5], 4.5 1.51.5 1.0

)

Projection of colinear data
http://find/


25/33


The figures before showed the data without the third colineardesign matrix column. Plotting such data is not possible, but its

colinearity is obvious by design.

When PCA is applied to the design matrix of rank q less than pthe number of positive eigenvalues discovered is equal to q thetrue rank of the design matrix.

If the number of PCs retained is larger than q (and the data isperfectly colinear, etc.) all of the variance of the data is retained inthe low dimensional projection.

In this example, when PCA is run on the design matrix of rank 2,the resulting projection back into two dimensions has exactly thesame distribution as before.

http://find/


26/33


Figure: Projection of multi-colinear data onto first two PCs

Reduction in regression coefficient estimator variance
http://find/


27/33


If we take the standard regression model

y = X +

And consider instead the PCA rotation ofX given by

Z = ZA

then we can rewrite the regression model in terms of the PCs

y = Z+ .

We can also consider the reduced model

y = Zqq + q

where only the first q PCs are retained.

http://find/


28/33


If we rewrite the regression relation as

y = Z+ .

Then we can, because A is orthogonal, rewrite

X = XAA = Z

where = A.

Clearly using least squares (or ML) to learn = A is equivalentto learning directly.

And, like usual,= (ZZ)1Zy

so = A(ZZ)1Zy

http://find/


29/33


Without derivation we note that the variance-covariance matrix of is given by

Var() = 2p

k=1

l1k aka

k

where lk is the kth largest eigenvalue ofXX, ak is the k

th columnofA, and 2 is the observation noise variance, i.e. N(0, 2I)

This sheds light on how multicolinearities produce large variancesfor the elements of . If an eigenvector lk is small then theresulting variance of the estimator will be large.

http://find/


30/33


One way to avoid this is to ignore those PCs that are associatedwith small eigenvalues, namely, use biased estimator

=m

k=1

l1k aka

kXy

where l1:m are the large eigenvalues ofXX and lm+1:p are the

small.

Var() = 2m

k=1

l1k aka

k

This is a biased estimator, but, since the variance of this estimatoris smaller it is possible that this could be an advantage.

Homework: find the bias of this estimator. Hint: use the spectraldecomposition ofXX.

Problems with PCA


31/33

Problems with PCA

PCA is not without its problems and limitations PCA assumes approximate normality of the input space

distribution PCA may still be able to produce a good low dimensional

projection of the data even if the data isnt normally distributed

PCA may fail if the data lies on a complicated manifold

PCA assumes that the input data is real and continuous. Extensions to consider

Collins et al, A generalization of principal components analysisto the exponential family.

Hyvarinen, A. and Oja, E., Independent component analysis:algorithms and applications

ISOMAP, LLE, Maximum variance unfolding, etc.

Non-normal data
http://find/


32/33

Non normal data

Figure: 2d Beta(.1, .1) Samples with PCs

Non-normal data
http://find/


33/33

Figure: PCA Projected
http://find/

Documents

Pca Slides