Upload
forscribd1981
View
224
Download
0
Embed Size (px)
Citation preview
7/28/2019 Pca Slides
1/33
Principal Component Analysis
Frank Wood
December 8, 2009
This lecture borrows and quotes from Joliffes Principle Component Analysis book. Go buy it!
http://find/7/28/2019 Pca Slides
2/33
Principal Component Analysis
The central idea of principal component analysis (PCA) isto reduce the dimensionality of a data set consisting of a
large number of interrelated variables, while retaining asmuch as possible of the variation present in the data set.This is achieved by transforming to a new set of variables,the principal components (PCs), which are uncorrelated,
and which are ordered so that the first few retain most ofthe variation present in all of the original variables.[Jolliffe, Pricipal Component Analysis, 2nd edition]
http://find/7/28/2019 Pca Slides
3/33
Data distribution (inputs in regression analysis)
Figure: Gaussian PDF
http://find/http://goback/7/28/2019 Pca Slides
4/33
Uncorrelated projections of principal variation
Figure: Gaussian PDF with PC eigenvectors
http://find/7/28/2019 Pca Slides
5/33
PCA rotation
Figure: PCA Projected Gaussian PDF
http://find/7/28/2019 Pca Slides
6/33
PCA in a nutshell
Notation
x is a vector of p random variables k is a vector of p constants
kx =p
j=1 kjxj
Procedural description
Find linear function ofx, 1x with maximum variance.
Next find another linear function ofx, 2x, uncorrelated with1x maximum variance.
Iterate.
GoalIt is hoped, in general, that most of the variation in x will be
accounted for by m PCs where m
7/28/2019 Pca Slides
7/33
Derivation of PCA
Assumption and More Notation
is the known covariance matrix for the random variable x Foreshadowing : will be replaced with S, the sample
covariance matrix, when is unknown.
Shortcut to solution For k = 1, 2, . . . , p the kth PC is given by zk =
kx where kis an eigenvector of corresponding to its kth largesteigenvalue k.
Ifk is chosen to have unit length (i.e.
kk = 1) thenVar(zk) = k
http://find/7/28/2019 Pca Slides
8/33
Derivation of PCA
First Step
Find kx that maximizes Var(kx) = kk Without constraint we could pick a very big k.
Choose normalization constraint, namely kk = 1 (unitlength vector).
Constrained maximization - method of Lagrange multipliers
To maximize kk subject to
kk = 1 we use thetechnique of Lagrange multipliers. We maximize the function
kk (
kk 1)
w.r.t. to k by differentiating w.r.t. to k.
http://find/7/28/2019 Pca Slides
9/33
Derivation of PCA
Constrained maximization - method of Lagrange multipliers
This results in
d
dk
kk k(
kk 1)
= 0
k kk = 0
k = kk
This should be recognizable as an eigenvector equation wherek is an eigenvector of bf and k is the associatedeigenvalue.
Which eigenvector should we choose?
http://find/http://goback/7/28/2019 Pca Slides
10/33
Derivation of PCA
Constrained maximization - method of Lagrange multipliers
If we recognize that the quantity to be maximized
kk =
kkk = k
kk = k
then we should choose k to be as big as possible. So, calling
1 the largest eigenvector of and 1 the correspondingeigenvector then the solution to
1 = 11
is the 1st
principal component ofx. In general k will be the k
th PC ofx and Var(x) = k We will demonstrate this for k = 2, k> 2 is more involved
but similar.
http://find/7/28/2019 Pca Slides
11/33
Derivation of PCA
Constrained maximization - more constraints
The second PC, 2x maximizes 22 subject to beinguncorrelated with 1x.
The uncorrelation constraint can be expressed using any ofthese equations
cov(1x,
2x) =
12 =
21 =
21
1
= 1
2 = 1
12 = 0
Of these, if we choose the last we can write an Langrangianto maximize 2
22 2(
22 1)
21
http://find/http://goback/7/28/2019 Pca Slides
12/33
Derivation of PCA
Constrained maximization - more constraints
Differentiation of this quantity w.r.t. 2 (and setting theresult equal to zero) yields
d
d2
22 2(
22 1)
21
= 0
2 22 1 = 0
If we left multiply 1 into this expression
12 2
12
11 = 0
0 0 1 = 0
then we can see that must be zero and that when this istrue that we are left with
2 22 = 0
http://find/7/28/2019 Pca Slides
13/33
Derivation of PCA
Clearly2 22 = 0
is another eigenvalue equation and the same strategy of choosing2 to be the eigenvector associated with the second largesteigenvalue yields the second PC ofx, namely 2x.
This process can be repeated for k = 1 . . . p yielding up to pdifferent eigenvectors of along with the correspondingeigenvalues 1, . . . p.
Furthermore, the variance of each of the PCs are given by
Var[kx] = k, k = 1, 2, . . . , p
http://find/7/28/2019 Pca Slides
14/33
Properties of PCA
For any integer q, 1 q p, consider the orthonormal lineartransformation
y = B
x
where y is a q-element vector and B is a q p matrix, and lety = B
B be the variance-covariance matrix for y. Then thetrace ofy, denoted tr(y), is maximized by taking B = Aq,
where Aq consists of the first q columns ofA.What this means is that if you want to choose a lower dimensionalprojection ofx, the choice ofB described here is probably a goodone. It maximizes the (retained) variance of the resulting variables.
In fact, since the projections are uncorrelated, the percentage ofvariance accounted for by retaining the first q PCs is given by
qk=1 kpk=1 k
100
http://find/7/28/2019 Pca Slides
15/33
PCA using the sample covariance matrix
If we recall that the sample covariance matrix (an unbiasedestimator for the covariance matrix of x) is given by
S =1
n 1XX
where X is a (n p) matrix with (i,j)th element (xij xj) (in
other words,X
is a zero mean design matrix).
We construct the matrix A by combining the p eigenvectors ofS(or eigenvectors ofXX theyre the same) then we can define amatrix of PC scores
Z = XAOf course, if we instead form Z by selecting the q eigenvectorscorresponding to the q largest eigenvalues ofS when forming Athen we can achieve an optimal (in some senses) q-dimensionalprojection ofx.
http://find/7/28/2019 Pca Slides
16/33
7/28/2019 Pca Slides
17/33
Sample Covariance Matrix PCA
Figure: Gaussian Samples
S C C
http://find/7/28/2019 Pca Slides
18/33
Sample Covariance Matrix PCA
Figure: Gaussian Samples with eigenvectors of sample covariance matrix
S l C i M i PCA
http://goforward/http://find/http://goback/7/28/2019 Pca Slides
19/33
Sample Covariance Matrix PCA
Figure: PC projected samples
S l C i M i PCA
http://goforward/http://find/http://goback/7/28/2019 Pca Slides
20/33
Sample Covariance Matrix PCA
Figure: PC dimensionality reduction step
S l C i M t i PCA
http://find/7/28/2019 Pca Slides
21/33
Sample Covariance Matrix PCA
Figure: PC dimensionality reduction step
PCA i li i
http://find/7/28/2019 Pca Slides
22/33
PCA in linear regression
PCA is useful in linear regression in several ways
Identification and elimination of multicolinearities in the data.
Reduction in the dimension of the input space leading tofewer parameters and easier regression.
Related to the last point, the variance of the regressioncoefficient estimator is minimized by the PCA choice of basis.
We will consider the following example.
x N
[2 5],
4.5 1.51.5 1.0
y = X[1 2] when no colinearities are present (no noise)
xi3 = .8xi1 + .5xi2 imposed colinearity
Noiseless Li ea Relatio shi ith No Coli ea it
http://find/7/28/2019 Pca Slides
23/33
Noiseless Linear Relationship with No Colinearity
Figure: y = x[1 2] + 5, x N([2 5], 4.5 1.51.5 1.0
)
Noiseless Planar Relationship
http://find/7/28/2019 Pca Slides
24/33
Noiseless Planar Relationship
Figure: y = x[1 2] + 5, x N([2 5], 4.5 1.51.5 1.0
)
Projection of colinear data
http://find/7/28/2019 Pca Slides
25/33
Projection of colinear data
The figures before showed the data without the third colineardesign matrix column. Plotting such data is not possible, but its
colinearity is obvious by design.
When PCA is applied to the design matrix of rank q less than pthe number of positive eigenvalues discovered is equal to q thetrue rank of the design matrix.
If the number of PCs retained is larger than q (and the data isperfectly colinear, etc.) all of the variance of the data is retained inthe low dimensional projection.
In this example, when PCA is run on the design matrix of rank 2,the resulting projection back into two dimensions has exactly thesame distribution as before.
Projection of colinear data
http://find/7/28/2019 Pca Slides
26/33
Projection of colinear data
Figure: Projection of multi-colinear data onto first two PCs
Reduction in regression coefficient estimator variance
http://find/7/28/2019 Pca Slides
27/33
Reduction in regression coefficient estimator variance
If we take the standard regression model
y = X +
And consider instead the PCA rotation ofX given by
Z = ZA
then we can rewrite the regression model in terms of the PCs
y = Z+ .
We can also consider the reduced model
y = Zqq + q
where only the first q PCs are retained.
Reduction in regression coefficient estimator variance
http://find/7/28/2019 Pca Slides
28/33
Reduction in regression coefficient estimator variance
If we rewrite the regression relation as
y = Z+ .
Then we can, because A is orthogonal, rewrite
X = XAA = Z
where = A.
Clearly using least squares (or ML) to learn = A is equivalentto learning directly.
And, like usual,= (ZZ)1Zy
so = A(ZZ)1Zy
Reduction in regression coefficient estimator variance
http://find/7/28/2019 Pca Slides
29/33
Reduction in regression coefficient estimator variance
Without derivation we note that the variance-covariance matrix of is given by
Var() = 2p
k=1
l1k aka
k
where lk is the kth largest eigenvalue ofXX, ak is the k
th columnofA, and 2 is the observation noise variance, i.e. N(0, 2I)
This sheds light on how multicolinearities produce large variancesfor the elements of . If an eigenvector lk is small then theresulting variance of the estimator will be large.
Reduction in regression coefficient estimator variance
http://find/7/28/2019 Pca Slides
30/33
Reduction in regression coefficient estimator variance
One way to avoid this is to ignore those PCs that are associatedwith small eigenvalues, namely, use biased estimator
=m
k=1
l1k aka
kXy
where l1:m are the large eigenvalues ofXX and lm+1:p are the
small.
Var() = 2m
k=1
l1k aka
k
This is a biased estimator, but, since the variance of this estimatoris smaller it is possible that this could be an advantage.
Homework: find the bias of this estimator. Hint: use the spectraldecomposition ofXX.
Problems with PCA
http://goforward/http://find/http://goback/7/28/2019 Pca Slides
31/33
Problems with PCA
PCA is not without its problems and limitations PCA assumes approximate normality of the input space
distribution PCA may still be able to produce a good low dimensional
projection of the data even if the data isnt normally distributed
PCA may fail if the data lies on a complicated manifold
PCA assumes that the input data is real and continuous. Extensions to consider
Collins et al, A generalization of principal components analysisto the exponential family.
Hyvarinen, A. and Oja, E., Independent component analysis:algorithms and applications
ISOMAP, LLE, Maximum variance unfolding, etc.
Non-normal data
http://find/7/28/2019 Pca Slides
32/33
Non normal data
Figure: 2d Beta(.1, .1) Samples with PCs
Non-normal data
http://find/7/28/2019 Pca Slides
33/33
Figure: PCA Projected
http://find/