Pca Slides

Embed Size (px)

Citation preview

  • 7/28/2019 Pca Slides

    1/33

    Principal Component Analysis

    Frank Wood

    December 8, 2009

    This lecture borrows and quotes from Joliffes Principle Component Analysis book. Go buy it!

    http://find/
  • 7/28/2019 Pca Slides

    2/33

    Principal Component Analysis

    The central idea of principal component analysis (PCA) isto reduce the dimensionality of a data set consisting of a

    large number of interrelated variables, while retaining asmuch as possible of the variation present in the data set.This is achieved by transforming to a new set of variables,the principal components (PCs), which are uncorrelated,

    and which are ordered so that the first few retain most ofthe variation present in all of the original variables.[Jolliffe, Pricipal Component Analysis, 2nd edition]

    http://find/
  • 7/28/2019 Pca Slides

    3/33

    Data distribution (inputs in regression analysis)

    Figure: Gaussian PDF

    http://find/http://goback/
  • 7/28/2019 Pca Slides

    4/33

    Uncorrelated projections of principal variation

    Figure: Gaussian PDF with PC eigenvectors

    http://find/
  • 7/28/2019 Pca Slides

    5/33

    PCA rotation

    Figure: PCA Projected Gaussian PDF

    http://find/
  • 7/28/2019 Pca Slides

    6/33

    PCA in a nutshell

    Notation

    x is a vector of p random variables k is a vector of p constants

    kx =p

    j=1 kjxj

    Procedural description

    Find linear function ofx, 1x with maximum variance.

    Next find another linear function ofx, 2x, uncorrelated with1x maximum variance.

    Iterate.

    GoalIt is hoped, in general, that most of the variation in x will be

    accounted for by m PCs where m

  • 7/28/2019 Pca Slides

    7/33

    Derivation of PCA

    Assumption and More Notation

    is the known covariance matrix for the random variable x Foreshadowing : will be replaced with S, the sample

    covariance matrix, when is unknown.

    Shortcut to solution For k = 1, 2, . . . , p the kth PC is given by zk =

    kx where kis an eigenvector of corresponding to its kth largesteigenvalue k.

    Ifk is chosen to have unit length (i.e.

    kk = 1) thenVar(zk) = k

    http://find/
  • 7/28/2019 Pca Slides

    8/33

    Derivation of PCA

    First Step

    Find kx that maximizes Var(kx) = kk Without constraint we could pick a very big k.

    Choose normalization constraint, namely kk = 1 (unitlength vector).

    Constrained maximization - method of Lagrange multipliers

    To maximize kk subject to

    kk = 1 we use thetechnique of Lagrange multipliers. We maximize the function

    kk (

    kk 1)

    w.r.t. to k by differentiating w.r.t. to k.

    http://find/
  • 7/28/2019 Pca Slides

    9/33

    Derivation of PCA

    Constrained maximization - method of Lagrange multipliers

    This results in

    d

    dk

    kk k(

    kk 1)

    = 0

    k kk = 0

    k = kk

    This should be recognizable as an eigenvector equation wherek is an eigenvector of bf and k is the associatedeigenvalue.

    Which eigenvector should we choose?

    http://find/http://goback/
  • 7/28/2019 Pca Slides

    10/33

    Derivation of PCA

    Constrained maximization - method of Lagrange multipliers

    If we recognize that the quantity to be maximized

    kk =

    kkk = k

    kk = k

    then we should choose k to be as big as possible. So, calling

    1 the largest eigenvector of and 1 the correspondingeigenvector then the solution to

    1 = 11

    is the 1st

    principal component ofx. In general k will be the k

    th PC ofx and Var(x) = k We will demonstrate this for k = 2, k> 2 is more involved

    but similar.

    http://find/
  • 7/28/2019 Pca Slides

    11/33

    Derivation of PCA

    Constrained maximization - more constraints

    The second PC, 2x maximizes 22 subject to beinguncorrelated with 1x.

    The uncorrelation constraint can be expressed using any ofthese equations

    cov(1x,

    2x) =

    12 =

    21 =

    21

    1

    = 1

    2 = 1

    12 = 0

    Of these, if we choose the last we can write an Langrangianto maximize 2

    22 2(

    22 1)

    21

    http://find/http://goback/
  • 7/28/2019 Pca Slides

    12/33

    Derivation of PCA

    Constrained maximization - more constraints

    Differentiation of this quantity w.r.t. 2 (and setting theresult equal to zero) yields

    d

    d2

    22 2(

    22 1)

    21

    = 0

    2 22 1 = 0

    If we left multiply 1 into this expression

    12 2

    12

    11 = 0

    0 0 1 = 0

    then we can see that must be zero and that when this istrue that we are left with

    2 22 = 0

    http://find/
  • 7/28/2019 Pca Slides

    13/33

    Derivation of PCA

    Clearly2 22 = 0

    is another eigenvalue equation and the same strategy of choosing2 to be the eigenvector associated with the second largesteigenvalue yields the second PC ofx, namely 2x.

    This process can be repeated for k = 1 . . . p yielding up to pdifferent eigenvectors of along with the correspondingeigenvalues 1, . . . p.

    Furthermore, the variance of each of the PCs are given by

    Var[kx] = k, k = 1, 2, . . . , p

    http://find/
  • 7/28/2019 Pca Slides

    14/33

    Properties of PCA

    For any integer q, 1 q p, consider the orthonormal lineartransformation

    y = B

    x

    where y is a q-element vector and B is a q p matrix, and lety = B

    B be the variance-covariance matrix for y. Then thetrace ofy, denoted tr(y), is maximized by taking B = Aq,

    where Aq consists of the first q columns ofA.What this means is that if you want to choose a lower dimensionalprojection ofx, the choice ofB described here is probably a goodone. It maximizes the (retained) variance of the resulting variables.

    In fact, since the projections are uncorrelated, the percentage ofvariance accounted for by retaining the first q PCs is given by

    qk=1 kpk=1 k

    100

    http://find/
  • 7/28/2019 Pca Slides

    15/33

    PCA using the sample covariance matrix

    If we recall that the sample covariance matrix (an unbiasedestimator for the covariance matrix of x) is given by

    S =1

    n 1XX

    where X is a (n p) matrix with (i,j)th element (xij xj) (in

    other words,X

    is a zero mean design matrix).

    We construct the matrix A by combining the p eigenvectors ofS(or eigenvectors ofXX theyre the same) then we can define amatrix of PC scores

    Z = XAOf course, if we instead form Z by selecting the q eigenvectorscorresponding to the q largest eigenvalues ofS when forming Athen we can achieve an optimal (in some senses) q-dimensionalprojection ofx.

    http://find/
  • 7/28/2019 Pca Slides

    16/33

  • 7/28/2019 Pca Slides

    17/33

    Sample Covariance Matrix PCA

    Figure: Gaussian Samples

    S C C

    http://find/
  • 7/28/2019 Pca Slides

    18/33

    Sample Covariance Matrix PCA

    Figure: Gaussian Samples with eigenvectors of sample covariance matrix

    S l C i M i PCA

    http://goforward/http://find/http://goback/
  • 7/28/2019 Pca Slides

    19/33

    Sample Covariance Matrix PCA

    Figure: PC projected samples

    S l C i M i PCA

    http://goforward/http://find/http://goback/
  • 7/28/2019 Pca Slides

    20/33

    Sample Covariance Matrix PCA

    Figure: PC dimensionality reduction step

    S l C i M t i PCA

    http://find/
  • 7/28/2019 Pca Slides

    21/33

    Sample Covariance Matrix PCA

    Figure: PC dimensionality reduction step

    PCA i li i

    http://find/
  • 7/28/2019 Pca Slides

    22/33

    PCA in linear regression

    PCA is useful in linear regression in several ways

    Identification and elimination of multicolinearities in the data.

    Reduction in the dimension of the input space leading tofewer parameters and easier regression.

    Related to the last point, the variance of the regressioncoefficient estimator is minimized by the PCA choice of basis.

    We will consider the following example.

    x N

    [2 5],

    4.5 1.51.5 1.0

    y = X[1 2] when no colinearities are present (no noise)

    xi3 = .8xi1 + .5xi2 imposed colinearity

    Noiseless Li ea Relatio shi ith No Coli ea it

    http://find/
  • 7/28/2019 Pca Slides

    23/33

    Noiseless Linear Relationship with No Colinearity

    Figure: y = x[1 2] + 5, x N([2 5], 4.5 1.51.5 1.0

    )

    Noiseless Planar Relationship

    http://find/
  • 7/28/2019 Pca Slides

    24/33

    Noiseless Planar Relationship

    Figure: y = x[1 2] + 5, x N([2 5], 4.5 1.51.5 1.0

    )

    Projection of colinear data

    http://find/
  • 7/28/2019 Pca Slides

    25/33

    Projection of colinear data

    The figures before showed the data without the third colineardesign matrix column. Plotting such data is not possible, but its

    colinearity is obvious by design.

    When PCA is applied to the design matrix of rank q less than pthe number of positive eigenvalues discovered is equal to q thetrue rank of the design matrix.

    If the number of PCs retained is larger than q (and the data isperfectly colinear, etc.) all of the variance of the data is retained inthe low dimensional projection.

    In this example, when PCA is run on the design matrix of rank 2,the resulting projection back into two dimensions has exactly thesame distribution as before.

    Projection of colinear data

    http://find/
  • 7/28/2019 Pca Slides

    26/33

    Projection of colinear data

    Figure: Projection of multi-colinear data onto first two PCs

    Reduction in regression coefficient estimator variance

    http://find/
  • 7/28/2019 Pca Slides

    27/33

    Reduction in regression coefficient estimator variance

    If we take the standard regression model

    y = X +

    And consider instead the PCA rotation ofX given by

    Z = ZA

    then we can rewrite the regression model in terms of the PCs

    y = Z+ .

    We can also consider the reduced model

    y = Zqq + q

    where only the first q PCs are retained.

    Reduction in regression coefficient estimator variance

    http://find/
  • 7/28/2019 Pca Slides

    28/33

    Reduction in regression coefficient estimator variance

    If we rewrite the regression relation as

    y = Z+ .

    Then we can, because A is orthogonal, rewrite

    X = XAA = Z

    where = A.

    Clearly using least squares (or ML) to learn = A is equivalentto learning directly.

    And, like usual,= (ZZ)1Zy

    so = A(ZZ)1Zy

    Reduction in regression coefficient estimator variance

    http://find/
  • 7/28/2019 Pca Slides

    29/33

    Reduction in regression coefficient estimator variance

    Without derivation we note that the variance-covariance matrix of is given by

    Var() = 2p

    k=1

    l1k aka

    k

    where lk is the kth largest eigenvalue ofXX, ak is the k

    th columnofA, and 2 is the observation noise variance, i.e. N(0, 2I)

    This sheds light on how multicolinearities produce large variancesfor the elements of . If an eigenvector lk is small then theresulting variance of the estimator will be large.

    Reduction in regression coefficient estimator variance

    http://find/
  • 7/28/2019 Pca Slides

    30/33

    Reduction in regression coefficient estimator variance

    One way to avoid this is to ignore those PCs that are associatedwith small eigenvalues, namely, use biased estimator

    =m

    k=1

    l1k aka

    kXy

    where l1:m are the large eigenvalues ofXX and lm+1:p are the

    small.

    Var() = 2m

    k=1

    l1k aka

    k

    This is a biased estimator, but, since the variance of this estimatoris smaller it is possible that this could be an advantage.

    Homework: find the bias of this estimator. Hint: use the spectraldecomposition ofXX.

    Problems with PCA

    http://goforward/http://find/http://goback/
  • 7/28/2019 Pca Slides

    31/33

    Problems with PCA

    PCA is not without its problems and limitations PCA assumes approximate normality of the input space

    distribution PCA may still be able to produce a good low dimensional

    projection of the data even if the data isnt normally distributed

    PCA may fail if the data lies on a complicated manifold

    PCA assumes that the input data is real and continuous. Extensions to consider

    Collins et al, A generalization of principal components analysisto the exponential family.

    Hyvarinen, A. and Oja, E., Independent component analysis:algorithms and applications

    ISOMAP, LLE, Maximum variance unfolding, etc.

    Non-normal data

    http://find/
  • 7/28/2019 Pca Slides

    32/33

    Non normal data

    Figure: 2d Beta(.1, .1) Samples with PCs

    Non-normal data

    http://find/
  • 7/28/2019 Pca Slides

    33/33

    Figure: PCA Projected

    http://find/