40
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set, that nonetheless retains most of the sample's information. 6.Principal Component Analysis

6.Principal Component Analysis

  • Upload
    hamlet

  • View
    110

  • Download
    1

Embed Size (px)

DESCRIPTION

6.Principal Component Analysis. Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. - PowerPoint PPT Presentation

Citation preview

Page 1: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

Principal component analysis (PCA) is a technique that is useful for the compression and classification of data.

The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set, that nonetheless retains most of the sample's information.

6.Principal Component Analysis

Page 2: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

By information we mean the variation present in the sample, given by the correlations between the original variables.

The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.

Page 3: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Principal component

• direction of maximum variance in the input space

• principal eigenvector of the covariance matrix

Goal: Relate these two definitions

Page 4: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

• Variance

A random variablefluctuating about its mean value

Average of the square of the fluctuations

x x x

x 2 x2 x 2

x x x

x 2 x2 x 2

xx x

Page 5: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

• Covariance

Pair of random variables, each fluctuating about its mean value.

Average of product of fluctuations

x1 x1 x1x2 x2 x2

x1x2 x1x2 x1 x2

Page 6: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 7: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

• Covariance matrix

N random variables

NxN symmetric matrix

Diagonal elements are variances

Cij x ix j x i x j

Page 8: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

• eigenvectors with k largest eigenvalues

Now you can calculate them, but what do they mean?

Principal components

Page 9: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

• Covariance to variance

From the covariance, the variance of any projection can be calculated.

w : unit vector

wT x 2 wT x2wTCw

wiCijw jij

Page 10: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

• Maximizing parallel variance

Principal eigenvector of C (the one with the largest eigenvalue)

w* argmaxw:w1

wTCw

max C maxw:w1

wTCw

w*TCw*

Page 11: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Geometric picture of principal components

• the 1st PC z1 is a minimum distance fit to a line in X space

• the 2nd PC z2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC

Page 12: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Then…

Page 13: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 14: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 15: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 16: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 17: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 18: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 19: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 20: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 21: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 22: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 23: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 24: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 25: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 26: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

6.Principal Component Analysis

Page 27: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

Algebraic definition of PCs

Given a sample of n observations on a vector of p variables x = (x1,x2,….,xp) define the first principal component of the sample by the linear transformation

λ

where the vector

is chosen such that is maximum

6.Principal Component Analysis

Page 28: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

Likewise, define the kth PC of the sample by the linear transformation

where the vector

is chosen such that is maximum subject to

6.Principal ComponentAnalysis

Page 29: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

To find first note that

is the covariance matrix

for the variables

6.Principal Component

Analysis

Page 30: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

To find maximize subject to

Let λ be a Lagrange multiplier

by differentiating…

then maximize

is an eigenvector of

corresponding to eigenvalue

therefore

6.Principal ComponentAnalysis

Page 31: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

We have maximized

Algebraic derivation of

So is the largest eigenvalue of

The first PC retains the greatest amount of variation in the sample.

6.Principal ComponentAnalysis

Page 32: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

To find vector maximize

then let λ and φ be Lagrange multipliers, and maximize

subject to

First note that

6.Principal ComponentAnalysis

Page 33: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

We find that is also an eigenvector of

whose eigenvalue is the second largest. In general

The kth largest eigenvalue of is the variance of the kth PC.

The kth PC retains the kth greatest fraction of the variation in the sample.

6.Principal ComponentAnalysis

Page 34: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

Algebraic formulation of PCA

Given a sample of n observations

define a vector of p PCs

according to

on a vector of p variables

whose kth column is the kth eigenvector of

where is an orthogonal p x p matrix

6.Principal ComponentAnalysis

Page 35: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

Then

is the covariance matrix of the PCs,

being diagonal with elements

6.Principal ComponentAnalysis

Page 36: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

In general it is useful to define standardized variables by

If the are each measured about their sample mean

then the covariance matrix of will be equal to the

correlation matrix of

and the PCs will be dimensionless

6.Principal ComponentAnalysis

Page 37: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

Practical computation of PCs

Given a sample of n observations on a vector of p variables (each measured about its sample mean)

compute the covariance matrix

where is the n x p matrix whose ith row is the ith observation

6.Principal ComponentAnalysis

Page 38: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

Then, compute the n x p matrix

whose ith row is the PC score for the ith observation:

Write to decompose each observation into PCs:

6.Principal ComponentAnalysis

Page 39: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

Usage of PCA: Data compression

Because the kth PC retains the kth greatest fraction of the variation, we can approximate each observation

by truncating the sum at the first m < p PCs:

6.Principal ComponentAnalysis

Page 40: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006

This reduces the dimensionality of the data from p to m < p by approximating

where

is the n x m portion of

and

is the p x m portion of

6.Principal ComponentAnalysis