Principal Component Analysis - NUS Computingcs5240/lecture/pca.pdf · 2019-09-05 · Principal Component Analysis CS5240 Theoretical Foundations in Multimedia LeowWeeKheng DepartmentofComputerScience

Principal Component Analysis

CS5240 Theoretical Foundations in Multimedia

Leow Wee Kheng

Department of Computer Science

School of Computing

National University of Singapore

Leow Wee Kheng (NUS) Principal Component Analysis 1 / 54

Motivation

Motivation

How wide is the widest part of NGC 1300?


Motivation

How thick is the thickest part of NGC 4594?

Use principal component analysis.


Maximum Variance Estimate


Consider a set of points xi, i = 1, . . . , n, in an m-dimensional spacesuch that their mean µ = 0, i.e., centroid is at the origin.

x i

di

yi

x1

x2

x4

x1

x2

x3

O

q

l

Want to find a line l through the origin that maximizes the projectionsyi of the points xi on l.



Let q denote the unit vector along line l.Then, the projection yi of xi on l is

yi = x⊤

i q. (1)

The mean squared projection, which is the variance, V over all points is

V =1

n

n∑

i=1

y2i =1

n

n∑

i=1

(x⊤

i q)2 ≥ 0. (2)

Because x⊤

i q = q⊤xi, expanding Eq. 2 gives

V =1

n

n∑

i=1

(q⊤xi)(x⊤

i q) = q⊤

[1

n

n∑

i=1

xix⊤

i

]q. (3)

The middle factor is the covariance matrix C of the data points

C =1

n

n∑

i=1

xix⊤

i . (4)



We want to find a unit vector q that maximizes the variance V , i.e.,

maximize V = q⊤Cq subject to ‖q‖ = 1. (5)

This is a constrained optimization problem.

Use Lagrange multiplier method:combine V and the constraint using Lagrange multiplier λ

maximize V ′ = q⊤Cq− λ(q⊤q− 1). (6)


Maximum Variance Estimate Lagrange Multiplier Method

Lagrange Multiplier Method

Lagrange multiplier is a method for solving constrained optimization.

Consider this problem:

maximize f(x) subject to g(x) = c. (7)

Lagrange multiplier method introduces a Lagrange multiplier λ tocombine f(x) and g(x):

L(x, λ) = f(x) + λ (g(x)− c). (8)

The sign of λ can be positive or negative.

Then, solve for the stationary point of L(x, λ):

∂L(x, λ)

∂x= 0. (9)


Maximum Variance Estimate Lagrange Multiplier Method

◮ If x0 is a solution of the original problem, then there is a λ0 suchthat (x0, λ0) is a stationary point of L.

◮ Not all stationary points yield solutions of the original problem.

◮ The same method applies to minimization problem.

◮ Multiple constraints can be combined by adding multiple terms.

minimize f(x) subject to g1(x) = c1, g2(x) = c2 (10)

is solved with

L(x) = f(x) + λ1(g1(x)− c1) + λ2(g2(x)− c2). (11)



Now, we differentiate V ′ with respect to q and set to 0:

∂V ′

∂q= 2q⊤C− 2λq⊤ = 0. (12)

Rearranging the terms gives

q⊤C = λq⊤. (13)

Since the covariance matrix C is symmetric (homework), C⊤ = C.So, transposing both sides of Eq. 13 gives

Cq = λq. (14)

◮ Eq. 14 is called an eigenvector equation.

◮ q is the eigenvector.

◮ λ is the eigenvalue.

Thus, the eigenvector q of C gives the line that maximizes variance V .



The perpendicular distance di of xi from the line l is

di = ‖xi − yiq‖ = ‖xi − (x⊤

i q)q‖. (15)

x i

di

yi

x1

x2

x4

x1

x2

x3

O

q

l



The squared distance is

d2i = ‖xi − (x⊤

i q)q‖2 = (xi − (x⊤

i q)q)⊤(xi − (x⊤

i q)q)

= x⊤

i xi − x⊤

i (x⊤

i q)q− (x⊤

i q)q⊤xi + (x⊤

i q)q⊤(x⊤

i q)q.

With x⊤

i q being a scalar and x⊤

i q = q⊤xi, we obtain

d2i = x⊤

i xi − (x⊤

i q)2 − (x⊤

i q)2 + (x⊤

i q)2(q⊤q)

= x⊤

i xi − (x⊤

i q)2.

Averaging over all i gives

D =1

n

n∑

i=1

d2i =1

n

n∑

i=1

x⊤

i xi − V. (16)

So, maximizing V means minimizing D.

The eigenvalue λ = V (homework); so λ ≥ 0.



Name q as the first eigenvector q1.The component of xi orthogonal to q1 is x′

i = xi − x⊤

i q1q1.

Repeat previous method on x′

i in an (m− 1)-D subspace orthogonal toq1 to get q2. Then, repeat to get q3, . . . ,qm.

x1

x2

x3

orthogonalsubspace

q1

O


Eigendecomposition

Eigendecomposition

In general, a m×m covariance matrix C has m eigenvectors:

Cqj = λj qj , j = 1, . . . ,m. (17)

Transpose both sides of the equations to get

q⊤

j C = λj q⊤

j , j = 1, . . . ,m, (18)

which are row matrices. Stack the row matrices into a column to get

q⊤1...

q⊤m

C =

λ1 0 0

0. . . 0

0 0 λm

q⊤1...

q⊤m

(19)

Denote matrix Q and Λ as

Q = [q1 · · · qm], Λ = diag(λ1, . . . , λm). (20)


Eigendecomposition

Then, Eq. 19 becomesQ⊤C = ΛQ⊤ (21)

orC = QΛQ⊤. (22)

This is the matrix equation for eigendecomposition.

Properties:

◮ The eigenvectors are orthonormal:

q⊤

j qj = 1,

q⊤

j qk = 0, for k 6= j.(23)

So, the eigenmatrix Q is orthogonal:

Q⊤Q = QQ⊤ = I. (24)

◮ The eigenvalues are arranged to be sorted λj ≥ λj+1.


General PCA

General PCA

In general, the mean µ of the data points xi is not at the origin.In this case, we subtract µ from each xi,obtaining shifted or zero-mean data points xi − µ.

PCA transforms xi into a new vector yi through Q as follows:

yi = Q⊤(xi − µ) =

q⊤1 (xi − µ)

...q⊤m(xi − µ)

=

m∑

j=1

(xi − µ)⊤qj qj , (25)

Each component of yi is

yij = (xi − µ)⊤qj . (26)

This is the projection of xi − µ on qj .


General PCA

The original xi can be recovered from yi:

xi = Qyi + µ. (27)

Caution!

◮ xi 6= yi + µ.

◮ xi is in the original input space but yi is in the eigenspace.Let xj denote the unit vectors that form the input space.qj are the eigenvectors that form the eigenspace. Then,

xi = (xi1, xi2, . . . , xim) =m∑

j=1

xijxj ,

yi = (yi1, yi2, . . . , yim) =

m∑

j=1

yijqj .


General PCA


General PCA Properties of PCA

Properties of PCA

◮ Mean µy over all yi is 0 (homework).

◮ Variance σ2j along qj is λj (homework).

◮ Since λ1 ≥ · · · ≥ λm ≥ 0, so σ1 ≥ · · · ≥ σm ≥ 0.

◮ q1 gives orientation of the largest variance.

◮ q2 gives orientation of largest variance orthogonal to q1

(2nd largest variance).

◮ qj gives orientation of largest variance orthogonal to q1, . . . ,qj−1

(j-th largest variance).

◮ qm is orthogonal to all other eigenvectors (least variance).



Data Points



Centroid at Origin



Principal Components



Another Example



Centroid at Origin



Principal Components



Special Cases: Points on a rectangle.

−3 −2 −1 0 1 2 3−2

−1

0

1

2

−3 −2 −1 0 1 2 3−2

−1

0

1

2

−3 −2 −1 0 1 2 3−2

−1

0

1

2

−3 −2 −1 0 1 2 3−2

−1

0

1

2

Blue line: direction of 1st principal component.

To verify the results, run the program in Programming Homework 3 on these data.



Special Cases: Points on a rectangle.

−3 −2 −1 0 1 2 3−2

−1

0

1

2

−3 −2 −1 0 1 2 3−2

−1

0

1

2

−2 −1 0 1 2−2

−1

0

1

2

−2 −1 0 1 2−2

−1

0

1

2




Special Cases: Points on a square / diamond.

−1 0 1

−1

0

1

−1 0 1

−1

0

1


This is a surprise! But, it is easy to see:The 4 points are equi-distant from the origin.So, they actually lie on a sphere!

A sphere has an infinite number of possible first principal componentsalong any diameter.



Special Cases: Points on a diamond.

−1 0 1

−1

0

1

What about this case? Try it out yourself!


PCA Algorithm 1

PCA Algorithm 1

Let xi denote m-dimensional vectors (data points), i = 1, ..., n.

1. Compute the mean vector of the data points

µ =1

n

n∑

i=1

xi. (28)

2. Compute the covariance matrix

C =1

n

n∑

i=1

(xi − µ)(xi − µ)⊤. (29)

3. Perform eigendecomposition of C

C = QΛQ⊤. (30)


PCA Algorithm 1

◮ Some books and papers use sample covariance

C =1

n− 1

n∑

i=1

(xi − µ)(xi − µ)⊤. (31)

Eq. 31 differs from Eq. 29 by only a constant.


PCA Algorithm 1

Notes:

◮ Covariance matrix C is a m×m matrix.

◮ Eigendecomposition of very large matrix is inefficient.

◮ Example:◮ A 256×256 colour image has 256×256× 3 values.◮ Number of dimensions m = 196680.◮ C has a size of 196608×196608!!◮ Number of images n is usually ≪ m, e.g., 1000.


PCA Algorithm 2

PCA Algorithm 2

1. Compute mean µ of data points xi.

µ =1

n

n∑

i=1

xi.

2. Form a m×n matrix A, n≪ m:

A = [(x1 − µ) · · · (xn − µ)]. (32)

3. Compute A⊤A, which is just a n×n matrix.

4. Apply eigendecomposition on A⊤A:

(A⊤A)qj = λjqj . (33)

5. Pre-multiply A to Eq. 33 giving

AA⊤Aqj = Aλjqj . (34)


PCA Algorithm 2

6. Recover eigenvectors and eigenvalues.

Since AA⊤ = nC (homework), Eq. 34 is

nC (Aqj) = λj(Aqj) (35)

C (Aqj) =λj

n(Aqj). (36)

Therefore, eigenvectors of C are Aqj , andeigenvalues of C are λj/n.


PCA Algorithm 3

PCA Algorithm 3

1. Compute mean µ of data points xi.

µ =1

n

n∑

i=1

xi.

2. Form a m×n matrix A:

A = [(x1 − µ) · · · (xn − µ)].

3. Apply singular value decomposition (SVD) on A:

A = UΣV⊤.

4. Recover eigenvectors and eigenvalues (page 31).


PCA Algorithm 3 Singular Value Decomposition

Singular Value Decomposition

Singular value decomposition (SVD) decomposes a matrix A into

A = UΣV⊤. (37)

◮ Column vectors of U are left singular vectors uj , anduj are orthonormal.

◮ Column vectors of V are right singular vectors vj , andvj are orthonormal.

◮ Σ is diagonal and contains singular values sj .

◮ Rank of A = number of non-zero singular values.

Refer to Appendix for more details.


PCA Algorithm 3 Singular Value Decomposition

Notice that

AA⊤ = UΣV⊤

(UΣV⊤

)⊤

= UΣV⊤VΣU⊤ = UΣ2U⊤

and eigendecomposition of AA⊤ is

AA⊤ = QΛQ⊤.

So, eigenvector of AA⊤ = uj , eigenvalue of AA⊤ = s2j .

On the other hand,

A⊤A =(UΣV⊤

)⊤

UΣV⊤ = VΣU⊤UΣV⊤ = VΣ2V⊤.

Compare with the eigendecomposition of A⊤A:

A⊤A = QΛQ⊤.

So, eigenvector of A⊤A = vj , eigenvalue of A⊤A = s2j .


PCA Algorithm 3

4. Recover eigenvectors and eigenvalues.

Compare AA⊤ = nC with eigendecomposition of C:

AA⊤ = UΣ2U⊤,

C = QΛQ⊤.

Therefore, eigenvectors qj of C = uj in U, and

eigenvalues λj of C = s2j/n, for sj in Σ.


Application Examples


Apply PCA for line fitting in 2-D.

Compute PCA of the points. Then,

◮ mean of points = a point on line

◮ 1st eigenvector = unit vector along line

xi1x1

x2

xi2 ip

f



Apply PCA for plane fitting in 3-D.

Compute PCA of the points. Then,

◮ mean of points = a point on plane

◮ 3rd eigenvector = unit normal vector of plane


Dimensionality Reduction


If a point represents a 256×256 colour image,then the dimensionality is 256×256×3 = 196680!!

But, not all 196680 eigenvalues are non-zero.



Case 1: Number of data vectors n ≤ m number of dimensions.

◮ Data vectors are all independent.Then, number of non-zero eigenvalues (eigenvectors) = n− 1,i.e., rank of covariance matrix = n− 1.Why not = n?

◮ Data vectors are not independent.Then, rank of covariance matrix < n− 1.

Case 2: Number of data vectors n > m.

◮ m or more independent data vectors.Then, rank of covariance matrix = m.

◮ Fewer than m data vectors are independent.In this case, what is the rank of covariance matrix?

In practice, it is often possible to reduce the dimensionality.



Eigenmatrix Q isQ = [q1 · · · qm] . (38)

PCA maps a data point x to a vector y in the eigenspace as

y = Q⊤(x− µ) =m∑

j=1

(x− µ)⊤qjqj , (39)

which has m dimensions.

Pick l eigenvectors with the largest eigenvalues to form truncated Q:

Q = [q1 · · · ql], (40)

which spans a subspace of the eigenspace.

Then, Q maps x to y, an estimate of y:

y = Q⊤(x− µ) =l∑

j=1

(x− µ)⊤qjqj , (41)

which has l < m dimensions.



x y y x

x1 y1 y1 x1

...Q⊤

−→...

...Q⊤

←−...

...Q←− yl yl

Q−→/

...

... yl+1

...

......

...

xm ym xm

dimensionalityreduction

Difference between y and y is

y − y =m∑

j=l+1

(x− µ)⊤qjqj . (42)



Beware:y = Q⊤(x− µ)

and, therefore,x = Qy + µ.

But,y = Q⊤(x− µ),

whereasx = Q y + µ 6= x. (43)

Why?

With n data points xi, sum-squared error E between xi and itsestimate xi is

E =n∑

i=1

‖xi − xi‖2. (44)



When all m dimensions are kept, the total variance is

m∑

j=1

σ2j =

m∑

j=1

λj . (45)

When l dimensions are used, the ratio R′ of unaccounted variance is:

R′(l) =

m∑

j=l+1

σ2j

m∑

j=1

σ2j

=

m∑

j=l+1

λj

m∑

j=1

λj

. (46)



A sample plot of R′ vs. number of dimensions used:

notenough

number ofdimensionsml

enough

1

0

R’

How to choose appropriate l?Choose l such that larger than l doesn’t reduce R′ significantly:

R′(l)−R′(l + 1) < ǫ. (47)



Alternatively, compute ratio R of accounted variance:

R(l) =

l∑

j=1

σ2j

m∑

j=1

σ2j

=

l∑

j=1

λj

m∑

j=1

λj

. (48)

l is large enough ifR(l + 1)−R(l) < ǫ.

number ofdimensions

notenough

ml

1

0

R

enough


Summary

Summary

◮ Eigendecomposition of covariance matrix = PCA.

◮ In practice, PCA is computed using SVD.

◮ PCA maximizes variances along the eigenvectors.

◮ Eigenvalues = variances along eigenvectors.

◮ Best fitting plane computed by PCA minimizes distance to points.

◮ PCA can be used for dimensionality reduction.


Probing Questions

Probing Questions

◮ When applying PCA to plane fitting, how to check whether thepoints really lie close to the plane?

◮ If you apply PCA to a set of points on a curve surface, where doyou expect the eigenvectors to point at?

◮ In dimensionality reduction, the m-D vector y is reduced to a l-Dvector y. Since l 6= m, vector subtraction is undefined. But,subtraction of Eq. 39 and 41 gives

y − y =m∑

j=l+1

(x− µ)⊤qjqj .

How is this possible? Is there a contradiction?

◮ Show that when n ≤ m, there are at most n− 1 eigenvectors.


Homework

Homework I

1. Describe the essence of PCA in one sentence.

2. Show that the covariance matrix C of a set of data points xi,i = 1, . . . , n, is symmetric about its diagonal.

3. Show that the eigenvalue λ of covariance matrix C equals thevariance V as defined in Eq. 2.

4. Show that the mean µy over all yi in the eigenspace is 0.

5. Show that the variance σ2j along eigenvector qj is λj .

6. Show that AA⊤ = nC.

7. Derive the difference Qy − Q y in dimensionality reduction.


Homework

Homework II

8. Suppose n ≤ m, and the truncated eigenmatrix Q contains all theeigenvectors with non-zero eigenvalues, Q = [q1 · · · qn−1].

Now, map an input vector x to y and y respectively by thecomplete Q (Eq. 39) and the truncated Q (Eq. 41). What is theerror y − y?

What is the result of mapping y back to the input space by Q(Eq. 43)?

9. Q3 of AY2015/16 Final Evaluation.


Appendix

Appendix

Consider a general m×n matrix A, where m > n. The SVD of A gives

A = UΣV⊤, (49)

◮ U is m×m orthogonal matrix,

◮ Σ is m×n diagonal matrix, and

◮ V is n×n orthogonal matrix.

This is called full PCA.

Since m > n, Σ has the structure

Σ =

[Σn

0m−n

],

◮ Σn is n×n diagonal matrix,

◮ 0m−n is (m− n)×n zero matrix.


Appendix

So, Eq. 49 can be written in this form

A =[Un Um−n

] [ Σn

0m−n

]V⊤. (50)

That is,A = UnΣnV

⊤. (51)

This is called Thin SVD.

In practice, we usually compute thin SVD instead of full SVD becauseit is useless to compute Um−n.

If the rank r of A is less than n, then can also computecompact SVD as follows:

A = UrΣrV⊤

r , (52)

where Ur is m×r, Σr is r×r diagonal, and Vr is r×r.


References

References

1. G. Strang, Introduction to Linear Algebra, 4th ed., Wellesley-Cambridge,2009. www-math.mit.edu/~gs

2. C. Shalizi, Advanced Data Analysis from an Elementary Point of View,2015. www.stat.cmu.edu/~cshalizi/ADAfaEPoV


www-math.mit.edu/~gs

www.stat.cmu.edu/~cshalizi/ADAfaEPoV

Documents

Principal Component Analysis - NUS Computingcs5240/lecture/pca.pdf · 2019-09-05 · Principal Component Analysis CS5240 Theoretical Foundations in Multimedia LeowWeeKheng DepartmentofComputerScience