Finding and analysing patterns in multivariate datasets · PCA by Singular Value Decomposition (SVD) •Relative PC variances reveal how important they are, i.e., what fraction of

Timo AsikainenReSoLVE Centre of Excellence,Space Climate Research Unit

University of [email protected]

Finding and analysing patterns in multivariate datasets

I4Future Summer School 13-17.5.2019

mailto:[email protected]

• Introduction to Principal Component Analysis as a method to find and analyse patterns

• Some mathematics

• A few real world examples

• Detailed extra material for self study

Contents

• Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of partly correlated variables into a set of values of linearly uncorrelated variables called principal components, which explain as much common variance as possible.

What is PCA?

• Two correlated variables

• Largest variation in the points is along the principal axis

• ➔ There is a principal component in the data, which consists of the combined correlated variability seen in X and Y

Example of correlated variables


• We can project the data into these principal axes to obtain principal components PC1 and PC2

• Here the same data is represented with the principal components, i.e., transformed to the new system

• Same method can be applied in higher dimensions, e.g., 3D, 4D,…, ND

• ➔ PCA finds the directions of maximum variance and principal components are projections of the data to these directions

Example of a 3D dataset

Mathematics of Principal Component Analysis...

• It is convenient to handle data in matrix form• Data matrix 𝑿 holds individual datasets in its columns

𝑛 data points, 𝑝 variables

𝑿 =

𝑋1,1 𝑋1,2… 𝑋1,𝑝𝑋2,1 𝑋2,2… 𝑋2,𝑝⋮ ⋮ ⋮

𝑋𝑛,1 𝑋𝑛,2… 𝑋𝑛,𝑝

• Data samples = Observations in time, individuals, pixels, material samples etc.

• Key assumption: Centered data

𝑿 =

𝑋1,1 − ത𝑋1 𝑋1,2 − ത𝑋2… 𝑋1,𝑝 − ത𝑋𝑝

𝑋2,1 − ത𝑋1 𝑋2,2 − ത𝑋2… 𝑋2,𝑝 − ത𝑋𝑝⋮ ⋮ ⋮

𝑋𝑛,1 − ത𝑋1 𝑋𝑛,2 − ത𝑋2… 𝑋𝑛,𝑝 − ത𝑋𝑝

Basics: data matrix

Rows = Data samples

Columns = Data series (variables)

• 𝑷=matrix with principal components as columns

• 𝑾=matrix with principal axes (direction vectors) as columns

• Axes are orthogonal!

𝑾𝑇𝑾 = 𝑰

• Principal axes EOFs (Empirical Orthogonal Functions)

• By definition we have𝑷 = 𝑿𝑾

• I.e., each column of 𝑷 is a projection of the data to the corresponding axis (column of 𝑾)

• ➔ 𝑷 represents the data in the new coordinates

• Original data matrix is a weighted sum (PCs are the weights) sum of orthogonal modes (principal axes)

Principal components as projections of data

• PCA decomposes original data matrix into a sum of orthogonal modes

𝑿 = 𝑷𝑾𝑇

• Number of modes (EOFs)/PCs = number of data series (variables)

• For example, each sample of the set of data series is a sum

𝑿𝑖 = 𝑃𝐶1(𝑖) × 𝑬𝑶𝑭𝟏 + 𝑃𝐶2(𝑖) × 𝑬𝑶𝑭𝟐 +⋯

Original data matrix decomposed into modes

Principal components as projections of data

PC1

PC2

1st principalaxis

2nd principalaxis

• We can obtain the following equation for the unknown principal axes (𝑤𝑖 = columns of 𝑾) (see details in Extra material):

(𝑿𝑇𝑿)𝑤𝑖 = 𝜎𝑖2𝑤𝑖

• This equation is an eigenvalue equation• ➔ vectors 𝑤𝑖 are eigenvectors of data covariance

matrix

and

corresponding eigenvalues 𝜎𝑖2 are variances of the

principal components

Calculating principal components

• The data matrix can be decomposed (SVD) into parts as

𝑿 = 𝑼𝑺𝑽𝑇

• Where – 𝑼 is a 𝑛 × 𝑘 matrix, with orthogonal columns

– 𝑺 is a 𝑘 × 𝑘 diagonal matrix whose elements are called the singular values (note that 𝑺 = 𝑺𝑇)

– 𝑽 is a 𝑝 × 𝑘 matrix with orthogonal columns

• It can be shown that (see Extra material)

– 𝑾 = 𝑽

– 𝑷 = 𝑼𝑺

– Diagonal values of 𝑺 are 𝜎𝑖2, i.e., standard deviations of the


PCA by Singular Value Decomposition (SVD)

• Relative PC variances reveal how important they are, i.e., what fraction of data variance they contain

• In correlated data typically only small subset of PCs explain most of the variance

• ➔ Can be used to compress the data!

• I.e., reduce the dimension of the data

➔ Take only 𝑘 columns from 𝑷 and 𝑽 and compute

reduced data matrix as 𝑿 = 𝑷(𝑘)𝑽(𝑘)𝑇

• This also helps in analysis and interpretation

Dimensional reduction

How to analyse and use the output of PCA?

- Example of satellite data

• Data from 24 latitudes, 38 years in daily resolution, about 14000 samples

• Let’s do PCA for this data set

• 𝑿 = 𝑼𝑺𝑽𝑇

• PCs PC variances EOFs

Dataset: Latitude distribution of electrons at low-altitude Earth orbit

• 3-4 first PCs contain majority of data variance (over 90%)

• The remaining PCs may contain either some physically interesting signal or just random noise

PC variances

• PCs depict how the modes vary with time

• EOFs depict by what weight each mode affects different data series (latitudes)

PCs and corresponding EOFs

Climate example

• NASA/GISS monthlyground temperaturedataset

• 5o x 5o grid➔ 2592 gridpoints in total

• Each grid point hasmonthly temperaturetime series

• Temperature anomaly = T-Tmeanfor each grid pointseparately

Ground temperature dataset

• In many cases one or moreof the variables may havedata gaps

• Simplest way to deal withgaps: exclude all rows, which have non-zeronumber of missing values

• Note: there are more sophisticated ways to deal with gaps (e.g., DINEOF analysis) ....

Note on datagaps

• It is hard to see global correlated patterns fromindividual time series

• Here three randomly chosen spots from differentparts of the world (blue is Oulu, Finland)

Examples of January temperaturesTe

mp

erat

ure

ano

mal

y[o

C]

• First PC explains about 25% of variance in Northern Hemisphere January temperatures.

• This mode is called Northern Annular Mode, NAM

January ground temperature: PCA

EOF 1

January ground temperature: PCA

• Three first EOFs

• Representation of EOFs in terms of their geographicallocation gives them physical interpretation

• Positive phase of NAM acts to produce warm in Eurasia

EOF 2 EOF 3

EOF 1

Ground temperature dataset: PCA

EOF 2 EOF 3

• PCA decomposes multivariate dataset into orthogonal modes (EOFs) and their sample to sample variation (PCs)

• Instead of analysing all individual data series one can analyse a much smaller number of modes

• ➔ Reduction of dimensionality

• ➔ Often easier to interpret

• Simple to calculate:

• SVD decomposition: 𝑿 = 𝑼𝑺𝑽𝑇 and 𝑷 = 𝑼𝑺

Take home points

Extra in depth material for self study

Some basics first...

• Covariance of two data sets:

𝑐𝑜𝑣 𝑋, 𝑌 =σ𝑖=1𝑛 (𝑋𝑖 − ത𝑋)(𝑌𝑖 − ത𝑌)

(𝑛 − 1)

• If 𝑐𝑜𝑣(𝑋, 𝑌) > 0 positive values of X tend to appear when Y has a positive value too

• If 𝑐𝑜𝑣(𝑋, 𝑌) < 0 negative values of X correspond to positive values of Y

• If 𝑐𝑜𝑣(𝑋, 𝑌) = 0 there is no covariance between the data

Basics: Covariance

• It is convenient to handle data in matrix form• Data matrix 𝑿 holds individual datasets in its columns𝑛 data points, 𝑝 variables

𝑿 =

𝑋1,1 𝑋1,2… 𝑋1,𝑝𝑋2,1 𝑋2,2… 𝑋2,𝑝⋮ ⋮ ⋮

𝑋𝑛,1 𝑋𝑛,2… 𝑋𝑛,𝑝

• Centered data: 𝑿 =

𝑋1,1 − ത𝑋1 𝑋1,2 − ത𝑋2… 𝑋1,𝑝 − ത𝑋𝑝

𝑋2,1 − ത𝑋1 𝑋2,2 − ത𝑋2… 𝑋2,𝑝 − ത𝑋𝑝⋮ ⋮ ⋮

𝑋𝑛,1 − ത𝑋1 𝑋𝑛,2 − ത𝑋2… 𝑋𝑛,𝑝 − ത𝑋𝑝

• From here on we also assume data is centered, i.e., subtract from each column the mean of that column!

• In MATLAB: Xcentered=X-repmat(mean(X),size(X,1),1)

Basics: data matrix

• We can collect the covariances of the dataset into a covariance matrix

𝑪 =1

𝑛 − 1𝑿𝑇𝑿

• For simplicity we often say that 𝑿𝑇𝑿 is the data covariance matrix

• 𝑪 =

𝑐𝑜𝑣(𝑋1, 𝑋1) 𝑐𝑜𝑣(𝑋1, 𝑋2)… 𝑐𝑜𝑣(𝑋1, 𝑋𝑝)

𝑐𝑜𝑣(𝑋2, 𝑋1) 𝑐𝑜𝑣(𝑋2, 𝑋2)… 𝑐𝑜𝑣(𝑋2, 𝑋𝑝)

⋮ ⋮ ⋮𝑐𝑜𝑣(𝑋𝑝, 𝑋1) 𝑐𝑜𝑣(𝑋𝑝, 𝑋2)… 𝑐𝑜𝑣(𝑋𝑝, 𝑋𝑝)

Basics: covariance matrix

• Transpose of 𝑿 is 𝑿𝑇 Interchange rows and columns, note (𝑨𝑩)𝑇= 𝑩𝑇𝑨𝑇

• Identity matrix 𝑰: diagonal values are 1 all else 0, e.g.,

• Inverse matrix 𝑿−1 defined by 𝑿−1𝑿 = 𝑿𝑿−1 = 𝑰

• Orthogonal matrix: 𝑿𝑇𝑿 = 𝑰➔ 𝑿−1 = 𝑿𝑇

– Columns of orthogonal matrix are orthogonal vectors

Basics: matrix formulae and properties

1 0 00 1 00 0 1

Mathematics of Principal Component Analysis...

• How to obtain 𝑷 and 𝑾 from the data?

• First note that

– the principal axes are orthogonal

– Principal components (new coordinates) are

uncorrelated➔ Their covariance matrix is diagonal

• The covariance matrix is proportional to 𝑿𝑇𝑿

• Also since 𝑷 = 𝑿𝑾 ➔ 𝑿 = 𝑷𝑾𝑇

• Covariance matrix can be written as

• 𝑿𝑇𝑿 = (𝑷𝑾𝑇)𝑇𝑷𝑾𝑇 = 𝑾𝑷𝑇𝑷𝑾𝑇


• Because principal components are uncorrelated

𝑷𝑇𝑷 = 𝑑𝑖𝑎𝑔 𝜎12, 𝜎2

2, … , 𝜎𝑝2 , where 𝜎𝑖

2 = 𝑐𝑜𝑣(𝑃𝑖 , 𝑃𝑖)

• Covariance matrix can be written as

𝑿𝑇𝑿 = (𝑷𝑾𝑇)𝑇𝑷𝑾𝑇 = 𝑾𝑷𝑇𝑷𝑾𝑇

• Multiplying with 𝑾 gives(𝑿𝑇𝑿)𝑾 = 𝑾𝑷𝑻𝑷

• If 𝑤𝑖 is the 𝑖:th column vector of 𝑾, then for each 𝑤𝑖 we have




• This equation is an eigenvalue equation

• ➔ vectors 𝑤𝑖 are eigenvectors of covariance matrix

and

corresponding eigenvalues 𝜎𝑖2 are variances of the



• The covariance of principal components and the original data is

𝑷𝑇𝑿 = 𝑷𝑇𝑷𝑾𝑇

• It can easily be shown that covariance of 𝑖:th PC with the j:th variable is

𝑐𝑜𝑣 𝑋𝑗 , 𝑃𝑖 = 𝜎𝑖2𝑊𝑗,𝑖 ➔ 𝑊𝑗,𝑖 =

1

𝜎𝑖2 𝑐𝑜𝑣 𝑋𝑗 , 𝑃𝑖

• ➔ Components of the eigenvectors (or EOFs) describe the covariance of the data series with corresponding principal components ➔ They reveal the covariance structure of the data related to each component!

Meaning of Eigenvectors

• By these results PCA analysis can be computed as follows:

1. Compute covariance matrix 𝑿𝑇𝑿

2. Compute the eigenvectors and eigenvalues of the covariance matrix.E.g., [W,s]=eig(X’*X) in MATLAB

3. Compute matrix of principal components as𝑷 = 𝑿𝑾

Computing PCA from covariance matrix

• The data matrix can be decomposed into parts as

𝑿 = 𝑼𝑺𝑽𝑇

• Where

– 𝑼 is a 𝑛 × 𝑘 matrix, with orthogonal columns

– 𝑺 is a 𝑘 × 𝑘 diagonal matrix whose elements are calledthe singular values (note that 𝑺 = 𝑺𝑇)

– 𝑽 is a 𝑝 × 𝑘 matrix with orthogonal columns

• The covariance matrix is now

𝑿𝑇𝑿 = 𝑽𝑺𝑼𝑇𝑼𝑺𝑽𝑇 = 𝑽𝑺𝟐𝑽𝑇

Singular Value Decomposition (SVD)

• We can compare the two expressions for covariance matrix

𝑿𝑇𝑿 = 𝑽𝑺𝑼𝑇𝑼𝑺𝑽𝑇

𝑿𝑇𝑿 = 𝑾𝑷𝑇𝑷𝑾𝑇

• Thus clearly

– 𝑾 = 𝑽

– 𝑷 = 𝑼𝑺

– Diagonal values of 𝑺 are 𝜎𝑖2, i.e., standard deviations

of the principal components

Singular Value Decomposition (SVD)

• Principal component analysis can now be easilycalculated directly from Singular Value Decompositionof the data matrix

1. Compute SVDMATLAB: [U,S,V]=svd(X)

2. Compute principal components: 𝑷 = 𝑼𝑺MATLAB: P=U*S

3. Eigenvectors depict the principal axes and are columns of 𝑽. The columns are also often called Empirical Orthogonal Functions (EOF)

4. Variances of PCs are squared diagonal values of 𝑺MATLAB: diag(S).^2

Computing PCA by SVD

• In MATLAB you can compute the PCA directly(part of Statistics Toolbox)

• [coeff,score,latent] = pca(X)

– coeff = matrix with factor loadings as columns

– score = matrix with principal components as columns

– latent = variances of principal components (eigenvalues of covariance matrix)

– You can also get lots of other outputs, check the MATLAB help…

PCA in MATLAB

How to analyse and use the output of PCA?

• From SVD or from eigenvalues of the covariance

matrix we get the variances 𝜎𝑖2 for each PC

• From the previous example:

Analysing the variance

• MATLAB: [U,S,V]=svd(X)

• Variances are:– 𝜎1

2, i.e., S(1)2=2681.4

– 𝜎22, i.e., S(2)2=88.181

• A better indicator is

relative variance:𝜎𝑖2

σ𝑖=1𝑝

𝜎𝑖2

• Relative variances here are0.968 and 0.032

• Relative variance indicates what fraction of variance in the data is captured by the principalcomponents

• In correlated data typically only small subset of PCs explain most of the variance

• ➔ Can be used to compress the data!

• I.e., reduce the dimension of the data

• This also helps in analysis and interpretation

Dimensional reduction (1)

• Reduction can be done by taking only the first 𝑘PCs that explain most of the variance (e.g., 95%)

• Procedure would be, e.g., with [U,S,V]=SVD(X)

1. Look at variances in 𝑺 and determine how manyPCs you want

2. PCs 𝑷 = 𝑼𝑺 and eigenvectors 𝑽

3. Take only 𝑘 columns from 𝑷 and 𝑽 and compute

reduced data matrix as 𝑿 = 𝑷(𝑘)𝑽(𝑘)𝑇

in MATLAB: P=U*S;X=P(:,1:k)*V(:,1:k)’;

Dimensional reduction (2)

• PCA finds the components on the basis of totalvariance they explain

• the variable with largest variance DOMINATES the result!

• ➔ Possibly a problem in cases where variables havedifferent units or range of variation

• In most (but not all) cases it makes sense to studystandardized variations in the data

➔ for each column of data matrix 𝑍𝑖 =𝑋𝑖−𝜇

𝜎

• MATLAB: Xstandard=Xcentered./repmat(std(Xcentered),size(Xcentered,1),1)

To standardize or not?

• Take a dataset of 24 stock prices from Helsinki stock market

– Each is a daily dataset from 2010 to 2016

– ➔ In total 24 x 1531 = 36744 data points

• Prices are at very different levels

➔ Standardize the prices first

Stock market example

• Standardized prices clearly show some correlated behavior

• From 24 data series it is difficult to see the common structure

• ➔ Do PCA on the data


• Relative variances of PCs (so called Scree plot) showsthat– PC1 explains about 55% of variance

– PC2 about 20%

– PC3 about 11 %...

– ➔ 6 first PCs explain about 95% of variance


• Time series of principal components


• Consider the eigenvectors (EOFs) of the principal components• Stock #9 = Kone Oyj, Positive loading for PC1 ➔ positive price trend over last

5 years• Stock #16 = Outokumpu Oyj, negative loading for PC1 ➔ negative price trend• Full interpretation of these factors is very complicated and requires

understanding of the markets and economical system


• We can reduce the dimensions of the dataset from 24 to 6 and retain 95% of variance!

• ➔ Recompute the data by taking first 6 components• To store data with 95% of original variance we need the 6 PC

time series and 6 x 24 eigenvector components 6 x 1531 + 6 x 24 = 9330 points, i.e., about 25 % of the original points


Rotations of PCA results

• PCA typically gives a few components which explain most of the variation and have a large effect on a large fraction of variables

• Because of this PCA results can be difficult to interpret in physical context

• For example the temperature modes have significant responseall over the northern hemisphere

• Rotation of PCA results can often help...

Problem with PCA interpretation

EOF 1 EOF 2 EOF 3

• With SVD the data matrix is 𝑿 = 𝑼𝑺𝑽𝑇

• This decomposition can be viewed in two ways:

1. The PCA interpretation:– Principal component scores: 𝑷 = 𝑼𝑺 (incorporates

variance)

– Eigenvectors 𝑽 (orthogonal)➔ 𝑿 = 𝑷𝑽𝑇

2. The Factor Analysis (FA) interpretation– Standardized principal component scores: 𝑼

– Factor loadings: 𝑳 = 𝑽𝑺 (non-orthogonal and incorporates variance)➔ 𝑿 = 𝑼𝑳𝑇

Two ways to view PCA

• Assume an orthogonal (rotation) matrix 𝑹 so that𝑹𝑹𝑇 = 𝑹𝑇𝑹 = 𝑰

• Data matrix can be written as

𝑿 = 𝑼𝑺𝑽𝑇 = 𝑼𝑺𝑹𝑹𝑇𝑽𝑇 = 𝑷𝑟𝑜𝑡𝑽𝑟𝑜𝑡𝑇

• From this form we have

– Rotated principal components: 𝑷𝑟𝑜𝑡 = 𝑼𝑺𝑹 = 𝑷𝑹

– Rotated eigenvectors: 𝑽𝑟𝑜𝑡 = 𝑽𝑹

• Rotation this way:

– Preserves orthogonality of eigenvectors, but

– Rotated principal components are not un-correlatedanymore!

– Variance is redistributed among rotated components!

Rotation: PCA interpretation

• Data matrix can also be written as𝑿 = 𝑼𝑺𝑽𝑇 = 𝑼𝑹𝑹𝑇𝑺𝑽𝑇 = 𝑼𝑟𝑜𝑡𝑳𝑟𝑜𝑡

𝑇

• From this form we have– Rotated standardized principal components:

𝑼𝑟𝑜𝑡 = 𝑼𝑹

– Rotated factor loadings: 𝑳𝑟𝑜𝑡 = 𝑽𝑺𝑹 = 𝑳𝑹

• Rotation this way (more common):– Rotated factor loadings and eigenvectors are non-

orthogonal

– Rotated standardized scores remain un-correlated

– Variance is redistributed among rotatedcomponents!

Rotation: FA interpretation

• There are several ways a rotation matrix can bechosen

• Most common is the VARIMAX rotation method

• 𝑅𝑉𝐴𝑅𝐼𝑀𝐴𝑋 = argmax𝑅

σ𝑗=1𝑝 1

𝑛σ𝑖=1𝑛 𝑳𝑹 𝑖𝑗

4 −1

𝑛σ𝑖=1𝑛 𝑳𝑹 𝑖𝑗

22

• This criterionmaximizes the variance of squaredfactor loadingssummed over allfactors

VARIMAX rotation

• ➔ It produces a solution,where each factor onlyaffects a few variables

➔ Simple solution• BUT: There is no guarantee

that any rotated (or un-rotated)solution correctly representsany physical mode affectingthe system!

• Typically we include only themost important Principal Components, which containmost of the variance into the rotation

VARIMAX rotation

• Factor loading matrix from the surfacetemperature example

VARIMAX rotation

Unrotated factor loadings Rotated factor loadings

Unrotated factor loadings

VARIMAX rotation

Rotated factor loadings

• Other rotation methods– QUARTIMAX: minimizes the number of factors needed

to explain each variable (maximizes the variance of factor loadings within each row)

– EQUIMAX: Compromise between VARIMAX and QUARTIMAX

– Orthomax, Promax, Procrustes... Etc.

• A more interesing method, which is effectivelyalso a rotation method is the IndependentComponent Analysis (ICA)

• ICA is perhaps the best method to look for physically independent signals in the data.

Other rotations

• In MATLAB one can do the rotation with functionrotatefactors() (part of Statistics Toolbox)

• [Lrot,R]=rotatefactors(L)

or including only a subset (first 𝑘) of PCs

[Lrot,R]=rotatefactors(L(:,1:k))

• Different rotation options (see MATLAB help), e.g.,[Lrot,R]=rotatefactors(L,’Method’,’quartimax’)

• One can rotate either factor loadings or eigenvectors

How to rotate in MATLAB

Documents

Finding and analysing patterns in multivariate datasets · PCA by Singular Value Decomposition (SVD) •Relative PC variances reveal how important they are, i.e., what fraction of