65
Timo Asikainen ReSoLVE Centre of Excellence, Space Climate Research Unit University of Oulu [email protected] Finding and analysing patterns in multivariate datasets I4Future Summer School 13 - 17.5.2019

Finding and analysing patterns in multivariate datasets · PCA by Singular Value Decomposition (SVD) •Relative PC variances reveal how important they are, i.e., what fraction of

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Timo AsikainenReSoLVE Centre of Excellence,Space Climate Research Unit

    University of [email protected]

    Finding and analysing patterns in multivariate datasets

    I4Future Summer School 13-17.5.2019

    mailto:[email protected]

  • • Introduction to Principal Component Analysis as a method to find and analyse patterns

    • Some mathematics

    • A few real world examples

    • Detailed extra material for self study

    Contents

  • • Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of partly correlated variables into a set of values of linearly uncorrelated variables called principal components, which explain as much common variance as possible.

    What is PCA?

  • • Two correlated variables

    • Largest variation in the points is along the principal axis

    • ➔ There is a principal component in the data, which consists of the combined correlated variability seen in X and Y

    Example of correlated variables

  • Example of correlated variables

    • We can project the data into these principal axes to obtain principal components PC1 and PC2

    • Here the same data is represented with the principal components, i.e., transformed to the new system

  • • We can project the data into these principal axes to obtain principal components PC1 and PC2

    • Here the same data is represented with the principal components, i.e., transformed to the new system

    Example of correlated variables

  • • Same method can be applied in higher dimensions, e.g., 3D, 4D,…, ND

    • ➔ PCA finds the directions of maximum variance and principal components are projections of the data to these directions

    Example of a 3D dataset

  • Mathematics of Principal Component Analysis...

  • • It is convenient to handle data in matrix form• Data matrix 𝑿 holds individual datasets in its columns

    𝑛 data points, 𝑝 variables

    𝑿 =

    𝑋1,1 𝑋1,2… 𝑋1,𝑝𝑋2,1 𝑋2,2… 𝑋2,𝑝⋮ ⋮ ⋮

    𝑋𝑛,1 𝑋𝑛,2… 𝑋𝑛,𝑝

    • Data samples = Observations in time, individuals, pixels, material samples etc.

    • Key assumption: Centered data

    𝑿 =

    𝑋1,1 − ത𝑋1 𝑋1,2 − ത𝑋2… 𝑋1,𝑝 − ത𝑋𝑝

    𝑋2,1 − ത𝑋1 𝑋2,2 − ത𝑋2… 𝑋2,𝑝 − ത𝑋𝑝⋮ ⋮ ⋮

    𝑋𝑛,1 − ത𝑋1 𝑋𝑛,2 − ത𝑋2… 𝑋𝑛,𝑝 − ത𝑋𝑝

    Basics: data matrix

    Rows = Data samples

    Columns = Data series (variables)

  • • 𝑷=matrix with principal components as columns

    • 𝑾=matrix with principal axes (direction vectors) as columns

    • Axes are orthogonal!

    𝑾𝑇𝑾 = 𝑰

    • Principal axes EOFs (Empirical Orthogonal Functions)

    • By definition we have𝑷 = 𝑿𝑾

    • I.e., each column of 𝑷 is a projection of the data to the corresponding axis (column of 𝑾)

    • ➔ 𝑷 represents the data in the new coordinates

    • Original data matrix is a weighted sum (PCs are the weights) sum of orthogonal modes (principal axes)

    Principal components as projections of data

  • • PCA decomposes original data matrix into a sum of orthogonal modes

    𝑿 = 𝑷𝑾𝑇

    • Number of modes (EOFs)/PCs = number of data series (variables)

    • For example, each sample of the set of data series is a sum

    𝑿𝑖 = 𝑃𝐶1(𝑖) × 𝑬𝑶𝑭𝟏 + 𝑃𝐶2(𝑖) × 𝑬𝑶𝑭𝟐 +⋯

    Original data matrix decomposed into modes

  • Principal components as projections of data

    PC1

    PC2

    1st principalaxis

    2nd principalaxis

  • • We can obtain the following equation for the unknown principal axes (𝑤𝑖 = columns of 𝑾) (see details in Extra material):

    (𝑿𝑇𝑿)𝑤𝑖 = 𝜎𝑖2𝑤𝑖

    • This equation is an eigenvalue equation• ➔ vectors 𝑤𝑖 are eigenvectors of data covariance

    matrix

    and

    corresponding eigenvalues 𝜎𝑖2 are variances of the

    principal components

    Calculating principal components

  • • The data matrix can be decomposed (SVD) into parts as

    𝑿 = 𝑼𝑺𝑽𝑇

    • Where – 𝑼 is a 𝑛 × 𝑘 matrix, with orthogonal columns

    – 𝑺 is a 𝑘 × 𝑘 diagonal matrix whose elements are called the singular values (note that 𝑺 = 𝑺𝑇)

    – 𝑽 is a 𝑝 × 𝑘 matrix with orthogonal columns

    • It can be shown that (see Extra material)

    – 𝑾 = 𝑽

    – 𝑷 = 𝑼𝑺

    – Diagonal values of 𝑺 are 𝜎𝑖2, i.e., standard deviations of the

    principal components

    PCA by Singular Value Decomposition (SVD)

  • • Relative PC variances reveal how important they are, i.e., what fraction of data variance they contain

    • In correlated data typically only small subset of PCs explain most of the variance

    • ➔ Can be used to compress the data!

    • I.e., reduce the dimension of the data

    ➔ Take only 𝑘 columns from 𝑷 and 𝑽 and compute

    reduced data matrix as 𝑿 = 𝑷(𝑘)𝑽(𝑘)𝑇

    • This also helps in analysis and interpretation

    Dimensional reduction

  • How to analyse and use the output of PCA?

    - Example of satellite data

  • • Data from 24 latitudes, 38 years in daily resolution, about 14000 samples

    • Let’s do PCA for this data set

    • 𝑿 = 𝑼𝑺𝑽𝑇

    • PCs PC variances EOFs

    Dataset: Latitude distribution of electrons at low-altitude Earth orbit

  • • 3-4 first PCs contain majority of data variance (over 90%)

    • The remaining PCs may contain either some physically interesting signal or just random noise

    PC variances

  • • PCs depict how the modes vary with time

    • EOFs depict by what weight each mode affects different data series (latitudes)

    PCs and corresponding EOFs

  • Climate example

  • • NASA/GISS monthlyground temperaturedataset

    • 5o x 5o grid➔ 2592 gridpoints in total

    • Each grid point hasmonthly temperaturetime series

    • Temperature anomaly = T-Tmeanfor each grid pointseparately

    Ground temperature dataset

  • • In many cases one or moreof the variables may havedata gaps

    • Simplest way to deal withgaps: exclude all rows, which have non-zeronumber of missing values

    • Note: there are more sophisticated ways to deal with gaps (e.g., DINEOF analysis) ....

    Note on datagaps

  • • It is hard to see global correlated patterns fromindividual time series

    • Here three randomly chosen spots from differentparts of the world (blue is Oulu, Finland)

    Examples of January temperaturesTe

    mp

    erat

    ure

    ano

    mal

    y[o

    C]

  • • First PC explains about 25% of variance in Northern Hemisphere January temperatures.

    • This mode is called Northern Annular Mode, NAM

    January ground temperature: PCA

  • EOF 1

    January ground temperature: PCA

    • Three first EOFs

    • Representation of EOFs in terms of their geographicallocation gives them physical interpretation

    • Positive phase of NAM acts to produce warm in Eurasia

    EOF 2 EOF 3

  • EOF 1

    Ground temperature dataset: PCA

    EOF 2 EOF 3

  • • PCA decomposes multivariate dataset into orthogonal modes (EOFs) and their sample to sample variation (PCs)

    • Instead of analysing all individual data series one can analyse a much smaller number of modes

    • ➔ Reduction of dimensionality

    • ➔ Often easier to interpret

    • Simple to calculate:

    • SVD decomposition: 𝑿 = 𝑼𝑺𝑽𝑇 and 𝑷 = 𝑼𝑺

    Take home points

  • Extra in depth material for self study

  • Some basics first...

  • • Covariance of two data sets:

    𝑐𝑜𝑣 𝑋, 𝑌 =σ𝑖=1𝑛 (𝑋𝑖 − ത𝑋)(𝑌𝑖 − ത𝑌)

    (𝑛 − 1)

    • If 𝑐𝑜𝑣(𝑋, 𝑌) > 0 positive values of X tend to appear when Y has a positive value too

    • If 𝑐𝑜𝑣(𝑋, 𝑌) < 0 negative values of X correspond to positive values of Y

    • If 𝑐𝑜𝑣(𝑋, 𝑌) = 0 there is no covariance between the data

    Basics: Covariance

  • • It is convenient to handle data in matrix form• Data matrix 𝑿 holds individual datasets in its columns𝑛 data points, 𝑝 variables

    𝑿 =

    𝑋1,1 𝑋1,2… 𝑋1,𝑝𝑋2,1 𝑋2,2… 𝑋2,𝑝⋮ ⋮ ⋮

    𝑋𝑛,1 𝑋𝑛,2… 𝑋𝑛,𝑝

    • Centered data: 𝑿 =

    𝑋1,1 − ത𝑋1 𝑋1,2 − ത𝑋2… 𝑋1,𝑝 − ത𝑋𝑝

    𝑋2,1 − ത𝑋1 𝑋2,2 − ത𝑋2… 𝑋2,𝑝 − ത𝑋𝑝⋮ ⋮ ⋮

    𝑋𝑛,1 − ത𝑋1 𝑋𝑛,2 − ത𝑋2… 𝑋𝑛,𝑝 − ത𝑋𝑝

    • From here on we also assume data is centered, i.e., subtract from each column the mean of that column!

    • In MATLAB: Xcentered=X-repmat(mean(X),size(X,1),1)

    Basics: data matrix

  • • We can collect the covariances of the dataset into a covariance matrix

    𝑪 =1

    𝑛 − 1𝑿𝑇𝑿

    • For simplicity we often say that 𝑿𝑇𝑿 is the data covariance matrix

    • 𝑪 =

    𝑐𝑜𝑣(𝑋1, 𝑋1) 𝑐𝑜𝑣(𝑋1, 𝑋2)… 𝑐𝑜𝑣(𝑋1, 𝑋𝑝)

    𝑐𝑜𝑣(𝑋2, 𝑋1) 𝑐𝑜𝑣(𝑋2, 𝑋2)… 𝑐𝑜𝑣(𝑋2, 𝑋𝑝)

    ⋮ ⋮ ⋮𝑐𝑜𝑣(𝑋𝑝, 𝑋1) 𝑐𝑜𝑣(𝑋𝑝, 𝑋2)… 𝑐𝑜𝑣(𝑋𝑝, 𝑋𝑝)

    Basics: covariance matrix

  • • Transpose of 𝑿 is 𝑿𝑇 Interchange rows and columns, note (𝑨𝑩)𝑇= 𝑩𝑇𝑨𝑇

    • Identity matrix 𝑰: diagonal values are 1 all else 0, e.g.,

    • Inverse matrix 𝑿−1 defined by 𝑿−1𝑿 = 𝑿𝑿−1 = 𝑰

    • Orthogonal matrix: 𝑿𝑇𝑿 = 𝑰➔ 𝑿−1 = 𝑿𝑇

    – Columns of orthogonal matrix are orthogonal vectors

    Basics: matrix formulae and properties

    1 0 00 1 00 0 1

  • Mathematics of Principal Component Analysis...

  • • How to obtain 𝑷 and 𝑾 from the data?

    • First note that

    – the principal axes are orthogonal

    – Principal components (new coordinates) are

    uncorrelated➔ Their covariance matrix is diagonal

    • The covariance matrix is proportional to 𝑿𝑇𝑿

    • Also since 𝑷 = 𝑿𝑾 ➔ 𝑿 = 𝑷𝑾𝑇

    • Covariance matrix can be written as

    • 𝑿𝑇𝑿 = (𝑷𝑾𝑇)𝑇𝑷𝑾𝑇 = 𝑾𝑷𝑇𝑷𝑾𝑇

    Calculating principal components

  • • Because principal components are uncorrelated

    𝑷𝑇𝑷 = 𝑑𝑖𝑎𝑔 𝜎12, 𝜎2

    2, … , 𝜎𝑝2 , where 𝜎𝑖

    2 = 𝑐𝑜𝑣(𝑃𝑖 , 𝑃𝑖)

    • Covariance matrix can be written as

    𝑿𝑇𝑿 = (𝑷𝑾𝑇)𝑇𝑷𝑾𝑇 = 𝑾𝑷𝑇𝑷𝑾𝑇

    • Multiplying with 𝑾 gives(𝑿𝑇𝑿)𝑾 = 𝑾𝑷𝑻𝑷

    • If 𝑤𝑖 is the 𝑖:th column vector of 𝑾, then for each 𝑤𝑖 we have

    (𝑿𝑇𝑿)𝑤𝑖 = 𝜎𝑖2𝑤𝑖

    Calculating principal components

  • (𝑿𝑇𝑿)𝑤𝑖 = 𝜎𝑖2𝑤𝑖

    • This equation is an eigenvalue equation

    • ➔ vectors 𝑤𝑖 are eigenvectors of covariance matrix

    and

    corresponding eigenvalues 𝜎𝑖2 are variances of the

    principal components

    Calculating principal components

  • • The covariance of principal components and the original data is

    𝑷𝑇𝑿 = 𝑷𝑇𝑷𝑾𝑇

    • It can easily be shown that covariance of 𝑖:th PC with the j:th variable is

    𝑐𝑜𝑣 𝑋𝑗 , 𝑃𝑖 = 𝜎𝑖2𝑊𝑗,𝑖 ➔ 𝑊𝑗,𝑖 =

    1

    𝜎𝑖2 𝑐𝑜𝑣 𝑋𝑗 , 𝑃𝑖

    • ➔ Components of the eigenvectors (or EOFs) describe the covariance of the data series with corresponding principal components ➔ They reveal the covariance structure of the data related to each component!

    Meaning of Eigenvectors

  • • By these results PCA analysis can be computed as follows:

    1. Compute covariance matrix 𝑿𝑇𝑿

    2. Compute the eigenvectors and eigenvalues of the covariance matrix.E.g., [W,s]=eig(X’*X) in MATLAB

    3. Compute matrix of principal components as𝑷 = 𝑿𝑾

    Computing PCA from covariance matrix

  • • The data matrix can be decomposed into parts as

    𝑿 = 𝑼𝑺𝑽𝑇

    • Where

    – 𝑼 is a 𝑛 × 𝑘 matrix, with orthogonal columns

    – 𝑺 is a 𝑘 × 𝑘 diagonal matrix whose elements are calledthe singular values (note that 𝑺 = 𝑺𝑇)

    – 𝑽 is a 𝑝 × 𝑘 matrix with orthogonal columns

    • The covariance matrix is now

    𝑿𝑇𝑿 = 𝑽𝑺𝑼𝑇𝑼𝑺𝑽𝑇 = 𝑽𝑺𝟐𝑽𝑇

    Singular Value Decomposition (SVD)

  • • We can compare the two expressions for covariance matrix

    𝑿𝑇𝑿 = 𝑽𝑺𝑼𝑇𝑼𝑺𝑽𝑇

    𝑿𝑇𝑿 = 𝑾𝑷𝑇𝑷𝑾𝑇

    • Thus clearly

    – 𝑾 = 𝑽

    – 𝑷 = 𝑼𝑺

    – Diagonal values of 𝑺 are 𝜎𝑖2, i.e., standard deviations

    of the principal components

    Singular Value Decomposition (SVD)

  • • Principal component analysis can now be easilycalculated directly from Singular Value Decompositionof the data matrix

    1. Compute SVDMATLAB: [U,S,V]=svd(X)

    2. Compute principal components: 𝑷 = 𝑼𝑺MATLAB: P=U*S

    3. Eigenvectors depict the principal axes and are columns of 𝑽. The columns are also often called Empirical Orthogonal Functions (EOF)

    4. Variances of PCs are squared diagonal values of 𝑺MATLAB: diag(S).^2

    Computing PCA by SVD

  • • In MATLAB you can compute the PCA directly(part of Statistics Toolbox)

    • [coeff,score,latent] = pca(X)

    – coeff = matrix with factor loadings as columns

    – score = matrix with principal components as columns

    – latent = variances of principal components (eigenvalues of covariance matrix)

    – You can also get lots of other outputs, check the MATLAB help…

    PCA in MATLAB

  • How to analyse and use the output of PCA?

  • • From SVD or from eigenvalues of the covariance

    matrix we get the variances 𝜎𝑖2 for each PC

    • From the previous example:

    Analysing the variance

    • MATLAB: [U,S,V]=svd(X)

    • Variances are:– 𝜎1

    2, i.e., S(1)2=2681.4

    – 𝜎22, i.e., S(2)2=88.181

    • A better indicator is

    relative variance:𝜎𝑖2

    σ𝑖=1𝑝

    𝜎𝑖2

    • Relative variances here are0.968 and 0.032

  • • Relative variance indicates what fraction of variance in the data is captured by the principalcomponents

    • In correlated data typically only small subset of PCs explain most of the variance

    • ➔ Can be used to compress the data!

    • I.e., reduce the dimension of the data

    • This also helps in analysis and interpretation

    Dimensional reduction (1)

  • • Reduction can be done by taking only the first 𝑘PCs that explain most of the variance (e.g., 95%)

    • Procedure would be, e.g., with [U,S,V]=SVD(X)

    1. Look at variances in 𝑺 and determine how manyPCs you want

    2. PCs 𝑷 = 𝑼𝑺 and eigenvectors 𝑽

    3. Take only 𝑘 columns from 𝑷 and 𝑽 and compute

    reduced data matrix as 𝑿 = 𝑷(𝑘)𝑽(𝑘)𝑇

    in MATLAB: P=U*S;X=P(:,1:k)*V(:,1:k)’;

    Dimensional reduction (2)

  • • PCA finds the components on the basis of totalvariance they explain

    • the variable with largest variance DOMINATES the result!

    • ➔ Possibly a problem in cases where variables havedifferent units or range of variation

    • In most (but not all) cases it makes sense to studystandardized variations in the data

    ➔ for each column of data matrix 𝑍𝑖 =𝑋𝑖−𝜇

    𝜎

    • MATLAB: Xstandard=Xcentered./repmat(std(Xcentered),size(Xcentered,1),1)

    To standardize or not?

  • • Take a dataset of 24 stock prices from Helsinki stock market

    – Each is a daily dataset from 2010 to 2016

    – ➔ In total 24 x 1531 = 36744 data points

    • Prices are at very different levels

    ➔ Standardize the prices first

    Stock market example

  • • Standardized prices clearly show some correlated behavior

    • From 24 data series it is difficult to see the common structure

    • ➔ Do PCA on the data

    Stock market example

  • • Relative variances of PCs (so called Scree plot) showsthat– PC1 explains about 55% of variance

    – PC2 about 20%

    – PC3 about 11 %...

    – ➔ 6 first PCs explain about 95% of variance

    Stock market example

  • • Time series of principal components

    Stock market example

  • • Consider the eigenvectors (EOFs) of the principal components• Stock #9 = Kone Oyj, Positive loading for PC1 ➔ positive price trend over last

    5 years• Stock #16 = Outokumpu Oyj, negative loading for PC1 ➔ negative price trend• Full interpretation of these factors is very complicated and requires

    understanding of the markets and economical system

    Stock market example

  • • We can reduce the dimensions of the dataset from 24 to 6 and retain 95% of variance!

    • ➔ Recompute the data by taking first 6 components• To store data with 95% of original variance we need the 6 PC

    time series and 6 x 24 eigenvector components 6 x 1531 + 6 x 24 = 9330 points, i.e., about 25 % of the original points

    Stock market example

  • Rotations of PCA results

  • • PCA typically gives a few components which explain most of the variation and have a large effect on a large fraction of variables

    • Because of this PCA results can be difficult to interpret in physical context

    • For example the temperature modes have significant responseall over the northern hemisphere

    • Rotation of PCA results can often help...

    Problem with PCA interpretation

    EOF 1 EOF 2 EOF 3

  • • With SVD the data matrix is 𝑿 = 𝑼𝑺𝑽𝑇

    • This decomposition can be viewed in two ways:

    1. The PCA interpretation:– Principal component scores: 𝑷 = 𝑼𝑺 (incorporates

    variance)

    – Eigenvectors 𝑽 (orthogonal)➔ 𝑿 = 𝑷𝑽𝑇

    2. The Factor Analysis (FA) interpretation– Standardized principal component scores: 𝑼

    – Factor loadings: 𝑳 = 𝑽𝑺 (non-orthogonal and incorporates variance)➔ 𝑿 = 𝑼𝑳𝑇

    Two ways to view PCA

  • • Assume an orthogonal (rotation) matrix 𝑹 so that𝑹𝑹𝑇 = 𝑹𝑇𝑹 = 𝑰

    • Data matrix can be written as

    𝑿 = 𝑼𝑺𝑽𝑇 = 𝑼𝑺𝑹𝑹𝑇𝑽𝑇 = 𝑷𝑟𝑜𝑡𝑽𝑟𝑜𝑡𝑇

    • From this form we have

    – Rotated principal components: 𝑷𝑟𝑜𝑡 = 𝑼𝑺𝑹 = 𝑷𝑹

    – Rotated eigenvectors: 𝑽𝑟𝑜𝑡 = 𝑽𝑹

    • Rotation this way:

    – Preserves orthogonality of eigenvectors, but

    – Rotated principal components are not un-correlatedanymore!

    – Variance is redistributed among rotated components!

    Rotation: PCA interpretation

  • • Data matrix can also be written as𝑿 = 𝑼𝑺𝑽𝑇 = 𝑼𝑹𝑹𝑇𝑺𝑽𝑇 = 𝑼𝑟𝑜𝑡𝑳𝑟𝑜𝑡

    𝑇

    • From this form we have– Rotated standardized principal components:

    𝑼𝑟𝑜𝑡 = 𝑼𝑹

    – Rotated factor loadings: 𝑳𝑟𝑜𝑡 = 𝑽𝑺𝑹 = 𝑳𝑹

    • Rotation this way (more common):– Rotated factor loadings and eigenvectors are non-

    orthogonal

    – Rotated standardized scores remain un-correlated

    – Variance is redistributed among rotatedcomponents!

    Rotation: FA interpretation

  • • There are several ways a rotation matrix can bechosen

    • Most common is the VARIMAX rotation method

    • 𝑅𝑉𝐴𝑅𝐼𝑀𝐴𝑋 = argmax𝑅

    σ𝑗=1𝑝 1

    𝑛σ𝑖=1𝑛 𝑳𝑹 𝑖𝑗

    4 −1

    𝑛σ𝑖=1𝑛 𝑳𝑹 𝑖𝑗

    22

    • This criterionmaximizes the variance of squaredfactor loadingssummed over allfactors

    VARIMAX rotation

  • • ➔ It produces a solution,where each factor onlyaffects a few variables

    ➔ Simple solution• BUT: There is no guarantee

    that any rotated (or un-rotated)solution correctly representsany physical mode affectingthe system!

    • Typically we include only themost important Principal Components, which containmost of the variance into the rotation

    VARIMAX rotation

  • • Factor loading matrix from the surfacetemperature example

    VARIMAX rotation

    Unrotated factor loadings Rotated factor loadings

  • Unrotated factor loadings

    VARIMAX rotation

    Rotated factor loadings

  • • Other rotation methods– QUARTIMAX: minimizes the number of factors needed

    to explain each variable (maximizes the variance of factor loadings within each row)

    – EQUIMAX: Compromise between VARIMAX and QUARTIMAX

    – Orthomax, Promax, Procrustes... Etc.

    • A more interesing method, which is effectivelyalso a rotation method is the IndependentComponent Analysis (ICA)

    • ICA is perhaps the best method to look for physically independent signals in the data.

    Other rotations

  • • In MATLAB one can do the rotation with functionrotatefactors() (part of Statistics Toolbox)

    • [Lrot,R]=rotatefactors(L)

    or including only a subset (first 𝑘) of PCs

    [Lrot,R]=rotatefactors(L(:,1:k))

    • Different rotation options (see MATLAB help), e.g.,[Lrot,R]=rotatefactors(L,’Method’,’quartimax’)

    • One can rotate either factor loadings or eigenvectors

    How to rotate in MATLAB