Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Timo AsikainenReSoLVE Centre of Excellence,Space Climate Research Unit
University of [email protected]
Finding and analysing patterns in multivariate datasets
I4Future Summer School 13-17.5.2019
mailto:[email protected]
• Introduction to Principal Component Analysis as a method to find and analyse patterns
• Some mathematics
• A few real world examples
• Detailed extra material for self study
Contents
• Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of partly correlated variables into a set of values of linearly uncorrelated variables called principal components, which explain as much common variance as possible.
What is PCA?
• Two correlated variables
• Largest variation in the points is along the principal axis
• ➔ There is a principal component in the data, which consists of the combined correlated variability seen in X and Y
Example of correlated variables
Example of correlated variables
• We can project the data into these principal axes to obtain principal components PC1 and PC2
• Here the same data is represented with the principal components, i.e., transformed to the new system
• We can project the data into these principal axes to obtain principal components PC1 and PC2
• Here the same data is represented with the principal components, i.e., transformed to the new system
Example of correlated variables
• Same method can be applied in higher dimensions, e.g., 3D, 4D,…, ND
• ➔ PCA finds the directions of maximum variance and principal components are projections of the data to these directions
Example of a 3D dataset
Mathematics of Principal Component Analysis...
• It is convenient to handle data in matrix form• Data matrix 𝑿 holds individual datasets in its columns
𝑛 data points, 𝑝 variables
𝑿 =
𝑋1,1 𝑋1,2… 𝑋1,𝑝𝑋2,1 𝑋2,2… 𝑋2,𝑝⋮ ⋮ ⋮
𝑋𝑛,1 𝑋𝑛,2… 𝑋𝑛,𝑝
• Data samples = Observations in time, individuals, pixels, material samples etc.
• Key assumption: Centered data
𝑿 =
𝑋1,1 − ത𝑋1 𝑋1,2 − ത𝑋2… 𝑋1,𝑝 − ത𝑋𝑝
𝑋2,1 − ത𝑋1 𝑋2,2 − ത𝑋2… 𝑋2,𝑝 − ത𝑋𝑝⋮ ⋮ ⋮
𝑋𝑛,1 − ത𝑋1 𝑋𝑛,2 − ത𝑋2… 𝑋𝑛,𝑝 − ത𝑋𝑝
Basics: data matrix
Rows = Data samples
Columns = Data series (variables)
• 𝑷=matrix with principal components as columns
• 𝑾=matrix with principal axes (direction vectors) as columns
• Axes are orthogonal!
𝑾𝑇𝑾 = 𝑰
• Principal axes EOFs (Empirical Orthogonal Functions)
• By definition we have𝑷 = 𝑿𝑾
• I.e., each column of 𝑷 is a projection of the data to the corresponding axis (column of 𝑾)
• ➔ 𝑷 represents the data in the new coordinates
• Original data matrix is a weighted sum (PCs are the weights) sum of orthogonal modes (principal axes)
Principal components as projections of data
• PCA decomposes original data matrix into a sum of orthogonal modes
𝑿 = 𝑷𝑾𝑇
• Number of modes (EOFs)/PCs = number of data series (variables)
• For example, each sample of the set of data series is a sum
𝑿𝑖 = 𝑃𝐶1(𝑖) × 𝑬𝑶𝑭𝟏 + 𝑃𝐶2(𝑖) × 𝑬𝑶𝑭𝟐 +⋯
Original data matrix decomposed into modes
Principal components as projections of data
PC1
PC2
1st principalaxis
2nd principalaxis
• We can obtain the following equation for the unknown principal axes (𝑤𝑖 = columns of 𝑾) (see details in Extra material):
(𝑿𝑇𝑿)𝑤𝑖 = 𝜎𝑖2𝑤𝑖
• This equation is an eigenvalue equation• ➔ vectors 𝑤𝑖 are eigenvectors of data covariance
matrix
and
corresponding eigenvalues 𝜎𝑖2 are variances of the
principal components
Calculating principal components
• The data matrix can be decomposed (SVD) into parts as
𝑿 = 𝑼𝑺𝑽𝑇
• Where – 𝑼 is a 𝑛 × 𝑘 matrix, with orthogonal columns
– 𝑺 is a 𝑘 × 𝑘 diagonal matrix whose elements are called the singular values (note that 𝑺 = 𝑺𝑇)
– 𝑽 is a 𝑝 × 𝑘 matrix with orthogonal columns
• It can be shown that (see Extra material)
– 𝑾 = 𝑽
– 𝑷 = 𝑼𝑺
– Diagonal values of 𝑺 are 𝜎𝑖2, i.e., standard deviations of the
principal components
PCA by Singular Value Decomposition (SVD)
• Relative PC variances reveal how important they are, i.e., what fraction of data variance they contain
• In correlated data typically only small subset of PCs explain most of the variance
• ➔ Can be used to compress the data!
• I.e., reduce the dimension of the data
➔ Take only 𝑘 columns from 𝑷 and 𝑽 and compute
reduced data matrix as 𝑿 = 𝑷(𝑘)𝑽(𝑘)𝑇
• This also helps in analysis and interpretation
Dimensional reduction
How to analyse and use the output of PCA?
- Example of satellite data
• Data from 24 latitudes, 38 years in daily resolution, about 14000 samples
• Let’s do PCA for this data set
• 𝑿 = 𝑼𝑺𝑽𝑇
• PCs PC variances EOFs
Dataset: Latitude distribution of electrons at low-altitude Earth orbit
• 3-4 first PCs contain majority of data variance (over 90%)
• The remaining PCs may contain either some physically interesting signal or just random noise
PC variances
• PCs depict how the modes vary with time
• EOFs depict by what weight each mode affects different data series (latitudes)
PCs and corresponding EOFs
Climate example
• NASA/GISS monthlyground temperaturedataset
• 5o x 5o grid➔ 2592 gridpoints in total
• Each grid point hasmonthly temperaturetime series
• Temperature anomaly = T-Tmeanfor each grid pointseparately
Ground temperature dataset
• In many cases one or moreof the variables may havedata gaps
• Simplest way to deal withgaps: exclude all rows, which have non-zeronumber of missing values
• Note: there are more sophisticated ways to deal with gaps (e.g., DINEOF analysis) ....
Note on datagaps
• It is hard to see global correlated patterns fromindividual time series
• Here three randomly chosen spots from differentparts of the world (blue is Oulu, Finland)
Examples of January temperaturesTe
mp
erat
ure
ano
mal
y[o
C]
• First PC explains about 25% of variance in Northern Hemisphere January temperatures.
• This mode is called Northern Annular Mode, NAM
January ground temperature: PCA
EOF 1
January ground temperature: PCA
• Three first EOFs
• Representation of EOFs in terms of their geographicallocation gives them physical interpretation
• Positive phase of NAM acts to produce warm in Eurasia
EOF 2 EOF 3
EOF 1
Ground temperature dataset: PCA
EOF 2 EOF 3
• PCA decomposes multivariate dataset into orthogonal modes (EOFs) and their sample to sample variation (PCs)
• Instead of analysing all individual data series one can analyse a much smaller number of modes
• ➔ Reduction of dimensionality
• ➔ Often easier to interpret
• Simple to calculate:
• SVD decomposition: 𝑿 = 𝑼𝑺𝑽𝑇 and 𝑷 = 𝑼𝑺
Take home points
Extra in depth material for self study
Some basics first...
• Covariance of two data sets:
𝑐𝑜𝑣 𝑋, 𝑌 =σ𝑖=1𝑛 (𝑋𝑖 − ത𝑋)(𝑌𝑖 − ത𝑌)
(𝑛 − 1)
• If 𝑐𝑜𝑣(𝑋, 𝑌) > 0 positive values of X tend to appear when Y has a positive value too
• If 𝑐𝑜𝑣(𝑋, 𝑌) < 0 negative values of X correspond to positive values of Y
• If 𝑐𝑜𝑣(𝑋, 𝑌) = 0 there is no covariance between the data
Basics: Covariance
• It is convenient to handle data in matrix form• Data matrix 𝑿 holds individual datasets in its columns𝑛 data points, 𝑝 variables
𝑿 =
𝑋1,1 𝑋1,2… 𝑋1,𝑝𝑋2,1 𝑋2,2… 𝑋2,𝑝⋮ ⋮ ⋮
𝑋𝑛,1 𝑋𝑛,2… 𝑋𝑛,𝑝
• Centered data: 𝑿 =
𝑋1,1 − ത𝑋1 𝑋1,2 − ത𝑋2… 𝑋1,𝑝 − ത𝑋𝑝
𝑋2,1 − ത𝑋1 𝑋2,2 − ത𝑋2… 𝑋2,𝑝 − ത𝑋𝑝⋮ ⋮ ⋮
𝑋𝑛,1 − ത𝑋1 𝑋𝑛,2 − ത𝑋2… 𝑋𝑛,𝑝 − ത𝑋𝑝
• From here on we also assume data is centered, i.e., subtract from each column the mean of that column!
• In MATLAB: Xcentered=X-repmat(mean(X),size(X,1),1)
Basics: data matrix
• We can collect the covariances of the dataset into a covariance matrix
𝑪 =1
𝑛 − 1𝑿𝑇𝑿
• For simplicity we often say that 𝑿𝑇𝑿 is the data covariance matrix
• 𝑪 =
𝑐𝑜𝑣(𝑋1, 𝑋1) 𝑐𝑜𝑣(𝑋1, 𝑋2)… 𝑐𝑜𝑣(𝑋1, 𝑋𝑝)
𝑐𝑜𝑣(𝑋2, 𝑋1) 𝑐𝑜𝑣(𝑋2, 𝑋2)… 𝑐𝑜𝑣(𝑋2, 𝑋𝑝)
⋮ ⋮ ⋮𝑐𝑜𝑣(𝑋𝑝, 𝑋1) 𝑐𝑜𝑣(𝑋𝑝, 𝑋2)… 𝑐𝑜𝑣(𝑋𝑝, 𝑋𝑝)
Basics: covariance matrix
• Transpose of 𝑿 is 𝑿𝑇 Interchange rows and columns, note (𝑨𝑩)𝑇= 𝑩𝑇𝑨𝑇
• Identity matrix 𝑰: diagonal values are 1 all else 0, e.g.,
• Inverse matrix 𝑿−1 defined by 𝑿−1𝑿 = 𝑿𝑿−1 = 𝑰
• Orthogonal matrix: 𝑿𝑇𝑿 = 𝑰➔ 𝑿−1 = 𝑿𝑇
– Columns of orthogonal matrix are orthogonal vectors
Basics: matrix formulae and properties
1 0 00 1 00 0 1
Mathematics of Principal Component Analysis...
• How to obtain 𝑷 and 𝑾 from the data?
• First note that
– the principal axes are orthogonal
– Principal components (new coordinates) are
uncorrelated➔ Their covariance matrix is diagonal
• The covariance matrix is proportional to 𝑿𝑇𝑿
• Also since 𝑷 = 𝑿𝑾 ➔ 𝑿 = 𝑷𝑾𝑇
• Covariance matrix can be written as
• 𝑿𝑇𝑿 = (𝑷𝑾𝑇)𝑇𝑷𝑾𝑇 = 𝑾𝑷𝑇𝑷𝑾𝑇
Calculating principal components
• Because principal components are uncorrelated
𝑷𝑇𝑷 = 𝑑𝑖𝑎𝑔 𝜎12, 𝜎2
2, … , 𝜎𝑝2 , where 𝜎𝑖
2 = 𝑐𝑜𝑣(𝑃𝑖 , 𝑃𝑖)
• Covariance matrix can be written as
𝑿𝑇𝑿 = (𝑷𝑾𝑇)𝑇𝑷𝑾𝑇 = 𝑾𝑷𝑇𝑷𝑾𝑇
• Multiplying with 𝑾 gives(𝑿𝑇𝑿)𝑾 = 𝑾𝑷𝑻𝑷
• If 𝑤𝑖 is the 𝑖:th column vector of 𝑾, then for each 𝑤𝑖 we have
(𝑿𝑇𝑿)𝑤𝑖 = 𝜎𝑖2𝑤𝑖
Calculating principal components
(𝑿𝑇𝑿)𝑤𝑖 = 𝜎𝑖2𝑤𝑖
• This equation is an eigenvalue equation
• ➔ vectors 𝑤𝑖 are eigenvectors of covariance matrix
and
corresponding eigenvalues 𝜎𝑖2 are variances of the
principal components
Calculating principal components
• The covariance of principal components and the original data is
𝑷𝑇𝑿 = 𝑷𝑇𝑷𝑾𝑇
• It can easily be shown that covariance of 𝑖:th PC with the j:th variable is
𝑐𝑜𝑣 𝑋𝑗 , 𝑃𝑖 = 𝜎𝑖2𝑊𝑗,𝑖 ➔ 𝑊𝑗,𝑖 =
1
𝜎𝑖2 𝑐𝑜𝑣 𝑋𝑗 , 𝑃𝑖
• ➔ Components of the eigenvectors (or EOFs) describe the covariance of the data series with corresponding principal components ➔ They reveal the covariance structure of the data related to each component!
Meaning of Eigenvectors
• By these results PCA analysis can be computed as follows:
1. Compute covariance matrix 𝑿𝑇𝑿
2. Compute the eigenvectors and eigenvalues of the covariance matrix.E.g., [W,s]=eig(X’*X) in MATLAB
3. Compute matrix of principal components as𝑷 = 𝑿𝑾
Computing PCA from covariance matrix
• The data matrix can be decomposed into parts as
𝑿 = 𝑼𝑺𝑽𝑇
• Where
– 𝑼 is a 𝑛 × 𝑘 matrix, with orthogonal columns
– 𝑺 is a 𝑘 × 𝑘 diagonal matrix whose elements are calledthe singular values (note that 𝑺 = 𝑺𝑇)
– 𝑽 is a 𝑝 × 𝑘 matrix with orthogonal columns
• The covariance matrix is now
𝑿𝑇𝑿 = 𝑽𝑺𝑼𝑇𝑼𝑺𝑽𝑇 = 𝑽𝑺𝟐𝑽𝑇
Singular Value Decomposition (SVD)
• We can compare the two expressions for covariance matrix
𝑿𝑇𝑿 = 𝑽𝑺𝑼𝑇𝑼𝑺𝑽𝑇
𝑿𝑇𝑿 = 𝑾𝑷𝑇𝑷𝑾𝑇
• Thus clearly
– 𝑾 = 𝑽
– 𝑷 = 𝑼𝑺
– Diagonal values of 𝑺 are 𝜎𝑖2, i.e., standard deviations
of the principal components
Singular Value Decomposition (SVD)
• Principal component analysis can now be easilycalculated directly from Singular Value Decompositionof the data matrix
1. Compute SVDMATLAB: [U,S,V]=svd(X)
2. Compute principal components: 𝑷 = 𝑼𝑺MATLAB: P=U*S
3. Eigenvectors depict the principal axes and are columns of 𝑽. The columns are also often called Empirical Orthogonal Functions (EOF)
4. Variances of PCs are squared diagonal values of 𝑺MATLAB: diag(S).^2
Computing PCA by SVD
• In MATLAB you can compute the PCA directly(part of Statistics Toolbox)
• [coeff,score,latent] = pca(X)
– coeff = matrix with factor loadings as columns
– score = matrix with principal components as columns
– latent = variances of principal components (eigenvalues of covariance matrix)
– You can also get lots of other outputs, check the MATLAB help…
PCA in MATLAB
How to analyse and use the output of PCA?
• From SVD or from eigenvalues of the covariance
matrix we get the variances 𝜎𝑖2 for each PC
• From the previous example:
Analysing the variance
• MATLAB: [U,S,V]=svd(X)
• Variances are:– 𝜎1
2, i.e., S(1)2=2681.4
– 𝜎22, i.e., S(2)2=88.181
• A better indicator is
relative variance:𝜎𝑖2
σ𝑖=1𝑝
𝜎𝑖2
• Relative variances here are0.968 and 0.032
• Relative variance indicates what fraction of variance in the data is captured by the principalcomponents
• In correlated data typically only small subset of PCs explain most of the variance
• ➔ Can be used to compress the data!
• I.e., reduce the dimension of the data
• This also helps in analysis and interpretation
Dimensional reduction (1)
• Reduction can be done by taking only the first 𝑘PCs that explain most of the variance (e.g., 95%)
• Procedure would be, e.g., with [U,S,V]=SVD(X)
1. Look at variances in 𝑺 and determine how manyPCs you want
2. PCs 𝑷 = 𝑼𝑺 and eigenvectors 𝑽
3. Take only 𝑘 columns from 𝑷 and 𝑽 and compute
reduced data matrix as 𝑿 = 𝑷(𝑘)𝑽(𝑘)𝑇
in MATLAB: P=U*S;X=P(:,1:k)*V(:,1:k)’;
Dimensional reduction (2)
• PCA finds the components on the basis of totalvariance they explain
• the variable with largest variance DOMINATES the result!
• ➔ Possibly a problem in cases where variables havedifferent units or range of variation
• In most (but not all) cases it makes sense to studystandardized variations in the data
➔ for each column of data matrix 𝑍𝑖 =𝑋𝑖−𝜇
𝜎
• MATLAB: Xstandard=Xcentered./repmat(std(Xcentered),size(Xcentered,1),1)
To standardize or not?
• Take a dataset of 24 stock prices from Helsinki stock market
– Each is a daily dataset from 2010 to 2016
– ➔ In total 24 x 1531 = 36744 data points
• Prices are at very different levels
➔ Standardize the prices first
Stock market example
• Standardized prices clearly show some correlated behavior
• From 24 data series it is difficult to see the common structure
• ➔ Do PCA on the data
Stock market example
• Relative variances of PCs (so called Scree plot) showsthat– PC1 explains about 55% of variance
– PC2 about 20%
– PC3 about 11 %...
– ➔ 6 first PCs explain about 95% of variance
Stock market example
• Time series of principal components
Stock market example
• Consider the eigenvectors (EOFs) of the principal components• Stock #9 = Kone Oyj, Positive loading for PC1 ➔ positive price trend over last
5 years• Stock #16 = Outokumpu Oyj, negative loading for PC1 ➔ negative price trend• Full interpretation of these factors is very complicated and requires
understanding of the markets and economical system
Stock market example
• We can reduce the dimensions of the dataset from 24 to 6 and retain 95% of variance!
• ➔ Recompute the data by taking first 6 components• To store data with 95% of original variance we need the 6 PC
time series and 6 x 24 eigenvector components 6 x 1531 + 6 x 24 = 9330 points, i.e., about 25 % of the original points
Stock market example
Rotations of PCA results
• PCA typically gives a few components which explain most of the variation and have a large effect on a large fraction of variables
• Because of this PCA results can be difficult to interpret in physical context
• For example the temperature modes have significant responseall over the northern hemisphere
• Rotation of PCA results can often help...
Problem with PCA interpretation
EOF 1 EOF 2 EOF 3
• With SVD the data matrix is 𝑿 = 𝑼𝑺𝑽𝑇
• This decomposition can be viewed in two ways:
1. The PCA interpretation:– Principal component scores: 𝑷 = 𝑼𝑺 (incorporates
variance)
– Eigenvectors 𝑽 (orthogonal)➔ 𝑿 = 𝑷𝑽𝑇
2. The Factor Analysis (FA) interpretation– Standardized principal component scores: 𝑼
– Factor loadings: 𝑳 = 𝑽𝑺 (non-orthogonal and incorporates variance)➔ 𝑿 = 𝑼𝑳𝑇
Two ways to view PCA
• Assume an orthogonal (rotation) matrix 𝑹 so that𝑹𝑹𝑇 = 𝑹𝑇𝑹 = 𝑰
• Data matrix can be written as
𝑿 = 𝑼𝑺𝑽𝑇 = 𝑼𝑺𝑹𝑹𝑇𝑽𝑇 = 𝑷𝑟𝑜𝑡𝑽𝑟𝑜𝑡𝑇
• From this form we have
– Rotated principal components: 𝑷𝑟𝑜𝑡 = 𝑼𝑺𝑹 = 𝑷𝑹
– Rotated eigenvectors: 𝑽𝑟𝑜𝑡 = 𝑽𝑹
• Rotation this way:
– Preserves orthogonality of eigenvectors, but
– Rotated principal components are not un-correlatedanymore!
– Variance is redistributed among rotated components!
Rotation: PCA interpretation
• Data matrix can also be written as𝑿 = 𝑼𝑺𝑽𝑇 = 𝑼𝑹𝑹𝑇𝑺𝑽𝑇 = 𝑼𝑟𝑜𝑡𝑳𝑟𝑜𝑡
𝑇
• From this form we have– Rotated standardized principal components:
𝑼𝑟𝑜𝑡 = 𝑼𝑹
– Rotated factor loadings: 𝑳𝑟𝑜𝑡 = 𝑽𝑺𝑹 = 𝑳𝑹
• Rotation this way (more common):– Rotated factor loadings and eigenvectors are non-
orthogonal
– Rotated standardized scores remain un-correlated
– Variance is redistributed among rotatedcomponents!
Rotation: FA interpretation
• There are several ways a rotation matrix can bechosen
• Most common is the VARIMAX rotation method
• 𝑅𝑉𝐴𝑅𝐼𝑀𝐴𝑋 = argmax𝑅
σ𝑗=1𝑝 1
𝑛σ𝑖=1𝑛 𝑳𝑹 𝑖𝑗
4 −1
𝑛σ𝑖=1𝑛 𝑳𝑹 𝑖𝑗
22
• This criterionmaximizes the variance of squaredfactor loadingssummed over allfactors
VARIMAX rotation
• ➔ It produces a solution,where each factor onlyaffects a few variables
➔ Simple solution• BUT: There is no guarantee
that any rotated (or un-rotated)solution correctly representsany physical mode affectingthe system!
• Typically we include only themost important Principal Components, which containmost of the variance into the rotation
VARIMAX rotation
• Factor loading matrix from the surfacetemperature example
VARIMAX rotation
Unrotated factor loadings Rotated factor loadings
Unrotated factor loadings
VARIMAX rotation
Rotated factor loadings
• Other rotation methods– QUARTIMAX: minimizes the number of factors needed
to explain each variable (maximizes the variance of factor loadings within each row)
– EQUIMAX: Compromise between VARIMAX and QUARTIMAX
– Orthomax, Promax, Procrustes... Etc.
• A more interesing method, which is effectivelyalso a rotation method is the IndependentComponent Analysis (ICA)
• ICA is perhaps the best method to look for physically independent signals in the data.
Other rotations
• In MATLAB one can do the rotation with functionrotatefactors() (part of Statistics Toolbox)
• [Lrot,R]=rotatefactors(L)
or including only a subset (first 𝑘) of PCs
[Lrot,R]=rotatefactors(L(:,1:k))
• Different rotation options (see MATLAB help), e.g.,[Lrot,R]=rotatefactors(L,’Method’,’quartimax’)
• One can rotate either factor loadings or eigenvectors
How to rotate in MATLAB