Biology
Chemistry
Informatics
Principal Component Analysis (PCA) of metabolomic sample processing methods
Prin
cipa
l Com
pone
nts
Anal
ysis
Goal:Use PCA to identify the major modes of variance
(Used DATA: Pumpkin data 1.csv)
Topics: 1. Principal component number selection2. Data pretreatment3. PCA results visualization
Biology
Chemistry
Informatics
Principal Components Analysis Pr
inci
pal C
ompo
nent
s An
alys
is
Steps1. Calculate a PCA model2. Select optimal model principal component (PC) 3. Overview PCA scores and loadings plots4. Repeat steps 1-2 using data centering and scaling
Visualize:5. Sample scores annotated by extraction and treatment6. Leverage and DmodX (distance from model plane)7. Variable loadings and biplots
Exercise:8. How many PCs are needed to capture 80% variance for raw data and
scaled data?9. Are their any moderate or extreme outliers?10.What variables contribute most to the variance for raw and scaled data?
Biology
Chemistry
Informatics
PCA Variance Explained(raw data)
Pri
nci
pal
Co
mp
on
ents
An
alys
is
• PCs can be selected to explain a minimum %variance in the data (~80%)
• PCs explaining below 1% variance can be excluded
• q2 is the cross-validated PCA prediction of left out data
Biology
Chemistry
Informatics
PCA Scores (raw data)P
rin
cip
al C
om
po
nen
ts A
nal
ysis
• Hotelling's T2 ellipse shows 95% CI for bivariate normal distribution
• Samples lying outside of the ellipse could be outliers
Biology
Chemistry
Informatics
PCA Loadings (raw data, centered)P
rin
cip
al C
om
po
nen
ts A
nal
ysis
• Unscaled data PCA loadings are highly correlated with magnitude
Biology
Chemistry
Informatics
PCA Biplot (raw data)P
rin
cip
al C
om
po
nen
ts A
nal
ysis
• Biplots can be used to rapidly overview the correlation between sample scores and variable loadings
Sample contains high maleic acid
Sample contains high sucrose (low maleic acid)
Biology
Chemistry
Informatics
PCA Leverage and DmodX(raw data)
Pri
nci
pal
Co
mp
on
ents
An
alys
is
Leverage is the distance to samples center in the PCA plane (extreme outliers)
Distance to model X (DmodX) is the orthogonal distance to the PCA plane (moderate outliers)
Biology
Chemistry
Informatics
PCA Variance Explained(autoscaled)
Pri
nci
pal
Co
mp
on
ents
An
alys
is
Biology
Chemistry
Informatics
PCA Scores (autoscaled)P
rin
cip
al C
om
po
nen
ts A
nal
ysis
• Loadings on PC1 describe differences due to extraction
• Loadings on PC2 describe differences due to treatment
Biology
Chemistry
Informatics
PCA Leverage and DmodX(autoscaled)
Pri
nci
pal
Co
mp
on
ents
An
alys
is
• Samples with both high leverage and DomdX are likely outliers
• Evaluate PCA results after their removal
Biology
Chemistry
Informatics
PCA Loadings (autoscaled)P
rin
cip
al C
om
po
nen
ts A
nal
ysis
• Scaled loadings are independent of variable magnitude and show a rich variance structure of the data
Biology
Chemistry
Informatics
Pri
nci
pal
Co
mp
on
ents
An
alys
isRelationship between scores and loadings (autoscaled)
Higher in 100% MeOH
ExtractionLower in 100% MeOH
Biology
Chemistry
Informatics
Loadings and ScoresP
rin
cip
al C
om
po
nen
ts A
nal
ysis
Highest positive loading on PC1
Highest negative loading on PC1