1
Implementation of ANOVA-PCA in R for Multivariate Data Exploration Matthew J. Keinsley & Bryan A. Hanson Dept. of Chemistry & Biochemistry DePauw University, Greencastle Indiana USA Analysis of Variance Analysis of Variance (ANOVA) is a significance test which considers whether or not two samples come from the same population. In the diagram below, each curve represents a series of measurements on two samples, with each described by a mean and standard deviation. A null hypothesis is defined which typically states that the two samples are from the same population. ANOVA can be thought of as measuring how much the two samples can drift apart and still come from the same population (i.e. the null hypothesis is true). The key values from an ANOVA calculation are the test statistic and the p-value. The p-value represents the probability of achieving a test statistic more extreme than the one observed if the two samples are from the same pop- ulation. The smaller the p-value, the more confident one is about rejecting the null hypothesis at a se- lected confidence level, usually 95%. Internally, the calculation compares the between sample variation to the within sample variation to generate the test statistic. This is referred to as partioning the vari- ance.0 5 10 15 0.0 0.1 0.2 0.3 0.4 sample values Principal Component Analysis Principal Component Analysis (PCA) is a common multivariate data analysis method which eliminates uninformative variables (noise) and re-expresses the data in terms of abstract principal components.The diagram below illustrates the matrix algebra used in the computation. The PCA results are scores and loadings. Scores represent the samples in the new data space, while loadings are the weights which each variable should be multiplied to obtain the score. While score plots are usually color-coded by group membership, PCA is blind to group membership. 1 ... j ... m n ... i ... 1 objects or specimens measured variables n ... i ... 1 T 1 ... k ... a m ... j ... 1 P 1 ... k ... a T = X • P or XP PC1 scores for each object ... PC2 scores for each object ... etc loadings for PC1 ... loadings for PC2 ... etc loadings scores original data X Data Reduction method of PCA P T ANOVA-PCA ANOVA-PCA is a combination of both methods de- veloped by Harrington. The data is partitioned into submatrices (shown below) corresponding to each experimental factor in a manner reminiscent of ANOVA. The submatrices are then separately sub- jected to PCA after adding back the residual error. If the effect of a factor is large compared to the residual error, separation along the 1st PC in the score plot should be evident. With this method, the significance of a factor can be visually determined. ANOVA-PCA is not blind to group membership. Raw Data = Grand Mean Means for each level of factor 1 Means for each level of factor 2 Means for each level of factor 1 x factor 2 Residual Error + + + + ANOVA-PCA con’t The method to create the submatrices is shown be- low. Data for all samples belonging to each level of a particular factor are replaced by their group averages. ############################# ############################# ############################# ############################# ############################# ############################# ############################# ############################# ############################# ############################# ############################# ############################# original data factor matrix group membership Computational Strategy We implemented the ANOVA-PCA concept using a series of R functions as shown below. The functions were integrated into the package ChemoSpec, which uses a Spectra object to store the spectral data and associated information. aovPCA(spectra, facs, ...) avgFacLvls(spectra, facs) aovPCAscores(spectra, LM, plot, type, ...) Spectra object factors to be analyzed aovPCAloadings(spectra, LM, pca, plot, ...) Simulated Data For testing purposes, a simulated data set was cre- ated by generating data which simulate UV-Vis spec- tra or chromatograms. Four different sample types were created in such a way as to represent group membership and the effects of the factors. The lev- els of the first factor are A and B; the second factor has levels X and Y. The entire data set contains 5 of each sample type, and noise was added to the spectra in order to make them more realistic. The prototype samples are shown below; type BY represents a inter- action factor as a peak is missing entirely compared to the other sample types. 0 5 10 15 0.0 0.2 0.4 Sample Type AX Wavelength or Time Peak intensity 0 5 10 15 0.0 1.0 2.0 Sample Type AY Wavelength or Time Peak intensity 0 5 10 15 0.0 1.0 2.0 Sample Type BX Wavelength or Time Peak intensity 0 5 10 15 0.0 1.0 2.0 Sample Type BY Wavelength or Time Peak intensity Standard PCA The results of a traditional PCA analysis on these simulated data are shown below. As expected, this data set does not present a challenge and each group is clustered away from the others. -10 -5 0 5 10 -5 0 5 10 Original Data: PCA Score Plot robust ellipses by group PC1 score (30%) PC2 score (8.6%) centered/noscale/classical Key AX AY BX BY aovPCA Score Plots The following figures show the results of aovPCA on the simulated data. Factor 1 has levels A and B. Because the 1st factor is significant, there is separation between levels along PC1. -5 0 5 10 -10 -5 0 5 10 15 FactorAB: PCA Score Plot robust ellipses by group PC1 score (32%) PC2 score (5.5%) centered/noscale/classical Key AX AY BX BY Factor 2 has levels X and Y. As before, separation is observed along PC1, but it is not as good as for Fac- tor 1. Note the groups here are reversed compared to the 1st factor. -6 -4 -2 0 2 4 6 -10 0 10 20 FactorXY: PCA Score Plot robust ellipses by group PC1 score (9.1%) PC2 score (7.4%) centered/noscale/classical Key AX AY BX BY The following figure shows the interaction between Factors 1 and 2. In this case, AX and BY are grouped separately from AY and BX. This sep- aration therefore shows that the interaction be- tween the two factors is significant as expected. -6 -4 -2 0 2 4 6 -10 0 10 20 FactorAB x FactorXY: PCA Score Plot robust ellipses by group PC1 score (15%) PC2 score (6.9%) centered/noscale/classical Key AX AY BX BY Finally, aovPCA on the residual error is shown. The residual error is the unexplained variance, so there should be no separation on PC1 nor any clustering. This is exactly what is observed. -20 -10 0 10 20 30 -40 -20 0 20 40 Res.Error: PCA Score Plot robust ellipses by group PC1 score (8.1%) PC2 score (7.5%) centered/noscale/classical Key AX AY BX BY aovPCA Loadings Plots The loading for Factor AB is shown below. Peaks on the PC1 loadings show which wavelengths contribute the most to the separation in the data. The reference spectrum clearly shows the noise that was added to the data. FactorAB: Loadings Plot centered/noscale/classical -0.5 0.0 0.5 1.0 0 5 10 15 Reference Spectrum -0.05 0.00 0.05 0.10 PC 1 Loadings aovPCA of Real Data The data in this example is composed of infrared (IR) spectra of the leaf surface of the common weedpurslane (Portulaca oleracea). Varieties collected in South Carolina and Wisconsin were grown in low and high water conditions as a part of a larger study on climate change. The aovPCA score plot with Treat- ment as the factor is shown below; aovPCA is not able to separate the groups based upon this factor. -0.02 -0.01 0.00 0.01 0.02 0.03 -0.010 -0.005 0.000 0.005 0.010 Treatment: PCA Score Plot robust ellipses by group PC1 score (77%) PC2 score (13%) centered/noscale/classical Key SC.H SC.L WI.H WI.L However, if the first 3 PCs from standard PCA are plotted, we can see separation be- tween the treatments. With this difficult data set standard PCA outperforms aovPCA. Cuticle IR Spectra: PCA Score Plot PC1 (83%) PC2 (14%) PC3 (1.1%) SC.H SC.L WI.H WI.L EB water trial week 3 IR Spectra Summer 2010 References Pinto, Bosc, Nocairi, Barros, and Rutledge. Using ANOVA-PCA for Discriminant Analysis. . . Ana- lytica Chimica Acta 629.1-2 (2008): 47-55. Harrington, Vieira, Espinoza, Nien, Romero, and Yergey. Analysis of Variance–Principal Compo- nent Analysis. . . Analytica Chimica Acta 544.1-2 (2005): 118-27. Software: R packages ChemoSpec and HandyStuff github.com/bryanhanson Acknowledgements Support was provided by the Science Research Fel- lows program and Chemistry Department research funds. Summer 2011

Implementation of ANOVA-PCA in R FactorAB: PCA Score Plot ...academic.depauw.edu/~hanson/MiscPages/aovPCAuseR2012Poster… · original data scores X Data Reduction method of PCA P

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Implementation of ANOVA-PCA in R FactorAB: PCA Score Plot ...academic.depauw.edu/~hanson/MiscPages/aovPCAuseR2012Poster… · original data scores X Data Reduction method of PCA P

Implementation of ANOVA-PCA in R forMultivariate Data Exploration

Matthew J. Keinsley & Bryan A. Hanson

Dept. of Chemistry & BiochemistryDePauw University, Greencastle Indiana USA

Analysis of Variance

Analysis of Variance (ANOVA) is a significance testwhich considers whether or not two samples comefrom the same population. In the diagram below,each curve represents a series of measurements ontwo samples, with each described by a mean andstandard deviation. A null hypothesis is definedwhich typically states that the two samples are fromthe same population. ANOVA can be thought ofas measuring how much the two samples can driftapart and still come from the same population (i.e.the null hypothesis is true). The key values froman ANOVA calculation are the test statistic and thep-value. The p-value represents the probability ofachieving a test statistic more extreme than the oneobserved if the two samples are from the same pop-ulation. The smaller the p-value, the more confidentone is about rejecting the null hypothesis at a se-lected confidence level, usually 95%. Internally, thecalculation compares the between sample variationto the within sample variation to generate the teststatistic. This is referred to as ”partioning the vari-ance.”

0 5 10 15

0.0

0.1

0.2

0.3

0.4

sample values

Principal Component Analysis

Principal Component Analysis (PCA) is a commonmultivariate data analysis method which eliminatesuninformative variables (noise) and re-expresses thedata in terms of abstract ”principal components.”Thediagram below illustrates the matrix algebra used inthe computation. The PCA results are scores andloadings. Scores represent the samples in the newdata space, while loadings are the weights which eachvariable should be multiplied to obtain the score.While score plots are usually color-coded by groupmembership, PCA is blind to group membership.

1 ... j ... m

n

.

..

i

...

1

obje

cts

orsp

ecim

ens

measured variables

n

.

..

i

...

1

T

1 ... k ... a

m

...

j

...

1

P

1 ... k ... a

T = X • P or XP

PC1

scor

es fo

r eac

h ob

ject

...

PC2

scor

es fo

r eac

h ob

ject

...

etc

load

ings

for P

C1 ..

.lo

adin

gs fo

r PC2

...

etc

loadings

scoresoriginal data

X

Data Reduction method of PCA

P

T

ANOVA-PCA

ANOVA-PCA is a combination of both methods de-veloped by Harrington. The data is partitionedinto submatrices (shown below) corresponding toeach experimental factor in a manner reminiscent ofANOVA. The submatrices are then separately sub-jected to PCA after adding back the residual error. Ifthe effect of a factor is large compared to the residualerror, separation along the 1st PC in the score plotshould be evident. With this method, the significanceof a factor can be visually determined. ANOVA-PCAis not blind to group membership.

RawData = Grand

MeanMeans foreach levelof factor 1

Means foreach levelof factor 2

Means foreach level

offactor 1 x factor 2

ResidualError+

+

+

+

ANOVA-PCA con’t

The method to create the submatrices is shown be-low. Data for all samples belonging to each level of aparticular factor are replaced by their group averages.

##########################################################

#############################

#############################

#############################

#############################

##########################################################

#############################

#############################

#############################

#############################

original data factor matrix

group membership

rows replaced byaverage of allsamples of a givenfactor level

##########################################################

#############################

#############################

#############################

#############################

##########################################################

#############################

#############################

#############################

#############################

original data factor matrix

group membership

Computational Strategy

We implemented the ANOVA-PCA concept using aseries of R functions as shown below. The functionswere integrated into the package ChemoSpec, whichuses a Spectra object to store the spectral data andassociated information.

aovPCA(spectra, facs, ...)

avgFacLvls(spectra, facs)aovPCAscores(spectra, LM, plot, type, ...)

Spectra object factors to beanalyzed

aovPCAloadings(spectra, LM, pca, plot, ...)

Simulated Data

For testing purposes, a simulated data set was cre-ated by generating data which simulate UV-Vis spec-tra or chromatograms. Four different sample typeswere created in such a way as to represent groupmembership and the effects of the factors. The lev-els of the first factor are A and B; the second factorhas levels X and Y. The entire data set contains 5 ofeach sample type, and noise was added to the spectrain order to make them more realistic. The prototypesamples are shown below; type BY represents a inter-action factor as a peak is missing entirely comparedto the other sample types.

0 5 10 15

0.0

0.2

0.4

Sample Type AX

Wavelength or Time

Pea

k in

tens

ity

0 5 10 15

0.0

1.0

2.0

Sample Type AY

Wavelength or Time

Pea

k in

tens

ity

0 5 10 15

0.0

1.0

2.0

Sample Type BX

Wavelength or Time

Pea

k in

tens

ity

0 5 10 15

0.0

1.0

2.0

Sample Type BY

Wavelength or Time

Pea

k in

tens

ity

Standard PCA

The results of a traditional PCA analysis on thesesimulated data are shown below. As expected,this data set does not present a challenge andeach group is clustered away from the others.

−10 −5 0 5 10

−5

05

10

Original Data: PCA Score Plot

robust ellipses by group

PC1 score (30%)

PC

2 sc

ore

(8.6

%)

centered/noscale/classical KeyAXAYBXBY

aovPCA Score Plots

The following figures show the results of aovPCA

on the simulated data. Factor 1 has levelsA and B. Because the 1st factor is significant,there is separation between levels along PC1.

● ●

−5 0 5 10

−10

−5

05

1015

FactorAB: PCA Score Plot

robust ellipses by group

PC1 score (32%)

PC

2 sc

ore

(5.5

%)

centered/noscale/classical KeyAXAYBXBY

Factor 2 has levels X and Y. As before, separation isobserved along PC1, but it is not as good as for Fac-tor 1. Note the groups here are reversed comparedto the 1st factor.

●●

−6 −4 −2 0 2 4 6

−10

010

20

FactorXY: PCA Score Plot

robust ellipses by group

PC1 score (9.1%)

PC

2 sc

ore

(7.4

%)

centered/noscale/classical KeyAXAYBXBY

The following figure shows the interaction betweenFactors 1 and 2. In this case, AX and BY aregrouped separately from AY and BX. This sep-aration therefore shows that the interaction be-tween the two factors is significant as expected.

●●

−6 −4 −2 0 2 4 6

−10

010

20

FactorAB x FactorXY: PCA Score Plot

robust ellipses by group

PC1 score (15%)

PC

2 sc

ore

(6.9

%)

centered/noscale/classical KeyAXAYBXBY

Finally, aovPCA on the residual error is shown. Theresidual error is the unexplained variance, so thereshould be no separation on PC1 nor any clustering.This is exactly what is observed.

●●

●●● ●

●●●

●●●

−20 −10 0 10 20 30

−40

−20

020

40

Res.Error: PCA Score Plot

robust ellipses by group

PC1 score (8.1%)

PC

2 sc

ore

(7.5

%)

centered/noscale/classical KeyAXAYBXBY

aovPCA Loadings Plots

The loading for Factor AB is shown below. Peaks onthe PC1 loadings show which wavelengths contributethe most to the separation in the data. The referencespectrum clearly shows the noise that was added tothe data.

FactorAB: Loadings Plot

centered/noscale/classical

−0.

50.

00.

51.

0

0 5 10 15

Ref

eren

ce S

pect

rum

−0.

050.

000.

050.

10

PC

1 L

oadi

ngs

aovPCA of Real Data

The data in this example is composed of infrared (IR)spectra of the leaf surface of the common ”weed”purslane (Portulaca oleracea). Varieties collected inSouth Carolina and Wisconsin were grown in low andhigh water conditions as a part of a larger study onclimate change. The aovPCA score plot with Treat-ment as the factor is shown below; aovPCA is notable to separate the groups based upon this factor.

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

● ●

●●

−0.02 −0.01 0.00 0.01 0.02 0.03

−0.

010

−0.

005

0.00

00.

005

0.01

0

Treatment: PCA Score Plot

robust ellipses by group

PC1 score (77%)

PC

2 sc

ore

(13%

)

centered/noscale/classical KeySC.HSC.LWI.HWI.L

However, if the first 3 PCs from standardPCA are plotted, we can see separation be-tween the treatments. With this difficultdata set standard PCA outperforms aovPCA.

Cuticle IR Spectra: PCA Score Plot

●●

● ●●

●●

●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

PC1 (83%)

PC2 (14%)

PC3 (1.1%)

SC.H SC.L WI.H WI.L

EB water trial week 3 IR Spectra Summer 2010

References

Pinto, Bosc, Nocairi, Barros, and Rutledge. ”UsingANOVA-PCA for Discriminant Analysis. . . ” Ana-lytica Chimica Acta 629.1-2 (2008): 47-55.

Harrington, Vieira, Espinoza, Nien, Romero, andYergey. ”Analysis of Variance–Principal Compo-nent Analysis. . . ”Analytica Chimica Acta 544.1-2(2005): 118-27.

Software: R packages ChemoSpec and HandyStuff

github.com/bryanhanson

Acknowledgements

Support was provided by the Science Research Fel-lows program and Chemistry Department researchfunds.

Summer 2011