26
ropls : PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data Etienne A. Thevenot 30 April 2018 Package ropls 1.12.0 Contents 1 The ropls package . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Orthogonal Partial Least-Squares . . . . . . . . . . . . . . . . . 3 2.2 OPLS software . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 The sacurine metabolomics dataset . . . . . . . . . . . . . . . . 4 3.1 Study objective . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Pre-processing and annotation . . . . . . . . . . . . . . . . . . 4 3.3 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 Hands-on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.1 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . 5 4.3 Partial least-squares: PLS and PLS-DA . . . . . . . . . . . . . . 8 4.4 Orthogonal partial least squares: OPLS and OPLS-DA . . . . . . 10 4.5 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.6 Working on ExpressionSet omics objects from bioconductor . . . . 16 4.7 Importing/exporting data from/to the Workflow4metabolomics infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5 Pre-processing and annotation of mass spectrometry data . . . 18 6 Other datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 7 Session info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) formultivariate analysis and feature selectionof omics data

Etienne A. Thevenot

30 April 2018

Package

ropls 1.12.0

Contents

1 The ropls package . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Orthogonal Partial Least-Squares . . . . . . . . . . . . . . . . . 3

2.2 OPLS software . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 The sacurine metabolomics dataset . . . . . . . . . . . . . . . . 4

3.1 Study objective . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Pre-processing and annotation . . . . . . . . . . . . . . . . . . 4

3.3 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 Hands-on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4.1 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4.2 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . 5

4.3 Partial least-squares: PLS and PLS-DA . . . . . . . . . . . . . . 8

4.4 Orthogonal partial least squares: OPLS and OPLS-DA . . . . . . 10

4.5 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.6 Working on ExpressionSet omics objects from bioconductor . . . . 16

4.7 Importing/exporting data from/to the Workflow4metabolomicsinfrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Pre-processing and annotation of mass spectrometry data . . . 18

6 Other datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 Session info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Page 2: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2

Page 3: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

1 The ropls package

The ropls R package implements the PCA, PLS(-DA) and OPLS(-DA) approaches withthe original, NIPALS-based, versions of the algorithms (Wold, Sjostrom, and Eriksson 2001,Trygg and Wold (2002)). It includes the R2 and Q2 quality metrics (Eriksson et al. 2001,Tenenhaus (1998)), the permutation diagnostics (Szymanska et al. 2012), the computationof the VIP values (Wold, Sjostrom, and Eriksson 2001), the score and orthogonal distances todetect outliers (Hubert, Rousseeuw, and Vanden Branden 2005), as well as many graphics(scores, loadings, predictions, diagnostics, outliers, etc).The functionalities from ropls can also be accessed via a graphical user interface in theMultivariate module from the Workflow4Metabolomics.org online resource for computa-tional metabolomics, which provides a user-friendly, Galaxy-based environment for datapre-processing, statistical analysis, and annotation (Giacomoni et al. 2015).

2 Context

2.1 Orthogonal Partial Least-Squares

Partial Least-Squares (PLS), which is a latent variable regression method based on covari-ance between the predictors and the response, has been shown to efficiently handle datasetswith multi-collinear predictors, as in the case of spectrometry measurements (Wold, Sjostrom,and Eriksson 2001). More recently, Trygg and Wold (2002) introduced the Orthogonal Par-tial Least-Squares (OPLS) algorithm to model separately the variations of the predictorscorrelated and orthogonal to the response.OPLS has a similar predictive capacity compared to PLS and improves the interpretation ofthe predictive components and of the systematic variation (Pinto, Trygg, and Gottfries 2012).In particular, OPLS modeling of single responses only requires one predictive component.Diagnostics such as the Q2Y metrics and permutation testing are of high importance to avoidoverfitting and assess the statistical significance of the model. The Variable Importancein Projection (VIP), which reflects both the loading weights for each component and thevariability of the response explained by this component (Pinto, Trygg, and Gottfries 2012;Mehmood et al. 2012), can be used for feature selection (Trygg and Wold 2002; Pinto, Trygg,and Gottfries 2012).

2.2 OPLS software

OPLS is available in the SIMCA-P commercial software (Umetrics, Umea, Sweden; Erikssonet al. (2001)). In addition, the kernel-based version of OPLS (Bylesjo et al. 2008) is availablein the open-source R statistical environment (R Development Core Team 2008), and a singleimplementation of the linear algorithm in R has been described recently (Gaude et al. 2013).

3

Page 4: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

3 The sacurine metabolomics dataset

3.1 Study objective

The objective was to study the influence of age, body mass index (bmi), and gender onmetabolite concentrations in urine, by analysing 183 samples from a cohort of adults withliquid chromatography coupled to high-resolution mass spectrometry (LC-HRMS; Thevenotet al. (2015)).

3.2 Pre-processing and annotation

Urine samples were analyzed by using an LTQ Orbitrap in the negative ionization mode. Atotal of 109 metabolites were identified or annotated at the MSI level 1 or 2. After retentiontime alignment with XCMS, peaks were integrated with Quan Browser. Signal drift and batcheffect were corrected, and each urine profile was normalized to the osmolality of the sample.Finally, the data were log10 transformed (Thevenot et al. 2015).

3.3 Covariates

The volunteers’ age, body mass index (bmi), and gender were recorded.

4 Hands-on

4.1 Loading

We first load the ropls package:library(ropls)

We then load the sacurine dataset which contains:1. The dataMatrix matrix of numeric type containing the intensity profiles (log10 trans-

formed),2. The sampleMetadata data frame containg sample metadata,3. The variableMetadata data frame containg variable metadata

data(sacurine)

names(sacurine)

## [1] "dataMatrix" "sampleMetadata" "variableMetadata"

We attach sacurine to the search path and display a summary of the content of the dataMa-trix, sampleMetadata and variableMetadata with the strF Function of the ropls package(see also str):

4

Page 5: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

attach(sacurine)

strF(dataMatrix)

## dim class mode typeof size NAs min mean median max

## 183 x 109 matrix numeric double 0.2 Mb 0 -0.3 4.2 4.3 6

## (2-methoxyethoxy)propanoic acid isomer (gamma)Glu-Leu/Ile ...

## HU_011 3.019766011 3.888479324 ...

## HU_014 3.81433889 4.277148905 ...

## ... ... ... ...

## HU_208 3.748127215 4.523763202 ...

## HU_209 4.208859398 4.675880567 ...

## Valerylglycine isomer 2 Xanthosine

## HU_011 3.889078716 4.075879575

## HU_014 4.181765852 4.195761901

## ... ... ...

## HU_208 4.634338821 4.487781609

## HU_209 4.47194762 4.222953354

strF(sampleMetadata)

## age bmi gender

## numeric numeric factor

## nRow nCol size NAs

## 183 3 0 Mb 0

## age bmi gender

## HU_011 29 19.75 M

## HU_014 59 22.64 F

## ... ... ... ...

## HU_208 27 18.61 F

## HU_209 17.5 21.48 F

strF(variableMetadata)

## msiLevel hmdb chemicalClass

## numeric character character

## nRow nCol size NAs

## 109 3 0 Mb 0

## msiLevel hmdb chemicalClass

## (2-methoxyethoxy)propanoic acid isomer 2 Organi

## (gamma)Glu-Leu/Ile 2 AA-pep

## ... ... ... ...

## Valerylglycine isomer 2 2 AA-pep:AcyGly

## Xanthosine 1 HMDB00299 Nucleo

4.2 Principal Component Analysis (PCA)

We perform a PCA on the dataMatrix matrix (samples as rows, variables as columns), withthe opls method:sacurine.pca <- opls(dataMatrix)

A summary of the model (8 components were selected) is printed:

5

Page 6: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

## PCA

## 183 samples x 109 variables

## standard scaling of predictors

## R2X(cum) pre ort

## Total 0.501 8 0

In addition the default summary figure is displayed:

p1 p3 p5 p7

Variance explained

PC

% o

f tot

al v

aria

nce

02

46

810

14

0 1 2 3 4

05

1015

Observation diagnostics

Score distance (SD)

Ort

hgon

al d

ista

nce

(OD

)

HU_038

HU_044HU_047

HU_053HU_055

HU_084HU_155

HU_167

HU_171 HU_173

HU_026HU_031

HU_051

HU_102HU_121 HU_150

HU_160

HU_187

−15 −5 0 5 10 15

−10

−50

510

Scores (PCA)

t1 (15%)

t2 (

10%

)

HU_011 HU_014HU_015HU_017HU_018

HU_019HU_020

HU_021

HU_022

HU_023HU_024HU_025

HU_026HU_027

HU_028HU_029HU_030

HU_031

HU_032HU_033HU_034

HU_035

HU_036

HU_037

HU_038

HU_039

HU_040HU_041HU_042HU_043

HU_044

HU_045

HU_046HU_047

HU_048HU_049

HU_050HU_051

HU_052

HU_053

HU_054HU_055 HU_056HU_057

HU_058

HU_060HU_061

HU_062

HU_063

HU_064HU_065

HU_066

HU_067HU_068HU_069

HU_070HU_072

HU_073HU_074HU_075

HU_076

HU_077HU_078HU_079HU_080

HU_081

HU_082HU_083

HU_084

HU_085

HU_086HU_087

HU_088HU_089HU_090HU_091

HU_092

HU_093

HU_094HU_095

HU_097HU_098

HU_099

HU_100

HU_101

HU_102HU_103

HU_105HU_106

HU_107HU_108

HU_109HU_110HU_112

HU_113

HU_114HU_115

HU_116

HU_117

HU_118

HU_119

HU_120HU_121HU_122HU_123

HU_124HU_125HU_126HU_127

HU_129HU_130HU_131 HU_132

HU_133

HU_134

HU_135

HU_136

HU_137

HU_138

HU_139

HU_140

HU_142

HU_143

HU_144HU_145HU_146

HU_147

HU_148

HU_149

HU_150

HU_152

HU_154

HU_155

HU_156HU_157

HU_158HU_159

HU_160

HU_162

HU_163HU_164

HU_166

HU_167

HU_168

HU_169

HU_170

HU_171HU_172

HU_173

HU_174

HU_175HU_177

HU_179

HU_180 HU_181HU_182

HU_183

HU_184

HU_185

HU_186

HU_187

HU_188HU_189

HU_190

HU_191HU_192HU_193

HU_194 HU_195

HU_196

HU_197

HU_198

HU_199

HU_200

HU_201HU_202HU_203

HU_204

HU_205

HU_206HU_207

HU_208

HU_209

R2X0.501

0.00 0.05 0.10 0.15

−0.1

0.0

0.1

0.2

Loadings

p1 (15%)

p2 (

10%

) Salicylic acid

N−AcetylleucineChenodeoxycholic acid isomer

PyridylacetylglycineDimethylguanosine

4−Acetamidobutanoic acid isomer 2

FMNH2Testosterone glucuronide6−(carboxymethoxy)−hexanoic acid

Pyrocatechol sulfateFumaric acidPentose

Figure 1: PCA summary plot. Top left overview: the scree plot (i.e., inertia barplot)suggests that 3 components may be sufficient to capture most of the inertia; Top rightoutlier: this graphics shows the distances within and orthogonal to the projection plane(Hubert, Rousseeuw, and Vanden Branden 2005): the name of the samples with a highvalue for at least one of the distances are indicated; Bottom left x-score: the variancealong each axis equals the variance captured by each component: it therefore depends onthe total variance of the dataMatrix X and of the percentage of this variance captured bythe component (indicated in the labels); it decreases when going from one component to acomponent with higher indice; Bottom right x-loading: the 3 variables with most extremevalues (positive and negative) for each loading are black colored and labeled.Note:

6

Page 7: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

1. Since dataMatrix does not contain missing value, the singular value decompositionwas used by default; NIPALS can be selected with the algoC argument specifying thealgorithm (Character),

2. The predI = NA default number of pred ictive components (Integer) for PCA meansthat components (up to 10) will be computed until the cumulative variance exceeds50%. If the 50% have not been reached at the 10th component, a warning messagewill be issued (you can still compute the following components by specifying the predI

value).Let us see if we notice any partition according to gender, by labeling/coloring the samplesaccording to gender (parAsColFcVn) and drawing the Mahalanobis ellipses for the male andfemale subgroups (parEllipseL).genderFc <- sampleMetadata[, "gender"]

plot(sacurine.pca, typeVc = "x-score",

parAsColFcVn = genderFc, parEllipsesL = TRUE)

−15 −10 −5 0 5 10 15

−10

−50

510

Scores (PCA)

t1 (15%)

t2 (

10%

)

F

M

HU_011 HU_014HU_015HU_017

HU_018

HU_019HU_020

HU_021

HU_022

HU_023 HU_024

HU_025

HU_026

HU_027

HU_028

HU_029HU_030

HU_031

HU_032HU_033HU_034

HU_035

HU_036

HU_037

HU_038

HU_039

HU_040HU_041

HU_042HU_043

HU_044

HU_045

HU_046

HU_047

HU_048

HU_049

HU_050

HU_051

HU_052

HU_053

HU_054

HU_055HU_056

HU_057

HU_058

HU_060HU_061

HU_062

HU_063

HU_064HU_065

HU_066

HU_067 HU_068HU_069

HU_070HU_072

HU_073HU_074

HU_075

HU_076

HU_077HU_078HU_079

HU_080

HU_081

HU_082HU_083

HU_084

HU_085

HU_086

HU_087

HU_088HU_089

HU_090

HU_091

HU_092

HU_093

HU_094HU_095

HU_097

HU_098

HU_099

HU_100

HU_101

HU_102HU_103

HU_105HU_106

HU_107

HU_108

HU_109HU_110 HU_112

HU_113

HU_114HU_115

HU_116

HU_117

HU_118

HU_119

HU_120HU_121

HU_122HU_123

HU_124HU_125HU_126

HU_127

HU_129HU_130

HU_131HU_132

HU_133

HU_134

HU_135

HU_136

HU_137

HU_138

HU_139

HU_140

HU_142

HU_143

HU_144HU_145

HU_146

HU_147

HU_148

HU_149

HU_150

HU_152

HU_154

HU_155

HU_156

HU_157HU_158

HU_159

HU_160

HU_162

HU_163HU_164

HU_166

HU_167

HU_168

HU_169

HU_170

HU_171

HU_172

HU_173

HU_174

HU_175

HU_177

HU_179

HU_180HU_181

HU_182

HU_183

HU_184

HU_185

HU_186

HU_187

HU_188

HU_189

HU_190

HU_191HU_192

HU_193

HU_194HU_195

HU_196

HU_197

HU_198

HU_199

HU_200

HU_201HU_202HU_203

HU_204

HU_205

HU_206

HU_207

HU_208

HU_209

R2X0.501

Figure 2: PCA score plot colored according to gender.

Note:

7

Page 8: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

1. The plotting parameter to be used As Colors (Factor of character type or V ectorof numeric type) has a length equal to the number of rows of the dataMatrix (ie ofsamples) and that this qualitative or quantitative variable is converted into colors (byusing an internal palette or color scale, respectively). We could have visualized the ageof the individuals by specifying parAsColFcVn = sampleMetadata[, "age"].

2. The displayed components can be specified with parCompVi (plotting parameter speci-fying the Components: V ector of 2 integers)

4.3 Partial least-squares: PLS and PLS-DA

For PLS (and OPLS), the Y response(s) must be provided to the opls method. Y canbe either a numeric vector (respectively matrix) for single (respectively multiple) (O)PLSregression, or a character factor for (O)PLS-DA classification as in the following examplewith the gender qualitative response:sacurine.plsda <- opls(dataMatrix, genderFc)

## PLS-DA

## 183 samples x 109 variables and 1 response

## standard scaling of predictors and response(s)

## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2

## Total 0.275 0.73 0.584 0.262 3 0 0.05 0.05

8

Page 9: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

0.5 0.6 0.7 0.8 0.9 1.0

−0.5

0.0

0.5

1.0

pR2Y = 0.05, pQ2 = 0.05

Similarity(y, yperm)

R2Y

Q2Y

p1 p2 p3

Model overview

0.0

0.2

0.4

0.6

0.0

0.2

0.4

0.6

0.0

0.2

0.4

0.6

R2Y

Q2Y

0 1 2 3 4

05

1015

Observation diagnostics

Score distance (SD)

Ort

hgon

al d

ista

nce

(OD

)

HU_047

HU_051

HU_053

HU_055

HU_171HU_173

HU_187HU_026

HU_038

HU_102HU_121

HU_150HU_160

HU_167

−10 −5 0 5 10

−10

−50

510

Scores (PLS−DA)

t1 (10%)

t2 (

9%)

F

M

HU_011HU_014

HU_015HU_017

HU_018HU_019

HU_020

HU_021

HU_022HU_023HU_024HU_025

HU_026

HU_027HU_028

HU_029HU_030

HU_031

HU_032HU_033

HU_034

HU_035HU_036

HU_037HU_038

HU_039HU_040HU_041

HU_042HU_043

HU_044HU_045

HU_046

HU_047

HU_048HU_049

HU_050

HU_051

HU_052

HU_053

HU_054

HU_055

HU_056HU_057HU_058

HU_060HU_061

HU_062

HU_063

HU_064HU_065

HU_066HU_067

HU_068

HU_069HU_070HU_072

HU_073

HU_074

HU_075

HU_076

HU_077

HU_078

HU_079HU_080HU_081

HU_082

HU_083

HU_084

HU_085

HU_086

HU_087HU_088

HU_089

HU_090HU_091HU_092

HU_093HU_094

HU_095

HU_097HU_098HU_099

HU_100

HU_101

HU_102

HU_103HU_105HU_106HU_107HU_108HU_109

HU_110

HU_112

HU_113HU_114

HU_115HU_116

HU_117

HU_118

HU_119

HU_120

HU_121

HU_122

HU_123

HU_124HU_125

HU_126HU_127HU_129

HU_130HU_131HU_132HU_133HU_134HU_135

HU_136HU_137

HU_138HU_139

HU_140

HU_142 HU_143

HU_144HU_145HU_146HU_147

HU_148HU_149HU_150

HU_152HU_154HU_155HU_156

HU_157

HU_158

HU_159

HU_160

HU_162HU_163HU_164

HU_166

HU_167HU_168HU_169HU_170

HU_171

HU_172

HU_173

HU_174

HU_175

HU_177

HU_179HU_180

HU_181HU_182

HU_183

HU_184HU_185

HU_186HU_187

HU_188HU_189HU_190HU_191

HU_192

HU_193

HU_194HU_195

HU_196HU_197HU_198HU_199HU_200HU_201

HU_202HU_203

HU_204

HU_205HU_206HU_207

HU_208HU_209

R2X0.275

R2Y0.73

Q2Y0.584

RMSEE0.262

pre3

Figure 3: PLS-DA model of the gender response. Top left: significance diagnostic:the R2Y and Q2Y of the model are compared with the corresponding values obtained afterrandom permutation of the y response; Top right: inertia barplot: the graphic here suggeststhat 3 orthogonal components may be sufficient to capture most of the inertia; Bottomleft: outlier diagnostics; Bottom right: x-score plot: the number of components and thecumulative R2X, R2Y and Q2Y are indicated below the plot.Note:

1. When set to NA (as in the default), the number of components predI is determinedautomatically as follows (Eriksson et al. 2001): A new component h is added to themodel if:

• R2Yh ≥ 0.01, i.e., the percentage of Y dispersion (i.e., sum of squares) explained bycomponent h is more than 1 percent, and

• Q2Yh = 1− PRESSh/RSSh−1 ≥ 0 for PLS (or 5% when the number of samples isless than 100) or 1% for OPLS: Q2Yh ≥ 0 means that the predicted residual sum ofsquares (PRESSh) of the model including the new component h estimated by 7-foldcross-validation is less than the residual sum of squares (RSSh−1) of the model withthe previous components only (with RSS0 being the sum of squared Y values).

9

Page 10: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

2. The predictive performance of the full model is assessed by the cumulative Q2Ymetric: Q2Y = 1−

r∏h=1

(1−Q2Yh). We have Q2Y ∈ [0, 1], and the higher the Q2Y,the better the performance. Models trained on datasets with a larger number of featurescompared with the number of samples can be prone to overfitting: in that case, highQ2Y values are obtained by chance only. To estimate the significance of Q2Y (andR2Y) for single response models, permutation testing (Szymanska et al. 2012) canbe used: models are built after random permutation of the Y values, and Q2Yperm

are computed. The p-value is equal to the proportion of Q2Yperm above Q2Y (thedefault number of permutations is 20 as a compromise between quality control andcomputation speed; it can be increased with the permI parameter, e.g. to 1,000, toassess if the model is significant at the 0.05 level),

3. The NIPALS algorithm is used for PLS (and OPLS); dataMatrix matrices with (amoderate amount of) missing values can thus be analysed.

We see that our model with 3 predictive (pre) components has significant and quite high R2Yand Q2Y values.

4.4 Orthogonal partial least squares: OPLS and OPLS-DA

To perform OPLS(-DA), we set orthoI (number of components which are orthogonal;Integer) to either a specific number of orthogonal components, or to NA. Let us build anOPLS-DA model of the gender response.sacurine.oplsda <- opls(dataMatrix, genderFc,

predI = 1, orthoI = NA)

## OPLS-DA

## 183 samples x 109 variables and 1 response

## standard scaling of predictors and response(s)

## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2

## Total 0.275 0.73 0.602 0.262 1 2 0.05 0.05

10

Page 11: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

0.4 0.5 0.6 0.7 0.8 0.9 1.0

−0.5

0.0

0.5

1.0

pR2Y = 0.05, pQ2 = 0.05

Similarity(y, yperm)

R2Y

Q2Y

p1 o1 o2

Model overview

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

R2Y

Q2Y

0 1 2 3 4

05

1015

Observation diagnostics

Score distance (SD)

Ort

hgon

al d

ista

nce

(OD

)

HU_047

HU_051

HU_053

HU_055HU_120

HU_171

HU_173

HU_187HU_026

HU_038

HU_044HU_102HU_121HU_150

HU_155HU_160HU_167

−6 −4 −2 0 2 4 6

−10

−50

510

Scores (OPLS−DA)

t1 (5%)

to1

F

M

HU_011

HU_014HU_015

HU_017HU_018HU_019

HU_020

HU_021HU_022

HU_023

HU_024HU_025

HU_026

HU_027HU_028HU_029

HU_030HU_031 HU_032

HU_033HU_034 HU_035HU_036

HU_037

HU_038HU_039

HU_040HU_041

HU_042HU_043

HU_044HU_045HU_046

HU_047

HU_048HU_049

HU_050

HU_051

HU_052

HU_053

HU_054

HU_055

HU_056HU_057

HU_058

HU_060HU_061HU_062

HU_063

HU_064HU_065

HU_066HU_067

HU_068HU_069HU_070HU_072

HU_073HU_074

HU_075

HU_076

HU_077

HU_078

HU_079HU_080HU_081

HU_082

HU_083

HU_084HU_085

HU_086

HU_087HU_088

HU_089

HU_090HU_091HU_092

HU_093

HU_094 HU_095

HU_097

HU_098HU_099

HU_100

HU_101HU_102

HU_103

HU_105HU_106

HU_107

HU_108HU_109

HU_110

HU_112

HU_113

HU_114

HU_115

HU_116

HU_117HU_118

HU_119HU_120

HU_121HU_122

HU_123HU_124

HU_125

HU_126HU_127

HU_129

HU_130

HU_131

HU_132HU_133

HU_134HU_135

HU_136

HU_137

HU_138HU_139

HU_140

HU_142

HU_143HU_144HU_145HU_146HU_147

HU_148HU_149HU_150

HU_152

HU_154HU_155HU_156HU_157

HU_158

HU_159

HU_160

HU_162HU_163

HU_164

HU_166

HU_167HU_168

HU_169

HU_170

HU_171

HU_172

HU_173

HU_174

HU_175

HU_177

HU_179

HU_180

HU_181HU_182

HU_183

HU_184HU_185

HU_186 HU_187

HU_188HU_189HU_190

HU_191HU_192

HU_193

HU_194

HU_195

HU_196

HU_197

HU_198HU_199

HU_200HU_201

HU_202 HU_203

HU_204

HU_205HU_206

HU_207

HU_208

HU_209

R2X0.275

R2Y0.73

Q2Y0.602

RMSEE0.262

pre1

ort2

Figure 4: OPLS-DA model of the gender response.

Note:1. For OPLS modeling of a single response, the number of predictive component is 1,2. In the (x-score plot), the predictive component is displayed as abscissa and the (selected;

default = 1) orthogonal component as ordinate.Let us assess the predictive performance of our model. We first train the model on a subsetof the samples (here we use the odd subset value which splits the data set into two halveswith similar proportions of samples for each class; alternatively, we could have used a specificsubset of indices for training):sacurine.oplsda <- opls(dataMatrix, genderFc, predI = 1, orthoI = NA,

subset = "odd")

## OPLS-DA

## 92 samples x 109 variables and 1 response

## standard scaling of predictors and response(s)

## R2X(cum) R2Y(cum) Q2(cum) RMSEE RMSEP pre ort

## Total 0.26 0.825 0.608 0.213 0.341 1 2

We first check the predictions on the training subset:

11

Page 12: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

trainVi <- getSubsetVi(sacurine.oplsda)

table(genderFc[trainVi], fitted(sacurine.oplsda))

##

## M F

## M 50 0

## F 0 42

We then compute the performances on the test subset:table(genderFc[-trainVi],

predict(sacurine.oplsda, dataMatrix[-trainVi, ]))

##

## M F

## M 43 7

## F 7 34

As expected, the predictions on the test subset are (slightly) lower. The classifier however stillachieves 91% of correct predictions.

4.5 Comments

4.5.1 Overfitting

Overfitting (i.e., building a model with good performances on the training set but poorperformances on a new test set) is a major caveat of machine learning techniques applied todata sets with more variables than samples. A simple simulation of a random X data set anda y response shows that perfect PLS-DA classification can be achieved as soon as the numberof variables exceeds the number of samples, as detailed in the example below, adapted fromWehrens (2011):

12

Page 13: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

−2 −1 0 1 2

−3−2

−10

12

3

t1 (42%)

t2 (

58%

)

00.250.50.751

s1s2

s3

s4

s5s6 s7s8

s9s10

s11s12

s13s14

s15

s16s17

s18s19 s20

R2X1

R2Y0.094

Q2Y−0.818

RMSEE0.516

pre2

0.1

−2 0 2

−3−2

−10

12

3

t1 (12%)

t2 (

9%)

00.250.50.751

s1s2s3

s4

s5s6

s7

s8s9s10

s11

s12

s13s14s15s16

s17

s18s19s20

R2X0.207

R2Y0.594

Q2Y−2.79

RMSEE0.346

pre2

1

−5 0 5

−50

5

t1 (5%)

t2 (

5%)

00.25

0.50.75

1

s1

s2

s3s4

s5

s6

s7s8 s9s10

s11

s12

s13s14

s15

s16

s17

s18s19

s20

R2X0.098

R2Y0.994

Q2Y−0.732

RMSEE0.04

pre2

10

−0.4 0.0 0.4 0.8

−1.0

−0.5

0.0

0.5

1.0

pR2Y = 0.48, pQ2 = 1

Similarity(y, yperm)

R2Y

Q2Y

obs./feat. ratio:

Figure 5: Risk of PLS overfitting. In the simulation above, a random matrix X of 20observations x 200 features was generated by sampling from the uniform distribution U(0, 1).A random y response was obtained by sampling (without replacement) from a vector of 10zeros and 10 ones. Top left, top right, and bottom left: the X-score plots of the PLSmodeling of y by the (sub)matrix of X restricted to the first 2, 20, or 200 features, aredisplayed (i.e., the observation/feature ratios are 0.1, 1, and 10, respectively). Despite thegood separation obtained on the bottom left score plot, we see that the Q2Y estimationof predictive performance is low (negative); Bottom right: a significant proportion of themodels (in fact here all models) trained after random permutations of the labels have ahigher Q2Y value than the model trained with the true labels, confirming that PLS cannotspecifically model the y response with the X predictors, as expected.This simple simulation illustrates that PLS overfit can occur, in particular when the number offeatures exceeds the number of observations. It is therefore essential to check that theQ2Y value of the model is significant by random permutation of the labels.

4.5.2 VIP from OPLS models

The classical VIP metric is not useful for OPLS modeling of a single response since (Galindo-Prieto, Eriksson, and Trygg 2014, Thevenot et al. (2015)): 1. VIP values remain identicalwhatever the number of orthogonal components selected, 2. VIP values are univariate (i.e.,

13

Page 14: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

they do not provide information about interactions between variables). In fact, when featuresare standardized, we can demonstrate a mathematical relationship between VIP and p-valuesfrom a Pearson correlation test (Thevenot et al. 2015), as illustrated by the figure below:## OPLS

## 183 samples x 109 variables and 1 response

## standard scaling of predictors and response(s)

## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2

## Total 0.212 0.476 0.31 7.53 1 1 0.05 0.05

0.2 0.4 0.6 0.8

0.5

1.0

1.5

2.0

p−value

VIP

Figure 6: Relationship between VIP from one-predictive PLS or OPLS models withstandardized variables, and p-values from Pearson correlation test. The (pj , V IPj)pairs corresponding respectively to the VIP values from OPLS modelling of the age responsewith the sacurine dataset, and the p-values from the Pearson correlation test are shown asred dots. The y = Φ−1(1− x/2)/zrms curve is shown in red (where Φ−1 is the inverse ofthe probability density function of the standard normal distribution, and zrms is the quadraticmean of the zj quantiles from the standard normal distribution; zrms = 2.6 for the sacurinedataset and the age response). The vertical (resp. horizontal) blue line corresponds tounivariate (resp. multivariate) thresholds of p = 0.05 and V IP = 1, respectively (Thevenotet al. 2015).The VIP properties above result from:

1. OPLS models of a single response have a single predictive component,

14

Page 15: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

2. in the case of one-predictive component (O)PLS models, the general formula for VIPscan be simplified to V IPj =

√m× |wj | for each feature j, were m is the total number

of features and w is the vector of loading weights,3. in OPLS, w remains identical whatever the number of extracted orthogonal components,4. for a single-response model, w is proportional to X’y (where ’ denotes the matrix

transposition),5. if X and y are standardized, X’y is the vector of the correlations between the features

and the response.Galindo-Prieto, Eriksson, and Trygg (2014) have recently suggested new VIP metrics forOPLS, VIP_pred and VIP_ortho, to separately measure the influence of the features inthe modeling of the dispersion correlated to, and orthogonal to the response, respectively(Galindo-Prieto, Eriksson, and Trygg 2014).For OPLS(-DA) models, you can therefore get from the model generated with opls:

1. the predictive VIP vector (which corresponds to the V IP4,pred metric measuring thevariable importance in prediction) with getVipVn(model),

2. the orthogonal VIP vector which is the V IP4,ortho metric measuring the variableimportance in orthogonal modeling with getVipVn(model, orthoL = TRUE). As for theclassical VIP, we still have the mean of V IP 2

pred (and of V IP 2ortho) which, each, equals

1.

4.5.3 (Orthogonal) Partial Least Squares Discriminant Analysis: (O)PLS-DA

4.5.3.1 Two classes

When the y response is a factor of 2 levels (character vectors are also allowed), it is internallytransformed into a vector of values ∈ {0, 1} encoding the classes. The vector is centered andunit-variance scaled, and the (O)PLS analysis is performed.Brereton and Lloyd (2014) have demonstrated that when the sizes of the 2 classes areunbalanced, a bias is introduced in the computation of the decision rule, which penalizesthe class with the highest size (Brereton and Lloyd 2014). In this case, an external procedureusing resampling (to balance the classes) and taking into account the class sizes should beused for optimal results.

4.5.3.2 Multiclass

In the case of more than 2 levels, the y response is internally transformed into a matrix (eachclass is encoded by one column of values ∈ {0, 1}). The matrix is centered and unit-variancescaled, and the PLS analysis is performed.In this so-called PLS2 implementation, the proportions of 0 and 1 in the columns is usuallyunbalanced (even in the case of balanced size of the classes) and the bias described previouslyoccurs (Brereton and Lloyd 2014). The multiclass PLS-DA results from ropls are thereforeindicative only, and we recommend to set an external procedure where each column of thematrix is modeled separately (as described above) and the resulting probabilities are aggregated(see for instance Bylesjo et al. (2006)).

15

Page 16: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

4.6 Working on ExpressionSet omics objects from bioconductor

The ExpressionSet class from the Biobase bioconductor package has been developed toconveniently handle preprocessed omics objects, including the variables x samples matrixof intensities, and data frames containing the sample and variable metadata (Huber et al.2015). The matrix and the two data frames can be accessed by the exprs, pData and fData

respectively (note that the data matrix is stored in the object with samples in columns).The opls method can be applied to an ExpressionSet object, by using the object as thex argument, and, for (O)PLS(-DA), by indicating as the y argument the name of thesampleMetadata to be used as the response.In the example below, we will first build a minimal ExpressionSet object from the sacurinedata set, and we subsequently perform an OPLS-DA.library(Biobase)

sacSet <- ExpressionSet(assayData = t(dataMatrix),

phenoData = new("AnnotatedDataFrame", data = sampleMetadata))

opls(sacSet, "gender", orthoI = NA)

## OPLS-DA

## 183 samples x 109 variables and 1 response

## standard scaling of predictors and response(s)

## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2

## Total 0.275 0.73 0.602 0.262 1 2 0.05 0.05

16

Page 17: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

p1 o1 o2

Model overview

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

R2Y

Q2Y

0 1 2 3 4

05

1015

Observation diagnostics

Score distance (SD)

Ort

hgon

al d

ista

nce

(OD

)

HU_047

HU_051

HU_053

HU_055HU_120

HU_171

HU_173

HU_187HU_026

HU_038

HU_044HU_102HU_121HU_150

HU_155HU_160HU_167

−6 −4 −2 0 2 4 6

−10

−50

510

Scores (OPLS−DA)

t1 (5%)

to1

F

M

HU_011

HU_014HU_015

HU_017HU_018HU_019

HU_020

HU_021HU_022

HU_023

HU_024HU_025

HU_026

HU_027HU_028HU_029

HU_030HU_031 HU_032

HU_033HU_034 HU_035HU_036

HU_037

HU_038HU_039

HU_040HU_041

HU_042HU_043

HU_044HU_045HU_046

HU_047

HU_048HU_049

HU_050

HU_051

HU_052

HU_053

HU_054

HU_055

HU_056HU_057

HU_058

HU_060HU_061HU_062

HU_063

HU_064HU_065

HU_066HU_067

HU_068HU_069HU_070HU_072

HU_073HU_074

HU_075

HU_076

HU_077

HU_078

HU_079HU_080HU_081

HU_082

HU_083

HU_084HU_085

HU_086

HU_087HU_088

HU_089

HU_090HU_091HU_092

HU_093

HU_094 HU_095

HU_097

HU_098HU_099

HU_100

HU_101HU_102

HU_103

HU_105HU_106

HU_107

HU_108HU_109

HU_110

HU_112

HU_113

HU_114

HU_115

HU_116

HU_117HU_118

HU_119HU_120

HU_121HU_122

HU_123HU_124

HU_125

HU_126HU_127

HU_129

HU_130

HU_131

HU_132HU_133

HU_134HU_135

HU_136

HU_137

HU_138HU_139

HU_140

HU_142

HU_143HU_144HU_145HU_146HU_147

HU_148HU_149HU_150

HU_152

HU_154HU_155HU_156HU_157

HU_158

HU_159

HU_160

HU_162HU_163

HU_164

HU_166

HU_167HU_168

HU_169

HU_170

HU_171

HU_172

HU_173

HU_174

HU_175

HU_177

HU_179

HU_180

HU_181HU_182

HU_183

HU_184HU_185

HU_186 HU_187

HU_188HU_189HU_190

HU_191HU_192

HU_193

HU_194

HU_195

HU_196

HU_197

HU_198HU_199

HU_200HU_201

HU_202 HU_203

HU_204

HU_205HU_206

HU_207

HU_208

HU_209

R2X0.275

R2Y0.73

Q2Y0.602

RMSEE0.262

pre1

ort2

−0.2 0.0 0.1 0.2

0.00

0.10

0.20

Loadings

p1 (5%)

pOrt

ho1

(5%

)

Testosterone glucuronideAsp−Leu/Ile isomer 1

(gamma)Glu−Leu/IlePantothenic acid

Malic acidp−Anisic acid

Acetaminophen glucuronide

N−Acetylleucine(2−methoxyethoxy)propanoic acid isomer

Threonic acid/Erythronic acid1−Methyluric acidPyrocatechol sulfate

4.7 Importing/exporting data from/to the Workflow4metabolomicsinfrastructure

Galaxy is a web-based environment providing powerful graphical user interface and workflowmanagement functionalities for omics data analysis (Goecks et al. (2010); Boekel et al.(2015)). Wrapping an R code into a Galaxy module is quite straight-forward: examples canbe found on the toolshed central repository and in the RGalaxy bioconductor package.Workflow4metabolomics (W4M) is the online infrastructure for computational metabolomicsbased on the Galaxy environment (Giacomoni et al. 2015). W4M enables to build, run,save and share workflows efficiently. In addition, workflows and input/output data (calledhistories) can be referenced, thus enabling fully reproducible research. More than 30 modulesare currently available for LC-MS, GC-MS and NMR data preprocessing, statistical analysis,and annotation, including wrappers of xcms, CAMERA, metaMS, ropls, and biosigner, and isopen to new contributions.In order to facilitate data import from/to W4M, the fromW4M function (respectively thetoW4M method) enables import from (respectively export to) the W4M 3 tabular file format(dataMatrix.tsv, sampleMetadata.tsv, variableMetadata.tsv) into (respectively from) anExpressionSet object, as shown in the following example which uses the 3 .tsv files stored inthe extdata repository of the package to create a sacSet ExpressionSet object:

17

Page 18: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

sacSet <- fromW4M(file.path(path.package("ropls"), "extdata"))

sacSet

## ExpressionSet (storageMode: lockedEnvironment)

## assayData: 109 features, 183 samples

## element names: exprs

## protocolData: none

## phenoData

## sampleNames: HU_011 HU_014 ... HU_209 (183 total)

## varLabels: age bmi gender

## varMetadata: labelDescription

## featureData

## featureNames: X.2.methoxyethoxy.propanoic.acid.isomer

## X.gamma.Glu.Leu.Ile ... Xanthosine (109 total)

## fvarLabels: msiLevel hmdb chemicalClass

## fvarMetadata: labelDescription

## experimentData: use 'experimentData(object)'

## Annotation:

The generated sacSet ExpressionSet object can be used with the opls method as describedin the previous section.Conversely, an ExpressionSet (with filled assayData, phenoData and featureData slots)can be exported to the 3 table W4M format:toW4M(sacSet, paste0(getwd(), "/out_"))

Before moving to the next session whith another example dataset, we detach sacurine fromthe search path:detach(sacurine)

5 Pre-processing and annotation of mass spectrom-etry data

To illustrate how dataMatrix, sampleMetadata and variableMetadata can be obtained fromraw mass spectra file, we use the LC-MS data from the faahKO package (Saghatelian etal. 2004). We will pre-process the raw files with the xcms package (Smith et al. 2006) andannotate isotopes and adducts with the CAMERA package (Kuhl et al. 2012), as describedin the corresponding vignettes (all these packages are from bioconductor).Let us start by getting the paths to the 12 raw files (6 KO and 6 WT mice) in the .cdf openformat. The files are grouped in two sub-directories (KO and WT) since xcms can use sampleclass information when grouping the peaks and correcting retention times.library(faahKO)

cdfpath <- system.file("cdf", package = "faahKO")

cdffiles <- list.files(cdfpath, recursive = TRUE, full.names = TRUE)

basename(cdffiles)

## [1] "ko15.CDF" "ko16.CDF" "ko18.CDF" "ko19.CDF" "ko21.CDF" "ko22.CDF"

## [7] "wt15.CDF" "wt16.CDF" "wt18.CDF" "wt19.CDF" "wt21.CDF" "wt22.CDF"

18

Page 19: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

Next, xcms is used to pre-process the individual raw files, as described in the vignette.library(xcms)

xset <- xcmsSet(cdffiles)

xset

## An "xcmsSet" object with 12 samples

##

## Time range: 2506.1-4147.7 seconds (41.8-69.1 minutes)

## Mass range: 200.1-599.3338 m/z

## Peaks: 4721 (about 393 per sample)

## Peak Groups: 0

## Sample classes: KO, WT

##

## Feature detection:

## o Peak picking performed on MS1.

## Profile settings: method = bin

## step = 0.1

##

## Memory usage: 0.744 MB

xset <- group(xset)

## Processing 3195 mz slices ... OK

xset2 <- retcor(xset, family = "symmetric", plottype = "mdevden")

## Performing retention time correction using 133 peak groups.

xset2 <- group(xset2, bw = 10)

## Processing 3195 mz slices ... OK

xset3 <- fillPeaks(xset2)

Finally, the annotateDiffreport from CAMERA annotates isotopes and adducts and builds apeak table containing the peak intensities and the variable metadata.library(CAMERA)

diffreport <- annotateDiffreport(xset3, quick=TRUE)

## Start grouping after retention time.

## Created 128 pseudospectra.

## Generating peak matrix!

## Run isotope peak annotation

## % finished: 10 20 30 40 50 60 70 80 90 100

## Found isotopes: 81

diffreport[1:4, ]

## name fold tstat pvalue mzmed mzmin

## 300.2/3390 M300T3390 5.693594 -14.44368 5.026336e-08 300.1898 300.1706

## 301.2/3390 M301T3390 5.876588 -15.57570 6.705719e-08 301.1879 301.1659

## 298.2/3187 M298T3187 3.870918 -11.93891 3.310025e-07 298.1508 298.1054

## 491.2/3397 M491T3397 24.975703 -16.83986 4.463361e-06 491.2000 491.1877

## mzmax rtmed rtmin rtmax npeaks KO WT ko15

19

Page 20: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

## 300.2/3390 300.2000 3390.324 3386.765 3396.335 12 6 6 4534353.6

## 301.2/3390 301.1949 3389.627 3386.765 3392.101 7 6 1 962353.4

## 298.2/3187 298.1592 3186.803 3184.124 3191.312 4 4 0 180780.8

## 491.2/3397 491.2063 3397.160 3367.123 3424.681 6 6 0 432037.0

## ko16 ko18 ko19 ko21 ko22 wt15

## 300.2/3390 4980914.5 5290739.1 4564262.9 4733236.1 3931592.6 349660.885

## 301.2/3390 1047934.1 1109303.0 946943.4 984787.2 806171.5 86450.412

## 298.2/3187 203927.0 191015.9 190626.8 156869.1 220288.6 16269.096

## 491.2/3397 332159.1 386966.8 334951.5 294816.2 373577.6 7643.138

## wt16 wt18 wt19 wt21 wt22 isotopes

## 300.2/3390 491793.18 645526.70 634108.85 1438254.446 1364627.84 [9][M]+

## 301.2/3390 120096.52 143007.95 137319.69 218483.143 291392.97 [9][M+1]+

## 298.2/3187 43677.78 54739.13 76318.01 54726.115 49679.94

## 491.2/3397 10519.94 26472.29 33598.32 8030.467 0.00

## adduct pcgroup

## 300.2/3390 20

## 301.2/3390 20

## 298.2/3187 103

## 491.2/3397 28

We then build the dataMatrix, sampleMetadata and variableMetadata matrix and dataframesas follows:sampleVc <- grep("^ko|^wt", colnames(diffreport), value = TRUE)

dataMatrix <- t(as.matrix(diffreport[, sampleVc]))

dimnames(dataMatrix) <- list(sampleVc, diffreport[, "name"])

sampleMetadata <- data.frame(row.names = sampleVc,

genotypeFc = substr(sampleVc, 1, 2))

variableMetadata <- diffreport[, !(colnames(diffreport) %in% c("name", sampleVc))]

rownames(variableMetadata) <- diffreport[, "name"]

The data can now be analysed with the ropls package as described in the previous section(i.e. by performing a PCA and an OPLS-DA):library(ropls)

opls(dataMatrix)

## PCA

## 12 samples x 398 variables

## standard scaling of predictors

## R2X(cum) pre ort

## Total 0.588 2 0

20

Page 21: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

p1 p2

Variance explained

PC

% o

f tot

al v

aria

nce

010

2030

40

0.0 0.5 1.0 1.5 2.0 2.5

05

1015

Observation diagnostics

Score distance (SD)

Ort

hgon

al d

ista

nce

(OD

) wt16

−40 −20 0 20 40

−20

010

20

Scores (PCA)

t1 (43%)

t2 (

16%

) ko15

ko16

ko18 ko19

ko21ko22

wt15

wt16 wt18 wt19wt21wt22

R2X0.588

−0.05 0.00 0.05

−0.1

00.

000.

050.

10

Loadings

p1 (43%)

p2 (

16%

)M438T3457M523T3376M440T3477

M398T3313M430T2688

M344T2894

M382T3247M264T3155M362T2918

M459T3044M566T3580M272T2732

opls(dataMatrix, sampleMetadata[, "genotypeFc"], orthoI = NA)

## Warning: OPLS: number of predictive components ('predI' argument) set to 1

## OPLS-DA

## 12 samples x 398 variables and 1 response

## standard scaling of predictors and response(s)

## R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2

## Total 0.737 0.993 0.822 0.0557 1 3 0.05 0.05

21

Page 22: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

p1 o1 o2 o3

Model overview

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

R2Y

Q2Y

0.0 0.5 1.0 1.5 2.0 2.5

05

1015

20

Observation diagnostics

Score distance (SD)

Ort

hgon

al d

ista

nce

(OD

)

−20 −10 0 10 20

−30

−10

1030

Scores (OPLS−DA)

t1 (11%)

to1

ko

wt

ko15

ko16

ko18

ko19ko21ko22

wt15

wt16

wt18

wt19wt21wt22

R2X0.737

R2Y0.993

Q2Y0.822

RMSEE0.056

pre1

ort3

−0.15 −0.05 0.05

−0.0

50.

000.

05

Loadings

p1 (11%)

pOrt

ho1

(11%

)

M491T3397M348T3288M301T3390

M522T3553M580T3389

M594T3395

M361T3170M440T3477M438T3457

M430T2688M344T2894M366T2792

Note that the warning message is just a reminder that OPLS(-DA) models of a single responsehave only 1 predictive component, and could have been avoided by specifying predI = 1 inthe opls call.

6 Other datasets

In addition to the sacurine dataset presented above, the package contains the followingdatasets to illustrate the functionalities of PCA, PLS and OPLS (see the examples in thedocumentation of the opls function):

• aminoacids Amino-Acids Dataset. Quantitative structure property relationship (QSPR)(Wold, Sjostrom, and Eriksson 2001).

• cellulose NIR-Viscosity example data set to illustrate multivariate calibration using PLS,spectral filtering and OPLS (Multivariate calibration using spectral data. Simca tutorial.Umetrics, Sweden).

• cornell Octane of various blends of gasoline: Twelve mixture component proportions ofthe blend are analysed (Tenenhaus 1998).

22

Page 23: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

• foods Food consumption patterns accross European countries (FOODS). The relativeconsumption of 20 food items was compiled for 16 countries. The values range between0 and 100 percent and a high value corresponds to a high consumption. The datasetcontains 3 missing data (Eriksson et al. 2001).

• linnerud Three physiological and three exercise variables are measured on twentymiddle-aged men in a fitness club (Tenenhaus 1998).

• lowarp A multi response optimization data set (LOWARP) (Eriksson et al. 2001).• mark Marks obtained by french students in mathematics, physics, french and english.

Toy example to illustrate the potentialities of PCA (Baccini 2010).

7 Session info

Here is the output of sessionInfo() on the system on which this document was compiled:## R version 3.5.0 (2018-04-23)

## Platform: x86_64-pc-linux-gnu (64-bit)

## Running under: Ubuntu 16.04.4 LTS

##

## Matrix products: default

## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so

## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so

##

## locale:

## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C

## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C

## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8

## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C

## [9] LC_ADDRESS=C LC_TELEPHONE=C

## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

##

## attached base packages:

## [1] parallel stats graphics grDevices utils datasets methods

## [8] base

##

## other attached packages:

## [1] CAMERA_1.36.0 faahKO_1.19.0 xcms_3.2.0

## [4] MSnbase_2.6.0 ProtGenerics_1.12.0 mzR_2.14.0

## [7] Rcpp_0.12.16 BiocParallel_1.14.0 Biobase_2.40.0

## [10] BiocGenerics_0.26.0 ropls_1.12.0 BiocStyle_2.8.0

##

## loaded via a namespace (and not attached):

## [1] vsn_3.48.0 splines_3.5.0 foreach_1.4.4

## [4] Formula_1.2-2 affy_1.58.0 stats4_3.5.0

## [7] latticeExtra_0.6-28 RBGL_1.56.0 yaml_2.1.18

## [10] impute_1.54.0 pillar_1.2.2 backports_1.1.2

## [13] lattice_0.20-35 limma_3.36.0 digest_0.6.15

## [16] RColorBrewer_1.1-2 checkmate_1.8.5 colorspace_1.3-2

## [19] htmltools_0.3.6 preprocessCore_1.42.0 Matrix_1.2-14

23

Page 24: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

## [22] plyr_1.8.4 MALDIquant_1.17 XML_3.98-1.11

## [25] pkgconfig_2.0.1 bookdown_0.7 zlibbioc_1.26.0

## [28] scales_0.5.0 RANN_2.5.1 affyio_1.50.0

## [31] tibble_1.4.2 htmlTable_1.11.2 IRanges_2.14.0

## [34] ggplot2_2.2.1 nnet_7.3-12 lazyeval_0.2.1

## [37] MassSpecWavelet_1.46.0 survival_2.42-3 magrittr_1.5

## [40] evaluate_0.10.1 doParallel_1.0.11 MASS_7.3-50

## [43] foreign_0.8-70 graph_1.58.0 BiocInstaller_1.30.0

## [46] tools_3.5.0 data.table_1.10.4-3 stringr_1.3.0

## [49] S4Vectors_0.18.0 munsell_0.4.3 cluster_2.0.7-1

## [52] pcaMethods_1.72.0 compiler_3.5.0 mzID_1.18.0

## [55] rlang_0.2.0 grid_3.5.0 iterators_1.0.9

## [58] rstudioapi_0.7 htmlwidgets_1.2 igraph_1.2.1

## [61] base64enc_0.1-3 rmarkdown_1.9 gtable_0.2.0

## [64] codetools_0.2-15 multtest_2.36.0 gridExtra_2.3

## [67] knitr_1.20 Hmisc_4.1-1 rprojroot_1.3-2

## [70] stringi_1.1.7 rpart_4.1-13 acepack_1.4.1

## [73] xfun_0.1

References

Baccini, A. 2010. “Statistique Descriptive Multidimensionnelle (Pour Les Nuls).”Boekel, J., JM. Chilton, IR. Cooke, PL. Horvatovich, PD. Jagtap, L. Kall, J. Lehtio, P.Lukasse, PD. Moerland, and TJ. Griffin. 2015. “Multi-Omic Data Analysis Using Galaxy.”Nature Biotechnology 33 (2):137–39. https://doi.org/10.1038/nbt.3134.Brereton, Richard G., and Gavin R. Lloyd. 2014. “Partial Least Squares Discriminant Analysis:Taking the Magic Away.” Journal of Chemometrics 28 (4):213–25. http://dx.doi.org/10.1002/cem.2609.Bylesjo, M., M. Rantalainen, J. Nicholson, E. Holmes, and J. Trygg. 2008. “K-OPLS Package:Kernel-Based Orthogonal Projections to Latent Structures for Prediction and Interpretation inFeature Space.” BMC Bioinformatics 9 (1):106. http://dx.doi.org/10.1186/1471-2105-9-106.Bylesjo, M, M Rantalainen, O Cloarec, J Nicholson, E Holmes, and J Trygg. 2006. “OPLSDiscriminant Analysis: Combining the Strengths of PLS-DA and SIMCA Classification.”Journal of Chemometrics 20:341–51. http://dx.doi.org/10.1002/cem.1006.Eriksson, L., E. Johansson, N. Kettaneh-Wold, and S. Wold. 2001. Multi- and MegavariateData Analysis. Principles and Applications. Umetrics Academy.Galindo-Prieto, B., L. Eriksson, and J. Trygg. 2014. “Variable Influence on Projection(VIP) for Orthogonal Projections to Latent Structures (OPLS).” Journal of Chemometrics 28(8):623–32. http://dx.doi.org/10.1002/cem.2627.Gaude, R., F. Chignola, D. Spiliotopoulos, A. Spitaleri, M. Ghitti, JM. Garcia-Manteiga,S. Mari, and G. Musco. 2013. “Muma, an R Package for Metabolomics Univariate andMultivariate Statistical Analysis.” Current Metabolomics 1:180–89. http://dx.doi.org/10.2174/2213235X11301020005.

24

Page 25: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

Giacomoni, F., G. Le Corguille, M. Monsoor, M. Landi, P. Pericard, M. Petera,C. Duperier, et al. 2015. “Workflow4Metabolomics: A Collaborative ResearchInfrastructure for Computational Metabolomics.” Bioinformatics 31 (9):1493–5.http://dx.doi.org/10.1093/bioinformatics/btu813.Goecks, J., A. Nekrutenko, J. Taylor, and The Galaxy Team. 2010. “Galaxy: A ComprehensiveApproach for Supporting Accessible, Reproducible, and Transparent Computational Research inthe Life Sciences.” Genome Biology 11 (8):R86. https://doi.org/10.1186/gb-2010-11-8-r86.Huber, W., VJ. Carey, R. Gentleman, S. Anders, M. Carlson, BS. Carvalho, HC. Bravo, etal. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” NatureMethods 12 (2):115–21. https://doi.org/10.1038/nmeth.3252.Hubert, M., PJ. Rousseeuw, and K. Vanden Branden. 2005. “ROBPCA: A New Approach toRobust Principal Component Analysis.” Technometrics 47:64–79. http://dx.doi.org/10.1198/004017004000000563.Kuhl, C., R. Tautenhahn, C Bottcher, TR. Larson, and S. Neumann. 2012. “CAM-ERA: An Integrated Strategy for Compound Spectra Extraction and Annotation of Liq-uid Chromatography/Mass Spectrometry Data Sets.” Analytical Chemistry 84 (1):283–89.http://dx.doi.org/10.1021/ac202450g.Mehmood, T., KH. Liland, L. Snipen, and S. Saebo. 2012. “A Review of Variable SelectionMethods in Partial Least Squares Regression.” Chemometrics and Intelligent LaboratorySystems 118 (0):62–69. http://dx.doi.org/10.1016/j.chemolab.2012.07.010.Pinto, RC., J. Trygg, and J. Gottfries. 2012. “Advantages of Orthogonal Inspection inChemometrics.” Journal of Chemometrics 26 (6):231–35. http://dx.doi.org/10.1002/cem.2441.R Development Core Team. 2008. R: A Language and Environment for Statistical Computing.Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org.Saghatelian, A., SA. Trauger, EJ. Want, EG. Hawkins, G. Siuzdak, and BF. Cravatt. 2004.“Assignment of Endogenous Substrates to Enzymes by Global Metabolite Profiling.” Biochem-istry 43 (45):14332–9. http://dx.doi.org/10.1021/bi0480335.Smith, CA., EJ. Want, G. O’Maille, R. Abagyan, and G. Siuzdak. 2006. “XCMS: ProcessingMass Spectrometry Data for Metabolite Profiling Using Nonlinear Peak Alignment, Match-ing, and Identification.” Analytical Chemistry 78 (3):779–87. http://dx.doi.org/10.1021/ac051437y.Szymanska, E., E. Saccenti, AK. Smilde, and JA. Westerhuis. 2012. “Double-Check:Validation of Diagnostic Statistics for PLS-DA Models in Metabolomics Studies.” Metabolomics8 (1, 1):3–16. http://dx.doi.org/10.1007/s11306-011-0330-3.Tenenhaus, M. 1998. La Regression PLS : Theorie et Pratique. Editions Technip.Thevenot, EA., A. Roux, X. Ying, E. Ezan, and C. Junot. 2015. “Analysis of the Human AdultUrinary Metabolome Variations with Age, Body Mass Index and Gender by Implementing aComprehensive Workflow for Univariate and OPLS Statistical Analyses.” Journal of ProteomeResearch 14 (8):3322–35. http://dx.doi.org/10.1021/acs.jproteome.5b00354.Trygg, J., and S. Wold. 2002. “Orthogonal Projection to Latent Structures (O-PLS).” Journalof Chemometrics 16:119–28. http://dx.doi.org/10.1002/cem.695.Wehrens, R. 2011. Chemometrics with R: Multivariate Data Analysis in the Natural Sciencesand Life Sciences. Springer.

25

Page 26: ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and …bioconductor.riken.jp/packages/3.7/bioc/vignettes/ropls/... · 2018. 5. 1. · plot(sacurine.pca,typeVc ="x-score",

ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data

Wold, S., M. Sjostrom, and L. Eriksson. 2001. “PLS-Regression: A Basic Tool of Chemo-metrics.” Chemometrics and Intelligent Laboratory Systems 58:109–30. http://dx.doi.org/10.1016/S0169-7439(01)00155-1.

26