HOW TO PERFORM UNIVARIATE ANALYZES? - Roscoffweb11.sb-roscoff.fr/download/w4m/howto/w4m_HowToPerformUniv… · The "Univariate" module The "Univariate" module on W4M allows you to

http://workflow4metabolomics.org

HOW TO PERFORM

UNIVARIATE ANALYZES?

1

W4M Core Team


The "Univariate" module

The "Univariate" module on W4M allows you to

perform:

• The Student t-test in order to compare two

population means

• The Wilcoxon test to compare two population

medians (non-parametric)

• The Analysis of Variance (and subsequent

pairwise comparisons with Student t-tests)

• The Kruskal-Wallis test to compare more than

two population medians (non-parametric; followed

by subsequent pairwise comparisons with

Wilcoxon tests)

• The correlation test with the Pearson or the

Spearman (non-parametric) methods

The Univariate module is a wrapper of the

corresponding tests from the R software

2


Chaining the statistical modules

The Univariate module can be chained with the Multivariate module,

and also the Filters module (either to filter out pool or blank samples

before the statistics, or filter out the variables according to a statistical

threshold after the analysis)

3


Preparing your files (1/9)

Your data must be split into 3 files:

• dataMatrix.tsv

• sampleMetadata.tsv

• variableMetadata.tsv

4



Each file can be prepared by using Excel and saved using the

tabulated type format:

5



You can then rename your file with the .tsv extension (instead of .txt)

by right-clicking on the file:

.tsv files (i.e. tabular separated) can be handled correctly both by

Excel and Galaxy.

6



Decimal separator must be "."

Missing values must be indicated as "NA"

7



Note: you can switch your default language in Excel to English in order

to have your decimal separator automatically set to "."

8

1

2

3 4



The dataMatrix.tsv file must contain:

• the names of your samples in the first row

• the names of your variables in the first column

• numbers (or NA) in all the other cells

Note: the name in the topleft (A1) cell does not matter; avoid using "ID"

for Excel compatibility

9


Preparing your files (7/9) The sampleMetadata.tsv file must contain:

• the names of the factors to be used in statistical analyzes in the first row

• the columns must be either characters (resp. numbers) for qualitative (resp.

quantitative) factors

• the names of your samples in the first column which must exactly match

those of the dataMatrix.tsv file

Note:

• 1) the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel

compatibility

• 2) you can add columns for storing metadata about your samples even

though it is not used in your Galaxy analysis

• 3) results from statistical analyzes (e.g. scores) will be added as

supplementary columns in this file 10



The variableMetadata.tsv file must contain:

• the names of the metadata (e.g. mzmed, rtmed) in the first row (there must

be at least one column in addition to the variable names)

• the names of your variables in the first column which must exactly match

those of the dataMatrix.tsv file

Note:

• 1) the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel

compatibility

• 2) you can add columns for storing metadata about your variables even

though it is not used in your Galaxy analysis

• 3) results from the statistical analyzes (e.g. loadings, VIPs) will be added as

new columns in this file

11



Sample and variable names:

• should not start with a digit

• should contain only

• a b c d e f g h i j k l m n o p q r s t u v w x y z

• A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

• 0 1 2 3 4 5 6 7 8 9

• , [comma]

• - [dash]

• _ [underscore]

• [blank]

• other punctuations and accents should not be used

• your sample and variable names should not contain any duplicate

12

http://workflow4metabolomics.org 13

Loading your files into Galaxy (1/2)

Upload your three files (dataMatrix.tsv, sampleMetadata.tsv and

variableMetadata.tsv)

• either by using the icon

and drag & dropping the file:

1

2

3

4


Loading your files into Galaxy (2/2)

• or with the Get Data / Upload File

14

1

2

3

4

5


Check that your data have been

uploaded correctly

15


Rename your history (optional)

16


Open the "Univariate" module

and select your 3 files of interest:

you are now ready to start your univariate analyzes!

17

1

2

3

4

5


Select

• the factor of interest (name of the corresponding column of the

sampleMetadata.tsv file)

• the test to be performed

• the correction for multiple testing

• the significant threshold

• and launch the computation

18


Notes: tests available

The choice of the test depends on:

• whether your factor of interest is quantitative or qualitative (and, in the

latter case, if the number of levels is 2 or > 2)

• whether you wish to perform parametric or non-parametric testing:

• non-parametric tests do not assume that the values are normally

distributed; they can be useful in case of skewed distributions or small

number of samples; the power of non-parametric test is lower than the one

of their parametric counterparts

19

Parametric Non-parametric

2 levels Student's t test Wilcoxon test

> 2 levels Analysis of Variance Kruskall-Wallis

Quantitative correlation test Pearson Spearman

Qualitative


Notes: correction for multiple testing

The 7 methods implemented in the 'p.adjust' R function are available. The R documentation

describes the methods as follows:

• Bonferroni correction ("bonferroni") in which the p-values are multiplied by the number of comparisons.

• Less conservative corrections are also included by Holm (1979) ("holm"), Hochberg (1988)

("hochberg"), Hommel (1988) ("hommel"), Benjamini and Hochberg (1995) ("BH" or its alias "fdr"), and

Benjamini and Yekutieli (2001) ("BY"), respectively.

• A pass-through option ("none") is also included

• The first four methods are designed to give strong control of the family-wise error rate. There seems no

reason to use the unmodified Bonferroni correction because it is dominated by Holm's method, which

is also valid under arbitrary assumptions. Hochberg's and Hommel's methods are valid when the

hypothesis tests are independent or when they are non-negatively associated (Sarkar, 1998; Sarkar

and Chang, 1997). Hommel's method is more powerful than Hochberg's, but the difference is usually

small and the Hochberg p-values are faster to compute. The "BH" (aka "fdr") and "BY" method of

Benjamini, Hochberg, and Yekutieli control the false discovery rate, the expected proportion of false

discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition

than the family-wise error rate, so these methods are more powerful than the others.

The p-values of the test for each variable will be given after correction by the selected method,

as an additional column in the variableMetadata.tsv file

20


Notes: significance threshold

The selected threshold will not modify the (corrected) p-values that will

be returned anyway as an additional column of the

variableMetadata.tsv file:

It is merely used to add another column with 0/1 values indicating

which variables are below the threshold (encoded as 1) and facilitate

their subsequent filtering

21


Results (1/2)

The dataMatrix.tsv and sampleMetadata.tsv files have not been modified

Two columns have been added to the variableMetadata.tsv file with the

p-values of the Kruskal-Wallis test (corrected with the False Discovery

Rate approach) and a 0/1 encoding corresponding to the threshold given

as argument

22


Results (2/2)

Since the p-value of the first variable is below the threshold, a pairwise

Wilcoxon test has been performed to compare the three groups:

• junior vs experienced

• experienced vs senior

• junior vs senior

The corresponding p-values have been corrected for the number of

pairwise tests (n = 3)

After correction, the junior vs senior p-value is above the threshold and

the comparison is consequently displayed in the last column

23


References

• Van Belle G., Fisher LD., Heagerty PJ. and Lumley T. (2004). Biostatistics:

A Methodology for the Health Sciences. Wiley.

• Durham T. and Turner J. (2008). Introduction to Statistics in

Pharmaceutical Clinical Trials. Pharmaceutical Press.

24

Documents

HOW TO PERFORM UNIVARIATE ANALYZES? - Roscoffweb11.sb-roscoff.fr/download/w4m/howto/w4m_HowToPerformUniv… · The "Univariate" module The "Univariate" module on W4M allows you to