Upload
others
View
27
Download
0
Embed Size (px)
Citation preview
http://workflow4metabolomics.org
HOW TO PERFORM
UNIVARIATE ANALYZES?
1
W4M Core Team
http://workflow4metabolomics.org
The "Univariate" module
The "Univariate" module on W4M allows you to
perform:
• The Student t-test in order to compare two
population means
• The Wilcoxon test to compare two population
medians (non-parametric)
• The Analysis of Variance (and subsequent
pairwise comparisons with Student t-tests)
• The Kruskal-Wallis test to compare more than
two population medians (non-parametric; followed
by subsequent pairwise comparisons with
Wilcoxon tests)
• The correlation test with the Pearson or the
Spearman (non-parametric) methods
The Univariate module is a wrapper of the
corresponding tests from the R software
2
http://workflow4metabolomics.org
Chaining the statistical modules
The Univariate module can be chained with the Multivariate module,
and also the Filters module (either to filter out pool or blank samples
before the statistics, or filter out the variables according to a statistical
threshold after the analysis)
3
http://workflow4metabolomics.org
Preparing your files (1/9)
Your data must be split into 3 files:
• dataMatrix.tsv
• sampleMetadata.tsv
• variableMetadata.tsv
4
http://workflow4metabolomics.org
Preparing your files (2/9)
Each file can be prepared by using Excel and saved using the
tabulated type format:
5
http://workflow4metabolomics.org
Preparing your files (3/9)
You can then rename your file with the .tsv extension (instead of .txt)
by right-clicking on the file:
.tsv files (i.e. tabular separated) can be handled correctly both by
Excel and Galaxy.
6
http://workflow4metabolomics.org
Preparing your files (4/9)
Decimal separator must be "."
Missing values must be indicated as "NA"
7
http://workflow4metabolomics.org
Preparing your files (5/9)
Note: you can switch your default language in Excel to English in order
to have your decimal separator automatically set to "."
8
1
2
3 4
http://workflow4metabolomics.org
Preparing your files (6/9)
The dataMatrix.tsv file must contain:
• the names of your samples in the first row
• the names of your variables in the first column
• numbers (or NA) in all the other cells
Note: the name in the topleft (A1) cell does not matter; avoid using "ID"
for Excel compatibility
9
http://workflow4metabolomics.org
Preparing your files (7/9) The sampleMetadata.tsv file must contain:
• the names of the factors to be used in statistical analyzes in the first row
• the columns must be either characters (resp. numbers) for qualitative (resp.
quantitative) factors
• the names of your samples in the first column which must exactly match
those of the dataMatrix.tsv file
Note:
• 1) the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel
compatibility
• 2) you can add columns for storing metadata about your samples even
though it is not used in your Galaxy analysis
• 3) results from statistical analyzes (e.g. scores) will be added as
supplementary columns in this file 10
http://workflow4metabolomics.org
Preparing your files (8/9)
The variableMetadata.tsv file must contain:
• the names of the metadata (e.g. mzmed, rtmed) in the first row (there must
be at least one column in addition to the variable names)
• the names of your variables in the first column which must exactly match
those of the dataMatrix.tsv file
Note:
• 1) the name in the topleft (A1) cell does not matter; avoid using "ID" for Excel
compatibility
• 2) you can add columns for storing metadata about your variables even
though it is not used in your Galaxy analysis
• 3) results from the statistical analyzes (e.g. loadings, VIPs) will be added as
new columns in this file
11
http://workflow4metabolomics.org
Preparing your files (9/9)
Sample and variable names:
• should not start with a digit
• should contain only
• a b c d e f g h i j k l m n o p q r s t u v w x y z
• A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
• 0 1 2 3 4 5 6 7 8 9
• , [comma]
• - [dash]
• _ [underscore]
• [blank]
• other punctuations and accents should not be used
• your sample and variable names should not contain any duplicate
12
http://workflow4metabolomics.org 13
Loading your files into Galaxy (1/2)
Upload your three files (dataMatrix.tsv, sampleMetadata.tsv and
variableMetadata.tsv)
• either by using the icon
and drag & dropping the file:
1
2
3
4
http://workflow4metabolomics.org
Loading your files into Galaxy (2/2)
• or with the Get Data / Upload File
14
1
2
3
4
5
http://workflow4metabolomics.org
Check that your data have been
uploaded correctly
15
http://workflow4metabolomics.org
Rename your history (optional)
16
http://workflow4metabolomics.org
Open the "Univariate" module
and select your 3 files of interest:
you are now ready to start your univariate analyzes!
17
1
2
3
4
5
http://workflow4metabolomics.org
Select
• the factor of interest (name of the corresponding column of the
sampleMetadata.tsv file)
• the test to be performed
• the correction for multiple testing
• the significant threshold
• and launch the computation
18
http://workflow4metabolomics.org
Notes: tests available
The choice of the test depends on:
• whether your factor of interest is quantitative or qualitative (and, in the
latter case, if the number of levels is 2 or > 2)
• whether you wish to perform parametric or non-parametric testing:
• non-parametric tests do not assume that the values are normally
distributed; they can be useful in case of skewed distributions or small
number of samples; the power of non-parametric test is lower than the one
of their parametric counterparts
19
Parametric Non-parametric
2 levels Student's t test Wilcoxon test
> 2 levels Analysis of Variance Kruskall-Wallis
Quantitative correlation test Pearson Spearman
Qualitative
http://workflow4metabolomics.org
Notes: correction for multiple testing
The 7 methods implemented in the 'p.adjust' R function are available. The R documentation
describes the methods as follows:
• Bonferroni correction ("bonferroni") in which the p-values are multiplied by the number of comparisons.
• Less conservative corrections are also included by Holm (1979) ("holm"), Hochberg (1988)
("hochberg"), Hommel (1988) ("hommel"), Benjamini and Hochberg (1995) ("BH" or its alias "fdr"), and
Benjamini and Yekutieli (2001) ("BY"), respectively.
• A pass-through option ("none") is also included
• The first four methods are designed to give strong control of the family-wise error rate. There seems no
reason to use the unmodified Bonferroni correction because it is dominated by Holm's method, which
is also valid under arbitrary assumptions. Hochberg's and Hommel's methods are valid when the
hypothesis tests are independent or when they are non-negatively associated (Sarkar, 1998; Sarkar
and Chang, 1997). Hommel's method is more powerful than Hochberg's, but the difference is usually
small and the Hochberg p-values are faster to compute. The "BH" (aka "fdr") and "BY" method of
Benjamini, Hochberg, and Yekutieli control the false discovery rate, the expected proportion of false
discoveries amongst the rejected hypotheses. The false discovery rate is a less stringent condition
than the family-wise error rate, so these methods are more powerful than the others.
The p-values of the test for each variable will be given after correction by the selected method,
as an additional column in the variableMetadata.tsv file
20
http://workflow4metabolomics.org
Notes: significance threshold
The selected threshold will not modify the (corrected) p-values that will
be returned anyway as an additional column of the
variableMetadata.tsv file:
It is merely used to add another column with 0/1 values indicating
which variables are below the threshold (encoded as 1) and facilitate
their subsequent filtering
21
http://workflow4metabolomics.org
Results (1/2)
The dataMatrix.tsv and sampleMetadata.tsv files have not been modified
Two columns have been added to the variableMetadata.tsv file with the
p-values of the Kruskal-Wallis test (corrected with the False Discovery
Rate approach) and a 0/1 encoding corresponding to the threshold given
as argument
22
http://workflow4metabolomics.org
Results (2/2)
Since the p-value of the first variable is below the threshold, a pairwise
Wilcoxon test has been performed to compare the three groups:
• junior vs experienced
• experienced vs senior
• junior vs senior
The corresponding p-values have been corrected for the number of
pairwise tests (n = 3)
After correction, the junior vs senior p-value is above the threshold and
the comparison is consequently displayed in the last column
23
http://workflow4metabolomics.org
References
• Van Belle G., Fisher LD., Heagerty PJ. and Lumley T. (2004). Biostatistics:
A Methodology for the Health Sciences. Wiley.
• Durham T. and Turner J. (2008). Introduction to Statistics in
Pharmaceutical Clinical Trials. Pharmaceutical Press.
24