Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Data integration in proteomic and biomarker
discovery pipelines: a case study using the
ProteINSIDE webservice.
M Bonnet, M. Reichstadt, J. Bazile, M.P Ellies
UMR Herbivores – Clermont-Ferrand - France
2
Why proteomics ?
Monti et al. 2019, J proteomics
Technological improvements aided the proteome depiction to have a pivotal role in the study of biological systems
3
MW
Gel-based proteomics
pH4 7
Proteomics is the large-scale study of proteinsin cells, tissues, organs, fluids
Proteinextraction
2-DEImage
Analysis
MS or MS/MS analysis
ProteinIdentification
Protein A
• Identification of proteins in a medium range of pH and molecular weight• Time-consuming and expensive (one mass spectrometry analysis = 1 spot)
Ruiz et al., 2016 ; Zargar et al., 2016
4
Gel-free proteomics
Proteinextraction
Proteinconcentration
Label free, chemical/
isotopical labelling
LC-MS/MS analysis
ProteinIdentification
Identification of the most abundant proteins => a depletion step can beincluded for the very highly abundant proteins
Ruiz et al., 2016 ; Zargar et al., 2016
Protein A
5
Proteomics : data analysis and mining
Gel-based or Gel-free proteomics provide lists of protein IDs with theirabundances to be analysed => this talk focus on the data integration of
protein IDs and abudances
6
Descriptive and differentialstatistics
Data analysis and data mining of lists of
protein IDs and abundances
7
http://prabig-prostar.univ-lyon1.fr/Articles/fdrtuto.pdf
Descriptive statistics : a very useful tutorial
8
Descriptive statistics : PCA, clustering, heatmap
• Observation of the data to assess data distribution, overall variation,
variability within each treatment group, outliers…
• Comparison of means and variabilities from those means
• Are the data normally distributed ? => to be transformed for parametrics
approaches or analyze by non-parametric methods.
• Correlation plots to compare treatment groups
9Karimpour-Fard et al., 2015
The major classification methods
Methods What does it do? Sample size/data characteristics
Principal Component Analysis
Separates features into groups based on commonality and reports the weight of each component’s contribution to the separation
Unlimited sample size, data normally distributed
Independent Component Analysis
Separates features into groups by eliminating correlation and reports the weight of each component’s contribution to the separation
Unlimited sample size; data non-normally distributed
Random Forest
Separates features into groups based on commonality; identifies important predictors
Performs well on small sample size ;resistant to over-fitting
Partial Least Square
Separates features into groups based on maximal covariation; reports the contribution of each variable
Unlimited sample size; sensitive to outliers
Support Vector Machine
Uses a user-specified kernel function to quantify the similarity between any pair of instances and create a classifier
Performs well on small sample size and resistant to over-fitting
10
The major clustering methods
Methods What does it do? Sample size/data characteristics
K-meansclustering
Separates features into clusters of similar expression patterns
performs best with a limited dataset, i.e., ~20 to 300 features
Hierarchicalclustering
Clusters treatment groups, features, or samples into a dendrogram
Performs best with limited dataset, i.e., ~20 to 300 features or samples
See the publication by Karimpour-Fard et al., 2015 for more information on the strengths, weaknesses and mathematical mechanisms of each
classification and clustering methods
Karimpour-Fard et al., 2015
11
Differential statistics : concepts of p-values (1/2)
• This is an answer to 2 questions : Whether the putative discovery is
true or false, by relying on the null hypothesis ? What is the
probability that a given false discovery is included in the set of
selected discovery
The null hypothesis and the null distribution refer to the “standard” behavior”
that is the non-differential abundance
Putative discovery : any
protein quantified in the
experiment
True discovery : a protein that
is differentially abundant
between the biological
conditions
False discovery :
a protein that is not differentially
abundant between the biological
conditions
Selected discovery : a protein
that has passed some user-
defined statistical threshold
Adapted from Burger, http://prabig-prostar.univ-lyon1.fr/Articles/fdrtuto.pdf
12
• This is an answer to 2 questions : Whether the putative discovery is
true or false, by relying on the null hypothesis ? What is the
probability that a given false discovery is included in the set of
selected discovery
• To answer => statistical tests quantify the similarity between the
putative discovery and the null hypothesis
• a p-value of 0.05 means that there is 5% chance of getting the observed
result, if the null hypothesis (no difference among groups) were true.
if 100 statistical tests are performed and for all of them the null
hypothesis is actually true, it is expected that 5 of them will be
significant at the p < 0.05 level, by chance. In this case, five
statistically significant results are obtained, all of them being false
positives (type I error)
Differential statistics : concept of p-values (2/2)
A p-value is applied to one protein (individual test)
13
Differential statistics : available methods
t-tests :
Students (the first version),
Welch (numerous variations)
Limma
ANOVA, mixt models...
Likelihood-ratio tests
compare two models provided the simpler model is a special case of the more
complex model (i.e., “nested"). LRTs can be presented as a difference in the log-
likelihoods (recall that log(A/B) = logA – logB) and this is often handy as they can
be expressed in terms of deviance.
Non parametric tests
do not assume anything about the underlying distribution and have great statistical
power, which means they are likely to find a true significant effect : Mann-Whitney,
Kruskal-Wallis, Log-rank..tests
14
Some tools for statistical proteomics
P Packages R 'mixOmics‘ to analyse and integrate several types of
“omics data” http://mixomics.org/a-prop os/publications/
Several software packages for analyzing proteomic data include
statistical tools : MaxQuant (Perseus), Skyline, Progenesis
Packages R specifically developed for the analysis of proteomic data:
• MSstats (Choietal.,2014,Bioinformatics,30:2524-2526)
• DAPAR & ProStar (Wieczoreketal.,2017,Bioinformatics,33,13513)
• SafeQuant (Ahrné, unpublished, https://github.com/eahrne/SafeQuant/)
• GiaPronto (Weineretal.,2017,Mol.Cell.Prot.doi:10.1074/mcp.TIR117.000438)
• MCQ (PAPPSO platform, Paris, unpublished)
• Bazile et al., 2019
https://github.com/jane-bzl/Differential_abundance_and_correlation_to_one_factor)
15
DOI : 10.5281/zenodo.2539329
https://github.com/jane-bzl/Differential_abundance_and_correlation_to_one_factor
Used in the publication Pathways and biomarkers of marbling and carcass fat
deposition in bovine revealed by a combination of gel-based and gel-free
proteomic analyses. Bazile et al. 2019, meat science
Management of proteomic data to identify differentially
abundant proteins according to one discriminant factor
and to correlate their abundance with the value of this factor
Jeanne BAZILE, Ioana MOLNAR, Brigitte PICARD, Muriel BONNET
An available R script by Bazile and al.
16
group id adiposity Protein1 Protein2 Protein3 Protein4
group1 1 6.3 230795072 72392559 -1308293 54238
group1 6 3.7 267670438 49956496 -1099355 23984
group1 3 4.1 1736366 57744002 -1188752 3574699
group1 4 5.1 72392559 65141951 -1234186 368571
group1 5 6.8 203517615 64672505 -1551479 65781
group2 2 3.0 275028504 64089037 -1125215 5348673
group2 7 2.1 2043358 41658946 -1132811 687253
group2 8 2.2 64089037 50876533 -1049944 68357
group2 9 2.3 231769965 37805488 -1134805 573521
group2 10 2.2 305818430 39682550 -1086689 68967
Groups made accordingto the biological
conditions
Identification of the samples (number,
letters or both)
Quantitative values related to the
biological conditions
Values of proteins abundances
R script by J. Bazile : the input table
17
Shapiro test reports the normal (or not) distribution of an
abundance (TRUE=normal distribution)
p-values from Student-t-test for
normally or Kruskal-Wallis test for non-normally
distributed (TRUE= significant pval)
p-values from Student-t-test whatever the distribution of
the data (TRUE= significant pval)
pval.shapiro normalitypval.testTorKW
signif.testTorKW
pval.Ttest signif.Ttestpval.KW
signif.KW
Protein1 0.089 TRUE 0.801 FALSE 0.801 FALSE 0.465 FALSE
Protein2 0.403 TRUE 0.040 TRUE 0.040 TRUE 0.047 TRUE
Protein3 0.018 FALSE 0.047 TRUE 0.061 TRUE 0.047 TRUE
Protein4 0.000 FALSE 0.175 FALSE 0.675 FALSE 0.175 FALSE
R script by J. Bazile : the output table
• Results saved in your directory as an Excel file (.csv)• No correction for multiple tests for the selected discoveries because the size of
the set is below than 150 proteins in most of our proteomic experiments
p-values from Kruskal-Wallis -
test whatever the distribution of the
data (TRUE= significant pval)
18
Correlations computed either by the Pearson test (for normally distributed data) or
Kendall/Spearman test (for non normally distributed data)
adiposity Protein1 Protein2 Protein3 Protein4
adiposity1 0,06723532 0,82398759 -0,8845269
-0,15145677
Protein10,06723532 1 -0,0077549 -0,04223508
-0,04956437
Protein2 0,82398759 -0,0077549 1 -0,60280322 0,245925
Protein3 -0,8845269 -0,04223508 -0,60280322 1 0,16626722
Protein4 -0,15145677 -0,04956437 0,245925 0,16626722 1
R script by J. Bazile : correlation results
• A figure to review the proteins significantly correlated with one parameter• 2 tables produced and save to your directory :
one with the correlation values one with the p-value of the correlations
19
Descriptive and differentialstatistics
Data mining : when proteomics turns functional
Data analysis and data mining of lists of
protein IDs and abundances
20
Data mining : when proteomics turns functional
List of proteins
Gene Ontology analysis
=> GO
Prediction of protein secretion,
=> secretome
Protein-Protein
Interaction PPI,
=> interactomics
Interpretation of the data
FUNCTION
21
GO “annotations”
• An annotation is a statement linking a gene to some aspect of its function
=> GO term
• Each annotation is based on some evidence, recorded as part of an
annotation :
• Evidence code (type of evidence)
• Reference (published data, journal article)
22
Semantics and structure of GO annotations
The association of a gene product with a class of GO gives statements about the
•Molecular function, MF : molecular activities of a gene product
•Cellular Component CC: where the gene product is active
•Biological process BP : pathways and larger processes made up of the activities of
multiple gene products
GO annotations represent :• the normal, in vivo, biological role of gene products• Current knowledge that is Human understandable and machine computable
Ontology is a structured representation of biology, composed of
•classes
•relations
•definitions
For more information, http://geneontology.org/docs/go-annotations/
23
One of the main uses of the GO is to perform
enrichment analysis on gene/protein sets. For example,
given a set of genes/proteins that are up-regulated
under certain conditions, an enrichment analysis will
find which GO terms are over-represented (or
under-represented) using annotations for that gene set
GO enrichment analysis
The output of GO enrichment analysis is a GO term with a p-value most of the time corrected for multiple testing (Benjamini & Holsberg, Bonferroni…)
24
Tool Description
DAVIDIt provides a set of functional annotation tools to understand biological meaning, such as identification of enriched GO terms, clustering of redundant annotation terms and visualization of genes using BioCarta & KEGG pathway maps (https://david.ncifcrf.gov/).
EnrichrIt contains a large collection of gene set libraries available for analysis and download. It is a source of curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries (http://amp.pharm.mssm.edu/Enrichr/).
KOBAS
A web server for functional gene set enrichment that can accept either gene lists or gene expression data as input. It generates enriched gene sets with the corresponding name, the p-value or a probability of enrichment and an enrichment score based on results of multiple methods (http://kobas.cbi.pku.edu.cn/).
PANTHERIt classifies proteins according to family (evolutionarily related proteins) and subfamily (proteins with the same function), molecular function, biological process and pathway (http://www.pantherdb.org/).
PathVisioIt provides a basic set of features for pathway drawing, analysis and visualization(https://www.pathvisio.org/).
WebGestaltA functional enrichment analysis web tool. It supports several organisms gene identifiers and three methods for enrichment analysis, including over-representation analysis, Gene Set Enrichment Analysis (GSEA), and Network Topology-based Analysis (NTA).
AmigoInteractively search the Gene Ontology data for annotations, gene products, and terms using a powerful search syntax and filters (http://amigo.geneontology.org/amigo/search/annotation/)
Main GO “annotations” software tools
25
Database Description
ReactomeIt collects the so called “reactions”, where molecules (e.g., proteins and metabolites) form a network of biological interactions and are grouped into pathways (https://reactome.org/).
KEGGIt integrates molecular building blocks with the knowledge on molecular wiring diagrams of interaction, reaction and relation networks (https://www.genome.jp/kegg/).
GO ConsortiumIt is based on the Gene Ontology (GO) classification, which defines concepts used to describe gene function (http://geneontology.org/).
WIKIPATHWAYIt is an open, collaborative platform for capturing and disseminating models of biological pathways for data visualization and analysis (https://www.wikipathways.org/index.php/WikiPathways).
PathwayCommons
It provides a web-based interface that enables biologists to browse and search a collection of pathways from multiple sources (http://www.pathwaycommons.org/pc/).
Main GO “annotations” databases
26
Prediction of secreted protein
Adapted from Nickel, 2003. European
Journal of Biochemistry, 270 (10), 2109-2119
Prediction of
proteins
secreted thanks
to signal
peptide
Prediction of
proteins
secreted
thanks to
other
pathways1
2
3
4
27
Bioinformatics tools to predict secreted proteins (1/2)
Caccia et al., 2013, Biochim Biophys Acta. 2013 Nov;1834(11):2442-53
28
Bioinformatics tools to predict secreted proteins (2/2)
29
Protein-Protein Interaction concept
PPI are considered as physical contacts with molecular docking between proteins
that occur in a cell or in a living organism in vivo
The methods to determine PPI are based on two alternative approaches:
•binary :techniques that demonstrate the direct physical interaction in proteins
pairs, yeast two-hybrid (Y2H)
•co-complex :approaches that study the direct and indirect interactions among
groups of proteins, (tandem) affinity purification coupled to mass spectrometry
((T) AP-MS)
De Las Rivas et al., PLoS Comput Biol. 2010 Jun; 6(6): e1000807
Focus on specific pathways
Node
Edge
30
Database Description
PSICQUIC
Proteomics Standards Initiative Common QUery InterfaCe, is a Community effort to standardise the way to access and retrieve data from Molecular Interaction databases.http://www.ebi.ac.uk/Tools/webservices/psicquic/view/main.xhtml
IMEXIt contains data of major public interaction data providers such as IntAct, BioGRID, Uniprot (https://www.imexconsortium.org/) [68]
IntActIt is an open source PPI database and it provides also analysis tools. All interactions are derived from literature curation or direct user submissions (https://www.ebi.ac.uk/intact/) [64]
STRING
It collects and integrates data obtained by consolidating known and predicted protein-protein association data for a large number of organisms. The associations include direct (physical) interactions, as well as indirect (functional) interactions, as long as both are specific and biologically meaningful (https://string-db.org/) [62]
Pathguide
A compendium of databases for molecular interaction/species.Contains information about 702 biological pathway related resources and molecular interaction related resources.http://www.pathguide.org/
Main PPI databases
31
Examples of complete and curated
databases
32
Protein-Protein Interaction analysis + GO
Adapted from Miryala et al., 2018,Gene
PPI search
Analysis of network topology
GO related to subnetwork
Data mining using
ProteINSIDE online tool
www.proteinside.org
34
Why ?
Generation of large amount of genomic and proteomic data
How can we extract the most relevant biological information ?
Transcriptomics and proteomics largely used to•identify biomarkers of ruminant traits•understand mechanisms behind traits
HOW ?
Main biological
information
Biological Function
Secreted proteins
Protein Interactions
HOW TO ANALYSE AND SUMMARISE DATA ?
An all in one system to automatically query and analyse the results from many tools and databases, and for a list of proteins / genes from three ruminant
species (bovine, ovine caprine) and well-annotated species (Human, mice, rat)
37
datbase
Web interface
Workflow
ProteINSIDE uses many resources that have been chosen according to the probability of their maintenance. ProteINSIDE is constantly updated.
Gathering of biological
information
STRUCTURE OF ProteINSIDE
GOannotation
Prediction of secreted
proteins
BlastP
PPI
38
Basic Analysis
- Paste identifiers- Automatic settings
Custom Analysis
- Paste identifiers- Select your own settings- Use preset buttons
1
2
4
Example of 133 proteins :- Glycolysis (33)- Respiratory chain (12)- Hormones (72)One protein in duplicate
3
To launch a new analysis
39
Pro
teIN
SID
EBiological
information
Ontologie de gène (GO)
Processus biologiques
Localisations cellulaires
Fonctions moléculaires
Protéines sécrétées
Interactions protéine-protéine
Example with 133 proteins :- Glycolysis- Respiratory chain- Hormones
ProteINSIDE automatically summarizes the available information in biological databases for a gene /
protein.
RESULTS of « ID resume » MODULE (1/2)
40
ProteINSIDE automatically summarizes the available information for a gene / protein in the NCBI and UniProtKB databases.
Identifiers
Function
Specificities
NCBI - Coordinators NR 2013 UniProtKB - Magrane et al. 2011
RESULTS of « ID resume » MODULE (2/2)
41
Pro
teIN
SID
EBiological
information
Gene Onotlogy (GO)
Biological Process
Cellular Component
Molecular Function
Protéines sécrétées
Interactions protéine-protéine
ProteINSIDE researches terms (called annotations) for proteins or genes under the "Gene Ontology (GO)"
Hierarchical dictionary of GO terms
Example with 133 proteins :- Glycolysis- Respiratory chain- Hormones
RESULTS of « GO » MODULE (1/2)
42
ProteINSIDE annotated genes / proteins according to the terms of the "Gene Ontology (GO)" and calculated the representativeness of a GO term in the dataset
Function Expected Identified (no IEA)
Identified (with IEA)
Glycolysis 33 15 32
Hormone activity 79 34 78
Validation with the sample dataset:
QuickGO - Binns et al. 2009GO.org - Ashburner et al. 2000
RESULTS of « GO » MODULE (2/2)
43
Pro
teIN
SID
EBiological
information
Gene Ontology (GO)
Biological Process
Cellular Component
Molecular Function
Secreted Proteins
Interactions protéine-protéineProteINSIDE predicts the secretion of a protein with two
complementary analyses: - Looking for a signal peptide on the amino acid sequence - Looking for GO terms related to secretion
Example of 133 proteins :- Glycolysis- Respiratory chain- Hormones
RESULTS of « SECRETED PROTEINS » MODULE (1/2)
44
ProteINSIDE predicts the secretion of a protein by using the algorithm SignalP and GO terms related to secretion
+
Expected Identified(only peptide prediction)
Identified(peptide + GO)
Identified(peptide + GO + IEA)
79 85 63 78
Validation with the sample dataset:
SignalP - Petersen et al. 2011
+
RESULTS of « SECRETED PROTEINS » MODULE (2/2)
45
Pro
teIN
SID
EBiological
Information
Gene Ontology (GO)
Biological Process
Cellular Component
Molecular Function
Secreted Proteins
Protein-Protein Interactions
Analysis of interaction networks (interactome) to identify the function and / or the central role of
proteins in biological mechanisms
Example of 133 proteins :- Glycolysis- Respiratory chain- Hormones
RESULTS of « PPI » MODULE (1/2)
46
ProteINSIDE creates protein networks to reveal biological
mechanisms
Between proteins within the dataset (36 interactions)
Between proteins within and external to the dataset
(616 interactions)
PsicQuic - Aranda et al. 2011 Cytoscape web - Lopes et al. 2010
RESULTS of « PPI » MODULE (2/2)
47
BioMyn
6 “synthetic” datasets of 1000 random proteins=> 1dataset /species
Database for Agricultural species (plants and animals, including bovine and ovine ; McCarthy et al., 2006)
Human database with GO and PPI analyses (Ramírez et al., 2012)
Multi-species database, intensively used by biologists (Huang
et al., 2009)
Bench test of ProteINSIDE relatively to others tools
Predictions of secreted
proteinsAlgorithms PrediSi (Hiller et al., 2004 ) and Phobius (Lukas et al., 2004)
48
BENCH TEST
Kaspirc et al., 2015, PlosOne
49
Wo
rkfl
ow
Biological Information
Gene Ontology (GO)
Secreted Proteins
Protein-Protein Interactions
ProteINSIDE offers great support to the analysis of the large amount of data generated by genomics and proteomics. ProteINSIDE is also a
unique tool that performs these analyses using ruminant IDs.
CONCLUSION
Target species:
- Bovine
- Sheep
- Goat
Model species:
- Human
- Mouse
- Rat
ProteINSIDE works using 6 species
ProteINSIDEDatabase
ProteINSIDEWeb interface
ProteINSIDE searches homologous proteins
Using Bovine IDs to search the homologous proteins in Humans
Q3Y5Z3 Q15848
Prediction of the Secretome of tissues
24 publications + 8 datasets (ruminants) proteins secreted : 1749 by bothmuscle & adipose tissue, 357 muscle-specific and 188 adipose-specific
Cross-talk between these tissues for their functioning and the construction of animal phenotype?
51
Descriptive and differentialstatistics
Data mining : when proteomics turns functional
Looking for biomarkercandidates
Data analysis and data mining of lists of
proteins ID and abundances
Proteomics to identify biomarkers
of marbling in bovineDiscovery
In groups of ruminantsdivergent
for a phenotype
QualificationIn ruminants with a range
of variability for a phenotype
ValidationNew
population
VerificationOn large cohorts
< 10 samples> 100 candidates
> 1000 samples< 10 candidates
2-DE or Shotgunproteomics
OR ?
Timeline of biomarker discovery
Trageted
proteomics
Immunological
tests
Step 1 Step 2 Step 3
How to select candidate biomarkers ?
Biomarker selection uses a variety of models to identify a subset of biomarkers that
best differentiate the control from test samples.
Recommended group sizes => from 10 samples and above. The models that are
generally used include:
• Logistic regression
• Linear discriminant analysis (LDA)
• Support vector machine (SVM)
• Random forest
• Other models that may be used
Biomarker Selection
https://www.tebu-bio.com/blog/2019/01/31/proteomics-biostats-get-the-most-out-of-your-array-data/
• MYDATAMODELS (12000 €/y)
https://www.mydatamodels.com/
• Free R package
ModVarSel (Ellies Oury et al., 2019) and one package in construction for predictive models (Molnar et al.)
Some « Biomarker Selection »
A new computational methodology that simultaneously selects the
best regression model and the most interesting covariates
Different regression models (including variable selection) :
1- parametric (multiple linear regression (MLR)),
𝑌 = 𝛽0 +
𝑗=1
𝑝
𝛽𝑗𝑋𝑗 + 𝜀
2- semiparametric (sliced inverse regression (SIR))
𝑌 = 𝑓(
𝑗=1
𝑝
𝛽𝑗𝑋𝑗) + 𝜀
and nonparametric (random forests (RF))
𝑌 = 𝑓 𝑋 + 𝜀
What is ModVarSel
Model choice using MSEtest
How to choose a model ?
How to select variables/proteins ?
• To install the current version of the R package modvarsel from github, use : https://github.com/chavent/modvarsel.
• DOI : https://zenodo.org/record/1445554#.W7YrYPY69PY
• More details are given here : https://chavent.github.io/modvarsel/modvarsel-intro.html
• Example of application in Ellies-Oury et al., 2019 Sci Reports
Supporting information
Descriptive and differentialstatistics
Data mining : when proteomics turns functional
Looking for biomarkercandidates
Take Home message
Data analysis and data mining of lists of
protein IDs and abundances
60
Before statistical analysis, try to answer to two simple questions: i) how many
features, and ii) how many samples.
Carefully consider the final goal of the proteomics analysis
• for classification purpose => multivariate supervised methods provide the best
results.
• to find some hidden patterns within features => unsupervised methods should be
chosen and batch effects must be carefully taken into account and avoided when
feasible.
Combine different pathways analysis tools to take advantages of their power of
analysis regarding your objectives, try ProteINSIDE and send me feedback
Take Home message
Challenges remain for the development of mathematical methods to select
biomarkers and predict a trait of interest from proteomics data.
www.proteinside.org
Thank you for your attention