Upload
davide-chicco
View
1.798
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Slides of my doctoral thesis dissertation talk, given on 20 March 2014 at Politecnico di Milano. Title: "Computational prediction of gene functions through machine learning methods and multiple validation procedures"
Citation preview
Computational Prediction of Gene Functions through Machine Learning methods and Multiple Validation Procedures
candidate: Davide Chicco [email protected]
supervisor: Marco Masseroli
PhD Thesis Defense Dissertation20th March 2014
“Computational Prediction of Gene Functions
through Machine Learning methods
and Multiple Validation Procedures”
1) Analyzed scientific problem
2) Machine learning methods used
3) Validation procedures
4) Main results
5) Annotation list correlation measures
6) Novelty indicator
7) Final list of likely predicted annotations
8) Conclusions
Biomolecular annotations
• The concept of annotation: association of nucleotide or amino
acid sequences with useful information describing their features
• The association of a gene and an information feature term
corresponds to a biomolecular annotation
• This information is expressed through controlled
vocabularies, sometimes structured as ontologies (e.g. Gene
Ontology), where every controlled term of the vocabulary is
associated with a unique alphanumeric code
Gene Biological function feature
Annotation
gene2bff
Biomolecular annotations
• The association of an information/feature with a gene ID
constitutes an annotation
• Annotation example:
• Scientific fact: “the gene GD4 is present in the
mitochondrial membrane”
• Corresponds to the coupling:
<GD4, mitochondrial membrane>
GD4 mitochondrial membrane
GD4 is present in the
mitochondrial membrane
The problem
• Many available annotations in different databanks
• However, available annotations are incomplete
• Only a few of them represent highly reliable, human–curated
information
• In vitro experiments are expensive (e.g. 1,000 € and 3 weeks)
• To support and quicken the time–consuming curation process,
prioritized lists of computationally predicted annotations are
extremely useful
• These lists could be generated by softwares based on
Machine Learning algorithms
The problem
• Other scientists and researchers dealt with the problem in the
past by using:
• Support Vector Machines (SVM) [Barutcuoglu et al., 2006]
• k-nearest neighbor algorithm (kNN) [Tao et al., 2007]
• Decision trees [King et al., 2003]
• Hidden Markov models (HMM) [Mi et al. 2013]
• …
• These methods were all good in stating if a predicted
annotation was correct or not, but were not able to make
extrapolations, that is to suggest new annotations absent
from the input dataset
The software
inputmatrix
output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
BioAnnotationPredictor:
A pipeline of steps and tools to predict,
validate and analyze biomolecular
annotation lists
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
• The software reads the data from the db GPDW
• The software creates the input matrix:
Input Annotation matrix A {0, 1} m x n
m rows: genes
n columns: annotation features
A(i,j) = 1 if gene i is annotated to feature j or to
any descendant of j in the considered ontology
structure (true path rule)
A(i,j) = 0 otherwise (it is unknown)
feat 1 feat 2 feat 3 feat 4 … feat N
gene 1 0 0 0 0 … 0
gene 2 0 1 1 0 … 1
… … … … … … …
gene M 0 0 0 0 … 0
inputmatrix
output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
• The software applies a statistical method
(Truncated Singular Value Decomposition,
Semantically Improved SVD with gene
clustering, Semantically Improved SVD with
clustering and term-term similarity weights) to
a binary A input matrix
• Returns a real output A~ matrix
• Every element of the A matrix is compared toits corresponding element of the A~ matrix
• After the computation, we compare the Aij element to
the Aij~
inputmatrix
outputStatisticalmethod
0 0 0 0 … 0
0 1 1 0 … 1
… … … … … …
0 0 0 0 … 0
0.1 0.3 0.6 0.5 … 0.2
0.6 0.8 0.1 0.9 … 0.8
… … … … … …
0.3 0.2 0.4 0.6 … 0.8
Input Aij Output: Aij~
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
if Aij = 1 & Aij~ > τ: AC TP
if Aij = 1 & Aij~ ≤ τ: AR FN
if Aij = 0 & Aij~ ≤ τ: NAC TN
if Aij = 0 & Aij~ > τ: AP FP
AC: Annotation Confirmed; AR: Annotation to be Reviewed
NAC: No Annotation Confirmed; AP: Annotation Predicted
τ: minimizes the sum APs + ARs
Input Output
Yes Yes
Yes No
No No
No Yes
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
AC: Annotation Confirmed
AR: Annotation to be Reviewed
NAC: No Annotation Confirmed
AP: Annotation Predicted
• The Annotations Predicted - AP (FP) are the
annotations absent in input and predicted by our
software: we suggest them as present
• We record them in ranked lists:
Input Output
Yes Yes
Yes No
No No
No Yes
Rank Annotation ID Likelihood value
1 218405 0.9742584
2 222571 0.8545574
… …
n 203145 0.1673128
inputmatrix
output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
• An annotation prediction is performed by computing
a reduced rank approximation A~ of the annotation
matrix A
(where 0 < k < r, with r the number of non zero
singular values of A, i.e. the rank of A)
Truncated Singular Value Decomposition (tSVD)
inputmatrix
output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
• Only the first most «important» k columns of A are
used for reconstruction
(where 0 < k < r, with r the number of non zero
singular values of A, i.e. the rank of A)
• In [P. Khatri et al. "A semantic analysis of the annotations of the
human genome“, Bioinformatics, 2005], the authors argued
that the study of the matrix A shows the semantic
relationships of the gene-function associations.
• A large value of a~ij suggests that gene i should be
annotated to term j, whereas a value close to zero
suggests the opposite.
Truncated Singular Value Decomposition (tSVD)
inputmatrix
output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
• We departed from this method developed by Khatri
et al. (2005) Wayne State Univeristy, Detroit, and
implemented it
• Improvement:
• Khatri et al. used a fixed SVD truncation level
k=500
• We developed a method for automated data-
driven selection of k based on Receiver
Opearating Characteristic (ROC) curve
• We got better results shown in several
publications
Truncated Singular Value Decomposition (tSVD)
inputmatrix output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
• Semantically improved (SIM1) version of the
Truncated SVD, based on gene clustering [P. Drineas et al.,
"Clustering large graphs via the singular value decomposition",
Machine Learning, 2004]
• Inspiring idea: similar genes can be grouped in
clusters, that have different weights
Truncated SVD with gene clustering (SIM1)
inputmatrix output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
Truncated SVD with gene clustering (SIM1)
1. We choose a number C of clusters, and completely
discard the columns of matrix U where j = C+1, ..., n.
(we have an algorithm for the choice of C)
2. Each column uc of SVD matrix U represents a cluster,
and the value U(i,c) indicates the membership of
gene i to the c-th cluster.
3. For each cluster, first we generate Wc = diag(uc), and
then the modified gene-to-term matrix Ac = Wc A, in
which the i-th row of A is weighted by the
membership score of the corresponding gene to the
c-cluster.
inputmatrix output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
Truncated SVD with gene clustering (SIM1)
4. Then, we compute Tc = AcT Ac, and its SVD(Tc)
5. Then, every element of the A~ matrix is computed
considering the c_th cluster that minimize its
Euclidean norm distance to the original vector:
ai~ = ai * Vk,c,i * Vk,c,iT
6. Output matrix is produced
Tc = x
inputmatrix output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
• Semantically improved (SIM2) version of the
Truncated SVD, based on gene clustering and term-
term similarity weights [P. Resnik, "Using information content to
evaluate semantic similarity in a taxonomy“, arXiv.org, 1995]
• Inspiring idea: functionally similar terms, should be
annotated to the same genes
Truncated SVD with gene clustering and term-similarity weights (SIM2)
inputmatrix output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
Truncated SVD with gene clustering and term-similarity weights (SIM2)
In the algorithm shown before, we would add the
following step:
6. a) Furthermore, to effect more accurate clustering, we
compute the eigenvectors of the matrix G~ = ASAT
where real n*n matrix S is the term similarity matrix.
Starting from a pair of ontology terms, j1 and j2, the
term functional similarity S(j1, j2) can be calculated
using different methods.
Similarity is based on Resnik measure [P. Resnik, "Using
information content to evaluate semantic similarity in a
taxonomy", arXiv.org, 1995]
inputmatrix output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
Other methods
With some colleagues at Politecnico di Milano we also
implemented other methods (not included in this thesis):
• Probabilistic Latent Semantic Analysis (pLSA)
• Latent Dirichlet Allocation with Gibbs sampling (LDA)
And with some colleagues at University of California
Irvine we have been trying to design and implement
other models:
• Auto-Encoder Deep Neural Network
• After the computation, we compare the Aij element to
the Aij~
inputmatrix
outputStatisticalmethod
0 0 0 0 … 0
0 1 1 0 … 1
… … … … … …
0 0 0 0 … 0
0.1 0.3 0.6 0.5 … 0.2
0.6 0.8 0.1 0.9 … 0.8
… … … … … …
0.3 0.2 0.4 0.6 … 0.8
Input Aij Output: Aij~
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
if Aij = 1 & Aij~ > τ: AC TP
if Aij = 1 & Aij~ ≤ τ: AR FN
if Aij = 0 & Aij~ ≤ τ: NAC TN
if Aij = 0 & Aij~ > τ: AP FP
AC: Annotation Confirmed; AR: Annotation to be Reviewed
NAC: No Annotation Confirmed; AP: Annotation Predicted
τ: minimizes the sum APs + ARs
Input Output
Yes Yes
Yes No
No No
No Yes
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
Predicted annotation
lists
Validation
A input matrix
A~ outputmatrix
• These four class results could be considered similar to
TP, FN, TN, FP
AC: Annotation Confirmed (TP)
AR: Annotation to be Reviewed (FN)
NAC: No Annotation Confirmed (TN)
AP: Annotation Predicted (FP)
• The software depicts ROC curves
AC rate =𝐴𝐶
𝐴𝐶+𝐴𝑅
AP rate =𝐴𝑃
𝐴𝑃+𝑁𝐴𝐶
Input Output
Yes Yes
Yes No
No No
No Yes
ROC Analysis Validation
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
Predicted annotation
lists
Validation
A input matrix
A~ outputmatrix
• Ten-fold cross validation
• The software depicts the ROC curve
AC rate =𝐴𝐶
𝐴𝐶+𝐴𝑅
AP rate =𝐴𝑃
𝐴𝑃+𝑁𝐴𝐶
• Compute Area Under the Curve (AUC)
• If AUC ≥ 66.67% = 2/3, then good matrix reconstruction
• Otherwise, bad matrix reconstruction
ROC Analysis Validation
Database Validation
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
Since more recent database versions contain
better data and information
• Compute the prediction of annotations on a
former database version (e.g. July 2009)
• Compare these predictions to a newer version
of that database (e.g. March 2013)
• More Annotation Predicted found in the new
version => better predictions
• Percentage of accuracyValidation
July 2009 -> March 2013
Database Validation
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
Two main issues:
- Retrieve the annotation IDs in the former database
version to be used in the updated database
version;
- Management of duplicate annotations (i.e.
annotations having different evidence code)
Validation
Text Mining and Web Tool Validation
inputmatrix
output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
Literature text mining and web tools validation
procedure
Databanks may be not updated, so we manually
searched for the predicted annotations through
• literature resources such as PubMed
• Web tools such as AmiGO and GeneCards
Validation
Results
inputmatrix
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
ROC Curves
Validation ROC curves for the Homo sapiens CC dataset. SVD-Khatri has k = 500;SVD-us, SIM1, SIM2 have k = 378; SIM1 and SIM2 use C = 2, and SIM2 uses Resnik measure.
Results
inputmatrix
output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
Results on the following annotation datasets:• Homo sapiens genes and CC feature terms
• Homo sapiens genes and MF feature terms
• Homo sapiens genes and BP feature terms
• Homo sapiens genes and CC+MF+BP feature terms
Validation
Results
inputmatrix
output
Data reading
Statisicalmethod
Predicted annotation
lists
A input matrix
A~ outputmatrix
The literature review allowed us to confirm some
additional predicted annotations
Validation
List Comparison Measures
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
A input matrix
A~ outputmatrix
Comparing methods and parameters
• When we have different lists of predicted
annotations and we want to know how
similar/different they are:
• How much similar are they?
• Answering this question will help us to
understand how method parameters
behave
Annotation ID
10,000
20,000
…
90,000
Annotation ID
40,000
10,000
…
90,000
Predicted annotation
lists
Comparison of the lists
Validation
List Comparison Measures
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
A input matrix
A~ outputmatrix
How much similar are these lists?
• Spearman's rank correlation coefficient
the total sum of the difference position between
each element (e.g. 3rd position – 1st position = 2)
Annotation ID
10,000
20,000
30,000
…
Annotation ID
30,000
10,000
40,000
…
Predicted annotation
lists
Comparison of the lists
Validation
List Comparison Measures
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
A input matrix
A~ outputmatrix
How much similar are these lists?
• Kendall tau distance:
the total sum of all the bubble-sort changes
needed to get a list equal to the other
outputAnnotation ID
10,000
20,000
…
90,000
Annotation ID
20,000
10,000
…
90,000
Predicted annotation
lists
Comparison of the lists
Validation
List Comparison Measures
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
A input matrix
A~ outputmatrix
Extended Kendall distance
Extended Spearman coefficient
output
Predicted annotation
lists
ValidationComparison of the lists
output
Annotation ID
AP List
10,000
20,000
30,000
...
NAC List
70,000
80,000
90,000
...
Annotation ID
AP List
30,000
10,000
40,000
...
NAC List
70,000
20,000
90,000
...
• We assign a high
penalty if an element
is absent from one of
the lists
And a low
penalty if an element
is absent from one of
the AP lists
but present
in its NAC list
List Comparison Measures
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
A input matrix
A~ outputmatrix
Significant patterns:
• Extended Kendall distances show that the similar
SVD truncations are, the lower is the Extended
Kendall distance is, and so the more similar the
lists are.
• Lists generated by predictions that produced
similar AUC have similar low Extended Spearman
coefficients.
This means that lists from
predictions having similar AUC
percentages have element
difference very low.
Predicted annotation
lists
Comparison of the lists
Validation
Example: DAG tree of the Molecular Functionterms predicted for the Homo sapiens gene P2RY14.Black balls: terms already present in the database.Blue exagons: predicted terms.
Novelty Indicator
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
A input matrix
A~ outputmatrix
Predicted annotation
lists
Novelty indicator
Schlicker rate based on DAG
An indicator to express the “novelty” rate of a
prediction in a gene tree
• Statistical rate
• Visual DAG viewer
Comparison of the lists
Validation
Example: DAG tree of the Molecular Functionterms predicted for the Homo sapiens gene CCR2.Black balls: terms already present in the database.Blue exagons: predicted terms.
Novelty Indicator
inputmatrix
Statisticalmethod
Data reading
Statisicalmethod
A input matrix
A~ outputmatrix
Predicted annotation
lists
Validation
Novelty indicator
Schlicker rate based on DAG
An indicator to express the “novelty” rate of a
prediction into a gene
• Statistical rate
• Visual DAG viewer
Comparison of the lists
Final predictions
inputmatrix
output
Data reading
Statisicalmethod
A input matrix
A~ outputmatrix
We finally get a list of the most likely predicted
annotations that have the following characteristics:
- predicted by all the three methods tSVD, SIM1,
SIM2
- prediction ranking in the first 50% of the list
- having at least one validated parent.
output
Predicted annotation
lists
Gene symbol Feature term
PPME1 Organelle organization. [BP]
CHST14 Chondroitin sulfate proteoglycan biosynthetic process. [BP]
CHST14 Biopolymer biosynthetic process. [BP]
ROPN1B Microtubule-based agellum. [CC]
CHST14 Dermatan sulfate proteoglycan biosynthetic process. [BP]
CPA2 Proteolysis involved in cellular protein catabolic process. [BP]
PPME1 Chromosome organization. [BP]
CNOT2 Positive regulation of cellular metabolic process. [BP]
Validation
Recap
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
A input matrix
A~ outputmatrix
output
Predicted annotation
lists
Comparison of the lists
Truncated SVD with the automatically chosen truncation
showed better results (percentage of predicted
annotations found on the updated database version)
than previous method version with fixed parameters.
New methods (SIM1 and SIM2) outperformed
Truncated SVD.
ROC analysis, Database version, and text mining and
web tool validation procedure resulted very efficient.
Extended Kendall and Spearman
coefficients showed interesting patterns,
otherwise invisible.
Novelty indicator rate resulted very
useful in explaining which are the most
interesting prediction tree, showing
relevant research paths.
Novelty indicator
Validation
Future
inputmatrix
outputStatisticalmethod
Data reading
Statisicalmethod
A input matrix
A~ outputmatrix
output
Predicted annotation
lists
Comparison of the lists
Future developments:
• integrate the software as a web application into
the Search Computing platform
• Implement and test the Auto-Encoder Deep
Neural Network algorithm
• Develop a text mining automated validation
procedure
• Add statistical tools to analyze the ROC
curves
Novelty indicator
Validation