Doctoral Thesis Dissertation 2014-03-20 @PoliMi

Computational Prediction of Gene Functions through Machine Learning methods and Multiple Validation Procedures

candidate: Davide Chicco [email protected]

supervisor: Marco Masseroli

PhD Thesis Defense Dissertation20th March 2014

“Computational Prediction of Gene Functions

through Machine Learning methods

and Multiple Validation Procedures”

1) Analyzed scientific problem

2) Machine learning methods used

3) Validation procedures

4) Main results

5) Annotation list correlation measures

6) Novelty indicator

7) Final list of likely predicted annotations

8) Conclusions

Biomolecular annotations

• The concept of annotation: association of nucleotide or amino

acid sequences with useful information describing their features

• The association of a gene and an information feature term

corresponds to a biomolecular annotation

• This information is expressed through controlled

vocabularies, sometimes structured as ontologies (e.g. Gene

Ontology), where every controlled term of the vocabulary is

associated with a unique alphanumeric code

Gene Biological function feature

Annotation

gene2bff

Biomolecular annotations

• The association of an information/feature with a gene ID

constitutes an annotation

• Annotation example:

• Scientific fact: “the gene GD4 is present in the

mitochondrial membrane”

• Corresponds to the coupling:

<GD4, mitochondrial membrane>

GD4 mitochondrial membrane

GD4 is present in the

mitochondrial membrane

The problem

• Many available annotations in different databanks

• However, available annotations are incomplete

• Only a few of them represent highly reliable, human–curated

information

• In vitro experiments are expensive (e.g. 1,000 € and 3 weeks)

• To support and quicken the time–consuming curation process,

prioritized lists of computationally predicted annotations are

extremely useful

• These lists could be generated by softwares based on

Machine Learning algorithms

The problem

• Other scientists and researchers dealt with the problem in the

past by using:

• Support Vector Machines (SVM) [Barutcuoglu et al., 2006]

• k-nearest neighbor algorithm (kNN) [Tao et al., 2007]

• Decision trees [King et al., 2003]

• Hidden Markov models (HMM) [Mi et al. 2013]

• …

• These methods were all good in stating if a predicted

annotation was correct or not, but were not able to make

extrapolations, that is to suggest new annotations absent

from the input dataset

The software

inputmatrix

output

Data reading

Statisicalmethod

Predicted annotation

lists

A input matrix

A~ outputmatrix

BioAnnotationPredictor:

A pipeline of steps and tools to predict,

validate and analyze biomolecular

annotation lists

inputmatrix

outputStatisticalmethod

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

• The software reads the data from the db GPDW

• The software creates the input matrix:

Input Annotation matrix A {0, 1} m x n

m rows: genes

n columns: annotation features

A(i,j) = 1 if gene i is annotated to feature j or to

any descendant of j in the considered ontology

structure (true path rule)

A(i,j) = 0 otherwise (it is unknown)

feat 1 feat 2 feat 3 feat 4 … feat N

gene 1 0 0 0 0 … 0

gene 2 0 1 1 0 … 1

… … … … … … …

gene M 0 0 0 0 … 0

inputmatrix

output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

• The software applies a statistical method

(Truncated Singular Value Decomposition,

Semantically Improved SVD with gene

clustering, Semantically Improved SVD with

clustering and term-term similarity weights) to

a binary A input matrix

• Returns a real output A~ matrix

• Every element of the A matrix is compared toits corresponding element of the A~ matrix

• After the computation, we compare the Aij element to

the Aij~

inputmatrix


0 0 0 0 … 0

0 1 1 0 … 1

… … … … … …

0 0 0 0 … 0

0.1 0.3 0.6 0.5 … 0.2

0.6 0.8 0.1 0.9 … 0.8

… … … … … …

0.3 0.2 0.4 0.6 … 0.8

Input Aij Output: Aij~

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

if Aij = 1 & Aij~ > τ: AC TP

if Aij = 1 & Aij~ ≤ τ: AR FN

if Aij = 0 & Aij~ ≤ τ: NAC TN

if Aij = 0 & Aij~ > τ: AP FP

AC: Annotation Confirmed; AR: Annotation to be Reviewed

NAC: No Annotation Confirmed; AP: Annotation Predicted

τ: minimizes the sum APs + ARs

Input Output

Yes Yes

Yes No

No No

No Yes

inputmatrix


Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

AC: Annotation Confirmed

AR: Annotation to be Reviewed

NAC: No Annotation Confirmed

AP: Annotation Predicted

• The Annotations Predicted - AP (FP) are the

annotations absent in input and predicted by our

software: we suggest them as present

• We record them in ranked lists:

Input Output

Yes Yes

Yes No

No No

No Yes

Rank Annotation ID Likelihood value

1 218405 0.9742584

2 222571 0.8545574

… …

n 203145 0.1673128

inputmatrix

output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

• An annotation prediction is performed by computing

a reduced rank approximation A~ of the annotation

matrix A

(where 0 < k < r, with r the number of non zero

singular values of A, i.e. the rank of A)

Truncated Singular Value Decomposition (tSVD)

inputmatrix

output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

• Only the first most «important» k columns of A are

used for reconstruction

(where 0 < k < r, with r the number of non zero

singular values of A, i.e. the rank of A)

• In [P. Khatri et al. "A semantic analysis of the annotations of the

human genome“, Bioinformatics, 2005], the authors argued

that the study of the matrix A shows the semantic

relationships of the gene-function associations.

• A large value of a~ij suggests that gene i should be

annotated to term j, whereas a value close to zero

suggests the opposite.


inputmatrix

output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

• We departed from this method developed by Khatri

et al. (2005) Wayne State Univeristy, Detroit, and

implemented it

• Improvement:

• Khatri et al. used a fixed SVD truncation level

k=500

• We developed a method for automated data-

driven selection of k based on Receiver

Opearating Characteristic (ROC) curve

• We got better results shown in several

publications


inputmatrix output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

• Semantically improved (SIM1) version of the

Truncated SVD, based on gene clustering [P. Drineas et al.,

"Clustering large graphs via the singular value decomposition",

Machine Learning, 2004]

• Inspiring idea: similar genes can be grouped in

clusters, that have different weights

Truncated SVD with gene clustering (SIM1)

inputmatrix output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix


1. We choose a number C of clusters, and completely

discard the columns of matrix U where j = C+1, ..., n.

(we have an algorithm for the choice of C)

2. Each column uc of SVD matrix U represents a cluster,

and the value U(i,c) indicates the membership of

gene i to the c-th cluster.

3. For each cluster, first we generate Wc = diag(uc), and

then the modified gene-to-term matrix Ac = Wc A, in

which the i-th row of A is weighted by the

membership score of the corresponding gene to the

c-cluster.

inputmatrix output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix


4. Then, we compute Tc = AcT Ac, and its SVD(Tc)

5. Then, every element of the A~ matrix is computed

considering the c_th cluster that minimize its

Euclidean norm distance to the original vector:

ai~ = ai * Vk,c,i * Vk,c,iT

6. Output matrix is produced

Tc = x

inputmatrix output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

• Semantically improved (SIM2) version of the

Truncated SVD, based on gene clustering and term-

term similarity weights [P. Resnik, "Using information content to

evaluate semantic similarity in a taxonomy“, arXiv.org, 1995]

• Inspiring idea: functionally similar terms, should be

annotated to the same genes

Truncated SVD with gene clustering and term-similarity weights (SIM2)

inputmatrix output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

Truncated SVD with gene clustering and term-similarity weights (SIM2)

In the algorithm shown before, we would add the

following step:

6. a) Furthermore, to effect more accurate clustering, we

compute the eigenvectors of the matrix G~ = ASAT

where real n*n matrix S is the term similarity matrix.

Starting from a pair of ontology terms, j1 and j2, the

term functional similarity S(j1, j2) can be calculated

using different methods.

Similarity is based on Resnik measure [P. Resnik, "Using

information content to evaluate semantic similarity in a

taxonomy", arXiv.org, 1995]

inputmatrix output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

Other methods

With some colleagues at Politecnico di Milano we also

implemented other methods (not included in this thesis):

• Probabilistic Latent Semantic Analysis (pLSA)

• Latent Dirichlet Allocation with Gibbs sampling (LDA)

And with some colleagues at University of California

Irvine we have been trying to design and implement

other models:

• Auto-Encoder Deep Neural Network

• After the computation, we compare the Aij element to

the Aij~

inputmatrix


0 0 0 0 … 0

0 1 1 0 … 1

… … … … … …

0 0 0 0 … 0

0.1 0.3 0.6 0.5 … 0.2

0.6 0.8 0.1 0.9 … 0.8

… … … … … …

0.3 0.2 0.4 0.6 … 0.8

Input Aij Output: Aij~

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

if Aij = 1 & Aij~ > τ: AC TP

if Aij = 1 & Aij~ ≤ τ: AR FN

if Aij = 0 & Aij~ ≤ τ: NAC TN

if Aij = 0 & Aij~ > τ: AP FP

AC: Annotation Confirmed; AR: Annotation to be Reviewed

NAC: No Annotation Confirmed; AP: Annotation Predicted

τ: minimizes the sum APs + ARs

Input Output

Yes Yes

Yes No

No No

No Yes

inputmatrix


Data reading

Statisicalmethod


lists

Validation

A input matrix

A~ outputmatrix

• These four class results could be considered similar to

TP, FN, TN, FP

AC: Annotation Confirmed (TP)

AR: Annotation to be Reviewed (FN)

NAC: No Annotation Confirmed (TN)

AP: Annotation Predicted (FP)

• The software depicts ROC curves

AC rate =𝐴𝐶

𝐴𝐶+𝐴𝑅

AP rate =𝐴𝑃

𝐴𝑃+𝑁𝐴𝐶

Input Output

Yes Yes

Yes No

No No

No Yes

ROC Analysis Validation

inputmatrix


Data reading

Statisicalmethod


lists

Validation

A input matrix

A~ outputmatrix

• Ten-fold cross validation

• The software depicts the ROC curve

AC rate =𝐴𝐶

𝐴𝐶+𝐴𝑅

AP rate =𝐴𝑃

𝐴𝑃+𝑁𝐴𝐶

• Compute Area Under the Curve (AUC)

• If AUC ≥ 66.67% = 2/3, then good matrix reconstruction

• Otherwise, bad matrix reconstruction

ROC Analysis Validation

Database Validation

inputmatrix


Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

Since more recent database versions contain

better data and information

• Compute the prediction of annotations on a

former database version (e.g. July 2009)

• Compare these predictions to a newer version

of that database (e.g. March 2013)

• More Annotation Predicted found in the new

version => better predictions

• Percentage of accuracyValidation

July 2009 -> March 2013

Database Validation

inputmatrix


Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

Two main issues:

- Retrieve the annotation IDs in the former database

version to be used in the updated database

version;

- Management of duplicate annotations (i.e.

annotations having different evidence code)

Validation

Text Mining and Web Tool Validation

inputmatrix

output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

Literature text mining and web tools validation

procedure

Databanks may be not updated, so we manually

searched for the predicted annotations through

• literature resources such as PubMed

• Web tools such as AmiGO and GeneCards

Validation

Results

inputmatrix

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

ROC Curves

Validation ROC curves for the Homo sapiens CC dataset. SVD-Khatri has k = 500;SVD-us, SIM1, SIM2 have k = 378; SIM1 and SIM2 use C = 2, and SIM2 uses Resnik measure.

Results

inputmatrix

output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

Results on the following annotation datasets:• Homo sapiens genes and CC feature terms

• Homo sapiens genes and MF feature terms

• Homo sapiens genes and BP feature terms

• Homo sapiens genes and CC+MF+BP feature terms

Validation

Results

inputmatrix

output

Data reading

Statisicalmethod


lists

A input matrix

A~ outputmatrix

The literature review allowed us to confirm some

additional predicted annotations

Validation

List Comparison Measures

inputmatrix


Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

Comparing methods and parameters

• When we have different lists of predicted

annotations and we want to know how

similar/different they are:

• How much similar are they?

• Answering this question will help us to

understand how method parameters

behave

Annotation ID

10,000

20,000

…

90,000

Annotation ID

40,000

10,000

…

90,000


lists

Comparison of the lists

Validation


inputmatrix


Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

How much similar are these lists?

• Spearman's rank correlation coefficient

the total sum of the difference position between

each element (e.g. 3rd position – 1st position = 2)

Annotation ID

10,000

20,000

30,000

…

Annotation ID

30,000

10,000

40,000

…


lists


Validation


inputmatrix


Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

How much similar are these lists?

• Kendall tau distance:

the total sum of all the bubble-sort changes

needed to get a list equal to the other

outputAnnotation ID

10,000

20,000

…

90,000

Annotation ID

20,000

10,000

…

90,000


lists


Validation


inputmatrix


Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

Extended Kendall distance

Extended Spearman coefficient

output


lists

ValidationComparison of the lists

output

Annotation ID

AP List

10,000

20,000

30,000

...

NAC List

70,000

80,000

90,000

...

Annotation ID

AP List

30,000

10,000

40,000

...

NAC List

70,000

20,000

90,000

...

• We assign a high

penalty if an element

is absent from one of

the lists

And a low

penalty if an element

is absent from one of

the AP lists

but present

in its NAC list


inputmatrix


Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

Significant patterns:

• Extended Kendall distances show that the similar

SVD truncations are, the lower is the Extended

Kendall distance is, and so the more similar the

lists are.

• Lists generated by predictions that produced

similar AUC have similar low Extended Spearman

coefficients.

This means that lists from

predictions having similar AUC

percentages have element

difference very low.


lists


Validation

Example: DAG tree of the Molecular Functionterms predicted for the Homo sapiens gene P2RY14.Black balls: terms already present in the database.Blue exagons: predicted terms.

Novelty Indicator

inputmatrix


Data reading

Statisicalmethod

A input matrix

A~ outputmatrix


lists

Novelty indicator

Schlicker rate based on DAG

An indicator to express the “novelty” rate of a

prediction in a gene tree

• Statistical rate

• Visual DAG viewer


Validation

Example: DAG tree of the Molecular Functionterms predicted for the Homo sapiens gene CCR2.Black balls: terms already present in the database.Blue exagons: predicted terms.

Novelty Indicator

inputmatrix

Statisticalmethod

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix


lists

Validation

Novelty indicator

Schlicker rate based on DAG

An indicator to express the “novelty” rate of a

prediction into a gene

• Statistical rate

• Visual DAG viewer


Final predictions

inputmatrix

output

Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

We finally get a list of the most likely predicted

annotations that have the following characteristics:

- predicted by all the three methods tSVD, SIM1,

SIM2

- prediction ranking in the first 50% of the list

- having at least one validated parent.

output


lists

Gene symbol Feature term

PPME1 Organelle organization. [BP]

CHST14 Chondroitin sulfate proteoglycan biosynthetic process. [BP]

CHST14 Biopolymer biosynthetic process. [BP]

ROPN1B Microtubule-based agellum. [CC]

CHST14 Dermatan sulfate proteoglycan biosynthetic process. [BP]

CPA2 Proteolysis involved in cellular protein catabolic process. [BP]

PPME1 Chromosome organization. [BP]

CNOT2 Positive regulation of cellular metabolic process. [BP]

Validation

Recap

inputmatrix


Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

output


lists


Truncated SVD with the automatically chosen truncation

showed better results (percentage of predicted

annotations found on the updated database version)

than previous method version with fixed parameters.

New methods (SIM1 and SIM2) outperformed

Truncated SVD.

ROC analysis, Database version, and text mining and

web tool validation procedure resulted very efficient.

Extended Kendall and Spearman

coefficients showed interesting patterns,

otherwise invisible.

Novelty indicator rate resulted very

useful in explaining which are the most

interesting prediction tree, showing

relevant research paths.

Novelty indicator

Validation

Future

inputmatrix


Data reading

Statisicalmethod

A input matrix

A~ outputmatrix

output


lists


Future developments:

• integrate the software as a web application into

the Search Computing platform

• Implement and test the Auto-Encoder Deep

Neural Network algorithm

• Develop a text mining automated validation

procedure

• Add statistical tools to analyze the ROC

curves

Novelty indicator

Validation

Education

Doctoral Thesis Dissertation 2014-03-20 @PoliMi