Research Proposal

Research Proposal

Bioinformatics approach to evaluation of Transcription factor genes

and diseases (Cancer)

Brijesh Singh Yadav (Senior Research Associates, URC, Allahabad)

E-Mail: [email protected]

Problem Statement:

The purpose of the proposed research is the development of a computational approach to

quantitatively evaluate associations between transcription factor encoding genes and

human diseases, based on available literature evidence. The approach will analyze a set

of candidate genes and determine which genes are linked to human diseases, which

properties are involved in these gene-disease linkages, and which clusters of similar

genes are involved in particular diseases.

During the course of the research, I shall explore methods for recapitulating existing

associations and predicting novel associations based on diverse forms of data pertaining

to genes and diseases. These methods will evaluate the resulting associations in a

quantitative manner, and the resulting analyses will be validated to determine the efficacy

of the methods.

Background:

Identification of functional causes and contributing mechanisms of disease is a principal

aim of biomedical research. In many cases, the term “disease” broadly applies to a

heterogeneous set of observable properties, which may arise from multiple molecular

processes. Disease is often characterized by symptoms and a pattern of progression over

time. The area of Cancer diseases is particularly broad, encompassing a wide range of

complex, abnormal phenotypes. Compared to diseases associated with other organs,

many types of cancer like brain cancer tend to be poorly understood: many are difficult to

characterize and have complex genetic components involving multiple genes.

mailto:[email protected]

Transcription factors are key regulators of gene expression, involved via processes such

as the recruitment of transcription initiation factors and conformational change of DNA,

working alone or as part of protein complexes.

GeneSeeker[1] can find genes within a chromosomal location that are localized in

particular tissues, by looking at human and mouse expression data. Another method of

associating disease genes to anatomical locations[2] performed text mining of PubMed

abstracts to associate eVOC anatomical ontology terms to gene names.

Machine learning approaches can be used when a representative set of disease genes are

available to use as training data. In DGP[3], a decision tree classification approach is used

to find features common to disease genes based on a training set composed of sample

disease and control proteins. Features were protein length, BLASTP ratios (conservation

score) between a protein and its highest scoring homologue within taxonomic groups

(representing phylogenetic conservation and extent) and the conservation score with the

closest paralogue. The study indicates that, on average, hereditary disease genes (genes

taken from OMIM) in comparison to randomly selected genes are longer, more

conserved, phylogenetically extended and without close paralogues.

PROSPECTR[4] uses a wider variety of features, including the length of the gene, the

length of its coding sequence, the length of its cDNA, length of the protein, GC content

and percentage protein identity with its nearest homologue in various species (mouse,

worm, fly). The investigators used an alternating decision tree, taking genes from OMIM

and comparing against genes not found in OMIM. They also generated two independent

test sets – one using genes from the Human Gene Mutation Database with randomly

selected control genes, and another set of 54 genes not in OMIM, again with a set of

randomly selected control genes.

POCUS[5] takes another machine learning approach, using a selected training set of genes

linked to the target disease. POCUS identifies common features between all the training

genes – InterPro domains, GO annotations, similar expression profile – and assesses the

chance that such common features would be shared by chance. This method depends on

a carefully selected training set of genes, and focuses the likelihood of these genes all

sharing common, disease-related properties, in contrast to methods that focus on

overrepresentation of properties among the training genes.

Proposed Method:

Most of the existing methods for the computational prediction of linkages between genes

and disease take as input a preliminary list of candidate genes (e.g. genes in a genomic

region linked in a genetic study to a disease), and return as output either a reduced or a

ranked list. The underlying approaches differ substantively between methods. Examples

of characteristics used in the methods include numerical features derived from the raw

sequence of genes and/or encoded proteins, existing annotations of proteins and genes,

and abstracts or articles directly referring to the gene. The current methods focus on using

properties from a representative set of genes to identify similar genes from the candidate

set.

We propose a method of extracting gene-disease associations that will emphasise

verifiable supporting evidence for the predicted associations, and a quantitative

evaluation of the strength of the association. We shall investigate both associations

between genes and disease, as well as properties of the gene-disease association.

We shall consider three base entities – Genes, Diseases, Evidence – and the relationships

between these entities.

Goal of Research:

Our goal will be to predict Gene-Disease relationships based on the existence of

relationships between other entity pairings. After initial study of mammalian gene-

disease relationships, we will broaden the approach to incorporate entity relationships

involving orthologous genes in model organisms or related diseases. These paths of

supporting evidence will be quantitatively evaluated, making it possible to both extract

strongly supported gene-disease linkages and to rank these linkages.

Although the thesis itself will investigate properties of transcription factor genes in

Cancer diseases, the methods and analysis will be designed for general application. For

the initial analysis of the main gene-disease associations.

Reference:

1. Van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG, et al. (2005)

GeneSeeker: extraction and integration of human disease-related information

from web-based genetic databases. Nucleic Acids Research 33: 758.

2. Tiffin N, Kelso J, Powell A, Pan H, Bajic V, et al. (2005) Integration of text- and data-

mining using ontologies successfully selects disease gene candidates. Nucleic

Acids Research 33: 1544-1552.

3. López-Bigas N, Ouzounis C (2004) Genome-wide identification of genes likely to be

involved in human genetic disease. Nucleic Acids Research 32: 3108.

4. Adie E, Adams R, Evans K, Porteous D, Pickard B (2005) Speeding disease gene

discovery by sequence based candidate prioritization. BMC Bioinformatics 6: 55.

5. Turner F, Clutterbuck D, Semple C (2003) POCUS: mining genomic sequence

annotation to predict disease genes. Genome Biology 4: 75.

Documents

Research Proposal