Disease Gene Candidate Prioritization by Integrative ... Gene Candidate Prioritization by...

Preview:

Citation preview

Disease Gene Candidate Prioritization by Integrative Biology

Table of contents:

Where are we in the course pipeline?

Maturation of the project and the project description

Background

Networks – deducing functional relationships from PPI data networksProtein interaction networksFunctional modules / network clusters

Phenotype associationGrouping disorders based on their phenotype.Biological implications of phenotype clusters.

Method and examplesIntegrating protein interaction data and phenotype associations in an automatedlarge scale disease gene finding platform

Sample PreparationHybridization

Array designProbe design

QuestionExperimental Design

Buy Chip/Array

Statistical AnalysisFit to Model (time series)

Expression IndexCalculation

Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network

ComparableGene Expression Data

Normalization

Image analysis

The DNA Array Analysis Pipeline

Project description

Søren Brunak, Professor, center director, Dr. Phil, PhD

Niels Tommerup Professor, Centre Director, Dr. Med.

Masters thesis sept 2003

Project description

Project:

Søren Brunak, Professor, center director, Dr. Phil, PhD

Niels Tommerup Professor, Centre Director Dr. Med.

Find disease genes

Using Bioinformatics

Project description

Project:

Søren Brunak, Professor, center director, Dr. Phil, PhD

Niels Tommerup Professor, Centre Director Dr. Med.

Disease gene candidate prioritization by integrating protein interaction and phenotype association data.

Project descriptionAbstractThe availability of the first draft of the human genome in 2001 (Venter, Adams et al. 2001) led to an increase in the number of methods for disease gene identification. However, the general number of candidates in most loci linked to a particular phenotype is in the hundreds (McCarthy, Smedley et al. 2003; van Driel, Cuelenaere et al. 2003), and the underlying genes in over 900 of the ~ 2550 loci associated with a phenotype in the “Online Mendelian Inheritance in Man” (OMIM) database, have not yet been identified (Hamosh, Scott et al. 2005). Evidently disease gene identification continues to be a very strenuous challenge, since mutational analysis of hundreds of candidates in a critical interval using methods currently available is extremely resource demanding. Thus, prioritising the candidates based on different criteria followed by an extensive investigation of promising candidates, is a logical step in the disease gene finding process. With the advent of proteomics, we are now able to retrieve information on gene functions in a large-scale manner, thus bridging the gap between genotype and phenotype, a possibility with significant interest for disease gene candidate prioritization.We propose that automated correlation of phenotype association networks, with interolog data (the transfer of protein interactions between orthologous protein pairs in different organisms), is a powerful way of identifying good disease gene candidates in a large list of genes in loci associated with a phenotype. Our method automatically identifies potential functional modules consisting of protein components, where at least one of the components is a disease related protein. When such incriminated modules are identified, the remaining protein components of the module are correlated with loci in the genome associated with a similar phenotype. A hit is reported if other protein components of the incriminated module are the product of genes in loci associated with an identical or overlapping phenotype. Using this large scale approach we show that a gene in a locus is a heavily incriminated candidate, if the protein product of the gene interacts with a protein involved in a similar or identical phenotype, and publish a list of 60 likely candidates in various disorders.

Project descriptionAbstractThe availability of the first draft of the human genome in 2001 (Venter, Adams et al. 2001) led to anincrease in the number of methods for disease gene identification. However, the general number of candidates in most loci linked to a particular phenotype is in the hundreds (McCarthy, Smedley et al. 2003; van Driel, Cuelenaere et al. 2003), and the underlying genes in over 900 of the ~ 2550 loci associated with a phenotype in the “Online Mendelian Inheritance in Man” (OMIM) database, have not yet been identified (Hamosh, Scott et al. 2005). Evidently disease gene identification continues to be a very strenuous challenge, since mutational analysis of hundreds of candidates in a critical interval using methods currently available is extremely resource demanding. Thus, prioritising the candidates based on different criteria followed by an extensive investigation of promising candidates, is a logical step in the disease gene finding process. With the advent of proteomics, we are now able to retrieve information on gene functions in a large-scale manner, thus bridging the gap between genotype and phenotype, a possibility with significant interest for disease gene candidate prioritization.We propose that automated correlation of phenotype association networks, with interolog data (the transfer of protein interactions between orthologous protein pairs in different organisms), is a powerful way of identifying good disease gene candidates in a large list of genes in loci associated with a phenotype. Our method automatically identifies potential functional modules consisting of protein components, where at least one of the components is a disease related protein. When such incriminated modules are identified, the remaining protein components of the module are correlated with loci in the genome associated with a similar phenotype. A hit is reported if other protein components of the incriminated module are the product of genes in loci associated with an identical or overlapping phenotype. Using this large scale approach we show that a gene in a locus is a heavily incriminated candidate, if the protein product of the gene interacts with a protein involved in a similar or identical phenotype, and publish a list of 60 likely candidates in various disorders.

Background

Background

Finding genes responsible for major genetic disorders can lead to diagnostics, potential drug targets, treatments and large amounts ofinformation about molecular cell biology in general.

BackgroundMethods for disease gene finding post genome era (>2001):

Mircodeletions Translocations

http://www.med.cmu.ac.th/dept/pediatrics/06-interest-cases/ic-39/case39.html

http://www.rscbayarea.com/images/reciprocal_translocation.gif

Linkage analysis

Fagerheim et al 1996.

1q21-1q23.1

chr1:141,600,00-155,900,000

BackgroundAutomated methods for disease gene finding int the post genome era (>2001):

?

(Perez-Iratxeta, Bork et al. 2002) (Freudenberg and Propping 2002)(van Driel, Cuelenaere et al. 2005)(Hristovski, Peterlin et al. 2005)

Grouping:

Tissues, Gene Ontology, Gene Expression, MeSH terms …….

Disease Gene Finding.

Summery

Background

Why do we want to find disease genes, how has it been done until now?

Networks – deducing functional relationships from network theory

Protein interactionnetworksFunctional modules / network clusters

Phenotype association

Grouping disorders based on their phenotype.Biological implications of phenotype clusters.

Method and examples

Combining network theory and phenotype associationsin an automated large scale disease gene finding platformproof of concept.Status of pipeline / infrastructure

Networks and functional modulesDeducing functional relationships from protein interaction networks

Networks and functional modulesDeducing functional relationships from network theory

Network theory is boooooooooring

Networks

Text mining of full text corpora e.g PubMed Central

http://www.biosolveit.de/ToPNet/screenshots/fig1.html

Protein interaction networks of physical interactions.

(Barabasi and Oltvai 2004).

Networks

daily

weekly

monthly

(de Licthenberg et al.)

Social Networks, The CBS interactome

Networks

daily

weekly

monthly

(de Licthenberg et al.)

Social Networks, The CBS interactome

Networks

Extracting functional data from protein interactionnetworks

InWeb

Homo Sapiens

The Ach receptor involved in MyasthenicSyndrome.

Dynamicfuncionalmodule:

Eg:

Cell cycleregulationMetabolism

Trans-organism protein interaction network

Orthologs?Orthologous genes are direct descendants of a gene in a commonancestor:

(O'Brien K, Remm et al. 2005)

S.Cerevisiae

D. Melanogaster

H.Sapiens

D. Melanogaster Experim.

C. Elegans Experim.

S. Cerevisiae Experim.

H.Sapiens MOSAIC

Trans-organism protein interaction network

Infrastructure status

BIND

IntAct

DIP

MINT

HPRD

Hand-curated

sets

PPI –pred.

GRID

InWeb

Homo Sapiens

Trans-organism ppi pipeline

>122.000 int.

> 22.000 genes

Scoring

A) Topological

B) No publ.

Extractionperl modules

Direct SQL access

XML or SIF output

Web serverOpis

Command lineInweb.pl

CBS DatawarehouseDownload/reformat db’s

Disease Gene Finding.

Summery

Background

Why do we want to find disease genes, how has it been done until now?

Networks – deducing functional relationships from network theory

Protein interactionnetworksFunctional modules / network clusters

Phenotype association

Grouping disorders based on their phenotype.Biological implications of phenotype clusters.

Method and examples

Combining network theory and phenotype associationsin an automated large scale disease gene finding platformproof of concept.Status of pipeline / infrastructure

Phenotype association

Phenotype association

Absent liver peroxisomesHepatomegalyIntrahepatic biliary dysgenesisProlonged neonatal jaundicePyloric hypertrophyPatent ductus arteriosusVentricular septal defectsBell-shaped thoraxSmall adrenal glandsAbsent renal peroxisomesClitoromegalyCryptorchidismHydronephrosisHypospadiasRenal cortical microcystsFailure to thriveAbnormal electroretinogramAbnormal helicesAnteverted naresBrushfield spotsCataractsCorneal clouding

Epicanthal foldsFlat faciesFlat occiputGlaucomaHigh arched palateHigh foreheadHypertelorismLarge fontanellesMacrocephalyMicrognathiaNystagmusPale optic diskPigmentary retinopathyPosteriorly rotated earsProtruding tongueRedundant skin folds of neckRound faciesSensorineural deafnessTurribrachycephalyUpward slanting Hyporeflexia or areflexiaHypotonia

Zelwegger syndrome

PolymicrogyriaSeizuresSevere mental retardationSubependymal cystsPulmonary hypoplasiaCubitus valgusDelayed bone ageMetatarsus adductusRocker-bottom feetStippled epiphyses (especially patellar and acetabular regions)Talipes equinovarusTransverse palmarcreaseUlnar deviation of handsWide cranial suturesTransverse palmarcreaseHeterotopias/abnormal migrationHypoplastic olfactory lobes

palpebral fissuresAutosomal recessiveAlbuminuriaAminoaciduriaDecreased dihydroxyacetonephosphate acyltransferase (DHAP-AT) activityDecreased plasmologenElevated long chain fatty acidsElevated serum iron and iron binding capacityIncreased phytanic acidPipecolic acidemiaBreech presentationDeath usually in first year of lifeGenetic heterogeneityInfants occasionally mistaken as having Down syndromeAgenesis/hypoplasic corpus collosum

Phenotype association

Word vectors

Phenotype Sim. Score

Adrenoleukodystrophy (202370) 0.781Hyperpipecolatemia (239400) 0.703Cerebrohepatorenal Syndr. (214110) 0.682Refsum Disease (266510) 0.609

Reference : Zelwegger Syndrome (214100)

A relationship between the infantile form of Refsumdisease and Zellweger syndrome was suggested by the observations of Poulos et al. (1984) in 2 patients. In the infantile form of Refsum disease, as in Zellwegersyndrome, peroxisomes are deficient and peroxisomalfunctions are impaired (Schram et al., 1986). Clinically, infantile Refsum disease, ZWS, and adreno-leukodystrophy have several overlapping features. (Stokke et al., 1984).(http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=266510)

214100 202370

Phenotype association

Word vectorsPhenotype association network

Cerebro-Hepato-

renalZelwegger

Refsum

Adrenoleuko-dystrophy

Disease Gene Finding.

Summery

Background

Why do we want to find disease genes, how has it been done until now?

Networks – deducing functional relationships from network theory

Protein interactionnetworksFunctional modules / network clusters

Phenotype association

Grouping disorders based on their phenotype.Biological implications of phenotype clusters.

Method and examples

Combining network theory and phenotype associationsin an automated large scale disease gene finding platformproof of concept.

Method –

Status

Method

InWeb

Homo Sapiens

Word vectors

Phenotype clustering

Method - statusIn Vitro:

Blinded tests on 15 linkage intervals with one novel angiogenesis gene per interval.

12 predictions, 7 correct, 2 novel.

In patient material:

Usher

Septooptic dysplasia syndrome

Dilated cardiomyopathy

Benchmarking:

Benchmarking by unbiased prioritizing genes in ~ 1404 critical intervals where the actual disease gene is known.

How well does it work ?

How well does the method work ?

Is it unbiased ?

Reveals novel global aspect ofhuman diseases

Results - BenchmarkMIM RANK GENE P-value TRUE

278800 1 ENSG00000032514 0.300326793109544 *278800 2 ENSG00000188611 0.0125655342047565278800 2 ENSG00000138297 0.0125655342047565278800 2 ENSG00000165406 0.0125655342047565278800 3 ENSG00000196693 0.0121357313793756278800 3 ENSG00000185532 0.0121357313793756278800 4 ENSG00000197910 0.00680983722337082278800 4 ENSG00000165383 0.00680983722337082278800 4 ENSG00000172538 0.00680983722337082. . . .. . . .. . . .. . . .. . . .278800 4 ENSG00000165511 0.00680983722337082278800 4 ENSG00000182354 0.00680983722337082278800 4 ENSG00000172661 0.00680983722337082278800 4 ENSG00000165507 0.00680983722337082278800 4 ENSG00000178440 0.00680983722337082278800 4 ENSG00000138299 0.00680983722337082278800 4 ENSG00000197704 0.00680983722337082278800 4 ENSG00000012779 0.00680983722337082278800 4 ENSG00000197354 0.00680983722337082278800 4 ENSG00000189090 0.00680983722337082278800 4 ENSG00000107551 0.00680983722337082278800 4 ENSG00000126542 0.00680983722337082278800 4 ENSG00000198364 0.00680983722337082278800 4 ENSG00000185849 0.00680983722337082278800 4 ENSG00000150165 0.00680983722337082278800 4 ENSG00000128815 0.00680983722337082278800 4 ENSG00000178645 0.00680983722337082278800 4 ENSG00000138293 0.00680983722337082278800 4 ENSG00000176833 0.00680983722337082278800 4 ENSG00000179251 0.00680983722337082278800 4 ENSG00000169826 0.00680983722337082278800 4 ENSG00000172678 0.00680983722337082278800 4 ENSG00000197752 0.00680983722337082278800 5 ENSG00000107643 0.00412573091718715278800 6 ENSG00000165733 0.000263885640603109

278800 7 ENSG00000169813 6,63E+07

DE SANCTIS-CACCHIONE SYNDROME

Gene map locus 10q11 >12MB area, 103 ranked genes

CLINICAL FEATURES

De Sanctis and Cacchione (1932) reported a condition, which theycalled 'xerodermic idiocy,' in whichpatients had xerodermapigmentosum, mental deficiency, progressive neurologicdeterioration, dwarfism, and gonadal hypoplasia.http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=278800

Results – Benchmarking

DE SANCTIS-CACCHIONE SYNDROME

Ranked 1 P-value 0.300326793109544

DNA excision repair

protein ERCC-6

Eukaryotic translation initiation factor 4E (eIF4E)

DNA excision repair protein ERCC-2

Eukaryotic initiation factor 4A-I (eIF4A-I)

*126340 DNA REPAIR DEFECT EM9 OF CHINESE HAMSTER OVARY CELLS, COMPLEMENTATION OF; EM9

#133540 COCKAYNE SYNDROME CKN2

#278730 XERODERMA PIGMENTOSUM, COMPLEMENTATION GROUP D

#278800 DE SANCTIS-CACCHIONE SYNDROME

#601675 TRICHOTHIODYSTROPHY

Results – Benchmarking

DE SANCTIS-CACCHIONE SYNDROME

Ranked 2 P-value 0.0125655342047565

Disease Gene Finding.

Summery

Background

Why do we want to find disease genes, how has it been done until now?

Networks – deducing functional relationships from network theory

Protein interactionnetworksFunctional modules / network clusters

Phenotype association

Grouping disorders based on their phenotype.Biological implications of phenotype clusters.

Method and examples

Combining network theory and phenotype associationsin an automated large scale disease gene finding platformproof of concept.

Project descriptionAbstractThe availability of the first draft of the human genome in 2001 (Venter, Adams et al. 2001) led to anincrease in the number of methods for disease gene identification. However, the general number of candidates in most loci linked to a particular phenotype is in the hundreds (McCarthy, Smedley et al. 2003; van Driel, Cuelenaere et al. 2003), and the underlying genes in over 900 of the ~ 2550 loci associated with a phenotype in the “Online Mendelian Inheritance in Man” (OMIM) database, have not yet been identified (Hamosh, Scott et al. 2005). Evidently disease gene identification continues to be a very strenuous challenge, since mutational analysis of hundreds of candidates in a critical interval using methods currently available is extremely resource demanding. Thus, prioritising the candidates based on different criteria followed by an extensive investigation of promising candidates, is a logical step in the disease gene finding process. With the advent of proteomics, we are now able to retrieve information on gene functions in a large-scale manner, thus bridging the gap between genotype and phenotype, a possibility with significant interest for disease gene candidate prioritization.We propose that automated correlation of phenotype association networks, with interolog data (the transfer of protein interactions between orthologous protein pairs in different organisms), is a powerful way of identifying good disease gene candidates in a large list of genes in loci associated with a phenotype. Our method automatically identifies potential functional modules consisting of protein components, where at least one of the components is a disease related protein. When such incriminated modules are identified, the remaining protein components of the module are correlated with loci in the genome associated with a similar phenotype. A hit is reported if other protein components of the incriminated module are the product of genes in loci associated with an identical or overlapping phenotype. Using this large scale approach we show that a gene in a locus is a heavily incriminated candidate, if the protein product of the gene interacts with a protein involved in a similar or identical phenotype, and publish a list of 60 likely candidates in various disorders.

Project description

Project:

Søren Brunak, Professor, center director, Dr. Phil, PhD, physicist

Niels Tommerup Professor, Centre Director Dr. Med.

Find disease genes

Using Bioinformatics

Acknowledgments

Disease Gene Finding :

Olga RiginaOlof Karlberg

Zenia M. Størling Páll Ísólfur Ólason

Kasper LageAnders Gorm

Anders HinsbyYves MoreauSøren Brunak

Recommended