Upload
cheyenne-mendez
View
52
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Genes. Diseases. Diseases. Diseases. Physiology. Diseases. Physiology. Genes. Genes. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Physiology. Anatomy. Diseases. Anatomy. - PowerPoint PPT Presentation
Citation preview
Gene Annotation Databases
Gene Annotation Databases
Gene Annotation Databases
Gene Annotation Databases
DiseasesDiseasesDiseasesDiseasesDiseasesDiseasesDiseases
Anatomy
Anatomy
Anatomy
Anatomy
Anatomy
Anatomy
Gen
esG
ene
sGen
esG
ene
sGen
esG
en
es
Physiology
Physiology
Physiology
Physiology
Physiology
Physiology
Diseases
Physiology
Anatomy
Genes
Genes
GenesDiseases
DiseasesMedical
Informatics
Genomics and Bioinformatics
Novel relationships & Deeper insights
Identification and Prioritization of Novel
Disease Candidate Genes Systems Biology Based Integrative Approaches
Anil JeggaDivision of Biomedical Informatics,
Cincinnati Children’s Hospital Medical Center (CCHMC)
Department of Pediatrics, University of CincinnatiCincinnati, Ohio - [email protected]://anil.cchmc.org
Bioinformatics to Systems Biology
November 16, 2007
Acknowledgements
• Jing Chen• Eric Bardes• Bruce Aronow
Cincinnati Children’s Hospital Medical Center Computational Medical Center, CincinnatiMouse Models of Human Cancers ConsortiumUniversity of Cincinnati College of Medicine
Support
• All the publicly available gene annotation resources especially NCBI, MGI and UCSC
Medical Informatics Bioinformatics & the “omes
Patient Records
Patient Records
Disease Database
Disease Database→Name→Synonyms→Related/Similar Diseases→Subtypes→Etiology →Predisposing Causes→Pathogenesis→Molecular Basis→Population Genetics→Clinical findings→System(s) involved→Lesions →Diagnosis→Prognosis→Treatment→Clinical Trials……
PubMed
Clinical Trials
Clinical Trials
Two Separate Worlds…..
With Some Data Exchange…
Genome
Transcriptome
miRNAome
Interactome
Metabolome
Physiome
Regulome Variome
Pathome Ph
arm
acog
en
om
e
OMIMClinical
Synopsis
Disease
World
382 “omes” so far………
and there is “UNKNOME” too - genes with no function knownhttp://omics.org/index.php/Alphabetically_ordered_list_of_omics
(as on November 15, 2007)
Proteome
PubMed
Medical Informatics
Patient Records
Patient Records
Disease Database
Disease Database
→Name→Synonyms→Related/Similar Diseases→Subtypes→Etiology →Predisposing Causes→Pathogenesis→Molecular Basis→Population Genetics→Clinical findings→System(s) involved→Lesions →Diagnosis→Prognosis→Treatment→Clinical Trials……
Clinical Trials
Clinical Trials
Bioinformatics
Genome
Transcriptome
Proteome
Interactome
Metabolome
Physiome
Regulome Variome
Pathome
Disease
World
OMIM
►Personalized Medicine►Decision Support System►Course/Outcome Predictor►Diagnostic Test Selector►Clinical Trials Design►Hypothesis Generator►Novel Gene/Drug Targets…..
Integrative
Genomics -
Biomedical
Informatics
the Ultimate Goal…….
miRNAome
Ph
arm
acog
en
om
e
No Integrative Genomics is Complete without Ontologies
• Gene Ontology (GO)
• Unified Medical Language System (UMLS)
Gene World Biomedical World
• Molecular Function = elemental activity/task– the tasks performed by individual gene products;
examples are carbohydrate binding and ATPase activity
– What a product ‘does’, precise activity
• Biological Process = biological goal or objective– broad biological goals, such as dna repair or purine
metabolism, that are accomplished by ordered assemblies of molecular functions
– Biological objective, accomplished via one or more ordered assemblies of functions
• Cellular Component = location or complex– subcellular structures, locations, and macromolecular
complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme
– ‘is located in’ (‘is a subcomponent of’ )
The 3 Gene Ontologies
http://www.geneontology.org
Function (what) Process (why)
Drive a nail - into wood Carpentry
Drive stake - into soil Gardening
Smash a bug Pest Control
A performer’s juggling object Entertainment
Example: Gene Product = hammer
http://www.geneontology.org
Unified Medical Language System Knowledge Server– UMLSKS
• The UMLS Metathesaurus contains information about biomedical concepts and terms from many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic and full-text databases, and expert systems.
• The Semantic Network, through its semantic types, provides a consistent categorization of all concepts represented in the UMLS Metathesaurus. The links between the semantic types provide the structure for the Network and represent important relationships in the biomedical domain.
• The SPECIALIST Lexicon is an English language lexicon with many biomedical terms, containing syntactic, morphological, and orthographic information for each term or word.
http://umlsks.nlm.nih.gov/kss
Unified Medical Language SystemMetathesaurus
• about over 1 million biomedical concepts • About 5 million concept names from more than 100 controlled
vocabularies and classifications (some in multiple languages) used in patient records, administrative health data, bibliographic and full-text databases and expert systems.
• The Metathesaurus is organized by concept or meaning. Alternate names for the same concept (synonyms, lexical variants, and translations) are linked together.
• Each Metathesaurus concept has attributes that help to define its meaning, e.g., the semantic type(s) or categories to which it belongs, its position in the hierarchical contexts from various source vocabularies, and, for many concepts, a definition.
• Customizable: Users can exclude vocabularies that are not relevant for specific purposes or not licensed for use in their institutions. MetamorphoSys, the multi-platform Java install and customization program distributed with the UMLS resources, helps users to generate pre-defined or custom subsets of the Metathesaurus.
• Uses: – linking between different clinical or biomedical vocabularies– information retrieval from databases with human assigned subject index
terms and from free-text information sources– linking patient records to related information in bibliographic, full-text, or
factual databases– natural language processing and automated indexing research
Open biomedical ontologies
http://obo.sourceforge.net/
Mammalian Phenotype Ontology1. The Mammalian Phenotype (MP)
Ontology enables robust annotation of mammalian phenotypes in the context of mutations, quantitative trait loci and strains that are used as models of human biology and disease.
2. Each node in MPO represents a category of phenotypes and each MP ontology term has a unique identifier, a definition, synonyms, and is associated with gene variants causing these phenotypes in genetically engineered or mutagenesis experiments.
3. In the current version of MPO, there are >4250 terms associated to >4300 unique Entrez mouse genes (extrapolated to ~4300 orthologous human genes).http://www.informatics.jax.org
Disease Gene Identification and Prioritization
Hypothesis: Majority of genes that impact or cause disease share membership in any of several functional relationships OR Functionally similar or related genes cause similar phenotype.
Functional Similarity – Common/shared•Gene Ontology term•Pathway•Phenotype•Chromosomal location•Expression•Cis regulatory elements (Transcription factor binding sites)•miRNA regulators•Interactions•Other features…..
1. Most of the common diseases are multi-factorial and modified by genetically and mechanistically complex polygenic interactions and environmental factors.
2. High-throughput genome-wide studies like linkage analysis and gene expression profiling, tend to be most useful for classification and characterization but do not provide sufficient information to identify or prioritize specific disease causal genes.
Background, Problems & Issues
3. Since multiple genes are associated with same or similar disease phenotypes, it is reasonable to expect the underlying genes to be functionally related.
4. Such functional relatedness (common pathway, interaction, biological process, etc.) can be exploited to aid in the finding of novel disease genes. For e.g., genetically heterogeneous hereditary diseases such as Hermansky-Pudlak syndrome and Fanconi anaemia have been shown to be caused by mutations in different interacting proteins.
Background, Problems & Issues
Background, Problems & Issues
Disease candidate gene studies
Biological experiments (expensive, time
consuming)
Linkage, gene expression
Potential candidate genes (too
many!)
Finemappin
g
Hand/cherr
y picking
Prioritization
approach
dilated cardiomyopathy
Linkage analysis
Locus region 10q25-26
Ellinor et al. J Am Coll Cardiol 2006.
~9.5Mb with 68 genes
7 candidates selected byexperts
ADRB1 missing
Assumption: genes involved in the same complex disease will have similar functions
dilated cardiomyopathy
Current candidate gene prioritization tools
Background, Problems & Issues
Input:Multiple locus
regions
Enriched functions
Prioritize genes basedon the functions
Approach without training
Training: Known disease genes (10 from OMIM)
Test: 68 genes at 10q25-26
Score test genesbased on their
similarity to training set
Approach with training
TOPPGeneTranscriptome Ontology Pathway based Prioritization of
Genes
http://toppgene.cchmc.org Chen J, Xu H, Aronow BJ, Jegga AG. 2007. Improved human disease candidate gene prioritization using mouse phenotype. BMC Bioinformatics 8(1): 392 [Epub ahead of print]
Applications:1.For functional enrichment2.For candidate gene prioritization
Why another gene prioritization method?
Feature type POCUS
Prospectr
SUSPECTS
ENDEAVOUR
ToppGene
Year 2003 2005 2006 2006 2007
Sequence Features
GO Annotations
Transcript Features
Protein Features
Literature
Phenotype Annotations
Training set
Comparison with other related approaches
Feature type
POCUS Prospectr SUSPECTS ENDEAVOUR ToppGene
Year 2003 2005 2006 2006 2007
SequenceFeatures &Annotation
s
Gene lengthHomologyBase composition
Gene lengthHomologyBase
composition
Blastcis-element
Cytobandcis-elementmiRNA targetsGeneSets
GeneAnnotation
s
Gene Ontology
Gene Ontology Gene Ontology Gene Ontology Mouse Phenotype
TranscriptFeatures
Gene expression
Gene expression
EST expression
Gene expression
ProteinFeatures
domains Protein domains
domainsinteractionspathways
domainsinteractionspathways
Literature Keywords Co-citation
Training set
No No Yes Yes Yes
Comparison with other related approaches
Feature Details
We do not check whether the human orthologous gene of a mouse gene causes similar phenotype. Rather, we assume that orthologous genes cause “orthologous phenotype” and test the potential of the extrapolated mouse phenotype terms as a similarity measure to prioritize human disease candidate genes
Mammalian Phenotype Ontology
Mammalian Phenotype Ontology77 human genes explicitly associated
with “heart development” (GO:0007507)
Mouse orthologs cause various types of cardiac phenotype (MPO)
ToppGene – General Schema
TOPPGene - Data Sources1. Gene Ontology: GO and NCBI Entrez
Gene2. Mouse Phenotype: MGI (used for the first
time for human disease gene prioritization)3. Pathways: KEGG, BioCarta, BioCyc,
Reactome, GenMAPP, MSigDB4. Domains: UniProt (Pfam, Interpro,etc.)5. Interactions: NCBI Entrez Gene (Biogrid,
Reactome, BIND, HPRD, etc.)6. Pubmed IDs: NCBI Entrez Gene7. Expression: GEO8. Cytoband: MSigDB9. Cis-Elements: MSigDB10.miRNA Targets: MSigDB
New features added
TOPPGene - Validation
• Random-gene cross-validation– Disease-gene relations from OMIM
and GAD databases– Training set: disease genes with
one gene (“target”) removed– Test set: 100 genes = “target” gene
+ 99 random genes– Rank of “target” gene– Control: random training sets– AUC and Sensitivity/Specificity
Random-gene cross-validation: breast cancer example
Disease genes ATMBARD1BRCA1BRCA2BRIP1CASP8CHEK2KRASPALB2PIK3CAPPM1DRAD51RB1CC1SLC22A18TP53
Training set BARD1BRCA1BRCA2BRIP1CASP8CHEK2KRASPALB2PIK3CAPPM1DRAD51RB1CC1SLC22A18TP53
Test set KIAA1333 PQLC3 RBMY2OP ZNF133 LOC402643 FBL SLEB4 FAM32A AACSL ATM NDUFB5 DENND4A C14orf106 ……KCNJ16
99randomgenes
Ranked list 1. ATM2. KIAA1333
3. PQLC3
4. RBMY2OP
5. ZNF133
6. LOC402643
7. FBL
8. SLEB4
9. FAM32A
10. AACSL
11. NDUFB5
12. DENND4A
13. C14orf106
……100. KCNJ16
prioritization
TOPPGene - Validation
Random-gene cross-validation result
• Training:19 diseases with 693 genes
• Control: 20 random sets of 35 genes each
• Sensitivity/Specificity: 77/90
• AUC: 0.916Sensitivity: frequency of
“target” genes that are ranked above a particular threshold position
Specificity: the percentage of genes ranked below the threshold
False positive rate
Tru
e p
osi
tive
ra
te
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 - specificity
Sen
sitiv
ity
False positive rate
Tru
e p
osi
tive
ra
te
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1 - specificity
Sen
sitiv
ity
Random-gene cross-validation with only one feature
AUC of different feature sets
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
All GO:MF GO:BP MP Pathway Domain Pubmed Interaction Expression
Feature set
AU
C
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Co
vera
ge
AUC (random control)
AUC (p-value score)
Coverage
Using Mouse Phenotype as a feature of similarity measure improves human disease
gene prioritization
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Tru
e p
osi
tive
ra
te
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Tru
e p
osi
tive
ra
te
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Tru
e p
osi
tive
ra
te
1-specificity
Sen
sitiv
ity
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Tru
e p
osi
tive
ra
te
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Tru
e p
osi
tive
ra
te
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Tru
e p
osi
tive
ra
te
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Tru
e p
osi
tive
ra
te
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Tru
e p
osi
tive
ra
te
0.00 0.05 0.10 0.15 0.20
0.0
0.2
0.4
0.6
0.8
1.0
False positive rate
Tru
e p
osi
tive
ra
te
1-specificity
Sen
sitiv
ity
Overall performance
All features: 0.913All – MP: 0.893All – MP – PubMed:
0.888
All
All – MP
All – MP - Pubmed
Random-gene cross-validation by leaving one feature out
Sensitivity: true positive rate at a cutoff scoreSpecificity: true negative rate at the same cutoff
Using Mouse Phenotype as a feature of similarity measure improves human disease
gene prioritization
Locus-region cross-validation using different feature sets
FeaturesAverage rank ratio
of“target” genes
Number of times“target” genes wereranked top 5%
Number of times“target” genes
wereranked top 10%
All 7.39% 118 125
GO + MP + PubMed 7.50% 118 126
MP + PubMed 7.08% 121 126
Without GO 6.84% 117 123
Without Pathway 7.66% 118 124
Without Domain 6.71% 118 124
Without Interaction 7.17% 120 124
Without Expression 7.28% 118 128
Without MP 9.77% 110 117
Without Pubmed 9.91% 100 111
Without MP & Pubmed 22.61% 71 80
ToppGene web server (http://toppgene.cchmc.org)For functional enrichment analysis
ToppGene web server (http://toppgene.cchmc.org)For functional enrichment analysis
ToppGene web server (http://toppgene.cchmc.org)For functional enrichment analysis
ToppGene web server (http://toppgene.cchmc.org)For functional enrichment analysis
1. Direct protein–protein interactions (PPI) are one of the strongest manifestations of a functional relation between genes.
2. Hypothesis: Interacting proteins lead to same or similar disease phenotypes when mutated.
3. Several genetically heterogeneous hereditary diseases are shown to be caused by mutations in different interacting proteins. For e.g. Hermansky-Pudlak syndrome and Fanconi anaemia. Hence, protein–protein interactions might in principle be used to identify potentially interesting disease gene candidates.
PPI - Predicting Disease Genes
Known Disease Genes
Direct Interactants of Disease Genes
Mining human interactome
HPRDBioGrid
Which of these interactants are potential new candidates?
Indirect Interactants of Disease Genes
7
66
778
Prioritize candidate genes in the interacting partners of the disease-related genes•Training sets: disease related genes •Test sets: interacting partners of the training genes
Example: Breast cancer
OMIM genes (level 0)
Directly interacting genes (level 1)
Indirectly interacting genes (level2)
15 342 2469!
15 342 2469
ToppGene web server (http://toppgene.cchmc.org)For candidate gene prioritization
ToppGene web server (http://toppgene.cchmc.org)For candidate gene prioritization
ToppGene web server (http://toppgene.cchmc.org)For candidate gene prioritization
Example: Breast cancer study. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007 May 27.
rs id Location
Gene Training set Test set
rs2981582 10q26 FGFR2 15 OMIM genes
83 genes in the region
Prioritization result:
Rank Gene Description P-value
1 BUB3 budding uninhibited by benzimidazoles 3 homolog
0.003865
2 FGFR2 fibroblast growth factor receptor 2 0.018906
3 BCCIP BRCA2 and CDKN1A interacting protein 0.04784
Example: Breast cancer study. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007 May 27.
ToppGene PrioritizationExample: Breast cancer
Ranked InteractantsRank
Gene Description
1 ATR ataxia telangiectasia and Rad3 related
2 FANCD2 Fanconi anemia, complementation group D2
3 NBN (NBS1) Nibrin
Training set Test set
15 OMIM genes
342 interacting genes
LimitationsGeneral limitations of any training-test
strategy:•Prior knowledge of disease-gene associations.•Assumption that the disease genes yet to discover will be consistent with what is already known about a disease.•Depend on the accuracy and completeness of the functional annotations.
– Only one-fifth of the known human genes have pathway or phenotype annotations and there are still more than 40% genes whose functions are not defined!
Chen et al., 2007; BMC Bioinformatics
Mouse Phenotype - Limitations1.MP is not a disease-centric ontology and the
phenotype of a same gene mutation can vary depending on specific mouse strains or their genetic backgrounds.
2.Orthologous genes need not necessarily result in orthologous phenotypes.
Possible Solutions - Future DirectionsMore efficient cross-species phenome extrapolation where in the mouse phenotype terms are mapped to human phenotype concepts (from UMLS) semantically (“orthologous phenotype”) and the resultant orthologous genes associated with an orthologous phenotype are identified.
Chen et al., 2007; BMC Bioinformatics
PPIs for disease gene identificationLimitations1.Noisy interactome data
• In vitro Vs in vivo (for e.g. only 5.8% of yeast two-hybrid predicted interactions were confirmed by HPRD)
• Extrapolation of interactions from one species to another
• Bias towards “well-studied” genes/proteins2.Too many interactants! Hub proteins3.Two interacting proteins need not lead to similar
phenotype when mutated4.Disease proteins may lie at different points in a
pathway and need not interact directly5.Lastly, disease mutations need not always
involve proteins Oti et al., 2006; J Med Gen
http://sbw.kgi.edu/
http://anil.cchmc.org (under presentations)
Thank You!
And PRIORITIZATION too!