Upload
kiona-ballard
View
56
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Ontologies and Biomedicine. What is the "right" amount of semantics ?. Ontologies and Biomedicine. The “right” amount of semantics depends on what you want to do with it. Ontologies and Biomedicine. Research is based on inference from what is known, and therefore it demands rigor. - PowerPoint PPT Presentation
Citation preview
Ontologies and Biomedicine
What is the "right" amount of semantics?
Ontologies and Biomedicine
The “right” amount of semantics depends on what
you want to do with it
Ontologies and Biomedicine
Research is based on inference from what is known, and therefore
it demands rigor
Ontologies and Biomedicine
Without rigor, we won’t—know what we know, or where to find it, or what to
infer from it.
Natural Language
Computable Ontology
Highly expressive
Ambiguous
Less expressive
Logical and precise
Semantic Spectrum
Ad hoc tagging approach
Let the users defined words and phrases Foregoes the use of an expertly curated
vocabulary or ontology.
Fast and distributed approach yields a vast amount of content No recruitment and training of people to
maintain the ontology is required. No recruitment and training of annotators to
interpret the material is required.
Ad hoc tagging approach
Tagging approach places the burden of interpretation and classification on every end user Overall this is more costly and wasteful Is inappropriate in the scientific domain
The problem is not about people communicating. It is about computers and HCI.
Build, apply, and use Ontology captures current scientific theory
that seeks to explain all of the existing evidence and is used to draw inferences and make predictions Acts like a review Requires curators who are experts in both the
science and logic
Ontology application is the real bottleneck But overall is less costly and wasteful
1.Univocity: Terms should have the same meanings on every occasion of use
2.Positivity:Terms such as ‘non-mammal’ or ‘non-membrane’ do not designate genuine classes.
3.Objectivity: Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.
4.Single Inheritance: No class in a classification hierarchy should have more than one is_a parent on the immediate higher level
5.Intelligible Definitions: The terms used in a definition should be simpler (more intelligible) than the term to be defined
6.Reality Based:When building or maintaining an ontology, always think carefully at how classes relate to instances in reality
7.Distinguish Classes and Instances: What is necessarily true for instances is not necessarily true for classes
Annotation bottleneck
An active lab can easily generate 10-100GB of data per month, and it is very difficult to manage data on this scale.
Even the best analytic schemes will be for naught if we cannot find our data.
And the data is complex Yet, the annotation effort required will
be utterly wasted if it cannot be reliably computed upon.
Implies numerous “light” ontologies
3-dimensions Protein function Cell type Tissue Stage Cellular component Organism And more…
Or it implies a single complex one
3-dimensions Protein function Cell type Tissue Stage Cellular anatomy Organism And more…
Plus all of the relations between these elements
Practicalities
1. The ontology should be robust or the annotator’s time is wasted
2. Research won’t wait, data must be annotated at the rate at which it is generated
3. Complex ontologies are much more difficult to get right than lighter ones
4. Light ontologies are easier to build and maintain
5. Complex ontologies can be built from lighter ones
A “successful” case study
Gene Ontology
The aims of GO
1. To develop comprehensive shared vocabularies of terms describing aspects of molecular biology.
2. To describe the gene products held in each contributing model organism database.
3. To provide a scientific resource for access to the vocabularies, the annotations, and associated data.
4. To provide a software resource to assist in curation of GO term assignments to biological objects.
The primary strength of the GO
The GO covers three domains of biology Molecular Function Biological Process Cellular Component
These are “precisely defined” axes of classification
The breakdown of work
Task 1 Building the ontology: a computable
description of the biological world Task 2
Describing your gene product—annotation Biological process Molecular function Cellular localization
The early key decisions
The vocabulary itself requires a serious and ongoing effort.
Carefully define every concept Initially keep things as simple as possible
and only use a minimally sufficient data representation.
Focus initially on molecular aspects that are shared between many organisms.
GO databases: distributed and centralized
Support cross-database queries By having a mutual understanding of the
definition and meaning of any word used to describe a gene product
Provide database access to a common repository of annotations By submitting a summary of gene products
that have been annotated
GO CVS
FTP
AnonymousCVS
GO data
HTTPDScripts
GO CVS
Many Scripts
GO DatabaseAmiGO
GODatabase.org
Hits = 77,012
Visits = 14,063
Sites = 6,638
Averages per week
www.geneontology.org 7,240www.godatabase.org 33obo.sourceforge.net 10song.sourceforge.net 6
genome.ucsc.edu 3,670www.ncbi.nih.gov 12,000
www.ebi.ac.uk 14,900sciencemag.org 14,900
www.ncbi.nlm.nih.gov 34,500
Number of links to a site: as reported by Google
72020 GO:0006810 transport56862 GO:0005524 ATP binding53622 GO:0019012 virion47773 GO:0006955 immune response46943 GO:0003677 DNA binding41474 GO:0006508 proteolysis and peptidolysis41126 GO:0006355 regulation of transcription, DNA-dependent40427 GO:0004872 receptor activity34943 GO:0005215 transporter activity30890 GO:0007186 G-protein coupled receptor protein signaling pathway30001 GO:0003700 transcription factor activity28127 GO:0006118 electron transport26636 GO:0005509 calcium ion binding24007 GO:0006968 cellular defense response21250 GO:0016486 peptide hormone processing20440 GO:0008152 metabolism19742 GO:0005515 protein binding19316 GO:0007155 cell adhesion18254 GO:0005198 structural molecule activity
Most Common GOIDs accessed via AmiGO
Arabidopsis: TAIR, taxon:3702Caenorhabditis: WormBase, taxon:6239Candida albicans: CGD, taxon:5476Danio: ZFIN, taxon:7955Dictyostelium: DictyBase, taxon:5782Drosophila: FlyBase, taxon:7227Mus: MGI, taxon:10090Oryza sativa: Gramene, taxon:39947 = Oryza sativa (japonica cultivar-group); Rattus: RGD, taxon:10116Saccharomyces: SGD, taxon:4932Leishmania major: GeneDB, taxon:5664Plasmodium falciparum: GeneDB, taxon:5833Schizosaccharomyces pombe: GeneDB, taxon:4896Trypanosoma brucei: GeneDB, taxon:185431Bacillus anthracis: TIGR, taxon:198094Coxiella burnetii: TIGR, taxon:227377Geobacter sulfurreducens: TIGR, taxon:243231Listeria monocytogenes: TIGR, taxon:265669Methylococcus capsulatus: TIGR, taxon:243233Pseudomonas syringae: TIGR, taxon:223283Shewanella oneidensis: TIGR, taxon:211586Vibrio cholerae: TIGR, taxon:686
Taxon covered by the GO (some)
NIH-funded experimental research that uses the GO
National Institute on Aging (NIA) National Institute of Allergy and
Infectious Diseases (NIAID) National Cancer Institute (NCI) National Institute on Drug Abuse
(NIDA) National Institute on Deafness and
Other Communication Disorders (NIDCD)
National Institute of Dental & Craniofacial Research (NIDCR)
National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)
National Institute of Biomedical Imaging and Bioengineering (NIBIB)
National Institute of Environmental Health Sciences (NIEHS)
National Eye Institute (NEI) National Institute of General
Medical Sciences (NIGMS) National Institute of Child Health
and Human Development (NICHD) National Human Genome
Research Institute (NHGRI) National Heart, Lung and Blood
Institute (NHLBI) National Library of Medicine (NLM) National Institute of Neurological
Disorders and Stroke (NINDS) National Center for Research
Resources (NCRR)
Other funded experimental projects that
use the GO
Public Heath Service Walter Reed Army Medical Center United States Department of
Agriculture Department of Defense USAID National Science Foundation
A “successful” case study
There are still challenges to meet
Building upon (sharing) light, axiomatic ontologies eliminates:
1. Spelling mistakes or differences oesinophil vs. eosinophil
2. Differences in synonyms, names or naming conventions
Spermatazoon, sperm cell, spermatozoid, sperm
3. Differences in definitions1. pericardial cell develops_from mesodermal cell
vs. Nothing develops_from pericardial cell
• Inconsistent structure
lamellocyte differentiati
on
plasmatocyte
differentiation
hemocyte differentiation(sensu Arthropoda)
hemocyte
lamellocyte
plasmocyte
Inconsistent structureGO CL
Finer granularity in the GO
GO immune cell
activation, migration, chemotaxis…
erythrocyte differentiation is_a myeloid blood cell differentiation”
CL no such term:
“immune cell”
no such term: “myeloid blood cell”
Courser granularity in the GO
GO neuroblast
proliferation is_a cell proliferation
CL neuroblast is_a
neuronal stem cell is_a stem cell is_a cell
Even a “light” ontology like the GO is difficult enough
A methodology that enforces clear, coherent definitions:
Promotes quality assurance intent is not hard-coded into software Meaning of relationships is defined, not inferred
Guarantees automatic reasoning across ontologies and across data at different granularities
Consequences of inconsistencies Hard to synchronize manually Inconsistent user-search results
Meeting the goal: Drawing inferences
Ahuman
B C DSP:1234 SP:8723 SP:19345?
PMID:5555 PMID:4444
toad
BSP:48392
yeast
B CSP:48291 SP:38921
Direct evidence Direct evidence
Indirect evidence
Indirect evidence
PMID:8976
PMID:9550 PMID:3924
Human
Xenopus
Drosophila
Thank you
Chris Mungall
Sima Misra
NCBOReactome
GOSO