Upload
others
View
27
Download
0
Embed Size (px)
Citation preview
April 10, 2007
Principles for BuildingBiomedical Ontologies
ISMB 2005
April 10, 2007
Ontology (as a branch ofphilosophy)
The science of what is:of the kinds and structures of the objects, andtheir properties and relations in every area ofreality.
In simple terms, it seeks the classification ofentities.
Defined by a scientific field's vocabulary andby the canonical formulations of its theories.
Seeks to solve problems which arise in thesedomains.
April 10, 2007
In computer science, there is aninformation handling problem
Different groups of data-gatherers developtheir own idiosyncratic terms and concepts ofwhich they represent information.
To put this information together, methodsmust be found to resolve terminological andconceptual incompatibilities.
Again, and again, and again…
April 10, 2007
The Solution to this Tower of Babelproblem
A shared, common, backbone taxonomy ofrelevant entities, and the relationshipsbetween them within an application domain
This is referred to by information scientists asan ’Ontology'.
April 10, 2007
Which means…Instances are not included!
It is the generalizations that areimportant
Please keep this in mind, it is crucial tounderstanding the tutorial
April 10, 2007
Principles for BuildingBiomedical Ontologies
Barry Smithhttp://ifomis.de
April 10, 2007
Ontologies as ControlledVocabularies
expressing discoveries in the lifesciences in a uniform way
providing a uniform framework formanaging annotation data deriving fromdifferent sources and with varying typesand degrees of evidence
April 10, 2007
Overview Following basic rules helps make better
ontologies We will work through the principles-
based treatment of relations inontologies, to show how ontologies canbecome more reliable and morepowerful
April 10, 2007
Why do we need rules for goodontology?
Ontologies must be intelligible both tohumans (for annotation) and to machines (forreasoning and error-checking)
Unintuitive rules for classification lead to entryerrors (problematic links)
Facilitate training of curators Overcome obstacles to alignment with other
ontology and terminology systems Enhance harvesting of content through
automatic reasoning systems
April 10, 2007
First Rule: Univocity
Terms (including those describingrelations) should have the samemeanings on every occasion of use.
In other words, they should refer to thesame kinds of entities in reality
April 10, 2007
Example of univocity problemin case of part_of relation
(Old) Gene Ontology: ‘part_of’ = ‘may be part of’
flagellum part_of cell ‘part_of’ = ‘is at times part of’
replication fork part_of the nucleoplasm ‘part_of’ = ‘is included as a sub-list in’
April 10, 2007
Second Rule: Positivity
Complements of classes are notthemselves classes.
Terms such as ‘non-mammal’ or ‘non-membrane’ do not designate genuineclasses.
April 10, 2007
Third Rule: Objectivity
Which classes exist is not a function ofour biological knowledge.
Terms such as ‘unknown’ or‘unclassified’ or ‘unlocalized’ do notdesignate biological natural kinds.
April 10, 2007
Fourth Rule: Single Inheritance
No class in a classificatoryhierarchy should have more thanone is_a parent on the immediatehigher level
April 10, 2007
Rule of Single Inheritance
no diamonds:
Cis_a2
Bis_a1
A
April 10, 2007
Problems with multiple inheritance
B C
is_a1 is_a2
A
‘is_a’ no longer univocal
April 10, 2007
‘is_a’ is pressed into service to meana variety of different things
shortfalls from single inheritance are oftenclues to incorrect entry of terms andrelations
the resulting ambiguities make the rulesfor correct entry difficult to communicate tohuman curators
April 10, 2007
is_a Overloading
serves as obstacle to integration withneighboring ontologies
The success of ontology alignmentdepends crucially on the degree towhich basic ontological relations suchas is_a and part_of can be relied on ashaving the same meanings in thedifferent ontologies to be aligned.
April 10, 2007
Use of multiple inheritance
The resultant mélange makes coherentintegration across ontologies achievable (atbest) only under the guidance of humanbeings with relevant biological knowledge
How much should reasoning systems beforced to rely on human guidance?
April 10, 2007
Fifth Rule: Intelligibility ofDefinitions
The terms used in a definition should besimpler (more intelligible) than the termto be defined
otherwise the definition provides noassistance to human understanding for machine processing
April 10, 2007
To the degree that the aboverules are not satisfied, errorchecking and ontologyalignment will be achievable,at best, only with humanintervention and via bruteforce
April 10, 2007
Some rules are Rules of Thumb The world of biomedical research is a world of difficult
trade-offs The benefits of formal (logical and ontological) rigor
need to be balanced Against the constraints of computer tractability, Against the needs of biomedical practitioners.
BUT alignment and integration of biomedicalinformation resources will be achieved only to thedegree that such resources conform to thesestandard principles of classification and definition
April 10, 2007
Definitions should be intelligibleto both machines and humans
Machines can cope with the full formalrepresentation
Humans need to use modularity Plasma membrane
is a cell part [immediate parent]
that surrounds the cytoplasm [differentia]
April 10, 2007
Terms and relations should haveclear definitions
These tell us how the ontology relatesto the world of biological instances,meaning the actual particulars in reality: actual cells, actual portions of cytoplasm,
and so on…
April 10, 2007
Sixth Rule: Basis in Reality
When building or maintaining anontology, always think carefully at howclasses (types, kinds, species) relate toinstances in reality
April 10, 2007
Axioms governing instances
Every class has at least one instance
Every genus (parent class) has aninstantiated species (differentia + genus)
Each species (child class) has a smaller classof instances than its genus (parent class)
April 10, 2007
Axioms governing Instances
Distinct classes on the same level nevershare instances
Distinct leaf classes within aclassification never share instances
April 10, 2007
siamese
mammal
cat
organism
substancespecies,genera
animal
instances
frogleaf class
April 10, 2007
Interoperability
Ontologies should work together ways should be found to avoid redundancy
in ontology building and to support reuse ontologies should be capable of being used
by other ontologies (cumulation)
April 10, 2007
Main obstacle to integration
Current ontologies do not deal well with Time and Space and Instances (particulars)
Our definitions should link the terms inthe ontology to instances in spatio-temporal reality
April 10, 2007
Benefits of well-definedrelationships
If the relations in an ontology are well-defined, then reasoning can cascade fromone relational assertion (A R1 B) to the next(B R2 C). Relations used in ontologies thusfar have not been well defined in this sense.
Find all DNA binding proteins should also findall transcription factor proteins because Transcription factor is_a DNA binding protein
April 10, 2007
How to define A is_a B
A is_a B =def.
1. A and B are names of universals(natural kinds, types) in reality
2. all instances of A are as a matter ofbiological science also instances of B
April 10, 2007
Biomedical ontologyintegration / interoperability
Will never be achieved through integration ofmeanings or concepts
The problem is precisely that different usercommunities use different concepts
What’s really needed is to have well-defined commonly used relationships
April 10, 2007
Idea:
Move from associative relationsbetween meanings to strictly definedrelations between the entitiesthemselves.
The relations can then be usedcomputationally in the way required
April 10, 2007
Key idea:To define ontological relations For example: part_of, develops_from Definitions will enable computation It is not enough to look just at classes or
types. We need also to take account of instances
and time
April 10, 2007
Kinds of relations
Between classes: is_a, part_of, ...
Between an instance and a class this explosion instance_of the class
explosion Between instances:
Mary’s heart part_of Mary
April 10, 2007
Seventh Rule: DistinguishUniversals and Instances
A good ontology must distinguishclearly between universals (types, kinds, classes)and instances (tokens, individuals,
particulars)
April 10, 2007
Don’t forget instances whendefining relations
part_of as a relation between classesversus part_of as a relation betweeninstances
nucleus part_of cell your heart part_of you
April 10, 2007
Part_of as a relation betweenclasses is more problematicthan is standardly supposed
testis part_of human being ? heart part_of human being ? human being has_part human testis ?
April 10, 2007
Why distinguish universals frominstances?
What holds on the level of instances may not hold onthe level of universals
nucleus adjacent_to cytoplasm Not: cytoplasm adjacent_to nucleus seminal vesicle adjacent_to urinary bladder Not: urinary bladder adjacent_to seminal vesicle
April 10, 2007
part_of
part_of must be time-indexed for spatialuniversals
A part_of B is defined as:Given any instance a and any time t,If a is an instance of the universal A at t,then there is some instance b of the universal Bsuch thata is an instance-level part_of b at t
April 10, 2007
Principles for BuildingBiomedical Ontologies:A GO
PerspectiveDavid Hill
Mouse Genome InformaticsThe Jackson Laoratory
April 10, 2007
How has GO dealt with some specificaspects of ontology development? Univocity Positivity Objectivity Single Inheritance Definitions
Formal definitions Written definitions
Basis in Reality Universals &
Instances Ontology
Alignment
April 10, 2007
Tactile senseTactionTactition
?
The Challenge of Univocity:People call the same thing by different names
April 10, 2007
Tactile senseTactionTactition
perception of touch ; GO:0050975
Univocity: GO uses 1 term and manycharacterized synonyms
April 10, 2007
= bud initiation
= bud initiation
= bud initiation
The Challenge of Univocity: People use thesame words to describe different things
April 10, 2007
Bud initiation? How isa computer to know?
April 10, 2007
= bud initiation sensu Metazoa
= bud initiation sensu Saccharomyces
= bud initiation sensu Viridiplantae
Univocity: GO adds “sensu” descriptors todiscriminate among organisms
April 10, 2007
The Importance of synonyms for utility:How do we represent the function of tRNA?
Biologically, what does the tRNA do?Identifies the codon and inserts the aminoacid in the growing polypeptide
Molecular_function
Triplet_codon amino acid adaptor activity
GO Definition: Mediates the insertion of an amino acid at the correctpoint in the sequence of a nascent polypeptide chain during proteinsynthesis.
Synonym: tRNA
April 10, 2007
But Univocity is also Dependenton a User’s Perspective
Development (The biological process whose specificoutcome is the progression of an organism over timefrom an initial condition to a later condition)
--part_of hepatocyte differentiation
----part_of hepatocyte fate commitment
------part_of hepatocyte fate specification
------part_of hepatocyte fate determination
----part_of hepatocyte development
April 10, 2007
But Univocity is also Dependenton a User’s Perspective
So from the perspective of GO a hepatocyte beginsdevelopment after it is committed to its fate. Its initialcondition is after cell fate commitment.
But! A User may ask show me things that have to do withhepatocyte development.
Do they mean show me things that have to dowith ‘hepatocyte development” or do they meanshow me things that have to do with‘development’ and a ‘hepatocyte’?
April 10, 2007
The Challenge of Positivity
Some organelles are membrane-bound.A centrosome is not a membrane bound organelle,but it still may be considered an organelle.
April 10, 2007
The Challenge of Positivity: Sometimesabsence is a distinction in a Biologist’s mind
non-membrane-bound organelle GO:0043228
membrane-bound organelleGO:0043227
April 10, 2007
Positivity
Note the logical difference between “non-membrane-bound organelle” and “not a membrane-bound organelle”
The latter includes everything that is nota membrane bound organelle!
April 10, 2007
The Challenge of Objectivity: Databaseusers want to know if we don’t know
anything (Exhaustiveness with respect toknowledge)
We don’t know anything about a gene product with
respect to these
We don’t know anything about the ligand that
binds this type of GPCR
April 10, 2007
Objectivity
How can we use GO to annotate geneproducts when we know that we don’t haveany information about them? Currently GO has terms in each ontology to
describe unknown An alternative might be to annotate genes to root
nodes and use an evidence code to describe thatwe have no data.
Similar strategies could be used for things likereceptors where the ligand is unknown.
April 10, 2007
GPCRs with unknown ligands
We couldannotate to
this
April 10, 2007
Single Inheritance
GO has a lot of is_a diamonds Some are due to
incompleteness/inaccuracies within thegraph
Some are due to a mixture of dissimilarentities within the graph at the same level
April 10, 2007
Is_a diamond in GO Processbehavior
locomotory behavior larval behavior
larval locomotory behavior
April 10, 2007
Is_a diamond in GO Functionenzyme regulator
activity
GTPase regulatoractivity
enzyme activatoractivity
GTPase activatoracivity
April 10, 2007
Is_a diamond in GO CellularComponent
organelle
non-membrane boundorganelle
intracellularorganelle
non-membrane boundintracellular organelle
April 10, 2007
Technically the diamonds arecorrect, but could be eliminated
locomotory behavior larval behavior
GTPase regulatoractivity
enzyme activatoractivity
non-membrane boundorganelle
intracellularorganelle
What do these pairs have in common?
April 10, 2007
What do the middle pair of termsall have in common?
locomotory behavior larval behavior
GTPase regulatoractivity
enzyme activatoractivity
non-membrane boundorganelle
intracellularorganelle
April 10, 2007
They are all differentiated from theparent term by a different factor
locomotory behavior larval behavior
GTPase regulatoractivity
enzyme activatoractivity
non-membrane boundorganelle
intracellularorganelle
Type of behavior vs. what is behaving
What is regulated vs. type of regulator
Type of organelle vs. location of organelle
April 10, 2007
Insert an intermediate grouping termbehavior
locomotory behavior larval behavior
larval locomotory behavior
behavior of a thing
descriptive behavior These actually
describe two differentrelationships
April 10, 2007
Create a new relationship typebehavior
locomotory behavior larval behavior
larval locomotory behavior
behavior of a thing
descriptive behavior
April 10, 2007
Why insert terms that no one woulduse?behavior
locomotory behavior larval behavior rhythmic behavior adult behavior
By the structure of this graph, locomotory behaviorhas the same relationship to larval behavior
as to rhythmic behavior
April 10, 2007
Why insert terms (types) that no onewould use?
behavior
locomotory behavior larval behaviorrhythmic behavior adult behavior
But actually, locomotory behavior/rhythmic behaviorand larval behavior/adult behavior group naturally
Descriptivebehavior
Behaviorof athing
This type of single step differentiation of types between levels would allow us to use distancesbetween nodes and levels to compare similarity.
April 10, 2007
GO DefinitionsA definition written by
a biologist:necessary & sufficient
conditionswritten definition(not computable)
Graph structure:necessaryconditions
formal(computable)
April 10, 2007
Relationships and definitions
The set of necessary conditions isdetermined by the graph This can be considered a partial definition
Important considerations: Placement in the graph- selecting parents Appropriate relationships to different
parents True path violation
April 10, 2007
Placement inthe graph
Example- Proteasome complex
April 10, 2007
The importance of relationships Cyclin dependent protein kinase
Complex has a catalytic and a regulatory subunit How do we represent these activities (function) in
the ontology? Do we need a new relationship type (regulates)?
Catalytic activity
protein kinase activity
protein Ser/Thr kinase activity
Cyclin dependent protein kinase activity
Cyclin dependent protein kinase regulator activity
Molecular_function
Enzyme regulator activity
Protein kinase regulator activity
April 10, 2007
We must avoid true path violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
chromosome
Mitochondrialchromosome
Is_a relationship
Part_of relationship
nucleus
April 10, 2007
We must avoid true path violations..”the pathway from a child term all the way up to its top-level parent(s) must always be true".
nucleus chromosome
Nuclearchromosome
Mitochondrialchromosome
Is_a relationshipsPart_of relationship
April 10, 2007
GO textual definitions: Related GO terms havesimilarly structured (normalized) definitions
April 10, 2007
Structured definitions contain both genusand differentiae
Essence = Genus + Differentiae
neuron cell differentiation =Genus: differentiation (processes whereby a relativelyunspecialized cell acquires the specialized features of..)Differentiae: acquires features of a neuron
April 10, 2007
Basis in Reality
GO is designed by a consortium As long as egos don’t get in the way, GO
represents types rather than concepts Large-scale developments of the GO are a result
of compromise Gene Annotators have a large say in GO
content Annotators are experts in their fields Annotators constantly read the scientific literature
But, since GO is representing a science, GO actually represents paradigms.
Therefore, it is essential that GO is able to change!
April 10, 2007
Types and Instances
For the sake of GO, types are theterms and instances are the gene
product attributes that areannotated to them.
April 10, 2007
Types and Instances
When should we create a new type asopposed to multiple annotations?
When the the biology represents auniversal principal. Receptor signalingprotein tyrosine kinase activity does notrepresent receptor signaling proteinactivity and tyrosine kinase activityindependently.
April 10, 2007
Ontology alignmentOne of the current goals of GO is to align:
cone cell fate commitment retinal_cone_cell
keratinocyte differentiation keratinocyte
adipocyte differentiation fat_cell
dendritic cell activation dendritic_cell
lymphocyte proliferation lymphocyte
T-cell homeostasis T_lymphocyte
garland cell differentiation garland_cell heterocyst cell differentiation heterocyst
Cell Types in GO Cell Types in the Cell Ontologywith
April 10, 2007
Alignment of the Two Ontologies will permit thegeneration of consistent and complete definitions
id: CL:0000062name: osteoblastdef: "A bone-forming cell which secretes an extracellular matrix.Hydroxyapatite crystals are then deposited into the matrix to formbone." [MESH:A.11.329.629]is_a: CL:0000055relationship: develops_from CL:0000008relationship: develops_from CL:0000375
GO
Cell type
New Definition
+
=Osteoblast differentiation: Processes whereby anosteoprogenitor cell or a cranial neural crest cellacquires the specialized features of an osteoblast, abone-forming cell which secretes extracellular matrix.
April 10, 2007
Alignment of the Two Ontologies willpermit the generation of consistent
and complete definitionsid: GO:0001649
name: osteoblast differentiationsynonym: osteoblast cell differentiationgenus: differentiation GO:0030154 (differentiation)differentium: acquires_features_of CL:0000062 (osteoblast)definition (text): Processes whereby a relatively unspecialized cellacquires the specialized features of an osteoblast, the mesodermalcell that gives rise to bone
Formal definitions with necessary and sufficientconditions, in both human readable and computerreadable forms
April 10, 2007
Other Ontologies that can bealigned with GO
Chemical ontologies 3,4-dihydroxy-2-butanone-4-phosphate synthase activity
Anatomy ontologies metanephros development
GO itself mitochondrial inner membrane peptidase activity
April 10, 2007
But Eventually…
April 10, 2007
Building Ontology
Improve
Collaborateand Learn
April 10, 2007
A tribute to Lewis CarrollOnce master the machinery of Symbolic Logic, and you have a mentaloccupation always at hand, of absorbing interest, and one that will be ofreal use to you in any subject you may take up. It will give you clearness ofthought - the ability to see your way through a puzzle - the habit ofarranging your ideas in an orderly and get-at-able form - and, morevaluable than all, the power to detect fallacies, and to tear to pieces theflimsy illogical arguments, which you will so continually encounter inbooks, in newspapers, in speeches, and even in sermons, and which soeasily delude those who have never taken the trouble to master thisfascinating Art.Lewis Carroll
(a) All babies are illogical.(b) Nobody is despised who can manage a crocodile.(c) Illogical persons are despisedCan a baby can manage a crocodile?