29
Semantic tools for aggregation of morphological characters across studies James Balhoff, Alex Dececchi, Paula Mabee, Hilmar Lapp, & Phenoscape team

Semantic tools for aggregation of morphological characters across studies

  • Upload
    balhoff

  • View
    482

  • Download
    1

Embed Size (px)

DESCRIPTION

Presented by Hilmar Lapp at the TDWG 2013 conference in Florence, Italy, on Nov 1, 2013.

Citation preview

Page 1: Semantic tools for aggregation of morphological characters across studies

Semantic tools for aggregation of morphological characters across studies

James Balhoff, Alex Dececchi, Paula Mabee, Hilmar Lapp, & Phenoscape team

Page 2: Semantic tools for aggregation of morphological characters across studies

Rich body of morphological observations – mostly locked up

hybrid map to see if any of these genes were linked to fls. The edargene is located on LG9 within the determined linkage interval forfls (see Methods). We cloned the full-length wild type cDNA of edarand found several polymorphisms in the Tu edar cDNA whencompared to the WIK mapping strain; these polymorphisms weretightly linked with the fls mutation and did not show recombina-tion in 238 meioses (Figure S2).

The edar gene encodes a transmembrane protein with similarityto tumor necrosis factor receptor (TNFR). The Edar proteincontains a conserved TNFR extracellular ligand binding domain

and a cytoplasmic terminal death domain essential for proteininteractions with signaling adaptor complexes. The flste370f

mutation is an A to T transversion at a splice acceptor site,resulting in missplicing of the mRNA leading to a frame shift intranslation and the generation of a premature stop codon(Figure 3B and Figure S2). This allele is a likely molecular nullmutation as only a fragment of the ligand-binding domain ispresent while the transmembrane and cytoplasmic death domains,which are essential for function of this protein, are both absent.The spontaneous mutation flst0sp212 was found to have a splicingdefect leading to the inclusion of intronic sequence. This ispredicted to form a protein with incorrect amino acid sequenceafter residue 212, at the end of the transmembrane domain leadingto a premature termination codon (Figure 3B, Figure S2). The twoalleles generated by the ENU mutagen both have missensemutations resulting in amino acid changes in the death domain(flst3R367W, R367W(C-T); flsdt3Tpl, I428F (A-T)). These mutationswere found at identical positions as seen in familial cases of HEDin humans (Figure 3B, E; [11,12].

The fang Allele Uncovers Dose and Organ SpecificSensitivity to Levels of Eda Signaling

The fang allele of fls was isolated in an allele screen for mutantsthat failed to complement flste370f (Figure 1P). flstfng homozygotes donot show any observable effect on lepidotrichia development yethave a reduction of scales and teeth/rakers as seen in other flsalleles (Figure 1M–O). The fang allele in trans to the te370fputative null allele shows an intermediate phenotype affectinglepidotrichial growth and a further reduction of teeth and scalessuggesting that the fang allele is a hypomorph (Figure 1P–R); flstfng

heterozygotes do not show any differences compared to wild type.The shape and number of the scales in fang is similar to the otherhomozygous fls alleles (Table 1). Analysis of edar RNA fromhomozygous flstfng showed the presence of two distinct transcriptswith an additional larger isoform than seen in wildtype. Analysis ofthe sequence of the novel isoform showed the addition of intronicsequence leading to a premature termination codon (Figure 3C).The predicted protein would be similar to the flst0sp212 allele

Figure 2. The dominant gene Nkt is phenotypically similar, however complements fls mutants. Nkt homozygotes show complete loss ofscales, teeth and gill rakers resembling the fls phenotype (A–C). Heterozygous Nkt zebrafish show an intermediate phenotype of scale loss andpatterning defect (arrows) while no effect on fin development is seen (D). Heterozygous Nkt also show a dominant effect on the number of teeth(arrows, E) and gill rakers (F), showing deficiencies along the posterior branchial arches and formation of rudimentary rakers along ceratobranchial 1and 2 (arrows, F). Cb1-5, ceratobranchial bones.doi:10.1371/journal.pgen.1000206.g002

Table 1. Quantitative effect of fls on scale number and shapeand the effect of background modifiers in Danio rerio strainson flsdt3Tpl.

Phenotype/Genotype Scale #/ stl n Scale DV/AP n

fish scales

+/+ 6.860.18 4 1.1460.15 13

flsdt3Tpl / Tu 3.060.20 ## 2 1.5260.29 # 8

flsdt3Tpl / Tu; mod 5.660.44 # 3 1.460.3 # 12

flsdt3Tpl / WIK 5.8460.66 # 9 1.4360.35 # 32

flstfang / flstfang 0.9760.50 ### 2 1.5760.18 # 7

flste370f / flste370f 0.4160.39 ### 6 1.860.64 # 16

The total number of scales on one side of alizarin red stained adults of differentgenotypes were counted and measured. Counts were normalized for standardlength (stl) of individual fish as shape and number of scales in the mutants mayvary as a measure of size. Shape characteristics of scales were quantified bymeasuring three to four scales from set positions across the flank of each fishand comparing the height (dorsal-ventral; DV) to length (anterior-posterior; AP)ratios. Results are presented as sample average and standard deviation aroundthe mean. mod, inferred genotype of a modifier in Tu background leading to‘‘weak’’ phenotype. The numerical symbol (#) indicates significant differencecompared to wild type values (students t, p,0.05). The different number ofsymbols signifies a significantly different phenotypic classes of scaledevelopment (#, ##, ###).doi:10.1371/journal.pgen.1000206.t001

Zebrafish Model of Human Ectodermal Dysplasia

PLoS Genetics | www.plosgenetics.org 4 October 2008 | Volume 4 | Issue 10 | e1000206

Page 3: Semantic tools for aggregation of morphological characters across studies

Free text is a barrier to machine-based integration

OMIM query # of records“large bone” 1083“enlarged bone” 224“big bones” 21“huge bones” 4“massive bones” 41“hyperplastic bones” 12“hyperplastic bone” 45“bone hyperplasia” 181“increased bone growth” 879

Lundberg & Akama 2005

Phylogenetic systematics Human genetics

http://www.ncbi.nlm.nih.gov/omim

Page 4: Semantic tools for aggregation of morphological characters across studies

Integration is key for knowledge synthesis

The Tree of Life and a New Classification of Bony Fishes—Betancur-R. et al. 2013. PLoS Currents Tree of Life

Page 5: Semantic tools for aggregation of morphological characters across studies

Integration is key for discovery

Page 6: Semantic tools for aggregation of morphological characters across studies

Phenoscape: making evolutionary morphology computable

+

= Phenoscape KnowledgebaseComparative studies Model organism datasets

Page 7: Semantic tools for aggregation of morphological characters across studies

How it works: shared ontologies, rich semantics, OWL reasoning

Page 8: Semantic tools for aggregation of morphological characters across studies

16,000 character states from >120 comparative morphological datasets, linked to 4,000 vertebrate taxa.

Imported genetic phenotype and expression data from ZFIN, Xenbase, MGI, and Human Phenotype project.

Shared semantics: Uberon (anatomy), PATO (phenotypic qualities), Entity–Quality (EQ) OWL axioms (phenotype observations)

Plus a dozen other ontologies ...

Phenoscape KB content

Page 9: Semantic tools for aggregation of morphological characters across studies

Integrative querying with the Phenoscape KB: scale, absent

Ictalurus punctatus eda gene in Danio rerio

edadt3S243X/dt3S243X — Harris, M.P., Rohner, N., Schwarz, H., Perathoner, S., Konstantinidis, P., and Nüsslein-Volhard, C.. 2008. Zebrafish eda

and edar mutants reveal conserved and ancestral roles of ectodysplasin signaling in vertebrates. PLoS Genetics 4(10):e1000206.

“body: naked”—Kailola, P. J. 2004. A phylogenetic exploration of the catfish family Ariidae (Otophysi; Siluriformes). The Beagle,

Records of the Museums and Art Galleries of the Northern Territory 20:87-166

Page 10: Semantic tools for aggregation of morphological characters across studies

Can we use reasoning to integrate character matrices across studies?

Would enable the wealth of single-study character analysis methods on any integrated matrix.

Including tree-based comparative phylogenetic methods

Integrating phylogenetic studies

Page 11: Semantic tools for aggregation of morphological characters across studies

Combined matrix of any character states related to presence/absence of limb/fin structures from studies in Phenoscape KB

Evolution of Sarcopterygian Limb/Fin

Clack, J. A. (2009). The Fin to Limb Transition: New Data, Interpretations, and Hypotheses from Paleontology and Developmental Biology. Annual Review of Earth and Planetary Sciences, 37(1), 163-179

Page 12: Semantic tools for aggregation of morphological characters across studies

EQ supermatrix synthesis: workflow

1.Use OWL reasoner to group character states by anatomy and quality axes, based on EQ annotations.

2.Export groupings as character matrix, with taxon assignments to states from original data.

3.Supplement presence/absence character state assertions with reasoner-inferred information.

4.Use Phenex data editor to manually consolidate character states where appropriate

Page 13: Semantic tools for aggregation of morphological characters across studies

Synthesized limb/fin character matrix

1055 Sarcopterygian taxa

494 characters

2-7 states per character

from 55 original studies

Developed several tools for automated character matrix synthesis to make this happen.

EQ supermatrix synthesis: Results

Page 14: Semantic tools for aggregation of morphological characters across studies

Ontologies and phenotype observation data in OWL

ELK, an OWL-EL reasoner

OWL-DL reasoners are too slow for this

OWL API (Java), programmed primarily using Scala

Bigdata™ RDF triplestore (~ 25 million triples)

Technology stack

Page 15: Semantic tools for aggregation of morphological characters across studies

For every pair of anatomical term X and quality attribute Y, generate a “character expression” OWL class: (involves some X and involves some Y)

Done programmatically via property chain axioms and OWL reasoning (ELK)

Classify character states to most relevant character expression

Done by OWL reasoner (ELK)

Inferred relationships materialized to triple store

Using reasoning to group character states

Page 16: Semantic tools for aggregation of morphological characters across studies

Anatomy ontologies and EQ annotation employ rich OWL semantics → best used with a DL reasoner

Classifying and querying over large dataset (~25 million RDF triples) does not scale well

Presently, the only feasible OWL reasoner is ELK

constrained to OWL EL profile → limits kinds of expressions we use

best performance over class axioms only → data must be modeled so as to avoid need for classifying instances

Challenge: scalable reasoning

Page 17: Semantic tools for aggregation of morphological characters across studies

Want to allow arbitrary selection of structures of interest, using rich semantics:(part_of some (limb/fin or girdle skeleton)) or (connected_to some girdle skeleton)

RDF triplestores provide very limited reasoning expressivity, and scale poorly with large ontologies.

However, ELK can answer class expression queries within seconds.

Challenge: Querying complex expressions

Page 18: Semantic tools for aggregation of morphological characters across studies

PREFIX  rdf:  <http://www.w3.org/1999/02/22-­‐rdf-­‐syntax-­‐ns#>PREFIX  rdfs:  <http://www.w3.org/2000/01/rdf-­‐schema#>PREFIX  ao:  <http://purl.obolibrary.org/obo/my-­‐anatomy-­‐ontology/>PREFIX  owl:  <http://www.w3.org/2002/07/owl#>SELECT  DISTINCT  ?geneWHERE  {?gene  ao:expressed_in  ?structure  .?structure  rdf:type  ?structure_class  .#  Triple  pattern  selecting  structure:?structure_class  rdfs:subClassOf  "ao:muscle”  .?structure_class  rdfs:subClassOf  ?restriction?restriction  owl:onProperty  ao:part_of  .?restriction  owl:someValuesFrom  "ao:head"  .}

Instead of something like this (*):

We would really like to do this:PREFIX  rdf:  <http://www.w3.org/1999/02/22-­‐rdf-­‐syntax-­‐ns#>PREFIX  rdfs:  <http://www.w3.org/2000/01/rdf-­‐schema#>PREFIX  ao:  <http://purl.obolibrary.org/obo/my-­‐anatomy-­‐ontology/>PREFIX  ow:  <http://purl.org/phenoscape/owlet/syntax#>SELECT  DISTINCT  ?geneWHERE  {?gene  ao:expressed_in  ?structure  .?structure  rdf:type  ?structure_class  .#  Triple  pattern  containing  an  OWL  expression:?structure_class  rdfs:subClassOf  "ao:muscle  and  (ao:part_of  some  ao:head)"^^ow:omn  .}

Page 19: Semantic tools for aggregation of morphological characters across studies

owlet interprets OWL class expressions embedded within SPARQL queries

Uses any OWL API-based reasoner to preprocess query.

We use ELK that holds terminology in memory.

Replaces OWL expression with FILTER statement listing matching terms

https://github.com/phenoscape/owlet

owlet: SPARQL query expansion with in-memory OWL reasoner

Page 20: Semantic tools for aggregation of morphological characters across studies

PREFIX  rdf:  <http://www.w3.org/1999/02/22-­‐rdf-­‐syntax-­‐ns#>PREFIX  rdfs:  <http://www.w3.org/2000/01/rdf-­‐schema#>PREFIX  ao:  <http://purl.obolibrary.org/obo/my-­‐anatomy-­‐ontology/>PREFIX  ow:  <http://purl.org/phenoscape/owlet/syntax#>SELECT  DISTINCT  ?geneWHERE  {?gene  ao:expressed_in  ?structure  .?structure  rdf:type  ?structure_class  .#  Triple  pattern  containing  an  OWL  expression:?structure_class  rdfs:subClassOf  "ao:muscle  and  (ao:part_of  some  ao:head)"^^ow:omn  .}

PREFIX  rdf:  <http://www.w3.org/1999/02/22-­‐rdf-­‐syntax-­‐ns#>PREFIX  rdfs:  <http://www.w3.org/2000/01/rdf-­‐schema#>PREFIX  ao:  <http://purl.obolibrary.org/obo/my-­‐anatomy-­‐ontology/>PREFIX  ow:  <http://purl.org/phenoscape/owlet/syntax#>SELECT  DISTINCT  ?geneWHERE  {?gene  ao:expressed_in  ?structure  .?structure  rdf:type  ?structure_class  .#  Filter  constraining  ?structure_class  to  the  terms  returned  by  the  OWL  query:FILTER(?structure_class  IN  (ao:adductor_mandibulae,  ao:constrictor_dorsalis,  ...))}

⬇owlet⬇

Page 21: Semantic tools for aggregation of morphological characters across studies

Inferring presence/absenceCharacter states often do not directly assert, but imply presence or absence.

Most phenotypic descriptions of some feature of a structure implies its presence or absence:

“Humerus slender and elongate: with length more than three times the diameter of its distal end” → humerus must be present

Partonomy axioms in the ontology allow inferring presence or absence:

‘all humerus part_of some forelimb’ → forelimb must be present if humerus is; humerus must be absent if forelimb is

Page 22: Semantic tools for aggregation of morphological characters across studies

Absence is typically modeled using negation → not (has_part some forelimb)

Negation not part of OWL EL (and thus ELK reasoner)

Solution: programmatic assertion of “absence hierarchy” via classification of negated expressions

Challenge: absence reasoning with OWL EL

A = has_part some forelimb

B = has_part some limb

C = has_part some appendage

absentC = not C

absentB = not B

absentA = not A

——

——

—re

vers

e——

——

Requires precomputation, constraints for on-the-fly use

Page 23: Semantic tools for aggregation of morphological characters across studies

Challenge: Character state consolidation

Page 24: Semantic tools for aggregation of morphological characters across studies

Challenge: Character state consolidation

Reduced 1-297 states per character to 2-7.

Page 25: Semantic tools for aggregation of morphological characters across studies

Result: Reasoning fills in many missing character states

asserted presence/absence with inference

Mesquite “birds-eye view”

Page 26: Semantic tools for aggregation of morphological characters across studies

Unified matrix enables candidate gene view

Linking evolutionary phenotypes to genes through ontologies, via Phenoscape KB or similarity

Page 27: Semantic tools for aggregation of morphological characters across studies

Conflicting interpretations in studiessupinator process of humerus: both absent & present in Strepsodus (Zhu et al. 1999 vs. Ruta 2011)

Gaps in knowledgeacetabulum present or absent?

Same term, different meaning?Acanthostega— “radials, jointed” (Swartz 2012)but doesn’t have radials...

Uneven taxon sampling

Integrated data highlight conflict and gaps

Acetabulum of pelvic girdle: present/absent

figure from Parker et al., 2005

http://characterdesignnotes.blogspot.com/2011/04/proper-use-of-reference-and-anatomy-in.html

Page 28: Semantic tools for aggregation of morphological characters across studies

https://github.com/phenoscape

owlet (SPARQL processor), Phenex (semantic data editor), phenoscape-owl-tools (KB build), others

http://phenoscape.org/wiki/Software

Phenoscape software

Page 29: Semantic tools for aggregation of morphological characters across studies

National Evolutionary Synthesis Center (NESCent)

Todd Vision (also University of North Carolina at Chapel Hill)Hilmar LappJim BalhoffPrashanti Manda

University of South DakotaPaula MabeeWasila DahdulAlex Dececchi

University of Chicago Paul SerenoNizar Ibrahim

Mouse Genome InformaticsJudith BlakeTerry Hayamizu

Phenoscape project teamUniversity of Oregon (Zebrafish Information Network)

Monte WesterfieldYvonne BradfordCeri Van Slyke

Cincinnati Children's Hospital (Xenbase)Aaron ZornChristina James-ZornVirgilio Ponferrada

California Academy of SciencesDavid Blackburn

University of ArizonaHong Cui

Oregon Health & Science UniversityMelissa Haendel

Lawrence Berkeley National LabsChris Mungall