Upload
jan-aerts
View
216
Download
4
Embed Size (px)
DESCRIPTION
Presentation "AmiGO2: document-oriented approach to ontology software and escaping heartache of SQL" by Seth Carbon at BOSC2012
Citation preview
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
AmiGO 2: a document-oriented approach toontology software and escaping the heartache of
SQL.
Seth Carbon (with Chris Mungall and Heiko Dietze)
Berkeley BOP (http://berkeleybop.org),Lawrence Berkeley National Lab
13 July 2012
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Outline
1 Introduction
2 Data as a document
3 What it gets you
4 Development and maintainability
5 Acknowledgments
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Introduction
Introduction
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Where we were at: The Software
AmiGO (http://amigo.geneontology.org) is an open-source webapplication that allows users to query, browse, and visualizeontologies and related gene product annotation data.
The basic things that we have to do:
Get information about gene products and terms.
Search by text in various �elds.
Find direct annotations for term, �ltered by. . .
Find all inferred annotations and/or genes to a term, �lteredby . . .
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Where we were at: The Software
AmiGO (http://amigo.geneontology.org) is an open-source webapplication that allows users to query, browse, and visualizeontologies and related gene product annotation data.
The basic things that we have to do:
Get information about gene products and terms.
Search by text in various �elds.
Find direct annotations for term, �ltered by. . .
Find all inferred annotations and/or genes to a term, �lteredby . . .
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
AmiGO 1.8 (term details page)
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Where we were at: The Problem
More, more, more.
After years of clinging to a core SQL backend, with an increasingnumber of tricks, extensions, and caches to keep the performanceat an acceptable level, things had to change. . .
Complicated queries
: enrichment, subsets, search, reports, etc.
Data
: ~1,500,000 -> ~13,000,000 -> ~80,000,000 -> ???
Provided services
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Where we were at: The Problem
More, more, more.
After years of clinging to a core SQL backend, with an increasingnumber of tricks, extensions, and caches to keep the performanceat an acceptable level, things had to change. . .
Complicated queries: enrichment, subsets, search, reports, etc.
Data: ~1,500,000 -> ~13,000,000 -> ~80,000,000 -> ???
Provided services
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
How the graph was in SQL
�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�
[Chado users should be familiar with our ontology model.]
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Our Solution: Solr
Solr�a specialized HTTP server over the Lucene document store.
AmiGO 2 has greatly increased in �exibility, speed, reliability, anddevelopment turnaround time over its SQL predecessor.
For example: a deep text search from �30s down to �0.3s.
It has also made things that were previously not feasible easy to do.
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Our Solution: Solr
Solr�a specialized HTTP server over the Lucene document store.
AmiGO 2 has greatly increased in �exibility, speed, reliability, anddevelopment turnaround time over its SQL predecessor.
For example: a deep text search from �30s down to �0.3s.
It has also made things that were previously not feasible easy to do.
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Data as a document
Data as a document
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
One minute overview of Solr
It is a document store.
Each document can have any number of named �elds.
These �eld names do not need to be unique�having multiplebehaves like a list.
The values of these �elds can be any number of atomic types(if they are a string).
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Example to this point
document_category: ontology_classis_obsolete: falselabel: neurogenesislabel_searchable: neurogenesisid: GO:0022008source: biological_processdescription: Generation of cells within the nervous system.description_searchable: Generation of cells within the nervous system.comment: This term was added by GO_REF:0000021.comment_searchable: This term was added by GO_REF:0000021.synonym: nervous system cell generationsynonym: neural cell di�erentiationsynonym_searchable: nervous system cell generationsynonym_searchable: neural cell di�erentiation
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Conversion: how the graph was in SQL
�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Conversion: the graph with Solr
A graph aspect in Solr:
[GO:0048770 GO:0031988 GO:0005623 GO:0031410GO:0005575 GO:0031982 GO:0043231 GO:0005622GO:0044444 GO:0005737 GO:0043226 GO:0043227GO:0044464 GO:0016023 GO:0043229 GO:0044424]
More on this later. . .
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
In for a penny, in for a pound
In fact, why not cram everything in we can? Lucene is designed fordata rather larger than what we have.
JSON maps of ids to labels and labels to ids.
Rich graph segments as non-indexed JSON blobs.
Anything that might have been cached.
Want just direct annotations? Add another �eld.
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
A �nal schema (annotation)
document_category annotationid MGI:MGI:107940_:_GO:0014013bioentity MGI:MGI:107940bioentity_label Ezh2bioentity_label_searchable Ezh2source MGIdate 20110523taxon NCBITaxon:10090taxon_label Mus musculustaxon_label_searchable Mus musculusreference MGI:MGI:4833736|PMID:20798045evidence_type IMPevidence_with MGI:MGI:2661097annotation_class GO:0014013annotation_class_label regulation of gliogenesisannotation_class_label_searchable regulation of gliogenesisisa_partof_closure_map {"GO:0051239":"regulation of multicellular organismal process","GO:0009987":"cellular process","GO:2000026":"regulation of multicellular organismal development","GO:0048699":"generation of neurons","GO:0065007":"biological regulation","GO:0048869":"cellular developmental process","GO:0007275":"multicellular organismal development","GO:0030154":"cell di�erentiation","GO:0007399":"nervous system development","GO:0051960":"regulation of nervous system development","GO:0042063":"gliogenesis","GO:0032502":"developmental process","GO:0008150":"biological_process","GO:0032501":"multicellular organismal process","GO:0050767":"regulation of neurogenesis","GO:0050794":"regulation of cellular process","GO:0060284":"regulation of cell development","GO:0050789":"regulation of biological process","GO:0050793":"regulation of developmental process","GO:0014013":"regulation of gliogenesis","GO:0045595":"regulation of cell di�erentiation","GO:0048468":"cell development","GO:0022008":"neurogenesis"}isa_partof_closure GO:0051239isa_partof_closure GO:0009987... ...isa_partof_closure_label regulation of multicellular organismal processisa_partof_closure_label cellular processisa_partof_closure_label regulation of multicellular organismal development... ...isa_partof_closure_label_searchable regulation of multicellular organismal processisa_partof_closure_label_searchable cellular processisa_partof_closure_label_searchable regulation of multicellular organismal development... ...
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
What it gets you
What it gets you
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Graph example in SQL
�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�
SELECT term.name AS superterm_name, term.acc ASsuperterm_acc, term.term_type AS superterm_type,association.*, gene_product.symbol AS gp_symbol,gene_product.symbol AS gp_full_name, dbxref.xref_dbname ASgp_dbname, dbxref.xref_key AS gp_acc, species.genus,species.species, species.ncbi_taxa_id, species.common_nameFROM term INNER JOIN graph_path ON(term.id=graph_path.term1_id) INNER JOIN association ON(graph_path.term2_id=association.term_id) INNER JOINgene_product ON(association.gene_product_id=gene_product.id) INNER JOINspecies ON (gene_product.species_id=species.id) INNER JOINdbxref ON (gene_product.dbxref_id=dbxref.id) WHEREterm.name = `neurogenesis' AND species.genus = `Drosophila';
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Graph example in SQL
�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�
SELECT term.name AS superterm_name, term.acc ASsuperterm_acc, term.term_type AS superterm_type,association.*, gene_product.symbol AS gp_symbol,gene_product.symbol AS gp_full_name, dbxref.xref_dbname ASgp_dbname, dbxref.xref_key AS gp_acc, species.genus,species.species, species.ncbi_taxa_id, species.common_nameFROM term INNER JOIN graph_path ON(term.id=graph_path.term1_id) INNER JOIN association ON(graph_path.term2_id=association.term_id) INNER JOINgene_product ON(association.gene_product_id=gene_product.id) INNER JOINspecies ON (gene_product.species_id=species.id) INNER JOINdbxref ON (gene_product.dbxref_id=dbxref.id) WHEREterm.name = `neurogenesis' AND species.genus = `Drosophila';
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Graph example in Solr
�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�
Add to URL:
what we want query arg
any doc q=*:*in genes fq=document_category:"bioentity"in closure fq=isa_partof_closure_label:"neurogenesis"with �y fq=taxon_label_searchable:"Drosophilia"
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Graph example in Solr
�Give me all of the genes in Drosophilia annotated to`neurogenesis'.�
Add to URL:
what we want query arg
any doc q=*:*in genes fq=document_category:"bioentity"in closure fq=isa_partof_closure_label:"neurogenesis"with �y fq=taxon_label_searchable:"Drosophilia"
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Text search example in SQL
�Any ontology term that contains a reference to `pigment', givingcertain �elds more weighted than others, scored and ordered byrelevance, transitively related to `organelle', with the relevant partshighlighted if I want them.�
. . . I'll get back to you.
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Text search example in SQL
�Any ontology term that contains a reference to `pigment', givingcertain �elds more weighted than others, scored and ordered byrelevance, transitively related to `organelle', with the relevant partshighlighted if I want them.�
. . . I'll get back to you.
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Text search example in Solr
�Any ontology term that contains a reference to `pigment', givingcertain �elds more weighted than others, scored and ordered byrelevance, transitively related to `organelle', with the relevant partshighlighted if I want them.�
Add to URL:
what we want query arg
only term fq=document_category:"ontology_class"has "pigment" defType=edismax & q=pigmentweights qf=[...] label_searchable^2 id^2 [...]in closure fq=isa_partof_closure_label:"organelle"highlighting hl.simple.pre=<em class="hilite"> & hl=true
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Text search example in Solr
�Any ontology term that contains a reference to `pigment', givingcertain �elds more weighted than others, scored and ordered byrelevance, transitively related to `organelle', with the relevant partshighlighted if I want them.�
Add to URL:
what we want query arg
only term fq=document_category:"ontology_class"has "pigment" defType=edismax & q=pigmentweights qf=[...] label_searchable^2 id^2 [...]in closure fq=isa_partof_closure_label:"organelle"highlighting hl.simple.pre=<em class="hilite"> & hl=true
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Ease of Exploration I
Facets can make life a lot easier.From the �neurogenesis� example, a couple of facets we get are:
IMP: 1370ISO: 495IGI: 356IDA: 290IBA: 104ISS: 36TAS: 7NAS: 6ISA: 4IEP: 3IEA: 2IRD: 2
... ...neuron projection morphogenesis: 636regulation of neuron di�erentiation: 533locomotion: 447central nervous system development: 408response to stimulus: 355... ...
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
A user interface:
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Ease of Exploration II
Also, with a small amount of work, calculations like informationcontent.
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Caching, speed, and data as a resource
Liberal �eld creation alleviates the need for a lot of caches ofqueries.
UI seeding data (facets).
All over HTTP�very easy to add a reverse proxy server in front.
Easy and direct data access for third parties (HTTP clients).
Reusable components such as term completion and spellcheck.
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
All folded into AmiGO 2
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Development and maintainability
Development and maintainability
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Cons
Some contortions to get at the data that we want (e.g. nojoins).
We can get by without with a little thought. Also, moreSolr features coming.
Loading is enabled through a parallel software stack.
But can leverage a lot of the stu� already out there(e.g. OWL API, SolrJ, etc.).
There is more overhead in the creation and maintenance of thevarious �elds necessary to make this all work out.
We use a con�guration and loading manager inOWLTools and AmiGO 2.
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Cons
Some contortions to get at the data that we want (e.g. nojoins).We can get by without with a little thought. Also, moreSolr features coming.
Loading is enabled through a parallel software stack.But can leverage a lot of the stu� already out there(e.g. OWL API, SolrJ, etc.).
There is more overhead in the creation and maintenance of thevarious �elds necessary to make this all work out.We use a con�guration and loading manager inOWLTools and AmiGO 2.
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
OWLTools FlexLoader (YAML-based)
id: bbop_ontdescription: Test mapping of ontology class for GO.display_name: Ontologydocument_category: ontology_classweight: 40boost_weights: id^2.0 label^2.0 description^1.0 comment^0.5 synonym^1.0 alternate_id^1.0result_weights: label^10.0 id^8.0 description^6.0 source^4.0 synonym^3.0 alternate_id^2.0 comment^1.0�elds:- id: label
description: Common term name.display_name: Termtype: stringproperty: [getAnnotationPropertyValues, label]searchable: true
- id: ......
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Pros
It's very very fast.
No more long running queries.
No ORMs (or SQL).
Development and debugging easier for clients�everything is ina uniform return schema/type.
Decreased number of layers necessary to complete a clientprogram�you just need an HTTP client.
New classes of features like JavaScript APIs, autocomplete,and spellcheck. . .
Trivial to o�er web APIs.
Scales nicely (for example, can chop up store between di�erentservers).
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
The software involved
Solr over Jetty (the store)License: Apache 2https://lucene.apache.org/solr/
AmiGO 2 (the clients)Seth Carbon, Chris Mungall, Shahid Manzoor, Heiko Dietze,Gene Ontology ConsortiumLicense: Modi�ed BSDhttp://wiki.geneontology.org/index.php/AmiGO_2http://amigo2.berkeleybop.org
OWLTools (the loader)Chris Mungall, Heiko Dietze, Gene Ontology ConsortiumLicense: BSD 2-Clausehttps://code.google.com/p/owltools/
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Acknowledgments
Acknowledgments
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
Acknowledgments
Berkeley Bioinformatics Open-source Projects
The Gene Ontology Consortium
Saccharomyces Genome Database
All the users of AmiGO
All the future users of AmiGO 2
Introduction Data as a document What it gets you Development and maintainability Acknowledgments
AmiGO 2