1
NCBO BioPortal SPARQL Endpoint - The Quad Economy of a Semantic Web Ontology Repository Trish Whetzel, Manuel Salvadores, Paul R. Alexander, Mark A. Musen, Natalya F. Noy Stanford University, Stanford, CA Acknowledgements The National Center for Biomedical Ontology is one of the National Centers for Biomedical Computing supported by the NHGRI, the NHLBI, and the NIH Common Fund under grant U54- HG004028. Contact For more information on the NCBO, visit http :// www.bioontology.org or email support@ bioontology.org Abstract The NCBO Web services provide a common output (XML/JSON) for ontology content regardless of the ontology representation format (OWL, OBO, Protégé frames, RRF), however there is no single uniform storage for the ontologies and their metadata. As the amount of information and number of hits to the Web services increases, a more scalable solution is needed. To address these issues, we analyzed the use of a quad store since quad stores easily scale to millions of triples and provides SPARQL query access to the ontologies. Currently each ontology in BioPortal includes the materialization of all owl:imports. Thus, if a small ontology imports a large ontology then the former becomes a large ontology. Taking into account that BioPortal stores multiple versions of an ontology, the problem is reproduced for every version. Our hypothesis was that we could optimize the number of quads in the system using a more granular model where owl:imports are not materialized and every ontology graph contains its own RDF triples without the triples from the owl:imports ontologies. One of the questions to be answered is the optimization ratio–in number of triples–when using an ontology-per-graph model versus a closure-materialized model. Of the 149 OWL ontologies reviewed, there are 299 ontologies in the import closure (i.e., if we follow all the owl:imports links from the 149 ontologies, we will create a set of 299 ontologies). These 299 OWL ontologies contain 303 owl:imports, the materialized import closure is a set of 495 owl:imports. We also reviewed the number of re-used triples. Ontologies with no imports gather 5.4M triples in the system; ontologies with one import 1.7M; ontologies with 2-9 imports reach 0.5M triples; and more than 10 imports 2.1M. To conclude, our analysis shows that while ontology reuse is still far from being the norm, effective reuse is a goal worth pursuing and the level of reuse can have significant implications for the scalability of ontology storage systems. BioPortal SPARQL Endpoint Features Open library of biomedical ontologies • Each ontology is materialized in a single graph to facilitate query articulation Ontology content is synchronized daily with BioPortal Only the latest version of each ontology can be accessed, but metadata for all versions is available Access control for SPARQL named graphs to restrict access to private and licensed ontologies based on the BioPortal user API Key • rdfs:subPropertyOf reasoning for preferred name, synonyms and definitions allows queries to bind top level predicates of the property hierarchy to query consistently across ontologies using the graph "globals” UMLS ontologies can be generated at the CUI or CODE level • To assure a fair usage of the triple store some queries are not permitted, for example SELECT * WHERE { ?s ?p ?o } Sample Code Examples[1] are provided for the following platforms/languages: - Java: * Java with no 3-party libs (SimpleTest.java) * Java with JenaARQ (JenaARQTest.java) * Java with OpenRDF [*] (OpenRDFAlibabaTest.java) - Python: * Python with no 3-party libs (sparql1.py) * Python with SPARQLWrapper[2] (sparql2.py) - Javascript: * Javacript with the SPARQLClient[3] lib (index.html) * Javascript with node.js. (node_test.js) - Perl using sparql.pm from[4] (test.pl) - TODO * Ruby, C#, Scala [1] https://github.com/ncbo/sparql-code- examples [2] http ://sparql- wrapper.sourceforge.net [3] http://thefigtrees.net/lee/sw/ sparql.js (slightly modified to allow API keys) [4] https://github.com/swh/Perl-SPARQL-client- library (slightly modified to allow API keys) [*] The jar file alibaba-repository-sparql-2.0-beta9-patched.jar has been patched to allow API keys and GET HTTP requests. Example Queries Select names and acronyms for all ontologies List all ontologies sorted by creation date including contact name and number of terms only if these exist (OPTIONAL clause) List ontology categories Ontology domains with number of ontologies per domain Get all versions of views from a virtual ontology ID Find all the ontologies that contain SNOMED in their name (case-sensitive) Get all root terms for an ontology version ID Get all ontology terms (owl:Class) Get all distinct predicates from a single ontology Get term direct sub-classes with labels, e.g. SNOMED example http ://alphasparql.bioontology.org/examples http:// bioportal.bioontology. org http:// alphasparql.bioontology. org

NCBO BioPortal SPARQL Endpoint - The Quad Economy of a Semantic Web Ontology Repository

Embed Size (px)

DESCRIPTION

Poster presentation of the NCBO SPARQL Endpoint at CSHALS 2012.

Citation preview

Page 1: NCBO BioPortal SPARQL Endpoint - The Quad Economy of a Semantic Web Ontology Repository

NCBO BioPortal SPARQL Endpoint - The Quad Economy of a Semantic Web Ontology Repository

Trish Whetzel, Manuel Salvadores, Paul R. Alexander, Mark A. Musen, Natalya F. NoyStanford University, Stanford, CA

Acknowledgements

The National Center for Biomedical Ontology is one of the National Centers for Biomedical Computing supported by the NHGRI, the NHLBI, and the NIH Common Fund under grant U54-HG004028.

Contact

For more information on the NCBO, visit http://www.bioontology.org or email [email protected]

AbstractThe NCBO Web services provide a common output (XML/JSON) for ontology content regardless of the ontology representation format (OWL, OBO, Protégé frames, RRF), however there is no single uniform storage for the ontologies and their metadata. As the amount of information and number of hits to the Web services increases, a more scalable solution is needed. To address these issues, we analyzed the use of a quad store since quad stores easily scale to millions of triples and provides SPARQL query access to the ontologies. Currently each ontology in BioPortal includes the materialization of all owl:imports. Thus, if a small ontology imports a large ontology then the former becomes a large ontology. Taking into account that BioPortal stores multiple versions of an ontology, the problem is reproduced for every version. Our hypothesis was that we could optimize the number of quads in the system using a more granular model where owl:imports are not materialized and every ontology graph contains its own RDF triples without the triples from the owl:imports ontologies. One of the questions to be answered is the optimization ratio–in number of triples –when using an ontology-per-graph model versus a closure-materialized model. Of the 149 OWL ontologies reviewed, there are 299 ontologies in the import closure (i.e., if we follow all the owl:imports links from the 149 ontologies, we will create a set of 299 ontologies). These 299 OWL ontologies contain 303 owl:imports, the materialized import closure is a set of 495 owl:imports. We also reviewed the number of re-used triples. Ontologies with no imports gather 5.4M triples in the system; ontologies with one import 1.7M; ontologies with 2-9 imports reach 0.5M triples; and more than 10 imports 2.1M. To conclude, our analysis shows that while ontology reuse is still far from being the norm, effective reuse is a goal worth pursuing and the level of reuse can have significant implications for the scalability of ontology storage systems.

BioPortal SPARQL Endpoint Features

• Open library of biomedical ontologies• Each ontology is materialized in a single graph to facilitate query articulation• Ontology content is synchronized daily with BioPortal • Only the latest version of each ontology can be accessed, but metadata for all versions is

available • Access control for SPARQL named graphs to restrict access to private and licensed ontologies

based on the BioPortal user API Key• rdfs:subPropertyOf reasoning for preferred name, synonyms and definitions allows queries to

bind top level predicates of the property hierarchy to query consistently across ontologies using

the graph "globals”• UMLS ontologies can be generated at the CUI or CODE level • To assure a fair usage of the triple store some queries are not permitted, for example SELECT *

WHERE { ?s ?p ?o }

Sample Code

Examples[1] are provided for the following platforms/languages: - Java: * Java with no 3-party libs (SimpleTest.java) * Java with JenaARQ (JenaARQTest.java) * Java with OpenRDF [*] (OpenRDFAlibabaTest.java) - Python: * Python with no 3-party libs (sparql1.py) * Python with SPARQLWrapper[2] (sparql2.py) - Javascript: * Javacript with the SPARQLClient[3] lib (index.html) * Javascript with node.js. (node_test.js) - Perl using sparql.pm from[4] (test.pl) - TODO * Ruby, C#, Scala

[1] https://github.com/ncbo/sparql-code-examples [2] http://sparql-wrapper.sourceforge.net [3] http://thefigtrees.net/lee/sw/sparql.js (slightly modified to allow API keys)[4] https://github.com/swh/Perl-SPARQL-client-library (slightly modified to allow API keys)[*] The jar file alibaba-repository-sparql-2.0-beta9-patched.jar has been patched to allow API keys and GET HTTP requests.

Example Queries• Select names and acronyms for all ontologies• List all ontologies sorted by creation date including

contact name and number of terms only if these exist (OPTIONAL clause)

• List ontology categories• Ontology domains with number of ontologies per domain• Get all versions of views from a virtual ontology ID• Find all the ontologies that contain SNOMED in their

name (case-sensitive)• Get all root terms for an ontology version ID• Get all ontology terms (owl:Class)• Get all distinct predicates from a single ontology• Get term direct sub-classes with labels, e.g. SNOMED

example

http://alphasparql.bioontology.org/examples

http://bioportal.bioontology.org http://alphasparql.bioontology.org