Upload
francois-belleau
View
1.766
Download
0
Embed Size (px)
Citation preview
Bio2RDF cloud of Virtuoso SPARQL endpoints
Life Science Raw Data Now
François Belleau, Marc-Alexandre Nolin, Peter Ansell, Michel Dumontier
30th April 2009W3C-HCLS F2F Meeting, Cambridge, MA
Agenda
● Why we did Bio2RDF ?● How we did it ?● What is know about hexokinase ?● Where we are going ?
The problem
According to NAR 2009 Database collection 1170 public databases exists.
How can they be integrated to behave like a global coherent resource ?
Public map of 1744 namespaces according toBioMoby, NAR, SRS, GO, NCBI, UniProt
Bio2RDF vision in 2007
Johanne Luciano vision for knowledge integration in 2005
W3C vision of semantic web in 2006
Bio2RDF Mouse and Human Atlas map in 2008 65 millions triples
Bio2RDF actual contribution to the Linked Data cloud
http://linkeddata.org/http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets/Statistics
Linked data cloudin March 2009
Linked data cloudin 2007
Bio2RDF cloud map of 2,3 billions triples in 2009
Why do it ?
Not to replace HTML or XML by an other new format, RDF and OWL, but to answer science question by submiting SPARQL query over the global knowledge base accessible through the Internet to the Life Science SPARQL endpoints cloud.
Solution
Bio2RDF approach to the data integration problem in bioinformatics : Apply the semantic web approach based on RDF, OWL and SPARQL technologies.
How we did it ?Bio2RDF architecture
Our design principles
http://www.w3.org/DesignIssues/LinkedData
http://bio2rdf.wiki.sourceforge.net/Banff%20Manifesto
YeastHub design in 2005
● Conversion of Dataset to RDF● Use of Sesame Triplestore● SeRQL query interface
http://www.ncbi.nlm.nih.gov/pubmed/15961502
Bio2RDF at ISMB 2005the begining
Thanks to Kei Cheung, Johanne Luciano, Eric Neumann and Christopher Baker they draw the lines.
Bio2RDF realtime rdfiser in 2007
Actual Architecture
● Offline rdfising process● Virtuoso SPARQL endpoints network ● Namespace resolution through DNS subdomain
Main REST services● Describe a ressource by a dereferencable URI
● :// . /http bio2rdf org ns:id ● Global services over federated endpoints
● :// .http bio2rdf org/links/ns:id● :// .http bio2rdf org/search/searchedTerm
● Targeted services to a specific endpoint● :// . /http bio2rdf org linksns/ns2/ns1:id● :// .http bio2rdf org/searchns/ns/searchedTerm
● other services are available.
Describe service implementation● http://bio2rdf.org/ns:id ● Corresponding SPARQL query :
● CONSTRUCT { ?s ?p ?o .}WHERE { ?s ?p ?o . FILTER(?s = <http://bio2rdf.org/ns:id>). }
● Submited at this URL ● http://ns.bio2rdf.org/sparql?query=...
– Based of DNS subdomain resolution service
Bio2RDF JSP server softwarehttp://sourceforge.net/projects/bio2rdf/
Peter Ansell is writing the Bio2RDFJSP server
● The software transform Bio2RDF URIs to SPARQL queries in real time.
● Its aim is to access normalised RDF information located in multiple endpoints using the concept of Public Namespaces and Private Record Identifiers and distributed SPARQL queries which are matched to the content in each endpoint.
● Each of the following databases have normalisation rules which normalise them back to bio2rdf.org URI's :Dbpedia, Drugbank, LinkedCT, HCLS KB/Neurocommons, Diseasome, Dailymed, Bioguid DOI
Bio2RDF.war package future● Provide more pipes to perform integrated actions without
having to put HTTP SPARQL requests into a workflow system when a URI resolution can perform the query in a distributed and normalised manner more efficiently
● Bring together the current distributed efforts to provide a complete HTML redirection registry so that a large percentage of Bio2RDF namespaces can be redirected with http://bio2rdf.org/html/namespace:identifier
● Form ontologies describing the query type, provider, rdf normalisation rule, namespace paradigm
● Integrate http://rdf.myexperiment.org/sparql and similar workflow RDF endpoints so that scientific workflows can be linked to their data cleanly
Bio2RDF.owl
http://quebec.bio2rdf.org/download/bio2rdf-2008.owl
Michel Dumontier will design Bio2RDF.owl ontology next version
What is known about hexokinase ?
Submit your query...● To the web search engine● To existing public web site offering data
integration services;● Using Bio2RDF SPARQL endpoints
● Submitting a SPARQL query;● Using facet browser interface from Virtuoso 6.0
server;● Dereferencing Bio2RDF search URI;● Using a Taverna workflow composed of SPARQL
queries to obtain federated results from KEGG, Entrez Gene and GO;
The usual unsemantic way
Existing integrated search services
NCBI/Entrez EBI/EB-eye
KEGG/DBGET GoPubmed
By submitting a SPARQL queryhttp://atlas.bio2rdf.org/sparql
What is know about « hexokinase » with semantic ?
select ?t1 ?p2 count(*) where { ?s1 ?p1 ?o1 . FILTER( bif:contains(?o1, "hexokinase")) . ?s1 a ?t1 . ?s1 ?p2 ?o2 . }ORDER BY ?t1 ?p2
Use Virtuoso 6.0 facet browserhttp://lod.openlinksw.com/
Dereferencing search URLhttp://bio2rdf.org/search/hexokinase
How can we submit a complex query over the network of SPARQL
endpoints ?
By building a mashup with Taverna
1) Write your complex SPARQL query as if a global graph would be available
2) Identify the needed namespaces and split the query to fetch each data source separetly
3) Build a mashup using a Taverna workflow that instanciate a local triplestore
4) Execute your complex query locally on the mashup
The SPARQL query needed(dont try this home, do it on the web !)
Get the list of genes from KEGG pathways of a specified taxon
http://www.myexperiment.org/workflows/747
● Clear graph● Get KEGG pathways list for a
specific taxon● For each pathway get genes
list and import instances● Count the number of genes
found
Insert into local triplestore GeneID genes and KEGG pathways
http://www.myexperiment.org/workflows/748
● Get the list of genes● Get the list of pathways● Insert into local triplestore
each corresponding graph
Insert into local triplestore the needed GO annotations
● Get the GO annotations for each gene
Finally, the neeeded query merging KEGG, Entrez Gene and GO together
Bio2RDF resources
Bio2RDF's mirrorshttp://quebec.bio2rdf.org/
http://qut.bio2rdf.org/
Bio2RDF SPARQL endpointshttp://www.freebase.com/view/user/bio2rdf/public/sparql
Life Science Raw Data Nowhttp://quebec.bio2rdf.org/download
Visit our Wiki rdfiser cookbookhttp://bio2rdf.wiki.sourceforge.net/
Bio2RDF news
http://bio2rdf.blogspot.com/
http://groups.google.ca/group/bio2rdf
http://www.slideshare.net/search/slideshow?q=bio2rdf
http://scholar.google.com/scholar?q=bio2rdf
Our 2009 objectives● Get approval from data provider to distribute
RDF dump and publish SPARQL endpoints (UniProt, BioCyc, Pathway Commons, Bind are in);
● Start using Virtuoso 6 cluster;● Design more services accessible with REST
protocol via our JSP package;● Recruit mirror server;● Develop new rdfiser program in a community
effort;
Thanks
Jean Morissette, Nicole Tourigny
● The Bio2RDF community● Centre de recherche du CHUL● Université Laval● Dumontier Lab● QUT eResearch Center● Openlink Virtuoso