Bio2RDF @ W3C HCLS2009

Bio2RDF cloud of Virtuoso SPARQL endpoints

Life Science Raw Data Now

François Belleau, Marc-Alexandre Nolin, Peter Ansell, Michel Dumontier

30th April 2009W3C-HCLS F2F Meeting, Cambridge, MA

Agenda

● Why we did Bio2RDF ?● How we did it ?● What is know about hexokinase ?● Where we are going ?

The problem

According to NAR 2009 Database collection 1170 public databases exists.

How can they be integrated to behave like a global coherent resource ?

Public map of 1744 namespaces according toBioMoby, NAR, SRS, GO, NCBI, UniProt

Bio2RDF vision in 2007

Johanne Luciano vision for knowledge integration in 2005

W3C vision of semantic web in 2006

Bio2RDF Mouse and Human Atlas map in 2008 65 millions triples

Bio2RDF actual contribution to the Linked Data cloud

http://linkeddata.org/http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets/Statistics

Linked data cloudin March 2009

Linked data cloudin 2007

http://linkeddata.org/

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets/Statistics

Bio2RDF cloud map of 2,3 billions triples in 2009

Why do it ?

Not to replace HTML or XML by an other new format, RDF and OWL, but to answer science question by submiting SPARQL query over the global knowledge base accessible through the Internet to the Life Science SPARQL endpoints cloud.

Solution

Bio2RDF approach to the data integration problem in bioinformatics : Apply the semantic web approach based on RDF, OWL and SPARQL technologies.

How we did it ?Bio2RDF architecture

Our design principles

http://www.w3.org/DesignIssues/LinkedData

http://bio2rdf.wiki.sourceforge.net/Banff%20Manifesto

http://www.w3.org/DesignIssues/LinkedData

http://bio2rdf.wiki.sourceforge.net/Banff%20Manifesto

YeastHub design in 2005

● Conversion of Dataset to RDF● Use of Sesame Triplestore● SeRQL query interface

http://www.ncbi.nlm.nih.gov/pubmed/15961502

Bio2RDF at ISMB 2005the begining

Thanks to Kei Cheung, Johanne Luciano, Eric Neumann and Christopher Baker they draw the lines.

Bio2RDF realtime rdfiser in 2007

Actual Architecture

● Offline rdfising process● Virtuoso SPARQL endpoints network ● Namespace resolution through DNS subdomain

Main REST services● Describe a ressource by a dereferencable URI

● :// . /http bio2rdf org ns:id ● Global services over federated endpoints

● :// .http bio2rdf org/links/ns:id● :// .http bio2rdf org/search/searchedTerm

● Targeted services to a specific endpoint● :// . /http bio2rdf org linksns/ns2/ns1:id● :// .http bio2rdf org/searchns/ns/searchedTerm

● other services are available.

Describe service implementation● http://bio2rdf.org/ns:id ● Corresponding SPARQL query :

● CONSTRUCT { ?s ?p ?o .}WHERE { ?s ?p ?o . FILTER(?s = <http://bio2rdf.org/ns:id>). }

● Submited at this URL ● http://ns.bio2rdf.org/sparql?query=...

– Based of DNS subdomain resolution service

http://ns.bio2rdf.org/sparql?query

Bio2RDF JSP server softwarehttp://sourceforge.net/projects/bio2rdf/

http://sourceforge.net/projects/bio2rdf/

Peter Ansell is writing the Bio2RDFJSP server

● The software transform Bio2RDF URIs to SPARQL queries in real time.

● Its aim is to access normalised RDF information located in multiple endpoints using the concept of Public Namespaces and Private Record Identifiers and distributed SPARQL queries which are matched to the content in each endpoint.

● Each of the following databases have normalisation rules which normalise them back to bio2rdf.org URI's :Dbpedia, Drugbank, LinkedCT, HCLS KB/Neurocommons, Diseasome, Dailymed, Bioguid DOI

Bio2RDF.war package future● Provide more pipes to perform integrated actions without

having to put HTTP SPARQL requests into a workflow system when a URI resolution can perform the query in a distributed and normalised manner more efficiently

● Bring together the current distributed efforts to provide a complete HTML redirection registry so that a large percentage of Bio2RDF namespaces can be redirected with http://bio2rdf.org/html/namespace:identifier

● Form ontologies describing the query type, provider, rdf normalisation rule, namespace paradigm

● Integrate http://rdf.myexperiment.org/sparql and similar workflow RDF endpoints so that scientific workflows can be linked to their data cleanly

Bio2RDF.owl

http://quebec.bio2rdf.org/download/bio2rdf-2008.owl

http://quebec.bio2rdf.org/download/bio2rdf-2008.owl

Michel Dumontier will design Bio2RDF.owl ontology next version

What is known about hexokinase ?

Submit your query...● To the web search engine● To existing public web site offering data

integration services;● Using Bio2RDF SPARQL endpoints

● Submitting a SPARQL query;● Using facet browser interface from Virtuoso 6.0

server;● Dereferencing Bio2RDF search URI;● Using a Taverna workflow composed of SPARQL

queries to obtain federated results from KEGG, Entrez Gene and GO;

The usual unsemantic way

Existing integrated search services

NCBI/Entrez EBI/EB-eye

KEGG/DBGET GoPubmed

By submitting a SPARQL queryhttp://atlas.bio2rdf.org/sparql

http://atlas.bio2rdf.org/sparql

What is know about « hexokinase » with semantic ?

select ?t1 ?p2 count(*) where { ?s1 ?p1 ?o1 . FILTER( bif:contains(?o1, "hexokinase")) . ?s1 a ?t1 . ?s1 ?p2 ?o2 . }ORDER BY ?t1 ?p2

Use Virtuoso 6.0 facet browserhttp://lod.openlinksw.com/

Dereferencing search URLhttp://bio2rdf.org/search/hexokinase

http://bio2rdf.org/search/hexokinase

How can we submit a complex query over the network of SPARQL

endpoints ?

By building a mashup with Taverna

1) Write your complex SPARQL query as if a global graph would be available

2) Identify the needed namespaces and split the query to fetch each data source separetly

3) Build a mashup using a Taverna workflow that instanciate a local triplestore

4) Execute your complex query locally on the mashup

The SPARQL query needed(dont try this home, do it on the web !)

Get the list of genes from KEGG pathways of a specified taxon

http://www.myexperiment.org/workflows/747

● Clear graph● Get KEGG pathways list for a

specific taxon● For each pathway get genes

list and import instances● Count the number of genes

found


Insert into local triplestore GeneID genes and KEGG pathways


● Get the list of genes● Get the list of pathways● Insert into local triplestore

each corresponding graph


Insert into local triplestore the needed GO annotations

● Get the GO annotations for each gene

Finally, the neeeded query merging KEGG, Entrez Gene and GO together

Bio2RDF resources

Bio2RDF's mirrorshttp://quebec.bio2rdf.org/

http://qut.bio2rdf.org/

http://quebec.bio2rdf.org/

http://qut.bio2rdf.org/

Bio2RDF SPARQL endpointshttp://www.freebase.com/view/user/bio2rdf/public/sparql

http://www.freebase.com/view/user/bio2rdf/public/sparql

Life Science Raw Data Nowhttp://quebec.bio2rdf.org/download

http://quebec.bio2rdf.org/download

Visit our Wiki rdfiser cookbookhttp://bio2rdf.wiki.sourceforge.net/

http://bio2rdf.wiki.sourceforge.net/

Bio2RDF news

http://bio2rdf.blogspot.com/

http://groups.google.ca/group/bio2rdf

http://www.slideshare.net/search/slideshow?q=bio2rdf

http://scholar.google.com/scholar?q=bio2rdf

http://bio2rdf.blogspot.com/

http://groups.google.ca/group/bio2rdf

http://www.slideshare.net/search/slideshow?q=bio2rdf

http://scholar.google.com/scholar?q=bio2rdf

Our 2009 objectives● Get approval from data provider to distribute

RDF dump and publish SPARQL endpoints (UniProt, BioCyc, Pathway Commons, Bind are in);

● Start using Virtuoso 6 cluster;● Design more services accessible with REST

protocol via our JSP package;● Recruit mirror server;● Develop new rdfiser program in a community

effort;

Thanks

Jean Morissette, Nicole Tourigny

● The Bio2RDF community● Centre de recherche du CHUL● Université Laval● Dumontier Lab● QUT eResearch Center● Openlink Virtuoso

Technology

Bio2RDF @ W3C HCLS2009