36
Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Embed Size (px)

Citation preview

Page 1: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Semantic Web Technologies in Biosciences

Kei Cheung, Ph.D.

Yale Center for Medical Informatics

Page 2: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Outline

• Introduction • Past and current Web (Syntactic Web)• Future Web (Semantic Web)• Semantic Web technologies with

examples in the biosciences

Page 3: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Data Growth

• The Human Genome Project created a paradigm shift in biology (experimental -> computational) due to the flood of DNA sequence data produced.

• Since HGP, other types of high throughput bio-technologies have emerged and produced vast quantities of data of diverse types (transcript profiling, protein profiling, genotyping, next generation sequencing, etc).

• An increasing number of bio-data providers have made their data available through the Web.

Page 4: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics
Page 5: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Problems and Issues

• Each database represents a data silo accessed by local applications written in specific languages

• The web pages display data but they do not expose the structure of data in a machine readable format

• Different user/query interfaces• No uniform/global data schema• Lack of standard ID’s, terminology, vocabulary,

data formats, etc

Page 6: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Available Tools/Approaches

• Web search engines (e.g., google, yahoo)

• One-stop shopping (e.g., NCBI)

• Gateway or directory listing (e.g., Neuroscience Database Gateway)

• Use screen scraping methods to extract data from web content (e.g., Perl scripts)

Page 7: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Kei (Hoi) Cheung (15 years ago)

Kei (Hoi) Cheung(more recent)

Kei (Hui) CheungNot me!

I’m NOT a company!

Find the most recent imageof the person “Kei Hoi Cheung”

Page 8: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Semantic Web = Brilliant Web!

Page 9: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Knowledge-driven bioscience data integration on the Semantic Web

KEGGNeuron

DB

PDBDrugBank

GenBank

CCDB

GeneCards

Gene ExpressionOnmibus

CellDrug

Protein

Gene

Disease

Sequence

Image

Receptor

targets

treats

is-a

has-image

has-sequence

encodes

underlies

underlies

Knowledge-based applications

Knowledge layer

Data layer

has-part

Pathway

underlies

is_involved_in

Page 10: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Semantic Web Stack

Page 11: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Problems with XML• DTD has limited expressiveness of the XML language

• XML is designed as a language for message encoding

• XML is only self-descriptive about the following structural relationships: – containment, adjacency, co-occurrence, attribute and opaque

reference. – All these relationships are useful for serialization, but are not optimal

for modeling objects of a problem domain

• For example, the relationship between the <spot> and <coord_*> of AGML tags is no different from that between <spot> and <dia_*>. – A computer algorithm must treat them differently to develop meaningful

applications. To calculate the distance between two <spot>s, an algorithm shall use the value of <coord_*>, but to calculate the area of each <spot>, it shall retrieve the value of <dia_*> instead

Page 12: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Proliferation of Bio-XML Formats

Sequence

BSML AGAVE

Microarray Gene Expression

GEML MAML

Pathway

BIND SBML PSI-MI

MAGE-ML

RDF (e.g., BioPax)

Semantically rich ontologies

Reasoning (machine intelligence)

Page 13: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

From XML to RDF

Page 14: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Semantic Web

• The Semantic Web provides a common machine-readable framework that allows data to be shared and reused across application, enterprise, and community boundaries– The Semantic Web is a web of data

• The Semantic Web is about two things– It is about common formats for identification,

integration and combination of data drawn from diverse sources

– It is also about languages for recording how the data relates to real world objects

Page 15: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

RDF

• The foundational semantic web technology is the resource-description framework (RDF)

• RDF is a system to describe resources• RDF has a very simple and yet elegant data

model (directed acyclic graph) – everything is a resource that connects with other

resources via properties

• A resource is anything that is identifiable by a uniform resource identifier (URI)

Page 16: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Characteristics of RDF• The DAG structure offered by RDF makes it extensible and evolvable.

Adding nodes and edges to a DAG doesn’t change the structure of any existing subgraph.

• RDF has an open-world assumption in that allows anyone to make statements about any resource

• RDF is monotonic in that new statements neither change nor negate the validity of previous assertions, making it particularly suitable in an academic environment, in which consensus and disagreement about the same resources have a useful coexistence that needs to be formally recorded.

• All RDF terms share a global naming scheme in URI, making distributed data and ontologies possible

• The combined effect of global naming, universal data structure and open-world assumption is that resources exist independently but can be readily linked with little precoordination.

Page 17: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Linked Data

• Linked Data is about using the Web to connect related data that wasn't previously linked

• Wikipedia defines Linked Data as "a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF."

• In addition to providers and consumers of linked data, there are link creators who create semantic links between different RDF datasets (e.g., links can be created between protein kinases and drugs)

Page 18: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Linked Data Cloud (linkeddata.org)

Page 19: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

RDFS and OWL

• RDF Schema (RDFS) – it supports classes and class hierarchy

• Web Ontology Language (OWL): OWL Lite, OWL DL, OWL Full

• While RDFS and OWL are layered on top of RDF, they offer support for inference and axiom, making Semantic Web capable of supporting knowledge-based querying and inferencing

Page 20: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Uniform Resource Identifiers (URIs)

• A URI is a string of characters used to identify or name a resource on the Internet.

• URLs (Uniform Resource Locators) are a particular type of URI, used for resources that can be accessed on the WWW (e.g., web pages)

• In RDF, URIs typically look like “normal” URLs, often with fragment identifiers to point at specific parts of a document:

– http://www.semantic-systems-biology.org/SSB#CCO_B0000000 (id for “core cell cycle protein” in Cell Cycle Ontology)

Page 21: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

RDF Triple/Graph• The basic information unit in RDF is an RDF statement in the form of

– (subject, property, object)

• Each RDF statement can be modeled as a graph comprising two nodes connected by a directed arc

• A triple example

• A set of such triples can jointly form a directed labeled graph (DLG) that can in theory model a significant part of domain knowledge.

• An RDF graph can be represented in different formats (XML, Turtle, N3…)

Page 22: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Cell Cycle Ontology (CCO) (Antezana et al, 2009, Genome Biology)

http://genomebiology.com/2009/10/5/R58

Page 23: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Named Graph

• RDF graphs are nameable by URIs

• This enables RDF statements to be created to describe graphs

• This helps establish provenance and trust

• Representation formats: TriX and TriG

:G2 { :G1 swp:assertedBy _:w1 . _:w1 swp:authority :Erick . _:w1 dc:date "2009-05-29"^^xsd:date . _:w1 dc:license "Creative Commons Attribution License“^^xsd:string . :Erick rdf:type ex:Person . :Erick ex:email <mailto:[email protected]> }

Page 24: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

SPARQL

• It is a standard query language for RDF• It can be used to express queries across diverse

data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.

• It contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions.

• The results of SPARQL queries can be results sets or RDF graphs.

Page 25: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

RDF Graph Match (SPARQL)

BASE <http://www.semantic-systems-biology.org/ webcite>PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema# webcite>PREFIX ssb:<http://www.semantic-systems-biology.org/SSB# webcite>SELECT ?protein_labelWHERE {   GRAPH <cco_S_pombe> {      ?protein ssb:is_a ssb:CCO_B0000000.      ?protein rdfs:label ?protein_label   }}

core cell cycle protein

Page 26: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

SPARQL (Cont’d)• The following SPARQL query on the A. thaliana graph allows users to infer a

putative location for proteins with no documented cellular locations. The assumption behind such a query is that two proteins that participate in the same interaction are likely to share the same cellular location, the 'nucleus' (CCO_C0000252): BASE <http://www.semantic-systems-biology.org/ webcite>

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema# webcite>PREFIX ssb:<http://www.semantic-systems-biology.org/SSB# webcite>SELECT   ?prot_in_the_nucleus   ?prot_to_study   ?interaction_labelWHERE {   GRAPH <cco_A_thaliana> {      ?interaction a ssb:interaction.      ?interaction rdfs:label ?interaction_label.      ?prot_A ssb:participates_in ?interaction.      ?prot_B ssb:participates_in ?interaction.      ?prot_A rdfs:label ?prot_in_the_nucleus.      ?prot_B rdfs:label ?prot_to_study.      ?prot_A ssb:located_in ssb:CCO_C0000252.      OPTIONAL {         ?prot_B ssb:located_in ?location_B.      }      FILTER (!bound(?location_B))   }}

Page 27: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

OWL DL Representation

:Nucleus a owl:Class ; rdfs:subClassOf [ a owl:Restriction ; owl:onProperty :part_of ; owl:someValuesFrom :Cell ]

Necessary but not sufficient condition: part of a nucleus is also part of a cell, but part of a cell is not necessarily part of a nucleus

Page 28: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

OWL Reasoning

• Which proteins participate in “mitosis”

:Protein a owl:Class ; rdfs:subClassOf [ a owl:Restriction ; owl:onProperty :participates_in ; owl:someValuesFrom :Mitosis ]

Page 29: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Visualization Application

Page 30: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Semantic Web Rules

• Semantic Web Rule Language (SWRL) – it combines the sublanguages of the OWL Web Ontology Language with those of the Rule Markup Language

• It can help increase the expressivity of OWL ontologies by augmenting such ontologies with rules

• Rules are easier to understand than description logic.

Page 31: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

SWRL Examples

• Protein (?p1) Λ cellularLocation (?p1, Nucleus) NuclearProtein (?p1)

• participatesInteraction(?protein1, ?interaction1) Λ participatesInteraction(?protein2, ?interaction1) Λ participatesInteraction(?protein2, ?interaction2) Λ participatesInteraction(?protein3, ?interaction2) proteinInteraction (?protein1, ?protein3)

Page 32: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Enabling Technologies/Tools

• Triplestores (e.g., Virtuoso, Oracle, AllegroGraph, …) – SPARQL Endpoint

• Ontology editors (e.g., Protégé, SWOOP, OBO-Edit, …)

• OWL reasoners (e.g., Pellet, RacerPro, FaCT++, …)

Page 33: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Semantic Web Related Communities

• National Center for Biomedical Ontology• OBO Foundry• BioPAX• Semantic Web Activity of the World Wide

Web Consortium– Semantic Web for Health Care and Life

Sciences Interest Group• BioRDF, COI, LODD, Sci. Discourse, Terminology,

Translational Medicine Ontology

Page 34: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Roads to Semantic Web

• Provide data in RDF format (data providers)– UniProt, Gene Ontology, NCI Metathesauras

• Convert non-RDF data to RDF data (third party efforts)– YeastHub– D2RQ, TRIPLIFY

• Mix RDF data with non-RDF data – RDFa (e.g., Fuzz Firefox extension)– GRDDL

Page 35: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

Merge between Web 2.0 and Semantic Web

• People (FOAF)

• Yahoo!Pipes (Semantic Web Pipes developed at DERI)

• Dapper (Semantify Dapper)

• MediaWiki (Semantic MediaWiki)

• Google Map (Semantic Google Map)

Page 36: Semantic Web Technologies in Biosciences Kei Cheung, Ph.D. Yale Center for Medical Informatics

The End