View
162
Download
0
Category
Tags:
Preview:
Citation preview
CURRENT ADVANCES TO BRIDGE THE USABILITY-EXPRESSIVITY GAP IN BIOMEDICAL SEMANTIC SEARCH (AND VISUALIZING LINKED DATA)
Maulik R. Kamdar Biomedical Informatics PhD Program
3rd April 2015
QUERYING HETEROGENEOUS DATASETS ON THE LINKED DATA WEB André Freitas, Edward Curry, João Gabriel Oliveira and Seán O'Riain
Internet Computing February 2012
EVALUATING THE USABILITY OF NATURAL LANGUAGE QUERY LANGUAGES AND INTERFACES TO SEMANTIC WEB KNOWLEDGE BASES Esther Kaufmann and Abraham Bernstein
Journal Of Web Semantics
November 2010
INTRODUCTION
¢ Opportunities � Builds on existing Web Infrastructure (URIs and HTTP)
and Semantic Web Standards (RDF, RDFS, vocabularies) � Reduce barriers to data publication, consumption, reuse
and availability, adding a fine-grained structure. � Expose previously siloed databases as data graphs (D2R,
Google Refine) to be interlinked and integrated with other datasets to create a global-scale interlinked dataspace.
¢ Challenges � Awareness of which exposed datasets potentially contain
the data they want, their location and their data model. � Syntax of structured query languages like SPARQL � Heterogeneous, different descriptors for same entity,
loosely-connected (yet!) and distributed data sources
EXISTING APPROACHES
¢ Information Retrieval Approaches � Entity-centric Search (SWSE, Sindice) � Structure Search (Semplore) – use of inverted indexes
and user feedback strategies
¢ Natural Language Queries � Question Answering (PowerAqua, FREyA) � Difficult to expand across domains � Best-effort Natural Language Interfaces (Treo) � Habitability Problem - users need guidance and support � WordNet/Wikipedia semantic approximation techniques
¢ Structured SPARQL Queries
CHALLENGE DIMENSIONS
¢ Query expressivity � Query datasets by referencing elements in the data model, operate
over the data (aggregate results, express conditional statements).
¢ Usability � An easy-to-operate, intuitive, and task-efficient query interface.
¢ Vocabulary-level semantic matching � Semantically match query terms to dataset vocabulary-level terms.
¢ Entity reconciliation � Match entities expressed in the query to semantically equivalent
dataset entities.
¢ Semantic tractability mechanisms � Answer queries not supported by explicit dataset statements
(for example, “Is Natalie Portman an Actress?” can be supported by the statement “Natalie Portman starred Star Wars”).
BIOMEDICAL MOTIVATION
~5 compounds
~300 000 compounds
~300 interesting compounds
~ 10 interesting compounds
Lite
ratu
re
Virtu
al S
cree
ning
Que
ry d
atab
ases
Hypothesis Generation
(Linked) Data
“Are there Drugs with molecular weight under 400 tested against ‘Colon Cancer’?”
“Do any Publications refer to assays using ‘Aspirin’ as the primary Drug in treatment of ‘Prostrate Cancer’?
REVEALD: A USER-DRIVEN DOMAIN-SPECIFIC INTERACTIVE SEARCH PLATFORM FOR BIOMEDICAL RESEARCH
Maulik R. Kamdar, Dimitris Zeginis, Ali Hasnain, Stefan Decker and Helena F. Deus
Journal of Biomedical Informatics February 2014
CHALLENGES
¢ Awareness of which exposed datasets potentially contain the data they want and their data model.
¢ Large, heterogeneous biomedical data sources, which are too dynamic for reliable data centralization
¢ The assembly of SPARQL queries to create the aggregated information for bioinformatics analysis still poses a high cognitive entry barrier.
¢ Human-readable, and more specifically, domain-specific representation of query results is required.
¢ None of the previous systems tested in biomedical domains, except DistilBio, VIQUEN and Cuebee
¢ Trade-off between expressivity and usability.
BACKGROUND: CANCO DOMAIN-SPECIFIC MODEL
Zeginis, Dimitris, et al. "A collaborative methodology for developing a semantic model for interlinking Cancer Chemoprevention linked-data sources." Semantic Web 5.2 (2014): 127-142.
BACKGROUND: CANCO DOMAIN-SPECIFIC MODEL
Zeginis, Dimitris, et al. "A collaborative methodology for developing a semantic model for interlinking Cancer Chemoprevention linked-data sources." Semantic Web 5.2 (2014): 127-142.
LIFE SCIENCES LINKED OPEN DATA CLOUD
~3 Billion Triples Life Sciences 53 datasets
Cyganiak,R. and Jentzsch,A. (2014) The Linking Open Data cloud diagram. http://lod-cloud.net/ [Accessed: March 23, 2013]
BACKGROUND: CATALOGUING & LINKING 1248 Concepts and 1255 properties were harvested from more than 53 Linked Biomedical Data Sources (LBDS) (Life Sciences Linked Open Data – LSLOD catalogue) and linked to the CanCO Query Elements.
Hasnain, Ali, et al. "Cataloguing and linking life sciences LOD cloud." 1st International Workshop on Ontology Engineering in a Data-driven World (OEDW 2012).
BACKGROUND: FEDERATED ARCHITECTURE
Chebi:Compound void-‐ext:subClassOf Granatum:Molecule Pubchem:Compound void-‐ext:subClassOf Granatum:Molecule
?molec a Granatum:Molecule
?molec a Chebi:Compound ?molec a Pubchem:Compound
SPARQL Query
Chebi DrugBank UniProt Others
Life Sciences Linked Open Data (LSLOD)
LSLOD Catalogue
CanCO
Saved Queries
Transformed Query
Transformed Query
Transformed Query
Transformed Query
Rule Templates Experimental Datasets
Query Engine Query Logging
TransformaGon
Cataloguing & Links CreaGon
RDFizaGon
Social CollaboraGve Workspace
Hasnain, Ali, et al. "A Roadmap for navigating the Life Scinces Linked Open Data Cloud." International Semantic Technology (JIST2014) conference. 2014.
BACKGROUND: FEDERATED ARCHITECTURE
Ø Non-intuitive Ø SPARQL, RDF, Schema knowledge required Ø Domain-specific visualization of results is not possible
REVEALD SEARCH PLATFORM
¢ ReVeaLD :- Real-Time Visual Explorer and Aggregator of Linked Data, is a user-driven domain-specific search platform.
¢ Intuitively formulate advanced search queries using a click-input-select mechanism
¢ Visualize the results in a domain–suitable format. ¢ Entity-centric and Visual Query Search System ¢ Assembly of the query is governed by a Domain-
specific Language (DSL), which in this case is the Cancer Chemoprevention Ontology(CanCO)
VISUAL QUERY MODEL
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX granatum: <http://chem.deri.ie/granatum/> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> SELECT DISTINCT * WHERE { ?x0_Assay a granatum:Assay ; granatum:hasInput ?x1_Target ; granatum:identify ?x2_ChemopreventiveAgent ; granatum:outcome_method ?x3_outcome_method . ?x1_Target granatum:title ?x4_title . ?x2_ChemopreventiveAgent granatum:molecularWeight ?x10_molecularWeight ; granatum:SMILESnotation ?x9_SMILESnotation ; granatum:hasFormula ?x7_hasFormula ; granatum:HBD ?x5_Hydrogen_Bond_Donors ; granatum:HBA ?x6_Hydrogen_Bond_Acceptors ; granatum:TPSA ?x8_Topological_Polar_Surface_Area . FILTER regex(xsd:string(?x4_title), "estrogen receptor", "is") FILTER ( xsd:double(?x10_molecularWeight) < 300 ) } LIMIT 100
Pubchem
ChEBI
Uniprot
↑ → SPARQL Translation
All Assays, which Target Estrogen Receptors present in Human (Organism), and which identify potential Chemopreventive Agents with Molecular Weight < 300
http://srvgal78.deri.ie:8080/explorer?type=sampleQuery&nodes=17-1-30-33-73-78-91-81-82-92-98-63 &links=17.1-17.30-1.33-17.73-17.78-1.91-30.81-30.82-30.92-30.98-33.63 &filters=1.91.c.estrogen%20receptor|30.98.lt.300|33.63.c.human&flexible=1
GRAPHIC RULES
¢ Query : SELECT * WHERE {<clickedURI> ?p ?o} ¢ Results are subjected to a set of Graphic Rules, which
follow the Event-Condition-Action paradigm (ECA) and provide visual representations using Fresnel Display Vocabulary.
¢ Example : � Event: Each retrieved triple as query execution result
<http://www4.wiwiss.fu-berlin.de/drugbank/resource/targets/844> <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/pdbIdPage> “http://www.pdb.org/pdb/explore/explore.do?structureId=1IVO”
� Condition: sdf_file or pdbIdpage (Predicate) + http (Object) � Action: HTTP GET and invoke a specific Resource Renderer � Resource Renderer: GLMol Molecular Viewer
EVALUATION ¢ Tracking Real-time User Experience Methodology (TRUE)
- widely used in the HCI community to evaluate computer games
¢ Game-based evaluation where domain users are given tasks to complete and time and interactions are tracked using Google Analytics
¢ Subjectivistic evaluation where users were asked to fill out a survey.
¢ The main purpose of this evaluation focused on two usability concerns: � Does familiarity of the users with the DSL affect the time needed to
formulate the query? � Does a constrained DSL (smaller DSL), lead to less time needed for
query formulation?
OTHER IMPLEMENTATIONS: LINKED TCGA
Saleem, M., Kamdar, M. R., et al. (2014). Big linked cancer data: Integrating linked TCGA and PubMed. Web Semantics: Science, Services and Agents on the World Wide Web, 27, 34-41.
http://srvgal78.deri.ie/tcga-pubmed/
OTHER IMPLEMENTATIONS: LINKED TCGA
Saleem, M., Kamdar, M. R., et al. (2014). Big linked cancer data: Integrating linked TCGA and PubMed. Web Semantics: Science, Services and Agents on the World Wide Web, 27, 34-41.
http://srvgal78.deri.ie/tcga-pubmed/
OTHER IMPLEMENTATIONS: LINKED TCGA
Saleem, M., Kamdar, M. R., et al. (2014). Big linked cancer data: Integrating linked TCGA and PubMed. Web Semantics: Science, Services and Agents on the World Wide Web, 27, 34-41.
http://srvgal78.deri.ie/tcga-pubmed/
OTHER IMPLEMENTATIONS: LINKED TCGA
Saleem, M., Kamdar, M. R., et al. (2014). Big linked cancer data: Integrating linked TCGA and PubMed. Web Semantics: Science, Services and Agents on the World Wide Web, 27, 34-41.
http://srvgal78.deri.ie/tcga-pubmed/
OTHER IMPLEMENTATIONS: LINKEDPPI
Kazemzadeh, L., Kamdar, M. R.,et al. LinkedPPI: Enabling Intuitive, Integrative Protein-Protein Interaction Discovery. Linked Science, 48.
OTHER IMPLEMENTATIONS: LINKEDPPI
Kazemzadeh, L., Kamdar, M. R.,et al. LinkedPPI: Enabling Intuitive, Integrative Protein-Protein Interaction Discovery. Linked Science, 48.
DISCUSSION
¢ DSL Incrementation Mechanism � Extend the current model represented in the Visual Query
Builder by adding new concepts and properties. � Use or merge publicly available extensions of the DSL
¢ No reliance on the Federated Query Engine, SPARQL Endpoint, underlying DSL and Graphic Rules.
¢ Corrupt Graphic Rules result in the textual representation of the relevant triple.
¢ Domain-specific Languages increase usability and enable abstraction of underlying data models
Query expressivity Usability Vocabulary-level semantic matching
Entity reconciliation Semantic tractability mechanisms
Medium (SELECT, FILTER, OPTIONAL)
Medium (En=ty-‐centric Search, VQS)
Low (Indexed Term URI to Concept)
Low (owl:sameAs for same unique keys)
None
FUTURE WORK
¢ Ontologies, indexed term labels and catalogue as elements in a Controlled Natural Language to increase usability
¢ Results pipelined to any Problem-solving method (like Autodock Vina, visualization, ML algorithm etc.)
¢ Faceted Search, Related Entity Recognition based on Feature-based Similarity Measures
¢ Allowing users of the platform to provide their own DSL, data sources, and graphic rules.
¢ SPARQL Endpoint availability and latency ¢ Ontology Reuse instead of Ontology Alignment!
Recommended