66
From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John A. Miller Liming Cai

From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Embed Size (px)

Citation preview

Page 1: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

From a Genome Database to a Semantic Knowledge Base

MS Thesis DefenseJuly 18th, 2008

Bobby E. McKnight

Committee:I. Budak Arpinar (Major Professor)John A. MillerLiming Cai

Page 2: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Contents

Introduction Motivation Example Scenario Data Inventory and

Knowledge Engineering

Visual Query Building Guided query

building Natural Language

Data Exploration Evaluation Related Works Future Work Conclusion

Page 3: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Introduction

Trypanosoma Cruzi Responsible for Chagas disease

Chagas is the third most serious parasitic disease worldwide (World Bank, 1993; Schofield and Dias, 1999)

TcruziDB.org On line Trypansosoma Cruzi database resource Provides genome exploration for researchers

Semantic Web Provides rich formats for expressing data Many advantages over traditional relational

database based systems

Page 4: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

The Big Picture

tcruzidb.org

OutsideGenomicResources

TcruziKB

ComGO

GO

SO

EnzyO

GlycO

PropreO

RO Taxo

nomyEContologies

Page 5: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Motivation

“Over most of my career, people could plan their experiments over a weekend, spend six months doing them, and then interpret the results over a weekend. Now, people can do an experiment over a weekend and spend six months thinking about what the results mean.”

Gerald M. Rubin

Vice President for Biomedical ResearchHoward Hughes Medical Institute (HHMI)

Page 6: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Why Semantics?

Interoperability: Seamless Integration Use known ontologies

Knowledge/Domain Centered as opposed to database tables

Automation for Knowledge Exploration inferencing

Re-Usable Standardization

Page 7: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Seamless Integration

Ontology naturally recognizes and maps between different external data sources

GeneXYZ has_genbank_index_identifier 12345 has_accession ENAxxx.1 has_kegg_identifier TCKxxx has_genedb_identifier Tc00.xxxx.30

Page 8: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Knowledge Centered

View concepts, not tables Focus on the real world concept, instead of the table where it

is stored More natural way to access data

Make our data reusable and inter-operable Using widely adopted standards RDF OWL

Page 9: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Example Scenario – Querying 1

With TcruziDB if a user wants to find a specific group of genes they must conduct multiple searches and combine the results

Page 10: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Example Scenario - Querying 2

Page 11: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Example Scenario – Querying 3

Page 12: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Example Scenario – Querying 4

This requires a great deal of backtracking TcruziKB uses a semantic based query

building system and natural language query system allowing for queries such as this one to be built

and executed from one screen eliminates the backtracking still supports keyword search

Page 13: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Example Scenario - Results

TcruziDB only gives results in tabular format TcruziKB gives a multi-perspective data view

Tables Statistics Graphs Related Publications

Page 14: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Example Scenario - Summary

With TcruziKB a user can enter in a complex query without backtracking by using the query builder or natural language query interface

In stead of simple tabular results which require a great deal of human effort in finding significant information, multiple result perspectives can be used view your query results along with related

publications

Page 15: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Data Inventory and Knowledge Engineering

Page 16: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Knowledge Engineering

System Ontology Several popular ontologies exist with classes

and properties of interest Reuse highly desirable

Ontology Engineering List keywords that appear in TcruziDB

These become the ontology concepts Find related classes/properties in existing

biological ontologies GO, SO, NCBI Taxonomy, etc

Page 17: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Ontology Schema

Page 18: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Data Collection

TcruziDB Relational database using GUS schema Mapped to RDF using D2R and a custom built

map The annotated data can be queried via SPARQL

endpoint Enchance with outside data

Pfam Flat files, converted to RDF

Interpro XML, converted to RDF

Others such as ortholog groups from OrthoMCL

Page 19: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Visual Query Building

Page 20: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Visual Query Building

We would like to allow the researcher to ask complex questions

Use SPARQL directly TcruziKB supports this

Problem You can't expect that every biologist knows the

language Solution

Guided query building1

Natural language querying1. Pablo N. Mendes, Bobby McKnight, Amit P. Sheth, Jessica C. Kissinger. "Enabling Complex Queries For Genome Data Exploration" IEEE Second International Conference on Semantic Computing (ICSC) 2008 in Santa Clara California. (To appear)

Page 21: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Query Building

The ontology schema represents all types of information in the system

By allowing the user to select a class from the schema to begin the query the system can guide them in building a more complex query

The system can provide suggestions as the user types with relevant knowledge from the ontology

Page 22: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Query Building – Stage 1 – Picking a Class

Page 23: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Query Builder – Stage 2 – Picking a Property

Page 24: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Query Builder – Stage 3 – Complete the Triple

Page 25: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Query Builder – Stage 4 – Continue Building Triples

Page 26: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Query Builder – Stage 5 – Finish The Triple

Page 27: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Query Builder – Stage 6

Page 28: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Query Builder – Stage 7 – New Line (AND)

Page 29: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Query Builder – Stage 9

Page 30: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Query Builder Summary

A user can conduct a search on a single class Simply selecting “AminoAcidSequence” and

pressing search will describe the AminoAcidSequence class

Selecting “SequenceX” gets all information for the instance SequenceX

The user can build as many triples as needed or can stop after one

Builds SPARQL for the user The user also has the option of altering the

generated SPARQL

Page 31: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Natural Language Querying

In order to allow for complex queries allow user's to enter in queries in natural English

Use NLP to find ontology concepts in the user's query and form SPARQL

Which genes are expressed in the Epimastigote stage?

SELECT ?gene WHERE { ?gene :life_cycle_stage :Epimastigote }

Page 32: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

NLP – Question Entry

The user enters in a question in plain English Suggestions are presented to the user in a

similar fashion as the query builder These suggestions are based on ontology words The classes, instances, and properties,

previously entered by the user helps determine the priority of the suggestions

What genes are expressed in the

MetacyclicEpimastigoteTrypanmastigote

Page 33: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

NLP – Parse Tree and Part of Speech Tagging

The user's question is converted into a parse tree

Stanford Parser Constructs parse tree Part of speech tagging

What is the life cycle stage of GeneX?(ROOT

(SBARQ (WHNP (WP What))

(SQ (VBZ is) (NP

(NP (DT the) (NN life cycle stage)) (PP (IN of)

(NP (CD GeneX))))) (. ?)))

Page 34: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

NLP – Tree Traversal

- 2 pre-order traversals- 1st looks for matches to properties (labels, id, and descriptions)- If a match if found a triple if formed- 2nd pass looks for classes and instances (labels, id, and descriptions)- Matches are placed in the triples found in pass 1- Synonyms are also used during the matching (WordNet, VerbNet)

root

What is

the life cycle stage of

GeneX

Page 35: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Tree Traversal – Stage 1

1. Root is first. The string literal matches nothing

2. “What” is a stop word so it's ignored3. ”is” is a stop word

4. “the life cycle stage”, the is removed because it's a stop word, the rest matches a property so triple formed:empty -> life cycle stage -> empty

5. “of” ignored6. “GeneX” doesn't match a property so ignored

root

What is

the life cycle stage of

GeneX

Page 36: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Tree Traversal – Stage 2

1. Root is first. The string literal matches nothing

2. “What” is a stop word so it's ignored3. ”is” is a stop word

4. “the life cycle stage”, the is removed because it's a stop word, the rest matches a property but now we are looking for classes/instances5. “of” ignored6. “GeneX” matches an instance, we need to add it to an existing triple. Looking at the domain and range of the “life cycle stage” property we can tell where it goes

root

What is

the life cycle stage of

GeneX

Page 37: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

NLP – To SPARQL

After the tree traversals are finished the triples are converted to SPARQL

Any missing entities in the triples are populated with variables ?gene, ?stage

rdf:labels are added to the SPARQL to make the result set more human readable

Page 38: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Data Exploration

Page 39: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Data Exploration

Most systems only offer a single method of results visualization little support is provided for analytical tasks that

prioritize summarization and finding relationships between entities

TcruziKB uses a variety of results exploration tools Tabular Graph Statistical Publications

Page 40: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Tabular Explorer

TcruziKB provides support for the familiar and popular results view

Rico Live Grid provides enhanced features search within results sorting

Page 41: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Graph Explorer

Ontologies define relationships between data which lends itself naturally to a directed graph representation

The query results can be displayed on a graph with classes/instances corresponding to nodes and properties corresponding to edges in the graph

This graph could give a biologist additional insight on the data by looking for clusters or paths between classes

Page 42: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Graph Explorer – Screen Shot

Page 43: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Graph Expansion

By right clicking on a node, the results can be extended by adding additional classes and properties

This could reveal more relationships between the results

Page 44: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Graph Expansion - ExampleOriginal Query Results

User selects to expand graph based on organism property

Expanded Graph

Page 45: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Feature Selection

A common problem with graph based results is that they can become too complex to navigate through

TcruziKB has the option to run feature selection on the graph to hide nodes and properties that are not statistically important

Edge importance is calculated during a preprocessing step using entropy and gain formulas from information theory

Page 46: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Feature Selection - Example

Page 47: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Statistical Explorer

Allows for an overview of a result set For each variable in the query, the system

offers a chart per property For each class-property pair, the chart shows

the proportion of instances that assume each possible value

Shows how the instances in the result set compares to the overall distribution

Page 48: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Statistical Explorer - Example

A query for all protein expression results, the system would present one pie chart for each property of the class Protein life cycle stage, ortholog group, etc

From the graph you can see the distribution of the values of the different properties 23% have value “Amastigote” for the property

“life_cycle_stage” This distribution can be compared to the

distribution of the result set

Page 49: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Statistical Explorer – Screen Shot

Page 50: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Publication Explorer

In the field of Genomics, a researcher would commonly execute queries, visualize results and then look for publications that would confirm or complete her knowledge about the results she obtained for a given query

Time consuming process TcruziKB integrates with PubMed to

automatically retrieve documents related to the query

Page 51: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Publication Explorer - Continued

Improved PubMed search by using ontology knowledge

The top features are used to weight the results of the simple keyword based query

Other words added that are in the neighborhood of the instances labels, parent class

Document score is computed by multiplying the frequency of the term in the paper by the weight calculated by feature selection and ontology distance

Page 52: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Publication Explorer - Example

A B CSuppose a query yielded the results A,B,C

PubMed could be

searched with “A^B^C”

or “AvBv”

-Problems?

D E

Neighboring classes can be added to the query.

PubMed can be searched

using the original terms with

the new addions.

The results from PubMed can be ranked according to frequency of the term and it's weight (computed from information gain)

Page 53: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Evaluation

Page 54: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Usability Evaluation

Subjective Evaluation System Usability Scale (SUS)

Empirical Metrics Time needed to complete queries Number of interactions needed to complete

queries Natural Language Query Accuracy

Page 55: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

SUS

System Usability Scale published method of evaluating user interfaces

Panel of 30 university members Performed the same set of queries on TcruziDB

and TcruziKB Recorded their experience on SUS evaluation

forms

Page 56: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

SUS - Results

Page 57: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Empirical Evaluation

The time and number of computer interactions needed to execute a set of queries were also recorded The number of interactions is simply the number

of keystrokes and mouse clicks TcruziKB Interactions (Avg): 21.33 TcruziKB Time (Avg): 117.33 seconds TcruziDB Interactions (Avg): 53.33 TcruziDB Time (Avg): 311.33 seconds

Page 58: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Natural Language Evaluation

Panel members were asked to write 3 questions (in their own words) based the gene finding section of the TcruziDB homepage

Users would look to see what type of query is possible then write it in English

These questions are used to test the Natural Language Query interface

Page 59: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Natural Language Evaluation - Results

50 total questions used After removing duplicates varying complexity

The questions were entered into the system to see if the correct SPARQL was generated

Recall: 90% Precision: 83%

Page 60: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Related Work

Page 61: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Comparison to Existing Work

Ontology Based Query Building Systems GRQL, SEWASIE

Show a visualized ontology that the user can select classes and properties from

Large ontologies present a problem Do not support multiple query and result exploration

mechanisms

Page 62: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Comparison to Existing Work - Continued

iSPARQL, SDS Allow the user to build a graph by drawing nodes and

edges Very different than traditional search systems Relies solely on graphical based query construction

Page 63: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Comparison to Existing Work - Continued

GINSENG Natural language query system No real NLP, just query building with a dictionary

of “rule” words No support for synonyms, exact match required

ONLI Another natural language query system Again, does not support synonyms Uses an underlying query language that is non-

standard

Page 64: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Future Work and Conclusion

Page 65: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Future Work

Extend query builder for SPARQLER support allow for more complex path based queries

AI assisted natural language query Cypher

Template based natural language query Combine semantic querying with web search

If a query can not be answered with the

knowledge base alone use information retrieval

methods to query the web Complete missing triples in the knowledge base

Page 66: From a Genome Database to a Semantic Knowledge Base MS Thesis Defense July 18th, 2008 Bobby E. McKnight Committee: I. Budak Arpinar (Major Professor) John

Conclusion

Semantics allow for a variety of improvements over relational database based systems standardization, interoperability, inferencing

Query building is a way to allow users to ask

difficult questions easily TcruziKB vs TcruziDB Similar for natural language querying

Ontologies can be used to express result

sets in more meaningful manners