45
+ Semantic Interpretation of User Queries for Question Answering on Interlinked Data Saeedeh Shekarpour Supervisor: Prof. Dr. Sören Auer 1 EIS research group - Bonn University 7 January 2015

Semantic Interpretation of User Query for Question Answering on Interlinked Data

Embed Size (px)

Citation preview

Page 1: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

Semantic Interpretation of User Queries forQuestion Answering on Interlinked Data

Saeedeh ShekarpourSupervisor: Prof. Dr. Sören Auer

1

EIS research group - Bonn University7 January 2015

Page 2: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Search engines can answer queries which match certain templates

EIS research group - Bonn University

2

7 January 2015

Page 3: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Search engines still lack the ability to answer more complex queries

7 January 2015EIS research group - Bonn University

3

Page 4: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Evolution of Web

Web of Documents

Semantic Web Web of Data

EIS research group - Bonn University

4

7 January 2015

Page 5: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+RDF model

RDF is an standard for describing Web resources.

The RDF data model expresses statesments about Web resources in the form of subject-predicate-object (triple).

The statement “Jack knows Alice” is represented as:

7 January 2015EIS research group - Bonn University

5

Jack Aliceknow

Page 6: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+The growth of Linked Open Data

EIS research group - Bonn University

6

August 2014

570 DatasetsMore than 74 billion triples

May 2007

12 Datasets

7 January 2015

Page 7: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+How to retrieve data from Linked Data?

EIS research group - Bonn University

7

Linked Data characteristics:

• Wide range of topical domains

• Variety in vocabularies

• Interlinked data

SPARQL queries:

• Knowledge about the ontology

• Proficiency in formulating formal queries

• Explicit and unambigious semantics

Text queries (either keyword or natural language):

• Simple retrieval approach

• Implicit and ambiguous semantics

• Popular

7 January 2015

Page 8: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Comparison of search approaches

Data-semantic

unaware

Data-semantic

aware

Keyword-based

query

Natural language

query

Question Answering

Systems

Information RetrievalSystems

Our approach: SINA

8

EIS research group - Bonn University 7 January 2015

Page 9: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Objective: transformation from textual query to formal query

Which televisions shows were created by Walt Disney?

7 January 2015EIS research group - Bonn University

9

SELECT * WHERE

{ ?v0 a dbo:TelevisionShow.

?v0 dbo:creator dbr:Walt_Disney. }

1

2

3

Page 10: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Test bed datasets

EIS research group - Bonn University

10

7 January 2015

One single dataset: DBpedia.

Three interlinked datasets from life-science:

1. Drugbank: contains information about drugs, drug target (i.e. protein) information, interactions and enzymes.

2. Diseasome: contains information about diseases and genes associated with these diseases.

3. Sider: contains information about drugs and their side effects.

Page 11: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+The addressed challenges

Challenges

Query Segmentation

Resource Disambiguation

Query Expansion

Formal Query Construction

Data Fusion on Linked Data

EIS research group - Bonn University

11

7 January 2015

Page 12: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

Definition: query segmentation is the process of identifying the rightsegments of data items that occur in the keyword queries.

12

EIS research group - Bonn University

Query Segmentation

Two segmentations:

Sequence of keywords:

Input Query: What are the side effects of drugs used for Tuberculosis?

(side, effect, drug , Tuberculosis)

side effect | drug | Tuberculosis side effect drug | Tuberculosis

7 January 2015

Page 13: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

Definition: resource disambiguation is the process of recognizing the suitable resources in the underlying knowledge base.

EIS research group - Bonn University

13

Resource Disambiguation

Input query

• What are the side effects of drugs used for Tuberculosis?

Ambiguous Resources

• diseasome:Tuberculosis

• sider:Tuberculosis

Input query

• Who produced films starring Natalie Portman?

Ambiguous Resources

• dbpedia/ontology/film

• dbpedia/property/film

7 January 2015

Page 14: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Concurrent approach

EIS research group - Bonn University

14

Query Segmentation

Resource Disambiguation

7 January 2015

Page 15: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

1

2

3

Unknown Entity

4

5

6

7

8

9

Start

Keyword 1 Keyword 3Keyword 2 Keyword 4

Modeling using hidden Markov model

EIS research group - Bonn University

15

7 January 2015

Query Segmentation

&

Resource Disambiguation

Page 16: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Bootstrapping the model parameters

1. Emission probability is defined based on the similarity of the label of each state with a segment, this similarity is computed based on string-similarity and Jaccard-similarity.

2. Semantic relatedness is a base for transition probability and initialprobability. Intuitively, it is based on two values: distance and connectivitydegree. We transform these two values to hub and authority values usingweighted HITS algorithm.

3. HITS algorithm is a link analysis algorithm that was originally developed forranking Web pages. It assign a hub and authority value to each web page.

4. Initial probability and transition probability are defined as a uniformdistribution over the hub and and authority values.

EIS research group - Bonn University

16

7 January 2015

Query Segmentation

&

Resource Disambiguation

Page 17: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Evaluation of bootstrapping

The accuracy of the bootstrapped transition probability using different distribution functions, i.e., Normal, Zipfian and uniform distributions.

7 January 2015EIS research group - Bonn University

17Query

Segmentation

&

Resource Disambiguation

Page 18: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Output of the model after running viterbi algorithm

Sequence of keywords

(television show creat Walt Disney)

Paths 0.0023 dbo:TelevisionShow dbo:creator dbr:Walt_Disney

0.0014 dbo:TelevisionShow dbo:creator dbr:Category:Walt_Disney

0.000589 dbr:TelevisionShow dbo:creator dbr:Walt_Disney

0.000353 dbr:TelevisionShow dbo:creator dbr:Category:Walt_Disney

0.0000376 dbp:television dbp:show dbo:creator dbr:Category:Walt_Disney

EIS research group - Bonn University

18

7 January 2015

Query Segmentation

&

Resource Disambiguation

Page 19: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

Definition: query expansion is a way of reformulating the input queryin order to overcome the vocabulary mismatch problem.

EIS research group - Bonn University

19

Input query

• Wife of Barak Obama

Reformulated query

• Spouse of Barak Obama

Query Expansion

7 January 2015

Page 20: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

Analysis of linguistic

features vs. semantic features

A method for

automatic query

expansion

EIS research group - Bonn University

20

7 January 2015

Query Expansion

Page 21: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Linguistic features

WordNet is a popular data source for expansion.

Linguistic features extracted from WordNet are:

1. Synonyms: words having a similar meanings to the input keyword.

2. Hyponyms: words representing a specialization of the input keyword.

3. Hypernyms: words representing a generalization of the input keyword.

EIS research group - Bonn University

21

7 January 2015

Query Expansion

Page 22: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Semantic features from Linked Data

1. SameAs: deriving resources using owl:sameAs.

2. SeeAlso: deriving resources using rdfs:seeAlso.

3. Equivalence class/property: deriving classes or properties usingowl:equivalentClass and owl:equivalentProperty.

4. Super class/property: deriving all super classes/properties of by following therdfs:subClassOf or rdfs:subPropertyOf property.

5. Sub class/property: deriving resources by following the rdfs:subClassOf orrdfs:subPropertyOf property paths ending with the input resource.

6. Broader concepts: deriving using the SKOS vocabulary properties skos:broader andskos:broadMatch.

7. Narrower concepts: deriving concepts using skos:narrower and skos:narrowMatch.

8. Related concepts: deriving concepts using skos:closeMatch, skos:mappingRelationand skos:exactMatch.

EIS research group - Bonn University

22

7 January 2015

Query Expansion

Page 23: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Exemplary expansion graph of the word movie

EIS research group - Bonn University

23

movie

home movieproduction

filmmotion picture show

video

telefilm

7 January 2015

Query Expansion

Page 24: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Objective of experiment

How effective do linguistic as well as semantic features perform?

How well does a linear weighted combination of features perform?

EIS research group - Bonn University

24

7 January 2015

Query Expansion

Page 25: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Benchmark creation

We created a benchmark extracted from QALD1 and QALD2.

Benchmark contains all keywords having vocabulary mismatchproblem and their corresponding match.

7 January 2015EIS research group - Bonn University

25

Query Expansion

Page 26: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Accuracy results of prediction function based on linguistic as well as semantic features

Features Weighting Mechanism

Precision Recall F-score

Linguistic SVM 0.730 0.650 0.620

Semantic SVM 0.680 0.630 0.600

Linguistic Decision Tree/ Information Gain

0.588 0.579 0.568

Semantic Decision Tree/ Information Gain

0.755 0.684 0.661

EIS research group - Bonn University

26

7 January 2015

Query Expansion

Page 27: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Statistics over the number of the derived words and matches

EIS research group - Bonn University

27

7 January 2015

Query Expansion

Feature #derived words #matches

synonym 503 23

hyponym 2703 10

hypernym 657 14

sameAs 2332 12

seeAlso 49 2

equivalence 2 0

super class/property 267 4

Sub class/property 2166 4

Page 28: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Automatic query expansion

Input query

External Data source

Data extraction and preparation

Heuristic method

Reformulated query

EIS research group - Bonn University

28

7 January 2015

Query Expansion

Page 29: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Expansion set for each segment

Expansion Set

Original segment

Lemmatized segment

Derived words from WordNet

Synonym

Hyponym

Hypernym

EIS research group - Bonn University

29

7 January 2015

Query Expansion

Page 30: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Reformulating query using hidden Markov model

EIS research group - Bonn University

30

Barak

Barak Obama

spouseObama

wife

first lady

woman

Barak Obama wife Barak

Obama

Start

Input query: wife of Barak Obama

Obama wife

Barak Obama

wife

7 January 2015

Query Expansion

Page 31: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Triple-based co-occurence

In a given triple t = (s, p, o), two words w1 and w2 are co-occurring, if they appear in the labels (rdfs:label) of at leasttwo resources.

EIS research group - Bonn University

31

7 January 2015

Query Expansion

Page 32: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Goals of evaluation

How effective is our method with regard to a correct reformulation of queries which have vocabulary mismatch problem?

How robust is the method for queries which do not have vocabulary mismatch problem?

EIS research group - Bonn University

32

7 January 2015

Query Expansion

Query Mismatch word

Match word #derived words

Movies with Tom Cruise movie film 77

Altitude of Everest altitude elevation 16

Soccer clubs in Spain - - 19

Employees of Google - - 10

Sample of our benchmark

Page 33: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Rank (R) and Cumulative rank (CR) for the test queries

EIS research group - Bonn University

33

7 January 2015

Query Expansion

Page 34: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

Definition: Once the resources are detected, a connected subgraphof the knowledge base graph, called the query graph, has to bedetermined which fully covers the set of mapped resources.

EIS research group - Bonn University

34

Formal Query Construction

7 January 2015

Disambiguated resources

sider:sideEffect

diseasome:possibleDrug

diseasome:1154

SPARQL query SELECT ?v3 WHERE {

diseasome:115 diseasome:possibleDrug ?v1 .

?v1 owl:sameAs ?v2 .

?v2 sider:sideEffect ?v3 .}

Page 35: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

Answer of a question may be spread among different datasetsemploying heterogeneous schemas.

Constructing a federated query from needs to exploit links betweenthe different datasets on the schema and instance levels.

EIS research group - Bonn University

35

Data Fusion on Linked Data

7 January 2015

Page 36: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Two different approaches

Template-based query construction

Forward chaining based query construction

EIS research group - Bonn University

36

7 January 2015

Federated Query Construction

Page 37: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Forward chaining based query construction

1. Set of resources

EIS research group - Bonn University

37

Query What is the side effects of drugs used for Tuberculosis?

resources diseasome:1154 (type instance)

diseasome:possibleDrug (type property)

sider:sideEffect (type property)

1154 ?v0

possibleDrug

Graph 1

?v1 ?v2

sideEffect

Graph 2

1154

?v0possibleDrug

Template 1

?v1 ?v2sideEffect

Template 2

1154

?v0possibleDrug

?v1 ?v2sideEffect

7 January 2015

2. Incomplete query graph

3. Query graph

Federated Query Construction

Page 38: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Evaluation

Goal of experiment:

1. performance of disambiguation method using Mean Reciprocal Rank (MRR).

2. performance of forward chaining query construction method using precision

and recall.

Benchmarks:1. 25 queries on the 3 interlinked datasets from life-science.2. QALD1 and QALD3 benchmarks over DBpedia.3. QALD2 was used for bootstrapping.

7 January 2015EIS research group - Bonn University

38

Federated Query Construction

Page 39: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Runtime

Parallization over three components:

1. Segment validation

2. Resource retrieval

3. Query construction

7 January 2015EIS research group - Bonn University

39

Page 40: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

Client

Query Preprocessing

Query Expansion

Resource Retrieval

Disambiguation

Query Construction

Representation

Server

Underlying Interlinked Knowledge Bases

query result

keywords

valid segments

mapped resources

tuple of resources

SPARQLqueries

OWL API

http client

Stanford CoreNLP

Segment Validation

Reformulated query

SINA architecture

EIS research group - Bonn University

40

7 January 2015

Page 41: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Demo

7 January 2015EIS research group - Bonn University

41

Page 42: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Conclusion

We researched and addressed a number of challenges.

The result of the evaluation confirms the feasibility and highaccuracy, for instance:

1. query segmentation and resource disambiguation with achieved theMRR from 86% till 96%.

2. query construction with precision 32% in DBpedia QALD3 benchmarkand 95% in life-science.

We learnt that:1. In Linked Data, structure as well as topology of data can be leveraged for any

inference and heuristic.

2. Using structure as well as topology without any deep text analysis, Linked Datacan enhance power of question answering.

7 January 2015EIS research group - Bonn University

42

Page 43: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Future work

It was the first step in a long agenda.

We plan to:

1. use supervise learning to enhance the parameters of the model.

2. extend our benchmark to make further evolutions.

3. employ more number of interlinked dataset to figure out the challengesof scability.

4. extending different aspects of each approach.

We are going to target new challenges, e.g. query cleaning.

We will continue this work with students joining us for Marie CurieITN network.

7 January 2015EIS research group - Bonn University

43

Page 44: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+

7 January 2015EIS research group - Bonn University

44

Page 45: Semantic Interpretation of User Query for Question Answering on Interlinked Data

+Questiones?

EIS research group - Bonn University

45

7 January 2015