+
Question Answering on Interlinked Data
Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Soeren Auer AKSW Research Group, Leipzig University December 5 2013, IBM Research Center
+ Motivation Retrieving information from LOD
AKSW group - Question Answering on Interlinked Data (published in www2013)
2
+ Motivation
Text queries (either keyword or natural language ) are:
n Simple retrieval approach
n Popular
n Implicit and ambiguous seman=cs.
SPARQL queries require:
n Knowledge about the ontology
n Proficiency in formula=ng formal queries
n Explicit and unambigious seman=cs.
AKSW group -‐ Ques=on Answering on Interlinked Data (published in www2013)
3
+ Comparison of Search Approaches
AKSW group - Question Answering on Interlinked Data (published in www2013)
Data-Semantic unaware
Data-Semantic aware
Keyword-based query
Natural language query
Question Answering
Systems
Information Retrieval
Our approach:
SINA
4
+ Example
n Which televisions shows were created by Walt Disney?
AKSW group - Question Answering on Interlinked Data (published in www2013)
select * where !{ ?v0 a ! !dbo:TelevisionShow.! ?v0 dbo:creator dbr:Walt_Disney. }!
1
2 3
5
+ Aim and Challenges
Aim: Question answering over a set of interlinked data sources.
n Query segmentation.
n Resource disambiguation.
n To construct a formal query (expressed in SPARQL)
AKSW group - Question Answering on Interlinked Data (published in www2013)
6
+ Further Challenges over Interlinked Data
1. Information for answering a certain question can be spread among different datasets employing heterogeneous schemas.
2. Constructing a federated formal query across different datasets requires exploiting links between the different datasets on both the schema and instance levels.
AKSW group - Question Answering on Interlinked Data (published in www2013)
7
+ SINA Architecture
AKSW group - Question Answering on Interlinked Data (published in www2013)
8
+ Test bed datasets
AKSW group - Question Answering on Interlinked Data (published in www2013)
* One single dataset: DBpedia. * Three interlinked datasets
from life-science: ü Drugbank: is a
comprehensive knowledge base containing information about drugs, drug target (i.e. protein) information, interactions and enzymes.
ü Diseasome: contains information about diseases and genes associated with these diseases.
ü Sider: contains information about drugs and their side effects.
9
+ Main characteristics of federated queries
1. Queries requiring fused information, e.g. side effects of drugs used for Tuberculosis.
2. Queries targeting combined information, e.g. side effect an enzymes of drugs used for ASTHMA.
3. Queries requiring keyword expansion, e.g. side effects of Valdecoxib.
AKSW group - Question Answering on Interlinked Data (published in www2013)
Diseasome
Drug
Asthma
?v0 side effect sameAs
a
?v2 ?v3
Disease
Drug Side Effect
a a
a
?v1 enzyme
Enzymes
a
Sider DrugBank
10
+ Challenge 1: Query Segmentation and Resource Disambiguation
l Sample ques5on: What is the side effects of drugs used for Tuberculosis?
l Transformed to 4-‐tuple (side # effect # drug # Tuberculosis)
l Different segmenta=ons are possible: 1. ( side effect # drug # Tuberculosis) 2. ( side effect drug # Tuberculosis )
Mapping of the segments to the resources in the underlying knowledge bases.
AKSW group - Question Answering on Interlinked Data (published in www2013)
Each valid segment
11
Segment validation
ü Original tuple: (side # effect # drug # Tuberculosis). ü Using a naive approach for finding all valid segments.
Valid Segments Samples of Candidate Resources
Side effect 1. sider:class:sideeffect !2. sider:property:side_effects!
drug 1. drugbank: drugs 2.class:offer!3.sider:drugs 4.diseases:possibledrug!
tuberculosis 1. diseases:1154 !2. side_effects: C0041296!
AKSW group - Question Answering on Interlinked Data (published in www2013)
12
+
Concurrent Segmenta5on and Disambigua5on
AKSW group - Question Answering on Interlinked Data (published in www2013)
13
Hidden Markov Model
• A statistics model containing a set of states. • Moving from one state to another state generates a sequence of observations. • The probability of entering state only depends on the previous state. • Output is the most likely states generating the sequence of the observation.
AKSW group - Question Answering on Interlinked Data (published in www2013)
14
State Space
• A state represents a knowledge base resource. • Contains all resources in the knowledge base. • In practice, we prune the state space by excluding irrelevant states. • Adding an unknown entity state comprising all resources, which are not
available (anymore) in the pruned state space.
• Extension of State Space with reasoning: An extension of the state space by including resources inferred from lightweight owl:sameAs reasoning.
AKSW group - Question Answering on Interlinked Data (published in www2013)
15
Bootstrapping the Model Parameters Emission Probability
• The set-similarity level measures the difference between the label and the segment in terms of the number of words using the Jaccard similarity.
• The string-similarity level measures the string similarity of each word in the segment with the most similar word in the label using the Levenshtein distance.
AKSW group - Question Answering on Interlinked Data (published in www2013)
16
Bootstrapping the Model Parameters Transition Probability & Initial Probability
• Computing the transition probability and initial probability based on Semantic relatedness of two resources.
• Semantic relatedness is based on two values: distance and connectivity degree.
• We transform these two values to hub and authority values using HITS algorithm.
• Initial probability and Transition probability are defined as a uniform distribution over the hub and and authority values.
AKSW group - Question Answering on Interlinked Data (published in www2013)
17
Evaluation of Bootstrapping
• The accuracy of different distribution functions, i.e., Normal, Zipfian and uniform distributions for transition probability.
• We ran the distribution functions with two different inputs, i.e. distance and connectivity degree values as well as hub and authority values.
AKSW group - Question Answering on Interlinked Data (published in www2013)
18
+ Viterbi Algorithm
AKSW group - Question Answering on Interlinked Data (published in www2013)
Aim: The most likely path generating the sequence of input keywords.
19
+ Output of the HMM for the following query: Which televisions shows were created by Walt Disney?
AKSW group - Question Answering on Interlinked Data (published in www2013)
Probability Path of states 0.0023 dbo:TelevisionShow , dbo:creator , dbr: Walt_Disney!0.0014 dbo:TelevisionShow , dbo:creator , dbr: Category:Walt_Disney!5.89E-4 dbr:TelevisionShow , dbo:creator , dbr: Walt_Disney!3.53E-4 dbr:TelevisionShow , dbo:creator , dbr: Category:Walt_Disney!3.76E-5 dbp:television , dbp:show , dbo:creator , dbr: Category:Walt_Disney!
20
+
Query Construction
AKSW group - Question Answering on Interlinked Data (published in www2013)
21
Query Construction Method
Input: set of resources Output: A query graph is a directed, connected multi-graph. Forward Chaining: 1. CT: Comprehensive type. 2. CD: Comprehensive domain. 3. CR: Comprehensive range.
AKSW group - Question Answering on Interlinked Data (published in www2013)
R = {r1, r2,..., rn}
QG = (V,E)
22
Query Construction Method
Input: set of resources Output: A query graph is a directed, connected multi-graph. Generating the Incomplete Query Graph (IQG) Initializing vertices and primary edges. • A vertex is added to IQG (1) If r is an instance, (2) If r is a class. • Properties are added along with zero, one or two vertices.
AKSW group - Question Answering on Interlinked Data (published in www2013)
R = {r1, r2,..., rn}
QG = (V,E)
23
Query Construction Method
Example: What is the side effects of drugs used for Tuberculosis?
• diseasome:1154 ! ! !(type instance) !!• diseasome:possibleDrug ! !(type property)!• sider:sideEffect ! !(type property) !!
AKSW group - Question Answering on Interlinked Data (published in www2013)
1154 ?v0 possibleDrug
Graph 1
?v1 ?v2
sideEffect
Graph 2
24
Query Construction Method
Connecting Sub-graphs of an IQG: 1. Minimum spanning tree: a minimum set of edges (i.e., properties) to span a set of
disjoint graphs. 2. Prim’s algorithm: incrementally includes edges to connect disjoint sub-graphs.
• Direct properties: ?v0 ?p ?v1. • Properties via owl:sameAs link. (1) ?v0 owl:sameAs ?x. ?x ?p ?v1. !(2) ?v0 ?p ?x. ?x owl:sameAs ?v1. !(3) ?v0 owl:sameAs ?x. ?x ?p ?y. ?y owl:sameAs ?v1. !
AKSW group - Question Answering on Interlinked Data (published in www2013)
1154 ?v0 possibleDrug
Template 1
?v1 ?v2 sideEffect
Template 2
1154 ?v0 possibleDrug
?v1 ?v2 sideEffect
25
Evaluation
Goal of experiment: How well: 1. resource disambiguation 2. query construction approaches perform. Measurement of the performance: 1. For disambiguation using the Mean Reciprocal Rank (MRR). 2. Query construction in terms of precision and recall.
Benchmark 1. A natural- language query and the equivalent conjunctive SPARQL query. 2. 25 queries on the 3 interlinked datasets Drugbank, Sider and Diseasome. 3. QALD1 and QALD3 benchmark for DBpedia.
AKSW group - Question Answering on Interlinked Data (published in www2013)
26
Evaluation using life-science datasets
AKSW group - Question Answering on Interlinked Data (published in www2013)
Without reasoning: precision = 0.91 recall = 0.88 With reasoning: precision = 0.95 recall = 0.90
27
+ Evaluation using DBpedia
n QALD3 Benchmark: ü contains 100 questions.
ü 32 original questions can be answered correctly.
n QALD1 Benchmark: ü contains 50 questions.
ü 7 complex questions.
ü 13 questions requiring information beyond DBpedia, i.e., from YAGO and FOAF.
ü 14 slightly were modified to remove expansion and cleaning problem.
ü MRR of disambiguation = 96%
ü Query construction accuracy = 83%
AKSW group - Question Answering on Interlinked Data (published in www2013)
28
AKSW group - Question Answering on Interlinked Data (published in www2013)
Runtime
Parallization over three components: 1. Segment validation 2. Resource retrieval 3. Query construction
29
+ Related work
AKSW group - Question Answering on Interlinked Data (published in www2013)
30
AKSW group - Question Answering on Interlinked Data (published in www2013)
Thank you
Saeedeh Shekarpour [email protected] [email protected]
31
Recommended