View
62
Download
1
Category
Preview:
DESCRIPTION
Cooperative XML (CoXML) Query Answering. Motivation. X ML has become the standard format for information representation and data exchange An explosive increase in the amount of XML data available on the web, e.g., Bills at the Library of Congress IEEE Computer Society’s publication - PowerPoint PPT Presentation
Citation preview
Cooperative XML (CoXML) Query Answering
2
Motivation XML has become the standard format for information
representation and data exchange An explosive increase in the amount of XML data
available on the web, e.g., Bills at the Library of Congress IEEE Computer Society’s publication SwissProt – protein sequence databases XMark – online auction data ….
Effective XML search methods are needed!
3
Challenges XML schema is usually very complex
E.g., the schema for the IEEE Computer Society publication dataset contains about 170 distinct tags and more than 1000 distinct paths
It is often unrealistic for users to fully understand a schema before asking queries
Exact query answering is inadequate and approximate query answering is more appropriate!
4
Approach: CoXML
Query
Approximate Answers
Cooperative XML Query Answering
XML Database Engine
XML Documents
Derive approximate answers by relaxing query conditions, i.e., query relaxation
5
Roadmap Introduction Background CoXML Related Work Conclusion
6
XML Data Model XML data is often modeled as an ordered labeled tree
Tree nodes: elements Tree edges: element-nesting relationships
1 article
title2 7 body
Search engine spam detection
section8
..a spam detection technique by content
analysis…
author3
name4 title5
XYZ IEEE Fellow
year6
2003
Content
Element
7
XML Query Model XML queries are often modeled as trees
Structure conditions: a set of query nodes connected by Parent-to-child (‘/’): directly connected Ancestor-to-descendant (‘// ’): connected (either directly or indirectly)
Content conditions: Either value predicates or keyword constraints on query nodes
Examplearticle
title section
search engine
spam detection
year
2003
8
XML Query Answer An answer for a query is a set of nodes in a data tree that
satisfies both structure and content conditions Example
1 article
title2 7 body
Search engine spam detection
section8
..a spam detection technique by content
analysis…
author3
name4 title5
XYZ IEEE Fellow
year6
2003
Data Tree
article
title section
search engine
spam detection
year
2003
Query Tree
9
XML Query Relaxation Types Value relaxation: enlarging a value condition’s search scope
Node relabel: changing the label a node to a similar or a more general label by domain knowledge
article
title year
search engine
2003
section
spam detection
article
title year
search engine
2000-2005
section
spam detection
article
title year
search engine
2003
section
spam detection
document
title year
search engine
2003
section
spam detection
[1] Tree Pattern Relaxation (S. Amer-Yahia, et al., 2000)
10
XML Query Relaxation Types Edge generalization: relaxing a ‘/’ edge to a ‘//’ edge
Node deletion: dropping a node from a query tree
article
title year
search engine
2003
section
spam detection
article
title year
search engine
2003
section
spam detection
article
title year
search engine
2003
section
spam detection
article
yearsearch engine
2003
section
spam detection
11
XML Relaxation Properties Definition
Relaxation operation: an application of a relaxation type to a specific query node or edge
Lemma Given a query tree with n applicable relaxation
operations, there are potentially up to 2n relaxed trees
Possible combinations: ...1n n
n⎛ ⎞ ⎛ ⎞
+ +⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
12
Roadmap Introduction Background CoXML Related Work Conclusion
13
Challenges Query relaxation is often user-specific
Different users may have different approximate matching specifications for a given query tree
How to provide user-specific approximate query answering?
A query with n relaxation operations has potentially up to 2n relaxed queries How to systematically relax a query?
Query relaxation generates a set of approximate answers How to effectively rank the returned approximate answers?
14
CoXML System Overview
Relaxation Engine
Ranking Module
Relaxation Index Builder
RLXQueryranked results
XML Documents
CoXML
XML Database Engine
XTAH
results
query exact answers
relaxed query
query
similarity metrics
relaxation language
relaxation indexes
15
Roadmap Introduction Background CoXML
Relaxation Language Relaxation Indexes Ranking Evaluation Testbed
Related Work Conclusion
16
Relaxation Language Motivation
Enabling users to specify approximate conditions in queries and to control the approximate matching process
RLXQuery - relaxation-enabled XQuery Extends the standard XML query language (XQuery) with
relaxation constructs & controls, such as ~ : approximate conditions ! : non-relaxable conditions REJECT : unacceptable relaxations AT-LEAST : minimum # of answers to be returned RELAX-ORDER : relaxation orders among multiple conditions USE: allowable relaxation types
17
RLXQuery ExampleFOR $a in doc (“bib.xml”)//article
WHERE $a/year = ~2003 V-COND-LABEL t1 and
~($a[about(./!title, “search engine”)]/body/section)[about(.,
“spam detection”)] S-COND-LABEL t2
RETURN $a
RELAX-ORDER (t1, t2)
USE (edge generalization, node deletion)AT-LEAST 20
article
titleyear
search engine
2003
body
section
spam detection
!
t2
t1
18
Roadmap Introduction Background CoXML
Relaxation Language Relaxation Indexes Ranking Evaluation Testbed
Related Work Conclusion
19
Relaxation Index Naïve approach
Generate all possible relaxed queries & iteratively select the best relaxed query to derive approximate answers
Exhaustive, but not scalable
Observation Many queries share the same (or similar) tree structures
Our approach: relaxation index Consider the structure of a query tree T as a template Build indexes on the relaxed trees of T Use the index to guide the relaxations of any query with the
same (or similar) tree structure as that of T
20
Relaxation Index - XTAH XTAH
A hierarchical multi-level labeled cluster of relaxed trees
Building an XTAH Given a query structure template T, generate all possible
relaxed trees Each relaxed trees uses an unique set of relaxation
operations Cluster relaxed trees into groups based on relaxation
operations and distances similar to “suffix-tree” clustering
21
XTAH Example
article
title body
section
$1
$2 $3
$4
Template structure T
{gen(e$1,$2)} … {gen(e$3, $4)} {del($2)}
…
node_relabeledge_generalization node_deletion
relax
{gen(e$3, $4), gen(e$1,$3)}
...
articlebody
section
T6{gen(e$1, $2), gen(e$3, $4)}
…
{del($2), del($3)}
…
…
……
…
article
title body
section
T2
T4 articletitle body
section
articletitle body
section
T3
articletitle body
section
T1
article
section
T7
A sample XTAH for the template structure T
gen(e$u, $v) – relaxing the edge between $u and $v
del($u) – deleting the node $u
22
XTAH Properties Each group consists of a set of relaxed trees obtained by
using similar relaxation operations Efficient location of relaxed trees based on relaxation
operations
The higher level a group, the less relaxed the trees in the group Relaxing queries at different granularities by traversing up
and down the XTAH
23
XTAH-Guided Query Relaxation Problem
Given a query with relaxation specifications (constructs and controls), how to search an XTAH for relaxed queries that satisfy the specification?
Approach First, prune XTAH groups containing trees that use
unacceptable relaxations as specified in the query This step can be efficiently achieved by utilizing internal node labels
Then, iteratively search the XTAH for the best relaxed query
24
Query Relaxation Process Example
node_relabel
...
node_deletion
relax
…{gen(e$1,$2)} … {gen(e$3, $4)}
…
edge_generalization
{gen(e$3, $4), gen(e$1,$3)}
{gen(e$1, $2), gen(e$3, $4)}
…
…
…
article
title body
section
T2
T4 articletitle body
section
articletitle body
section
T3
articletitle body
section
T1
{del($2)}
articlebody
section
T6 {del($2), del($3)}
…
…
article
section
T7
article
title body
section
$1
$2 $3
$4
The template structure, T
A sample XTAH for the template structure T
article
titleyear
search engine
2003
body
section
spam detection
!
t2t1
Relaxation ControlUSE (edge generalization,
node deletion)AT-LEAST 20
Sample RLXQuery
25
XTAH-Guided Query Relaxation Problem
Given a query and an XTAH, how to efficiently locate the best relaxation candidate at the leaf level?
Approach: M-tree Assign representatives to internal groups Representatives summarize distance properties of the trees within groups Use representatives to guide the search path to the best relaxation candidate
R0
R1 R2 R3
R5 R8R11
relaxed tree j
[2] M-tree: An efficient access method for similarity search in metric space (P. Ciaccia et. al., VLDB 97)
26
Roadmap Introduction Background CoXML
Relaxation Language Relaxation Indexes Ranking Evaluation Testbed
Related Work Conclusion
27
Ranking Ranking criteria
Based on both content and structure similarities between a query and an answer, i.e., a set of data nodes
Approach Content similarity – extended vector space model Structure similarity – tree editing distance with a model for
assigning operation cost Overall relevancy – a ranking model combing both content
and structure similarities
28
Content Similarity
Term Frequency Inverse Document Frequency
Weighted Term Frequency Inverse Element Frequency
Vector Space Model
Extended Vector Space ModelXML content ranking
Traditional IR ranking
content similarity between a query and an answer (i.e., a set of data nodes)
content similarity between a query and a document
29
Weighted Term Frequency Terms under different paths of a node weight differently Example
The weighted term frequency for a term t in a node v is:
pi: a path under the node v to a term t;
m: # of different paths under the node v that contain the term t
w1
tf ( , ) w( ) tf( , )m
i ii
v t p p t=
= ∗∑
section
spam detection
8 paragraph
…an approach to detect spam by …
12 reference
Spam detection taxonomy
section5
Spam Detection By Content Analysis
6 title
QueryData
30
Inverse Element Frequency The more number of XML elements containing a term,
the less disambiguating power the term has E.g., the term “spam” is less disambiguating than the
term “detection” The inverse element frequency for a query term t is
1
2
($ , ) log Nief u tN
=
$u: a query node whose content condition contains the term t
N1: # of data nodes that match the structure condition related to $u
N2: # of data nodes that match the structure condition related to $u and contain t
31
Extended Vector Space Model The content similarity between an answer A and a
query Q is
|$ . |
w1 1
cont_sim( , ) tf ( , ) ief($ , )iu contn
i ij i iji j
A Q v t u t= =
= ∗∑ ∑
n: # of nodes in Q
{$u1, …, $un}: the set of query nodes in Q
{v1, …, vn}: the set of data nodes in A, where vi matches $ui (1 ≤ i ≤ n)
|$ui.cont|: the number of terms in the content conditions on the node $ui
tij: a term in the content condition on the query $ui
32
Structure Distance Function Both XML data and queries are modeled as trees Similarities between trees are often computed by
editing distances, i.e., the cost of the cheapest sequence of editing operations
that transform one tree into the other tree The structure distance between an answer A and a query
Q can be measured as the total cost of relaxation operations used to derive A
1struct_dist( , ) cost( )
k
ii
A Q r=
=∑{r1, …, rk}: the set of relaxation operations used to derive A
cost(ri): the cost for ri (0 ≤ cost(ri) ≤ 1 )
33
Relaxation Operation Cost Naïve approach
Assign uniform cost to all relaxation operations Simple but ineffective
Our approach Assign an operation cost based on the similarity between
the two nodes being approximated by the operation The closer the two nodes, the less the operation costs
cos ( ) 1 ($ , $ )it r similarity u v= −
ri: a relaxation operation
$u, $v: the two nodes that are being approximated by ri
34
Nodes Approximated By Relaxation Operations
Relaxation Operation
Nodes being approximated by the operation: ($u, $v)
Example
Node relabel (a node with the old label, a node with the new label)
(article, document)
Node deletion (a child node, the parent node) (section, body)
Edge generalization
(a child node, a descendant node) (article/title, article//title)
article
title body
section
Query tree
document
title body
section
Node Relabel
article
title body
Node deletion
article
title body
section
Edge generalization
T1 T2 T3 T4
35
overall relevancy
content similarity structure distance
36
Overall Relevancy Function The overall relevancy of an answer A to a query Q,
sim(A, Q), is a function of cont_sim(A, Q) and struct_dist(A, Q)
Properties sim(A, Q) = cont_sim(A, Q) if struct_dist(A, Q) = 0 sim(A, Q) as cont_sim(, Q) sim(A, Q) as struct_dist(, Q)
Implementationstruct_dist( , )sim( , ) cont_sim( , )A QA Q A Q=α ∗
α is a small constant between 0 and 1
37
Roadmap Introduction Background CoXML
Relaxation Indexes Relaxation Language Ranking Evaluation Testbed
Related Work Conclusion
38
Evaluation Studies INEX (Initiative for the evaluation of XML)
Similar to TREC for text retrieval
Document collections Scientific articles from IEEE Computer Society 1995 – 2002 About 500MByte Each article consists of 1500 XML nodes on average
Queries Strict content and structure (SCAS) Vague content and structure (VCAS)
Golden standard Relevance assessment provided by INEX
39
Evaluation of Content Similarity Datasets: INEX 03 test collection Query sets: 30 SCAS queries Comparisons: 38 submissions in INEX 03
Recall
Prec
isio
n
0.5 10
0.2
0.4
0.6
0.8
1
Avg. Precision 0.3309
40
Evaluation of the Cost Model Dataset: INEX 05 test collection Query set: 22 simple VCAS queries Evaluation metric: normalized extended cumulative gain (nxCG)
the official evaluation metric used in INEX 05 Given a number i (i1), nxCG@i, similar to precision@i,
measures the relative gain users accumulated up to the rank i E.g., nxCG@10, nxCG@25, nxCG@50, …
Cost Models: UCost: uniform cost for each relaxation operation (Baseline) SCost: our proposed cost model
41
Retrieval performance improvements with semantic cost model
αCost Model
0.1 0.3 0.5 0.7 0.9
Uniform 0.2584 0.2616 0.2828 0.2894 0.2916
Semantic 0.3319 (+28.44%)
0.3190 (+21.94%)
0.3196 (+13.04%)
0.3068 (+6%)
0.2957 (+4.08%)
struct_dist( , )sim( , ) cont_sim( , )A QA Q A Q= ∗α
Assigning relaxation operation with different cost based on the similarities of the nodes being operated improves retrieval performance! nxCG@25 and nxCG@50 yield similar results
Query set: all content-and-structure queries in INEX 05nxCG@10 (α, cost model)
42
Evaluation of the Cost Model Result
αCost Model
0.1 0.3 0.5 0.7 0.9
UCost 0.2584 0.2616 0.2828 0.2894 0.2916SCost 0.3319
(+28.44%)0.3190 (+21.94%)
0.3196 (+13.04%)
0.3068 (+6%)
0.2957 (+4.08%)
struct_dist( , )sim( , ) cont_sim( , )A QA Q A Q= ∗α
Each cell: nxCG@10 for a given pair (α, cost model) (% of improvement over the baseline)
Utilizing node similarities to distinguish costs of different operations improves retrieval performance!Similar results are observed using nxCG@25 and nxCG@50
43
Expressiveness of the Relaxation Language
INEX 05 Topic 267
Expressing Topic 267 using RLXQuery
<inex_topic topic_id="267" query_type="CAS" > <castitle> //article//fm//atl[about(., "digital libraries")] </castitle> <description> Articles containing "digital libraries" in their title. </description> <narrative> I'm interested in articles discussing Digital Libraries as their main subject. Therefore I require that the title of any relevant article mentions "digital library" explicitly. Documents that mention digital libraries only under the bibliography are not relevant, as well as documents that do not have the phrase "digital library" in their title. </narrative></inex_topic>
FOR $a in doc(“inex.xml”)//articleLET $b = $a//fm//!atl REJECT(fm, bb)WHERE $b[about(., “digital libraries”)]RETURN $b
44
Expressing Topic 267 with RLXQuery
Results
FOR $a in doc(“inex.xml”)//articleLET $b = $a//fm//!atl REJECT(fm, bb)WHERE $b[about(., “digital libraries”)]RETURN $b
Evaluation MetricMethod
nxCG@10 nxCG@25
No relaxation control 0.1013 0.2365With relaxation control 1.0 0.8986
Effectiveness of the Relaxation Control
Relaxation control enables the system to provide answers with greater relevancy!
Perfect accuracy
45
Evaluation of the Ranking Function Dataset: INEX 05 test collection Query set: 4 official VCAS queries with available relevance assessments Comparison: top-1 submission in INEX 05
Results MetricTopic
nxCG@10 nxCG@25
Top-1 CoXML Top-1 CoXML
256 0.4293 0.4248 0.4733 0.5555
264 0.0 0.0069 0.0 0.0033
275 0.7715 0.638 0.589 0.5922
284 0.0 0.1259 0.0 0.1233
Average 0.3002 (+0.4%) 0.2989 0.2656 0.3186 (+20%)The systematic relaxation approach enables our system to derive more approximate answers!Our ranking function, based on both content and structure relevancy, outperforms other ranking functions using content similarities only!
46
Roadmap Introduction Background CoXML
Relaxation Indexes – XTAH Relaxation Language – RLXQuery Ranking Evaluation Testbed
Related Work Conclusion
47
CoXML Testbed
Team Members: Prof. Chu, S. Liu, T. Lee, E. Sung, C. Cardenas, A. Putnam, J. Chen, R. Shahinian
RLXQuery Preprocessor
RLXQuery Parser
Relaxation Manager
DatabaseManager
RankingModule
Relaxation Index Builder XTAH
XML Database Engine
XML Document
s
RelaxationController
RLXQuery
Approximate Answers
48
Relaxation Examples using the Testbed
49
Relaxation Examples using the Testbed
50
Roadmap Introduction Background CoXML Related Work Conclusion
51
Related Work: Query Relaxation Relaxation based on schema conversions ([LC01,
LMC01], [LMC03]) No structure relaxation
Native XML relaxation Propose structure relaxation types [e.g., KS01, ACS02]
We use the relaxation types introduced in [ACS02] Investigate efficient algorithms for deriving top-K answers
based on relaxation types supported [e.g, Sch02, ACS02, ALP04, AKM05]
No relaxation control
52
Related Work: XML Ranking Content ranking
Most extend ranking models for text retrieval to the XML scenario, e.g., HyRex, XXL, JuruXML, XSearch
We utilize structure to distinguish terms of different weights occurring in different parts of a document
Structure ranking Based on tree editing distance algorithms w/o considering
operation cost [NJ02] Based on the occurrence frequency of the query trees, paths,
or predicates in data [MAK05, AKM05] Our structure ranking is similar to editing distance, but we
consider operation cost
53
Conclusion Cooperative XML (CoXML) query answering
RLXQuery enables users to effectively express approximate query conditions and to control the approximate matching process
XTAH provides systematic query relaxation guidance
Both content and structure similarity metrics for evaluating the relevancy of approximate answers
Evaluation studies with the INEX test collections demonstrate the effectiveness of our methodology
Recommended