Cooperative XML (CoXML) Query Answering

Motivation XML has become the standard format for information

representation and data exchange An explosive increase in the amount of XML data

available on the web, e.g., Bills at the Library of Congress IEEE Computer Society’s publication SwissProt – protein sequence databases XMark – online auction data ….

Effective XML search methods are needed!

Challenges XML schema is usually very complex

E.g., the schema for the IEEE Computer Society publication dataset contains about 170 distinct tags and more than 1000 distinct paths

It is often unrealistic for users to fully understand a schema before asking queries

Exact query answering is inadequate and approximate query answering is more appropriate!

Approach: CoXML

Approximate Answers

Cooperative XML Query Answering

XML Database Engine

XML Documents

Derive approximate answers by relaxing query conditions, i.e., query relaxation

Roadmap Introduction Background CoXML Related Work Conclusion

XML Data Model XML data is often modeled as an ordered labeled tree

Tree nodes: elements Tree edges: element-nesting relationships

1 article

title2 7 body

Search engine spam detection

section8

..a spam detection technique by content

analysis…

author3

name4 title5

XYZ IEEE Fellow

Content

Element

XML Query Model XML queries are often modeled as trees

Structure conditions: a set of query nodes connected by Parent-to-child (‘/’): directly connected Ancestor-to-descendant (‘// ’): connected (either directly or indirectly)

Content conditions: Either value predicates or keyword constraints on query nodes

Examplearticle

title section

search engine

spam detection

XML Query Answer An answer for a query is a set of nodes in a data tree that

satisfies both structure and content conditions Example

1 article

title2 7 body

Search engine spam detection

section8

..a spam detection technique by content

analysis…

author3

name4 title5

XYZ IEEE Fellow

Data Tree

article

title section

search engine

spam detection

Query Tree

XML Query Relaxation Types Value relaxation: enlarging a value condition’s search scope

Node relabel: changing the label a node to a similar or a more general label by domain knowledge

article

title year

search engine

section

spam detection

article

title year

search engine

2000-2005

section

spam detection

article

title year

search engine

section

spam detection

document

title year

search engine

section

spam detection

[1] Tree Pattern Relaxation (S. Amer-Yahia, et al., 2000)

XML Query Relaxation Types Edge generalization: relaxing a ‘/’ edge to a ‘//’ edge

Node deletion: dropping a node from a query tree

article

title year

search engine

section

spam detection

article

title year

search engine

section

spam detection

article

title year

search engine

section

spam detection

article

yearsearch engine

section

spam detection

XML Relaxation Properties Definition

Relaxation operation: an application of a relaxation type to a specific query node or edge

Lemma Given a query tree with n applicable relaxation

operations, there are potentially up to 2n relaxed trees

Possible combinations: ...1n n

n⎛ ⎞ ⎛ ⎞

+ +⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

Challenges Query relaxation is often user-specific

Different users may have different approximate matching specifications for a given query tree

How to provide user-specific approximate query answering?

A query with n relaxation operations has potentially up to 2n relaxed queries How to systematically relax a query?

Query relaxation generates a set of approximate answers How to effectively rank the returned approximate answers?

CoXML System Overview

Relaxation Engine

Ranking Module

Relaxation Index Builder

RLXQueryranked results

XML Documents

XML Database Engine

results

query exact answers

relaxed query

similarity metrics

relaxation language

relaxation indexes

Roadmap Introduction Background CoXML

Relaxation Language Relaxation Indexes Ranking Evaluation Testbed

Related Work Conclusion

Relaxation Language Motivation

Enabling users to specify approximate conditions in queries and to control the approximate matching process

RLXQuery - relaxation-enabled XQuery Extends the standard XML query language (XQuery) with

relaxation constructs & controls, such as ~ : approximate conditions ! : non-relaxable conditions REJECT : unacceptable relaxations AT-LEAST : minimum # of answers to be returned RELAX-ORDER : relaxation orders among multiple conditions USE: allowable relaxation types

RLXQuery ExampleFOR $a in doc (“bib.xml”)//article

WHERE $a/year = ~2003 V-COND-LABEL t1 and

~($a[about(./!title, “search engine”)]/body/section)[about(.,

“spam detection”)] S-COND-LABEL t2

RETURN $a

RELAX-ORDER (t1, t2)

USE (edge generalization, node deletion)AT-LEAST 20

article

titleyear

search engine

section

spam detection

Relaxation Index Naïve approach

Generate all possible relaxed queries & iteratively select the best relaxed query to derive approximate answers

Exhaustive, but not scalable

Observation Many queries share the same (or similar) tree structures

Our approach: relaxation index Consider the structure of a query tree T as a template Build indexes on the relaxed trees of T Use the index to guide the relaxations of any query with the

same (or similar) tree structure as that of T

Relaxation Index - XTAH XTAH

A hierarchical multi-level labeled cluster of relaxed trees

Building an XTAH Given a query structure template T, generate all possible

relaxed trees Each relaxed trees uses an unique set of relaxation

operations Cluster relaxed trees into groups based on relaxation

operations and distances similar to “suffix-tree” clustering

XTAH Example

article

title body

section

Template structure T

{gen(e$1,$2)} … {gen(e$3, $4)} {del($2)}

node_relabeledge_generalization node_deletion

{gen(e$3, $4), gen(e$1,$3)}

articlebody

section

T6{gen(e$1, $2), gen(e$3, $4)}

{del($2), del($3)}

……

article

title body

section

T4 articletitle body

section

articletitle body

section

articletitle body

section

article

section

A sample XTAH for the template structure T

gen(e$u, $v) – relaxing the edge between $u and $v

del($u) – deleting the node $u

XTAH Properties Each group consists of a set of relaxed trees obtained by

using similar relaxation operations Efficient location of relaxed trees based on relaxation

operations

The higher level a group, the less relaxed the trees in the group Relaxing queries at different granularities by traversing up

and down the XTAH

XTAH-Guided Query Relaxation Problem

Given a query with relaxation specifications (constructs and controls), how to search an XTAH for relaxed queries that satisfy the specification?

Approach First, prune XTAH groups containing trees that use

unacceptable relaxations as specified in the query This step can be efficiently achieved by utilizing internal node labels

Then, iteratively search the XTAH for the best relaxed query

Query Relaxation Process Example

node_relabel

node_deletion

…{gen(e$1,$2)} … {gen(e$3, $4)}

edge_generalization

{gen(e$3, $4), gen(e$1,$3)}

{gen(e$1, $2), gen(e$3, $4)}

article

title body

section

T4 articletitle body

section

articletitle body

section

articletitle body

section

{del($2)}

articlebody

section

T6 {del($2), del($3)}

article

section

article

title body

section

The template structure, T

A sample XTAH for the template structure T

article

titleyear

search engine

section

spam detection

Relaxation ControlUSE (edge generalization,

node deletion)AT-LEAST 20

Sample RLXQuery

XTAH-Guided Query Relaxation Problem

Given a query and an XTAH, how to efficiently locate the best relaxation candidate at the leaf level?

Approach: M-tree Assign representatives to internal groups Representatives summarize distance properties of the trees within groups Use representatives to guide the search path to the best relaxation candidate

R1 R2 R3

R5 R8R11

relaxed tree j

[2] M-tree: An efficient access method for similarity search in metric space (P. Ciaccia et. al., VLDB 97)

Ranking Ranking criteria

Based on both content and structure similarities between a query and an answer, i.e., a set of data nodes

Approach Content similarity – extended vector space model Structure similarity – tree editing distance with a model for

assigning operation cost Overall relevancy – a ranking model combing both content

and structure similarities

Content Similarity

Term Frequency Inverse Document Frequency

Weighted Term Frequency Inverse Element Frequency

Vector Space Model

Extended Vector Space ModelXML content ranking

Traditional IR ranking

content similarity between a query and an answer (i.e., a set of data nodes)

content similarity between a query and a document

Weighted Term Frequency Terms under different paths of a node weight differently Example

The weighted term frequency for a term t in a node v is:

pi: a path under the node v to a term t;

m: # of different paths under the node v that contain the term t

tf ( , ) w( ) tf( , )m

v t p p t=

= ∗∑

section

spam detection

8 paragraph

…an approach to detect spam by …

12 reference

Spam detection taxonomy

section5

Spam Detection By Content Analysis

6 title

QueryData

Inverse Element Frequency The more number of XML elements containing a term,

the less disambiguating power the term has E.g., the term “spam” is less disambiguating than the

term “detection” The inverse element frequency for a query term t is

($ , ) log Nief u tN

$u: a query node whose content condition contains the term t

N1: # of data nodes that match the structure condition related to $u

N2: # of data nodes that match the structure condition related to $u and contain t

Extended Vector Space Model The content similarity between an answer A and a

query Q is

|$ . |

cont_sim( , ) tf ( , ) ief($ , )iu contn

i ij i iji j

A Q v t u t= =

= ∗∑ ∑

n: # of nodes in Q

{$u1, …, $un}: the set of query nodes in Q

{v1, …, vn}: the set of data nodes in A, where vi matches $ui (1 ≤ i ≤ n)

|$ui.cont|: the number of terms in the content conditions on the node $ui

tij: a term in the content condition on the query $ui

Structure Distance Function Both XML data and queries are modeled as trees Similarities between trees are often computed by

editing distances, i.e., the cost of the cheapest sequence of editing operations

that transform one tree into the other tree The structure distance between an answer A and a query

Q can be measured as the total cost of relaxation operations used to derive A

1struct_dist( , ) cost( )

A Q r=

=∑{r1, …, rk}: the set of relaxation operations used to derive A

cost(ri): the cost for ri (0 ≤ cost(ri) ≤ 1 )

Relaxation Operation Cost Naïve approach

Assign uniform cost to all relaxation operations Simple but ineffective

Our approach Assign an operation cost based on the similarity between

the two nodes being approximated by the operation The closer the two nodes, the less the operation costs

cos ( ) 1 ($ , $ )it r similarity u v= −

ri: a relaxation operation

$u, $v: the two nodes that are being approximated by ri

Nodes Approximated By Relaxation Operations

Relaxation Operation

Nodes being approximated by the operation: ($u, $v)

Example

Node relabel (a node with the old label, a node with the new label)

(article, document)

Node deletion (a child node, the parent node) (section, body)

Edge generalization

(a child node, a descendant node) (article/title, article//title)

article

title body

section

Query tree

document

title body

section

Node Relabel

article

title body

Node deletion

article

title body

section

Edge generalization

T1 T2 T3 T4

overall relevancy

content similarity structure distance

Overall Relevancy Function The overall relevancy of an answer A to a query Q,

sim(A, Q), is a function of cont_sim(A, Q) and struct_dist(A, Q)

Properties sim(A, Q) = cont_sim(A, Q) if struct_dist(A, Q) = 0 sim(A, Q) as cont_sim(, Q) sim(A, Q) as struct_dist(, Q)

Implementationstruct_dist( , )sim( , ) cont_sim( , )A QA Q A Q=α ∗

α is a small constant between 0 and 1

Relaxation Indexes Relaxation Language Ranking Evaluation Testbed

Evaluation Studies INEX (Initiative for the evaluation of XML)

Similar to TREC for text retrieval

Document collections Scientific articles from IEEE Computer Society 1995 – 2002 About 500MByte Each article consists of 1500 XML nodes on average

Queries Strict content and structure (SCAS) Vague content and structure (VCAS)

Golden standard Relevance assessment provided by INEX

Evaluation of Content Similarity Datasets: INEX 03 test collection Query sets: 30 SCAS queries Comparisons: 38 submissions in INEX 03

Recall

0.5 10

Avg. Precision 0.3309

Evaluation of the Cost Model Dataset: INEX 05 test collection Query set: 22 simple VCAS queries Evaluation metric: normalized extended cumulative gain (nxCG)

the official evaluation metric used in INEX 05 Given a number i (i1), nxCG@i, similar to precision@i,

measures the relative gain users accumulated up to the rank i E.g., nxCG@10, nxCG@25, nxCG@50, …

Cost Models: UCost: uniform cost for each relaxation operation (Baseline) SCost: our proposed cost model

Retrieval performance improvements with semantic cost model

αCost Model

0.1 0.3 0.5 0.7 0.9

Uniform 0.2584 0.2616 0.2828 0.2894 0.2916

Semantic 0.3319 (+28.44%)

0.3190 (+21.94%)

0.3196 (+13.04%)

0.3068 (+6%)

0.2957 (+4.08%)

struct_dist( , )sim( , ) cont_sim( , )A QA Q A Q= ∗α

Assigning relaxation operation with different cost based on the similarities of the nodes being operated improves retrieval performance! nxCG@25 and nxCG@50 yield similar results

Query set: all content-and-structure queries in INEX 05nxCG@10 (α, cost model)

Evaluation of the Cost Model Result

αCost Model

0.1 0.3 0.5 0.7 0.9

UCost 0.2584 0.2616 0.2828 0.2894 0.2916SCost 0.3319

(+28.44%)0.3190 (+21.94%)

0.3196 (+13.04%)

0.3068 (+6%)

0.2957 (+4.08%)

struct_dist( , )sim( , ) cont_sim( , )A QA Q A Q= ∗α

Each cell: nxCG@10 for a given pair (α, cost model) (% of improvement over the baseline)

Utilizing node similarities to distinguish costs of different operations improves retrieval performance!Similar results are observed using nxCG@25 and nxCG@50

Expressiveness of the Relaxation Language

INEX 05 Topic 267

Expressing Topic 267 using RLXQuery

<inex_topic topic_id="267" query_type="CAS" > <castitle> //article//fm//atl[about(., "digital libraries")] </castitle> <description> Articles containing "digital libraries" in their title. </description> <narrative> I'm interested in articles discussing Digital Libraries as their main subject. Therefore I require that the title of any relevant article mentions "digital library" explicitly. Documents that mention digital libraries only under the bibliography are not relevant, as well as documents that do not have the phrase "digital library" in their title. </narrative></inex_topic>

FOR $a in doc(“inex.xml”)//articleLET $b = $a//fm//!atl REJECT(fm, bb)WHERE $b[about(., “digital libraries”)]RETURN $b

Expressing Topic 267 with RLXQuery

Results

FOR $a in doc(“inex.xml”)//articleLET $b = $a//fm//!atl REJECT(fm, bb)WHERE $b[about(., “digital libraries”)]RETURN $b

Evaluation MetricMethod

nxCG@10 nxCG@25

No relaxation control 0.1013 0.2365With relaxation control 1.0 0.8986

Effectiveness of the Relaxation Control

Relaxation control enables the system to provide answers with greater relevancy!

Perfect accuracy

Evaluation of the Ranking Function Dataset: INEX 05 test collection Query set: 4 official VCAS queries with available relevance assessments Comparison: top-1 submission in INEX 05

Results MetricTopic

nxCG@10 nxCG@25

Top-1 CoXML Top-1 CoXML

256 0.4293 0.4248 0.4733 0.5555

264 0.0 0.0069 0.0 0.0033

275 0.7715 0.638 0.589 0.5922

284 0.0 0.1259 0.0 0.1233

Average 0.3002 (+0.4%) 0.2989 0.2656 0.3186 (+20%)The systematic relaxation approach enables our system to derive more approximate answers!Our ranking function, based on both content and structure relevancy, outperforms other ranking functions using content similarities only!

Relaxation Indexes – XTAH Relaxation Language – RLXQuery Ranking Evaluation Testbed

CoXML Testbed

Team Members: Prof. Chu, S. Liu, T. Lee, E. Sung, C. Cardenas, A. Putnam, J. Chen, R. Shahinian

RLXQuery Preprocessor

RLXQuery Parser

Relaxation Manager

DatabaseManager

RankingModule

Relaxation Index Builder XTAH

XML Database Engine

XML Document

RelaxationController

RLXQuery

Approximate Answers

Relaxation Examples using the Testbed

Related Work: Query Relaxation Relaxation based on schema conversions ([LC01,

LMC01], [LMC03]) No structure relaxation

Native XML relaxation Propose structure relaxation types [e.g., KS01, ACS02]

We use the relaxation types introduced in [ACS02] Investigate efficient algorithms for deriving top-K answers

based on relaxation types supported [e.g, Sch02, ACS02, ALP04, AKM05]

No relaxation control

Related Work: XML Ranking Content ranking

Most extend ranking models for text retrieval to the XML scenario, e.g., HyRex, XXL, JuruXML, XSearch

We utilize structure to distinguish terms of different weights occurring in different parts of a document

Structure ranking Based on tree editing distance algorithms w/o considering

operation cost [NJ02] Based on the occurrence frequency of the query trees, paths,

or predicates in data [MAK05, AKM05] Our structure ranking is similar to editing distance, but we

consider operation cost

Conclusion Cooperative XML (CoXML) query answering

RLXQuery enables users to effectively express approximate query conditions and to control the approximate matching process

XTAH provides systematic query relaxation guidance

Both content and structure similarity metrics for evaluating the relevancy of approximate answers

Evaluation studies with the INEX test collections demonstrate the effectiveness of our methodology

Cooperative XML (CoXML) Query Answering

Documents

Expressive Query Answering For Semantic Wikis

Consistent query answering in inconsistent databases

Translating WFS Query to SQL/XML Query

Approximate XML Query Answers

Efficient Query Answering against Dynamic RDF Databases

Data Exchange: Semantics and Query Answering

Expressive Query Answering For Semantic Wikis (20min)

Optimizing FOL reducible query answering

WEDT Tłumaczenie automatyczne & Query answering

Query Answering in Inconsistent Databases

XPath – an XML query language - NUS Computinglingtw/cs4221/xpath.pdfXPath – an XML query language Some XML query languages: • XML-QL ... Note: Serge Abiteboul, Victor Vianu,

Access Control in XML PDMS Query Answering - … · Access Control in XML PDMS Query Answering by Shuan Wang ... The peer data management system (PDMS) ... control requirement in

Aggregate Query Answering under Uncertain Schema Mappings

Querying XML Documents · Cotton/Robie 8 Unicode Conf, Jan 2002 W3C XML Query WG - Status June 2001: Revised Working Drafts – XQuery 1.0: An XML Query Language – XML Query Use

Optimizing SPARQL Query Answering over OWL Ontologies

XML Native Query Processing

CBU 2007XQuery and XPath1 W3C XML Query n How to access various XML data sources? n XQuery, XML Query Lang, W3C Rec, Jan '07 –joint work by XML Query and

Intelligent XML Query-Answering Support with · 2016-12-25 · Parag Zaware,Prabhudev.I," Intelligent XML Query-Answering Support with Efficiently Updating XML Data in Data Mining”,

W3C XML Query

XML Query: xQuery