Extending PRIX for Similarity-based XML Query

Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao

Agenda System Architecture Introduction Semantic-based Similarity Search

Query Expansion Semantic Similarity Computation

Structural-based Similarity Search Adapting PRIX algorithm

Indexing Query Processing

Structural Similarity Computation Similarity Computation and Ranking Discussion & Conclusion

Data Parser

Data Storage Manager

Results Ranking

Query Parser

Index Manager

Metadata Manager

Query Extending

XML data

XQuery Query Result

Loading XML Flow

XML Query Flow

1 - 8: XML Query Steps

2Query

ProcessingQuery Extensions

System Architecture Introduction

Query Expansion (I)

An Example:

Tags in a sample query

{title, Praveen Rao, information retrieval}

Keywords

{title, Praveen, Rao, information, retrieval}

Keyword Extensions

{{title, status title,deed, claim, entity, style}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}}

Valid Keyword Extensions

{{title, claim, entity}, {Praveen}, {Rao}, {data, entropy, information}, {retrieval, recovery}}

(Continue in next page)

Extracting meaningful keywords for each tag in the tag sequence (ignore stop words)

Extracting all the tags from the query

XML Query

Tag sequence

Getting keyword extensions for each keyword in the keyword sequence based upon WordNet

Keyword sequence

Removing the keywords that do not exist in the database

Keyword extensions

Full combination of the keyword extensions in each tag

Valid Keyword extensions

Remove the tag extension if not all of its keywords appear at least once in a tag of the database;Replace the tag extensions with the tag whose keywords set is a super set of the keywords set of the tag extensions

Tag extensions

Full combination of all the tags in the original query to get query extensions.

Valid Tag extensions

Remove the query extensions whose tags do not appear in the same XML document of the database

Valid Query Extensions

Query Extensions

Query Expansion (II)Tag Extensions

{{title}, {claim}, {entity}, {Praveen}, {Rao}, {data, retrieval}, {data recovery}, {information, retrieval}, {information, recovery}, {entropy, retrieval}, {entropy, recovery}}

Valid Tag Extensions

{{title}, {A claim on theory of computation}, {entity}, {Praveen Rao}, {modern information retrieval}, {A survey on information retrieval}, {information recovery}}

Query Expansions

1. { {title}, {Praveen Rao}, {modern information retrieval} }

2. {A claim on theory of computation} ， {Praveen Rao}, {modern information retrieval} } ……

Valid Queries

{ {title}, {Praveen Rao}, {modern information retrieval} }

Extracting meaningful keywords for each tag in the tag sequence (ignore stop words)

Extracting all the tags from the query

XML Query

Tag sequence

Getting keyword extensions for each keyword in the keyword sequence based upon WordNet

Keyword sequence

Removing the keywords that do not exist in the database

Keyword extensions

Full combination of the keyword extensions in each tag

Valid Keyword extensions

Remove the tag extension if not all of its keywords appear at least once in a tag of the database;Replace the tag extensions with the tag whose keywords set is a super set of the keywords set of the tag extensions

Tag extensions

Full combination of all the tags in the original query to get query extensions.

Valid Tag extensions

Remove the query extensions whose tags do not appear in the same XML document of the database

Valid Query Extensions

Query Extensions

Semantic Similarity Computation Similarity between query q and one of its

extensions q’

1( , ') ( , ')query

t q t q

sim q q sim t tn

1( , ')

sim t t xm

t: tag in query q

t’: tag in query q’

n: number of tags in q

= 1, if ki= ki’ α (0 =< α <1), if ki <> ki’

m: number of keywords in tag t

Indexing: Prix (PRüfer sequences for Indexing Xml)

No de R e m o v a lm e th o d

L PS : b, c , b, a , f , d, a

NPS : 1 , 2 , 3 , 6 , 4 , 5 , 6

b ,3 d ,5

c ,2f ,4

b ,1 d ,2

No de R e m o v a lm e th o d

L PS : b, a , d, a

NPS : 1 , 3 , 2 , 3

D o cu m e n t Tre e

Q u e ry Tre e Pa t te rn

Indexing: Prix (PRüfer sequences for Indexing Xml) AD-Label (Ancestor-

Descendant)

Indexing structure in DB

"n am e"tag B+ - T r ee

n o d e B+ - T r ee

Query Processing

Procedure Filtering

Based on Subsequence matching O (n*n*m) : n is the number of nodes in the document; m

is the number of nodes in the query. Refinement

Connectivity Gap Consistency Frequency Consistency

Subsequence Matching

Definition

- Example:

* Good results: media, mult, mm, ted, tia, etc…

Why it works? Is not enough, need more refinements…

Refinement #1

Concept of Dummy Nodes

- PRIX offers only partial match

- Solution: extend prix to leaves level

- Example:

Refinement #2

Connection vs Connectionless

- Definition

- How to check it?

- If not connected, then what?

- Solution: apply penalty

- Example (Disconnected By Gap):

- Example (Disconnected By Unknown):

Refinement #3

Checking for Gap Consistency

- Gap Consistency depends on gaps of prüfer sequence

- How to check it?

- Determines if query tree is subset of searching domain

Refinement #4

Checking for Frequency Consistency

- Frequency consistency depends on Gap Consistency and occurrences of NPS

- How to check it?

- Determines if query tree is exact match in searching domain

- If not frequency consistent, then what?

- Solution: apply penalty

Structure Similarity

Calculations are based on edit distances which transforms to penalty values

Each mismatch node in structure has penalty equal to size of subtree + 1

Overall penalty is dot product of all mismatches All results are normalized with respect to worst case

penalty Overall penalty is dot product of all mismatches All results are normalized with respect to worst case

penalty

Structural Similarity #1: Connectivity

,1 ,2 ,( ) { , ,..., }i i i i nNPS S s s s

( )childrenSize x

jisLast ,

&Re ( , ) 0parent child i js s

( )m n , where m is the number of the subsequences from the filter.

, & , , 1 ,1

( ) ( ) Re ( , ) ( )n

connection k i k i parent child k i k i children k ii

sim S Last s s s Size s

Structural Similarity #2: Gap Similarity

1 2( ) { , ,..., }nNPS Q q q q

,1 ,2 ,( ) { , ,..., }i i i i nNPS S s s s

sgn ( ), 0gap

( )n m

, 1 , 11

( , ) sgn ( )n

gap t i gap t i t i i ii

sim Q S s s q q

Structural Similarity #3: Frequency Similarity

,1 ,2 , ,( , ) ( , ,..., ), {0,1}, 1,i i i n i jPos Q i b b b b j n describs the

positional information of the i th element in the NFS ofQ . When , 1i jb ,

it represents that iq equals jq .

( ( , ))num Pos Q i represents the number of the ‘1’ in the ( , )Pos Q i .

( )n m

( , ) ( ( ( , ) ( , ) ) )n

frequency t i ti

sim Q S num Pos S i Pos Q i

Rank returned XML patterns

Similarity (q, q’’)= Semantic_sim(q, q’) * Structure_sim (q’, q’’)

Advantages of the approach

Prix Indexing Faster Captures all structural information

Similarity based Structure similarity Semantic similarity

Limitations and Extensions

Limitation of Prix: Ordering of nodes

We need to handle it in query extension

baca caba

Limitations and Extensions

More Limitations of Prix: It is difficult to map intuitive structure

similarities in tree to sequences similarities in Prix sequences

thus difficult to have accurate definitions of the similarity

However: Translate tree structures to equivalent

sequences and further do data mining or similarity matching on sequences is a promising direction

Limitations and Extensions Limitations of Semantic similarity

Too many similar results However:

We consider semantic similarity together with structure information

In broad sense: Structure similarity Semantic similarity Syntax similarity Similarity information from co-occurrences of keywords Similarity information from user feedback Similarity information from metadata (DTD, data source,

region, language, link structure of XML files, etc.)

Extending PRIX for Similarity-based XML Query

Documents

Content Triage with Similarity Digests - The M57 Case Study · Content Triage with Similarity Digests ... o 11/16/2009—12/11/2009 ... 260GB Æ 55 min Æ 123 sec . Query 2:

Integrating Keyword Search into XML Query Processing Presentation By: Alex Kremer Ariel Rosenblatt XML Query Language (XML-QL) Extending XML-QL with Keyword

Extending the Role of Similarity Attraction in Friendship and Advice Networks in Angel ... · 2017-06-06 · Extending the Role of Similarity Attraction in Friendship and Advice Networks

Extending the small molecule similarity principle to all ... · Biological processes Interactome Gene expression Cancer cell lines Chemical genetics Morphology Cell bioassays

Augmented Business Intelligencebias.csr.unibo.it/golfarelli/papers/DOLAP19-Golfarelli.pdf · • Query similarity sim (with div = 1 ... • In 10 different rooms (i.e., 10 context

SQuID: Semantic Similarity-Aware Query Intent Discoveryameli/projects/squid/papers/squid-de… · aware Query Intent Discovery. SQuIDtakes a few example tuples from the user as input,

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P

REL STANDARD TAG LIBRARY ODATA QUERY EXTENSION · Agenda OData Query Extension Extending OData? REL Standard Tag Library JSTL only with REL Why a Query Extension? It’s all about

Similarity Query Processing Algorithmsweiw/project/Hamming-similarity-query-2013.pdf · Similarity Query Processing Algorithms: ... Source: Hadjieleftheriou & Li, VLDB09 tutorial

Extending BM25 with multiple query operators

Extending Oracle Siebel CRM with Oracle Fusion …Extending Oracle Siebel CRM with Oracle Fusion Middleware Page 6 • Business intelligence services to query and analyze, perform

1 Similarity aware query processing in sensor networks PingXia, , and AlexandrosLabrinidis Proceedings…

Chapter 3:Spatial Query Languages 3.1 Standard Database Query Languages 3.2 Relational Algebra 3.3 Basic SQL Primer 3.4 Extending SQL for Spatial Data

Calculate Cosine Similarity Score - Donald Bren …djp3/classes/2008_09_26_CS221/... · Calculate Cosine Similarity Score Assignment 06 • Steps • Get a query from the user •

Similarity Measures for Query Expansion in TopX€¦ · 1 Chapter 1 Introduction Top-k query processing is an "important building block for ranked retrieval" [25] and is also used

RESEARCH OpenAccess Semanticindexingwithdeeplearning:a ... · cation and information retrieval. Document ranking largely depends on measuring the semantic similarity of query-document

Assisting Web Search Using Query Suggestion Based …ng/papers/WebQS-R1-Rev.pdfAssisting Web Search Using Query Suggestion Based on Word Similarity Measure and Query Modiﬁcation

Example-Driven Query Intent Discovery: Abductive …Example-Driven Query Intent Discovery: Abductive Reasoning using Semantic Similarity Anna Fariha College of Information and Computer

Probabilistic Similarity Query on Dimension Incomplete Data

The Similarity-aware Relational Database Set Operators · tem, namely PostgreSQL. By extending several queries from the TPC-H benchmark to include predicates that involve similarity-based