20
BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 1 Indexing and Searching XML Documents based on Content and Structure Synopses Weimin He, Leonidas Fegaras, David Levine University of Texas at Arlington http://lambda.uta.edu

Indexing and Searching XML Documents based on Content and Structure Synopses

  • Upload
    tamah

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Weimin He, Leonidas Fegaras, David Levine University of Texas at Arlington http://lambda.uta.edu. Indexing and Searching XML Documents based on Content and Structure Synopses . Outline. Motivation Key Contributions Related Work Data Synopses Indexing Query Processing Experimental Results - PowerPoint PPT Presentation

Citation preview

Page 1: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 1

Indexing and Searching XML Documents based on Content and Structure Synopses

Weimin He, Leonidas Fegaras, David Levine

University of Texas at Arlington

http://lambda.uta.edu

Page 2: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 2

Outline

• Motivation• Key Contributions• Related Work• Data Synopses Indexing• Query Processing• Experimental Results• Conclusion

Page 3: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 3

Why not Google?

• Need to query both structure and content– an opportunity for more precise search

• Keyword queries are NOT adequate for XML search An example query beyond Google: Find the price of the book whose author’s lastname is “Smith” and whose

title contains “XML” and “SAX” Semantic search using an XPath Query: //book[author/lastname ~ “Smith”][title ~ “XML” and “SAX”]/price• Simpler query formats cannot express complex containment

relationships: [ (lastname, Smith), (title, XML & SAX), price ]

• Fully indexing XML data is neither efficient nor scalable

Page 4: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 4

Key Contributions

• A framework for indexing and searching schema-less XML documents based on data synopses extracted from documents

• Two novel data synopsis structures that can achieve higher query precision and scalability

• A hash-based processing algorithm to speed up searching• A prototype implementation to evaluate the performance of the

indexing scheme and to validate the data synopsis precision

Page 5: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 5

Related Work

• Extend keyword queries to XML– XRank– XKSearch

• Integrate IR constructs and scoring into XQuery– TIX– TeXQuery

• XML Summarization Techniques– XSketch– XCluster

Page 6: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 6

System Architecture

Meta-Data Indexer

Query Footprint Extractor

StructuralSummaries

DataSynopses

Query Footprint

QueryClient

DocumentSynopses

Query Optimizer

Server

XML DocumentRepository

Structural Summary Matcher

Matching StructuralSummaries & Label Paths

QueryProcessor

Full-Text XPath Query

A List of DocumentLocations

Page 7: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 7

Specification of Search Queries• XPath is extended with a simple IR syntax:

Queries may contain predicates of the form: e ~ S– e is an XPath expression– S is a search predicate that takes the form:

“term” | S1 and S2 | S1 or S2 | (S)

• A running query example: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price

• Query result:A list of document locations (path names) that satisfy the query

Page 8: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 8

Data Indexing

• Structural Summary (SS)– A tree that captures all unique paths in an XML document– It is constructed from XML data incrementally– Each SSnode# corresponds to a unique full label path:

9: /auction/sponsor/address1

2 8

3

4

56

7

9 10

auction

item sponsor

description address name

location namepayment price

Page 9: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 9

Data Indexing (cont.)

• Content Synopsis (CS)– Summarizes the text associated with an SS node in an XML document– Approximated as a bit matrix of size W×L

• L is fixed but W may depend on the document size– Stored as a B+-tree that implements the mapping

(SSnode#, doc#) bit-matrix– Used in evaluating search predicates in the query

• Positional Filter (PF)– Captures the position spans of all XML elements associated with an SS

node in an XML document– Represented as a bit matrix of size M×L, where M ≥ 2– Stored as a B+-tree that implements the mapping

(SSnode#, doc#) bit-matrix– Used in enforcing containment constraints among query predicates

• Do we need positional dimension?

Page 10: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 10

Data Synopsis Example

Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price

Term

DocumentPosition

0 1 2 20

01

229

13

Content Synopsis for /auction/item/location

hash(Dallas) = 13

DocumentPosition

01

229

Positional Filter for /auction/item

Term

DocumentPosition

0 1 2 200

12

29

hash(mountain) = 2 hash(bicycle) = 11

11

Content Synopsis for /auction/item/description

Page 11: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 11

Containment Filtering

Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/priceitem(F2)

locationDallas A B

description

mountain bicycle

CF(F2, H4[Dallas])

CF(A,and(H3[mountain], H3[bicycle]))

Testing Running Query Using Data Synopses

Page 12: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 12

Query Processing Overview

• Query Footprint (QF) Extraction – Query: //auction//item[location ~ “Dallas”][description ~ “mountain” and “bicycle”]/price– QF: //auction//item:0[location: 1][description: 2]/price

• Structural Summary Matching – Retrieve all structural summaries that match the QF

• We use the standard preorder numbering scheme to represent an SS• An SS is stored as a B+-tree that implements the mapping:

tag → {(SS#, SSnode#, begin, end, level)}• We use containment joins to retrieve the qualified full label paths that match the entry points

in the QF[ /auction/item, /auction/item/location, /auction/item/description ]

• Containment Filtering• Qualified document locations are collected and returned

– The unit of query processing is a mapping from a doc# to a bit matrix of size M×L (positions)

– An empty bit matrix means an unqualified document

Page 13: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 13

Two-Phase Containment Filtering

• Many sources of inefficiency:– A large number of full label path may match a single generic XPath query– A long list of data synopses has to be retrieved for each label path in a QF– The retrieved lists of data synopses have to be correlated at each step

during containment filtering• Solution:

– Aggregate data synopses lists from multiple documents into a single bit matrix, called Document Synopsis, of size W×D path → bit-matrix

so that, given a term t and a full label path p, the document doc# is a candidate if the document synopsis for p is set at [hash(t),hash(doc#)]

– Need a two-phase containment filtering algorithm to prune unqualified document locations before the actual containment filtering

Page 14: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 14

Document Synopsis

Term

Document ID

0 1 2 20

01

215

hash(“XML”) = 2

doc 12 mod 16 = 12

doc 105 mod 16 = 9doc 121 mod 16 = 9doc 137 mod 16 = 9

hash(“science”) = 11hash(“computer”) = 11

11

912

The document synopsis for /biblio/book/paragraph

Page 15: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 15

Experimental Setup

Data Set Data Size(MB)

Files Avg. File Size(KB)

Avg. SS Size(Byte)

Avg. CS Size(Byte)

Avg. PF Size(Byte)

XBench 1050 2666 394 432 20564 178

XMark 55.8 11500 5 417 306 16

• A prototype system is implemented in Java

• Employed Berkeley DB Java Edition 3.2.13 as a storage manager • Datasets

– XMark– XBench

Page 16: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 16

Query WorkloadDataset Query Query Expression

XMark Q1 /site//item[location ~ "United"][payment ~ "Creditcard" and "Check"]/description

XMark Q2 //regions//item[location ~ "States"][payment ~ "Creditcard" or "Cash"]/name

XMark Q3 /site//item[location ~ "United"][payment ~ "Creditcard"]/description

XMark Q4 //regions//item[location ~ "States"][payment ~ "Check"]/quantity

XMark Q5 /site//item[description//text ~ "gold"]/name

XMark Q6 /regions//item[description//text ~ "character "]/payment

XMark Q7 //closed_auction[type ~ "Regular"][annotation//text~ "heat"]/date

XMark Q8 //closed_auction[annotation//text~ "heat" or "country"]/seller

XMark Q9 //closed_auction[annotation//text~ "heat" and "country"]/buyer

XMark Q10 //closed_auction[annotation//text~ "country"]/type

XBench Q11 /article//body[abstract/p ~ "hockey"][section/p ~ "hockey" and "patterns"]/section

XBench Q12 //article//body[section/p ~ "regular"][abstract/p ~ "hockey" or "patterns"]/abstract

XBench Q13 /article//body[section/subsec/p ~ "hockey"][abstract/p ~ "hockey"]/abstract

XBench Q14 /article//body[section/subsec/p ~ "regular"][abstract/p ~ "patterns"]/section

XBench Q15 /article//body[section/p ~ "patterns"][abstract/p ~ "patterns"]/abstract

XBench Q16 /article//body[section/p ~ "hockey"][abstract/p ~ "patterns"]/abstract

XBench Q17 //prolog[keywords/keyword ~ "bold" or "regular"][title~ "regular"]/authors

XBench Q18 //prolog[keywords/keyword ~ "bold"][title~ "bold"]/title

XBench Q19 //prolog[genre ~ "Travel"] [keywords/keyword ~ "bold" or "stealth" ]//author/name

XBench Q20 //prolog[genre ~ "Travel"] [keywords/keyword ~ "bold"]/title

Page 17: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 17

Indexing Scheme Comparison

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Xbench

Dataset

Inde

x B

uild Time(s) ILI

DSI

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

Xbench

Dataset

Inde

x S

ize(MB) ILI

DSI

0

100

200

300

400

500

600

700

Xbench

Dataset

AVG

Que

ry R

espo

nse Tim

e(s)

ILI

DSI

ILI: using a standard XML indexing scheme based on full Inverted Lists

DSI: using our indexing scheme based on Data Synopses

Page 18: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 18

Query Precision Measurement

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

Query

Fals

e P

ositi

ve

Rat

e

ODBF

TDBF

ODBF: using one-dimensional Bloom FiltersTDBF: using two-dimensional Bloom Filters

Page 19: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 19

Efficiency of Optimization Algorithm

0

20

40

60

80

100

120

140

160

180

200

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

Query

Que

ry

Resp

onse

Ti

me(

s)OPCF

TPCF

OPCF: using one-phase containment filteringTPCF: using two-phase containment filtering

Page 20: Indexing and Searching XML Documents based on Content and Structure Synopses

BNCOD07 Indexing & Searching XML Documents based on Content and Structure Synopses 20

Future Research Directions

• Develop an effective ranking function• Adopt top-k algorithms to improve search efficiency• Apply our framework to structured P2P networks• Evaluate our framework over INEX data