XSEarch: A Semantic Search Engine for XML
Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv
The Hebrew University of Jerusalem
Presented by Deniz Kasap & Sarp Baran Özkan
XSEarch an XML Search Engine
XSEarch an XML Search Engine
Goal:
Find the “relevant” XML fragments,
given tag names and keywords
Goal:
Find the “relevant” XML fragments,
given tag names and keywords
Introduction
• It is becoming increasingly popular to publish data on the Web in the form of XML documents.
• Current search engines, which are an indispensable tool for finding HTML documents, have two main drawbacks when it comes to searching for XML documents. – It is not possible to pose queries that explicitly refer to XML tags. – Search engines return references (i.e. links) to documents and not
specific fragments thereof. This is problematic, since large XML documents may contain thousands of elements storing many pieces of information that are not necessarily related to each other.
Excerpt from the XML Version of DBLP
<proceedings>
<inproceedings>
<author>Moshe Y. Vardi</author>
<title>Querying Logical Databases</title>
</inproceedings>
<inproceedings>
<author>Victor Vianu</author>
<title>A Web Odyssey: From Codd to XML</title>
</inproceedings>
</proceedings>
A Search Example
Find papers by Vianu on the topic of “logical databases”
How can we find such papers?
Attempt 1: Standard Search Engine
A document containing some of the three query terms is considered as a result.
The document is not relevant to the query.
This does not work!!!
The document is returned BUT it does not
contain any paper on “logical databases” by VianuThis fragment does not represent
a paper by Vianu
This fragment does not represent a paper about logical databases
<proceedings> <inproceedings> <author>Moshe Y. Vardi</author> <title>Querying Logical Databases</title> </inproceedings> <inproceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> </inproceedings></proceedings>
The document contains the three query terms. Hence, it is returned by a standard search engine. BUT
• Since a reference to whole XML document is usually not a useful answer, the granularity of the search should be refined.
• Instead of returning entire document, an XML search engine should return fragments of XML documents.
• A query language for XML, such as XQuery, can
be used to extract data from XML documents.
• However, such a query language is not an alternative to an XML search engine for several reasons. – The syntax of XQuery is more complicated than the
syntax of a standart search query. Hence, it is not appropriate for a naive user.
– Extensive knowledge of the document structure is required in order to correctly formulate a query. Thus, queries must be formulated on a per document basis.
– XQuery lacks any mechanism for ranking answers.
Attempt 2: XML Query Language
FOR $i IN document(“bib.xml”)//inproceedingsWHERE $i/author contains ‘Vianu’
AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’
RETURN <result> <author> $i/author </author>
<title> $i/title </title> </result>
FOR $i IN document(“bib.xml”)//inproceedingsWHERE $i/author contains ‘Vianu’
AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’
RETURN <result> <author> $i/author </author>
<title> $i/title </title> </result>
• Complicated syntax• Extensive knowledge of the document structure required to
write the query• No mechanism for ranking results
This does work, BUT
Our Requirements from the Search Tool
• A simple syntax that can be used by naive users• Search results should include XML fragments and
not necessarily full documents• The XML fragments in an answer, should be
semantically related– For example, a paper and an author should be in an
answer only if the paper was written by this author
• Search results should be ranked• Search results should be returned in “reasonable”
time
• The design and implementation of XSEarch involved several challenges.
– A syntax is suitable for a naive user.
– The theoretical results were adapted so that XSEarch always returns as answers.
– Answers are highly relevant to the keywords of the query.
– Suitable ranking mechanism that takes into account both the degree of the semantic relationship and the relevance of the keywords have been developed.
– Index structures and evaluation algorithms that allow the system to deal efficiently with large documents have been developed.
– The implemantation of XSEarch is extensible in the sense that it can easily accommodate different type of semantic relationships.
Query Syntax
• The query language of a standart search engine is simply a list of keywords.
• Keywords with a plus (+) sign must appear in a satisfying document, whereas keywords without a plus sign may or may not appear in a satisfying document. (but the appearance of such keywords is desirable)
• The query language of XSEarch is a simple extension of the language described below. In addition to specify labels and keyword-label combinations that must or may appear in a satisfying document.
• A search term may have a plus sign prepended, in which case it is a required term. Otherwise, it is an optional term.
• We use t, t1, t2, etc., as an abstract notation for required and optional term.
• A query has the form Q(S) where S = t1,...,tm is a sequence of required and optional search terms.
Formally, a search term has the form;
l:k, l:, :k
where
l is a label and k is a keyword.
Appearance of logical in thefragment increases the rank ofthis fragment
Note that the different document fragments matching these query terms must be “semantically related”
Example
• Find papers by Vianu on the topic of “logical databases”
logical +database inproceedings: author:Vianulogical +database inproceedings: author:Vianu
Appearance of the tag inproceedings, in the fragment, increases the rank of this fragment
Appearance of Vianu under the tag author, in the fragment, increases the rank of this fragment
The keyword databasemust appear in the fragment
Query Semantics
• This section presents the semantics of our queries.
• In order to satisfy a query Q, each of the required terms in Q must be satisfied.
• In addition, the elements satisfying Q must be meaningfully related.
XSEarch:
<proceedings>
<inproceedings>
<author>Moshe Y. Vardi</author>
<title>Querying Logical Databases</title>
</inproceedings>
<inproceedings>
<author>Victor Vianu</author>
<title>A Web Odyssey: From Codd to XML</title>
</inproceedings>
</proceedings>
>title>A Web Odyssey: From Codd to XML</title<
>author>Victor Vianu</author<
Good Result!
title and author elements ARE semantically related
author:Vianu title:
<proceedings>
<inproceedings>
<author>Moshe Y. Vardi</author>
<title>Querying Logical Databases</title>
</inproceedings>
<inproceedings>
<author>Victor Vianu</author>
<title>A Web Odyssey: From Codd to XML</title>
</inproceedings>
</proceedings>
<title>Querying Logical Databases</title>
>author>Victor Vianu</author<
Bad Result!
title and author elements ARE NOT semantically related
XSEarch: author:Vianu title:
Satisfaction of a Search Term
• XML documents are modeled as trees in the standard fashion.– Each interior node is associated with a label and each leaf
node is associated with the sequence of keywords.– If k is a keyword in the sequence associated with n, n
contains k is said.
• In Figure 1 there is a tree that represents a small portion of the Sigmod Record.
• We will refer to this tree as Tsr
• Let n be an interior node in a tree T.
• We say that n satisfies the search term;– l:k if n is labeled with l and a descendent that contains the keyword
k.– l: if n is labeled with l.– :k if n has a leaf child that contains the keyword k.
• Example:– In the tree Tsr,
• node number 14 satisfies :Kempster • node number 9 satisfies authors:Kempster. • node 9 does not satisfy :Kempster, position: or :position.
Meaningfully Related Sets of Nodes
• Let T be a tree and R be a binary, reflexive and symmetric relationship on the nodes in T.
• We assume that R contains pairs of nodes that are meaningfully related.
• We present two different way to extend R to arbitrary sets of nodes
• A set of nodes N is all-pairs R-related, if (n1,n2) is in R, for every pair of nodes n1, n2.
• This states that a set of nodes is meaningfully related if every pair of nodes in the set is meaningfully related.
• N is star R-related, if there is a node n* N such that the pair (n*,n) is in R, for all nodes n N.
• This states that the nodes of a set are meaningfully related if all these nodes are meaningfully related to a node in the set.
• Depending on the structure of the documents, either the all-pairs relationship or star relation-ship may be more appropriate.
Query Answers• Let Q(t1,…,tm) be a query.
• A sequence N = n1,…,nm of nodes and null values is an all-pairs R-answer for Q if the nodes in N are all-pairs R-related and for all 1 i m:– ni is not the null value if ti is a required term;
– ni satisfies ti if it is not the null value.
• Similarly, N is star R-answer, when the nodes in N are star R-related.
• We use;
– Ansa,R(Q) to denote the set of all-pairs R-answer for the query Q over a tree T and
– Ansts,R(Q) to denote the set of star R-answers for Q over T.
– MaxAnsa,R to denote the set of maximal answers in Ansa,R(Q)
The Interconnection Relationship
• We present a relation which can be used to determine whether a pair of nodes is meaningfully related.
• Let T be tree an n1 and n2 be nodes in T.
• The shortest undirected path between n1 and n2 consists of the paths from the lowest common ancestor of n1 and n2 to n1 and n2.
• We denote the tree consisting of these two paths as T|n1,n2.
• This tree describes the relationship between the nodes n1 and n2.
• For example in Tsr, the tree T|8,13 consists of the nodes 7, 8, 9, 12 and 13.
Relationship Trees
Relationship tree of n1, n2, …, nk
n1 n2nk…
Lowest common ancestor of n1, n2, …, nk
Our “Semantic Relation”: Interconnection
• n1,..., nk are interconnected if either
– relationship tree of n1,..., nk does not contain two nodes with the same label, or
– the only nodes with the same label in the relationship tree of n1,..., nk, are among n1,..., nk
proceedings
Moshe Y. Vardi
inproceedings
author title
Querying Logical Databases
authortitle
Victor Vianu
A Web Odyssey: From Codd to XML
inproceedings
Circled nodes belong to different inproceedings entities. They ARE NOT interconnected!
Relationship tree
Lowest common ancestor of circled
nodes
Example (1)
proceedings
Moshe Y. Vardi
inproceedings
author title
Querying Logical Databases
authortitle
Victor Vianu
A Web Odyssey: From Codd to XML
inproceedings
Circled nodes belong to the same inproceedings entity. They ARE interconnected!
Relationship tree
Lowest common ancestor of circled
nodes
Example (2)
proceedings
Moshe Y. Vardi
inproceedings
author title
Querying Logical Databases
authortitle
Victor Vianu
Queries and Computation on the Web
inproceedings
Circled nodes belong to the same inproceedings entity, but are labeled with the same tag.
They ARE interconnected.
Relationship tree
Lowest common ancestor of circled
nodes
Example (3)
author
Serge Abiteboul
Example 1 of Query Semantics
• Consider the query Q1 defined as;
Q1(+title:, author:).
– The query Q1 finds pairs of titles and authors, belonging to the same article.
– Only tuples where the title is non-null will be returned.
– The answers created for Tsr are;
(8,10) , (8,12) , (8,14) , (17,18) and (25, )
Example 2 of Query Semantics
The answers for Q1 over this document would consists of;
(6,3) and (6,4)
Query Processing
• Document fragments are extracted using the interconnection index and other indices
• Extracted fragments are returned ranked by the estimated relevance
Ranker
User
Query Processor Ranker
XML FilesIndexerIndices
Index Repository
1 2 3 4
L1 L2 L3 L4
Ranking Factors
Several factors increase the rank of a result
• Similarity between query and result
• Weight of labels appearing in the result
• Characteristics of result tree
Query and Result Similarity
TFILF– Extension of TFIDF, classical in IR
– Term Frequency: number of occurrences of a query term in a fragment
– Inverse Leaf Frequency: number of leaves containing a query term divided by number of leaves in the corpus
TFILF• Term frequency of keyword k in a leaf node nl
• Inverse leaf frequency
TFILF is the product between tf and ilf
Weight of Labels
• Some labels are considered more important than others– Text under an element labeled with title is
more “important” than text under element labeled with section
• Label weights can be– system generated– user defined
Relationship between Nodes• Size of the relationship tree: small fragment
indicates that its nodes are closer, and thus, probably, “more related”
article: title:XML
2 nodes
3 nodes
This fragment will obtain an higher rankarticle
title
XML
article
section
title
XML
Relationship between Nodes• Ancestor-descendant relationships between
a pair of nodes in a fragment, indicates “strong relation” between these nodes
section: title:XML
article
title
XML
section
section node is an ancestor of title node
This fragment will obtain an higher rank
section
title
article
XML
Combining the Factors
Given a query Q and an answer N, we use the measures• sim(Q,N), • tsize(N) • and anc-des(N)
to determine the ranking of the answer. We experimented with the following combination of factors by varying the values of α , β and γ
sim(Q,N)α / tsize(N)β x (1+ γ x anc-des(N))
System Implementation
The architecture of the XSEarch system is depicted in the following figure:
User Interface
Query Processor Ranker
XML Files
1 2 3 4
L1 L2 L3 L4
IndexerIndices
Index Repository
• The basic follow of information is as follows:– The user enters a query using a browser.
– The Search-Query Processor parses the query into a list of search terms.
– The Index Repository is used to find nodes that satisfy that satisfy the search terms and to find whether pairs of nodes are interconnected.
• It responds by checking the stored indices.
• If these indices do not contain sufficient information, the Indexer is used to augment the current indices.
• Once the relevant information is returned to the Search-Query Processor, it creates the answers, which are ranked, sorted and then returned.
– The Indexer creates several different indices in the Index Repository based on a set of XML documents.
• We focus on the most important and novel index structures:– The interconnection index– Path index
• The interconnection index allows for rapid checking of the interconnection relationship.
• Path index allow us to create first answers with higher estimated ranking.
Dynamic Offline Interconnection Indexing
• Checking for interconnection of nodes online is expensive.
– Hence, it is decided that at the first to create a node-interconnection index that would store information about the interconnection relationship between each pair of nodes.
– This requires solving the following problem:
• Given a document T, for all pairs of nodes n and n’ in T, determine whether n and n’ are interconnected.
– The algorithm which is the solution of this problem, is based on the following Lemma:
• Lemma (Interconnection Characterization)
– Let T be a document and let n and n’ be nodes in T.
– If n is ancestor of n’, then n and n’ are interconnected if and only if the following hold:
• The parent of n’ is strongly-interconnected with n;
• The child of n on the path to n’ is strongly-interconnected with n’.
– If n is not an ancestor of n’ and n’ is not an ancestor of n, then n and n’ are interconnected if and only if the following hold:
• The parent of n’ is strongly-interconnected with n;
• The parent of n is strongly-interconnected with n‘.
• In the XSearch system, we have explored the possibilities of storing the node-interconnection index in either a hashtable or a symmetric matrix.
• When implemented as a hastable, the node-interconnection index contains pairs of ids of interconnected nodes.
• When implemented as a symmetric matrix, the node-interconnection index contains a boolean value for each pairs of nodes, indicating whether they are interconnected or not.
• A comparison of time and space efficiency of these structures will be explained.
Dynamic Online Interconnection Indexing
• Offline computation of the node-interconnection index may be expensive.
• In order to amortize the cost of computing this index over the queries received, we have also considered an online indexing method.
• When indexing online, for each pair of nodes n and n’, we compute the section of the node interconnection index corresponding to T|n,n’
• We use a hashtable to store the part of the part of index that has already been computed at any given moment.
• The hashtable contains a boolean value for each pair of nodes whose interconnection has already been checked.
• The boolean value indicates whether the nodes are interconnected or not.
• During query processing, usually only a small part of the node-interconnection index will be created, thus the slowdown in response time is not large.
• In addition, queries tend to be similar in the parts of the document that they must access.
• Therefore, even after many queries have been evaluated, it is likely for the node-interconnection index to be only partially computed.
• This speeds up execution time when loading the index into main memory.
Experimental ResultsExperimental Results
Hardware and Software Used
• Language: Java
• Processor: 1.6 GHZ Pentium 4
• RAM: 2 GB (limited to 1.46 GB by JVM)
• OS: Windows XP
Interconnection Index
• Is built offline• Allows for checking interconnection
between two nodes, during query processing, in O(1) time
• We have two implementations – as a hash table– as a symmetric matrix
• The Indexer is responsible for building the Interconnection Index
Choosing the Implementation for the Interconnection Index
• We have experimented the two implementations of the interconnection index1. IIH: the index is an hash table2. IIM: the index is a symmetric matrix
• We compare the two implementations – Cost of building the index– Cost of query processing, i.e., using the index
• IIM is better than IIH, because of the additional overhead of hashing
Time For Building Indices
XML corpus
Size (KB)
Number of nodes
IIM time (ms)
IIH time (ms)
Dream1463,3602936
Hamlet2816,635114185
Sigmod70421,2461,5521,729
Mondial1,19849,4226,2317,837
• Both implementations are reasonable
On the Fly Indexing (OFI)
• Fully building the indices as a preprocess of querying is expensive in memory for “huge” corpuses!– Also expensive in time because of the additional
overhead of using virtual memory
• Instead, compute interconnection index incrementally on-the-fly during query processing for each pair that must be checked– By how much will query processing be slowed
down?
Time For Building Indices: Comparing IIH, IIM, OFI
XML corpus
Size (KB)
Number of nodes
OFI time (ms)
IIH time (ms)
IIM time (ms)
Dream1463,3600.63629
Hamlet2816,6351.1185114
Sigmod70421,2462.21,7291,552
Mondial1,19849,42210.07,8376,231
For these corpuses, OFI time is less than 10 ms.
Actually it is the time to build all the indices other than the interconnection index.
Query Execution Time
• We generated 1000 random queries for the Sigmod Record corpus
• Each query had:– At most 3 optional search terms– At most 3 required search terms
• We checked time with IIH, IIM and OFI
0.1
1
10
100
1000
Time (milliseconds)
Nu
mb
er
of
Qu
eri
es
IIH
IIM
IIH/IIM: Query Processing Time
• Note: Logarithmic scale
• Both approaches lead to similar results
• Average run time for queries: 35 ms
• After processing the 1000 queries, 0.75% of all pairs of nodes were checked for interconnection.•Space saved in main memory
Slowdown in response time not too large!Locality property: queries tend to be similar in the parts of the document that they may access
• More than 50% of the queries processed in under 10 ms
513
144199
96
2317
7
11
10
100
1000
0.01 0.1 1 10 100 1000 10000 100000
Time (sec)
Nu
mb
er
of
Qu
eri
es
OFI: Query Processing Time
How Good are the Results?
• We measured recall & precision for the query:– Find papers written by Buneman that contain the keyword
database in the title
• We tried two different queries that reflect different amounts of user knowledge– Kw: +Buneman +database (classical search engine
query)– Tag-kw: +author:Buneman +title:database
• Corpus: Sigmod, DBLP
• We computed the "correct answers" using XQuery
• Recall
Perfect recall, i.e., XSEarch returns all the correct answers
• Precision at n
Precision and Recall
correct returned answers
correct answers
correct answers in the first n returned answers
n
Precision at 5, 10 and 20
0.5
0.6
0.7
0.8
0.9
1
P@5: KW P@10:KW
P@20:KW
P@5:KW+TAG
P@10:KW+TAG
P@20:KW+TAG
DBLP
Sigmod
Sigmod: Perfect precision
DBLP: 0.8/0.9 for query containing only keywords
Combining tags and keywords leads to perfect precision
Related Work
• Numerous query languages for XML have been developed.– For example, the XQuery working group is considering
how to add full-text search features and ranking to XQuery. Such capabilities have already been added to various XML query languages. But these languages are not suitable for naïve user, since the query syntax is always complex.
– A recent related work is the XRANK system for keyword searching in XML documents
Conclusions
• The main contribution of this paper is in laying the foundations for a semantic search engine over XML documents.
• XSearch returns semantically related fragmants, ranked by estimated relevance.
• This system is extensible and can easily accommodate different types of relationships between nodes.
• We have shown that it is possible to combine these qualities with an efficient, scalable and modular system.
• Thus, XSearch can be seen as a general framework for semantic searching in XML documents.
• Efficient index structures– IIM/IIH for “small” documents– OFI for “big” documents
• Efficient evaluation algorithms– Dynamic algorithm for computing
interconnection
• Extensible implementation– The system can easily accommodate different
types of semantic relations between nodes, other than interconnection
Thank You.
Questions?