Upload
noel-powell
View
231
Download
2
Tags:
Embed Size (px)
Citation preview
XML RETRİEVAL
Tarık Teksen Tutal
21.07.2011
INFORMATİON RETRİEVAL
XML (Extensible Markup Language)
XQuery
Text Centric vs Data Centric
BASİC XML CONCEPTS
XML
Ordered, Labeled Tree
XML Element
XML Attribute
XML DOM (Document Object Model): Standard for accessing and processing XML documents.
XML STRUCTURE
An Example:
XML DOM OBJECT
XML DOMObject of theSample in thePrevious Slide
Nodes in a Tree
Parse the TreeTop Down
XPATH
Standard for enumerating paths in an XML document collection
Query language for selecting nodes from an XML document
Defined by the World Wide Web Consortium (W3C)
SCHEMA
Puts Constraints on the Structure of Allowable XML
Two Standarts for Schemas:
XML DTD XML Schema
CHALLANGES İN XML RETRİEVAL
STRUCTURED DOCUMENT RETRİEVAL PRİNCİPLE
A system should always retrieve the most specific part of a document answering the query
In a «Cookbook» collection, if a user queries «Apple Pie», the system should return the relevant, «Apple Pie», chapter of the book, «AppleDeserts», not the entire book.
In the same example however, if user queries «Apple», the book should be returned instead of a chapter.
INDEXİNG UNİT
Unstructured:
Files on PC, Pages on the Web, E-Mail Messages etc.
Structured
Non-Overlapping Pseudodocuments Top-Down Bottom-Up All
INDEXİNG UNİT
Non-Overlapping Pseudodocuments
Not Coherent
INDEXİNG UNİT
Top-Down
Start with one of the latest units (e.g book in a book collection)
Postprocess search results to find for each book the subelement that is the best hit.
Fail to return the best element since relevance of a book is generally not a good predictor for relevance of subelements.
INDEXİNG UNİT
Bottom-Up
Search all leaves, select relevant ones Extend them to larger units in postprocessing
Fail to return the best element since relevance of a subelement is generally not a good predictor for relevance of larger units.
INDEXİNG UNİT
Index All the Elements
Not Useful to Index Some Elements (e.g ISBN)
Creates redundancy (Deeper Level Elements are Returned Several Times)
NESTED ELEMENTS
To Get Rid of Redundancy,
Discard All Small Elements
Discard All Element Types that Users do not Look at (Working XML Retrieval System Logs)
Discard All Element Types that Assessors Generally do not Judge to be Relevant (If Relevance Assessments are Available)
Only Keep Element Types that a System Designer or Librarian has Deemed to be Useful Search Results
NESTED ELEMENTS
Remove Nested Elements in a Postprocessing Step
Collapse Several Nested Elements in the Results List and then Highlight Results
VECTOR SPACE MODEL FOR XML RETRİEVAL
LEXİCALİZED SUBTREES
To get each word together with its position within the XML tree encoded by a dimension of the vector space
Map XML documents to lexicalized subtrees
Take each text node (leaf) and break it into multiple nodes, one for each word.
E.g. split Bill Gates into Bill and Gates
Define the dimensions of the vector space to be lexicalized subtrees of documents – subtrees that contain at least one vocabulary term
LEXİCALİZED SUBTREES
LEXİCALİZED SUBTREES
Queries and documents can be respresented as vectors in this lexicalized subtree context
Matches can then be computed for example by using the Vector Space Formalism
V.S. Formalism -> Unstructured vs Structured
Dimensions: Vocabulary Terms vs Lexicalized Subtrees
DİMENSİONS: TRADEOFF
Dimensionality of Space vs Accuracy of Results
Restrict Dimensions to Vocabulary Terms Standart Vector Space Retrieval System Do Not Match the Structure of the Query
Separate Lexicalized Dimension for Each Subtree Dimensionality of Space Becomes too Large
DİMENSİONS: COMPROMİSE
Index All Paths that End with a Single Vocabulary Term (XML-Context Term Pairs)
Structural Term <c, t>: a pair of XML-context c and vocabulary term t
CONTEXT RESEMBLANCE
To measure the similarity between a path in a query and a path in a document
|cq| and |cd| are the number of nodes in the query path and document path respectively
cq matches cd if and only if we can transform cq into cd by inserting additional nodes
CONTEXT RESEMBLANCE
CR(cq4 , cd2) = 3/4 = 0.75
CR(cq4 , cd3) = 3/5 = 0.6
DOCUMENT SİMİLARİTY MEASURE
Final Score for a Document
Variant of the Cosine Measure
Also called «SimNoMerge»
Not a True Cosine Measure Since Its Value can be Larger than 1.0
DOCUMENT SİMİLARİTY MEASURE
V is the vocabulary of non-structural terms B is the set of all XML contexts weight (q, t, c), weight(d, t, c) are the
weights of term t in XML context c in query q and document d, respectively
standard weighting e.g. idft x wft,d, where idft depends on which elements we use to compute dft.
SİMNOMERGE ALGORİTHMSCOREDOCUMENTSWITHSIMNOMERGE(q, B, V, N, normalizer)
EVALUATİON OF XML RETRİEVAL
INEX
Initiative for the Evaluation of XML Retrieval
Yearly standard benchmark evaluation that has produced test collections (documents, sets of queries, and relevance judgments)
Based on IEEE journal collection (since 2006 INEX uses the much larger English Wikipedia test collection)
The relevance of documents is judged by human assessors.
INEX TOPİCS
Content Only (CO) Regular Keyword Queries Like in Unstructured IR
Content and Structure (CAS) Structured Constraints in Addition to Keywords Relevance Assessments are More Complicated
INEX RELEVANCE ASSESSMENTS
INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance
Component Coverage: Evaluates Whether the Element Retrieved is
«Structurally» Correct
Topical Relevance
INEX RELEVANCE ASSESSMENTS Component Coverage:
Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information
Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information
Too large (L): The information sought is present in the component, but is not the main topic
No coverage (N): The information sought is not a topic of the component
Topical Relevance: Highly Relevant (3), Fairly Relevant (2), Marginally
Relevant (1) and Nonrelevant (0)
COMBİNİNG THE RELEVANCE DİMENSİONS
All of the combinations are not possible -> 3N
Quantization:
INEX EVALUATİON MEASURES
Precision and Recall can be applied
Sum Grades vs Binary Relevance
Overlap is not accounted for Nested elements in the same search result
Recent INEX focus: Develop algorithms and evaluation measures
that return non-redundant results lists and evaluate them properly.