XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric

XML RETRİEVAL

Tarık Teksen Tutal

21.07.2011

INFORMATİON RETRİEVAL

XML (Extensible Markup Language)

XQuery

Text Centric vs Data Centric

BASİC XML CONCEPTS

XML

Ordered, Labeled Tree

XML Element

XML Attribute

XML DOM (Document Object Model): Standard for accessing and processing XML documents.

XML STRUCTURE

An Example:

XML DOM OBJECT

XML DOMObject of theSample in thePrevious Slide

Nodes in a Tree

Parse the TreeTop Down

XPATH

Standard for enumerating paths in an XML document collection

Query language for selecting nodes from an XML document

Defined by the World Wide Web Consortium (W3C)

SCHEMA

Puts Constraints on the Structure of Allowable XML

Two Standarts for Schemas:

XML DTD XML Schema

CHALLANGES İN XML RETRİEVAL

STRUCTURED DOCUMENT RETRİEVAL PRİNCİPLE

A system should always retrieve the most specific part of a document answering the query

In a «Cookbook» collection, if a user queries «Apple Pie», the system should return the relevant, «Apple Pie», chapter of the book, «AppleDeserts», not the entire book.

In the same example however, if user queries «Apple», the book should be returned instead of a chapter.

INDEXİNG UNİT

Unstructured:

Files on PC, Pages on the Web, E-Mail Messages etc.

Structured

Non-Overlapping Pseudodocuments Top-Down Bottom-Up All

INDEXİNG UNİT

Non-Overlapping Pseudodocuments

Not Coherent

INDEXİNG UNİT

Top-Down

Start with one of the latest units (e.g book in a book collection)

Postprocess search results to find for each book the subelement that is the best hit.

Fail to return the best element since relevance of a book is generally not a good predictor for relevance of subelements.

INDEXİNG UNİT

Bottom-Up

Search all leaves, select relevant ones Extend them to larger units in postprocessing

Fail to return the best element since relevance of a subelement is generally not a good predictor for relevance of larger units.

INDEXİNG UNİT

Index All the Elements

Not Useful to Index Some Elements (e.g ISBN)

Creates redundancy (Deeper Level Elements are Returned Several Times)

NESTED ELEMENTS

To Get Rid of Redundancy,

Discard All Small Elements

Discard All Element Types that Users do not Look at (Working XML Retrieval System Logs)

Discard All Element Types that Assessors Generally do not Judge to be Relevant (If Relevance Assessments are Available)

Only Keep Element Types that a System Designer or Librarian has Deemed to be Useful Search Results

NESTED ELEMENTS

Remove Nested Elements in a Postprocessing Step

Collapse Several Nested Elements in the Results List and then Highlight Results

VECTOR SPACE MODEL FOR XML RETRİEVAL

LEXİCALİZED SUBTREES

To get each word together with its position within the XML tree encoded by a dimension of the vector space

Map XML documents to lexicalized subtrees

Take each text node (leaf) and break it into multiple nodes, one for each word.

E.g. split Bill Gates into Bill and Gates

Define the dimensions of the vector space to be lexicalized subtrees of documents – subtrees that contain at least one vocabulary term



Queries and documents can be respresented as vectors in this lexicalized subtree context

Matches can then be computed for example by using the Vector Space Formalism

V.S. Formalism -> Unstructured vs Structured

Dimensions: Vocabulary Terms vs Lexicalized Subtrees

DİMENSİONS: TRADEOFF

Dimensionality of Space vs Accuracy of Results

Restrict Dimensions to Vocabulary Terms Standart Vector Space Retrieval System Do Not Match the Structure of the Query

Separate Lexicalized Dimension for Each Subtree Dimensionality of Space Becomes too Large

DİMENSİONS: COMPROMİSE

Index All Paths that End with a Single Vocabulary Term (XML-Context Term Pairs)

Structural Term <c, t>: a pair of XML-context c and vocabulary term t

CONTEXT RESEMBLANCE

To measure the similarity between a path in a query and a path in a document

|cq| and |cd| are the number of nodes in the query path and document path respectively

cq matches cd if and only if we can transform cq into cd by inserting additional nodes

CONTEXT RESEMBLANCE

CR(cq4 , cd2) = 3/4 = 0.75

CR(cq4 , cd3) = 3/5 = 0.6

DOCUMENT SİMİLARİTY MEASURE

Final Score for a Document

Variant of the Cosine Measure

Also called «SimNoMerge»

Not a True Cosine Measure Since Its Value can be Larger than 1.0

DOCUMENT SİMİLARİTY MEASURE

V is the vocabulary of non-structural terms B is the set of all XML contexts weight (q, t, c), weight(d, t, c) are the

weights of term t in XML context c in query q and document d, respectively

standard weighting e.g. idft x wft,d, where idft depends on which elements we use to compute dft.

SİMNOMERGE ALGORİTHMSCOREDOCUMENTSWITHSIMNOMERGE(q, B, V, N, normalizer)

EVALUATİON OF XML RETRİEVAL

INEX

Initiative for the Evaluation of XML Retrieval

Yearly standard benchmark evaluation that has produced test collections (documents, sets of queries, and relevance judgments)

Based on IEEE journal collection (since 2006 INEX uses the much larger English Wikipedia test collection)

The relevance of documents is judged by human assessors.

INEX TOPİCS

Content Only (CO) Regular Keyword Queries Like in Unstructured IR

Content and Structure (CAS) Structured Constraints in Addition to Keywords Relevance Assessments are More Complicated

INEX RELEVANCE ASSESSMENTS

INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance

Component Coverage: Evaluates Whether the Element Retrieved is

«Structurally» Correct

Topical Relevance

INEX RELEVANCE ASSESSMENTS Component Coverage:

Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information

Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information

Too large (L): The information sought is present in the component, but is not the main topic

No coverage (N): The information sought is not a topic of the component

Topical Relevance: Highly Relevant (3), Fairly Relevant (2), Marginally

Relevant (1) and Nonrelevant (0)

COMBİNİNG THE RELEVANCE DİMENSİONS

All of the combinations are not possible -> 3N

Quantization:

INEX EVALUATİON MEASURES

Precision and Recall can be applied

Sum Grades vs Binary Relevance

Overlap is not accounted for Nested elements in the same search result

Recent INEX focus: Develop algorithms and evaluation measures

that return non-redundant results lists and evaluate them properly.

Documents

XML R ETRIEVAL Tarık Teksen Tutal 21.07.2011. I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric