79
Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis Proposal Paul Ogilvie April 25, 2005 Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Jamie Callan (Chair) Christos Faloutsos Yiming Yang Bruce Croft (University of Massachusetts at Amherst)

Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Generative Probabilistic Models for Retrieval of Documentswith Structure and Annotations

Ph.D. Thesis Proposal

Paul Ogilvie

April 25, 2005

Language Technologies InstituteSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213

Thesis Committee:

Jamie Callan (Chair)Christos FaloutsosYiming YangBruce Croft (University of Massachusetts at Amherst)

Page 2: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Table of Contents

1 Introduction 11.1 Some History of Structured Document Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 New Directions for Structured Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Generative Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.2 Overview of the Model and Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background in Generative Probabilistic Models for Text Retrieval 62.1 Generative Models for Text Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 The Generative Query-Likelihood Language Model . . . . . . . . . . . . . . . . . . . . 72.1.3 Generative Relevance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 The Maximum-Likelihood Estimate of a Language Model . . . . . . . . . . . . . . . . 82.1.5 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Modeling Document Structure 103.1 Hierarchically Structured Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Multiple Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.2 Non-Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Ranking Items for Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.1 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.1 Hierarchical Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4.2 Multiple Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Modeling Characteristics of Relevant Documents 364.1 Estimating the Prior Probabilities from Desired Posteriors . . . . . . . . . . . . . . . . . . . . 374.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3 Length Priors for XML Component Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4 URL Priors for Named-Page Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4.1 Modeling URL Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

i

Page 3: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

ii TABLE OF CONTENTS

4.4.2 Ranking Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4.3 Estimating the Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Smoothing and Parameter Estimation 485.1 Training and Estimation of Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.1.1 Next-Word Prediction (Speech Recognition) . . . . . . . . . . . . . . . . . . . . . . . . 495.1.2 Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.1.3 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.1.4 Document Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2 Combination of Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.2 Analysis Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Modeling Annotations 606.1 Offset Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.2 A Reading Tutor System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.2.1 Adapting the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.3 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.3.1 Richer Queries for Richer Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.3.2 Adapting the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Summary of Proposal 707.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.2 Time Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Page 4: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Chapter 1

Introduction

Structure in documents has been a part of Information Retrieval as long as the field has existed. There havealways been documents with metadata information such as authors and creation dates. Library catalogsstore extensive fielded information about documents, such as titles, authors, abstracts, publication, andkeywords.

1.1 Some History of Structured Document Retrieval

Researchers have always recognized that document structure is important to effective retrieval. The earliestsystems supporting structure focused on providing support for querying of the fielded information. This hasstrong connections to research in library sciences, where fielded search of citations is important. A commonapproach to handling fielded queries was to treat constraints literally, perhaps providing a ranking of thequery corresponding to how well the multiple constraints are met.

A typical example of an early citation search system was the Norton Cotton Cancer Center (NCCC) On-Line Personal Bibliographic Retrieval System [10]. The bibliographic system stored citations of documents(articles, books, journals, etc.) in a fielded database. Citations were indexed with keywords in a controlledvocabulary or general language. Titles, authors, publication information, and call numbers were also indexedin fields. Searches could query the entire citation or over the controlled vocabulary. The searches were eitherconjunctions or disjunctions of the clauses; more sophisticated search capabilities were not deemed necessary.Results were ordered by the author field.

A more sophisticated example of citation search is the SCAT-IR system [37]. SCAT-IR indexed similarfields as the NCCC bibliographic system, but allowed additional query structures. Each field could be queriedin a clause, and query clauses could be combined with Boolean ANDs and ORs. Yet result sets were notordered by any estimate of relevance.

Fox summarized early work with structured documents in [11]. Much work to that date was empirical,with no analysis of the conditions in which structure is informative. Most tasks limited the retrieval unitto documents, although some early work with passage retrieval had been performed. Fox described someexperiments that found a vector space system performed better on sections than on passages, but performedno analysis as to why. Fox further described how soft Boolean matching functions such as the P-Normformalism can be extended to matching in complex documents. Up to this point there was little considerationof what the best unit of retrieval is and almost no consideration of relationship between fields of informationin a document.

The introduction of Inference Networks by Turtle and Croft [50] provided one of the first retrieval modelsexplicitly designed to handle multiple representations of information within a document and combine thisevidence during retrieval. While the model directly accommodates fielded information and allows queryingof fields, few experiments have been done leveraging these advantages and document retrieval was still theprimary task considered. This work was among the first of many works with greater sophistication inhandling document structure.

1

Page 5: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

2 Chapter 1. Introduction

Shortly after that, Burkowski [5] presented a model for searching and retrieving components of a hier-archically structured text database. The query language presented allowed the user to specify the unit ofretrieval and supports enforcement of components being contained within others. Fuller et al [15] presenteda variety of factors for leveraging the structure in document retrieval, such as component proximity, thelocation of the component within the document, and how well the children components matched the query.Wilkerson [53] performed an evaluation that consideration of component level evidence may be beneficialto document retrieval and document information could be helpful to component retrieval. Navarro andBaeza Yates [36] considered the problem of querying structured data and suggested that query structuresfor comparing the proximity of document components was necessary for expressing rich information needs.

These works are but a sample of prior work in structured document retrieval. However, there are someimportant characteristics shared by all of these works. First and foremost, there has been a lack of largetest collections with relevance judgments for the task of component retrieval. This lack of test collectionshas hindered scientific research in this area, and these models have not been rigorously tested. A lack ofcomponent level judgments has made it almost impossible to test hypotheses about the nature of evidencewithin these components. In the few cases where there have been some experiments, results have been onsmall, suboptimally constructed testbeds and cannot be considered conclusive. Typical experiments haveconsidered the effects of structure on ad-hoc document retrieval only, and it is unclear whether structure isa crucial factor in this task.

In addition, this prior work has also focused on using document structure when there are explicit structuralconstraints in the query. The strict interpretation of query constraints in these works made an assumptionthat information does not cross the structural boundaries present in the document. We know this to befalse; sections, paragraphs, etc. are often intimately related.

1.2 New Directions for Structured Retrieval

However, recent shifts in research directions for structured retrieval has opened the door for new explorationsof the benefits of structure. Along with these shifts in directions come new testbeds for evaluation, and anopportunity to test hypotheses about the use of structure in retrieval scenarios. These new tasks have eitherfound definite improvements through the use of structure or have moved toward component retrieval as animportant task in and of itself.

There has been increasing interest in research leveraging document structure for retrieval. Commercialweb search engines such as Google leverage link information, titles, and other structural elements of docu-ments, despite the lack of structure present in most web search engine queries. The TREC Web track hasprovided the opportunity for the research community to examine structure of HTML documents. While thisis similar to traditional document retrieval, these systems leverage document structure to produce betterrankings.

A newer evaluation initiative, INEX [13], has examined the use of structure in XML documents. Structurehas been an integral part of INEX: content-only queries may return any document component, and content-and-structure queries provide hints about where query terms may occur in the document structure, as wellas hints about which document components may be relevant. Work in this area moves beyond traditionaldocument retrieval, as systems must choose which document components are relevant and make effective useof structural query constraints.

Parallel to the investigations on HTML and XML, there have been a growing number of cases whereanother system has been the user of an information retrieval system. Information Retrieval has long focusedon humans as the users of retrieval. This has biased research toward tasks such as ad-hoc retrieval ofdocuments or abstracts, text filtering and classification, and known-item retrieval on the Web. An exampleof a system as a user of a retrieval system is a Question Answering system. Question Answering systems

Page 6: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

1.2. New Directions for Structured Retrieval 3

often use text retrieval systems to retrieve passages that may contain the answer to a question [47]. Anotherexample non-human user is a reading tutor system. The reading tutor may issue requests to a search enginefor documents on a topic, but with additional constraints about the language used in the document. In bothcases, it is likely that the documents will be annotated with additional information, such as named-entitiesin a QA system and reading difficulty levels in the reading tutor. These annotations are a new type ofstructural information that could be leveraged to improve retrieval effectiveness.

In all of the above examples, the retrieval system must make use of two forms of markup within thedocuments: annotations and document structure. Annotations typically encode linguistic or semantic in-formation and document structure is commonly used to represent general layout of documents. However,there is very little difference between how one can represent structure and annotations within an informationretrieval system. Document structure may also encode linguistic or semantic information, and annotationscan be used to describe the layout of a document. The difference between structure and annotations can beviewed as the source of the mark-up. For the purposes of this work, document structure and annotationsare defined as:

• document structure - explicitly encoded information about the content of the document present withinthe document. This information is encoded during document creation, and the author of the documentcreates this structure. It is certain information, as one can assume the author placed the structurethere willingly. The structure may encode information about layout, as in title tags or paragraph tags,citations and hyperlinks, or linguistic or semantic information such as part-of-speech information ornamed-entities. However, as the linguistic and semantic information is costly for a human to provide,structure will usually refer to document layout or meta-data information.

• annotations - information about the document assigned by a process or human external to the doc-ument, such as information provided by a named-entity tagger or an automatic layout analysis tool.This information may be less certain than document structure, as the process or person creating theannotations may not know the author’s original intent. Annotations are usually added for linguisticor semantic information such as part-of speech tags, parse trees of sentences, named-entities, readinglevel, language, etc.

It is often perceived that annotations and structure represent fundamentally different types of information;layout is on a different scale than part-of-speech, because layout is about organization of a document, whilelinguistic or semantic annotations are about the meaning or structure of a sentence, etc. This may be trueat some level but those differences may be ignored; the only difference of impact between structure andannotations are that document structure is encoded within the document at the time of its creation, whileannotations are mark-up of information by some program or human sometime after the document has alreadybeen created.

1.2.1 Generative Probabilistic Models

Generative probabilistic models provide a natural way to model this structured data. Generative modelsare used to estimate the probability that a model of an object “generated” some observations. A model isa probability distribution that characterize how the object may have been created. This may be a simpleprobability distribution such as a multinomial over words in a vocabulary (the classic unigram model), or asophisticated graphical structure. One example of a graphical mode is latent Dirichlet allocation [2], wheresome nodes of the graph correspond to semantic content.

In text retrieval, the object is generally the text and the observations used in ranking are the query.Texts may be ranked by:

P (Q |θT ) (1.1)

Page 7: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

4 Chapter 1. Introduction

where Q is the query, T is the text (usually a document) and θT is the model of T . This is referred to asa query-likelihood approach to ranking. Much work in text retrieval using this approach is summarized inChapter 2.

In this work, we model documents and their components (such as paragraphs, sections and titles) us-ing multiple hierarchies of models using the document structure and/or annotations as the nodes in thehierarchies. For each of these nodes, we estimate a simple generative model that may be used in ranking.This allows documents and components to be ranked for retrieval using the query-likelihood approach inEquation 1.1.

1.2.2 Overview of the Model and Proposal

The two modeling approaches presented in this work are novel adaptations of the generative model for a singlerepresentation using hierarchical structure (Model SR) and for multiple document representations (ModelMR) (Chapter 3). Generative models (of which language models are a subset) have some key advantagesover other approaches for the task of structured retrieval:

• Generative models are founded on statistical methods, which provide some guidance on estimationmethods.

• The basic generative model is quite simple, which will ease analysis of the model.

• Modeling structured documents fits naturally within generative framework (Chapter 3).

• Combination of evidence within the generative framework can leverage existing methods from statistics,such as linear interpolation or smoothing using Dirichlet priors.

The model provides a framework for the combination of evidence from structure within a document,external to the document, and from knowledge about the information request, as in structured queries orknowledge about the characteristics of relevant components. For example, when providing a search servicefor the location of homepages on the Internet, the knowledge that homepages’ URLs very often fit the patternhttp://www.xxx.com/ is extremely useful. In a reading tutor system, the knowledge that a student is in fifthgrade may indicate that the system would prefer documents approximately at the fifth grade reading level.

Given other tasks where retrievable items may be modeled as multiple hierarchically structured repre-sentations, one must define:

• which representations to use,

• the parameters for combination and representation of information in the hierarchy of one representationand the parameters for combination across representations, and

• which characteristics of relevant components to leverage and how to parameterize this information.

At this point in the research, there is little guidance for the three items in the list above. However,Chapter 5 focuses on how the first and second problems relate to the general combination of evidenceproblem, prior work in that area, and some possibilities for the investigation of solutions to the above. Thechapter also discusses possible approaches to automatically setting parameters given that the representationshave already been chosen. Chapter 4 describes a basic approach to the modeling of characteristics of relevantcomponents.

The models described in Chapter 3 do not allow for multiple hierarchies of mark-up for a single repre-sentation. For example, in documents marked with parse trees for sentences, there may be multiple parsesfor a sentence, regardless of whether this is explicitly coded structure by a human or an annotation provided

Page 8: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

1.2. New Directions for Structured Retrieval 5

by an automatic parser. There may also be overlapping hierarchies of inherently different types of mark-up,such as named-entities, parse trees, and semantic role labellers. Additional difficulties may arise when theimportance of forming queries where the relationship of items across these mark-ups is recognized. Chapter 6describes the challenges to the model through the discussion of a reading tutor system and a hypotheticalquestion answering system. It also discusses directions for possible solutions to modeling these overlappingstructures and annotations.

The proposal concludes with a summary (Chapter 7). The generative query-likelihood model for retrievalof documents with structure and annotations will pave the way for increasing interactivity of other systemswith information retrieval systems. Up to this point, analysis of the needs of other systems for an IRsystem has not been explored in great depth, and this dissertation will address some of these concerns. Thecomplexities of working with multiple structures and annotations of text must be explored in order to bestserve other systems. The contributions of the proposed work are presented in more detail in the final chapter,along with a plan for the completion of the dissertation.

Page 9: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Chapter 2

Background in Generative ProbabilisticModels for Text Retrieval

There has been a recent revolution of probabilistic models for text retrieval. Prior probabilistic models suchas the classical probabilistic model of Sparck Jones and Robertson [41] and the 2-Poisson Model of Harter[17] focused on the direct estimation of relevance given evidence from the document. The introduction ofstatistical language models to Information Retrieval by Ponte and Croft [40] and Hiemstra [19] made a newset of statistical estimation methods available. Instead of directly estimating the probability of relevance,these approaches estimate the probability that the query was “generated” by the document. Which, in asense, measures the representativeness of the document with respect to the query. This chapter is a surveyof existing work that is related to the model that Chapter 3 introduces.

Generally, a unigram language model is estimated from the text of the document. A unigram languagemodel specifies a probability distribution over words in a vocabulary. Unigram language models assumeindependence among terms in the text. While this is not a realistic model of what happens in text, it is aclassic example of the bias / variance trade-off commonly found in statistical estimation and an assumptionmade by many of the popular retrieval models. In this model each document is assumed to be created bya latent language model unique to the document. Documents are then ranked by the probability that thelatent language model generated the query. This ranking approach is referred to as query-likelihood.

2.1 Generative Models for Text Retrieval

Language modeling was extensively researched by the speech recognition community as a means of estimatingthe probability of a word sequence (such as a sentence) given a sequence of phonemes recognized from anaudio signal. The speech recognition community has developed sophisticated methods for estimating theseprobabilities. Their most important contributions to the use of language models in information retrieval aresmoothing and methods for combining language models.

In information retrieval, texts and sometimes queries are represented using language models. Theseare typically unigram language models, which are much like bags-of-words, where word order is ignored.The unigram language model specifically estimates the probability of a word given a chunk of text. It isa “unigram” language model because it ignores word order. Text ranking is done one of two ways: bymeasuring how much a query language model diverges from text language models [23][39], or by estimatingthe probability that each text generated the query string [40][20][52][56].

2.1.1 Kullback-Leibler Divergence

Kullback-Leibler divergence has been used much in recent language modeling for retrieval. Xu [54] introducedthe use of KL-divergence to the problem of information retrieval in the context of federated search. Also,Zhai [55] showed a derivation of the KL-divergence ranking function for ad-hoc retrieval within the riskminimization framework. Kullback-Leibler divergence is an information-theoretic measure of how much oneprobability distribution “diverges” from another. It specifically measures how much information is lost if

6

Page 10: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

2.1. Generative Models for Text Retrieval 7

we assume one distribution is true when the other is the true distribution [31]. In retrieval, texts may beranked by the negative of the Kullback-Leibler (KL) divergence of the query from each text [23]:

−KL (θQ ||θT ) = −∑w

P (w |θQ ) logP (w |θQ )P (w |θT )

rank≡∑w

P (w |θQ ) log P (w |θT ) (2.1)

where θT is the language model estimated from the text of the document, θQ is the language model estimatedfrom the query, and P (w |θ ) estimates the probability of the word w given the language model θ. TheP (w |θQ ) within the logarithm can be dropped in ranking because it is a constant with respect to the query.Documents where the query’s model diverges less from the document’s model are ranked higher.

2.1.2 The Generative Query-Likelihood Language Model

The query-likelihood approach was first introduced by Ponte and Croft[40]. Documents are ranked accordingto the probability that the document produced the query vector. In this case the query takes the form ofa vector of [0, 1]|V | where ith entry in the vector indicates the presence or absence of the ith term of thevocabulary in the query string. In this approach, documents are ranked by:

P (Q |θT ) =∏

w∈Q

P (w |θT )∏

w/∈Q

(1− P (w |θT ) (2.2)

More recent work by Hiemstra[20], Westerveld et al. [52], and Zhai [57] treats queries as a string rather thana binary vector. This allows for placing more weight on frequently occurring query terms, as in Zhai [57]:

P (Q |θT ) =∏

w∈Q

P (w |θT )tf(w,Q) rank≡∑w∈Q

tf (w,Q) log P (w |θT ) (2.3)

where Q is the query and tf (q, Q) is the term frequency (count) of the term q in the query Q. Hiemstraalso explicitly includes a document prior when ranking, which will be discussed more in Chapter 4.

Documents more likely to have produced the query are ranked higher. Under the assumptions thatquery terms are generated independently and that the query language model used in KL-divergence is themaximum-likelihood estimate, the generative model and KL divergence produce the same rankings [39].

2.1.3 Generative Relevance Models

In [24], Lavrenko presents an alternative to direct comparison of the query and document models or thequery-likelihood model. Lavrenko hypothesizes that relevant documents and queries are generated from arelevance sub-space that lies in a common representation space. Relevant documents are generated from apoint in the relevant sub-space via a deterministic document generator. Queries are generated similarly, butthe query generator is different from the document generator. The query is treated as evidence of where therelevant sub-space lies in the common representation space. Lavrenko makes the assumption that the randomvariables in the latent representation in the common space are exchangeable, rather than independent. Forad-hoc retrieval, Lavrenko argues for a model-comparison ranking, which is very similar to the KL-divergenceapproach:

−KL (EQ ||θT )rank≡

∑w

EQ (w) log P (w |θT ) (2.4)

where EQ is the expected parameter vector of the relevance model and is:

EQ (w) =∑

T P (w |θT )∏|Q|

i=1 P (qi |θT )∑T

∏|Q|i=1 P (qi |θT )

(2.5)

Page 11: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

8 Chapter 2. Background in Generative Probabilistic Models for Text Retrieval

EQ estimates the relevance model based on the evidence present in the query. In a crude way, it can beviewed as a smoothed query model based on co-occurrence information found in the text.

2.1.4 The Maximum-Likelihood Estimate of a Language Model

The most direct way to estimate a language model given some observed text is to use the maximum-likelihoodestimate (MLE), assuming an underlying multinomial model. In this case, the maximum-likelihood estimateis also the empirical distribution or the histogram distribution. An advantage of this estimate is that it iseasy to compute. It is very good at estimating the probability distribution for the language model when thesize of the observed text is very large. It is given by:

P (w |MLE (T ) ) =tf (w, T )

|T |(2.6)

where |T | is the length in terms of the observed text. Note that |T | =∑

w tf (w, T ). The maximum likelihoodestimate is not good at estimating low frequency terms for short texts. For example, the MLE will assignzero probability to words not occurring in the text T . This creates a serious problem for estimating textlanguage models in both KL divergence and generative language model approaches to ranking texts, as thelog of zero is negative infinity. The solution to this problem is smoothing.

2.1.5 Smoothing

Smoothing is the re-estimation of the probabilities in a language model. Smoothing is motivated by the factthat many of the language models we estimate are based on a small sample of the “true” probability distri-bution. It improves the estimates by leveraging known patterns of word usage in language and can leverageother language models based on larger samples. In Information Retrieval, smoothing is very important [56],because the language models tend to be constructed from very small amounts of text. How we estimatethe probabilities for low frequency words can have large effects on the text scores. In both approaches toranking texts, the score assigned to a text in ranking is a sum of logarithms of the probability of a word giventhe text’s model. In addition to the problem of zero probabilities presented above for maximum-likelihoodestimates, much care is required if this probability is close to zero. Small changes in the probability will havelarge effects on the logarithm of the probability, in turn having large effects on the text scores.

The smoothing technique most commonly used is linear interpolation. Linear interpolation is a simpleapproach to combining estimates from different language models:

P (w |θ ) =M∑

m=1

λmP (w |θm ) ,

M∑m=1

λm = 1 (2.7)

where M is the number of language models we are combining, and λm is the weight on the model θm. Toensure that this is a valid probability distribution, the λ parameters must sum to one.

One use of linear interpolation is to smooth the estimates of the text’s language model with a collectionlanguage model:

P (w |θT ) = λT P (w |MLE (T ) ) + λCP (w |θC ) (2.8)

where θT is the text language model and θC is the collection language model. This new model would thenbe used as the smoothed text language model in either the generative or KL-divergence ranking approach.The following paragraphs describe ways commonly used in IR to choose the weight on the text model andthe collection model.

Page 12: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

2.2. Summary 9

Jelinek-Mercer Smoothing

The most basic approach to setting the linear interpolation parameters is the Jelinek-Mercer smoothing.Text and collection models constructed using maximum likelihood estimate. The λ parameters are set usinga constant λ parameter across all documents in the collection: λT is set to 1 − λ and λC = λ. This is asimple estimate and makes no assumptions about the ability to estimate the information content found inthe text.

Smoothing Using Dirichlet Priors

Another common approach to linearly interpolating a text and a collection language model is Bayesiansmoothing using Dirichlet priors [56]. The interpolation is done as in Equation 2.8, but the linear interpo-lation parameters are now set as:

λT = |T ||T |+µ λC = µ

|T |+µ (2.9)

where the parameter µ controls the effect of the collection model for short documents. Zhai describes a leave-one-out approach to estimating µ in [57]. This smoothing technique has been found effective for ad-hoc textretrieval on several collections by Ogilvie et al [39], Westerveld et al [52], and Zhai and Lafferty[56].

This approach to setting λT and λC has the nice benefit of placing more emphasis on the text modelas the text’s length increases. This fits nicely with intuitions about the quality of the text’s length forestimating the language model and its dependence on the length of the text. As we see more text, we aremore confident in the text model. Another reason for smoothing using Dirichlet priors is that the Dirichletprior is the conjugate of the multinomial distribution and as such fits nicely into Bayesian statistics.

It is useful to characterize how λC varies as a function of |T |. When |T | is 0, λC = 1. As |T | gets verylarge, λC asymptotes at 0. µ controls the rate that λC approaches 0.

Two-Stage Smoothing

Zhai and Lafferty [57] motivate two-stage smoothing using the risk minimization framework and argue thatthe approach separates the influence of query and the text collection on the choice of parameters. The firststage of smoothing interpolates the text model with the collection model using Dirichlet prior smoothing.The second stage combines the smoothed text model with a background query model using Jelinek-Mercersmoothing. In their work, they use the collection model as the background query model, although they dosuggest that there may be other options for the background query model.

The usage of the collection model as the background query model results in a simple combination ofDirichlet prior smoothing with Jelinek-Mercer smoothing. A rewrite of the ranking equation in [57] resultsin interpolation parameters set by:

λT = (1−λ)|T ||T |+µ λC = λ|T |+µ

|T |+µ (2.10)

This approach is an interesting hybrid of the two previously discussed smoothing strategies. Instead ofλC approaching 0 when |T | gets very large, λC now asymptotes at λ. This can be viewed as addressinga potential problem with Dirichlet prior smoothing which results in placing almost no emphasis on thecollection model for very large texts.

2.2 Summary

This chapter presented the basic generative framework used for much recent work within Information Re-trieval. This set the background for the structured generative model that is presented in the next chapter.

Page 13: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Chapter 3

Modeling Document Structure

With the recent shifts in research activities for structured document retrieval outlined in the introduction, itis becoming more practical to work on retrieval and understand techniques that leverage document structure.Whether the goal is to retrieve components of a document or to create better models of text for comparisonto the query, generative probabilistic models can naturally and simply model structure.

Before presenting the model a definition is required. The retrieval task is not always document retrieval,but may include retrieval of document components, such as paragraphs or sections. We introduce the notionof a rankable item. A rankable item is any text, document, image, etc. that may be returned by thesystem for the specific retrieval task at hand. Sets of rankable items for the documents in a collectionmust be defined prior to retrieval. For example, ad-hoc retrieval on the Web may have only the completedocuments themselves as the rankable items. For retrieval of XML document components, any componentof a document in the XML structure may be treated as a rankable item. A rankable item may be describedby one ore more representations. For example, in Web retrieval it is common to model the documents usingmultiple representations: the content of the document, the in-link text of the document, and meta-data areall common representations used to model a document. The rankable item in this case is described by thecomposite of the representations of the document.

This chapter presents a new model for representing and retrieving structured documents with multiplerepresentations. Section 3.1 presents a model for hierarchically structured documents and Section 3.2 extendsit to supporting multiple representations. Following that, ranking documents for retrieval is described.Experiments in the model for the basic hierarchical model, the extended model with multiple representations,and also using trends of relevant documents across queries are described in Section 3.4. Finally, the modeland experiments are reviewed and open problems are presented. The following section considers the casewhere a single hierarchical representation is used to model a document.

3.1 Hierarchically Structured Documents

Hierarchically structured documents may be represented as a tree, where nodes in the tree correspond tocomponents of the document. From the content of the document, a generative distribution may be estimatedfor each node in the tree. The distribution at a node may be estimated using evidence from the text of thenode, the children’s distributions, or the parent’s distribution, and so on.

Representing hierarchically structured documents in this manner is simple and flexible. In this approachwe combine the evidence from the various components in the document using the document structure asguidance. We propose to investigate the use of linear interpolation to combine the evidence from thecomponent, its children components, and its parent component. This is natural as structural boundariessuch as paragraphs, sections, chapters, and so on often correspond with natural boundaries in the semanticsof the content.

Figure 3.1 presents Model SR, the single representation model of a hierarchically structured document.The hierarchical graph structure of the representation is characterized by a graph G. The three estimationsteps given in Equations 3.2 through 3.4 specify how the model estimated directly from the component is

10

Page 14: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.1. Hierarchically Structured Documents 11

smoothed with a collection model, how the children’s models are included, and how the context is given byinterpolating the parent model.

The first step (Equation 3.2) interpolates the maximum likelihood estimate of the component with abackground model. The background model is sometimes referred to as a “universal” model, hence the u inλu. The type function may be used to specify document component specific models, as the language in titlesmay be different from other text, or it may simply return one language model for all components, whichwould provide larger amounts of text for the estimation of the corpus model. This model does not includeevidence found in the component’s children; this is done in the next step.

The second step (Equation 3.3) interpolates the model estimated by Equation 3.2 with the modelsestimated for the component’s children. If the λc parameters are set proportional to the length of the textin the node, as in

λc′

vi=

|vi||vi|a

, λcvj

=|vj |a

|vi|a − |vi|, |vi|a = |vi|+

∑vk∈children(vi)

|vk|a (3.1)

where |vi| is the length in tokens of the text in node vi not including tokens in its children, |vi|a is theaccumulated length in tokens of the text in node vi (including tokens in its descendants), and θ′vi

is equivalentto a flat text model estimated from the text in node vi interpolated with a background model. However,this choice of parameters is not required, and the weight placed on a child node may be dependent on thenode’s type. For example, the text in title nodes may be more representative of queries than the body of adocument, so a higher weight on the title model may improve retrieval performance.

The final step (Equation 3.4 includes evidence from the parent’s model. Incorporating the parent modelin this way allows the context of the component within the document to influence the language model. Thisis referred to as shrinkage. In [33], McCallum and Nigam introduced shrinkage to information retrieval inthe context of text classification. Classes were modeled hierarchically in this work. Class language modelswere estimated from the text in the documents assigned to the class, and all ancestor language models inthe class hierarchy. The hierarchical model for classes used in [33] is very similar to the model of documentspresented in this proposal. The difference in model estimation in this proposal is the application of shrinkageto document models, rather than class models. In general, the term “shrinkage” refers to any smoothing ofa model estimated from a small sample with a model estimate from a larger sample, but we will specificallyuse shrinkage in this document to refer to hierarchical shrinkage.

The choice of linear interpolation parameters λ may depend upon the task and corpus. Ideally, the choiceof these parameters would be set to maximize some measure of retrieval performance through automatedlearning. A set of rankable items R must also be defined for the document. For the hierarchical model here,R may be any subset of V.

3.1.1 Example

The estimation process described above may be clarified through describing the process for an exampledocument. The example document is a well known children’s poem encoded in XML:

<poem id=‘poem1’><title> Little Jack Horner </title><body>

Little Jack HornerSat in the corner,Eating of Christmas pie;He put in his thumbAnd pulled out a plumb,

Page 15: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

12 Chapter 3. Modeling Document Structure

Model SRThe hierarchical structure of the document is represented by a tree, each vertex v ∈ V in the treecorresponding to a document component. Directed edges in the tree are represented as a list ofvertex pairs (vi, vj) ∈ E when vi is the parent of vj . The tree is defined in graph notation as:

G = (V, E)V =

{v1, v2, v3, . . . , v|V|

}E = {(vi, vj) such that vi, vj ∈ V and there is an edge from vi to vj}

Parent and children functions are be defined as:

parent (vj) = vi : (vi, vj) ∈ E

children (vi) = {vj : (vi, vj) ∈ E}

Estimation of the generative models for the components of a document is a three step process:

1. A smoothed generative model is θvi estimated from the observations of the component in thedocument vi that does not include evidence of children components:

P (w |θvi ) =(1− λu

vi

)P (w |MLE (vi) ) + λu

viP(w∣∣θtype(vi)

)(3.2)

The θvimodel estimates a distribution directly from observed text within the document compo-

nent. The θtype(vi) model is a collection level background model for smoothing these estimates.

2. The next step is to estimate the intermediate θ′vimodel, from the bottom up to the top of the

tree:P(w∣∣θ′vi

)= λc′

viP (w |θvi

) +(1− λc′

vi

) ∑vj∈children(vi)

λcvj

P(w∣∣∣θ′vj

)(3.3)

As required for a proper linear interpolation,∑

vj∈children(vi)λc

vj= 1. The θ′vi

model incorpo-rates the evidence from the children nodes.

3. After the estimation of θ′vi, the θ′′vi

models used for ranking the components are estimated fromthe root of the tree down:

P(w∣∣θ′′vi

)=(1− λp

parent(vi)

)P(w∣∣θ′vi

)+ λp

parent(vi)P(w∣∣∣θ′′parent(vi)

)(3.4)

The θ′′vimodel incorporates evidence form the parent node, allowing context to be incorporated

into the model.

Figure 3.1: Model SR describes a single representation model of a hierarchically structured document.

Page 16: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.1. Hierarchically Structured Documents 13

v2title

v4 quote

v1 poem

3v body

Figure 3.2: The tree representing document structure for “Little Jack Horner”.

And cried, <quote> What a good boy am I! </quote></body>

</poem>

There are four components of this document, corresponding to the poem, title, body, and quote tags.Let us now assign labels to these components v1 to the poem, v2 to the title component, v3 to the bodycomponent, and v4 to the quote. The structure of the document may be drawn, as in Figure 3.2 or bedescribed as a set of vertices and edges:

G = (V, E)V = {v1, v2, v3, v4}E = {(v1, v2) , (v1, v3) , (v3, v4)}

Not all components of the document must be rankable items. A set of rankable items must be defined. Inour example, perhaps only the poem and the body of the poem are considered rankable items: R = {v1, v3}.

The estimation process is illustrated in Figure 3.3. First, smoothed θvimodels are estimated for each

vertex using the text occurring in the document component corresponding to the vertex. Note that “Whata good boy am I!” is not used for the estimation of θv3 , it is only used in the estimation for the model ofv3’s child node v4:

P (w |θvi) =

(1− λu

vi

)P (w |MLE (vi) ) + λu

viP(w∣∣θtype(vi)

)(3.5)

Next, the θ′vimodels are estimated by combination of the θvi model and the θ′ models of vi’s children.

For example, θ′v3is an interpolation of θv3 and θ′v4

. Similarly, θ′v1is an interpolation of θv1 , θ′v2

, and θ′v3:

P(w∣∣θ′v1

)= λc′

v1P (w |θv1 ) +

(1− λc′

v1

) (λc

v2P(w∣∣θ′v2

)+ λc

v3P(w∣∣θ′v3

))P(w∣∣θ′v2

)= P (w |θv2 )

P(w∣∣θ′v3

)= λc′

v3P (w |θv3 ) +

(1− λc′

v3

)λc

v4P(w∣∣θ′v4

)P(w∣∣θ′v4

)= P (w |θv4 )

(3.6)

Finally, the θ′′vimodels used in ranking are estimated by interpolating the θ′vi

model with the θ′′parent(vi)

model. In the example, θ′′v1is simply taken as θ′v1

as v1 has no parent. The other vertices do have parents,and θ′′v3

is an interpolation of θ′v3and θ′′v1

:

P(w∣∣θ′′v1

)= P

(w∣∣θ′v1

)P(w∣∣θ′′v3

)=

(1− λp

v1

)P(w∣∣θ′v3

)+ λp

v1P(w∣∣θ′′v1

) (3.7)

Page 17: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

14 Chapter 3. Modeling Document Structure

θ

θ θ

θ

What a goodboy am I!

Little Jack HornerSat in the corner,Eating of Christmas pie;He put in his thumbAnd pulled out a plumbAnd cried,

θ

θ θ

θ

θ

θ θ

θ

θ

θ θ

θ

θ

θ θ

θ

1

3

4

2

v

v v

v

’ ’

1

3

4

2

v

v v

v

1

3

4

2

v

v v

v

’’

’’

’’

’’

’ ’

1

3

4

2

v

v v

v

1

3

4

2

v

v v

v

Little JackHorner

2) Smooth Up Tree 3) Smooth Down Tree1) Direct Estimates & Smoothing with Background Model

Figure 3.3: The estimation process for “Little Jack Horner”.

We can expand the equations for these rankable items to a longer version as follows:

P(w∣∣θ′′v1

)= λc′

v1P (w |θv1 ) +

(1− λc′

v1

)( λcv2

P (w |θv2 ) +λc

v3

(λc′

v3P (w |θv3 ) +

(1− λc′

v3

)λc

v4P (w |θv4 )

) )

P(w∣∣θ′′v3

)=

(1− λp

v1

) (λc′

v3P (w |θv3 ) +

(1− λc′

v3

)λc

v4P (w |θv4 )

)+ λp

v1P(w∣∣θ′′v1

)

=(1− λp

v1

) (λc′

v3P (w |θv3 ) +

(1− λc′

v3

)λc

v4P (w |θv4 )

)+

λpv1

(λc′

v1P (w |θv1 ) +

(1− λc′

v1

)( λcv2

P (w |θv2 ) +λc

v3

(λc′

v3P (w |θv3 ) +

(1− λc′

v3

)λc

v4P (w |θv4 )

) ))

=(1− λp

v1+(1− λc′

v1

)λp

v1λc

v3

)(λc′

v3P (w |θv3 ) +

(1− λc′

v3

)λc

v4P (w |θv4 )

)+

λpv1

(λc′

v1P (w |θv1 ) +

(1− λc′

v1

)λc

v2P (w |θv2 )

)(3.8)

As an aside, we note that the model does not handle attributes of document components, such asid=“poem1” in our example. These attributes are common in structured documents encoded in XML and

Page 18: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.2. Multiple Representations 15

HTML, and it is important to represent and provide the capability to query these attributes. Modelingattributes in this model is still an open question, and is discussed in more detail in Chapter 6.

Related Work

Some related work has already been described in Section 1.1. The most related recent work is found in theproceedings of the INEX workshops [13][14]. The INEX workshops are a forum for the evaluation of XMLcomponent retrieval. The collection consists of flat text queries and structured queries, and any documentcomponent may be a rankable item. This section will highlight works of particular interest.

In [44] a generative language modeling approach for content only queries is described where a documentcomponents language model is estimated by taking a linear interpolation of the maximum likelihood modelfrom the text of the node and its descendants and a collection model. This corresponds to a special caseof our approach. They also experiment with blind feedback, length priors, and mixing the document andcomponent scores.

The authors of [27] also present a generative language model for content only queries in structureddocument retrieval. They estimate the collection model in a different way, using document frequenciesinstead of collection term frequencies. As with [44], their basic model can be viewed as a special case ofthe language modeling approach presented here. They also experiment with mixing in the scores of nearbycomponents, which is similar to shrinkage. Further experiments included length priors, which they foundbeneficial.

Neither of these works explicitly model the hierarchy of document structure during ranking. The modelpresented here is a generalization of these models and allows for more sophisticated estimation techniquessuch as shrinkage with the parent model and different weights on the children components.

3.2 Multiple Representations

Section 3.1 presented Model SR, which models documents where a clear hierarchy is expressed within a singlerepresentation of a document. However, there are environments where there exist alternative representationsof a rankable document component. For example, Web documents link to other documents. When rankinga document, one valid representation of the document may be the text of other documents’ links to it.This text exists outside of the document and its hierarchical structure, so it cannot be represented in thehierarchical model described above. This section presents Model MR, which extends Model SR to allowmultiple representations of components.

Model MR is the approach to including multiple representations, which is presented in Figure 3.2. It isa simple extension of the hierarchical model presented above. For any rankable item one defines a set ofrepresentations. Each representation may be a component in a hierarchically structured representation ofa document. Model SR described how we estimate the generative model of a document component, so allthat is left now is a description of how the distributions are combined. As before, a natural choice for thecombination is linear interpolations.

Model MR is a simple extension of Model SR. It essentially connects multiple document representationsmodeled using Model SR. The set R defines a set of rankable items and C defines which components arerepresentations of the rankable items. Equation 3.9 combines the representations using a simple linearinterpolation. As with other interpolation parameters, the choice of values for the λr parameters woulddepend on the task and the representations.

Page 19: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

16 Chapter 3. Modeling Document Structure

Model MRThe structure of a document is captured using a set of graphs in the multiple representation model(Model MR) . Each graph is for a representation of the document. Additionally, there is a set ofreferences across graphs that specify the different representations for a rankable item:

D = (G,R, C)G =

{G1,G2, . . . ,G|G|

}R =

{r1, r2, . . . , r|R|

}C =

{c1, c2, . . . , c|C|

}where cl is of the form

(ri, v

jk

)In the above definition, R is the set of rankable items, C links the representations to the rankablecomponents, and vj

k denotes the kth vertex of representation graph Gj .For convenience, we define a function that returns the representation set for a rankable item:

reps (ri) ={

vjk :(ri, v

jk

)∈ C}

The generative model used for ranking retrievable items is then defined as:

P (w |θri) =

∑vj

k∈reps(ri)

λrvj

k

P(w∣∣∣θ′′

vjk

)(3.9)

where θ′′vj

k

is as defined in Model SR.

Figure 3.4: Model MR describes a multiple representation model where each representation may be hierar-chically structured.

Page 20: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.2. Multiple Representations 17

3.2.1 Example

The discussion of hierarchically structured representations included an example of “Little Jack Horner”marked-up in XML. There could many additional representations, such as variations on the poem or trans-lations into other languages. For example, a controlled vocabulary can be used to label documents withmedical terms in the MeSH corpus. In this case we will consider an environment similar to the Web withlinks or cross-references across text. Suppose that there are several documents that reference the poem withsome text associated with the reference. For example, consider the following text snippets:

<document id=‘doc1’>... the poem <link id=‘poem1’> Little Jack Horner </link> is rumored to be based on ...</document>

<document id=‘doc2’>... a nursery rhyme about <link id=‘poem1’> the stealing of a deed </link> ...</document>

<document id=‘doc3’>... <link id=‘poem1’> 16th century nursery rhyme </link> ...</document>

<document id=‘doc4’>... a poem rumored about <link id=‘poem1’> Thomas Horner and the Abbot of Glastonbury </link> ...</document>

The links to the poem can be collected to form a new document representation:

<links id=‘poem1’><link source=‘doc1’>

Little Jack Horner</link><link source=‘doc2’>

the stealing of a deed</link><link source=‘doc3’>

16th century nursery rhyme</link><link source=‘doc4’>

Thomas Horner and the Abbot of Glastonbury</link>

</links>

The document now has two graph representations, the graph of the poem in the prior discussion, whichwe will take as G1 and a new graph G2 for the link-based representation. For the link based representa-tion, the links component will be assigned v2

1 , and the four link components will be assigned to vertices

Page 21: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

18 Chapter 3. Modeling Document Structure

v22 , v2

3 , v24 , and v2

5 :G = {G1,G2}

G1 = (V1, E1)V1 =

{v11 , v1

2 , v13 , v1

4

}E1 =

{(v11 , v1

2

),(v11 , v1

3

),(v13 , v1

4

)}G2 = (V2, E2)V2 =

{v21 , v2

2 , v23 , v2

4 , v25

}E2 =

{(v21 , v2

2

),(v21 , v2

3

),(v21 , v2

4

),(v21 , v2

5

)}For this example, we will assume that there is only one rankable item representing the poem. This need notbe the case. For example, a link may refer to a component within a representation. The structure of thedocument is then specified by:

D = (G,R, C)R = {r1}C =

{(r1, v

11

),(r1, v

21

)}Estimation of the θr1 distribution is then taken as an interpolation of θ′′

v11

and θ′′v21:

P (w |θr1 ) = λrv11P(w∣∣∣θ′′v1

1

)+ λr

v21P(w∣∣∣θ′′v2

1

)(3.10)

Since neither v11 nor v2

1 have parents, θ′′v11

= θ′v11

and θ′′v21

= θ′v21. By further expanding using Equation 3.3, we

can rewrite the estimation of probabilities for words of the poem as:

P (w |θr1 ) = λrv11

λc′

v11P(w∣∣∣θv1

1

)+

(1− λc′

v11

)(λc

v12P(w∣∣∣θv1

2

)+ λc

v13

(λc′

v13P(w∣∣∣θv1

3

)+ λc

v14P(w∣∣∣θv1

4

)))

+ λrv21

λc′

v21P(w∣∣∣θv2

1

)+

(1− λc′

v21

)λc

v22P(w∣∣∣θv2

2

)+ λc

v23P(w∣∣∣θv2

3

)+

λcv24P(w∣∣∣θv2

4

)+ λc

v25P(w∣∣∣θv2

5

)

(3.11)

3.2.2 Non-Textual Data

The model as described assumes that all leaf nodes contain textual data only. However, it is common to havenon-text data in a document, such as dates, numbers, and pictures. Although the prior discussion considersa probability distribution over a vocabulary of words, there isn’t anything preventing modeling non-textdata in a generative model. The primary difference for the inclusion of non-textual information is that theestimation of generative probabilities may no longer be a discrete distribution over vocabulary items. Forexample, P (w |θvi

) of Equation 3.2 may be defined over a continuous space, as in numeric fields. Whennon-textual data is included, an appropriate approach for estimating generative probabilities must be definedfor P (w |θvi

). The smoothing of P (w |θvi) may also be different than the simple linear interpolations with

Page 22: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.3. Ranking Items for Queries 19

text. For example, we may assume that a number may be distributed as a Gaussian where the mean to bethe observed value and a reasonable estimate of variance.

Images may also be modeled in this setting, though the approach may be more complex. Westerveld [52]proposes a method modeling images using a Gaussian Mixture Model, which he argues provides a frameworkfor combining image retrieval with text-based language modeling. Lavrenko’s [24] generative relevance modelsalso provides a generative framework for retrieval that allows non-textual data, such as images. The workpresented here is different for previous work in that it provides an explicit model of structure in addition tothe capability of handling non-textual data.

In addition, the multiple representation model would allow the use of link-based anchor text to be usedwhen ranking image components (as in Google’s image search). Model MR can flexibly incorporate the useof multiple representations of different natures.

3.3 Ranking Items for Queries

This section describes how rankable items are ordered for queries. Ordering of items is based on the query-likelihood model, where items are ranked by the probability of generating the query. For flat text queries,ranking is analogous to the prior query-likelihood models described in Chapter 2. However, it is desirableto provide support for structured queries as well. In addition to the traditional Boolean query constraints,it is also desirable to support constraints on where terms occur within document structure. This sectiondescribes the use of trend information in ranking.

3.3.1 Queries

Flat text queries may simply be ordered by P (Q |θr ), where

P (Q |θr ) =∏

w∈Q

P (w |θr )tf(w,Q) (3.12)

as before. This is the natural adaptation of the query-likelihood model to the model described in this chapter.If there is a single hierarchical representation, then θr = θvi . For multiple representations, θr is estimatedas specified in Equation 3.9.

Constraints on Document Structure

There are many cases where it is desirable to place constraints on where the query terms appear in thedocument structure of a representation. This can be done by constraining which θvi distribution generatesthe terms. For our examples, we will use the NEXI [48] query language. NEXI is an adaptation of XPath toprovide a sensible simplification of the XPath language that also has support for information retrieval needs.Consider the NEXI query

//poem[about(.//title, Horner)]

which requests poem components where the title component is about ‘Horner’. For our example documentwith a single representation, instead of measuring P

(‘Horner’

∣∣θ′′poem

), corresponding to the probability the

poem component generated the query term ‘Horner’, P (‘Horner’ |θ′′title ) is used. P (‘Horner’ |θ′′title ) measuresthe probability that the title component generated ‘Horner’.

Constraining the generating distribution in this manner is a more strict interpretation of the constraintsexpressed in the query. Ranking items as described above requires that only poems be returned as results

Page 23: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

20 Chapter 3. Modeling Document Structure

and that they contain titles about ‘Horner’. Note that through the use of smoothing, ‘Horner’ is not requiredto be present in the title.

There are cases where a structural constraint on the query may be applicable to multiple components.Consider the query:

//document[about(.//paragraph, plum)]

Most documents contain many paragraphs. Which distribution is chosen to generate ‘plum’? Many rea-sonable options are applicable. One approach may create a new distribution estimated as a combinationof all of the θ′′vi

distributions corresponding to paragraph components. Another approach may take the θ′′vi

distribution corresponding to paragraph nodes that maximizes the probability of generating ‘plum’. As apart of the dissertation, we will investigate which distribution or combination of distributions is best.

Boolean Operators

Boolean logic operators are commonly supported features in query languages. They allow the simple con-junction, disjunction and negation of query clauses. We will refer to these operations as ‘AND’, ‘OR’ and‘NOT’, respectively. Though we refer to logical definitions of conjunction, disjunction and negation, we willtreat these operators probabilistically. Basic support for these operators is relatively straightforward.

The ‘AND’ operator is the default operator when working with generative models. For example, rankingour example document against the query

//poem[about(.//title, Horner) AND about(., plum)]

is done by measuringP (‘Horner’ |θ′′title )P

(‘plum’

∣∣θ′′poem

)This is a direct estimate of probability the title generated ‘Horner’ AND that the poem generated ‘plum’.The ‘NOT’ operator is also simple to estimate. In NEXI, a ‘NOT’ operator is expressed as a ‘-’ in front ofa term:

//poem[about(., plum -Horner)]

This query expresses an information need for poems about ‘plum’ but not about ‘Horner’. The ‘NOT’constraint may be interpreted as ‘the poem component did not generate “Horner”’. This can be estimatedby taking one minus the probability that the poem component did generate ‘Horner’:

P(‘plum’

∣∣θ′′v1

) (1− P

(‘Horner’

∣∣θ′′v1

))Supporting ‘OR’ operators is done similarly. Due to term independence assumptions, the probabilities

may simply be added up. If the query is:

//poem[about(., plum) OR about(., pear)]

which asks for the retrieval engine to return poems that are about ‘plum’ or ‘pear’, then the probability forour sample document may be estimated as:

P(‘plum’

∣∣θ′′v1

)+ P

(‘pear’

∣∣θ′′v1

)However, for all operators there are complications when used in conjunction with repeated query terms

across clauses. The nesting of operators within NEXI poses some difficulties. There may be dependenciesamong clauses that need to be understood to get accurate probability estimates. For example, the query:

Page 24: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.4. Experimental Results 21

//poem[about(., plum thumb) OR about(., plum pear)]

In this case, the clauses share the term ‘plum’, and there is an overlap that must be subtracted away. Theprobabilities for the separate clauses can no longer be added. How the operators should be handled whenthere are shared query terms across clauses and how structural constraints may additionally confound thisproblem is an open question and will be investigated as a part of the dissertation.

Proximity Operators

Proximity and phrase operators may also be supported within the model. Phrase operators can be modelednaturally using bigrams, as introduced to information retrieval by Song and Croft [46]. That is, one wouldestimate P (‘a b’ |θ ) as P (‘a’ |θ )P (‘b’ |θa ) where θ is the unigram model estimated from the text sampleand θa is the bigram model estimated from the text sample of words occurring directly after ‘a’.

How best to handle proximity operators such as ‘a WITHIN 20 WORDS OF b’ within the generativeframework has not been established. However, one approach may be to take a crude density estimate of ‘a’sdistance from the term ‘b’. The generative probability for P (‘a WITHIN 20 WORDS OF b’ |θ ) could thenbe estimated as P (‘b’ |θ )

∑d=20d=0 P (‘a’ at distance d from ‘b’ |θb ) where θ is the unigram estimated from text

and P (‘a’ at distance d from ‘b’ |θb ) is density estimate that the term ‘a’ occurred d tokens away from ‘b’.

3.4 Experimental Results

This section describes preliminary results with the model for document structure and multiple representa-tions. They are provided as evidence that the model is a viable approach to handling structure. Experimentsin retrieval of a document components from a single hierarchical representation (Model SR) are describedin Section 3.4.1. A description of experiments using multiple hierarchical representations (Model MR) aredescribed in Section 3.4.2. Finally, a discussion of experiments using prior probabilities to leverage trendinformation is described in Section 4.4.

3.4.1 Hierarchical Structure

The example of hierarchical structure provided earlier in the chapter centered around the poem “Little JackHorner” that was marked-up in XML. This section describes some preliminary experiments using Model SRto demonstrate its effectiveness.

The context of this discussion will be the INEX workshop[13][14]. This workshop has for the last threeyears studied XML retrieval. There are two tasks - retrieval of document components for flat text queriesand retrieval of documents for structured queries. Any component in the XML structure of a documentmay be retrieved as a result for either task. The structural queries may also have constraints or hints torelevance as described in the examples of the NEXI query language given previously. The model presentedin Section 3.1 is directly applied to the XML structure of documents for the retrieval experiments describedhere.

Methodology

INEX has produced two sets of relevance judgments of interest here: judgments for 36 unstructured flattext queries and 30 structured queries written in an adaptation of the XPath query language. For flat textqueries (content-only), any document component may be returned. Structured queries may specify a desiredresult type. In the SCAS (strict content-and-structure) relevance assessments, the structural constraintsare interpreted literally and only those document components satisfying the constraints expressed in the

Page 25: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

22 Chapter 3. Modeling Document Structure

query may be considered as relevant. There are also VCAS assessments where the structural constraints areconsidered vague or as clues to relevance which are not considered here. The corpus is approximately 500MB of Computer Science journals marked up in XML.

The relevance judgments in INEX are multi-valued in two dimensions. There is the exhaustivenessdimension which may take three values:

• The document component is highly exhaustive and discusses most or all aspects of the query.

• The component is fairly exhaustive and discusses several aspects of the query.

• The component is marginally exhaustive and mentions the subject of the query in passing.

• The component is not exhaustive at all.

The other dimension considers the specificity of the component. That is, how much of the documentcomponent is about the subject of the query. The values are:

• The component’s specificity is high and discusses only the topic of the query.

• The component is fairly specific and and the topic is a major theme of the component.

• The component’s specificity is marginal and the topic is only a minor theme.

• The specificity of the component is not specific as the topic is not a theme of the component.

The evaluation of the experiments presented in this chapter will use only those that are highly specificand highly relevant. This choice is due to concerns about the quality of evaluation measures with respectto overlap of document components when including less relevant and less exact components. It is importantto note that using only highly exhaustive and highly specific components as relevant components is a verystrict evaluation measure and as such baselines may appear much lower than commonly found within ad-hocretrieval experiments in the TREC conference.

The experiments in this chapter were performed on an unstemmed corpus and uses the InQuery stopwordlist.

The rest of the chapter is structured as follows. Section 3.4.1 discusses how components are ranked forthe queries. The following two sections describe experiments on the INEX testbed. Section 3.5 summarizesthe chapter and discusses future work.

Ranking XML Document Components

The basic model for ranking XML components against queries is described in Sections 3.1 and 3.3.1. Thechildren nodes are combined to estimate a node’s distribution using a linear interpolation where the weightsare proportional to the length of the node in the child. Two variations of the model are presented inexperiments:

• single corpus model - a single collection language model is used for smoothing. That is, all θtype(vi)

models are the same model which is estimated using the maximum likelihood estimator from all textobserved in the collection. This allows some efficiency improvements for ranking and gives a largeramount of text for estimating the corpus model. It is valid when the language in all components issufficiently similar to provide good background estimates or when separate component type specificbackground models would be too small to provide a good background model. It makes sense in thiscorpus as many of the document components consist mostly of text.

Page 26: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.4. Experimental Results 23

<article><fno>A1004</fno><doi>10.1041/A1004s-1996</doi><fm><hdr><hdr1><ti>IEEE ANNALS OF THE HISTORY OF COMPUTING</ti>

<crt><issn>1058-6180</issn>/96/\$5.00 <cci><onm>&copy; 1996 IEEE</onm></cci></crt></hdr1><hdr2><obi><volno>Vol. 18</volno><issno>No. 1</issno></obi>

<pdt><mo>SPRING</mo><yr>1996</yr></pdt><pp>p. 4</pp>

</hdr2></hdr><tig><atl>About this Issue</atl><pn>p. 4</pn></tig><au sequence="first"><fnm>Michael R.</fnm><snm>Williams</snm></au>

</fm><bdy><sec><st></st><ip1>This issue of <it>Annals</it> is the first under my editorship but people

have been working hard for over a year preparing it. Peter Patton and StevenBrown from The University of Pennsylvania have been overseeing the creation....

</ip1><p>We have a wonderful collection of articles to help celebrate the occasion. Dylis

Winegrad introduces us to the time by describing the Moore School and the...of their work&mdash;Pres Eckert died only a few months ago.</p>

...</sec>

</bdy></article>

Figure 3.5: A Sample of an INEX Document

Page 27: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

24 Chapter 3. Modeling Document Structure

• shrinkage - the parent of a component is taken into consideration when ranking. This is done by(recursively) interpolating the distribution estimated for a component with its parent distribution.The interpolation weights are set using Jelinek-Mercer smoothing to a constant non-zero λp. Thisallows the context of a component in a document to have an effect on how it is ranked. The shrinkageparameter λp controls how much weight is placed on the parent component. When shrinkage is notused, λp = 0.

The effects of these variations of the basic model on performance are discussed in the following sections.

Unstructured Queries

This section presents results for searching for components relevant to an information need expressed in astandard text query. The results presented here use the title field of INEX content-only (CO) topics (91- 126). These are flat text queries. The weight placed on the collection model λu is 0.5 and when usingshrinkage, λp is set to 0.1. The λc

viparameters are set proportional to the aggregated length of text in the

component corresponding to vi and λc′

vjis proportional to the length of text in the component corresponding

to vj but not in any of its children (Equation 3.1). These parameters are chosen conservatively at this pointwith some fine tuning on a separate training corpus to provide reasonable results. As mentioned before,automated choosing of the λ parameters is an open question and a part of future work.

Figure 3.6 shows results when using a single corpus model. For content-only queries, shrinkage givesan 11% improvement in the quality of the ranking as measured by mean-average precision. Examining thefigures reveals that this improvement comes in the mid-recall range. This is not surprising as the shrinkageparameter is set very conservatively. While a mean-average precision of 0.075 may seem low, keep in mindthe strict measure of relevance. This performance is competitive with the best systems at INEX [13].

For comparison, Figure 3.6 also shows results using article retrieval. We see that early precision of thearticle retrieval run is much higher, but the precision quickly drops off and Model SR with all componentsas rankable items achieves higher recall. In terms of mean-average precision, the article retrieval run onlyscores around 0.049, as the recall is so low.

Ideally, we would be able to leverage both the high precision aspects of article only run and the recall ofthe component retrieval run. One solution to this problem will be presented in Chapter 4.

Structured Queries

This section presents some preliminary work on retrieval of structured queries. A taste of what the structuredqueries look like may be found in the sample queries above, but the queries of topics 64 and 66 are alsoprovided:

64 = //article[about(., hollerith)]//sec[about(., DEHOMAG)]

66 = //article[.//fm//yr < 2000]//sec[about(., ‘search engines’)]

Topic 64 has the narrative of

Relevant sections deal with DEHOMAG (Deutsche Hollerith MaschinenGesellschaft) in documents that discuss work or life of Herman Hollerith.

We can see that the structured query corresponding to topic 64 is a reasonable representation of the user’sinformation need. The request of topic 66 is for sections about ‘search engines’ in articles published prior tothe year 2000 (fm is the ‘front matter’ component).

The structured queries may have numeric constraints on some fields which are ignored by the retrievalsystem. ‘OR’ clauses are taken by summing the probabilities of the child clauses. For ‘NOT’ clauses are

Page 28: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.4. Experimental Results 25

0

0.2

0.4

0.6

0.8

1

0 0.5 1

Pre

cisi

on

Recall

Content Only TopicsINEX 2003

Single Corpus ModelShrinkage

Article Retrieval

Figure 3.6: Precision-recall curve for content only queries using a single corpus model with a weight of 0.5.

dropped from the query and terms with a ‘-’ in front of them, one minus the probability of the clause/termis taken. Different clauses of the queries are given equal weight. Phrase constraints are dropped, but thewords are kept. A ‘+’ in front of a query term is ignored. A literally interpreted year filter is applied postretrieval when there are constraints on the publication year.

Figure 3.9 reports results when using a single collection model for smoothing. Interestingly, shrinkagedoes not change the performance of the system much for structured query retrieval. Some analysis as to whyshrinkage does not help for strict content-and-structured queries is required.

In addition, Figure 3.9 includes a line for an element-based unstructured run. In this run, all structuralconstraints except for the desired result type were removed from the query. All text query terms containedin the about clauses were placed into one about clause for the desired element type. Terms with a ‘-’ aredropped, and year constraints are also dropped. For example, queries 64 and 66 were converted to:

64 = //article//sec[about(., hollerith DEHOMAG)]

66 = //article//sec[about(., search engines)]

It is interesting to note that the element-based unstructured run performed very similarly to the struc-tured retrieval run. However, they did not perform the same for all queries. Figure 3.10 shows how both theunstructured run and the structured run performed on each of the topics. Some topics performed noticeablyworse when using the structured queries (Figure 3.4.1).

In topic 63, the narrative requests documents that “include one or more paragraphs discussing security,authorization or access control in digital libraries”. This corresponds to the second about clause of the query(the title field of the topic):

about(.//p, +authorization +"access control" +security)

Page 29: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

26 Chapter 3. Modeling Document Structure

We argue that this query poorly formed given the statement of the information need in the narrative.The topic author is not looking for a single paragraph discussing security issues in digital libraries, but isin search of at least one paragraph discussing these issues. We argue the element-based unstructured queryformed for this topic is a better information need:

//article[about(.,"digital library" +authorization +"access control" +security)]

This query removes the single paragraph constraint of the original query, and would rank documents thatcontain more text about security issues in digital libraries above those containing less.

The author of this proposal is also the author and assessor of topic 75. In this case, the author wasrequired to create a reasonable structured query, but had difficulty creating structure queries that did notfeel artificial on this corpus. The topic requests there exist a section containing a formal definition of a Petrinet and another section describing algorithms, their efficiency, or approximation techniques used in Petrinets. While the section constraints exist in the topic title (the query), in reality the author did not reallycare whether these things existed in sections or not; the author merely desired the presence of these thingsin the article.

The third topic where the query using all structural constraints performed significantly worse was topic82. We suspect that the strict constraint requiring the author field to contain ‘kim’ is too strict a matchfor our retrieval system to handle appropriately given the narrative. The narrative mentions that ‘kimh’ or‘kimm’ would be acceptable in this case. To test whether this is the case, we simply removed the aboutcorresponding about clause. This resulted in an average precision of 0.717, which is comparable to theaverage precision of the element-based unstructured query which achieved an average precision of 0.672.

One possible way to alleviate the mismatch problem with the narrative and the query in topic 82 withinModel SR would be to substitute a different generative model for the author fields, such as using a characterbased trigram instead of a word based unigram.

There were two topics where the structured query performed noticeably better than the element-basedunstructured query (Figure 3.4.1).

For topic 65, the extra structure was a constraint on the year of publication. The system performedreasonably well without a year filter. The inclusion of the year filter raised performance from 0.414 to 0.714.

For topic 81, there are only two relevant articles given the strict quantization. Both relevant articlescontained the word ‘Databases’ in the title, so it seems that the ‘about(.//fm//atl, databases)’ clause washelpful in ordering documents.

This examination paints a somewhat bland picture of the usefulness of structured queries in XML com-ponent retrieval. The use of structure did improve performance modestly (0.225 to 0.232), but most of thatwas as a result of one topic (65). It is unclear whether human users are good at creating structured queries,although our examination suggests that humans can sometimes give good hints as in topics 65 and 81.

The unclear benefits may be a result of several factors: poor use of structural constraints within themodel, poorly formed structural queries, or a corpus that does not have natural structured queries. In ourfuture work, we would like to identify why our system has failed to leverage structure for this task.

It is also important to note that we did not leverage phrasal suggestions in the queries, as our index doesnot currently support term locations. In our future work we will also implement support for phrases. Webelieve that phrase information may be helpful to improve performance.

Summary

The experiments presented here demonstrated the viability of the hierarchical generative model proposed inthis chapter. Future experiments in this area will include the incorporation of information about trends ofrelevant components using the length of the component and the type of the document component.

Page 30: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.4. Experimental Results 27

Figure 3.7: Topics where structural constraints seemed to harm performance. Topic descriptions and key-words are omitted to save space.

<inex_topic topic_id="63"><title>

//article[about(.,"digital library") ANDabout(.//p, +authorization +"access control" +security)]

</title><narrative>

Relevant documents are about digital libraries and includeone or more paragraphs discussing security, authorization or accesscontrol in digital libraries.

</narrative></inex_topic>

<inex_topic topic_id="74"><title>

//article[about(., video streaming applications)]//sec[about(., media stream synchronization) OR

about(., stream delivery protocol)]</title><narrative>

Documents/document components describing delivery controlfor a specific setting are considered relevant. Document/documentcomponents describing audio and video stream synchronization (if thisis necessary due to the architecture) are consideredrelevant. Documents/document components describing architectures orservers for stream delivery (media servers) are relevant, if a part ofthe document component describes a communication or delivery protocol(even if lowlevel).

</narrative></inex_topic>

<inex_topic topic_id="82"><title>

//article[about(., handwriting recognition) AND about(.//fm//au, kim)]</title><narrative>

To be relevant, a document must be an article, havehandwriting recognition or an example of handwriting recognition as amajor subject. One of the authors must have a name similar to ‘kim’(kimh or kimm would be accepted, kimbelnatenbaum wouldn’t). Kim doesnot have to be the first author. A bibliographic reference to arelevant article is not a relevant element. An author index or a callfor paper is not a relevant element. A correction to a relevantarticle IS a relevant article, as it is important that the retrievedinformation comes with its eventual corrections.

</narrative></inex_topic>

Page 31: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

28 Chapter 3. Modeling Document Structure

Figure 3.8: The structural constraints in queries 65 and 81 improved retrieval effectiveness. Topic descriptionsand keywords are omitted to save space.

<inex_topic topic_id="65"><title>

//article[.//fm//yr > ’1998’ ANDabout(., "image retrieval")]

</title><narrative>

Relevant documents are research articles about imageretrieval written after 1998. Editorials, news or volume indices arenot relevant.

</narrative></inex_topic>

<inex_topic topic_id="81"><title>

//article[about(.//p, "multi concurrency control") ANDabout(.//p, algorithm) ANDabout(.//fm//atl, databases)]

</title><narrative>

An article is considered relevant if it providesinformation about a concurrency control algorithm(s)/protocol(s) basedon keeping multiple versions of data items in a database. It can beeither a description of the algorithm/protocol or a reference to apaper on the subject in its "Related work" section.

</narrative></inex_topic>

Page 32: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.4. Experimental Results 29

0

0.2

0.4

0.6

0.8

1

0 0.5 1

Pre

cisi

on

Recall

Content and Structure TopicsINEX 2003

Single Corpus ModelShrinkage

Unstructured

Figure 3.9: Precision-recall curve for structured queries using a single corpus model with a weight of 0.5. Anelement based-unstructured run performs similarly.

For structured queries, the dissertation will also explore the role of context-sensitive smoothing, incorpo-ration of the dropped and modified query constraints, and alternative weighting strategies for query clauses.The dissertation will also explore adaptations of the model for vague interpretations of structured queries,

3.4.2 Multiple Representations

While the previous section described experiments using a single hierarchical representation of a document(Model SR), this section focuses on the use of multiple document representations (Model MR). In particular,we motivate this by considering HTML documents found on the Web. HTML documents have structurewithin them through the mark-up of titles, large fonts, etc. They can also have multiple document repre-sentations, such as in in-link text. The model we take for Web documents is a reorganization of the Webdocuments so that they all have uniform structure. Each web document is redescribed as a document withthree representations: one for the text of the document, one for the in-link text, and one for representingthe image alternate text (the values of the ALT attribute of the IMG tag).

The in-link text is a flat text representation, as is the image alternate representation, while the text ofthe document is represented hierarchically. The text of the document has a ‘title’ child that contains thetitle of the document, a ‘fonts’ child that contains the large fonts found in the document, and a ‘meta’ tagthat contains the keywords and descriptions of the document. Figure 3.11 illustrates this structure.

Why use this reorganization of the document? This work was originally framed as a combination offlat text document representations. The document representations were chosen using knowledge of whichrepresentations tended to convey useful information about the web documents. In order to better understandhow it relates to the model presented in this proposal, it has be redescribed with a hierarchical representation

Page 33: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

30 Chapter 3. Modeling Document Structure

Figure 3.10: The element-based unstructured and structured runs did not perform identically for all querieson the SCAS topics. The extra structure in the queries did not help all queries, but for some, the extrastructure proved beneficial.

link doc

title meta font

alt

G1 2 3

1 2

2 2 2

31

2 3 4

G

v

v

v

v v

G

v

Figure 3.11: Document representations and graph structure used for experiments with web items. Note thatv1 could also be hierarchically represented as in Section 3.2.1.

Page 34: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.4. Experimental Results 31

for the text of the document. This organization may not be what we would do if we started with the model,as we would more explicitly preserve the original structure of the documents. Reinvestigation of this workstarting with the model will be a part of the dissertation.

In order to evaluate the model, this section presents experiments for known item finding. Known-itemfinding of Web documents is an important information seeking activity. In this task the user knows of aparticular document, but does not know where it is. Queries for this task tend to be very good, as findingthe correct page at a high rank over many queries is quite common. Queries may be a request for the IRS1040 form, the homepage of the IRS, or any other known-item to be found on the Web. Very often, thesedocuments have structural markup, such as HTML. This task is important for the dissertation as it is oneexample where leveraging structure and representations can greatly improve performance, as will be seenbelow.

The section continues with a description of how the model is adapted for known-items search of Webdocuments. First, we describe meta-search techniques used as base-lines for comparison. The text thenpresents details of the experimental context and methodology, followed by the experiments.

Model

The model used here is the model for combining the document representations and hierarchical structureis as described in 3.2. Figure 3.11 gives a pictorial representation of the graphs, but we also include textnotation for completeness:

D = (G,R, C)

G = {G1,G2,G3}R = {r1}C =

{(r1, v

11

),(r1, v

21

),(r1, v

31

)}G1 = (V1, E1)V1 =

(v11

)E1 = {}

G1 = (V2, E2)V1 =

(v21 , v2

2 , v23 , v2

4

)E1 =

{(v21 , v2

2

),(v21 , v2

3

),(v21 , v2

4

)} G3 = (V3, E3)V3 =

(v31

)E3 = {}

In these experiments, the λ parameters were chosen as follows:

λuvi

j=

µtype(vi

j)|vi

j|+µtype(vi

j)

λc′

v21

= 1/4 λp = 0

λcv22

= 1/3 λrv11

= 1/6

λcv23

= 1/3 λrv21

= 4/6

λcv24

= 1/3 λrv31

= 1/6

The parameter choices for λc and λr may seem arbitrary, but they were chosen to place equal weight to eachof the vertex in the graph. Changing the shrinkage parameter λp would have no effect in this case as the

Page 35: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

32 Chapter 3. Modeling Document Structure

rankable item’s only representations are roots of trees. The λu parameters are set according to the Dirichletprior approach to smoothing and are take to be twice the average length of vi

j across all documents, as thisis a conservative parameter choice that should be effective for retrieval [57].

Meta-search for Combining Document Representations

Meta-search algorithms are a common approach to combining document representations and will serve asbaseline systems in the following experiments. The goal in meta-search is to combine the results fromdifferent search engines to produce a single ranked list that is better than the results of any single searchengine. This is sometimes referred to as data fusion. Combining document representations is similar tocreating a search engine for each of the document representations and performing meta-search to combinethe result lists.

Meta-search fusion algorithms were developed to address the combination of retrieval results from manyretrieval systems. In general, they combine the result lists containing the top n documents from each retrievalsystem. This subsection describes the meta-search algorithms to which we will compare the language modelapproach, and discusses other related work where appropriate.

There are two main classes of meta-search fusion algorithms: ones that use scores from systems and onesthat do not. The algorithms investigated in this paper are combMNZ, combSUM, Condorcet fuse, Bordafuse, and a reciprocal rank fusion strategy. Montague and Aslam compare most of these approaches formeta-search in [3] and [35]. Within these classes are variants that do and do not use training data.

Condorcet fuse, Borda fuse, and the reciprocal rank fusion strategy do not use scores from the systems.These approaches assume that the scores from the different retrieval systems are not directly comparable. Ina meta-search environment, this can be a good assumption, as the different search systems may use differentranking formulas or have different corpus statistics. The corpus statistics may still be different in documentrepresentation fusion, but we can assume that the search engines use the same ranking function. This mayor may not lead to comparable scores, depending on the ranking function.

Condorcet fuse [35] and Borda fuse [3] were originally developed to address elections and have beenadapted to meta-search fusion. Condorcet fuse algorithm is not described in detail here as the description isbeyond the scope of this review. It can be viewed as an approach that maximizes agreement of partial orderpreferences across voters. Borda fuse sums n−r, where is r the rank of the document across all systems, andsorts the documents in descending order (documents with higher scores are higher in the merged list). Systemweights can be incorporated by multiplying the weight by the scores for the documents. The reciprocal rankstrategy [58] sums 1/r across all search engines. Documents are sorted in descending order. The reciprocalrank strategy gives much higher weight than Borda fuse to documents that occur near the top of a list.System weights can be incorporated by multiplying each document’s inverse rank by the weight.

Both combMNZ and combSUM are variants of an algorithm developed by Fox [12]. combSUM rankseach document using the sum of the scores returned from the individual retrieval systems, and combMNZranks by the sum multiplied by the number of systems that returned the document in the top n results. Itis common to perform a linear normalization of the scores in the result lists for both algorithms so that thefirst document in each list has a score of one and the nth document has a score of zero.

Methodology

The past few TREC Web Tracks have included a known-item task. TREC 10 included a homepage findingtask, where the user is searching for a homepage. Some examples of homepages are those for individuals,corporations, departments of institutions, etc. TREC 11 included a named-page finding tasks, which is moregeneral than homepage finding. An example of a named-page may be the “IRS 1040 form” or a specificreport. TREC 12 had a mixed task, where half of the queries were named-page queries, and the other half

Page 36: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.4. Experimental Results 33

Table 3.1: Document representations used for meta-search algorithms. These are chosen to correspond withthe components that may be leveraged by Model MR (Figure 3.11).

Representation DescriptionAlt Image alternate textFont Changed font sizes and headingsFull Full document textLink In-link textMeta Meta tags (keyword, description)Title Document title

Table 3.2: Known-item finding testbedsTask Corpus Topics Documents Size Document typesHomepage finding WT10G 145 1,692,096 10 GB htmlNamed-page finding .GOV 150 1,247,753 18 GB html, doc, pdf, ps

were homepage queries. The relevance judgments provided by these TREC Web Track tasks will enableevaluation of the approach.

We performed experiments on two TREC tasks and corpora. Table 3.2 summarizes the corpora andtest topics. Our experiments were on the Homepage Finding task from TREC-10 [18] and the Named-PageFinding task from TREC-11 [8]. Homepage finding and named-page finding are both kinds of known-itemfinding tasks. The Homepage Finding task uses the WT10G corpus, and the Named-Page Finding task uses.GOV corpus (Table 3.2). Both corpora consist of over 10 GB of text and around 150 topics. The WT10Gis a subset of a general crawl of the Web, while the .GOV corpus is focused in the .gov domain. To evaluatethe Homepage Finding task and the Named-Page Finding task, we use mean-reciprocal rank (MRR) andnumber of topics where the correct page was found by rank 10 as they emphasize finding the right page highin the result list. MRR is the average of one over the rank where the correct page was found. These measureswere chosen as they are official measures reported at TREC [18][8] and they emphasize early success.

Our experiments were performed using the Lemur toolkit [39]. A separate index was created for eachdocument representation. We use the Porter stemmer and InQuery’s stopword list.

Six different document representations were used for the meta-search systems, which are listed in Ta-ble 3.1. These six representations correspond to the six components used by Model MR as presented inFigure 3.11. Table 3.3 summarizes their performance using Dirichlet prior smoothed language models. Thebest three document representations for both tasks and systems are the full text, in-link text, and the titletext. These tables demonstrate that some of the document representations tend to be better than other

Table 3.3: Performance of individual document representations using language modelsHomepage Named-page

MRR # by 10 (out of 145) MRR # by 10 (out of 150)Alt 0.186 35 0.194 42Font 0.155 44 0.191 38Full 0.300 77 0.469 100Link 0.515 95 0.455 87Meta 0.115 20 0.144 32Title 0.332 82 0.406 84

Page 37: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

34 Chapter 3. Modeling Document Structure

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

link +title +full +alt +font +meta

Mea

n R

ecip

roca

l Ran

k

Document Representations

Fusion of LMs on Homepage Finding

combMNZcombSUM

Recip RankBorda

CondorcetMixture Language Model

Figure 3.12: The mixture language model performs at least as well as meta-search techniques on the WT10GHomepage Finding testbed. Models are added in order of decreasing quality.

representations. These representations were chosen as they are the components and representations used forthe multiple-representation hierarchical structure model to which they were compared.

Experimental Results

Figures 3.12 and 3.13 demonstrate the effectiveness of using the generative model described in Section 3.4.2on the Homepage Finding and Named-Page Finding tasks. The mixture model performs at least as wellas the meta-search techniques applied to the same document representations. This result suggests that thegenerative models are an appropriate framework for the combination of evidence from multiple documentrepresentations.

For completeness, this section contains a brief comparison to results presented during and after theTREC conference by other research groups. For Homepage Finding, our results using the mixture languagemodeling system yielded an MRR of 0.658, failing to find the correct page in the top 100 results for only5.5% of the queries. For 82.8% of the topics, the correct document was somewhere the top 10 results. Wedid not enter a submission to the TREC-10 Homepage Finding task, but this performance would have beenamong the best three groups that participated (out of 16). All of the submissions with higher performancemade use of the length of the URL text. In particular, Kraaij et al. [22] used a prior based on the depth ofthe URL. A detailed investigation of the use of priors within our model is presented in Section 4.4.

Our group did participate in the Named-Page Finding task of TREC 11. Our official submission used thesame system, but included an additional language model created using the URL text. With this additionalmodel, our system had a mean-reciprocal rank of 0.676. This was the second best official submission from aunique group. Without the URL model, the language model approach gives a mean-reciprocal rank of 0.667.Zhang et al. [58] had a mean reciprocal rank of 0.719 for their best submission.

Summary

This section explored the use of a generative language modeling approach to the combination of documentrepresentations. The experiments were carried out on two known-item finding tasks on HTML documents:Homepage Finding and Named-Page Finding. The mixture-based language model gave results at least asgood as the best known meta-search techniques. This lends empirical merit to the argument that generative

Page 38: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

3.5. Chapter Summary 35

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

full +link +title +alt +font +meta

Mea

n R

ecip

roca

l Ran

k

Document Representations

Fusion of LMs on Named-Page Finding

combMNZcombSUM

Recip RankBorda

CondorcetMixture Language Model

Figure 3.13: The mixture language model also performs at least as well as meta-search techniques on the.GOV Named-Page Finding testbed.

models are an appropriate means for combination of evidence. Future work in this area will be to researchthe role of smoothing in the combination of document representations.

3.5 Chapter Summary

This chapter presented two generative model for structured documents. Model SR provides a simple genera-tive model for single representation hierarchical documents and Model MR provides a multiple representationmodel of documents. This chapter demonstrated their applicability to XML component retrieval and known-item finding of Web documents.

Performing retrieval using Model SR on XML components for flat text queries recalled more relevantcomponents than document retrieval. When using constraints on document structure in structured queries,Model SR did not perform much different from a baseline system that simply performed flat text search andfiltered out results that did not match the desired result type.

Future experiments in XML component retrieval will look to find the reasons behind the poor performancewhen evaluating structured queries. Future experiments will also include weighting strategies for queryclauses, context-sensitive smoothing for structured queries, and implementation of phrasal constraints.

Model MR performed better than comparable meta-search algorithms for the tasks of Homepage Find-ing and Known-Item Finding. This demonstrates the viability of the generative model for combinationof evidence. Future work in this area will investigate different weighting techniques for the models andrepresentations.

Page 39: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Chapter 4

Modeling Characteristics of RelevantDocuments

The previous chapter described modeling of text within a hierarchically structured document with multiplerepresentations. It also described how text components may be ranked against queries. However, it didnot describe how to leverage knowledge of characteristics of relevant documents. One may have query-independent knowledge of characteristics of relevant documents. For example, there may be a tendency forlonger documents to be more likely to be relevant than short documents.

Ranking in Models SR and MR took the form P (Q |θr ) where θr was a generative model estimatedfor the rankable item. In order to leverage the characteristics of relevant documents, we instead turn theequation around and rank according to an estimate of

P (r is Relevant |Q ) (4.1)

In order to leverage these characteristics of rankable components during ranking, we can condition thisprobability based on observations of the components:

P (r is Relevant |Q, g (r) = a ) (4.2)

where g is a function on rankable items corresponding some aspect of components we wish to leverage duringranking. Bayes rule is then applied:

P (r is Relevant |Q, g (r) = a ) =P (Q |r is Relevant, g (r) = a ) P (r is Relevant |g (r) = a )

P (Q |g (r) = a )(4.3)

where a and Q are independent and P (Q |r is Relevant) is estimated using the rankable item r’s probabilityof generating the query Q. This reduces the ranking function to:

P (r is Relevant |Q, g (r) = a ) ∝ P (Q |θr ) P (r is Relevant |g (r) = a ) (4.4)

where P (Q |θr ) is estimated using either Model SR or Model MR. P (r is Relevant |g (r) = a ) is referred toas the prior probability. This is estimated using knowledge about trends of relevance observed in relevanceassessments, which will be described as posteriors we wish to observe in the rankings.

If g is a non-deterministic function whose output can be expressed as a probability distribution over therange A of possible output values, then ranking can be done using:

P (r is Relevant |Q ) =∑a∈A

P (r is Relevant |Q, a ) P (a) (4.5)

If A is continuous, then ranking can be done by:

P (r is Relevant |Q ) =∫

a∈A

P (r is Relevant |Q, a ) P (a) da (4.6)

Estimation of priors from desirable posteriors is discussed in the next section. A detailed example ispresented in Section 4.2. Experiments using length priors for XML component retrieval are presented inSection 4.3 and experiments using URL type priors for known item retrieval are presented in Section 4.4.

36

Page 40: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

4.1. Estimating the Prior Probabilities from Desired Posteriors 37

4.1 Estimating the Prior Probabilities from Desired Posteriors

An example worthy of a detailed examination is homepage finding. In homepage finding, pages with shortURLs are more likely to be the correct page than pages with long URLs. One example of a homepage findingquery would be ‘find the homepage of the IRS’. The correct page for this query is ‘http://www.irs.gov’, whichhas a very short URL. This behavior can usually be quantified by examining relevance judgments of a trainingset and learning some function that maps aspects of the documents to a probability of relevance.

The characteristics of relevant documents this chapter considers are those that may be described byposterior distribution desirable in the rankings returned by the engine:

P (g (R) = a |R is Relevant) (4.7)

where g is a function of the rankable item R, and a is some value in the range of g. R is used to denote arandom variable. For example, in web text retrieval, g may produce a number corresponding to the numberof in-links. Alternatively, g may return one of (ROOT,SUBROOT,PATH,FILE) as in [52]. The posteriordistributions may be estimated from observations in relevance assessments using simple statistical techniquessuch as maximum likelihood estimates for discrete ranges of g or kernel density estimates where the rangeof g is continuous.

These posterior distributions could be helpful in parameter estimation. One use of this informationcould be to learn parameters that produced these distributions in the rankings. This will be discussed inChapter 5. The approach we will focus on in this chapter is a more direct usage of the characteristics ofrelevant documents: using the desirable posteriors to estimate prior probabilities of relevance to be used inranking (Equation 4.4).

Additionally, the information we wish to leverage using the prior probability P (r is Relevant |g (r) = a )does not need to be independent of the information need. For example, a user model of a student model mayinclude information about the reading ability of the student. It may be highly desirable to return documentsthat are readable by the student.

The posterior distributions may be used to estimate a prior probability of relevance for a text throughthe use of Bayes rule:

P (R is Relevant |g (R) = a ) =P (g (R) = a |R is Relevant) P (R is Relevant)

P (g (R) = a)(4.8)

Typically, P (R is Relevant) is assumed constant across all documents and P (g (R) = a) is estimated byobserving the occurrence of a in the collection.

4.2 Example

The task of homepage finding on the Internet is one where there is a heavy bias toward short URLs beinghomepages [52]. This can be leveraged by specifying a function g = urlType that considers the length of theURL. In [52], Westerveld et al specified urlType (r) such that it would return:

• ROOT if the URL of r matched the pattern http://www.xxx.com/ possibly followed by index.html

• SUBROOT if the URL of r matched http://www.xxx.com/yyy/ possibly followed by index.html

• PATH if the URL matched http://www.xxx.com/yyy/.../zzz/ possibly followed by index.html

• FILE if the URL of r did not match any of the prior cases.

Page 41: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

38 Chapter 4. Modeling Characteristics of Relevant Documents

They estimated P (urlType (R) |R is relevant) using the maximum likelihood estimator on training data,resulting in:

P (urlType (R) = ROOT |R is relevant) = 0.731P (urlType (R) = SUBROOT |R is relevant) = 0.139

P (urlType (R) = PATH |R is relevant) = 0.074P (urlType (R) = FILE |R is relevant) = 0.056

They then observed the frequency of each URL type in the collection, again using a maximum likelihoodestimator:

P (urlType (R) = ROOT) = 0.007P (urlType (R) = SUBROOT) = 0.022

P (urlType (R) = PATH) = 0.049P (urlType (R) = FILE) = 0.921

While the derivation of the prior P (r is Relevant |urlType (r) = a ) described in [52] is a little different thanwhat is described here, they do estimate prior probabilities directly proportional to the ones described here.Using 4.8, we see that:

P (R is Relevant |urlType (r) = ROOT) =c·P(urlType(R)=ROOT|R is relevant)

P(urlType(R)=ROOT)

P (R is Relevant |urlType (R) = SUBROOT) =c·P(urlType(R)=SUBROOT|R is relevant)

P(urlType(R)=SUBROOT)

and so on for PATH and FILE. Note that c = P (R is relevant) is a constant across all equations and notnecessary for ranking. From this, it can be observed that:

P (R is Relevant |urlType (r) = ROOT) ∝ 0.731/0.007 ≈ 104P (R is Relevant |urlType (r) = SUBROOT) ∝ 0.139/0.022 ≈ 6.32

P (R is Relevant |urlType (r) = FILE) ∝ 0.074/0.049 ≈ 1.51P (R is Relevant |urlType (r) = PATH) ∝ 0.056/0.921 ≈ 0.0641

They used a query-likelihood approach to provide a baseline run, which resulted in a mean-reciprocal rankof 0.49, which means that the first relevant page was found on average a little below rank 2 in the result list.Applying the prior probabilities described above, they found that performance was improved to 0.77, whichmeans that the first relevant page was found on average at about rank 1.5.

4.3 Length Priors for XML Component Retrieval

Section 3.4.1 presented experiments using flat text queries for XML Component retrieval in Model SR. Thediscussion of the experiments concluded with an observation that component retrieval resulted in higherrecall but lower precision at low recall levels than document retrieval.

One could hypothesize that the higher precision of the document retrieval was a result of the length ofthe documents. We leveraged this within the component retrieval run by introducing a prior probability ofrelevance that was proportional to the length of the component:

P (vi is relevant | |vi|a = x ) ∝ |vi|a (4.9)

Thus, components were ranked by:

P (vi is relevant | |vi|a = x ) P(Q∣∣θ′′vi

) rank≡ |vi|a P(Q∣∣θ′′vi

)(4.10)

Page 42: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

4.4. URL Priors for Named-Page Finding 39

0

0.2

0.4

0.6

0.8

1

0 0.5 1

Pre

cisi

on

Recall

Content Only TopicsINEX 2003

Single Corpus ModelShrinkage

Article RetrievalLinear Prior

Figure 4.1: Precision-recall curve for content only queries using a single corpus model with a weight of 0.5.Including a linear prior greatly improves performance of the component retrieval approach used in ModelSR.

where P(Q∣∣θ′′vi

)are estimated using Equations 3.12 and θ′′vi

is estimated as defined by Model SR (Figure 3.1)with the parameters used for the “Single Corpus Model” run in Section 3.4.1.

The results of this experiment are in Figure 4.1. The results are quite astounding, as it achieves precisionnear the article retrieval run at low recall levels while maintaining the higher levels of precision for largerrecall levels. Using the linear prior results in a MAP of 0.163, which is a 117% increase in performance overthe run with no prior, and much better than the article-only retrieval run. This use of priors for componentretrieval demonstrates the ability of Model SR to find relevant document components.

4.4 URL Priors for Named-Page Finding

This section continues the investigation of the use of the prior probabilities introduced above. Prior prob-abilities have been found effective for known-item retrieval by others [22], and this extends that work andexamines the behavior of prior probabilities in more detail. While the results described in Section 3.4.2 arequite good, the use of prior probabilities can further improve performance.

In these experiments, documents are ranked by Equation 4.4. The basic model for estimating the probabil-ity of generating the query is the same as in Section 3.4.2 but includes an additional document representationwhich provides a language model for the text of the URL.

Page 43: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

40 Chapter 4. Modeling Characteristics of Relevant Documents

4.4.1 Modeling URL Text

Rather than ranking the URL text using a unigram word-based language model, these experiments use acharacter-based trigram model from the tokenized character strings in the URL. A character-based trigrammodel predicts the next character given the previous two characters. This probability is computed for eachcharacter in the query terms.

The use of a trigram model is motivated by the desire to allow near matches of the query terms in the URL,as file names or domain names may be similar but not exactly the same as the query terms. For example, thecorrect page for a query of “Nissan” might be “http:www.nissanusa.com”. Using a character-based trigrammodel for “nissanusa” in the URL will give a fairly high probability of generating the query term “Nissan”.The ability to use different generative probability models was discussed briefly in Section 3.2.2, but thisprovides a concrete example of its usage. This is an important strength of Model MR that should not beoverlooked.

The probability of a word given the document’s URL can be computed by treating the URL and wordas a character sequence, then computing a character-based trigram generative probability:

P (w |θURL ) =∏

ci∈c1c2...c|w|=w P(ci

∣∣θci−1,ci−2

)P (c |θa,b ) = (1− λu

URL) P (c |MLE (a, b) ) + λuURLP

(c∣∣∣θu

a,b

)P (c |MLE(a, b)) = tf(abc,URL)

tf(ab,URL)

(4.11)

One way to estimate the trigram models is to use a Jelinek-Mercer linear interpolation of the MLE with theURL collection model.

In the experiments presented below, λuURL was set to 0.4. The URL collection model was estimated from

all trigrams in the URLs corresponding to documents in the collection. A shorter stopword list was used forURLs than for other document representations (http, www, com, gov, html, etc.).

4.4.2 Ranking Documents

The final document scores were computed as the generative probability of the query given the document,where the graph of the document structure is as in the prior experiments for web retrieval using Model MR(Section 3.4.2, but also includes the URL representation:

D = (G,R, C)

G = {G1,G2,G3,G4}R = {r1}C =

{(r1, v

11

),(r1, v

21

),(r1, v

31

),(r1, v

41

)}G4 = (V4, E4)V4 =

(v41

)E4 = {}

where G1,G2, and G3 are as before and v41 corresponds to the URL representation. In these experiments, the

Page 44: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

4.4. URL Priors for Named-Page Finding 41

Table 4.1: Estimation of prior probabilities and posterior distribution expectationsURL Type NP HP Posterior .GOV Prior

P (urlType (T ) |T . . . ) P (urlType (T )) ∝ P (T . . . |urlType (T ) )ROOT 0.029 0.600 0.315 0.005 58.0SUBROOT 0.083 0.208 0.145 0.097 1.50PATH 0.059 0.081 0.070 0.053 1.32FILE 0.829 0.111 0.470 0.845 0.557Computation MLE from training data NP+HP

2 MLE from corpus Posterior.GOV

λ parameters were chosen in a manner similar to what was done before:

λutype(vi

j)=

µtype(vi

j)|vi

j|+µ

λc′

v21

= 1/4λc

v22

= 1/4λc

v23

= 1/4λc

v24

= 1/4

λrv11

= 1/7λr

v21

= 4/7λr

v31

= 1/7λr

v41

= 1/7

λp = 0

The methodology is the same as described in Section 3.4.2, but it is important to note that the test topicsare different. The test topics used here are from the mixed homepage/named-page finding task of TREC12, and the TREC 10 homepage topics and TREC 11 named-page topics can be used as training data toestimate the prior parameters.

4.4.3 Estimating the Priors

Prior probabilities were estimated from training data according to Equation 4.8 where g = urlType returnsthe URL type of the document which is in the set of URL types (ROOT,SUBROOT,PATH,FILE) as inthe example provided in Section 4.2. The training data used for estimation of the priors was TREC 10data and TREC 11 data. We made the assumption that homepages in the .GOV corpus would have similarcharacteristics to those in the WT10G corpus. In addition to using the test topics for the TREC 10 HomepageFinding task, we also used the 80 training topics provided that year. We leveraged the knowledge that therewould be an equal number of homepage topics and named-page topics in the test set by scaling our estimateson the posterior P (urlType (R) |R is a correct page) to the same number of topics for each task. We use thelanguage “correct page” rather than “relevant” because there is only one correct page for each query, ratherthan many relevant documents. It is reasonable to believe that this training data could exists for a “real”Web collection as relevance judgments for this task are inexpensive to create and realistic queries could betaken from query logs.

We note that this method of parameter estimation for the posterior distribution expectation was veryaccurate given the training data. Figure 4.4.3 shows pie charts corresponding to the distribution we estimatedand the actual distribution observed in the relevance judgments. We believe it is reasonable to achieve suchgood estimation of this distribution in practice, as it is a relatively low cost activity to provided assessmentsfor known-item tasks. This is an important point that highlights the effectiveness of this technique. It isvery reasonable to expect that we can very accurately estimate trends of relevance across queries given sometraining data.

Page 45: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

42 Chapter 4. Modeling Characteristics of Relevant Documents

Figure 4.2: Estimated posterior distribution of page types for correct documents (left) and actual distributionof correct documents (right).

Table 4.2: Summary of runs and their performanceRun Structure URL Prior Not found Found by 10 MRRStructured YES NO 8.3 % 83.3 % 0.652Unstructured NO NO 11.0 % 78.7 % 0.612Structured+Prior YES YES 5.3 % 88.0 % 0.713Unstructured+Prior NO YES 9.3 % 50.7 % 0.315

4.4.4 Experiments

The experiments in this section examine several sets of runs: Structured, Unstructured, and their variants.The Structured run uses the model described above, while the Unstructured runs used a simple unstructuredlanguage model estimated by combining together all of the document representations into a single represen-tation and forming a single language model from that representation. This MLE model was interpolatedwith a collection MLE model for ranking.

The results of the experiments are shown in Table 4.2. It is interesting to note that the URL priors didnot help the different runs uniformly, despite the fact that they had similar initial performance. Applyingthe URL priors to the Unstructured run severely degraded performance, so this lead us to questions whyaccurate URL priors do not reliably improve performance.

4.4.5 Discussion

When we apply the priors to the scores returned in a ranking, we do so with the hope that thedistribution of documents according to the urlType found in the reranked lists will be close to theP (urlType (R) |R is a correct page) observed in the training data. The fact that the URL priors do helpretrieval accuracy in some cases suggests that we may indeed be observing this behavior. To test thishypothesis, we compared the distribution of urlType in the rankings to the desired posterior given byP (urlType (R) |R is a correct page). For a given run, we estimated the distribution of urlType in the result

Page 46: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

4.4. URL Priors for Named-Page Finding 43

0

0.1

0.2

0.3

0.4

0.5

0.6

0 2 4 6 8 10 12 14 16 18 20

Pro

babi

lity

Rank Threshold

FILEDesired FILE

PATHDesired PATH

ROOTDesired ROOT

SUBROOT Desired SUBROOT

Figure 4.3: Posterior distribution for Structured+Prior. Applying the URL prior to Structured gives resultlists are more biased toward the FILE and SUBROOT types than desired, and less biased toward the ROOTtype then desired.

lists up to rank t in the result lists across the set of queries Q∗ in the evaluation set. This was done bycounting the number of times a document of each URL class occurred in the top t results for a query Q ∈ Qthen dividing by the number of queries times t. For example,

P (ROOT |Structured + Prior, t ) =

∑Q∈Q |{r : rank (r, Q, Structured + Prior) <= t}|

t · |Q|(4.12)

where rank (r, Q, Structured + Prior) is the rank of item r in the results list for query Q of the Structuredrun. This equation is the number of times a ROOT page is observed in the Structured+Prior run in ranks 1to t in the query set divided by the number of items in the run across all queries up to and including rank t.Figures 4.3 and 4.4 show these estimated distributions. The lines without the points are the desired posteriordistribution we hoped to achieve by applying the prior probabilities. For the Structured+Prior run, we seethat the FILE, PATH, and SUBROOT types are favored more than desired, but the ROOT type is returnedless often than we would like. However, we know that performance is increased by applying the priors to theStructured run, so we hypothesize that the URL prior brought the results closer to the desired distributionof documents in the ranking. On the other hand, Unstructured+Prior does not match our expected posteriordistribution well. Applying the URL priors to Unstructured results in a heavy bias towards ROOT pages atearly ranks which decays rapidly. There is an increasing bias toward the SUBROOT type. The FILE typehas a much stronger bias against it then desired, and the PATH class has a stronger bias towards it thendesired.

One hypothesis for this behavior is that the rankings produced by Structured closely matches P (urlType (R)) =P (g (R)) in Equation 4.8. To test this hypothesis, we applied the same method we used to generate Figures4.3 and 4.4 to the Structured and Unstructured runs. Refer to Figures 4.5 and 4.6. There are stronger biasespresent in the Structured run than in the Unstructured run. The FILE type was omitted from these figures inorder to provide a closer look at the other URL types. Both runs have a flat line for the observed probabilityof the FILE type in the runs that is below the observations of the FILE type in the general collection. Theseplots don’t provide any clear answers, as both runs have very similar behavior, which is otherwise interestingand encouraging.

Another way to test this hypothesis is to recompute the priors based on estimated P (urlType (R)) valuesspecific to a run. To do this, we used the class specific probabilities observed in the top 20 results for a runacross queries:

P (urlType (R)) = P (urlType (R) |run, t ) (4.13)∗Bold face is used here to denote a set.

Page 47: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

44 Chapter 4. Modeling Characteristics of Relevant Documents

0

0.1

0.2

0.3

0.4

0.5

0.6

0 2 4 6 8 10 12 14 16 18 20

Pro

babi

lity

Rank Threshold

FILEDesired FILE

PATHDesired PATH

ROOTDesired ROOT

SUBROOT Desired SUBROOT

Figure 4.4: Posterior distribution for Unstructured+Prior. Applying the priors to Unstructured results inheavy biases.

0

0.05

0.1

0.15

0.2

0 2 4 6 8 10 12 14 16 18 20

Pro

babi

lity

Rank Threshold

PATHCollection PATH

ROOTCollection ROOT

SUBROOTCollection SUBROOT

Figure 4.5: Distribution of documents for Structured.

0

0.05

0.1

0.15

0.2

0 2 4 6 8 10 12 14 16 18 20

Pro

babi

lity

Rank Threshold

PATHCollection PATH

ROOTCollection ROOT

SUBROOTCollection SUBROOT

Figure 4.6: Distribution of documents in the Unstructured.

Page 48: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

4.5. Summary 45

Table 4.3: Run specific priorsStructured+Prior Unstructured+Prior

P (urlType (R) |Structured, t ) Prior P (urlType (R) |Unstructured, t ) PriorROOT 0.017 18.9 0.017 23.5SUBROOT 0.137 1.06 0.117 1.55PATH 0.053 1.31 0.050 1.23FILE 0.793 0.594 0.817 0.56

Table 4.4: Performance of run specific priorsRun Structure Run Specific Prior Not Found Found by 10 MRRStructured YES NO 8.3 % 83.3 % 0.652Structured+Prior YES NO 5.3% 88.0 % 0.713Structured+Prior YES YES 3.7% 89.0 % 0.727Unstructured NO NO 11.0 % 78.7 % 0.612Unstructured+Prior NO NO 9.3% 50.7 % 0.315Unstructured+Prior NO YES 9.3% 56.7 % 0.367

We then recomputed the priors as before in Equation 4.8 but instead used the run specific estimates forthe probability of observing a document in the class. This is not training on the test set as it does notuse relevance judgments or knowledge of the correct answer. It simply uses the results observed in theoriginal rankings as a calibration. In a very simple way it is analogous to automatic relevance feedback,which adjusts the query by examining characteristics of returned documents. This approach adjusts theprobability estimates of observing a particular characteristic given the returned documents.

Table 4.3 contains the run-specific priors and Table 4.4 contains the performance achieved by usingthese priors. Using the run-specific priors improves performance marginally over the original priors forall runs. However, we see that this did not solve the problem when applying the priors to the flat textrun. Applying the run-specific prior to Unstructured still results in an abysmal degradation of perfor-mance. This suggests that good estimation of the probability of the class P (class (R)) and the posteriorP (class (R) |R is a correct page) is not always enough. It suggests there is some bias in the scores resultingfrom either the ranking algorithm or characteristics of the data that cannot be accounted for by a simpleprior. Further evidence of this is provided in the posterior distribution plots for the runs using the run-specific priors (Figures 4.7 and 4.8). These figures demonstrate that the estimation of priors using Bayesrule does not result in the desired posterior distributions.

4.5 Summary

This chapter presented the use of information about relevance within the generative framework. A naturalextension of the model was presented that places prior probabilities on documents based on the informationabout relevance. The use of a linear prior for XML component retrieval demonstrated that a componentretrieval system can improve recall over the baseline of document retrieval. The use of priors was alsosuccessfully applied to improve performance of the generative model within the domain of named-pagefinding on the Internet.

The method is adaptable to any situation where there may be characteristics of relevant documents thatcan be leveraged. The estimation process is robust; given some training data we can reliably estimate theappropriate desirable posterior distributions.

Page 49: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

46 Chapter 4. Modeling Characteristics of Relevant Documents

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 2 4 6 8 10 12 14 16 18 20

Pro

babi

lity

Rank Threshold

FILEDesired FILE

PATHDesired PATH

ROOTDesired ROOT

SUBROOT Desired SUBROOT

Figure 4.7: Posterior distribution of Structured after applying the run specific URL priors.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 2 4 6 8 10 12 14 16 18 20

Pro

babi

lity

Rank Threshold

FILEDesired FILE

PATHDesired PATH

ROOTDesired ROOT

SUBROOT Desired SUBROOT

Figure 4.8: Posterior distribution of Unstructured after applying the run specific URL priors.

Page 50: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

4.5. Summary 47

This chapter also demonstrated that the prior probabilities do not always produce the desired expectedposterior distributions. The run-specific prior probabilities help performance, but still do not produce thedesired posterior distributions. We believe this is a result of biases in the scores produced by the rankingalgorithm. One potential solution would be to learn prior probabilities that do come closer to producing thedesired posterior distributions.

On the other hand, if there are biases in the scores, a simple prior probability may not be enough tofix the problem. Another approach would be to learn some more general class-based score transformationthat produces the desired posterior distributions. Care must be taken here, as there is important and usefulinformation in the scores produced by the original ranking, and a too extreme score transformation may losethis information.

We believe an effective score transformation will minimize the difference between the observed posteriordistribution and the expected distribution while also minimizing the difference between the new ranking andthe original ranking. These transformations may be smoothing methods.

There is a danger in directly estimating a transformation of the scores. This attempts to bypass the causeof the bias, which may a predictable and something we can effectively leverage for better results. There maybe some hidden variable causing these biases, and it would behoove us to try to find it.

Therefore, we propose to investigate the source of biases in these scores by examining characteristicsin the document classes. If these sources can be quantified, we will explore the possibility of a smoothingmethod that helps correct the biases. As an alternative approach, we will also look into methods for learningprior probabilities that help correct the biases.

Page 51: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Chapter 5

Smoothing and Parameter Estimation

Chapter 3 presented a model for retrieval of documents with structure. Experiments were also presented todemonstrate that the model can achieve good results. However, the model gives very little guidance on howthe λ parameters used in smoothing should be estimated. This chapter first considers the estimation problemfor language models applied to the tasks speech recognition, parsing, document classification, and retrieval.This provides context and distinguishes the parameter estimation for document and component retrievalfrom the other tasks. The chapter continues with a discussion of combination of evidence and presents somehypotheses for combination of distributions within the generative framework.

5.1 Training and Estimation of Language Models

Language models and other generative models have been used for many tasks, such as speech recognition,natural language parsing, classification, and ad-hoc retrieval. This section presents examples of languagemodeling applied to these research problems and how parameters of the model may be estimated. Theseexamples are not chosen to be representative of the most recent state-of-the-art techniques, but are simpleand illustrative examples that are comparable to the methods presented in this proposal.

For each of the tasks, this section details the following characteristics:

• the basic problem,

• typical training and test data,

• evaluation measures,

• a language model approach to the problem,

• a characterization of the parameters to be estimated,

• simplifying assumptions that reduce the size of the parameter space, and

• the parameter estimation process.

In this discussion, we consider all values that must be estimated as parameters of the model; the probabilitiesin the multinomial of a language model are considered parameters as well as linear interpolation parameters.Sometimes within Information Retrieval a “parameter” is defined as those that remain “unspecified” by themodel, such as the linear interpolation parameters used in smoothing. In reality, the multinomial probabilitiesare also parameters of the model; the method of estimation and the amount of sample data may be importantto the effectiveness of the approach.

In order to facilitate discussion, this section considers adaptations of each of these tasks to the INEXcorpus described in Section 3.4.1. Table 5.1 contains a summary of additional statistics about the corpus.

48

Page 52: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

5.1. Training and Estimation of Language Models 49

Table 5.1: Some characteristics of the INEX collectionSize 494 MBNumber Documents 12107Number Magazines / Transactions 18Unique Words (|W |) 270,000Number Words 65 MillionNumber Document Component Types 192Number Document Components 18.5 Million

5.1.1 Next-Word Prediction (Speech Recognition)

Speech recognition systems often produce a list of possible sentences as output for an utterance. In producinga transcript, the system must choose the most likely sentence from the list. Training data for this task areusually a corpus of sentences that are assumed representative of the test data. Test data are a set of sentences.In general, language modeling for speech recognition recasts the problem of choosing the best sentence asprediction of the next word given the previous words. This is facilitated by the chain rule:

P (S) = P (w1) P (w2 |w1 )P (w3 |w1w2 ) · · ·P (wn |w1w2 . . . wn−1 ) (5.1)

where the sentence S is equal to the string of words w1w2 . . . wn. If sentences are ordered by P (S) whichmaximizes the probability of the real sentences, this should be a good ordering of sentences produced bythe output of a speech recognition system. Using the above equation, we can see that maximizing P (S)on the training data is the same as maximizing P (wi |w1w2 . . . wi−1 ). A language model predicts the nextword given the previous words by taking the word w′

i that has the maximum estimated probability giventhe previous words:

w′i = argmax

w∗∈WP (w∗

i |w1w2 . . . wi−1 ) (5.2)

where W is the vocabulary. The models used to estimate P (w∗i |w1w2 . . . wi−1 ) are evaluated on their ability

to predict wi, rather than on their ability to order sentences. This is done by measuring the word-error-rate(WER) on the test data. WER is the percentage of incorrectly predicted words given the previous words inthe sentence across all sentences. As there is such a strong correlation between optimizing word predictionand sentence selection, measuring WER is a good indication of the language model’s capability to select thecorrect sentence.

One way to estimate P (wi |w1 . . . wi−1 ) is to use a Markov assumption to limit the history that mustbe examined. For example, one may assume it is reasonable to estimate using only the previous two words.This assumption results in a second order Markov model, which is also called a trigram. We can char-acterize the trigram by noting the previous two words wi−2wi−1 can be used to index a language model,as P (w |wi−2wi−1 ) specifies a multinomial probability distribution. We will denote this language modelθwi−2wi−1 . It is common that many word pairs wawb occur infrequently in the text, so many of the θwawb

models will be poorly estimated. One solution to this is to smooth the trigrams with bigrams and unigrams:

P (wi |w1 . . . wi−1 ) = λ3P(wi

∣∣θwi−2wi−1

)+ λ2P

(wi

∣∣θwi−1

)+ λ1P (wi |θU ) (5.3)

where θU is the unigram model. The bigrams and unigrams have progressively larger amounts of data thatcan be used for estimation. This leaves the estimation of |W |2 θwawb

language models, |W | θwa languagemodels, the θU language model, and λ1 and λ2 (λ3 = 1 − λ1 − λ2). The total number of parameters to beestimated are

(|W |2 + |W |+ 1

)(|W | − 1) + 2, or approximately |W |3.

Page 53: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

50 Chapter 5. Smoothing and Parameter Estimation

The θ language models may be estimated using many techniques. To keep the discussion simple, we willassume a maximum-likelihood estimator is used. The corpus used in this section is around 65 million words,which is not as large a corpus as most language modeling systems would be trained on. Given the somewhatsmall corpus size, it is reasonable to use a smaller vocabulary of 10,000 words. Under this assumption, thereare 100,000,000 trigram language models, 10,000 bigram models, and one unigram model to estimate. Exceptfor the most frequently occurring wawb word pairs, the trigram estimates will be practically useless as theyare estimated on only a few training samples. This is not generally a problem, as the bigram and unigrammodels help significantly. One can estimate that the bigram models will be estimated on an average of 6,500training examples (number words in corpus / number of bigram models). Some bigram models will havemany more training examples, but most will have fewer. Considering the 9,999 parameters to be estimatedfor a single bigram language model, many of these models will be poorly estimated, and the unigram modelmay be relied on quite often.

A more detailed analysis reveals that with a 10,000 word vocabulary chosen by taking the most frequentterms in the INEX collection, there are 4,393,058 unique bigrams occurring in the collection. That meansthat more than 95 percent of the trigram models will have no data for estimation. Over half of these wawb

bigrams occur only once, which means that the θwawbtrigram model will only have one training example

to estimate the 9,999 parameters of the model. In fact, only 7,298 bigrams occur 1,000 or more times,meaning that 4,385,760 trigram models are estimated on less than 1,000 examples. The case is not as badfor estimating the θwa

bigram models. The least number of training examples for a bigram model is 308,and 53 percent have 1,000 or more examples.

While it may seem that there are far too few training examples to estimate the language models, it isimportant to remember the task. In next word prediction, the most frequently occurring words are the mostimportant. The rare trigram models in the training set will be used rarely in the test set. In a sense, mostof the time there will be plenty of training data for the estimates. For example, Seymore and Rosenfeld [43]report a 19 percent error rate on a 58,000 word vocabulary with 45 million words for training. That is, theywere able to correctly predict four out of five words. Not bad considering the sheer magnitude of parametersin the model (around 195 million million). Similar to our observations of the INEX collection, they foundaround 4.6 million unique bigrams in the text.

There is no maximum likelihood estimation of the λ parameters. Instead, a local maximum for theparameters may be estimated using the expectation-maximization (EM) algorithm. In this case, the processof estimation using EM is:

1. Initialize the λ0 parameters to random values that sum to 1.

2. Update λt+1i using

λt+1i =

1|W |

|W |∑j=1

P(wj

∣∣θwj−i+1...wj−1

)∑i′ λ

ti′P(wj

∣∣∣θwj−i′+1...wj−1

)3. Check log-likelihood of held-out data using the λt+1 parameters and see if they have converged. If they

have not, set t = t + 1 and repeat step 2. If they have, take λi = λt+1i .

Estimation of the λ parameters in this manner will converge on a local optimum for the held-out dataand works quite well in practice.

5.1.2 Document Classification

The task of document classification is to assign documents to a class based on the text of the document.These classes typically correspond to general topics, such as “sports”, “economics”, and “world politics” in

Page 54: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

5.1. Training and Estimation of Language Models 51

Table 5.2: Contingency table for evaluation measures in document classification.In class Not in class

System labeled as in class A BSystem labeled as not in class C D

a news corpus. The system tries to replicate the classes that a user would give to the documents. A typicaltraining and test corpus is a labeled collection of documents. Training and test subcollections are usuallycreated for use with cross-validation. Success is measured either by accuracy or F1, the harmonic mean ofrecall and precision:

recall = AA+C accuracy = A+D

A+B+C+D

precision = AA+B F1 = 1

1/recall+1/precision

(5.4)

The naive Bayes classifier is a common approach to text classification [34]. The task of the classifier isto select the class c∗ with maximum the probability given the terms in the document:

c∗ = argmaxci∈C

P(ci

∣∣w1w2 . . . w|D|)

(5.5)

where the document D is a string of words w1w2 . . . w|D| This can be rewritten using Bayes rule:

c∗ = argmaxci∈C

P (ci)P(w1w2...w|D||ci )P(w1w2...w|D|)

= argmaxci∈C P (ci) P (w1w2 . . . wn |ci )

(5.6)

The prior probability of the class ci, P (ci), may be estimated using a maximum likelihood estimator fromthe training data. Estimating P

(w1w2 . . . w|D| |ci

)is usually simplified by assuming independence among

words in the document:P (w1w2 . . . wn |ci ) =

∏|D|j=1 P (wj |ci )

=∏

w∈D P (w |ci )tf(w,D)

(5.7)

The probabilities P (w |ci ) can be viewed as specifying a unigram language model. To be consistent withprevious notation, we refer to this language model as θci :

c∗ = argmaxci∈C P (ci)∏

w∈D P (w |θci)tf(w,D) (5.8)

One way to estimate θciwould be using the maximum likelihood estimator. However, this should be

additionally smoothed so that zero counts will not pose a problem. The Laplace estimator is designed toovercome these difficulties by assigning some probability mass to unseen terms:

P (w |θci) = P (w |LAP (ci) ) =

tf (w, ci) + 1∑w′∈W tf (w, ci) + |W |

(5.9)

where tf (w, ci) is the term frequency (count) of the word summed over all training documents in the classci.

Page 55: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

52 Chapter 5. Smoothing and Parameter Estimation

Assuming this model for document classification requires the estimation of |C| − 1 parameters for theclass priors P (c) and |C| (|W | − 1) for the |C| θc language models. As discussed above, the estimation ofthese parameters is fairly simple and can be done using the MLE and Laplace estimators.

For the purposes of discussion we assume 18 classes in the INEX collection as there are 18 journals.Assuming the entire collection is available for training, there are 12,107 training samples for the 17 parametersrequired for the estimation of the class priors, which should be more than enough. Assuming no featuresselection beyond the document tokenization process, each class language model θc has on average 3.8 milliontraining samples for the approximately 270,000 parameters. This should also be plenty of examples. Itis clear that there are much fewer parameters than in next word prediction, and the parameters can beestimated with much greater confidence.

Document Retrieval as Classification

Document retrieval may be viewed as document classification. For a given query, the system should try toseparate documents into the relevant and non-relevant classes. However, the training data for documentretrieval is quite different. In document classification, example documents from each class are available tothe system, but text retrieval only has a short query as a representative of the relevant class. The class priorscannot be estimated from this data using the data available in document retrieval, and there is little dataavailable to estimate the relevant class language model and none available to the non-relevant documentclass language model. If one were to estimate a smoothed query model as the estimate of the relevant classlanguage model and the collection model as the non-relevant class language model, ranking in accordingto the naive Bayes document classification model presented above would result in a model similar to thedocument-likelihood model:

P (rel|D )

P(rel|D )rank≡ P (rel)

P(rel)P (D|θrel )

P(D|θrel )

rank≡ P (D|θrel )

P(D|θrel )

rank≡∏

w∈D(λP (w|MLE(Q) )+(1−λ)P (w|θC )tf(w,D))∏

w∈DP (w|θC )tf(w,D)

(5.10)

where the denominator can be viewed as a controlling factor for various document lengths. This model wouldrank by how much more likely the document was created by the query rather than the collection.

5.1.3 Parsing

The problem of natural language parsing considered here is to select the most probable parse tree given a setof possible parse trees for a sentence. The set of possible parse trees for a given grammar may be determinedby a separate algorithm, such as the chart parser (Chapter 3 of Allen [1]). In the model presented here, whichtrees are deemed most probable are determined by a probabilistic context-free grammar (PCFG) (Chapter11 of Manning and Schutze [31]). In a PCFG, probabilities are associated with grammar rules of the forml → r, where l is a single non-terminal and the r is a string of non-terminals or a word in the vocabulary.These probabilities are denoted P (r |l ) where l denotes the left-hand side of the rule and r denotes theright-hand side of the rule.

These rules can be separated into two sets: ones where the right-hand side is a terminal (a word in thevocabulary) and ones that consist of a string of non-terminals. We will denote the left and right hand sidesof the rules where the right-hand side of the rule is a terminal as LW and RW . Similarly, we denote the set

Page 56: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

5.1. Training and Estimation of Language Models 53

of left and right sides of rules where the right-hand side of the rule is a string of non-terminals LNT andRNT

Using this notation, the task of parsing is to find the most likely tree T ∗ given a sentence:

T ∗ = argmaxT∈parses(S)

P (T |S ) (5.11)

where T is a set of l → r rules that can produce the sentence S and parses (S) returns the set of possibleparse trees that can produce S. The context-free attribute of the grammar breaks the probability of a treeinto simpler components:

P (T |S ) =|T |∏i=1

P (ri |li ) (5.12)

When the right-hand side is a word (ri ∈ RW ), P (ri |li ) may be viewed as a unigram language model θli .Otherwise, P (ri |li ) is a multinomial distribution over possible right-hand side strings of non-terminals asspecified by the grammar. There are |LW | language models to estimate. If GNT is the subset of the grammarrule set where the right-hand side of the rules is a string of non-terminals, then there are |GNT | − |LNT |parameters to estimate for the estimation of the non-terminal multinomial distributions. This results in atotal of |LW | (|W | − 1) + (|GNT | − |LNT |) parameters to estimate in the model.

There are two cases for training data for parsers. In one case, only sentences are provided and the systemmust induce a grammar. This can be done using the Inside-Outside algorithm (Section 11.3.4 of [31]),which results in low entropy models but grammars that do not correspond well with linguistic knowledgeof language. The other case is where some sample of parse trees are provided for sentences, as in thePenn Treebank [32]. In this case, a simple estimator such as the maximum-likelihood estimator would beappropriate:

PMLE (r |l ) =count(l→r,T)

count(l,T) (5.13)

Any smoothed estimator such as the Laplace estimator would also be appropriate.

5.1.4 Document Retrieval

For our discussion of document retrieval, we will assume a query-likelihood model (Section 2.1.2) where thedocument’s model is smoothed with a collection model using Dirichlet priors (Section 2.1.5). This meansdocuments are ranked according to:

P (Q |θT ) =∏

w∈Q P (w |θT )tf(w,Q)

rank≡∑

w∈Q tf (w,Q) log P (w |θT )

=∑

w∈Q tf (w,Q) log(

|T ||T |+µP (w |MLE (T ) ) + µ

|T |+µP (w |θC )) (5.14)

where θC is a collection language model and µ is an unspecified parameter of the model. In this approachthere are |D|+ 1 language models to estimate in addition to µ, resulting in total of (|D|+ 1) (|W | − 1) + 1parameters to estimate.

In document retrieval, the document collection is available for training. Often a set of past queries withrelevance judgments is also available. Test data consist of new, unseen queries and relevance judgments forsome documents in the collection. The training queries and judgments may be used to estimate parametersthat work well across many queries.

Page 57: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

54 Chapter 5. Smoothing and Parameter Estimation

When considering the INEX collection, each documents’ model is estimated with an average of approxi-mately 5,400 words. The collection model is well estimated with 65 million words, and provides a suitablylarge model to use for smoothing with document models. Both the document models and the collectionmodel may be estimated with the maximum-likelihood estimator. Estimation of the µ parameter is anothermatter.

In [55], Zhai presents a leave-one-word-out method for estimating µ. The method finds the µ parameterthat maximizes the sum over all words of the log-likelihood of the left out word given a model estimatedfrom the collection without that word. Zhai also presents how the µ that maximizes the likelihood maybe estimated using Newton’s method. However, Zhai reports that in practice, the µ parameter may beunderestimated by the method for short queries. The underlying assumption not made explicit is thatsingle words in the collection are representative of queries, and can thus be used for estimation of the µparameter. Another disadvantage of this approach is that it cannot leverage available training data. Analternative approach would be to find the µ parameter that maximizes the likelihood of example queries,rather than the left out word. Another approach could be to find the µ parameter that empirically maximizesan evaluation measure such as average precision on the training queries, perhaps using a leave-one-query-outmethod on the training data.

5.1.5 Discussion

The previous sections described basic examples of how language models are used across a variety of tasks.Given this context, it is informative to contrast the training data and test available for the task of documentretrieval with the other tasks presented. In all of the other tasks, the test data is remarkably similar to thetraining data:

• In document classification, the goal is to label a document with a class label. The training dataconsist of example documents for each of the classes. In a sense, the task is to compare documents todocuments. The classes documents can take remain fixed, but new documents are presented.

• In parsing, the goal is to identify the most likely parse tree of a sentence. The training data consist ofexample parse trees for sentences. In some way, the task is to compare trees to trees. The grammarremains fixed, but new sentences are presented.

• In next word prediction, the goal is to predict a word given a history. The training data consist of wordsin their histories. The task can be crudely viewed as comparing word-history pairs with word-historypairs. The words in the vocabulary remain fixed, but new histories are presented.

In retrieval, the goal is to label documents as relevant or non-relevant to a specific query. The trainingexamples consist of relevant and non-relevant class labels for documents to other queries. Unlike the othertasks, the classes change while the documents remain fixed. The test documents are the same as the trainingdocuments, and a document’s relevance to one query may not be indicative of the document’s relevance toa new query.

The marked difference in the nature of the training data for document retrieval makes the methods usedto estimate the parameters of these other tasks less directly applicable to the task of document retrieval.In the other tasks, parameters for specific class labels, histories, or grammar rules can be learned, while indocument retrieval parameters that work well in general, across previously unseen queries or classes mustbe estimated. Estimating a model of the relevant class from only the query is essentially trying to estimate|W | − 1 parameters of a model from a handful of words. Any additional information about relevance maymake a big difference:

• Relevance feedback, examples of relevant and non-relevant documents can greatly increase the amountof training for modeling queries, relevance models, or models of non-relevance.

Page 58: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

5.1. Training and Estimation of Language Models 55

• Pseudo-relevance feedback tries to approximate relevance feedback by assuming highly ranked docu-ments under the initial model of the query are relevant.

• Estimating models of relevance through other techniques, such as an estimation of a relevance modelusing co-occurrence statistics of terms with the query terms, allow leveraging larger amounts of datafor model estimation.

Unfortunately, these scrambles for information do not help one understand how to best use the informationthat is available for parameters that may be estimable from previous queries and corresponding relevancejudgments for queries. In the case of one parameter, such as the µ in smoothing using Dirichlet priors, theremay be simple techniques to do this estimation. However, the hierarchically structured model presentedChapter 3 has many parameters that remain unspecified (the λs).

It would be futile to assume that the λ parameters in the structured document retrieval model presentedin Chapter 3 can be estimated on a per-document basis from training queries. As such, some simplifyingassumptions to reduce the number of unspecified parameters must be made. For example, in the experimentsdescribed in Section 3.4.1, a word token was considered to hold one “unit” of information. Weight on thecollection and parent models were assumed to be best set as constants. These assumptions resulted in aconstant λu and λp across all document components in the collection, with a λc

vjproportional to the length

of the text in node vj . A list of possible assumptions are below.

1. λcvi

does not depend on the ancestors or descendants of vi. This assumption could greatly reduce theparameter space if used with other assumptions.

2. λpvi

does not depend on ancestors or descendants of vi. This could result in λpvi

as a function of thecomponent type of vi or the length of vi.

3. How representative a word is of the query depends on the type of the document component in whichthe word occurs.

4. The weight used when interpolating children should depend only on the types of the children and thetype of the node.

A model using assumptions 1 and 3 could result in λcvi

as a value proportional to a scaled length of the node.We will refer to this as the component type weighted length approach. For example:

Let vj = parent (vi) .

λcvi

= α(type(vi))|vi|aα(type(vj))|vj |+

∑vk∈children(vj)

α(type(vk))|vk|a

λc′

vj= α(type(vj))|vj |

α(type(vj))|vj |+∑

vk∈children(vj)α(type(vk))|vk|a

(5.15)

If λp and λu are additionally taken to be constants, then there are 192 α parameters plus the λp and λu

parameters within the INEX collection to estimate to fully specify the λ parameters of the model. One wayto estimate these parameters for ad-hoc retrieval would be to use the expectation-maximization algorithmto maximize the query-likelihood of relevant query-document pairs in the training data.

Using assumption 4 could result in a model where the document structure is viewed as produced by acontext-free grammar. For each grammar rule, a set of linear interpolation parameters must be learned.Data sparsity could be a problem with this approach, as many of the grammar rules would not occur in thetraining data.

Page 59: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

56 Chapter 5. Smoothing and Parameter Estimation

The above assumptions are just a sample of assumptions that could be made in order to simplify theparameter estimation process. However, the model gives no guidance about which assumptions are reasonableand the parameter estimation process may be difficult. This is a hindrance to the use of the model, and itis important to analyze some reasonable assumptions so that others may have guidance when adapting themodel to retrieval scenarios. The problem of how best to interpolate the generative models of the documentstructure can be viewed as a combination of evidence problem. The following section considers this view.

5.2 Combination of Evidence

This section considers the combination of evidence problem for structured document retrieval. The goal isthat through considering the combination of evidence from structural information, some useful intuitions andguidance may be uncovered. This section first discusses related work on the general combination of evidenceproblem within the Information Retrieval community. Hypotheses for successful combination of evidence forstructured document retrieval within the generative probabilistic framework are then presented.

5.2.1 Related Work

There has been extensive work in the combination of evidence in the IR community. Much of this work issummarized by Croft [9]. Croft separates this work into several areas: combining representations, combiningqueries, combining ranking algorithms and search systems, and combining belief. In order to keep thisdiscussion brief, I will only address work where a concerted effort was made to understand the conditionsfor successful retrieval.

As a result of early studies observing that different searchers forming queries for an information needcreated widely different queries, Belkin et al. [4] examined the conditions for successful combination of queriesand found that bad query representations needed to be weighted lower than good query representations. Theyrelated this to the data fusion problem, which has been extensively studied for combining ranking algorithmsand search systems.

Croft observed that combining ranking algorithms can be cast as problem of combining classifiers, whichis well studied within the machine learning community. Tumer and Ghosh [49] provided a detailed analysisof using either linear combinations of classifiers or order statistics. Of particular interest to IR research isthe linear combination of classifiers, as this is closely related to many of the combination techniques usedwithin IR. Tumer and Ghosh showed that the combination of classifiers reduces the variance in boundarylocations around the optimal boundary. They observed that classifiers of roughly similar quality combinethe best, and including poorly performing classifiers can be detrimental to performance. They did cautionto add that weighting classifiers, which they do not investigate, can accommodate the inclusion of thesepoorly performing classifiers. Tumer and Ghosh also showed that the gain given by correlated classifiers isrelated to the amount of independent information expressed by the different classifiers. This work suggestedthat the success of the combination is related to the quality of the classifiers, number of classifiers, and theindependence of the classifiers.

An error in classification corresponds roughly to failing to retrieve a relevant document or retrieving anon-relevant document. However, combining rankings is not a simple matter of assigning a “retrieved” or“not-retrieved” label. The decision boundary varies based on thresholding of function or rank. Systems areevaluated with respect to multiple thresholds. How Tumer and Ghosh’s analysis can be generalized to theranking problem is not entirely clear and has not been investigated. Additionally, Croft pointed out thatTumer and Ghosh’s work assumes the compatibility of the classifiers’ output, which is not always true forcombining ranking algorithms.

Additional recent research has been done on the metasearch problem. Aslam and Montague [3] interpretedCroft’s statements about combination of evidence by stating:

Page 60: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

5.2. Combination of Evidence 57

“The systems being combined should (1) have compatible outputs, (2) each produce accurateestimates of relevance, and (3) be independent of each other.”

However, these three hypotheses have not been fully investigated. A hypothesis similar to the independencehypothesis posed in metasearch research by Lee [25] is that there should be higher overlap of relevantdocuments (across the top n results given by each algorithm) than the overlap of non-relevant documents.However, Chowdhury et al. [6] found that this is not a sufficient condition for effective combination.

Manmatha et al. [30] took a different approach to metasearch. They explicitly modeled the distributionsof relevant and non-relevant documents and used these as a guide for combination of rankings. While wedo not expect that an explicit modeling of distributions will be necessary to do well in our own work, webelieve that modeling and understanding these distributions will be crucial to understanding the conditionsfor successful combination of evidence.

In his discussion of combining belief, Croft described the INQUERY retrieval system which uses Bayesianinference networks. The inference network framework allows for multiple document and query representations(which may be structured queries). However, the framework gives little guidance on how to make sure themultiple forms of evidence can be combined successfully.

Greiff [16] described another probabilistic model that can incorporate multiple forms of evidence. Centralto this work was the use of Exploratory Data Analysis (EDA) to model and understand factors important torelevance and successful combination of evidence. While the model presented in this proposal is not the sameprobabilistic model as used by Greiff, we expect that EDA will be crucial within this work to developinglanguage models that can more successfully combine evidence.

5.2.2 Analysis Approach

Successful combination of evidence will be dependent on the generative probability distributions being com-patible and appropriately weighted. These analysis tools will draw from the hypotheses of previous work, butare tailored specifically for analysis of the generative models for structured document retrieval presented inChapter 3. I propose that power and compatibility are important characteristics for analysis of a particularparameter estimation method. In the presentation of concepts below, consider using Model MR to rankpoems against the query “plum Horner” as in the example provided in Section 3.2.1.

• power - Power measures the performance of some aspect of the estimation method. This is a measurewith respect to a particular test collection. For example, one measure of power could be average preci-sion. Power should measure the ability of the estimation method to distinguish relevant componentsfrom other components. The estimation method refers to how the generative models are estimated.For example, we may wish to measure the power of the θvi

j, θ′

vij

or θ′′vi

jmodels in a collection. We will

refer to these measures of power as power (θ, χ), power (θ′, χ) and power (θ′′, χ) respectively as theyrefer to the power of these models over some set of rankable items in a corpus and some additionalparameters, which are denoted by χ. χ is used as a shorthand to represent any additional parameterssuch as a specific query or query term and relevance assessments.

– term power - The term power is the power (defined above, e.g. average precision) of the rankingsproduced by ordering the rankable items in a corpus by the generative probability of a single queryterm. For example, we may wish to measure the term power of the maximum likelihood estimatorfor the terms “plum” and “Horner”. We would denote this as term power (MLE, “plum”, χ)and term power (MLE, “Horner”, χ), where χ would in this case refer a specific query and itsrelevance assessments (e.g. a query of “plum Horner” and some user’s judgments of relevance onthe corpus with respect to the query).

Page 61: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

58 Chapter 5. Smoothing and Parameter Estimation

– query power - The query power of generative model with respect to a query is the power (e.g.average precision) of the ranking produced by evaluating the model on the query. For example, thequery power of the maximum likelihood estimator with respect to the query “plum Horner” wouldbe denoted term power (MLE, “plum Horner”, χ) and estimated by taking the chosen measureof power (e.g. average precision) of the rankings produced by ranking components in the corpususing the maximum likelihood estimation method. Similarly, one could measure the term powerof the θ models, which would be denoted term power (θ, “plum Horner”, χ)

• compatibility - Compatibility is a measure of how well the distributions combine with themselvesand with each other. It compares two measures of power for an estimation method. One possiblecompatibility measure could be the relative difference of one measure of power from another (e.g.compatibility (θ1, θ2, χ) = power(θ1,χ)−power(θ2,χ)

power(θ2,χ) ).

– term compatibility - The term compatibility is a measure of how well a distribution combines withitself during ranking across multiple query terms. It is a function of termpower and query power.For example, one estimate may be:

term compatibility (θ, Q, χ) =1|Q|

∑Q∈Q

query power (θ, Q, χ)− 1|Q|∑

w∈Q term power (θ, w, χ)1|Q|∑

w∈Q term power (θ, w, χ)

(5.16)where w is a query term in the query Q in some set of queries Q that have relevance judgmentson some corpus. The estimate in Equation 5.16 compares the query power over a set of queriesto the average term power of the query terms.

– cross distribution term compatibility - Cross distribution term compatibility measures how wella distribution combines with other distributions. It specifically measures how the combinationaffects the term power of these distributions. It is a function of term power and combined distri-bution term power. One estimate may be:

1|Q|

∑Q∈Q

1|Q|

∑w∈Q

term power (θ, w, χ)− 1n

∑ni=1 term power (θi, w, χ)

1n

∑ni=1 term power (θi, w, χ)

(5.17)

where θ1, θ2, . . . , θn are being combined to form θ.

– query compatibility - Cross distribution query compatibility measures how well a distributioncombines with other distributions. It specifically measures how the combination affects the querypower of these distributions. It is a function of query power and combined distribution querypower. One estimate may be:

1|Q|

∑Q∈Q

query power (θ, Q, χ)− 1n

∑ni=1 query power (θi, Q, χ)

1n

∑ni=1 query power (θi, Q, χ)

(5.18)

Given the definitions of power and compatibility, we further propose some hypotheses about the conditionsfor successful combination of distributions:

1. The average query power should be greater than the average term power for the model in a successfulestimation method. This would correspond to a “positive” average cross term compatibility. Positivecross term compatibility does not correspond to a system where the performance of the combinedmodel is greater than the sum of the parts; it merely asserts that the modeling technique providesbetter rankings using entire queries than the average performance of a query term. A “robust” method

Page 62: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

5.2. Combination of Evidence 59

would, for all queries, have query power greater than the average term power for that query. A “highlyrobust” method would, for all queries, have query power at least the maximum term power for anyterm in the query. Low variance in cross term compatibility would be “predictable”.

2. For distributions that are combinations of other distributions, the average term power for a distributionestimation method should be greater than the average term power of the distributions combined toform the new distribution. That is, we would like to see that combining two distributions does nothurt performance more than we might expect; we would like the combined distributions to perform atleast as well on average as the uncombined distributions. This would be called a “positive” averagecross distribution term power. The notions of “robust”, “highly robust”, and “predictable” are definedsimilarly to their definitions in hypothesis 1.

3. For distributions that are combinations of other distributions, the average query power should similarlybe greater than the average query power of the distributions being combined. This is similar to theprevious point. We would like to see that combining distributions does not hurt the performanceof queries compared to the average performance of the queries on the uncombined models. Crossdistribution query compatibility measures the correspondence between cross distribution query powerand average query power.

4. The best weight to place on a distribution when combining it with others is a function of its crossdistribution query compatibility with the other distributions and its combined distribution query power.

5. The net achievable increase in combined distribution powers is a function of the cross distributioncompatibilities of the distributions and the independent query power of the distributions.

These hypotheses provide a framework for the analysis of smoothing methods and assumptions aboutthe estimation methods that can be used within the framework. Smoothing methods may be mechanismsfor setting linear interpolation parameters when combining two models or direct transformations of theprobabilities of one distribution. Using the above characteristics, one may be able to describe commonsmoothing methods:

• Jelinek-Mercer smoothing with a text and corpus model is a simple linear interpolation of the maximumlikelihood model of the text and a corpus model. The interpolation parameter λ is set as a constantacross all documents. This method of combining two distributions does not change the term powerof the distribution with respect to the text model, as the orderings produced for a single term do notchange when including the corpus model. However, it does increase the cross term compatibility ofthe combined distribution with respect to the original text model, corresponding to an increased querypower.

• Smoothing using Dirichlet priors to combine a text and corpus model relies on a linear interpolationwhere the λ parameter depends on the length of the text sample. As a result, this form of smoothingis a transformation of the distributions that does change their term power, as texts are reordered dueto the length factor.

As a part of this thesis, we propose to investigate these hypotheses on existing test corpora, such as theknown-item finding tasks of the TREC conference used in experiments in Section 3.4.2. We will investigatethe use of some measures of power, such as average precision. We hope that this investigation will result ina better understanding of smoothing and combination methods for linear interpolations. At a minimum, weexpect to have a better understanding of which smoothing and combination methods work for specific tasksthat will be investigated in the thesis.

Page 63: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Chapter 6

Modeling Annotations

The previous chapters presented and evaluated a model capable of representing common cases of documentstructure and multiple representations. As structure encoded by humans is often clean and easy to representhierarchically, the model was directly able to capture and represent relationships in document structure.When annotations to documents are provided, the model may break down.

Models SR and MR presented in Chapter 3 may fail to adequately represent annotations for a varietyof reasons. Documents may have many forms of annotations. While different annotations could be modeledusing additional representations within Model MR, the underlying text that annotations may be formedfrom may be the same. For example, we may wish to add annotations for part-of-speech and also named-entities. The underlying text is the same, and forming different representations within Model MR wouldignore that connection. Additionally, Model MR does not have adequate query language support for thetypes of questions we would like to ask. It would be difficult to search for named-entities matching “AbeLincoln” contained that are a subject of a sentence. Annotations may be assigned probabilistically andthe model should be able to represent alternative annotations and confidence values. Leveraging confidencevalues and using alternative annotations may be important for effective retrieval.

To help clarify the problem with Models SR and MR, consider annotating the sentence

S1:<sentence>

John Wilkes Booth killed Abraham Lincoln.</sentence>

with semantic role labels and named-entities. Suppose the semantic role labeler annotates the text as

S2:<sentence>

<agent> John Wilkes </agent> Booth<event> killed </event><patient> Abraham Lincoln </patient>.

</sentence>

making an error in the tagging of the agent, while the named-entity also incorrectly identifies the first person:

S3:<sentence>

John <person> Wilkes Booth </person>killed<person> Abraham Lincoln </person>.

</sentence>

We cannot combine the annotations provided in S2 and S3 within Model SR as it assumes a hierarchicalstructure. Within Model MR, we could create separate representations for S2 and S3, but could not answerthe query “find person components matching ‘Abe Lincoln’ contained within a patient component”.

60

Page 64: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

6.1. Offset Annotation 61

To first handle the representation problems associated with hierarchical structures, we will use offsetannotation (Section 6.1). This chapter then discusses additional difficulties by first motivating the needfor annotations in the context of a reading tutor system (Section 6.2) and a question answering system(Section 6.3).

6.1 Offset Annotation

Offset annotation can be used to represent the alternative and perhaps overlapping structures for text. Inoffset annotation, an index of structure separate from the original text of the document is created thatcontains cross-references to the original text. The cross-references are called offsets, and may be in terms ofbyte locations or token positions.

To keep our example simple, we will use token positions and assume the original text has been tokenizedin a uniform manner for all processing components. In this case, we will place each token on a separate line,with the term location in parentheses:

(1) John(2) Wilkes(3) Booth(4) killed(5) Abraham(6) Lincoln

An example of annotations for this text is:

<named-entities><person begin=‘1’ length=‘3’/><person begin=‘5’ length=‘2’/>

</named-entities><parse-trees>

<sentence begin=‘1’ length=‘6’><noun-phrase begin=‘1’ length=‘3’/><verb-phrase begin=‘4’ length=‘6’>

<noun-phrase begin=‘5’ length=‘2’/></verb-phrase>

</sentence></parse-trees><semantic-roles>

<predicate begin=‘1’ end=‘6’><event begin=‘4’ length=‘1’/><agent begin=‘1’ length=‘3’/><patient begin=‘5’ length=‘2’/>

</predicate></semantic-roles><pos-labels>

<nnp begin=‘1’ length=‘1’/><nnp begin=‘2’ length=‘1’/><nnp begin=‘3’ length=‘1’/><vbd begin=‘4’ length=‘1’/>

Page 65: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

62 Chapter 6. Modeling Annotations

<nnp begin=‘5’ length=‘1’/><nnp begin=‘6’ length=‘1’/>

</pos-labels><opv-triples>

<opv-triple><object begin=‘1’ length=‘3’/><property begin=‘4’ length=‘1’/><value begin=‘5’ length=‘6’/>

</opv-triple></opv-triples>

The example above is somewhat more verbose than it would need to be in practice. This annotation structureallows for natural representation of the different hierarchies of the sentence. It also allows for overlappingstructures and alternative structures. For example, a sentence with multiple parse trees could simply havetwo sentence entries in the parse-trees component.

6.2 A Reading Tutor System

The REAP project considers the task of building a reading tutor system for grade school students [7]. Twoof its goals are to make observations of how people learn and to assist learning through the system. Thelearning will be facilitated by generating test questions automatically from the text presented to a user,such as vocabulary questions and reading comprehension questions. The reading tutor system queries thesupporting retrieval system to find good documents to present to the user. These documents should beabout a topic, as in standard ad-hoc retrieval, but should also have good qualities for improving the readingcomprehension of the user and the generation of test questions.

In order to support the reading tutor system, REAP queries reference part-of-speech annotations and adocument level attribute for reading difficulty level. Neither of these annotations can be directly handledwith Models SR and MR presented in Chapter 3.

Part-of-speech annotations are assigned on a per-word basis. One way to handle the annotations wouldbe to assign a document component to each word, where the label of the component is the part-of-speechassigned by the annotator. Let us set aside offset annotation for now and consider the poem used in Chapter 3for Model SR:

<poem id=‘poem1’><title> <nnp>Little</nnp> <nnp>Jack</nnp> <nnp>Horner</nnp> </title><body>

<nnp>Little</nnp> <nnp>Jack</nnp> <nnp>Horner</nnp><vbd>Sat</vbd> <in>in</in> <dt>the</dt> <nn>corner<nn> <punc>,</punc><vbg>Eating</vbg> <in>of</in> <nnp>Christmas</nnp> <nn>pie</nn> <punc>;</punc><prp>He</prp> <vbd>put</vbd> <in>in</in> <prp-s>his</prp-s> <nn>thumb</nn><cc>And</cc> <vbd>pulled</vbd> <rp>out</rp>

<dt>a</dt> <rb>plumb</rb> <punc>,</punc><cc>And</cc> <vbd>cried</vbd>,

<quote><wp>What</wp> <dt>a</dt> <jj>good</jj> <nn>boy</nn><vbp>am</vbp> <prp>I</prp> <punc>!</punc>

</quote></body>

Page 66: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

6.3. Question Answering 63

</poem>

This approach could certainly work within the model, and may have conceptual advantages. For instance,smoothing on the term level could use a language model corresponding to the part-of-speech (POS) of theword. However, the model we have described does not allow for adjacency or sequence constraints in thequery, which we would need to provide for the reading tutor. One cannot express “find components containingthe adjective ‘good’ followed by a noun.” One could extend NEXI to request adjacency of components. Forexample:

//poem[sequence(., .//jj=‘good’ .//nn)]

where the sequence operator takes two arguments and filters out components where there does not exist apair of components matching the arguments that are directly adjacent. The “.//jj=‘good’]” matches ‘jj’components that contain the word ‘good’.

6.2.1 Adapting the Model

The previous discussion outlined several problems with the model presented in Chapter 3. This sectionoutlines several small extensions to the model to address these problems.

The structural constraints could be enforced absolutely or probabilistically. For instance, the constraintscould be smoothed with a background model. More interestingly, if the POS annotator outputs probabilitiesfor different tags, then these probabilities could be used. Matching the query component “//jj=‘good’” canbe done by taking:

P (‘good’ |θr ) P (pos = ‘jj’ |‘good’, r ) (6.1)

which would allow the use of the probabilities given by the annotator in the estimation of P (pos = ‘jj’ |‘good’, r ).Attributes for document components could be handled similarly.

Future extensions of the reading tutor may also have additional constraints in the queries. For ex-ample, the query may contain a user profile of known vocabulary items and a distribution over ratios ofknown/unknown vocabulary that is desirable in the results. This type of constraint may be expressed asa posterior desirable in the rankings, as in Chapter 4. In this case, one could take g (r) = unknown (r)where the unknown function is the percentage unknown vocabulary terms in the document. A distributionspecified over the range [0, 1] would be used to specify the range of acceptable percentages of unknown terms.For example, guidance from psychological studies may indicate a truncated and rescaled Gaussian distribu-tion centered at 0.1 with a standard deviation of 0.05 facilitates learning. This distribution would specifyP (unknown (R) = a |R is relevant). A non-parametric density estimate such as a kernel density estimator(see Chapter 20 of Wasserman [51]) could be used to estimate P (unknown (R) = a). These two componentsprovide the values necessary to estimate values proportional to prior probabilities and rank by Equation 4.4.

While Models SR and MR as specified in Chapter 3 were not adequate to handle the reading tutorapplication, simple extensions to the models fill the needs of the reading tutor.

6.3 Question Answering

There are two general approaches to QA: knowledge-based and extraction-based. Knowledge-based QA usesa database of typically domain specific knowledge. Natural language questions are analyzed and convertedinto database queries that return the answers. More recent QA has been extraction-based. Questions areasked on a open-domain text corpus and answers are extracted from the text. This process roughly consists ofthree steps: question analysis, passage retrieval, and answer extraction. This section will focus on extraction-based QA approaches, but with the goal of moving some of the processing usually done by the extractioncomponent into the retrieval component.

Page 67: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

64 Chapter 6. Modeling Annotations

Of particular note is the work of Litkowski [28][29]. Litkowski has been developing an XML-based QAsystem. The markup employed is both syntactic and semantic in nature. Litkowski first parses passagesto produce parse trees which are then processed to annotate them with semantic role and relationshipinformation. Question answering is performed by creating an appropriate XPath query to search the passages.The XPath queries are created using a rule-based mechanism, which can be manually intensive. In Litkowski’sevaluations, it is observed that while many answers are found for questions, they are not always easy to findin the result lists as the system does not rank items. This stresses the need for structured retrieval systemsto rank results.

We argue that ranking is important, that query term weighting and proximity operations are needed, andthat the structural hierarchies of document representations may overlap. Overlapping structure will requirenew mechanisms for indexing and retrieval.

Section 6.3.2 presents challenges and limitations for the use of structured retrieval in QA. It continues bydescribing some potential solutions to these problems, and Section 6.4 concludes the chapter and discussesfuture work.

6.3.1 Richer Queries for Richer Documents

In most current extraction-based QA there is little interaction between answer extraction components andretrieval components. With a richer index that stores the output of some system preprocessing, a greaterdegree of interaction between extraction and retrieval would be enabled.

In order to enable a greater interaction between answer extraction and retrieval, we may need to indexsyntactic and semantic structure contained within the documents. Typically, knowledge bases are assumedperfect, but the automatic document markup used in a QA system cannot be assumed perfect. Knowledgeencoded in text may not be accurate, and the natural language processing methods used will not provideperfect results. This has two primary implications to this approach:

• It is appropriate to provide rankings of results, rather than a set. The order of the ranking shouldreflect confidence of the passage matching the query. Effective retrieval should rank the most exactlymatching results highest, but gracefully back off to approximate matches.

• Additional processing or reasoning over text or using external knowledge bases will still be appropriate.A system may use an external resource containing knowledge about the world or language to verifyanswers. It may also be too computationally expensive to do all desired forms of predictive annotationof the text prior to indexing and retrieval.

The main idea is that by increasing interaction between the extraction and retrieval components, morespecific result lists can be passed to the extraction component. Returning passages that are more likely tosatisfy the answer extraction component can enable some benefits:

• The extraction component then has fewer results to process, therefore decreasing processing time duringthe extraction phase.

• The extraction component can process more results, possibly resulting in higher recall.

In order to gain these benefits, the retrieval system must index documents with structure. Some of theoptions for which reasonable adaptations to the model are already discussed are document structure markup,named entities, part-of-speech and semantic role labels.

One unaddressed challenge is overlapping annotations within an annotation type. This could result fromannotating parse trees, may be multiple correct parses of an ambiguous sentence. This could also result fromtaggers that output alternative annotations of a text along with confidence estimates.

Page 68: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

6.3. Question Answering 65

Example Query

In this section we examine the different types of queries that could be asked of an XML search systemfor to find text that contains the answer to the question ‘Who killed Abraham Lincoln?’. Why use XML?We have already presented how the query-likelihood model adapts naturally to XML. These examples willmotivate and highlight issues with the current model. The example queries will also be based loosely onthe NEXI query language. The discussion starts with simple queries typical for retrieval components of QAsystems and builds up to more complex queries that could be issued to more structured retrieval systems.Later, the section will examine how these queries may be represented in an XML retrieval system and howcorresponding text in documents that match the queries could be represented. Note that we never assumethe system must match the structural requirements in the query exactly, but should instead rank highlythose that match the query best. That is, we wish to provide a best match approach rather than an exactmatch approach.

Most basic information retrieval systems for QA would search for a simple query such as:

Q1:kill Abraham Lincoln

In the above query, ‘killed’ has had its suffix removed, a common practice in Information Retrieval.Question answering systems may or may not remove suffixes; this distinction is not important here. Asimple marked-up document may have sentence boundaries:

Q2://sentence[about(., kill Abraham Lincoln)]

With the annotation of named entities in the corpus, a search may look like:

Q3://sentence[about(., kill) AND

.//person=‘Abraham Lincoln’ AND

.//person]

This search would look for the word kill in sentences that have a component of the person type that matched‘Abraham Lincoln’ and contained another person component. The second person component in the query isthere to indicate that another person should be present in the results, but we do not know who the personis. It is likely that the extraction component would extract that component matching as the answer. Thereis no natural way to represent that the two person components should not be the same in NEXI, but thereis in XPath. It is somewhat complicated so we will not use the notation. For the sake of the example, wewill assume that the ‘.//person’ does not match the same person as in ‘.//person=‘Abraham Lincoln”.

Adding part of speech information allows a simple constraint on the verb:

Q4://sentence[about(.//vbd, kill) AND

.//person=‘Abraham Lincoln’ AND

.//person]

This is very similar to the examples presented in the previous section describing the reading tutor system.Using a semantic tagger on the corpus would allow even richer queries:

Q5://sentence[about(.//event//vbd, kill) AND

.//patient//person=‘Abraham Lincoln’ AND

.//agent//person]

Page 69: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

66 Chapter 6. Modeling Annotations

The usefulness of semantic role labeling should be very apparent with this query. Enforcing that AbrahamLincoln be the person that is killed in the sentence and ensuring that there also be a person that did thekilling would give us high confidence that the sentence contains a good candidate answer. Even without thenamed entity tagging and part-of-speech tagging, semantic role labeling would be powerful. Consider:

Q6://sentence[about(.//event, kill)] AND

about(.//patient, Abraham Lincoln)] AND.//agent]

Document components satisfying this query would still provide good candidate answers.Despite all of the nice aspects of queries Q5 and Q6, it still requires that the verb be ‘kill’ and that the

named entity match ‘Abraham Lincoln’. It would be nice to represent a set of words or a concept within aquery. For example, the kill concept could be represented as a multinomial probability distribution:

Q7://sentence[.//event//vbd=model(0.4 kill, 0.3 assassinate,

0.2 murder, 0.1 shoot) AND.//patient//person=‘Abraham Lincoln’) AND.//agent//person]

where the model operator is used to represent a weighted set of terms rather than a simple world list. Itis not uncommon for information retrieval systems to provide some form of query term weighting. Queryweighting is very useful for automatic term expansion. The probability distribution for ‘kill’ in query Q7could be constructed from outside resources such as WordNet or a thesaurus. We may additionally wish toexpand on variants of ‘Abraham Lincoln’ using some outside resource.

Q8://sentence[.//event//vbd=model(0.4 kill, 0.3 assassinate,

0.2 murder, 0.1 shoot) AND.//patient//person=model(0.4 ‘Abraham Lincoln’,

0.4 ‘President Lincoln’,0.1 ‘honest Abe’, 0.1 ‘Lincoln’] AND

.//agent[.//person]]

Representation of queries as in Q8 has several strengths. Flexibility on terms matching is allowed throughthe use of weighted query components. The term weighting allows the use of outside resources to incorporateworld or domain knowledge to improve recall. The semantic role labels ensure that the sentence will havea good candidate answer, which the answer extraction component may verify using external resources. Theuse of part-of-speech and named-entities provides redundancy over the semantic role labels and may allowthe system to still find a good candidate answer where the semantic role labeler failed to process the sentenceproperly.

It is not necessary to do all of the preprocessing to corpora needed to handle query Q8 to gain nicebenefits from structure. For example, with just document structure and named entities, one could formulatethe query:

Q9://sentence[about(., model(0.4 kill, 0.3 assassinate, 0.2 murder, 0.1 shoot)) AND

.//person=model(0.4 ‘Abraham Lincoln’, 0.4 ‘President Lincoln’,0.1 ‘honest Abe’, 0.1 ‘Lincoln’) AND

.//person]

Page 70: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

6.3. Question Answering 67

This query has many of the benefits of Q8, but additional text processing would still be required to ensurethat the other person in the sentence did the action of killing Abraham Lincoln.

The addition of extracted knowledge in the index may take several forms. Consider the extraction of(object, property, value) triples were extracted as in [21][26]. The role of the retrieval engine would then beto match the ‘assassinated’ property and the ‘Abraham Lincoln’ value and return the object:

Q10://opv-triple[.//object AND

.//property=model(0.4 kill, 0.3 assassinate0.2 murder, 0.1 shoot) AND

.//value=model(0.4 ‘Abraham Lincoln’,0.4 ‘President Lincoln’,0.1 ‘honest Abe’,0.1 ‘Lincoln’)]

Many QA systems use template based methods. One template that may be generated for the query wouldbe:

killed Abraham Lincoln.

where the blank indicates the answer to be extracted. To effectively support these queries, the queries issuedto the retrieval component would need to support sequences of terms. Some queries for the above templatemight look like:

Q11://sentence[contains(.,‘kill Abraham Lincoln’)]

Q12://sentence[sequence(., .//person ‘kill’ .//person=‘Abraham Lincoln’)]

Query Q11 represents a query for a system that has only sentence boundaries marked, while Q12 representsa query for a system with both sentence boundaries and named entities. For QA systems that use templatesfor candidate answer extraction, the inclusion of sequences in query representations would be quite useful.Queries could be created and issued for specific templates, and the highly ranked components in the returnedcomponent lists should match the templates well.

6.3.2 Adapting the Model

A detailed description resulting in challenges for the use of structured retrieval in question answering maybe found in [38]. Some of these challenges are pertinent to retrieval systems:

1. There are multiple overlapping hierarchies of structure for text that we would like to index.

2. It is important to be able to express relationships between components in these hierarchies whenquerying.

3. The relationships between hierarchies may not be exact due to errors in language processing so approx-imate matching on structure and terms is important and should be reflected in the result rankings.

Page 71: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

68 Chapter 6. Modeling Annotations

The challenges presented above require a somewhat different view of how to represent structure in doc-uments. The single hierarchy or multiple representation solutions provided by Models SR and MR are notappropriate for all tasks. Additionally, it is important to recognize the role of proximity, order, weightedquery terms, and approximate matching of structure.

The importance of multiple annotations cannot be stressed enough. In order for a structured retrievalsystem to be flexible for a number of tasks, it must be capable of supporting arbitrary structures. In limitednatural language tasks, it may be possible to impose a single strict hierarchy on the structure and embedit into the text. Even in these simple cases, this modifies the original text which may be undesirable. Wesaw in the examples above some complications of this in a very simple sentence. Language is ambiguous andmultiple correct parses are not uncommon. Even if a sentence is not ambiguous, natural language processingtools are not perfect. It is not desirable to lose the structures represented in an alternative parse, as thealternative may be the correct parse. In other cases, one may wish to work with the output of multipleparsers and index these structures.

Offset annotation is a convenient way to represent these structures. It can cleanly accommodate multipleparses of a sentence. It can also cleanly represent and separate different kinds of structure, such as extractedknowledge, syntactic structures, semantic role labels, and named entities.

However, Representing structure using offset annotation would have implications on the query language.Existing query languages such as XPath and NEXI for searching XML would not satisfy the needs forquerying this document and its structure in a simple manner. In particular, a useful query language wouldstill look similar those in the example queries given above. The query language should have the powerto suggest relationships between document components across types. That is, it should be able to simplyexpress that a person named-entity should be contained within an agent. It should also be easy to referto the text of the original document in a simple manner. We don’t want to have to look up the begin andlength of a component, then cross-reference that with the text of the document. This should be a simpleand intuitive mechanism, as it will be performed often.

While query languages that resemble XPath and NEXI do not allow the desired simple expression dis-cussed above, query languages that resemble these languages are not out of the question. The semantics ofsome of the underlying notations may be changed to refer to their natural analogs in this setting. Hierar-chical relationships expressed between components in the query should be treated as suggestions. Explicitnestings found in the structural annotations should be preferred, and other hierarchical relationships shouldbe given a score based on how much of the ‘child’ is contained by the ‘parent’.

The equivalent of ‘contains’ and ‘about’ operators used in XPath and NEXI should automatically lookup the contents of the original text. Additionally, a ‘weight’ operator could be included as a term weight-ing strategy. Loose interpretation of the structural requirements and the ‘AND’ clauses would allow forreasonable rankings of document components in such a structured retrieval system supporting QA.

Returning to the first challenge, the generative probability distributions described in this proposal couldbe easily adapted to this context. Matching the ‘weight’ and ‘about’ clauses could be done by estimatingthe probability the model generated the description. Loose structural requirements for the containmentrelationships could be ranked by estimating probabilities using observations of the document structure.Exact details the model have not yet been worked out and will be a theoretical contribution of the thesis.

6.4 Summary

This chapter examined the use of structured retrieval for assisting reading tutor systems and extraction-based QA systems. The need for retrieval systems that can model annotations and provide query capabilityis growing, as demonstrated by the discussions here.

The discussion of the reading tutor system highlighted the need for term level attributes in queries.

Page 72: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

6.4. Summary 69

In addition, it is desirable to express relationships between terms and structure beyond containment, suchas adjacency of terms with structural components. As a part of the dissertation work, the model will beadapted and implemented for these needs as a part of the REAP project. Further extensions to the REAPsystem, such as leveraging reading difficulty attributes assigned to documents will also be incorporated intothe model.

The discussion in this chapter argued that structured retrieval could allow for a tighter integration ofthe answer extraction and passage retrieval components of typical QA systems. This chapter also lookedforward beyond what is currently done in most QA systems and presented a vision of QA that is closer totreating an annotated open-domain corpus as a knowledge base.

The process of examining structured retrieval databases and QA together presented some challengesfor structured retrieval. The structures of real documents may be overlapping and messy, so mechanismsfor indexing and retrieval should be adapted to work with these idiosyncrasies. These difficulties will beaddressed by theoretically adapting the model presented in this proposal to handle them as a part of thedissertation.

This chapter also proposed alternative representations for the documents and queries in order to neatlysupport and represent them for QA related tasks. We believe that the use of generative probability distribu-tions is one effective way to rank document components for structured queries, and will work out the detailsof the model for the thesis.

Page 73: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Chapter 7

Summary of Proposal

It is well known that structure is important for many exact match retrieval scenarios, such as Booleansearch. Research in Web retrieval has shown the usefulness of structure for best match search. Title fields,link anchor text, and URL lengths have all proved useful structural features for document retrieval.

The availability of the INEX collection of XML documents has also provided the opportunity for theexamination of relevance in document components. This is the first collection to have large scale assessmentson document components, which has enabled experiments in the task of component retrieval where thecomponent type is not specified by the query. In addition, relevance judgments on queries with structuralconstraints have also enabled the opportunity to investigate structured retrieval more carefully.

In addition to the directions of XML and Web retrieval, there is a growing trend for retrieval systemsto be used by other information processing systems. These systems are capable of creating rich structuredqueries that work on annotations of documents. These automatically created annotations may have tangledstructure within the documents and may have associated uncertainties or confidence values. It is importantto model these characteristics well to provide good rankings for these information systems. It is now routinefor these systems to use multiple representations and annotation schemes.

More so now than ever before, the time is ripe to improve our understanding of document structureand its relationship with relevance. Through the experiments and theoretical development presented in thisproposal, we have already made some progress toward this goal.

Chapter 3 presented a generative model for documents with multiple hierarchical structural represen-tations. Experiments using this model have demonstrated its practicality for the use of XML componentretrieval and the evaluation of structured queries. Further experiments on Web retrieval have successfullycombined multiple document representations with structure and the use of trend information.

The proposal identified open questions with the generative model. Selection of the representations andthe methods for estimation and combination of the representations remains incompletely specified. Thesequestions are central to the combination of structural evidence. Chapter 5 provided perspective of themodel by comparing it to additional language processing tasks using language models. It further discussedpossibilities for the investigation of these issues, proposing hypotheses based on the notion of power andcompatibility.

The proposal then outlined challenges with the modeling of annotations (Chapter 6). The model aspreviously presented in Chapter 3 allowed for multiple hierarchical representations. However, annotations ofa document may provide multiple hierarchical structures for a single representation. An appropriate querylanguage for asking questions across the different hierarchical structures is also necessary for work withannotations.

We will continue experiments with smoothing and estimation methods for the model. Some basic hypothe-ses and methods for estimation and data methods were outlined in Chapter 5. We will examine parameterlearning through the use of machine learning or statistical techniques such as expectation maximization.During this process, we will also investigate the use simplifying assumptions with respect to component typein order to reduce the parameter space to be learned Ideally, these investigations will be driven by obser-vations from data analysis gained while investigating the hypotheses about power and compatibility. Wewill strive to characterize smoothing, estimation methods, and assumptions given their behavior described

70

Page 74: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

7.1. Contributions 71

in terms of power and compatibility.We will also explore the use of desired posterior distributions in more detail. Some explorations of the

problem will include analysis of characteristics of the texts belonging to specific classes in the known-itemretrieval problem discussed in Chapter 3. If such biases can be quantified, then we will look into smoothingmethods or other approaches to score transformations that help correct the biases. As a fall-back, someguidance in this work may be found in Singhal et al [45], where ad-hoc corrections for length bias in thecosine similarity model was presented. We will also explore the approach of learning priors that will returnthe desired posterior rankings in the results.

We will further work with structured queries, providing better modeling of structural and Boolean con-straints. We will adapt the model for vague interpretation of structural constraints, and evaluate the resultson the INEX vague content-and-structure queries.

We will extend the theoretical work describing the adaptation of the model to better handle annotations.As motivating examples we will explore question answering passage retrieval and support for a reading tutorsystem. We will adapt the model to handle the multiple hierarchical representations over a representationin order to provide support for these and other applications.

7.1 Contributions

This thesis will have several major contributions to the field of Information Retrieval. The thesis will pro-vide a structured document retrieval model that is adaptable to numerous retrieval tasks. In particular,this research will open the door to retrieval systems that serve results to other applications that can gen-erate rich structured queries. Discussed in the thesis will be applications to passage retrieval for questionanswering systems and a reading tutor system. This work will also be extendable to many other systems, forexample information extraction, text mining systems, and retrieval systems supporting text with biologicalinformation.

The research in the thesis will enable a better understanding of the role of smoothing in generativemodels. This will guide other researchers in their choice of smoothing when working on their own retrievalproblems. Additionally, the work in this thesis will also provide researchers with more principled methodsfor using information about characteristics of relevant documents. The research will enable the routine useof multiple forms of evidence.

The work in this thesis will also provide a method for analysis of combination of evidence in structuredretrieval. A principled approach to the analysis of combination of evidence within a retrieval model willenable other researchers to better understand their own models. This will help enable researchers to identifyfailures in their models and correct them.

7.2 Time Table

• June 05 - Theoretical incorporation of support for multiple annotations of a representation into themodel.

• June 05 - August 05 - Develop training procedure for the component type weighted length approachand evaluate on test collections.

• September 05 - November 05 - Extend support for structured queries and evaluate.

• December 05 - April 05 - Investigate measures of power and the combination of evidence hypothe-ses. Use insights gained to identify or develop good methods of evidence combination for documentcomponents in structured retrieval.

Page 75: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

72 Chapter 7. Summary of Proposal

• May 05 - September 06 - Investigate biases, priors, and posteriors. First perform data analysis toidentify source of bias, then develop an approach to correct the bias.

• October 06 - January 07 - Write thesis.

• January 07 - Defense.

Page 76: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

Bibliography

[1] J. Allen. Natural Language Understanding. Benjamin/Cummings Publishing, 2nd edition edition,1995.

[2] D. B. amd M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1012,January 2003.

[3] J. A. Aslam and M. Montague. Models for metasearch. In Proc. of the 24th annual int. ACM SIGIRconf. on Research and development in information retrieval, pages 276–284. ACM Press, 2001.

[4] N. J. Belkin, P. Kantor, E. A. Fox, and J. A. Shaw. Combining the evidence of multiple queryrepresentations for information retrieval. Inf. Process. and Management, 31(3):431–448, 1995.

[5] F. J. Burkowski. Retrieval activities in a database consisting of heterogeneous collections of structuredtext. In Proceedings of the Fifteenth Annual International SIGIR Conference on Research andDevelopment in Information Retrieval, pages 112–125, 1992.

[6] A. Chowdhury, O. Frieder, D. Grossman, and C. McCabe. Analyses of multiple-evidence combinationsfor retrieval strategies. In Proc. of the 24th annual int. ACM SIGIR conf. on Research anddevelopment in information retrieval, pages 394–395. ACM Press, 2001.

[7] K. Collins-Thompson and J. Callan. Information retrieval for language tutoring: an overview of thereap project. In Proceedings of the Twenty-Seventh Annual International SIGIR Conference onResearch and Development in Information Retrieval, pages 544–545. ACM, 2004.

[8] N. Craswell and D. Hawking. Overview of the trec-2002 web track. In The Tenth Text REtrievalConference, pages 86–95. NIST Special Publication 500-251, 2002.

[9] W. B. Croft. Combining approaches to information retrieval. In Advances in Information Retrieval,pages 1–36. Kluwer, 2000.

[10] E. G. Fayen and S. B. Baird. On-line personal bibliographic retrieval system. In Proceedings of theSecond Annual International SIGIR Conference on Research and Development in InformationRetrieval, pages 83–86, 1979.

[11] E. A. Fox. Composite document extended retrieval. In Proceedings of the Eighth Annual InternationalSIGIR Conference on Research and Development in Information Retrieval, pages 42–53, 1985.

[12] E. A. Fox and J. A. Shaw. Combination of multiple searches. In Proceedings of the Second TextREtrieval Conference (TREC), pages 243–249, 1994.

[13] N. Fuhr, N. Goevert, G. Kazai, and M. Lalmas, editors. Proceedings of the First Workshop of theINitiative for the Evaluation of XML Retrieval (INEX). ERCIM, 2003.

[14] N. Fuhr, S. Maalik, and M. Lalmas, editors. Proc. of the Second Annual Workshop of the Initiative forthe Evaluation of XML retrieval (INEX), Dagstuhl, Germany, Dec. 2003.

73

Page 77: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

74 BIBLIOGRAPHY

[15] M. Fuller, E. Mackie, R. Sacks-Davis, and R. Wilkerson. Structured answers for a large structureddocument collection. In Proceedings of the Sixteenth Annual International SIGIR Conference onResearch and Development in Information Retrieval, pages 204–213, 1993.

[16] W. R. Greiff. The use of exploratory data analysis in information retrieval research. In Advances inInformation Retrieval, pages 37–72. Kluwer, 2000.

[17] S. Harter. A probabilistic approach to automatic keyword indexing. Journal of the American Societyfor Information Science, 26:197–206, 280–289, 1975.

[18] D. Hawking and N. Craswell. Overview of the trec-2001 web track. In The Tenth Text REtrievalConference, pages 61–67. NIST Special Publication 500-250, 2001.

[19] D. Hiemstra. A linguistically motivated probabilistic model of information retrieval. In Proceedings ofthe Second European Conference on Research and Advanced Technologies for Digital Libraries, pages569–584, 1998.

[20] D. Hiemstra. Using language models for information retrieval. PhD thesis, University of Twente, 2001.

[21] B. Katz. From sentence processing to information access on the World Wide Web. In AAAI SpringSymposium on Natural Language Processing for the World Wide Web, 1997.

[22] W. Kraaij, T. Westerveld, and D. Hiemstra. The importance of prior probabilities for entry pagesearch. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval, pages 27–34. ACM, 2002.

[23] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization forinformation retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, pages 111–119. ACM Press, 2001.

[24] V. Lavrenko. A Generative Theory of Relevance. PhD thesis, University of Massachusetts at Amherst,2004.

[25] J. H. Lee. Analyses of multiple evidence combination. In Proc. of the 20th annual int. ACM SIGIRconf. on Research and development in information retrieval, pages 267–276. ACM Press, 1997.

[26] J. Lin and B. Katz. Question answering from the web using knowledge annotation and knowledgemining techniques. In The Twelfth International Conference on Information and KnowledgeManagement, pages 116–123, 2003.

[27] J. List, V. Mihajlovic, G. Ramirez, and D. Hiemstra. The tijah xml-ir system at inex 2003. In INEX2003 Workshop Proceedings, pages 102–109, 2003.

[28] K. C. Litkowski. Question answering using xml-tagged documents. In The Eleventh Text REtrievalConf. (TREC-11), NIST SP 500-251, pages 156–165, 2003.

[29] K. C. Litkowski. Use of metadata for question answering and novelty tasks. In The Twelfth TextREtrieval Conf., 2003.

[30] R. Manmatha, T. Rath, and F. Feng. Modeling score distributions for combining the outputs of searchengines. In Proc. of the 24th annual int. ACM SIGIR conf. on Research and development ininformation retrieval, pages 267–275. ACM Press, 2001.

Page 78: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

BIBLIOGRAPHY 75

[31] C. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. The MIT Press,1999.

[32] M. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: thepenn treebank. Computational Linguistics, 19, 1993.

[33] A. McCallum and K. Nigam. Text classification by bootstrapping with keywords, em and shrinkage.In Proceedings of the ACL 99 Workshop for Unsupervised Learning in Natural Language Processing,pages 52–58, 1999.

[34] T. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997.

[35] M. Montague and J. A. Aslam. Condorcet fusion for improved retrieval. In Proc. of the 11thInternational Conference on Information Knowledge and Mangement (CIKM), pages 538–548. ACMPress, 2002.

[36] G. Navarro and R. Baeza-Yates. A language for queries on structure and contents of textualdatabases. In Proceedings of the Eighteenth Annual International SIGIR Conference on Research andDevelopment in Information Retrieval, pages 93–101, 1995.

[37] M. Nikkuni and H. Tanaka. Document contents representation model of sentence retrieval systemscat-ir. In Proceedings of the Fourth Annual International SIGIR Conference on Research andDevelopment in Information Retrieval, pages 106–112, 1981.

[38] P. Ogilvie. Retrieval using structure for question answering. In Proccedings of the First Twente DataManagement Workshop, pages 15–23, 2004.

[39] P. Ogilvie and J. P. Callan. Experiments using the lemur toolkit. In The Tenth Text REtrieval Conf.(TREC-10), NIST SP 500-250, pages 103–108, 2002.

[40] J. Ponte and W. Croft. A language modeling approach for information retrieval. In Proceedings of the21st Annual International ACM SIGIR Conference on Research and Development in InformationRetrieval, pages 275–281. ACM Press, 1998.

[41] S. Roberston and K. S. Jones. Relevance weighting of search terms. Journal of the American Societyfor Information Science, 27:129–146, 1976.

[42] G. Salton, E. Fox, and H. Wu. Extended boolean information retrieval. Communications of the ACM,26:1002–1036, November 1983.

[43] K. Seymore and R. Rosenfeld. Scalable backoff language models. In Proceedings of ICLSP ‘96, 1996.

[44] B. Sigurbjornsson, J. Kamps, and M. de Rijke. Processing content-and-structure queries for xmlretrieval. In Proccedings of the First Twente Data Management Workshop, pages 31–38, 2004.

[45] A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proceedings of theNineteenth Annual International SIGIR Conference on Research and Development in InformationRetrieval, pages 21–29, 1996.

[46] F. Song and W. Croft. A general language model for information retrieval. In Proceedings of theEighth International Conference on Information and Knowledge Management, 1999.

Page 79: Generative Probabilistic Models for Retrieval of Documents ...pto/proposal.pdf · Generative Probabilistic Models for Retrieval of Documents with Structure and Annotations Ph.D. Thesis

76 BIBLIOGRAPHY

[47] S. Tellex, B. Katz, J. Lin, A. Fernandes, and G. Marton. Quantative evaluation of passage retrievalalgorithms for question answering. In Proceedings of the Twenty-Sixth Annual International SIGIRConference on Research and Development in Information Retrieval, pages 41–47. ACM, 2003.

[48] A. Trotman and B. Sigurbjornsson. Narrow Extended XPath I. Technical report, 2004. Available athttp://inex.is.informatik.uni-duisburg.de:2004/.

[49] K. Tumer and J. Ghosh. Linear and order statistic combiners pattern classification. In A. Sharkley,editor, Combining Artificial Neural Networks, pages 127–162. Springer-Verlag, 1999.

[50] H. Turtle and W. B. Croft. Inference networks for document retrieval. In Proceedings of theThirteenth Annual International SIGIR Conference on Research and Development in InformationRetrieval, pages 1–24, 1990.

[51] L. Wasserman. All of Statistics. Springer, 2004.

[52] T. Westerveld, W. Kraaij, and D. Hiemstra. Retrieving web pages using content, links, URLs, andanchors. In The Tenth Text REtrieval Conf. (TREC-10), NIST SP 500-250, pages 663–672, 2002.

[53] R. Wilkerson. Effective retrieval of structured documents. In Proceedings of the Seventeenth AnnualInternational SIGIR Conference on Research and Development in Information Retrieval, pages311–317, 1994.

[54] J. Xu and J. Callan. Cluster based language models for distributed retrieval. In Proceedings of the22nd Annual International ACM SIGIR Conference on Research and Development in InformationRetrieval, pages 254–261. ACM Press, 1999.

[55] C. Zhai. Risk Minization and Language Modeling in Text Retrieval. PhD thesis, Carnegie MellonUniversity, 2002.

[56] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hocinformation retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, pages 334–342. ACM Press, 2001.

[57] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to informationretrieval. ACM Transactions on Information Systems, 2(2), April 2004.

[58] M. Zhang, R. Song, C. Lin, L. Ma, Z. Jiang, Y. Jin, Y. Liu, L. Zhao, and S. Ma. Thu at trec 2002:Web track experiments. In The Eleventh Text REtrieval Conf. (TREC-11), NIST SP 500-251, pages591–594, 2003.