Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate

Information Retrieval:A Primer

Ellen Voorhees

2

IR Primer(Parts based on an outline by James Allan, UMass)

• Basic IR processing– bag of words– alternate approaches

• Web searching– differences from non-web searching– “advanced’ features

• Available systems– SMART, MG– Lemur, Lucene

3

IR Problem Definition

• Find “documents” that are relevant to a user’s information need– unstructured, natural language

text– lack of structure precludes use

of traditional database technologies

– a document is the unit of text of interest, traditionally at least a paragraph in length

4

IR Issues

• How to represent text– indexing

• How to represent information need– free text vs. formal query language

• How to compare representations– retrieval models

• How to evaluate quality of the search– recall/precision variants

5

Original Approach

• Manually assign descriptors– expensive– human agreement on descriptors is poor– controlled vocabulary increases annotator

consistency but requires searchers to use the same vocabulary

• Still used – MEDLINE– motivation for many semantic web

proposals

6

Current Standard Approach

• Statistical: compute degree of match between query and document– weighted “bag of words”– rank documents by how well they match

• Assume top documents are good & modify query based their words

• Rerun search with modified query

7

Single biggest factor in IR Effectiveness

• The query!• many studies shown topic effect bigger than

system effect or interaction effects• TREC query track demonstrated effect of

different queries for same topic

• Creating good queries• select correct document set to ask• use discriminative, high content query words• include informative alternate expressions• prefer specific examples over general concepts

8

• Tokenization • might include identifying phrases• might include identifying other lexical

structures such as names, amounts, etc• increased importance for (factoid) QA

• Remove “stop words”

• Perform stemming

• Weight terms • might include mining other data sources

How to Represent Text

9

Original Text

Czechs Play Indoor Soccer for More Than Four Days Straight for Record

Twenty Czechs broke the world record for indoor soccer last month, playing the game continuously for 107 hours and 15 minutes, the official Czechoslovak news agency CTK reported.

Two teams began the endeavor in the West Bohemian town of Holysov on Dec. 13 and ended with the new world record on Dec. 17, CTK said in the dispatch Monday.

According to the news agency, the previous record of 106 hours 10 minutes was held by English players. The Czechs new record is to be recorded in the Guinness Book of World Records, CTK said.

10

Bag of Tokens

agency(2); began; bohemian; book; broke; continuously; ctk(3); czechoslovak; czechs(3); days; dec(2); dispatch; ended; endeavor; english; game; guinness; held; holysov; hour(2); indoor(2); minutes(2); monday; month; news(2); official; play(3); previous; record(7); reported; soccer(2); straight; teams; town; twenty; west; world(3)

official czechoslovak; previous record; world record(3); days straight; news agency(2); ctk reported; guinness book; ctk agency; teams began

11

Final Document Representation

agent 2.80; begin 1.55; bohem 6.01; book 2.63; brok 2.60; continu 1.55; ctk 13.35; czechoslovak 4.36; czech 11.65; day 1.38; dec 4.34; dispatch 4.12; end 1.36; endeavor 5.03; engl 3.51; game 3.14; guin 5.40; held 2.05; holysov 10.75; hour 3.44; indoor 8.13; minut 4.38; monday 2.12; month 1.34; new 2.62; offic .98; play 4.82; prev 1.89; record 5.80; report 1.04; socc 9.16; straight 3.61; team 2.86; town 2.86; twent 3.91; west 2.14; world 3.85

czechoslovak offic 8.64; prev record 6.09; record world 13.69; day straight 6.41; agent new 6.69; ctk report 8.51; book guin 7.15; agent ctk 7.67; begin team 8.20

12

How to Represent & CompareInformation Need

• Formal query language– query is a pattern to be matched by doc– Boolean systems best well-known– others

• density measures such as Waterloo’s MultiText• Inquery’s structured query operators

13

Information Need

• Free text: query is a (short) document• Vector space derivatives:

– compute similarity function between document and query vectors• length normalization of vectors key• cosine traditional, but highly biased toward

short docs; pivot length normalization now standard

• Language models– select document most likely to have

generated query

14

Beneficial techniques for automatic IR

• Term weighting

• Query expansion

• Phrasing

• Passages

15

Term weighting• Largest single system factor that

affects retrieval effectiveness

• Current best weights are combination of three factors:– term frequency: how often the term occurs

in a text– collection frequency: how many

documents the term occurs in– length normalization: compensating factor

for widely varying document lengths

16

Query expansion• Good query is essential, yet users tend

to provide only a few keywords

• Variety of query expansion techniques, both interactive and automatic

• Most often used automatic technique is blind feedback: – assume the top ranked documents are

relevant and apply feedback to produce new query

– run new query and present results of this query to user

17

Phrasing• Creation of compound index terms (i.e.,

terms that correspond to >1 word stem in original text)

• Usually found statistically by searching for word pairs that co-occur (much) more frequently in the corpus than expected by chance

• Lots of work on linguistically-motivated phrasing

18

Passages

• Passage is a document subpart

• Useful for breaking long, multi-subject documents into areas of homogenous content

• Also useful for mitigating effects of widely varying document lengths if weighting scheme sub-par

19

(Factoid) QA System Applications

• Predictive Annotation (IBM)– index entity types as well as content words

• Controlled query expansion (LCC)– use progressively more aggressive term

expansion in different iterations of querying

– new iteration invoked if predetermined conditions not met

20

• Current IR evaluation assumes ranked retrieval– some methods (e.g., Boolean matching)

don’t easily produce ranked results

• Measures– variations on recall, precision

– MAP most common measure reported

IR Evaluation

# relevant retrieved

# relevantrecall =

# relevant retrieved

# retrievedprecision =

21

Relevance• Fundamental concept in IR evaluation

• restricted to topical relevance– document and query discuss same topic

• operational definition used in TREC: if you were writing a report on the subject of the query and would use any information included in the document in that report, mark the document relevant

• known that different assessors judge documents differently

• assessor differences affect absolute scores of systems, but generally not relative scores

22

0

0.1

0.2

0.3

0.4

System

Ave

rage

Pre

cisi

on

Line 1

Line 2

Mean

Original

Union

I ntersection

Average Precision by Qrel

23

Recall-Precision Graph

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

n

ok7ax

att98atdc

I NQ502

mds98td

bbn1

tno7exp1

pirc8Aa2

Cor7A3rrf

(Interpolated, Extrapolated, Averaged)

24

Averages Do Hide Variability

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

n

25

Mean Average Precision• Most frequently used summary measure of

a ranked retrieval run– average precision of single query is mean of the

precision scores after each relevant retrieved– value for run is the mean of the individual average

precision scores

• Contains both recall- and precision- oriented aspects; sensitive to entire ranking

• Interpretation is less obvious than other measures (e.g., P(20) )

26

IR Evaluation for (factoid) QA• Lots of interest currently

• SIGIR 2004 workshop• Christof Monz thesis• papers by MIT group

• Suggested measures– p@n, r@n, a@n (Monz)– coverage, redundancy (Roberts &

Gaizauskas)• coverage(n) = % of questions with answer in top

n docs• redundancy(n) = average number of answers per

question in top n docs

27

Issues with IR Eval for QA

• Increased retrieval effectiveness can harm overall QA system performance

• Optimizing average effectiveness for a single retrieval strategy unlikely to be as effective as iterative strategies

• Beware! TREC QA track data sets never intended for this purpose

• see Billotti, Katz, Lin SIGIR 2004 workshop paper for examples

28

Web Searching• Differences from standard ad hoc

searching– quality of spider affects overall

effectiveness– web search engines exploit link structure– web search engines must defend against

deliberate spamming– web search engines operate under severe

efficiency constraints– web search engines usually optimized for

precision

29

Quality of Spider

• Spider sets upper bound on coverage, recency of retrieval

• Bigger index is not necessarily better– eliminating “junk pages” at this stage is

big win for efficiency, possible win for effectiveness

– but, need to carefully define junk

• Affects user queries to the extent that retrieval suffers if not well done

30

Exploiting Link Structure

• Key to effective web retrieval– web retrieval tasks not always “Find docs

about…”– in links treated as recommendations for page– anchor text a form of manual keyword

assignment

• Web engines generally need that structure– e.g., Google-in-a-Box for newswire unlikely to

work well

31

Defending Against Spam

• IR systems generally trust input text, but can’t on the web– link exploitation major way of coping

• Assuming engine does a reasonable job, not a big impact on user queries

32

Efficiency Constraints

• Massive size of web and volume of queries means query processing must be very fast– stemming used rarely, if at all– no complicated NLP

• Economics supports caching/special processing (even manual) for extremely frequent queries

33

Optimized for Early Precision

• Default user model assumes precision is only measure of interest

• “I’m feeling lucky” is Prec(1)

34

Advanced Search

• “Phrase operator”– requires all words in phrase to appear

contiguously in document– close to turning search into a giant grep

• provides a type of pattern matching, but be sure that’s what you really want

35

Advanced Search

• -forbidden words– same semantics as Boolean NOT operator,

with the same pitfalls

36

Available Systems(with input from Ian Soboroff of NIST)

• SMART• MG• Lucene• Lemur

37

SMART• Vector-space system written by

Chris Buckley• available from Cornell ftp site

• Designed to support experimentation• extremely easy to switch components of

indexing/search• hard to change fundamental search process

(from ad hoc retrieval to filtering, say)

• Freely available version old, but well-known

• many different groups have used it in TREC

• Little user documentation• not supported

38

MG• Written largely at RMIT using research

of Bell, Moffat, Witten, Zobel• retrieval system accompanying the book

“Managing Gigabytes”

• Intended to use as is as black box• implements tf*idf vector model• research emphasis on efficiency

• Not officially supported• documentation essentially limited to book

• Zettair new system by same group

39

Lucene

• Retrieval toolkit written in Java• open source from Jakarta Apache• Doug Cutting main architect

• Target audience is people looking to put search on a web site

• while a toolkit so extensible, general use more out-of-the-box

• vector space, tf*idf system• no relevance feedback

• Well-documented with active user community

40

Lemur

• Retrieval toolkit written largely in C++• available from CMU website• written at CMU (Jamie Callan) and UMass (Bruce

Croft, James Allan) with support from ARDA

• Target audience is IR research community

• primary retrieval model is language modeling approach, also supports tf*idf, Okapi weighting

• has components for major areas of IR research including relevance feedback, distributed IR, etc.

• Active user community; good documentation

Documents

Information Retrieval: A Primer Ellen Voorhees. 2 IR Primer (Parts based on an outline by James Allan, UMass) Basic IR processing –bag of words –alternate