Upload
sara-jasmin-williams
View
214
Download
0
Embed Size (px)
Citation preview
Information Retrieval:A Primer
Ellen Voorhees
2
IR Primer(Parts based on an outline by James Allan, UMass)
• Basic IR processing– bag of words– alternate approaches
• Web searching– differences from non-web searching– “advanced’ features
• Available systems– SMART, MG– Lemur, Lucene
3
IR Problem Definition
• Find “documents” that are relevant to a user’s information need– unstructured, natural language
text– lack of structure precludes use
of traditional database technologies
– a document is the unit of text of interest, traditionally at least a paragraph in length
4
IR Issues
• How to represent text– indexing
• How to represent information need– free text vs. formal query language
• How to compare representations– retrieval models
• How to evaluate quality of the search– recall/precision variants
5
Original Approach
• Manually assign descriptors– expensive– human agreement on descriptors is poor– controlled vocabulary increases annotator
consistency but requires searchers to use the same vocabulary
• Still used – MEDLINE– motivation for many semantic web
proposals
6
Current Standard Approach
• Statistical: compute degree of match between query and document– weighted “bag of words”– rank documents by how well they match
• Assume top documents are good & modify query based their words
• Rerun search with modified query
7
Single biggest factor in IR Effectiveness
• The query!• many studies shown topic effect bigger than
system effect or interaction effects• TREC query track demonstrated effect of
different queries for same topic
• Creating good queries• select correct document set to ask• use discriminative, high content query words• include informative alternate expressions• prefer specific examples over general concepts
8
• Tokenization • might include identifying phrases• might include identifying other lexical
structures such as names, amounts, etc• increased importance for (factoid) QA
• Remove “stop words”
• Perform stemming
• Weight terms • might include mining other data sources
How to Represent Text
9
Original Text
Czechs Play Indoor Soccer for More Than Four Days Straight for Record
Twenty Czechs broke the world record for indoor soccer last month, playing the game continuously for 107 hours and 15 minutes, the official Czechoslovak news agency CTK reported.
Two teams began the endeavor in the West Bohemian town of Holysov on Dec. 13 and ended with the new world record on Dec. 17, CTK said in the dispatch Monday.
According to the news agency, the previous record of 106 hours 10 minutes was held by English players. The Czechs new record is to be recorded in the Guinness Book of World Records, CTK said.
10
Bag of Tokens
agency(2); began; bohemian; book; broke; continuously; ctk(3); czechoslovak; czechs(3); days; dec(2); dispatch; ended; endeavor; english; game; guinness; held; holysov; hour(2); indoor(2); minutes(2); monday; month; news(2); official; play(3); previous; record(7); reported; soccer(2); straight; teams; town; twenty; west; world(3)
official czechoslovak; previous record; world record(3); days straight; news agency(2); ctk reported; guinness book; ctk agency; teams began
11
Final Document Representation
agent 2.80; begin 1.55; bohem 6.01; book 2.63; brok 2.60; continu 1.55; ctk 13.35; czechoslovak 4.36; czech 11.65; day 1.38; dec 4.34; dispatch 4.12; end 1.36; endeavor 5.03; engl 3.51; game 3.14; guin 5.40; held 2.05; holysov 10.75; hour 3.44; indoor 8.13; minut 4.38; monday 2.12; month 1.34; new 2.62; offic .98; play 4.82; prev 1.89; record 5.80; report 1.04; socc 9.16; straight 3.61; team 2.86; town 2.86; twent 3.91; west 2.14; world 3.85
czechoslovak offic 8.64; prev record 6.09; record world 13.69; day straight 6.41; agent new 6.69; ctk report 8.51; book guin 7.15; agent ctk 7.67; begin team 8.20
12
How to Represent & CompareInformation Need
• Formal query language– query is a pattern to be matched by doc– Boolean systems best well-known– others
• density measures such as Waterloo’s MultiText• Inquery’s structured query operators
13
Information Need
• Free text: query is a (short) document• Vector space derivatives:
– compute similarity function between document and query vectors• length normalization of vectors key• cosine traditional, but highly biased toward
short docs; pivot length normalization now standard
• Language models– select document most likely to have
generated query
14
Beneficial techniques for automatic IR
• Term weighting
• Query expansion
• Phrasing
• Passages
15
Term weighting• Largest single system factor that
affects retrieval effectiveness
• Current best weights are combination of three factors:– term frequency: how often the term occurs
in a text– collection frequency: how many
documents the term occurs in– length normalization: compensating factor
for widely varying document lengths
16
Query expansion• Good query is essential, yet users tend
to provide only a few keywords
• Variety of query expansion techniques, both interactive and automatic
• Most often used automatic technique is blind feedback: – assume the top ranked documents are
relevant and apply feedback to produce new query
– run new query and present results of this query to user
17
Phrasing• Creation of compound index terms (i.e.,
terms that correspond to >1 word stem in original text)
• Usually found statistically by searching for word pairs that co-occur (much) more frequently in the corpus than expected by chance
• Lots of work on linguistically-motivated phrasing
18
Passages
• Passage is a document subpart
• Useful for breaking long, multi-subject documents into areas of homogenous content
• Also useful for mitigating effects of widely varying document lengths if weighting scheme sub-par
19
(Factoid) QA System Applications
• Predictive Annotation (IBM)– index entity types as well as content words
• Controlled query expansion (LCC)– use progressively more aggressive term
expansion in different iterations of querying
– new iteration invoked if predetermined conditions not met
20
• Current IR evaluation assumes ranked retrieval– some methods (e.g., Boolean matching)
don’t easily produce ranked results
• Measures– variations on recall, precision
– MAP most common measure reported
IR Evaluation
# relevant retrieved
# relevantrecall =
# relevant retrieved
# retrievedprecision =
21
Relevance• Fundamental concept in IR evaluation
• restricted to topical relevance– document and query discuss same topic
• operational definition used in TREC: if you were writing a report on the subject of the query and would use any information included in the document in that report, mark the document relevant
• known that different assessors judge documents differently
• assessor differences affect absolute scores of systems, but generally not relative scores
22
0
0.1
0.2
0.3
0.4
System
Ave
rage
Pre
cisi
on
Line 1
Line 2
Mean
Original
Union
I ntersection
Average Precision by Qrel
23
Recall-Precision Graph
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Prec
isio
n
ok7ax
att98atdc
I NQ502
mds98td
bbn1
tno7exp1
pirc8Aa2
Cor7A3rrf
(Interpolated, Extrapolated, Averaged)
24
Averages Do Hide Variability
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Prec
isio
n
25
Mean Average Precision• Most frequently used summary measure of
a ranked retrieval run– average precision of single query is mean of the
precision scores after each relevant retrieved– value for run is the mean of the individual average
precision scores
• Contains both recall- and precision- oriented aspects; sensitive to entire ranking
• Interpretation is less obvious than other measures (e.g., P(20) )
26
IR Evaluation for (factoid) QA• Lots of interest currently
• SIGIR 2004 workshop• Christof Monz thesis• papers by MIT group
• Suggested measures– p@n, r@n, a@n (Monz)– coverage, redundancy (Roberts &
Gaizauskas)• coverage(n) = % of questions with answer in top
n docs• redundancy(n) = average number of answers per
question in top n docs
27
Issues with IR Eval for QA
• Increased retrieval effectiveness can harm overall QA system performance
• Optimizing average effectiveness for a single retrieval strategy unlikely to be as effective as iterative strategies
• Beware! TREC QA track data sets never intended for this purpose
• see Billotti, Katz, Lin SIGIR 2004 workshop paper for examples
28
Web Searching• Differences from standard ad hoc
searching– quality of spider affects overall
effectiveness– web search engines exploit link structure– web search engines must defend against
deliberate spamming– web search engines operate under severe
efficiency constraints– web search engines usually optimized for
precision
29
Quality of Spider
• Spider sets upper bound on coverage, recency of retrieval
• Bigger index is not necessarily better– eliminating “junk pages” at this stage is
big win for efficiency, possible win for effectiveness
– but, need to carefully define junk
• Affects user queries to the extent that retrieval suffers if not well done
30
Exploiting Link Structure
• Key to effective web retrieval– web retrieval tasks not always “Find docs
about…”– in links treated as recommendations for page– anchor text a form of manual keyword
assignment
• Web engines generally need that structure– e.g., Google-in-a-Box for newswire unlikely to
work well
31
Defending Against Spam
• IR systems generally trust input text, but can’t on the web– link exploitation major way of coping
• Assuming engine does a reasonable job, not a big impact on user queries
32
Efficiency Constraints
• Massive size of web and volume of queries means query processing must be very fast– stemming used rarely, if at all– no complicated NLP
• Economics supports caching/special processing (even manual) for extremely frequent queries
33
Optimized for Early Precision
• Default user model assumes precision is only measure of interest
• “I’m feeling lucky” is Prec(1)
34
Advanced Search
• “Phrase operator”– requires all words in phrase to appear
contiguously in document– close to turning search into a giant grep
• provides a type of pattern matching, but be sure that’s what you really want
35
Advanced Search
• -forbidden words– same semantics as Boolean NOT operator,
with the same pitfalls
36
Available Systems(with input from Ian Soboroff of NIST)
• SMART• MG• Lucene• Lemur
37
SMART• Vector-space system written by
Chris Buckley• available from Cornell ftp site
• Designed to support experimentation• extremely easy to switch components of
indexing/search• hard to change fundamental search process
(from ad hoc retrieval to filtering, say)
• Freely available version old, but well-known
• many different groups have used it in TREC
• Little user documentation• not supported
38
MG• Written largely at RMIT using research
of Bell, Moffat, Witten, Zobel• retrieval system accompanying the book
“Managing Gigabytes”
• Intended to use as is as black box• implements tf*idf vector model• research emphasis on efficiency
• Not officially supported• documentation essentially limited to book
• Zettair new system by same group
39
Lucene
• Retrieval toolkit written in Java• open source from Jakarta Apache• Doug Cutting main architect
• Target audience is people looking to put search on a web site
• while a toolkit so extensible, general use more out-of-the-box
• vector space, tf*idf system• no relevance feedback
• Well-documented with active user community
40
Lemur
• Retrieval toolkit written largely in C++• available from CMU website• written at CMU (Jamie Callan) and UMass (Bruce
Croft, James Allan) with support from ARDA
• Target audience is IR research community
• primary retrieval model is language modeling approach, also supports tf*idf, Okapi weighting
• has components for major areas of IR research including relevance feedback, distributed IR, etc.
• Active user community; good documentation