View
218
Download
0
Embed Size (px)
Citation preview
Evaluation
Information retrievalWeb
Purposes of Evaluation System Performance Evaluation
efficiency of data structures and methods operational profile
Retrieval Evaluation how well can system “guess” what is needed
from range of queries (including ill-specified and vague)
Comparison – Who’s best how does new algorithm/data structure compare
to its predecessors; what is its contribution?
Canonical Retrieval & Comparison Experiment
1. For each method M being compared,a. For each parameter setting P of M,
1) Train M(P) on a training data set D’.2) For each testing data set D,
a) Run M on Db) Compare actual results to expected resultsc) Compute performance metrics
b. Compare performance on set P for method M
2. Compare performance across set M using statistical tests for significance
Key Questions To what methods should you
compare? What parameters should be checked? What data sets should be used? How should you divide into training
and testing? What metrics should be used? What statistical tests should be run?
Choosing Methods and Parameters
Depends on your hypothesis… Experiments are based on hypothesis,:
My system has significantly better performance than state of the art.
Heuristics significantly improve performance.
Negative feedback makes little difference to performance.
IR Data Sets TREC (Text Retrieval Conference) http://trec.nist.gov/
yearly conference organized by NIST groups compare IR systems on designated
tasks and data data sets are large and include human
relevance and topic judgments tasks: usually IR (e.g., retrieval, filtering),
recently Web Largest,most widely used collections
TREC Data Collections (>800,000 docs)
1. Wall Street Journal (1987, 1988, 1989), the Federal Register (1989), Associated Press (1989), Department of Energy abstracts, and Information from the Computer Select disks (1989, 1990)
2. Wall Street Journal (1990, 1991, 1992), the Federal Register (1988), Associated Press (1988) and Information from the Computer Select disks (1989, 1990) copyrighted by Ziff-Davis
3. San Jose Mercury News (1991), the Associated Press (1990), U.S. Patents (1983-1991), and Information from the Computer Select disks (1991, 1992) copyrighted by Ziff-Davis
4. Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), and the Federal Register (1994).
5. Foreign Broadcast Information Service (1996) and the Los Angeles Times (1989, 1990).
TREC Relevance Judgments
Definition of relevance: “If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant.”
binary judgments, relevant if any part is pooled (based on returns of actual systems)
subset of docs are judged by topic author
TREC Topics<top><head> Tipster Topic Description<num> Number: 066<dom> Domain: Science and Technology<tide> Topic: Natural Language Processing<desc> Description: Document will identify a type of natural language processing technology which is being developed or marketed in the U.S.<narr> Narrative: A relevant document will identify a company or institution developing or marketing a natural language processing technology, identify the technology, and identify one or more features of the company's product.<con> Concept(s): 1. natural language processing 2. translation, language, dictionary, font 3. software
applications<fac> Factor(s):<nat> Nationality: U.S.<fac><del> Definition(s):<top>
TREC Tasks ad hoc retrieval routing (query stream from profile) confusion (docs with scanning errors) Database merging (docs from separate collections) spoken documents filtering (construct profile to filter incoming stream) question and answer web ad hoc web homepage finding
TREC 2003 TasksCross-language: retrieve documents in multiple languagesFiltering: user’s information need is stable; check incoming
stream for which docs should be disseminatedGenome: gene sequences as well as supporting docsHigh Accuracy Retrieval from Documents (HARD): leverage
knowledge about user and/or goalsInteractiveNovelty: retrieve new,non-redundant dataQuestion answeringVideo: Content-based retrieval of digital videoWeb: search on a snapshot of the web
Web Topics<top><num> Number: 501<title> deduction and induction in English?<desc> Description:What is the difference between deduction and
induction in the process of reasoning?<narr> Narrative:A relevant document will contrast inductive and
deductive reasoning. A document that discusses only one or the other is not relevant.
</top>
Metrics: Precision and Recall
Precision: were all retrieved docs relevant? hits/(hits+fp)
Recall: were all relevant docs retrieved? hits/(hits+fn)
Relevant ~Relevant
Retrieved hit false positive
~Retrieved
false negative
miss
Precision/Recall Tradeoff
0102030405060708090
100
0 10 20 30 40 50 60 70 80 90 100
sys1sys2
Performance for Each Query Hold recall fixed (through
parameters) and allow precision to vary
Average precision at seen relevant docs: running average
R-precision: precision by rth ranking where r=|relevant docs|
Metrics: P&R Combination Precision and recall tend to balance
each other. Improve one at cost of the other.
F is harmonic mean, comparing recall and precision in single metric.
1,RecallPrecision*
Precision*Recall)1(
),(Recall),(Precision
),(Precision*),(Recall*2),(
2
2
F
jiji
jijijiF
Experiment Pitfalls Ill specified hypothesis Reliance on flaky and/or too few users Bias in the user base Poorly constructed user instructions Inappropriate comparison methods Too much varying simultaneously Biased evaluation metric (confounding)
or data selection Too many or too few statistical tests
Strategies for Avoiding Pitfalls
Fully specify experiment before running it
Run pilot experiment Put serious thought into “right”
hypothesis Think about what you hope to say when
experiment concludes… will experiment support your saying it?
Web Experiment Example[O. Zamir & O. Etzioni, “Web Document Clustering”, Proc. of
21st International SIGIR Conference on Research and Development in Information Retrieval, 1998.]
Developed incremental, linear time clustering algorithm which clusters based on shared phrases in document snippets: Suffix Tree Clustering
Hypotheses: STC has higher precision than standard clustering methods
for this domain. Clustering based on snippets does not significantly
degrade precision for this domain. STC is faster than standard clustering methods for this
domain.
Zamir & Etzioni Experiment
Data set: Thought up 10 queries Generated 10 collections of 200 snippets from
MetaCrawler query results (retrieved original docs as well)
Manually assigned relevance (mean=40 relevant docs/query)
Methods: Single-Pass, K-Means, Buckshot, Fractionation,
GAHC
Z&E Exp: STC has higher precision.
Z&E Exp: Snippet Clustering does not unduly affect precision.
Z&E Exp: STC is faster.