Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile

Evaluation

Information retrievalWeb

Purposes of Evaluation System Performance Evaluation

efficiency of data structures and methods operational profile

Retrieval Evaluation how well can system “guess” what is needed

from range of queries (including ill-specified and vague)

Comparison – Who’s best how does new algorithm/data structure compare

to its predecessors; what is its contribution?

Canonical Retrieval & Comparison Experiment

1. For each method M being compared,a. For each parameter setting P of M,

1) Train M(P) on a training data set D’.2) For each testing data set D,

a) Run M on Db) Compare actual results to expected resultsc) Compute performance metrics

b. Compare performance on set P for method M

2. Compare performance across set M using statistical tests for significance

Key Questions To what methods should you

compare? What parameters should be checked? What data sets should be used? How should you divide into training

and testing? What metrics should be used? What statistical tests should be run?

Choosing Methods and Parameters

Depends on your hypothesis… Experiments are based on hypothesis,:

My system has significantly better performance than state of the art.

Heuristics significantly improve performance.

Negative feedback makes little difference to performance.

IR Data Sets TREC (Text Retrieval Conference) http://trec.nist.gov/

yearly conference organized by NIST groups compare IR systems on designated

tasks and data data sets are large and include human

relevance and topic judgments tasks: usually IR (e.g., retrieval, filtering),

recently Web Largest,most widely used collections

TREC Data Collections (>800,000 docs)

1. Wall Street Journal (1987, 1988, 1989), the Federal Register (1989), Associated Press (1989), Department of Energy abstracts, and Information from the Computer Select disks (1989, 1990)

2. Wall Street Journal (1990, 1991, 1992), the Federal Register (1988), Associated Press (1988) and Information from the Computer Select disks (1989, 1990) copyrighted by Ziff-Davis

3. San Jose Mercury News (1991), the Associated Press (1990), U.S. Patents (1983-1991), and Information from the Computer Select disks (1991, 1992) copyrighted by Ziff-Davis

4. Financial Times Limited (1991, 1992, 1993, 1994), the Congressional Record of the 103rd Congress (1993), and the Federal Register (1994).

5. Foreign Broadcast Information Service (1996) and the Los Angeles Times (1989, 1990).

TREC Relevance Judgments

Definition of relevance: “If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant.”

binary judgments, relevant if any part is pooled (based on returns of actual systems)

subset of docs are judged by topic author

TREC Topics<top><head> Tipster Topic Description<num> Number: 066<dom> Domain: Science and Technology<tide> Topic: Natural Language Processing<desc> Description: Document will identify a type of natural language processing technology which is being developed or marketed in the U.S.<narr> Narrative: A relevant document will identify a company or institution developing or marketing a natural language processing technology, identify the technology, and identify one or more features of the company's product.<con> Concept(s): 1. natural language processing 2. translation, language, dictionary, font 3. software

applications<fac> Factor(s):<nat> Nationality: U.S.<fac><del> Definition(s):<top>

TREC Tasks ad hoc retrieval routing (query stream from profile) confusion (docs with scanning errors) Database merging (docs from separate collections) spoken documents filtering (construct profile to filter incoming stream) question and answer web ad hoc web homepage finding

TREC 2003 TasksCross-language: retrieve documents in multiple languagesFiltering: user’s information need is stable; check incoming

stream for which docs should be disseminatedGenome: gene sequences as well as supporting docsHigh Accuracy Retrieval from Documents (HARD): leverage

knowledge about user and/or goalsInteractiveNovelty: retrieve new,non-redundant dataQuestion answeringVideo: Content-based retrieval of digital videoWeb: search on a snapshot of the web

Web Topics<top><num> Number: 501<title> deduction and induction in English?<desc> Description:What is the difference between deduction and

induction in the process of reasoning?<narr> Narrative:A relevant document will contrast inductive and

deductive reasoning. A document that discusses only one or the other is not relevant.

</top>

Metrics: Precision and Recall

Precision: were all retrieved docs relevant? hits/(hits+fp)

Recall: were all relevant docs retrieved? hits/(hits+fn)

Relevant ~Relevant

Retrieved hit false positive

~Retrieved

false negative

miss

Precision/Recall Tradeoff

0102030405060708090

100

0 10 20 30 40 50 60 70 80 90 100

sys1sys2

Performance for Each Query Hold recall fixed (through

parameters) and allow precision to vary

Average precision at seen relevant docs: running average

R-precision: precision by rth ranking where r=|relevant docs|

Metrics: P&R Combination Precision and recall tend to balance

each other. Improve one at cost of the other.

F is harmonic mean, comparing recall and precision in single metric.

1,RecallPrecision*

Precision*Recall)1(

),(Recall),(Precision

),(Precision*),(Recall*2),(

2

2

F

jiji

jijijiF

Experiment Pitfalls Ill specified hypothesis Reliance on flaky and/or too few users Bias in the user base Poorly constructed user instructions Inappropriate comparison methods Too much varying simultaneously Biased evaluation metric (confounding)

or data selection Too many or too few statistical tests

Strategies for Avoiding Pitfalls

Fully specify experiment before running it

Run pilot experiment Put serious thought into “right”

hypothesis Think about what you hope to say when

experiment concludes… will experiment support your saying it?

Web Experiment Example[O. Zamir & O. Etzioni, “Web Document Clustering”, Proc. of

21st International SIGIR Conference on Research and Development in Information Retrieval, 1998.]

Developed incremental, linear time clustering algorithm which clusters based on shared phrases in document snippets: Suffix Tree Clustering

Hypotheses: STC has higher precision than standard clustering methods

for this domain. Clustering based on snippets does not significantly

degrade precision for this domain. STC is faster than standard clustering methods for this

domain.

Zamir & Etzioni Experiment

Data set: Thought up 10 queries Generated 10 collections of 200 snippets from

MetaCrawler query results (retrieved original docs as well)

Manually assigned relevance (mean=40 relevant docs/query)

Methods: Single-Pass, K-Means, Buckshot, Fractionation,

GAHC

Z&E Exp: STC has higher precision.

Z&E Exp: Snippet Clustering does not unduly affect precision.

Z&E Exp: STC is faster.