26
1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

Embed Size (px)

Citation preview

Page 1: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

1

CS 430: Information Discovery

Lecture 8

Evaluation of Retrieval Effectiveness II

Page 2: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

2

Course administration

Page 3: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

3

The Cranfield methodology

Recall and precision depend on concept of relevance

-> Is relevance a context-, task-independent property of documents?

"Relevance is the correspondence in context between an information requirement statement (a query) and an article (a document), that is, the extent to which the article covers the material that is appropriate to the requirement statement."

F. W. Lancaster, 1979

Page 4: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

4

Relevance

• Recall and precision values are for a specific set of documents and a specific set of queries

• Relevance is subjective, but experimental evidence suggests that, for textual documents, different experts have similar judgments about relevance

• Estimates of relevance level are less consistent

• Query types are important, depending on specificity

-> subject-heading queries

-> title queries

-> paragraphs or free text

Tests should use realistic queries

Page 5: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

5

Text Retrieval Conferences (TREC)

• Led by Donna Harman (NIST), with DARPA support

• Annual since 1992 (initial experiment ended 1999)

• Corpus of several million textual documents, total of more than five gigabytes of data

• Researchers attempt a standard set of tasks

-> search the corpus for topics provided by surrogate users

-> match a stream of incoming documents against standard queries

• Participants include large commercial companies, small information retrieval vendors, and university research groups.  

Page 6: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

6

The TREC Corpus (Examples)

Source Size # Docs Median (Mbytes) words/doc

Wall Street Journal, 87-89 267 98,732 245Associated Press newswire, 89 254 84,678 446Computer Selects articles 242 75,180 200Federal Register, 89 260 25,960 391abstracts of DOE publications 184 226,087 111

Wall Street Journal, 90-92 242 74,520 301Associated Press newswire, 88 237 79,919 438Computer Selects articles 175 56,920 182Federal Register, 88 209 19,860 396

Page 7: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

7

The TREC Corpus (continued)

Source Size # Docs Median (Mbytes) words/doc

San Jose Mercury News 91 287 90,257 379Associated Press newswire, 90 237 78,321 451Computer Selects articles 345 161,021 122U.S. patents, 93 243 6,711 4,445

Financial Times, 91-94 564 210,158 316Federal Register, 94 395 55,630 588Congressional Record, 93 235 27,922 288

Foreign Broadcast Information 470 130,471 322LA Times 475 131,896 351

Page 8: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

8

The TREC Corpus (continued)

Notes:

1. The TREC corpus consists mainly of general articles. The Cranfield data was in a specialized engineering domain.

2. The TREC data is raw data:

-> No stop words are removed; no stemming

-> Words are alphanumeric strings

-> No attempt made to correct spelling, sentence fragments, etc.

Page 9: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

9

TREC Experiments

1. NIST provides text corpus on CD-ROM

Participant builds index using own technology

2. NIST provides 50 natural language topic statements

Participant converts to queries (automatically or manually)

3. Participant run search, returns up to 1,000 hits to NIST.

NIST analyzes for recall and precision (all TREC participants use rank based methods of searching)

Page 10: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

10

TREC Topic Statement

<num> Number: 409

<title> legal, Pan Am, 103

<desc> Description:

What legal actions have resulted from the destruction of Pan Am Flight 103 over Lockerbie, Scotland, on December 21, 1988?

<narr> Narrative:

Documents describing any charges, claims, or fines presented to or imposed by any court or tribunal are relevant, but documents that discuss charges made in diplomatic jousting are not relevant.

A sample TREC topic statement

Page 11: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

11

Relevance Assessment

For each query, a pool of potentially relevant documents is assembled, using the top 100 ranked documents from each participant.

The human expert who set the query looks at every document in the pool and determines whether it is relevant.

Documents outside the pool are not examined.

In a TREC-8 example, with 71 participants:7,100 documents in the pool1,736 unique documents (eliminating duplicates)94 judged relevant

Page 12: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

12

A Cornell Footnote

The TREC analysis uses a program developed by Chris Buckley, who spent 17 years at Cornell before completing his Ph.D. in 1995.

Buckley has continued to maintain the SMART software and has been a participant at every TREC conference. SMART is used as the basis against which other systems are compared.

During the early TREC conferences, the tuning of SMART with the TREC corpus led to steady improvements in retrieval efficiency, but after about TREC-5 a plateau was reached.

TREC-8, in 1999, was the final year for this experiment.

Page 13: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

13

Measures based on relevance

RR

NN

NR RN

not retrieved not relevant

retrieved not relevant

retrieved relevant

not retrieved relevant

retrievednot re

trieved

not relevantrelevant

Page 14: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

14

Measures based on relevance

retrieved relevant relevant

retrieved relevant retrieved

retrieved not-relevant not-relevant

recall =

precision =

fallout =

Page 15: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

15

Estimates of Recall

Pooled used by TREC depends on the pool of nominated documents. Are there relevant documents not in the pool?

An Example of Estimating Recall

• Litigation support system using IBM STAIRS system

• Corpus 40,000 documents

• 51 queries

• Random samples of document examined by lawyers in blind sampling experiment

• Estimate that only 20% of relevant documents found by STAIRS

Blair and Mahon, 1981

Page 16: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

16

Recall-precision after retrieval of n documents

n relevant recall precision1 yes 0.2 1.02 yes 0.4 1.03 no 0.4 0.674 yes 0.6 0.755 no 0.6 0.606 yes 0.8 0.677 no 0.8 0.578 no 0.8 0.509 no 0.8 0.4410 no 0.8 0.4011 no 0.8 0.3612 no 0.8 0.3313 yes 1.0 0.3814 no 1.0 0.36

SMART system using Cranfield data, 200 documents in aeronautics of which 5 are relevant

Page 17: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

17

Recall-precision graph

1.0

0.75

0.5

0.25

1.00.750.50.25

recall

precision

1

23

45

612

13200

Page 18: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

18

Typical recall-precision graph

1.0

0.75

0.5

0.25

1.00.750.50.25

recall

precision

Broad, general query

Narrow, specific query

Page 19: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

19

Normalized recall measure

5 10 15 200195

ideal ranks

actual ranks

worst ranks

recall

ranks of retrieved documents

Page 20: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

20

Normalized recall

area between actual and worst area between best and worstNormalized recall =

Rnorm = 1 - ri - i

n(N - n)

i = 1

n

i = 1

n

Page 21: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

21

Normalized Symmetric Difference

Retrieved Relevant

All documents

AB

Symmetric difference, S = A B - A B

Normalized symmetric difference = S / ½ (|A| + |B|)

= 1 - 12(1/recall + 1/precision)

Page 22: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

22

Statistical tests

Suppose that a search is carried out on systems i and jSystem i is superior to system j if, for all test cases,

recall(i) >= recall(j)precisions(i) >= precision(j)

Page 23: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

23

Recall-precision graph

1.0

0.75

0.5

0.25

1.00.750.50.25

recall

precision

The red system appears better than the black, but is the difference statistically significant?

Page 24: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

24

Statistical tests

• The t-test is the standard statistical test for comparing two table of numbers, but depends on statistical assumptions of independence and normal distributions that do not apply to this data.

• The sign test makes no assumptions of normality and uses only the sign (not the magnitude) of the the differences in the sample values, but assumes independent samples.

• The Wilcoxon signed rank uses the ranks of the differences, not their magnitudes, and makes no assumption of normality but but assumes independent samples.

Page 25: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

25

User criteria

System-centered and user-centered evaluation

-> Is user satisfied?

-> Is user successful?

System efficiency

-> What efforts are involved in carrying out the search?

Suggested criteria

• recall and precision

• response time

• user effort

• form of presentation

• content coverage

Page 26: 1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II

26

System factors that affect user satisfaction

Collection

• Input policies -- coverage, error rates, timeliness• Document characteristics -- title, abstract, summary, full text

Indexing

• Rules for assigning terms, specificity, exhaustively

Query

• Formulation, operators