30
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Embed Size (px)

DESCRIPTION

Relevance How do we quantify relevance?  A benchmark set of docs (corpus)  A benchmark set of queries  A binary assessment for each query-doc pair either relevant or irrelevant

Citation preview

Page 1: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Information Retrieval

Lecture 3Introduction to Information Retrieval (Manning et al. 2007)Chapter 8

For the MSc Computer Science Programme

Dell ZhangBirkbeck, University of London

Page 2: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Evaluating an IR System

Expressiveness The ability of query language to express complex

information needs e.g., Boolean operators, wildcard, phrase, proximity, etc.

Efficiency How fast does it index? How large is the index? How fast does it search?

Effectiveness – the key measure How effective does it find relevant documents

Is this search engine good?Which search engine is better?

Page 3: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Relevance

How do we quantify relevance? A benchmark set of docs (corpus) A benchmark set of queries A binary assessment for each query-doc pair

either relevant or irrelevant

Page 4: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Relevance

Relevance should be evaluated according to the information need (which is translated into a query). [information need] I'm looking for information on

whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

[query] wine red white heart attack effective We judge whether the document addresses the

information need, not whether it has those words.

Page 5: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Benchmarks

Common Test Corpora TREC

National Institute of Standards and Testing (NIST) has run a large IR test bed for many years

Reuters Reuters-21578 RCV1

20 Newsgroups ……

Relevance judgements are given by human experts.

Page 6: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

TREC

TREC Ad Hoc tasks from first 8 TRECs are standard IR tasks 50 detailed information needs per year Human evaluation of pooled results returned More recently other related things

QA, Web, Genomics, etc.

Page 7: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

TREC

A Query from TREC5

<top><num> Number: 225<desc> Description:What is the main function of the Federal Emergency Management Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facilities?</top>

Page 8: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Ranking-Ignorant Measures

The IR system returns a certain number of documents.

The retrieved documents are regarded as a set.

It can actually be considered as classification – each doc is classified/predicted to be either ‘relevant’ or ‘irrelevant’.

Page 9: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Contingency Table

Relevant Not Relevant

Retrieved tp fp

Not Retrieved fn tn

p = positive; n = negative; t = true; f = false.

Page 10: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Accuracy

Accuracy = (tp+tn) / (tp+fp+tn+fn) The fraction of correct classifications. Not a very useful evaluation measure in IR.

Why? Accuracy puts equal weights on relevant and irrelevant

documents. It is common that the number of relevant documents is

very small compared to the total number of documents. People doing information retrieval want to find

something and have a certain tolerance for junk.

Page 11: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Accuracy

Search for: 0 matching results found.

This Web search engine returns 0 matching results for all queries.How much time do you need to build it? 1 minute!

How much accuracy does it have? 99.9999%

Page 12: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Precision and Recall

Precision P = tp/(tp+fp) The fraction of retrieved docs that are relevant. Pr[relevant|retrieved]

Recall R = tp/(tp+fn) The fraction of relevant docs that are retrieved. Pr[retrieved|relevant] Recall is a non-decreasing function of the number

of docs retrieved. You can get a perfect recall (but low precision) by retrieving all docs for all queries!

Page 13: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Precision and Recall

Precision/Recall Tradeoff In a good IR system, precision decreases as recall increases, and vice versa.

Page 14: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

F measure

F : weighted harmonic mean of P and R Combined measure that assesses the

precision/recall tradeoff. Harmonic mean is a conservative average.

RPPR

RP

F

2

2 )1(1)1(1

1

Page 15: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

F measure

F1 : balanced F measure (with = 1 or = ½) Most popular IR evaluation measure

RPPR

RP

F

2

112

1

Page 16: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

F1 and other averagesCombined Measures

0

20

40

60

80

100

0 20 40 60 80 100

Precision (Recall fixed at 70%)

Minimum

Maximum

Arithmetic

Geometric

Harmonic

Page 17: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

F measure – Exercise

d1

d2

d3

d4

d5

retrieved

relevantirrelevant

F1 = ?

IR result for q

Page 18: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Ranking-Aware Measures

The IR system rank all docs in the decreasing order of their relevance to the query.

Returning various numbers of the top ranked docs leads to different recalls (and accordingly different precisions).

Page 19: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Precision-Recall Curve

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Pre

cisi

on

Page 20: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Precision-Recall Curve

The interpolated precision at a recall level R The highest precision found for any recall level

higher than R. Removes the jiggles in the precision-recall curve.

Page 21: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

11-Point Interpolated Average Precision For each information need, the interpolated

precision is measured at 11 recall levels 0.0, 0.1, 0.2, …, 1.0

The measured interpolated precisions are averaged (i.e., arithmetic mean) over the set of queries in the benchmark.

A composite precision-recall curve showing 11 points can be graphed.

Page 22: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

11-Point Interpolated Average Precision

A representative (good) TREC system

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Prec

isio

n

Page 23: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Mean Average Precision (MAP) For one information need, it is the average of

the precision value obtained for the top k docs each time a relevant doc is retrieved. No use of fixed recall levels. No interpolation.

When no relevant doc is retrieved, the precision value is taken to be 0.

The MAP value for a test collection is then the arithmetic mean of MAP values for individual information needs. Macro-averaging: each query counts equally.

Page 24: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Precision/Recall at k

Prec@k: Precision on the top k retrieved docs. Appropriate for Web search engines

Most users scan only the first few (e.g., 10) hyperlinks that are presented.

Rec@k: Recall on the top k retrieved docs. Appropriate for archival retrieval systems

what fraction of total number of relevant docs did a user find after scanning the first (say 100) docs?

Page 25: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

R-Precision

Precision on the top Rel retrieved docs Rel is the size of the set of relevant documents

(though perhaps incomplete). A perfect IR system could score 1 on this metric

for each query.

Page 26: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

PRBEP

Given a precision-recall curve, the Precision/Recall Break-Even Point (PRBEP) is the value at which the precision is equal to the recall. It is obvious from the definition of precision/recall,

the equality is achieved for contingency tables with tp+fp = tp+fn.

Page 27: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

ROC Curve

An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1-specificity). true positive rate or sensitivity = recall = tp/(tp+fn) false positive rate = fp/(fp+tn) = 1 – specificity

specificty = tn/(fp+tn) The area under the ROC curve

Page 28: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

ROC Curve

Page 29: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Variance in Performance

It is normally the case that the variance in performance of the same system across different queries is much greater than the variance in performance of different systems on the same query. For a test collection, an IR system may perform

terribly on some information needs (e.g., MAP = 0.1) but excellently on others (e.g., MAP = 0.7).

There are easy information needs and hard ones!

Page 30: Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Take Home Messages

Evaluation of Effectiveness based on Relevance Ranking-Ignorant Measures

Accuracy; Precision & Recall F measure (especially F1)

Ranking-Aware Measures Precision-Recall curve 11 Point, MAP, P/R at k, R-Precision, PRBEP ROC curve