Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang

Information Retrieval

Lecture 3Introduction to Information Retrieval (Manning et al. 2007)Chapter 8

For the MSc Computer Science Programme

Dell ZhangBirkbeck, University of London

Evaluating an IR System

Expressiveness The ability of query language to express complex

information needs e.g., Boolean operators, wildcard, phrase, proximity, etc.

Efficiency How fast does it index? How large is the index? How fast does it search?

Effectiveness – the key measure How effective does it find relevant documents

Is this search engine good?Which search engine is better?

Relevance

How do we quantify relevance? A benchmark set of docs (corpus) A benchmark set of queries A binary assessment for each query-doc pair

either relevant or irrelevant

Relevance

Relevance should be evaluated according to the information need (which is translated into a query). [information need] I'm looking for information on

whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

[query] wine red white heart attack effective We judge whether the document addresses the

information need, not whether it has those words.

Benchmarks

Common Test Corpora TREC

National Institute of Standards and Testing (NIST) has run a large IR test bed for many years

Reuters Reuters-21578 RCV1

20 Newsgroups ……

Relevance judgements are given by human experts.

TREC

TREC Ad Hoc tasks from first 8 TRECs are standard IR tasks 50 detailed information needs per year Human evaluation of pooled results returned More recently other related things

QA, Web, Genomics, etc.

TREC

A Query from TREC5

<top><num> Number: 225<desc> Description:What is the main function of the Federal Emergency Management Agency (FEMA) and the funding level provided to meet emergencies? Also, what resources are available to FEMA such as people, equipment, facilities?</top>

Ranking-Ignorant Measures

The IR system returns a certain number of documents.

The retrieved documents are regarded as a set.

It can actually be considered as classification – each doc is classified/predicted to be either ‘relevant’ or ‘irrelevant’.

Contingency Table

Relevant Not Relevant

Retrieved tp fp

Not Retrieved fn tn

p = positive; n = negative; t = true; f = false.

Accuracy

Accuracy = (tp+tn) / (tp+fp+tn+fn) The fraction of correct classifications. Not a very useful evaluation measure in IR.

Why? Accuracy puts equal weights on relevant and irrelevant

documents. It is common that the number of relevant documents is

very small compared to the total number of documents. People doing information retrieval want to find

something and have a certain tolerance for junk.

Accuracy

Search for: 0 matching results found.

This Web search engine returns 0 matching results for all queries.How much time do you need to build it? 1 minute!

How much accuracy does it have? 99.9999%

Precision and Recall

Precision P = tp/(tp+fp) The fraction of retrieved docs that are relevant. Pr[relevant|retrieved]

Recall R = tp/(tp+fn) The fraction of relevant docs that are retrieved. Pr[retrieved|relevant] Recall is a non-decreasing function of the number

of docs retrieved. You can get a perfect recall (but low precision) by retrieving all docs for all queries!

Precision and Recall

Precision/Recall Tradeoff In a good IR system, precision decreases as recall increases, and vice versa.

F measure

F : weighted harmonic mean of P and R Combined measure that assesses the

precision/recall tradeoff. Harmonic mean is a conservative average.

RPPR

RP

F

2

2 )1(1)1(1

1

F measure

F1 : balanced F measure (with = 1 or = ½) Most popular IR evaluation measure

RPPR

RP

F

2

112

1

F1 and other averagesCombined Measures

0

20

40

60

80

100

0 20 40 60 80 100

Precision (Recall fixed at 70%)

Minimum

Maximum

Arithmetic

Geometric

Harmonic

F measure – Exercise

d1

d2

d3

d4

d5

retrieved

relevantirrelevant

F1 = ?

IR result for q

Ranking-Aware Measures

The IR system rank all docs in the decreasing order of their relevance to the query.

Returning various numbers of the top ranked docs leads to different recalls (and accordingly different precisions).

Precision-Recall Curve

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Pre

cisi

on

Precision-Recall Curve

The interpolated precision at a recall level R The highest precision found for any recall level

higher than R. Removes the jiggles in the precision-recall curve.

11-Point Interpolated Average Precision For each information need, the interpolated

precision is measured at 11 recall levels 0.0, 0.1, 0.2, …, 1.0

The measured interpolated precisions are averaged (i.e., arithmetic mean) over the set of queries in the benchmark.

A composite precision-recall curve showing 11 points can be graphed.

11-Point Interpolated Average Precision

A representative (good) TREC system

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Recall

Prec

isio

n

Mean Average Precision (MAP) For one information need, it is the average of

the precision value obtained for the top k docs each time a relevant doc is retrieved. No use of fixed recall levels. No interpolation.

When no relevant doc is retrieved, the precision value is taken to be 0.

The MAP value for a test collection is then the arithmetic mean of MAP values for individual information needs. Macro-averaging: each query counts equally.

Precision/Recall at k

Prec@k: Precision on the top k retrieved docs. Appropriate for Web search engines

Most users scan only the first few (e.g., 10) hyperlinks that are presented.

Rec@k: Recall on the top k retrieved docs. Appropriate for archival retrieval systems

what fraction of total number of relevant docs did a user find after scanning the first (say 100) docs?

R-Precision

Precision on the top Rel retrieved docs Rel is the size of the set of relevant documents

(though perhaps incomplete). A perfect IR system could score 1 on this metric

for each query.

PRBEP

Given a precision-recall curve, the Precision/Recall Break-Even Point (PRBEP) is the value at which the precision is equal to the recall. It is obvious from the definition of precision/recall,

the equality is achieved for contingency tables with tp+fp = tp+fn.

ROC Curve

An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1-specificity). true positive rate or sensitivity = recall = tp/(tp+fn) false positive rate = fp/(fp+tn) = 1 – specificity

specificty = tn/(fp+tn) The area under the ROC curve

ROC Curve

Variance in Performance

It is normally the case that the variance in performance of the same system across different queries is much greater than the variance in performance of different systems on the same query. For a test collection, an IR system may perform

terribly on some information needs (e.g., MAP = 0.1) but excellently on others (e.g., MAP = 0.7).

There are easy information needs and hard ones!

Take Home Messages

Evaluation of Effectiveness based on Relevance Ranking-Ignorant Measures

Accuracy; Precision & Recall F measure (especially F1)

Ranking-Aware Measures Precision-Recall curve 11 Point, MAP, P/R at k, R-Precision, PRBEP ROC curve

Documents

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang