Evaluation. Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs

Evaluation

Allan, Ballesteros, Croft, and/or Turtle

Types of Evaluation

• Might evaluate several aspects

• Evaluation generally comparative– System A vs. B

– System A vs A´

• Most common evaluation - retrieval effectiveness

– Assistance in formulating queries– Speed of retrieval– Resources required– Presentation of documents– Ability to find relevant documents


The Concept of Relevance

• Relevance of a document D to a query Q is subjective

– Different users will have different judgments

– Same users may judge differently at different times

– Degree of relevance of different documents may vary


The Concept of Relevance

• In evaluating IR systems it is assumed that:

– A subset of the documents of the database (DB) are relevant

– A document is either relevant or not


Relevance

• In a small collection - the relevance of each document can be checked

• With real collections, never know full set of relevant documents

• Any retrieval model includes an implicit definition of relevance

– Satisfiability of a FOL expression– Distance– P(Relevance|query,document)– P(query|document)


Evaluation• Set of queries

• Collection of documents (corpus)• Relevance judgements: Which documents are correct and

incorrect for each query

Potato farming andnutritional value ofpotatoes.

Mr. Potato Head … nutritional info for spuds

potato blight …growing potatoes …

x

x

x

• If small collection, can review all documents• Not practical for large collections

Any ideas about how we might approach collecting relevancejudgments for very large collections?


Finding Relevant Documents

• Pooling– Retrieve documents using several automatic techniques

– Judge top n documents for each technique

– Relevant set is union

– Subset of true relevant set

• Possible to estimate size of relevant set by sampling

• When testing:– How should un-judged documents be treated?

– How might this affect results?


Test Collections

• To compare the performance of two techniques:

– each technique used to evaluate same queries– results (set or ranked list) compared using metric– most common measures - precision and recall

• Usually use multiple measures to get different views of performance

• Usually test with multiple collections – – performance is collection dependent



EvaluationRetrieved documents

Relevantdocuments

Rel&Retdocuments

RelevantRet & Rel Recall

Ability to return ALL relevant items.

Ret & Rel Precision Retrieved

Ability to return ONLY relevant items.

Let retrieved = 100, relevant = 25, rel & ret = 10

Recall = 10/25 = .40

Precision = 10/100 = .10


Precision and Recall

• Precision and recall well-defined for sets

• For ranked retrieval– Compute value at fixed recall points (e.g. precision at

20% recall)

– Compute a P/R point for each relevant document, interpolate

– Compute value at fixed rank cutoffs (e.g. precision at rank 20)


Average Precision for a Query

• Often want a single-number effectiveness measure

• Average precision is widely used in IR

• Calculate by averaging precision when recall increases





Averaging Across Queries

• Hard to compare P/R graphs or tables for individual queries (too much data)– Need to average over many queries

• Two main types of averaging– Micro-average - each relevant document is a point in the

average (most common)– Macro-average - each query is a point in the average

• Also done with average precision value– Average of many queries’ average precision values– Called mean average precision (MAP)

• “Average average precision” sounds weird




Averaging and Interpolation

• Interpolation– actual recall levels of individual queries are seldom equal to

standard levels

– interpolation estimates the best possible performance value between two known values

• e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20

• their precision at actual recall is .25, .22, and .15

– On average, as recall increases, precision decreases



• Actual recall levels of individual queries are seldom equal to standard levels

• Interpolated precision at the ith recall level, Ri, is the maximum precision at all points p such that

Ri p Ri+1

– assume only 3 relevant docs retrieved at ranks 4, 9, 20

– their actual recall points are: .33, .67, and 1.0

– their precision is .25, .22, and .15 – what is interpolated precision at standard recall points?

Recall level Interpolated Precision0.0, 0.1, 0.2, 0.3 0.250.4, 0.5, 0.6 0.220.7, 0.8, 0.9, 1.0 0.15


Interpolated Average Precision

• Average precision at standard recall points• For a given query, compute P/R point for every

relevant doc. • Interpolate precision at standard recall levels

– 11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall)– 3-pt is usually 75%, 50%, 25%

• Average over all queries to get average precision at each recall level

• Average interpolated recall levels to get single result– Called “interpolated average precision”

• Not used much anymore; “mean average precision” more common• Values at specific interpolated points still commonly used


1. d123 (*) 6. d9 (*) 11. d38

2. d84 7. d511 12. d48

3. d56 (*) 8. d129 13. d250

4. d6 9. d187 14. d113

5. d8 10. d25 (*) 15. d3 (*)

• Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123}

• |Rq| = 10, no. of relevant docs for q

• Ranking of retreived docs in the answer set of q:

10 % Recall=> .1 * 10 rel docs

= 1 rel doc retrieved

One doc retrieved to get 1 rel doc:

precision = 1/1 = 100%

Micro-averaging: 1 Qry

Find precision given total number of docs retrieved at given recall value.


1. d123 (*) 6. d9 (*) 11. d38

2. d84 7. d511 12. d48

3. d56 (*) 8. d129 13. d250

4. d6 9. d187 14. d113

5. d8 10. d25 (*) 15. d3 (*)








Micro-averaging : 1 Qry

20% Recall: .2 * 10 rel docs

= 2 rel docs retrieved3 docs retrieved to get 2 rel docs:

precision = 2/3 = 0.667

Find precision given total number of docs retrieved at given recall value.


1. d123 (*) 6. d9 (*) 11. d38

2. d84 7. d511 12. d48

3. d56 (*) 8. d129 13. d250

4. d6 9. d187 14. d113

5. d8 10. d25 (*) 15. d3 (*)








Micro-averaging : 1 Qry

What is precision at recall values from 40-100%?



precision = 2/3 = 0.667



precision = 3/6 = 0.5



• Ranking of retreived docs

in the answer set of q:

Recall/ Precision Curve

0

20

40

60

80

100

120

20 40 60 80 100 120

Prec

isio

n

Recall

•

••

••

• • • • •

•

1. d123 (*) 5. d8 9. d187 13. d250

2. d84 6. d9 (*) 10. d25 (*) 14. d113

3. d56 (*) 7. d511 11. d38 15. d3 (*)

4. d6 8. d129 12. d48

Recall Precision0.1 1/1 = 100%0.2 2/3 = 0.67%0.3 3/6 = 0.5%0.4 4/10 = 0.4%0.5 5/15 = 0.33%0.6 0%… …1.0 0%



• macroaverage - each query is a point in the avg– can be independent of any parameter– average of precision values across several queries at standard

recall levels

e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20

– their actual recall points are: .33, .67, and 1.0 (why?)

– their precision is .25, .22, and .15 (why?)

• Average over all relevant docs – rewards systems that retrieve relevant docs at the top

(.25+.22+.15)/3= 0.21


Recall-Precision Tables & GraphsPrecision – 44 queries

Recall Terms Phrases0 88.2 90.8 (+2.9)

10 82.4 86.1 (+4.5)20 77.0 79.8 (+3.6)30 71.1 75.6 (+5.4)40 65.1 68.7 (+5.4)50 60.3 64.1 (+6.2)60 53.3 55.6 (+4.4)70 44.0 47.3 (+7.5)80 37.2 39.0 (+4.6)90 23.1 26.6 (+15.1)

100 12.7 14.2 (+11.4)average 55.9 58.9 (+5.3)

0102030405060708090

100

0 20 40 60 80 100

Recall

Pre

cisi

onTerms

Phrases


Document Level Averages

• Precision after a given number of docs retrieved– e.g.) 5, 10, 15, 20, 30, 100, 200, 500, & 1000 documents

• Reflects the actual system performance as a user might see it

• Each precision avg is computed by summing precisions at the specified doc cut-off and dividing by the number of queries– e.g. average precision for all queries at the point where n

docs have been retrieved


R-Precision

• Precision after R documents are retrieved– R = number of relevant docs for the query

• Average R-Precision– mean of the R-Precisions across all queries

e.g.) Assume 2 qrys having 50 & 10 relevant docs;

system retrieves 17 and 7 relevant docs in

the top 50 and 10 documents retrieved, respectively

52.02

107

5017

PrecisionR


Evaluation

• Recall-Precision value pairs may co-vary in ways that are hard to understand

• Would like to find composite measures– A single number measure of effectiveness

• primarily ad hoc and not theoretically justifiable

• Some attempt to invent measures that combine parts of the contingency table into a single number measure


Contingency Table

Relevant Not RelevantRetrieved A B

Not Retrieved C D

Relevant = A + CRetrieved = A + BCollection size = A + B + C + D

Precision = A/(A+B)Recall = A/(A+C)

Fallout = B/(B+D) = P(retrieved | not relevant)

Miss = C/(A+C)


Symmetric Difference

A is the retrieved set of documentsB is the relevant set of documents

A B (the symmetric difference) is the shaded area


E measure (van Rijsbergen)

• used to emphasize precision or recall– like a weighted average of precision and recall

• large increases importance of precision – can transform by = 1/(2 +1), = P/R

– when = 1/2, = 1; precision and recall are equally important

E= normalized symmetric difference of retrieved and relevant setsE b=1 = |A B|/(|A| + |B|)

• F =1- E is typical (good results mean larger values of F)

)1

)(1()1

(

11

RP

E

RP

PRF

2

2 )1(


Expected Search Length• Evaluation is based on type of information need e.g.)

1. only one relevant document required

2. some arbitrary number n

3. all relevant documents

4. a given proportion of relevant documents…..

• Search strategy output assumed to be weak ordering

• Simple ordering means never have two or more documents at the same level of the ordering

• Search length in a simple ordering is the number of non-relevant documents a user must scan before the information need is satisfied

• Expected search length appropriate for weak ordering


Expected Search Length


Other Single-Valued Measures

• Breakeven point– point at which precision = recall

• Swets model– use statistical decision theory to express recall, precision,

and fallout in terms of conditional probabilities

• Utility measures– assign costs to each cell in the contingency table– sum (or average) costs for all queries

• Many others...