COSC431 Information Retrieval - Otago - Evaluation.pdf · Information Retrieval Evaluation. 2 Outline • Evaluation – Efficiency – Effectiveness – Usability • Data, information,

1

COSC431Information Retrieval

Evaluation

2

Outline• Evaluation

– Efficiency– Effectiveness– Usability

• Data, information, and knowledge• Information representation• Information quality• Precision and recall• XML• Web• Usability

3

Data• Datum

– A fact or number• Data

– Collection of datum

• Computer data– Sequences of 1s and 0s– To the editor your source code is data– To the operating system your program is data– To the user your program’s input and output is data

4

Information• Information

– Meaning of data• Interpreted by someone• Displayed in a meaningful way

• Knowledge– Comprehension of information– Meaning of information

• E.g. dictionary– Each character on the page is a datum– Each definition is information– Understanding a definition is knowledge– Foreign (not-understood by you) language dictionary

• Is this data, information, or knowledge?

5

Information

• Users depend on information every day– Help files, man pages– Newspapers, magazines, books– Road Signs– Expiry date on the milk

• Information is needed:– To survive each day– To accomplish goals– To solve problems

6

Knowledge• Knowledge

– Information once comprehended– Extracted by us from information– “What people have”

• To control information is to control knowledge– Knowledge is power

• Information retrieval sits between the data and knowledge– How much is Google worth?– Why?

• Search sits between wants and gets– How much is eBay worth?

7

Text• Social contract of representation

– Recording method for words

• Not common across whole world or time– English, Russian, and other conventions– Old English conventions and New English convention

• Not common across media– TXT vs. written vs. notes– Formal vs. informal– Emoticons vs. grammatical inflection– eBay titles vs. eBay descriptions

8

• An abstract representation of information• What is this?

– Benzene ring with side group• Side group contains CH2NH2

• What is this?– An algorithm

• What is this?– Who wrote it?

• Shown here are three different conventions for the graphical representation of information

Diagrams

9

Information Quality• Legibility

– Ability to see the information

• Readability– Ability to read the information

• HWÆT, WE GAR-DEna in geardagum, þeodcyninga þrym gefrunon, hu ða æþelingas ellen fremedon!

• Comprehensibility– Ability to understand the information

• LO, praise of the prowess of people-kingsof spear-armed Danes, in days long sped,we have heard, and what honor the athelings won!

10

Information Quality• Electronically

– Few legibility issues, but readability and comprehension issues• Deliberate attempts to hide information

– Patent applications– Academic papers– Government reports

• Deliberate attempts to be clear– Legal documents– Government laws

• Deliberate attempts to be comprehendible– Newspapers– Encyclopedia

• Are there any quantitative measures of quality?

11

Information Quality• Collections of information

– Dictionary, thesaurus– Encyclopedia– Libraries– Digital libraries

• CiteSeer, ScienceDirect, Google Scholar

– Web search engines• Google, Bing, etc

• Are there any quantitative measures of quality?• Given a research topic and some sources:

– Coverage• Proportion of the topic discussed by the sources

– Density• Proportion of relevant to irrelevant discussion in the sources

12

User Behavior• Users do NOT:

– Formulate a search– Convert it into a Boolean query– Consult the search engine– Print all the results– Read them all from beginning to end

• Users– Type a few words into Google (~2.5 on average)– Click a few links– Re-write their query– Keep doing so until an idea jells– Keep going until an answer is found

• Users have “information needs”– A query is a request to an IR system

13

Precision and Recall• For a query:

– Recall• The proportion of relevant documents that are found

Recall = F/RF = Found and relevantR = Relevant in collection

– Precision• The proportion of found documents that are relevant

Precision = F/II = Identified by search engine

• How can we determine R?

14

Judgments• Given

– A collection of documents– And an information need

• Judgment:– A decision as to whether or not the document is relevant to the

information need• Judgments are made by humans

– Each document is read, comprehended and the decision is made– Judgments for a single information need may take many hours to

collect, especially if the collection is large• It is infeasible for one person to do this

– Collaborations are essential and include:• TREC, INEX, CLEF, NTCIR, FIRE, etc.

15

TREC• Text REtrieval Conference

– Run each year at NIST, Gaithersburg, USA (Near Washington DC)• Documents

– Text (WSJ, 500MB), Web (W10g, 10GB & .GOV2, 500GB) etc.– clueWeb09 (1 Billion web pages, 25TB)

• Topics– Title: a few words (generally considered to be the query)– Description: one sentence– Narrative: a paragraph– IR system makes a query from a topic

16

TREC Relevance• Relevance

– “If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant” even if you’ve seen the same information before

• Pooling method, for each topic:– Merge top (typically) 100 results from each system– Remove duplicates– Read and determine binary relevance

• See the works of Zobel et al. on the validity of this method

17

Boolean Searching• Recap:

– Parse query– Build abstract syntax tree

(anaconda and cat) or (monkey not baboon)– Evaluate query

• Merge the postings lists

– Result• List of postings one for each matched document

– In order of postings (database order)

or

and not

anaconda cat monkey baboon

18

Ranked Searching• Recap:

– Process same way as Boolean– Sort the results on rsv (retrieval status value)– Result:

• Ordered document list in expected-relevance order• By the probabilistic ranking principal

• How do precision and recall compare– For Boolean and ranked retrieval?

19

Precision and Recall• Boolean and Ranked output are usually different

In this example, Relevant=4

• Recall = FoundRelevant / Relevant– Recall = 4/4 = 1

• Precision = FoundRelevant / Identified– Precision = 4/8 = 0.5

RRR

RRR

R

R

R

R

R

R

RRRR

Boolean BM25 Probability Vector

20

Average Precision• Average Precision

– Sum(P-at-reldoc-n) / Relevant

• Average Precision– Boolean AP=(0.2+0.3+0.4+0.5)/4=0.35– BM25 AP=(1.0+1.0+1.0+1.0)/4=1.00– Probability AP=(1.0+1.0+0.5+0.5)/4=0.75– Vector AP=(1.0+0.7+0.6+0.6)/4=0.73

1/12/23/34/4

1/12/2

3/6

4/8

1/1

2/3

3/5

4/7

1/52/63/74/8

RRR

RRR

R

R

R

R

R

R

RRRR

Boolean BM25 Probability Vector

21

Graphs• Graph precision at each relevant document• Average precision varies greatly from topic to topic

• Results from TREC:– Average precision is the integral of the precision/recall graph– Wide variation across topics– Wide variation across systems

BM25 Precision / Recall Graph

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Recall

Prec

isio

n

Topic by Topic Comparison

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

Topic

Ave

rage

Pre

cisi

on.

22

Mean Average Precision• Need a quantitative way to compare two systems

– Average precision only describes the performance of one “ranking algorithm” with respect to one query

– What about average precision for a set of queries?• Mean Average Precision (MAP)

– Mean of average precision for many queries

sum = 0for each query in a set

compute average precisionsum = sum + average precision

map = sum / NumQueries

• With MAP it’s possible to meaningfully compare systems (if the number of queries is large enough)

23

Graded Judgments• But not all information is equally useful

– How relevant is the information?– Different judges disagree on relevancy!

• Cystic Fibrosis Collection– Judgment of relevance : 2=Highly, 1=Marginally, 0=Not– Each document judged by 4 assessors

• Example relevance {1,1,2,0}

– Three medical researchers (M), 98% agreement– Compare “consensus” to Medical bibliographer (B)

B.Highly B.Marginally B.NotM.Highly 465 63 72

M.Marginally 38 29 17M.Not 272 1,361 119,081

24

Graded Relevance• Generalized Recall

– Normalize relevance to (0..1)– Before:

Recall=FoundRelevant / Relevant

– r(d) is the relevance of document d• Generalized Precision

– Normalize relevance to (0..1)– Before:

• Precision=FoundRelevant / Identified

• Compared to precision and recall:– If r(d) always equals one

• gR = Recall• gP = Precision

∑∑

∈

∈=

Dd

Fd

dr

drgR

)(

)(

∑∈

=Fd I

drgP )(

Cumulative Gain (CG)• Sum of the “gain” in information at point p in the

results list

• Where reli is the relevance score at point i in the results– Typically reli is a graded relevance score

25

Discounted Cumulative Gain (DCG)• But the user is less likely to read a result further down

the result list– So the score is “discounted” at each point in the list

– In this case the discount function is log2

• Alternatively (apparently used in industry):

26

Normalized DCG (nDCG)• But not all queries have the same reli scores

– Some might not have “highly” relevant results• So its meaningless to compare across queries

– Or to average across queries• As some might not have “highly” relevant documents

• However, if the “ideal” results list is known then its possible to compute deviation from that

• Where the ideal results list (iDCG) is a simple ranking of the assessed documents from most to least relevant– In other words, nDCGp is how close we are to the ideal

results list27

eBay• eBay measures “defects” rather than successes

– Normally measured for top n results (where n is small)

defects@n = (1 - p@n)

• Also uses a plethora of other measurements– DCG measures– Gross Merchandise Bought (GMB)– Change in various measures (lift)

28

29

XML• INEX

– Documents• 50GB of Wikipedia articles (dump taken in 2009)

– Topics• Title, description, narrative

– IR system makes a query from a topic• Results are:

– XML elements– Passages of text

– Judging• Pooling (top-500)• TREC relevance definition• Documents presented to assessors who highlight relevant text• Performance

– Comparison of result elements (or passages) with highlighted text

30

Web• More problems

– How is it possible to get a list of all relevant documents? – The web isn’t a closed collection!

• Additional measures– Precision at n documents

• P@10 - first page• P@30 - first few pages• Rank of first relevant document

– These measures do not average well• Large change in P@10 by swapping a non-relevant document in the

10th position with a relevant one from the 11th position• So are not-so-good measures of performance

– But p@10 is easy to interpret

31

Quantitative Analysis • Assumptions

– Queries• Independent, diverse, representative

– Documents• Static

– Judgments are• Sound - each judgment is correct• Complete - there are none missing

– The same information remains relevant each time it’s seen• Even if seen before

– If a document is partly relevant then it is fully relevant• But, possible graded relevance assessments• But, possible focused (yellow highlighter) assessments

32

Quantitative Analysis

• The Process– Choose an appropriate metric– Determine the performance of a baseline– Determine the performance of a “new” system– Compute the difference (hopefully an improvement)– Measure significance

• Two ordered lists– Student’s t-test (excel / R)

» Used to determine if the probability that mean and standard deviation of one sample differs from the other by chance

» Assumes a normal distribution– Wilcoxon’s test (R)

» Assumes a symmetric distribution but not a normal distribution

• Significant at 5% or 1%

33

Usability• GUIs

– There exist a plethora of interfaces– Which is best?– Assume result order is fixed

• User Studies– Video Camera– Eye tracking– Think aloud– Questionnaires (satisfaction)– Mean time to answer

• Hard to get statistically significant results• How is it possible to compare systems like these:

34

Text Interfaces

TileBarsGoogle

35

Graphical Interfaces

Concept Spaces Cat-a-cone

36

Self-Organizing Maps

SPIRESelf Organizing Maps

eBay

37Results List Results Tiles

38

Summary• This week

– What is information– Information quality– IR Effectiveness– TREC, INEX– Usability

• Readings:– Jarvelin, K., Kekalainen, J. (2002) Cumulated gain-based

evaluation of IR techniques. ACM Transactions on Information Systems 20(4):422–446.

– Moffat, A., Zobel, J. (2008) Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27(1): Article 2.