Download pdf - What Would We Like IR Metrics to Measure? - MSE...IR Metrics Metrics Galore Why Do Users Search? Audience Exercise A Model for User Search Behavior Query Variation in Action Recall-Based

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise

A Model for UserSearch Behavior

Query Variation inAction

Recall-BasedMetrics

Summary

What Would We LikeIR Metrics to Measure?

Alistair Moffat

with (most recently) thanks to:Peter Bailey, Falk Scholer, Paul Thomas

NTCIR 2016

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Overview

Metrics Galore

Why Do Users Search?

Audience Exercise (you will need your smartphone)

A Model for User Search Behavior

Query Variation in Action

Recall-Based Metrics

Summary

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

The Library Catalog

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

The Library Catalog

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Online Search

During the 1970s and 1980s, online “search” was describedby Boolean expressions, was guarded by librarians, and costreal money (international phone call to US at ≈ $5/minute).

Students and staff were allowed one search session per year,typically 30 mins of query formulation and trial-and-error,seeking out a “goldilocks” query, determined primarily byanswer set size.

Then a “print abstracts” command, and a ten-day wait forairmail from the States. Then library interloan requests.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

How Was Search Measured?

To measure search quality, binary relevance labels wereassumed.

Precision indicated how satisfied the user was (or, shouldhave been) with what they were given; recall indicated howdisappointed the user would be if they could somehow knowabout the documents they had missed.

Plus, could combine into F1 by taking their harmonic mean.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Enter the 1990s

Ranked retrieval emerged as a serious tool in the late 1980s;the first TREC collaboration was in 1992.

Computers and implementation techniques reached the stagewhere skilled amateurs (that is, academics) could indexhundreds of megabytes or even (gasp) small numbers ofgigabytes.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Enter the 1990s (1994)

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Measuring Ranked Lists

Now there were problems:

I no longer an answer set being generated, and

I no longer possible to just “know” what the correctanswers should have been.

On the assumption rankings are (always?) truncated at depthk , could measure using precision@k and (maybe?) recall@k.

A bigger problem is that human nature is to seek just asingle number. Fastest. Tallest. Richest.

Best retrieval effectiveness...

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Measuring Ranked Lists

Combinations were developed:

I 3-point average precision

I 11-point average precision (interpolated or not)

I (all-relevant-points) average precision (AP).

Plus shallower measures with a lighter judgment load:

I reciprocal rank (RR)

In TREC experimentation through the 1990s AP emerged asthe main metric used to compare systems.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Enter the 2000s

(1999)

By the late 1990s, freecommercial web search wasa rapidly growing industry.

And academics werecomfortable working withmulti-gigabyte text files.

But even so...

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Enter the 2000s (1999)

By the late 1990s, freecommercial web search wasa rapidly growing industry.

And academics werecomfortable working withmulti-gigabyte text files.

But even so...

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Enter the 2000s

More metrics arrived:

I 2002: discounted cumulative gain (DCG),normalized discounted cumulative gain (NDCG)[Jarvelin & Kekalainen]

I 2004: BPref [Buckley & Voorhees]

I 2008: Q-Measure [Sakai & Kando]

I 2008: Rank-biased precision (RBP) [Moffat & Zobel]

I 2009: Expected reciprocal rank (ERR) [Chapelle et al.]

I 2012: Time-biased gain (TBG) [Smucker & Clarke]

Plus variants for: faceted retrieval; inferred relevance basedon sampling; etc.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Economics says people act when they can exchange effort forutility; and that if they have a choice of alternatives and allother factors are equal, they will favor the option with thebest conversion rate.

For search, utility is measured as relevance, or gain; possiblyfractional, possibly context dependent, and possibly personal.

Effort is measured in seconds or minutes (or perhapsbrain-Watts); or approximated by surrogate units calleddocuments inspected.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


If effort can be represented by documents inspected,

and if all other things are equal,

then users will prefer the search service with the greatestexpected gain per document inspected.

Because that is the conversion rate between effort and utility.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Audience Exercise (1)

Get a notepad on your smartphone ready (or find a pen!)...

While visiting Thailand for a beach holiday last year, youdecided to visit some local museums to learn more aboutThailands history. You learned many interesting things aboutthe country, including that it was not always called Thailand.What was it called originally?

(a) How many useful web pages do you think you wouldneed to complete the search task?(b) How many different queries do you think you would needto enter to find that many useful pages?(c) What would your first query be?

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary





IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary




(a) How many useful web pages do you think you wouldneed to complete the search task?

(b) How many different queries do you think you would needto enter to find that many useful pages?(c) What would your first query be?

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary





(b) How many different queries do you think you would needto enter to find that many useful pages?

(c) What would your first query be?

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary




(a) How many useful web pages do you think you wouldneed to complete the search task?(b) How many different queries do you think you would needto enter to find that many useful pages?


IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


You recently heard a commercial about the health benefitsof eating algae, seaweed and kelp. This made you interestedin finding out about the positive uses of marine vegetation,both as a source of food, and as a potentially useful drug.


IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary




(b) How many different queries do you think you would needto enter to find that many useful pages?(c) What would your first query be?

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary




(b) How many different queries do you think you would needto enter to find that many useful pages?


IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary



(a) How many useful web pages do you think you wouldneed to complete the search task?(b) How many different queries do you think you would needto enter to find that many useful pages?


IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Audience Exercise

The point?

For (a) useful documents, put up your hand if you had moredocuments expected for the “history of thailand” scenariothan for the “marine vegetation” one.

For (b) number of queries, put up your hand if you had morequeries expected for the “history of thailand” scenario thanfor the “marine vegetation” one.

Show your two (c) first queries to the person sitting next toyou. Both of you put up your hand if both queries wereexactly the same.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Audience Exercise

The point?




IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Audience Exercise

The point?




IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Audience Exercise

The point?




IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

First Queries, 42 Crowd-Workers

history of thailand (×2), original name of thailand (×2), thailand

(×3), thailand first name, thailand former name, thailand name,

thailand original name (×3), thailand s history (×3), thailand s

original name, thailand wiki (×2), thailands first name, thailands

former name, what thailand was called originally, what was

thailand called, what was thailand called originally (×9), what was

thailand originally called (×6), what was thailand s original name,

what was thailands original name (×2), what was the original

name of thailand

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


algae health benefits, algae seaweed kelp nutrition medicine,

benefits of eating algae seaweed and kelp, benefits of marine

vegetables, benefits to eating algae seaweed and kelp, different

application of marine vegetation, edible seaweeds, finding out

about the positive uses of marine vegetation, health benefits,

health benefits of algae seaweed and kelp, health benefits of

marine vegetation, health benefits of seaweed algae kelp food

supply medical benefits, is sea veggies really good for you,

marine vegetation, marine vegetation algae seaweed kelp, marine

vegetation as food or drugs, marine vegetation benefits, marine

vegetation food and a drug use, marine vegetation food and drugs,

marine vegetation good for health, marine vegetation health

benefits, marine vegetation positive effects, . . .

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary















IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary















IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


. . . , marine vegetation positive uses, marine vegetation uses,

positive uses of marine vegetation (×8), positive uses of marine

vegetation as source of food, positive uses of marine vegetation

both as a source of food and as a potentially useful drug (×2),

research into health benefits of algae seaweed and kelp, the

positive uses of marine vegetation, the uses of marine vegetation

in food and drugs, uses of algae seaweed and kelp, uses of marine

vegetation (×2), what are some good uses of maritime vegetation,

what are the benefits of eating algae seaweed and kelp, what are

the health benefits of eating algae seaweed and kelp, what are the

health benefits of seaweed, what are the positive uses of marine

vegetation as a food and a medical treatment, what is the benefit

of eating algae seaweed and kelp.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Result Expectations

Information need descriptions – backstories – were writtenfor 180 Q02, R03, and T04 topics (70, 60, 50 resp).

Backstories were categorized as one of [Kelly et al., 2015]:

I Remember, tasks that primarily involve factoid-styleanswers.

I Understand, tasks that involve the construction ofmeaning, for example through interpreting orexemplifying.

I Analyze, tasks that involve breaking material into parts,and making overall decisions based on how these facetsrelate to one another.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Result Expectations

After cleansing, 7,969 responses, averaging 44 per backstory.

Topic workers queries avg. T avg. Q Type

history of42 19 1.77 1.29 R

thailand

marine47 38 3.89 2.94 U

vegetation

(a) There is unpredictable variation in first queries.(b) There is predictable variation in the T and Q estimates.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Distribution of T

Remember Understand Analyze

0.0

0.2

0.4

0 1 2

3−5

6−10

11−

100

101+

0 1 2

3−5

6−10

11−

100

101+

0 1 2

3−5

6−10

11−

100

101+

Estimate of T

Pro

port

ion

of c

ases

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Distribution of Q

Remember Understand Analyze

0.0

0.2

0.4

0.6

0.8

1 2

3−5

6−10

11+ 1 2

3−5

6−10

11+ 1 2

3−5

6−10

11+

Estimate of Q

Pro

port

ion

of c

ases

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Goal-Sensitive Evaluation

Cooper [1968]:

“Most measures do not take into account a crucial variable:the amount of material relevant . . . which the user actuallyneeds ”.

“A search request is therefore to be conceived in theabstract as involving two parts: a relevance description(normally a subject specification) and a quantityspecification. To put it another way, every search requesthas a definite quantification ”.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Back to the expected rate at which gain is accrued.

To compute an expectation, need a probability distribution.

Let W(i) be the probability that the user examinessnippet/document i in the ranking while viewing the SERP.

Can we take it as axiomatic that W(i) ≥W(i + 1) ?

A range of evidence says “yes, for the most part”. Andthat’s why we generate ranked lists, after all.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Measuring Fixations

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

First Fixations, By Rank

User experiment: 34 users, 6 search topics each.

0.0

0.1

0.2

0.3

0.4

0.5

1 2 3 4 5 6 7 8 9 10First fixation

Pro

port

ion

of c

ases

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

All Fixations, By Rank


0.00

0.05

0.10

0.15

0.20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Rank of fixation

Pro

port

ion

of c

ases

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Click-Throughs, By Rank


0.0

0.1

0.2

0.31 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Rank of click

Pro

port

ion

of c

ases

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Fixation Progressions – Zero Order

Observed jump probabilities, expressed as fractions of a totalof 2,633 overlapping two-fixation observations.

< −4 −3 −2 −1 +1 +2 +3 +4 >

0.047 0.033 0.049 0.069 0.230 0.347 0.104 0.046 0.032 0.0430.427 0.573

Median 1.0, mean 0.15.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Fixation Progressions – First Order

+9+8+7+6+5+4+3+2+1−1−2−3−4−5−6−7−8−9

−9

−8

−7

−6

−5

−4

−3

−2

−1

+1

+2

+3

+4

+5

+6

+7

+8

+9

First jump

Nex

t jum

p

0.0 0.1 0.2 0.3 0.4 0.5

Probability

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Let W(i) be the probability that the user examinessnippet/document i in the ranking while viewing the SERP.

Assume that the user is seeking a gain of T in regard to thecurrent information need.

Assume that the user starts at rank 1, and proceeds throughthe SERP until they exit, obtaining gain of ri from rank i .

And let Ti = T −∑i

j=1 ri be the gain still required after idocuments have been viewed.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Leave SERP

T = T − r

i = 0 view item i,

Info need, i = i+1,

T =??, T = T

Run query,

i−1 i

0

i

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Exit search

view item i,

Info need, i = i+1,Run query,

i = 0

or switchT = T

T = T − r

or reformulate,

Next page,

T =??, T = TLeave SERP

i

i

0

0

ii−1

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


The metric value is then the weighted relevance:

M(r) =∞∑i=1

WM(i) · ri ,

where ri is the real-valued (or binary) utility/gain at rank i .

The units for M(r) are “expected gain per documentinspected”.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


An alternative way of thinking about the situation:

CM(i) =WM(i + 1)

WM(i),

the conditional continuation probability at rank i .

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Run query,

i = 0

1 − C(i)

Info need,

view item i,

i = i+1,

T =??, T = T

W(i+1)/W(i)

C(i) =

Leave SERP

T = T − ri

0

ii−1

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

User Models – Examples

(1) Precision@k: CPrec(i) = 1 for i < k and 0 otherwise.

(2) Scaled DCG at k, SDCG@k:

CSDCG(i) =log(i + 1)

log(i + 2)

when 1 ≤ i < k , and 0 when i ≥ k . Must be truncated to afixed depth k if WSDCG(i) is to be a probability distribution.

(3) Rank-biased precision: CRBP(i) = p.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Adaptive User Models

But, hang on...

Do users really have the same continuation probability,regardless of what they have already seen?

Wouldn’t it be better to take ri into account whendetermining C(i)?

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Adaptive User Models – Examples

(4) Reciprocal rank is defined for ri ∈ {0, 1}:

CRR(i) = (1− ri ) .

(5) Expected reciprocal rank is defined for ri ∈ [0 . . . 1]:

CERR(i) = (1− ri ) .

(6) Average precision is defined by

CAP(i) =

∑∞j=i+1(rj/j)∑∞j=i (rj/j)

.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Sensitive User Models

But, hang on...

Do users really have the same continuation probability,regardless of what they had initially hoped to find?

Wouldn’t it be better to take T or Ti (or both) into accountwhen determining C(i)?

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Adaptive and Sensitive User Models

Is there a formulation for C(i) that:

I is adaptive, and reacts to relevance found

I is sensitive, and can be adjusted to user goals

I is computationally tractable, so that upper and lowerbounds can be calculated over prefixes of the ranking

I is complete, and can be computed even when R = 0

I is plausible in terms of the user behavior it models?

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Adaptive and Sensitive User Models

Define INST by

CINST(i) =

(i + T + Ti − 1

i + T + Ti

)2

.

Where does this come from? And what does it achieve?

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

INST – Adaptive and Sensitive

WINST(i) αkind of

1

(i + T + Ti − 1)2

The weight of each item is inversely proportional to all of:

I the depth in the ranking

I the number of useful items initially sought

I the number of useful items still being sought.

Squared to give a convergent sequence.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Consequences of the definition:

I If ri = 0, then C(i) > C(i − 1).The more the user has invested in a search, the morelikely they are to continue it.

I If ri = 1, then C(i) = C(i − 1).As users make progress toward their goal, their statusremains constant.

I It is always the case that 0 < C(i) < 1.The user might end the search at any point.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Expected search length for INST, for ri ≡ 0 and ri ≡ 1.

T Upper Lower

1 2.58 1.333 6.53 3.27

10 20.51 10.2630 60.50 30.25

For any given value of T , all other rankings fall between thetwo limits in terms of expected search depth.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


CINST(i) with ri always 0 and always 1, for T = 3.

1 10 100

rank i

0.60

0.80

1.00

conditio

nal pro

babili

ty C

(i)

T=3, ri=0

T=3, ri=1

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


CINST(i) with ri always 0 and always 1, for T = 10.

1 10 100

rank i

0.60

0.80

1.00

conditio

nal pro

babili

ty C

(i)

T=10, ri=0

T=10, ri=1

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


WINST(i) with ri always 0 and always 1, for T = 3.

1 10 100

rank i

0.01

0.10

1.00

weig

ht W

(i)

T=3, ri=0

T=3, ri=1

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


WINST(i) with ri always 0 and always 1, for T = 10.

1 10 100

rank i

0.01

0.10

1.00

weig

ht W

(i)

T=10, ri=0

T=10, ri=1

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Residuals

A benefit of weighted-precision metrics is that residuals canbe computed – the sum of the W(i) values for which ri isunknown.

The residual associated with a score represents the gapbetween an “all unjudged are non-relevant (ri = 0)”assessment, and an “all unjudged are relevant (ri = 1)”assessment.

When residuals are large, the evaluation must be regarded ashaving low credibility.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Topic R03.356, “postmenopausal estrogen Britain”, 46queries, 29 distinct, Indri-SDM similarity, NIST qrels, INST.

0 10 20 30 40 50

Query variant

0.0

0.5

1.0

Sco

re

Residual

Score

Title query

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Topic T04.734, “recycling successes”, 44 queries, 25distinct, Indri-SDM similarity, NIST qrels, INST.

0 10 20 30 40 50

Query variant

0.0

0.5

1.0

Sco

re

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Total of 110 R03 and T04 topics, showing the fraction ofuser-generated queries that outperform the “title” query.

0.0 0.2 0.4 0.6 0.8 1.0

TREC title-only query (INST)

0.0

0.5

1.0

Be

st

use

r q

ue

ry (

INS

T)

> 75%

> 50%

> 25%

> 0%

<= 0%

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary


Pool growth as users are added (60 R03 topics): documentsper topic to be judged ≈ dn0.7 versus ≈ dn0.5 for systems.

1 10 100

number of users included

10

100

1000

pool siz

e

d=100

d=50

d=20

d=10

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Recall-Based Metrics

For some purposes may wish to measure retrievalperformance relative to what a perfect system would attain.

Metrics AP, NDCG, and Q-Measure normalize by R, thenumber of relevant documents for the topic.

By definition, recall-based metrics are divorced from the userexperience. Scores based on partial judgments are not lowerbounds; nor can residuals be calculated.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Summary

There is a relationship between information seeking taskcomplexity, and anticipated values of T and Q.

Making reasonable assumptions, it is possible to construct aweighted-precision user model – and hence evaluationmetric, INST – that is both goal-sensitive and adaptive.

There is vast variation in user queries that is not catered forby current test collections. That means that we are not (yet)able to determine if INST (with per-query T ) genuinelyoffers more refined evaluations than, say, RBP (single pacross topics) or ERR (no parameter at all).

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Future Work

Things we would love to do:

I Build a test collection that incorporates query variants:see SIGIR 2016.

I Develop an evaluation using either crowd-workers orlaboratory subjects to distinguish between the searchbehaviors modeled by say RBP and INST.

I Develop an understanding of query variants and theirunderlying “potency”, as step towards an effectivequery rewriting tool.

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Contributors

The majority of the work described here has beenundertaken in collaboration with Peter Bailey (Microsoft),Falk Scholer (RMIT University), and Paul Thomas (CSIROand now Microsoft).

Previous collaborators in this area include William Webberand Justin Zobel (University of Melbourne), and ShaneCulpepper (RMIT University).

Xiaolu Lu provided technical assistance with some of thequery processing.

Funded by the Australian Research Council (DP110101934and DP140102655).

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Questions (Who? When? Where?)

IR Metrics

Metrics Galore

Why Do UsersSearch?

Audience Exercise



Recall-BasedMetrics

Summary

Sources

This talk was based on a suite of published work:

I Models and Metrics: IR Evaluation as a User Process,ADCS’12

I Users Versus Models: What Observation Tells Us AboutEffectiveness Metrics, CIKM’13

I What Users Do: The Eyes Have It, AIRS’13

I User Variability and IR System Evaluation, SIGIR’15

I Pooled Evaluation Over Query Variations: Users are asDiverse as Systems, CIKM’15

I INST: An Adaptive Metric for Information RetrievalEvaluation, ADCS’15

I UQV: A Test Collection with Query Variability, SIGIR’16