IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
What Would We LikeIR Metrics to Measure?
Alistair Moffat
with (most recently) thanks to:Peter Bailey, Falk Scholer, Paul Thomas
NTCIR 2016
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Overview
Metrics Galore
Why Do Users Search?
Audience Exercise (you will need your smartphone)
A Model for User Search Behavior
Query Variation in Action
Recall-Based Metrics
Summary
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
The Library Catalog
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
The Library Catalog
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Online Search
During the 1970s and 1980s, online “search” was describedby Boolean expressions, was guarded by librarians, and costreal money (international phone call to US at ≈ $5/minute).
Students and staff were allowed one search session per year,typically 30 mins of query formulation and trial-and-error,seeking out a “goldilocks” query, determined primarily byanswer set size.
Then a “print abstracts” command, and a ten-day wait forairmail from the States. Then library interloan requests.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
How Was Search Measured?
To measure search quality, binary relevance labels wereassumed.
Precision indicated how satisfied the user was (or, shouldhave been) with what they were given; recall indicated howdisappointed the user would be if they could somehow knowabout the documents they had missed.
Plus, could combine into F1 by taking their harmonic mean.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Enter the 1990s
Ranked retrieval emerged as a serious tool in the late 1980s;the first TREC collaboration was in 1992.
Computers and implementation techniques reached the stagewhere skilled amateurs (that is, academics) could indexhundreds of megabytes or even (gasp) small numbers ofgigabytes.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Enter the 1990s (1994)
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Measuring Ranked Lists
Now there were problems:
I no longer an answer set being generated, and
I no longer possible to just “know” what the correctanswers should have been.
On the assumption rankings are (always?) truncated at depthk , could measure using precision@k and (maybe?) recall@k.
A bigger problem is that human nature is to seek just asingle number. Fastest. Tallest. Richest.
Best retrieval effectiveness...
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Measuring Ranked Lists
Combinations were developed:
I 3-point average precision
I 11-point average precision (interpolated or not)
I (all-relevant-points) average precision (AP).
Plus shallower measures with a lighter judgment load:
I reciprocal rank (RR)
In TREC experimentation through the 1990s AP emerged asthe main metric used to compare systems.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Enter the 2000s
(1999)
By the late 1990s, freecommercial web search wasa rapidly growing industry.
And academics werecomfortable working withmulti-gigabyte text files.
But even so...
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Enter the 2000s (1999)
By the late 1990s, freecommercial web search wasa rapidly growing industry.
And academics werecomfortable working withmulti-gigabyte text files.
But even so...
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Enter the 2000s
More metrics arrived:
I 2002: discounted cumulative gain (DCG),normalized discounted cumulative gain (NDCG)[Jarvelin & Kekalainen]
I 2004: BPref [Buckley & Voorhees]
I 2008: Q-Measure [Sakai & Kando]
I 2008: Rank-biased precision (RBP) [Moffat & Zobel]
I 2009: Expected reciprocal rank (ERR) [Chapelle et al.]
I 2012: Time-biased gain (TBG) [Smucker & Clarke]
Plus variants for: faceted retrieval; inferred relevance basedon sampling; etc.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Why Do Users Search?
Economics says people act when they can exchange effort forutility; and that if they have a choice of alternatives and allother factors are equal, they will favor the option with thebest conversion rate.
For search, utility is measured as relevance, or gain; possiblyfractional, possibly context dependent, and possibly personal.
Effort is measured in seconds or minutes (or perhapsbrain-Watts); or approximated by surrogate units calleddocuments inspected.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Why Do Users Search?
If effort can be represented by documents inspected,
and if all other things are equal,
then users will prefer the search service with the greatestexpected gain per document inspected.
Because that is the conversion rate between effort and utility.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise (1)
Get a notepad on your smartphone ready (or find a pen!)...
While visiting Thailand for a beach holiday last year, youdecided to visit some local museums to learn more aboutThailands history. You learned many interesting things aboutthe country, including that it was not always called Thailand.What was it called originally?
(a) How many useful web pages do you think you wouldneed to complete the search task?(b) How many different queries do you think you would needto enter to find that many useful pages?(c) What would your first query be?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise (1)
Get a notepad on your smartphone ready (or find a pen!)...
While visiting Thailand for a beach holiday last year, youdecided to visit some local museums to learn more aboutThailands history. You learned many interesting things aboutthe country, including that it was not always called Thailand.What was it called originally?
(a) How many useful web pages do you think you wouldneed to complete the search task?(b) How many different queries do you think you would needto enter to find that many useful pages?(c) What would your first query be?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise (1)
Get a notepad on your smartphone ready (or find a pen!)...
While visiting Thailand for a beach holiday last year, youdecided to visit some local museums to learn more aboutThailands history. You learned many interesting things aboutthe country, including that it was not always called Thailand.What was it called originally?
(a) How many useful web pages do you think you wouldneed to complete the search task?
(b) How many different queries do you think you would needto enter to find that many useful pages?(c) What would your first query be?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise (1)
Get a notepad on your smartphone ready (or find a pen!)...
While visiting Thailand for a beach holiday last year, youdecided to visit some local museums to learn more aboutThailands history. You learned many interesting things aboutthe country, including that it was not always called Thailand.What was it called originally?
(a) How many useful web pages do you think you wouldneed to complete the search task?
(b) How many different queries do you think you would needto enter to find that many useful pages?
(c) What would your first query be?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise (1)
Get a notepad on your smartphone ready (or find a pen!)...
While visiting Thailand for a beach holiday last year, youdecided to visit some local museums to learn more aboutThailands history. You learned many interesting things aboutthe country, including that it was not always called Thailand.What was it called originally?
(a) How many useful web pages do you think you wouldneed to complete the search task?(b) How many different queries do you think you would needto enter to find that many useful pages?
(c) What would your first query be?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise (2)
You recently heard a commercial about the health benefitsof eating algae, seaweed and kelp. This made you interestedin finding out about the positive uses of marine vegetation,both as a source of food, and as a potentially useful drug.
(a) How many useful web pages do you think you wouldneed to complete the search task?(b) How many different queries do you think you would needto enter to find that many useful pages?(c) What would your first query be?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise (2)
You recently heard a commercial about the health benefitsof eating algae, seaweed and kelp. This made you interestedin finding out about the positive uses of marine vegetation,both as a source of food, and as a potentially useful drug.
(a) How many useful web pages do you think you wouldneed to complete the search task?
(b) How many different queries do you think you would needto enter to find that many useful pages?(c) What would your first query be?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise (2)
You recently heard a commercial about the health benefitsof eating algae, seaweed and kelp. This made you interestedin finding out about the positive uses of marine vegetation,both as a source of food, and as a potentially useful drug.
(a) How many useful web pages do you think you wouldneed to complete the search task?
(b) How many different queries do you think you would needto enter to find that many useful pages?
(c) What would your first query be?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise (2)
You recently heard a commercial about the health benefitsof eating algae, seaweed and kelp. This made you interestedin finding out about the positive uses of marine vegetation,both as a source of food, and as a potentially useful drug.
(a) How many useful web pages do you think you wouldneed to complete the search task?(b) How many different queries do you think you would needto enter to find that many useful pages?
(c) What would your first query be?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise
The point?
For (a) useful documents, put up your hand if you had moredocuments expected for the “history of thailand” scenariothan for the “marine vegetation” one.
For (b) number of queries, put up your hand if you had morequeries expected for the “history of thailand” scenario thanfor the “marine vegetation” one.
Show your two (c) first queries to the person sitting next toyou. Both of you put up your hand if both queries wereexactly the same.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise
The point?
For (a) useful documents, put up your hand if you had moredocuments expected for the “history of thailand” scenariothan for the “marine vegetation” one.
For (b) number of queries, put up your hand if you had morequeries expected for the “history of thailand” scenario thanfor the “marine vegetation” one.
Show your two (c) first queries to the person sitting next toyou. Both of you put up your hand if both queries wereexactly the same.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise
The point?
For (a) useful documents, put up your hand if you had moredocuments expected for the “history of thailand” scenariothan for the “marine vegetation” one.
For (b) number of queries, put up your hand if you had morequeries expected for the “history of thailand” scenario thanfor the “marine vegetation” one.
Show your two (c) first queries to the person sitting next toyou. Both of you put up your hand if both queries wereexactly the same.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Audience Exercise
The point?
For (a) useful documents, put up your hand if you had moredocuments expected for the “history of thailand” scenariothan for the “marine vegetation” one.
For (b) number of queries, put up your hand if you had morequeries expected for the “history of thailand” scenario thanfor the “marine vegetation” one.
Show your two (c) first queries to the person sitting next toyou. Both of you put up your hand if both queries wereexactly the same.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
First Queries, 42 Crowd-Workers
history of thailand (×2), original name of thailand (×2), thailand
(×3), thailand first name, thailand former name, thailand name,
thailand original name (×3), thailand s history (×3), thailand s
original name, thailand wiki (×2), thailands first name, thailands
former name, what thailand was called originally, what was
thailand called, what was thailand called originally (×9), what was
thailand originally called (×6), what was thailand s original name,
what was thailands original name (×2), what was the original
name of thailand
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
First Queries, 47 Crowd-Workers
algae health benefits, algae seaweed kelp nutrition medicine,
benefits of eating algae seaweed and kelp, benefits of marine
vegetables, benefits to eating algae seaweed and kelp, different
application of marine vegetation, edible seaweeds, finding out
about the positive uses of marine vegetation, health benefits,
health benefits of algae seaweed and kelp, health benefits of
marine vegetation, health benefits of seaweed algae kelp food
supply medical benefits, is sea veggies really good for you,
marine vegetation, marine vegetation algae seaweed kelp, marine
vegetation as food or drugs, marine vegetation benefits, marine
vegetation food and a drug use, marine vegetation food and drugs,
marine vegetation good for health, marine vegetation health
benefits, marine vegetation positive effects, . . .
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
First Queries, 47 Crowd-Workers
algae health benefits, algae seaweed kelp nutrition medicine,
benefits of eating algae seaweed and kelp, benefits of marine
vegetables, benefits to eating algae seaweed and kelp, different
application of marine vegetation, edible seaweeds, finding out
about the positive uses of marine vegetation, health benefits,
health benefits of algae seaweed and kelp, health benefits of
marine vegetation, health benefits of seaweed algae kelp food
supply medical benefits, is sea veggies really good for you,
marine vegetation, marine vegetation algae seaweed kelp, marine
vegetation as food or drugs, marine vegetation benefits, marine
vegetation food and a drug use, marine vegetation food and drugs,
marine vegetation good for health, marine vegetation health
benefits, marine vegetation positive effects, . . .
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
First Queries, 47 Crowd-Workers
algae health benefits, algae seaweed kelp nutrition medicine,
benefits of eating algae seaweed and kelp, benefits of marine
vegetables, benefits to eating algae seaweed and kelp, different
application of marine vegetation, edible seaweeds, finding out
about the positive uses of marine vegetation, health benefits,
health benefits of algae seaweed and kelp, health benefits of
marine vegetation, health benefits of seaweed algae kelp food
supply medical benefits, is sea veggies really good for you,
marine vegetation, marine vegetation algae seaweed kelp, marine
vegetation as food or drugs, marine vegetation benefits, marine
vegetation food and a drug use, marine vegetation food and drugs,
marine vegetation good for health, marine vegetation health
benefits, marine vegetation positive effects, . . .
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
First Queries, 47 Crowd-Workers
. . . , marine vegetation positive uses, marine vegetation uses,
positive uses of marine vegetation (×8), positive uses of marine
vegetation as source of food, positive uses of marine vegetation
both as a source of food and as a potentially useful drug (×2),
research into health benefits of algae seaweed and kelp, the
positive uses of marine vegetation, the uses of marine vegetation
in food and drugs, uses of algae seaweed and kelp, uses of marine
vegetation (×2), what are some good uses of maritime vegetation,
what are the benefits of eating algae seaweed and kelp, what are
the health benefits of eating algae seaweed and kelp, what are the
health benefits of seaweed, what are the positive uses of marine
vegetation as a food and a medical treatment, what is the benefit
of eating algae seaweed and kelp.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Result Expectations
Information need descriptions – backstories – were writtenfor 180 Q02, R03, and T04 topics (70, 60, 50 resp).
Backstories were categorized as one of [Kelly et al., 2015]:
I Remember, tasks that primarily involve factoid-styleanswers.
I Understand, tasks that involve the construction ofmeaning, for example through interpreting orexemplifying.
I Analyze, tasks that involve breaking material into parts,and making overall decisions based on how these facetsrelate to one another.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Result Expectations
After cleansing, 7,969 responses, averaging 44 per backstory.
Topic workers queries avg. T avg. Q Type
history of42 19 1.77 1.29 R
thailand
marine47 38 3.89 2.94 U
vegetation
(a) There is unpredictable variation in first queries.(b) There is predictable variation in the T and Q estimates.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Distribution of T
Remember Understand Analyze
0.0
0.2
0.4
0 1 2
3−5
6−10
11−
100
101+
0 1 2
3−5
6−10
11−
100
101+
0 1 2
3−5
6−10
11−
100
101+
Estimate of T
Pro
port
ion
of c
ases
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Distribution of Q
Remember Understand Analyze
0.0
0.2
0.4
0.6
0.8
1 2
3−5
6−10
11+ 1 2
3−5
6−10
11+ 1 2
3−5
6−10
11+
Estimate of Q
Pro
port
ion
of c
ases
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Goal-Sensitive Evaluation
Cooper [1968]:
“Most measures do not take into account a crucial variable:the amount of material relevant . . . which the user actuallyneeds ”.
“A search request is therefore to be conceived in theabstract as involving two parts: a relevance description(normally a subject specification) and a quantityspecification. To put it another way, every search requesthas a definite quantification ”.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
A Model for User Search Behavior
Back to the expected rate at which gain is accrued.
To compute an expectation, need a probability distribution.
Let W(i) be the probability that the user examinessnippet/document i in the ranking while viewing the SERP.
Can we take it as axiomatic that W(i) ≥W(i + 1) ?
A range of evidence says “yes, for the most part”. Andthat’s why we generate ranked lists, after all.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Measuring Fixations
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
First Fixations, By Rank
User experiment: 34 users, 6 search topics each.
0.0
0.1
0.2
0.3
0.4
0.5
1 2 3 4 5 6 7 8 9 10First fixation
Pro
port
ion
of c
ases
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
All Fixations, By Rank
User experiment: 34 users, 6 search topics each.
0.00
0.05
0.10
0.15
0.20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Rank of fixation
Pro
port
ion
of c
ases
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Click-Throughs, By Rank
User experiment: 34 users, 6 search topics each.
0.0
0.1
0.2
0.31 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Rank of click
Pro
port
ion
of c
ases
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Fixation Progressions – Zero Order
Observed jump probabilities, expressed as fractions of a totalof 2,633 overlapping two-fixation observations.
< −4 −3 −2 −1 +1 +2 +3 +4 >
0.047 0.033 0.049 0.069 0.230 0.347 0.104 0.046 0.032 0.0430.427 0.573
Median 1.0, mean 0.15.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Fixation Progressions – First Order
+9+8+7+6+5+4+3+2+1−1−2−3−4−5−6−7−8−9
−9
−8
−7
−6
−5
−4
−3
−2
−1
+1
+2
+3
+4
+5
+6
+7
+8
+9
First jump
Nex
t jum
p
0.0 0.1 0.2 0.3 0.4 0.5
Probability
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
A Model for User Search Behavior
Let W(i) be the probability that the user examinessnippet/document i in the ranking while viewing the SERP.
Assume that the user is seeking a gain of T in regard to thecurrent information need.
Assume that the user starts at rank 1, and proceeds throughthe SERP until they exit, obtaining gain of ri from rank i .
And let Ti = T −∑i
j=1 ri be the gain still required after idocuments have been viewed.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
A Model for User Search Behavior
Leave SERP
T = T − r
i = 0 view item i,
Info need, i = i+1,
T =??, T = T
Run query,
i−1 i
0
i
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
A Model for User Search Behavior
Exit search
view item i,
Info need, i = i+1,Run query,
i = 0
or switchT = T
T = T − r
or reformulate,
Next page,
T =??, T = TLeave SERP
i
i
0
0
ii−1
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
A Model for User Search Behavior
The metric value is then the weighted relevance:
M(r) =∞∑i=1
WM(i) · ri ,
where ri is the real-valued (or binary) utility/gain at rank i .
The units for M(r) are “expected gain per documentinspected”.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
A Model for User Search Behavior
An alternative way of thinking about the situation:
CM(i) =WM(i + 1)
WM(i),
the conditional continuation probability at rank i .
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
A Model for User Search Behavior
Run query,
i = 0
1 − C(i)
Info need,
view item i,
i = i+1,
T =??, T = T
W(i+1)/W(i)
C(i) =
Leave SERP
T = T − ri
0
ii−1
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
User Models – Examples
(1) Precision@k: CPrec(i) = 1 for i < k and 0 otherwise.
(2) Scaled DCG at k, SDCG@k:
CSDCG(i) =log(i + 1)
log(i + 2)
when 1 ≤ i < k , and 0 when i ≥ k . Must be truncated to afixed depth k if WSDCG(i) is to be a probability distribution.
(3) Rank-biased precision: CRBP(i) = p.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Adaptive User Models
But, hang on...
Do users really have the same continuation probability,regardless of what they have already seen?
Wouldn’t it be better to take ri into account whendetermining C(i)?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Adaptive User Models – Examples
(4) Reciprocal rank is defined for ri ∈ {0, 1}:
CRR(i) = (1− ri ) .
(5) Expected reciprocal rank is defined for ri ∈ [0 . . . 1]:
CERR(i) = (1− ri ) .
(6) Average precision is defined by
CAP(i) =
∑∞j=i+1(rj/j)∑∞j=i (rj/j)
.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Sensitive User Models
But, hang on...
Do users really have the same continuation probability,regardless of what they had initially hoped to find?
Wouldn’t it be better to take T or Ti (or both) into accountwhen determining C(i)?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Adaptive and Sensitive User Models
Is there a formulation for C(i) that:
I is adaptive, and reacts to relevance found
I is sensitive, and can be adjusted to user goals
I is computationally tractable, so that upper and lowerbounds can be calculated over prefixes of the ranking
I is complete, and can be computed even when R = 0
I is plausible in terms of the user behavior it models?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Adaptive and Sensitive User Models
Define INST by
CINST(i) =
(i + T + Ti − 1
i + T + Ti
)2
.
Where does this come from? And what does it achieve?
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
INST – Adaptive and Sensitive
WINST(i) αkind of
1
(i + T + Ti − 1)2
The weight of each item is inversely proportional to all of:
I the depth in the ranking
I the number of useful items initially sought
I the number of useful items still being sought.
Squared to give a convergent sequence.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
INST – Adaptive and Sensitive
Consequences of the definition:
I If ri = 0, then C(i) > C(i − 1).The more the user has invested in a search, the morelikely they are to continue it.
I If ri = 1, then C(i) = C(i − 1).As users make progress toward their goal, their statusremains constant.
I It is always the case that 0 < C(i) < 1.The user might end the search at any point.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
INST – Adaptive and Sensitive
Expected search length for INST, for ri ≡ 0 and ri ≡ 1.
T Upper Lower
1 2.58 1.333 6.53 3.27
10 20.51 10.2630 60.50 30.25
For any given value of T , all other rankings fall between thetwo limits in terms of expected search depth.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
INST – Adaptive and Sensitive
CINST(i) with ri always 0 and always 1, for T = 3.
1 10 100
rank i
0.60
0.80
1.00
conditio
nal pro
babili
ty C
(i)
T=3, ri=0
T=3, ri=1
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
INST – Adaptive and Sensitive
CINST(i) with ri always 0 and always 1, for T = 10.
1 10 100
rank i
0.60
0.80
1.00
conditio
nal pro
babili
ty C
(i)
T=10, ri=0
T=10, ri=1
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
INST – Adaptive and Sensitive
WINST(i) with ri always 0 and always 1, for T = 3.
1 10 100
rank i
0.01
0.10
1.00
weig
ht W
(i)
T=3, ri=0
T=3, ri=1
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
INST – Adaptive and Sensitive
WINST(i) with ri always 0 and always 1, for T = 10.
1 10 100
rank i
0.01
0.10
1.00
weig
ht W
(i)
T=10, ri=0
T=10, ri=1
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Residuals
A benefit of weighted-precision metrics is that residuals canbe computed – the sum of the W(i) values for which ri isunknown.
The residual associated with a score represents the gapbetween an “all unjudged are non-relevant (ri = 0)”assessment, and an “all unjudged are relevant (ri = 1)”assessment.
When residuals are large, the evaluation must be regarded ashaving low credibility.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Query Variation in Action
Topic R03.356, “postmenopausal estrogen Britain”, 46queries, 29 distinct, Indri-SDM similarity, NIST qrels, INST.
0 10 20 30 40 50
Query variant
0.0
0.5
1.0
Sco
re
Residual
Score
Title query
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Query Variation in Action
Topic T04.734, “recycling successes”, 44 queries, 25distinct, Indri-SDM similarity, NIST qrels, INST.
0 10 20 30 40 50
Query variant
0.0
0.5
1.0
Sco
re
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Query Variation in Action
Total of 110 R03 and T04 topics, showing the fraction ofuser-generated queries that outperform the “title” query.
0.0 0.2 0.4 0.6 0.8 1.0
TREC title-only query (INST)
0.0
0.5
1.0
Be
st
use
r q
ue
ry (
INS
T)
> 75%
> 50%
> 25%
> 0%
<= 0%
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Query Variation in Action
Pool growth as users are added (60 R03 topics): documentsper topic to be judged ≈ dn0.7 versus ≈ dn0.5 for systems.
1 10 100
number of users included
10
100
1000
pool siz
e
d=100
d=50
d=20
d=10
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Recall-Based Metrics
For some purposes may wish to measure retrievalperformance relative to what a perfect system would attain.
Metrics AP, NDCG, and Q-Measure normalize by R, thenumber of relevant documents for the topic.
By definition, recall-based metrics are divorced from the userexperience. Scores based on partial judgments are not lowerbounds; nor can residuals be calculated.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Summary
There is a relationship between information seeking taskcomplexity, and anticipated values of T and Q.
Making reasonable assumptions, it is possible to construct aweighted-precision user model – and hence evaluationmetric, INST – that is both goal-sensitive and adaptive.
There is vast variation in user queries that is not catered forby current test collections. That means that we are not (yet)able to determine if INST (with per-query T ) genuinelyoffers more refined evaluations than, say, RBP (single pacross topics) or ERR (no parameter at all).
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Future Work
Things we would love to do:
I Build a test collection that incorporates query variants:see SIGIR 2016.
I Develop an evaluation using either crowd-workers orlaboratory subjects to distinguish between the searchbehaviors modeled by say RBP and INST.
I Develop an understanding of query variants and theirunderlying “potency”, as step towards an effectivequery rewriting tool.
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Contributors
The majority of the work described here has beenundertaken in collaboration with Peter Bailey (Microsoft),Falk Scholer (RMIT University), and Paul Thomas (CSIROand now Microsoft).
Previous collaborators in this area include William Webberand Justin Zobel (University of Melbourne), and ShaneCulpepper (RMIT University).
Xiaolu Lu provided technical assistance with some of thequery processing.
Funded by the Australian Research Council (DP110101934and DP140102655).
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Questions (Who? When? Where?)
IR Metrics
Metrics Galore
Why Do UsersSearch?
Audience Exercise
A Model for UserSearch Behavior
Query Variation inAction
Recall-BasedMetrics
Summary
Sources
This talk was based on a suite of published work:
I Models and Metrics: IR Evaluation as a User Process,ADCS’12
I Users Versus Models: What Observation Tells Us AboutEffectiveness Metrics, CIKM’13
I What Users Do: The Eyes Have It, AIRS’13
I User Variability and IR System Evaluation, SIGIR’15
I Pooled Evaluation Over Query Variations: Users are asDiverse as Systems, CIKM’15
I INST: An Adaptive Metric for Information RetrievalEvaluation, ADCS’15
I UQV: A Test Collection with Query Variability, SIGIR’16