Web Search Engine Metrics for Measuring User Satisfaction

1

Web Search Engine Metrics for Measuring

User Satisfaction

Ali Dasdan Kostas Tsioutsiouliklis

Emre Velipasaoglu {dasdan, kostas, emrev}@yahoo-inc.com

Yahoo! Inc. 20 Apr 2009

2

Tutorial @

18th International World Wide Web

Conference

http://www2009.org/ April 20-24, 2009

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.

Disclaimers

•  This talk presents the opinions of the authors. It does not necessarily reflect the views of Yahoo! Inc.

•  This talk does not imply that these metrics are used by Yahoo!, or should they be used, they may not be used in the way described in this talk.

•  The examples are just that – examples. Please do not generalize them to the level of comparing search engines.

3


Acknowledgments for presentation material (in alphabetical order of last names in each category)

•  Coverage –  Paolo D’Alberto, Amit Sasturkar

•  Discovery –  Chris Drome, Kaori Drome

•  Freshness –  Xinh Huynh

•  Presentation –  Rob Aseron, Youssef Billawala, Prasad Kantamneni, Diane

Yip •  General

–  Stanford U. presentation audience (organized by Aneesh Sharma and Panagiotis Papadimitriou), Yahoo! presentation audience (organized by Pavel Dmitriev)

4


Learning objectives

•  To learn about user satisfaction metrics

•  To learn about how to interpret metrics results

•  To get the relevant bibliography •  To learn about the open problems

5


Scope

• Web “textual” search • Users’ point of view • Analysis rather than

synthesis •  Intuitive rather than formal • Not exhaustive coverage

(including references) 6


Outline

•  Introduction (30min) – Ali •  Relevance metrics (50min) – Emre •  Break (15min) •  Coverage metrics (15min) – Ali •  Diversity metrics (15min) – Ali •  Discovery metrics (15min) – Ali •  Freshness metrics (15min) – Ali •  Presentation metrics (50min) – Kostas •  Conclusions (5min) – Kostas

7

8

Introduction PART “0”

of WWW’09 Tutorial on Web Search Engine Metrics

by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu


“To measure is to know”

“If you cannot measure it, you cannot improve it”

Lord Kelvin (1824-1907)

Why measure? Why metrics?

9


Search engine pipeline: Simplified architecture

10

•  Serving system: serves user queries and search results

•  Content system: acquires and processes content

Front-end tiers

Search tiers Indexers Web graphs Crawlers

Serving system

Content system

WWW

User


Search engine pipeline: Content selection

11

Craw

led

Accessed

Served

Indexed

Graphed

the Web

How do you select content to pass to the next catalog?

Front-end tiers


Serving system

Content system

WWW

User


User view of metrics: Example with coverage metrics (SE #1)

12

Front-end tiers


Serving system

Content system

WWW

User

Search Engine (SE) #1



13

Front-end tiers


Serving system

Content system

WWW

User




14

Front-end tiers


Serving system

Content system

WWW

User



System view of metrics: Example with coverage metrics

15

Check for coverage of expected URL http://rain.stanford.edu/schedule/ (if missing from SRP)

Front-end tiers


Serving system

Content system

WWW

User


Ideal vs. reality

•  Ideal view –  crawl all content –  discover all changes instantaneously –  serve all content instantaneously –  store all content indefinitely –  meet user’s information need perfectly

•  Practical view –  constraints on above aspects due to

•  market focus, long tails, cost, resources, complexity

•  Moral of the story –  Cannot make all the users happy all the time!

16


Sampling methods for metrics

•  Random sampling of queries –  from search engine’s query logs –  from third-party logs (e.g., ComScore)

•  Random sampling of URLs –  from random walking the Web

•  see a review at Baykan et al., WWW’06 –  from directories and similar hubs –  from RSS feeds and sitemaps –  from third-party feeds –  from search engine’s catalogs –  from competitor’s indices using queries

•  Customer-selected samples

17


Different dimensions for metrics

•  Content types and sources –  news, blogs, wikipedia, forums, scholar; regions, languages;

adult, spam, etc. •  Site types

–  small vs. large, region, language •  Document formats

–  html, pdf, etc. •  Query types

–  head, torso, tail; #terms; informational, navigational, transactional; celebrity, adult, business, research, etc.

•  Open web vs. hidden web •  Organic vs. commercial •  Dynamic vs. static content •  New content vs. existing content

18


Further issues to consider

•  Rate limitations –  search engine blocking, hence, difficulty of large competitive testing –  internal bandwidth usage limitations

•  Intrusiveness –  How can metrics queries affect what’s observed?

•  Statistical soundness –  in methods used and guarantees provided –  accumulation of errors –  the “value” question –  E.g., what is “random”? is “random” good enough?

•  Undesired positive feedback or the chicken-and-egg problem –  Focus on popular queries may make them more popular at the expense

of potentially what’s good for the future. •  Controlled feedback or labeled training and testing data

–  Paid human judges (or editors), crowdsourcing (e.g., Amazon’s Mechanical Turk), Games with a Purpose (e.g., Dasdan et al., WWW’09), bucket testing on live traffic, etc.

19


Key problems for metrics

•  Measure user satisfaction •  Compare two search engines •  Optimize for user satisfaction in

each component of the pipeline •  Automate all metrics •  Discover anomalies •  Visualize, mine, and summarize

metrics data •  Debug problems automatically

20 Also see: Yahoo Research list at http://research.yahoo.com/ksc

21

Relevance Metrics PART I



© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 22 22

Example on relevance

Ad for gear. OK if I will go to the game.

No schedule here.

There is a schedule.

A different schedule?

A different Real Madrid!


What is relevance?

•  User issues a query to a search engine and receives an ordered list of results…

•  Relevance: How effectively was the user’s information need met? –  How useful were the results? –  How many of the retrieved results were useful? –  Were there any useful pages not retrieved? –  Did the order of the results make the user’s search

easier or harder? –  How successful did the search engine handle the

ambiguity and the subjectivity of the query?

© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 24

Evaluating relevance

•  Set based evaluation –  basic but fundamental

•  Rank based evaluation with explicit absolute judgments –  binary vs. graded judgments

•  Rank based evaluation with explicit preference judgments –  binary vs. graded judgments –  practical system testing and incomplete judgments

•  Rank based evaluation with implicit judgments –  direct and indirect evaluation by clicks

•  User satisfaction •  More notes

25

Relevance Metrics: Set Based Evaluation


Precision

–  True Positive (TP): A retrieved document is relevant –  False Positive (FP): A retrieved document is not relevant

•  Kent et. al. (1955) €

Precision =# relevant items retrieved( )

# retrieved items( )

=TP

TP+ FP( )= Prob relevant retrieved( )

•  How many of the retrieved results were useful?


Recall

€

Recall =# relevant items retrieved( )

# relevant items( )

=TP

TP+ FN( )= Prob retrieved relevant( )

–  True Positive (TP): A retrieved document is relevant –  False Negatives (FN): A relevant document is not retrieved

•  Kent et. al. (1955)

•  Were there any useful pages left not retrieved?


Properties of precision and recall

•  Precision decreases when false positives increase •  False positives:

–  also known as false alarm in signal processing –  correspond to Type I error in statistical hypothesis testing

•  Recall decreases when false negatives increase •  False negatives

–  also known as missed opportunities

–  correspond to Type II error in statistical hypothesis testing


F-measure

•  Inconvenient to have two numbers •  F-measure: Harmonic mean of precision and recall

–  related to van Rijsbergen’s effectiveness measure –  reflects user’s willingness to trade precision for recall controlled

by a parameter selected by the system designer

€

F =1

α1P

+ 1−α( ) 1R

=β 2 +1( )PRβ 2P+ R

α = 1β 2 +1( )

F β =1( ) =2PRP + R


Various means of precision and recall

0%

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

Precision

Averag

e o

f p

recis

ion

an

d r

ecall

R

arithmetic

geometric

F1

F2

F.5

Recall = 70%

31

Relevance Metrics: Rank Based Evaluation

with Explicit Absolute

Judgments


Extending precision and recall

•  So far, considered: –  How many of the retrieved results were useful? –  Were there any useful pages left not retrieved?

•  Next, consider: –  Did the order of the results made the user’s search for

information easier or harder?

•  Extending set based precision/recall to ranked list –  It is possible to define many sets over a ranked list. –  E.g. Start with a set including the first result and progressively

increase the size of the set by adding the next result.

•  Precision-recall curve: –  Calculate precision at standard recall levels and interpolate.


Precision-recall curve example

rank relevance TP FP FN recall precisioninterpolated

precision1 1 1 0 3 0.25 1.00 1.00

2 1 2 0 2 0.5 1.00 1.00

3 0 2 1 2 0.5 0.67 0.75

4 1 3 1 1 0.75 0.75 0.75

5 0 3 2 1 0.75 0.60 0.60

6 0 3 3 1 0.75 0.50 0.57

7 1 4 3 0 1 0.57 0.57

8 0 4 4 0 1 0.50 0.50

9 0 4 5 0 1 0.44 0.44

10 0 4 6 0 1 0.40 0.40


Precision-recall curve example

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Recall

Precision

precision interpolated precision


Average precision-recall curve

•  Precision-recall curve is for one ranked list (i.e. one query).

•  To evaluate relevance of a search engine: –  Calculate interpolated

precision-recall curves for a sample of queries at 11-points (Recall = 0.0:0.1:1.0).

–  Average over test sample of queries.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

recall

precision


Mean average precision (MAP)

•  Single number instead of a graph •  Measure of quality at all recall levels •  Average precision for a single query:

€

AP = 1# relevant Precision at rank of kth relevant document( )

k=1

# relevant

∑

•  MAP: Mean of average precision over all queries –  Most frequently, arithmetic mean is used over the query

sample. –  Sometimes, geometric mean can be useful by putting

emphasis on low performing queries.


Average precision example

rank relevance TP FP FN R P P@rel(k)

1 1 1 0 3 0.25 1.00 1.00

2 1 2 0 2 0.5 1.00 1.00

3 0 2 1 2 0.5 0.67 0

4 1 3 1 1 0.75 0.75 0.75

5 0 3 2 1 0.75 0.60 0

6 0 3 3 1 0.75 0.50 0

7 1 4 3 0 1 0.57 0.57

8 0 4 4 0 1 0.50 0

9 0 4 5 0 1 0.44 0

10 0 4 6 0 1 0.40 0

# relevant 4 ave P 0.83


Precision @ k

•  MAP evaluates precision at all recall levels. •  In web search, top portion of a result set is more

important. •  A natural alternative is to report at top-k

(e.g. top-10). •  Problem:

–  Not all queries will have more than k relevant results. So, even a perfect system may score less than 1.0 for some queries.


R-precision

•  Allan (2005) •  Use a variable result set cut-off for each query

based on number of its relevant results. •  In this case, a perfect system can score 1.0 over all

queries. •  Official evaluation metric of the TREC HARD track •  Highly correlated with MAP


Mean reciprocal rank (MRR)

•  Voorhees (1999) •  Reciprocal of the rank of the first relevant result

averaged over a population of queries •  Possible to define it for entities other than explicit

absolute relevance judgments (e.g. clicks - see implicit judgments later on)

€

MRR = 1#queries

1rank(1st relevant result of query q)q=1

#queries

∑


Graded Relevance

•  So far, the evaluation methods did not measure satisfaction in the following aspects: –  How useful were the results?

•  Do documents have grades of usefulness in meeting an information need?

–  How successful did the search engine handle the ambiguity and the subjectivity of the query?

•  Is the information need of the user clear in the query? •  Do different users mean different things with the same query?

•  Can we cover these aspects by using graded relevance judgments instead of binary? –  very useful –  somewhat useful –  not useful


Precision-recall curves

•  If we have grades of relevance, how can we modify some of the binary relevance measures?

•  Calculate Precision-Recall curves at each grade level (Järvelin and Kekäläinen (2000))

•  Informative but, too many curves to compare


Discounted cumulative gain (DCG)

•  Järvelin and Kekäläinen (2002) •  Gain adjustable for importance of different relevance

grades for user satisfaction •  Discounting desirable for web ranking

–  Most users don’t browse deep. –  Search engines truncate the list of results returned.

€

DCG =Gain(result@r)logb r +1( )r=1

R

∑Discount proportional to

effort to reach result at rank r.

Gain proportional to utility of result at rank r.


DCG example

•  Gain for various grades –  Very useful (V): 3 –  Somewhat useful (S): 1 –  Not useful (N) : 0

•  E.g. Results ordered as VSN:

DCG = 3/log2(1+1) + 1/log2(2+1) + 0/log2(3+1) = 2.63

•  E.g. Results ordered as VNS:

DCG = 3/log2(1+1) + 0/log2(2+1) + 1/log2(3+1) = 2.50


Normalized DCG (nDCG)

•  DCG is yields unbounded scores. It is desirable for the best possible result set to have a score of 1.

•  For each query, divide the DCG by best attainable DCG for that query.

•  E.g. VSN:

nDCG = 2.63 / 2.63 = 1.00

•  E.g. VNS:

nDCG = 2.50 / 2.63 = 0.95

46


with Explicit Preference

Judgments


Kendall tau coefficient

•  Based on counts of preferences –  Preference judgments are cheaper and easier/cleaner than

absolute judgments. –  But, may need to deal with circular preferences.

•  Range in [-1,1] ‒  τ = 1 when all in agreement ‒  τ = -1 when all disagree

•  Robust for incomplete judgments –  Just use the known set of preferences.

€

τ =A−DA+D

preferences in agreement

preferences in disagreement


Binary preference (bpref)

•  Buckley and Voorhees (2004) •  Designed in particular for incomplete judgments •  Similar to some other relevance metrics (MAP) •  Can be generalized to graded judgments

€

bpref =1R

1− Nr

R

r∈R∑ ∝

AA + D

# of non-relevant docs above relevant doc r,

in the first R non-relevant

For a query with R relevant results


Bpref example

rank relevance numerator denominator summand

1 0

2 1 1 3 0.66

3 NA

4 1 1 3 0.66

5 NA

6 0

7 0

8 0

9 1 3 3 0

10 0

# relevant 3 bpref 0.44

# non-relevant 5

# unjudged 2


Generalization of bpref to graded judgments - rpref

•  De Beer and Moens (2006) •  Graded relevance version of bpref •  Sakai (2007) gives a corrected version expressed in

terms of cumulative gain.

€

rprefrelative R( ) = 1CGideal R( ) g r( ) 1−

penalty r( )Nr

r>1,g r( )>0

r=R

∑

penalty r( ) =g r( ) − g i( )g r( )i<r,g i( )<g r( )

∑ # of judged docs above r

Soft count of out-of-order pairs

Relevance gain of result at rank r

Cumulative gain


Practical system testing with incomplete judgments

•  Comparing two search engines in practice –  Scrape top-k result sets for a sample of queries –  Calculate any of the metrics above for each engine and

compare using a statistical test (e.g. paired t-test) •  Need judgments •  Use existing judgments •  What to do if judgments missing •  Use a metric robust to missing judgments


Comparing various metrics under incomplete judgment scenario

•  Sakai (2007) simulates incomplete judgments by sampling from pooled judgments –  Stratified sampling yields various levels of completeness from 100% to

10%. •  Then tests bpref, rpref, MAP, Q-measure, and normalized DCG

(nDCG). –  Q-measure is similar to rpref (see Sakai (2007)) –  Since all but the first two are originally designed for complete judgments,

he tests two versions of them: •  one based on assuming results with missing judgments are non-relevant, •  and another computed on condensed lists by removing results with missing

judgments.

•  nDCG with incomplete absolute judgments –  As in average precision based measures, one can ignore the unjudged

documents when using normalized DCG.


Robustness of evaluation with incomplete judgments

•  Among the original methods only bpref and rpref stay stable with increasing incompleteness.

•  nDCG, Q and MAP computed on condensed lists also perform well. –  Furthermore, they have more discriminative power.

•  Graded relevance metrics are more robust than binary metrics for incompleteness.

•  nDCG and Q-measure on condensed lists are the best metrics.


Average precision based rank correlation

•  Yilmaz, Aslam and Robertson (2008) •  Kendal tau rank correlation as a random variable

–  Pick a pair of items at random. –  Define p: Return 1 if pair in same order in both lists, 0 otherwise.

•  Rank correlation based on average precision as a random variable –  Pick an item at random from the 1st list (other than the top item). –  Pick another document at random above the current. –  Define p’: Return the 1 if this pair is in the same relevance order

in the 2nd list, 0 otherwise. •  Agreement on top of the list is rewarded.

€

τ =A −DA + D

= p − 1− p( ) = 2p −1

55


with Implicit Judgments


Implicit judgments from clicks

•  Explicit judgments are expensive. •  A search engine has lots of user interaction data

–  which results were viewed for a query, and –  which of those received clicks.

•  Can we obtain implicit judgments of satisfaction or relevance from clicks? –  Clicks are highly biased.

•  presentation details (order of results, attractiveness of abstracts) •  trust and other subtle aspects of user’s need

–  Not impossible - some innovative methods are emerging. •  Pros: Cheap, better model of ambiguity and subjectivity •  Cons: Noisy and retroactive. (May expose poor quality search

engines to live traffic.)


Performance metrics from user logs

•  Naïve way to utilize user interaction data is to develop basic statistics from raw observations: –  abandonment rate –  reformulation rate –  number of queries per session –  clicks per query –  mean reciprocal rank of clicked results –  time to first or last click

•  Intuitive but not clear how sensitive these metrics are to what we want to measure


Implicit preference judgments from clicks

•  Joachims (2002) •  Radlinski and Joachims (2005) •  These are document level preference judgments

and have not been used in evaluation.

A

B

C

skip

skip

click

C>A and C>B

A

B

C

click

skip

skip

A>B


Direct evaluation by clicks

•  Randomly interleave two result sets to be compared. –  Have the same number of links from top of each result set. –  More clicks on links from one result set indicates preference.

•  Balanced interleaving (Joachims (2003)) –  Determine randomly which side goes first at the start. –  Pick the next available result from the side that has the turn

while removing duplicates. –  Caution: Biased when two result sets are nearly identical

•  Team draft interleaving (Radlinski et. al. (2008)) –  Determine randomly which side goes first at each round. –  Pick the next available result from the side that has the turn

while removing duplicates. •  Effectively removes the rank bias, but not directly

applicable to evaluation of multi-page sessions.


Interleaving example

A B A first B first Captain Team Captain Team1 a b a b A a B b

b a B b A a2 b e e A A

e B e B e3 c a c B B

c A c A c4 d f d f A d A d

f d B f B f5 e g g B g A

g A B g6 f h h B h A

h A B h7 g k k B k B k

k A A8 h c A B

B A9 i d i B B

i A i A i10 j i j A j A j

j B B

Rank

Engines to Comapre

Balanced Interleave

Team Draft InterleaveTDI example 1 TDI example 2


Indirect evaluation by clicks

•  Carterette and Jones (2007) •  Relevance as a multinomial random variable

•  Model absolute judgments by clicks

•  Expected DCG (incomplete judgments are OK)

€

P Ri = grade j{ }

€

E DCGN[ ] = E R1[ ] +E Ri[ ]log2 ii= 2

N

∑€

p(R | q,c) = p Ri q,c( )i=1

N

∏

€

logp R > g j q,c( )p R ≤ g j q,c( )

=α j +βq+ βicii=1

N

∑ + βikcicki<k

N

∑


Indirect evaluation by clicks (cont’d)

•  Comparing two search engines

•  Predict if difference is statistically significant –  use Monte Carlo

•  Can improve confidence by asking for labels where

•  Efficient but, effectiveness depends on the quality of the relevance model obtained from the clicks.

€

E ΔDCG[ ] = E DCGA[ ]− E DCGB[ ]

€

P ΔDCG < 0( ) ≥ 0.95

€

maxi E GiA[ ]− E Gi

B[ ] Gi =Ri if rank(i) =1

Rilog2 rank(i)

o/w

63

Relevance Metrics: User Satisfaction


Relevance evaluation and user satisfaction

•  So far, focused on evaluation method rather than the entity (i.e. user satisfaction) to be evaluated.

•  Subtle and salient aspects of user satisfaction are difficult for traditional relevance. –  E.g. trust, expectation, patience, ambiguity, subjectivity –  Explicit absolute or preference judgments are not very

successful in addressing all aspects at once. –  Implicit judgment models get one step closer to user

satisfaction by incorporating user feedback.

•  The popular IR relevance metrics are not strongly based on user tasks and experiences. –  Turpin and Scholer (2006): precision based metrics such as

MAP fail to assess user satisfaction on tasks targeting recall.


Modeling user satisfaction

•  Huffman and Hochster (2007) •  Obtain explicit judgments of true satisfaction over a

sample of sessions or any other grain. •  Develop a predictive model based on observable

statistics. –  explicit absolute relevance judgments –  number of user actions in a session –  query classification

•  Carry out correlation analysis •  Pros: More direct than many other evaluation metrics.

•  Cons: More exploratory than a usable metric at this stage.

66

Relevance Metrics: More Notes


Relevance through search system components

•  Relevance can explicitly be measured for each search system component (Dasdan and Drome (2009)). –  Use set based evaluation for WWW, catalog, database tiers.

•  Rank based evaluation can be used if sampled subset is ordered by explicit judgments or by using order inferred from a downstream component.

–  Yields approximate upper bounds

–  Use rank based evaluation for candidate documents and result set. •  Useful for quantifying and monitoring relevance gap

–  inter-system relevance gap by comparing different system stages –  intra-system relevance gap by comparing against external benchmarks

WWW crawl catalog index tier 1

tier N

selection candidate doc list

ranking result

set

query


Where to find more

•  Traditional relevance metrics have deep roots in information retrieval –  Cranfield experiments (Cleverdon (1991)) –  SMART (Salton (1991)) –  TREC (Voorhees and Harman (2005))

•  Modern metrics addressing cost and noise by using statistical inference in more advanced ways

•  For more on relevance evaluation, see –  Manning, Raghavan, and Schütze (2008) –  Croft, Metzler, and Strohman (2009)

•  For more on the user dimension, see –  Baeza-Yates, and Ribeiro-Neto (1999) –  Spink and Cole (2005)


References 1/2

•  J. Allan (2005), HARD track overview in TREC 2005: High accuracy retrieval from documents. •  R. Baeza-Yates and B. Ribeiro-Neto (1999), Modern Information Retrieval, Addison-Wesley. •  C. Buckley and E.M. Voorhees (2004), Retrieval Evaluation with Incomplete Information, SIGIR’04. •  B. Carterette and R. Jones (2007), Evaluating Search Engines by Modeling the Relationship

Between Relevance and Clicks, NIPS’07. •  C.W. Cleverdon (1991), The significance of the Cranfield tests on index languages, SIGIR’91. •  B. Croft, D. Metzler, and T. Strohman (2009), Search Engines: Information Retrieval in Practice,

Addison Wesley. •  A. Dasdan and C. Drome (2008), Measuring Relevance Loss of Search Engine Components,

submitted. •  J. De Beer and M.-F. Moens (2006), Rpref - A Generalization of Bpref towards Graded Relevance

Judgments, SIGIR’06. •  S.B. Huffman and M. Hochster (2007), How Well does Result Relevance Predict Session

Satisfaction? SIGIR’07. •  K. Järvelin and J. Kekäläinen (2000), IR evaluation methods for retrieving highly relevant

documents, SIGIR’00. •  K. Järvelin and J. Kekäläinen (2002), Cumulated Gain-Based Evaluation of IR Techniques, ACM

Trans. IS 20(4):422-446. •  T. Joachims (2002), Optimizing Search Engines using Clickthrough Data. SIGKDD’02. •  T. Joachims (2003), Evaluating Retrieval Performance using Clickthrough Data, In J. Franke, et.

al. (eds.), Text Mining, Physica Verlag.


References 2/2

•  A. Kent, M.M. Berry F.U. Luehrs Jr., and J.W. Perry (1955), Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation 6(2):93-101.

•  C. Manning, P. Raghavan, and H. Schützke (2008), H. Introduction to Information Retrieval, Cambridge University Press.

•  F. Radlinski, M. Kurup, and T. Joachims (2008), How Does Clickthrough Data Reflect Retrieval Quality? CIKM’08.

•  F. Radlinski and T. Joachims (2005), Evaluating the Robustness of Learning from Implicit Feedback, ICML’05.

•  T. Sakai (2007), Alternatives to Bpref, SIGIR’07. •  G. Salton (1991), The smart project in automatic document retrieval, SIGIR’91. •  A. Spink and C. Cole (eds.) (2005), New Directions in Cognitive Information Retrieval, Springer. •  A. Turpin and F. Scholer (2006), User performance versus precision measures for simple search

tasks, SIGIR’06. •  C.J. Van Rijsbergen (1979), Information Retrieval (2nd ed.), Butterworth. •  E.M. Voorhees and D. Harman (eds) (2005), TREC: Experiment and Evaluation in Information

Retrieval, MIT Press. •  E.M. Voorhees (1999), TREC-8 Question Answering Track Report. •  E. Yilmaz and J. Aslam (2006), Estimating Average Precision with Incomplete and Imperfect

Information, CIKM’06. •  E. Yilmaz, J. Aslam, and S. Robertson (2008), A New Rank Correlation Coefficient for Information

Retrieval, SIGIR’08.

71

Coverage Metrics PART II




Example on coverage: Heard some interesting news; decided to search

72


Example on coverage: URL was not found

73


Example on coverage: But content was found under different URLs

74


Example on coverage: URL was also found after some time

75


Definitions for coverage

•  Coverage refers to presence of content of interest in a catalog.

•  Coverage ratio – defined as the ratio of the number of

documents (pages) found to the number of documents (pages) tested

– Can be represented as a distribution when many document attributes are considered together

76


Some background: Shingling and Jaccard Index

77

Doc = (a b c d e) (5 terms) 2-grams: (a b, b c, c d, d e)

Shingles for 2-grams (after hashing them): 10, 3, 7, 16 Min shingle: 3 (used as a signature of Doc)

Doc1 = (a b c d e) Doc2 = (a e f g)

Doc1 Doc2 = (a e) Doc1 Doc2 = (a b c d e f g)

Jaccard index = |Doc1 Doc2| / |Doc1 Doc2| = 2 / 7 ≈ 30% (Shingling estimates this index.)


How to measure coverage

•  Given an input document with its URL •  Query by URL (QBU)

–  enter URL at the target search engine’s query interface –  if the URL is not found, then iterate using “normalized” forms of

the same URL •  Query by content (QBC)

–  if URL is not given or URL search has failed, then perform this search

–  generate a set of queries (called strong queries) from the document

–  submit the queries to the target search engine’s query interface –  combine the returned results –  perform a more thorough similarity check between the returned

documents and the input document •  Compute coverage ratio over multiple documents

78


Query-by-Content flowchart

79

String signature: Terms from page

Strings combined into queries

Similarity check using shingles

Search results extraction


Query by content: How to generate queries

•  Select sequences of terms randomly –  find the document’s shingles signature –  find the corresponding sequences of terms –  This method can produce the same query signature for the

same document, as opposed to the method of just selecting random sequences of terms from the document.

•  Select sequences of terms by frequency –  terms with the lowest frequency or highest TF-IDF

•  Select sequences of terms by position –  +/- two terms at every 5th term

80



•  URL normalization –  see Dasgupta, Kumar, and Sasturkar (2008)

•  Page templates and ads –  or how to avoid undesired matches

•  Search for non-textual content –  images, mathematical formulas, tables and other similar

structures

•  Definition of content similarity •  Syntactic vs. semantic match •  How to balance coverage against other

objectives

81


Key problems

•  Measure web growth in general and along any dimension

•  Compare search engines automatically and reliably

•  Improve content-based search, including semantic-similarity search

•  Improve copy detection methods for quality and performance, including URL based copy detection

82


Reference review on coverage metrics

•  Luhn (1957) –  summarizes an input document by selecting terms or sentences

by frequency –  Bharat and Broder (1998) discovered the same method

independently for a different purpose •  Bar-Yossef and Gurevich (2008)

–  introduces improved methods to randomly sample pages from a search engine’s index using its public query interface, a problem introduced by Bharat and Broder (1998)

•  Dasdan et al. (2008), Pereira and Ziviani (2004) –  represents an input document by selecting (sequences of)

terms randomly or by frequency –  uses the term-based document signature as queries (called

strong queries) for similarity search –  Yang et al. (2009) proposes similar methods for blog search

83


References

•  Z. Bar-Yossef and M. Gurevich (2008), Random sampling from a search engine’s index, J. ACM, 55(5).

•  K. Bharat, A. Broder (1998), A technique for measuring the relative size and overlap of public Web search engines, WWW’98.

•  S. Brin, J. Davis, and H. Garcia-Molina (1995), Copy detection mechanisms for digital documents, SIGMOD’95.

•  A. Dasdan, P. D’Alberto, C. Drome, and S. Kolay (2008), Automating retrieval for similar content using search engine query interface, submitted.

•  A. Dasgupta, R. Kumar, and A. Sasturkar (2008), De-duping URLs via Rewrite Rules, KDD’08.

•  H. Luhn (1957), A statistical approach to mechanized encoding and searching of literary information, IBM J. Research and Dev., 1(4):309–317.

•  H. P. Luhn (1958), The automatic creation of literature abstracts, IBM J. Research and Dev., 2(2).

•  A.R. Pereira Jr. and N. Ziviani (2004), Retrieving similar documents from the Web, J. Web Engineering, 2(4):247-261.

•  Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, D. Papadias (2009), Query by document, WSDM’09.

84

85

Diversity Metrics PART III




Example on diversity: Long query

86

Ever

y re

sult

is a

bout

the

sam

e ne

ws.


Example on diversity: Long query

87

Mor

e di

vers

e


Example on diversity: Ambiguous query [stanford]

88

See

http

://en

.wik

iped

ia.o

rg/w

iki/S

tanf

ord_

(dis

ambi

guat

ion)


Example on diversity: Ambiguous query [stanford]

89

See

http

://en

.wik

iped

ia.o

rg/w

iki/S

tanf

ord_

(dis

ambi

guat

ion)


Definitions for diversity

•  Diversity –  related to the breadth of the content –  also related to the quantification of “concepts” in

a set of documents, or the quantification of query disambiguation or query intent

•  Closely tied to relevance and redundancy –  excluding near-duplicate results

•  May have implications for search engine interfaces too –  e.g., clustered or faceted presentations

90


How to measure diversity

•  Method #1: –  get editorial judgments as to the degree of diversity in a catalog

•  Method #2: –  use the number of the content or source types for the

documents in a catalog –  find the set of concepts in a catalog and measure diversity

based on their relationships •  e.g., cluster using document similarity and assign a

concept to each cluster

•  Method #3: (with a given relevance metric) –  iterate over each intent of the input query –  consider sets of documents relevant to each intent –  weight the given relevance metric by the probability of each

intent

91


How to measure diversity: Example

92

•  Types: News, organic, rich, ads •  Sources for 10 organic results:

•  4 domains •  Themes for organic results:

•  6 for Stanford University related •  1 for Stanfords restaurant related •  1 for Stanford, MT related •  1 for Stanford, KY related

•  Detailed themes for organic results: •  2 for general Stanford U. intro •  1 for Stanford athletics •  1 for Stanford medical school •  1 for Stanford business school •  1 for Stanford news •  1 for Stanford green buildings •  1 for Stanfords restaurant •  1 for Stanford, MT high school •  1 for Stanford, KY fire department



•  Categorization and similarity methods –  for documents, queries, sites

•  Presentation issues – single page, clusters, facets, term cloud

•  Summarizing diversity •  How to balance diversity against

other objectives – diversity vs. relevance in particular

93


Key problems

• Measure and summarize diversity better

• Measure tradeoffs between diversity and relevance better

• Determine the best presentation of diversity

94


Reference review on diversity metrics

•  Goldstein and Carbonell (1998) –  defines maximizal marginal relevance as a parameterized linear combination of novelty and

relevance •  novelty: measured via the similarity among documents (to avoid redundancy) •  relevance: measured via the similarity between documents and the query

•  Jain, Sarda, and Haritsa (2003); Chen and Karger (2006); Joachims et al. (2008); and Swaminathan et al. (2008)

–  iteratively expand a document set to maximize marginal gain –  each time add a new relevant document that is least similar to the existing set –  Joachims et al. (2008) address the learning aspect.

•  Radlinski and Dumais (2006) –  diversifies search results using relevant results to the input query and queries related to it

•  Agrawal et al. (2009) –  diversifies search results using a taxonomy for classifying queries and documents –  also reviews diversity metrics and proposes new ones

•  Gollapudi and Sharma (2009) –  proposes an axiomatization of result diversification (similar to similar recent efforts for ranking

and clustering) and proves the impossibility of satisfying all properties –  enumerates a set of diversification functions satisfying different subsets of properties

•  Metrics to measure diversity of a given set of results are proposed by Chen and Karger (2006), Clarke et al. (2008), and Agrawal et al. (2009).

95


References

•  R. Agrawal, S. Gollapudi, A. Halverson, and S. Leong (2009), Diversifying search results, WSDM’09.

•  H. Chen and D.R. Karger (2006), Less is more: Probabilistic models for retrieving fewer relevant documents, SIGIR’06.

•  C.L.A. Clarke, M. Kolla, G.V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon (2008), Novelty and diversity in information retrieval evaluation, SIGIR’08.

•  J. Goldstein and J. Carbonell (1998), Summarization: (1) Using MMR for Diversity-based Reranking and (2) Evaluating Summaries, SIGIR’98.

•  S. Gollapudi and A. Sharma (2009), An axiomatic approach for result diversification, WWW’09.

•  A. Jain, P. Sarda, and J.R. Haritsa (2003), Providing Diversity in K-Nearest Neighbor Query Results, CoRR’03.

•  R. Kleinberg, F. Radlinski, and T. Joachims (2008), Learning Diverse Rankings with Multi-armed Bandits, ICML’08.

•  F. Radlinski and S.T. Dumais (2006), Improving personalized web search using result diversification, SIGIR’06.

•  A. Swaminathan, C. Mathew, and D. Kirovski (2008), Essential pages, MSR-TR-2008-015, Microsoft Research.

•  Y. Yue, and T. Joachims (2008), Predicting Diverse Subsets Using Structural SVMs, ICML’08.

•  C. Zhai and J.D. Lafferty (2006), A risk minimization framework for information retrieval, Info. Proc. and Management, 42(1):31-55.

96

97

Discovery and Latency Metrics

PART IV of

WWW’09 Tutorial on Web Search Engine Metrics by

A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu


Example on discovery: Page was born ~30 minutes before

98


Example on discovery: URL of page was not found

99


Example on discovery: But content existed under different URLs

100


Example on discovery: URL was also found after ~1 hr

101


Life of a URL

102

AGE

LATENCY

BORN DISCOVERED NOW EXPIRED

TIME


Lives of many URLs

103

AGE

LATENCY

BORN DISCOVERED NOW EXPIRED

TIME

LATENCY

LATENCY

LATENCY


How to measure discovery and latency

•  Consider a sample of new pages on the Web –  Feeds at regular intervals –  Each sample monitored for a period (e.g., 15 days)

•  User view –  Discovery: Measure how many of these new pages are in

the search results? •  Using the coverage ratio formula

–  Latency: Measure how long it took to get these new pages in the search results?

•  System view –  Discovery: Measure how many of these new pages are in a

catalog? –  Latency: Measure how long it took to get these new pages

in a catalog?

104


Discovery profile of a search engine component: Overview

105

Time to reach a certain coverage percentage

No expiration yet

Content expired

Convergence

Over many URLs, per search engine component

Oth

er b

ehav

iors


Discovery profiles and monitoring: Examples

106

Profiles Monitoring of

profile parameters


Latency profiles of a search engine component: Overview

107

Over many URLs, per search engine component

Desired skewness direction Close to zero for crawlers


Latency profiles and monitoring: Examples

108

Profiles Monitoring of

profile parameters



•  How to discover samples to measure discovery and latency

•  How to beat crawlers to acquire samples

•  Discovery of top-level pages •  Discovery of deep links •  Discovery of hidden web content •  How to balance discovery against

other objectives

109


Key problems

•  Predict content changes on the Web •  Discover new content almost

instantaneously •  Reduce latency per search engine

component and overall

110


Reference review on discovery metrics

•  Cho, Garcia-Molina, & Page (1998) –  discusses how to order URL accesses based on importance

scores •  importance: PageRank (best), link count, similarity to query in

anchortext or URL string, attributes of URL string. •  Dasgupta et al. (2007)

–  formulates the problem of discoverability (discover new content from the fewest number of known pages) and proposes approximation algorithms

•  Kim and Kang (2007) –  compares top three search engines for discovery (called

“timeliness”), freshness, and latency •  Lewandowski (2008)

–  compares top three search engines for freshness and latency •  Dasdan and Drome (2009)

–  proposes discovery metrics along the lines discussed in this section

111


References

•  J. Cho, H. Garcia-Molina, and L. Page (1998), Efficient Crawling Through URL Ordering, Computer Networks and ISDN Systems, 30(1-7):161-172.

•  A. Dasdan and C. Drome (2009), Discovery coverage: Measuring how fast content is discovered by search engines, submitted.

•  A. Dasgupta, A. Ghosh R. Kumar, C. Olston, S. Pandey, and A. Tomkins (2007), The discoverability of the Web, WWW’07.

•  J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.

•  N. Eiron, K.S. McCurley, and J.A. Tomlin, Ranking the Web frontier, WWW’04.

•  C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08.

•  Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07.

•  D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008.

112

113

Freshness Metrics PART V




Example on freshness: Stale abstract in Search Results Page

114


Example on freshness: Actual page content

115

http://en.wikipedia.org/wiki/John_Yoo:


Example on freshness: Fresh abstract now

116


Definitions illustrated for a page

117

(Dasdan and Huynh, WWW’09)

Last sync Page is up-to-date or fresh until time 3.

CRAWLED

0

TIME

1 2 3 4 5 6

MODIFIED MODIFIED

INDEXED CLICKED

AGE=3


Definitions illustrated for a page

118

CRAWLED

0

TIME

1 2 3 4 5 6

MODIFIED MODIFIED

INDEXED CLICKED

TIME

FRESHNESS

AGE 0

3

1

(Dasdan and Huynh, WWW’09)


Freshness and age of a page

•  The freshness F(p,t) of a local page p at time t is – 1 if p is up-to-date at time t – 0 otherwise

•  The age A(p,t) of a local page p at time t is – 0 if p is up-to-date at time t –  t−tmod otherwise, where tmod is the time of

the first modification after the last sync of p.

119


Freshness and age of a catalog

•  S: catalog of documents •  Sc: catalog of clicked documents •  Basic freshness and age

•  Unweighted freshness and age

•  Weighted freshness and age (c(): #clicks)

120


How to measure freshness

•  Find the true refresh history of each page in the sample –  Needs independent crawling

•  Compare with the history in the search engine

•  Determine freshness and age –  basic form: averaged over all documents in the catalog

•  Consider clicked or viewed documents –  unweighted form: averaged over all clicked or viewed

documents in the catalog –  weighted form: unweighted form weighted with #clicks or

#views (or any other weight function)

121


How to measure freshness: Example

122



•  Sampling pages –  random, from DMOZ, revisited, popular

•  Classifying pages –  topical, importance, change period, refresh period

•  Refresh period for monitoring –  daily, hourly, minutely

•  Measuring change –  hashing (MD5, Broder’s shingles, Charikar’s SimHash),

Jaccard’s index, Dice coefficient, word frequency distribution similarity, structural similarity via DOM trees

•  note:

•  What is change? –  content, “information”, structure, status, links, features, ads

•  How to balance freshness against other objectives

123


Key problems

•  Measure the evaluation of the content on the Web

•  Design refresh policies to adapt to the changes on the Web

•  Reduce latency from discovery to serving

•  Improve freshness metrics

124


Reference review on web page change patterns

•  Cho & Garcia-Molina (2000): Crawled 720K pages once a day for 4 months.

•  Ntoulas, Cho, & Olston (2004): Crawled 150 sites once a week for a year.

–  found: most pages didn’t change; changes were minor; freq of change couldn’t predict degree of change but degree of change could predict future degree of change;

•  Fetterly, Manasse, Najork, & Wiener (2003): Crawled 150M pages once a week for 11 weeks.

–  found: past change could predict future change; page length & top level domain name were correlated with change;

•  Olston & Panday (2008): Crawled 10K random pages and 10K pages sampled from DMOZ every two days for several months.

–  found: moderate correlation between change frequency and information longevity

•  Adar, Teevan, Dumais, & Elsas (2009): Crawled 55K revisited pages (sub)hourly for 5 weeks.

–  found: higher change rates compared to random pages; large portions of pages changing more than hourly; focus on pages with important static or dynamic content;

125


Reference review on predicting refresh rates

•  Grimes, Ford & Tassone (2008) –  determines optimal crawl rates under a set of scenarios:

•  while doing estimation; while fairly sure of the estimate; •  when crawls are expensive, and when they are cheap;

•  Matloff (2005) –  derives estimators similar to Cho & Garcia-Molina but lower

variance (and with improved theory) –  also derives estimators for non-Poisson case –  finds that Poisson model is not very good for its data

•  but the estimators seem accurate (bias around 10%)

•  Singh (2007) –  non-homogeneous Poisson, localized windows, piecewise,

Weibull, experimental evaluation •  No work seems to consider non-periodical case.

126


Reference review on freshness metrics

•  Cho & Garcia-Molina (2003) –  freshness & age of one page –  average/expected freshness & age of one page & corpus –  freshness & age wrt Poisson model of change –  weighted freshness & age –  sync policies

•  uniform (better): all pages at the same rate •  nonuniform: rates proportionally to change rates

–  sync order •  fixed order (better), random order

–  to improve freshness, penalize pages that change too often –  to improve age, sync proportionally to freq but uniform is not far from

optimal •  Han et al. (2004) and Dasdan and Huynh (2009) add user

perspective with weights. •  Lewandowski (2008) and Kim and Kang (2007) compare top

three search engines for freshness. 127


References 1/2

•  E. Adar, J. Teevan, S. Dumais, and J.L. Elsas (2009), The Web changes everything: Understanding the dynamics of Web content, WSDM’09.

•  J. Cho and H. Garcia-Molina (2000), The evolution of the Web and implications for an incremental crawler, VLDB’00.

•  D. Fetterly, M. Manasse, M. Najork, and J. Wiener (2003), A Large scale study of the evolution of Web pages, WWW’03.

•  F. Grandi (2000), Introducing an annotated bibliography on temporal and evolution aspects in the World Wide Web, SIGMOD Records, 33(2):84-86.

•  A. Ntoulas, J. Cho, and C. Olston (2004), What’s new on the Web? The evolution of the Web from a search engine perspective, WWW’04.

128


References 2/2

•  J. Cho and H. Garcia-Molina (2003), Effective page refresh policies for web crawlers, ACM Trans. Database Syst., 28(4):390-426.

•  J. Cho and H. Garcia-Molina (2003), Estimating frequency of change, ACM Trans. Inter. Tech., 3(3):256-290.

•  A. Dasdan and X. Huynh, User-centric content freshness metrics for search engines, WWW’09.

•  J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.

•  C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08.

•  J. Han, N. Cercone, and X. Hu (2004), A Weighted freshness metric for maintaining a search engine local repository, WI’04.

•  Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07.

•  D. Lewandowski, H. Wahlig, and G. Meyer-Bautor (2006), The freshness of web search engine databases, J. Info. Syst., 32(2):131-148.

•  D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008.

•  N. Matloff (2005), Estimation of internet file-access/modification rates from indirect data, ACM Trans. Model. Comput. Simul., 15(3):233-253.

•  C. Olston and S. Padley (2008), Recrawl scheduling based on information longevity, WWW’08.

•  S.R. Singh (2007), Estimating the rate of web page changes, IJCAI’07.

129

Documents

Web Search Engine Metrics for Measuring User Satisfaction