Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
1
Web Search Engine Metrics for Measuring
User Satisfaction
Ali Dasdan Kostas Tsioutsiouliklis
Emre Velipasaoglu {dasdan, kostas, emrev}@yahoo-inc.com
Yahoo! Inc. 20 Apr 2009
2
Tutorial @
18th International World Wide Web
Conference
http://www2009.org/ April 20-24, 2009
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Disclaimers
• This talk presents the opinions of the authors. It does not necessarily reflect the views of Yahoo! Inc.
• This talk does not imply that these metrics are used by Yahoo!, or should they be used, they may not be used in the way described in this talk.
• The examples are just that – examples. Please do not generalize them to the level of comparing search engines.
3
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Acknowledgments for presentation material (in alphabetical order of last names in each category)
• Coverage – Paolo D’Alberto, Amit Sasturkar
• Discovery – Chris Drome, Kaori Drome
• Freshness – Xinh Huynh
• Presentation – Rob Aseron, Youssef Billawala, Prasad Kantamneni, Diane
Yip • General
– Stanford U. presentation audience (organized by Aneesh Sharma and Panagiotis Papadimitriou), Yahoo! presentation audience (organized by Pavel Dmitriev)
4
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Learning objectives
• To learn about user satisfaction metrics
• To learn about how to interpret metrics results
• To get the relevant bibliography • To learn about the open problems
5
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Scope
• Web “textual” search • Users’ point of view • Analysis rather than
synthesis • Intuitive rather than formal • Not exhaustive coverage
(including references) 6
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Outline
• Introduction (30min) – Ali • Relevance metrics (50min) – Emre • Break (15min) • Coverage metrics (15min) – Ali • Diversity metrics (15min) – Ali • Discovery metrics (15min) – Ali • Freshness metrics (15min) – Ali • Presentation metrics (50min) – Kostas • Conclusions (5min) – Kostas
7
8
Introduction PART “0”
of WWW’09 Tutorial on Web Search Engine Metrics
by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
“To measure is to know”
“If you cannot measure it, you cannot improve it”
Lord Kelvin (1824-1907)
Why measure? Why metrics?
9
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Search engine pipeline: Simplified architecture
10
• Serving system: serves user queries and search results
• Content system: acquires and processes content
Front-end tiers
Search tiers Indexers Web graphs Crawlers
Serving system
Content system
WWW
User
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Search engine pipeline: Content selection
11
Craw
led
Accessed
Served
Indexed
Graphed
the Web
How do you select content to pass to the next catalog?
Front-end tiers
Search tiers Indexers Web graphs Crawlers
Serving system
Content system
WWW
User
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
User view of metrics: Example with coverage metrics (SE #1)
12
Front-end tiers
Search tiers Indexers Web graphs Crawlers
Serving system
Content system
WWW
User
Search Engine (SE) #1
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
User view of metrics: Example with coverage metrics (SE #2)
13
Front-end tiers
Search tiers Indexers Web graphs Crawlers
Serving system
Content system
WWW
User
Search Engine (SE) #2
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
User view of metrics: Example with coverage metrics (SE #3)
14
Front-end tiers
Search tiers Indexers Web graphs Crawlers
Serving system
Content system
WWW
User
Search Engine (SE) #3
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
System view of metrics: Example with coverage metrics
15
Check for coverage of expected URL http://rain.stanford.edu/schedule/ (if missing from SRP)
Front-end tiers
Search tiers Indexers Web graphs Crawlers
Serving system
Content system
WWW
User
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Ideal vs. reality
• Ideal view – crawl all content – discover all changes instantaneously – serve all content instantaneously – store all content indefinitely – meet user’s information need perfectly
• Practical view – constraints on above aspects due to
• market focus, long tails, cost, resources, complexity
• Moral of the story – Cannot make all the users happy all the time!
16
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Sampling methods for metrics
• Random sampling of queries – from search engine’s query logs – from third-party logs (e.g., ComScore)
• Random sampling of URLs – from random walking the Web
• see a review at Baykan et al., WWW’06 – from directories and similar hubs – from RSS feeds and sitemaps – from third-party feeds – from search engine’s catalogs – from competitor’s indices using queries
• Customer-selected samples
17
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Different dimensions for metrics
• Content types and sources – news, blogs, wikipedia, forums, scholar; regions, languages;
adult, spam, etc. • Site types
– small vs. large, region, language • Document formats
– html, pdf, etc. • Query types
– head, torso, tail; #terms; informational, navigational, transactional; celebrity, adult, business, research, etc.
• Open web vs. hidden web • Organic vs. commercial • Dynamic vs. static content • New content vs. existing content
18
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Further issues to consider
• Rate limitations – search engine blocking, hence, difficulty of large competitive testing – internal bandwidth usage limitations
• Intrusiveness – How can metrics queries affect what’s observed?
• Statistical soundness – in methods used and guarantees provided – accumulation of errors – the “value” question – E.g., what is “random”? is “random” good enough?
• Undesired positive feedback or the chicken-and-egg problem – Focus on popular queries may make them more popular at the expense
of potentially what’s good for the future. • Controlled feedback or labeled training and testing data
– Paid human judges (or editors), crowdsourcing (e.g., Amazon’s Mechanical Turk), Games with a Purpose (e.g., Dasdan et al., WWW’09), bucket testing on live traffic, etc.
19
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Key problems for metrics
• Measure user satisfaction • Compare two search engines • Optimize for user satisfaction in
each component of the pipeline • Automate all metrics • Discover anomalies • Visualize, mine, and summarize
metrics data • Debug problems automatically
20 Also see: Yahoo Research list at http://research.yahoo.com/ksc
21
Relevance Metrics PART I
of WWW’09 Tutorial on Web Search Engine Metrics
by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 22 22
Example on relevance
Ad for gear. OK if I will go to the game.
No schedule here.
There is a schedule.
A different schedule?
A different Real Madrid!
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 23 23
What is relevance?
• User issues a query to a search engine and receives an ordered list of results…
• Relevance: How effectively was the user’s information need met? – How useful were the results? – How many of the retrieved results were useful? – Were there any useful pages not retrieved? – Did the order of the results make the user’s search
easier or harder? – How successful did the search engine handle the
ambiguity and the subjectivity of the query?
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 24
Evaluating relevance
• Set based evaluation – basic but fundamental
• Rank based evaluation with explicit absolute judgments – binary vs. graded judgments
• Rank based evaluation with explicit preference judgments – binary vs. graded judgments – practical system testing and incomplete judgments
• Rank based evaluation with implicit judgments – direct and indirect evaluation by clicks
• User satisfaction • More notes
25
Relevance Metrics: Set Based Evaluation
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 26
Precision
– True Positive (TP): A retrieved document is relevant – False Positive (FP): A retrieved document is not relevant
• Kent et. al. (1955) €
Precision =# relevant items retrieved( )
# retrieved items( )
=TP
TP+ FP( )= Prob relevant retrieved( )
• How many of the retrieved results were useful?
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 27
Recall
€
Recall =# relevant items retrieved( )
# relevant items( )
=TP
TP+ FN( )= Prob retrieved relevant( )
– True Positive (TP): A retrieved document is relevant – False Negatives (FN): A relevant document is not retrieved
• Kent et. al. (1955)
• Were there any useful pages left not retrieved?
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 28
Properties of precision and recall
• Precision decreases when false positives increase • False positives:
– also known as false alarm in signal processing – correspond to Type I error in statistical hypothesis testing
• Recall decreases when false negatives increase • False negatives
– also known as missed opportunities
– correspond to Type II error in statistical hypothesis testing
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 29
F-measure
• Inconvenient to have two numbers • F-measure: Harmonic mean of precision and recall
– related to van Rijsbergen’s effectiveness measure – reflects user’s willingness to trade precision for recall controlled
by a parameter selected by the system designer
€
F =1
α1P
+ 1−α( ) 1R
=β 2 +1( )PRβ 2P+ R
α = 1β 2 +1( )
F β =1( ) =2PRP + R
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 30
Various means of precision and recall
0%
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
Precision
Averag
e o
f p
recis
ion
an
d r
ecall
R
arithmetic
geometric
F1
F2
F.5
Recall = 70%
31
Relevance Metrics: Rank Based Evaluation
with Explicit Absolute
Judgments
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 32
Extending precision and recall
• So far, considered: – How many of the retrieved results were useful? – Were there any useful pages left not retrieved?
• Next, consider: – Did the order of the results made the user’s search for
information easier or harder?
• Extending set based precision/recall to ranked list – It is possible to define many sets over a ranked list. – E.g. Start with a set including the first result and progressively
increase the size of the set by adding the next result.
• Precision-recall curve: – Calculate precision at standard recall levels and interpolate.
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 33
Precision-recall curve example
rank relevance TP FP FN recall precisioninterpolated
precision1 1 1 0 3 0.25 1.00 1.00
2 1 2 0 2 0.5 1.00 1.00
3 0 2 1 2 0.5 0.67 0.75
4 1 3 1 1 0.75 0.75 0.75
5 0 3 2 1 0.75 0.60 0.60
6 0 3 3 1 0.75 0.50 0.57
7 1 4 3 0 1 0.57 0.57
8 0 4 4 0 1 0.50 0.50
9 0 4 5 0 1 0.44 0.44
10 0 4 6 0 1 0.40 0.40
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 34
Precision-recall curve example
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Precision
precision interpolated precision
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 35
Average precision-recall curve
• Precision-recall curve is for one ranked list (i.e. one query).
• To evaluate relevance of a search engine: – Calculate interpolated
precision-recall curves for a sample of queries at 11-points (Recall = 0.0:0.1:1.0).
– Average over test sample of queries.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
recall
precision
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 36
Mean average precision (MAP)
• Single number instead of a graph • Measure of quality at all recall levels • Average precision for a single query:
€
AP = 1# relevant Precision at rank of kth relevant document( )
k=1
# relevant
∑
• MAP: Mean of average precision over all queries – Most frequently, arithmetic mean is used over the query
sample. – Sometimes, geometric mean can be useful by putting
emphasis on low performing queries.
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 37
Average precision example
rank relevance TP FP FN R P P@rel(k)
1 1 1 0 3 0.25 1.00 1.00
2 1 2 0 2 0.5 1.00 1.00
3 0 2 1 2 0.5 0.67 0
4 1 3 1 1 0.75 0.75 0.75
5 0 3 2 1 0.75 0.60 0
6 0 3 3 1 0.75 0.50 0
7 1 4 3 0 1 0.57 0.57
8 0 4 4 0 1 0.50 0
9 0 4 5 0 1 0.44 0
10 0 4 6 0 1 0.40 0
# relevant 4 ave P 0.83
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 38
Precision @ k
• MAP evaluates precision at all recall levels. • In web search, top portion of a result set is more
important. • A natural alternative is to report at top-k
(e.g. top-10). • Problem:
– Not all queries will have more than k relevant results. So, even a perfect system may score less than 1.0 for some queries.
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 39
R-precision
• Allan (2005) • Use a variable result set cut-off for each query
based on number of its relevant results. • In this case, a perfect system can score 1.0 over all
queries. • Official evaluation metric of the TREC HARD track • Highly correlated with MAP
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 40
Mean reciprocal rank (MRR)
• Voorhees (1999) • Reciprocal of the rank of the first relevant result
averaged over a population of queries • Possible to define it for entities other than explicit
absolute relevance judgments (e.g. clicks - see implicit judgments later on)
€
MRR = 1#queries
1rank(1st relevant result of query q)q=1
#queries
∑
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 41
Graded Relevance
• So far, the evaluation methods did not measure satisfaction in the following aspects: – How useful were the results?
• Do documents have grades of usefulness in meeting an information need?
– How successful did the search engine handle the ambiguity and the subjectivity of the query?
• Is the information need of the user clear in the query? • Do different users mean different things with the same query?
• Can we cover these aspects by using graded relevance judgments instead of binary? – very useful – somewhat useful – not useful
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 42
Precision-recall curves
• If we have grades of relevance, how can we modify some of the binary relevance measures?
• Calculate Precision-Recall curves at each grade level (Järvelin and Kekäläinen (2000))
• Informative but, too many curves to compare
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 43 43
Discounted cumulative gain (DCG)
• Järvelin and Kekäläinen (2002) • Gain adjustable for importance of different relevance
grades for user satisfaction • Discounting desirable for web ranking
– Most users don’t browse deep. – Search engines truncate the list of results returned.
€
DCG =Gain(result@r)logb r +1( )r=1
R
∑Discount proportional to
effort to reach result at rank r.
Gain proportional to utility of result at rank r.
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 44
DCG example
• Gain for various grades – Very useful (V): 3 – Somewhat useful (S): 1 – Not useful (N) : 0
• E.g. Results ordered as VSN:
DCG = 3/log2(1+1) + 1/log2(2+1) + 0/log2(3+1) = 2.63
• E.g. Results ordered as VNS:
DCG = 3/log2(1+1) + 0/log2(2+1) + 1/log2(3+1) = 2.50
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 45
Normalized DCG (nDCG)
• DCG is yields unbounded scores. It is desirable for the best possible result set to have a score of 1.
• For each query, divide the DCG by best attainable DCG for that query.
• E.g. VSN:
nDCG = 2.63 / 2.63 = 1.00
• E.g. VNS:
nDCG = 2.50 / 2.63 = 0.95
46
Relevance Metrics: Rank Based Evaluation
with Explicit Preference
Judgments
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 47
Kendall tau coefficient
• Based on counts of preferences – Preference judgments are cheaper and easier/cleaner than
absolute judgments. – But, may need to deal with circular preferences.
• Range in [-1,1] ‒ τ = 1 when all in agreement ‒ τ = -1 when all disagree
• Robust for incomplete judgments – Just use the known set of preferences.
€
τ =A−DA+D
preferences in agreement
preferences in disagreement
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 48
Binary preference (bpref)
• Buckley and Voorhees (2004) • Designed in particular for incomplete judgments • Similar to some other relevance metrics (MAP) • Can be generalized to graded judgments
€
bpref =1R
1− Nr
R
r∈R∑ ∝
AA + D
# of non-relevant docs above relevant doc r,
in the first R non-relevant
For a query with R relevant results
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 49
Bpref example
rank relevance numerator denominator summand
1 0
2 1 1 3 0.66
3 NA
4 1 1 3 0.66
5 NA
6 0
7 0
8 0
9 1 3 3 0
10 0
# relevant 3 bpref 0.44
# non-relevant 5
# unjudged 2
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 50
Generalization of bpref to graded judgments - rpref
• De Beer and Moens (2006) • Graded relevance version of bpref • Sakai (2007) gives a corrected version expressed in
terms of cumulative gain.
€
rprefrelative R( ) = 1CGideal R( ) g r( ) 1−
penalty r( )Nr
r>1,g r( )>0
r=R
∑
penalty r( ) =g r( ) − g i( )g r( )i<r,g i( )<g r( )
∑ # of judged docs above r
Soft count of out-of-order pairs
Relevance gain of result at rank r
Cumulative gain
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 51
Practical system testing with incomplete judgments
• Comparing two search engines in practice – Scrape top-k result sets for a sample of queries – Calculate any of the metrics above for each engine and
compare using a statistical test (e.g. paired t-test) • Need judgments • Use existing judgments • What to do if judgments missing • Use a metric robust to missing judgments
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 52
Comparing various metrics under incomplete judgment scenario
• Sakai (2007) simulates incomplete judgments by sampling from pooled judgments – Stratified sampling yields various levels of completeness from 100% to
10%. • Then tests bpref, rpref, MAP, Q-measure, and normalized DCG
(nDCG). – Q-measure is similar to rpref (see Sakai (2007)) – Since all but the first two are originally designed for complete judgments,
he tests two versions of them: • one based on assuming results with missing judgments are non-relevant, • and another computed on condensed lists by removing results with missing
judgments.
• nDCG with incomplete absolute judgments – As in average precision based measures, one can ignore the unjudged
documents when using normalized DCG.
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 53
Robustness of evaluation with incomplete judgments
• Among the original methods only bpref and rpref stay stable with increasing incompleteness.
• nDCG, Q and MAP computed on condensed lists also perform well. – Furthermore, they have more discriminative power.
• Graded relevance metrics are more robust than binary metrics for incompleteness.
• nDCG and Q-measure on condensed lists are the best metrics.
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 54
Average precision based rank correlation
• Yilmaz, Aslam and Robertson (2008) • Kendal tau rank correlation as a random variable
– Pick a pair of items at random. – Define p: Return 1 if pair in same order in both lists, 0 otherwise.
• Rank correlation based on average precision as a random variable – Pick an item at random from the 1st list (other than the top item). – Pick another document at random above the current. – Define p’: Return the 1 if this pair is in the same relevance order
in the 2nd list, 0 otherwise. • Agreement on top of the list is rewarded.
€
τ =A −DA + D
= p − 1− p( ) = 2p −1
55
Relevance Metrics: Rank Based Evaluation
with Implicit Judgments
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 56
Implicit judgments from clicks
• Explicit judgments are expensive. • A search engine has lots of user interaction data
– which results were viewed for a query, and – which of those received clicks.
• Can we obtain implicit judgments of satisfaction or relevance from clicks? – Clicks are highly biased.
• presentation details (order of results, attractiveness of abstracts) • trust and other subtle aspects of user’s need
– Not impossible - some innovative methods are emerging. • Pros: Cheap, better model of ambiguity and subjectivity • Cons: Noisy and retroactive. (May expose poor quality search
engines to live traffic.)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 57
Performance metrics from user logs
• Naïve way to utilize user interaction data is to develop basic statistics from raw observations: – abandonment rate – reformulation rate – number of queries per session – clicks per query – mean reciprocal rank of clicked results – time to first or last click
• Intuitive but not clear how sensitive these metrics are to what we want to measure
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 58
Implicit preference judgments from clicks
• Joachims (2002) • Radlinski and Joachims (2005) • These are document level preference judgments
and have not been used in evaluation.
A
B
C
skip
skip
click
C>A and C>B
A
B
C
click
skip
skip
A>B
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 59
Direct evaluation by clicks
• Randomly interleave two result sets to be compared. – Have the same number of links from top of each result set. – More clicks on links from one result set indicates preference.
• Balanced interleaving (Joachims (2003)) – Determine randomly which side goes first at the start. – Pick the next available result from the side that has the turn
while removing duplicates. – Caution: Biased when two result sets are nearly identical
• Team draft interleaving (Radlinski et. al. (2008)) – Determine randomly which side goes first at each round. – Pick the next available result from the side that has the turn
while removing duplicates. • Effectively removes the rank bias, but not directly
applicable to evaluation of multi-page sessions.
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 60
Interleaving example
A B A first B first Captain Team Captain Team1 a b a b A a B b
b a B b A a2 b e e A A
e B e B e3 c a c B B
c A c A c4 d f d f A d A d
f d B f B f5 e g g B g A
g A B g6 f h h B h A
h A B h7 g k k B k B k
k A A8 h c A B
B A9 i d i B B
i A i A i10 j i j A j A j
j B B
Rank
Engines to Comapre
Balanced Interleave
Team Draft InterleaveTDI example 1 TDI example 2
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 61
Indirect evaluation by clicks
• Carterette and Jones (2007) • Relevance as a multinomial random variable
• Model absolute judgments by clicks
• Expected DCG (incomplete judgments are OK)
€
P Ri = grade j{ }
€
E DCGN[ ] = E R1[ ] +E Ri[ ]log2 ii= 2
N
∑€
p(R | q,c) = p Ri q,c( )i=1
N
∏
€
logp R > g j q,c( )p R ≤ g j q,c( )
=α j +βq+ βicii=1
N
∑ + βikcicki<k
N
∑
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 62
Indirect evaluation by clicks (cont’d)
• Comparing two search engines
• Predict if difference is statistically significant – use Monte Carlo
• Can improve confidence by asking for labels where
• Efficient but, effectiveness depends on the quality of the relevance model obtained from the clicks.
€
E ΔDCG[ ] = E DCGA[ ]− E DCGB[ ]
€
P ΔDCG < 0( ) ≥ 0.95
€
maxi E GiA[ ]− E Gi
B[ ] Gi =Ri if rank(i) =1
Rilog2 rank(i)
o/w
63
Relevance Metrics: User Satisfaction
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 64
Relevance evaluation and user satisfaction
• So far, focused on evaluation method rather than the entity (i.e. user satisfaction) to be evaluated.
• Subtle and salient aspects of user satisfaction are difficult for traditional relevance. – E.g. trust, expectation, patience, ambiguity, subjectivity – Explicit absolute or preference judgments are not very
successful in addressing all aspects at once. – Implicit judgment models get one step closer to user
satisfaction by incorporating user feedback.
• The popular IR relevance metrics are not strongly based on user tasks and experiences. – Turpin and Scholer (2006): precision based metrics such as
MAP fail to assess user satisfaction on tasks targeting recall.
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 65
Modeling user satisfaction
• Huffman and Hochster (2007) • Obtain explicit judgments of true satisfaction over a
sample of sessions or any other grain. • Develop a predictive model based on observable
statistics. – explicit absolute relevance judgments – number of user actions in a session – query classification
• Carry out correlation analysis • Pros: More direct than many other evaluation metrics.
• Cons: More exploratory than a usable metric at this stage.
66
Relevance Metrics: More Notes
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 67
Relevance through search system components
• Relevance can explicitly be measured for each search system component (Dasdan and Drome (2009)). – Use set based evaluation for WWW, catalog, database tiers.
• Rank based evaluation can be used if sampled subset is ordered by explicit judgments or by using order inferred from a downstream component.
– Yields approximate upper bounds
– Use rank based evaluation for candidate documents and result set. • Useful for quantifying and monitoring relevance gap
– inter-system relevance gap by comparing different system stages – intra-system relevance gap by comparing against external benchmarks
WWW crawl catalog index tier 1
tier N
selection candidate doc list
ranking result
set
query
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 68
Where to find more
• Traditional relevance metrics have deep roots in information retrieval – Cranfield experiments (Cleverdon (1991)) – SMART (Salton (1991)) – TREC (Voorhees and Harman (2005))
• Modern metrics addressing cost and noise by using statistical inference in more advanced ways
• For more on relevance evaluation, see – Manning, Raghavan, and Schütze (2008) – Croft, Metzler, and Strohman (2009)
• For more on the user dimension, see – Baeza-Yates, and Ribeiro-Neto (1999) – Spink and Cole (2005)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 69
References 1/2
• J. Allan (2005), HARD track overview in TREC 2005: High accuracy retrieval from documents. • R. Baeza-Yates and B. Ribeiro-Neto (1999), Modern Information Retrieval, Addison-Wesley. • C. Buckley and E.M. Voorhees (2004), Retrieval Evaluation with Incomplete Information, SIGIR’04. • B. Carterette and R. Jones (2007), Evaluating Search Engines by Modeling the Relationship
Between Relevance and Clicks, NIPS’07. • C.W. Cleverdon (1991), The significance of the Cranfield tests on index languages, SIGIR’91. • B. Croft, D. Metzler, and T. Strohman (2009), Search Engines: Information Retrieval in Practice,
Addison Wesley. • A. Dasdan and C. Drome (2008), Measuring Relevance Loss of Search Engine Components,
submitted. • J. De Beer and M.-F. Moens (2006), Rpref - A Generalization of Bpref towards Graded Relevance
Judgments, SIGIR’06. • S.B. Huffman and M. Hochster (2007), How Well does Result Relevance Predict Session
Satisfaction? SIGIR’07. • K. Järvelin and J. Kekäläinen (2000), IR evaluation methods for retrieving highly relevant
documents, SIGIR’00. • K. Järvelin and J. Kekäläinen (2002), Cumulated Gain-Based Evaluation of IR Techniques, ACM
Trans. IS 20(4):422-446. • T. Joachims (2002), Optimizing Search Engines using Clickthrough Data. SIGKDD’02. • T. Joachims (2003), Evaluating Retrieval Performance using Clickthrough Data, In J. Franke, et.
al. (eds.), Text Mining, Physica Verlag.
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009. 70
References 2/2
• A. Kent, M.M. Berry F.U. Luehrs Jr., and J.W. Perry (1955), Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation 6(2):93-101.
• C. Manning, P. Raghavan, and H. Schützke (2008), H. Introduction to Information Retrieval, Cambridge University Press.
• F. Radlinski, M. Kurup, and T. Joachims (2008), How Does Clickthrough Data Reflect Retrieval Quality? CIKM’08.
• F. Radlinski and T. Joachims (2005), Evaluating the Robustness of Learning from Implicit Feedback, ICML’05.
• T. Sakai (2007), Alternatives to Bpref, SIGIR’07. • G. Salton (1991), The smart project in automatic document retrieval, SIGIR’91. • A. Spink and C. Cole (eds.) (2005), New Directions in Cognitive Information Retrieval, Springer. • A. Turpin and F. Scholer (2006), User performance versus precision measures for simple search
tasks, SIGIR’06. • C.J. Van Rijsbergen (1979), Information Retrieval (2nd ed.), Butterworth. • E.M. Voorhees and D. Harman (eds) (2005), TREC: Experiment and Evaluation in Information
Retrieval, MIT Press. • E.M. Voorhees (1999), TREC-8 Question Answering Track Report. • E. Yilmaz and J. Aslam (2006), Estimating Average Precision with Incomplete and Imperfect
Information, CIKM’06. • E. Yilmaz, J. Aslam, and S. Robertson (2008), A New Rank Correlation Coefficient for Information
Retrieval, SIGIR’08.
71
Coverage Metrics PART II
of WWW’09 Tutorial on Web Search Engine Metrics
by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on coverage: Heard some interesting news; decided to search
72
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on coverage: URL was not found
73
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on coverage: But content was found under different URLs
74
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on coverage: URL was also found after some time
75
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Definitions for coverage
• Coverage refers to presence of content of interest in a catalog.
• Coverage ratio – defined as the ratio of the number of
documents (pages) found to the number of documents (pages) tested
– Can be represented as a distribution when many document attributes are considered together
76
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Some background: Shingling and Jaccard Index
77
Doc = (a b c d e) (5 terms) 2-grams: (a b, b c, c d, d e)
Shingles for 2-grams (after hashing them): 10, 3, 7, 16 Min shingle: 3 (used as a signature of Doc)
Doc1 = (a b c d e) Doc2 = (a e f g)
Doc1 Doc2 = (a e) Doc1 Doc2 = (a b c d e f g)
Jaccard index = |Doc1 Doc2| / |Doc1 Doc2| = 2 / 7 ≈ 30% (Shingling estimates this index.)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
How to measure coverage
• Given an input document with its URL • Query by URL (QBU)
– enter URL at the target search engine’s query interface – if the URL is not found, then iterate using “normalized” forms of
the same URL • Query by content (QBC)
– if URL is not given or URL search has failed, then perform this search
– generate a set of queries (called strong queries) from the document
– submit the queries to the target search engine’s query interface – combine the returned results – perform a more thorough similarity check between the returned
documents and the input document • Compute coverage ratio over multiple documents
78
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Query-by-Content flowchart
79
String signature: Terms from page
Strings combined into queries
Similarity check using shingles
Search results extraction
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Query by content: How to generate queries
• Select sequences of terms randomly – find the document’s shingles signature – find the corresponding sequences of terms – This method can produce the same query signature for the
same document, as opposed to the method of just selecting random sequences of terms from the document.
• Select sequences of terms by frequency – terms with the lowest frequency or highest TF-IDF
• Select sequences of terms by position – +/- two terms at every 5th term
80
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Further issues to consider
• URL normalization – see Dasgupta, Kumar, and Sasturkar (2008)
• Page templates and ads – or how to avoid undesired matches
• Search for non-textual content – images, mathematical formulas, tables and other similar
structures
• Definition of content similarity • Syntactic vs. semantic match • How to balance coverage against other
objectives
81
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Key problems
• Measure web growth in general and along any dimension
• Compare search engines automatically and reliably
• Improve content-based search, including semantic-similarity search
• Improve copy detection methods for quality and performance, including URL based copy detection
82
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Reference review on coverage metrics
• Luhn (1957) – summarizes an input document by selecting terms or sentences
by frequency – Bharat and Broder (1998) discovered the same method
independently for a different purpose • Bar-Yossef and Gurevich (2008)
– introduces improved methods to randomly sample pages from a search engine’s index using its public query interface, a problem introduced by Bharat and Broder (1998)
• Dasdan et al. (2008), Pereira and Ziviani (2004) – represents an input document by selecting (sequences of)
terms randomly or by frequency – uses the term-based document signature as queries (called
strong queries) for similarity search – Yang et al. (2009) proposes similar methods for blog search
83
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
References
• Z. Bar-Yossef and M. Gurevich (2008), Random sampling from a search engine’s index, J. ACM, 55(5).
• K. Bharat, A. Broder (1998), A technique for measuring the relative size and overlap of public Web search engines, WWW’98.
• S. Brin, J. Davis, and H. Garcia-Molina (1995), Copy detection mechanisms for digital documents, SIGMOD’95.
• A. Dasdan, P. D’Alberto, C. Drome, and S. Kolay (2008), Automating retrieval for similar content using search engine query interface, submitted.
• A. Dasgupta, R. Kumar, and A. Sasturkar (2008), De-duping URLs via Rewrite Rules, KDD’08.
• H. Luhn (1957), A statistical approach to mechanized encoding and searching of literary information, IBM J. Research and Dev., 1(4):309–317.
• H. P. Luhn (1958), The automatic creation of literature abstracts, IBM J. Research and Dev., 2(2).
• A.R. Pereira Jr. and N. Ziviani (2004), Retrieving similar documents from the Web, J. Web Engineering, 2(4):247-261.
• Y. Yang, N. Bansal, W. Dakka, P. Ipeirotis, N. Koudas, D. Papadias (2009), Query by document, WSDM’09.
84
85
Diversity Metrics PART III
of WWW’09 Tutorial on Web Search Engine Metrics
by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on diversity: Long query
86
Ever
y re
sult
is a
bout
the
sam
e ne
ws.
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on diversity: Long query
87
Mor
e di
vers
e
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on diversity: Ambiguous query [stanford]
88
See
http
://en
.wik
iped
ia.o
rg/w
iki/S
tanf
ord_
(dis
ambi
guat
ion)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on diversity: Ambiguous query [stanford]
89
See
http
://en
.wik
iped
ia.o
rg/w
iki/S
tanf
ord_
(dis
ambi
guat
ion)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Definitions for diversity
• Diversity – related to the breadth of the content – also related to the quantification of “concepts” in
a set of documents, or the quantification of query disambiguation or query intent
• Closely tied to relevance and redundancy – excluding near-duplicate results
• May have implications for search engine interfaces too – e.g., clustered or faceted presentations
90
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
How to measure diversity
• Method #1: – get editorial judgments as to the degree of diversity in a catalog
• Method #2: – use the number of the content or source types for the
documents in a catalog – find the set of concepts in a catalog and measure diversity
based on their relationships • e.g., cluster using document similarity and assign a
concept to each cluster
• Method #3: (with a given relevance metric) – iterate over each intent of the input query – consider sets of documents relevant to each intent – weight the given relevance metric by the probability of each
intent
91
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
How to measure diversity: Example
92
• Types: News, organic, rich, ads • Sources for 10 organic results:
• 4 domains • Themes for organic results:
• 6 for Stanford University related • 1 for Stanfords restaurant related • 1 for Stanford, MT related • 1 for Stanford, KY related
• Detailed themes for organic results: • 2 for general Stanford U. intro • 1 for Stanford athletics • 1 for Stanford medical school • 1 for Stanford business school • 1 for Stanford news • 1 for Stanford green buildings • 1 for Stanfords restaurant • 1 for Stanford, MT high school • 1 for Stanford, KY fire department
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Further issues to consider
• Categorization and similarity methods – for documents, queries, sites
• Presentation issues – single page, clusters, facets, term cloud
• Summarizing diversity • How to balance diversity against
other objectives – diversity vs. relevance in particular
93
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Key problems
• Measure and summarize diversity better
• Measure tradeoffs between diversity and relevance better
• Determine the best presentation of diversity
94
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Reference review on diversity metrics
• Goldstein and Carbonell (1998) – defines maximizal marginal relevance as a parameterized linear combination of novelty and
relevance • novelty: measured via the similarity among documents (to avoid redundancy) • relevance: measured via the similarity between documents and the query
• Jain, Sarda, and Haritsa (2003); Chen and Karger (2006); Joachims et al. (2008); and Swaminathan et al. (2008)
– iteratively expand a document set to maximize marginal gain – each time add a new relevant document that is least similar to the existing set – Joachims et al. (2008) address the learning aspect.
• Radlinski and Dumais (2006) – diversifies search results using relevant results to the input query and queries related to it
• Agrawal et al. (2009) – diversifies search results using a taxonomy for classifying queries and documents – also reviews diversity metrics and proposes new ones
• Gollapudi and Sharma (2009) – proposes an axiomatization of result diversification (similar to similar recent efforts for ranking
and clustering) and proves the impossibility of satisfying all properties – enumerates a set of diversification functions satisfying different subsets of properties
• Metrics to measure diversity of a given set of results are proposed by Chen and Karger (2006), Clarke et al. (2008), and Agrawal et al. (2009).
95
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
References
• R. Agrawal, S. Gollapudi, A. Halverson, and S. Leong (2009), Diversifying search results, WSDM’09.
• H. Chen and D.R. Karger (2006), Less is more: Probabilistic models for retrieving fewer relevant documents, SIGIR’06.
• C.L.A. Clarke, M. Kolla, G.V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon (2008), Novelty and diversity in information retrieval evaluation, SIGIR’08.
• J. Goldstein and J. Carbonell (1998), Summarization: (1) Using MMR for Diversity-based Reranking and (2) Evaluating Summaries, SIGIR’98.
• S. Gollapudi and A. Sharma (2009), An axiomatic approach for result diversification, WWW’09.
• A. Jain, P. Sarda, and J.R. Haritsa (2003), Providing Diversity in K-Nearest Neighbor Query Results, CoRR’03.
• R. Kleinberg, F. Radlinski, and T. Joachims (2008), Learning Diverse Rankings with Multi-armed Bandits, ICML’08.
• F. Radlinski and S.T. Dumais (2006), Improving personalized web search using result diversification, SIGIR’06.
• A. Swaminathan, C. Mathew, and D. Kirovski (2008), Essential pages, MSR-TR-2008-015, Microsoft Research.
• Y. Yue, and T. Joachims (2008), Predicting Diverse Subsets Using Structural SVMs, ICML’08.
• C. Zhai and J.D. Lafferty (2006), A risk minimization framework for information retrieval, Info. Proc. and Management, 42(1):31-55.
96
97
Discovery and Latency Metrics
PART IV of
WWW’09 Tutorial on Web Search Engine Metrics by
A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on discovery: Page was born ~30 minutes before
98
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on discovery: URL of page was not found
99
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on discovery: But content existed under different URLs
100
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on discovery: URL was also found after ~1 hr
101
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Life of a URL
102
AGE
LATENCY
BORN DISCOVERED NOW EXPIRED
TIME
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Lives of many URLs
103
AGE
LATENCY
BORN DISCOVERED NOW EXPIRED
TIME
LATENCY
LATENCY
LATENCY
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
How to measure discovery and latency
• Consider a sample of new pages on the Web – Feeds at regular intervals – Each sample monitored for a period (e.g., 15 days)
• User view – Discovery: Measure how many of these new pages are in
the search results? • Using the coverage ratio formula
– Latency: Measure how long it took to get these new pages in the search results?
• System view – Discovery: Measure how many of these new pages are in a
catalog? – Latency: Measure how long it took to get these new pages
in a catalog?
104
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Discovery profile of a search engine component: Overview
105
Time to reach a certain coverage percentage
No expiration yet
Content expired
Convergence
Over many URLs, per search engine component
Oth
er b
ehav
iors
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Discovery profiles and monitoring: Examples
106
Profiles Monitoring of
profile parameters
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Latency profiles of a search engine component: Overview
107
Over many URLs, per search engine component
Desired skewness direction Close to zero for crawlers
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Latency profiles and monitoring: Examples
108
Profiles Monitoring of
profile parameters
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Further issues to consider
• How to discover samples to measure discovery and latency
• How to beat crawlers to acquire samples
• Discovery of top-level pages • Discovery of deep links • Discovery of hidden web content • How to balance discovery against
other objectives
109
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Key problems
• Predict content changes on the Web • Discover new content almost
instantaneously • Reduce latency per search engine
component and overall
110
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Reference review on discovery metrics
• Cho, Garcia-Molina, & Page (1998) – discusses how to order URL accesses based on importance
scores • importance: PageRank (best), link count, similarity to query in
anchortext or URL string, attributes of URL string. • Dasgupta et al. (2007)
– formulates the problem of discoverability (discover new content from the fewest number of known pages) and proposes approximation algorithms
• Kim and Kang (2007) – compares top three search engines for discovery (called
“timeliness”), freshness, and latency • Lewandowski (2008)
– compares top three search engines for freshness and latency • Dasdan and Drome (2009)
– proposes discovery metrics along the lines discussed in this section
111
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
References
• J. Cho, H. Garcia-Molina, and L. Page (1998), Efficient Crawling Through URL Ordering, Computer Networks and ISDN Systems, 30(1-7):161-172.
• A. Dasdan and C. Drome (2009), Discovery coverage: Measuring how fast content is discovered by search engines, submitted.
• A. Dasgupta, A. Ghosh R. Kumar, C. Olston, S. Pandey, and A. Tomkins (2007), The discoverability of the Web, WWW’07.
• J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.
• N. Eiron, K.S. McCurley, and J.A. Tomlin, Ranking the Web frontier, WWW’04.
• C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08.
• Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07.
• D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008.
112
113
Freshness Metrics PART V
of WWW’09 Tutorial on Web Search Engine Metrics
by A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on freshness: Stale abstract in Search Results Page
114
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on freshness: Actual page content
115
http://en.wikipedia.org/wiki/John_Yoo:
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Example on freshness: Fresh abstract now
116
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Definitions illustrated for a page
117
(Dasdan and Huynh, WWW’09)
Last sync Page is up-to-date or fresh until time 3.
CRAWLED
0
TIME
1 2 3 4 5 6
MODIFIED MODIFIED
INDEXED CLICKED
AGE=3
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Definitions illustrated for a page
118
CRAWLED
0
TIME
1 2 3 4 5 6
MODIFIED MODIFIED
INDEXED CLICKED
TIME
FRESHNESS
AGE 0
3
1
(Dasdan and Huynh, WWW’09)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Freshness and age of a page
• The freshness F(p,t) of a local page p at time t is – 1 if p is up-to-date at time t – 0 otherwise
• The age A(p,t) of a local page p at time t is – 0 if p is up-to-date at time t – t−tmod otherwise, where tmod is the time of
the first modification after the last sync of p.
119
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Freshness and age of a catalog
• S: catalog of documents • Sc: catalog of clicked documents • Basic freshness and age
• Unweighted freshness and age
• Weighted freshness and age (c(): #clicks)
120
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
How to measure freshness
• Find the true refresh history of each page in the sample – Needs independent crawling
• Compare with the history in the search engine
• Determine freshness and age – basic form: averaged over all documents in the catalog
• Consider clicked or viewed documents – unweighted form: averaged over all clicked or viewed
documents in the catalog – weighted form: unweighted form weighted with #clicks or
#views (or any other weight function)
121
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
How to measure freshness: Example
122
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Further issues to consider
• Sampling pages – random, from DMOZ, revisited, popular
• Classifying pages – topical, importance, change period, refresh period
• Refresh period for monitoring – daily, hourly, minutely
• Measuring change – hashing (MD5, Broder’s shingles, Charikar’s SimHash),
Jaccard’s index, Dice coefficient, word frequency distribution similarity, structural similarity via DOM trees
• note:
• What is change? – content, “information”, structure, status, links, features, ads
• How to balance freshness against other objectives
123
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Key problems
• Measure the evaluation of the content on the Web
• Design refresh policies to adapt to the changes on the Web
• Reduce latency from discovery to serving
• Improve freshness metrics
124
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Reference review on web page change patterns
• Cho & Garcia-Molina (2000): Crawled 720K pages once a day for 4 months.
• Ntoulas, Cho, & Olston (2004): Crawled 150 sites once a week for a year.
– found: most pages didn’t change; changes were minor; freq of change couldn’t predict degree of change but degree of change could predict future degree of change;
• Fetterly, Manasse, Najork, & Wiener (2003): Crawled 150M pages once a week for 11 weeks.
– found: past change could predict future change; page length & top level domain name were correlated with change;
• Olston & Panday (2008): Crawled 10K random pages and 10K pages sampled from DMOZ every two days for several months.
– found: moderate correlation between change frequency and information longevity
• Adar, Teevan, Dumais, & Elsas (2009): Crawled 55K revisited pages (sub)hourly for 5 weeks.
– found: higher change rates compared to random pages; large portions of pages changing more than hourly; focus on pages with important static or dynamic content;
125
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Reference review on predicting refresh rates
• Grimes, Ford & Tassone (2008) – determines optimal crawl rates under a set of scenarios:
• while doing estimation; while fairly sure of the estimate; • when crawls are expensive, and when they are cheap;
• Matloff (2005) – derives estimators similar to Cho & Garcia-Molina but lower
variance (and with improved theory) – also derives estimators for non-Poisson case – finds that Poisson model is not very good for its data
• but the estimators seem accurate (bias around 10%)
• Singh (2007) – non-homogeneous Poisson, localized windows, piecewise,
Weibull, experimental evaluation • No work seems to consider non-periodical case.
126
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
Reference review on freshness metrics
• Cho & Garcia-Molina (2003) – freshness & age of one page – average/expected freshness & age of one page & corpus – freshness & age wrt Poisson model of change – weighted freshness & age – sync policies
• uniform (better): all pages at the same rate • nonuniform: rates proportionally to change rates
– sync order • fixed order (better), random order
– to improve freshness, penalize pages that change too often – to improve age, sync proportionally to freq but uniform is not far from
optimal • Han et al. (2004) and Dasdan and Huynh (2009) add user
perspective with weights. • Lewandowski (2008) and Kim and Kang (2007) compare top
three search engines for freshness. 127
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
References 1/2
• E. Adar, J. Teevan, S. Dumais, and J.L. Elsas (2009), The Web changes everything: Understanding the dynamics of Web content, WSDM’09.
• J. Cho and H. Garcia-Molina (2000), The evolution of the Web and implications for an incremental crawler, VLDB’00.
• D. Fetterly, M. Manasse, M. Najork, and J. Wiener (2003), A Large scale study of the evolution of Web pages, WWW’03.
• F. Grandi (2000), Introducing an annotated bibliography on temporal and evolution aspects in the World Wide Web, SIGMOD Records, 33(2):84-86.
• A. Ntoulas, J. Cho, and C. Olston (2004), What’s new on the Web? The evolution of the Web from a search engine perspective, WWW’04.
128
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009.
References 2/2
• J. Cho and H. Garcia-Molina (2003), Effective page refresh policies for web crawlers, ACM Trans. Database Syst., 28(4):390-426.
• J. Cho and H. Garcia-Molina (2003), Estimating frequency of change, ACM Trans. Inter. Tech., 3(3):256-290.
• A. Dasdan and X. Huynh, User-centric content freshness metrics for search engines, WWW’09.
• J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.
• C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08.
• J. Han, N. Cercone, and X. Hu (2004), A Weighted freshness metric for maintaining a search engine local repository, WI’04.
• Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07.
• D. Lewandowski, H. Wahlig, and G. Meyer-Bautor (2006), The freshness of web search engine databases, J. Info. Syst., 32(2):131-148.
• D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008.
• N. Matloff (2005), Estimation of internet file-access/modification rates from indirect data, ACM Trans. Model. Comput. Simul., 15(3):233-253.
• C. Olston and S. Padley (2008), Recrawl scheduling based on information longevity, WWW’08.
• S.R. Singh (2007), Estimating the rate of web page changes, IJCAI’07.
129