Upload
ali-dasdan
View
392
Download
1
Embed Size (px)
Citation preview
1
Web Search Engine Metrics for Measuring User
Satisfaction [Section 5 of 7: Discovery]
Ali Dasdan, eBay
Kostas Tsioutsiouliklis, Yahoo!
Emre Velipasaoglu, Yahoo!
With contributions from Prasad Kantamneni, Yahoo!
27 Apr 2010
(Update in Aug 2015: The authors work in different companies now.)
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Disclaimers
• This talk presents the opinions of the authors. It does not necessarily reflect the views of our employers.
• This talk does not imply that these metrics are used by our employers, or should they be used, they may not be used in the way described in this talk.
• The examples are just that – examples. Please do not generalize them to the level of comparing search engines.
3
4
Discovery and Latency Metrics
Section 5/7 of
WWW’10 Tutorial on Web Search Engine Metrics by
A. Dasdan, K. Tsioutsiouliklis, E. Velipasaoglu
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on discovery: Page was born ~30 minutes before
5
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on discovery: URL of page was not found
6
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on discovery: But content existed under different URLs
7
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Example on discovery: URL was also found after ~1 hr
8
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Life of a URL
9
AGE
LATENCY
BORN DISCOVERED NOW EXPIRED
TIME
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Lives of many URLs
10
AGE
LATENCY
BORN DISCOVERED NOW EXPIRED
TIME
LATENCY
LATENCY
LATENCY
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
How to measure discovery and latency
• Consider a sample of new pages on the Web – Feeds at regular intervals – Each sample monitored for a period (e.g., 15 days)
• User view – Discovery: Measure how many of these new pages are in
the search results? • using the coverage ratio formula
– Latency: Measure how long it took to get these new pages in the search results?
• variants as ‘Time-To-First-* (TTF*)’ metrics, e.g., Time-To-First-Click and Time-To-First-View
• System view – Discovery: Measure how many of these new pages are in a
catalog? – Latency: Measure how long it took to get these new pages
in a catalog?
11
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Discovery profile of a search engine component: Overview
12
Time to reach a certain coverage percentage
No expiration yet
Content expired
Convergence
Over many URLs, per search engine component
Oth
er b
ehav
iors
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Discovery profiles and monitoring: Examples
13
Profiles Monitoring of
profile parameters
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Latency profiles of a search engine component: Overview
14
Over many URLs, per search engine component
Desired skewness direction Close to zero for crawlers
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Latency profiles and monitoring: Examples
15
Profiles Monitoring of
profile parameters
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Further issues to consider
• How to discover samples to measure discovery and latency
• How to beat crawlers to acquire samples
• Discovery of top-level pages • Discovery of deep links • Discovery of hidden web content • How to balance discovery against
other objectives
16
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Key problems
• Predict content changes on the Web • Discover new content almost
instantaneously • Reduce latency per search engine
component and overall
17
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Reference review on discovery metrics
• Cho, Garcia-Molina, & Page (1998) – discusses how to order URL accesses based on importance
scores • importance: PageRank (best), link count, similarity to query in
anchortext or URL string, attributes of URL string. • Dasgupta et al. (2007)
– formulates the problem of discoverability (discover new content from the fewest number of known pages) and proposes approximation algorithms
• Kim and Kang (2007) – compares top three search engines for discovery (called “timeliness”), freshness, and latency
18
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
Reference review on discovery metrics
• Lewandowski (2008) – compares top three search engines for freshness and latency
• Dasdan and Drome (2009) – proposes discovery metrics along the lines discussed in this
section • Olston and Najork (2010)
– gives a detailed survey of web crawling, including how crawlers discover URLs
– discusses how to optimize for both coverage and freshness in a web crawler
19
© Dasdan, Tsioutsiouliklis, Velipasaoglu, 2009-2010.
References
• J. Cho, H. Garcia-Molina, and L. Page (1998), Efficient Crawling Through URL Ordering, Computer Networks and ISDN Systems, 30(1-7):161-172.
• A. Dasdan and C. Drome (2009), Discovery coverage: Measuring how fast content is discovered by search engines, submitted.
• A. Dasgupta, A. Ghosh R. Kumar, C. Olston, S. Pandey, and A. Tomkins (2007), The discoverability of the Web, WWW’07.
• J. Dean (2009), Challenges in building large-scale information retrieval systems, WSDM’09.
• N. Eiron, K.S. McCurley, and J.A. Tomlin, Ranking the Web frontier, WWW’04.
• C. Grimes, D. Ford, and E. Tassone (2008), Keeping a search engine index fresh: Risk and optimality in estimating refresh rates for web pages, INTERFACE’08.
• Y.S. Kim and B.H. Kang (2007), Coverage and timeliness analysis of search engines with webpage monitoring results, WISE’07.
• D. Lewandowski (2008), A three-year study on the freshness of Web search engine databases, to appear in J. Info. Syst., 2008.
• C. Olston and M. Najork (2010), Web crawling, Chapter in Foundations and Trends in Information Retrieval, 4(3):175--246.
20