Random Sampling of Search Engine’s Index Using …richard.myweb.cs.uwindsor.ca/cs510/survey_sinha.pdf · 2 Random Sampling of Search Engine’s Index Using Monte Carlo Simulation

Random Sampling of Search Engine’s Index UsingMonte Carlo Simulation Method

Sajib Kumer Sinha

University of Windsor

Getting uniform random samples from a search engine’s index is a challenging problem identified

in 1998 by Bharat and Broader [1998]. An accurate solution to this problem can help us to

solve many problems related to the web data analysis, especially for deep web analysis suchas size estimation, determining the data distribution, etc. This survey reviews research which

tries to clarify: (i) the different problem definitions related to the considered problem, (ii) thedifficulties encountered in this field of research, (iii) the varying assumptions, heuristics, and

intuitions forming the basis of different approaches, and (iv) how several prominent solutions

tackle different problems.

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information

Search and Retrieval

General Terms: Measurement, Algorithms

Additional Key Words and Phrases: search engines, benchmarks, sampling, size estimation

Contents

1 INTRODUCTION 2

2 APPROACHES 32.1 Lexicon based approaches . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Query based sampling . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Query pool based approach with rejection sampling . . . . . 42.1.3 Query pool based approach with rejection sampling and basic

estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.4 Query based approach using importance sampling . . . . . . 62.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Random walk approaches . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1 Random walk with multi thread crawler . . . . . . . . . . . . 72.2.2 Random walk with web walker . . . . . . . . . . . . . . . . . 82.2.3 Weighted random walk approach . . . . . . . . . . . . . . . . 92.2.4 Random walk approach with metropolis hastings sampling . . 102.2.5 Random walk approach with rejection sampling . . . . . . . . 102.2.6 Random walk approach with importance Sampling . . . . . . 112.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Random sampling based on DAAT model . . . . . . . . . . . . . . . 13

3 CONCLUDING COMMENTS 14

ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1–0??.

2 · Random Sampling of Search Engine’s Index Using Monte Carlo Simulation Method

4 ACKNOWLEDGEMENT 15

5 ANNOTATIONS 155.1 Anagnostopoulos et al. 2005 . . . . . . . . . . . . . . . . . . . . . . . 155.2 Bar-Yossef et al. 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3 Bar-Yossef and Gurevich 2006 . . . . . . . . . . . . . . . . . . . . . . 175.4 Bar-Yossef and Gurevich 2007 . . . . . . . . . . . . . . . . . . . . . . 185.5 Bar-Yossef and Gurevich 2008a . . . . . . . . . . . . . . . . . . . . . 195.6 Broder et al. 2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.7 Callan and Connell 2001 . . . . . . . . . . . . . . . . . . . . . . . . . 215.8 Dasgupta et al. 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.9 Henzinger et al 2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.10 Rusmevichientong et al. 2001 . . . . . . . . . . . . . . . . . . . . . . 24

6 REFERENCES 25

1. INTRODUCTION

As a prolific research area, obtaining random sampling from a search engine’s in-terface encouraged a vast quantity of proposed solutions. This survey is aboutdifferent approaches that have been taken and could be taken for obtaining randomsamples from a search engine’s index using Monte Carlo simulation methods such asRejection sampling, Importance sampling and Metropolis hastings sampling. Herewe are considering that we have black box access to the search engine. That meanswe do not have any direct access to the search engine’s index. We can access thedata using only its publicly available interface.

For conducting this survey Google Scholar and ACM digital library were usedfor identifying published papers.

Papers that were published in conference proceeding are Hastings [1970], Olkenand Rotem [1990], Bharat and Broder [1998], Henzinger et al. [1999], Bar-Yossef etal. [2000], Henzinger et al [2000], Rusmevichientong et al. [2001], Anagnostopouloset al. [2005], Bar-Yossef and Gurevich [2006], Broder et al. [2006], Hedley etal. [2006], Bar-Yossef and Gurevich [2007], Dasgupta et al. [2007], Mccown andNelson [2007], Bar-Yossef and Gurevich [2008a], and Ribeiro and Towsley [2010].And papers that were published in journals are Liu [1996], Lawrence and Giles[1998], Lawrence and Giles [1999] and Callan and Connell [2001].

This survey consists of different approaches, concluding comments and annota-tions of the ten selected papers. Three main different approaches were identifiedto solve the concerned problems which are described in three different subsectionsof approaches. Besides, for two of those subsections two different method existto solve the problem. So, those two subsections have two more different subsec-tions. Finally, there is a concluding part containing the concluding statements andannotations afterwards.

From all of the ten selected papers it is observed that most of the authors usedrejection sampling algorithm to do the sampling parts. Two papers propose amethod with importance sampling. However, no one focused on one major MonteCarlo algorithm called Metropolis-Hastings algorithm, which could be an effectivealgorithm to do the sampling part.

ACM Journal Name, Vol. V, No. N, Month 20YY.

Sajib Sinha · 3

2. APPROACHES

As mentioned earlier there are many different approaches have been taken to obtainuniform random samples from a search engine’s index. This survey focuses on thoseapproaches which have used Monte Carlo simulation methods. It is observed thatall approaches can be categorized into three main classes. One of them uses Querypool which is lexicon based, another uses Random walk on the web graph and thethird approach uses one of the information retrieval techniques called the DAATmodel.

2.1 Lexicon based approaches

Research papers that are presented in this section use lexicon based approachesto solve the considered problem. All of this research uses a lexicon to formulaterandom queries to the search interface and have done the random sampling based onthe results returned by the interface. It is observed that researchers have used twodifferent Monte Carlo simulation methods called rejection sampling and importancesampling. The first lexicon based approach was described by Callan and Connell[2001]. After that Bar-Yossef and Gurevich [2006], Broder et al. [2006] and Bar-Yossef and Gurevich [2007] have taken the same approach with some modificationsto improve the sampling accuracy and cost minimization.

2.1.1 Query based sampling. Callan and Connell [2001] first introduced thisapproach to obtain uniform samples from text databases to acquire resource de-scription of a certain database. The authors state that no efficient method existsto acquire resource description of a certain database which is very important forinformation retrieval techniques.

The authors do not refer to any of the papers in the bibliography as related work.The authors propose a new query based sampling algorithm for acquiring resource

description. Here initially one query term is selected at random and run on thedatabase. Based on the returned top N number of documents resource descriptionis updated. This process run many times based on different query terms until stopcondition reached. This algorithm also have the option of choosing number of queryterms, how many documents to examine per query and the stop condition of thesampling.

For evaluation of their proposed algorithm the authors conducted three experi-ments separately with separate datasets for finding the description accuracy con-sisting of vocabulary and frequency information, resource selection accuracy andretrieval accuracy.

The authors state that their experimental results show that the resource descrip-tion created by query based sampling is sufficiently similar to resource descriptiongenerated from complete information. Experimental results also demonstrate thatsmall partial description of a resource is as effective as its complete description.

The authors claim that their method avoids many limitations of cooperativeprotocol such as STARTS and their method also applicable for legacy databases.They also claim that the cost of their query based sample method is comparativelylow and its not as easily defeated by intentional misrepresentation as cooperativeprotocols.

This paper is cited by Broder et al. [2006].



2.1.2 Query pool based approach with rejection sampling. Bar-Yossef and Gure-vich [2006] first introduced the term “query pool” which contains number of lexicalqueries. They used the query pool to obtain random samples from a search engine’sindex. The authors state that no method exists which can produce unbiased sam-ples from a search engine using its public interface. Extraction of random unbiasedsamples from a search engines index is needed for search engines evaluation and itis also prerequisite for size estimation of large databases, deep web sites, libraryrecords, etc.

The authors refer to previous work by Bharat and Broder [1998], Lawrence andGiles [1999], Henzinger et al. [1999], Bar-Yossef et al. [2000], Henzinger et al [2000],Rusmevichientong et al. [2001] and Anagnostopoulos et al. [2005].

The authors state that the approach of Bharat and Broder [1998] has bias towardslong content rich documents as large documents match with many queries comparedto short documents and it is also biased towards highly ranked documents as searchengines return only the top K documents. Another noted shortcoming is thatthe accuracy of this method directly depends on the term frequency estimation.That means if estimation is biased from beginning, it will remain in the sampleddocuments. Henzinger et al. [1999] and Lawrence and Giles [1999] also have severalbiases in their sampled documents. Approaches used by Henzinger et al [2000] andBar-Yossef et al. [2000] are criticized because of the bias of sampled documentstowards documents with high in-degree. Finally, Rusmevichientong et al. [2001]also criticized for their biased samples.

The authors propose two new methods for getting uniform samples from searchengines index. The first one is called Pool-based sampler which is lexicon based,but it does not require term frequency does Bharat and Broder [1998] approach.In this approach sampling techniques are used in two steps. First one for selectinga query from the query pool and second one to choose returned top K results atrandom. For sampling, the authors use rejection sampling for its simplicity.

The authors state that they conducted 3 sets of experiments: Pool measurement,Evaluation experiments and Exploration experiments. They crawled 2.4 millionEnglish pages from ODP to build a search engine and created the query pool.From that query pool they made queries in Google, Yahoo and MSN search enginesand used those returned samples to estimate their relative sizes.

The authors state that their Pool-based sampler had no bias at all and therewas a small negative bias towards short documents in their Random walk sampler.They also found that Bharat-Broder [1998] sampler had significant bias.

Bar-Yossef and Gurevich [2006] claim that their first model, the Pool- basedsampler provide samples with weights which can be used with stochastic simulationtechniques to produce near-uniform samples. The bias of this model could beeliminated using a more comprehensive experimental setup.

This paper is cited by Bar-Yossef and Gurevich [2007], Bar-Yossef and Gurevich[2008a], Broder et al. [2006] and Dasgupta et al. [2007].

2.1.3 Query pool based approach with rejection sampling and basic estimator.Broder et al. [2006] first solved the considered problem using a “query pool” with abasic estimator. The problem considered by the authors is estimating the size of adata corpus using only its standard query interfaces. This would enable researchers


Sajib Sinha · 5

to evaluate the quality of a search engine, size estimation of the deep web, etc. Itis also important for owners to understand the different characteristics and qualityof their corpus, such as corpus freshness, identifying over/under represented topics,presence of spam, the ability of the corpus to answer narrow or rare queries, etc.

The authors refer to previous work by Bharat and Broder [1998], Lawrence andGiles [1998], Lawrence and Giles [1999], Bar-Yossef et al. [2000], Henzinger et al[2000], Callan and Connell [2001], Rusmevichientong et al. [2001] and Bar-Yossefand Gurevich [2006].

The authors state that the approach taken by Bharat and Broder [1998] is biasedtoward content-rich pages with higher rank. In the method of Lawrence and Giles[1998] they accept only those queries, for which all matched results are returned bythe search engine. As a result their returned sample is not uniform. Next year in[Lawrence and Giles 1999] they fixed that problem but their proposed technique ishighly dependent on the underlying IR technique. They also state that the methodof Bar-Yossef and Gurevich [2006] is laborious.

The authors propose two approaches based on a basic low variance and unbiasedestimator. Their first method requires a uniform random sample from the under-lying corpus and after getting the uniform sample, the corpus size is computed bythe basic estimator. For random sampling they use the rejection sampling method.The second approach is based on two query pools where both pools are uncorrelatedwith respect to a set of query terms. Next, using these two query pools, the corpussize is estimated using the basic estimator that taken into account.

The authors evaluated their methods by two experiments. One experiment wasconducted on the TREC with a dataset consisting of 1,246,390 HTML files fromthe TREC.gov test collection. Another experiment was conducted on the web withthree popular search engines. For both of their experiments they made query pooland computed 11 estimators. Finally they took the median of those 11 estimatorsas their final estimator.

The authors state that with their TREC experiment they estimated the size ofthe three different search engines in terms of number of pages indexed as follows

Table I. Broder et al. [2006] page: 602

Search Engine(SE) SE1 SE2 SE3

Size(Billion) 1.5 1.0 0.95

Next, using the uncorrelated query pool approach they estimated the size of thosesearch engines as following

Table II. Broder et al. [2006] page: 602



The authors claim that they constructed a basic estimator for estimating the sizeof a corpus using black-box access. They propose two algorithms for size estimationwith random sampling and without random sampling. They also claim that they



developed a novel method to reduce the variance of the random samples where thedistribution of a random variable is monotonically decreasing. Besides, they statethat by building an uncorrelated query pool more accurately, the size of the corpuscan be more accurately estimated.

This paper is cited by Bar-Yossef and Gurevich [2007], and Bar-Yossef and Gure-vich [2008a].

2.1.4 Query based approach using importance sampling. Bar-Yossef and Gure-vich [2007] first used importance sampling to obtain uniform random samples fromthe Search engines index. For automated and neutral evaluation of a search en-gine we need to measure some quality matrices of a search engine such as corpussize, index freshness, and density of spam and duplicate pages, etc. The evaluationmethod can be the objective benchmark of a search engine which also could be use-ful for the owners of those search engines. As search engines are highly dynamic,measuring those quality matrices is challenging. In this paper the authors focus onthe external evaluation of a search engine using its public interface.

The authors refer to previous work by Bharat and Broder [1998], Lawrence andGiles [1998], Henzinger et al. [1999], Lawrence and Giles [1999], Bar-Yossef etal. [2000], Henzinger et al [2000], Rusmevichientong et al. [2001], Bar-Yossef andGurevich [2006] and Broder et al. [2006].

The authors state that the methods proposed by Lawrence and Giles [1999],Henzinger et al. [1999], Bar-Yossef et al. [2000], Henzinger et al [2000] and Rus-mevichientong et al. [2001] sample documents from the whole web which is moredifficult and suffers from bias. They also report that the recently proposed tech-niques by Bar-Yossef and Gurevich [2006] and Broder et al. [2006] have significantbias due to the inaccurate approximation of document degrees.

The authors define two new estimators called Accurate Estimator and EfficientEstimator to estimate the target distribution of the sample documents based on theImportance sampling methods. The Accurate estimator uses approximate weightsand it requires sending some queries to a search engine and the Efficient Estimatoruses deterministic approximate weights and it does not need any query. Theyuse the Rao-Blackwell theorem with the importance sampling method to reduceestimation variance.

For evaluation of their newly proposed estimators they conducted two sets ofexperiments. Their first experiment was on the ODP search engine made by them-selves, which is consists of text, html and pdf files. And for the second experimentthey used three major commercial search engines. In both of their experimentsthey used their proposed estimator and the estimators proposed by Bar-Yossef andGurevich [2006] and Broder et al. [2006].

According to the authors both experiments clearly show that the Accurate Esti-mator is unbiased as expected and the Efficient Estimator has a small bias which isbecause of the weak correlation between the function value and the validity densityof pages.

The authors claim that their new estimators outperform the recently proposedestimators by Bar-Yossef and Gurevich [2006] and Broder et al. [2006] in produc-ing unbiased random samples and it is proved both analytically and empirically.They also state that their methods more accurate and efficient from any other ex-


Sajib Sinha · 7

isting methods, cause it overcome the degree mismatch problem and it does notrely on the process how search engine is working. They also claim that the Rao-Blackwellization on the importance sampling reduces the bias significantly andreduces the cost compared to other existing techniques.

This paper is cited by Bar-Yossef and Gurevich [2008a].

2.1.5 Summary. .

Year Author Title of Paper Major Contribution2001 J. Callan and M.

ConnellQuery-based sam-pling of textdatabases

Novel query based samplingalgorithm for acquiring re-source description of a textdatabase.

2006 Z. Bar-Yossefand M. Gure-vich

Random samplingfrom a search en-gines index

Two new algorithms for ob-taining random samples fromSearch engine’s index.

2006 Broder et al. Estimating corpussize via queries

Two new algorithms for ob-taining random samples usingbasic low variance estimator.

2007 Z. Bar-Yossefand M. Gure-vich

Efficient search en-gine measurements

two new estimators called Ac-curate Estimator and Effi-cient Estimator to estimatethe target distribution ofthe sample documents basedon the Importance samplingmethods.

2.2 Random walk approaches

The web can be naturally described as a graph with pages as vertices and hyperlinksas edges. A random walk is then a stochastic process that iteratively visits thevertices of the graph. The next vertex of the walk is chosen by following a randomlychosen outgoing edge from the current one. This section contains research paperswhich have solved the problem based on random walk of the Web graph. It isobserved that researchers have used three different Monte Carlo simulation methodsto obtain the uniform random samples using this approach. First random walkapproach is taken by Henzinger et al [2000]. After that Bar-Yossef et al. [2000],Rusmevichientong et al. [2001], Bar-Yossef and Gurevich [2006], Dasgupta et al.[2007] and Bar-Yossef and Gurevich [2008a] have taken the same approach withsome modifications to improve the sampling accuracy and cost minimization.

2.2.1 Random walk with multi thread crawler. Henzinger et al [2000] proposeMulti thread crawler to estimate various properties of web pages. The authors statethat there is no accurate method exists to get uniform random sample from theWeb, which is needed to estimate various properties of web pages such as fractionof pages in various internet domains, coverage of a search engine, etc.

The authors refer to previous work by Bharat and Broder [1998], Lawrence andGiles [1998], Lawrence and Giles [1999] and Henzinger et al [1999].

Henzinger et al [2000] state that the approach of Bharat and Broder [1998] suf-



fered from various biases. Samples produced by [Bharat and Broder 1998] are biasedtowards longer web pages (with more words). The method proposed in Lawrenceand Giles [1998] is useful for determining the characteristics of web hosts but notaccurate for estimating the size of the Web. Besides, the work of Lawrence andGiles [1998] is uncertain for IP-v6 addresses.

In this research work the authors do not introduce any new method rather theygive some suggestion to improve sampling based on random walk. Instead of usingnormal crawler they suggest to use the Mercator, a multi threaded web crawler.Here each thread will begin with randomly chosen starting point and for randomselection they suggest to make a random jump to the pages which are visited by atleast one thread instead of following the hyper links.

The authors state that they made a test bed based on a class of random graphsthat is designed to share important properties with the Web. To collect uniformURLs they performed three random walks which were started from a seed set con-taining 10,258 URLs discovered by previous crawl.

Henzinger et al [2000] state that 83.2 percent of all visited URLs was visited byonly one walk. That proved their walk was quite random.

The authors claim that they have described a method to generate near uniformsample of URLs. They also claim that samples produced by their method are moreuniform than the samples produced by random walk over entire pages. According toHenzinger et al [2000] their method can be improved significantly using additionalknowledge of the Web.

This paper is cited by Bar-Yossef et al. [2000], Bar-Yossef and Gurevich [2006],Bar-Yossef and Gurevich [2007], Broder et al. [2006] and Rusmevichientong et al.[2001].

2.2.2 Random walk with web walker. Bar-Yossef et al [2000] introduce the Webwalker to approximate certain aggregate queries about web pages. The authorsstate that no accurate method exists to approximate certain aggregate queries aboutweb pages such as the average size of web pages, coverage of a search engine, theproportion of pages belonging to .com and other domains, etc. These queries arevery important for both academic and commercial use.

The authors refer to previous work by Bharat and Broder [1998], Lawrence andGiles [1998], Lawrence and Giles [1999], Henzinger et al [1999] and Henzinger et al[2000].

The authors state that the approach of Bharat and Broder [1998] for gettinguniform sample uses dictionaries to make their queries. As a result their randomsample is biased towards English language pages. In both [Lawrence and Giles1998] and [Lawrence and Giles 1999] the size of web has been estimated based onnumber of web servers instead of measuring the total number of web pages. Theauthors also state that the method of Henzinger et al [2000] requires huge numberof samples for approximation and their method is biased towards high degree nodes.

The authors state that for approximation they use random sampling. To producethe uniform samples they have used the theory of random walks. They propose anew random walking process called Web Walker which performs a regular undi-rected random walk and picks pages randomly from its traversed pages. Startingpage of the Web Walker is an arbitrary page from strongly connected component


Sajib Sinha · 9

of the web.Bar-Yossef et al [2000] state that to evaluate their Web Walkers effectiveness and

bias they performed a 100,000 page walk on the 1996 web graph. To check howclose the samples were to uniform, they compared their resultant set with a numberof known sets of which some sets were ordered by degree and some were in breadthfirst search order. They tried to approximate the relative size of pages indexed bythe search engine Alta Vista, pages indexed by an another search engine FAST,overlapping between Alta Vista and FAST and pages that contains inline images.

According to the reports of Alta Vista and FAST, their indexes were about 300million pages and 250 million pages respectively. So, the ratio should be around1.2. According to the authors from their experiment they found 1.14 as the ratio,which is closer.

The authors claim that they have presented an efficient and accurate methodfor estimating the results of aggregate queries. They have also claimed that theirmethod can produce uniformly distributed web pages without significant compu-tation and network resources. They also claim that their estimates can be repro-ducible using a single PC with modest internet connection.

This paper is cited by Bar-Yossef and Gurevich [2006], Bar-Yossef and Gurevich[2007] and Broder et al. [2006].

2.2.3 Weighted random walk approach. Rusmevichientong et al. [2001] proposea method to obtain a uniform random sample from the web. The authors statethat no accurate method exists to get uniform random sample from the web, whichis needed to estimate various properties of web pages such as fraction of pages invarious internet domains, coverage of a search engine, etc.

The authors refer to previous work by Bharat and Broder [1998], Lawrence andGiles [1998], Lawrence and Giles [1999], Bar-Yossef et al. [2000] and Henzinger etal [2000].

The authors state that the methods of Bharat and Broder [1998], Lawrence andGiles [1998], Lawrence and Giles [1999] and Henzinger et al [2000] are highly biasedtowards pages with large numbers of inbound links. The approach of Bar-Yossef etal. [2000] assumes that the Web is an undirected graph where as the Web is nottruly undirected. Besides, their approach also has some biases.

The authors propose two new algorithms based on approach of Henzinger et al[2000] and Bar-Yossef et al. [2000]. The first algorithm called Directed-Sampleworks on the arbitrary directed graph and the other one called Undirected-Sampleworks on the undirected graph with some additional knowledge about inboundlinks. Both of the algorithms based on weighted random-walk methodology.

The authors state that for evaluation they conducted their experiment with bothdirected and undirected graphs with 100,000 nodes and 71,189 of which belong tothe primary strongly connected component.

The authors state that the Directed-Sample algorithm produced 2057 samplesand the Undirected-Sample algorithm produced 10,000 uniform samples. Both ofthe results are near to uniform sample.

The authors claim that the Directed-Sample is naturally suited to the Web with-out any assumption whereas the Undirected-Sample assumes that any hyperlinkscan be followed both backward and forward direction. They also claim that their



algorithm perform as good as the Regular-Sample of Bar-Yossef et al. [2000] andbetter than the Pagerank-Sample of Henzinger et al [2000].

This paper is cited by Bar-Yossef and Gurevich [2006], Bar-Yossef and Gurevich[2007] and Broder et al. [2006].

2.2.4 Random walk approach with metropolis hastings sampling. Bar-Yossef andGurevich [2006] propose an algorithm to obtain random samples from a Searchengines index using Metropolis Hastings Sampling. The authors state that nomethod exists which can produce unbiased samples from a search engine using itspublic interface. Extraction of random unbiased samples from a search enginesindex is needed for search engines evaluation and it is also prerequisite for sizeestimation of large databases, deep web sites, Library records, etc.

The authors refer to previous work by Bharat and Broder [1998], Lawrence andGiles [1999], Henzinger et al. [1999], Bar-Yossef et al. [2000], Henzinger et al [2000],Rusmevichientong et al. [2001] and Anagnostopoulos et al. [2005].

The authors state that the approach of Bharat and Broder [1998] has bias towardslong content rich documents as large documents match with many queries comparedto short documents and it is also biased towards highly ranked documents as searchengines return only the top K documents. Another noted shortcoming is thatthe accuracy of this method directly depends on the term frequency estimation.That means if estimation is biased from beginning, it will remain in the sampleddocuments. Henzinger et al. [1999] and Lawrence and Giles [1999] also have severalbiases in their sampled documents. Approaches used by Henzinger et al [2000] andBar-Yossef et al. [2000] are criticized because of the bias of sampled documentstowards documents with high in-degree. Finally, Rusmevichientong et al. [2001]also criticized for their biased samples.

The authors propose a technique is based on random walk on a virtual graphwhich is based on documents indexed by search engine. Here two documents areconnected by an edge if they share same words or phrases. In this approach, theyapply the Metropolis-Hastings algorithm for getting uniform samples.

The authors state that they conducted 3 sets of experiments: Pool measurement,Evaluation experiments and Exploration experiments. They crawled 2.4 millionEnglish pages from ODP to build a search engine.

The authors state that their Pool-based sampler had no bias at all and therewas a small negative bias towards short documents in their Random walk sampler.They also found that Bharat-Broder [1998] sampler had significant bias.

Bar-Yossef and Gurevich [2006] claim that the main advantage of their randomwalk model is that it does not need a query pool. The bias of this model could beeliminated using a more comprehensive experimental setup.

This paper is cited by Bar-Yossef and Gurevich [2007], Bar-Yossef and Gurevich[2008a], Broder et al. [2006] and Dasgupta et al. [2007].

2.2.5 Random walk approach with rejection sampling. Dasgupta et al. [2007]used Rejection sampling to obtain uniform random samples from deep web. A largepart of World Wide Web data is hidden under the form-like interfaces which is calledthe hidden web. Owing to limitations of user access we cant analyze hidden webdata in a trivial way using only its publicly available interfaces. But, by generating


Sajib Sinha · 11

a uniform sample from those hidden databases we can make some analysis on thoseunderlying data such as, the form of data distribution, size estimation, where thereis a restriction on direct access.

The authors refer to previous work by Olken [1993], Bharat and Broder [1998]and Bar-Yossef and Gurevich [2006].

Dasgupta et al. [2007] state that the approach of Olken [1993] is applicablefor accessible databases. Olken [1993] has not considered the limitation of directaccess to the underlying databases. Bharat and Broder [1998] and Bar-Yossef andGurevich [2006] both address the same problem for search engine only, not forthe whole deep web databases. Bar-Yossef and Gurevich [2006] have introducedthe concept of random walk on the World Wide Web using the top-K resultantdocuments from a search engine. But, this document based model is not directlyapplicable to the hidden databases.

In this paper the authors propose a new method for sampling documents fromhidden web databases. The authors propose a new algorithm called HIDDEN-DB-SAMPLER, which is based on random walk over the query spaces provided bythe public user interface. They have done some modification to BRUTE-FORCE-SAMPLER, which is a simple algorithm for random sampling, and adopt this tech-nique into their new algorithm. Three new ideas proposed by the authors are earlydetection of underflow and valid tuples, random reordering of attributes, boost-ing acceptance probability via a scaling factor. For reducing skew, they used therejection sampling method in their random walk.

Dasgupta et al. [2007], state that they ran their algorithm with two groups ofdatabases. The first group comprised of small Boolean datasets with 500 tupleswith 15 attributes in each. It also included 500 tuples retrieved from the yahooauto web sites. Another database used for performance experiments comprised of300,000 to 500,000 tuples in each. All experiments were run on machine having 1.8Ghz P4 processor with 1.5 GB RAM running Fedora Core 3.

According to the authors random orderings outperform the fixed ordering forboth of their data groups. They also state that the performance with the fixedordering technique is dependent on the specific ordering used.

The authors claim that their new random walk over the query space, providedby the public user interface, outperforms the earlier approach of Bar-Yossef andGurevich [2006]. They claim that they evaluated the approach of Bar-Yossef andGurevich [2006] and compared with their method. They found that the approachof Bar-Yossef and Gurevich [2006] has inherent skew and does not work well forhidden databases.

This paper is cited by Bar-Yossef and Gurevich [2008a].

2.2.6 Random walk approach with importance Sampling. Bar-Yossef and Gure-vich [2008a] have used Importance sampling to obtain the uniform random samples.This paper concerned about the sampling of search engines suggestion databasefrom its public user interface. Extraction of suggestion samples can be useful foronline advertizing, studying user behaviour. Extraction of random unbiased sam-ples from search engines index was needed for creating objective benchmarks forsearch engines evaluation. It would also be useful for size estimation of searchengine, large databases, deep web sites, Library records, etc.



The authors refer to previous work by Olken and Rotem [1995], Liu [1996], Bharatand Broder [1998], Lawrence and Giles [1998], Bar-Yossef and Gurevich [2006],Broder et al. [2006], Bar-Yossef and Gurevich [2007] and Dasgupta et al. [2007].

The authors state that the work of Bar-Yossef and Gurevich [2006], Bar-Yossefand Gurevich [2007], Bharat and Broder [1998] and Broder et al. [2006] basedon query pool and for suggestion sampling building query pool is impractical anddifficult. In [Olken and Rotem 1995] they used B-tree but the suggestion databaseis not balanced as B-tree.

The authors defined two algorithms: uniform sampler and score induced sampler,for sampling suggestions using its public interface. In their method they used aneasy to sample trial distribution to simulate samples from target distribution usinga Monte Carlo simulation method named importance sampling. They also returnedsample based on random walk on the tree. When random walk reached on a markednode it stops and return reached node as a sample.

For evaluation of their algorithm, they built a suggestion server based on theAOL query logs which consists of 20 M web queries. From that they extractedunique query strings and their popularity and from those unique queries they madesuggestion server and used their methods for sampling the suggestions.

They state that the uniform sampler is practically unbiased and the score inducedsampler has minor bias towards medium-scored queries.

The authors claim that uniform sampler is unbiased and efficient while score-induced sampler is a bit biased but effective for producing useful measurements.They also claim that their methods do not compromise user privacy. Moreover, theynoted that their methods require only a few thousands queries while this numberis very high in other contemporary methods.

There are no specific references to this paper by other papers in the bibliography.

2.2.7 Summary. .


Sajib Sinha · 13

Year Author Title of Paper Major Contribution2000 Henzinger et al. On near-uniform

URL samplingMulti threaded Web crawlerto improve sampling.

2000 Bar-Yossef et al. Approximatingaggregate queriesabout Web pagesvia random walks

They propose a new randomwalking process called WebWalker.

2001 Rusmevichientonget al.

Methods for sam-pling pages uni-formly from theWorld Wide Web

New algorithms based onweighted random-walkmethodology to obtainuniform samples.

2006 Bar-Yossef andGurevich

Random samplingfrom a search en-gines index

Two new algorithms for ob-taining random samples fromSearch engine’s index.

2007 Dasgupta et al. A random walk ap-proach to samplinghidden databases

new method for samplingdocuments from hidden webdatabases.

2008 Bar-Yossef andGurevich

Mining searchengine query logsvia suggestionsampling

Two new algorithms uniformsampler and score inducedsampler for sampling sugges-tions using its public inter-face.

2.3 Random sampling based on DAAT model

Document at a time (DAAT) is a traditional information retrieval technique. Anag-nostopoulos et al. [2005] are first used this method to obtain uniform samples. Itis observed that the DAAT model is used by only Anagnostopoulos et al. [2005].No further research is conducted based on this approach. The authors state thatno method exists which can produce unbiased samples from a search engine usingits public interface. Extraction of random unbiased samples from a search enginesindex is needed for search engines evaluation and it is also prerequisite for sizeestimation of large databases, deep web sites, Library records, etc

The authors did not refer to any of my selected paper as their related work.The authors propose a new model based on inverted indices and traditional

Document-at-a-time (DAAT) model for IR system. They represent each documentwith a unique DID and each term with an associated posting list with support ofstandard operations (loc, next, jump). Their main concept was to create a prunedposting list from the original posting list. New pruned posting list contains docu-ments from original posting list by skipping a random number of documents usingjump operation and with equal probability. Next they made query for those docu-ments which contains at least one term from the pruned posting list and insertedthose documents into the random samples. In this technique for probability adjust-ment they used a method called reservoir sampling and for query optimization theyimplemented WAND operator. They also gave a new idea for random indexing ofsearch engine without static ranking.

They stated that they implemented the sampling mechanism for the WAND



operator and used the JURU search engine developed by IBM. Their data setwas consisting of 1.8 million pages and 1.1 billion words. For documents categoryclassification they used IBMs Eureka classifier.

The authors stated that if the difference between the actual size and samplingsize is 2 orders of magnitude then their sampling technique is justified. They alsonoted that the total runtime can be reduced to 10, 100 or even more based on theration of actual and sampling size. They stated that for the estimator the error ratenever exceeds more than 15 percent and they claimed that is negligible for samplesize greater than 200.

The authors claim that they have developed a general scheme for efficiency of thesampling methods and their efficiency can be increased for particular implementa-tion. They also stated that their WAND sampling can be improved by optimalselection of its attributes.

This paper is cited byBar-Yossef and Gurevich [2006].

3. CONCLUDING COMMENTS

This survey revisits a problem introduced by Bharat and Broder [1998] almost oneand half decades ago. They first realize the necessity of obtaining random pagesfrom a search engine’s index for calculating the size and overlap of different searchengines.

It is observed that until 2001 most of the researches considering this problem, werefocused on the random walk over the web graph to obtain uniform pages. In 1998Bharat and Broder [1998] first introduced the lexicon based approach where queriesare generated randomly and run in the public interface of the Search engine. Next,Callan and Connell [2001] proposed their own algorithm with some improvementfollowing the concept of Bharat and Broder [1998].

Next, in 2006 Bar-Yossef and Gurevich [2006] first introduced the concept of theQuery pool and used one of the Monte Carlo simulation methods called Rejectionsampling. But, this method is not accurate. After that Broder et al. [2006] usedthe Pool based concept of Bar-Yossef and Gurevich [2006] and introduced their newalgorithm with new low variance estimator. Next, Bar-Yossef and Gurevich [2007]improved their previous methods with another Monte Carlo simulation methodcalled Importance sampling.

There are many solutions exists with the Random walk approach. First Bar-Yossef and Gurevich [2006] used one of the Monte Carlo simulation methods calledMetropolis Hasting sampling in their random walk approach. Later, Rejectionsampling method was used by Dasgupta et al. [2007] and Importance samplingmethod were used by Bar-Yossef and Gurevich [2008a].

Callan and Connell [2001] have mentioned that their research can be extendedin several directions, to provide a more complete environment for searching andbrowsing among many databases. For example, the documents obtained by query-based sampling could be used to provide query expansion for database selection,or to drive a summarization or visualization interface showing the range of infor-mation available in a multidatabase environment. On the other hand Bar-Yossefand Gurevich [2006] have mentioned that they want to use metropolis hastingsalgorithm instead of rejection sampling as their future work.


Sajib Sinha · 15

It is observed that mainly two groups of researchers, Bar-Yossef and Gurevich,and Broder et al. have done tremendous improvement to solve the considered prob-lem. Bar-Yossef and Gurevich have suggested to use Metropolis Hasting samplingin their lexicon based approach. But, till now no one has taken that approach.

4. ACKNOWLEDGEMENT

I have the honour to acknowledge the knowledge and support I received from Dr.Richard Frost throughout this survey. I convey my deep regards to my supervisorDr. Lu for encouraging me and helping me unconditionally. I would also like tothank my family and friends for providing me immense strength and support incompleting this survey report.

5. ANNOTATIONS

5.1 Anagnostopoulos et al. 2005

Citation:

Anagnostopoulos, A., Broder, A., and Carmel, D. 2005. Sampling search-engine results. In Proceedings of the 14th International World Wide Web Confer-ence (WWW), 245–256.

Problem: The authors state that no method exists which can produce unbiasedsamples from a search engine using its public interface. Extraction of randomunbiased samples from a search engines index is needed for search engines evaluationand it is also prerequisite for size estimation of large databases, deep web sites,Library records, etc.Previous Work: The authors did not refer to any of my selected paper as their

related work.Shortcomings of Previous Work: No short comings of previous work were men-

tioned by the authors.New Idea/Algorithm/Architecture: The authors propose a new model based on

inverted indices and traditional Document-at-a-time (DAAT) model for IR system.They represent each document with a unique DID and each term with an associatedposting list with support of standard operations (loc, next, jump). Their mainconcept was to create a pruned posting list from the original posting list. Newpruned posting list contains documents from original posting list by skipping arandom number of documents using jump operation and with equal probability.Next they made query for those documents which contains at least one term fromthe pruned posting list and inserted those documents into the random samples.In this technique for probability adjustment they used a method called reservoirsampling and for query optimization they implemented WAND operator. They alsogave a new idea for random indexing of search engine without static ranking.Experiments Conducted: They stated that they implemented the sampling mech-

anism for the WAND operator and used the JURU search engine developed by IBM.Their data set was consisting of 1.8 million pages and 1.1 billion words. For docu-ments category classification they used IBMs Eureka classifier.Results: The authors stated that if the difference between the actual size and

sampling size is 2 orders of magnitude then their sampling technique is justified.They also noted that the total runtime can be reduced to 10, 100 or even more based



on the ration of actual and sampling size. They stated that for the estimator theerror rate never exceeds more than 15 percent and they claimed that is negligiblefor sample size greater than 200.

Claims: The authors claim that they have developed a general scheme for effi-ciency of the sampling methods and their efficiency can be increased for particularimplementation. They also stated that their WAND sampling can be improved byoptimal selection of its attributes.

Citations by Others: Bar-Yossef and Gurevich [2006].

5.2 Bar-Yossef et al. 2000

Citation:

Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., and Weitz, D.2000. Approximating aggregate queries about Web pages via random walks. InProceedings of the 26th International Conference on Very Large Databases (VLDB).ACM, New York, 535–544.Problem:The authors state that no accurate method exists to approximate certain

aggregate queries about web pages such as the average size of web pages, coverageof a search engine, the proportion of pages belonging to .com and other domains,etc. These queries are very important for both academic and commercial use.Previous Work: The authors refer to previous work by Bharat and Broder [1998],

Lawrence and Giles [1998], Lawrence and Giles [1999], Henzinger et al [1999] andHenzinger et al [2000].Shortcomings of Previous Work: The authors state that the approach of Bharat

and Broder [1998] for getting uniform sample uses dictionaries to make their queries.As a result their random sample is biased towards English language pages. In both[Lawrence and Giles 1998] and [Lawrence and Giles 1999] the size of web has beenestimated based on number of web servers instead of measuring the total numberof web pages. The authors also state that the method of Henzinger et al [2000]requires huge number of samples for approximation and their method is biasedtowards high degree nodes.New Idea/Algorithm/Architecture: The authors state that for approximation

they use random sampling. To produce the uniform samples they have used thetheory of random walks. They propose a new random walking process called WebWalker which performs a regular undirected random walk and picks pages randomlyfrom its traversed pages. Starting page of the Web Walker is an arbitrary page fromstrongly connected component of the web.Experiments Conducted: Bar-Yossef et al [2000] state that to evaluate their Web

Walkers effectiveness and bias they performed a 100,000 page walk on the 1996web graph. To check how close the samples were to uniform, they compared theirresultant set with a number of known sets of which some sets were ordered bydegree and some were in breadth first search order. They tried to approximate therelative size of pages indexed by the search engine Alta Vista, pages indexed by ananother search engine FAST, overlapping between Alta Vista and FAST and pagesthat contains inline images.

Results:According to the reports of Alta Vista and FAST, their indexes wereabout 300 million pages and 250 million pages respectively. So, the ratio should be


Sajib Sinha · 17

around 1.2. According to the authors from their experiment they found 1.14 as theratio, which is closer.Claims: The authors claim that they have presented an efficient and accurate

method for estimating the results of aggregate queries. They have also claimedthat their method can produce uniformly distributed web pages without significantcomputation and network resources. They also claim that their estimates can bereproducible using a single PC with modest internet connection.Citations by Others: Bar-Yossef and Gurevich [2006], Broder et al. [2006] and

Bar-Yossef and Gurevich [2007].

5.3 Bar-Yossef and Gurevich 2006

Citation:

Bar-Yossef, Z. and Gurevich, M. 2006. Random sampling from a search enginesindex. In Proceedings of WWW 06. 367–376.Problem: The authors state that no method exists which can produce unbiasedsamples from a search engine using its public interface. Extraction of randomunbiased samples from a search engines index is needed for search engines evaluationand it is also prerequisite for size estimation of large databases, deep web sites,Library records, etc.

Previous Work: The authors refer to previous work by Bharat and Broder [1998],Lawrence and Giles [1999], Henzinger et al. [1999], Bar-Yossef et al. [2000], Hen-zinger et al [2000], Rusmevichientong et al. [2001] and Anagnostopoulos et al.[2005].Shortcomings of Previous Work: The authors state that the approach of Bharat

and Broder [1998] has bias towards long content rich documents as large documentsmatch with many queries compared to short documents and it is also biased to-wards highly ranked documents as search engines return only the top K documents.Another noted shortcoming is that the accuracy of this method directly depends onthe term frequency estimation. That means if estimation is biased from beginning,it will remain in the sampled documents. Henzinger et al. [1999] and Lawrenceand Giles [1999] also have several biases in their sampled documents. Approachesused by Henzinger et al [2000] and Bar-Yossef et al. [2000] are criticized becauseof the bias of sampled documents towards documents with high in-degree. Finally,Rusmevichientong et al. [2001] also criticized for their biased samples.New Idea/Algorithm/Architecture:The authors propose two new methods for get-

ting uniform samples from search engines index. The first one is called Pool-basedsampler which is lexicon based, but it does not require term frequency does Bharatand Broder [1998] approach. In this approach sampling techniques are used in twosteps. For sampling, the authors use rejection sampling for its simplicity. Theirsecond proposed technique is based on random walk on a virtual graph which isbased on documents indexed by search engine. Here two documents are connectedby an edge if they share same words or phrases. In this approach, they apply theMetropolis-Hastings algorithm for getting uniform samples.Experiments Conducted:The authors state that they conducted 3 sets of exper-

iments: Pool measurement, Evaluation experiments and Exploration experiments.They crawled 2.4 million English pages from ODP to build a search engine and



created the query pool. From that query pool they made queries in Google, Yahooand MSN search engines and used those returned samples to estimate their relativesizes.

Results: The authors state that their Pool-based sampler had no bias at all andthere was a small negative bias towards short documents in their Random walksampler. They also found that Bharat-Broder [1998] sampler had significant bias.Claims: Bar-Yossef and Gurevich [2006] claim that their first model, the Pool-

based sampler provide samples with weights which can be used with stochasticsimulation techniques to produce near-uniform samples. They also stated thatthe main advantage of their random walk model is that it does not need a querypool. The bias of this model could be eliminated using a more comprehensiveexperimental setup.Citations by Others: Broder et al. [2006], Bar-Yossef and Gurevich [2007], Das-

gupta et al. [2007] and Bar-Yossef and Gurevich [2008a].

5.4 Bar-Yossef and Gurevich 2007

Citation:

Bar-Yossef, Z. and Gurevich, M. 2007. Efcient search engine measurements. InProceedings of the 16th International World Wide Web Conference (WWW). ACM,New York, 401–410.Problem: For automated and neutral evaluation of a search engine we need to

measure some quality matrices of a search engine such as corpus size, index fresh-ness, and density of spam and duplicate pages, etc. The evaluation method can bethe objective benchmark of a search engine which also could be useful for the own-ers of those search engines. As search engines are highly dynamic, measuring thosequality matrices is challenging. In this paper the authors focus on the externalevaluation of a search engine using its public interface.Previous Work: The authors refer to previous work by Bharat and Broder [1998],

Lawrence and Giles [1998], Henzinger et al. [1999], Lawrence and Giles [1999],Bar-Yossef et al. [2000], Henzinger et al [2000], Rusmevichientong et al. [2001],Bar-Yossef and Gurevich [2006], Broder et al. [2006], Bar-Yossef and Gurevich[2007].Shortcomings of Previous Work: The authors state that the methods proposed

by Lawrence and Giles [1999], Henzinger et al. [1999], Bar-Yossef et al. [2000],Henzinger et al [2000] and Rusmevichientong et al. [2001] sample documents fromthe whole web which is more difficult and suffers from bias. They also report thatthe recently proposed techniques by Bar-Yossef and Gurevich [2006] and Broder etal. [2006] have significant bias due to the inaccurate approximation of documentdegrees.New Idea/Algorithm/Architecture: The authors define two new estimators called

Accurate Estimator and Efficient Estimator to estimate the target distribution ofthe sample documents based on the Importance sampling methods. The Accurateestimator uses approximate weights and it requires sending some queries to a searchengine and the Efficient Estimator uses deterministic approximate weights and itdoes not need any query. They use the Rao-Blackwell theorem with the importancesampling method to reduce estimation variance.


Sajib Sinha · 19

Experiments Conducted: For evaluation of their newly proposed estimators theyconducted two sets of experiments. Their first experiment was on the ODP searchengine made by themselves, which is consists of text, html and pdf files. And forthe second experiment they used three major commercial search engines. In both oftheir experiments they used their proposed estimator and the estimators proposedby Bar-Yossef and Gurevich [2006] and Broder et al. [2006].Results:According to the authors both experiments clearly show that the Accu-

rate Estimator is unbiased as expected and the Efficient Estimator has a small biaswhich is because of the weak correlation between the function value and the validitydensity of pages.Claims: The authors claim that their new estimators outperform the recently

proposed estimators by Bar-Yossef and Gurevich [2006] and Broder et al. [2006]in producing unbiased random samples and it is proved both analytically and em-pirically. They also state that their methods more accurate and efficient from anyother existing methods, cause it overcome the degree mismatch problem and itdoes not rely on the process how search engine is working. They also claim thatthe Rao-Blackwellization on the importance sampling reduces the bias significantlyand reduces the cost compared to other existing techniques.Citations by Others: Bar-Yossef and Gurevich [2008a].

5.5 Bar-Yossef and Gurevich 2008a

Citation:

Bar-Yossef, Z. and Gurevich, M. 2008a. Mining search engine query logs viasuggestion sampling. Proceedings of the VLDB Endowment, 1, 1, 54–65.

Problem: This paper concerned about the sampling of search engines suggestiondatabase from its public user interface. Extraction of suggestion samples can be use-ful for online advertizing, studying user behaviour. Extraction of random unbiasedsamples from search engines index was needed for creating objective benchmarksfor search engines evaluation. It would also be useful for size estimation of searchengine, large databases, deep web sites, Library records, etc.

Previous Work: The authors refer to previous work by Olken and Rotem [1995],Liu [1996], Bharat and Broder [1998], Lawrence and Giles [1998], Bar-Yossef andGurevich [2006], Broder et al. [2006], Bar-Yossef and Gurevich [2007] and Dasguptaet al. [2007].

Shortcomings of Previous Work: The authors state that the work of Bar-Yossefand Gurevich [2006], Bar-Yossef and Gurevich [2007], Bharat and Broder [1998]and Broder et al. [2006] based on query pool and for suggestion sampling buildingquery pool is impractical and difficult. In [Olken and Rotem 1995] they used B-treebut the suggestion database is not balanced as B-tree.New Idea/Algorithm/Architecture: The authors defined two algorithms: uniform

sampler and score induced sampler, for sampling suggestions using its public inter-face. In their method they used an easy to sample trial distribution to simulatesamples from target distribution using a Monte Carlo simulation method namedimportance sampling. They also returned sample based on random walk on thetree. When random walk reached on a marked node it stops and return reachednode as a sample.



Experiments Conducted: For evaluation of their algorithm, they built a sug-gestion server based on the AOL query logs which consists of 20 M web queries.From that they extracted unique query strings and their popularity and from thoseunique queries they made suggestion server and used their methods for samplingthe suggestions.

Results:They state that the uniform sampler is practically unbiased and the scoreinduced sampler has minor bias towards medium-scored queries. Claims:The au-thors claim that uniform sampler is unbiased and efficient while score-induced sam-pler is a bit biased but effective for producing useful measurements. They alsoclaim that their methods do not compromise user privacy. Moreover, they notedthat their methods require only a few thousands queries while this number is veryhigh in other contemporary methods. Citations by Others: There are no specificreferences to this paper by other researchers in this survey.

5.6 Broder et al. 2006

Citation:

Broder, A., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar,S., Panigrahy, R., Tomkins, A., and XU, Y. 2006. Estimating corpus size viaqueries. In Proceedings of the 15th International Conference on Information andKnowledge Management (CIKM). ACM, New York, 594–603.Problem: The problem considered by the authors is estimating the size of a data

corpus using only its standard query interfaces. This will enable researchers toevaluate the quality of a search engine, size estimation of the deep web, etc. It isalso important for owners to understand the different characteristics and quality oftheir corpus, such as corpus freshness, identifying over/under represented topics,presence of spam, the ability of the corpus to answer narrow or rare queries, etc.Previous Work: The authors refer to previous work by Bharat and Broder [1998],

Lawrence and Giles [1998], Lawrence and Giles [1999], Bar-Yossef et al. [2000],Henzinger et al [2000], Callan and Connell [2001], Rusmevichientong et al. [2001]and Bar-Yossef and Gurevich [2006].Shortcomings of Previous Work: The authors state that the approach taken by

Bharat and Broder [1998] is biased toward content-rich pages with higher rank. Inthe method of Lawrence and Giles [1998] they accept only those queries, for whichall matched results are returned by the search engine. As a result their returnedsample is not uniform. Next year in [Lawrence and Giles 1999] they fixed thatproblem but their proposed technique is highly dependent on the underlying IRtechnique. They also state that the method of Bar-Yossef and Gurevich [2006] islaborious.New Idea/Algorithm/Architecture: The authors propose two approaches based on

a basic low variance and unbiased estimator. Their first method requires a uniformrandom sample from the underlying corpus and after getting the uniform sample,the corpus size is computed by the basic estimator. For random sampling they usethe rejection sampling method. The second approach is based on two query poolswhere both pools are uncorrelated with respect to a set of query terms. Next, usingthese two query pools, the corpus size is estimated using the basic estimator thattaken into account.


Sajib Sinha · 21

Experiments Conducted: In [Broder et al. 2006], the authors evaluated theirmethods by two experiments. One experiment was conducted on the TREC witha dataset consisting of 1,246,390 HTML files from the TREC.gov test collection.Another experiment was conducted on the web with three popular search engines.For both of their experiments they made query pool and computed 11 estimators.Finally they took the median of those 11 estimators as their final estimator.

Results: the authors state that with their TREC experiment they estimated thesize of the three different search engines in terms of number of pages indexed asfollows



Next, using the uncorrelated query pool approach they estimated the size of thosesearch engines as following



Claims: The authors claim that they constructed a basic estimator for estimatingthe size of a corpus using black-box access. They propose two algorithms for sizeestimation with random sampling and without random sampling. They also claimthat they developed a novel method to reduce the variance of the random sampleswhere the distribution of a random variable is monotonically decreasing. Besides,they state that by building an uncorrelated query pool more accurately, the size ofthe corpus can be more accurately estimated.Citations by Others: Bar-Yossef and Gurevich [2007], and Bar-Yossef and Gure-

vich [2008a].

5.7 Callan and Connell 2001

Citation:

Callan, J. and Connell, M. 2001. Query-based sampling of text databases. ACMTransactions on Information Systems. 19, 2, 97–130.

Problem:The authors state that no efficient method exists to acquire resourcedescription of a certain database. Acquiring resource description of a database isvery important for Information Retrieval techniques.Previous Work: The authors do not refer to any of my selected paper as their

related work.Shortcomings of Previous Work:No shortcomings of previous work were men-

tioned by the authors.New Idea/Algorithm/Architecture: The authors propose a new query based sam-

pling algorithm for acquiring resource description. Here initially one query term isselected and run on the database. Based on the returned top N number of doc-uments resource description is updated. This process run many times based ondifferent query terms until stop condition reached. This algorithm also have the



option of choosing number of query terms, how many documents to examine perquery and the stop condition of the sampling.

Experiments Conducted: For evaluation of their proposed algorithm the authorsconducted three experiments separately with separate datasets for finding the de-scription accuracy consisting of vocabulary and frequency information, resourceselection accuracy and retrieval accuracy.

Results:The authors state that their experimental results show that the resourcedescription created by query based sampling is sufficiently similar to resource de-scription generated from complete information. Experimental results also demon-strate that small partial description of a resource is as effective as its completedescription.Claims: The authors claim that their method avoids many limitations of co-

operative protocol such as STARTS and their method also applicable for legacydatabases. They also claim that the cost of their query based sample method iscomparatively low and its not as easily defeated by intentional misrepresentationas cooperative protocols.Citations by Others: Broder et al. [2006].

5.8 Dasgupta et al. 2007

Citation:

Dasgupta, A., Das, G., and Mannila, H. 2007. A random walk approach tosampling hidden databases. In Proceedings of the ACM SIGMOD InternationalConference on Management of Data (SIGMOD07). ACM, New York, 629–640.

Problem: A large part of World Wide Web data is hidden under the form-likeinterfaces which is called the hidden web. Owing to limitations of user accesswe cant analyze hidden web data in a trivial way using only its publicly availableinterfaces. But, by generating a uniform sample from those hidden databases we canmake some analysis on those underlying data such as, the form of data distribution,size estimation, where there is a restriction on direct access.Previous Work: The authors refer to previous work by Olken [1993], Bharat and

Broder [1998] and Bar-Yossef and Gurevich [2006].Shortcomings of Previous Work:Dasgupta et al. [2007] state that the approach of

Olken [1993] is applicable for accessible databases. Olken [1993] has not consideredthe limitation of direct access to the underlying databases. Bharat and Broder[1998] and Bar-Yossef and Gurevich [2006] both address the same problem forsearch engine only, not for the whole deep web databases. Bar-Yossef and Gurevich[2006] have introduced the concept of random walk on the World Wide Web usingthe top-K resultant documents from a search engine. But, this document basedmodel is not directly applicable to the hidden databases.New Idea/Algorithm/Architecture: In this paper the authors propose a new

method for sampling documents from hidden web databases. The authors pro-pose a new algorithm called HIDDEN-DB-SAMPLER, which is based on randomwalk over the query spaces provided by the public user interface. They have donesome modification to BRUTE-FORCE-SAMPLER, which is a simple algorithm forrandom sampling, and adopt this technique into their new algorithm. Three newideas proposed by the authors are early detection of underflow and valid tuples,


Sajib Sinha · 23

random reordering of attributes, boosting acceptance probability via a scaling fac-tor. For reducing skew, they used the rejection sampling method in their randomwalk.Experiments Conducted: Dasgupta et al. [2007], state that they ran their algo-

rithm with two groups of databases. The first group comprised of small Booleandatasets with 500 tuples with 15 attributes in each. It also included 500 tuplesretrieved from the yahoo auto web sites. Another database used for performanceexperiments comprised of 300,000 to 500,000 tuples in each. All experiments wererun on machine having 1.8 Ghz P4 processor with 1.5 GB RAM running FedoraCore 3.

Results: According to the authors random orderings outperform the fixed order-ing for both of their data groups. They also state that the performance with thefixed ordering technique is dependent on the specific ordering used.

Claims: The authors claim that their new random walk over the query space,provided by the public user interface, outperforms the earlier approach of Bar-Yossef and Gurevich [2006]. They claim that they evaluated the approach of Bar-Yossef and Gurevich [2006] and compared with their method. They found that theapproach of Bar-Yossef and Gurevich [2006] has inherent skew and does not workwell for hidden databases.Citations by Others: Bar-Yossef and Gurevich [2008a].

5.9 Henzinger et al 2000

Citation:

Henzinger, M. R., Heydon, A., Mitzenma, M., and Najork, M. 2000. Onnear-uniform URL sampling. In Proceedings of the 9th International World WideWeb Conference. 295–308.

Problem:The authors state that there is no accurate method exists to get uniformrandom sample from the Web, which is needed to estimate various properties ofweb pages such as fraction of pages in various internet domains, coverage of a searchengine, etc.Previous Work: The authors refer to previous work by Bharat and Broder [1998],

Lawrence and Giles [1998], Lawrence and Giles [1999] and Henzinger et al [1999].Shortcomings of Previous Work: Henzinger et al [2000] state that the approach

of Bharat and Broder [1998] suffered from various biases. Samples produced by[Bharat and Broder 1998] are biased towards longer web pages (with more words).The method proposed in Lawrence and Giles [1998] is useful for determining thecharacteristics of web hosts but not accurate for estimating the size of the Web.Besides, work of Lawrence and Giles [1998] is uncertain for IP-v6 addresses.New Idea/Algorithm/Architecture: In this research work the authors do not intro-

duce any new method rather they give some suggestion to improve sampling basedon random walk. Instead of using normal crawler they suggest to use the Mercator,a multi threaded web crawler. Here each thread will begin with randomly chosenstarting point and for random selection they suggest to make a random jump tothe pages which are visited by at least one thread instead of following the hyperlinks.Experiments Conducted: The authors state that they made a test bed based on



a class of random graphs that is designed to share important properties with theWeb. To collect uniform URLs they performed three random walks which werestarted from a seed set containing 10,258 URLs discovered by previous crawl.

Results:Henzinger et al [2000] state that 83.2 percent of all visited URLs wasvisited by only one walk. That proved their walk was quite random.Claims: The authors claim that they have described a method to generate near

uniform sample of URLs. They also claim that samples produced by their methodare more uniform than the samples produced by random walk over entire pages.According to Henzinger et al [2000] their method can be improved significantlyusing additional knowledge of the Web.Citations by Others: Bar-Yossef et al. [2000], Rusmevichientong et al. [2001],

Bar-Yossef and Gurevich [2006], Broder et al. [2006] and Bar-Yossef and Gurevich[2007].

5.10 Rusmevichientong et al. 2001

Citation:

Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001.Methods for sampling pages uniformly from the World Wide Web. In Proceedingsof AAAI Fall Symposium on Using Uncertainty within Computation. 121–128.Problem:The authors state that no accurate method exists to get uniform random

sample from the Web, which is needed to estimate various properties of web pagessuch as fraction of pages in various internet domains, coverage of a search engine,etc.Previous Work: The authors refer to previous work by Bharat and Broder [1998],

Lawrence and Giles [1998], Lawrence and Giles [1999], Bar-Yossef et al. [2000] andHenzinger et al [2000].Shortcomings of Previous Work:The authors state that the methods of Bharat

and Broder [1998], Lawrence and Giles [1998], Lawrence and Giles [1999] and Hen-zinger et al [2000] are highly biased towards pages with large numbers of inboundlinks. The approach of Bar-Yossef et al. [2000] assumes that the Web is an undi-rected graph where as the Web is not truly undirected. Besides, their approachalso has some biases.New Idea/Algorithm/Architecture: The authors propose two new algorithms

based on approach of Henzinger et al [2000] and Bar-Yossef et al. [2000]. The firstalgorithm called Directed-Sample works on the arbitrary directed graph and theother one called Undirected-Sample works on the undirected graph with some ad-ditional knowledge about inbound links. Both of the algorithms based on weightedrandom-walk methodology.Experiments Conducted: The authors state that for evaluation they conducted

their experiment with both directed and undirected graphs with 100,000 nodes and71,189 of which belong to the primary strongly connected component.Results:The authors state that the Directed-Sample algorithm produced 2057

samples and the Undirected-Sample algorithm produced 10,000 uniform samples.Both of the results are near to uniform sample.Claims: The authors claim that the Directed-Sample is naturally suited to the

Web without any assumption whereas the Undirected-Sample assumes that any


Sajib Sinha · 25

hyperlinks can be followed both backward and forward direction. They also claimthat their algorithm perform as good as the Regular-Sample of Bar-Yossef et al.[2000] and better than the Pagerank-Sample of Henzinger et al [2000].Citations by Others: Bar-Yossef and Gurevich [2006], Broder et al. [2006] and

Bar-Yossef and Gurevich [2007].

6. REFERENCES

REFERENCES

Anagnostopoulos, A., Broder, A., and Carmel, D. 2005. Sampling search-engine results. In

Proceedings of the 14th International World Wide Web Conference (WWW), 245–256. (Exactlyon topic)

Bar-Yossef, Z. and Gurevich, M. 2006. Random sampling from a search engines index. In

Proceedings of WWW 06. 367–376. (Exactly on topic)

Bar-Yossef, Z. and Gurevich, M. 2008a. Mining search engine query logs via suggestion

sampling. Proceedings of the VLDB Endowment, 1, 1, 54–65. (Exactly on topic)

Bar-Yossef, Z. and Gurevich, M. 2007. Efcient search engine measurements. In Proceedings

of the 16th International World Wide Web Conference (WWW). ACM, New York, 401–410.(Part of this paper is exactly on topic)

Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, J., andWeitz, D. 2000. Approximatingaggregate queries about Web pages via random walks. In Proceedings of the 26th International

Conference on Very Large Databases (VLDB). ACM, New York, 535–544. (Different approach

for concerned problem and most of my selected papers had referred this paper )

Bharat, K. and Broder, A. 1998. A technique for measuring the relative size and overlap of

public Web search engines. In Proceedings of the 7th International World Wide Web Conference(WWW7). 379–388. (This paper introduced my concerned problem and most of my selected

papers had referred this paper)

Broder, A., Fontoura, M., Josifovski, V., Kumar, R., Motwani, R., Nabar, S., Panigrahy,

R., Tomkins, A., and XU, Y. 2006. Estimating corpus size via queries. In Proceedings of the

15th International Conference on Information and Knowledge Management (CIKM). ACM,New York, 594–603. (Part of this paper is exactly on topic)

Callan, J. and Connell, M. 2001. Query-based sampling of text databases. ACM Transactionson Information Systems. 19, 2, 97–130. (Different approach for concerned problem and most

of my selected papers had referred this paper)

Dasgupta, A., Das, G., and Mannila, H. 2007. A random walk approach to sampling hidden

databases. In Proceedings of the ACM SIGMOD International Conference on Management of

Data (SIGMOD07). ACM, New York, 629–640. (Exactly on topic)

Hastings, W. K. 1970. Monte Carlo sampling methods using Markov chains and their applica-

tions. Biometrika 57, 1, 97–109. (About basic idea of Monte Carlo simulation and referred bymany selected papers)

Hedley, Y.-L., Younas, M., James, A. E., and Sanderson, M. 2006. Sampling, informationextraction and summarisation of hidden web databases. Data and Knowledge Engineering, 59,

2, 213–230. (Different approach for my concerned problem and referred by my other selected

paper)

Henzinger, M. R., Heydon, A., Mitzenma, M., and Najork, M. 1999. Measuring index qualityusing random walks on the Web. In Proceedings of the 8th International World Wide WebConference. 213–225. (Different approach for concerned problem and most of my selected papershad referred this paper)

Henzinger, M. R., Heydon, A., Mitzenma, M., and Najork, M. 2000. On near-uniform URL

sampling. In Proceedings of the 9th International World Wide Web Conference. 295–308.

(Different approach for concerned problem and most of my selected papers had referred thispaper)



Lawrence, S. and Giles, C. L. 1998. Searching the World Wide Web. Science 5360, 280, 98.

(About basic idea of search engine’s index and returned documents and most of my selected

papers referred this paper)

Lawrence, S. and Giles, C. L. 1999. Accessibility of information on the Web. Nature 400,

107–109. (About basic idea of search engine’s index and returned documents and most of my

selected papers referred this paper)

Liu, J. S. 1996. Metropolized independent sampling with comparisons to rejection sampling andimportance sampling. Stat. Comput. 6, 113–119. (About basic idea of different Monte Carlo

methods and their comparison and some of my selected papers referred this paper)

Mccown, F., and Nelson, M. L. 2007. Agreeing to disagree: Search engines and their publicinterfaces. In JCDL 07: Proceedings of the 2007 Conference on Digital Libraries. 309–318.

(About search engine result comparison and part of this paper is useful)

Olken, F. and Rotem, D. 1990. Random Sampling from Database Files: A Survey. In Proc.

Fifth Int”l Conf. Statistical and Scientific Database Management 92–111. (Similar work to mytopic in different environment)

Ribeiro, B. and Towsley, D. 2010. Estimating and sampling graphs with multidimensional

random walks. In ACM IMC. 390–403. (Different approach for concerned problem)

Rusmevichientong, P., Pennock, D., Lawrence, S., and Giles, C. L. 2001. Methods forsampling pages uniformly from the World Wide Web. In Proceedings of AAAI Fall Symposium

on Using Uncertainty within Computation. 121–128. (Different approach for concerned problem

and most of my selected papers had referred this paper)


Documents

Random Sampling of Search Engine’s Index Using …richard.myweb.cs.uwindsor.ca/cs510/survey_sinha.pdf · 2 Random Sampling of Search Engine’s Index Using Monte Carlo Simulation