search engine indexing

Embed Size (px)

Citation preview

  • 7/28/2019 search engine indexing

    1/26

    SOFTWARE TESTING, VERIFICATION AND

    RELIABILITYSoftw. Test. Verif. Reliab. 2012; 22:221243Published online 24 August 2010 in Wiley Online Library (wileyonlinelibrary.com). DOI:10.1002/stvr.437

    Automated functional testing ofonline search services

    SUMMARY

    Search services are the main interface through which people discover information on the Internet. Afundamental challenge in testing search services is the lack of oracles. The sheer volume of data on theInternet prohibits testers from verifying the results. Furthermore, it is difficult to objectively assessthe ranking quality because different assessors can have very different opinions on the relevance of aWeb page to a query. This paper presents a novel method for automatically testing search serviceswithout the need of a human oracle. The experimental findings reveal that some commonly used search

    engines, including Google, Yahoo!, and Live Search, are not as reliable as what most users wouldexpect. For example, they may fail to find pages that exist in their own repositories, or rank pages in away that is logically inconsistent. Suggestions are made for search service providers to improve theirservice quality. Copyright 2010 John Wiley & Sons, Ltd.

    Received 17 April 2009; Revised 12 April 2010; Accepted 23 April 2010

    KEY WORDS: software testing; metamorphic testing; verification and validation; Internetsearch

    1. INTRODUCTION

    The World Wide Web is the largest repository of digital information ever produced by thehuman race. Owing to its size, the finding or retrieval of specific information from the Web is

    an increasingly difficult task. Finding information on the Web is made possible through search

    services such as Google, Yahoo!, Live Search, and numerous others.

    Among these service providers, Google is currently the largest one providing the most compre-

    hensive coverage of the Web. Even then, Google only has partial information about the contents

    of the Web. For example, it is known that Google indexed 25 billion Web pages in 2006, whereas

    the estimated size of the Web had already exceeded 200 billion in 2005 [1] and reached 447.98

    billion by early 2007 [2]. As a consequence, any attempt to search for information on the Web

    will only yield the results covered by the search service provider.

    A large number of users rely on Web search services to retrieve reliable information. While

    it is known that search service providers cannot provide full Web coverage, the extent to which

    *Correspondence to: Zhi Quan Zhou, School of Computer Science and Software Engineering, University ofWollongong, Wollongong, NSW 2522, Australia.

    E-mail:zhiquan@u o w .edu.au The paper will use the term Web when referring to the World Wide Web.

    Copyright 2010 John Wiley & Sons, Ltd.

    mailto:[email protected]:[email protected]:[email protected]
  • 7/28/2019 search engine indexing

    2/26

  • 7/28/2019 search engine indexing

    3/26

    AUTOMATED FUNCTIONAL TESTING OF ONLINE SEARCH SERVICES 223

    several interesting issues, such as the impact of different languages on ranking quality, the impact

    of commercial interests on ranking quality, the change of ranking quality with time, and the

    relationship between ranking quality and page quality. Section 5 makes further discussions on

    the presented testing method and findings. Section 6 concludes the paper and discusses future

    work.

    2. TESTING SEARCH SERVICES IN TERMS OF THE QUALITY OF RETURNED COUNTS

    AND CONTENTS

    As an example of the inaccuracy of search results, consider an advanced search in the ACM

    Digital Library (http://portal.acm.o rg), for the following paper: In black and white: an integrated

    approach to class-level testing of object-oriented programs (search by title, see Figure 1(a)). The

    system returned no results (Figure 1(b)). Surrounding the keyphrase with double quotation marks

    (as suggested by the system) did not change this outcome. However, when performing a general

    search on http://portal.acm.o r g using the same keyphrase (Figure 1(c)) produces a find. Obviously,

    the paper does exist in the database, but cannot be located in the first two attempts. This indicates

    a potential fault in the system.

    Apart from general users, researchers from many fields have been doing research by consulting

    generic Web search engines for some time now. It has become appealing for some researchers

    to use the Web as a data source, and to use a Web search engine to find frequencies and prob-

    abilities of some phenomena of interest [16]. In the field of language analysis and generation,

    for instance, some researchers have been using the counts returned by Web search engines for

    their research [1721]. The assumptions that Web search engines are trustworthy and that the

    counts are reliable, however, remain to be validated. There are numerous other areas that can

    be affected by inaccuracies in the functionality of search services. User acceptance is reduced if

    search services are perceived to be unreliable. Apart from obvious economic impacts, the accep-

    tance of search services and therefore the Web as a medium for information distribution may be at

    stake.

    As a second example, consider a search ww w .google.com for leakless dearer (according to the

    specification in Google, spaces among words imply the AND-relation). The system returns about

    59 results (Figure 2(a)). How can testers know whether Google correctly counted all the relevantWeb pages indexed in its database? This paper proposes that, instead of checking the correctness

    of each individual response, many necessary relations among multiple responses (executions) of

    the search engine can be identified.

    Let X be a search criterion (such as a condition Page contains string S.), Pages .X / be the set

    of pages satisfying X , and Count.X / be jPages .X /j, that is, the number of pages satisfying X . A

    useful general relation can be identified as RCO U N T : ifPages .X2 / Pages .X1 /, then Count.X2 /

    Count.X1 /. A special case is ROR: if A2 .A1 OR B/, then Count.A1 / Count.A2/, which means that the number of pages satisfying condition A1 should be less than or equal to

    the

    Figure 1. Anomaly in searching the ACM DigitalLibrary.

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012;22

    :221243DOI: 10.1002/stvr

    http://portal.acm.org/http://portal.acm.org/http://www.google.com/http://portal.acm.org/http://portal.acm.org/http://portal.acm.org/http://www.google.com/
  • 7/28/2019 search engine indexing

    4/26

  • 7/28/2019 search engine indexing

    5/26

    AUTOMATED FUNCTIONAL TESTING OF ONLINE SEARCH SERVICES 225

    Figure 3. Anomaly in LiveSearch.

    engines return approximate results. However, this argument would not hold for very large differ-

    ences. In fact, for all the search engines tested and all the logical consistency properties employed,

    a great number of anomalies with even larger differences have been detected, such as the one in

    Microsofts Live Search shown in Figure 3: A query for GLIF gave 11 783 results, as shown inFigure 3(a). An immediate second query for GLIF OR 5Y4W (Figure 3(b)) gave, surprisingly,

    zero results (Figure 3(c)). This obviously violated the expected ROR property. All the anomalies

    reported in this paper were repeatable at the time of experiment, and all the queries used correct

    syntax. (4) For large search engines, the result counts may be estimated from statistics or samples

    about the query terms. This estimate may change between result pages. The last entries in the result

    pages, as shown in Figures 2(c) and (d), were therefore browsed. It is interesting to compare Fig-

    ure 2(a) against Figure 2(c): The former reports of about 59 whereas the latter reports of 58.

    Note that the word about in the former disappeared in the latter. This means that Figure 2(c) can

    be seen as the accurate result given by Google. This final count 58 differs only slightly from the

    initial estimate. A similar observation is made for Figures 2(b) and (d). It is further found that out

    of the 165 URLs returned by Google for leakless dearer negative, 119 did not appear in the set of

    58 URLs returned for leakless dearer.Can the violations of the consistency properties be caused by intentional design decisions rather

    than design flaws or software faults? For example, is it possible that such a violation was caused

    by an internal leniency mechanism of the search engine? That is, when the search engine finds the

    number of pages containing term A is too small, it becomes more lenient when searching forpages

    containing both A and B to avoid returning too few or zero results. Such an assumption is not sup-

    ported by the findings made in this research because significant inconsistencies were observed even

    when millions of pages matched a query. Another possible cause is that there might be a thresh-

    old defined by the ranking algorithm of the search engine: a Web page is counted only when it

    both contains the keywords and has a relevance score above the threshold. Therefore, it is possible

    that when searching for leakless dearer, only 58 pages had relevance scores above the threshold;

    whereas when searching for leakless dearer negative, the additional keyword negative revealed

    more information about the users intention and, as a result, more than 58 Web pages received rele-

    vance scores above the threshold. If this is indeed the case, then the rate of violations of the RANDproperty should be quite high. However, as will be introduced shortly, experiments show that for

    all the search engines tested, their RAND anomaly rates were quite low. Hence, it is unlikely that

    the anomalies were caused by such a threshold mechanism. Furthermore, if there was indeed such a

    threshold mechanism, users should have been informed of the mechanism rather than being misled

    to believe that only XXX pages were found.

    The validity of the method can also be discussed from the perspectives of software verification

    and validation. Verification is defined in [22] as:

    Verification is checking the consistency of an implementation with a specification. Here, specifica-

    tion and implementation are roles, not particular artifacts. For example, an overall design couldplay

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012; 22:221243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    6/26

    226 Z. Q. ZHOUET AL.

    the role of specification and a more detailed design could play the role of implementation; check-

    ing whether the detailed design is consistent with the overall design would then be verification of thedetailed design. Later, the same detailed design could play the role of specification with respect to

    source code, which would be verified against the design.

    For all the search engines tested, their specifications are available online, and all these specifica-

    tions state that the search engine will count=return allpages (indexed in the database) that meet the

    users search criteria. The logical consistency properties used in the presented method are necessaryproperties identified from these specifications. Any violation of any logical consistency property,

    therefore, means asoftware failure in the verification, where software failure is defined in [23] as

    external, incorrect behavior with respect to the requirements or other description of the expected

    behavior. It should be noted, however, that there are various other possible reasons for the failure

    apart from software faults. For instance, though unlikely, it is theoretically possible that the two

    searches involved in the violation of the logical consistency property were conducted at different

    servers with different databases and, hence, the two search results were not consistent. Neverthe-

    less, this kind of lower-level design decisions cannot replace the top-level specifications during the

    verification of the entire system.

    Next, consider the problem from the perspective of software validation, which is defined in

    [22] as:

    Assessing the degree to which a software system actually fulfills its requirements, in the senseof meeting the users real needs, is called validation.

    Note the difference between verification and validation [22]:

    In every case, though, verification is a check of consistency between two descriptions, in contrast tovalidation which compares a description (whether a requirements specification, a design, or a running

    system) against actual needs.

    The logical consistency properties are desirable properties that reflect users actual needs, as it is

    natural for users to expect that a good search engine should behave in a logically consistent way.

    Therefore, if a user does realize that responses from a search service are often logically inconsistent,

    then the users perceived quality of the search service can be affected in a negative way. Therefore,

    the logical consistency properties are a measure of users perceived quality of search || .

    Another question is whether some query terms (such as GLIF and 5Y4W shown in Figure 3)

    really reflect users real information needs. First, it must be noted that different users are interested

    in different kinds of information. For example, the search term 5Y4W may not make sense to

    most users, but is of particular importance to some users who are looking for a product having the

    substring 5Y4W in its model number or who are looking for a postal code containing 5Y4W

    in certain countries. Indeed, it is hard to say what query terms reflect users real information needs

    and what does not. Users should have the right to expect that the search engine will work reliably

    for all queries regardless of whether the query terms involve hot words. Furthermore, research in

    software testing reveals that faults in programs are often exposed to unexpected inputs, and this is an

    advantage of random testing. Therefore, debuggers should check the system after serious violations

    of the logical consistency properties are detected.

    To further investigate the relationship between users information needs and logical consistency

    of search engines, the AOL search data (released in AOL Research Web site research.aol.com in

    2006 but no longer online) were analysed. The search data contain Web search queries and userIDs. The data set contains a total of 39 389 567 queries collected from 630 554 users over a period

    of three months. It is found that a total of 266 142 (42.21%) of the users issued strongly logically

    connected (AND-related) queries, such as a query for A followed by another query for A AND B . In

    fact, 6 263 944 queries (that is, 15.90% of all queries) were involved in such a search pattern alone.

    || The first author wishes to thank Microsoft researchers for telling him that the following logical consistency property isalso used within Microsoft to check the ranking robustness of Web search service: A robust search service shouldreturn similar results for similar queries. For instance, although a search for todays movies in Redmond and asearch forRedmond movies today may return different results, the two result sets should share a large intersection if the searchservice is robust.

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012; 22:221

    243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    7/26

    OR

    And

    Exclud

    EX CLUDE

    AUTOMATED FUNCTIONAL TESTING OF ONLINE SEARCH SERVICES 227

    Given that those 15.90% of queries were issued by 42.21% of users, the percentage ofAND-related

    queries issued by these 42.21% of users was much higher, which is 44.56%. This finding indicates

    that logically consistent search results are highly desirable by many users.

    Despite the relationship between logical consistency of search engines and users information

    needs, it should be pointed out that the focus of the testing methods presented in this paper is not on

    users information needs (which involve users subjective judgment), but is on the objective assess-

    ment of search engine behaviour. It would be very interesting to know the relationship betweensearch engine logical consistency and the conventional measures of precision and recall. However,

    because of the difficulties in measuring the latter on the live Web, such a study was not conducted.

    2.2. The experiments

    Experiments have been conducted to test three Web search engines, namely Google, Yahoo!, and

    Live Search, against the logical consistency properties. The aim of the experiments was not to find

    the best search engine or the best properties for testing search engines. Instead, the aim was to see

    how some commonly used search engines compare in terms ofconsistency.

    As mentioned in Section 2.1, the result counts returned by a search engine may vary betweenresult pages. Normally, the count in the first result page is less accurate, and that in the

    last result page is the most accurate. Therefore, the original definitions of RAND , ROR, and

    REX CLUDE are slightly modified. RAND is revised to RM ul t iP ag e

    : if A2 .A1 AND B/,

    then min.C ount .A2 // max.C ount .A1 //, where min.C ount .A2 // denotes the minimum

    of the count in the first result page and that in the last result page for query A2 , and max.C ount

    .A1 // denotes the maximum of the count in the first result page and that in the last result

    page for query A1 . Similarly, ROR and REX C LUDE are revised to RM ul t iP ag e

    : if A2

    .A1 OR B/, then min.C ount .A1 // max.C ount .A2 //; and RM ul t iPag e

    : if A2 .A1AND B/, thenmin.C ount .A2 // max.C ount .A1 //.

    Three series of experiments were conducted to test search engines against the properties

    RM ul t iPag

    eM ul t iPage

    M ul t iP age

    AND, R

    OR, and R

    EX CLUDE. In the first series, each search keyword was ran-

    domly selected from an English dictionary that contains 80 368 entries. In the second series,eachsearch keyword was randomly generated strings containing combinations of any three charac-

    ters from the set {A, B, . . . , Z, a, b, . . . , z, 0, 1, . . . , 9}. In both series of experiments, words

    that are too common (such as is and the) were excluded because they are often ignored by

    Web search engines. In the third series, each search keyword was randomly selected from a hot

    query set that contains 1000 latest hot queries obtained from Google Zeitgeist, Yahoo! Buzz,

    and MSN Insider (accessible viahttp://ww w .google.com/zeitgeist, http:// b uzz.yahoo.com, and http:

    //ww w .imagine - msn.com/insider, respectively).

    To obtain the most accurate results, quotation marks were used to enclose keywords in each

    search. This is because the quotation marks mean exact-word search in all the three engines. It

    should also be noted that, in order to obtain the most accurate results, it is important to disable all

    filters.

    In each series of experiments, for each search engine under test, and for each of the three logical

    consistency properties, 200 tests were run, and each test involved two searches. The experimen-

    tal results are summarized in Table I. First, consider the column of English Dictionary. In this

    series of experiments, no AND-anomaly was detected for Google and Yahoo!, but Live Search

    had an AND-anomaly rate of 1.5%. Note that the anomaly shown in Figure 2 is not captured by

    these experiments because A1 and B both contain only one English word, whereas in Figure 2, A1contains two English words, namely leakless and dearer. (Readers who are interested in AND-

    relations involving more than two keywords may refer to Zhou et al. [24] for experimental results.)

    It can be noted that different consistency properties produced very different anomaly rates, and the

    anomaly rate forRM ul t iP ag e

    was the highest for all search engines, namely 4.5% forGoogle and

    Yahoo!, and 37.5% for Live Search. All anomalies reported in this paper were repeatable at the time

    ofexperiment.

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012; 22:221243

    DOI: 10.1002/stvr

    http://www.google.com/zeitgeisthttp://www.google.com/zeitgeisthttp://www.google.com/zeitgeisthttp://buzz.yahoo.com/http://buzz.yahoo.com/http://www.imagine-msn.com/insiderhttp://www.google.com/zeitgeisthttp://buzz.yahoo.com/http://www.imagine-msn.com/insider
  • 7/28/2019 search engine indexing

    8/26

    228 Z. Q. ZHOUET AL.

    Table I. Anomaly rates on result counts (unit: %).

    English dictionary Random strings Hot queries

    AND OR EXCLUDE AND OR EXCLUDE AND OR EXCLUDE

    Google 0.0 0.0 4.5 0.0 0.5 13.5 0.0 11.0 15.5Live Search 1.5 15.0 37.5 0.5 5.0 33.0 0.0 5.0 14.0Yahoo! 0.0 0.5 4.5 0.0 0.0 0.5 0.0 0.0 6.0

    A look at the results on Random Strings in Table I reveals a notable phenomenon: many of the

    anomaly rates have changed significantly as compared with the first series of experiments. Googles

    OR-anomaly rate increased from 0 to 0.5%, and EXCLUDE-anomaly rate increased from 4.5% to

    13.5%. On the other hand, the anomaly rates of Live Search and Yahoo! have decreased. When con-

    sidering the column Hot Queries in Table I, Google produced its highest anomaly rates out of the

    three series. Live Search, on the contrary reached its best performance. Yahoo!, however, yielded its

    highest EXCLUDE-anomaly rate. Overall, Yahoo! produced the most consistent search results.

    Such observations may also help to improve users perceived quality of search by providing hints

    on where flaws may or may not exist in the search system. Table I shows that, for each search engine

    tested, its anomaly rates for different logical consistency properties were significantly different. This

    suggests that the anomalies may be due to different causes/deficiencies in the system. Furthermore,consider each individual search engine. For Google, its anomaly rates increased significantly from

    English Dictionary to Random Strings. When compared with dictionary words, random strings

    can be assumed to occur less frequently as a search term, and they do not have much semantic mean-

    ing. By looking at Googles anomaly rates in Table I, however, it is clear that such properties are not

    the cause for the anomalies since Google reached its highest anomaly rates on the Hot Queries.

    This analysis also applies to the experimental results for Live Search and Yahoo!. In the absence of

    more information on the design of the search engines, it is not possible to find the causes for the

    anomalies. The above analysis, however, can provide a useful metric for search service providers to

    debug or tune theirsystems.

    3. TESTING SEARCH SERVICES IN TERMS OF THE QUALITY OF RANKING

    The metric proposed earlier is search engine focused. A user focused metric is desirable since it can

    be assumed that the vast majority of general users are more concerned about what appears on the

    first few pages of the search results. In other words, users are not often concerned about the actual

    number of returned pages as long as the initial results (which are often limited to the top 10 results)

    displayed on screen match the user expectation. The first few results are therefore more critical to

    the overall perceived quality of ranking. Conventional methods for evaluating ranking quality can

    be both expensive and subjective because they entail a great amount of human judgment and manual

    labour. This section introduces an approach to assessing the ranking quality automatically using the

    concept of logical consistency properties among related responses.

    Let htermi represent a search term submitted to a search engine, and let RS0 represent the setof results returned. RS0 can contain links to many kinds of files, such as HTML, PDF, TXT, and

    so on. Suppose that only the top 20 TXT files are of interest. Let RS1 represent the subset ofRS0containing the top 20 TXT files in their original order. Most search engines also allow users tosearch for a specific type of file directly, either through a command or through the Advanced Searchoption. Suppose a search engine is queried using the command htermi filetype:txt, which denotessearching for files of type TXT containing htermi. Let RS2 represent the set of the top 20 TXT filesreturned. (See Figure 4 for an illustration ofRS0 , RS1 , and RS2 . In case of duplicate URLs, onlythe first one is taken and the duplicate ones are ignored.)

    For an ideal search engine, RS1 and RS2 should be identical. That is, RS1 and RS2 should

    con- sist of the same set of files in the same order. This is because the same keyword htermi is

    used in both queries. If a search engine does not produce RS1 D RS2 , then the smaller the

    difference

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012;22

    :221243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    9/26

    TXT1TXT2TXT3TXT4

    TXT20

    TXTaTXTb

    TXTcTXTd

    TXTt

    AUTOMATED FUNCTIONAL TESTING OF ONLINE SEARCH SERVICES 229

    RS0(results for)

    HTML1HTML2PDF1TXT1PDF2

    TXT2HTML3TXT3TXT4HTML4

    TXT20

    RS1

    RS2(resultsfor

    filetype:txt)

    Figure 4. The ordered sets RS0 , RS1 , andRS2.

    between RS1 and RS2 , the higher the ranking quality in terms of logical consistency, as it is naturalfor users to expect that good ranking algorithms should give logically consistent results.

    The above discussion also applies to other file types. In total, four file types are selected to exper-iment with, namely TXT, PDF, HTML, and HTM files. These types of files are most common on the

    Internet, and all the three search engines support type-specific searches for these file types. Further-

    more, they are also representative of files with different degrees of inter-document structures: TXT

    files are least structured; PDF files support structure through (hyper-)links but these are not often

    exploited; whereas HTML and HTM files are the most structured among the four.

    Again, there are various other logical consistency properties that can be used to test the rank-

    ing quality of search services. Many of these properties can be identified by simply looking at the

    Advanced Search options of the search services. The aim of this research is not to find a com-

    prehensive list of such properties or to compare their effectiveness. Instead, this research aims to

    propose and empirically study a new testing method. Under this methodology, many different logical

    consistency properties can be identified by practitioners for different kinds of search services.

    3.1. Validity of the methodology

    Is it reasonable to assume that a search engine uses a ranking scheme for the query htermi and a

    different ranking scheme for the query htermi filetype:txt? From the user validation perspective,

    the filetype:txt command is only a filter according to the specifications of the search engines and,

    hence, should not affect the ranking of pages. Regardless of how the search engines implement the

    command, this is a desirable property. In other words, if users find that this property does not hold,

    their perceived quality of ranking can be affected. Further discussions on the relationship between

    this property and the users perceived quality of ranking will be given in Section 3.4.

    A search engine can rank Web pages according to many factors, which can be basically classified

    as on-page and off-page factors. For example, according to keyphrase-specific findings reported by

    Fortune Interactive [25], the main factors used by Google, MSN (Live Search), and Yahoo! to rank

    Web pages are: (1) In-Bound Link Quality. This is a measurement of key elements on the page

    containing an in-bound link which, in combination, influence the link reputation for the target of

    the link, which is the only factor that had the same level of relative influence across the search

    engines and happened to be the most influential in all cases. (2) In-Bound Link Relevance. This

    is a measurement of the topic=keyphrase relevancy of the content on the page containing the in-

    bound link. It can be further divided into In-Bound Link Title Keyword Density and In-Bound

    Link Anchor Keyword Density. (3) In-Bound Link Quantity. That is, the number of links pointing

    to the current page. This factor is of least relative importance among the off-page factors across the

    board. (4) Title Keyword Density. This is an on-page factor.

    Let F be a file of type hTi on the Web. Suppose that F is ranked using scheme S1 forquery

    htermi, and ranked using scheme S2 for query htermi filetype:hTi. If S1 and S2 are

    different

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012; 22:221243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    10/26

    230 Z. Q. ZHOUET AL.

    a.pdfa.pdf

    e.htm

    j.docf.htm

    c.pdf c.pdf

    i.docb.pdf

    d.pdf

    b.pdf

    d.pdf

    h.htmg.htm

    Figure 5. A simplified topologicalstructure.

    ranking schemes, then it must be because S1 and S2 adopt different off-page factors for F . To illus-

    trate this, consider Figure 5(left), which shows a simplified topological structure of the files indexed

    in a search engines database. For ease of discussion, it is assumed that all of the files are relevantto the users query htermi. Among these 10 files, four are PDF files, four are HTM files, and two

    are DOC files. The arrows in the figure represent hyperlinks between files. All other factors being

    equal, b.pdf will be ranked the highest among all the four PDF files because b.pdf has the highest

    in-degree (five incoming arrows). Consider a second search with the query htermi filetype:pdf.

    Theoretically speaking, this time the search engine may rank the PDF files using either the global

    view (Figure 5(left)) or the local view (that is, PDF-only view as shown in Figure 5(right)) since the

    search is now confined to only PDF files. In the former case, b.pdf should still be ranked the highest;

    in the latter case, however, b.pdf should be ranked last because no PDF file points to b.pdf. Although

    the view actually adopted by the search engines is not known to external users, the following anal-

    ysis ensures that, for all the three search engines under test, the global view is most likely to be

    adopted. First, to implement the local view is resource-consuming and inefficient. Second, the local

    view is not applicable to TXT files, because plain text (TXT) files do not have an inter-documentstructure and are normally not subject to hyperlinks analysis. Next, consider what happens if hTi is

    a non-TXT file type and involves hyperlinks structures, such as PDF, HTM, and HTML file types. If

    the local view (Figure 5(right)) is adopted to generate RS2 in Figure 4 (in this case the TXT files in

    the figure should be changed to PDF, HTM, or HTML files), then the difference between RS1and RS2 for PDF, HTM, or HTML files must be larger than that for TXT files because the former

    uses two different ranking schemes, whereas the latter uses only one ranking scheme. However, as

    will be shown in Section 3.3, for all the three search engines tested, the difference between RS1 and

    RS2 (measured by common line rate (CLR), to be defined shortly) for PDF, HTM, and HTML files

    is much smaller than that for TXT files. As a result, it can be concluded that global views are used

    for the file types in all queries. Hence, the proposed logical consistency property is valid.

    3.2. Design of the experiments

    A series of experiments have been conducted to test the three Web search engines against the

    consistency property by using the search APIsprovided.

    The experiments consist of two phases. In phase 1, the search engines are queried and their

    responses are recorded. In phase 2, the data collected in phase 1 are analysed. The following pro-

    cedure describes the basic flow of a test in phase 1. For ease of presentation, the following only

    describes a test for TXT files. Other file types are treated in the same way.

    Step 1: Randomly select a word w from the English dictionary and query the search engine with

    w (without the filetype command). The response of the search engine is RS0 .

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012; 22:221

    243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    11/26

    1

    2

    1

    11

    1

    iD1

    AUTOMATED FUNCTIONAL TESTING OF ONLINE SEARCH SERVICES 231

    Step 2: Collect the top 20 URLs that refer to TXT files from RS0 . That is, build RS1 as illustratedin Figure 4. If the number of TXT files is smaller than 20, then collect all TXT files in RS0 . If thisnumber is smaller than 10, then discard RS0 and RS1 , and go to step 1.**

    Step 3: Query the search engine for w filetype:TXT and collect the top 20 results to build RS2 .

    If the number of files is smaller than 20, then collect all files. If this number is smaller than 10, thendiscard RS0 , RS1 , RS2 , and go to step 1.

    Step 4: IfRS1 and RS2 have different sizes, then truncate the larger set by removing its tail ele-ments, so that RS1 and RS2 can have the same size. (For instance, ifRS1 has 17 elements and RS2has 20 elements, then the last three elements in RS2 will be removed.)

    For each search engine and for each of the four file types (TXT, PDF, HTML, and HTM), 120

    tests have been run. The dictionary that contains 80 368 English words was used to generate the

    search keywords. Again, common words such as of were excluded.

    The data collected in phase 1 are then analysed against two measures: the CLR and the offset, as

    definedbelow.

    Common Line Rate (CLR): Let RS 0 and RS 0 be the subsets ofRS1 and RS2 , respectively, that1 2consist only of the common URLs ofRS1 and RS2 in their original orders. Therefore, RS

    0 and

    RS 0 have the same elements. The orders of these elements in the two sets, however, may not be the

    same. The CLR is defined as jRS 0j jRS1j, which is the ratio of the number of common URLs of

    RS1 and RS2 to the number of URLs in RS1 . Note thatjRS1j DjRS2j andjRS1j DjRS2j.0 0

    Offset: In addition to the measuring of CLR, it is also important to know whether the elements

    in RS 0 and RS 0 have been ranked consistently, that is, whether they have the same order and, if1 2not, how much they differ. The smaller the difference, the higher will be the ranking quality. Let

    RS 0 D .e1 , e2 , : : : , ek/, where the ranking of ei in

    RS0is i , i D 1, 2, : : : k, and k 20. Here,

    the cases where k>1 are considered, because the intention is to measure how the elements of RS0

    and RS 0 are ranked differently. Letposition

    0 .ei / denote the ranking of element ei in RS0 . Let

    2 RS2

    2

    offset.ei / D ji positionRS 0 eij. For instance, if RS0

    D .a, b, c/ and RS0

    D .c, a, b/, then2 1 2

    offset.a/ D 1, offset.b/ D 1, and offset.c/ D 2.

    For an ideal search engine and for each ei , offset.ei / should be 0, which means that all elements

    (URLs) are ranked consistently. For real-world search engines, the smaller the offset values, the

    higher the ranking quality in terms of logical consistency. Based on the notions introduced above,

    let the average ranking offset(ARO) be ARO D .Pk

    offset(MRO) be MRO D Maxfoffset.ei / j i 2 1, kg.

    offset.ei // k. Let the maximum ranking

    Weighting: For the evaluation of experimental results, it is desirable to weight the elements when

    calculating the offset values. This is to accommodate for the properties of some common rank-

    ing algorithms, and for the common user perception that the topmost returned results are more

    significant than the latter results. It should hence be appropriate to differentiate pages in differ-

    ent ranking positions when analysing the experimental results. In order to achieve this, a weight

    value is added to each element (URL) indicating its level of importance in response to aparticular

    query. A non-linear weighting scheme is used based on a well-defined Web page ranking algorithm

    PageRank. PageRank [26] is arguably the most successful of Web page ranking algorithms. The dis-

    tribution of rank values becomes evident when plotting the pages by PageRank value in descending

    order [27]. This is shown in Figure 6. The figure shows a logarithmicplot of PageRank values foradata set containing 1.67 million pages (taken from a well-known snapshot of a portion of the Web

    called WT10G). It is found that a reciprocal-squared logarithmic function follows the curvature

    of the descending ordered PageRank values (the original values before applying logarithm) quite

    closely. In other words, the distribution of PageRank values falls sharply at a rate ofapproximately

    f.x/ D 1=log2 .x/. This is the reason that it is a logarithmic rather than the original plot shown inFigure 6. It is thus observed that a change of order of highly ranked pages has a significantly higher

    impact on the overall perceived quality of ranking than does that of lowly ranked pages.

    ** Most commercial search engines impose some limitations on the use of their search APIs. For instance, Googleimposes a quota of 1000 queries per day. For each query, only the top 1000 results can be viewed. Therefore, it issometimes not possible to collect 20 TXT files out of the top 1000pages.But the procedure ensures that at least 10TXT files are collected.

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012;22:221243

  • 7/28/2019 search engine indexing

    12/26

  • 7/28/2019 search engine indexing

    13/26

  • 7/28/2019 search engine indexing

    14/26

    averag

    eAWRO

    averageARO

    AUTOMATED FUNCTIONAL TESTING OF ONLINE SEARCH SERVICES 233

    0.2

    0.18

    0.16

    0.14

    0.12

    0.1

    0.08

    0.06

    0.04

    0.02

    0

    2.5

    2

    Google

    Live Search

    Yahoo!

    0 20 40 60 80 100 120

    number of tests

    Google

    Live Search

    Yahoo!

    1.5

    1

    0.5

    00 20 40 60 80 100 120

    number of tests

    Figure 7. Average AWRO (above) and ARO (below) for files oftype

    TXT.

    in k 1 were skipped. This is explained previously in the paper. In the end, there were 116 valid

    tests for Google, 114 for Yahoo!, and 118 for Live Search.) It can be seen that when the number

    of tests increases, all the curves become more and more stable. Fluctuations at 60 are already very

    small. This illustrates that a sample size of 120 has yielded representative performance indicators

    for all three search engines. Similar patterns have been observed in all the otherexperiments.

    Tables II and III summarize the results of the 120 tests based on the common line rate and the

    offset, respectively. Consider Table II first. A higher CLR value means a higher ranking quality in

    terms of logical consistency. For Google, its ranking quality improved significantly when the file

    type changed from TXT to PDF, HTM, and HTML. This can be seen not only from the average

    figures (increased from 41% to 78%) and minimum figures (increased from 5% to 20%), but alsofrom the standard deviations (decreased from 25% to 17%, indicating more stable performance).

    For Yahoo!, its average and minimum CLRs and standard deviations for TXT and PDF files were at

    a similar level, but improved dramatically for HTM and HTML files. Live Search also had a similar

    pattern except for the standard deviations. Among the four file types, HTM and HTML files are the

    most structured, followed by PDF files, and TXT files are structureless. Therefore, the above exper-

    imental results suggest that all the three search engines are sensitive to file structures (including

    hyperlink structures) and vulnerable to the topology of Web components and off-page factors (as

    the ranking of TXT files relies more on off-page factors). For all the three search engines, the rank-

    ing quality for less structured files needs to be improved. Among the three search engines, Yahoo!

    outperformed the other two in terms ofCLR.

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012; 22:221243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    15/26

  • 7/28/2019 search engine indexing

    16/26

  • 7/28/2019 search engine indexing

    17/26

    236 Z. Q. ZHOUET AL.

    Table IV. Using Google to search for HTML pages written in different languages.

    Avg. CLR Avg. ARO Avg. AWRO

    English 77.8% 0.7931 0.0336Chinese 74.7% 0.7951 0.0387French 76.0% 0.8622 0.0410

    4. EXAMPLES OF FURTHERAPPLICATIONS

    The proposed method is simple in concept, easy to implement, and completely automatic. In addi-

    tion to serving the purpose of conventional testing and debugging, it can be used for various other

    purposes. A few examples of such applications are given in the following, where several interesting

    issues are investigated.

    4.1. A case study on language bias

    This subsection attempts to check whether there is a bias in the ranking quality when searching for

    Web pages written in different languages. A case study is conducted with Google. In the experiment,

    HTML pages written in the English, Chinese, and French languages were searched. For each lan-guage, 200 tests were run. Google search API was used, which allows to set language preference by

    invoking the function setLanguageRes t r i ct s./. The results are summarized in Table IV,which shows that Google had the best performance on all the three measurements for English Web

    pages. The differences among languages, however, were quite small. It can be concluded,

    therefore, that no language bias was found in Google.

    4.2. The impact of commercial interests on ranking

    This subsection attempts to find whether commercial interests could have an impact on the rank-

    ing quality. This is because many search keywords (such as travel, hotel, and laptop) have

    commercial=advertising values. Will the search engines change the order of pages when present-

    ing them to users because of commercial reasons? As a pilot study, 200 tests were conducted on

    Google. Each test used commercial query terms instead of dictionary words. As Google returnsboth normal search results and sponsored links, the following items were collected in each test:

    (1) the number of sponsored links (complete sponsored links were collected via the URL http:

    //ww w .google.com/sponsoredlinks), (2) the number of commercial sites that appeared in the top

    20 search results, and (3) the CLR and offset values of the search results when considering HTML

    pages only.

    The correlations between the number of sponsored links=commercial sites and the CLR=offset

    values are analysed. Because of space limitations, the details of the analyses will not be pre-

    sented in this paper. The result indicates that there was no correlation. This means that the ranking

    quality=consistency of Google was not affected by commercial interests.

    4.3. Change of ranking quality with time

    According to the experimental results reported in the previous sections, all of Google, Live Search,and Yahoo! yielded inconsistencies in their search results. The further study focused on whether and

    how the ranking quality changes. If changes are detected, can a model be built to describe them? To

    what degree is the quality of Web search engines affected by the dynamics of the Web?

    It is postulated that the quality of the search results may change with time, because all Web

    search systems need to update their databases and caches based on certain rules. When updating is

    in progress, the consistency of the search results may be affected in a negative way. Furthermore,

    A Web site is considered to be a commercial site when it is from the .com or .com.xx (such as .com.au)domain. This is, however, only a rough classification in order to automate the testing process.

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012;22:221243

    DOI: 10.1002/stvr

    http://www.google.com/sponsoredlinkshttp://www.google.com/sponsoredlinkshttp://www.google.com/sponsoredlinks
  • 7/28/2019 search engine indexing

    18/26

    AUTOMATED FUNCTIONAL TESTING OF ONLINE SEARCH SERVICES 237

    Figure 8. Hourly observations of mean CLR and ARO values for Google (left), Live Search (middle), andYahoo! (right).

    there are many other time-related factors that may affect ranking consistency, such as network con-

    gestion, sudden emerging searches, and so on. At different times, a clients requests might also be

    directed to different servers according to certain server selection rules.

    A time-based experiment is then designed as follows: Over a duration of four weeks, for each of

    the three Web search engines, 100 tests were run every hour and its hourly mean CLR and ARO

    values for HTML pages were recorded. In order to make a meaningful comparison, the same set

    of queries was reused every hour. To generate this query set, the English dictionary used in the

    previous experiments was still used. The experiment was designed to show whether and how the

    ranking consistency of Web search engines changes hourly, daily, and weekly. A large number ofobservations and analytical results were made. In the following, an extract of these ispresented.

    Overall results: Figure 8 shows hourly changes of the mean CLR and mean ARO values of

    Google, Live Search, and Yahoo!. The x-axis indicates the hours when the results were collected.

    The y-axis indicates the hourly mean CLR or ARO values (of the 100 tests per hour). The experi-

    ment started at 10:00 a.m., 12 June 2008, Sydney time, and finished at the 647th hour at 9:00 a.m.,

    9 July 2008. Because of network connection issues, the results of a few hours are missing. For

    Google, one days result (14 June 2008) is missing. Where comparisons are made, these missing

    data are avoided. Furthermore, to avoid influences of causes that could result in an incorrect mea-

    surement, when calculating each hourly result, the two maximum and the two minimum values are

    eliminated (hence, 100 4 D 96 tests per hour contributed to the calculation). The horizontal ref-

    erence line represents the overall mean value of all the hourly mean values. The vertical reference

    lines represent the first hour on Monday every week.

    It can be observed that, even for the same set of queries, both CLR and ARO values vary overtime. For example, Google produced hourly mean CLRs fluctuated within the range from 70% to

    76% during the 647 hours and exhibited an overall mean CLR of 73.5%. For ARO, the overall

    mean value of Google is about 0.59 and the change was within 0.35. In comparison, Live Search

    mostly produced a comparatively higher mean CLR than Google with an average of 80% where

    most observations were within the range of 74%83%. There was one significant drop on the curve,

    which corresponds to 18 June 2008 (a drop to 69%). This could be led by an unexpected surge

    in visitors to the search engine system, which is typically due to some newsworthy event that had

    just taken place. Such events are called flash crowds. The curve of Live Search hourly mean ARO

    presents roughly an opposite trend as the one of hourly mean CLR. The higher the CLR value or

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012; 22:221

    243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    19/26

    Mon Tue Wed Thur Fri Sat Sun

    Mean

    CLR

    Mean

    CLR

    238 Z. Q. ZHOUET AL.

    0.75

    0.74

    0.73

    0.72

    0.71

    Google Week 1 [AU Time 2008-6-16- 2008 6-22]

    Hours

    0.83

    0.82

    0.81

    0.80

    0.79

    0.78

    0.77

    Live Search Week 2 [AU Time 2008-6-23 - 2008-6-29]

    Mon Tue Wed Thur Fri Sat Sun

    Hours

    Figure 9. A typical weekly observation for Google (left) and Live Search(right).

    the lower the ARO value, the better will be the ranking consistency. Thus, in this case, eitherCLR

    or ARO values can approximately indicate the general ranking consistency of Live Search.

    Yahoo! yielded the best results with the highest overall mean CLR close to 100%. There were

    less than 10 (minor) deviations during the observation period of 647 hours. Yahoo! also produced

    a perfect offset result, which was consistently 0 for the HTML pages. Note that the three search

    engines were tested at the same time.

    The comparative performance of the three search engines shown in Figure 8 agrees with that

    indicated in Tables II and III despite the fact that the two sets of experiments were conducted using

    different query sets, on different dates, and that the results were calculated using slightly different

    methods (two maximum and two minimum values were excluded when calculating each mean value

    for Figure 8).

    Weekly results: For weekly results, only the results of Google and Live Search are dis-

    played since Yahoo! produced near perfect results during the observation period and, hence,

    does not require further analytical discussion. Figure 9(left) shows a typical weeks results ofGoogle. Similar to the previous plots, the y-axis represents the hourly mean CLR and vertical

    dashed lines separate days in the week. The curve of this typical week reveals few similari-

    ties between daily patterns, and it is difficult to find any regularity in the patterns. The erratic

    and frequent changes of the mean CLRs indicate that the Google search system is a dynamic

    system.

    Figure 9(right) shows that Live Search yielded comparatively more stable CLR results than

    Google as the observations produced a somewhat smoother curve. The hourly mean CLRs of the

    week show a decreasing trend from Monday to Wednesday. More general is the observation that the

    lowest values are often observed on Wednesday midnight into the early morning hours on Thurs-

    day. The results on Friday, Saturday, and Sunday were roughly consistent and changed within a

    relatively small range during the daytime hours. Overall, the observations (see also the daily results

    shown in Figure 10) reveal a pattern that the CLR of Live Search is generally the best and also

    stable during the daytime hours, whereas this can reduce quite significantly at a time near mid-night. This is presumably a result of a scheduled event at Live Search, such as a synchronization of

    caches.

    Overall, it appears that CLR results observed for Live Search follow an apparent daily and weekly

    pattern that can be used to describe and predict the changes of ranking consistency over time. In con-

    trast, Google produced patterns that are much more dynamic and change very frequently. Among

    the three search engines, Yahoo! yielded almost a continued perfect ranking consistency during the

    observation period. This demonstrates a remarkable advantage of Yahoo! in terms of user-centric

    reliability of ranking. The observed difference among the results of the three search engines can

    be explained on their different focuses. Google pays great attention to enlarging the Web coverage

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012; 22:221243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    20/26

    AUTOMATED FUNCTIONAL TESTING OF ONLINE SEARCH SERVICES 239

    Figure 10. A typical daily observation forLiveSearch.

    and, for this purpose, continuously runs crawlers to keep its database up to date. In addition, owing

    to the popularity of Google, it has to dynamically adjust its ranking algorithm to counterproblems

    associated with spam. Some of these measures involve human intervention. In contrast, it can be

    stated that Yahoo! focuses on the relevancy and accuracy of results in a largely automated fashion

    involving exact methods rather than using approximate methods. Live Search is placed somewhat in

    between Google and Yahoo! in terms of consistency and, hence, exhibits methods that balance the

    two prior directions. However, the experimental results do not provide absolute evidence that fully

    supports the reasons behind the changes of ranking consistency with time for each search engine.From another point of view, even though the precise cause of the observed phenomenon may not

    be identified, the results do provide hints to search engine developers, who have actual knowledge

    of the underlying techniques. This provides a means for the developers to verify whether the search

    engine behaviour complies with its design goals.

    4.4. Users perceived quality of Webpages

    Recent work by Kc et al. [28] proposed a ranking scheme on the basis of users perceived qual-

    ity of documents on the Web. The atomization of the quality assessment procedure [28] was

    made possible through an algorithmic interpretation of documented cognitive processes that lead

    to the assessment of perceived quality of a Web document. An interesting observation by Kc

    et al. is that there is no correlation between the popularity of a Web document and the quality

    of the same document. In other words, Web documents that are commonly hyperlinked by other

    pages are as likely to be of high quality as documents that are rarely or not hyperlinked by other

    pages.

    It is interesting to know whether there is a relation between the logical consistency of a search

    engine and the quality of pages that it returns to users. A set of experiments have, therefore, been

    carried out using the procedure as described in Section 3.2 for HTML pages. For each of the three

    search engines, namely Google, Live Search, and Yahoo!, 1000 tests were run. Query keywords

    were generated from the English dictionary. In each test, in addition to calculating the CLR and

    offset values, all the HTML pages contained in RS1 and RS2 were downloaded, and the quality

    score of each of these HTML pages was calculated using the method described in Kc et al. [28].

    A higher quality score indicates a higher quality of the Web page. The calculation was based on

    the following quality indicators [28]: spelling accuracy, grammatical correctness, explicit indication

    of authorship, existence of references, non-spam probability, correctness of content, timeliness, theprobability of the Web pages content being biased, and whether the document is too short to contain

    sufficient information.

    For test i (i D 1, 2, . . . , 1000), the total quality score and mean quality score of all the HTML

    pages contained in RS1 were calculated, denoted by total.QSi / and mean.QS i /, respectively.

    The 1 1total and mean quality scores forRS2 were also calculated, denoted by total.QS

    i / and mean.QSi /,2 2respectively.

    The results of the 1000 tests are summarized in Table V, where Mean QS1 D.P1000

    mean.QSi //i D1 11000, and Mean QS2 D .P1000mean.QSi // 1000. The comparative performance of the threei D1 2search engines in terms of the mean CLR, ARO, and AWRO for HTML pages agrees with the

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012;22:221243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    21/26

  • 7/28/2019 search engine indexing

    22/26

  • 7/28/2019 search engine indexing

    23/26

    AUTOMATED FUNCTIONAL TESTING OF ONLINE SEARCH SERVICES 241

    even except that there are a few far apart points below 2. This means that in about 50% ofthe

    situations where a ranking inconsistency in terms of CLR occurred, the algorithm that generated

    RS1 outperformed the algorithm that generated RS2 in terms of page quality, and vice versa. This

    indicates that there was not much difference in the way that the page quality indicators were treated

    by these two algorithms. A similar observation is also made for Figure 11(lower left). Hence, it can

    be concluded that the treatment of the page quality indicators in the ranking algorithms was not

    a cause for the ranking inconsistencies measured by CLR and ARO. Similar observations are alsomade for Live Search. Figure 11(upper right) shows Yahoo! results, where only three points can be

    seen. This is because 998 points overlap at CLRD1 and diffD0. For the other two points, although

    both of theirdiff values are smaller than 0, one cannot conclude that RS2 gives better quality pages

    than RS1 because the sample size is trivial. Both these two points have a CLR value of 0 and, hence,

    are excluded from calculation for ARO. This explains why in Figure 11(lower right) all the points

    overlap at AROD0 and diffD0.

    Further analysis has also been conducted by applying the weighting scheme introduced in Sec-

    tion 3.2, and similar observations have been made. Therefore, it can be concluded that page quality

    was not a cause for the ranking inconsistencies.

    5. DISCUSSIONS

    The search engines considered in this paper are very large-scale distributed systems that operate

    within a very dynamic environment. It is well known that it can be difficult to maintain large-scale

    distributed systems in a consistent state at all times [29, 30]. Designers may allow a certain level

    of temporary inconsistency in the system in order to achieve a better system performance during

    periods of peak system load. Such inconsistencies are transient in nature as the system will strive

    to return to a consistent state when the system load is eased. This paper has provided evidence

    that many of the observed inconsistencies cannot be explained by such system variables, hence

    indicating possibly faulty software features.

    A main problem with inconsistencies is how to measure it. In distributed system design this

    is always application specific, and is often based on some form of numerical deviation or order

    deviation [29, 30]. However, these measures do not express the impact on user perception about

    the systems behaviour. This paper introduces a user centric inconsistency measure. The proposed

    measure is computed based on what the users get to see when interacting with a search engine.

    6. CONCLUSION AND FUTURE WORK

    Online search services are the main interface between users and the Internet. They are of major

    economic and political significance. A fundamental challenge in the quality assurance of these

    important services is the oracle problem. Not only does the sheer volume of data on the Internet

    prohibit testers from verifying the search results, but it is also difficult to objectively compare the

    ranking quality of different search algorithms on the live Web, owing to the fact that different asses-

    sors often have very different opinions on the relevance of a Web page to a query. If the quality

    cannot be measured, it cannot be controlledorimproved.

    The first contribution of this paper is the proposal of a novel method, which employs logical con-

    sistency properties to alleviate the oracle and automation problems in testing online search services.

    The presented method automatically tests search services in terms of (a) the quality of returnedcounts and contents and (b) the ranking consistency. Hence, this method can perform user-centric

    as well as provider-centric tests of search services. The logical consistency properties are necessary,

    but not sufficient, conditions for good search services, and this is a limitation of all testing methods.

    The second contribution of this paper is that significant experimental findings have been obtained.

    It has been shown that failures in search engines can be detected effectively and efficiently. The

    experiments reveal that three commonly used Web search engines, namely Google, Yahoo!, and

    Live Search, are not as reliable as many users would expect. It is found that Yahoo! generally out-

    performed the other two in both its count quality and ranking quality. It is also observed that the

    accuracy of any of these search engines varies over time. These variations are predictable in the

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012; 22:221

    243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    24/26

  • 7/28/2019 search engine indexing

    25/26

    29th

    Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR

    2006). ACM Press: New York, NY, 2006; 276283.

    Copyright 2010 John Wiley & Sons, Ltd. Softw. Test. Verif. Reliab. 2012; 22:221243

    DOI: 10.1002/stvr

  • 7/28/2019 search engine indexing

    26/26

    AUTOMATED FUNCTIONAL TESTING OF ONLINE SEARCH SERVICES 243

    14. Soboroff I, Nicholas C, Cahan P. Ranking retrieval systems without relevance judgments.Proceedings of the 24th

    Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR

    2001). ACM Press: New York, NY, 2001; 6673.

    15. Chen TY, Tse TH, Zhou ZQ. Semi-proving: An integrated method for program proving, testing, and debugging.

    IEEE Transactions on Software Engineering2010. DOI: 10.1109=TSE.2010.23.

    16. Kilgarriff A. Googleology is bad science. Computational Linguistics 2007; 33(1):147151.

    17. Grefenstette G. The WWW as a resource for example-based MT tasks. Proceedings of the 21st International

    Conference on Translating and the Computer. Aslib Publications: London, U.K., 1999.

    18. Keller F, Lapata M. Using the Web to obtain frequencies for unseen bigrams. Computational Linguistics 2003;

    29(3):459484.

    19. Lapata M, Keller F. Web-based models for natural language processing. ACM Transactions on Speech and

    Language Processing2005; 2(1):Article No. 3.

    20. Nakov P, Hearst M. Search engine statistics beyond the n-gram: Application to noun compound bracketing.

    Proceedings of the 9th Conference on Computational Natural Language Learning (CONLL 2005).

    Association for Computational Linguistics: Morristown, NJ, 2005; 1724.

    21. Turney PD. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the 12th

    European

    Conference on Machine Learning (EMCL 2001). Springer: London, U.K., 2001; 491502.

    22. Pezz M, Young M. Software Testing and Analysis: Process, Principles, and Techniques. Wiley: New York, NY,

    2008.

    23. Ammann P, Offutt J.Introduction to Software Testing. Cambridge University Press: New York, NY, 2008.

    24. Zhou ZQ, Tse TH, Kuo F-C, Chen TY. Automated functional testing of Web search engines in the

    absence of an oracle. Technical Report TR-2007-06, Department of Computer Science, The University of

    Hong Kong: Pokfulam, Hong Kong, 2007.

    25. Available at:http://ww w .fortuneinteract iv e.com/laptop.php [18 April 2009].

    26. Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the Web.

    Stanford InfoLab Publication 19991966. Stanford University: Palo Alto, CA, 1999. Available at:

    http://dbpubs.stanford.edu/pub/1999-66 [22 May 2010].

    27. Tsoi AC, Morini G, Scarselli F, Hagenbuchner M, Maggini M. Adaptive ranking of Web pages.

    Proceedings of the 12th International Conference on World Wide Web (WWW 2003). ACM Press: New

    York, NY, 2003;

    356365.

    28. Kc M, Hagenbuchner M, Tsoi AC. Quality information retrieval for the World Wide Web. 2008IEEE=WIC=ACMInternational Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2008). IEEE

    Computer Society Press: Los Alamitos, CA, 2008; 655661.

    29. Yu H, Vahdat A. The costs and limits of availability for replicated services. Proceedings of the 18th ACM

    Symposium on Operating Systems Principles (SOSP01). ACM Press: New York, NY, 2001; 2942.

    30. Yu H, Vahdat A. Minimal replication cost for availability. Proceedings of the 21st Annual Symposium on

    Principles of Distributed Computing (PODC02). ACM Press: New York, NY, 2002; 98107.31. Tsoi AC, Hagenbuchner M, Chau R, Lee V. Unsupervised and supervised learning of graph domains.Innovations

    in Neural Information Paradigms and Applications (Studies in Computational Intelligence, vol. 247).

    Springer: Berlin, Germany, 2009; 4365.

    32. Yong SL, Hagenbuchner M, Tsoi AC. Ranking Web pages using machine learning approaches.Proceedings of the

    2008 IEEE=WIC=ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT

    2008) IEEE Computer Society Press: Los Alamitos, CA, 2008; 677680.

    http://www.fortuneinteractive.com/laptop.phphttp://www.fortuneinteractive.com/laptop.phphttp://dbpubs.stanford.edu/http://dbpubs.stanford.edu/http://www.fortuneinteractive.com/laptop.phphttp://dbpubs.stanford.edu/