LIU Zhiyuan WSE Paper

  • Upload
    idotu

  • View
    48

  • Download
    0

Embed Size (px)

Citation preview

  • Test Different Ranking Functions at Terabyte 2006 Zhiyuan Liu

    New York University Email: [email protected]

    ABSTRACT

    Ranking function is an essential part of web search engine. It plays a very important role in the entire searching process and determines the accuracy and user experiences of searching. In this work I tested different ranking functions which are term-based or link-based those belong to different models such as Lucene Base, BM25, PageRank, HITS, etc. and their combinations to get a higher performance over the TREC GOV2 dataset. Some modifications to improve the performances of these ranking functions will be applied. I used Lucene 2.4.0 and Lucene 3.0.2 application programming interface to accomplish the searching and testing the performance use trec_eval which was provided by NIST. The queries were from the TREC Terabyte Track 2004 2006. There are totally 150 topics provided and they will be tested respectively. The measurements results were provided by MAP, P@ and so on which can reveal the average performances over the total queries and the performance over a specific query or a small part queries which may bring me some new discovery. This work reports my experiments and results and describes the ways to combine or modify the relative parts which will make influences on the ranking of the results and some conclusions from those experiments data and some curious performance during the experiments. The future work and possible revisions will also be talked at the end of the paper.

    1. INTRODUCTION

    The primary goal of the Terabyte Track is to develop an evaluation methodology for terabyte-scale document collections. Terabyte 2006 is the third year for the track. The track was introduced as part of Terabyte 2004, with a single Ad hoc retrieval task. This project used a collection of Web data crawled from Web sites in the gov domain during early 2004. This collection ("GOV2") contains a large proportion of the crawlable pages in gov, including html and text, plus the extracted text of pdf, word

  • and postscript files. The collection is 426GB in size and contains 25 million documents. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. I used Lucene to test the ranking functions on the TREC GOV2 over the queries from Terabyte 2006 to finish the Ad hoc automatic run. I tested by two different versions of Lucene, Lucene 2.4.0 and Lucene 3.0.2 which are the relatively old version and the latest version, respectively. They have different index structures and quite different performances on some same ranking functions and which were caused by the different index structures, and the associate details will be talked later. Since the major objective is to test different ranking functions not the different search engines or TREC track so I focused on the changes caused by different ranking functions only at Terabyte 2006 and only in the Ad hoc automatic runs. The system I used during this project please refer to table 1.1 below:

    CPU Intel Xeon CPU 2.27GHz 16 BUS speed 1600 MHz Operating System Linux 2.6.32 ( Ubuntu 4.4.3 ) Work volume 5.2 Terabytes Memory 72 Gigabytes JDK OpenJDK (IcedTea6 1.8)

    Table 1.1 There are some details in the relevant parts on describing the solutions to some inevitable issues throughout the project such as how to integrate BM25 and Lucene, how to combine PageRank and Lucene, how to combine HITS and Lucene, etc. There are some alternatives sometimes, and I described the comparison among them and which one I chose and why I chose it. The results are reported in the format of table and figure which can reveal a clear comparison among the different ranking functions. And I wrote some meaningful issues I found during testing corresponding the relevant table or figures. This paper is organized by the order of the flow to accomplish the task. I will talk about how to read the data in and how to build the web graph which is one of the most important aspects to realize the link-based analysis ranking function. And next I will talk about building the index, searching and testing different ranking functions. At last, I will report the results of the experiment and conclusions. Some discussion and outlooks would also be appended.

  • 2. READING GOV2 AND BUILING WEB GRAPH

    The first task for this work is of course reading the data from the dataset and try to extract the necessary information which are needed in the following task and also building a web graph if you need to do some link-based ranking manipulation.

    2.1 About GOV2

    The gov2 dataset I adapted is stored in 273 directories in the format of gzip files with 100 files that each one contains 1000 documents, in sum 25,205,179 documents. It consumes 80.7 Gigabytes in gzipped bundle info and 426 Gigabytes in normal text format. The average document size is 17.7 KB.

    2.2 Anatomy of GOV2 documents

    1. : The start of one document. 2. GX000-00-0000000: Official document number which can identify the specific document and will be used in the TREC relevance evaluation. 3. http://sgra.jpl.nasa.gov HTTP/1.1 200 OK Date: Tue, 09 Dec 2003 21:21:33 GMT Server: Apache/1.3.27 (UNIX) Last-Modified: Tue, 26 Mar 2002 19:24:25 GMT ETag: "6361e-266-3ca0cae9" Accept-Ranges: bytes Content-Length: 614 Connection: close Content-Type: text/html : Header of the document, which contains the relative information such as Date which can be stored in the index if necessary and Content-Length which can be used to compute the length of the document but in my work, I use Lucernes setSimilarity method to import a program segment to compute the number of the terms in each document 4.

    JPL Sgra web Site

  • JPL Sgra Web Site

    : Html part of the document which is the core part and also the part I need to extract and index and of course the part which would be searched by the end user of web search engine.

    2.3 Reading GOV2

    I will read the GOV2 two times. The first time I will extract the URL of the HTML. And the second time I will extract the pure text of the HTML and build web graph. Now we start with the first time reading. Because the GOV2 dataset is stored in the format of gzip I use GZIPInputStream of Java and FileInputStream as well. And because the document is organized line by line so I adapted BufferedReader to read it line by line using the readLine() method. Whenever it meets "", read next line and use trim() method to format the link and tell if it ends with / cause some URLs of the document ends with it and some not so we should normalize it. And I have create a static HashMap to store the URL of the document and obvious we put the URL into the HashMap after the trim() method and update the docnum which stands for the serial number of the document and another variable which is used as the value of the HashMap. Then we start the second time of reading the datatset. When meet the tag, record the document number. When meet the "" tag, record the URL of the web page which is located at next line below "". And this URL address is a very important variable here and which I call it currentdoc and used in many places later.

    If we meet anything else except "" which I will talk next to this item, I just append them into a StringBuffer which will hold all the content of the Html page and would be used when it meets "" tag. It is a key moment to meet "" tag which means the end of the document, then we can start to build the web graph which I will talk later, and that is also the reason why I store the whole HTML page in a StringBuffer that is not only because the page would be indexed later but also because I will parse and extract the useful link later and I will talk about this in the Building Web Graph section. Here I firstly talk about extracting the text from the HTML page. Here I use HTML Parser API to extract the pure text from the body of the HTML and also extract the title of the page which will be indexed as another field in Lucene in part of my experiment. The performance is not very satisfied, tough. We will talk about this in a moment.

  • 2.4 Building Web Graph

    Building web graph is very crucial part to realize the link-based ranking functions. Firstly, I need to clarify the structure which I adapted in this task. I used a HashMap to store the URLs of the document which will be used to check if the links belong to the documents of the dataset and used the adjacency structure to store the web graph. The adjacency list is consist of an array of HashSet to store the web graph because the Set is unordered but cannot tolerate the duplicate elements. And the index of the array could be considered as the index of the adjacency list and the each HashSet of course will be considered as the latter part of the adjacency list. Because there are too many kinds of tags in the HTML. So it is not a good idea to try to extract the URL by identify their types of link. If I do it that way, I will need to handle hundreds of Regular Expression. It took more time to do it by classify them by Absolute URL Address and Relative URL Address simply but the accuracy, completeness of that is the thing the former way cannot compete with. It can be considered as trading time for accuracy. 2.4.1 Absolute URL Address

    We start still from the "" tag. I used regular expression to extract URL link no matter to what kind of URLs they belong and assort them later. This method can guarantee that I will not miss any link although it will cause some overhead but it is still necessary because of the complexity of the web content. I will firstly test the suffix of any link which fit the regular expression and filter anyone which ends with .jpg, .gif, .pdf and so on. Then tell it if it starts with http:// or HTTP:// or even Http://, it reveals that the web content is not case sensitive. And also I need to normalize the URL format to make them in the same format of the currentdoc I have stored before which I mentioned in the first time of reading dataset. Then I called a very crucial method named searchinhm() with the argument url which parsed before. This method will tell if the url is in the HashMap I built before and if it is the currentdoc. If it is currentdoc then I ignored it because it should not be put a link itself into the web graph of it. If it is not currentdoc, I will put it into the HashSet which I defined before.

    2.4.2 Relative URL Address

    Above is solution to the absolute URL. But there are quite a lot URLs existed in the document only in the format of the relative URL. We cannot ignore them because they have a very strong influence on building the web graph. I have tested the both situations one is with relative URL and one is without them. And

  • there is a huge difference between each other. And because the relative URL Addresses are very messy, some of them start with ../ and some start with ./. Some are inside JavaScript and some are just one page name without slash or anything and some are even nothing inside which will cause Null Pointer Exception. So I should be very careful to treat the Relative URL Address. Here I just classify them by the start of them into four kinds which are ../, ./, / or anything else, respectively. And after the categorization, I again use searchinhm() method to find them in the HashSet which is used as a dictionary of document URL. If found it, same manipulation will be applied as Absolute URL Address.

    3 INDEXING

    Building index is of course an essential part of this project. And because the dataset is relatively large, so I am very careful to build the index to reduce the number of times to build the index. 3.1 Preparation to index Select the directory of the dataset and the index directory. Use an abstract class FSDirectory to open the index directory and I use default construction method here. The most important thing in building index is the next things that create an IndexWriter. The IndexWriter has lots of construction method. And I import the Directory I built before, StandardAnalyzer which will be used to analyze the pure text and building a query which I will get to later as argument. And we must keep the StandardAnalyzer same in the two places so the search engine will work correctly. The third parameter is a Boolean type which stands for overwriting the existing index or not, here I chose true. And the last parameter is MaxFieldLength which means if there is a limitation on the length of the field, here I chose UNLIMITED of course because I found there are some quite long text in the GOV2 dataset. 3.2 Import the data from HTML Parser into Lucene I store the pure text of the web page into a StringBuffer, so here I create a new StringReader to read the contents in and I will use this Reader to build a relative field called contents and it is self-explain field by its name. This way of building a Field is not the usual way to build a Field. It is a special construction method to build a Field. 3.3 Different Fields I need to build differentiate the different contents using Field. And there are four or five fields in my project (Including title or not) such as docname which is the

  • URL of the document, contents which is the pure text of the HTML, docID which is a serial number I defined to identify the document in some cases ( Note that: this docID is not the document id defined in Lucene ), docno which is the official document number provided by TREC GOV2 which will be used to do the relevance evaluation later. And also, title is a tough decision here. I should boost the influence of the title intuitively but according to the performance of the experiment, it is better to retrieve one Field than two. And here I just want to declare the title field stores the title of the HTML, as for the issue of the number of Fields, I will get to it the time when I talk about the performance of the searching.

    There are quite a few construction methods for Field. I will talk about the ones I used below. The first one is the name of the Field, it seems the one all the methods share. The second one is the contents of the Field you want to input. It can be a String Object, a Reader, or a variable which belong to the former two. The most interesting thing here is the Reader I used in my work which is a StringReader which can read from a StringBuffer. There are only two parameters for this construction method I adapted here. It creates a tokenized and indexed field that is not stored. Term vectors will not be stored. The Reader is read only when the Document is added to the index, i.e. you may not close the Reader until IndexWriter.addDocument(Document) has been called. Let me introduce the other two common options now. Field.Store specifies whether and how a field should be stored. For docname, docID, docno I chose YES because I will use them later to accomplish multiple tasks. For title (if I use), I chose NO. For contents, it has a default option NO. Field.Index specifies whether and how a field should be indexed. For docname, docID, docno I chose NOT_ANALYZED_NO_NORMS and For title (if there is such a Field) and contents the option is ANALYZED_NO_NORMS (not this option if I use Sweet Spot Similarity or set a boost on some Field, etc.). I will talk about this when I get there.

    3.4 Adding Fields and Document After the work I talked above, it is the time to add the Field into a Document and add this Document into the index. First, I created a Document to hold the Field and called a method named add to add the Fields into the Document. And after that, I called IndexWriters addDocument() method to add the Document into the index. After all of this, clean the things up, which I mean clean up the things in the StringBuffer, StringReader and all the things those will be needed during the index process next time. And do not forget to close the BufferedReader, too.

    3.5 About the indexing time Similarity and LenghNorm There is an abstract class called Similarity which defines the components of

  • Lucene scoring. Overriding computation of these components is a convenient way to alter Lucene scoring. And If we want to use BM25 API to change the Lucene Scoring function into BM25, we should do something during the indexing, that is , set the Similarity into BM25 Similarity which will get the AL (average Length of the document, based on the unit of number of terms). And LenghNorm is also an index time attribute belong to the Lucene Scoring System, and this will be changed during the Sweet Spot Similarity Test. And all of this, I will talk about later when I get to the specific part.

    3.6 Setting the Boost

    One could use setBoost() method to set the boost of the Document or the Field. Here, title is obvious a Field to be boosted. All the Fields have a default boost 1.0. And I set the boost of title Field as 1.25f which has a best retrieval performance compared to the other factors whichever larger or smaller than 1.25f.

    4 SEARCHING

    After the indexing, I get to the part of searching which is the part also belong to the end user and also the ultimate goal of web search engines. Firstly, I will give a brief introduction of the searching in Lucene. There are various ways to do the searching in Lucene. In this paper, I will just talk about the ways those I used in my experiments. The first thing to do is same as the building index process that is to identify the index directory. Next, I created an IndexSearcher Object which takes the responsibility for searching the index which has been already created during the index process. And then create a very important tool named QueryParser which will be used to parse the query into the format which you want it to be depends on what kinds of Analyzed you select and the attention thing here is to keep it same as the Analyzer used in the indexing process. Here I chose StandardAnalyzer again. And Also chose the Fields to be searched, this may vary depends on the number of fields. I will talk about this later in Multiple Fields Query Parser section in detail. And the next step is creating a query, this also may vary depends on the strategy you choose. And then call the parse method of the QueryParser and transfer into the query as an argument. After that, call the search() method of IndexSearcher and transfer into the query and the number of how many results you want it return. And the search results will be stored into an Object called TopDocs. The TopDocs Class has many useful attributes and variables such as socreDocs. And create a Document which I have mentioned above can represent a Document here again to store the Object return by the doc() method of IndexSearcher. After that, I can use the get() method of the Document Object which I have created to get the contents of the Field which I stored during the process of building the index. And I used score() method of scoreDoc to get the score of the doc. And after searching, it is a must to close the IndexSearcher.

  • 4.1.Boolean Query There are two main kinds of searching we usually do. One is intersection and the other is union. Let us walk through intersection first. Because the query of the Ad hoc task of Terabyte 2006 is consists of a few terms, so I used Boolean Query to build the query which does the intersection search. Assume I want to retrieve from two Fields title and contents and make an intersection search which means all of those Fields must contains all the Terms the end users want to retrieve. Below is a practical example I used in my experiment. Assume I have three words to search. I create six TermQuery which is a query that matches documents containing a term. This may be combined with other terms with a BooleanQuery. And you may find that the first argument is Field you want to retrieve and the second one is of course the term. query1=new TermQuery(title, word1); query2=new TermQuery (title,word2); query3=new TermQuery (title,word3); query4=new TermQuery contents, word1); query5=new TermQuery (contents,word2); query6=new termquery (contents,word3);

    Below I used BooleanClause to make the intersection happen. First, I need to create a BooleanQuery. bquery1 = new BooleanQuery(); Then, add the queries I created above into the BooleanQuery, the point here is the options chosen for the BooleanClause. The must option means that the term must appear, and should means it can appear or cannot appear. And must_not means must not appear. bquery1.add(query1,BooleanClause.Occur.must); bquery1.add(query2,BooleanClause.Occur.must); bquery1.add(query3,BooleanClause.Occur.must); As for the other Field, I did the same thing. bquery2 = new BooleanQuery(); bquery2 .add.new BooleanQuery(query4,booleanclause.occur.must); bquery2.add = new BooleanQuery(queruy5,boleean.must);

  • bquery2.add = new Booleanquery(query6.boolean.must); And the last thing is to combine bquery1 and bquery2. bquerymain.add(bquery1.must); bquerymain.add(bquery2.must); One can tell the rules above is a little bit strict, we can change it depends on our needs. And obviously, it is a very flexible way to make intersection or union. 4.2 Intersection vs. Union

    And next I will talk about the searching performance of intersection and union. For Terabyte dataset and Ad hoc task, Lucene Base performed much better when I did union than intersection, say the MAP is only 0.04 when I used intersection with Lucene Base but the union got a 0.11 MAP. It is obvious a huge improvement when I just change several BooleanClause options. And the details of the relative performance will be provided later with the others together in a chart. As for the reason why union did better than intersection is the judgment standard defined by NIST is not simply just the document which contains all the words retrieved is relevant. TREC uses the following working definition of relevance: If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant. Sometimes, the relevant document may not contain all the words retrieved, and on the other hand, the irrelevant may contain all the words retrieved. In short, the relevance do not only depends on the number of the terms contained by the document. And because the dataset is sort of large, and only about 5,000 documents are relevant, and also because of the relatively strict relevance standard, so the intersection is too strict to get the relevant documents. 4.3 Phrase Query Phrase Query is a query that matches documents containing a particular sequence of terms. A PhraseQuery is built by QueryParser for input like "new york". This query may be combined with other terms or queries with a BooleanQuery. This kind of query can be used for phrase expansion. Phrase expansion means the query will be expanded to include the query text as a phrase which means the query will be more difficult to find a match in the index. According to the experiment performance, it has a quite horrible result by using Phrase Query of Lucene. 4.4 MultiFieldQueryParser MultiFieldQueryParser is a QueryParser which constructs queries to search multiple fields. This could be used for title and contents at the same time. For example,

  • QueryParser parser = new MultiFieldQueryParser(Version.LUCENE_30, new String[]{"contents","title"},new StandardAnalyzer(Version.LUCENE_30)); And the experiment results revealed that the single Field index performed better than multiple Fields. So finally, I used single Field to store the title and contents. But the title must be boosted because of the importance of the title. So I used another way to boost that is duplicate the contents of the title field. And here I only duplicate one time, because it is equivalent to setting a 1.5 ~ 1.75 boost and also it performs worse when I duplicate two times of the contents of title. 4.5 QueryParser This class is generated by JavaCC. The most important method is parse(String). The syntax for query strings is as follows: A Query is a series of clauses. And it has the best performance among all kinds of Queries. 4.6 SpanNearQuery SpanNearQuery matches spans which are near one another. One can specify slop, the maximum number of intervening unmatched positions, as well as whether a match is required to be in-order. It performed not very well on GOV2 but sort of well on some special text. 4.7 Query Expansion Query expansion (QE) is the process of reformulating a seed query to improve retrieval performance in information retrieval operations. It is not indexing time action and will be done during the searching process. Since I focused automatic runs and this is related to manual runs, so research of this solution is needed in future.

    5 DIFFERENT RANKING FUNCTIONS 5.1 Lucene Scoring Formula Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.

    ( )2( , ) ( , ) ( ) ( , ) ( ) . () ( , )t q

    score q d coord q d queryNorm q tf t d idf t t getBoost norm t d

    = Lucene Practical Scoring Function where

    1. tf(t,d) correlates to the term's frequency, defined as the number of times term t

  • appears in the currently scored document d. 2. idf(t) stands for Inverse Document Frequency. This value correlates to the inverse of docFreq (the number of documents in which the term t appears). This means rarer terms give higher contribution to the total score. idf(t) appears for t in both the query and the document, hence it is squared in the equation. The default computation for idf(t) in DefaultSimilarity is:

    ( ) 1 log( )1

    numDocsidf t

    docFreq= +

    +

    3. coord(q,d) is score factors based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.

    4. queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation in DefaultSimilarity produces a Euclidean norm:

    The sum of squared weights (of the query terms) is computed by the query Weight object. For example, a boolean query computes this value as:

    ( )22. () ( ) . ()t q

    sumOfSquaredWeights q getBoost idf t t getBoost

    =

    5. norm(t,d) encapsulates a few (indexing time) boost and length factors:

    a) Document boost - set by calling doc.setBoost() before adding the document to the index.

    b) Field boost - set by calling field.setBoost() before adding the field to a document.

    c) lengthNorm(field) - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.

    When a document is added to the index, all the above factors are multiplied. If the document has multiple fields with the same name, all their boosts are multiplied together:

  • ( , ) . () ( ) . ()field f in d named as t

    norm t d doc getBoost lengthNorm field f getBoost=

    5.2 Sweet Spot Similarity The lengthNorm computation formula is:

    1( )lengthNorm L

    L=

    Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large, and larger values when numTokens is small. But maybe this is a little bit unfair to the long document and too loosen to be a standard to choose the shorter document. According to my experiment, the lengthNorm plays a very crucial role in the searching performance. The change of the formula of lengthNorm can make a dramatic change of the searching results. And I used Sweet Spot Similarity which is a similarity with a lengthNorm that provides for a "plateau" of equally good lengths, and tf helper functions to replace the DefaultSimilarity of Lucene.

    1( )

    ( ( min ) ( max ) (max min)) 1lengthNorm L

    steepness x x=

    * - + - - - +

    Sweet Spot Similarity Formula

    I set steepness=0.5 and min=170 and max=1500 according to the average length of the GOV2 dataset. And the different factors can lead to very different results here. So it is very important to choose the proper factors here, it is kind of based on your datasets average length. And this statistics I can get from a helper class CSI extends from DefaultSimilarity and stored them in the local drive prepared to use. The results revealed that sweet spot similarity gave a very strong improvement on the same dataset. So it prove that conclusion again that the lengthNorm play a very important role to decide the performance of searching quality. Please refer to the table and the figure to see the details. 5.3 Boost Document Boost does not make sense in this project, so I just ignore it and let it keep the default value as 1.0f. Field Boost is relevant to this project because we should make some differences between the title and the body of a document. It is obvious not very proper to make the title and the body have the same weight during searching. So firstly I tried to set the boost 1.5f on the title Field. And I also found a interesting thing that there are some subtle relations between the multiple fields. The weight of the title depends on the other field contents. The weight of the title will be turn smaller if there are lots

  • of terms in the body. And one cannot set the boost too large, it is obvious because one cannot award the title too much. And to the opposite side, one cannot set the boost too small, cause I found if the boost is too small, it will cause nearly no influence on the final results. And after quite a few experiments, I got the best factor here is 1.25f. It is quite a proper factor to set the boost of title here. But the performance on the entire dataset was not much better than the single field and after another try I found that Lucene performs best in the way that keeping one single field and boost title Field by duplicating the content of title and store them into the StringReader. 5.4 Okapi BM25 It seems that Lucene does not do very well on TREC dataset as the other search engines like Juru or Indri and the Similarity is always blamed by the researcher of search engines. So I tried to replace Lucernes scoring system with BM25 which is a widely know and used term-based ranking function. Given a query Q, containing keywords q1,...,qn, the BM25 score of a document D is:

    , 1

    11

    ( ) ( 1)( , ) ( )

    ( , ) (1 )

    ni

    ii

    i

    f q d kScore d q idf q

    df q d k b b

    avgdl=

    +=

    + - +

    And it is not that hard to integrate BM25 with Lucene as imagine. I can get almost all the arguments without any additional operations. Here I used a widely used API provided by Joaquin Perez-Iglesias. And I did some modifications to his codes. Here I declare the relative important factors as below. avgdl: avgdl stands for the average length of the present document based on the unit of the number of the terms in a document. This is the only factor I cannot get directly from Lucene. So I just extends the DefaultSimilarity of Lucene by a new Similarity class which I call it CSI. And I only added one method to get the number of terms and stored them in a file on the local drive prepared to be read during searching. Be aware, it is a must to call Lucernes setSimilarity() method before indexing, it is a index time change so it took a lot of time to change the things which are relevant to lengthNorm. And by the way, the average length of the TREC GOV2 dataset is 424.0f. It is also an enlightenment for me to set the factor in sweet spot similarity to avoid to give the too large max and min in that formula.

    ,( )if q d is iq 's term frequency in the document d.

    d is the length of the document D in terms.

    k1 and b are free parameters, usually chosen as k1 = 2.0 and b = 0.75. Both of them can be set by the method setParameters in that API. And the different factors were tested by the other researchers already, and it reveals a meaningful results in their projects. But I think this may vary upon the different datasets and query topics.

    ( )iidf q is the inverse document frequency weight of the query term iq .

  • I found a very interesting thing during the experiment of integrating the BM25 with Lucene 2.4.0 and Lucene 3.0.2. There is a huge difference of the performance between the two APIs. Lucene 2.4.0 performed much better than Lucene 3.0.2. BM25 with Lucene 2.4.0 returns a much better results over Terabyte 2006, the MAP become about 0.2 from 0.16. And for Lucene 3.0.2, the MAP even decreased instead of increasing. There are two reasons may cause this difference: 1. Different Index Structures The Lucene 3.0.2 has a very different Index Structures with Lucene 2.4.0s. One cannot use Luke to check the index structure of Lucene 3.0.2, but one can do it with Lucene 2.4.0. And this underlying difference can lead the different scoring performance and the influence of changing the same parameter. 2. Different BM25 API I made some modifications and recompiled the BM25 API to fit the Lucene 3.0.2 and there are some new methods such as Advanced() to fit Lucene 3.0.2. And the different method can add or change the process of computing the document score. But anyway, BM25 seems did better than Lucene Base Scoring Formula cause avgdl was considered in BM25 and this parameter focused on the lengthNorm in detail and again the lenghtNorm is very important to the whole computation. And also, the BM25 has lots of auxiliary parameters to avoid to make the ultra computation on some factors. And another advantage of BM25 is that it took fewer time to finish the searching, it is about 10% faster than Lucene Scoring System Please refer to the table and figure to see it in detail. 5.5 Link-based Ranking Functions Link-based ranking functions are widely used by the major search engines. How to combine the term-based ranking functions with link-based ranking functions has become a major issue in web search engine region. And this is also a major task of this project. In this project, I tested two very famous ranking functions PageRank and HITS with Lucene on GOV2 over Terabyte 2006 query topics.

    5.5.1 PageRank

    PageRank is a link analysis algorithm, named after Larry Page, used by the Google Internet search engine that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose

  • of "measuring" its relative importance within the set. Integrate PageRank with Lucene is not easy since you have to post-process the sites you've indexed, make calculations on how the sites are linked together, and then update the index with this info. Firstly, I will give a brief introduction on how I integrate them and give some details on some important points. (A) Building a Web Graph which I have talked above. (B) Compute PageRank use the formula below :

    1 ( ) ( ) ( )( )

    ( ) ( ) ( )

    d PR B PR C PR DPR A d

    N L B L C L D

    -= + + + +

    d: Damping factor is set as 0.85f here. N: total number of the documents And this is obvious a iterate function, I iterated 30 times to end the iteration in practice. As for the details of the PageRank, please refer to the relevant documents, and I will not talk here for time saving. (C) Store the PageRank values onto the local drive. (D) Get the PageRank values from the disk and used them in TermScorer Class to change the order of the returning document.

    And next I will talk four issues.

    How to compute the PageRank values by the web graph generated before? I used a array of HashSet to store the web graph and a iterator to get the thing from the set again and again. The sets size can represent the out degree and used hasNext() method to get the things inside the chosen set to plus the value of the present PageRank onto the items in the HashSet. Code segment is provided below:

    int outd=al[i].size(); Iterator it = al[i].iterator(); while(it.hasNext()){

    Pr[it.next()]+=df*Pr[i]/outd; }

    How to combine the score with the original term-based score? After computing the PageRank values, the next question is to how to store it and use it. And there is a direct relationship between each other. That how to store it depends on how to use it. There are two basic ways to store the PageRank values. The first one is store it into the Payload which is a new feature of Lucene. This one is related to the time of computing PageRank. The time of computing PageRank is the indexing time after drawing the web graph. So every time I got a

  • PageRank, I could store them into the Payload of the term. But the question is the Payload is assigned for every position, so which one should be stored into is a tough question. A alternative is store them randomly or store them into all the payloads of each position if it is not decided. But obviously, it is not a good idea because it will more spaces of the hard disk. The second one is store it into a separate Array onto the local disk. This one is clearer to express and easy to implement. The only dark side is just some extra I/O time but not spaces. I chose the second one, finally. And then, the next question is how to use it? If one wants to change the scoring function, one should change the function inside the TermScore Class. The first thing to accomplish the task is to ignore the computing function of the construction method of TermScorer. The second thing to do is to get the PageRank values from outside. There is a variable called doc in this class which represents the internal document number of Lucene. This number may changed each time when the indexes were merged by Lucene, so I cannot use this number as the docId which I mentioned before. And this is also the reason I create a extra variable called docId and stored when I created the index. And then I create another class called DocId which has a method called returnUrl which can return the PageRank values which were store into the local drive by PrintWriter in the format of Array. And it is a static method and static variable which can hold the whole array into the memory temporarily so each time it calls the score method in TermScorer class, it will just get the relevant PageRank value from hard disk. And the key point here is just use the doc variable to get the corresponding docId which I created by myself. And naturally it leads to another new question that is how to build a accurate connection between the internal doc number and the external docId which was created by me. And the solution is calling the doc method of the IndexSearcher which was created before and transfer the internal doc number as a argument and then call the get method of the generated Document Object and get the contents of the docID Field and store it into a variable called did so the connection was built. After I got the corresponding PageRank value of the present document, the next question I need to figure out is how to use the PageRank I have to change the scoring function to recomputed the raw variable. The original function is:getSimilarity().tf(f)*weight. And I added sigma function to it: Here I set w = 1.8, k = 1 and a = 1.6. The performance is not dramatic after this change, say, it increased about 0.05 MAP value. And I tried to amplify the effect of the PageRank value by multiply the original PageRank value by the fifth

    power of ten because the PageRank value is kind of small, say 710- . And after

    that, the performance just changes a little during the multiply factors from

    210 ~710 . The comparison of different factors is provided in the format of a figure

    below in figure 5.1.

  • 00.05

    0.1

    0.15

    0.2

    0.25

    1 100 10000

    100000

    100000

    0

    Multiply

    MAP

    Figure 5.5.1 Combine BM25 and PageRank There is an embarrassing result I got when I tried to combine BM25 and PageRank. BM25 performs better with Lucene 2.4.0 and PageRank performs better with Lucene 3.0.2. And the aggregate performances were almost same. Ignore the top-N results The top-N results returned by Lucene Base may already have a high PageRank value, so adding the PageRank value will boost those documents delicately. So I tried to ignore the first 1000 results which I mean do not apply PageRank values on them but the result is not very satisfied. The point here I think is to find a proper number of documents which should be ignored. Further research is needed here. 5.5.2 HITS Hyperlink-Induced Topic Search (HITS) (also known as Hubs and authorities) is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. It determines two values for a page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages. I combined the HITS and Lucene Base scoring function to test the performance. Compared to combination with PageRank, combining HITS is even harder because Lucene search all the results, and then use a stack to output the results, so the docs with higher score will be output earlier. So obviously, one cannot only search the top

  • N document to shorten the searching time. But we want the only return results to be involved with the computation of the HITS value. Below, I will talk about the flow and some issues together. Build connection and initialize hubs Because HITS request to compute the HITS value only among the return results of searching, and each document has two values, one is hub and the other is auth. So I need to compute them respectively, and the process is a little bit complicated to compute the two values. First of all, I created a HashMap which I call it bridge. Bridge is a self-explain name. I store the docId as the key and just a ordered index number from 1 as the value of the HashMap and then put the relevant docId and the index number into the HashMap. And the last thing in this section is initialize the hub values by 1/numOfDocs Compute Auth Created four arrays to store the relevant values such as the original score, the values of hub and the values of auth and the relevant docId. Tell if the present document contains the documents which controlled by another variable, if yes, plus it into the auth value. Compute Hub Create a iterator first, and call the relevant documents hasNext method to get the things one by one from the HashSet named al which I have mentioned in the former section of this paper. And then, use the containsKey method of the bridge to tell if the elements got from the HashSet existed in the bridge. If yes, then I plus its auth value to the hub of the present document. Normalized Adding the hub values together and store the sum into a variable called hubsum. And then divided each hub value by hubsum. Combine the hub value with the original score returned by Lucene Here, I firstly just simply add them up, and of Course I got a disappointing result. Then, I change the hub value in the same way as what I did in the PageRank section. And it returns a much better results than the former method. Resort the Documents Store the index and the score into two extra variable named topId and topScore and there is only one thing needed to be aware of is that store the relevant official document number of GOV2 into a array named docno in prior so I can get the official document number at this time.

    6. RESULTS

    The experiment results are provided in Search Quality Comparison Table. They are

  • the results over Terabyte 2006 query topics of the Ad hoc automatic task. Because the major task of this project is to test different ranking functions not to test the difference of the Terabyte query topics from different years, so I focused on observing Terabyte 2006 not all the three years topics. And the top 10,000 results were return from all the results. It reveals a comparison of search quality of different ranking functions and different Lucene Versions.

    Run Relevant

    Documen

    ts

    Relevant

    Documents

    Retrieved

    MAP bpref P@5 P@10 P@20

    Lucene Base (Double Field) 5893 2733 0.1101 0.2038 0.3170 0.2933 0.2899

    Lucene Base (Single Field) (1) 5893 2790 0.1212 0.2133 0.3208 0.3099 0.3021

    Sweet Spot Similarity (2) 5893 3122 0.1826 0.2746 0.4208 0.4020 0.3870

    BM25(Lucene 2.4.0) + 1+2 5893 2771 0.1990 0.2864 0.6280 0.5640 0.4530

    BM25 + 1+2 (3) 5893 2770 0.1612 0.2746 0.4280 0.4020 0.3870

    PageRank + 1+2 5893 3593 0.1925 0.3262 0.5200 0.4560 0.4280

    HITS + 1+2 5893 3600 0.2023 0.3266 0.5200 0.4820 0.4370

    PageRank(sigmoid) + 1 + 2 5893 3640 0.2070 0.3269 0.5480 0.5040 0.4510

    HITS(sigmoid) + 1 + 2 5893 3640 0.2065 0.3268 0.5480 0.5000 0.4520

    PageRank(sigmoid)+BM25+ 1,2 5893 3640 0.2075 0.3401 0.5492 0.5076 0.4533

    HITS(sigmoid)+BM25 + 1 + 2 5893 3640 0.2070 0.3275 0.5490 0.5033 0.4527

    PageRank(sigmoid+ig1000)+3 5893 3640 0.2023 0.3401 0.5492 0.5076 0.4533

    Table 6.1

    Search Quality Comparison Table

    (Note that: Lucene stands for Lucene 3.0.2 if I do not clarify, ig1000 means do not add the PR

    on the first 1000 results)

    Table 6.1 reveals three interesting phenomenal. Here I only talk about the apparent phenomenal which I mean that you can find out by observing the table only not including the internal theory such as the changes sigmoid function made on PageRank. And I will talk about those in conclusion and discussion section. The first one is that all the values have not to act in unison which I mean the value of MAP may increase while P@5 decrease or a item has a high P@5 with a relatively low MAP such as BM25(Lucene 2.4.0). That is caused by the way of computing the MAP value. The precision is calculated after each relevant doc is retrieved. All precision values are then averaged together to get a single number for the performance of a query. The values are then averaged over all queries. The precision (percent of retrieved docs that are relevant) after X documents (whether relevant or non-relevant) have been retrieved. Values averaged over all queries. If X docs were not retrieved for a query, then all missing docs are assumed to be non-relevant. In one sentence, the MAP value depends on the order of the relevant documents you gave. So you may return the highly relevant top-20 documents but did not much worse next, so

  • you cannot get the high MAP, either. And by the way, in my opinion, the P@ values are more important than MAP to the end users because they generally just need the top-20 results returned, so the relevance of top-20 results maybe a more important standard to evaluate the performance of search engines. The second one is that the number of the relevant documents retrieved does not decide the MAP uniquely. It reveals that MAP really focuses on the order of the relevant documents your search engine return not only the number of the relevant documents returned. Besides that, MAP distributes a large weight to the top results. Still the example of BM25 (Lucene 2.4.0), it got a high P@ scores and so the MAP value will not be very low and even you can see that the number of the relevant documents retrieved is few. The last one is that the density of the P@ values or the descent speed of P@ value is related to the number of the relevant documents retrieved. And of course, this is not an absolute relation. The sparseness of the P@ value means the performance is not very stable and of course more returned relevant documents can make the performances more stable. Below, I will give out some figures which focus on the comparison of the different ranking function on some specific standard.

    MAP

    00.050.1

    0.150.2

    0.25

    Luce

    ne B

    ase

    Swee

    t Sp

    otSi

    mila

    rity

    BM25

    Page

    Rank

    HITS

    Page

    Rank

    +BM2

    5

    HITS

    +BM2

    5

    Run

    MAP

    Figure 6.1

    Comparison on MAP

  • P@5

    00.10.20.30.40.50.60.7

    Luce

    ne Bas

    e

    Swee

    t Sp

    otSi

    milari

    ty

    BM25

    PageRa

    nk

    HITS

    Page

    Rank

    +BM2

    5

    HITS

    +BM2

    5

    Run

    P@5

    Figure 6.2

    Comparison on P@5

    P@10

    00.10.20.30.40.50.6

    Luce

    ne Bas

    e

    Swee

    t Sp

    otSi

    milari

    ty

    BM25

    PageRa

    nk

    HITS

    Page

    Rank

    +BM2

    5

    HITS

    +BM2

    5

    Run

    P@10

    Figure 6.3

    Comparison on P@10

  • P@20

    00.10.20.30.40.5

    Luce

    ne Bas

    e

    Swee

    t Sp

    otSi

    milari

    ty

    BM25

    PageRa

    nk

    HITS

    Page

    Rank

    +BM2

    5

    HITS

    +BM2

    5

    Run

    P@20

    Figure 6.4

    Comparison on P@20

    Relevant Docs

    Luce

    ne B

    ase

    Swee

    t Sp

    otSi

    mila

    rity

    BM25

    Page

    Rank

    HITS

    Page

    Rank

    +BM2

    5

    HITS

    +BM2

    5

    Run

    Rele

    vant

    Doc

    s

    Figure 6.5

    Comparison on Relevant Documents

    7. CONCLUSION

    I gained four major conclusions after all the testing. As for the other interim conclusions, please refer to the corresponding sections. 1. PageRank (sigmoid) +BM25+SingleField (duplicate title once) + Sweet Spot

    Similarity performs best. And there are four reasons why it performs best. (1) Proper modifications on lengthNorm of Lucene. And the Lucene Base lengthNorm

  • punished too much on the long documents. I want to declare here again, the lengthNorm is a very important factor. The ultimate goal is to give every document whichever long document or short ones a fair evaluation. Do not punish or award too much. (2) Lucene performs better on Single Field in this task and duplicate only once on title Field give the best results. (3) BM25 is a stronger term-based ranking function than Lucernes base ranking function because it considered average length of documents as an important factor. (4) Sigmoid did better on PageRank than any other solutions. (Note that: Ignoring the top-N results returned by Lucene Base may improve but a proper parameter is needed.) 2. Different versions of Lucene have very different performances on different

    ranking methods or different combination of ranking methods. For example, Lucene 2.4.0 works well with BM25 but Lucene 3.2.0 does not have a dramatic improvement. And Lucene 3.0.2 performs better when it is with PageRank or HITS.

    3. How to combine Term-based ranking functions with Link-based ranking functions will lead to very different results. For this conclusion, sigmoid function applied on PageRank and ignoring the top-N results for PageRank are good examples. The change after applying sigmoid function on PageRank is dramatic. And as for ignoring the top-N results, in my opinion, a proper number of documents to be ignored are needed here. The same thing happened again when I test on HITS, not that dramatic as PageRank, tough. 4. Index Structure and Conflict Some conflict may happen when I combined different alternatives. For instance, the performance of the combination of Lucene 2.4.0, BM25, PageRank and Sweet Spot Similarity was horrible. It reveals that some conflict may happen if one do not combine the alternatives or tune the parameters properly. And another conclusion is that different alternatives are suitable to different index structures. So putting the right solutions on the right index structures is a necessary to optimize the performance of search engines.

    8. ACKNOWLEDGMENTS

    I would like to thank Professor Torsten Suel for sharing his GOV2 dataset, providing me a high performance machine and his help all through this project.

    9. REFERENCES

    [1] Erik Hatcher, Otis Gospodnetic. Lucene in Action Second Edition. Manning Publications Co., 2010.

  • [2] Doron Cohen, Einat Amitay, David Carmel. Lucene and Juru at Trec 2007: 1-Million Queries Track. In Proceedings of the Sixteenth Text REtrieval Conference (TREC 2007). National Institute of Standards and Technology. NIST, 2007. [3] Trevor Strohman, Howard Turtle, W. Bruce Croft. Optimization Strategies for Complex Queries. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Pages: 219 - 225. ACM, 2005. [4] Yun Zhou and W. Bruce Croft. Document Quality Models for Web Ad Hoc Retrieval. In Proceedings of the 14th ACM international conference on Information and knowledge management, Pages: 331 332. ACM, 2005. [5] Joaqun Prez-Iglesias, Jos R. Prez-Agera, Vctor Fresno, Yuval Z. Feinstein. Integrating the Probabilistic Models BM25/BM25F into Lucene. Website 2009. http://nlp.uned.es/~jperezi/Lucene-BM25/ . [6] Stefan Bttcher, Charles L. A. Clarke, Ian Soboroff. The TREC 2006 Terabyte Track. In proceedings of the Fifteenth Text REtrieval Conference (TREC 2006). National Institute of Standards and Technology. NIST, 2006. [7] Nick Craswell, Stephen Robertson, Hugo Zaragoza and Michael Taylor. Relevance Weighting for Query Independent Evidence. In proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Pages: 416 - 423. ACM, 2005. [8] Yuval Feinstein. BM25 Scoring for Lucene: From Academia to Industry. Apache Lucene EuroCon 2010 meetup, Prague, May 2010. [9] Donald Metzler, Trevor Strohman, Yun Zhou, W. B. Croft. Indri at TREC 2005: Terabyte Track. In proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, Pages: 347-354. ACM, 2008. [10] Xian Jue. Analysis of the theory and codes of Lucene. Website 2010. http://forfuture1978.javaeye.com/ [11] Bob Carpenter. Phrasal Queries with LingPipe and Lucene: Ad Hoc Genomics Text Retrieval. In proceedings of The Eighteenth Text REtrieval Conference, NIST Special Publication 500-278. National Institute of Standards and Technology. NIST, 2009.