27

Click here to load reader

file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

Embed Size (px)

Citation preview

Page 1: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

A Method to Improve Effectiveness of Information Retrieval using

Link Information

Yong Kim*, Won-Kyun Joo**

*Department of Library & Information Science, Chonbuk National University, 567 Barkje-daero, Deokjin-gu, Jeonju-si, Jeollabukdo, South Korea**Department of R&D System Development, KISTI, 245 Daehak-ro, Yuseong-gu, Daejeon, South Korea (Corresponding Author)

Abstract

A scheme for using various types of hyperlinks for the purpose of improving information retrieval

effectiveness. The proposed scheme is based on the distinction between incoming and outgoing links by

directionality as well as the distinction between direct and indirect links by directness of links. various types of

hyperlinks have different impact on enhancement of retrieval effectiveness and that the improvement can be as

much as 44.8% in 11-point average precision when right combinations of weights are given to the different link

types The proposed method serves as a useful reference for electronic libraries or IR system considering the

improvement of IR service.

Keywords: Boolean Model, Information Retrieval, Directionality and directness of links, Hyperlink, Role of

Link, Retrieval effectiveness.

I. INTRODUCTIONInformation retrieval has dealt with a collection of documents whose contents are predominantly texts, linear

sequence of characters, words, and sentences. With the growing popularity of World-Wide-Web (www) where

texts are inter-linked in a complex way to form a hypertext, however, it has become a common practice to

access texts in a non-sequential way by means of navigation or browsing. Few search engines on the web

employ new methods, whereas, the others do traditional information retrieval techniques that work under the

assumption that documents are independent units. Word frequencies in individual documents and in the entire

collection are the basis for determining the degree of association between a word and a document.

In this paper, we take a stance that the meaning of a document is not just restricted to the sequential text itself

but can be affected by and derived from its relationship with others. We hypothesize that when a source text (or

Page 2: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

a node in hypertext) is linked to a destination text by means of an anchor in the source and a link pointing to the

destination, they each reveal the content of the other to some extent. Besides, hypertext can enhance notions of

relevance and hence the performance of retrieval heuristics (Frisse, 1998). In most cases, hyperlinks encode a

considerable amount of latent human judgment. In fact, the creators of web pages create link to other pages,

usually with an idea in mind that the linked pages are relevant to the linking pages. Therefore, a hyperlink, if

reasonably created, reflects the human semantic judgment and this judgment is objective and independent of the

synonymy and polysemy of the words in the pages. This latent semantics, once revealed, could be used to find

deeper relationships among the pages, as well as to find the relevant pages for a given page (Hou, 2003).

The main thrust of this paper is to present a way in which this link information can be used for retrieval

purposes and show that this information in fact improves retrieval effectiveness and reveals the role of each type

of links defined in continuing sections. This paper presents a generalization and an extension of these works.

To accomplish these goals, this paper focuses on the followings.

0) Classification of Links: Links in hypertext have many different characteristics and can be associated with

many different meanings. Depending on the nature of the objects at the source and destination points,

first of all, links can connect a document with another document or a term with a document. On the other

hand, the directionality of a link can determine whether it is an outgoing or incoming link with respect to

a given text. Although not common, it is also possible to build a one-to-many link so that more than one

destination can be associated with the source. Furthermore, an indirect link between two objects can be

defined when they are pointed to by a single source, or when they point to a single destination. Finally,

an explicit meaning such as ’example-of’ or ’definition-of’, or a numerical weight can be attached along

a link to indicate its strength.

1) Investigation on the roles of link types: While the long-term goal here is to investigate the roles of

various types of links and their properties in improving retrieval effectiveness, the current work has a

focus on directionality and the distinction between direct and indirect links.

This paper gives suggestion on ways to make use of such link information in computing the retrieval status

value (RSV) or similarity value between a document and query. Results from the experiments carried out to

test the value and impact of different link information is also reported herein.

II. RELATED WORKSResearch on the usage of hypertext links in information retrieval started from the study on scientific citations.

The most well-known measure in this field is Garfield’s impact factor (Garfield, 1972), which is a ranking

measure based fundamentally on pure counting of the in-degrees of nodes in the network. Pinski and Narin

(1976) proposed a more subtle citation-based measure of standing, stemming from the observation that not all

citations are equally important.

Hypertexts allows for browsing, an effective way of acquiring information from a structured information

space and of finding unexpected items in serendipity. However, since browsing is not an efficient means with a

large search space, researchers have attempted to combine browsing and searching functions in a single retrieval

Page 3: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

environment. Lucarella (1990), for example, proposed a model for retrieval of hypertexts and its

implementation. More recently, researchers have attempted to integrate browsing and searching for hypermedia

documents. Croft and Turtle (1989) developed a method by which hypertext links are regarded as evidence in

answering a query. Savoy (1996) suggested a scheme where a vector space model was extended to include

citation links to improve overall retrieval effectiveness. Brin and Page (1998) have proposed a ranking measure

based on a node-to-node weight propagation scheme and its analysis via eigenvectors. In order to measure the

relative importance of web pages, they proposed PageRank, a method for computing a ranking for every web

page based on the graph of the web. It is the first application of these approaches to www search. Kleinberg

(1999), developed a model of the web as Hubs and Authorities, based on eigenvector calculation on the co-

citation matrix of the web.

As the amount of information on the World Wide Web (WWW) increases rapidly, search engines such as

Google, Yahoo, MS Live Search and so on are becoming powerful tools for retrieving information on the web.

Among the typically huge amount of relevant web pages satisfying a query, users only want to browse the most

important ones (Bar-Ilan et al, 2006). Thus, web search engines need to use effective ranking methods for

ordering the search results according to their “importance” (Arasu et al, 2001). There have been a number of

methods proposed to effectively rank the relevant web pages (Devanshu, 2002). These methods are classified

into two categories: content-based methods and link-based methods. Content-based methods exploit the

similarity between contents of web pages and given queries whereas link-based methods exploit the link

structure of web pages. Currently, link-based methods are prevalent in prominent web search engines such as

Google, Yahoo, and MS Live Search (Yoshida, 2008). Among link-based methods, PageRank, which was first

introduced by Google, has been known as one of the most popular method. As PageRank is an important

component for ranking web pages in Google and other web search engines (Berkhin, 2005), many variations of

the PageRank algorithm have been proposed in order to improve the ranking quality (Haveliwala, 2002).

However, it is unclear which variations (or their combinations) provide the “best” ranked results since existing

research neither explicitly explains which ranking algorithm are used (Can et al., 2004) nor provides sufficient

experimental results for comparing the ranking quality among those variations (Yi et al., 2006). As one of the

several studies, Eiron et al. (2004) proposed four algorithms ― jump-weighting, push-back, self-loop, and

BHITS ― in order to demote the ranks of bad pages. The jump-weighting algorithm in particular demotes the

rank of a bad page by distributing the PageRank value of a dangling page to all non-dangling pages in

proportion to the fraction of the number of its good links over its out degree. Here, bad links are the links

pointing to the dangling pages that cannot be crawled because they are not available or not downloadable.

Acharyya and Ghosh (2004) also proposed three algorithms ― dangling link estimation, naive link estimation,

and clustered link estimation ― in order to estimate the outlink distributions from dangling pages to non-

dangling pages. The dangling link estimation algorithm exploits the link information of dangling pages

estimated.

Meanwhile, some studies focused on investigating influence between web sites according to the frequency of

inlinks and outlinks. Adekannbi (2012) revealed that African universities have more interest in linking to the

Page 4: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

world class universities as analyzed web links to investigate interrelationship between top ten African

universities and world universities based on the frequency and percentage distributions of outlinks and inlinks.

Islam and Alam (2011) revealed that some private universities in Bangladesh have higher number of web pages

but their link pages are very small in number, thus the websites fall behind in their Overall WIF, self link,

external links and Absolute WIF. The study showed that these universities did not have much impact factor on

the web and were not known internationally as examined in the 44 private university websites in Bangladesh. It

also identified the number of web pages and link pages, and calculated the Overall Web Impact Factor (WIF)

and Absolute Web Impact Factor (WIF).

III. DEFINITIONS ON LINK TYPES

3.1. Directionality and Directness of Links

Among many different types of link characteristics mentioned above, only two aspects are focused upon in

this section: directionality and directness of links. It appears on general hypertext document.

Directionality: This notion explains the relation between two documents actually connected with links. For

directionality, a distinction is made between outgoing and incoming links. When there is a link from document

A to B, it serves as an outgoing link for A and as an incoming link for B. These links can be expressed as out-

degree and in-degree in graph theory. Outgoing and incoming links described below in Figure 1 are all examples

of a direct link.

Directness: In this case, an indirect link that is assumed to exist between documents A and B is defined when

they each have an outgoing link to a common document C, or when they each have an incoming link from the

same document C. So, indirect link is not an actual connection but a virtual connection between two documents.

The definition of indirect links is similar to co-citing and co-cited reference (Savoy 1996) in the science

citations, but the usage of links differ. Description on the directness is explained on the right side of Figure 1.

From the two points of view, thus, we derive three basic types of link: incoming link, outgoing link and

indirect link. Based on these basic link types, we can define more link types applied to some circumstance. More

specified link types will be shown in the following section.

Figure 1. Directionality and directness of links

Page 5: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

3.2. Considering Search Environment

Considering search environment constructed by query, documents, links and sets, the types of direct links

including incoming links and outgoing links can be more specified. For this, two additional distinctions are

made on links according to the relations of the above four objects; 1) Whether the anchor terms on the links

have appeared on query terms or not. 2) The location of a current link node in the set. Figure 2 illustrates such a

specified environment.

Figure 2. The links considering search environment

In Figure 2 above, the rectangles and arrowed lines each represent documents and links. Thus, there are 5

documents and 10 kinds of links. Q and N represents query and non-query term anchors respectively. All links

are attached with circled number indicating link types.

Query term link vs. Non-Query term link: The links in a document can be separated into query anchor and

non-query anchor. If all terms in the anchor appeared on query terms, it is referred to as query term link,

otherwise, it would be a non-query term link. For example, document A has one query term link and one non-

query term link.

Internal vs. External: There are 3 kinds of sets in the search environment. The initial set consists of search

results for a query, which are the results obtained by basic search methods like vector model (Witten et al.,

1999; Frakes & Baeza-Yates, 1997). The extended set is a growing set of the initial set using outgoing links

circled 1 and 5. Lastly, the total set consists of a group of all the documents. For example, there are 2, 4, 5

documents for each of the three sets. For these sets, the initial set was redefined to internal set while the

extended set to external set which have documents except internal ones.

Finally, eight different types of links were derived based on the two classes above. The description is shown

in Table I below. Each case matches with the circled number in Figure 2. The extended set is the basic unit for

the re-ranking process. Therefore, case 9 and 10 is out of our consideration hence excluded.

Page 6: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

Table 1. Eight different types of links considered

Source Page Internal Internal External ExternalTarget Page External Internal Internal External

Query Term Type

Query Case1 Case2 Case3 Case4Non-Query Case5 Case6 Case7 Case8

IV. ALGORITHM FOR LINK BASED RETRIEVAL

The overall algorithm for link-based retrieval works as follows.

1. A set of documents (internal documents) is retrieved using a method based on the vector space model to

form an initial retrieval set.

2. By using link information, the set is expanded to include additional documents (external documents),

forming an extended set.

3. The documents in the extended set are re-evaluated for their relevancy based on the link information that

exists among the documents.

The purpose of the second step is to enhance recall by retrieving those documents that would not be retrieved

by the given query but are linked with the initial retrieval set. The new external documents as well as the

internal documents are re-evaluated at the third step, where the locations of the source and the destination of the

links between any two documents in the extended set are taken into account. Table I shows 8 different types of

links, depending on whether the source of a link is from a query term or from a non-query term and on where

the source and the destination of the link are located. For example, case 3 refers to a link whose source is a

query term in an external document, and whose destination is an internal document. It should be noted that an

external document may contain a query term when its rank is lower than the cut-off point and therefore excluded

from the initial set. Each of the three steps is explained in detail in the following subsections.

4.1. The First Step: Construction Of The Initial Retrieval Set

This step is basically the same as the ordinary process of retrieving documents using a vector space model.

We employed the following formula to calculate a weight wij for a term j appearing in document I:

(1)

Where nt is the total frequency of a term normalized by the maximum term frequency in the document, nidf is

the normalized inverse document frequency for the term, and n is the total number of documents in the

collection. The retrieval status value (RSV) for document Di and query Q is defined to be a cosine similarity ㅍ

value:

(2)

Where t means the total term frequency in document Di, Wd means the specific term weight in document Di,

Page 7: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

and Wq means the specific term weight of the query term Q. |Di| and |Q| is defined as the length of the

document and query respectively.

When this step is completed, the documents that contain at least one of the query terms are listed above a

certain cut-off point.

4.2. The Second Step: Determination Of The Extended Set

This step handles cases 1 and 5 in Table I. The documents in the initial retrieval set are examined for their

association with those outside the set, via an outgoing link from either a query or non-query term. Each

document connected to a link from a term in one of the internal documents becomes an element in the extended

set. This process resembles a form of blind relevance feedback, where top-ranked documents in the initial

retrieval set are combined with the original query to form an extended query, in the sense that additional

documents are retrieved based on their connection to some of the initially retrieved documents. The difference

lies in the fact that human-generated connections (i.e. links) are used to retrieve additional documents as

opposed to word occurrences in initially retrieved documents, which are the basis for adding new terms in the

extended query.

Since external documents do not ordinarily contain query terms, it is not possible to calculate RSV for them

by simply applying the formula as in (2). The strategy used in assigning RSV for an external document is to let it

inherit the RSV from the source document containing the anchor term of the link, reflecting the similarity

between them. In other words, it is computed by first calculating the similarity between the two (i.e. an external

document and the corresponding internal document) and propagating the RSV of the internal document i

accordingly based on the similarity measure.

More formally, the RSV for the external More formally:

(3)

Where 0<= Sim(Din, Dex) <= 1. When there are more than one incoming links to the external document, the

RSV’s for individual links is calculated using the formula in (3) and the maximum value chosen. It is also

possible to combine them with the Dempster-Shafer combination rule (Rich & Knight, 1991) under the

assumption that the more the incoming links from the initial retrieval set, the more likely the external document

would satisfy the query (i.e. higher RSV). Since this aspect of accumulating evidence is incorporated when the

final RSV is calculated as described below, we take a conservative approach of giving a RSV that is not larger

than that of the most relevant document among those that point to the external document at hand.

4.3. The Third Step: Incorporation Of Link Information

After generating all the candidate documents for the extended set, the next step is to re-rank them using all

the link information across documents in the set. Not only can new documents (i.e. external documents) be

inserted at various places in the list of retrieved internal documents, but also the original ranks of the internal

documents can change with newly calculated RSV’s. From this step on, the distinction between internal and

Page 8: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

external documents is no longer necessary. The basic scheme is to make the documents influence each other

when they are connected with links. It is conjectured that the more links candidate documents share, the more

likely they form a coherent body of documents for the given query since the candidates are supposed to be more

or less relevant to the query in the first place. Since different link types are likely to have different impact on the

bondage between documents (Pinski, 1976), they have been analyzed separately based on direct and indirect

links.

4.3.1. Impact By Direct Links

A direct link can be further classified based on the directionality (i.e. whether a link is an incoming or

outgoing link) and whether the anchor is a query term or non-query term. Considering the four cases, the

formula (4) below is used to calculate the effect of direct links on a given extended document D. Here the four

terms represent the following, respectively: 1) the effect of outgoing links starting from query terms in D, 2)

outgoing links starting from non-query terms in D, 3) incoming links from query terms in other documents, and

4) incoming links from non-query terms in other documents. They are combined with the Dempster-Shafer

combination rule represented by the operator.

Evidence Edl for an extended document ei by direct links is defined as follows:

(4)

with:

Where t are the parameters indicating the strength or the importance of the four types of links, and r, s,

t, and w are the numbers of the four different types of links starting from or pointing to D, respectively. Also, k,

l, m, and n are the documents corresponding to the four terms represented above.

It should be noted that the symbol does not have the usual meaning of summation but the meaning of the

operation based on Dempster-Shafer combination rule. This not only ensures that the value of each term

never exceeds 1, but also reflects the intuition that the link information provides additional evidence that the

document satisfies the query, which should be gathered in a principled way. The new RSV value for Considering

both the direct and indirect

w

nnin

t

mmim

s

llil

r

kkik

QDRSVE

QDRSVE

QDRSVE

QDRSVE

1

1

1

1

,

,

,

,

Page 9: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

(5)

extended document ei can be computed in the following way:

4.3.2. Impact By Indirect Links

When two documents A and B have links to the same document, an association between them is assumed.

Likewise, to a document that has separate links to the two documents (see Figure 1). In other words, a bi-

directional indirect link can be established between two documents A and B by a common destination from the

links starting from them, or by a common source of two (or more) links starting from another document C,

which point to A and B, respectively. In order to compute the strength of the indirect link between Di and Dj, two

aspects are taken into account. The first is to do with how many links leave from Di and Dj separately and how

many of them point to the same destination Dk. The second aspect is about how many links arrive in Di and Dj

starting from Dk at the same time. The evidence Eil for an extended document ei by indirect links can be

calculated as follows:calculated as follows:

(6)

with:

Where gives the number of links. Li and Lj reflects each link starting from document i and j. Lij indicates a

pair of two links leaving from Di and Dj and arriving at the same destination document k. Indirect links(Mij)

created by a document k pointing to Di and Dj can be calculated in a similar way. M is the same as L except that

it arrives at Di and Dj instead of leaving from them.

(7)

Considering both the direct and indirect effect from links, the final RSV for D is calculated as follows:

It should be noted that while the parameter 4 is explicitly shown in this formula, other parameters are

hidden in Edl(Dei) as in the formula (4). The exact values for t ’s determined by experiments.

V. EXPERIEMENTS AND EVALUATIONS

5.1. Test Collections

The re-ranking scheme described above was implemented and a series of experiments run to understand the

effects of using link information in document retrieval. There are a myriad of hypertext documents available in

the World Wide Web, which contain interesting links, but it is not possible to use them for the experiments

because there are no relevant judgments for them. Although the CACM collection could be used by treating

references as links to other documents as in (Savoy, 1996), they would probably not serve the purpose of this

research because the nature of the links in CACM are quite different from typical hypertext links.

Page 10: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

The choice made here was a test collection developed by ETRI (Electronics and Telecommunications

Research Institute) in Korea, which consists of encyclopedia documents containing a large number of hypertext

links published by Kyemong Publishing Co. The collection cosists of 23,113 documents (9.4 MB) in Korean,

which are basically explanations of titles such as ”comet” and ”gene”, and 46 natural language queries. There

are many hypertext links whose anchor words are the titles of other documents. In other words, following a link

by clicking on an anchor word will lead the user to a new document whose title is the same as the anchor word.

Considering that one of the main purposes of this paper is to improve retrieval performance by taking advantage

of link information, a survey on link status for data sets is strongly recommended. The general status of link

information in KYEMONG data set is shown in Table 2 below.

Table 2. Different link status information by link occurrence

Information on minimum, maximum, whole and average link frequency per document can be obtained from

Table II above. The frequencies follow the way like this (outgoing link incominglink≪indirect link). In

direct link, the difference of frequency between minimum and maximum value, relatively, is not larger than

indirect link.

Link frequency by document id and document frequency by each link frequency will be shown for direct and

indirect link in Figure 3, 4 and 5.

Figure 3. Link frequency by documents and document frequency by link frequency for outgoing links

After arranging each document by link frequency in an increasing order, a maximum link frequency is

Page 11: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

obtained from among all documents corresponding to high percentage and the value plotted. For example, it is 6

at 40% in outgoing links, as will be seen in Figure 6.

By capturing link information for data set through the above Figures and Tables, roughly, the conclusions below

can be made;

- In the frequency level, the direct links have a lot of resemblances (see Table 2). In the case of direct link,

there is a similar link frequency distribution for about 90% of documents, but that of indirect link vary from

0 to about 800,000 (see Figure 6).

- When we take into account that the size of average document is 411KB, the number of indirect link is so

high. It may cause the opposite effects.

- There’s no relation on frequency in the three types of links. For example, the document id having

maximum frequency for outgoing link is 5018 named title ”korea”, which has the incoming frequency of

66 much lower than maximum value 1,918, same as in the case of indirect link.

Figure 4. Link frequency by documents and document frequency by link frequency for in-coming links

Figure 5. Link frequency by documents and document frequency by link frequency for in-direct links

Page 12: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

Figure 6. Maximum link frequency for the top percentage of whole document

5.2. Experiments And Results

Getting better search performance by using link information is simply not the final purpose of this experiment

rather, it is finding the best parameter values for the algorithm and showing its superiority. The final goal was to

demonstrate that utilization of newly identified 5-types link information improves retrieval effectiveness and

understand the roles of 5 different types of links considered in the retrieval scheme.

In order to understand the roles played by each of the link types, different numbers for the five parameters

were plugged in and the 11-point average precision measured. The performance of the information retrieval can

be divided into recall and precision. You can use a wide variety of assessment methods depending on your point

of view. 11-point average precision is used for a global evaluation of the entire test documents. It was calculated

by two steps, thus measuring the precision based on the recall to each document and computing an average

value of the entire documents set. This method is widely used to evaluate the general-purpose information

retrieval performance in a reasonable way.

In Table 3, the results are summarized for different numbers of link types used in formula (4), (7).

α t Here’s indicates different link types considered: outgoing links from query and non-query terms,

incoming links from query and non-query terms, and indirect links. The first group of rows from the upper part

of the Table shows a case where only one of the link types was used. For example, the very first row shows the

11-poin taverage precision when only outgoing links from a query term are considered in re-ranking documents.

The values shown in the Table (e.g.0.2 in the first case) are the ones with the best performance in the particular

case. For example, when only α0 and α1 were given a non-zero value(i.e. when only outgoing links were

considered), the combination of 0.6 and 0.8 gave the best results among all the other combinations of values

considered between 0.1 and 1.0 with an interval 0.1.

As can be seen in Table III, the effectiveness increased as the right combination of link types was

progressively added. When only one link type was considered, which is the incoming link from a query term,

Page 13: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

the 11-point average precision was 0.4591, but when another link type (outgoing links from a query term) was

added, the value increased to 0.5034. The best value in each category increased to 0.5086, 0.5123, and 0.5126,

showing steady improvements. This indicates that when the right parameter values are used, the five link types

together contribute to enhancement of retrieval effectiveness. However, it was found that the extent of its

influence is different.

A close examination of the results indicates that different contribution is made by different link types,

however. Incoming links from a query term (signified by a non-zero value for 2) are shown to have the strongest

impact in all the five groups. This seems to be due to the general characteristics of the links indicating that the

target page is important. Outgoing links from a query term are the next most important type. This can be seen

most clearly in the second group, where the best score was obtained when α0 was given a non-zero value in

addition to α2. These results exemplify that both link characteristics and query words have a great influence on

the search result.

In the other groups with three, four, and five non-zero parameters used, the value for α0 is the second biggest.

Similarly, we can see the next one is incoming links with non-query terms. In summary, links leaving from a

query term are more important than those leaving from a non-query term, and incoming links are more

important than outgoing links. That is, it is much safer to increase RSV for a document that is pointed to by

links whose anchor is a query term. Indirect links seem to have the least impact on retrieval effectiveness,

because of having a lot of links as explained in Table 2, which gives misleading effects on ranking value.

Table 3. Comparisons among contributions made by different link types

Page 14: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

There are some idiosyncratic aspects in these experiments steps. Because of the characteristics of the

hypertexts in the encyclopedia collection, a link leaving from a query term seldom retrieves a new external

document. This is because most of them should have been retrieved by the title word that is the same as the

query word. The only documents that have been left out would be those below the cutoff point (30 in the

experiments). This observation indicates that most of the improvement we obtained seems to stem from the re-

ranking scheme rather than from newly retrieved documents. Nonetheless, many cases were observed where

some relevant documents that would never be retrieved by query terms only were actually retrieved because of

the link information.

To get the results for the baseline, KRISTAL-2002 retrieval system using morphological analyzer for Korean

text retrieval is used. This system developed by KISTI(Korea Institute of Science and Technology Information)

isb ased on the vector space model and includes most of the well-tested weighting schemes. The search

performance of KRISTAL is proven through man yresearches (Kim,2005). We can use that result as a baseline.

Table 4 shows the comparison between the baseline case and the best case that took into account the entire five

different link types at various recall levels. As shown in the Table, the average improvement obtained over the

base line was about 44.8% when an appropriate combination of hypertext links wa sused.

Page 15: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

Table 4. Comparison between the best case and the baseline

This study is focused on understanding the characteristics of each link. To understand the characteristics of

the algorithm proposed in this study, a comparison with PageRank methods was performed. The experiment was

carried out from the perspective of both literature and algorithm implementation. The algorithm and parameter

values used by Brin and Page (1998) were adopted in this study.

Table 5. Comparison between the proposed method and PageRank

Division The proposed method PageRank

Processing method Create and expand the result setRe-ranking the expanded result set. Re-ranking for the result set

Link characteristics

Consideration of the 4 kinds of link characteristics – Link direction (input / output), Direct or indirect query word,

Division of expanded set (internal / external)

Link direction (input / output)

Link type used

5 kinds of link(Outgoing query word, outgoing non-

query word, incoming query word, incoming non-query word, indirect link)

1 kind of link (input link)

Calculation of link relationships

Considering the relationship between the number of links and the linked document Depends on the number of links

Ranking priority Reflect the relevance between documents and link impact

The importance of a page based on the number of input links

(PageRank value)

Calculation complexity Slightly higher normal

Applicable target Small web Massive web

The maximum performance of the 11-

point average value0.5126 0.4730

In this study, effectiveness of five types of links was measured depending on the nature of the link. Because

extended result set was used in re-ranking process and the weight of the extended result set is computed

beforehand like equation (2), link types were not distinguished as extended set. Indirect links are related to each

other too much and have low degree of influence. The kinds of indirect link were not classified because it was

Page 16: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

meaningless to distinguish between the query word or indirect links. PageRank shows the maximum

performance of the 11-point average value of 0.4730. This is similar to the value of the maximum performance

when using only the incoming links proposed algorithm in this paper. The incoming links is important in the

influence of the link. But how to distinguish the types of links and application seems to be more important. In

particular, as shown in Table 3, whether or not containing the query words in link title have a big impact on the

search performance.

VI. CONCLUSION AND FUTURE WORKSWith the growing popularity of web documents, hypertext documents are ubiquitous. As such, it becomes

more and more important to make use of link information existing in such documents for information retrieval.

The main goal of this paper is to present a way in which this link information can be used for retrieval purposes,

and show that this information in fact improves retrieval effectiveness and reveals the role of 5 types of links

defined in previous sections. This study emphasizes that direct rather than indirect links are more important, and

incoming links are more important that outgoing links in direct links, and separation of the link title using the

words of the query is more important. For more details, refer to the experimental section. To accomplish the

goal, this paper proposes a scheme for using link information to improve retrieval effectiveness and

experimental results that proved the hypothesis that such information would be useful. The scheme has been

implemented first in the context of the efforts to build a fully-distributed, agent-based digital library prototype

(Joo, 1998), which can be modified continuously.

In the experiments using the ETRI-Kyemong test collection containing a large number of hyperlinks, as much

as 44.8 % increase was obtained in 11-point average precision, over the baseline where no link information was

used. It was observed that by using both incoming and outgoing links, we could get documents that would have

never been retrieved by matching them against the query terms. At the same time, it was possible to re-rank

those that had already been retrieved so that relevant ones could be moved toward the top of the list. This

method was compared with PageRank and results of a similar level as that used in the input link were

confirmed.

While the results show that this approach to using link information is promising, there are some limitations.

First of all, the size of the documents in the collection used was small, and all the anchors are important key

terms that correspond directly to the titles of other documents. Additional valuations should be done with more

typical document sets with hypertext, but such has not yet been found. Even with the same collection, it would

be desirable to compare these results with the blind relevance feedback strategy where a set of top-ranked

documents from the initial retrieval are combined with the original query to generate an expanded one. Still

another extension to the current work is to devise a variety of schemes for indirect links and test their efficacy.

REFERENCES

Acharyya, S, and Ghosh, J. (2004). Outlink Estimation For PageRank Computation Under Missing Data.

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters,

Page 17: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

486-487.

Adekannbi, Janet. (2012). Motivation and Disciplinary Distribution of Web Links Between African And World

Universities: An Exploratory Study. Journal of Emerging Trends in Computing and Information Sciences,

3(3), 373-383.

Arasu, A. et al. (2001), Searching the Web. ACM Transaction on Internet Technology (TOIT), 1(1), 2-43, Aug.

2001.

Bar-Ilan, J., Mat-Hassan, M., and Levene, M. (2006). Methods For Comparing Rankings of Search Engine

Results. Computer Networks, 50(10), 1448-1463.

Berkhin, P. (2005), A Survey on PageRank Computing. Internet Mathematics, 2(1), 73-120.

Brin, S. and Page, L. (1998). Anatomy of a large-scale hypertextual web search engine. Proceedings of the 7th

international World Wide Web Conference, Brisbane, Australlia, 107-117.

Can, F., Nuray, R., and Sevdik, A. (2004). Automatic Performance Evaluation of Web Search Engines.

Information Processing and Management, 40(3), 495-514.

Croft, W. B. and Turtle, H. (1989). A retrieval model for incorporating hypertext links. Proceedings of the 1989

the second annual ACM conference on Hypertext, 213-224.

Dhyani, D., Keong N. W., and Bhowmick, S. (2002). A Survey of Web Metrics. ACM Computing Surveys,

34(4), 469-503.

Dunlop, M. D. and Van Rijsbergen, C. J. (1993). Hypermedia and free text retrieval. Information Processing and

Management, 29(3), 287-298.

Eiron, N., McCurley, K., and Tomlin, J. (2004). Ranking the Web Frontier. Proceeding of the 13th Int'l Conf. on

World Wide Web (WWW), 309 - 318.

Frakes, William B. and Baeza-Yates, Ricardo. (1997). Information Retrieval: Data Structures and Algorithms.

Prentice Hall, Upper Saddle River, New Jersey, 363-392.

Frisse, M. E. (1998). Searching for informations in a hypertext medical handbook. Communications of ACM.

31(7), 880-886.

Garfield, E. (1972). Citation analysis as a tool in journal evaluation. Science, No. 178, 471-479.

Google, http://www.google.com.

Haveliwala, T. H. (2003). Topic-sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search.

IEEE Transactions on Knowledge and Data Engineering, 15(4), 784-796.

Hou, J. and Zhang, Y. (2003). Effectively Finding Relevant Web Pages from Linkage Information. IEEE

Transaction on knowledge and data engineering, 15(4), 940–951.

Islam, Anwarul and Md. Saiful Alam. 2011. Webometric study of private universities in Bangladesh. Malaysian

Journal of Library & Information Science, 16(2), 115-126.

Joo, W. K. et al. (1998). Separating Link Information from Documents in a Digital Library. Proceedings of the

25th KISS Spring Conference, Seoul, Korea.

Kleinberg, Jon M.. (1999). Authoritative Sources in a Hypertlinked Environment. Journal of the ACM, 46(5),

604-632.

Page 18: file · Web viewNTRODUCTION. Information ... Word frequencies in individual documents and in the entire collection are the basis for determining the degree of association between

KRISTAL, http://giis.kisti.re.kr.

Kim, KwangYoung et al. (2005). An Improvement of Information Retrieval System using Word Location

Information. Proceedings of the IEEE International Conference on Intelligent Agents, Web Technology

and Internet Commerce (IAWTIC-05), 90-95.

Lee, Kyung-Soon, Park, Y. C. and Choi, K. S. (2001). Re-ranking model based on document clusters.

Information Processing and Management, 37(1), 1-14.

Lucarella, Dario. (1990). A model for hypertext-based information retrieval. in Hypertext Concepts, Systems,

and Applications, (ed) Rizk, Streitz, and Andrie, 81-94.

Pinski, G. and Narin, F. (1976). Citation influence for journal aggregates of scientific publications: Theory, with

application to the literature of physics. Information Processing and Management, 12(5), 297-312.

Rich, Elaine and Knight, K. (1991). Artificial Intelligence 2nd edition, McGraw-Hill, New York, NY.

Savoy, Jacques. (1996). An extended vector-processing scheme for searching information in hypertext systems.

Information Processing and Management, 32(2), 155-170.

Witten, Ian H., Moffat, A. and Bell, T. C. (1999). Managing Gigabytes: Compressing and Indexing Document

and Images. Morgan Kaufmann Publishers, San Francisco, CA.

Zhang, Yi, et al. (2006). XRank: Learning More from Web User Behaviors. Proceedings of The Sixth IEEE

International Conference on Computer and Information Technology, 36.

Yoshida, Y. et al. (2008). What's Going on in Search Engine Rankings. Proceeding of the 22nd International

Conference on Advanced Information Networking and Applications - Workshops, 1199-1204.

KRISTAL, http://giis.kisti.re.kr