Improving Rank Algorithm of Search Engine with Ontology and Categorization

Improving Rank Algorithm of Search Engine with Ontology and Categorization

Qiaowei DaiSupervisor: Jiuyong Li

2

This presentation

• Background

• Motivation

• Research strategy

• Modelling and evaluation process

• Results and analysis

• Goal I have achieved

3

Background

• What is search engine?

-An important tool to help people search information on the huge Web.

• Shortcomings of information retrieval:-Low query quality

-Low query update speed

-Lack of effective information categorization

-Keyword-based Web query lacks understanding of user behavior

-Low index coverage rate

4

RageRank Algorithm

• PageRank is a global link analysis algorithm proposed by S. Brin and L. Page (Brin, S & Page, L 1998). It performs statistics to the URL link condition of whole Web, and computes a weight, which is called as the PageRank value of this page, to every URL according to the factors such as link times, etc.

• PageRank algorithm is in recursive form, its value depends on the linked times and the PageRank value of source link (Brin, S & Page, L 1998).Let the pages on Web as 1, 2, …, m,

where N(i) is the amount of the extra-Website links of page i, B(i) is the page set referring to page I, d (0<d<1) is damping factor.

)(

1

)(

)(*)(

iBj m

d

jN

jrdir

5

RageRank Algorithm

• Problems:-More concerned about old pages than new pages

-Topic drift

-Nepotistic links

6

RageRank Algorithm

• Improvements:-Ling & Fanyuan (2004) proposed an accelerated evaluation algorithm, which introduces accelerated factor.

-Beijing University developed WebGather search engine (Ming, L, Jianyong, W & Baojue, C 2001), which gives compensation to new pages.

-Haveliwala (2002) proposed a topic-sensitive PageRank algorithm.

7

HITS Algorithm

• HITS (Hypertext Induced Topic Search) algorithm is a kind of rank algorithm that analyzes Web resource based on local link, which is proposed by Kleinberg in 1998 (Kleinberg, J 1999).

• Authority page refers to some page that is most related to query keyword and combination (Kleinberg, J 1999).

• Hub page is the page that includes multiple Authority pages (Kleinberg, J 1999).

• There are two steps in the iterative procedure of HITS algorithm, which are step I and step O. In step I, the authority value of each page is the sum of the Hub values of pages referring to it. In step O, the Hub value of each page is the sum of the authority values of pages referring to it. That is:

where B(i) represents the page set referring to page i, F(i) represents the page set referred by page i.

)(iBj

ji ha

)(iFj

ji ah

8

HITS Algorithm

• Problems:-Faction attacks

-Mixed hub page

-Topic pollution

-Topic drift

-Nepotistic links

9

HITS Algorithm

• Improvements:-Bharat and Henzinger (1998) improved the computation method of authority weight and hub weight by means of introducing relevance weight to hyperlinks.

-Chakrabarti, Dom and Gibson (1999) proposed an idea that split big hub page into smaller units.

-Haveliwala (2002) proposed a topic-sensitive PageRank algorithm.

-Clever system (Chakrabarti, B, Dom, B & Raghavan, P 1998)

` -Xuelong, Xuemei and Xiangwei (2006) proposed a time parameter control model.

10

Motivation

• According to Intelligent Surfer Model, we can consider that the user behavior in browning Web page is not absolutely random or blind, but related to topic.

• To the numerous outbound links of each Web page, the outbound links which belong to the same or similar Web page category will have the higher clicking rate.

• No matter PageRank algorithm or HITS algorithm, they objectively describe the essential characteristics between Web pages, but rarely consider about the topic relativity of users’ surfer habit.

• Link structure-based algorithm can be integrated with other technology very well in order to improve the algorithm adaptability.

11

Motivation

• Categorization technology can simulate user subject-related habit, so as to improve this kind of link structure-based algorithm.

• Categorization technology overcomes the unreliability brought by the assumption that the users’ behavior of visiting Web pages is absolute random, and distinguishes the direction relation between Web pages according to category attributes, thus categorization technology can be regarded as an important supplementary to traditional algorithms.

• In order to implement categorization, a comparison mechanism between categories need to be introduced, which domain ontology-based concept semantic similarity computation model is an ideal one to adopt.

12

Research strategy

Distance-based

Content-based

Attribute-based

DepthCategory Density Strength AttributeModeling Evaluate

Research on combination of PageRank and categorization

Define the basic idea of categorization Modeling Evaluate

How to implement categorization

Pre-categorization of Web pages

Pre-categorization of keywords Categorization

similarity table

Selevtive mechanismCombine HITS with categorization

Research on traditional domain ontology-based concept semantic similarity computation models

Improvement of decision facts

Modeling and evaluating improved domain ontology-based concept semantic similarity computation model

Constructing categorization similarity table according to the improved model

Performing pre-categorization processes according to categorization similarity table

13

Research strategy

• Phase 1:

-Improve domain ontology-based concept semantic similarity computation model.

• Phase 2:

-Improve HITS algorithm with categorization based on the computation model generated by phase 1.

14

Phase 1 - Traditional computation models

• Traditional computation models

1.Distance-based:

The basic idea is to quantify the semantic distance between concepts by using the geometric distance of two concepts in hierarchical network (Qun, L & Sujian, L 2002).

Leocock(2005) proposed an improvement.)1(2

min)1(2)2,1(

MaxLenth

MaxLenthwwsim

)],(,[)],(,[)2,1( 212211 wwAncwNwwAncwNwwDist linkslinks

max

)2,1(1log)2,1(

d

wwDistwwsim

15

Phase 1

2.Content-based

The basic principle is that if the more information two concepts share, the higher semantic similarity between them; contrarily, the less information two concepts share, the lower semantic similarity between them (Xiaofeng, Z, Xinting, T & Yongsheng, Z 2006).

)](log[)( wPwIC

material trainingofamount total

material gin trainin appears concept w that times)(

the

thewP

)2()1(

)]2,1([2)2,1(

wICwIC

wwAncICwwsim

16

Phase 1

3.Attribute-based

The basic principle is to judge the similarity degree of attribute set which the two concepts corresponding to.

Rips (2002) proposed a multi-dimensional computation model.

)()()(),( 12212121 wwfwwfwwfwwsim

},...,,{)1( 1,1,11,0 wnww AAAwAttr

},...,,{)2( 2,2,12,0 wnww AAAwAttr

n

kwkwk AAwwDist

0

22,1, )()2,1(

2)2,1()2,1(

wwDistwwsim

17

Phase 1

• Decision Facts

-In a directed, no-loop, hierarchical network constituted by domain ontology, the weights of directed edge may be different, that is to say the semantic similarity between parent node and child node located in the two ends of different directed edge is different.

-It indicates that the influence of weight needs to be considered when computing the distance length between concepts. According to my research, there are five main facts influence the weight of directed edge in ontology hierarchical network.

18

Phase 1

1. Category

-Inheritance relation

-Entirety-part relation

-Synonymous relation

where weigh (c, p) is the weight of directed edge constituted by child node c and its parent node p.

),( pcweight

relation ),(*1 synonymouspctype

relation tan),(*2

1ceinheripctype

relation ),(*3

1partentiretypctype

19

Phase 1

2. Depth

The meanings of concept are concrete in lower level; contrarily, the meanings of concept are abstract in higher level. Thus, the weight of directed edge is related to its depth in hierarchical network.

where depth(p) is the depth of node in hierarchical network.

)(

11)()( 2

1)

2

1...

2

1

2

1(),(

pdepth

nnpdepthpdepth

pcweight

20

Phase 1

3. Density

If the node density of a certain local area in hierarchical network is larger, it indicates that the refinement to concept is better here, and the weight of corresponding edge is larger.

where indegree(p) and indegree(c) are the ingress degree of parent node p and child node c in hierarchical network,

outdegree(p) and outdegree(c) respectively represents the egress degree of parent node p and child node c in hierarchical network,

and indegree(G) and outdegree(G) represent theingress degree and egress degree of hierarchical network graph.

))(deg)(deg(*2

)(deg)(deg)(deg)(deg),(

GreeinGreeout

creeoutpreeoutcreeinPreeinpcweight

21

Phase 1

4. StrengthIn the hierarchical network constituted by domain ontology, a

parent node may have multiple child nodes. If a child node is more important than the other nodes to this domain, the weight of directed edge constituted by this child node and its parent node should be larger.

where important() represents the former is important than the latter.

where LS(ci,p) represents the strength of directed edge constituted by child node and parent node.

where is adjustment factor.

)),(),,((tan)|()|( pclinkpclinktimporpcPpcP jiji

)(

)()|(

pP

pcPpcP i

i

)()|( ii cPpcP

)(

)()|(

pP

cPpcP i

i

)](log[)( wPwIC

|)()(|))(|)(log())|(log(),( iiii cICpICpPcPpcPpcLS

),(

),(),(

pcLS

pcLSpcweight

i

i

22

Phase 1

5. Attribute

If the concept that the child node and parent node of the two side of directed edge corresponding to possesses more equal attributes, it indicates that the relation between parent node and child node is closer, and the weight of directed edge constituted by them is larger.

where Attr(c) and Attr(p) is respectively the attribute set of concept c and concept p, Attr(c) ∩Attr(p) is the intersection attribute set of concept c and

concept p, Attr(c) Attr(p) is the union attribute set of concept c and concept p, count() is the amount of statistic attribute.∪

))()((

))()((),(

pAttrcAttrcount

pAttrcAttrcountpcweight

23

Phase 1

• Procedure of modeling1. The domain ontology completed by domain specialists can be considered as a hierarchical, directed and no-loop graph.

where N is the set of all the nodes in graph, and each node ni represents the set of concept and its attribute in domain, L is the set of all the directed edges in graph, and each directed edge represents some kind of relation existing between nodes.

2. The facts influencing weight need to be fully considered when qualifying the weight of directed edge.

where k is adjustment factor, k (0,1], a + b + c + d + e = 1∈ .

),( LNGG ),...,,...,,( max21 nnnnN i

max)j0 max,0( i)( ji ccL

attributeestrengthddensitycdepthbcategoryapcweight *****),(

))()((

))()((*

),(

),(*

))(deg)(deg(*2

)(deg)(deg)(deg)(deg

*2

1*),(*),(

)(

1

pAttrcAttrcount

pAttrcAttrcounte

pcLS

pcLSd

GreeinGreeout

creeoutpreeoutcreeinPreein

cbpctypeapcweight

i

i

pdepth

nn

24

Phase 1

3. The length of unit directed edge is inversely proportional to

the weight of directed edge.

where η is adjustable factor.

4. Use Leacock computation model to compute the distance

between two concepts in domain ontology.

where Anc(w1,w2) is the closest common ancestor node of node w1 and w2, path(w1,w2) is the set of all the nodes in the shortest path of node w1 and w2

in hierarchical network.

5. Obtain the improved concept semantic similarity computation model.

where is θ amplification factor.

),(),(

pcweightpcDist

)],(,[)],(,[)2,1( 212211 wwAncwNwwAncwNwwDist linkslinks

))2,1(,1(

2111 ))(,())],(,(,[wwAncwpathn

links nparentnDistwwAncwAncwN

2)2,1()2,1(

wwDistwwsim

25

Phase 1

• EvaluationTools: Protégé 3.4

Main interface of Protégé 3.4

26

Phase 1

Ontology of linear structure

Linear Structure Linear List

Stack

Sequential Stack

Linked Stack

Queue

Circular Queue

Linked Queue

Sequential storage structure

Linked storage structure

String

Static Allocation

Dynamic Allocation

Circular Linked List

Single Linked List

Double Linked List

Double Linked Circular List

7.03

9.02

8.92

8.41

8.5

8.61

9.03

9.31

9.81

9.5

9.61

10.46

11.46

13.46

10.36

12.46

10.86

27

Phase 1

Experimental result

Traditional computation

model

Concept Improved

computation

model Content

-based

Distance

-based

Expert

experience

Sim(Linear structure, Linear List) 91.3% 62.9% 87.6% 90%

Sim(Linear structure, Stack) 90.5% 81.2% 83.2% 88%

Sim(Linear structure, String) 89.9% 81.1% 78.1% 85%

Sim(Linear structure, Queue) 84.2% 80.5% 78.3% 83%

Sim(Linear structure, Linked Stack) 80.3% 86.5% 87.6% 78%

Sim(Linear structure, Sequential Stack) 77.4% 63.3% 78.5% 77%

Sim(Linear structure, Sequential storage) 76.3% 80.1% 64.6% 75%

Sim(Linear structure, Linked storage) 68.2% 73.6% 62.8% 70%

Sim(Linear structure, Circular Queue) 67.8% 62.7% 70.3% 68%

Sim(Linear structure, Linked Queue) 65.3% 59.3% 63.6% 66%

28

Phase 2 - Basic idea

• Basic idea- Assumptions:

1. Every Web page is categorize-able, and can be marked with a main category, denoted by C.

2. There is relation at different degree between every category, expressed by category similarity degree S.

-To every Web page, there is always a related content topic, which is subordinated to a certain category of knowledge, or has a certain category of feature.

-These categories can be also subdivided into more professional and specific subcategory. The category division is according to the content topic of Web pages, not property.

29

Phase 2 - Basic idea

- To the same Web page, its topic may be related to category A, as well as related to category B and category C at different degree. This relativity can be divided into:

1. Directly topic-related: Rd{a, b, c, …}

-The content of a Web page may also include several categories of topics, so there

is a directly topic-related relation between this Web page and related topics. 2. Indirectly topic-related: Ri{a, b, c, …}

-There are relations existing between categories originally. If Web page W is determined to be related to category A, and the relativity between category A and category B is very high, then B can be considered to have indirect topic relativity to category B as well. Its relativity to B is determined jointly by both the relativity between category A and category B and the relativity between Web page W and category A.

The vector sum of direct and indirect topic category of Web page topic is the topic category vector of Web page:

R{a, b, c, …} = Rd{a, b, c, …} + Rd{a, b, c, …}

30

Phase 2 - Pre-categorization

• Pre-categorization

Process of pre-categorization

31

Phase 2 - Category Selective Mechanism

• Category Selective Mechanism

- Because the amount of Web pages is huge and increasing explosively, the pre-processing for searching process is particularly important. If the selection for searching results can reduce the scope before searching all Web pages, it is helpful to improve retrieval efficiency.

-Idea of selective mechanism:

Select the keyword similarity coefficient k1, category similarity coefficient k2 (0< k1<k2<1). When users retrieve keyword, search engines firstly select the categories with the similarity between k1 to 1 according to the category vector of retrieval keywords. Then, search engines select the Web pages with the category similarity between k2 to 1 according to the category vector of Web pages.

32

Phase 2 - Category Selective Mechanism

-Through two-level categorization selective mechanism, after compare searching scope with the category similarity of Web pages according to keywords, the Web pages with great difference in category are discarded.

-Because users are hardly concerned about the information with low rank when browsing the results returned by search engines, the pre-processed Web pages will not influence the searching performance in users’ view.

-The amount of Web pages is greatly reduced, so the overhead of iterative operation is reduced as well when computing rank.

33

Phase 2 - Modeling

• Think about the relating degree between the links of Web pages and topic categories, which the links with the target belonging to same or similar topic category are considered to be more important.

34

Phase 2 - Modeling

• Procedure of modeling

1. Let the differential sum of Web pages uj and v on each category vector as the difference degree of the two Web page categories.

Compute the difference degree between the category attribute of all the Web pages uj, which referring to Authority Web page v, with v.

Compute the difference degree between the category attribute of all the Web pages vj, which referred by Hub Web page u, with u.

n

ivvuvvuvvuvvuvvuu iijinnjnjjjj

CCCCCCCCCCCCCCCD1

/||/||.../||/||/||333222111

n

iuuvuuvuuvuuvuuvv iijinnjnjjjj

CCCCCCCCCCCCCCCD1

/||/||.../||/||/||333222111

35

Phase 2 - Modeling

where n is the total category amount;

Cuji and Cvji is respectively the vector attribute of Web page uj and vj on category i;

Cvi and Cui is respectively the vector attribute of Web page v and u on category i;

The values of Cuji ,Cvji ,Cui and Cvi are generated in the pre-categorization process of Web pages.

|Cuji-Cvi| is the difference degree between Web pages uj, which referring to Authority Web page v, and Web page v on category component i,

|Cvji-Cui| is the difference degree between Web pages vj, which referred by Hub Web page, and Web page u on category component i;

|Cuji-Cvi|/Cvi and |Cvji-Cui|/Cui are the percentages of difference degree;

∑|Cuji-Cvi|/Cvi and ∑|Cvji-Cui|/Cui are the summations of difference degree ratio on each category component;

2. Normalizations

3. Compute the category similarities of Web pages referred by links according to category difference degrees.

4. Obtain the improved HITS algorithm.

j

u jD 1

j

v jD 1

jj uu DC 1

jj vv DC 1

j

ujunuuu jnCuhCuhCuhCuhCuhva )()(...)()()()(

321 321

j

vjvnvvv jnCvaCvaCvaCvaCvavh )()(...)()()()(

321 321

36

Phase 2 - Evaluation• The category information of each Web page is integrated into

HITS algorithm, and the category relativity of links is considered while computing the link structure of Web pages.

• The shortcoming that each link between Web pages is treated equally in traditional HITS algorithm is overcome.

• The category information of Web pages themselves is combined with link information as the modified parameter of computation, so as to improve the accuracy of HITS algorithm.

37

Phase 2 - Evaluation

• The application of categorization information modifies the basic set of HITS algorithm by means of the pre-categorization mechanism of categories.

• Because HITS algorithm itself needs perform iterative computation, the use of pre-categorization can greatly reduce computation overhead by reducing Web page amount, so as to improve the performance of HITS algorithm.

• Meanwhile, the computation overhead of categorization process is O(n), n is the total category amount. Because the magnitude of n is always smaller than it of Web page amount, the improved categorization-based HITS algorithm is better than HITS algorithm on the complex degree of algorithms.

38

Goal I have achieved

• Gain an insight into the principles, problems and improvements of traditional link structure-based rank algorithms for search engine.

• Gain a full understanding of the principles and weaknesses of the three traditional domain ontology-based semantic similarity computation models, and develop an improved computation models based on five decision facts.

• A category-based HITS algorithm is developed, which users’ surfer behaviors are well considered.

39

References• Berners Lee, T, Hendler, J, Lassila, O 2001, ‘The semantic web’, Scientific American, 284(5)34-43.• Bharat, MK & Henzinger, R 1998, ‘Improved Algorithm for Topic Distillation in a Hyperlinked Environment’, In Proceeding of {SIGIR}-98, 21st {ACM} International

Conference on Research and Development in Information Retrieval.• Brin, S & Page, L 1998, ‘The anatomy of a large-scale hypertexual web search engine’, In Proceeding of the WWW7 Conference, page 107-117.• Broder, A, Kumar, R, & Maghoul, F 2000, ‘Graph structure in the web: experiments• and models.’ In Proceeding of the Ninth International World-Wide Web Conferecne.• Itsky AB & Hirst, G 2004, ‘Evaluating word net-based measures of lexical semantic relatedness’, Computational Linguistics, 1(1);1-49.• Chakrabarti, S, Dom, B & Gibson, D 1999, ‘Mining the Link Stucture of the World Wide Web’, IEEE Computer, 32(8).• Chakrabarti, S, Dom, B & Raghavan, P 1998, ‘Automatic resource compilation by analyzing hyperlink structure and associated text’, In Proceeding of the Seventh

International World-Wide Web Conference.• Chakrabarti, S, van den Berg, M & Dom, B 1999, ‘Focused crawling: A new approach to topic-specific web resource discovery’, In Proceedings of the Eighth

International World-Wide Web Conference.• Dean, J & Henzinger, RM 1999, ‘Finding related pages on the Web’, In Proceeding of the WWW8 Conference, page 389-401.• Gan, KW & Wong, PW 2000, ‘Annotation information structures in Chinese texts using how net’, Hong Kong: Second Chinese Language Processing Work shop, 85-

92.• Gruber, T 1993, ‘Ontolingua: A translation approach to portable ontology specification’, Knowledge Acquisition 5(2), pp.199-200.• Haveliwala, HT 1999, ‘Efficient computing of PageRank’, Stanford Database Group Technical Report.• Haveliwala HT 2002, ‘Topic-sensitive PageRank’, Proceedings of the Eleventh International World Wide Web Conference.• Kleinberg, J 1999, ‘Authoritative sources in a hyperlinked environment’, Jouranl of the ACM, 46(5):604-632.• Lawrence, S & Lee Guiles, C 1998, ‘Context and page Analysis for Improved Web Search.’, IEEE Internet Computing, page38-46.• Ling, Z & Fanyuan, M 2004, ‘Accelerated evaluation algorithm: a new method to improve Web structure mining quality’, Computer Research and Development,

41(1):98-103.• Mendelzon, OA & Rafiei, D 2000, ‘What do the neighbors think? Computing web page reputations’, IEEE Data Engineering Bulletin, Page 9-16.• Ming, L, Jianyong, W & Baojue, C 2001, ‘Improved Relevance Ranking in WebGather’, J. Cimput. Sci. & Technol. Vol.16 No.5.• Motwani, R & Raghavan, P 1995, ‘Randomized Alogrithms’, Cambridge University• Press.• Neches, R, Fikes, R, Finin, T, Gruber, T, Patil, R, Senator, T & Swartout, WR 1991, ‘Enabling technology for knowledge sharing’, AI Magazine 12(3), pp.36-56• Page, L, Brin, S & Motwani, R 1998, ‘The PageRank citation ranking: Bringing order to the Web’ Technical report, Computer Science Department, Stanford University.• Qianhong, P & Ju, W 1999, ‘Attribute theory-based text similarity computation’, Computer Journal, 22(6):651-655. • Qun, L & Sujian, L 2002, ‘CNKI-based word semantic similarity computation’, Computer Linguistics and Chinese Information Processing, 2002(7):59-76• Steichen, O & Daniel-Le, C 2005, ‘Bozec. Computation of semantic similarity within an ontology of breast pathology to assist inter-observer consensus’, Computers in

Biology and Medicine, (4):l-21.• Studer, R, Benjanmins, VR & Fensel, D 1998, ‘Knowledge engineering: principles and methods’, Data and knowledge engineering 25, pp.161-197.• Sujian, L 2002, ‘Research on semantic computation-based sentence similarity’, Computer Engineering and Application, 38(7):75-76.• Xiaofeng, Z, Xinting, T & Yongsheng, Z 2006, ‘Ontology technology-based Internet intelligent search research’, Computer Engineering and Design, 27(7):1194-1197.• Xuelong, W, Xuemei Z & Xiangwei L 2006, ‘Application and improvement of time parameter in Hits algorithm’, Modern Computer.• Zhihong, D & Shiwei, T 2002, ‘Ontology research review’, Beijing University Journal (Natual Science Edition), (5):730-738

40

Questions?

Thank you for listening!

Documents

Improving Rank Algorithm of Search Engine with Ontology and Categorization