13
Hyperlink based search algorithms-PageRank and HITS Shatakirti MT2011096

Pagerank and hits

Embed Size (px)

DESCRIPTION

An essay on Pagerank and HITS searching algorithms..

Citation preview

Page 1: Pagerank and hits

Hyperlink based search algorithms-PageRankand HITS

ShatakirtiMT2011096

Page 2: Pagerank and hits

Dimensionality Reduction

Contents1 Link Analysis and Web Search 2

2 Hyperlink-Induced Topic Search (HITS) 22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Motivation behind developing the HITS algorithm . . . . . . . 22.3 Authorities and Hubs . . . . . . . . . . . . . . . . . . . . . . . 32.4 HITS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 42.5 HITS Implementation . . . . . . . . . . . . . . . . . . . . . . 52.6 Advantages and Disadvantages of HITS . . . . . . . . . . . . . 6

3 PageRank 63.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 PageRank algorithm . . . . . . . . . . . . . . . . . . . . . . . 73.3 PageRank Implementation . . . . . . . . . . . . . . . . . . . . 73.4 Problems with the algorithm and their modifications . . . . . 9

3.4.1 Rank Sink . . . . . . . . . . . . . . . . . . . . . . . . . 93.4.2 Dangling Links . . . . . . . . . . . . . . . . . . . . . . 10

3.5 Advantages and Disadvantages of PageRank . . . . . . . . . . 11

References 12

List of Figures1 Hubs and Authorities . . . . . . . . . . . . . . . . . . . . . . . 42 Page Rank Example . . . . . . . . . . . . . . . . . . . . . . . 83 Rank Sink/Page Cycles . . . . . . . . . . . . . . . . . . . . . . 94 Rank Sink/Page Cycles . . . . . . . . . . . . . . . . . . . . . . 105 Link-Farms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1

Page 3: Pagerank and hits

Dimensionality Reduction

1 Link Analysis and Web SearchBack in the 1990’s, web search was purely based on the number of occur-rences of a word in a document. The search was purely and only based onrelevancy of a document with the query. But over time, webpages have beenincreasing at an enormous rate. Thereby, simply getting the relevant docu-ments wasn’t sufficient as the number of relevant documents may range ina few millions. But instead, it was supposed to be classified in descendingorder of importance (most imp document first). Content similarity was an-other major issue. It was easily spammed as the page owner can repeat somewords to boost his rankings and make it more relevant to a large number ofqueries. These problems were solved by analyzing the link structure of theweb. Hyperlinks in a document provide a valuable source of information forinformation retrieval. Link analysis has been successful for deciding whichwebpages are important. It has also been used for categorizing webpages,finding pages related to given pages and to find duplicated websites. During1997-1998, two most famous and influential algorithms were proposed. Theyare : PageRank and HITS. Both these algorithms exploit the hyperlinks ofthe web to rank the pages. We will see these algorithms in detail here.

2 Hyperlink-Induced Topic Search (HITS)

2.1 IntroductionDuring the same time as the Page Rank algorithm was being developed bySergey Brin and Larry Page, Jon Kleinberg a professor in the Departmentof Computer Science at Cornell came up with his own solution to the WebSearch problem. He developed an algorithm that made use of the link struc-ture of the web in order to discover and rank pages relevant for a particulartopic. HITS (hyperlink-induced topic search) is now part of the Ask searchengine (www.Ask.com).

2.2 Motivation behind developing the HITS algorithmThe HITS(hyperlink-induced topic search) algorithm has been developed bylooking at the way how Humans analyze a search process rather than themachines searching up a query by looking at a bunch of documents andreturn the matches. For ex. if we want to buy a car, we type in ”topautomobile makers in the world”, in the intention of getting the top rankedcar list and their official websites. When we ask someone this question, we

2

Page 4: Pagerank and hits

Dimensionality Reduction

probably expect him to understand that by ”automobile”, we actually meanCar and the automobile in general means other vehicles too. But if you searchthis using a query for a computer, the search results would be a lot different.The computer will simply count all the occurrences of the given word in aset of documents, but won’t apply any intelligence for you. And hence, thesearch results would be apt according to what we typed, but not what wewere expecting them to be. The conclusion is that even if trying to findpages that contain the query words should be the starting point, a differentranking system is needed in order to find those pages that are authoritativefor a given query.

2.3 Authorities and HubsPage i is called an authority for the query ”automobile makers” if it containsvaluable information on the subject. Official web sites of car manufacturers,such as www.bmw.com, HyundaiUSA.com, www.mercedes-benz.com wouldbe authorities for this search. Commercial web sites selling cars might beauthorities on the subject as well. These are the ones truly relevant to thegiven query. These are the ones that the user expects back from the queryengine. However, there is a second category of pages relevant to the processof finding the authoritative pages, called hubs. Their role is to advertisethe authoritative pages. They contain useful links towards the authoritativepages. In other words, hubs point the search engine in the ”right direction”.In real life, when you buy a car, you are more inclined to purchase it from acertain dealer that your friend recommends. Following the analogy, the au-thority in this case would be the car dealer, and the hub would be your friend.You trust your friend; therefore you trust what your friend recommends. Inthe World Wide Web, hubs for our query about automobiles might be pagesthat contain rankings of the cars, blogs where people discuss about the carsthat they purchased, and so on.

3

Page 5: Pagerank and hits

Dimensionality Reduction

Figure 1: Hubs and Authorities

2.4 HITS AlgorithmLet’s assume that a webpage i has an authority score ai and hub score hi.And let ξ denote the set of all directed edges in the web-graph. Let eijrepresent the directed edge from a webpage i to j. Initially all the authorityscore of ai is 0 and hub score hi is also 0. The HITS algorithm updates thescores following the summation:

a(k)i =

∑j

h(k−1)j , where eij ε ξ (1)

The above equations are represented in the form of an adjacency matrix Ldefined by :

Lij ={

1 if there exists i and j such that eij ε ξ;0 otherwise.

The authoritative and hub scores’ summation can also be written as:

~a(k) = LT~h(k−1) and ~h(k) = L~a(k) (2)

where ~a(k) and ~h(k) are n x 1 vectors comprising the authority and hub scores,respectively, for each of the n nodes (webpages) in the graph. The above

4

Page 6: Pagerank and hits

Dimensionality Reduction

equation gets repeatedly updating until ~a(k) and ~h(k) converge for some k.Basically ~a(k) and ~h(k) are normalized every time the scores get updated. Bysubstituting the values of ~h(k) and ~a(k) from each other’s equation in (2), weget :

~a(k) = LTL~a(k−1) and ~h(k) = LLT~h(k−1) (3)The iterations in the above equation is the power iteration method to

find the dominating eigen vectorsof LTL and LLT . The matrix LTL is theauthority matrix as it is used to find the authority scores and similarly matrixLLT is the hub matrix as its dominating vector gives final hub score. Bycomputing the calues of ~a and ~h will help us solve the equations:

LTL~a = λmax~a and LLT~h = λmax~h (4)where λmax is the largest eigenvalue of LTL and LLT .Next, we need to normalize each power iteration involving ~a(k) or ~h(k). Oneway to normalize this is:

~a(k) ← ~a(k)

m(~a(k)) and ~h(k) ←~h(k)

m(~h(k))(5)

where m(~x) is the signed element having the maximum magnitude of ~x.

2.5 HITS ImplementationOnce the user enters a query, the algorithm first constructs a neighborhoodgraph N associated with the terms present in the query. The authority andthe hub scores for the documents in N are then computed as described inequation 3 and returned to the user. N can be constructed using the invertedindex. The graph N is then extended by adding the nodes that either pointto the nodes in N or are pointed by the nodes in N . The graph expansionalso brings in all the related documents containing the synonyms of the queryterms. Sometimes however, its possible that the a node has a large indegreeor a very large out-degree. In such cases one can always restrict the number ofnodes in or out from the node that contains the query terms or its synonyms.Once the graph N is constructed for the given query, the adjacency matrixL is constructed. Next we can also get the dominant eigenvectors of LTLand LLT and know the hub scores and the authority scores. From this, themost related webpages are shown first to the user. For ranking the webpages,HITS doesn’t use the entire web, instead, it uses a somewhat neighborhoodgraphs. This reduces the computational costs. Some more reduction in costis achieved by computing only one dominant eigenvector of either LTL orLLT by multiplication with L or LT . For example, the authority vector ~acan be calculated from the hub vector ~h by ~a = LT~h.

5

Page 7: Pagerank and hits

Dimensionality Reduction

2.6 Advantages and Disadvantages of HITSAdvantages

The main strength of HITS is its ability to rank pages according to thequery topic, which may be able to provide more relevant authority and hubpages. The process of building the adjecency matrices from somewhat smallerneighborhood graphs and applying power iterations does not present anycomputational burden.

Disadvantages

1. The main disadvantage of HITS is that the neighborhood graph mustbe built ”on the fly” i.e the authority and hub rankings are querydependent. Minor changes to the web could significantly change theauthority and hub scores.

2. Another major disadvantage is that it suffers from topicdrift i.e theneighborhood graph N could contain nodes which have high authorityscores for a topic unrelated to the original query.

3. HITS also cannot detect advertisements. There are a lot of sites thathave commercial advertising sponsors and also, the friendly exchangeof links also reduce the accuracy of the HITS algorithm.

4. The algorithm can easily be spammed since its quite easy to add out-links in one’s own page.

5. The query time evaluation is slow. Collecting the root set, expandingit and performing eigenvector computation are all expensive tasks.

3 PageRank

3.1 IntroductionPageRank, the second link analysis algorithm from 1998, is the heart ofGoogle. Both PageRank and Google were conceived by Sergey Brin andLarry Page while they were computer science graduate students at StanfordUniversity. Brin and Page use a recursive scheme similar to Kleinberg’s HITSalgorithm but the PageRank algorithm produces a ranking independent of auser’s query. Their original idea was that a page is important if it is pointedto by other important pages. That is, they decided that the importance of

6

Page 8: Pagerank and hits

Dimensionality Reduction

your page (its PageRank score) is determined by summing the PageRanks ofall pages that point to yours. So, basically, PageRank is an algorithm thatthat reviews and then assigns a weight value for the elements of a webpage.It’s the number of people that are linked to you and how important thoselinks are.

3.2 PageRank algorithmLet the rank of a webpage pi be given by PR(pi). Suppose that the page pihas pages M(pi) linking to it. These are basically the citations. L(pj) is thenumber of outbound links on page pj and N is the total number of pages.The PageRank of a page pi is given as follows.

PR(pi) =∑

pjεM(pi)

PR(pj)L(pj)

The above equation simply sums up the PageRanks of all the websites point-ing to our site. Where N is the total number of pages.

3.3 PageRank ImplementationIn the mathematical model of the PageRank, if an important page points tosome pages, the PageRank of this important page is distributed to all thepages it is pointing to, proportionally. In other words, if YAHOO! pointsto your Web page, that’s good, but you shouldn’t receive the full weight ofYAHOO! because they point to many other places. If YAHOO! points to 999other pages in addition to yours, then you should only get credit for 1/1000of YAHOO!’s PageRank. And hence intuitively, we can say that a page canhave a high PageRank if there are many pages that point to it, or if thereare some pages that point to it and have a high PageRank.

7

Page 9: Pagerank and hits

Dimensionality Reduction

Figure 2: PageRank Example. Courtesy : Wikipedia

The above figure shows Mathematical PageRanks for a simple network,expressed as percentages. (Google uses a logarithmic scale.) Page C has ahigher PageRank than Page E, even though there are fewer links to C; theone link to C comes from an important page and hence is of high value. Ifweb surfers who start on a random page have an 85% likelihood of choosing arandom link from the page they are currently visiting, and a 15% likelihoodof jumping to a page chosen at random from the entire web, they will reachPage E 8.1% of the time. (The 15% likelihood of jumping to an arbitrarypage corresponds to a damping factor of 85%.) Without damping, all websurfers would eventually end up on Pages A, B, or C, and all other pageswould have PageRank zero. In the presence of damping, Page A effectivelylinks to all pages in the web, even though it has no outgoing links of its own.

The PageRank is updated once every month and does not require anyanalysis of the actual(semantic) content of the web or a user’s queries. So,Google first finds the semantic matches(webpages) with the user’s query first

8

Page 10: Pagerank and hits

Dimensionality Reduction

and then order the output result set of pages in order of their page ranks.The computation of PageRank is itself a huge challenging task. Computingthe PageRank from the power iteration method may involve the following:

1. Parallelization of the sparse vector matrix multiplications

2. Partitioning of iteration matrix into blocks of webpages with outlinksand another block of those without outlinks.

3. Speeding up the convergence of the distribution vector.Updating the PageRank is a big implementation concern since the web

is not static. The calculated PageRank now may not be the same aftera few days. Extensive research is being done for being able to use the oldscores of a page to calculate the new PageRank without having to reconstructeverything. Also, the changing link structure and the addition and deletingof webpages must be taken care of.

3.4 Problems with the algorithm and their modifica-tions

3.4.1 Rank Sink

Problem Description:

Take the following link structure for example where pages 2,3,4,5 are in apath and an external site (page1) contributes a PageRank 1 to the first pageof the link (page2).

Figure 3: Rank Sink/Page Cycles

In the above case, all the pages in the path would have a PageRank of 1.Now, what if we connect the last page of the path (page5) to the first page(page2) and construct a cycle as shown in the figure below?

9

Page 11: Pagerank and hits

Dimensionality Reduction

Figure 4: Rank Sink/Page Cycles

In this case, the PageRank accumulates but never distribute it. Becauseof this, all the pages in the cycle get a PageRank of ∞.

Solution to Rank Sink - Random Surfer Model

The PageRank algorithm assumes that there is a ”random surfer” whois given a web page at random and he keeps on clicking the links from onepage to another and never clicking ”back”. But the surfer eventually getsbored and starts another random page. The probability that this randomsurfer visits this page is the PageRank. And the probability that the ”ran-dom surfer” will get bored and request another random page or a group ofpages, is called the damping factor denoted by d.

The damping factor is added to the equation and now, the new equationbecomes:

PR(pi) = (1− d)N

+ d(∑

pjεM(pi)

PR(pj)L(pj)

)

Where N is the total number of pages. Usually the damping factor d is setto 0.85.

3.4.2 Dangling Links

Problem Description

Suppose that there are some pages that do not have any outlinks, we callthem dangling nodes. In the random surfer model above, we can see that ifa scenario of dangling links appear, the random surfer will get stuck on thesepages and the importance of these pages cannot be given to other pages.

10

Page 12: Pagerank and hits

Dimensionality Reduction

In another case, if the main web graph has some disconnected components,the random surfer who started from one component cannot go to the othercomponent. All pages in the other component will get a 0 importance.

Solution to Dangling Links

The damping factor d will take care of this situation. The algorithm ishowever further modified a little to solve this issue. The algorithm is to betweaked a little bit as shown below:

PR(pi) = (1− d)N

+ d(∑

pjεM(pi)

PR(pj)L(pj)

+∑

pjεM(pi)

PR(pj)N

)

The term ∑pjεM(pi)

PR(pj)N

is useful when there are no outlinks to pj. The

damping factor d refers to the probability that the surfer quits the currentpage and ”teleports” to a new one. Since every page can be teleported withequal probability, each page has 1

nprobability to be chosen. Hence, no page

will give 0 PageRank. Dangling links do not affect the ranking of any otherpage directly, so they are removed until all the PageRanks are calculated.

3.5 Advantages and Disadvantages of PageRankAdvantages of PageRank

1. The algorithm is robust against Spam since its not easy for a webpageowner to add inlinks to his/her page from other important pages.

2. PageRank is a global measure and is query independent.

Disdvantages of PageRank

1. The major disadvantage of PageRank is that it favors the older pages,because a new page, even a very good one will not have many linksunless it is a part of an existing site.

2. PageRank can be easily increased by the concept of ”link-farms” asshown below. However, while indexing, the search actively tries to findthese flaws.

Link-farms : 99 vertices point to vertex 1. We have discussed abovethat a page has atleast a PageRank 0.15. In the following 2 scenarios,the PageRank of the main page (page1) is very good even though theaverage is very worst.

11

Page 13: Pagerank and hits

Dimensionality Reduction

Figure 5: Link-Farms

3. Something that is of course also very efficient to raise your own PageR-ank, is ’buying’ a link on a page with high PageRank. However, Googlehas publicly warned webmasters that if they are discovered to do anyof the two above, their links might be ignored in the future, or theymight even be taken out of Google’s index.

References[1] ”The Anatomy of a Large-Scale Hypertextual Web Search Engine” by

Sergey Brin and Lawrence Page

[2] ”Google Page Rank-Algorithms for Data Base Systems (Fachseminar)”by David Degen - May 12, 2007

[3] ”Link Analysis in Web Information Retrieval” by Monika Henzinger ;Google Incorporated, Mountain View, California

[4] ”Understanding Search Engines-Mathematical Modeling and Text Re-trieval” by Micheal W.Berry and Murray Browne

[5] en.wikipedia.org/wiki/PageRank

[6] en.wikipedia.org/wiki/HITS_algorithm

12