View
220
Download
0
Category
Preview:
Citation preview
Authoritative Sources in a Hyperlinked Environment
Jon M. KleinbergACM-SIAM Symposium,1998
Krishna Venkateswaran
1
Basic Idea R is grown to a set S so that it contains a rich amount of
authoritative pages.Include any page to S that is pointed to by a page in R.
R- Root set S contains t
results. R S- Base set
generated from algorithm.
‘S’ is used to determine the hubs and authorities.
2
Get a set of results for a query string from a text based search query.
Take the top ‘t’ results out of it and put it in a set R.
For every page in set R,◦ Add all the pages that the page points to into
the set R.◦ Add a maximum of d pages that points to the
page, into the set R. The new result set is named S.
Result returned:Base set S out of which we compute the top
authorities and hubs.
3
HeuristicsTo determine what pages to add to the set S.
Heuristic 1: Avoiding navigational links.◦ Transverse links: links that are between pages
with different domain names.◦ Intrinsic links (navigational links): links that are
between pages within a domain.◦ Delete all intrinsic links.
Heuristic 2: Avoiding Mass endorsements.◦ Mass endorsements: A large number of pages in
a domain pointing to a single page.◦ Example: “This site is designed by …” and a link.◦ Eliminate this by setting a parameter m and
allowing only m pages from a single domain to point to a page.
4
Extracting authorities from the overall collection of pages, through an analysis of the link structure of G.
Good hub points to many good authorities and a good authority is pointed to by many good hubs.
Hubs Authorities unrelated page of large in-degree
5
Basic Idea Each page p has a non negative authority weight
and non negative hub weight.
If p points to pages with large authority weight values then the page has a large hub weight value.
If p is pointed to by pages with large hub weight values then the page has a large authority weight value.
Pages with higher weights are better authorities and hubs.
6
I operation:◦ Authority weight of a page= Sum of all hub
weights of pages pointing to the page.
O operation: ◦ Hub weight of a page= Sum of all authority
weights of pages, this page points to.
I and O reinforce each other.
Normalization: The values of the hub and authority weights are divided with a value so that the squares of the sum doesn’t exceed 1.
7
Contd...q1 q1
q2 y[p]=sum of all x[q].
page p page p q2
x[p]=sum of all y[q] q3 q3
Operation I Operation O
Decision on when to stop the reinforcing process. 1)Apply I and O operations alternatively until a
fixed point is reached. 2)Choose a specific parameter ‘k’ and iterate the
process only to k number of times. 8
Given the set of pages in the form of a graph, set an integer value for parameter k.
k is the number of time the iteration occurs. Repeat the following process k times.
◦ Apply the I operation to a page and update its new authority weight.
◦ Apply the O operation to a page and update its hub weight.
◦ Normalize both the authority weight and the hub weight. Return the graph with the new authority weight
and hub weight for each page.
9
Observations The top authorities and hubs are determined by
finding the pages containing the top ‘c’ values for x and y from the graph resulted from the Iterate algorithm.
The Iterate procedure converges to fixed points x* and y* as k increases arbitrarily. ◦ Proved using principal eigenvectors.
Iterate algorithm results in densely linked collection of pages- rich in relevant pages. ◦ Most relevant collection of pages is the densest
graph.
10
Results(java) Authorities
.328 http://www.gamelan.com/ Gamelan
.251 http://java.sun.com/ JavaSoft Home Page
.190 http://www.digitalfocus.com/digitalfocus/faq/howdoi.html The Java Developer: HowDoI
.190 http://lightyear.ncsa.uiuc.edu/srp/java/javabooks.html The Java Book
(\search engines") Authorities.346 http://www.yahoo.com/ Yahoo!.291 http://www.excite.com/ Excite.231 http://www.lycos.com/ Lycos Home Page.231 http://www.altavista.digital.com/ AltaVista: Main Page
(Gates) Authorities.643 http://www.roadahead.com/ Bill Gates: The Road Ahead.458 http://www.microsoft.com/ Welcome to Microsoft.440 http://www.microsoft.com/corpinfo/bill-g.htm
It was observed that the www.roadahead.com was the only site that was present in R initially.
This supports the algorithm because many of the pages don’t contain the search query in them. 11
Recommended