30
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University. 1998.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin , Stanford University. 1998

  • Upload
    gannon

  • View
    55

  • Download
    2

Embed Size (px)

DESCRIPTION

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin , Stanford University. 1998. Presented by Ashish Chawla and Vinit Asher. Agenda. Introduction. Challenges in Information Retrieval on Web Large # of documents Heterogeneous and Unstructured WWW - PowerPoint PPT Presentation

Citation preview

Page 1: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

PRESENTED BY ASHISH CHAWLA AND VINIT ASHER

The PageRank Citation Ranking: Bringing Order to the Web

Lawrence Page and Sergey Brin, Stanford University. 1998.

Page 2: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Agenda

Introduction

Background

Link Structure

Propagation of Ranking

Simplified Page Rank Calculation

Problems in Ranking

Page Rank Definition

Computing Page Rank

Mathematical Basics

Implementation Details

Convergence

Searching with

PageRank

Page 3: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Introduction

Challenges in Information Retrieval on Web Large # of documents Heterogeneous and Unstructured WWW

Is hypertext provides auxiliary information (other than the text of web

pages)Objective

Take advantage of this link structure.

Page 4: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Background

Academic Citations link to other well known papers peer reviewed have quality control

Web : Homogeneous in their quality, usage, citation & length Quality measure (subjective to the user) Importance of a page is a quantity that isn’t intuitively

possible to capture

Page 5: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

What does a user want?

Most applicable documents firstWhat is the job of a retrieval system?

Present more relevant documents upfrontNotion: Quality/Importance of Web Pages

Difficult to classify (depends on user)

We deal with the overall importance of a page, rather than individual sections of the page.

Page 6: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Link Structure

Forward LinksBack LinksWeb has 150 million

pages and 1.7 billion links (probably more now)

Use the concept of citation analysis Highly linked pages are

more “important" than pages with few links

Page 7: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Propagation of Ranking

Page Rank: a page has high rank if the sum of the ranks of its back-links is high

Some notationsu Web PageFu Set of pages u points to

(Forward links)Bu Set of pages that point

to u (Backlinks)Nu = |Fu| Number of links from uc Normalization factor

Simple Ranking function

uBv vNvRcuR )()(

Page 8: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Simplified Page Rank Calculation

Page 9: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Problem in Ranking?

Rank Sink: Two web pages that point to each other but to no

other page. Third page which points to one of them. loop will accumulate rank but never distribute it

(since there are no outedges).

Page 10: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Page Rank Definition

Let E(u) be some vector over the Web pages that corresponds to a source of rank. Then, the PageRank of a set of Web pages is an assignment, R’, to the Web pages which satisfies

such that c is maximized and ||R’||1 = 1 (||R’||1 denotes the L1 norm of R’).

)()()('

ucENvRcuR

uBv v

Page 11: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Computing Page Rank

initialize vector over web pages

Loop:

new ranks sum of normalized backlink rankscompute normalizing factor

add escape term

control parameter

while stop when converged

SR 0

ii ARR 1

111 ii RRd

dERR ii 11

ii RR 1

Page 12: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Random Surfer Model

Random SurferClicks at random basis“Surfer” periodically gets bored.

Page 13: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Solution to Random Surfer Model

Escape term: E(u) can be thought of as the random surfer gets bored periodically and jumps to a different page – not staying in the loop forever.

We term this E to be a vector over all the web pages that accounts for each page’s escape probability (user defined parameter).

)()()('

ucENvRcuR

uBv v

Page 14: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Another Problem – Dangling Links

What are dangling links? Links that point to any page with no outgoing links. Pages not downloaded yet.

Why is this a problem? We don’t know how to distribute weight to these.

What do we do ? Remove them from the system

Page 15: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Mathematical Basics

What is eigen vector and eigen value? Given a vector v in the n-dimensional vector space, we can

linear transform it to another vector space using a transformation matrix A. The transformed vector is Av.

An eigen vector is a vector that is scaled by a linear transformation, but not moved. The scaling factor is the eigen value. Eigen values and eigen vectors are not unique. We can compute them by Ax = x where is the eigen value of A and x is the corresponding eigen vector.

An eigenvector is a vector that 'points' in the same direction (has invariant direction cosines) under some transform. The eigenvalue is a number that describes how the magnitude of the eigenvector is scaled by the transform.

Page 16: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Mathematical Basics

A is designated to be a matrix, u and v correspond to the columns of this matrix.

Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue.

Page 17: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Example

AT=

Page 18: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Example (contd..)

R = c A R = M R c: eigenvalue R : eigen vector of A

A =

R = Normalized =

A x = λ x| A - λI | x = 0

Page 19: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Implementation

Web crawler keeps a database of URLs so that it can discover all URLs on the web

To implement PageRank, the web crawler builds an index of the URLs as it crawls

Problems??? Infinitely large sites Incorrect HTML Sites are down Web is always changing

Page 20: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

PageRank Implementation

Convert each URL into unique integer IDLink structure sorted by the IDsRemove dangling linksMake a initial assignment of ranks and

iterate until convergenceAdd the dangling links backIterate the process again to assign weights to

all dangling linksLink database A, is normally kept in RAM

Page 21: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Convergence Properties

PageRank will scale very well for large collections as the scaling factor is roughly linear in log n.

Page 22: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Convergence Properties

Here we interpret web as a expander like graph.

A graph is said to be expander if every subsets of nodes S has a neighborhood that is larger than some factor α times |S|

Mathematically we verify the same if the largest eigenvalue is sufficiently larger than the second-largest eigenvalue

Page 23: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Searching with PageRank

Two search engines implemented using PageRank. Title based search engine

Matches titles of web pages with the given query Ranks the results using PageRank Works well for general queries having a large result set

Full text search engine (Google) Scans the entire document for a match with the given

query Performs rank merging.

Page 24: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Types of Results

Information based result Finds a site which contains great deal of information Propagates textual matching score through the link

structureCommon Case result

Most commonly used site (often commercial) relevant to the search query

PageRank results in good representation of the common case

Page 25: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Personalized PageRank

E vector Corresponds to a distribution of web pages Provides flexibility in adjustment of PageRanks

Uniform E causes highly linked web pages to achieve a very high ranking

Single page E results in important pages not related to the homepage to achieve a low PageRank

E consisting of root level pages of all web servers is a good compromise between uniform E and single page E

Page 26: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Applications

Estimating Web Traffic Looking at differences between PageRank and actual

usage statistics, it is possible to find things that people often look at, but do not want to link to their web pages

Backlink Predictor Citation counts tends to get stuck in the local web

pages Using random surfer model, PageRank quickly finds

the site homepage, and gives preference to its children resulting in an efficient, broad search

Hence PageRank potentially acts as a better backlink predictor since it builds up the entire website information faster

Page 27: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Other Applications

Spam detection and prevention

Sort the backlinks based on their importance

Page 28: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Issues

Users are not random walkers.Starting point distribution (actual usage data

as starting vector).Bias towards main pages.Linkage spam.No query specific rank.

Page 29: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Conclusion

PageRank is a global ranking of all webpages, regardless of their content based solely on their location in the Web’s graph structure

PageRank can be used to separate a small set of commonly used documents

Full database is consulted only when small database is not adequate to answer the queries

Personalized PageRank can be used to create a view of Web from a particular user’s perspective

Page 30: The  PageRank  Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey  Brin ,  Stanford University.  1998

Google Architecture..