TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

TANA07: Data Mining using Matrix MethodsPageRank

Lars Eldén and Berkant Savas

Department of MathematicsLinköping University, Sweden

2012

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Google search: university

A Google search conducted on September 29, 2005, using thesearch phrase university, gave as a result:HarvardStanfordCambridgeYaleCornellOxfordThe total number of web pages relevant to the search phrase wasmore than 2 billion.Result (ordering) changes over time.


Google indexes billions of web pages

1 of 1 2004-03-02 11:28

Sverige

Nätet Bilder Grupper Kategori

Google-sökning Jag har tur

• Avancerad sökning • Inställningar • Språkverktyg

Sök: webben sidor på svenska sidor från Sverige

Annonsera hos oss - Allt om Google - Google.com in English

©2004 Google - Söker igenom 4,285,199,774 webbsidor


Google Mathematics?

Google ranks about 45 ∗ 109 (45 billion) web pages (2011)The ranking is based on a mathematical model of the linkstructure of the InternetThe ranking is recomputed often (each month?) and thecomputation takes 1-2 weeks(?): the world’s largest matrixcomputationWhen Google started 1998 there was some mathematicaltheory but not enough. It was found experimentally(?) thatthe method works.Now there is quite some literature on PageRank, itscomputation, and it is used in a wide range of areas.


Search Engines

Web crawler: Computer program that downloads web pages andscans their contentsSearch engine:

1 Scan the page and collect key words (indexing)2 Find the link structure graph of the whole Internet and store it

as a sparse matrix (dimension n ≈ 45 ∗ 109?)3 Rank all web pages


Links

��i

Ii (inlinks)

k -

kHHH

HHHHj

kXXXXXXXzk��:

Oi (outlinks)

k- k��

�:

kXXXXXXXz

All web pages are ordered: 1, 2, . . . , nPageRank of page i : a number ri between 0 and 1 that indicateshow important the page is.


Links, cont.

��i

Ii

k -

kHHH

HHHHj

kXXXXXXXzk��:

Oi

k- k��

�:

kXXXXXXXz

Provisional definition of PageRank: The more inlinks your pagehas, the more important it is.Easy to manipulate: create pages that point to yours


New provisional definition

the more inlinks from important pages your page has, themore important it is

ri =∑j∈Ii

rjNj, i = 1, 2, . . . , n.

� ��i

Ii

i -

iHH

HHHHj

iXXXXXXzi��:

Oi

i- i��:

iXXXXXXz

In the sum: The rank of page j is divided equally between isoutlinks. (Example)


Questions

Recursive definition:

ri =∑j∈Ii

rjNj, i = 1, 2, . . . , n.

1 Does there exist a solution such that 0 ≤ ri ≤ 1?2 If it exists, how to compute it?


Suggested Algorithm: Power method

1 Guess starting values r (0)i , i = 1, . . . , n.2 Iterate until convergence:

r (k)i =∑j∈Ii

r (k−1)j

Nj, i = 1, 2, . . . , n.

3 If the change is small enough,

‖r (k) − r (k−1)‖ ≤ TOL

then stop. Otherwise go to 2.


Matrix Formulation

Qij =

{1/Nj if there is a link from j to i0 otherwise.

j

i

∗0...

0 ∗ · · · ∗ ∗ · · ·...0∗

← inlinks

↑outlinks


Example

i1 -� i2 - i3

i4 i5 i6? ? ?

6

-��

@@@@@R �

��

Q =

0 13 0 0 0 0

13 0 0 0 0 00 1

3 0 0 13

12

13 0 0 0 1

3 013

13 0 0 0 1

20 0 1 0 1

3 0

No outlinks from page 4: corresponding column is equal to zero.All other columns: the elements sum up to 1


Random Walk Model

Random surfer: On each page, choose one of the outlinks withequal probabilityObserve the surfer for a long time and count thefrequency of visits to all pages

PageRank ri : the asymptotic probability that the surfer is at page i .Markov chain: A random process in which the next state is

determined completely from the current state.Discrete time and no memory. The state at timet + 1 depends on the state at time t only.QT is the matrix of transition probabilities


PageRank definition

ri =∑j∈Ii

rjNj, i = 1, 2, . . . , n.

��i

Ii

k -

kHHHHHHHj

kXXXXXXXzk��:

Oi

k- k��

�:

kXXXXXXXz

ri : relative frequency of visits at page iCopyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Example, cont.

k1 -� k2 - k3

k4 k5 k6? ? ?

6

-��

@@@@@@R �

��

The model does not work: The surfer gets stuck at page 4.Remedy: Introduce (artificially) links from 4 to all the others.The random walk can continue forever.


Modified Transition Matrix

Define

dj =

{1 if Nj = 00 otherwise,

for j = 1, . . . , n, and

e =

11...1

∈ Rn, P = Q +1nedT =

0 13 0 1

6 0 013 0 0 1

6 0 00 1

3 0 16

13

12

13 0 0 1

613 0

13

13 0 1

6 0 12

0 0 1 16

13 0

.

P is a proper column-stochastic matrix: Non-negative elements,and the elements of each column sum up to 1.


Column-Stochastic Matrix

LemmaA column-stochastic matrix P satisfies

eTP = eT , e =

11...1

.

Equivalently:PT e = e

e is an eigenvector of PT with the eigenvalue 1.


Eigenvector of P

Recursive definition of PageRank:

ri =∑j∈Ii

rjNj, i = 1, 2, . . . , n.

Now equivalent toPr = r .

We want to find this eigenvector of P (not PT )Questions:

1 Is the eigenvector r unique?2 Is it a probability vector?


Some Perron-Frobenius Theory

TheoremIf P is an irreducible column-stochastic matrix, then the largesteigenvalue is equal to 1, and the corresponding eigenvector r hasnon-negative elements.

The Google matrix is reducible: the random walk can get stuck insubgraphs of the Internet

k1 � k4 - k5

k2 k3 k6?

6

?

6

� -

@@@@@@R@

@@

@@@I


Example

Corresponding matrix:

P =

0 12

12

12 0 0

12 0 1

2 0 0 012

12 0 0 0 0

0 0 0 0 0 00 0 0 1

2 0 10 0 0 0 1 0

Eigenvalues: 1, 1,−1,−0.5,−0.5


Irreducible Matrix

A = αP + (1− α)1neeT , 0 ≤ α ≤ 1

A is column-stochastic:

eTA = αeTP + (1− α)1neT eeT = αeT + (1− α)eT = eT .

Random walk interpretation: the surfer will jump to a random pagewith probability 1− αTeleportation


Eigenvalues of A

Theorem

The positive column-stochastic matrix A = αP + (1− α) 1nee

T hasa unique eigenvector r > 0 with the eigenvalue 1.

Proof.Perron-Frobenius theory of positive matrices


Power Method

The power method for Ar = λr

for k = 1, 2, . . .q(k) = Ar (k−1)

r (k) = q(k)/‖q(k)‖1end

Expand initial approximation r (0) in terms of eigenvectors:

r (0) = c1r1 + c2r2 + · · ·+ cnrn,

Ak r (0) = λk1

(c1r1 +

n∑j=2

cj

(λj

λ1

)k

rj).


Second Eigenvalue of A I

TheoremGiven that the eigenvalues of the column-stochastic matrix P are{1, λ2, λ3 . . . , λn}, the eigenvalues of A = αP + (1− α) 1

neeT are

{1, αλ2, αλ3, . . . , αλn}.


Example: A = αP + (1− α)1neeT

LP=eig(P)’; e=ones(6,1);A=0.85*P + 0.15/6*e*e’; [R,L]=eig(A)

gives

LP = -0.5 1.0 -0.5 1.0 -1.0 0

R = 0.447 -0.365 -0.354 0.000 0.817 0.1010.430 -0.365 0.354 -0.000 -0.408 -0.7520.430 -0.365 0.354 0.000 -0.408 0.6510.057 -0.000 -0.707 0.000 0.000 -0.0000.469 0.548 -0.000 -0.707 0.000 0.0000.456 0.548 0.354 0.707 -0.000 -0.000

diag(L) = 1.0 0.85 -0.0 -0.85 -0.425 -0.425


Computing y = Az I

A = αP + (1− α)1neeT ∈ Rn×n

Recall: n ≈ 1010

Cannot form A (dense)

Cannot form P = Q + 1ned

T (many dense columns)

Q is the actual link matrix1) If ‖z‖1 = eT z = 1, then ‖y‖1 = eT y = eTAz = eT z = 1 sinceA is column-stochastic (eTA = eT ). Normalization in the poweriteration is unnecessary2) With P = Q + 1

nedT

y = α(Q +1nedT )z +

(1− α)n

e(eT z) = αQz + β1ne,


Computing y = Az II

where β = αdT z + (1− α)eT z . β is computed from

1 = eT (αQz) + βeT (1ne) = eT (αQz) + β.

Thus, β = 1− ‖αQz‖1.Extra bonus: we do not need to know which pages lack outlinks.


Matlab Code

y=alpha*Q*z;beta=1-norm(y,1);y=y+beta*v;residual=norm(y-z,1);


Example

Link structure for the domain stanford.eduWeb pages: 281903, links: 2312497.

0 0.5 1 1.5 2x 104

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x 104

nz = 16283


Convergence and PageRank

0 10 20 30 40 50 60 7010 7

10 6

10 5

10 4

10 3

10 2

10 1

100RESIDUAL

ITERATIONS0 0.5 1 1.5 2 2.5 3

x 105

0

0.002

0.004

0.006

0.008

0.01

0.012


What happens when α→ 1?

α→ 0: completely random, all pages get the same rankα→ 1: The links determine the rankBUT: counterintuitive effects

k1

k7 ��

- k4 - k5

k2 k3 k66

?

6

� -@@

@@

@@I


α→ 1

k1

k7 ��

- k4 - k5

k2 k3 k66

?

6

� -@@

@@

@@I

Node 0.5 0.85 0.99991 0.15 0.07 0.02 0.10 0.04 0.03 0.10 0.04 0.04 0.15 0.08 0.05 0.24 0.39 0.56 0.19 0.36 0.57 0.07 0.02 0.0


Conclusions

The power method is generally obsolete, but there aresituations where there are no serious alternativesFurther work on acceleration of the power method (classicalmethods)In the case of PageRank for local, smaller domains, othermethods should be much better (??)

Perhaps the Google problem as such is not so important. There areother applications where similar methods can be applied: graphsimilarity, social networks, semantic graphs, etc.


P. Berkin.

A survey on PageRank computing.

Internet Mathematics, 2:73–120, 2005.

S. Brin and L. Page.

The anatomy of a large-scale hypertextual web search engine.

Computer Networks and ISDN Systems, 30:107–117, 1998.

Lars Eldén.

A note on the eigenvalues of the Google matrix.

Technical Report LiTH-MAT-R–04-01, Department of Mathematics,Linköping University, 2004.

R. Horn and S. Serra-Capizzano.

A general setting for the parametric Google matrix.

Internet Math., to appear, 2007.


A. N. Langville and C. D. Meyer.

Google’s PageRank and Beyond: The Science of Search EngineRankings.

Princeton University Press, 2006.

C. D. Meyer.

Matrix Analysis and Applied Linear Algebra.

SIAM, Philadelphia, 2000.

M. Totty and M. Mangalindan.

As Google becomes Web’s gatekeeper, sites fight to get in.

Wall Street Journal, CCXLI(39)(Feb. 26), 2003.


Documents

TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige