35
TANA07: Data Mining using Matrix Methods PageRank Lars Eldén and Berkant Savas Department of Mathematics Linköping University, Sweden 2012 Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

TANA07: Data Mining using Matrix MethodsPageRank

Lars Eldén and Berkant Savas

Department of MathematicsLinköping University, Sweden

2012

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 2: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Google search: university

A Google search conducted on September 29, 2005, using thesearch phrase university, gave as a result:HarvardStanfordCambridgeYaleCornellOxfordThe total number of web pages relevant to the search phrase wasmore than 2 billion.Result (ordering) changes over time.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 3: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Google indexes billions of web pages

1 of 1 2004-03-02 11:28

Sverige

Nätet Bilder Grupper Kategori

Google-sökning Jag har tur

• Avancerad sökning • Inställningar • Språkverktyg

Sök: webben sidor på svenska sidor från Sverige

Annonsera hos oss - Allt om Google - Google.com in English

©2004 Google - Söker igenom 4,285,199,774 webbsidor

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 4: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Google Mathematics?

Google ranks about 45 ∗ 109 (45 billion) web pages (2011)The ranking is based on a mathematical model of the linkstructure of the InternetThe ranking is recomputed often (each month?) and thecomputation takes 1-2 weeks(?): the world’s largest matrixcomputationWhen Google started 1998 there was some mathematicaltheory but not enough. It was found experimentally(?) thatthe method works.Now there is quite some literature on PageRank, itscomputation, and it is used in a wide range of areas.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 5: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Search Engines

Web crawler: Computer program that downloads web pages andscans their contentsSearch engine:

1 Scan the page and collect key words (indexing)2 Find the link structure graph of the whole Internet and store it

as a sparse matrix (dimension n ≈ 45 ∗ 109?)3 Rank all web pages

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 6: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Links

����i

Ii (inlinks)

k -

kHHH

HHHHj

kXXXXXXXzk�������:

Oi (outlinks)

k- k������

�:

kXXXXXXXz

All web pages are ordered: 1, 2, . . . , nPageRank of page i : a number ri between 0 and 1 that indicateshow important the page is.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 7: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Links, cont.

����i

Ii

k -

kHHH

HHHHj

kXXXXXXXzk�������:

Oi

k- k������

�:

kXXXXXXXz

Provisional definition of PageRank: The more inlinks your pagehas, the more important it is.Easy to manipulate: create pages that point to yours

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 8: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

New provisional definition

the more inlinks from important pages your page has, themore important it is

ri =∑j∈Ii

rjNj, i = 1, 2, . . . , n.

� ��i

Ii

i -

iHH

HHHHj

iXXXXXXzi������:

Oi

i- i������:

iXXXXXXz

In the sum: The rank of page j is divided equally between isoutlinks. (Example)

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 9: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Questions

Recursive definition:

ri =∑j∈Ii

rjNj, i = 1, 2, . . . , n.

1 Does there exist a solution such that 0 ≤ ri ≤ 1?2 If it exists, how to compute it?

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 10: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Suggested Algorithm: Power method

1 Guess starting values r (0)i , i = 1, . . . , n.2 Iterate until convergence:

r (k)i =∑j∈Ii

r (k−1)j

Nj, i = 1, 2, . . . , n.

3 If the change is small enough,

‖r (k) − r (k−1)‖ ≤ TOL

then stop. Otherwise go to 2.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 11: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Matrix Formulation

Qij =

{1/Nj if there is a link from j to i0 otherwise.

j

i

∗0...

0 ∗ · · · ∗ ∗ · · ·...0∗

← inlinks

↑outlinks

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 12: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Example

i1 -� i2 - i3

i4 i5 i6? ? ?

6

-��

@@@@@R �

�����

Q =

0 13 0 0 0 0

13 0 0 0 0 00 1

3 0 0 13

12

13 0 0 0 1

3 013

13 0 0 0 1

20 0 1 0 1

3 0

No outlinks from page 4: corresponding column is equal to zero.All other columns: the elements sum up to 1

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 13: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Random Walk Model

Random surfer: On each page, choose one of the outlinks withequal probabilityObserve the surfer for a long time and count thefrequency of visits to all pages

PageRank ri : the asymptotic probability that the surfer is at page i .Markov chain: A random process in which the next state is

determined completely from the current state.Discrete time and no memory. The state at timet + 1 depends on the state at time t only.QT is the matrix of transition probabilities

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 14: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

PageRank definition

ri =∑j∈Ii

rjNj, i = 1, 2, . . . , n.

����i

Ii

k -

kHHHHHHHj

kXXXXXXXzk�������:

Oi

k- k������

�:

kXXXXXXXz

ri : relative frequency of visits at page iCopyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 15: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Example, cont.

k1 -� k2 - k3

k4 k5 k6? ? ?

6

-��

@@@@@@R �

������

The model does not work: The surfer gets stuck at page 4.Remedy: Introduce (artificially) links from 4 to all the others.The random walk can continue forever.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 16: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Modified Transition Matrix

Define

dj =

{1 if Nj = 00 otherwise,

for j = 1, . . . , n, and

e =

11...1

∈ Rn, P = Q +1nedT =

0 13 0 1

6 0 013 0 0 1

6 0 00 1

3 0 16

13

12

13 0 0 1

613 0

13

13 0 1

6 0 12

0 0 1 16

13 0

.

P is a proper column-stochastic matrix: Non-negative elements,and the elements of each column sum up to 1.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 17: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Column-Stochastic Matrix

LemmaA column-stochastic matrix P satisfies

eTP = eT , e =

11...1

.

Equivalently:PT e = e

e is an eigenvector of PT with the eigenvalue 1.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 18: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Eigenvector of P

Recursive definition of PageRank:

ri =∑j∈Ii

rjNj, i = 1, 2, . . . , n.

Now equivalent toPr = r .

We want to find this eigenvector of P (not PT )Questions:

1 Is the eigenvector r unique?2 Is it a probability vector?

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 19: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Some Perron-Frobenius Theory

TheoremIf P is an irreducible column-stochastic matrix, then the largesteigenvalue is equal to 1, and the corresponding eigenvector r hasnon-negative elements.

The Google matrix is reducible: the random walk can get stuck insubgraphs of the Internet

k1 � k4 - k5

k2 k3 k6?

6

?

6

� -

@@@@@@R@

@@

@@@I

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 20: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Example

Corresponding matrix:

P =

0 12

12

12 0 0

12 0 1

2 0 0 012

12 0 0 0 0

0 0 0 0 0 00 0 0 1

2 0 10 0 0 0 1 0

Eigenvalues: 1, 1,−1,−0.5,−0.5

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 21: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Irreducible Matrix

A = αP + (1− α)1neeT , 0 ≤ α ≤ 1

A is column-stochastic:

eTA = αeTP + (1− α)1neT eeT = αeT + (1− α)eT = eT .

Random walk interpretation: the surfer will jump to a random pagewith probability 1− αTeleportation

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 22: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Eigenvalues of A

Theorem

The positive column-stochastic matrix A = αP + (1− α) 1nee

T hasa unique eigenvector r > 0 with the eigenvalue 1.

Proof.Perron-Frobenius theory of positive matrices

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 23: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Power Method

The power method for Ar = λr

for k = 1, 2, . . .q(k) = Ar (k−1)

r (k) = q(k)/‖q(k)‖1end

Expand initial approximation r (0) in terms of eigenvectors:

r (0) = c1r1 + c2r2 + · · ·+ cnrn,

Ak r (0) = λk1

(c1r1 +

n∑j=2

cj

(λj

λ1

)k

rj).

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 24: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Second Eigenvalue of A I

TheoremGiven that the eigenvalues of the column-stochastic matrix P are{1, λ2, λ3 . . . , λn}, the eigenvalues of A = αP + (1− α) 1

neeT are

{1, αλ2, αλ3, . . . , αλn}.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 25: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Example: A = αP + (1− α)1neeT

LP=eig(P)’; e=ones(6,1);A=0.85*P + 0.15/6*e*e’; [R,L]=eig(A)

gives

LP = -0.5 1.0 -0.5 1.0 -1.0 0

R = 0.447 -0.365 -0.354 0.000 0.817 0.1010.430 -0.365 0.354 -0.000 -0.408 -0.7520.430 -0.365 0.354 0.000 -0.408 0.6510.057 -0.000 -0.707 0.000 0.000 -0.0000.469 0.548 -0.000 -0.707 0.000 0.0000.456 0.548 0.354 0.707 -0.000 -0.000

diag(L) = 1.0 0.85 -0.0 -0.85 -0.425 -0.425

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 26: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Computing y = Az I

A = αP + (1− α)1neeT ∈ Rn×n

Recall: n ≈ 1010

Cannot form A (dense)

Cannot form P = Q + 1ned

T (many dense columns)

Q is the actual link matrix1) If ‖z‖1 = eT z = 1, then ‖y‖1 = eT y = eTAz = eT z = 1 sinceA is column-stochastic (eTA = eT ). Normalization in the poweriteration is unnecessary2) With P = Q + 1

nedT

y = α(Q +1nedT )z +

(1− α)n

e(eT z) = αQz + β1ne,

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 27: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Computing y = Az II

where β = αdT z + (1− α)eT z . β is computed from

1 = eT (αQz) + βeT (1ne) = eT (αQz) + β.

Thus, β = 1− ‖αQz‖1.Extra bonus: we do not need to know which pages lack outlinks.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 28: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Matlab Code

y=alpha*Q*z;beta=1-norm(y,1);y=y+beta*v;residual=norm(y-z,1);

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 29: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Example

Link structure for the domain stanford.eduWeb pages: 281903, links: 2312497.

0 0.5 1 1.5 2x 104

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

x 104

nz = 16283

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 30: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Convergence and PageRank

0 10 20 30 40 50 60 7010 7

10 6

10 5

10 4

10 3

10 2

10 1

100RESIDUAL

ITERATIONS0 0.5 1 1.5 2 2.5 3

x 105

0

0.002

0.004

0.006

0.008

0.01

0.012

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 31: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

What happens when α→ 1?

α→ 0: completely random, all pages get the same rankα→ 1: The links determine the rankBUT: counterintuitive effects

k1

k7 �������

- k4 - k5

k2 k3 k66

?

6

� -@@

@@

@@I

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 32: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

α→ 1

k1

k7 �������

- k4 - k5

k2 k3 k66

?

6

� -@@

@@

@@I

Node 0.5 0.85 0.99991 0.15 0.07 0.02 0.10 0.04 0.03 0.10 0.04 0.04 0.15 0.08 0.05 0.24 0.39 0.56 0.19 0.36 0.57 0.07 0.02 0.0

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 33: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

Conclusions

The power method is generally obsolete, but there aresituations where there are no serious alternativesFurther work on acceleration of the power method (classicalmethods)In the case of PageRank for local, smaller domains, othermethods should be much better (??)

Perhaps the Google problem as such is not so important. There areother applications where similar methods can be applied: graphsimilarity, social networks, semantic graphs, etc.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 34: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

P. Berkin.

A survey on PageRank computing.

Internet Mathematics, 2:73–120, 2005.

S. Brin and L. Page.

The anatomy of a large-scale hypertextual web search engine.

Computer Networks and ISDN Systems, 30:107–117, 1998.

Lars Eldén.

A note on the eigenvalues of the Google matrix.

Technical Report LiTH-MAT-R–04-01, Department of Mathematics,Linköping University, 2004.

R. Horn and S. Serra-Capizzano.

A general setting for the parametric Google matrix.

Internet Math., to appear, 2007.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods

Page 35: TANA07: Data Mining using Matrix Methods - PageRankwebstaff.itn.liu.se/~bersa48/tana07/slides2011/chp-12-pagerank.pdfGoogleindexesbillionsofwebpages 1 of 1 2004-03-02 11:28 Sverige

A. N. Langville and C. D. Meyer.

Google’s PageRank and Beyond: The Science of Search EngineRankings.

Princeton University Press, 2006.

C. D. Meyer.

Matrix Analysis and Applied Linear Algebra.

SIAM, Philadelphia, 2000.

M. Totty and M. Mangalindan.

As Google becomes Web’s gatekeeper, sites fight to get in.

Wall Street Journal, CCXLI(39)(Feb. 26), 2003.

Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods