Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
TANA07: Data Mining using Matrix MethodsPageRank
Lars Eldén and Berkant Savas
Department of MathematicsLinköping University, Sweden
2012
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Google search: university
A Google search conducted on September 29, 2005, using thesearch phrase university, gave as a result:HarvardStanfordCambridgeYaleCornellOxfordThe total number of web pages relevant to the search phrase wasmore than 2 billion.Result (ordering) changes over time.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Google indexes billions of web pages
1 of 1 2004-03-02 11:28
Sverige
Nätet Bilder Grupper Kategori
Google-sökning Jag har tur
• Avancerad sökning • Inställningar • Språkverktyg
Sök: webben sidor på svenska sidor från Sverige
Annonsera hos oss - Allt om Google - Google.com in English
©2004 Google - Söker igenom 4,285,199,774 webbsidor
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Google Mathematics?
Google ranks about 45 ∗ 109 (45 billion) web pages (2011)The ranking is based on a mathematical model of the linkstructure of the InternetThe ranking is recomputed often (each month?) and thecomputation takes 1-2 weeks(?): the world’s largest matrixcomputationWhen Google started 1998 there was some mathematicaltheory but not enough. It was found experimentally(?) thatthe method works.Now there is quite some literature on PageRank, itscomputation, and it is used in a wide range of areas.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Search Engines
Web crawler: Computer program that downloads web pages andscans their contentsSearch engine:
1 Scan the page and collect key words (indexing)2 Find the link structure graph of the whole Internet and store it
as a sparse matrix (dimension n ≈ 45 ∗ 109?)3 Rank all web pages
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Links
����i
Ii (inlinks)
k -
kHHH
HHHHj
kXXXXXXXzk�������:
Oi (outlinks)
k- k������
�:
kXXXXXXXz
All web pages are ordered: 1, 2, . . . , nPageRank of page i : a number ri between 0 and 1 that indicateshow important the page is.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Links, cont.
����i
Ii
k -
kHHH
HHHHj
kXXXXXXXzk�������:
Oi
k- k������
�:
kXXXXXXXz
Provisional definition of PageRank: The more inlinks your pagehas, the more important it is.Easy to manipulate: create pages that point to yours
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
New provisional definition
the more inlinks from important pages your page has, themore important it is
ri =∑j∈Ii
rjNj, i = 1, 2, . . . , n.
� ��i
Ii
i -
iHH
HHHHj
iXXXXXXzi������:
Oi
i- i������:
iXXXXXXz
In the sum: The rank of page j is divided equally between isoutlinks. (Example)
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Questions
Recursive definition:
ri =∑j∈Ii
rjNj, i = 1, 2, . . . , n.
1 Does there exist a solution such that 0 ≤ ri ≤ 1?2 If it exists, how to compute it?
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Suggested Algorithm: Power method
1 Guess starting values r (0)i , i = 1, . . . , n.2 Iterate until convergence:
r (k)i =∑j∈Ii
r (k−1)j
Nj, i = 1, 2, . . . , n.
3 If the change is small enough,
‖r (k) − r (k−1)‖ ≤ TOL
then stop. Otherwise go to 2.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Matrix Formulation
Qij =
{1/Nj if there is a link from j to i0 otherwise.
j
i
∗0...
0 ∗ · · · ∗ ∗ · · ·...0∗
← inlinks
↑outlinks
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Example
i1 -� i2 - i3
i4 i5 i6? ? ?
6
-��
@@@@@R �
�����
Q =
0 13 0 0 0 0
13 0 0 0 0 00 1
3 0 0 13
12
13 0 0 0 1
3 013
13 0 0 0 1
20 0 1 0 1
3 0
No outlinks from page 4: corresponding column is equal to zero.All other columns: the elements sum up to 1
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Random Walk Model
Random surfer: On each page, choose one of the outlinks withequal probabilityObserve the surfer for a long time and count thefrequency of visits to all pages
PageRank ri : the asymptotic probability that the surfer is at page i .Markov chain: A random process in which the next state is
determined completely from the current state.Discrete time and no memory. The state at timet + 1 depends on the state at time t only.QT is the matrix of transition probabilities
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
PageRank definition
ri =∑j∈Ii
rjNj, i = 1, 2, . . . , n.
����i
Ii
k -
kHHHHHHHj
kXXXXXXXzk�������:
Oi
k- k������
�:
kXXXXXXXz
ri : relative frequency of visits at page iCopyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Example, cont.
k1 -� k2 - k3
k4 k5 k6? ? ?
6
-��
@@@@@@R �
������
The model does not work: The surfer gets stuck at page 4.Remedy: Introduce (artificially) links from 4 to all the others.The random walk can continue forever.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Modified Transition Matrix
Define
dj =
{1 if Nj = 00 otherwise,
for j = 1, . . . , n, and
e =
11...1
∈ Rn, P = Q +1nedT =
0 13 0 1
6 0 013 0 0 1
6 0 00 1
3 0 16
13
12
13 0 0 1
613 0
13
13 0 1
6 0 12
0 0 1 16
13 0
.
P is a proper column-stochastic matrix: Non-negative elements,and the elements of each column sum up to 1.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Column-Stochastic Matrix
LemmaA column-stochastic matrix P satisfies
eTP = eT , e =
11...1
.
Equivalently:PT e = e
e is an eigenvector of PT with the eigenvalue 1.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Eigenvector of P
Recursive definition of PageRank:
ri =∑j∈Ii
rjNj, i = 1, 2, . . . , n.
Now equivalent toPr = r .
We want to find this eigenvector of P (not PT )Questions:
1 Is the eigenvector r unique?2 Is it a probability vector?
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Some Perron-Frobenius Theory
TheoremIf P is an irreducible column-stochastic matrix, then the largesteigenvalue is equal to 1, and the corresponding eigenvector r hasnon-negative elements.
The Google matrix is reducible: the random walk can get stuck insubgraphs of the Internet
k1 � k4 - k5
k2 k3 k6?
6
?
6
� -
@@@@@@R@
@@
@@@I
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Example
Corresponding matrix:
P =
0 12
12
12 0 0
12 0 1
2 0 0 012
12 0 0 0 0
0 0 0 0 0 00 0 0 1
2 0 10 0 0 0 1 0
Eigenvalues: 1, 1,−1,−0.5,−0.5
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Irreducible Matrix
A = αP + (1− α)1neeT , 0 ≤ α ≤ 1
A is column-stochastic:
eTA = αeTP + (1− α)1neT eeT = αeT + (1− α)eT = eT .
Random walk interpretation: the surfer will jump to a random pagewith probability 1− αTeleportation
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Eigenvalues of A
Theorem
The positive column-stochastic matrix A = αP + (1− α) 1nee
T hasa unique eigenvector r > 0 with the eigenvalue 1.
Proof.Perron-Frobenius theory of positive matrices
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Power Method
The power method for Ar = λr
for k = 1, 2, . . .q(k) = Ar (k−1)
r (k) = q(k)/‖q(k)‖1end
Expand initial approximation r (0) in terms of eigenvectors:
r (0) = c1r1 + c2r2 + · · ·+ cnrn,
Ak r (0) = λk1
(c1r1 +
n∑j=2
cj
(λj
λ1
)k
rj).
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Second Eigenvalue of A I
TheoremGiven that the eigenvalues of the column-stochastic matrix P are{1, λ2, λ3 . . . , λn}, the eigenvalues of A = αP + (1− α) 1
neeT are
{1, αλ2, αλ3, . . . , αλn}.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Example: A = αP + (1− α)1neeT
LP=eig(P)’; e=ones(6,1);A=0.85*P + 0.15/6*e*e’; [R,L]=eig(A)
gives
LP = -0.5 1.0 -0.5 1.0 -1.0 0
R = 0.447 -0.365 -0.354 0.000 0.817 0.1010.430 -0.365 0.354 -0.000 -0.408 -0.7520.430 -0.365 0.354 0.000 -0.408 0.6510.057 -0.000 -0.707 0.000 0.000 -0.0000.469 0.548 -0.000 -0.707 0.000 0.0000.456 0.548 0.354 0.707 -0.000 -0.000
diag(L) = 1.0 0.85 -0.0 -0.85 -0.425 -0.425
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Computing y = Az I
A = αP + (1− α)1neeT ∈ Rn×n
Recall: n ≈ 1010
Cannot form A (dense)
Cannot form P = Q + 1ned
T (many dense columns)
Q is the actual link matrix1) If ‖z‖1 = eT z = 1, then ‖y‖1 = eT y = eTAz = eT z = 1 sinceA is column-stochastic (eTA = eT ). Normalization in the poweriteration is unnecessary2) With P = Q + 1
nedT
y = α(Q +1nedT )z +
(1− α)n
e(eT z) = αQz + β1ne,
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Computing y = Az II
where β = αdT z + (1− α)eT z . β is computed from
1 = eT (αQz) + βeT (1ne) = eT (αQz) + β.
Thus, β = 1− ‖αQz‖1.Extra bonus: we do not need to know which pages lack outlinks.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Matlab Code
y=alpha*Q*z;beta=1-norm(y,1);y=y+beta*v;residual=norm(y-z,1);
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Example
Link structure for the domain stanford.eduWeb pages: 281903, links: 2312497.
0 0.5 1 1.5 2x 104
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
x 104
nz = 16283
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Convergence and PageRank
0 10 20 30 40 50 60 7010 7
10 6
10 5
10 4
10 3
10 2
10 1
100RESIDUAL
ITERATIONS0 0.5 1 1.5 2 2.5 3
x 105
0
0.002
0.004
0.006
0.008
0.01
0.012
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
What happens when α→ 1?
α→ 0: completely random, all pages get the same rankα→ 1: The links determine the rankBUT: counterintuitive effects
k1
k7 �������
- k4 - k5
k2 k3 k66
?
6
� -@@
@@
@@I
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
α→ 1
k1
k7 �������
- k4 - k5
k2 k3 k66
?
6
� -@@
@@
@@I
Node 0.5 0.85 0.99991 0.15 0.07 0.02 0.10 0.04 0.03 0.10 0.04 0.04 0.15 0.08 0.05 0.24 0.39 0.56 0.19 0.36 0.57 0.07 0.02 0.0
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
Conclusions
The power method is generally obsolete, but there aresituations where there are no serious alternativesFurther work on acceleration of the power method (classicalmethods)In the case of PageRank for local, smaller domains, othermethods should be much better (??)
Perhaps the Google problem as such is not so important. There areother applications where similar methods can be applied: graphsimilarity, social networks, semantic graphs, etc.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
P. Berkin.
A survey on PageRank computing.
Internet Mathematics, 2:73–120, 2005.
S. Brin and L. Page.
The anatomy of a large-scale hypertextual web search engine.
Computer Networks and ISDN Systems, 30:107–117, 1998.
Lars Eldén.
A note on the eigenvalues of the Google matrix.
Technical Report LiTH-MAT-R–04-01, Department of Mathematics,Linköping University, 2004.
R. Horn and S. Serra-Capizzano.
A general setting for the parametric Google matrix.
Internet Math., to appear, 2007.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods
A. N. Langville and C. D. Meyer.
Google’s PageRank and Beyond: The Science of Search EngineRankings.
Princeton University Press, 2006.
C. D. Meyer.
Matrix Analysis and Applied Linear Algebra.
SIAM, Philadelphia, 2000.
M. Totty and M. Mangalindan.
As Google becomes Web’s gatekeeper, sites fight to get in.
Wall Street Journal, CCXLI(39)(Feb. 26), 2003.
Copyright Lars Eldén, Berkant Savas 2012 TANA07: Data Mining using Matrix Methods