Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
MATH36032 Problem Solving by Computer
Ranking algorithms and Pagerank(the algorithmbehind Google)
Ranking/Rating is ubiquitous
Some daily examples:
I recommender systems: the items recommended on yourAmazon page
I league tables for universities
I shortlisting job applicants
More complicated examples
I drug discovery: how do we identify the most promising drugstructures from millions of candidates?
I bioinformatics: which gene is responsible for a particularsymptom or disease?
Ranking/Rating is ubiquitous
Some daily examples:
I recommender systems: the items recommended on yourAmazon page
I league tables for universities
I shortlisting job applicants
More complicated examples
I drug discovery: how do we identify the most promising drugstructures from millions of candidates?
I bioinformatics: which gene is responsible for a particularsymptom or disease?
Rules of Ranking could change ...
Different ranking rules at different stages of the UEFA ChampionsLeague:
I Three qualifying rounds (candidate teams are based on UEFAcoefficients)
I Play-off round
I Group stage (ranking 4 teams in each group, from 32 teamsto 16)
I Knockout phase (ranking 2 teams, from 16 teams to theChampion)
Web search engine market share (2015)
What do the web search engine do?
Three basic tasks:
I Locate all web pages with public access
I Index all web pages (so that they can be searched efficientlyby keywords)
I Rate the importance of each page with the keyword(s)searched
Search engine optimization: the process that improves theimportance of the web page
Google’s PageRank Algorithm
I One way to calculate the importance of a node in a network(PageRank score)
I Developed by Google’s founders, Larry Page and Sergey Brin1
I Based entirely by the link structure of the WWW, NOT theactual content of the web pages (think about the Kardashiansisters)
I Only requires basic knowledge from Graph Theory, Probability(Markov Chain) and Linear Algebra
1The ”Page” in ”PageRank” is the last name of Larry Page
Web Link Structure: from graph to matrix
1 2
34
Connectivity matrix H
hij =
{1 if there is a hyperlink from page j to page i
0 otherwise.
with in-degree ri =∑
j=1 hij and out-degree cj =∑
i hij .
The matrix can also be construct from two vectors (with sparse):I = [2 3 1 3 4 4 1];
J = [1 1 2 2 2 3 4];
Web Link Structure: from graph to matrix
1 2
34
Adjacency matrix (or connectivity matrix) H
H =
0 1 0 11 0 0 01 1 0 00 1 1 0
with in-degree (row sum) [2 1 2 2] and out-degree (columnsum) [2 3 1 1]
What is the problem by just counting the in-degree? Think about”Zombie fans” (fake followers) on social networks.
Web Link Structure: from graph to matrix
1 2
34
Adjacency matrix (or connectivity matrix) H
H =
0 1 0 11 0 0 01 1 0 00 1 1 0
with in-degree (row sum) [2 1 2 2] and out-degree (columnsum) [2 3 1 1]
What is the problem by just counting the in-degree? Think about”Zombie fans” (fake followers) on social networks.
Modified Algorithm: the rescaled in-degree
Basic idea: total out-degree is rescaled to be unit, hij → hij/cj .
H =
0 1 0 11 0 0 01 1 0 00 1 1 0
→ H̃ =
0 1/3 0 1
1/2 0 0 01/2 1/3 0 0
0 1/3 1 0
.Starting from the matrix H and the row vector c, how to find H̃efficiently in MATLAB?
H̃ =
0 1/3 0 1
1/2 0 0 01/2 1/3 0 0
0 1/3 1 0
=
0 1 0 11 0 0 01 1 0 00 1 1 0
1/2 0 0 00 1/3 0 00 0 1 00 0 0 1
How this algorithm can be manipulated?
Modified Algorithm: the rescaled in-degree
Basic idea: total out-degree is rescaled to be unit, hij → hij/cj .
H =
0 1 0 11 0 0 01 1 0 00 1 1 0
→ H̃ =
0 1/3 0 1
1/2 0 0 01/2 1/3 0 0
0 1/3 1 0
.Starting from the matrix H and the row vector c, how to find H̃efficiently in MATLAB?
H̃ =
0 1/3 0 1
1/2 0 0 01/2 1/3 0 0
0 1/3 1 0
=
0 1 0 11 0 0 01 1 0 00 1 1 0
1/2 0 0 00 1/3 0 00 0 1 00 0 0 1
How this algorithm can be manipulated?
Modified Algorithm: eigenvalue problem for H̃Looking for a PageRank vector p,
p =
p1p2...pn
, pi ≥ 0,∑i
pi = 1.
With great power comes great responsibility: node i has thesame total in-degree and out-degree pi
1 2
34
p3/c3 = p3
p2/c2 = p2/3
p4/c4 = p4
One reasonable condition p4 = p3/c3 + p2/c2, or H̃p = p.
Stochastic Matrices
(Column) stochastic Matrices: nonnegative entries, and unitcolumn sum (i.e. etS = et with e = [1, 1, · · · , 1]t).
Theorem. Every column stochastic matrix S has 1 as aneigenvalue and there is an associated nonnegative eigenvector psuch that Sp = p.
Ht=[0 1/3 0 1; 1/2 0 0 0; 1/2 1/3 0 0 ; 0 1/3 1 0];
[V,D] = eig(Ht)
V =
0.6470 + 0.0000i 0.6234 + 0.0000i 0.6234 + 0.0000i -0.4413 + 0.0000i
0.3235 + 0.0000i -0.1799 - 0.3453i -0.1799 + 0.3453i 0.8484 + 0.0000i
0.4313 + 0.0000i -0.2728 - 0.2124i -0.2728 + 0.2124i -0.2391 + 0.0000i
0.5392 + 0.0000i -0.1707 + 0.5577i -0.1707 - 0.5577i -0.1681 + 0.0000i
D =
1.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i
0.0000 + 0.0000i -0.3700 + 0.7099i 0.0000 + 0.0000i 0.0000 + 0.0000i
0.0000 + 0.0000i 0.0000 + 0.0000i -0.3700 - 0.7099i 0.0000 + 0.0000i
0.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i -0.2600 + 0.0000i
More general theorem in advanced linear algebra
Theorem.(Perron-Frobenius) If the entries of a matrix isnon-negative and irreducible (can not be reduced to block diagonalform), then there is a positive eigenvalue. This eigenvalue has thelargest modulus among all eigenvalues, and its associatedeigenvector has positive entries.
If the sum of each column is unit (stochastic matrix), then thisspecial eigenvalue is one: if etS = et with e = [1, 1, · · · , 1]t , then1 is an eigenvalue of S t , and hence an eigenvalue of S , thetranspose of S t , because
det(λI − S) = det(λI − S t).
Dangling node (no out-going links)
1 2
34
5
H =
0 1 0 1 01 0 0 0 01 1 0 0 00 1 1 0 00 1 1 0 0
→ H̃ =
0 1/4 0 1 ?
1/2 0 0 0 ?1/2 1/4 0 0 ?
0 1/4 1/2 0 ?0 1/4 1/2 0 ?
Ht = [0 1/4 0 1 0; 1/2 0 0 0 0; 1/2 1/4 0 0 0; ...
0 1/4 1/2 0 0; 0 1/4 1/2 0 0];
[V,D] = eig(Ht);
>> diag(D)'
ans =
0.000 0.8212 -0.1743+0.0000i -0.3234-0.5762i -0.3234+0.5762i
No eigenvalue 1!
Fix for Dangling node
1 2
34
5
H =
0 1 0 1 01 0 0 0 01 1 0 0 00 1 1 0 00 1 1 0 0
→ H̃ =
0 1/4 0 1 1/5
1/2 0 0 0 1/51/2 1/4 0 0 1/5
0 1/4 1/2 0 1/50 1/4 1/2 0 1/5
Ht = [0 1/4 0 1 1/5; 1/2 0 0 0 1/5; 1/2 1/4 0 0 1/5; ...
0 1/4 1/2 0 1/5; 0 1/4 1/2 0 1/5];
[V,D] = eig(Ht);
>> diag(D)'
ans =
1.0000 -0.3091-0.5631i -0.3091+0.5631i -0.1818+0.0000i -0.0000
>> V(:,1)'
ans =
-0.5745 -0.3677 -0.4596 -0.4022 -0.4022
Markov Chain Interpretation: surfing the web
1 2
34
5
H =
0 1 0 1 01 0 0 0 01 1 0 0 00 1 1 0 00 1 1 0 0
→ H̃ =
0 1/4 0 1 1/5
1/2 0 0 0 1/51/2 1/4 0 0 1/5
0 1/4 1/2 0 1/50 1/4 1/2 0 1/5
I Going from page to page by randomly choosing an outgoing link
with probability 1/out-degree.
I At dead ends (pages without outgoing links), randomly choose onepage from all web pages.
The matrix H̃ is the transition probability matrix of this Markov Chain;
the vector p is the equilibrium probability.
One final problem: uniqueness of PageRank Score
1 2
34
5 6
7
The matrix has the block diagonal form
H =
[H1
H2
], H̃ =
[H̃1
H̃2
]and there are two eigenvectors associated with eigenvalue zero:
P1 =
(p10
), P2 =
(0p2
),
where H̃1p1 = p1, H̃2p2 = p2. Which one do we choose?
The Google Matrix
G (α) = αH̃ + (1− α)veT , 0 < α < 1
where e = [1, 1, · · · , 1]T , and the non-negative penalization vector v is
chosen such that eT v =∑
vi = 1.
Markov chain interpretation: with probability α the surferchoose a link at random; with probability 1− α the surfer choose arandom page on the Web.
Typical value: α = 0.85 and v = e/n.
The PageRank vector is usually computed iteratively for largematrices:
p(0) → p(1) = G (α)p(0) → p(2) = G (α)p(1) → · · · → p∗.
References
[1] Brin S, Page L. Reprint of: The anatomy of a large-scalehypertextual web search engine. Computer networks. 2012 Dec17;56(18):3825-33.
[2] Langville AN, Meyer CD. Google’s PageRank and beyond: Thescience of search engine rankings. Princeton University Press; 2011Jul 1.
There are many theorems related to stochastic matrices (existenceof equilibrium), and more generally non-negative matrices(Perron-Frobenius Theorem). Iterative methods are also verypopular for large, sparse matrices.