21
MATH36032 Problem Solving by Computer Ranking algorithms and Pagerank(the algorithm behind Google)

Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

MATH36032 Problem Solving by Computer

Ranking algorithms and Pagerank(the algorithmbehind Google)

Page 2: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Ranking/Rating is ubiquitous

Some daily examples:

I recommender systems: the items recommended on yourAmazon page

I league tables for universities

I shortlisting job applicants

More complicated examples

I drug discovery: how do we identify the most promising drugstructures from millions of candidates?

I bioinformatics: which gene is responsible for a particularsymptom or disease?

Page 3: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Ranking/Rating is ubiquitous

Some daily examples:

I recommender systems: the items recommended on yourAmazon page

I league tables for universities

I shortlisting job applicants

More complicated examples

I drug discovery: how do we identify the most promising drugstructures from millions of candidates?

I bioinformatics: which gene is responsible for a particularsymptom or disease?

Page 4: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Rules of Ranking could change ...

Different ranking rules at different stages of the UEFA ChampionsLeague:

I Three qualifying rounds (candidate teams are based on UEFAcoefficients)

I Play-off round

I Group stage (ranking 4 teams in each group, from 32 teamsto 16)

I Knockout phase (ranking 2 teams, from 16 teams to theChampion)

Page 5: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Web search engine market share (2015)

Page 6: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

What do the web search engine do?

Three basic tasks:

I Locate all web pages with public access

I Index all web pages (so that they can be searched efficientlyby keywords)

I Rate the importance of each page with the keyword(s)searched

Search engine optimization: the process that improves theimportance of the web page

Page 7: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Google’s PageRank Algorithm

I One way to calculate the importance of a node in a network(PageRank score)

I Developed by Google’s founders, Larry Page and Sergey Brin1

I Based entirely by the link structure of the WWW, NOT theactual content of the web pages (think about the Kardashiansisters)

I Only requires basic knowledge from Graph Theory, Probability(Markov Chain) and Linear Algebra

1The ”Page” in ”PageRank” is the last name of Larry Page

Page 8: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Web Link Structure: from graph to matrix

1 2

34

Connectivity matrix H

hij =

{1 if there is a hyperlink from page j to page i

0 otherwise.

with in-degree ri =∑

j=1 hij and out-degree cj =∑

i hij .

The matrix can also be construct from two vectors (with sparse):I = [2 3 1 3 4 4 1];

J = [1 1 2 2 2 3 4];

Page 9: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Web Link Structure: from graph to matrix

1 2

34

Adjacency matrix (or connectivity matrix) H

H =

0 1 0 11 0 0 01 1 0 00 1 1 0

with in-degree (row sum) [2 1 2 2] and out-degree (columnsum) [2 3 1 1]

What is the problem by just counting the in-degree? Think about”Zombie fans” (fake followers) on social networks.

Page 10: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Web Link Structure: from graph to matrix

1 2

34

Adjacency matrix (or connectivity matrix) H

H =

0 1 0 11 0 0 01 1 0 00 1 1 0

with in-degree (row sum) [2 1 2 2] and out-degree (columnsum) [2 3 1 1]

What is the problem by just counting the in-degree? Think about”Zombie fans” (fake followers) on social networks.

Page 11: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Modified Algorithm: the rescaled in-degree

Basic idea: total out-degree is rescaled to be unit, hij → hij/cj .

H =

0 1 0 11 0 0 01 1 0 00 1 1 0

→ H̃ =

0 1/3 0 1

1/2 0 0 01/2 1/3 0 0

0 1/3 1 0

.Starting from the matrix H and the row vector c, how to find H̃efficiently in MATLAB?

H̃ =

0 1/3 0 1

1/2 0 0 01/2 1/3 0 0

0 1/3 1 0

=

0 1 0 11 0 0 01 1 0 00 1 1 0

1/2 0 0 00 1/3 0 00 0 1 00 0 0 1

How this algorithm can be manipulated?

Page 12: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Modified Algorithm: the rescaled in-degree

Basic idea: total out-degree is rescaled to be unit, hij → hij/cj .

H =

0 1 0 11 0 0 01 1 0 00 1 1 0

→ H̃ =

0 1/3 0 1

1/2 0 0 01/2 1/3 0 0

0 1/3 1 0

.Starting from the matrix H and the row vector c, how to find H̃efficiently in MATLAB?

H̃ =

0 1/3 0 1

1/2 0 0 01/2 1/3 0 0

0 1/3 1 0

=

0 1 0 11 0 0 01 1 0 00 1 1 0

1/2 0 0 00 1/3 0 00 0 1 00 0 0 1

How this algorithm can be manipulated?

Page 13: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Modified Algorithm: eigenvalue problem for H̃Looking for a PageRank vector p,

p =

p1p2...pn

, pi ≥ 0,∑i

pi = 1.

With great power comes great responsibility: node i has thesame total in-degree and out-degree pi

1 2

34

p3/c3 = p3

p2/c2 = p2/3

p4/c4 = p4

One reasonable condition p4 = p3/c3 + p2/c2, or H̃p = p.

Page 14: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Stochastic Matrices

(Column) stochastic Matrices: nonnegative entries, and unitcolumn sum (i.e. etS = et with e = [1, 1, · · · , 1]t).

Theorem. Every column stochastic matrix S has 1 as aneigenvalue and there is an associated nonnegative eigenvector psuch that Sp = p.

Ht=[0 1/3 0 1; 1/2 0 0 0; 1/2 1/3 0 0 ; 0 1/3 1 0];

[V,D] = eig(Ht)

V =

0.6470 + 0.0000i 0.6234 + 0.0000i 0.6234 + 0.0000i -0.4413 + 0.0000i

0.3235 + 0.0000i -0.1799 - 0.3453i -0.1799 + 0.3453i 0.8484 + 0.0000i

0.4313 + 0.0000i -0.2728 - 0.2124i -0.2728 + 0.2124i -0.2391 + 0.0000i

0.5392 + 0.0000i -0.1707 + 0.5577i -0.1707 - 0.5577i -0.1681 + 0.0000i

D =

1.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i

0.0000 + 0.0000i -0.3700 + 0.7099i 0.0000 + 0.0000i 0.0000 + 0.0000i

0.0000 + 0.0000i 0.0000 + 0.0000i -0.3700 - 0.7099i 0.0000 + 0.0000i

0.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i -0.2600 + 0.0000i

Page 15: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

More general theorem in advanced linear algebra

Theorem.(Perron-Frobenius) If the entries of a matrix isnon-negative and irreducible (can not be reduced to block diagonalform), then there is a positive eigenvalue. This eigenvalue has thelargest modulus among all eigenvalues, and its associatedeigenvector has positive entries.

If the sum of each column is unit (stochastic matrix), then thisspecial eigenvalue is one: if etS = et with e = [1, 1, · · · , 1]t , then1 is an eigenvalue of S t , and hence an eigenvalue of S , thetranspose of S t , because

det(λI − S) = det(λI − S t).

Page 16: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Dangling node (no out-going links)

1 2

34

5

H =

0 1 0 1 01 0 0 0 01 1 0 0 00 1 1 0 00 1 1 0 0

→ H̃ =

0 1/4 0 1 ?

1/2 0 0 0 ?1/2 1/4 0 0 ?

0 1/4 1/2 0 ?0 1/4 1/2 0 ?

Ht = [0 1/4 0 1 0; 1/2 0 0 0 0; 1/2 1/4 0 0 0; ...

0 1/4 1/2 0 0; 0 1/4 1/2 0 0];

[V,D] = eig(Ht);

>> diag(D)'

ans =

0.000 0.8212 -0.1743+0.0000i -0.3234-0.5762i -0.3234+0.5762i

No eigenvalue 1!

Page 17: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Fix for Dangling node

1 2

34

5

H =

0 1 0 1 01 0 0 0 01 1 0 0 00 1 1 0 00 1 1 0 0

→ H̃ =

0 1/4 0 1 1/5

1/2 0 0 0 1/51/2 1/4 0 0 1/5

0 1/4 1/2 0 1/50 1/4 1/2 0 1/5

Ht = [0 1/4 0 1 1/5; 1/2 0 0 0 1/5; 1/2 1/4 0 0 1/5; ...

0 1/4 1/2 0 1/5; 0 1/4 1/2 0 1/5];

[V,D] = eig(Ht);

>> diag(D)'

ans =

1.0000 -0.3091-0.5631i -0.3091+0.5631i -0.1818+0.0000i -0.0000

>> V(:,1)'

ans =

-0.5745 -0.3677 -0.4596 -0.4022 -0.4022

Page 18: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

Markov Chain Interpretation: surfing the web

1 2

34

5

H =

0 1 0 1 01 0 0 0 01 1 0 0 00 1 1 0 00 1 1 0 0

→ H̃ =

0 1/4 0 1 1/5

1/2 0 0 0 1/51/2 1/4 0 0 1/5

0 1/4 1/2 0 1/50 1/4 1/2 0 1/5

I Going from page to page by randomly choosing an outgoing link

with probability 1/out-degree.

I At dead ends (pages without outgoing links), randomly choose onepage from all web pages.

The matrix H̃ is the transition probability matrix of this Markov Chain;

the vector p is the equilibrium probability.

Page 19: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

One final problem: uniqueness of PageRank Score

1 2

34

5 6

7

The matrix has the block diagonal form

H =

[H1

H2

], H̃ =

[H̃1

H̃2

]and there are two eigenvectors associated with eigenvalue zero:

P1 =

(p10

), P2 =

(0p2

),

where H̃1p1 = p1, H̃2p2 = p2. Which one do we choose?

Page 20: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

The Google Matrix

G (α) = αH̃ + (1− α)veT , 0 < α < 1

where e = [1, 1, · · · , 1]T , and the non-negative penalization vector v is

chosen such that eT v =∑

vi = 1.

Markov chain interpretation: with probability α the surferchoose a link at random; with probability 1− α the surfer choose arandom page on the Web.

Typical value: α = 0.85 and v = e/n.

The PageRank vector is usually computed iteratively for largematrices:

p(0) → p(1) = G (α)p(0) → p(2) = G (α)p(1) → · · · → p∗.

Page 21: Ranking algorithms and Pagerank(the algorithm behind Google) · Search engine optimization: the process that improves the importance of the web page. ... and the non-negative penalization

References

[1] Brin S, Page L. Reprint of: The anatomy of a large-scalehypertextual web search engine. Computer networks. 2012 Dec17;56(18):3825-33.

[2] Langville AN, Meyer CD. Google’s PageRank and beyond: Thescience of search engine rankings. Princeton University Press; 2011Jul 1.

There are many theorems related to stochastic matrices (existenceof equilibrium), and more generally non-negative matrices(Perron-Frobenius Theorem). Iterative methods are also verypopular for large, sparse matrices.