29
Numerical computing & Google’s PageRank DAVID F. GLEICH, CS 197 PRESENTATION

A history of PageRank from the numerical computing perspective

Embed Size (px)

DESCRIPTION

We'll survey some of the underlying ideas from Google's PageRank algorithm along the lines of Massimo Franceschet's CACM history. There are some slight liberties I've taken to make it more accessible.

Citation preview

Page 1: A history of PageRank from the numerical computing perspective

Numerical computing & Google’s PageRank

DAVID F. GLEICH, CS 197 PRESENTATION

Page 2: A history of PageRank from the numerical computing perspective

Hey Katie, do you have a date for Valentine’s Day?

It was 1234567890 in 2009.

Page 3: A history of PageRank from the numerical computing perspective

Thanks Internet!

http://school.discoveryeducation.com/clipart/clip/stk-fgr6.html http://listsoplenty.com/pix/tag/cartoon

https://www.facebook.com/ProgrammersJokes http://www.feld.com/wp/archives/2009/02/unix-time-1234567890-

on-valentines-day.html

Page 4: A history of PageRank from the numerical computing perspective

Thanks Internet!

http://school.discoveryeducation.com/clipart/clip/stk-fgr6.html http://listsoplenty.com/pix/tag/cartoon

https://www.facebook.com/ProgrammersJokes http://www.feld.com/wp/archives/2009/02/unix-time-1234567890-

on-valentines-day.html

Thanks Google

Page 5: A history of PageRank from the numerical computing perspective

How did Google get started?

Page 6: A history of PageRank from the numerical computing perspective

How did Google get started? … with an idea … … on the shoulders of giants!

Page 7: A history of PageRank from the numerical computing perspective

LEO KATZ

David F. Gleich (Purdue) Emory Math/CS Seminar 6 of 47

Page 8: A history of PageRank from the numerical computing perspective

Vannevar Bush “wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified” -- “As we may think” The Atlantic, July 1945

Page 9: A history of PageRank from the numerical computing perspective

Sir Tim Berners-Lee “We should work towards a universal linked information system … to allow a place for any information or reference one felt was important and a way of finding it afterwards.”

-- Founding proposal for “the mesh”, 1989

Page 10: A history of PageRank from the numerical computing perspective

… the mesh became the web … the web became a mess ... “finding it afterwards”? Hah!

Page 11: A history of PageRank from the numerical computing perspective

Larry Page "Sergey Brin •  Grad students at Stanford •  Worked with Terry Winograd

(artificial intelligence) •  Created a web-search

algorithm called “backrub” •  Spun-off a company “Googol” •  Worth about $20 billion each

Page 12: A history of PageRank from the numerical computing perspective

A cartoon websearch primer 1. Crawl webpages 2. Analyze webpage text (information retrieval) 3.  Analyze webpage links 4. Fit measures to human evaluations 5. Produce rankings 6. Continuously update

Page 13: A history of PageRank from the numerical computing perspective

SportsIllustrated.com BobsPortsIllustrated.com

Page 14: A history of PageRank from the numerical computing perspective

1

2

3

to

Gleich (Stanford) PageRank intro Ph.D. Defense 6 / 41

Page 15: A history of PageRank from the numerical computing perspective

What pages are important? Those that people visit a lot! How to we check? Create a model of how people visit the web.

Page 16: A history of PageRank from the numerical computing perspective

What pages are important? The Google random surfer •  Follows a random link with

probability alpha"“random clicks”

•  Goes anywhere with probability (1-alpha)"“random jumps”

Page 17: A history of PageRank from the numerical computing perspective

This is a Markov chain!

Page 18: A history of PageRank from the numerical computing perspective

Andrei Markov •  Studied sequences of random

variables. •  The probability that the random

variable takes a particular value only depends on it’s current value.

•  The “page id” is the “random variable” in the Markov chain!

Page 19: A history of PageRank from the numerical computing perspective

Oskar Perron"Georg Frobenius •  Simultaneously discovered

when a Markov chain has an “average”

•  The “average” of the web? It’s the probability of finding the random surfer at a page.

•  In 1907

Page 20: A history of PageRank from the numerical computing perspective

What pages are important? Perron and Frobenius proved the following algorithm always converges to a solution… set prob[i] = 0 for all pages set p to a random page for t = 1 to ... increment prob[p] if rand() < alpha, set p to a random neighbor of p else, set p to a random page

Page 21: A history of PageRank from the numerical computing perspective

Richard von Mises •  Created “the power method” •  An efficient algorithm to

“average” a Markov chain •  It updated the probabilities of

all pages at once. “Praktische Verfahren der Gleichungsauflösung”"R. von Mises and H. Pollaczek-Geiringer, 1929

Page 22: A history of PageRank from the numerical computing perspective

What pages are important? Using the von Mises method …

set prob[i] = 1/n for all pages for t = 1 to about 80 set newprob[i] = 0 for all pages for all links from page i to page j set newprob[j] += prob[i]/deg[i] for all pages I set prob[i] = alpha*newprob[i] + (1-alpha)/n

Page 23: A history of PageRank from the numerical computing perspective

That algorithm underlying Google’s analysis of the web is from 1929!

Page 24: A history of PageRank from the numerical computing perspective

Leo Katz

Page 25: A history of PageRank from the numerical computing perspective

Leo Katz

That’s not quite right Wikipedia!

Page 26: A history of PageRank from the numerical computing perspective

A new status index (1953)"Leo Katz A paper about how information spreads in groups … “For example, the information that the new high-school principal is unmarried and handsome might occasion a violent reaction in a ladies' garden club and hardly a ripple of interest in a luncheon group of the local chamber of commerce. On the other hand, the luncheon group might be anything but apathetic in its response to information concerning a fractional change in credit buying restrictions announced by the federal government.”

Page 27: A history of PageRank from the numerical computing perspective

… there were many other shoulders too …

Page 28: A history of PageRank from the numerical computing perspective

Gene Golub Popularized numerical computing with matrices via the informal “Golub thesis” “anything worth computing can be stated as a matrix problem”

William Kahan

Formalized IEEE-754 floating point arithmetic.

Make it possible to compute with probabilities as “real numbers” instead of discrete counts.

Page 29: A history of PageRank from the numerical computing perspective

Credits

Most pictures taken from Google image search. Original idea from Massimo Franceschet. “PageRank: Standing on the shoulders of giants”