A hadoop implementation of pagerank

A HADOOP IMPLEMENTATION

OF PAGERANKCHENGENG MA

2 0 1 6 / 0 2 / 0 2

HOW DOES GOOGLE FIGHT WITH SPAMMERS ?

• The old version search engine usually

relies on the information (e.g., word

frequency) shown on each page itself.

• A spammer who want to sell his T-shirt

may create his own web page which has

words like “movie” 1000 times. But he can

make these words invisible by setting the

same color as the background.

• When you search “movie”, the old search

engine will find this page unbelievably

important, so you click it and only find his

ad for T-shirt.

• “While Google was not the first search

engine, it was the first able to defeat the

spammers who had made search almost

useless. ”

• The key innovation that Google has

introduced is a measurement of web page

importance, called PageRank.

PAGERANK IS ABOUT WEB LINKS.WHY WEB LINKS?

• People usually like to add a tag or a link to

a page he/she thinks is correct, useful or

reliable.

• For spammers, they can create their own

page as whatever they like, but it’s usually

hard for them to ask other pages to link

to them.

• Even though he can create a link farm

where thousands of pages link to one

particular page which he want to emphasis,

that thousands of pages he has control are

still not linked by billions of web pages in

the out side of world.

For example, a

Chinese web user

who see the left

picture on site will

probably add a tag as

“Milk Tea Beauty”

(A Chinese young

celebrity whose

reputation is disputed).

WHAT IS PAGERANK? • PageRank is a vector whose j th element is

the probability that a random surfer is

travelling at the j th web page at the final

static state.

• At the beginning, you can set each page

onto the same value ( Vj=1/N ). Then you

multiply the PageRank vector V with

transition matrix M to get the next

moment’s probability distribution X.

• In the final state, PageRank will converge

and vector X will be the same as vector V.

For web that does not contains dead end

or spider trap, vector V now represents

the PageRank.

A B C D

A

B

C

D

J: from

I: to

SPIDER TRAP

• Once you come to page C, you have no

way to leave C. The random surfer get

trapped at page C, so that everything

becomes not random.

• Finally all the PageRank will be taken by

page C.

DEAD END • In the real situation, a page can be a

dead end (does not link to any other

pages). Once the random surfer

comes to a dead end, it stops

travelling and has no more chance to

go out to other pages, so the

random assumption is violated.

• The column correspond to it in

transition matrix will be an empty

column, for the previous definition.

• Keeping on multiplying this matrix

will leave nothing left.

TAXATION

For loop iterations:

𝑉1 = 𝜌 ∗ 𝑀 ∗ 𝑉0𝑉1 = 𝑉1+ (1 − 𝑠𝑢𝑚(𝑉1))/𝑁𝑉0 = 𝑉1

The modified version algorithm:• The modification to solve the above 2

problems is adding a possibility 𝜌 that the

surfer will keep on going through the

links, so there is (1 − 𝜌) possibility the

surfer will teleport to random pages.

• This method is called taxation.

HOWEVER, THE REAL WEB HAS BILLIONS OF PAGES, MULTIPLICATION BETWEEN MATRIX AND VECTOR IS OVERHEAD.

• By using partitioned matrix and vector, the

calculation can be paralleled onto a

computing cluster that has more than

thousands of nodes.

• And such large a magnitude of computing

is usually managed by a mapreduce system,

like Hadoop.

beta=0 1 2 3 4

alpha=0

1

2

3

4

beta=

0

1

2

3

4

MAPREDUCE

• 1st mapper:

• 𝑀 𝑖, 𝑗, 𝑀𝑖𝑗 → { 𝛼, 𝛽 ; "M", 𝑖, 𝑗,𝑀𝑖𝑗 }

where 𝛼 = 𝑖/∆, 𝛽 = 𝑗/∆, where the

∆ represents interval.

𝑉 𝑗, 𝑉𝑗 → { 𝛼, 𝛽 ; ("𝑉", 𝑗, 𝑉𝑗)}

where ∀𝛼 ∈ [0, 𝐺 − 1], 𝛽 = 𝑗/∆,

𝐺 = 𝑐𝑒𝑖𝑙(𝑁

∆) represents the group number.

• 1st reducer gets input as:

{ (𝛼, 𝛽); [ "M", 𝑖, 𝑗, 𝑀𝑖𝑗 , ("𝑉", 𝑗, 𝑉𝑗) ] }

∀ 𝑖 ∈ partion 𝛼

∀ 𝑗 ∈ partion 𝛽

• 1st reducer outputs:

{ 𝑖 ; 𝑆𝛽 = ∀ 𝑗 ∈ partion 𝛽 𝑀𝑖𝑗 ∗ 𝑉𝑗 }

• 2nd mapper: Pass

• 2nd reducer gets input as:

{ 𝑖; [𝑆0, 𝑆1, 𝑆2,… , 𝑆𝐺−1] }

• 2nd reducer outputs { 𝑖 ; 𝛽=0𝐺−1 𝑆𝛽 }

BEFORE THE PAGERANK CALCULATING TRANSLATING THE WEB TO NUMBERS

• 𝐴 → 𝐵

• 𝐴 → 𝐶

• 𝐴 → 𝐷

• 𝐵 → 𝐴

• 𝐵 → 𝐷

• 𝐶 → 𝐴

• 𝐷 → 𝐵

• 𝐷 → 𝐶

• A 0

• B 1

• C 2

• D 3

LINKS ID• Performing Inner Join twice, where

the 1st time’s key is FromNodeID, the

2nd time’s key is ToNodeID.

• 𝐴, 𝐵, 0

• 𝐴, 𝐶, 0

• 𝐴, 𝐷, 0

• 𝐵, 𝐴, 1

• 𝐵, 𝐷, 1

• 𝐶, 𝐴, 2

• 𝐷, 𝐵, 3

• 𝐷, 𝐶, 3

• 𝐴, 𝐵, 0, 1

• 𝐴, 𝐶, 0, 2

• 𝐴, 𝐷, 0, 3

• 𝐵, 𝐴, 1, 0

• 𝐵, 𝐷, 1, 3

• 𝐶, 𝐴, 2, 0

• 𝐷, 𝐵, 3, 1

• 𝐷, 𝐶, 3, 2

After 1st

inner joinAfter 2nd

inner join

After the PageRank is

calculated, the same thing can

be done to translate index

back to node names.From

Node

ID

To Node

ID

Web

Node ID

in data

Index used in

program

2002 GOOGLE PROGRAMMING CONTEST WEB GRAPH DATA

• 875713 pages, 5105039 edges

• 72 MB txt file

• Hadoop program iterates 75 times (“For

the web itself, 50-75 iterations are

sufficient to converge to within the error

limits of double precision”).

• 𝜌 = 0.85 as the possibility to follow the

web links and 0.15 possibility to teleport.

• The program has a structure of for loop,

each of which has 4 map-reduce job inside.

• The first 2 MR job are for matrix

multiplying vector.

• The 3rd MR job is to calculate the sum of

the product vector beta*M*V.

• And the final MR job does the shifting.

PAGERANK RESULT

• A Python program is written to compare the result from Hadoop:

RESULT ANALYSIS

• The value not sorted is noisy

and hard to see.

• But sorting by PageRank value and plotting in

log-log provides a linear line.

RESULT ANALYSIS

• The histogram has exponentially decaying

counts for large PageRank value.

• The largest 1/9 web pages contains 60% of

PageRank importance over the whole dataset.

REFERENCE• Mining of Massive Datasets, Chapter 5

Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman

The code will be attached as the following files.

FINALLY, A TOP K PROGRAM IN HADOOP

• 1st column is the index used in this

program;

• 2nd column is the web node ID

within the original data;

• 3rd column is the PageRank value.

The right table shows the top 15

PageRank value.

Data & Analytics

A hadoop implementation of pagerank