Upload
chen-geng-ma
View
204
Download
0
Embed Size (px)
Citation preview
A HADOOP IMPLEMENTATION
OF PAGERANKCHENGENG MA
2 0 1 6 / 0 2 / 0 2
HOW DOES GOOGLE FIGHT WITH SPAMMERS ?
• The old version search engine usually
relies on the information (e.g., word
frequency) shown on each page itself.
• A spammer who want to sell his T-shirt
may create his own web page which has
words like “movie” 1000 times. But he can
make these words invisible by setting the
same color as the background.
• When you search “movie”, the old search
engine will find this page unbelievably
important, so you click it and only find his
ad for T-shirt.
• “While Google was not the first search
engine, it was the first able to defeat the
spammers who had made search almost
useless. ”
• The key innovation that Google has
introduced is a measurement of web page
importance, called PageRank.
PAGERANK IS ABOUT WEB LINKS.WHY WEB LINKS?
• People usually like to add a tag or a link to
a page he/she thinks is correct, useful or
reliable.
• For spammers, they can create their own
page as whatever they like, but it’s usually
hard for them to ask other pages to link
to them.
• Even though he can create a link farm
where thousands of pages link to one
particular page which he want to emphasis,
that thousands of pages he has control are
still not linked by billions of web pages in
the out side of world.
For example, a
Chinese web user
who see the left
picture on site will
probably add a tag as
“Milk Tea Beauty”
(A Chinese young
celebrity whose
reputation is disputed).
WHAT IS PAGERANK? • PageRank is a vector whose j th element is
the probability that a random surfer is
travelling at the j th web page at the final
static state.
• At the beginning, you can set each page
onto the same value ( Vj=1/N ). Then you
multiply the PageRank vector V with
transition matrix M to get the next
moment’s probability distribution X.
• In the final state, PageRank will converge
and vector X will be the same as vector V.
For web that does not contains dead end
or spider trap, vector V now represents
the PageRank.
A B C D
A
B
C
D
J: from
I: to
SPIDER TRAP
• Once you come to page C, you have no
way to leave C. The random surfer get
trapped at page C, so that everything
becomes not random.
• Finally all the PageRank will be taken by
page C.
DEAD END • In the real situation, a page can be a
dead end (does not link to any other
pages). Once the random surfer
comes to a dead end, it stops
travelling and has no more chance to
go out to other pages, so the
random assumption is violated.
• The column correspond to it in
transition matrix will be an empty
column, for the previous definition.
• Keeping on multiplying this matrix
will leave nothing left.
TAXATION
For loop iterations:
𝑉1 = 𝜌 ∗ 𝑀 ∗ 𝑉0𝑉1 = 𝑉1+ (1 − 𝑠𝑢𝑚(𝑉1))/𝑁𝑉0 = 𝑉1
The modified version algorithm:• The modification to solve the above 2
problems is adding a possibility 𝜌 that the
surfer will keep on going through the
links, so there is (1 − 𝜌) possibility the
surfer will teleport to random pages.
• This method is called taxation.
HOWEVER, THE REAL WEB HAS BILLIONS OF PAGES, MULTIPLICATION BETWEEN MATRIX AND VECTOR IS OVERHEAD.
• By using partitioned matrix and vector, the
calculation can be paralleled onto a
computing cluster that has more than
thousands of nodes.
• And such large a magnitude of computing
is usually managed by a mapreduce system,
like Hadoop.
beta=0 1 2 3 4
alpha=0
1
2
3
4
beta=
0
1
2
3
4
MAPREDUCE
• 1st mapper:
• 𝑀 𝑖, 𝑗, 𝑀𝑖𝑗 → { 𝛼, 𝛽 ; "M", 𝑖, 𝑗,𝑀𝑖𝑗 }
where 𝛼 = 𝑖/∆, 𝛽 = 𝑗/∆, where the
∆ represents interval.
𝑉 𝑗, 𝑉𝑗 → { 𝛼, 𝛽 ; ("𝑉", 𝑗, 𝑉𝑗)}
where ∀𝛼 ∈ [0, 𝐺 − 1], 𝛽 = 𝑗/∆,
𝐺 = 𝑐𝑒𝑖𝑙(𝑁
∆) represents the group number.
• 1st reducer gets input as:
{ (𝛼, 𝛽); [ "M", 𝑖, 𝑗, 𝑀𝑖𝑗 , ("𝑉", 𝑗, 𝑉𝑗) ] }
∀ 𝑖 ∈ partion 𝛼
∀ 𝑗 ∈ partion 𝛽
• 1st reducer outputs:
{ 𝑖 ; 𝑆𝛽 = ∀ 𝑗 ∈ partion 𝛽 𝑀𝑖𝑗 ∗ 𝑉𝑗 }
• 2nd mapper: Pass
• 2nd reducer gets input as:
{ 𝑖; [𝑆0, 𝑆1, 𝑆2,… , 𝑆𝐺−1] }
• 2nd reducer outputs { 𝑖 ; 𝛽=0𝐺−1 𝑆𝛽 }
BEFORE THE PAGERANK CALCULATING TRANSLATING THE WEB TO NUMBERS
• 𝐴 → 𝐵
• 𝐴 → 𝐶
• 𝐴 → 𝐷
• 𝐵 → 𝐴
• 𝐵 → 𝐷
• 𝐶 → 𝐴
• 𝐷 → 𝐵
• 𝐷 → 𝐶
• A 0
• B 1
• C 2
• D 3
LINKS ID• Performing Inner Join twice, where
the 1st time’s key is FromNodeID, the
2nd time’s key is ToNodeID.
• 𝐴, 𝐵, 0
• 𝐴, 𝐶, 0
• 𝐴, 𝐷, 0
• 𝐵, 𝐴, 1
• 𝐵, 𝐷, 1
• 𝐶, 𝐴, 2
• 𝐷, 𝐵, 3
• 𝐷, 𝐶, 3
• 𝐴, 𝐵, 0, 1
• 𝐴, 𝐶, 0, 2
• 𝐴, 𝐷, 0, 3
• 𝐵, 𝐴, 1, 0
• 𝐵, 𝐷, 1, 3
• 𝐶, 𝐴, 2, 0
• 𝐷, 𝐵, 3, 1
• 𝐷, 𝐶, 3, 2
After 1st
inner joinAfter 2nd
inner join
After the PageRank is
calculated, the same thing can
be done to translate index
back to node names.From
Node
ID
To Node
ID
Web
Node ID
in data
Index used in
program
2002 GOOGLE PROGRAMMING CONTEST WEB GRAPH DATA
• 875713 pages, 5105039 edges
• 72 MB txt file
• Hadoop program iterates 75 times (“For
the web itself, 50-75 iterations are
sufficient to converge to within the error
limits of double precision”).
• 𝜌 = 0.85 as the possibility to follow the
web links and 0.15 possibility to teleport.
• The program has a structure of for loop,
each of which has 4 map-reduce job inside.
• The first 2 MR job are for matrix
multiplying vector.
• The 3rd MR job is to calculate the sum of
the product vector beta*M*V.
• And the final MR job does the shifting.
PAGERANK RESULT
• A Python program is written to compare the result from Hadoop:
RESULT ANALYSIS
• The value not sorted is noisy
and hard to see.
• But sorting by PageRank value and plotting in
log-log provides a linear line.
RESULT ANALYSIS
• The histogram has exponentially decaying
counts for large PageRank value.
• The largest 1/9 web pages contains 60% of
PageRank importance over the whole dataset.
REFERENCE• Mining of Massive Datasets, Chapter 5
Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman
The code will be attached as the following files.
FINALLY, A TOP K PROGRAM IN HADOOP
• 1st column is the index used in this
program;
• 2nd column is the web node ID
within the original data;
• 3rd column is the PageRank value.
The right table shows the top 15
PageRank value.