Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Big Data Analytics
Big Data Analytics
Lucas Rego Drumond
Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science
University of Hildesheim, Germany
Big Data Analytics
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 1 / 27
Big Data Analytics
Outline
1. Introduction
2. PageRank
3. PageRank and MapReduce
4. Pregel
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 1 / 27
Big Data Analytics 1. Introduction
Outline
1. Introduction
2. PageRank
3. PageRank and MapReduce
4. Pregel
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 1 / 27
Big Data Analytics 1. Introduction
Overview
Part III
Part II
Part I
Machine Learning Algorithms
Large Scale Computational Models
Distributed Database
Distributed File System
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 1 / 27
Big Data Analytics 1. Introduction
MapReduce - Review
1. Each mapper transforms a set key-value pairs into a list of outputkeys and intermediate value pairs
2. all intermediate values are grouped according to their output keys
3. each reducer receives all the intermediate values associated with agiven keys
4. each reducer associates one final value to each key
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 2 / 27
Big Data Analytics 2. PageRank
Outline
1. Introduction
2. PageRank
3. PageRank and MapReduce
4. Pregel
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 3 / 27
Big Data Analytics 2. PageRank
Google’s Pagerank
Problem:
I Measure the importance of Websites in Google’s search engine results
I Relevant Websites are more likely to have more links to it
I It is important to consider the quality of the websites that link a givepage
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 3 / 27
Big Data Analytics 2. PageRank
Google’s Pagerank
Problem:
I Measure the importance of Websites in Google’s search engine results
I Relevant Websites are more likely to have more links to it
I It is important to consider the quality of the websites that link a givepage
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 3 / 27
Big Data Analytics 2. PageRank
Google’s Pagerank
Problem:
I Measure the importance of Websites in Google’s search engine results
I Relevant Websites are more likely to have more links to it
I It is important to consider the quality of the websites that link a givepage
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 3 / 27
Big Data Analytics 2. PageRank
Web graph
The Web can be represented as agraph containing:
I A set of Webpages W
I A set of hyperlinks betweenwebpages H ⊆W ×W
Task:
I Assign a numerical weightr : W → R to each element W
I The weight r(w) of a webpagew ∈W should reflect itsrelative importance to the otherwebpages
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 4 / 27
Big Data Analytics 2. PageRank
Web graph
The Web can be represented as agraph containing:
I A set of Webpages W
I A set of hyperlinks betweenwebpages H ⊆W ×W
Task:
I Assign a numerical weightr : W → R to each element W
I The weight r(w) of a webpagew ∈W should reflect itsrelative importance to the otherwebpages
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 4 / 27
Big Data Analytics 2. PageRank
Web graph
The Web can be represented as agraph containing:
I A set of Webpages W
I A set of hyperlinks betweenwebpages H ⊆W ×W
Task:
I Assign a numerical weightr : W → R to each element W
I The weight r(w) of a webpagew ∈W should reflect itsrelative importance to the otherwebpages
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 4 / 27
Big Data Analytics 2. PageRank
Web graph
The Web can be represented as agraph containing:
I A set of Webpages W
I A set of hyperlinks betweenwebpages H ⊆W ×W
Task:
I Assign a numerical weightr : W → R to each element W
I The weight r(w) of a webpagew ∈W should reflect itsrelative importance to the otherwebpages
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 4 / 27
Big Data Analytics 2. PageRank
Pagerank - The Random Surfer
Imagine a web surfer that:I Jumps from a web page to another one randomly:
I The next link to follow is chosen with uniform probability
I The surfer will browse the chosen link with some probability β
I It will occasionaly jump to a random page with some small probability1− β
The Pagerank PR(w) of a web page w models how likely is that therandom surfer will visit it
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 5 / 27
Big Data Analytics 2. PageRank
Pagerank - The Random Surfer
Imagine a web surfer that:I Jumps from a web page to another one randomly:
I The next link to follow is chosen with uniform probability
I The surfer will browse the chosen link with some probability β
I It will occasionaly jump to a random page with some small probability1− β
The Pagerank PR(w) of a web page w models how likely is that therandom surfer will visit it
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 5 / 27
Big Data Analytics 2. PageRank
Pagerank - The Random Surfer
Imagine a web surfer that:I Jumps from a web page to another one randomly:
I The next link to follow is chosen with uniform probability
I The surfer will browse the chosen link with some probability β
I It will occasionaly jump to a random page with some small probability1− β
The Pagerank PR(w) of a web page w models how likely is that therandom surfer will visit it
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 5 / 27
Big Data Analytics 2. PageRank
Pagerank - The Random Surfer
Imagine a web surfer that:I Jumps from a web page to another one randomly:
I The next link to follow is chosen with uniform probability
I The surfer will browse the chosen link with some probability β
I It will occasionaly jump to a random page with some small probability1− β
The Pagerank PR(w) of a web page w models how likely is that therandom surfer will visit it
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 5 / 27
Big Data Analytics 2. PageRank
Pagerank - The Random Surfer
Be:
I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w
I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w
I 1− β the probability that the surferjumps to a random page
Pagerank of a webpage wi :
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 6 / 27
Big Data Analytics 2. PageRank
Pagerank - The Random SurferBe:
I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w
I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w
I 1− β the probability that the surferjumps to a random page
Example:
in(w1) = {w3}
Pagerank of a webpage wi :
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 6 / 27
Big Data Analytics 2. PageRank
Pagerank - The Random Surfer
Be:
I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w
I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w
I 1− β the probability that the surferjumps to a random page
Pagerank of a webpage wi :
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 6 / 27
Big Data Analytics 2. PageRank
Pagerank - The Random SurferBe:
I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w
I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w
I 1− β the probability that the surferjumps to a random page
Example:
out(w1) = {w2,w3}
Pagerank of a webpage wi :
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 6 / 27
Big Data Analytics 2. PageRank
Pagerank - The Random Surfer
Be:
I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w
I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w
I 1− β the probability that the surferjumps to a random page
Pagerank of a webpage wi :
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 6 / 27
Big Data Analytics 2. PageRank
Pagerank - The Random Surfer
Be:
I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w
I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w
I 1− β the probability that the surferjumps to a random page
Pagerank of a webpage wi :
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 6 / 27
Big Data Analytics 2. PageRank
Pagerank - Algorithm
1: procedure Pagerankinput: A web graph (W ,H), hyperparameter βoutput: Pagerank values PR ∈ R|W |
2: PR ← {0}|W |
3: for w ∈W do4: PR(w)← Random Value5: end for6: repeat7: for w ∈W do8: PR(w)← (1− β) + β
∑wj∈in(wi )
PR(wj )|out(wj )|
9: end for10: until convergence11: return PR12: end procedure
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 7 / 27
Big Data Analytics 2. PageRank
Pagerank - Algorithm
w1 w2
β = 0.85
PR(w1) = (1− β) + β PR(w2)|out(w2)|
PR(w2) = (1− β) + β PR(w1)|out(w1)|
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 8 / 27
Big Data Analytics 2. PageRank
Pagerank - Algorithm
w1 w2
β = 0.85
PR(w1) = (1− β) + β PR(w2)|out(w2)|
PR(w2) = (1− β) + β PR(w1)|out(w1)|
PR(w1) = 0.9
PR(w2) = 1.1
PR(w1) = 0.15 + 0.85 ∗ 1.11 = 1.085
PR(w2) = 0.15+0.85∗ 1.0851 = 1.07225
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 8 / 27
Big Data Analytics 2. PageRank
Pagerank - Algorithm
w1 w2
β = 0.85
PR(w1) = (1− β) + β PR(w2)|out(w2)|
PR(w2) = (1− β) + β PR(w1)|out(w1)|
PR(w1) = 1.085
PR(w2) = 1.07225
PR(w1) = 0.15 + 0.85 ∗ 1.072251 =
1.061412
PR(w2) = 0.15 + 0.85 ∗ 1.0614121 =
1.0522
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 8 / 27
Big Data Analytics 2. PageRank
Pagerank - Algorithm
w1 w2
β = 0.85
PR(w1) = (1− β) + β PR(w2)|out(w2)|
PR(w2) = (1− β) + β PR(w1)|out(w1)|
PR(w1) = 1.061412
PR(w2) = 1.0522
PR(w1) = 0.15 + 0.85 ∗ 1.05221 =
1.04437
PR(w2) = 0.15 + 0.85 ∗ 1.044371 =
1.037714
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 8 / 27
Big Data Analytics 2. PageRank
Pagerank - Algorithm
w1 w2
β = 0.85
PR(w1) = (1− β) + β PR(w2)|out(w2)|
PR(w2) = (1− β) + β PR(w1)|out(w1)|
PR(w1) = 1.04437
PR(w2) = 1.037714
PR(w1) = 0.15 + 0.85 ∗ 1.0377141 =
1.032057
PR(w2) = 0.15 + 0.85 ∗ 1.0320571 =
1.027248
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 8 / 27
Big Data Analytics 2. PageRank
Pagerank - Algorithm
w1 w2
β = 0.85
PR(w1) = (1− β) + β PR(w2)|out(w2)|
PR(w2) = (1− β) + β PR(w1)|out(w1)|
After a couple of iterations...
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 8 / 27
Big Data Analytics 2. PageRank
Pagerank - Algorithm
w1 w2
β = 0.85
PR(w1) = (1− β) + β PR(w2)|out(w2)|
PR(w2) = (1− β) + β PR(w1)|out(w1)|
PR(w1) = 1
PR(w2) = 1
PR(w1) = 0.15 + 0.85 ∗ 11 = 1
PR(w2) = 0.15 + 0.85 ∗ 11 = 1
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 8 / 27
Big Data Analytics 2. PageRank
Pagerank Algorithm
I PR(w1) = 1.49
I PR(w2) = 0.78
I PR(w3) = 1.58
I PR(w4) = 0.15
w1 w2
w3 w4
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 9 / 27
Big Data Analytics 3. PageRank and MapReduce
Outline
1. Introduction
2. PageRank
3. PageRank and MapReduce
4. Pregel
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 10 / 27
Big Data Analytics 3. PageRank and MapReduce
MapReduce Implementation of Pagerank
How to implement Pagerank on MapReduce???
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 10 / 27
Big Data Analytics 3. PageRank and MapReduce
MapReduce Limitations
I iterations:I Pipeline sequence of a Map phase and a Reduce phaseI MapReduce does not naturally support iterative algorithms
I depenency between data points:I Data points should be independant of each otherI Not suitable for graph-like data
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 11 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 12 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce
Two phases:
I Initialize: complete MapReduce Iteration for initializing the pageswith random pagerank values
I Compute pagerank: repeated series of pagerank computations
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 13 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Initialize
Map:
I Input: (u, page content)I Output: (u, (init, out(u)))
I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u
Reduce:
I Input: (u, (init, out(u)))
I Output: (u, (init, out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 14 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Initialize
Map:
I Input: (u, page content)
I Output: (u, (init, out(u)))
I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u
Reduce:
I Input: (u, (init, out(u)))
I Output: (u, (init, out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 14 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Initialize
Map:
I Input: (u, page content)I Output: (u, (init, out(u)))
I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u
Reduce:
I Input: (u, (init, out(u)))
I Output: (u, (init, out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 14 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Initialize
Map:
I Input: (u, page content)I Output: (u, (init, out(u)))
I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u
Reduce:
I Input: (u, (init, out(u)))
I Output: (u, (init, out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 14 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Initialize
Map:
I Input: (u, page content)I Output: (u, (init, out(u)))
I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u
Reduce:
I Input: (u, (init, out(u)))
I Output: (u, (init, out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 14 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Compute Ranks
Map:
I Input: (u, (PR(u), out(u)))I Output:
I For each v ∈ out(u): (v , PR(u)out(u) )
I (u, out(u))
Reduce:
I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u
I Output: (u, (PR(u), out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 15 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Compute Ranks
Map:
I Input: (u, (PR(u), out(u)))
I Output:
I For each v ∈ out(u): (v , PR(u)out(u) )
I (u, out(u))
Reduce:
I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u
I Output: (u, (PR(u), out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 15 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Compute Ranks
Map:
I Input: (u, (PR(u), out(u)))I Output:
I For each v ∈ out(u): (v , PR(u)out(u) )
I (u, out(u))
Reduce:
I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u
I Output: (u, (PR(u), out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 15 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Compute Ranks
Map:
I Input: (u, (PR(u), out(u)))I Output:
I For each v ∈ out(u): (v , PR(u)out(u) )
I (u, out(u))
Reduce:
I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u
I Output: (u, (PR(u), out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 15 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Compute Ranks
Map:
I Input: (u, (PR(u), out(u)))I Output:
I For each v ∈ out(u): (v , PR(u)out(u) )
I (u, out(u))
Reduce:
I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u
I Output: (u, (PR(u), out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 15 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Compute Ranks
Map:
I Input: (u, (PR(u), out(u)))I Output:
I For each v ∈ out(u): (v , PR(u)out(u) )
I (u, out(u))
Reduce:
I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u
I Output: (u, (PR(u), out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 15 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Compute Ranks
Map:
I Input: (u, (PR(u), out(u)))I Output:
I For each v ∈ out(u): (v , PR(u)out(u) )
I (u, out(u))
Reduce:
I Input: (u, out(u)), {(u, val)}
I Sums up all val’s and compute new pagerank PR(u) for u
I Output: (u, (PR(u), out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 15 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Compute Ranks
Map:
I Input: (u, (PR(u), out(u)))I Output:
I For each v ∈ out(u): (v , PR(u)out(u) )
I (u, out(u))
Reduce:
I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u
I Output: (u, (PR(u), out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 15 / 27
Big Data Analytics 3. PageRank and MapReduce
Pagerank with MapReduce - Compute Ranks
Map:
I Input: (u, (PR(u), out(u)))I Output:
I For each v ∈ out(u): (v , PR(u)out(u) )
I (u, out(u))
Reduce:
I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u
I Output: (u, (PR(u), out(u)))
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 15 / 27
Big Data Analytics 3. PageRank and MapReduce
Extensions of MapReduce
I Support repeated MapReduce
I Haloop, Twister
I Graph Models: Giraph, Pregel, Graphlab
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 16 / 27
Big Data Analytics 4. Pregel
Outline
1. Introduction
2. PageRank
3. PageRank and MapReduce
4. Pregel
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 17 / 27
Big Data Analytics 4. Pregel
Pregel
Distributed graph processing programming model from Google
Bulk Synchronous Parallel Computation:
I Synchronous iterations (called supersteps)I In a superstep:
I Each vertex asynchronously executes some user defined function inparallel
Message passing:
I Vertices exchange messages with neighbors
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 17 / 27
Big Data Analytics 4. Pregel
Pregel Programming ModelA data graph is defined where:
1. each vertex performs some computation
2. a vertex sends messages to neighboring vertices
3. each vertex process incoming messages
The graph is distributed across computing nodes:
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 18 / 27
Big Data Analytics 4. Pregel
Pregel Programming ModelA data graph is defined where:
1. each vertex performs some computation
2. a vertex sends messages to neighboring vertices
3. each vertex process incoming messages
The graph is distributed across computing nodes:
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 18 / 27
Big Data Analytics 4. Pregel
Pregel Programming ModelA data graph is defined where:
1. each vertex performs some computation
2. a vertex sends messages to neighboring vertices
3. each vertex process incoming messages
The graph is distributed across computing nodes:
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 18 / 27
Big Data Analytics 4. Pregel
Pregel Programming ModelA data graph is defined where:
1. each vertex performs some computation
2. a vertex sends messages to neighboring vertices
3. each vertex process incoming messages
The graph is distributed across computing nodes:
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 18 / 27
Big Data Analytics 4. Pregel
Pregel Programming Model
We have two types of entities:
Vertex:
1. Unique identifier
2. Has a modifiable, user defined value
Edge:
1. Source and target vertex identifiers
2. Has a modifiable, user defined value
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 19 / 27
Big Data Analytics 4. Pregel
Pregel Programming Model
We have two types of entities:
Vertex:
1. Unique identifier
2. Has a modifiable, user defined value
Edge:
1. Source and target vertex identifiers
2. Has a modifiable, user defined value
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 19 / 27
Big Data Analytics 4. Pregel
Pregel Programming Model
We have two types of entities:
Vertex:
1. Unique identifier
2. Has a modifiable, user defined value
Edge:
1. Source and target vertex identifiers
2. Has a modifiable, user defined value
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 19 / 27
Big Data Analytics 4. Pregel
Pregel Programming Model
We have two types of entities:
Vertex:
1. Unique identifier
2. Has a modifiable, user defined value
Edge:
1. Source and target vertex identifiers
2. Has a modifiable, user defined value
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 19 / 27
Big Data Analytics 4. Pregel
Pregel Programming Model
We have two types of entities:
Vertex:
1. Unique identifier
2. Has a modifiable, user defined value
Edge:
1. Source and target vertex identifiers
2. Has a modifiable, user defined value
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 19 / 27
Big Data Analytics 4. Pregel
Pregel Programming Model
We have two types of entities:
Vertex:
1. Unique identifier
2. Has a modifiable, user defined value
Edge:
1. Source and target vertex identifiers
2. Has a modifiable, user defined value
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 19 / 27
Big Data Analytics 4. Pregel
Pregel Program
Program is loaded: each vertex executes computation and sendsmessages to out neighbors
Each vertex:
1. receives messages from in-neighbors
2. processes messages
3. decides whether or not to send new messages
4. decides whether or not to halt
The process is repeated until all vertices are at halt
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 20 / 27
Big Data Analytics 4. Pregel
Pregel Program
Program is loaded: each vertex executes computation and sendsmessages to out neighbors
Each vertex:
1. receives messages from in-neighbors
2. processes messages
3. decides whether or not to send new messages
4. decides whether or not to halt
The process is repeated until all vertices are at halt
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 20 / 27
Big Data Analytics 4. Pregel
Pregel Program
Program is loaded: each vertex executes computation and sendsmessages to out neighbors
Each vertex:
1. receives messages from in-neighbors
2. processes messages
3. decides whether or not to send new messages
4. decides whether or not to halt
The process is repeated until all vertices are at halt
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 20 / 27
Big Data Analytics 4. Pregel
Pregel Program
Program is loaded: each vertex executes computation and sendsmessages to out neighbors
Each vertex:
1. receives messages from in-neighbors
2. processes messages
3. decides whether or not to send new messages
4. decides whether or not to halt
The process is repeated until all vertices are at halt
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 20 / 27
Big Data Analytics 4. Pregel
Pregel Program
Program is loaded: each vertex executes computation and sendsmessages to out neighbors
Each vertex:
1. receives messages from in-neighbors
2. processes messages
3. decides whether or not to send new messages
4. decides whether or not to halt
The process is repeated until all vertices are at halt
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 20 / 27
Big Data Analytics 4. Pregel
Pregel Program
Program is loaded: each vertex executes computation and sendsmessages to out neighbors
Each vertex:
1. receives messages from in-neighbors
2. processes messages
3. decides whether or not to send new messages
4. decides whether or not to halt
The process is repeated until all vertices are at halt
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 20 / 27
Big Data Analytics 4. Pregel
Pregel - State of a vertex
Active Inactive
Vote to halt
Message received
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 21 / 27
Big Data Analytics 4. Pregel
Pregel Model
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 22 / 27
Big Data Analytics 4. Pregel
Pregel - Aggregators
Pregel supports global communication through Aggregators
I Each vertex can provide a value to the Aggregator in each superstep
I The Aggregator produces a single aggregated value
I Each vertex has access to the aggregated value of the previoussuperstep
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 23 / 27
Big Data Analytics 4. Pregel
Pregel - Aggregators
Pregel supports global communication through Aggregators
I Each vertex can provide a value to the Aggregator in each superstep
I The Aggregator produces a single aggregated value
I Each vertex has access to the aggregated value of the previoussuperstep
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 23 / 27
Big Data Analytics 4. Pregel
Pregel - Aggregators
Pregel supports global communication through Aggregators
I Each vertex can provide a value to the Aggregator in each superstep
I The Aggregator produces a single aggregated value
I Each vertex has access to the aggregated value of the previoussuperstep
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 23 / 27
Big Data Analytics 4. Pregel
Pregel - Aggregators
Pregel supports global communication through Aggregators
I Each vertex can provide a value to the Aggregator in each superstep
I The Aggregator produces a single aggregated value
I Each vertex has access to the aggregated value of the previoussuperstep
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 23 / 27
Big Data Analytics 4. Pregel
Pregel Example finding the max value in the graph
3 6 2 1
6 6 2 6
6 6 6 6
6 6 6 6
1: procedure VertexUpdateinput: Current value v , mesages M
2: Send v on all outgoing edges3: flag ← true4: for m ∈ M do5: if m > v then6: v ← m7: flag ← false8: end if9: end for
10: if flag then halt11: end if12: return v13: end procedure
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 24 / 27
Big Data Analytics 4. Pregel
Pregel Example finding the max value in the graph
3 6 2 1
6 6 2 6
6 6 6 6
6 6 6 6
1: procedure VertexUpdateinput: Current value v , mesages M
2: Send v on all outgoing edges3: flag ← true4: for m ∈ M do5: if m > v then6: v ← m7: flag ← false8: end if9: end for
10: if flag then halt11: end if12: return v13: end procedure
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 24 / 27
Big Data Analytics 4. Pregel
Pregel Example finding the max value in the graph
3 6 2 1
6 6 2 6
6 6 6 6
6 6 6 6
1: procedure VertexUpdateinput: Current value v , mesages M
2: Send v on all outgoing edges3: flag ← true4: for m ∈ M do5: if m > v then6: v ← m7: flag ← false8: end if9: end for
10: if flag then halt11: end if12: return v13: end procedure
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 24 / 27
Big Data Analytics 4. Pregel
Pregel Example finding the max value in the graph
3 6 2 1
6 6 2 6
6 6 6 6
6 6 6 6
1: procedure VertexUpdateinput: Current value v , mesages M
2: Send v on all outgoing edges3: flag ← true4: for m ∈ M do5: if m > v then6: v ← m7: flag ← false8: end if9: end for
10: if flag then halt11: end if12: return v13: end procedure
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 24 / 27
Big Data Analytics 4. Pregel
Pagerank using PregelPagerank iteration:
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
BeI mj : the the incoming message from node jI v the current node
Vertex update function:
1. send PR(wv )out(wv )
to all outgoing edges
2. collect incoming messages mj
3.PR(wv ) = (1− β) + β
∑mj∈M
mj
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 25 / 27
Big Data Analytics 4. Pregel
Pagerank using PregelPagerank iteration:
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
BeI mj : the the incoming message from node jI v the current node
Vertex update function:
1. send PR(wv )out(wv )
to all outgoing edges
2. collect incoming messages mj
3.PR(wv ) = (1− β) + β
∑mj∈M
mj
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 25 / 27
Big Data Analytics 4. Pregel
Pagerank using PregelPagerank iteration:
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
BeI mj : the the incoming message from node jI v the current node
Vertex update function:
1. send PR(wv )out(wv )
to all outgoing edges
2. collect incoming messages mj
3.PR(wv ) = (1− β) + β
∑mj∈M
mj
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 25 / 27
Big Data Analytics 4. Pregel
Pagerank using PregelPagerank iteration:
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
BeI mj : the the incoming message from node jI v the current node
Vertex update function:
1. send PR(wv )out(wv )
to all outgoing edges
2. collect incoming messages mj
3.PR(wv ) = (1− β) + β
∑mj∈M
mj
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 25 / 27
Big Data Analytics 4. Pregel
Pagerank using PregelPagerank iteration:
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
BeI mj : the the incoming message from node jI v the current node
Vertex update function:
1. send PR(wv )out(wv )
to all outgoing edges
2. collect incoming messages mj
3.PR(wv ) = (1− β) + β
∑mj∈M
mj
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 25 / 27
Big Data Analytics 4. Pregel
Pagerank using PregelPagerank iteration:
PR(wi ) = (1− β) + β∑
wj∈in(wi )
PR(wj)
|out(wj)|
BeI mj : the the incoming message from node jI v the current node
Vertex update function:
1. send PR(wv )out(wv )
to all outgoing edges
2. collect incoming messages mj
3.PR(wv ) = (1− β) + β
∑mj∈M
mj
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 25 / 27
Big Data Analytics 4. Pregel
Pagerank using Pregel
The algorithm can stop if
I it reaches a predefined maximum number of supersteps
I An aggregator is used to determine convergence
Aggregator:
I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one
I The aggregated value is the sum of all the differences
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 26 / 27
Big Data Analytics 4. Pregel
Pagerank using Pregel
The algorithm can stop if
I it reaches a predefined maximum number of supersteps
I An aggregator is used to determine convergence
Aggregator:
I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one
I The aggregated value is the sum of all the differences
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 26 / 27
Big Data Analytics 4. Pregel
Pagerank using Pregel
The algorithm can stop if
I it reaches a predefined maximum number of supersteps
I An aggregator is used to determine convergence
Aggregator:
I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one
I The aggregated value is the sum of all the differences
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 26 / 27
Big Data Analytics 4. Pregel
Pagerank using Pregel
The algorithm can stop if
I it reaches a predefined maximum number of supersteps
I An aggregator is used to determine convergence
Aggregator:
I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one
I The aggregated value is the sum of all the differences
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 26 / 27
Big Data Analytics 4. Pregel
Pagerank using Pregel
The algorithm can stop if
I it reaches a predefined maximum number of supersteps
I An aggregator is used to determine convergence
Aggregator:
I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one
I The aggregated value is the sum of all the differences
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 26 / 27
Big Data Analytics 4. Pregel
Pagerank using Pregel
w1 w2
w3 w4
1: procedure VertexUpdateinput: Current value PR(v), number ofoutgoing edges out(v), mesages M,aggregated value a
2: if a > 0 then3: Send PR(v)
out(v) on all outgoing edges
4: old← PR(v)5:
PR(v)← (1− β) + β∑mj∈M
mj
6: Send |PR(v)− old | to theaggregator
7: else8: halt9: end if
10: end procedure
Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Big Data Analytics 27 / 27