92
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Big Data Analytics 1 / 27

Big Data Analytics - Universität Hildesheim€¦ · Big Data Analytics 1. Introduction Outline 1. Introduction 2. PageRank 3. PageRank and MapReduce 4. Pregel Lucas Rego Drumond,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Big Data Analytics

Big Data Analytics

Lucas Rego Drumond

Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science

University of Hildesheim, Germany

Big Data Analytics

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 27

Big Data Analytics

Outline

1. Introduction

2. PageRank

3. PageRank and MapReduce

4. Pregel

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 27

Big Data Analytics 1. Introduction

Outline

1. Introduction

2. PageRank

3. PageRank and MapReduce

4. Pregel

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 27

Big Data Analytics 1. Introduction

Overview

Part III

Part II

Part I

Machine Learning Algorithms

Large Scale Computational Models

Distributed Database

Distributed File System

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 27

Big Data Analytics 1. Introduction

MapReduce - Review

1. Each mapper transforms a set key-value pairs into a list of outputkeys and intermediate value pairs

2. all intermediate values are grouped according to their output keys

3. each reducer receives all the intermediate values associated with agiven keys

4. each reducer associates one final value to each key

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 2 / 27

Big Data Analytics 2. PageRank

Outline

1. Introduction

2. PageRank

3. PageRank and MapReduce

4. Pregel

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 3 / 27

Big Data Analytics 2. PageRank

Google’s Pagerank

Problem:

I Measure the importance of Websites in Google’s search engine results

I Relevant Websites are more likely to have more links to it

I It is important to consider the quality of the websites that link a givepage

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 3 / 27

Big Data Analytics 2. PageRank

Google’s Pagerank

Problem:

I Measure the importance of Websites in Google’s search engine results

I Relevant Websites are more likely to have more links to it

I It is important to consider the quality of the websites that link a givepage

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 3 / 27

Big Data Analytics 2. PageRank

Google’s Pagerank

Problem:

I Measure the importance of Websites in Google’s search engine results

I Relevant Websites are more likely to have more links to it

I It is important to consider the quality of the websites that link a givepage

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 3 / 27

Big Data Analytics 2. PageRank

Web graph

The Web can be represented as agraph containing:

I A set of Webpages W

I A set of hyperlinks betweenwebpages H ⊆W ×W

Task:

I Assign a numerical weightr : W → R to each element W

I The weight r(w) of a webpagew ∈W should reflect itsrelative importance to the otherwebpages

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 4 / 27

Big Data Analytics 2. PageRank

Web graph

The Web can be represented as agraph containing:

I A set of Webpages W

I A set of hyperlinks betweenwebpages H ⊆W ×W

Task:

I Assign a numerical weightr : W → R to each element W

I The weight r(w) of a webpagew ∈W should reflect itsrelative importance to the otherwebpages

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 4 / 27

Big Data Analytics 2. PageRank

Web graph

The Web can be represented as agraph containing:

I A set of Webpages W

I A set of hyperlinks betweenwebpages H ⊆W ×W

Task:

I Assign a numerical weightr : W → R to each element W

I The weight r(w) of a webpagew ∈W should reflect itsrelative importance to the otherwebpages

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 4 / 27

Big Data Analytics 2. PageRank

Web graph

The Web can be represented as agraph containing:

I A set of Webpages W

I A set of hyperlinks betweenwebpages H ⊆W ×W

Task:

I Assign a numerical weightr : W → R to each element W

I The weight r(w) of a webpagew ∈W should reflect itsrelative importance to the otherwebpages

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 4 / 27

Big Data Analytics 2. PageRank

Pagerank - The Random Surfer

Imagine a web surfer that:I Jumps from a web page to another one randomly:

I The next link to follow is chosen with uniform probability

I The surfer will browse the chosen link with some probability β

I It will occasionaly jump to a random page with some small probability1− β

The Pagerank PR(w) of a web page w models how likely is that therandom surfer will visit it

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 5 / 27

Big Data Analytics 2. PageRank

Pagerank - The Random Surfer

Imagine a web surfer that:I Jumps from a web page to another one randomly:

I The next link to follow is chosen with uniform probability

I The surfer will browse the chosen link with some probability β

I It will occasionaly jump to a random page with some small probability1− β

The Pagerank PR(w) of a web page w models how likely is that therandom surfer will visit it

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 5 / 27

Big Data Analytics 2. PageRank

Pagerank - The Random Surfer

Imagine a web surfer that:I Jumps from a web page to another one randomly:

I The next link to follow is chosen with uniform probability

I The surfer will browse the chosen link with some probability β

I It will occasionaly jump to a random page with some small probability1− β

The Pagerank PR(w) of a web page w models how likely is that therandom surfer will visit it

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 5 / 27

Big Data Analytics 2. PageRank

Pagerank - The Random Surfer

Imagine a web surfer that:I Jumps from a web page to another one randomly:

I The next link to follow is chosen with uniform probability

I The surfer will browse the chosen link with some probability β

I It will occasionaly jump to a random page with some small probability1− β

The Pagerank PR(w) of a web page w models how likely is that therandom surfer will visit it

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 5 / 27

Big Data Analytics 2. PageRank

Pagerank - The Random Surfer

Be:

I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w

I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w

I 1− β the probability that the surferjumps to a random page

Pagerank of a webpage wi :

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 6 / 27

Big Data Analytics 2. PageRank

Pagerank - The Random SurferBe:

I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w

I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w

I 1− β the probability that the surferjumps to a random page

Example:

in(w1) = {w3}

Pagerank of a webpage wi :

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 6 / 27

Big Data Analytics 2. PageRank

Pagerank - The Random Surfer

Be:

I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w

I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w

I 1− β the probability that the surferjumps to a random page

Pagerank of a webpage wi :

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 6 / 27

Big Data Analytics 2. PageRank

Pagerank - The Random SurferBe:

I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w

I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w

I 1− β the probability that the surferjumps to a random page

Example:

out(w1) = {w2,w3}

Pagerank of a webpage wi :

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 6 / 27

Big Data Analytics 2. PageRank

Pagerank - The Random Surfer

Be:

I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w

I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w

I 1− β the probability that the surferjumps to a random page

Pagerank of a webpage wi :

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 6 / 27

Big Data Analytics 2. PageRank

Pagerank - The Random Surfer

Be:

I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w

I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w

I 1− β the probability that the surferjumps to a random page

Pagerank of a webpage wi :

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 6 / 27

Big Data Analytics 2. PageRank

Pagerank - Algorithm

1: procedure Pagerankinput: A web graph (W ,H), hyperparameter βoutput: Pagerank values PR ∈ R|W |

2: PR ← {0}|W |

3: for w ∈W do4: PR(w)← Random Value5: end for6: repeat7: for w ∈W do8: PR(w)← (1− β) + β

∑wj∈in(wi )

PR(wj )|out(wj )|

9: end for10: until convergence11: return PR12: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 7 / 27

Big Data Analytics 2. PageRank

Pagerank - Algorithm

w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 8 / 27

Big Data Analytics 2. PageRank

Pagerank - Algorithm

w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

PR(w1) = 0.9

PR(w2) = 1.1

PR(w1) = 0.15 + 0.85 ∗ 1.11 = 1.085

PR(w2) = 0.15+0.85∗ 1.0851 = 1.07225

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 8 / 27

Big Data Analytics 2. PageRank

Pagerank - Algorithm

w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

PR(w1) = 1.085

PR(w2) = 1.07225

PR(w1) = 0.15 + 0.85 ∗ 1.072251 =

1.061412

PR(w2) = 0.15 + 0.85 ∗ 1.0614121 =

1.0522

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 8 / 27

Big Data Analytics 2. PageRank

Pagerank - Algorithm

w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

PR(w1) = 1.061412

PR(w2) = 1.0522

PR(w1) = 0.15 + 0.85 ∗ 1.05221 =

1.04437

PR(w2) = 0.15 + 0.85 ∗ 1.044371 =

1.037714

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 8 / 27

Big Data Analytics 2. PageRank

Pagerank - Algorithm

w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

PR(w1) = 1.04437

PR(w2) = 1.037714

PR(w1) = 0.15 + 0.85 ∗ 1.0377141 =

1.032057

PR(w2) = 0.15 + 0.85 ∗ 1.0320571 =

1.027248

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 8 / 27

Big Data Analytics 2. PageRank

Pagerank - Algorithm

w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

After a couple of iterations...

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 8 / 27

Big Data Analytics 2. PageRank

Pagerank - Algorithm

w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

PR(w1) = 1

PR(w2) = 1

PR(w1) = 0.15 + 0.85 ∗ 11 = 1

PR(w2) = 0.15 + 0.85 ∗ 11 = 1

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 8 / 27

Big Data Analytics 2. PageRank

Pagerank Algorithm

I PR(w1) = 1.49

I PR(w2) = 0.78

I PR(w3) = 1.58

I PR(w4) = 0.15

w1 w2

w3 w4

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 9 / 27

Big Data Analytics 3. PageRank and MapReduce

Outline

1. Introduction

2. PageRank

3. PageRank and MapReduce

4. Pregel

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 10 / 27

Big Data Analytics 3. PageRank and MapReduce

MapReduce Implementation of Pagerank

How to implement Pagerank on MapReduce???

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 10 / 27

Big Data Analytics 3. PageRank and MapReduce

MapReduce Limitations

I iterations:I Pipeline sequence of a Map phase and a Reduce phaseI MapReduce does not naturally support iterative algorithms

I depenency between data points:I Data points should be independant of each otherI Not suitable for graph-like data

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 11 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 12 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce

Two phases:

I Initialize: complete MapReduce Iteration for initializing the pageswith random pagerank values

I Compute pagerank: repeated series of pagerank computations

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 13 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Initialize

Map:

I Input: (u, page content)I Output: (u, (init, out(u)))

I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u

Reduce:

I Input: (u, (init, out(u)))

I Output: (u, (init, out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 14 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Initialize

Map:

I Input: (u, page content)

I Output: (u, (init, out(u)))

I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u

Reduce:

I Input: (u, (init, out(u)))

I Output: (u, (init, out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 14 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Initialize

Map:

I Input: (u, page content)I Output: (u, (init, out(u)))

I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u

Reduce:

I Input: (u, (init, out(u)))

I Output: (u, (init, out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 14 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Initialize

Map:

I Input: (u, page content)I Output: (u, (init, out(u)))

I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u

Reduce:

I Input: (u, (init, out(u)))

I Output: (u, (init, out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 14 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Initialize

Map:

I Input: (u, page content)I Output: (u, (init, out(u)))

I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u

Reduce:

I Input: (u, (init, out(u)))

I Output: (u, (init, out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 14 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Compute Ranks

Map:

I Input: (u, (PR(u), out(u)))I Output:

I For each v ∈ out(u): (v , PR(u)out(u) )

I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u

I Output: (u, (PR(u), out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Compute Ranks

Map:

I Input: (u, (PR(u), out(u)))

I Output:

I For each v ∈ out(u): (v , PR(u)out(u) )

I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u

I Output: (u, (PR(u), out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Compute Ranks

Map:

I Input: (u, (PR(u), out(u)))I Output:

I For each v ∈ out(u): (v , PR(u)out(u) )

I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u

I Output: (u, (PR(u), out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Compute Ranks

Map:

I Input: (u, (PR(u), out(u)))I Output:

I For each v ∈ out(u): (v , PR(u)out(u) )

I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u

I Output: (u, (PR(u), out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Compute Ranks

Map:

I Input: (u, (PR(u), out(u)))I Output:

I For each v ∈ out(u): (v , PR(u)out(u) )

I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u

I Output: (u, (PR(u), out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Compute Ranks

Map:

I Input: (u, (PR(u), out(u)))I Output:

I For each v ∈ out(u): (v , PR(u)out(u) )

I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u

I Output: (u, (PR(u), out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Compute Ranks

Map:

I Input: (u, (PR(u), out(u)))I Output:

I For each v ∈ out(u): (v , PR(u)out(u) )

I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}

I Sums up all val’s and compute new pagerank PR(u) for u

I Output: (u, (PR(u), out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Compute Ranks

Map:

I Input: (u, (PR(u), out(u)))I Output:

I For each v ∈ out(u): (v , PR(u)out(u) )

I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u

I Output: (u, (PR(u), out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 27

Big Data Analytics 3. PageRank and MapReduce

Pagerank with MapReduce - Compute Ranks

Map:

I Input: (u, (PR(u), out(u)))I Output:

I For each v ∈ out(u): (v , PR(u)out(u) )

I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u

I Output: (u, (PR(u), out(u)))

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 27

Big Data Analytics 3. PageRank and MapReduce

Extensions of MapReduce

I Support repeated MapReduce

I Haloop, Twister

I Graph Models: Giraph, Pregel, Graphlab

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 16 / 27

Big Data Analytics 4. Pregel

Outline

1. Introduction

2. PageRank

3. PageRank and MapReduce

4. Pregel

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 17 / 27

Big Data Analytics 4. Pregel

Pregel

Distributed graph processing programming model from Google

Bulk Synchronous Parallel Computation:

I Synchronous iterations (called supersteps)I In a superstep:

I Each vertex asynchronously executes some user defined function inparallel

Message passing:

I Vertices exchange messages with neighbors

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 17 / 27

Big Data Analytics 4. Pregel

Pregel Programming ModelA data graph is defined where:

1. each vertex performs some computation

2. a vertex sends messages to neighboring vertices

3. each vertex process incoming messages

The graph is distributed across computing nodes:

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 18 / 27

Big Data Analytics 4. Pregel

Pregel Programming ModelA data graph is defined where:

1. each vertex performs some computation

2. a vertex sends messages to neighboring vertices

3. each vertex process incoming messages

The graph is distributed across computing nodes:

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 18 / 27

Big Data Analytics 4. Pregel

Pregel Programming ModelA data graph is defined where:

1. each vertex performs some computation

2. a vertex sends messages to neighboring vertices

3. each vertex process incoming messages

The graph is distributed across computing nodes:

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 18 / 27

Big Data Analytics 4. Pregel

Pregel Programming ModelA data graph is defined where:

1. each vertex performs some computation

2. a vertex sends messages to neighboring vertices

3. each vertex process incoming messages

The graph is distributed across computing nodes:

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 18 / 27

Big Data Analytics 4. Pregel

Pregel Programming Model

We have two types of entities:

Vertex:

1. Unique identifier

2. Has a modifiable, user defined value

Edge:

1. Source and target vertex identifiers

2. Has a modifiable, user defined value

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 19 / 27

Big Data Analytics 4. Pregel

Pregel Programming Model

We have two types of entities:

Vertex:

1. Unique identifier

2. Has a modifiable, user defined value

Edge:

1. Source and target vertex identifiers

2. Has a modifiable, user defined value

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 19 / 27

Big Data Analytics 4. Pregel

Pregel Programming Model

We have two types of entities:

Vertex:

1. Unique identifier

2. Has a modifiable, user defined value

Edge:

1. Source and target vertex identifiers

2. Has a modifiable, user defined value

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 19 / 27

Big Data Analytics 4. Pregel

Pregel Programming Model

We have two types of entities:

Vertex:

1. Unique identifier

2. Has a modifiable, user defined value

Edge:

1. Source and target vertex identifiers

2. Has a modifiable, user defined value

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 19 / 27

Big Data Analytics 4. Pregel

Pregel Programming Model

We have two types of entities:

Vertex:

1. Unique identifier

2. Has a modifiable, user defined value

Edge:

1. Source and target vertex identifiers

2. Has a modifiable, user defined value

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 19 / 27

Big Data Analytics 4. Pregel

Pregel Programming Model

We have two types of entities:

Vertex:

1. Unique identifier

2. Has a modifiable, user defined value

Edge:

1. Source and target vertex identifiers

2. Has a modifiable, user defined value

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 19 / 27

Big Data Analytics 4. Pregel

Pregel Program

Program is loaded: each vertex executes computation and sendsmessages to out neighbors

Each vertex:

1. receives messages from in-neighbors

2. processes messages

3. decides whether or not to send new messages

4. decides whether or not to halt

The process is repeated until all vertices are at halt

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 20 / 27

Big Data Analytics 4. Pregel

Pregel Program

Program is loaded: each vertex executes computation and sendsmessages to out neighbors

Each vertex:

1. receives messages from in-neighbors

2. processes messages

3. decides whether or not to send new messages

4. decides whether or not to halt

The process is repeated until all vertices are at halt

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 20 / 27

Big Data Analytics 4. Pregel

Pregel Program

Program is loaded: each vertex executes computation and sendsmessages to out neighbors

Each vertex:

1. receives messages from in-neighbors

2. processes messages

3. decides whether or not to send new messages

4. decides whether or not to halt

The process is repeated until all vertices are at halt

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 20 / 27

Big Data Analytics 4. Pregel

Pregel Program

Program is loaded: each vertex executes computation and sendsmessages to out neighbors

Each vertex:

1. receives messages from in-neighbors

2. processes messages

3. decides whether or not to send new messages

4. decides whether or not to halt

The process is repeated until all vertices are at halt

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 20 / 27

Big Data Analytics 4. Pregel

Pregel Program

Program is loaded: each vertex executes computation and sendsmessages to out neighbors

Each vertex:

1. receives messages from in-neighbors

2. processes messages

3. decides whether or not to send new messages

4. decides whether or not to halt

The process is repeated until all vertices are at halt

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 20 / 27

Big Data Analytics 4. Pregel

Pregel Program

Program is loaded: each vertex executes computation and sendsmessages to out neighbors

Each vertex:

1. receives messages from in-neighbors

2. processes messages

3. decides whether or not to send new messages

4. decides whether or not to halt

The process is repeated until all vertices are at halt

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 20 / 27

Big Data Analytics 4. Pregel

Pregel - State of a vertex

Active Inactive

Vote to halt

Message received

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 21 / 27

Big Data Analytics 4. Pregel

Pregel Model

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 22 / 27

Big Data Analytics 4. Pregel

Pregel - Aggregators

Pregel supports global communication through Aggregators

I Each vertex can provide a value to the Aggregator in each superstep

I The Aggregator produces a single aggregated value

I Each vertex has access to the aggregated value of the previoussuperstep

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 23 / 27

Big Data Analytics 4. Pregel

Pregel - Aggregators

Pregel supports global communication through Aggregators

I Each vertex can provide a value to the Aggregator in each superstep

I The Aggregator produces a single aggregated value

I Each vertex has access to the aggregated value of the previoussuperstep

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 23 / 27

Big Data Analytics 4. Pregel

Pregel - Aggregators

Pregel supports global communication through Aggregators

I Each vertex can provide a value to the Aggregator in each superstep

I The Aggregator produces a single aggregated value

I Each vertex has access to the aggregated value of the previoussuperstep

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 23 / 27

Big Data Analytics 4. Pregel

Pregel - Aggregators

Pregel supports global communication through Aggregators

I Each vertex can provide a value to the Aggregator in each superstep

I The Aggregator produces a single aggregated value

I Each vertex has access to the aggregated value of the previoussuperstep

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 23 / 27

Big Data Analytics 4. Pregel

Pregel Example finding the max value in the graph

3 6 2 1

6 6 2 6

6 6 6 6

6 6 6 6

1: procedure VertexUpdateinput: Current value v , mesages M

2: Send v on all outgoing edges3: flag ← true4: for m ∈ M do5: if m > v then6: v ← m7: flag ← false8: end if9: end for

10: if flag then halt11: end if12: return v13: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 24 / 27

Big Data Analytics 4. Pregel

Pregel Example finding the max value in the graph

3 6 2 1

6 6 2 6

6 6 6 6

6 6 6 6

1: procedure VertexUpdateinput: Current value v , mesages M

2: Send v on all outgoing edges3: flag ← true4: for m ∈ M do5: if m > v then6: v ← m7: flag ← false8: end if9: end for

10: if flag then halt11: end if12: return v13: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 24 / 27

Big Data Analytics 4. Pregel

Pregel Example finding the max value in the graph

3 6 2 1

6 6 2 6

6 6 6 6

6 6 6 6

1: procedure VertexUpdateinput: Current value v , mesages M

2: Send v on all outgoing edges3: flag ← true4: for m ∈ M do5: if m > v then6: v ← m7: flag ← false8: end if9: end for

10: if flag then halt11: end if12: return v13: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 24 / 27

Big Data Analytics 4. Pregel

Pregel Example finding the max value in the graph

3 6 2 1

6 6 2 6

6 6 6 6

6 6 6 6

1: procedure VertexUpdateinput: Current value v , mesages M

2: Send v on all outgoing edges3: flag ← true4: for m ∈ M do5: if m > v then6: v ← m7: flag ← false8: end if9: end for

10: if flag then halt11: end if12: return v13: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 24 / 27

Big Data Analytics 4. Pregel

Pagerank using PregelPagerank iteration:

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

BeI mj : the the incoming message from node jI v the current node

Vertex update function:

1. send PR(wv )out(wv )

to all outgoing edges

2. collect incoming messages mj

3.PR(wv ) = (1− β) + β

∑mj∈M

mj

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 25 / 27

Big Data Analytics 4. Pregel

Pagerank using PregelPagerank iteration:

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

BeI mj : the the incoming message from node jI v the current node

Vertex update function:

1. send PR(wv )out(wv )

to all outgoing edges

2. collect incoming messages mj

3.PR(wv ) = (1− β) + β

∑mj∈M

mj

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 25 / 27

Big Data Analytics 4. Pregel

Pagerank using PregelPagerank iteration:

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

BeI mj : the the incoming message from node jI v the current node

Vertex update function:

1. send PR(wv )out(wv )

to all outgoing edges

2. collect incoming messages mj

3.PR(wv ) = (1− β) + β

∑mj∈M

mj

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 25 / 27

Big Data Analytics 4. Pregel

Pagerank using PregelPagerank iteration:

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

BeI mj : the the incoming message from node jI v the current node

Vertex update function:

1. send PR(wv )out(wv )

to all outgoing edges

2. collect incoming messages mj

3.PR(wv ) = (1− β) + β

∑mj∈M

mj

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 25 / 27

Big Data Analytics 4. Pregel

Pagerank using PregelPagerank iteration:

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

BeI mj : the the incoming message from node jI v the current node

Vertex update function:

1. send PR(wv )out(wv )

to all outgoing edges

2. collect incoming messages mj

3.PR(wv ) = (1− β) + β

∑mj∈M

mj

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 25 / 27

Big Data Analytics 4. Pregel

Pagerank using PregelPagerank iteration:

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

BeI mj : the the incoming message from node jI v the current node

Vertex update function:

1. send PR(wv )out(wv )

to all outgoing edges

2. collect incoming messages mj

3.PR(wv ) = (1− β) + β

∑mj∈M

mj

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 25 / 27

Big Data Analytics 4. Pregel

Pagerank using Pregel

The algorithm can stop if

I it reaches a predefined maximum number of supersteps

I An aggregator is used to determine convergence

Aggregator:

I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one

I The aggregated value is the sum of all the differences

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 26 / 27

Big Data Analytics 4. Pregel

Pagerank using Pregel

The algorithm can stop if

I it reaches a predefined maximum number of supersteps

I An aggregator is used to determine convergence

Aggregator:

I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one

I The aggregated value is the sum of all the differences

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 26 / 27

Big Data Analytics 4. Pregel

Pagerank using Pregel

The algorithm can stop if

I it reaches a predefined maximum number of supersteps

I An aggregator is used to determine convergence

Aggregator:

I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one

I The aggregated value is the sum of all the differences

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 26 / 27

Big Data Analytics 4. Pregel

Pagerank using Pregel

The algorithm can stop if

I it reaches a predefined maximum number of supersteps

I An aggregator is used to determine convergence

Aggregator:

I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one

I The aggregated value is the sum of all the differences

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 26 / 27

Big Data Analytics 4. Pregel

Pagerank using Pregel

The algorithm can stop if

I it reaches a predefined maximum number of supersteps

I An aggregator is used to determine convergence

Aggregator:

I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one

I The aggregated value is the sum of all the differences

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 26 / 27

Big Data Analytics 4. Pregel

Pagerank using Pregel

w1 w2

w3 w4

1: procedure VertexUpdateinput: Current value PR(v), number ofoutgoing edges out(v), mesages M,aggregated value a

2: if a > 0 then3: Send PR(v)

out(v) on all outgoing edges

4: old← PR(v)5:

PR(v)← (1− β) + β∑mj∈M

mj

6: Send |PR(v)− old | to theaggregator

7: else8: halt9: end if

10: end procedure

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 27 / 27