Big Data Analytics - Universität Hildesheim€¦ · Big Data Analytics 1. Introduction Outline 1. Introduction 2. PageRank 3. PageRank and MapReduce 4. Pregel Lucas Rego Drumond,

Big Data Analytics

Big Data Analytics

Lucas Rego Drumond

Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science

University of Hildesheim, Germany

Big Data Analytics

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 27

Big Data Analytics

Outline

1. Introduction

2. PageRank

3. PageRank and MapReduce

4. Pregel



Big Data Analytics 1. Introduction

Outline

1. Introduction

2. PageRank


4. Pregel




Overview

Part III

Part II

Part I

Machine Learning Algorithms

Large Scale Computational Models

Distributed Database

Distributed File System




MapReduce - Review

1. Each mapper transforms a set key-value pairs into a list of outputkeys and intermediate value pairs

2. all intermediate values are grouped according to their output keys

3. each reducer receives all the intermediate values associated with agiven keys

4. each reducer associates one final value to each key



Big Data Analytics 2. PageRank

Outline

1. Introduction

2. PageRank


4. Pregel




Google’s Pagerank

Problem:

I Measure the importance of Websites in Google’s search engine results

I Relevant Websites are more likely to have more links to it

I It is important to consider the quality of the websites that link a givepage




Google’s Pagerank

Problem:







Google’s Pagerank

Problem:







Web graph

The Web can be represented as agraph containing:

I A set of Webpages W

I A set of hyperlinks betweenwebpages H ⊆W ×W

Task:

I Assign a numerical weightr : W → R to each element W

I The weight r(w) of a webpagew ∈W should reflect itsrelative importance to the otherwebpages

w1 w2

w3 w4




Web graph




Task:



w1 w2

w3 w4




Web graph




Task:



w1 w2

w3 w4




Web graph




Task:



w1 w2

w3 w4




Pagerank - The Random Surfer

Imagine a web surfer that:I Jumps from a web page to another one randomly:

I The next link to follow is chosen with uniform probability

I The surfer will browse the chosen link with some probability β

I It will occasionaly jump to a random page with some small probability1− β

The Pagerank PR(w) of a web page w models how likely is that therandom surfer will visit it
































Be:

I in(w) := {v |(v ,w) ∈ H} the set ofpages that link to w

I out(w) := {v |(w , v) ∈ H} the set ofpages linked by w

I 1− β the probability that the surferjumps to a random page

Pagerank of a webpage wi :

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4




Pagerank - The Random SurferBe:




Example:

in(w1) = {w3}


PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4





Be:





PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4




Pagerank - The Random SurferBe:




Example:

out(w1) = {w2,w3}


PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4





Be:





PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4





Be:





PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

w1 w2

w3 w4




Pagerank - Algorithm

1: procedure Pagerankinput: A web graph (W ,H), hyperparameter βoutput: Pagerank values PR ∈ R|W |

2: PR ← {0}|W |

3: for w ∈W do4: PR(w)← Random Value5: end for6: repeat7: for w ∈W do8: PR(w)← (1− β) + β

∑wj∈in(wi )

PR(wj )|out(wj )|

9: end for10: until convergence11: return PR12: end procedure





w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|





w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

PR(w1) = 0.9

PR(w2) = 1.1

PR(w1) = 0.15 + 0.85 ∗ 1.11 = 1.085

PR(w2) = 0.15+0.85∗ 1.0851 = 1.07225





w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

PR(w1) = 1.085

PR(w2) = 1.07225

PR(w1) = 0.15 + 0.85 ∗ 1.072251 =

1.061412

PR(w2) = 0.15 + 0.85 ∗ 1.0614121 =

1.0522





w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

PR(w1) = 1.061412

PR(w2) = 1.0522

PR(w1) = 0.15 + 0.85 ∗ 1.05221 =

1.04437

PR(w2) = 0.15 + 0.85 ∗ 1.044371 =

1.037714





w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

PR(w1) = 1.04437

PR(w2) = 1.037714

PR(w1) = 0.15 + 0.85 ∗ 1.0377141 =

1.032057

PR(w2) = 0.15 + 0.85 ∗ 1.0320571 =

1.027248





w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

After a couple of iterations...





w1 w2

β = 0.85

PR(w1) = (1− β) + β PR(w2)|out(w2)|

PR(w2) = (1− β) + β PR(w1)|out(w1)|

PR(w1) = 1

PR(w2) = 1

PR(w1) = 0.15 + 0.85 ∗ 11 = 1

PR(w2) = 0.15 + 0.85 ∗ 11 = 1




Pagerank Algorithm

I PR(w1) = 1.49

I PR(w2) = 0.78

I PR(w3) = 1.58

I PR(w4) = 0.15

w1 w2

w3 w4



Big Data Analytics 3. PageRank and MapReduce

Outline

1. Introduction

2. PageRank


4. Pregel




MapReduce Implementation of Pagerank

How to implement Pagerank on MapReduce???




MapReduce Limitations

I iterations:I Pipeline sequence of a Map phase and a Reduce phaseI MapReduce does not naturally support iterative algorithms

I depenency between data points:I Data points should be independant of each otherI Not suitable for graph-like data




Pagerank with MapReduce




Pagerank with MapReduce

Two phases:

I Initialize: complete MapReduce Iteration for initializing the pageswith random pagerank values

I Compute pagerank: repeated series of pagerank computations




Pagerank with MapReduce - Initialize

Map:

I Input: (u, page content)I Output: (u, (init, out(u)))

I init: initial pagerank valueI u: url of a webpageI out(u): pages linked by u

Reduce:

I Input: (u, (init, out(u)))

I Output: (u, (init, out(u)))





Map:

I Input: (u, page content)



Reduce:







Map:



Reduce:







Map:



Reduce:







Map:



Reduce:






Pagerank with MapReduce - Compute Ranks

Map:

I Input: (u, (PR(u), out(u)))I Output:

I For each v ∈ out(u): (v , PR(u)out(u) )

I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}I Sums up all val’s and compute new pagerank PR(u) for u

I Output: (u, (PR(u), out(u)))





Map:

I Input: (u, (PR(u), out(u)))

I Output:


I (u, out(u))

Reduce:







Map:



I (u, out(u))

Reduce:







Map:



I (u, out(u))

Reduce:







Map:



I (u, out(u))

Reduce:







Map:



I (u, out(u))

Reduce:







Map:



I (u, out(u))

Reduce:

I Input: (u, out(u)), {(u, val)}

I Sums up all val’s and compute new pagerank PR(u) for u






Map:



I (u, out(u))

Reduce:







Map:



I (u, out(u))

Reduce:






Extensions of MapReduce

I Support repeated MapReduce

I Haloop, Twister

I Graph Models: Giraph, Pregel, Graphlab



Big Data Analytics 4. Pregel

Outline

1. Introduction

2. PageRank


4. Pregel




Pregel

Distributed graph processing programming model from Google

Bulk Synchronous Parallel Computation:

I Synchronous iterations (called supersteps)I In a superstep:

I Each vertex asynchronously executes some user defined function inparallel

Message passing:

I Vertices exchange messages with neighbors




Pregel Programming ModelA data graph is defined where:

1. each vertex performs some computation

2. a vertex sends messages to neighboring vertices

3. each vertex process incoming messages

The graph is distributed across computing nodes:




























Pregel Programming Model

We have two types of entities:

Vertex:

1. Unique identifier

2. Has a modifiable, user defined value

Edge:

1. Source and target vertex identifiers







Vertex:



Edge:








Vertex:



Edge:








Vertex:



Edge:








Vertex:



Edge:








Vertex:



Edge:






Pregel Program

Program is loaded: each vertex executes computation and sendsmessages to out neighbors

Each vertex:

1. receives messages from in-neighbors

2. processes messages

3. decides whether or not to send new messages

4. decides whether or not to halt

The process is repeated until all vertices are at halt




Pregel Program


Each vertex:









Pregel Program


Each vertex:









Pregel Program


Each vertex:









Pregel Program


Each vertex:









Pregel Program


Each vertex:









Pregel - State of a vertex

Active Inactive

Vote to halt

Message received




Pregel Model




Pregel - Aggregators

Pregel supports global communication through Aggregators

I Each vertex can provide a value to the Aggregator in each superstep

I The Aggregator produces a single aggregated value

I Each vertex has access to the aggregated value of the previoussuperstep




























Pregel Example finding the max value in the graph

3 6 2 1

6 6 2 6

6 6 6 6

6 6 6 6

1: procedure VertexUpdateinput: Current value v , mesages M

2: Send v on all outgoing edges3: flag ← true4: for m ∈ M do5: if m > v then6: v ← m7: flag ← false8: end if9: end for

10: if flag then halt11: end if12: return v13: end procedure





3 6 2 1

6 6 2 6

6 6 6 6

6 6 6 6








3 6 2 1

6 6 2 6

6 6 6 6

6 6 6 6








3 6 2 1

6 6 2 6

6 6 6 6

6 6 6 6







Pagerank using PregelPagerank iteration:

PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|

BeI mj : the the incoming message from node jI v the current node

Vertex update function:

1. send PR(wv )out(wv )

to all outgoing edges

2. collect incoming messages mj

3.PR(wv ) = (1− β) + β

∑mj∈M

mj





PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|






3.PR(wv ) = (1− β) + β

∑mj∈M

mj





PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|






3.PR(wv ) = (1− β) + β

∑mj∈M

mj





PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|






3.PR(wv ) = (1− β) + β

∑mj∈M

mj





PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|






3.PR(wv ) = (1− β) + β

∑mj∈M

mj





PR(wi ) = (1− β) + β∑

wj∈in(wi )

PR(wj)

|out(wj)|






3.PR(wv ) = (1− β) + β

∑mj∈M

mj




Pagerank using Pregel

The algorithm can stop if

I it reaches a predefined maximum number of supersteps

I An aggregator is used to determine convergence

Aggregator:

I Gathers from all nodes the difference between the value in the currentsuperstep and in the last one

I The aggregated value is the sum of all the differences








Aggregator:










Aggregator:










Aggregator:










Aggregator:







w1 w2

w3 w4

1: procedure VertexUpdateinput: Current value PR(v), number ofoutgoing edges out(v), mesages M,aggregated value a

2: if a > 0 then3: Send PR(v)

out(v) on all outgoing edges

4: old← PR(v)5:

PR(v)← (1− β) + β∑mj∈M

mj

6: Send |PR(v)− old | to theaggregator

7: else8: halt9: end if

10: end procedure



Documents

Big Data Analytics - Universität Hildesheim€¦ · Big Data Analytics 1. Introduction Outline 1. Introduction 2. PageRank 3. PageRank and MapReduce 4. Pregel Lucas Rego Drumond,