Tree and Graph Processing On Hadoop

1

Tree and Graph Processing On Hadoop

Ted Malaska

2

Schedule

• Intro• Overview of Hadoop and Eco-System• Summarize Tree Rooting• MR Overview/Implementation Options• Hbase Overview/Implementation Options• Giraph Overview/Implementation Options• Spark Overview/Implementation Options• Summery• Quesitons

3

Intro

• Hi there

4

Overview of Hadoop and Eco-System

SearchNoSqlMachine LearningLFPRTQStreamingIngestionBatch

HDFSSecurity and Access Controls

Auditing and Monitoring

Map

Red

uce

Pig

Crun

ch

Hive

Gira

ph

Sqoo

p

Flum

e

Kafk

a

Stor

m

Spar

k St

ream

ing

Spar

k

Impa

la

Mah

out

Ory

x

R Pyth

on S

trea

min

g

SAS

HBas

e

Accu

mul

o

NFS

Sear

ch S

olR

5

In Scope for Tonight

SearchNoSqlMachine LearningLFPRTQStreamingIngestionBatch

HDFSSecurity and Access Controls

Auditing and Monitoring

Map

Red

uce

Pig

Crun

ch

Hive

Gira

ph

Sqoo

p

Flum

e

Kafk

a

Stor

m

Spar

k St

ream

ing

Spar

k

Impa

la

Mah

out

Ory

x

R Pyth

on S

trea

min

g

SAS

HBas

e

Accu

mul

o

NFS

Sear

ch S

olR

6

Summarize Tree Rooting

• Basic Tree

0

1 1

22 2

2

3

33

True Root

Leafs

Branches

Vertex

Edge

Depth

7


• More Complex Tree

0

11

22 2

2

3

32

Circular Link

Multiple Parents

8


• Merging Trees• Borderline True Graph Problem

0

11

22 2

2

3

32

0

0

Multi RootedVertex

True RootTrue Root

9


• Know your data

10

Basic Storage Format

• <NodeID>|<EdgeID>

• Example• 101• 101|201• 101|202• 201• 202|301• 301

11

Preprocessing

• Terming Data• Nodes and edges have data• Data has weight• Normally linkage information is under 10% of true data size

• Organize Data by Partitioning

12

Basic Solution

• Step 1: Identify Roots• Echo to all edges• Vertexes with that receive no echoes are roots• Root the root

• Step 2: Walk the tree• Echo from last newly rooted Vertex to all edges• If vertex is not already rooted then root it.

• 101• 101|201• 101|202• 201• 202|301• 301

• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:Null• 202|301|R:Null• 301|R:Null

• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:101• 202|301|R:101• 301|R:Null

• 101|R:101• 101|201|R:101• 101|202|R:101• 201|R:101• 202|301|R:101• 301|R:101

13

Map Reduce

• Massive parallel processing on Hadoop• Based on the Google 2004 MapReduce white paper• Able to process PBs of data

14

Map Reduce

Data Blocks

Data Blocks

Data Blocks

Mapper

Mapper

Mapper

Sort & Shuffle

Sort & Shuffle

Sort & Shuffle

Mapper

Mapper

Data Blocks

Data Blocks

15

Map Reduce

• Self Joins• Always dumping two output:

• Newly Rooted• Still Un-Rooted

All Data

Un-Rooted

Newly Rooted

Un-Rooted

Newly Rooted

Old Rooted 0

MR - Stage0

Root Identifying

MR – Stage1

Rooting

Un-Rooted

Newly Rooted

Old Rooted 0

MR – Stage2

RootingOld Rooted 1

16

Map Reduce

• Great for large batch operations• No memory limit• Not good at iterations

17

HBase

• Largest and Most used NoSql Implementation in the World• Based on the Google 2006 BigTable white paper• Imagine it like a giant HashMap with keys and values• Handles 100k of operations a second on even a small 10 node cluster

18

HBase Getting

Client

HBase Master

HBase Region Server HBase Region Server HBase Region Server

Block Cache Block Cache Block Cache

19

HBase Putting

Client

HBase Master

HBase Region Server HBase Region Server HBase Region Server

WAL

MemStore

HFile

HFile

HFile

WAL

MemStore

WAL

MemStore

20

HBase

• Good for graph traversing• Bad for large batch processing

• Scan rate about 8x slower then HDFS• Good for end of a long tail

21

Giraph

• System built for Large Batch Graph Processing • Based on Pregel 2009 white paper• Hardened by LinkedIn and FaceBook• Recorded to handle up to a Trillion edges

22

Giraph Loading

Data Blocks

Data Blocks

Data Blocks

Worker

Worker

Worker

Worker

Master

23

Com

mun

icati

on

Giraph (Bulk Synchronous Parallel)

Worker Worker Worker

Loca

l ver

tex

com

putin

g

Barrier synchronization

Loca

l ver

tex

com

putin

g

Loca

l ver

tex

com

putin

g

24

Giraph

• Most mature bulk graph processing out there• Of all the solutions, most graph focused

25

Spark

• At Berkeley around 2011 some asked is we could do better then MR• Take advantage of lower cost memory• Building on everything before

26

Spark

WorkerDag Scheduler

(Like a queue planner

Spark Worker

RDD Objects

Task Threads

Block Manager

Rdd1.join(rdd2).groupBy(…).filter(…)

Task Scheduler

Threads

Block Manager

ClusterManager

27

Spark

• Implementations• Onion MR approach with Basic Spark• Pregel approach with Bagel or GraphX

• Bagel is a Façade over Generic Spark Functionality• GraphX is an effort extend to Spark

• Less code• Learning curve • Its Raw will be changing a lot in the next year

Documents

Tree and Graph Processing On Hadoop