WOOster: A Map-Reduce based Platform for Graph Mining

WOOster: A Map-Reduce based Platform for Graph Mining

Aravindan RaghuveerYahoo! Inc, Bangalore.

Yahoo! Confidential

2

Introduction

“If you squint the right way, graphs are everywhere” [1]

@ Yahoo! :• The WOO Graph: All knowledge

assimilated from the web.- http://iswc2011.semanticweb.org/fileadmin/iswc/Pa

pers/Industry/WOO_ISWC.pptx

[1] http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html

http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Industry/WOO_ISWC.pptx

http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Industry/WOO_ISWC.pptx

http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html

Yahoo! Confidential

What?

Why?

Family of Graph Query Algorithms.• Framework:

• For graph storage and invoking the query algorithms• Hosted Solution on Hadoop

• Family of Graph Query Algorithms: Present day algorithms do not scale to billion edge, vertex graphs.• Framework:

• Optimizes storage layout to suit graph query algorithms

• Improves throughput of the queries.

The What and Why?

3

Why?

What?

Yahoo! Confidential

Outline of the talk

• MapReduce 101• Graph Mining Approaches• Brief overview of WOOster architecture• Graph query algorithms in WOOster:

• Sub Graph Matching• Reachability Query

• Experiments• Conclusion

Yahoo! Confidential

5

Map Reduce 101

Switch to slides from Cloud Computing with MapReduce and Hadoop

www.cs.berkeley.edu/~matei/talks/2009/parlab_bootcamp_clouds.ppt

http://www.google.co.in/url?sa=t&rct=j&q=map%20reduce%20tutorial%20ppt&source=web&cd=4&ved=0CD8QFjAD&url=http://www.cs.berkeley.edu/~matei/talks/2009/parlab_bootcamp_clouds.ppt&ei=gjLxTuLBA5HSrQeV6JnKDw&usg=AFQjCNFSJIB8tdtOM4n488nxUUMOMZwJ6Q




http://www.cs.berkeley.edu/~matei/talks/2009/parlab_bootcamp_clouds.ppt



MapReduce Programming Model

• Data type: key-value records

• Map function:

(Kin, Vin) list(Kinter, Vinter)

• Reduce function:

(Kinter, list(Vinter)) list(Kout, Vout)

Example: Word Count

def mapper(line):

foreach word in line.split():

output(word, 1)

def reducer(key, values):

output(key, sum(values))

Word Count Execution

the quick

brown fox

the fox ate

the mouse

how now

brown cow

MapMap

MapMap

MapMap

Reduce

Reduce

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1

fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

Yahoo! Confidential

9

Graph Mining Approaches : Two Schools School-1: Invent a new platform:

- Map-reduce is not best suited for graph mining: - BSP, PRAM models : circa 1980s- Pregel, Haloop from Google [1]

School-2: Ride on Map-Reduce- MR has wide adoption, open source tools, industry support.- Invest on one more computing infrastructure- Apache Giraph: http://incubator.apache.org/giraph/ (BSP on Hadoop)- Efforts in open source / academia on the same lines:

• Pegasus CMU [2]• Graph Mining in Apache Mahout[3]• Rayethon’s Graph Mining [4]

[1] SIGMOD 2010, http://dl.acm.org/citation.cfm?id=1807184[2] http://www.cs.cmu.edu/~pegasus/[3] http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache[4] http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/

http://incubator.apache.org/giraph/

http://dl.acm.org/citation.cfm?id=1807184

http://www.cs.cmu.edu/~pegasus/

http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache

http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/

Yahoo! Confidential

WOOster Architecture

• User submits a query • Planner periodically scans for

newly arrived queries.• Planner creates a M-R plan that

re-uses computation, / IO across queries. (Batching)

• Executor executes the M-R plan.

• Result notified to the user (Hosted Solution)

WOOster Web UI & WebService APIs

Planner

Executor

Grid

JobsD/B

GraphIndices

WOO Graph

Yahoo! Confidential

Why Sub-Graph Match (Exact Graph Isomorphism)?:

A popular and expressive graph query useful to mine patterns.

To our knowledge, a large scale algorithm to operate on a billion vertex graph is not present.

The Sub-Graph Match Query

Find all instances of query Q

graph G

Vertices have attributes (ex age:31)

Edges have relationship labels.

Query Vertex Graph Vertex A matched graph vertexNotation

Vertices and edges have constraints (ex: age<40)

in

Yahoo! Confidential

Overview of the Solution

Step-1. Query Graph Partitioning

Step-2. Edge Selection

Step-3. Query Partition Matching

Step-4. Query Partition Merging

Step-0. Data Layout on HDFS

Yahoo! Confidential

Data Layout on HDFS

• How to store a large scale graph?• Adjacency List like solution:

• Each row/line has information about a vertex:• Vertex attributes• Vertex neighbors and the labels associated with each edge.

Implications:•Enables early pruning of non-matching edges and vertices.•Each vertex has information about itself and its immediate neighbors only.

Yahoo! Confidential

Step-1: Query Graph Partitioning

Why?: Parallelized solving of independent sub-problems

How?Find minimum number of partitions such that diameter of partition = 2.

Intuition:•In a spanning tree of diameter 2, there is one vertex that is connected to all other vertices pivot vertex•Will use this property in steps 2, 3.

Pivot Vertices

Yahoo! Confidential

Step-2: Edge Selection• What: Select a subset of edges from G that match atleast one

edge in Q.• How:

g1

g2

g4

g3MapLogic

g1 g2 ReduceLogic

g1 g2

g1

g1:Current vertex in mapper.

1.Mapper emits all

edges if vertex and edge constraints are

met

2a.

g1-g2 emitted:g1 mapped to a

query vertex.

3.

g1-g2 emited from g2’s mapper

4.Reducer emits

an edge if a pair is found

5.For every neigbor of q1, there exists a

corresponding neighbor for g1

2b.

Yahoo! Confidential

Step-3: Query Partition MatchingEdge Selection:

• Associates a graph vertex to the possible query vertices it could map to• Associates the graph vertex to its “pivot” graph vertex.• Pivot graph vertex is a graph vertex which is mapped to a pivot query vertex: g1 in this example

Edge Selection

outputMapLogic

g1 g2

g1 g3

g1 g4

ReduceLogic

g1

g2

g4

g3

Mapper emits pivot graph vertex as key and edge as

value 1.Reducer receives all edges with the same

pivot graph vertex

2.

Reducer forms the partition

3.

g1 g2

Yahoo! Confidential

Step-4: Query Partition Merging

• Merges partitions one after another to form the a query match• More details in paper.

Take-away from Steps1-4: (also for any scalable Map-Reduce program)

The mapper/reducer keys are chosen such that: # keys is proportional to the number of matches of query Q in the graph. Hence the algorithm scales well for large graphs and complex queries.

Yahoo! Confidential

Results

Graph of 10 million vertices and 50 million edges Complex Query of 24 vertices Note that the edge selection time reduces with

increasing number of reducers.

0

20

40

60

80

100

120

140

160

100 150 200 250

Number of Reducers

Tim

e (s

ec)

Edge Selection Query Partition Matching Query Partition Merging

Yahoo! Confidential

In the paper…

Detailed map-reduce algorithms for sub-graph match and reachability

Theoretical analysis for scalability Construction of the synthetic dataset Methodology and more experiments. Reachability query: examples, map-reduce algorithm Related work

Yahoo! Confidential

Future Work

• Indexing structure for graphs suited for M-R jobs• Compare with giraph based approach.

• Better batching strategies.• Right interface for custom graph algorithms to be

plugged in while WOOster providing automatic batching.

• More graph mining algorithms implemented

Yahoo! Confidential

21

Questions / Comments

Technology

WOOster: A Map-Reduce based Platform for Graph Mining