View
1.486
Download
0
Embed Size (px)
DESCRIPTION
Large scale graphs containing O(billion) of verticesare becoming increasingly common in various applica-tions. With graphs of such proportion, efficient query-ing infrastructure becomes crucial. In this paper, wepropose WOOster a hosted querying infrastructure de-signed specifically for the large graphs. We make twokey contributions: a) Design of the WOOster frame-work. b)Scalable map-reduce algorithms for two pop-ular graph queries: sub-graph match and reachability.Our experiments show that the proposed map-reducealgorithms scale well with large synthetic datasets.
Citation preview
WOOster: A Map-Reduce based Platform for Graph Mining
Aravindan RaghuveerYahoo! Inc, Bangalore.
Yahoo! Confidential
2
Introduction
“If you squint the right way, graphs are everywhere” [1]
@ Yahoo! :• The WOO Graph: All knowledge
assimilated from the web.- http://iswc2011.semanticweb.org/fileadmin/iswc/Pa
pers/Industry/WOO_ISWC.pptx
[1] http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html
Yahoo! Confidential
What?
Why?
Family of Graph Query Algorithms.• Framework:
• For graph storage and invoking the query algorithms• Hosted Solution on Hadoop
• Family of Graph Query Algorithms: Present day algorithms do not scale to billion edge, vertex graphs.• Framework:
• Optimizes storage layout to suit graph query algorithms
• Improves throughput of the queries.
The What and Why?
3
Why?
What?
Yahoo! Confidential
Outline of the talk
• MapReduce 101• Graph Mining Approaches• Brief overview of WOOster architecture• Graph query algorithms in WOOster:
• Sub Graph Matching• Reachability Query
• Experiments• Conclusion
Yahoo! Confidential
5
Map Reduce 101
Switch to slides from Cloud Computing with MapReduce and Hadoop
www.cs.berkeley.edu/~matei/talks/2009/parlab_bootcamp_clouds.ppt
MapReduce Programming Model
• Data type: key-value records
• Map function:
(Kin, Vin) list(Kinter, Vinter)
• Reduce function:
(Kinter, list(Vinter)) list(Kout, Vout)
Example: Word Count
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
Word Count Execution
the quick
brown fox
the fox ate
the mouse
how now
brown cow
MapMap
MapMap
MapMap
Reduce
Reduce
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1brown, 1
fox, 1
quick, 1
the, 1fox, 1the, 1
how, 1now, 1
brown, 1
ate, 1mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
Yahoo! Confidential
9
Graph Mining Approaches : Two Schools School-1: Invent a new platform:
- Map-reduce is not best suited for graph mining: - BSP, PRAM models : circa 1980s- Pregel, Haloop from Google [1]
School-2: Ride on Map-Reduce- MR has wide adoption, open source tools, industry support.- Invest on one more computing infrastructure- Apache Giraph: http://incubator.apache.org/giraph/ (BSP on Hadoop)- Efforts in open source / academia on the same lines:
• Pegasus CMU [2]• Graph Mining in Apache Mahout[3]• Rayethon’s Graph Mining [4]
[1] SIGMOD 2010, http://dl.acm.org/citation.cfm?id=1807184[2] http://www.cs.cmu.edu/~pegasus/[3] http://www.robust-project.eu/news/robust-project-pushes-large-scale-graph-mining-with-hadoop-apache[4] http://www.cloudera.com/blog/2010/03/how-raytheon-researchers-are-using-hadoop-to-build-a-scalable-distributed-triple-store/
Yahoo! Confidential
WOOster Architecture
• User submits a query • Planner periodically scans for
newly arrived queries.• Planner creates a M-R plan that
re-uses computation, / IO across queries. (Batching)
• Executor executes the M-R plan.
• Result notified to the user (Hosted Solution)
WOOster Web UI & WebService APIs
Planner
Executor
Grid
JobsD/B
GraphIndices
WOO Graph
Yahoo! Confidential
Why Sub-Graph Match (Exact Graph Isomorphism)?:
A popular and expressive graph query useful to mine patterns.
To our knowledge, a large scale algorithm to operate on a billion vertex graph is not present.
The Sub-Graph Match Query
Find all instances of query Q
graph G
Vertices have attributes (ex age:31)
Edges have relationship labels.
Query Vertex Graph Vertex A matched graph vertexNotation
Vertices and edges have constraints (ex: age<40)
in
Yahoo! Confidential
Overview of the Solution
Step-1. Query Graph Partitioning
Step-2. Edge Selection
Step-3. Query Partition Matching
Step-4. Query Partition Merging
Step-0. Data Layout on HDFS
Yahoo! Confidential
Data Layout on HDFS
• How to store a large scale graph?• Adjacency List like solution:
• Each row/line has information about a vertex:• Vertex attributes• Vertex neighbors and the labels associated with each edge.
Implications:•Enables early pruning of non-matching edges and vertices.•Each vertex has information about itself and its immediate neighbors only.
Yahoo! Confidential
Step-1: Query Graph Partitioning
Why?: Parallelized solving of independent sub-problems
How?Find minimum number of partitions such that diameter of partition = 2.
Intuition:•In a spanning tree of diameter 2, there is one vertex that is connected to all other vertices pivot vertex•Will use this property in steps 2, 3.
Pivot Vertices
Yahoo! Confidential
Step-2: Edge Selection• What: Select a subset of edges from G that match atleast one
edge in Q.• How:
g1
g2
g4
g3MapLogic
g1 g2 ReduceLogic
g1 g2
g1
g1:Current vertex in mapper.
1.Mapper emits all
edges if vertex and edge constraints are
met
2a.
g1-g2 emitted:g1 mapped to a
query vertex.
3.
g1-g2 emited from g2’s mapper
4.Reducer emits
an edge if a pair is found
5.For every neigbor of q1, there exists a
corresponding neighbor for g1
2b.
Yahoo! Confidential
Step-3: Query Partition MatchingEdge Selection:
• Associates a graph vertex to the possible query vertices it could map to• Associates the graph vertex to its “pivot” graph vertex.• Pivot graph vertex is a graph vertex which is mapped to a pivot query vertex: g1 in this example
Edge Selection
outputMapLogic
g1 g2
g1 g3
g1 g4
ReduceLogic
g1
g2
g4
g3
Mapper emits pivot graph vertex as key and edge as
value 1.Reducer receives all edges with the same
pivot graph vertex
2.
Reducer forms the partition
3.
g1 g2
Yahoo! Confidential
Step-4: Query Partition Merging
• Merges partitions one after another to form the a query match• More details in paper.
Take-away from Steps1-4: (also for any scalable Map-Reduce program)
The mapper/reducer keys are chosen such that: # keys is proportional to the number of matches of query Q in the graph. Hence the algorithm scales well for large graphs and complex queries.
Yahoo! Confidential
Results
Graph of 10 million vertices and 50 million edges Complex Query of 24 vertices Note that the edge selection time reduces with
increasing number of reducers.
0
20
40
60
80
100
120
140
160
100 150 200 250
Number of Reducers
Tim
e (s
ec)
Edge Selection Query Partition Matching Query Partition Merging
Yahoo! Confidential
In the paper…
Detailed map-reduce algorithms for sub-graph match and reachability
Theoretical analysis for scalability Construction of the synthetic dataset Methodology and more experiments. Reachability query: examples, map-reduce algorithm Related work
Yahoo! Confidential
Future Work
• Indexing structure for graphs suited for M-R jobs• Compare with giraph based approach.
• Better batching strategies.• Right interface for custom graph algorithms to be
plugged in while WOOster providing automatic batching.
• More graph mining algorithms implemented
Yahoo! Confidential
21
Questions / Comments