Upload
sayeed-mahmud
View
392
Download
0
Embed Size (px)
DESCRIPTION
A simple presentation based on paper [ARIDHI 2013]
Citation preview
Distributed Graph Mining
Presented By
Sayeed Mahmud
Motivation
Motivation
• The reason BigData is here
– To make processing data easier which is impossible or overwhelming to process with our existing problem.
• Some Graph Database might be too big for a single machine
– Easier for a distributed system by sharing load
• Graph Database may itself be scattered around the globe
– Google search records.
Distributed Graph Mining
• Partition based
• Divide the problem into independent sub-problems
– Each node of the system can process it independently
– Parallel processing
– Speedup computation
– Enhance scalability of solutions
Techniques
• MRPF
• MapReduce
– We are mainly interested in this
Map Reduce• A programming model for distributed
platforms.
• Proposed by Google
• Abundant open source implementations– Hadoop
• Divides the problem in to sub-problems to be processed in nodes– Mapping
• Combining the processing results– Reduce
Map Reduce Example• Problem: Find frequency of a word in documents available on
a system.
…word….word……
…word….……
…word….word……
<word, count>
Distributed System
Map
<word, 2> <word, 1> <word, 2>
<word, 2 + 1 + 2 = 5> Reduce
Graph Mining using Map Reduce• Problem: Find frequent sub-graphs of a graph database in a
MapReduce programming model (Local Support 2)
Graph Dataset
Distributed System
Map
Run gSpan Run gSpan3
2
5 Reduce
Data Partitioning
• Performance and load balancing will be depending on Mapping portion
– Termed “Partitioning”
– Which portion of the graph dataset will go to which
– Loss of Data and Load Balancing directly dependent on partitioning.
• Two approach
– MRGP (Map Reduced Partitioning)
– DGP (Density Based Partitioning)
MRGP• Followed in common Map Reduce problems.
• Assigned sequentially
• SimpleGraph Size (KB) Density
G1 1 0.25
G2 2 0.5
G3 2 0.6
G4 1 0.25
G5 2 0.5
G6 2 0.5
G7 2 0.5
G8 2 0.6
G9 2 0.6
G10 2 0.7
G11 3 0.7
G12 3 0.8
4 Partition 6KB Each
G1, G2, G3, G4
G5, G6, G7
G8, G9, G10
G11, G12
DGP• Goes for a balanced distribution
• Uses intermediary Bucket
• First graphs are sorted according to densities.
Graph Size (KB) Density
G1 1 0.25
G2 2 0.5
G3 2 0.6
G4 1 0.25
G5 2 0.5
G6 2 0.5
G7 2 0.5
G8 2 0.6
G9 2 0.6
G10 2 0.7
G11 3 0.7
G12 3 0.8
G1 (0.25)G4 (0.25)G2 (0.5)G5 (0.5)G6 (0.5)G7 (0.5)G3 (0.6)G8 (0.6)G9 (0.6)G10 (0.7)G11 (0.7)G12 (0.8)
DGP cont..• Lets say bucket count for this demo is 2
• Next we equally distribute the sorted list to two buckets.
Bucket 1 Bucket 2
G1
G2G5
G7G6
G4
G3
G8G10
G12G11
G9
Make 4 Partitions in total Divide each Bucket in 4 Non Empty Sub-Bucket
DGP Cont..• Now take one partition from each and form
final partitions
G1
G2G5
G7G6
G4
G3
G8G10
G12G11
G9
G1, G2, G3,
G8
G4, G5, G9,
G10
G6, G11 G7, G12
Support Count
• There are two types of support counts to be considered in distributed graph mining
– Global Support Count
– Local Support Count
• Global Support is the same as in normal graph mining
• When each mapper is running individual job it considers local support count.
Local Support Count
• Each individual node has only partial graph data set.
• Support Count need to be adjusted relative to the original dataset.
• This adjusted support count is Local Support Count.
• Local Support Count = Tolerance Rate * Global Support [Tolerance rate is between 1 and 0]
Loss of Data
• Some frequent sub-graph are lost
• The loss can be mitigated by choosing an optimal tolerance rate.
– Theoretically tolerance rate = 1 means there will be no loss of data.
– But usually higher run time.
Experiment Environment
• Language : Perl
• MapReduce Framework : Hadoop (0.20.1)
• Cluster Size : 5
• Node Specification:
– Processor AMD Opteron Quad Core 2.4 GHz
– 4GB Main memory
Data Sets
• Synthetic (Size Ranging from 18MB to 69GB)
• Real
– Chemical Compound Dataset from National Cancer Institute.
Loss Rate for gSpan Support 30%
Loss Rate for Gaston and FSG Support 30%
Runtime
Thank You