22
Distributed Graph Mining Presented By Sayeed Mahmud

Distributed graph mining

Embed Size (px)

DESCRIPTION

A simple presentation based on paper [ARIDHI 2013]

Citation preview

Page 1: Distributed graph mining

Distributed Graph Mining

Presented By

Sayeed Mahmud

Page 2: Distributed graph mining

Motivation

Page 3: Distributed graph mining

Motivation

• The reason BigData is here

– To make processing data easier which is impossible or overwhelming to process with our existing problem.

• Some Graph Database might be too big for a single machine

– Easier for a distributed system by sharing load

• Graph Database may itself be scattered around the globe

– Google search records.

Page 4: Distributed graph mining

Distributed Graph Mining

• Partition based

• Divide the problem into independent sub-problems

– Each node of the system can process it independently

– Parallel processing

– Speedup computation

– Enhance scalability of solutions

Page 5: Distributed graph mining

Techniques

• MRPF

• MapReduce

– We are mainly interested in this

Page 6: Distributed graph mining

Map Reduce• A programming model for distributed

platforms.

• Proposed by Google

• Abundant open source implementations– Hadoop

• Divides the problem in to sub-problems to be processed in nodes– Mapping

• Combining the processing results– Reduce

Page 7: Distributed graph mining

Map Reduce Example• Problem: Find frequency of a word in documents available on

a system.

…word….word……

…word….……

…word….word……

<word, count>

Distributed System

Map

<word, 2> <word, 1> <word, 2>

<word, 2 + 1 + 2 = 5> Reduce

Page 8: Distributed graph mining

Graph Mining using Map Reduce• Problem: Find frequent sub-graphs of a graph database in a

MapReduce programming model (Local Support 2)

Graph Dataset

Distributed System

Map

Run gSpan Run gSpan3

2

5 Reduce

Page 9: Distributed graph mining

Data Partitioning

• Performance and load balancing will be depending on Mapping portion

– Termed “Partitioning”

– Which portion of the graph dataset will go to which

– Loss of Data and Load Balancing directly dependent on partitioning.

• Two approach

– MRGP (Map Reduced Partitioning)

– DGP (Density Based Partitioning)

Page 10: Distributed graph mining

MRGP• Followed in common Map Reduce problems.

• Assigned sequentially

• SimpleGraph Size (KB) Density

G1 1 0.25

G2 2 0.5

G3 2 0.6

G4 1 0.25

G5 2 0.5

G6 2 0.5

G7 2 0.5

G8 2 0.6

G9 2 0.6

G10 2 0.7

G11 3 0.7

G12 3 0.8

4 Partition 6KB Each

G1, G2, G3, G4

G5, G6, G7

G8, G9, G10

G11, G12

Page 11: Distributed graph mining

DGP• Goes for a balanced distribution

• Uses intermediary Bucket

• First graphs are sorted according to densities.

Graph Size (KB) Density

G1 1 0.25

G2 2 0.5

G3 2 0.6

G4 1 0.25

G5 2 0.5

G6 2 0.5

G7 2 0.5

G8 2 0.6

G9 2 0.6

G10 2 0.7

G11 3 0.7

G12 3 0.8

G1 (0.25)G4 (0.25)G2 (0.5)G5 (0.5)G6 (0.5)G7 (0.5)G3 (0.6)G8 (0.6)G9 (0.6)G10 (0.7)G11 (0.7)G12 (0.8)

Page 12: Distributed graph mining

DGP cont..• Lets say bucket count for this demo is 2

• Next we equally distribute the sorted list to two buckets.

Bucket 1 Bucket 2

G1

G2G5

G7G6

G4

G3

G8G10

G12G11

G9

Make 4 Partitions in total Divide each Bucket in 4 Non Empty Sub-Bucket

Page 13: Distributed graph mining

DGP Cont..• Now take one partition from each and form

final partitions

G1

G2G5

G7G6

G4

G3

G8G10

G12G11

G9

G1, G2, G3,

G8

G4, G5, G9,

G10

G6, G11 G7, G12

Page 14: Distributed graph mining

Support Count

• There are two types of support counts to be considered in distributed graph mining

– Global Support Count

– Local Support Count

• Global Support is the same as in normal graph mining

• When each mapper is running individual job it considers local support count.

Page 15: Distributed graph mining

Local Support Count

• Each individual node has only partial graph data set.

• Support Count need to be adjusted relative to the original dataset.

• This adjusted support count is Local Support Count.

• Local Support Count = Tolerance Rate * Global Support [Tolerance rate is between 1 and 0]

Page 16: Distributed graph mining

Loss of Data

• Some frequent sub-graph are lost

• The loss can be mitigated by choosing an optimal tolerance rate.

– Theoretically tolerance rate = 1 means there will be no loss of data.

– But usually higher run time.

Page 17: Distributed graph mining

Experiment Environment

• Language : Perl

• MapReduce Framework : Hadoop (0.20.1)

• Cluster Size : 5

• Node Specification:

– Processor AMD Opteron Quad Core 2.4 GHz

– 4GB Main memory

Page 18: Distributed graph mining

Data Sets

• Synthetic (Size Ranging from 18MB to 69GB)

• Real

– Chemical Compound Dataset from National Cancer Institute.

Page 19: Distributed graph mining

Loss Rate for gSpan Support 30%

Page 20: Distributed graph mining

Loss Rate for Gaston and FSG Support 30%

Page 21: Distributed graph mining

Runtime

Page 22: Distributed graph mining

Thank You