Upload
others
View
33
Download
0
Embed Size (px)
Citation preview
Master Thesis
Thesis Advisor: Sebastian Schelter, Research Associate
Thesis Supervisor: Prof. Dr. Markl, Volker
Large Scale Centrality Measures in
Apache Flink and Apache Giraph
Submitted by
Janani Chakkaradhari
5 September 2014
Centrality measures identify the most central
nodes in a network
2
3
Targeted advertising minimizes resources and effort
required for marketing
Centrality measures to identify the head of terrorist
network that attacked on 9/11
Krebs, 2002
4
Different notions of the “most central nodes”
5 Freeman et al, 1977
Degree Closeness Betweenness
The real world networks are very large and
sparse
6 Barabási, 2004
Big Data platforms to analyze large networks
7
Related work on parallel computation of
centrality measures
Two novel algorithms proposed by Kang, U., et al in the paper
“Centralities in Large Networks: Algorithms and Observations”
for computing closeness and betweenness and implemented in
Hadoop.
• Effective Closeness algorithm
- an approximate technique for closeness
• LineRank algorithm
- random walk betweenness
8
Comparison of parallel computing platforms by
implementing and evaluating centrality measures
• How well does the
programming model of these
data processing platforms fit
Effective Closeness and
LineRank algorithms?
• Evaluating the performance
of each of these two
platforms
9
Closeness & Betweenness
of large networks
Parallel data processing
platforms
Apache Flink
Programming model of Apache Giraph & Apache Flink
for iterative graph processing
• Apache Giraph, a vertex centric model
for iterative graph processing based • Apache Flink offers special iteration
operator
10 Stephan Ewen, 2014 Sebastian Schelter, 2012
V1
V2
V3
V1
V2
V3
V1
V2
V3
superstep i superstep i+1 superstep i+2
Programming model of Apache Giraph & Apache Flink
for iterative graph processing
• Apache Giraph, a vertex centric model
for iterative graph processing based • Apache Flink offers special iteration
operator
Iterative
Function
Initial dataset
Result
Bulk
Iteration
Initial
solutionset
Initial
workset
Result Delta
Iteration
11 Stephan Ewen, 2014 Sebastian Schelter, 2012
V1
V2
V3
V1
V2
V3
V1
V2
V3
superstep i superstep i+1 superstep i+2
Comparison of parallel computing platforms by
implementing and evaluating centrality measures
• Evaluating the performance
of each of these two
platforms
12
Closeness & Betweenness
of large networks
Parallel data processing
platforms
Apache Flink
• How well does the
programming model of these
data processing platforms fit
Effective Closeness and
LineRank algorithms?
1. Computation logic
2. Implementation
Iterative computation of Effective Closeness Algorithm
• Shortest path between nodes => it counts the node at each
step/shortest path progressively
• Sum of the shortest paths: (2 x 1) + (2 x 2) + (1 x 3)= 9
13
2 2 1
Step 1 Step 2 Step 3
Computation of Effective Closeness in Apache Giraph
3 2
2
3 3
1
4 5
4
5 5
3
1 3
2
2 2
2
14
1 0 0 0 0 0 1
0 1 0 0 0 0 2
0 0 1 0 0 0 3
0 0 0 1 0 0 4
0 0 0 0 1 0 5
0 0 0 0 0 1 6
1 2
1 5
2 3
5 4
4 6
2 1
5 1
3 2
4 5
6 4
2 5
3 4
4 3
5 2
vid bit_string
src des
Illustration of Effective Closeness in Apache Flink using Delta iteration
(1/4)
• Vertices – Initial workset and solution set
• Edges - Pair of source and destination ids
15
1 0 0 0 0 0 1
0 1 0 0 0 0 2
0 0 1 0 0 0 3
0 0 0 1 0 0 4
0 0 0 0 1 0 5
0 0 0 0 0 1 6
1 2
1 5
2 3
5 4
4 6
2 1
5 1
3 2
4 5
6 4
2 5
3 4
4 3
5 2
⋈ vid=src
vid bit_string
src des
emit
0 1 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 1 0 0 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 1 0 0
0 0 0 1 0 0
0 0 0 0 0 1
1 0 0 0 0 0
1 0 0 0 0 0
0 1 0 0 0 0
0 0 0 1 0 0
1
1
2
5
4
2
5
3
4
6
2
3
4
5
des bit_string
Illustration of Effective Closeness in Apache Flink using Delta iteration (2/4)
16
0 1 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 1 0 0 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 1 0 0
0 0 0 1 0 0
0 0 0 0 0 1
1 0 0 0 0 0
1 0 0 0 0 0
0 1 0 0 0 0
0 0 0 1 0 0
1
1
2
5
4
2
5
3
4
6
2
3
4
5
des bit_string
𝛾𝑠𝑟𝑐
0 1 0 0 0 0
0 0 0 0 1 0
1
1
0 0 1 0 0 0 2
1 0 0 0 0 0 2
0 0 0 0 1 0 2
0 0 0 1 0 0 5
1 0 0 0 0 0 5
0 1 0 0 0 0 5
0 1 0 0 0 0 3
0 0 0 1 0 0 3
0 0 0 0 0 1 4
0 0 0 0 1 0 4
0 0 1 0 0 0 4
0 0 0 1 0 0 6
1 1 0 0 1 0 1
1 1 1 0 1 0 2
0 1 1 1 0 0 3
0 0 1 1 1 1 4
1 1 0 1 1 0 5
0 0 0 1 0 1 6
Bit-OR
Illustration of Effective Closeness in Apache Flink using Delta iteration (3/4)
des bit_string
Updated result in current iteration
17
1 0 0 0 0 0 1
0 1 0 0 0 0 2
0 0 1 0 0 0 3
0 0 0 1 0 0 4
0 0 0 0 1 0 5
0 0 0 0 0 1 6
⋈ vid=des
vid bit_string
Illustration of Effective Closeness in Apache Flink using Delta iteration (4/4)
Solutionset /previous
iteration’s result
1 1 0 0 1 0 1
1 1 1 0 1 0 2
0 1 1 1 0 0 3
0 0 1 1 1 1 4
1 1 0 1 1 0 5
0 0 0 1 0 1 6
des bit_string
Updated result in
current iteration
0
0
0
0
0
0
2
3
2
3
3
1
Termination condition
check If(prev count != current count)
emit the updated nodes => Next
Workset
else keep calm!
1 1 0 0 1 0 1
1 1 1 0 1 0 2
0 1 1 1 0 0 3
0 0 1 1 1 1 4
1 1 0 1 1 0 5
0 0 0 1 0 1 6
2
3
2
3
3
1
emit
Next Workset
18
Illustration of Effective Closeness in Apache Flink of delta iteration
19
REDUCE
JOIN
Step Function
Update Function
JOIN
Summary of Effective Closeness implementation
• Both implementations reduces the amount of data to be processed
in the successive iterations
• Hence both the computing models for finding Effective Closeness exploits the sparse nature of the real world graphs
20
Idea behind LineRank Algorithm
• Betweenness is computed by finding the importance score of
incident edges of a node
1
2 3
b a d
c
e
G
a e
c b
d
L(G)
kang et al, 2011
Power
Iteration
PageRank
Eigenvector/ Rank
of nodes in L(G)
• Problem: Line graph L(G) is larger than original graph
𝑟 = 𝑇𝑘 𝑟0
21
Challenges in implementing LineRank in Apache Giraph
• Two step matrix-vector multiplication in the power iteration using two
sparse matrices (incoming and outgoing edges)
• The vertex state value in the LineRank is edge score which contradicts
with the vertex centric computation model
• How to achieve two stage matrix-vector multiplication in Giraph?
𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2 𝑣 ⟵ 𝐿(𝐺)𝑣 ↔
22
Proposed solution for implementing LineRank algorithm in
Apache Giraph (1/2)
• Illustration of “think like vertex”
• Let us compute the step v2 in the first iteration for our example graph
𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2
1
2 3
b a d
c
e
23
=
Proposed solution for implementing LineRank algorithm in
Apache Giraph (2/2)
Pseudo-code
• Current state of the vertex is assigned with computation result of v2
• The messages that are distributed or exchanged in the iteration are
considered to be the edge score v3
𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2
=
24
Illustration of proposed solution to implement LineRank in Apache
Giraph
1
2 3
1 2 3 Input graph
0.2 0.2 0.2
0.2
0.2 0.1
0.1
0.1
superstep 0
0.15
0.3 0.3 0.1
0.1
0.1 0.15
0.15
superstep 1
25
𝑣2 ⟵ 𝑆 𝐺 𝑇𝑣1 𝑣3 ⟵ 𝑇 𝐺 𝑣2
Implementation of LineRank in Apache Flink
26
Summary of LineRank implementation
• Two step matrix-vector multiplication is hard to implement in
Apache Giraph
• Remodeling the LineRank computation in Apache Giraph requires
an in-depth knowledge in both platform level and algorithmic level
• Programmability with Apache Flink for computational intensive
iterative algorithms are simple and flexible
27
Comparison of parallel computing platforms by implementing and
evaluating centrality measures
28
Closeness & Betweenness
of large networks
Parallel data processing
platforms
Apache Flink
• How well does the
programming model of these
data processing platforms fit
Effective Closeness and
LineRank algorithms?
1. Computation logic
2. Implementation
• Evaluating the performance
of each of these two
platforms
Evaluation – Dataset
29
Evaluating scalability of Effective Closeness in Apache Giraph &
Apache Flink (Runtime vs Edges)
*Fixed number of parallel tasks 30
Evaluating scalability of LineRank in Apache Giraph & Apache Flink
(Runtime vs Edges)
*Fixed graph data
LineRank in Flink: Runtime vs Number of cores
LineRank in Giraph: Runtime vs Number of cores
31
Evaluation – Comparing the performance of Apache
Giraph and Apache Flink
LineRank Effective Closeness
32
No. of cores = 15
Evaluation – Comparing the performance of
Apache Giraph and Apache Flink
• Apache Giraph incorporates hash based aggregations
• Apache Flink uses sorting technique for aggregations
• Efficient mechanism for estimating memory
requirements in Apache Giraph
33
Conclusion
34
• Implementation of Effective Closeness exploits the sparse nature of the real world graphs
• The programming model of Apache Giraph is not flexible for
computations that involves multi-step matrix-vector
multiplication whereas Apache Flink is more flexible for these
computations
• Efficient optimizations in Apache Giraph makes it perform better
than Apache Flink
• The implementation of these algorithms are targeted to
contribute to the Apache Flink open source community
Future Works
35
• This work can be extended to evaluate the computation intensive
centrality algorithms on other parallel data processing systems
such as Apache Spark and Distributed GraphLab
References
[1] Kang, U., et al. "Centralities in Large Networks: Algorithms and Observations."
SDM. Vol. 2011. 2011
[2] Sebastian Schelter, “Introducing Apache Giraph for Large Scale Graph
Processing”, “slideshare.net/sscdotopen/introducing- apache-giraph-for-large-
scale-graph-processing”, 2012
[3] Krebs, Valdis E. "Mapping networks of terrorist cells." Connections 24.3 (2002):
43-52
[4] Ewen, Stephan, et al. "Spinning fast iterative data flows." Proceedings of
the VLDB Endowment 5.11 (2012): 1268-1279
[5] Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph processing"
Proceedings of the 2010 ACM SIGMOD International Conference on Management
of data. ACM, 2010.
36
References
[6] Freeman, Linton C. "A set of measures of centrality based on betweenness"
Sociometry (1977): 35-41
[9] Stephan Ewen “Stratosphere, Next-Gen Data Analytics Platform”, Hadoop Summit
Europe, 2014
[8] Barabasi, Albert-Laszlo, and Zoltan N. Oltvai. "Network biology: understanding
the cell's functional organization." Nature Reviews Genetics 5.2 (2004): 101-113
37
Backup Slides
38
Summary of Effective Closeness implementation
• Both implementations reduces the amount of data to be processed
in the successive iterations
• Hence both the computing models for finding Effective Closeness exploits the sparse nature of the real world graphs
Highly connected
Less connected
39
LineRank Dataflow in Apache Flink
Final step in the proposed solution to implement LineRank in
Apache Giraph
• Aggregating the computed edge scores
(incoming and outgoing edges )
• Computation of v2 represents aggregation
incoming edges scores
1
2 3
b a d
c
e
G
Bet(1) = R(a)+R(b)+R(c)+R(d)
Bet(2) = R(a)+R(b)+R(e)
Bet(3) = R(c)+R(d)+R(e)
LineRank algorithm computes the random-walk
betweenness without constructing line graph
• L(G) is decomposed into two sparse matrices
– Source Incidence Matrix S(G) [Outgoing edges]
– Target Incidence Matrix T(G) [Incoming edges]
– L G = 𝑇 𝐺 𝑆(𝐺)𝑇
1
2 3
b a c
d
e
a 1 0 0
b 0 1 0
c 1 0 0
d 0 0 1
e 0 1 0
1 2 3
a 0 1 0
b 1 0 0
c 0 0 1
d 1 0 0
e 0 0 1
1 2 3
S(G) T(G)
0 1 0 1 0 0
= 1 0 0
0 0 1
0 0 1
1 0 0 1 0 0 1 0 0 1 0 0 1 0 0
𝑻 𝑮 𝑺(𝑮)𝑻
0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0
=
Power iteration in LineRank
Referred from [Ukang]