Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Time-Evolving Graph
Processing at Scale
Anand Iyer#, Li Erran Li+,
Tathagata Das*, Ion Stoica#*
#UC Berkeley +Uber Technologies *Databricks
Motivation
Dynamically evolving graphs prevalent in many domains
– Social networks (e.g., Twitter, Facebook)
– Communication networks (e.g. cellular networks)
– Internet-of-Things
Motivation
Many applications need to leverage the evolution characteristics
– Product recommendations
– Network troubleshooting
– Real-time ad placement
Motivation
Lots of interest in distributed graph processing…
– GraphX, Girafe, Powergraph, GraphLab, GraphChi, Chaos, …
…but existing graph processing engines offer little support for dynamic graphs
– Some specialized systems exist. E.g., Kineograph, Chronos, not generic enough
Challenges
• Consistent & fault-tolerant snapshot generation
• Co-ordinate snapshot generation and computation
• Window operations on snapshots
• Mix data and graph parallel computations
Existing solutions do not satisfy all the requirements
GraphTau
Abstraction
Computational
Model
a
e d
c
a b
e d
c
a b
e d
c f
GraphTau
a
e d
c
a b
e d
c f
a b
e d
c
t1 t2 t3
GraphTau represents time-evolving graphs as a series of consistent graph snapshots
New Computational Models
Two new models for processing time-evolving graphs
Pause Shift Resume
Online Rectification
Pause-Shift-Resume
Many graph algorithms robust to changes in graph before convergence E.g. PageRank: pause iterating, update snapshot, continue iterating
Pause-Shift-Resume
B C
A D
F E
A DD
B C
D
E
AA
F
B C
A D
F E
A DD
B C
D
E
AA
F
Transition
(0.977, 0.968)
(X , Y): X is 10 iteration PageRank
Y is 23 iteration PageRank
After 11 iteration on graph 2,
Both converge to 3-digit precision
(0.977, 0.968)
(0.571, 0.556)
1.224
0.8490.502
(2.33, 2.39) 2.07
0.8490.502(0.571, 0.556)
(0.571, 0.556)
Online Rectification Model
Many graph algorithms not resilient to changes Need to keep per-vertex state to handle changes Connected components on an evolving graph can be done if each vertex stores its component
Abstraction
GraphStream[V,E]: Represents a series of Graph[V,E] snapshots where V = vertices, E = edges
Graph[V,E]
@ T = 1 Graph[V,E]
@ T = 2 Graph[V,E] @ T = 3
Graph[V,E] @ T = 4
GraphStream[V,E]
Operations: transform
class GraphStream { def transform(func: Graph => Graph): GraphStream }
func: User provided function to do bulk operations on vertices and edges to create a new graph,
allows aggregations over vertices and edges transform: Applies func over each snapshot Graphs in
a GraphStream
Operations: transform
class GraphStream { def transform(func: Graph => Graph): GraphStream }
T = 1 T = 2 T = 3 T = 4
Original GraphStream
Transformed GraphStream
func func func func
Operations: sliding windows
T = 1 T = 2 T = 3 T = 4
Original GraphStream
Windowed GraphStream
class GraphStream { def mergeWindows( aggregationFuncs, windowLength, slidingInterval): GraphStream }
aggregationFuncs
windowLen
slidingInterval
Differential Computation:
Pause-shift-resume and Online Rectification incorporated into an efficient Pregel-style computation implementation Effectively an extension of the Pregel iterative processing model for time-evolving graphs
Operations: StreamingBSP
GraphStream
Apply Pregel iterationFunc until next snapshot is available
T = 1
class GraphStream { def StreamingBSP(..., iterationFunc, ...): GraphStream }
Combine previous results with new snaphot, continue iterating
T = 2 T = 3
Continue until convergence
PageRank using StreamingBSP
PageRank computation on streaming graphs easily achieved by a simple call
Faster convergence than running PageRank from scratch on every snapshot
Operations: updateLocalState
class GraphStream {
def updateLocalState (stateUpdateFunc, initialState): LocalStateStream }
GraphStream
T = 1
initialState
T = 2 T = 3
stateUpdateFunc
Keep updating non-graph "state" as graph evolves
Implementation
Implemented on Apache Spark platform - Spark Streaming: stream processing engine - GraphX: graph processing engine
GraphTau implemented by combining Spark Streaming and Graphx - Novel optimizations to implement the GraphStream
abstraction
Other Benefits
Spark Streaming, GraphX built on Spark's RDDs RDDs guarantees fault-tolerance and consistency of datasets In addition, allows mixing data and graph parallel computations in GraphStream
Preliminary Results
• Algorithms: – PageRank
– Connected Components
• Setup: 16 Amazon EC2 instances
• Datasets: – Twitter follow graph: 41M vertices, ~1.5B edges
– Live LTE network: 2M vertices, variable edges
Preliminary Results: PageRank
Dataset: Twitter Graph broken in to parts: - 1 part = full graph - 5 parts = 20% of graph in each part Comparison: - Time to complete PageRank in GraphX on full graph - Time to complete streaming PageRank in GraphTau
when the graph is streamed in parts
Preliminary Results: PageRank
GraphX on whole graph could not converge!
GraphTau converged fast when 20% of the graph is streamed at a time
Smaller batches lead to faster convergence
Preliminary Results: Cell IQ
CellIQ (NSDI 2015): Prior work - Detection of persistent hotspots using incremental
connected components - Built specialized system to do temporal analysis
Re-implemented on general system GraphTau
- Uses mergeByWindow for sliding window analysis - Strawman (baseline) runs non-incremental
connected components on whole window of snapshots
Preliminary Results: Cell IQ
0
2
4
6
8
0 2 4 6 8 10 12
AnalysisTime(s)
WindowSize(m)
Strawman GraphTau CellIQ
GraphTau managed to get performance comparable to specialized system, without domain specific optimizations
Takeways
GraphTau
General purpose processing engine for time-evolving graphs
GraphStream abstraction that provides Consistent & fault-tolerant snapshot generation
Co-ordinate snapshotting and computation
Sliding window operations
Mix data and graph parallel computations