View
3.156
Download
4
Embed Size (px)
DESCRIPTION
Citation preview
Scaling Apache Giraph
Nitay Joffe, Data Infrastructure Engineer
@nitayj
September 10, 2013
Agenda
1 Background
2 Scaling
3 Results
4 Questions
Background
What is Giraph?• Apache open source graph computation engine based on Google’s Pregel.• Support for Hadoop, Hive, HBase, and Accumulo.• BSP model with simple think like a vertex API.• Combiners, Aggregators, Mutability, and more.• Configurable Graph<I,V,E,M>:
– I: Vertex ID
– V: Vertex Value
– E: Edge Value
– M: Message data
What is Giraph NOT?• A Graph database. See Neo4J.• A completely asynchronous generic MPI system.• A slow tool.
implementsWritable
Why not Hive?
Inputformat
Outputformat
Map tasks
Intermediatefiles
Reducetasks
Output 0
Output 1
Input 0
Input 1
Iterate!
• Too much disk. Limited in-memory caching.• Each iteration becomes a MapReduce job!
Giraph components
Master – Application coordinator
• Synchronizes supersteps
• Assigns partitions to workers before superstep begins
Workers – Computation & messaging
• Handle I/O – reading and writing the graph
• Computation/messaging of assigned partitions
ZooKeeper
• Maintains global application state
Giraph Dataflow
Split 0
Split 1
Split 2
Split 3
Work
er
1
Mast
er
Work
er
0Input
formatLoad / SendGraph
Load / SendGraph
Loading the graph
1
Part 0
Part 1
Part 2
Part 3
Compute / Send
Messages
Work
er
1
Compute / Send
Messages
Mast
er
Work
er
0
In-memory graph
Send stats / iterate!
Compute/Iterate
2
Work
er
1W
ork
er
0 Part 0
Part 1
Part 2
Part 3
Output format
Part 0
Part 1
Part 2
Part 3
Storing the graph
3
Split 4
Split
Giraph Job Lifetime
Output
Active Inactive
Vote to Halt
Received Message
Vertex Lifecycle
All Vertices Halted?
InputCompute Superstep
No
Master halted?
No
Yes
Yes
Simple Example – Compute the maximum value
5
15
2
5
5
25
5
5
5
5
1
2
Processor 1
Processor 2
Time
Connected Componentse.g. Finding Communities
PageRank – ranking websites
Mahout (Hadoop)854 lines
Giraph< 30 lines
• Send neighbors an equal fraction of your page rank • New page rank = 0.15 / (# of vertices) + 0.85 *
(messages sum)
Scaling
Problem: Worker Crash.
Superstep i(no
checkpoint)
Superstep i+1
(checkpoint)
Superstep i+2(no
checkpoint)Worker failure!
Superstep i+1
(checkpoint)
Superstep i+2(no
checkpoint)
Superstep i+3
(checkpoint)Worker failure after
checkpoint complete!
Superstep i+3(no
checkpoint)
ApplicationComplete…
Solution: Checkpointing.
“Spare”Master 2
ActiveMaster State“Spare”
Master 1
“Active”Master 0
Before failure of active master 0
“Spare”Master 2
ActiveMaster State“Active”
Master 1
“Active”Master 0
After failure of active master 0
ZooKeeper ZooKeeper
Problem: Master Crash.
Solution: ZooKeeper Master Queue.
Problem: Primitive Collections.• Graphs often parameterized with {Null,Int,Long,Float,Double}• Boxing/unboxing. Objects have internal overhead.
3
Solution: Use fastutil, e.g. Long2DoubleOpenHashMap.
fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues with a small memory footprint and fast access and insertion
1
24
5
1.2
0.50.8
0.4
1.7
0.7
Single Source Shortest Path
s
t
1.2
0.50.8
0.4
0.2
0.7
Network Flow
3
1
24
5
Count In-Degree
Problem: Too many objects.Lots of time spent in GC.
Graph: 1B Vertices, 200B Edges, 200 Workers.
• 1B Edges per Worker. 1 object per edge value.• List<Edge<I, E>> ~ 10B objects
• 5M Vertices per Worker. 10 objects per vertex value.• Map<I, Vertex<I, V, E> ~ 50M objects
• 1 Message per Edge. 10 objects per message data.• Map<I, List<M>> ~ 10B objects
• Objects used ~= O(E*e + V*v + M*m) => O(E*e)
Label Propagatione.g. Who’s sleeping?
3
1
24
5
Boring
Amazing
Q: What did he think?
0.5
0.2
0.8 0.36
0.17
0.41
Confusing
Problem: Too many objects.Lots of time spent in GC.
Solution: byte[]• Serialize messages, edges, and vertices.• Iterable interface with representative object.
Input Input Input
next()next()
next()Objects per worker ~= O(V)
Label Propagatione.g. Who’s sleeping?
3
1
24
5
Boring
Amazing
Q: What did he think?
0.5
0.2
0.8 0.36
0.17
0.41
Confusing
Problem: Serialization of byte[]• DataInput? Kyro? Custom?
Solution: Unsafe• Dangerous. No formal API. Volatile. Non-portable (oracle JVM only).
• AWESOME. As fast as it gets.• True native. Essentially C: *(long*)(data+offset);
Problem: Large Aggregations.
Worker
Worker
Worker Worke
r
Worker
Master
Workers own aggregators
Worker
Worker
Worker Worke
r
Worker
Master
Aggregator owners communicatewith Master
Worker
Worker
Worker Worke
r
Worker
Master
Aggregator owners distribute values
Solution: Sharded Aggregators.
Worker
Worker
Worker Worke
r
Worker
Master
K-Means Clusteringe.g. Similar Emails
Problem: Network Wait.• RPC doesn’t fit model.• Synchronous calls no good.
Solution: NettyTune queue sizes & threads
BarrierBarrier
Begin superstep
computenetwork
End compute
End superstep
wait
BarrierBarrier
Begin superstep
compute
network
wait
Time to first message
End compute
End superstep
Results
50 100 150 200 250 3000
50
100
150
200
250
300
350
400
450
2B Vertices, 200B Edges, 20 Compute Threads
Workers
Itera
tion
Tim
e (
sec)
Increasing Workers
Increasing Data Size
1000000000 1010000000000
50
100
150
200
250
300
350
400
450
50 Workers, 20 Compute Threads
EdgesIt
era
tion
Tim
e (
sec)
Scalability Graphs
Lessons Learned
• Coordinating is a zoo. Be resilient with ZooKeeper.• Efficient networking is hard. Let Netty help.• Primitive collections, primitive performance. Use fastutil.• byte[] is simple yet powerful.• Being Unsafe can be a good thing.
• Have a graph? Use Giraph.
What’s the final result?
Comparison with Hive:• 20x CPU speedup• 100x Elapsed time speedup. 15 hours => 9 minutes.
Computations on entire Facebook graph no longer “weekend jobs”.Now they’re coffee breaks.
Questions?
Problem: Measurements.
• Need tools to gain visibility into the system.• Problems with connecting to Hadoop sub-processes.
Solution: Do it all.• YourKit – see YourKitProfiler• jmap – see JMapHistoDumper• VisualVM –with jstatd & ssh socks proxy• Yammer Metrics• Hadoop Counters• Logging & GC prints
Problem: Mutations• Synchronization.• Load balancing.
Solution: Reshuffle resources• Mutations handled at barrier between supersteps.• Master rebalances vertex assignments to optimize
distribution.• Handle mutations in batches.• Avoid if using byte[].• Favor algorithms which don’t mutate graph.