View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Alex PothenPurdue UniversityCSCAPES Institutewww.cs.purdue.edu/homes/apothen/
Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit Catalyurek (Ohio State)
CSC’11 Workshop May 2011
Multithreaded Algorithms for Graph Coloring
1
2
References Multithreaded algorithms for graph coloring.
Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, 40 pp., Submitted to Parallel Computing.
New multithreaded ordering and coloring algorithms for multicore architectures. Patwary, Gebremedhin, Pothen, 12pp., EuroPar 2011.
Graph coloring for derivative computation and beyond: Algorithms, software and analysis. Gebremedhin, Nguyen, Pothen, and Patwary, 32 pp., Submitted to TOMS.
Distributed Memory Parallel Algorithms for Matching and Coloring. Catalyurek, Dobrian, Gebremedhin, Halappanavar, and Pothen, 10pp., IPDPS Workshop PCO, 2011.
3
Graph
Architectu
re
Algorithm
Thre
ad G
ranu
larit
y,
Sync
hron
izat
ion Concurrency
Latency Tolerance
Performance
4
OutlineThe many-core and multi-threaded world
◦ Intel Nehalem◦ Sun Niagara◦ Cray XMT
A case study on multithreaded graph coloring◦ An Iterative and Speculative Coloring Algorithm◦ A Dataflow algorithm
RMAT graphs: ER, G, and B Experimental results Conclusions
Architectural Features
O(|E|)-time implementations possible for all four
• B = max back degree over entire seq.• B+1 colors sufficeto color G.
Proc. Threads/Core
Cores/Socket
Threads
Cache Clock Multithreading, Other Detail
Intel Nehalem
2 4 16 Shared L3
2.5 G Simultaneous, Cache Coher. protocol
Sun Niagara 2
8 2 128 Shared L2
1.2 G Simultaneous
Cray XMT
128 128 Procs.
16,384
None 500 M Interleaved, Fine-grained synchronization
6
Multithreaded: Iterative Greedy Algorithm
v
Forbidden Colors
vV
ci
7
Multi-threaded: Data Flow Algorithm
8
Multi-threaded: Data Flow Algorithm
v
Forbidden Colors
vV
ci
10
RMAT Graphs
R-MAT: Recursive MATrix method
Experiments ◦ RMAT-ER (0.25, 0.25, 0.25, 0.25)◦ RMAT-G (0.45, 0.15, 0.15, 0.25)◦ RMAT-B (0.55, 0.15, 0.15, 0.15)
Chakrabarti, D. and Faloutsos, C. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38, 1.
11
RMAT Graphs a bc d
12
Nehalem: Strong Scaling (Niagara)
RMAT-ER
RMAT-G
RMAT-B
13
Cray XMT: Strong and Weak Scaling Iter-G Iter-B
DF-G DF-B
14
Comparing Three Platforms
a) ER c) Good
e) Bad
15
No. Colors in Parallel Algorithms
a) ER
c) B
b) G
16
Computing SL Orderings in Parallel: RMAT-G graphs (Nehalem)
SL Ordering Relaxed SL Ordering
17
Our contributions: Multithreaded Coloring
Massive multithreading◦ Can tolerate memory latency for graphs/sparse matrices◦ Dataflow algorithms easier to implement than distributed memory
versions◦ Thread concurrency ameliorates lack of caches, and lower clock speeds◦ Thread parallelism can be exploited at fine grain if supported by
lightweight synchronization◦ Graph structure critically influences performance
Many-core machines◦ Developed an iterative algorithm for greedy coloring (distance-1 and -2)
and ordering algorithms that port to different machines◦ Simultaneous multithreading can hide latency (X threads on 1 core vs. 1
thread on X cores) ◦ Decomposition into tasks at a finer grain than distributed-memory version,
and relax synchronization to enhance concurrency◦ Will form nodes of Peta- and Exa-scale machines, so single node
performance studies are needed
Multi-threaded Parallelism
24
25
•Memory access times determine performance •By issuing multiple threads, mask memory latency if a ready thread is available when a functional unit becomes free•Interleaved vs. Simultaneous multithreading (IMT or SMT)
Figure from Robert Golla, Sun Time
26
Multi-core: Sun Niagara 2
• Two 8-core sockets, •8 hw threads per core•1.2 GHz processors linked by 8 x 9 crossbar to L2 cache banks
•Simultaneous multithreading•Two threads from a core can be issued in a cycle•Shallow pipeline
27
Multicore: Intel Nehalem
• Two quad-core sockets, 2.5 GHz• Two hyperthreads per core support SMT•Off chip-data latency 106 cycles
•Advanced architectural features:Cache coherence protocol to reduce traffic, loop-stream detection, improved branch prediction, out-of-order execution
28
Massive Multithreading: Cray XMT Latency tolerance via massive multi-
threading◦ Context switch between threads in a single clock cycle◦ Global address space, hashed to memory banks to reduce
hot-spots◦ No cache or local memory, average latency 600 cycles
Memory request doesn’t stall processor◦ Other threads work while the request is fulfilled
Light-weight, word-level synchr. (full/empty bits)Notes:
◦ 500 MHz clock◦ 128 Hardware thread streams/proc., ◦ Interleaved multithreading
29
Multithreaded Algorithms for Graph Coloring
◦We developed two kinds of multithreaded algorithms for graph coloring: An iterative, coarse-grained method for generic
shared-memory architectures A dataflow algorithm designed for massively
multithreaded architectures with hardware support for fine-grain synchronization, such as the Cray XMT
◦Benchmarked the algorithms on three systems: Cray XMT, Sun Niagara 2 and Intel Nehalem
◦Excellent speedup observed on all three platforms
Coloring Algorithms
30
31
Greedy coloring algorithms Distance-k, star, and acyclic coloring are NP-hard Approximating coloring to within O(n1-e) is NP-hard
for any e>0
GREEDY(G=(V,E))Order the vertices in Vfor i = 1 to |V| do
Determine colors forbidden to vi
Assign vi the smallest permissible color
end-for
A greedy heuristic usually gives a near-optimal solution
The key is to find good orderings for coloring, and many have been developed
Ref: Gebremedhin, Tarafdar, Manne, Pothen, SIAM J. Sci. Compt. 29:1042--1072, 2007.
Distance-1Coloring, Greedy Alg.
a
v
a
v
32
33
Many-core greedy coloring Given a graph, parallelize greedy coloring on many-core
machines such that Speedup is attained, and Number of colors is roughly same as in serial
Difficult task since greedy is inherently sequential, computation small relative to communication, and data accesses are irregular
D1 coloring: Approaches based on Luby’s parallel algorithm for maximal independent set had limited success
Gebremedhin and Manne (2000) developed a parallel greedy coloring algorithm on shared memory machines
◦ Uses speculative coloring to enhance concurrency, randomized partitioning to reduce conflicts, and serial conflict resolution
◦ Number of conflicts bounded, so this approach yields an effective algorithm
◦ Extended to distance-2 coloring by G, M and P (2002) We adapt this approach to implement the greedy
algorithm for many-core computing
Parallel Coloring
34
Parallel Coloring: Speculation
a
v
w
a
v
w
35
36
Experimental results
Iterative Dataflow
Cray XMT: RMAT-G with 224, …, 227 vertices and 134M, …, 1B edges
37
Experimental resultsNiagara 2Iterati
ve
Perf. With doubling threads on a core = Doubling cores!
38
Experimental resultsRMAT-G with 224 = 16M vertices and 134M edges
All Platforms
RMAT-B, 224 vertices,134M edges
39
Iterative Greedy Coloring: Multithreaded Algorithm
Adj(v), color(w), forbidden(v): d(v) reads eachforbidden(v): d(v) writes
Adj(v), color(w): d(v) reads each
40
Experimental resultsRMAT-G with 224 = 16M vertices and 134M edges
All Platforms
Tentative Conclusions, Future Work
41
42
Future Plans: Multithreaded Coloring
Massive multithreading◦ Microbechmarking to understand where the cycles go: thread
management, data accesses, synchronization, instruction scheduling, function unit limitations…
◦ Develop a performance model of the computation◦ Experiment with other graph classes ◦ Consider new algorithmic paradigms
Many-core machines◦ Four items as above◦ Ordering for coloring: Archetype of a problem for computing a
sequential ordering in a parallel environment (Mostofa Patwary and Assefaw Gebremedhin)
◦ Extend to nodes of Peta-scale machines, so single node performance is enhanced, and complete our work on the Blue Gene and the Cray XT5
43
Thanks
Rob Bisseling, Erik Boman, Ümit Çatalürek, Karen Devine, Florin Dobrian, John Feo, Assefaw Gebremedhin, Mahantesh Halappanavar, Bruce Hendrickson, Paul Hovland, Gary Kumfert, Fredrik Manne, Ali Pınar, Sivan Toledo, Jean Utke
44
Further readingwww.cscapes.org
Gebremedhin and Manne, Scalable parallel graph coloring algorithms, Concurrency: Practice and Experience, 12: 1131-1146, 2000.
Gebremedhin, Manne and Pothen, Parallel distance-k coloring algorithms for numerical optimization, Lecture Notes in Computer Science, 2400: 912-921, 2002.
Bozdag, Gebremedhin, Manne, Boman and Catalyurek. A framework for scalable greedy coloring on distributed-memory parallel computers. J. Parallel Distrib. Comput. 68(4):515-535, 2008.
Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, Multi-threaded algorithms for graph coloring, Preprint, Aug. 2010.