Alex Pothen Purdue University CSCAPES Institute Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

Alex PothenPurdue UniversityCSCAPES Institutewww.cs.purdue.edu/homes/apothen/

Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit Catalyurek (Ohio State)

CSC’11 Workshop May 2011

Multithreaded Algorithms for Graph Coloring

1

mailto:[email protected]

2

References Multithreaded algorithms for graph coloring.

Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, 40 pp., Submitted to Parallel Computing.

New multithreaded ordering and coloring algorithms for multicore architectures. Patwary, Gebremedhin, Pothen, 12pp., EuroPar 2011.

Graph coloring for derivative computation and beyond: Algorithms, software and analysis. Gebremedhin, Nguyen, Pothen, and Patwary, 32 pp., Submitted to TOMS.

Distributed Memory Parallel Algorithms for Matching and Coloring. Catalyurek, Dobrian, Gebremedhin, Halappanavar, and Pothen, 10pp., IPDPS Workshop PCO, 2011.

3

Graph

Architectu

re

Algorithm

Thre

ad G

ranu

larit

y,

Sync

hron

izat

ion Concurrency

Latency Tolerance

Performance

4

OutlineThe many-core and multi-threaded world

◦ Intel Nehalem◦ Sun Niagara◦ Cray XMT

A case study on multithreaded graph coloring◦ An Iterative and Speculative Coloring Algorithm◦ A Dataflow algorithm

RMAT graphs: ER, G, and B Experimental results Conclusions

Architectural Features

O(|E|)-time implementations possible for all four

• B = max back degree over entire seq.• B+1 colors sufficeto color G.

Proc. Threads/Core

Cores/Socket

Threads

Cache Clock Multithreading, Other Detail

Intel Nehalem

2 4 16 Shared L3

2.5 G Simultaneous, Cache Coher. protocol

Sun Niagara 2

8 2 128 Shared L2

1.2 G Simultaneous

Cray XMT

128 128 Procs.

16,384

None 500 M Interleaved, Fine-grained synchronization

6

Multithreaded: Iterative Greedy Algorithm

v

Forbidden Colors

vV

ci

7

Multi-threaded: Data Flow Algorithm

8

Multi-threaded: Data Flow Algorithm

v

Forbidden Colors

vV

ci

10

RMAT Graphs

R-MAT: Recursive MATrix method

Experiments ◦ RMAT-ER (0.25, 0.25, 0.25, 0.25)◦ RMAT-G (0.45, 0.15, 0.15, 0.25)◦ RMAT-B (0.55, 0.15, 0.15, 0.15)

Chakrabarti, D. and Faloutsos, C. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38, 1.

11

RMAT Graphs a bc d

12

Nehalem: Strong Scaling (Niagara)

RMAT-ER

RMAT-G

RMAT-B

13

Cray XMT: Strong and Weak Scaling Iter-G Iter-B

DF-G DF-B

14

Comparing Three Platforms

a) ER c) Good

e) Bad

15

No. Colors in Parallel Algorithms

a) ER

c) B

b) G

16

Computing SL Orderings in Parallel: RMAT-G graphs (Nehalem)

SL Ordering Relaxed SL Ordering

17

Our contributions: Multithreaded Coloring

Massive multithreading◦ Can tolerate memory latency for graphs/sparse matrices◦ Dataflow algorithms easier to implement than distributed memory

versions◦ Thread concurrency ameliorates lack of caches, and lower clock speeds◦ Thread parallelism can be exploited at fine grain if supported by

lightweight synchronization◦ Graph structure critically influences performance

Many-core machines◦ Developed an iterative algorithm for greedy coloring (distance-1 and -2)

and ordering algorithms that port to different machines◦ Simultaneous multithreading can hide latency (X threads on 1 core vs. 1

thread on X cores) ◦ Decomposition into tasks at a finer grain than distributed-memory version,

and relax synchronization to enhance concurrency◦ Will form nodes of Peta- and Exa-scale machines, so single node

performance studies are needed

Multi-threaded Parallelism

24

25

•Memory access times determine performance •By issuing multiple threads, mask memory latency if a ready thread is available when a functional unit becomes free•Interleaved vs. Simultaneous multithreading (IMT or SMT)

Figure from Robert Golla, Sun Time

26

Multi-core: Sun Niagara 2

• Two 8-core sockets, •8 hw threads per core•1.2 GHz processors linked by 8 x 9 crossbar to L2 cache banks

•Simultaneous multithreading•Two threads from a core can be issued in a cycle•Shallow pipeline

27

Multicore: Intel Nehalem

• Two quad-core sockets, 2.5 GHz• Two hyperthreads per core support SMT•Off chip-data latency 106 cycles

•Advanced architectural features:Cache coherence protocol to reduce traffic, loop-stream detection, improved branch prediction, out-of-order execution

28

Massive Multithreading: Cray XMT Latency tolerance via massive multi-

threading◦ Context switch between threads in a single clock cycle◦ Global address space, hashed to memory banks to reduce

hot-spots◦ No cache or local memory, average latency 600 cycles

Memory request doesn’t stall processor◦ Other threads work while the request is fulfilled

Light-weight, word-level synchr. (full/empty bits)Notes:

◦ 500 MHz clock◦ 128 Hardware thread streams/proc., ◦ Interleaved multithreading

29

Multithreaded Algorithms for Graph Coloring

◦We developed two kinds of multithreaded algorithms for graph coloring: An iterative, coarse-grained method for generic

shared-memory architectures A dataflow algorithm designed for massively

multithreaded architectures with hardware support for fine-grain synchronization, such as the Cray XMT

◦Benchmarked the algorithms on three systems: Cray XMT, Sun Niagara 2 and Intel Nehalem

◦Excellent speedup observed on all three platforms

Coloring Algorithms

30

31

Greedy coloring algorithms Distance-k, star, and acyclic coloring are NP-hard Approximating coloring to within O(n1-e) is NP-hard

for any e>0

GREEDY(G=(V,E))Order the vertices in Vfor i = 1 to |V| do

Determine colors forbidden to vi

Assign vi the smallest permissible color

end-for

A greedy heuristic usually gives a near-optimal solution

The key is to find good orderings for coloring, and many have been developed

Ref: Gebremedhin, Tarafdar, Manne, Pothen, SIAM J. Sci. Compt. 29:1042--1072, 2007.

Distance-1Coloring, Greedy Alg.

a

v

a

v

32

33

Many-core greedy coloring Given a graph, parallelize greedy coloring on many-core

machines such that Speedup is attained, and Number of colors is roughly same as in serial

Difficult task since greedy is inherently sequential, computation small relative to communication, and data accesses are irregular

D1 coloring: Approaches based on Luby’s parallel algorithm for maximal independent set had limited success

Gebremedhin and Manne (2000) developed a parallel greedy coloring algorithm on shared memory machines

◦ Uses speculative coloring to enhance concurrency, randomized partitioning to reduce conflicts, and serial conflict resolution

◦ Number of conflicts bounded, so this approach yields an effective algorithm

◦ Extended to distance-2 coloring by G, M and P (2002) We adapt this approach to implement the greedy

algorithm for many-core computing

Parallel Coloring

34

Parallel Coloring: Speculation

a

v

w

a

v

w

35

36

Experimental results

Iterative Dataflow

Cray XMT: RMAT-G with 224, …, 227 vertices and 134M, …, 1B edges

37

Experimental resultsNiagara 2Iterati

ve

Perf. With doubling threads on a core = Doubling cores!

38

Experimental resultsRMAT-G with 224 = 16M vertices and 134M edges

All Platforms

RMAT-B, 224 vertices,134M edges

39

Iterative Greedy Coloring: Multithreaded Algorithm

Adj(v), color(w), forbidden(v): d(v) reads eachforbidden(v): d(v) writes

Adj(v), color(w): d(v) reads each

40

Experimental resultsRMAT-G with 224 = 16M vertices and 134M edges

All Platforms

Tentative Conclusions, Future Work

41

42

Future Plans: Multithreaded Coloring

Massive multithreading◦ Microbechmarking to understand where the cycles go: thread

management, data accesses, synchronization, instruction scheduling, function unit limitations…

◦ Develop a performance model of the computation◦ Experiment with other graph classes ◦ Consider new algorithmic paradigms

Many-core machines◦ Four items as above◦ Ordering for coloring: Archetype of a problem for computing a

sequential ordering in a parallel environment (Mostofa Patwary and Assefaw Gebremedhin)

◦ Extend to nodes of Peta-scale machines, so single node performance is enhanced, and complete our work on the Blue Gene and the Cray XT5

43

Thanks

Rob Bisseling, Erik Boman, Ümit Çatalürek, Karen Devine, Florin Dobrian, John Feo, Assefaw Gebremedhin, Mahantesh Halappanavar, Bruce Hendrickson, Paul Hovland, Gary Kumfert, Fredrik Manne, Ali Pınar, Sivan Toledo, Jean Utke

44

Further readingwww.cscapes.org

Gebremedhin and Manne, Scalable parallel graph coloring algorithms, Concurrency: Practice and Experience, 12: 1131-1146, 2000.

Gebremedhin, Manne and Pothen, Parallel distance-k coloring algorithms for numerical optimization, Lecture Notes in Computer Science, 2400: 912-921, 2002.

Bozdag, Gebremedhin, Manne, Boman and Catalyurek. A framework for scalable greedy coloring on distributed-memory parallel computers. J. Parallel Distrib. Comput. 68(4):515-535, 2008.

Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, Multi-threaded algorithms for graph coloring, Preprint, Aug. 2010.

Documents

Alex Pothen Purdue University CSCAPES Institute Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit