37
Alex Pothen Purdue University CSCAPES Institute www.cs.purdue.edu/homes/apothen/ Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit Catalyurek (Ohio State) CSC’11 Workshop May 2011 Multithreaded Algorithms for Graph Coloring 1

Alex Pothen Purdue University CSCAPES Institute Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

Alex PothenPurdue UniversityCSCAPES Institutewww.cs.purdue.edu/homes/apothen/

Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit Catalyurek (Ohio State)

CSC’11 Workshop May 2011

Multithreaded Algorithms for Graph Coloring

1

Page 2: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

2

References Multithreaded algorithms for graph coloring.

Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, 40 pp., Submitted to Parallel Computing.

New multithreaded ordering and coloring algorithms for multicore architectures. Patwary, Gebremedhin, Pothen, 12pp., EuroPar 2011.

Graph coloring for derivative computation and beyond: Algorithms, software and analysis. Gebremedhin, Nguyen, Pothen, and Patwary, 32 pp., Submitted to TOMS.

Distributed Memory Parallel Algorithms for Matching and Coloring. Catalyurek, Dobrian, Gebremedhin, Halappanavar, and Pothen, 10pp., IPDPS Workshop PCO, 2011.

Page 3: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

3

Graph

Architectu

re

Algorithm

Thre

ad G

ranu

larit

y,

Sync

hron

izat

ion Concurrency

Latency Tolerance

Performance

Page 4: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

4

OutlineThe many-core and multi-threaded world

◦ Intel Nehalem◦ Sun Niagara◦ Cray XMT

A case study on multithreaded graph coloring◦ An Iterative and Speculative Coloring Algorithm◦ A Dataflow algorithm

RMAT graphs: ER, G, and B Experimental results Conclusions

Page 5: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

Architectural Features

O(|E|)-time implementations possible for all four

• B = max back degree over entire seq.• B+1 colors sufficeto color G.

Proc. Threads/Core

Cores/Socket

Threads

Cache Clock Multithreading, Other Detail

Intel Nehalem

2 4 16 Shared L3

2.5 G Simultaneous, Cache Coher. protocol

Sun Niagara 2

8 2 128 Shared L2

1.2 G Simultaneous

Cray XMT

128 128 Procs.

16,384

None 500 M Interleaved, Fine-grained synchronization

Page 6: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

6

Multithreaded: Iterative Greedy Algorithm

v

Forbidden Colors

vV

ci

Page 7: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

7

Multi-threaded: Data Flow Algorithm

Page 8: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

8

Multi-threaded: Data Flow Algorithm

v

Forbidden Colors

vV

ci

Page 9: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

10

RMAT Graphs

R-MAT: Recursive MATrix method

Experiments ◦ RMAT-ER (0.25, 0.25, 0.25, 0.25)◦ RMAT-G (0.45, 0.15, 0.15, 0.25)◦ RMAT-B (0.55, 0.15, 0.15, 0.15)

Chakrabarti, D. and Faloutsos, C. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38, 1.

Page 10: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

11

RMAT Graphs a bc d

Page 11: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

12

Nehalem: Strong Scaling (Niagara)

RMAT-ER

RMAT-G

RMAT-B

Page 12: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

13

Cray XMT: Strong and Weak Scaling Iter-G Iter-B

DF-G DF-B

Page 13: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

14

Comparing Three Platforms

a) ER c) Good

e) Bad

Page 14: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

15

No. Colors in Parallel Algorithms

a) ER

c) B

b) G

Page 15: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

16

Computing SL Orderings in Parallel: RMAT-G graphs (Nehalem)

SL Ordering Relaxed SL Ordering

Page 16: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

17

Our contributions: Multithreaded Coloring

Massive multithreading◦ Can tolerate memory latency for graphs/sparse matrices◦ Dataflow algorithms easier to implement than distributed memory

versions◦ Thread concurrency ameliorates lack of caches, and lower clock speeds◦ Thread parallelism can be exploited at fine grain if supported by

lightweight synchronization◦ Graph structure critically influences performance

Many-core machines◦ Developed an iterative algorithm for greedy coloring (distance-1 and -2)

and ordering algorithms that port to different machines◦ Simultaneous multithreading can hide latency (X threads on 1 core vs. 1

thread on X cores) ◦ Decomposition into tasks at a finer grain than distributed-memory version,

and relax synchronization to enhance concurrency◦ Will form nodes of Peta- and Exa-scale machines, so single node

performance studies are needed

Page 17: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

Multi-threaded Parallelism

24

Page 18: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

25

•Memory access times determine performance •By issuing multiple threads, mask memory latency if a ready thread is available when a functional unit becomes free•Interleaved vs. Simultaneous multithreading (IMT or SMT)

Figure from Robert Golla, Sun Time

Page 19: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

26

Multi-core: Sun Niagara 2

• Two 8-core sockets, •8 hw threads per core•1.2 GHz processors linked by 8 x 9 crossbar to L2 cache banks

•Simultaneous multithreading•Two threads from a core can be issued in a cycle•Shallow pipeline

Page 20: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

27

Multicore: Intel Nehalem

• Two quad-core sockets, 2.5 GHz• Two hyperthreads per core support SMT•Off chip-data latency 106 cycles

•Advanced architectural features:Cache coherence protocol to reduce traffic, loop-stream detection, improved branch prediction, out-of-order execution

Page 21: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

28

Massive Multithreading: Cray XMT Latency tolerance via massive multi-

threading◦ Context switch between threads in a single clock cycle◦ Global address space, hashed to memory banks to reduce

hot-spots◦ No cache or local memory, average latency 600 cycles

Memory request doesn’t stall processor◦ Other threads work while the request is fulfilled

Light-weight, word-level synchr. (full/empty bits)Notes:

◦ 500 MHz clock◦ 128 Hardware thread streams/proc., ◦ Interleaved multithreading

Page 22: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

29

Multithreaded Algorithms for Graph Coloring

◦We developed two kinds of multithreaded algorithms for graph coloring: An iterative, coarse-grained method for generic

shared-memory architectures A dataflow algorithm designed for massively

multithreaded architectures with hardware support for fine-grain synchronization, such as the Cray XMT

◦Benchmarked the algorithms on three systems: Cray XMT, Sun Niagara 2 and Intel Nehalem

◦Excellent speedup observed on all three platforms

Page 23: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

Coloring Algorithms

30

Page 24: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

31

Greedy coloring algorithms Distance-k, star, and acyclic coloring are NP-hard Approximating coloring to within O(n1-e) is NP-hard

for any e>0

GREEDY(G=(V,E))Order the vertices in Vfor i = 1 to |V| do

Determine colors forbidden to vi

Assign vi the smallest permissible color

end-for

A greedy heuristic usually gives a near-optimal solution

The key is to find good orderings for coloring, and many have been developed

Ref: Gebremedhin, Tarafdar, Manne, Pothen, SIAM J. Sci. Compt. 29:1042--1072, 2007.

Page 25: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

Distance-1Coloring, Greedy Alg.

a

v

a

v

32

Page 26: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

33

Many-core greedy coloring Given a graph, parallelize greedy coloring on many-core

machines such that Speedup is attained, and Number of colors is roughly same as in serial

Difficult task since greedy is inherently sequential, computation small relative to communication, and data accesses are irregular

D1 coloring: Approaches based on Luby’s parallel algorithm for maximal independent set had limited success

Gebremedhin and Manne (2000) developed a parallel greedy coloring algorithm on shared memory machines

◦ Uses speculative coloring to enhance concurrency, randomized partitioning to reduce conflicts, and serial conflict resolution

◦ Number of conflicts bounded, so this approach yields an effective algorithm

◦ Extended to distance-2 coloring by G, M and P (2002) We adapt this approach to implement the greedy

algorithm for many-core computing

Page 27: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

Parallel Coloring

34

Page 28: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

Parallel Coloring: Speculation

a

v

w

a

v

w

35

Page 29: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

36

Experimental results

Iterative Dataflow

Cray XMT: RMAT-G with 224, …, 227 vertices and 134M, …, 1B edges

Page 30: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

37

Experimental resultsNiagara 2Iterati

ve

Perf. With doubling threads on a core = Doubling cores!

Page 31: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

38

Experimental resultsRMAT-G with 224 = 16M vertices and 134M edges

All Platforms

RMAT-B, 224 vertices,134M edges

Page 32: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

39

Iterative Greedy Coloring: Multithreaded Algorithm

Adj(v), color(w), forbidden(v): d(v) reads eachforbidden(v): d(v) writes

Adj(v), color(w): d(v) reads each

Page 33: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

40

Experimental resultsRMAT-G with 224 = 16M vertices and 134M edges

All Platforms

Page 34: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

Tentative Conclusions, Future Work

41

Page 35: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

42

Future Plans: Multithreaded Coloring

Massive multithreading◦ Microbechmarking to understand where the cycles go: thread

management, data accesses, synchronization, instruction scheduling, function unit limitations…

◦ Develop a performance model of the computation◦ Experiment with other graph classes ◦ Consider new algorithmic paradigms

Many-core machines◦ Four items as above◦ Ordering for coloring: Archetype of a problem for computing a

sequential ordering in a parallel environment (Mostofa Patwary and Assefaw Gebremedhin)

◦ Extend to nodes of Peta-scale machines, so single node performance is enhanced, and complete our work on the Blue Gene and the Cray XT5

Page 36: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

43

Thanks

Rob Bisseling, Erik Boman, Ümit Çatalürek, Karen Devine, Florin Dobrian, John Feo, Assefaw Gebremedhin, Mahantesh Halappanavar, Bruce Hendrickson, Paul Hovland, Gary Kumfert, Fredrik Manne, Ali Pınar, Sivan Toledo, Jean Utke

Page 37: Alex Pothen Purdue University CSCAPES Institute  Assefaw Gebremedhin, Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit

44

Further readingwww.cscapes.org

Gebremedhin and Manne, Scalable parallel graph coloring algorithms, Concurrency: Practice and Experience, 12: 1131-1146, 2000.

Gebremedhin, Manne and Pothen, Parallel distance-k coloring algorithms for numerical optimization, Lecture Notes in Computer Science, 2400: 912-921, 2002.

Bozdag, Gebremedhin, Manne, Boman and Catalyurek. A framework for scalable greedy coloring on distributed-memory parallel computers. J. Parallel Distrib. Comput. 68(4):515-535, 2008.

Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, Multi-threaded algorithms for graph coloring, Preprint, Aug. 2010.