Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013

Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard 2/27/2013

Scientific Computation Research Center Rensselaer Polytechnic Institute

1

Advances in PUMI forHigh Core Count Machines

Outline

1. Distributed Mesh Data Structure2. Phased Message Passing3. Hybrid (MPI/thread) Programming Model4. Hybrid Phased Message Passing5. Hybrid Partitioning6. Hybrid Mesh Migration

2

Unstructured Mesh Data Structure

3

Mesh

Part

Regions

Edges

Faces

Vertices

Pointer inData Structure

Distributed Mesh Representation

Mesh elements assigned to parts Uniquely identified by handle or global ID Treated as a serial mesh with the addition of part boundaries

Part boundary: groups of mesh entities on shared links between partsRemote copy: duplicated entity copy on non-local partResident part set : list of parts where the entity exists

Can have multiple parts per process.

4

Message Passing

Primitive functional set: Size – members in group Rank – ID of self in group Send – non-blocking synchronous send Probe – non-blocking probe Receive – blocking receive

Non-blocking barrier (ibarrier) API Call 1: Begin ibarrier API Call 2: Wait for ibarrier termination Used for phased message passing Will be available in MPI3, right now custom solution

5

ibarrier Implementation

Using all non-blocking point-to-point calls:

For N ranks, lg(N) go to and from rank 0Uses a separate MPI communicator

0 1 2 3 4

Reduce

Broadcast

6

Phased Message Passing

Similar to Bulk Synchronous ParallelUses non-blocking barrier

1. Begin phase2. Send all messages3. Receive any messages sent this phase4. End phase

Benefits: Efficient termination detection when neighbors unknown Phases are implicit barriers – simplify algorithms Allows buffering all messages per rank per phase

7

Phased Message Passing

Implementation:

1. Post all sends for this phase2. While local sends incomplete: receive any

1. Now local sends complete (remember they are synchronous)

3. Begin “stopped sending” ibarrier4. While ibarrier incomplete: receive any

1. Now all sends complete, can stop receiving

5. Begin “stopped receiving” ibarrier6. While ibarrier incomplete: compute

1. Now all ranks stopped receiving, safe to send next phase

7. Repeat

send recv send recv send recv

ibarriers aresignal edges

8

Hybrid System

Node

Core Core Core Core

Core Core Core Core

Core Core Core Core

Core Core Core Core

Process

Thread Thread Thread Thread




Blue Gene/Q Program

MAPPING

*Processes per node and threads per core are variable

9

Hybrid Programming System

1. Message Passing is the de facto standard programming model for distributed memory architectures.

2. The classic shared memory programming model: mutexes, atomic operations, lockless structures

Most massively parallel code is currently using model 1.

Models are very different, hard to convert from 1 to 2.

10

Hybrid Programming System

We will try message passing between threads.

Threads can send to other threads in the same process

And to threads in a different process.

Same model as MPI, replace “process” with “thread”.

Porting is faster: change the message passing API.

Shared memory is still exploited, lock with messages:

11

Thread 1:Write(A)Release(lockA)

Thread2:Lock(lockA)Write(A)

Thread 1:Write(A)SendTo(2)

Thread2:ReceiveFrom(1)Write(A)

becomes

Parallel Control Utility

Multi-threading API for hybrid MPI/thread mode Launch a function pointer on N threads Get thread ID, number of threads in process Uses pthread directly

Phased communication API Send messages in batches per phase, detect end of phase

Hybrid MPI/thread communication API Uses hybrid ranks and size Same phased API, automatically changes to hybrid when

called within threads

Future: Hardware queries by wrapping hwloc** Portable Hardware Locality (http://www.open-mpi.org/projects/hwloc/)

12

Hybrid Message Passing

Everything built from primitives, need hybrid primitives:Size: # of threads on the whole machineRank: machine-unique ID of the threadSend, Probe, and Receive using hybrid ranks

0 1 2 3 0 1 2 3

0 1

4 5 6 70 1 2 3

Process rank

Thread rank

Hybrid rank

13

Hybrid Message Passing

Initial simple hybrid primitives: just wrap MPI primitivesMPI_Init_thread with MPI_THREAD_MULTIPLEMPI rank = floor(Hybrid rank / threads per process)MPI tag bit fields:

From thread To thread Hybrid tag

Phased

ibarrier

MPI primitives

Phased

ibarrier

Hybrid primitives

MPI primitives

MPI mode: Hybrid mode:

14

Hybrid Partitioning

Partition mesh to processes, then partition to threads Map Parts to threads, 1-to-1 Share entities on inter-thread part boundaries

MPIProcess 1

Process 2

Process 3

Process 4

pthreads

Part

Part

Part

Part

pthreads

Part

Part

Part

Part

pthreads

Part

Part

Part

Part

pthreads

Part

Part

Part

Part

15

Hybrid Partitioning

Entities are shared within a process Part boundary entity is created once per process Part boundary entity is shared by all local parts Only owning part can modify entity(avoids almost all contention) Remote copy: duplicate entity copy on another process Parallel control utility can provide architecture info to mesh, which is distributed accordingly.

iM0

jM0

1P

0P 2P

inter-process boundary

intra-process part boundary (implicit)

process j process i

16

Mesh Migration

Moving mesh entities between parts Input: local mesh elements to send to other parts Other entities to move are determined by adjacencies

Complex Subtasks Reconstructing mesh adjacencies Re-structuring the partition model Re-computing remote copies

ConsiderationsNeighborhoods change: try to maintain scalability despite loss of

communication localityHow to benefit from shared memory

17

11 1

Migration Steps

Mesh Migration

1

(B) Get affected entities and compute post-migration residence parts

(D) Delete migrated entities

P0 P2

P1

2

1 11 1 1

22

2 222

1

(A) Mark destination part id

2

1 11 1 1

22

2 222

1

(C) Exchange entitiesand update part boundary

P0

P2

P1

2

1 1

2

22

22 2

18

Hybrid Migration

Shared memory optimizations: Thread to part matching: use partition model for concurrency Threads handle part boundary entities which they own

Other entities are ‘released’ Inter-process entity movement

Send entity to one thread per process Intra-process entity movement

Send message containing pointer

0 1

2 3

0 1

2 3

release

grab

19

Hybrid Migration

1. Release shared entities2. Update entity resident part sets3. Move entities between processes4. Move entities between threads5. Grab shared entities

Two-level temporary ownership: Master and Process MasterMaster: smallest resident part IDProcess Master: smallest on-process resident part ID

20

0 1

2 3

Representative Phase:1. Old Master Part sends entity to new Process Master Parts2. Receivers bounce back addresses of created entities3. Senders broadcast union of all addresses

Hybrid Migration

21

0 1

4 5 6 7

Old Resident Parts:{1,2,3}New Resident Parts:{5,6,7}

Data to create copyAddress of local copyAddresses of all copies

Many subtle complexities:1. Most steps have to be done one dimension at a time2. Assigning upward adjacencies causes thread contention

1. Use a separate phase of communication to make them2. Use another phase to remove them when entities are deleted

3. Assigning downward adjacencies requires addresses on the new process

1. Use a separate phase to gather remote copies

Hybrid Migration

22

Preliminary Results

Model: bi-unit cube Mesh: 260K tets, 16 parts Migration: sort by X coordinate

23

Preliminary Results

First test of hybrid algorithm: Using 1 node of the CCNI Blue Gene /Q: Cases:

1. 16 MPI ranks, 1 thread per rank1. 18.36 seconds for migration2. 433 MB mesh memory use (sum of all MPI ranks)

2. 1 MPI rank, 16 threads per rank1. 9.62 seconds for migration + thread create/join2. 157 MB mesh memory use (sum of all threads)

24

Thank You

25

Seegyoung Seol – FMDB architect, part boundary sharingMicah Corah – SCOREC undergraduate, threaded part loading

Documents

Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013