Upload
cassidy-drake
View
28
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Advances in PUMI for High Core Count Machines. Dan Ibanez, Micah Corah , Seegyoung Seol , Mark Shephard 2/27/2013 Scientific Computation Research Center Rensselaer Polytechnic Institute. Outline. Distributed Mesh Data Structure Phased Message Passing - PowerPoint PPT Presentation
Citation preview
Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard 2/27/2013
Scientific Computation Research Center Rensselaer Polytechnic Institute
1
Advances in PUMI forHigh Core Count Machines
Outline
1. Distributed Mesh Data Structure2. Phased Message Passing3. Hybrid (MPI/thread) Programming Model4. Hybrid Phased Message Passing5. Hybrid Partitioning6. Hybrid Mesh Migration
2
Unstructured Mesh Data Structure
3
Mesh
Part
Regions
Edges
Faces
Vertices
Pointer inData Structure
Distributed Mesh Representation
Mesh elements assigned to parts Uniquely identified by handle or global ID Treated as a serial mesh with the addition of part boundaries
Part boundary: groups of mesh entities on shared links between partsRemote copy: duplicated entity copy on non-local partResident part set : list of parts where the entity exists
Can have multiple parts per process.
4
Message Passing
Primitive functional set: Size – members in group Rank – ID of self in group Send – non-blocking synchronous send Probe – non-blocking probe Receive – blocking receive
Non-blocking barrier (ibarrier) API Call 1: Begin ibarrier API Call 2: Wait for ibarrier termination Used for phased message passing Will be available in MPI3, right now custom solution
5
ibarrier Implementation
Using all non-blocking point-to-point calls:
For N ranks, lg(N) go to and from rank 0Uses a separate MPI communicator
0 1 2 3 4
Reduce
Broadcast
6
Phased Message Passing
Similar to Bulk Synchronous ParallelUses non-blocking barrier
1. Begin phase2. Send all messages3. Receive any messages sent this phase4. End phase
Benefits: Efficient termination detection when neighbors unknown Phases are implicit barriers – simplify algorithms Allows buffering all messages per rank per phase
7
Phased Message Passing
Implementation:
1. Post all sends for this phase2. While local sends incomplete: receive any
1. Now local sends complete (remember they are synchronous)
3. Begin “stopped sending” ibarrier4. While ibarrier incomplete: receive any
1. Now all sends complete, can stop receiving
5. Begin “stopped receiving” ibarrier6. While ibarrier incomplete: compute
1. Now all ranks stopped receiving, safe to send next phase
7. Repeat
send recv send recv send recv
ibarriers aresignal edges
8
Hybrid System
Node
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
Process
Thread Thread Thread Thread
Thread Thread Thread Thread
Thread Thread Thread Thread
Thread Thread Thread Thread
Blue Gene/Q Program
MAPPING
*Processes per node and threads per core are variable
9
Hybrid Programming System
1. Message Passing is the de facto standard programming model for distributed memory architectures.
2. The classic shared memory programming model: mutexes, atomic operations, lockless structures
Most massively parallel code is currently using model 1.
Models are very different, hard to convert from 1 to 2.
10
Hybrid Programming System
We will try message passing between threads.
Threads can send to other threads in the same process
And to threads in a different process.
Same model as MPI, replace “process” with “thread”.
Porting is faster: change the message passing API.
Shared memory is still exploited, lock with messages:
11
Thread 1:Write(A)Release(lockA)
Thread2:Lock(lockA)Write(A)
Thread 1:Write(A)SendTo(2)
Thread2:ReceiveFrom(1)Write(A)
becomes
Parallel Control Utility
Multi-threading API for hybrid MPI/thread mode Launch a function pointer on N threads Get thread ID, number of threads in process Uses pthread directly
Phased communication API Send messages in batches per phase, detect end of phase
Hybrid MPI/thread communication API Uses hybrid ranks and size Same phased API, automatically changes to hybrid when
called within threads
Future: Hardware queries by wrapping hwloc** Portable Hardware Locality (http://www.open-mpi.org/projects/hwloc/)
12
Hybrid Message Passing
Everything built from primitives, need hybrid primitives:Size: # of threads on the whole machineRank: machine-unique ID of the threadSend, Probe, and Receive using hybrid ranks
0 1 2 3 0 1 2 3
0 1
4 5 6 70 1 2 3
Process rank
Thread rank
Hybrid rank
13
Hybrid Message Passing
Initial simple hybrid primitives: just wrap MPI primitivesMPI_Init_thread with MPI_THREAD_MULTIPLEMPI rank = floor(Hybrid rank / threads per process)MPI tag bit fields:
From thread To thread Hybrid tag
Phased
ibarrier
MPI primitives
Phased
ibarrier
Hybrid primitives
MPI primitives
MPI mode: Hybrid mode:
14
Hybrid Partitioning
Partition mesh to processes, then partition to threads Map Parts to threads, 1-to-1 Share entities on inter-thread part boundaries
MPIProcess 1
Process 2
Process 3
Process 4
pthreads
Part
Part
Part
Part
pthreads
Part
Part
Part
Part
pthreads
Part
Part
Part
Part
pthreads
Part
Part
Part
Part
15
Hybrid Partitioning
Entities are shared within a process Part boundary entity is created once per process Part boundary entity is shared by all local parts Only owning part can modify entity(avoids almost all contention) Remote copy: duplicate entity copy on another process Parallel control utility can provide architecture info to mesh, which is distributed accordingly.
iM0
jM0
1P
0P 2P
inter-process boundary
intra-process part boundary (implicit)
process j process i
16
Mesh Migration
Moving mesh entities between parts Input: local mesh elements to send to other parts Other entities to move are determined by adjacencies
Complex Subtasks Reconstructing mesh adjacencies Re-structuring the partition model Re-computing remote copies
ConsiderationsNeighborhoods change: try to maintain scalability despite loss of
communication localityHow to benefit from shared memory
17
11 1
Migration Steps
Mesh Migration
1
(B) Get affected entities and compute post-migration residence parts
(D) Delete migrated entities
P0 P2
P1
2
1 11 1 1
22
2 222
1
(A) Mark destination part id
2
1 11 1 1
22
2 222
1
(C) Exchange entitiesand update part boundary
P0
P2
P1
2
1 1
2
22
22 2
18
Hybrid Migration
Shared memory optimizations: Thread to part matching: use partition model for concurrency Threads handle part boundary entities which they own
Other entities are ‘released’ Inter-process entity movement
Send entity to one thread per process Intra-process entity movement
Send message containing pointer
0 1
2 3
0 1
2 3
release
grab
19
Hybrid Migration
1. Release shared entities2. Update entity resident part sets3. Move entities between processes4. Move entities between threads5. Grab shared entities
Two-level temporary ownership: Master and Process MasterMaster: smallest resident part IDProcess Master: smallest on-process resident part ID
20
0 1
2 3
Representative Phase:1. Old Master Part sends entity to new Process Master Parts2. Receivers bounce back addresses of created entities3. Senders broadcast union of all addresses
Hybrid Migration
21
0 1
4 5 6 7
Old Resident Parts:{1,2,3}New Resident Parts:{5,6,7}
Data to create copyAddress of local copyAddresses of all copies
Many subtle complexities:1. Most steps have to be done one dimension at a time2. Assigning upward adjacencies causes thread contention
1. Use a separate phase of communication to make them2. Use another phase to remove them when entities are deleted
3. Assigning downward adjacencies requires addresses on the new process
1. Use a separate phase to gather remote copies
Hybrid Migration
22
Preliminary Results
Model: bi-unit cube Mesh: 260K tets, 16 parts Migration: sort by X coordinate
23
Preliminary Results
First test of hybrid algorithm: Using 1 node of the CCNI Blue Gene /Q: Cases:
1. 16 MPI ranks, 1 thread per rank1. 18.36 seconds for migration2. 433 MB mesh memory use (sum of all MPI ranks)
2. 1 MPI rank, 16 threads per rank1. 9.62 seconds for migration + thread create/join2. 157 MB mesh memory use (sum of all threads)
24
Thank You
25
Seegyoung Seol – FMDB architect, part boundary sharingMicah Corah – SCOREC undergraduate, threaded part loading