1
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy (PNNL); and Vinod Tipparaju (AMD) Global Arrays (GA) is popular high-level parallel programming model that provides data and computation management facilities to the NWChem computational chemistry suite. GA's global-view data model is supported by the ARMCI partitioned global address space runtime system, which traditionally is implemented natively on each supported platform in order to provide the best performance. The industry standard Message Passing Interface (MPI) also provides one-sided functionality and is available on virtually every supercomputing system. We present the first high- performance, portable implementation of ARMCI using MPI one-sided communication. We interface the existing GA infrastructure with ARMCI-MPI and demonstrate that this approach reduces the amount of resources consumed by the runtime system, provides comparable performance, and enhances portability for applications like NWChem. Global Arrays GA allows the programmer to create large, multidimensional shared arrays that span the memory of multiple nodes in a distributed memory system. The programmer interacts with GA through asynchronous, one-sided Get, Put, and Accumulate data movement operations as well as through high- level mathematical routines provided by GA. Accesses to the a global array are specified through high-level array indices; GA translates these high-level operations into low-level communication operations that may target multiple nodes depending on the underlying data distribution. GA's low-level communication is managed by the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication runtime system. Porting and tuning of GA is accomplished through ARMCI and traditionally involves natively implementing its core operations. Because of its sophistication, creating an efficient and scalable implementation of ARMCI for a new system requires expertise in that system and significant effort. In this work, we present the first implementation of ARMCI on MPI’s portable one- sided communication interface. MPI Global Memory Regions Allocation ID 0 1 N 0 0x6b7 0x7af 0x0c1 1 0x9d9 0x0 0x0 2 0x611 0x38b 0x659 3 0xa63 0x3f8 0x0 ARMCI Absolute Process Id MPI global memory reg-ions (GMR) align MPI's remote memory access (RMA) window seman-tics with ARMCI's PGAS model. GMR provides the following functionality: MPI One-Sided Communication ARMCI_PutS GA_Put 0 2 1 3 Noncontiguous Communication GA operations frequently access array regions that are not contiguous in memory. These operations are translated into ARMCI strided and generalized I/O vector communica-tion operations. We have provided four implementations for ARMCI’s strided interface: Direct: MPI subarray datatypes are generated to describe local and remote buffers; a single operation is issued. Communication Performance Contiguous Transfer Bandwidth Noncontiguous Transfer Bandwidth NWChem Performance Performance Evaluation 1.Alloction and mapping of shared segments. 2.Translation between ARMCI shared addresses and MPI windows and window displacements 3.Translation between ARMCI absolute ranks used in communication and MPI window ranks 4.Management of MPI RMA semantics for direct load/store accesses to shared data. Direct load/store access to shared regions in MPI must be protected with MPI_Win_lock/unlock operations. 5.Guarding of shared buffers: When a shared buffer is provided as the local buffer for an ARMCI communication operation, GMR must perform appropriate locking and copy operations in order to preserve multiple MPI rules, including the constraint that a window cannot be locked for more than one target at any given time. Unloc k Rank 0 Rank 1 Get(Y ) Put(X ) Lock Completio n Public Copy Privat e Copy MPI provides an interface for one-sided communication and encapsulates shared regions of memory in window objects. Windows are used to guarantee data consistency and portability across a wide range of platforms, including those with noncoherent memory access. This is accomplished through public and private copies which are synchronized through access epochs. To ensure portability, windows have several important properties: 1.Concurrent, conflicting accesses are erroneous 2.All access to the window data must occur within an epoch 3.Put, get, and accumulate operations are non- blocking and occur concurrently in an epoch 4.Only one epoch is allowed per window at any given time • Experiments were conducted on a cross- section of leader-ship class architectures • Mature implementations of ARMCI on BG/P, InfiniBand, and XT (GA 5.0.1) • Development implementation of ARMCI on Cray XE6 •NWChem performance study: computational modeling of water pentamer System Nodes Core s Memory Interconn ect MPI IBM BG/P (Intrepi d) 40,96 0 1 x 4 2 GB 3D Torus IBM MPI InfiniBa nd (Fusion) 320 2 x 4 36 GB IB QDR MVAPICH2 1.6 Cray XT (Jaguar) 18,68 8 2 x 6 16 GB Seastar 2+ Cray MPI Cray XE6 (Hopper) 6,392 2 x 12 32 GB Gemini Cray MPI Blue Gene/P Cray XE •Performance and scaling is comparable on most architectures; MPI tuning would improve overheads. •Performance of ARMCI-MPI on Cray XE6 is 30% higher and scaling is better •Native ARMCI on Cray XE6 is a work in progress; ARMCI-MPI enables early use of GA/NWChem on such new platforms InfiniBand Cluster Cray XE •ARMCI-MPI performance is comparable to native, but tuning is needed, especially on Cray XE •Contiguous performance of MPI-RMA is nearly twice that of native ARMCI on Cray XE •Direct strided is best, except on BG/P where batched is better due to pack/unpack throughput IOV-Conservative: Strided transfer is converted to an I/O vector and each operation is issued within its own epoch. IOV-Batched: Strided transfer is converted to an I/O vector and N operations are issued within an epoch. IOV-Direct: Strided transfer is converted to an I/O vector and an MPI indexed datatype is generated to describe both the local and remote buffers; a single operation is issued Additional Information: J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju, Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication, " Preprint ANL/MCS-P1878-0411, April 2011. Online: http://www.mcs.anl.gov/publications/paper_detail.php?id=1535 Epoch Allocation Metadata MPI_Win window; int size[nproc]; ARMCI_Group grp; ARMCI_Mutex rmw;

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem

  • Upload
    lesa

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy (PNNL); and Vinod Tipparaju (AMD). - PowerPoint PPT Presentation

Citation preview

Page 1: High-Level, One-Sided Models on MPI:  A Case Study with Global Arrays and NWChem

High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChemJames Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy (PNNL); and Vinod Tipparaju (AMD)Global Arrays (GA) is popular high-level parallel programming model that provides data and computation management facilities to the NWChem computational chemistry suite. GA's global-view data model is supported by the ARMCI partitioned global address space runtime system, which traditionally is implemented natively on each supported platform in order to provide the best performance. The industry standard Message Passing Interface (MPI) also provides one-sided functionality and is available on virtually every supercomputing system. We present the first high-performance, portable implementation of ARMCI using MPI one-sided communication. We interface the existing GA infrastructure with ARMCI-MPI and demonstrate that this approach reduces the amount of resources consumed by the runtime system, provides comparable performance, and enhances portability for applications like NWChem.Global ArraysGlobal ArraysGA allows the programmer to create large, multidimensional shared arrays that span the memory of multiple nodes in a distributed memory system. The programmer interacts with GA through asynchronous, one-sided Get, Put, and Accumulate data movement operations as well as through high-level mathematical routines provided by GA. Accesses to the a global array are specified through high-level array indices; GA translates these high-level operations into low-level communication operations that may target multiple nodes depending on the underlying data distribution.GA's low-level communication is managed by the Aggregate Remote Memory Copy Interface (ARMCI) one-sided communication runtime system. Porting and tuning of GA is accomplished through ARMCI and traditionally involves natively implementing its core operations. Because of its sophistication, creating an efficient and scalable implementation of ARMCI for a new system requires expertise in that system and significant effort. In this work, we present the first implementation of ARMCI on MPI’s portable one-sided communication interface.

MPI Global Memory RegionsMPI Global Memory Regions

Allocation ID 0 1 … N

0 0x6b7 0x7af 0x0c1

1 0x9d9 0x0 0x0

2 0x611 0x38b 0x659

3 0xa63 0x3f8 0x0

ARMCI Absolute Process Id

MPI global memory reg-ions (GMR) align MPI's remote memory access (RMA) window seman-tics with ARMCI's PGAS model.

GMR provides the following functionality:

MPI One-Sided CommunicationMPI One-Sided CommunicationARMCI_PutS

GA_Put

0

2

1

3

Noncontiguous CommunicationNoncontiguous Communication

GA operations frequently access array regions that are not contiguous in memory. These operations are translated into ARMCI strided and generalized I/O vector communica-tion operations. We have provided four implementations for ARMCI’s strided interface:

Direct: MPI subarray datatypes are generated to describe local and remote buffers; a single operation is issued.

Communication PerformanceCommunication PerformanceContiguous Transfer Bandwidth Noncontiguous Transfer Bandwidth

NWChem PerformanceNWChem PerformancePerformance EvaluationPerformance Evaluation

1. Alloction and mapping of shared segments.2. Translation between ARMCI shared addresses

and MPI windows and window displacements3. Translation between ARMCI absolute ranks used

in communication and MPI window ranks4. Management of MPI RMA semantics for direct

load/store accesses to shared data. Direct load/store access to shared regions in MPI must be protected with MPI_Win_lock/unlock operations.

5. Guarding of shared buffers: When a shared buffer is provided as the local buffer for an ARMCI communication operation, GMR must perform appropriate locking and copy operations in order to preserve multiple MPI rules, including the constraint that a window cannot be locked for more than one target at any given time.

Unlock

Rank 0 Rank 1

Get(Y)

Put(X)

Lock

Completion

PublicCopyPublicCopy

PrivateCopy

PrivateCopy

MPI provides an interface for one-sided communication and encapsulates shared regions of memory in window objects. Windows are used to guarantee data consistency and portability across a wide range of platforms, including those with noncoherent memory access. This is accomplished through public and private copies which are synchronized through access epochs. To ensure portability, windows have several important properties:

1. Concurrent, conflicting accesses are erroneous2. All access to the window data must occur within an

epoch3. Put, get, and accumulate operations are non-blocking

and occur concurrently in an epoch4. Only one epoch is allowed per window at any given

time

• Experiments were conducted on a cross-section of leader-ship class architectures

• Mature implementations of ARMCI on BG/P, InfiniBand, and XT (GA 5.0.1)

• Development implementation of ARMCI on Cray XE6

• NWChem performance study: computational modeling of water pentamer

System Nodes Cores Memory Interconnect MPI

IBM BG/P(Intrepid)

40,960 1 x 4 2 GB 3D Torus IBM MPI

InfiniBand(Fusion)

320 2 x 4 36 GB IB QDR MVAPICH2 1.6

Cray XT(Jaguar)

18,688 2 x 6 16 GB Seastar 2+ Cray MPI

Cray XE6(Hopper)

6,392 2 x 12 32 GB Gemini Cray MPI

Blue Gene/P Cray XE

• Performance and scaling is comparable on most architectures; MPI tuning would improve overheads.

• Performance of ARMCI-MPI on Cray XE6 is 30% higher and scaling is better

• Native ARMCI on Cray XE6 is a work in progress; ARMCI-MPI enables early use of GA/NWChem on such new platforms

InfiniBand Cluster Cray XE

• ARMCI-MPI performance is comparable to native, but tuning is needed, especially on Cray XE

• Contiguous performance of MPI-RMA is nearly twice that of native ARMCI on Cray XE

• Direct strided is best, except on BG/P where batched is better due to pack/unpack throughput

IOV-Conservative: Strided transfer is converted to an I/O vector and each operation is issued within its own epoch.

IOV-Batched: Strided transfer is converted to an I/O vector and N operations are issued within an epoch.

IOV-Direct: Strided transfer is converted to an I/O vector and an MPI indexed datatype is generated to describe both the local and remote buffers; a single operation is issued

Additional Information:J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tipparaju, “Supporting the Global Arrays PGAS Model Using MPI One-Sided Communication," Preprint ANL/MCS-P1878-0411, April 2011. Online: http://www.mcs.anl.gov/publications/paper_detail.php?id=1535

Epo

ch

Allocation Metadata

MPI_Winwindow;int size[nproc];ARMCI_Group grp;ARMCI_Mutex rmw;…

Allocation Metadata

MPI_Winwindow;int size[nproc];ARMCI_Group grp;ARMCI_Mutex rmw;…