Upload
stewart-haynes
View
212
Download
0
Embed Size (px)
Citation preview
PDCS-2000 November 9, 2000
A Generalized Portable SHMEM Library
Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory
Ricky Kendall Ames Laboratory
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
Overview
Introduction global address space programming model one-sided communication
Cray SHMEM GPSHMEM - Generalized Portable SHMEM Implementation Approach Experimental Results Conclusions
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
Global Address Space and
1-Sided Communication
(0xf5670,P0)
(0xf32674,P5)
P0 P1 P2
collection of address spacesof processes in a parallel jobglobal address: (address, pid)
message passing
P1P0receive send
But not
P1P0put
one-sided communication
Communication model
hardware examples: Cray T3E, Fujitsu VPP5000language support: Co-Array Fortran, UPC
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
Motivation:global address space versus other programming
models
shared memory message passing global addressspace
data view shared distributed distributed
data locality obscured explicit explicit
access to data/
ease of use
simplest
(a=b)
hardest
(send-receive)
simple
(put/get)
scalable performance limited very good very good
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
One-sided communication interfaces
First commercial implementation - SHMEM on the Cray T3D put, get, scatter, gather, atomic swap memory consistency issues (solved on the T3E) maps well to the Cray T3E hardware - excellent application
performance
Vendors specific interfaces IBM LAPI, Fujitsu MPlib, NEC Parlib/CJ, Hitachi RDMA, Quadrics Elan
Portable Interfaces MPI-2 1-sided (related but rather restrictive model) ARMCI one-sided communication library SHMEM (some platforms) GPSHMEM -- first fully portable implementation of SHMEM
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
History of SHMEM
Introduced in on the Cray T3D in 1993 one-sided operations: put, get, scatter, gather, atomic swap collective operations: synchronization, reduction cache not coherent w.r.t. SHMEM operations (problem solved on the T3E)
highest level of performance on any MPP at that time
Increased availability SGI after purchasing Cray ported to IRIX systems and Cray vector systems
but not always full functionality (w/o atomic ops on vector systems like Cray J90) extensions to match more datatypes - SHMEM API is datatype oriented
HPVM project lead by Andrew Chien (UIUC/UCSD) ported and extended a subset of SHMEM on top of Fast Messages for Linux (later dropped) and Windows clusters
Quadrics/Compaq port to Elan available on Linux and Tru64 clusters with QSW switch
subset on top of LAPI for the IBM SP internal porting tool by the IBM ACTS group at Watson
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
Characteristics of SHMEM
Memory addressability symmetric objects stack, heap allocation on the T3D Cray memory allocation routine shmalloc
Ordering of operations ordered in the original version on the T3D out-of-order on the T3E
adaptive routing, added shmem_quiet
Progress rules fully one-sided, no explicit or implicit polling
by remote node much simpler model than MPI-2 1-sided
no redundant locking or remote process cooperation
P1P0
shmem_put(a,b,n,0)
Symmetric object
aa b
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
GPSHMEM
Full interface of the Cray T3D SHMEM version Ordering of operations Portability restriction: must use shmalloc for memory allocation Extensions for block strided data transfers
the original Cray strided interface involved single elements GPSHMEM shmem_strided_get( prem, ploc, rstride, lstride,nbytes, nblock,
proc)
ploc
shmem_strided_get
prem
nblock
lstr
ide
lstr
ide
nbyt
es
shmem_iget
Cray SHMEMGPSHMEM
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
GPSHMEM implementation approach
ARMCI
message-passinglibrary (MPI,PVM)
Platform-specific communication interfaces(active messages, RMC, threads, shared memory)
one-sided operations
collective operations
SHMEM interfaces
Run-time support
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
ARMCI portable 1-sided communication library
Functionality put, get, accumulate (also with noncontiguous interfaces) atomic read-modify-write, mutexes and locks memory allocation operations
Characteristics simple progress rules - truly one-sided operations ordered w.r.t. target (ease of use) compatible with message-passing libraries (MPI, PVM) low-level system, no Fortran API
Portability MPPs: Cray T3E, Fujitsu VPP, IBM SP (uses vendors 1-sided ops)
clusters of Unix and Windows systems (Myrinet,VIA,TCP/IP)
large servers with shared memory: SGI, Sun, Cray SV1, HP
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
Multiprotocols in ARMCI(IBM SP example)
Process/threadsynchronization
Active Messages threads
shared memoryRemote memory
copy
between nodes SMP
AMs used for noncontiguoustransfers and atomic operations
Places alluser’s data in shared memory!ARMCI_Malloc()
0
10
20
30
40
50
60
70
0 200 400 600
dimension of a square patch
band
wid
th [
MB
/s] AMs+RMC
RMC only
0
20
40
60
80
100
120
140
1 100 10000 1000000
bytes
band
wid
th [M
B/s
] shared memory
LAPI remote
LAPI SMP
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
Experience
Performance studies GPSHMEM overhead over SHMEM on the Cray T3E Comparison to MPI-2 1-sided on the Fujitsu VX-4
Applications - see paper
matrix multiplication on a Linux cluster porting Cray T3E codes
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
GPSHMEM Overhead on the T3E
Approach renamed GPSHMEM calls to avoid conflict with Cray
SHMEM collected latency and bandwidth numbers
Overhead shmem_put 3.5s shmem_get 3s bandwidth is the same since GPSHMEM and ARMCI do
not add extra memory copies
Discussion the overhead includes GPSHMEM and ARMCI reflects address conversion
searching table of addresses for allocated objects can be avoided when addresses are identical
ARMCI
GPSHMEM
Cray SHMEM
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
Performance of GPSHMEM and MPI-2
on the Fujitsu VX-4
0
100
200
300
400
500
600
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
bytes
band
wid
th [M
B/s]
shmem_strided_get
MPI_Get(strided datatype)
multiple MPI_Get
multiple shmem_get
PDCS-2000Pacific Northwest National Laboratory Ames Laboratory
Conclusions
Described a fully portable implementation of SHMEM-like library SHMEM becomes a viable alternative to MPI-2 1-sided Good performance closely tied up to ARMCI Offers potential wide portability to other tools based on
SHMEM e.g. Co-Array Fortran
Cray SHMEM API incomplete for strided data structures extensions for block strided transfers improve performance
More work with applications needed to drive future extensions and development
Code availability: [email protected]