November 9, 2000 PDCS-2000 A Generalized Portable SHMEM Library Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory

PDCS-2000 November 9, 2000

A Generalized Portable SHMEM Library

Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory

Ricky Kendall Ames Laboratory

PDCS-2000Pacific Northwest National Laboratory Ames Laboratory

Overview

Introduction global address space programming model one-sided communication

Cray SHMEM GPSHMEM - Generalized Portable SHMEM Implementation Approach Experimental Results Conclusions


Global Address Space and

1-Sided Communication

(0xf5670,P0)

(0xf32674,P5)

P0 P1 P2

collection of address spacesof processes in a parallel jobglobal address: (address, pid)

message passing

P1P0receive send

But not

P1P0put

one-sided communication

Communication model

hardware examples: Cray T3E, Fujitsu VPP5000language support: Co-Array Fortran, UPC


Motivation:global address space versus other programming

models

shared memory message passing global addressspace

data view shared distributed distributed

data locality obscured explicit explicit

access to data/

ease of use

simplest

(a=b)

hardest

(send-receive)

simple

(put/get)

scalable performance limited very good very good


One-sided communication interfaces

First commercial implementation - SHMEM on the Cray T3D put, get, scatter, gather, atomic swap memory consistency issues (solved on the T3E) maps well to the Cray T3E hardware - excellent application

performance

Vendors specific interfaces IBM LAPI, Fujitsu MPlib, NEC Parlib/CJ, Hitachi RDMA, Quadrics Elan

Portable Interfaces MPI-2 1-sided (related but rather restrictive model) ARMCI one-sided communication library SHMEM (some platforms) GPSHMEM -- first fully portable implementation of SHMEM


History of SHMEM

Introduced in on the Cray T3D in 1993 one-sided operations: put, get, scatter, gather, atomic swap collective operations: synchronization, reduction cache not coherent w.r.t. SHMEM operations (problem solved on the T3E)

highest level of performance on any MPP at that time

Increased availability SGI after purchasing Cray ported to IRIX systems and Cray vector systems

but not always full functionality (w/o atomic ops on vector systems like Cray J90) extensions to match more datatypes - SHMEM API is datatype oriented

HPVM project lead by Andrew Chien (UIUC/UCSD) ported and extended a subset of SHMEM on top of Fast Messages for Linux (later dropped) and Windows clusters

Quadrics/Compaq port to Elan available on Linux and Tru64 clusters with QSW switch

subset on top of LAPI for the IBM SP internal porting tool by the IBM ACTS group at Watson


Characteristics of SHMEM

Memory addressability symmetric objects stack, heap allocation on the T3D Cray memory allocation routine shmalloc

Ordering of operations ordered in the original version on the T3D out-of-order on the T3E

adaptive routing, added shmem_quiet

Progress rules fully one-sided, no explicit or implicit polling

by remote node much simpler model than MPI-2 1-sided

no redundant locking or remote process cooperation

P1P0

shmem_put(a,b,n,0)

Symmetric object

aa b


GPSHMEM

Full interface of the Cray T3D SHMEM version Ordering of operations Portability restriction: must use shmalloc for memory allocation Extensions for block strided data transfers

the original Cray strided interface involved single elements GPSHMEM shmem_strided_get( prem, ploc, rstride, lstride,nbytes, nblock,

proc)

ploc

shmem_strided_get

prem

nblock

lstr

ide

lstr

ide

nbyt

es

shmem_iget

Cray SHMEMGPSHMEM


GPSHMEM implementation approach

ARMCI

message-passinglibrary (MPI,PVM)

Platform-specific communication interfaces(active messages, RMC, threads, shared memory)

one-sided operations

collective operations

SHMEM interfaces

Run-time support


ARMCI portable 1-sided communication library

Functionality put, get, accumulate (also with noncontiguous interfaces) atomic read-modify-write, mutexes and locks memory allocation operations

Characteristics simple progress rules - truly one-sided operations ordered w.r.t. target (ease of use) compatible with message-passing libraries (MPI, PVM) low-level system, no Fortran API

Portability MPPs: Cray T3E, Fujitsu VPP, IBM SP (uses vendors 1-sided ops)

clusters of Unix and Windows systems (Myrinet,VIA,TCP/IP)

large servers with shared memory: SGI, Sun, Cray SV1, HP


Multiprotocols in ARMCI(IBM SP example)

Process/threadsynchronization

Active Messages threads

shared memoryRemote memory

copy

between nodes SMP

AMs used for noncontiguoustransfers and atomic operations

Places alluser’s data in shared memory!ARMCI_Malloc()

0

10

20

30

40

50

60

70

0 200 400 600

dimension of a square patch

band

wid

th [

MB

/s] AMs+RMC

RMC only

0

20

40

60

80

100

120

140

1 100 10000 1000000

bytes

band

wid

th [M

B/s

] shared memory

LAPI remote

LAPI SMP


Experience

Performance studies GPSHMEM overhead over SHMEM on the Cray T3E Comparison to MPI-2 1-sided on the Fujitsu VX-4

Applications - see paper

matrix multiplication on a Linux cluster porting Cray T3E codes


GPSHMEM Overhead on the T3E

Approach renamed GPSHMEM calls to avoid conflict with Cray

SHMEM collected latency and bandwidth numbers

Overhead shmem_put 3.5s shmem_get 3s bandwidth is the same since GPSHMEM and ARMCI do

not add extra memory copies

Discussion the overhead includes GPSHMEM and ARMCI reflects address conversion

searching table of addresses for allocated objects can be avoided when addresses are identical

ARMCI

GPSHMEM

Cray SHMEM


Performance of GPSHMEM and MPI-2

on the Fujitsu VX-4

0

100

200

300

400

500

600

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07

bytes

band

wid

th [M

B/s]

shmem_strided_get

MPI_Get(strided datatype)

multiple MPI_Get

multiple shmem_get


Conclusions

Described a fully portable implementation of SHMEM-like library SHMEM becomes a viable alternative to MPI-2 1-sided Good performance closely tied up to ARMCI Offers potential wide portability to other tools based on

SHMEM e.g. Co-Array Fortran

Cray SHMEM API incomplete for strided data structures extensions for block strided transfers improve performance

More work with applications needed to drive future extensions and development

Code availability: [email protected]

Documents

November 9, 2000 PDCS-2000 A Generalized Portable SHMEM Library Krzysztof Parzyszek Ames Laboratory Jarek Nieplocha Pacific Northwest National Laboratory