Programming in Co-Array Fortran John M. Levesque CTO Office Applications With Help from Bob Numrich and Jim Schwarzmeier

Programming in Co-Array Fortran

John M. Levesque

CTO Office

Applications

With Help from Bob Numrich and

Jim Schwarzmeier

Outline

What is Co-Array Fortran Why you need assistance from the compiler Co-arrays and the Interconnect Why Co-Arrays are better than MPI Things to watch out for Tricks of the CAF coder Results

The Guiding Principle behindCo-Array Fortran What is the smallest change required to make Fortran 90 an

effective parallel language? How can this change be expressed so that it is intuitive and

natural for Fortran programmers? How can it be expressed so that existing compiler

technology can implement it easily and efficiently?

3

What is Co-Array Syntax? Co-Array syntax is a simple extension to normal

Fortran syntax.• It uses normal rounded brackets ( ) to point to data in

local memory.• It uses square brackets [ ] to point to data in remote

memory.• Syntactic and semantic rules apply separately but equally

to ( ) and [ ].

4

Examples of Co-Array Declarations

5

real :: s[*]real :: a(n)[*]complex :: z[*]integer :: index(n)[*]real :: b(n)[p, *]real :: c(n,m)[0:p, -7:q, 11:*]real, allocatable :: w(:)[:]type(field) :: maxwell[p,*]

CAF Memory Model

6

X(1)

X(N)

X(1)

X(N)

X(1) [p]

X(N)[p]

X(1)[q]

X(N)[q]

X(1)

X(N)

P Q

X(1)[q]

X(N)[p]

One to One Model

7

X(1)

X(N)

X(1)

X(N)

X(1) [p]

X(N)[p]

X(1)[q]

X(N)[q]

X(1)

X(N)

P Q

X(1)[q]

X(N)[p]

One physical processor

MANY to One Model (OpenMP on Node)

8

X(1)

X(N)

X(1)

X(N)

X(1) [p]

X(N)[p]

X(1)[q]

X(N)[q]

X(1)

X(N)

P Q

X(1)[q]

X(N)[p]

Many physical processors

One to One Model (multiple images on Node)

9

X(1)

X(N)

X(1)

X(N)

X(1) [p]

X(N)[p]

X(1)[q]

X(N)[q]

X(1)

X(N)

P Q

X(1)[q]

X(N)[p]

One physical processors

Node

What Do Co-Dimensions Mean? real :: x(n)[p,q,*]

• Replicate an array of length n, one on each image.• Build a map so each image knows how to find the array

on any other image.• Organize images in a logical (not physical) three

dimensional grid.• The last co-dimension acts like an assumed size array: *

num_images()/(pxq)

A specific implementation could choose to represent memory hierarchy through the co-dimensions.

10

The CAF Execution Model The number of images is fixed and each image has its own

index, retrievable at run-time:• 1 <,= num_images()• 1 <,= this_image() <,= num_images()

Each image executes the same program independently of the others.

The programmer inserts explicit synchronization and branching as needed.

An “object” has the same name in each image. Each image works on its own local data. An image moves remote data to local data through, and only

through, explicit CAF syntax.

11

Co-Array Fortran Extension Incorporate the SPMD Model into Fortran 90

• Multiple images of the same program• Text and data are replicated in each image

Mark some variables with co-dimensions• Co-dimensions behave like normal dimensions• Co-dimensions express a logical problem decomposition• One-sided data exchange between co-arrays using a Fortran-like

syntax

Require the underlying run-time system to map the logical problem decomposition onto specific hardware.

12

What Do Co-Dimensions Mean?real :: x(n)[p,q,*] Replicate an array of length n, one on each image. Build a map so each image knows how to find the

array on any other image. Organize images in a logical (not physical) three

dimensional grid. The last co-dimension acts like an assumed size array: * =

num_images()/(pxq) A specific implementation could choose to represent

memory hierarchy through the co-dimensions.

13

Communication Using CAF Syntax

y(:) = x(:)[p] myIndex(:) = index(:) yourIndex(:) = index(:)[you] x(index(:)) = y[index(:)] x(:)[q] = x(:) + x(:)[p]

Absent co-dimension defaults to the local object.

14

Irregular and Changing Data Structures

15

Z%ptr Z%ptrZ[p]%ptr

X

X

Co-Array Fortran

Can be implemented:• Directly in the compiler; on those systems where the

compiler can issue memory fetches and stores directly to remote processors memory, the statement becomes a simple remote store. Allows co-array reference in a loop to be combined

into a vector load or store Allows compiler to use normal prefetch mechanism to

move fetches ahead of reference• Via a pre-processor; Rice University is currently working

on such a translator which generates subroutine calls for transferring data to the remote processor Significantly more difficult to get performance better

than MPI

Importance of Vectorizing loop with the CAF reference

17

7. iz = this_image(a) 8. V----< do ix = 1, kx 9. V r--< do iy = 1, ky

10. V r a(ix,iy) = b(iy,iz)[ix] 11. V r--> end do 12. V----> end do

ftn-3021 ftn: INLINE File = data_distro.f90, Line = 7 Routine _THIS_IMAGE3 was not inlined because the compiler was unable to

locate the routine to expand it inline.ftn-6204 ftn: VECTOR File = data_distro.f90, Line = 8

A loop starting at line 8 was vectorized.ftn-6005 ftn: SCALAR File = data_distro.f90, Line = 9 A loop starting at line 9 was unrolled 4 times.

ftn-6208 ftn: VECTOR File = data_distro.f90, Line = 9 A loop starting at line 9 was vectorized as part of the loop starting at line

8.

Another Example

18

629. V------------< do im = 1, 10000 630. V if(blockid.eq.imon_in(4,im) .and.

631. V & ibegin(sx) .le.imon_in(1,im) .and. 632. V & ibegin(sx+1).gt.imon_in(1,im) .and. 633. V & jbegin(sy) .le.imon_in(2,im) .and. 634. V & jbegin(sy+1).gt.imon_in(2,im) .and. 635. V & kbegin(sz) .le.imon_in(3,im) .and. 636. V & kbegin(sz+1).gt.imon_in(3,im)) then

637. V num_mon_me = num_mon_me+1 638. V lmon(im) = .true.

639. V proc_mon[ioid]%array(im)=procid_global 640. V endif 641. V------------> end do

ftn-6375 ftn: VECTOR File = main_3d.f, Line = 629

A loop starting at line 629 would benefit from "!dir$ safe_address".

ftn-6204 ftn: VECTOR File = main_3d.f, Line = 629 A loop starting at line 629 was vectorized.

19

Special features of Baker relating to CAF/UPC

On X1, X1E, and ‘BlackWidow’, the custom processor directly emits addresses for any memory location in the machine. Scalar or vector loads/stores can be done to any global address in the system

On Baker the Gemini NIC used to ‘extend’ address space of Opteron references to access memory on remote nodes• Fortran or C compilers recognize CAF references, x(i)[dest_pe], or

UPC ‘shared’ references, x[i][threads], and generates appropriate ncHT messages to Gemini to load from or store to remote memory

• Users can stride on local offsets or across processor space with any stride, including Gather/Scatter

• Compiler should generate vector requests as appropriate

Thing to watch out for

Typically one must use CAF on symmetric arrays – the virtual address is the same on all processors• This is typically done by allocating an array as a Co-array

Static arrays can be used Allocatable arrays can be used* Automatic arrays can be used*

•These can be costly – it takes time to allocate a symmetric array across all processors

Tricks of the CAF Coder

Since CAF pointer variables are not allowed one can use a derived type that contains a pointer.

TYPE RB real*8, dimension(:,:), pointer :: p_precv_bufEND TYPE RBTYPE (RB) precv_buf[0:*]

Precv_buf%p_precv_buf => recv_buf(1:nx,1:ny)

Using Derived Types

This is particularily useful when modifying a message passing library and you do not know the sizes of the arrays. You would have to allocate the co-array each time, perform an extra copy into the co-array

By using derived types you perform the minimum amount of data transfer, thus completely reducing the overhead of performing the transfer

Using dervived types in a MPI Library

23

!**************************************************************** subroutine mpigatherv (sendbuf, sendcnt, sendtype, recvbuf, recvcnts, & displs, recvtype, root, comm)!! Collects different messages from each thread on masterproc! use shr_kind_mod, only: r8 => shr_kind_r8 use mpishorthand implicit none real (r8), intent(in) :: sendbuf(*) real (r8), intent(out) :: recvbuf(*) integer, intent(in) :: displs(*) integer, intent(in) :: sendcnt integer, intent(in) :: sendtype integer, intent(in) :: recvcnts(*) integer, intent(in) :: recvtype integer, intent(in) :: root integer, intent(in) :: comm integer ier ! MPI error code

24

#if ( defined CAF )

integer i, j, start, end integer mytid,nproc,info target sendbuf

TYPE R4 real(r8),dimension(:), POINTER :: p_ptmp END TYPE R4 TYPE(R4) :: ptmp[*]

call mpi_comm_rank(MPI_COMM_WORLD,mytid,info)

i = this_image() ptmp[i]%p_ptmp => sendbuf(1:sendcnt)

CALL mpi_barrier(MPI_COMM_WORLD,info)

if(mytid .eq. root ) then do i = 1, num_images()

start = displs(i)+1end = start+recvcnts(i)-1recvbuf(start:end) = ptmp[i]%p_ptmp(1:recvcnts(i))

end doend if

CALL mpi_barrier(MPI_COMM_WORLD,info)

25

#else call t_startf ('mpi_gather') call mpi_gatherv (sendbuf, sendcnt, sendtype, recvbuf, recvcnts, displs, recvtype, & root, comm, ier) if (ier /= mpi_success) then write(6,*)'mpi_gather failed ier=',ier call endrun end if call t_stopf ('mpi_gather') #endif return end subroutine mpigatherv

Pointers in Derived Types

TYPE P4 integer len1

real(REAL8),dimension(:), POINTER :: p_send_low

END TYPE P4

TYPE R4

integer len2

real(REAL8),dimension(:), POINTER :: p_send_scratch

END TYPE R4

TYPE S4

integer len3

integer, dimension(:),POINTER :: p_rsend_index

END TYPE S4

TYPE(P4) :: send_low[*]

TYPE(R4) :: send_scratch[*]

TYPE(S4) :: rsend_index[*]

! set Co- array pointer to location of output array

send_scratch%p_send_scratch => input(1:length)

rsend_index%p_rsend_index => send_index(1:length)

send_low%p_send_low => send_lo(0:maxpe)

Must Barrier before using pointer

And then use them

do n=1,recv_num

pe = recv_pe(n)

tc = ilenght(recv_length(pe),pe)

ll = send_low[pe+1]%p_send_low(mype)

do l=1,tc

!dir$ concurrent

do lll=ilength(l,pe),ilenght(l,pe)-1

rindex = rsend_index[pe+1]%p_rsend_index(ll)

output(recv_index(lll))=output(recv_index(lll)) + &

send_scratch[pe+1]%p_send_scratch(rindex)

ll = ll + 1

enddo ! lll

enddo ! l

enddo ! n

Don’t do buffering of message

One of the tremendous advantages of Co-arrays is that one does not have to do buffering to build message blocks

Typical MPI codepack buffer

Send/recv bufferunpack buffer

Good CAF codeput data directly into

remote processor’s memory

How to write a Global_sum using Co-arrays

All processors come into the routine.• Everyone does local sum and/or stores local scalar into

Co-array scalar• Tells master (Processor 1 (or 0)) that it has set value• Spins on master ready flag

Master reads all scalars, performs sum• Broadcasts scalar to all processors• Broadcasts master_ready flag to all processors

What the Children Do! sum local contributions

reduce_real_local = c0

do j=jphys_b,jphys_e

do i=iphys_b,iphys_e

reduce_real_local = reduce_real_local + X(i,j)*MASK(i,j)

end do

end do

!

! send local sum to master

reduce_real_global(1,me)[1] = reduce_real_local

call sync_memory()

child_ready(1,me)[1] = .true.

If(me.eq.1)then

This is the Master code

else

do while (.not. master_ready(1,me))

enddo

master_ready(1,me) = .false.

endif

global_sum_caf = reduce_real_global(`,me)

end function global_sum_caf

What the Master doesif(me.eq.1)then

! wait until all local results have arrived

children_ready = .false.

do while (.not. children_ready)

children_ready = .true.

do i = 2,NPROC_X*NPROC_Y

children_ready = children_ready .and. child_ready(1,i)

enddo

enddo


child_ready(1,i) = .false.

enddo

! global sum

global_sum = reduce_real_global(1)


global_sum = global_sum + reduce_real_global(1,i)

enddo

What the Master does! broadcast


reduce_real_global(1,i)[i] = global_sum

enddo

call sync_memory()


master_ready(1,I)[i] = .true.

enddo

Make sure that child_ready and master_ready areTyped volatile.

Cray Inc. Preliminary and Proprietary 33

Taking full advantage of CAF/UPC

CAF/UPC can be used to do lightweight ‘message passing’• CAF/UPC do ‘zero-sided’ messaging by directly copying data from (local)

source arrays to (remote) destination arrays, without intervening buffer copying

• references generated by compiler no library call overhead• however, this still the basic ‘compute’/’communicate’ approach, so does not

overlap communication with computation

Here we propose that last step in ‘compute’ phase include direct store of latest array values to remote memory of consumer processor• that is, just after final array values stored to local memory, while values still

in processor registers, also store them directly to memory locations needed by remote consumer processor for next iteration or time step

• saves re-loading array values later ala conventional CAF. Also ‘meters out’ remote PUTs on the network while other values of arrays are computed reduces network computation (*)

Why is this important? Because strong scaling forces small grids per MPI process short messages and more benefit from fine-grain overlapping of communication and computation for fixed global problem, we can strongly scale runs on Baker to reduce runtime

(*) ala Norm Troullier, Cray Inc.


Optimizing short-message communication with CAF/UPC Illustrate with generic nearest neighbor explicit algorithms, such as

Jacobi iteration of Laplace’s equation on unit square

Simple explicit differencing leads to

where

Iterate until global MAX , To maintain numerical stability choose

yyuyuxuxuy

u

x

u

),1(),0(,0.1)1,(,0.0)0,(,02

2

2

2

0.1

duuu nji

nji 4,1

,

)4( ,1,1,,1,1nji

nji

nji

nji

nji uuuuudu n

jiu ,

)(du 410


Parallelize with domain decomposition, SPMD Give each processor ‘halo’ cells, here

for 4x4 processor grid, PX = PY = 4

jy,

ix,

nx

ny

0i 1nxi

0j

1nyj

Real*8 u(0:nx+1, 0:ny+1, 0:1)

1: nn‘time’ level =


High-level structure of code

INIT Use #ifdef MPI, #ifdef CAF, #ifdef CAF_overlap in single source Set global boundary conditions DO iter = 1, maxiter

• Communicate halo exchanges via MPI or CAF• Compute for each processor

Call Laplace (u, k0, k1, dumax, nx, ny, alpha) After loops in Laplace for MPI, CAF return. For CAF_overlap

communicate a) halo data, and b) each PE PUT local dumax to ‘master’• Communicate dumax:

MPI – Allreduce CAF – master reads dumax[source_pe], computes global_dumax,

broadcasts global_dumax[dest_pe] CAF_overlap – master computes global_dumax from local dumax array

(filled by all PEs while in Laplace), broadcasts global_dumax[dest_pe]

If(global_dumax < ) go to 1000 convergence test


High-level source code

integer(4), parameter :: nx = 100, ny = 100, maxiter = 20000 ! WEAK scaling real(8), parameter :: epsilon = 1.d-4, alpha = 0.95d0 integer(4) :: iter, k0, k1 integer(4) :: PPX, PPY, px, py, master, pxmaster, pymaster#ifdef MPI real(8), dimension(0:nx+1, 0:ny+1, 0:1) :: u real(8) :: dumax, global_dumax call mpi_comm_rank(mpi_comm_world, myrank, ierror) ! myrank=mype call mpi_comm_size(mpi_comm_world, mysize, ierror) ! mysize=numpes mype = myrank numpes = mysize#endif #ifdef CAF real(8), allocatable, dimension(:,:,:)[:,:] :: u real(8), allocatable, dimension[:,:] :: dumax real(8) :: global_dumax mype = this_image() - 1 numpes = num_images()#endif


High-level source code (cont)

#ifdef CAF_overlap ! ‘_overlap’ is optimized CAF version real(8), allocatable, dimension(:,:,:)[:,:] :: u real(8), allocatable, dimension(:)[:,:] :: dumax real(8) :: global_dumax common /CAFstuff/ PPX, PPY, px, py, master, pxmaster, pymaster, mype, numpes mype = this_image() - 1 numpes = num_images()#endif

PPX = INT(sqrt(dfloat(numpes))) ; PPY = numpes/PPX . . .#ifdef CAF allocate( u(0:nx+1, 0:ny+1, 0:1)[0:PPX-1, 0:*] ) allocate( dumax[0:PPX-1, 0:*] )#endif#ifdef CAF_overlap allocate( u(0:nx+1, 0:ny+1, 0:1)[0:PPX-1, 0:*] ) allocate( dumax(0:numpes-1)[0:PPX-1, 0:*] )#endif


High-level source code (cont)!...main iteration loop k0 = 0 DO iter = 1, maxiter k1 = mod( 1 + mod(k0, 2), 2 ) px = MOD(mype, PPX) py = mype/PPY

!...before next compute step, communicate data in NSEW directions

#ifdef MPI! send to North neighbor if(py < PPY-1) then dest = px + (py+1)*PPX buf_send(1:nx) = u(1:nx,ny,k0) call MPI_send(buf_send, nx, MPI_real8, dest, 1, MPI_COMM_WORLD, status, ierror) endif! recvNorth message from South neighbor if(py > 0) then dest = px + (py-1)*PPX call MPI_recv(buf_recv, nx, MPI_real8, dest, 1, MPI_COMM_WORLD, status, ierror) u(1:nx,0,k0) = buf_recv(1:nx) endif. . .



#ifdef CAF! send to North neighbor with stride 1 if(py < PPY-1) u(1:nx,0,k0)[px,py+1] = u(1:nx,ny,k0)! send to South neighbor with stride 1 if(py > 0) u(1:nx,ny+1,k0)[px,py-1] = u(1:nx,1,k0)! send to East neighbor with stride nx+2 if(px < PPX-1) u(0,1:ny,k0)[px+1,py] = u(nx,1:ny,k0)! send to West neighbor with stride nx+2 if(px > 0) u(nx+1,1:ny,k0)[px-1,py] = u(1,1:ny,k0) call sync_all()#endif

NOTE: NO halo communication for CAF_overlap



call Laplace (u, k0, k1, dumax, nx, ny, alpha)

. . .subroutine Laplace (u, k0, k1, dumax, nx, ny, alpha) integer(4) :: k0, k1, nx, ny, i, j real(8) :: du, alpha#ifndef CAF_overlap real(8), dimension(0:nx+1, 0:ny+1, 0:1) :: u MPI, CAF use same real(8) :: dumax subroutine declarations#endif#ifdef CAF_overlap integer(4) :: PPX, PPY, px, py, master, pxmaster, pymaster, mype, numpes common /CAFstuff/ PPX, PPY, px, py, master, pxmaster, pymaster, mype, numpes real(8), dimension(0:nx+1, 0:ny+1, 0:1)[0:PPX-1,0:*] :: u real(8), dimension(0:*)[0:PPX-1, 0:*] :: dumax#endif



For MPI, CAF, no communication in Laplace

#ifndef CAF_overlap!...do five-point iterative update on u(i,j)

dumax = 0.d0

!dir$ concurrent do j = 1, ny!dir$ concurrent do i = 1, nx du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du if(dabs(du) >= dumax) dumax = dabs(du) enddo ! i enddo ! j

return#endif



#ifdef CAF_overlap!...peel off surface layers of doubly nested loop to overlap communication with computation!...do five-point iterative update on u(i,j)

dumax(mype) = 0.d0

!...North + South layers!dir$ concurrent do i = 1, nx j = 1 ; du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du u(i,j,k1) in vector register if(py > 0) u(i,ny+1,k1)[px, py-1] = u(i,1,k1) Vstore register again if(dabs(du) >= dumax(mype)) dumax(mype) = dabs(du) j = ny ; du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du if(py < PPY-1) u(i,0,k1)[px, py+1] = u(i,ny,k1) if(dabs(du) >= dumax(mype)) dumax(mype) = dabs(du) enddo ! I

NOTE: PUT-based CAF_overlap more efficient than GET-based if #stores < #loads


High-level source code (cont)!...East + West layers!dir$ concurrent do j = 1, ny i = 1; du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du if(px > 0) u(nx+1,j,k1)[px-1, py] = u(1,j,k1) if(dabs(du) >= dumax(mype)) dumax(mype) = dabs(du) i = nx; du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du if(px < PPX-1) u(0,j,k1)[px+1, py] = u(nx,j,k1) if(dabs(du) >= dumax(mype)) dumax(mype) = dabs(du) enddo ! j!...interior!dir$ concurrent do j = 2, ny-1!dir$ concurrent do i = 2, nx-1 du = u(i-1,j,k0) + u(i+1,j,k0) + u(i,j-1,k0) + u(i,j+1,k0) - 4.d0*u(i,j,k0) u(i,j,k1) = u(i,j,k0) + 0.25d0*alpha*du if(dabs(du) >= dumax(mype)) dumax(mype) = dabs(du) dumax(mype) in enddo ; enddo scalar register dumax(mype)[pxmaster,pymaster] = dumax(mype) ! PUT dumax to pe=master#endif



Finish iteration loop with communication for global convergence test

#ifdef MPI call MPI_ALLREDUCE(dumax, global_dumax, 1, MPI_DOUBLE_PRECISION, & MPI_MAX, MPI_COMM_WORLD, ierror) if(ierror .ne. 0) go to 999#endif#ifdef CAF call sync_all() ensure all PEs have computed their local dumax if(mype == master) then do ipy = 0,PPY-1 ; do ipx = 0,PPX-1 tmp = dumax[ipx,ipy] master does remote GETs of dumax with vector load dumax = MAX(dumax, tmp) enddo ; enddo do ipy = 0,PPY-1 ; do ipx = 0,PPX-1 dumax[ipx,ipy] = dumax enddo ; enddo endif ! mype == master ierror = 0 call sync_all() ensure all PEs have received global_dumax global_dumax = dumax#endif



#ifdef CAF_overlap call sync_all() ensure all PEs have computed their local dumax if(mype == master) then in routine Laplace do i= 0,numpes-1 dumax(master) = MAX( dumax(master), dumax(i) ) read dumax(i) enddo ! i from local memory do ipy = 0,PPY-1 ; do ipx = 0,PPX-1 dest_pe = ipx + ipy*PPX dumax(dest_pe)[ipx,ipy] = dumax(master) broadcast global_dumax enddo ; enddo via remote vector stores endif ! mype == master ierror = 0 call sync_all() ensure all PEs have received global_dumax global_dumax = dumax#endif k0 = k1 interchange old and new iteration copies if(global_dumax < epsilon) go to 1000 enddo ! iter

1000 continue


Weak Scaling Results on X1E Scaling results in units of GFLOPS/MSP

n = 100 n = 200 n = 400

P=4 (%

)

P=16(%

)

P=64(%

)

P=4(%

)

P=16(%

)

P=64(%

)

P=4(%

)

P=16(%

)

P=64(%

)

MPI.948 (22)

.432 (12)

.101 (2.6)

2.72 (42)

1.36 (24)

.697 (11)

3.17 (73)

2.09 (60)

.986 (27)

CAF1.62 (75)

1.31 (63)

1.12 (53)

3.60 (85)

3.13 (75)

2.81 (68)

3.59 (92)

2.72 (87)

2.41 (79)

CAF_overlap

2.59 (79)

2.06 (69)

1.71 (57)

5.23 (88)

4.60 (83)

3.99 (68)

3.61 (92)

2.82

(92)

2.67

(87)

Strong scaling:

constPn

compT compT compT compT compT compT compT compT compT


Conclusions

In terms of performance of 2-D Laplace example on X1E:• For small surface-to-volume (P = 4, n = 400), MPI, CAF,

CAF_overlap within 13% of one another• For weak scaling, CAF_overlap > CAF > MPI in all cases• For strong scaling limit (P = 64 and n = 100), CAF_overlap =

1.5*CAF, CAF = 11.*MPI

Baker will have hardware support for efficient use of CAF/UPC/SHMEM (and excellent support for MPI )

Moreover, users can program for even better strong scaling performance on Baker by using CAF/UPC to do fine-grain overlapping of communication and computation as illustrated here

References

49

ISO/IEC JTC1/SC22/WG5 N1762Coarrays in the next Fortran Standard

John Reid, JKR Associates, UKDecember 8, 2008

Documents

Programming in Co-Array Fortran John M. Levesque CTO Office Applications With Help from Bob Numrich and Jim Schwarzmeier