Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

Comparison of Communication and I/O of the Cray T3E and IBM SP

Jonathan CarterNERSC User Services


2

Overview

• Node Characteristics• Interconnect Characteristics• MPI Performance• I/O Configuration• I/O Performance


3

T3E Architecture

• Distributed memory, single CPU processing elements

Interconnect

CPU Memory


4

T3E Communication Network

• Processing Elements (PE) are connected by a 3D torus.


5

T3E Communication Network

• The peak bandwidth of the torus is about 600 Mbyte/sec bidirectional

• Sustainable bandwidth is about 480 Mbytes/sec bidirectional• Latency is 1μs• Shmem API gives latency of 1μs, bandwidth 350 Mbyte/sec

bidirectional


6

SP Architecture

• Cluster of SMP nodes

Interconnect

Memory

CPU

CPU


7

SP Communication Network• Nodes are connected via adapters to the SP Switch. Switch is

composed of boards which link 16 nodes. Boards are linked to form larger network.

Switch Board

Nodes


8

SP Communication Network

• The peak bandwidth of adapter and switch is 300 Mbyte/sec bidirectional

• Latency of the switch is about 2μs• Sustainable bandwidth is about 185 Mbytes/sec bidirectional


9

MPI Performance

T3E SP(intra-node)

SP(inter-node)

Latencys

12 10 22

BandwidthMbyte/s

270 300 150

Intra-node is 1 MPI process per node, 2 MPI processes (typical) will halve bandwidth


10

MPI Performance

MPI_reduce (sum)

0500

1000150020002500300035004000

16 32 64 128

Procs.

Tim

e (u

s) T3E 256 bytesSP 256 bytesT3E 1024 bytesSP 1024 bytes


11

MPI Performance

MPI_Bcast

0

100

200

300

400

500

600

700

16 32 64 128

Procs.

Tim

e (u

s) T3E 256 bytesSP 256 bytesT3E 1024 bytesSP 1024 bytes


12

T3E I/O Configuration

• PEs do not have local disk• All PEs access all filesystems equivalently• Path for (optimum) I/O generally looks like:

– PE to I/O node via torus– I/O node to Fibre Channel Node (FCN) via Gigaring– FCN to Disk Array via Fibre loop

• In some cases data on APP PE must be transferred to a system buffer on an OS PE then out to an FCN


13

T3E I/O Configuration

I/O FCN

Gigaring

Disk Arrays


14

SP I/O Configuration

• Nodes have local disk. One SCSI disk for all local filesystems. Non-optimal.

• All nodes access Global Parallel File System (GPFS) filesystems equivalently

• Path for GPFS I/O looks like:– Node to GPFS Node via IP over the switch– GPFS Node to Disk Array via SSA loop


15

SP I/O Configuration

Nodes

Switch

Switch

GPFS Nodes

Disk Array


16

T3E Filesystems• /usr/tmp

– fast– subject to 14 day purge, not backed up– check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes)

• $TMPDIR– fast– purged at end of job or session– shares quota with /usr/tmp

• $HOME– slower– permanent, backed up– check quota with quota (usually 2Gb and 3500 inodes)


17

SP Filesystems• /scratch and $SCRATCH

– global– fast (GPFS)– subject to 14 day purge (or at session end for $SCRATCH), not backed up– check quota with myquota (usually 100Gb and 6000 inodes)

• $TMPDIR– local (created in /scr) - only 2 Gbyte total– slower– purged at end of job or session

• $HOME– global– slower (GPFS)– permanent, not backed up yet– check quota with myquota (usually 4Gb and 5000 inodes)


18

Types of I/O

• Bewildering number of choices on both machines:– Standard Language I/O: Fortran or C (ANSI or POSIX)– Vendor extensions to language I/O – MPI I/O– Cray FFIO library (can be used from Fortran or C)– IBM MIO library, requires code changes


19

Standard Language I/O

• Fortran direct access is slightly more efficient then sequential access both on the T3E (see comments on FFIO later) and the SP. It also allows file transferability.

• C language I/O (fopen, fwrite, etc.) is inefficient on both machines.

• POSIX standard I/O (open, read, etc.) can be efficient on the T3E, but requires care (see comments on FFIO later). Works well on the SP.


20

Vendor Extensions to Language I/O

• Cray has a number of I/O routines (aqopen, etc.) which are legacies from the PVP systems. Non-portable.

• IBM has extended Fortran syntax to provide asynchronous I/O. Non-portable.


21

MPI I/O

• Part of MPI-2• Interface for High Performance Parallel I/O

– data partitioning– collective I/O– asynchronous I/O– portability and interoperability bwteen T3E and SP

• Different subset implemented on T3E and SP


22

Summary of access routines for T3E

Positioning Synchronism CoordinationNon-collective Collective

Explicit BlockingNon-blocking

READ_AT READ_AT_ALL

IREAD_AT READ_AT_ALL_BEGINWAIT READ_AT_ALL_END

Individual BlockingNon-blocking

READ READ_ALL

IREAD READ_ALL_BEGINWAIT READ_ALL_END

Shared BlockingNon-Blocking

READ_SHARED READ_ORDERED

IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END


23

Summary of access routines for SPPositioning Synchronism Coordination

Non-collective CollectiveExplicit Blocking

Non-blockingREAD_AT READ_AT_ALL

IREAD_AT READ_AT_ALL_BEGINWAIT READ_AT_ALL_END

Individual BlockingNon-blocking

READ READ_ALL

IREAD READ_ALL_BEGINWAIT READ_ALL_END

Shared BlockingNon-Blocking

READ_SHARED READ_ORDERED

IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END


24

Cray FFIO library

• FFIO is a set of I/O layers tuned for different I/O characteristics

• Buffering of data (configurable size)• Caching of data (configurable size)• Available to regular Fortran I/O without reprogramming• Available for C through POSIX-like calls, e.g. ffopen, ffwrite


25

FFIO - The assign command

• controls program behavior at runtime• the assign command controls

– controls which FFIO layer is active– striping across multiple partitions– lots more

• scope of assign– File name– Fortran unit number– File type (e.g. all sequential unformatted files)


26

IBM MIO library

• User interface based on POSIX I/O routines, so requires program modification

• Useful trace module to collect statistics• Not much experience with using on GPFS filesystem• Coming soon


27

I/O Strategies - Exclusive access files

• Each process reads and writes to a separate file– Language I/O

• Increase language I/O performance with FFIO library (for example, sepcify a large buffer with the bufa layer) on T3E. For Fortran direct access default buffer is only the maximum of the record length or 32 Kbytes

• read/write large amounts of data per request on the SP

– MPI I/O• read/write large amounts of data per request


28

bufa FFIO layer Overview

• bufa is an asynchronous buffering layer• performs read-ahead, write-behind• specify buffer size with -F bufa:bs:nbufs where bs

is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers

• buffer space increases your applications memory requirements


29

I/O Strategies - Shared files

• All PEs read and write the same file simultaneously– Language I/O (requires FFIO library global layer for T3E)– MPI I/O– On T3E, language I/O with FFIO library global layer and Cray

extensions for additional flexibility


30

Positioning with a shared file

• Positioning of a read or write is your responsibility• File pointers are private• Fortran

– Use a direct access file, and read/write(rec=num)– Use Cray T3E extensions setpos and getpos to position file

pointer (not portable)

• C– Use ffseek

• MPI I/O– MPI I/O fileview generally takes care of this. Positioning routines

also available.


31

global FFIO layer Overview

• global is a caching and buffering layer which enables multiple PEs to read and write to the same file

• if one PE has already read the data, an additional read request from another PE will result in a remote memory copy

• file open is a synchronizing event• By default, all PEs must open a global file, this can be

changed by calling GLIO_GROUP_MPI(comm)• specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE


32

GPFS and shared files

• On the T3E the global FFIO layer takes care of updates to a file from multiple PEs by tracking the state of the file across all PEs.

• On the SP, GPFS implements a safe update scheme via tokens and a token manager.– If two processes access the same block of a GPFS file (256 Kbytes),

a negotiation is conducted between the nodes and the token manager to determine the order of updates. This can slow down I/O considerably.

– MPI I/O merges requests from different processes to alleviate this problem


33

I/O Performance Comparison• Each process writes a 200 Mbyte file. 2 processes per node on SP.

0

200

400

600

800

1000

1200

16 32 64

processes

Mby

te/s

ec T3E WriteT3E ReadSP WriteSP read


34

Further Information

• I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials

• Cray Publication - Application Programmer’s I/O Guide• Cray Publication - Cray T3E Fortran Optimization Guide• man assign• XL Fortran User’s Guide

Documents

Comparison of Communication and I/O of the Cray T3E and IBM SP