34
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User Services

Comparison of Communication and I/O of the Cray T3E and IBM SP

  • Upload
    yvon

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

Comparison of Communication and I/O of the Cray T3E and IBM SP. Jonathan Carter NERSC User Services. Overview. Node Characteristics Interconnect Characteristics MPI Performance I/O Configuration I/O Performance. Interconnect. Memory. CPU. T3E Architecture. - PowerPoint PPT Presentation

Citation preview

Page 1: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

Comparison of Communication and I/O of the Cray T3E and IBM SP

Jonathan CarterNERSC User Services

Page 2: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

2

Overview

• Node Characteristics• Interconnect Characteristics• MPI Performance• I/O Configuration• I/O Performance

Page 3: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

3

T3E Architecture

• Distributed memory, single CPU processing elements

Interconnect

CPU Memory

Page 4: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

4

T3E Communication Network

• Processing Elements (PE) are connected by a 3D torus.

Page 5: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

5

T3E Communication Network

• The peak bandwidth of the torus is about 600 Mbyte/sec bidirectional

• Sustainable bandwidth is about 480 Mbytes/sec bidirectional• Latency is 1μs• Shmem API gives latency of 1μs, bandwidth 350 Mbyte/sec

bidirectional

Page 6: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

6

SP Architecture

• Cluster of SMP nodes

Interconnect

Memory

CPU

CPU

Page 7: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

7

SP Communication Network• Nodes are connected via adapters to the SP Switch. Switch is

composed of boards which link 16 nodes. Boards are linked to form larger network.

Switch Board

Nodes

Page 8: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

8

SP Communication Network

• The peak bandwidth of adapter and switch is 300 Mbyte/sec bidirectional

• Latency of the switch is about 2μs• Sustainable bandwidth is about 185 Mbytes/sec bidirectional

Page 9: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

9

MPI Performance

T3E SP(intra-node)

SP(inter-node)

Latencys

12 10 22

BandwidthMbyte/s

270 300 150

Intra-node is 1 MPI process per node, 2 MPI processes (typical) will halve bandwidth

Page 10: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

10

MPI Performance

MPI_reduce (sum)

0500

1000150020002500300035004000

16 32 64 128

Procs.

Tim

e (u

s) T3E 256 bytesSP 256 bytesT3E 1024 bytesSP 1024 bytes

Page 11: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

11

MPI Performance

MPI_Bcast

0

100

200

300

400

500

600

700

16 32 64 128

Procs.

Tim

e (u

s) T3E 256 bytesSP 256 bytesT3E 1024 bytesSP 1024 bytes

Page 12: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

12

T3E I/O Configuration

• PEs do not have local disk• All PEs access all filesystems equivalently• Path for (optimum) I/O generally looks like:

– PE to I/O node via torus– I/O node to Fibre Channel Node (FCN) via Gigaring– FCN to Disk Array via Fibre loop

• In some cases data on APP PE must be transferred to a system buffer on an OS PE then out to an FCN

Page 13: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

13

T3E I/O Configuration

I/O FCN

Gigaring

Disk Arrays

Page 14: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

14

SP I/O Configuration

• Nodes have local disk. One SCSI disk for all local filesystems. Non-optimal.

• All nodes access Global Parallel File System (GPFS) filesystems equivalently

• Path for GPFS I/O looks like:– Node to GPFS Node via IP over the switch– GPFS Node to Disk Array via SSA loop

Page 15: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

15

SP I/O Configuration

Nodes

Switch

Switch

GPFS Nodes

Disk Array

Page 16: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

16

T3E Filesystems• /usr/tmp

– fast– subject to 14 day purge, not backed up– check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes)

• $TMPDIR– fast– purged at end of job or session– shares quota with /usr/tmp

• $HOME– slower– permanent, backed up– check quota with quota (usually 2Gb and 3500 inodes)

Page 17: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

17

SP Filesystems• /scratch and $SCRATCH

– global– fast (GPFS)– subject to 14 day purge (or at session end for $SCRATCH), not backed up– check quota with myquota (usually 100Gb and 6000 inodes)

• $TMPDIR– local (created in /scr) - only 2 Gbyte total– slower– purged at end of job or session

• $HOME– global– slower (GPFS)– permanent, not backed up yet– check quota with myquota (usually 4Gb and 5000 inodes)

Page 18: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

18

Types of I/O

• Bewildering number of choices on both machines:– Standard Language I/O: Fortran or C (ANSI or POSIX)– Vendor extensions to language I/O – MPI I/O– Cray FFIO library (can be used from Fortran or C)– IBM MIO library, requires code changes

Page 19: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

19

Standard Language I/O

• Fortran direct access is slightly more efficient then sequential access both on the T3E (see comments on FFIO later) and the SP. It also allows file transferability.

• C language I/O (fopen, fwrite, etc.) is inefficient on both machines.

• POSIX standard I/O (open, read, etc.) can be efficient on the T3E, but requires care (see comments on FFIO later). Works well on the SP.

Page 20: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

20

Vendor Extensions to Language I/O

• Cray has a number of I/O routines (aqopen, etc.) which are legacies from the PVP systems. Non-portable.

• IBM has extended Fortran syntax to provide asynchronous I/O. Non-portable.

Page 21: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

21

MPI I/O

• Part of MPI-2• Interface for High Performance Parallel I/O

– data partitioning– collective I/O– asynchronous I/O– portability and interoperability bwteen T3E and SP

• Different subset implemented on T3E and SP

Page 22: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

22

Summary of access routines for T3E

Positioning Synchronism CoordinationNon-collective Collective

Explicit BlockingNon-blocking

READ_AT READ_AT_ALL

IREAD_AT READ_AT_ALL_BEGINWAIT READ_AT_ALL_END

Individual BlockingNon-blocking

READ READ_ALL

IREAD READ_ALL_BEGINWAIT READ_ALL_END

Shared BlockingNon-Blocking

READ_SHARED READ_ORDERED

IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END

Page 23: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

23

Summary of access routines for SPPositioning Synchronism Coordination

Non-collective CollectiveExplicit Blocking

Non-blockingREAD_AT READ_AT_ALL

IREAD_AT READ_AT_ALL_BEGINWAIT READ_AT_ALL_END

Individual BlockingNon-blocking

READ READ_ALL

IREAD READ_ALL_BEGINWAIT READ_ALL_END

Shared BlockingNon-Blocking

READ_SHARED READ_ORDERED

IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END

Page 24: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

24

Cray FFIO library

• FFIO is a set of I/O layers tuned for different I/O characteristics

• Buffering of data (configurable size)• Caching of data (configurable size)• Available to regular Fortran I/O without reprogramming• Available for C through POSIX-like calls, e.g. ffopen, ffwrite

Page 25: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

25

FFIO - The assign command

• controls program behavior at runtime• the assign command controls

– controls which FFIO layer is active– striping across multiple partitions– lots more

• scope of assign– File name– Fortran unit number– File type (e.g. all sequential unformatted files)

Page 26: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

26

IBM MIO library

• User interface based on POSIX I/O routines, so requires program modification

• Useful trace module to collect statistics• Not much experience with using on GPFS filesystem• Coming soon

Page 27: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

27

I/O Strategies - Exclusive access files

• Each process reads and writes to a separate file– Language I/O

• Increase language I/O performance with FFIO library (for example, sepcify a large buffer with the bufa layer) on T3E. For Fortran direct access default buffer is only the maximum of the record length or 32 Kbytes

• read/write large amounts of data per request on the SP

– MPI I/O• read/write large amounts of data per request

Page 28: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

28

bufa FFIO layer Overview

• bufa is an asynchronous buffering layer• performs read-ahead, write-behind• specify buffer size with -F bufa:bs:nbufs where bs

is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers

• buffer space increases your applications memory requirements

Page 29: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

29

I/O Strategies - Shared files

• All PEs read and write the same file simultaneously– Language I/O (requires FFIO library global layer for T3E)– MPI I/O– On T3E, language I/O with FFIO library global layer and Cray

extensions for additional flexibility

Page 30: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

30

Positioning with a shared file

• Positioning of a read or write is your responsibility• File pointers are private• Fortran

– Use a direct access file, and read/write(rec=num)– Use Cray T3E extensions setpos and getpos to position file

pointer (not portable)

• C– Use ffseek

• MPI I/O– MPI I/O fileview generally takes care of this. Positioning routines

also available.

Page 31: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

31

global FFIO layer Overview

• global is a caching and buffering layer which enables multiple PEs to read and write to the same file

• if one PE has already read the data, an additional read request from another PE will result in a remote memory copy

• file open is a synchronizing event• By default, all PEs must open a global file, this can be

changed by calling GLIO_GROUP_MPI(comm)• specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE

Page 32: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

32

GPFS and shared files

• On the T3E the global FFIO layer takes care of updates to a file from multiple PEs by tracking the state of the file across all PEs.

• On the SP, GPFS implements a safe update scheme via tokens and a token manager.– If two processes access the same block of a GPFS file (256 Kbytes),

a negotiation is conducted between the nodes and the token manager to determine the order of updates. This can slow down I/O considerably.

– MPI I/O merges requests from different processes to alleviate this problem

Page 33: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

33

I/O Performance Comparison• Each process writes a 200 Mbyte file. 2 processes per node on SP.

0

200

400

600

800

1000

1200

16 32 64

processes

Mby

te/s

ec T3E WriteT3E ReadSP WriteSP read

Page 34: Comparison of Communication and I/O of the Cray T3E and IBM SP

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

34

Further Information

• I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials

• Cray Publication - Application Programmer’s I/O Guide• Cray Publication - Cray T3E Fortran Optimization Guide• man assign• XL Fortran User’s Guide