23
1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead [email protected] NERSC User Group Meeting June 12, 2006

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead [email protected] NERSC User Group Meeting June 12, 2006

Embed Size (px)

Citation preview

Page 1: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

1

Using HPS Switch on Bassi

Jonathan CarterUser Services Group Lead

[email protected]

NERSC User Group MeetingJune 12, 2006

Page 2: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

2

IBM Switch Evolution

Page 3: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

3

IBM Switch Evolution

Year Name Peak BW Latency Processor

1996 SP Switch 300 MB/s per node

2x150 MB/s channel

20-35 us Power2/

Power3

2000 SP Switch2 (Colony)

2GB/s per node

2x500MB/s per port

~17 us Power3/

Power4

2003 HPS

(Federation)

2GB/s per port 5-14 us Power4/

Power5

Page 4: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

4

HPS Switch Configuration

Page 5: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

5

Bassi Switch Configuration

B0101 B0201 B0301 B0401 B0501 B0601 B0701 B0801 B0901 B1001 B1101 B1201

B0102 B0202 B0302 B0402 B0502 B0602 B0702 B0802 B0902 B1002 B1102 B1202

B0103 B0203 B0303 B0403 B0503 B0603 B0703 B0803 B0903 B1003 B1103 B1203

B2904 B0304 B0404 B0504 B0704 B0804 B0904 B1004 B1104 B1204

B0205 B0305 B0405 B0505 B0705 B8905 B0905 B1005 B1105 B1205

B0206 B0306 B0406 B0506 B0706 B0806 B0906 B1006 B1106 B1206

B0207 B0307 B0407 B0507 B0707 B8907 B0907 B1007 B1107 B1207

B2908 B0308 B0408 B0508 B0708 B0808 B0908 B1008 B1108 B1208

B0209 B0309 B0409 B0709 B0809 B0909 B1009 B1109 B1209

B0210 B0310 B0410 B0710 B0810 B0910 B1010 B1110 B1210

B0211 B0311 B0411 B0711 B0811 B0911 B1011 B1111 B1211

B0212 B0312 B0412 B0712 B0812 B0912 B1012 B1112 B1212

Page 6: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

6

IBM Software

• Parallel Environment (PE 4.2.2) which contains poe and MPI remains unchanged

• Parallel System Support Package (PSSP 3.5.0), which contains LAPI, absorbed in Reliable Scalable Clustering Technology (RSCT 2.4.2) software stack.

Page 7: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

7

IBM Software

• MPI 4.2.2– Uses LAPI as reliable transport layer– Uses threads not signals for

asynchronous activities

• Binary compatible• New performance characteristics– Eager– Bulk transfer– Collectives

Page 8: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

8

IBM Software Stack

HPS

SMA3+ Adapter

HAL

LAPI

IF_LS

IP

MPI

Application

ESSL PESSL GPFS Sockets

VSD TCP UDP

Page 9: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

9

Communication Modes

• FIFO mode– Chopped into 2KB

chunks on host, copied by CPU

• Remote Direct Memory Access (RDMA)– CPU offload– One I/O bus crossing

Adapter

CPUUser Buffer

FIFORDMA

DMA

Page 10: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

10

RDMA (Bulk transfer)

• Overlap of communication and computation possible– Asynchronous-messaging applications– One-sided communications

• Reduce CPU work– Offload fragmentation and reassembly– Minimize packet arrival interrupts

• Reduce memory subsystem load– Zero copy transport

• Striping across adapters

Page 11: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

11

RDMA vs. Packet

0

500

1000

1500

2000

2500

3000

3500

0.0E+00

2.0E+00

8.0E+00

3.2E+01

1.3E+02

5.1E+02

2.0E+03

8.2E+03

3.3E+04

1.3E+05

5.2E+05

2.1E+06

8.4E+06

MSG Size

MB

/s PingPong

PingPong

Page 12: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

12

MPI Transfer Protocols

• Eager: send data immediately; store in remote buffer– No synchronization– Only one message sent– Uses memory for buffering (less for application)

• Rendezvous: send message header; wait for recv to be posted; send data– No data copy may be required– No memory required for buffering (more for

application)– More messages required– Synchronization (standard send blocks until recv

posted)

P0 P1

data

ack

req

ack

data

ack

Page 13: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

13

Eager vs. Rendezvous

0

20

40

60

80

100

120

0.0E+00

2.0E+00

8.0E+00

3.2E+01

1.3E+02

5.1E+02

2.0E+03

8.2E+03

3.3E+04

1.3E+05

MSG Size

Tim

e (

us

)

Eager

Rendevous

Page 14: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

14

Latency

System Intra (us) Inter (us)

Seaborg 10.5 24.5

Jacquard 0.6 4.7

Bassi 1.1 4.5

Page 15: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

15

Internode Comparison

0

500

1000

1500

2000

2500

3000

3500

0.0E+00

2.0E+00

8.0E+00

3.2E+01

1.3E+02

5.1E+02

2.0E+03

8.2E+03

3.3E+04

1.3E+05

5.2E+05

2.1E+06

8.4E+06

MSG Size

MB

/s

bassi

seaborg

jacquard

Page 16: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

16

Internode Comparison

0

50

100

150

200

250

300

350

400

0.0E+00

1.0E+00

2.0E+00

4.0E+00

8.0E+00

1.6E+01

3.2E+01

6.4E+01

1.3E+02

2.6E+02

5.1E+02

1.0E+03

2.0E+03

4.1E+03

MSG Size

MB

/s

bassi

seaborg

jacquard

Page 17: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

17

Intranode Comparison

0

1000

2000

3000

4000

5000

6000

7000

8000

0.0E+00

2.0E+00

8.0E+00

3.2E+01

1.3E+02

5.1E+02

2.0E+03

8.2E+03

3.3E+04

1.3E+05

5.2E+05

2.1E+06

MSG Size

MB

/s

bassi

seaborg

jacquard

Page 18: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

18

Intranode Comparison

0

200

400

600

800

1000

1200

1400

0.0E+00

1.0E+00

2.0E+00

4.0E+00

8.0E+00

1.6E+01

3.2E+01

6.4E+01

1.3E+02

2.6E+02

5.1E+02

1.0E+03

2.0E+03

4.1E+03

MSG Size

MB

/s

bassi

seaborg

jacquard

Page 19: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

19

Packed-node Comparison

0

100

200

300

400

500

600

0.0E+00

2.0E+00

8.0E+00

3.2E+01

1.3E+02

5.1E+02

2.0E+03

8.2E+03

3.3E+04

1.3E+05

5.2E+05

2.1E+06

8.4E+06

MSG Size

MB

/s

bassi

seaborg

jacquard

Page 20: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

20

Packed-node Comparison

0

50

100

150

200

250

300

0.0E+00

1.0E+00

2.0E+00

4.0E+00

8.0E+00

1.6E+01

3.2E+01

6.4E+01

1.3E+02

2.6E+02

5.1E+02

1.0E+03

2.0E+03

4.1E+03

MSG Size

MB

/s

bassi

seaborg

jacquard

Page 21: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

2121

• MP_SINGLE_THREAD– Set to Yes for slight latency

decrease, set to No for MPI I/O and OpenMP, etc.

• MP_USE_BULK_XFER– Default to Yes

• MP_BULK_MIN_MSG_SIZE– Default to ~150KB

POE environment variables

Page 22: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

2222

• MP_BUFFER_MEM– Default is 64MB

• MP_EAGER_LIMIT– Varies from 32KB to 1KB depending on job

size, can be increased in conjunction with MP_BUFFER_MEM

• LAPI parameters for apps with many blocking send of small mgs:– MP_REXMIT_BUF_SIZE

• Default 128 bytes

– MP_REXMIT_BUF_CNT• Default is 128 buffers

POE environment variables

Page 23: 1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl.gov NERSC User Group Meeting June 12, 2006

23

IBM Documentation

• RSCT for AIX 5L LAPI Programming Guide (SA22-7936-03) – LAPI programming

• Parallel Environment for AIX 5L V4.2.2Operation and Use, Vol 1 (SA22-7948-04)– Running jobs

• Parallel Environment for AIX 5L V4.2.2Operation and Use, Vol 2 (SA22-7949-04)– Performance tools

• Parallel Environment for AIX 5L V4.2.2MPI Programming Guide (SA22-7945-04)– IBM MPI implementation