IFS Benchmark with Federation Switch John Hague, IBM

IFS Benchmark with Federation Switch

John Hague, IBM

Introduction

• Federation has dramatically improved pwr4 p690 communication, so

– Measure Federation performance with Small Pages and Large Pages using simulation program

– Compare Federation and pre-Federation (Colony) performance of IFS

– Compare Federation performance of IFS with and without Large Pages and Memory Affinity

– Examine IFS communication using mpi profiling

Colony v Federation

• Colony (hpca)

– 1.3GHz 32-processor p690s– Four 8-processor Affinity LPARs per p690

• Needed to get communication performance

– Two 180MB/s adapters per LPAR

• Federation (hpcu)

– 1.7GHz p690s– One 32-processor LPAR per p690– Memory and MPI MCM Affinity

• MPI Task and Memory from same MCM• Slightly better than binding task to specific processor

– Two 2-link 1.2GB/s Federation adapters per p690 • Four 1.2GB/s links per node

IFS Communication:transpositions

1. MPI Alltoall in all rows simultaneously• Mostly shared memory

2. MPI Alltoall in all columns simultaneously

1

0 MPI task

Node

1

4

21 2 3 4 5

8 9

31308

3

0

Simulation of transpositions

• All transpositions in “row” use shared memory• All transpositions in “column” use switch• Number of MPI tasks per node varied

– But all processors used by using OpenMP threads

• Bandwidth measured for MPI Sendrecv calls– Buffers allocated and filled by threads between each call

• Large Pages give best switch performance– With current switch software

“Transposition” Bandwidth per link(8 nodes, 4 links/node, 8 tasks/node, 4 threads/task, 2 tasks/link)

0

200

400

600

800

1000

1200

1400

1600

100 1000 10000 100000 1000000 10000000

Bytes

MB

/sec

LP: EAGER_LIMIT=64K

LP: MIN_BULK=50K

LP: BASE

SP

SP = Small PagesLP = Large Pages

“Transposition” Bandwidth per link(8 nodes, 4 links/node)

0

200

400

600

800

1000

1200

1400

1600

100 1000 10000 100000 1000000 10000000

Bytes

MB

/sec

32 tasks

16 tasks

8 tasks

4 tasks

Multiple threads ensure all processors are used

hpcu v hpca with IFS

• Benchmark jobs (provided 3 years ago)

– Same executable used on hpcu and hpca– 256 processors used– All jobs run with mpi_profiling (and barriers before data

exchange)

Procs Grid Points hpca hpcu Speedup

T399 10x1_4 213988 5828 3810 1.52

T799 16x8_2 843532 9907 5527 1.79

4D-Var

T511/T255

16x8_2 4869 2737 1.78

IFS Speedups: hpcu v hpca

11.11.21.31.41.51.61.71.81.9

2

SP no MA SP w MA LP no MA LP w MA

Total

spee

dup

v hp

ca

799

399

4D-Var

1

2

3

4

5


Communication

sp

ee

du

p v

hp

ca

799

399

4D_Var

11.11.21.31.41.51.61.71.81.9

2


CPU

spee

du

p v

hp

ca

799

399

4D-Var

LP = Large Pages; SP = Small PagesMA = Memory Affinity

LP/SP & MA/noMA CPU comparison

-5

0

5

10

15

LP/SP no MA LP/SP w MA MA/noMA w SP MA/noMA w LP

Pe

rce

nta

ge

799

399

4D-Var

LP/SP & MA/noMA Comms comparison

-20

0

20

40

60

80

100

120

LP/SP no MA LP/SP w MA MA/noMA w SP MA/noMA w LP

Pe

rce

nta

ge

799

399

4D-Var

Percentage Communication

05

101520253035404550

SP no MA SP no MA SP w MA LP no MA LP w MA

Pe

rce

nta

ge

799

399

4D-Var

hpca ------------------- hpcu --------------------------

Extra Memory needed by Large Pages

Large Pages are allocated in Real Memory in segments of 256 MB

• MPI_INIT– 80MB which may not be used – MP_BUFFER_MEM (default 64MB) can be reduced– MPI_BUFFER_ALLOCATE needs memory which may not be used

• OpenMP threads:– Stack allocated with XLSMPOPTS=“stack=…” may not be used

• Fragmentation – Memory is "wasted"

• Last 256 MB segment– Only a small part of it may be used

mpi_profile

• Examine IFS communication using mpi profiling

– Use libmpiprof.a

– Calls and MB/s rate for each type of call• Overall• For each higher level subroutine

– Histogram of blocksize for each type of call

mpi_profile for T799

128 MPI tasks, 2 threadsWALL time = 5495 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec) --------------------------------------------------------------MPI_Send 49784 52733.2 2625.3 7.873MPI_Bsend 6171 454107.3 2802.3 1.331MPI_Isend 84524 1469867.4 124239.1 1.202MPI_Recv 91940 1332252.1 122487.3 359.547MPI_Waitall 75884 0.0 0.0 59.772MPI_Bcast 362 26.6 0.0 0.028MPI_Barrier 9451 0.0 0.0 436.818 -------TOTAL 866.574 ----------------------------------------------------------------

Barrier indcates load imbalance

mpi_profile for 4D_Var min0

128 MPI tasks, 2 threadsWALL time = 1218 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec)--------------------------------------------------------------MPI_Send 43995 7222.9 317.8 1.033MPI_Bsend 38473 13898.4 534.7 0.843MPI_Isend 326703 168598.3 55081.6 6.368MPI_Recv 432364 127061.8 54936.9 220.877 MPI_Waitall 276222 0.0 0.0 23.166MPI_Bcast 288 374491.7 107.9 0.490MPI_Barrier 27062 0.0 0.0 94.168MPI_Allgatherv 466 285958.8 133.3 26.250MPI_Allreduce 1325 73.2 0.1 1.027 -------TOTAL 374.223-----------------------------------------------------------------

Barrier indicates load imbalance

MPI Profiles for send/recv

0

5000

10000

15000

20000

25000

30000

KBytes

Cal

ls

799

0

1000

2000

3000

4000

5000

6000

7000

8000

KBytes

Cal

ls

399 hpca

0

20000

40000

60000

80000

100000

120000

KBytes

Ca

lls

4d_var min0

mpi_profiles for recv/send

Avg

MB

MB/s per task

hpca hpcu

T799 (4 tasks per link)

trltom (inter node) 1.84 35 224

trltog (shrd memory) 4.00 116 890

slcomm2 (halo) 0.66 65 363

4D-Var min0 (4 tasks per link)

trltom (inter node) 0.167 160

trltog (shrd memory) 0.373 490

slcomm2 (halo) 0.088 222

Conclusions

• Speedups of hpcu over hpca Large Memory Pages Affinity Speedup N N 1.32 – 1.60 Y N 1.43 – 1.62 N Y 1.47 – 1.78 Y Y 1.52 – 1.85

• Best Environment Variables

– MPI.network=ccc0 (instead of cccs)– MEMORY_AFFINITY=yes– MP_AFFINITY=MCM ! With new pvmd– MP_BULK_MIN_MSG_SIZE=50000– LDR_CNTRL="LARGE_PAGE_DATA=Y“ don’t use – else system calls in LP very slow – MP_EAGER_LIMIT=64K

hpca v hpcu

------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7 min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3

hpca v hpcu

------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7 min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3

mpi_profiles for recv/send

Avg

MB

MB/s per task

hpca hpcu

T799 (4 tasks per link)

trltom (inter node) 1.84 35 224

trltog (shrd memory) 4.00 116 890

slcomm2 (halo) 0.66 65 363

4D-Var min0 (4 tasks per link)

trltom (inter node) 0.167 160

trltog (shrd memory) 0.373 490

slcomm2 (halo) 0.088 222

Conclusions

• Memory Affinity with binding– Program binds to: MOD(task_id*nthrds+thrd_id,32), or– Use new /usr/lpp/ppe.poe/bin/pmdv4 – How to bind if whole node not used– Try VSRAC code from Montpellier– Bind adapter link to MCM ?

• Large Pages – Advantages

• Need LP for best communication B/W with current software – Disadvantages

• Uses extra memory (4GB more per node in 4D-Var min1)• Load Leveler Scheduling

– Prototype switch software indicates Large Pages not necessary

• Collective Communication– To be investigated

Linux compared to PWR4 for IFS

• Linux (run by Peter Mayes)– Opteron, 2GHz, 2 CPUs/node, 6GB/node, myrinet switch– Portland Group compiler:– Compiler flags: -O3 -Mvect=sse– No code optimisation or OpenMP– Linux 1: 1 CPU/node, Myrinet IP– Linux 1A: 1 CPU/node, Myrinet GM– Linux 2: using 2 CPUs/node

• IBM Power4– MPI (intra-node shared memory) and OpenMP– Compiler flags: -O3 –qstrict– hpca: 1.3GHz p690, 8 CPUs/node, 8GB/node, colony switch– hpcu: 1.7GHz p690, 32 CPUs/node, 32GB/node, federation switch

Linux compared to Pwr4

0

100

200

300

400

500

600

700

800

900

1000

0 10 20 30 40 50 60 70

Processors

Se

c fo

r S

tep

s 1

to 1

1

T511 Linux 1

T511 Linux 1A

T511 hpca

T511 hpcu

T159 Linux 2

T159 Linux 1

T159 hpca

T159 hpcu

Documents

IFS Benchmark with Federation Switch John Hague, IBM