BG/L Application Tuning and Lessons Learned Bob Walkup IBM Watson Research Center

BG/L Application Tuning and Lessons LearnedBob Walkup

IBM Watson Research Center

Performance Decision Tree

Total Performance

Computation Communication

Xprofiler HPM

Routines/Source Summary/Blocks

Compiler

Source Listing

MP_Profiler

Summary/Events

I/O

MIO Library

Timing Summary from MPI Wrappers

Data for MPI rank 0 of 32768:Times and statistics from MPI_Init() to MPI_Finalize().--------------------------------------------------------MPI Routine #calls avg. bytes time(sec)--------------------------------------------------------MPI_Comm_size 3 0.0 0.000MPI_Comm_rank 3 0.0 0.000MPI_Sendrecv 2816 112084.3 23.197MPI_Bcast 3 85.3 0.000MPI_Gather 1626 104.2 0.579MPI_Reduce 36 207.2 0.028MPI_Allreduce 679 76586.3 19.810--------------------------------------------------------total communication time = 43.614 seconds.total elapsed time = 302.099 seconds.top of the heap address = 84.832 MBytes.

Compute-Bound => Use gprof/Xprofiler

Compile/link with -g -pg

Optionally link with libmpitrace.a to limit profiler output – get profile data for ranks 0, min, max, median communication time.

Analysis is the same for serial and parallel codes.

Gprof => subroutine-level

Xprofiler => statement level: clock ticks tied to source lines

Gprof Example : GTC Flat Profile

Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 37.43 144.07 144.07 201 0.72 0.72 chargei 25.44 241.97 97.90 200 0.49 0.49 pushi 6.12 265.53 23.56 _xldintv 4.85 284.19 18.66 cos 4.49 301.47 17.28 sinl 4.19 317.61 16.14 200 0.08 0.08 poisson 3.79 332.18 14.57 _pxldmod 3.55 345.86 13.68 _ieee754_exp

Time is concentrated in two routines and intrinsic functions.

Good prospects for tuning.

Statement-Level Profile : GTC pushi

Line ticks source 115 do m=1,mi 116 657 r=sqrt(2.0*zion(1,m)) 117 136 rinv=1.0/r 118 34 ii=max(0,min(mpsi-1,int((r-a0)*delr))) 119 55 ip=max(1,min(mflux,1+int((r-a0)*d_inv))) 120 194 wp0=real(ii+1)-(r-a0)*delr 121 52 wp1=1.0-wp0 122 104 tem=wp0*temp(ii)+wp1*temp(ii+1) 123 86 q=q0+q1*r*ainv+q2*r*r*ainv*ainv 124 166 qinv=1.0/q 125 68 cost=cos(zion(2,m)) 126 18 sint=sin(zion(2,m)) 129 104 b=1.0/(1.0+r*cost)

Can pipeline expensive operations like sqrt, reciprocal, cos, sin, …

Requires either compiler option (-qhot=vector) or hand-tuning.

Compiler Listing Example : GTC chargei

Source section: 55 |! inner flux surface 56 | im=ii 57 | tdum=pi2_inv*(tflr-zetatmp*qtinv(im))+10.0 58 | tdum=(tdum-aint(tdum))*delt(im) 59 | j00=max(0,min(mtheta(im)-1,int(tdum))) 60 | jtion0(larmor,m)=igrid(im)+j00 61 | wtion0(larmor,m)=tdum-real(j00) Register section: GPR's set/used: ssss ssss ssss s-ss ssss ssss ssss ssss FPR's set/used: ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ss-- ---- ---- ---s s--s Assembler section: 50| 000E04 stw 0 ST4A #SPILL52(gr31,520)=gr4 58| 000E08 bl 0 CALLN fp1=_xldintv,0,fp1,… 59| 000E0C mullw 2 M gr3=gr19,gr15 58| 000E10 rlwinm 1 SLL4 gr5=gr15,2

Issues: function call for aint(), register spills, pipelining, …

Get listing with source code: -qlist -qsource

GTC – Performance on Blue Gene/L

Original code: main loop time = 384 sec (512 nodes, coprocessor)

Tuned code : main loop time = 244 sec (512 nodes, coprocessor)

Factor of ~1.6 performance improvement by tuning.

Weak scaling, relative performance per processor:

#nodes coprocessor virtual-node512 1.000 0.9741024 1.002 0.9612048 0.985 0.9634096 1.002 0.9568192 1.009 0.93516384 0.968 NAN

BG/L Daxpy Performance

0

0.2

0.4

0.6

0.8

1

1.2

1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08

Bytes

Flo

ps

per

Cyc

le

440d+alignx

440

440d

Daxpy: y(:) = a*x(:) + y(:), with compiler-generated code

Getting Double-FPU Code Generation

Use library routines (blas, vector intrinsics, ffts,…)

Try compiler options : -O3 -qarch=440d (-qlist -qsource)

-O3 -qarch=440d -qhot=simd

Add alignment assertions:

Fortran: call alignx(16,array(1))

C: __alignx(16,array);

Try double-FPU intrinsics:

Fortran: loadfp(A), fpadd(a,b), fpmadd(c,b,a), …

C : __ldpf(address), __fpadd(a,b), __fpmadd(c,b,a)

Can write assembler code.

16K file write test

0

50

100

150

200

250

300

350

1 8 64 512 4096

Number of Directories

Tim

e (

se

c)

write

open

mkdir

Optimizing Communication Performance

3D Torus network => the bandwidth is degraded if the traffic

goes many hops, sharing links => stay local if possible.

Example: 2D domain on 1024 nodes (8x8x16 torus)

try 16x64 with BGLMPI_MAPPING=ZXYT

Example: QCD codes with logical 4D decomposition

try Px = torus x dimension (same for y, z)

Pt = 2 (for virtual-node-mode)

Layout optimizer : Record the communication matrix, then

minimize the cost function to obtain a good mapping.

Currently limited to modest torus dimensions.

Finding Communication Problems

POP Communication Time

1920 Processors (40x48 decomposition)

Some Experience with Performance Tools

Follow the decision tree – don’t get buried in data.

Get details for a just a subset of MPI ranks.

Use the parallel job for data analysis (min, max, median etc.).

For applications that repeat the same work:

sample or trace just a few repeat units.

Save cumulative performance data for all ranks in one file.

Some Lessons Learned

Text core files are great, as long as you get the call stack (need -g).

Use addr2line … takes you from instruction address to the source file and line number. Standard GNU bin utility, compile/link with -g.

Use the backtrace() library routine – standard GNU libc utility.

Can make wrappers for exit() and abort() routines so that normal

exits provide the call stack.

What do you do when >10**4 processes are hung?

halt cores, dump stacks, make separate text core files, use grep

(grep -L tells you which of the >10**4 core files was not

stopped in an MPI routine, also use grep + wc (word count).

Lesson: Flash Flood

If (task = 0)

for (t=1, …, P-1) recv data from task t

Else

send data to task 0 => results in a flood at task 0

-------------------------------------------------------------------------------------

Add flow control … slow but safe:

If (task = 0)

for (t=1, … P-1) {send a flag to task t; then recv data from task t}

Else

{recv a flag from task 0; then send data to task 0}

Lesson: P*P => Can’t Scale

integer table(P,P) requires 1 GB for P = 16K

Memory requirement limits scalability: example Metis

Can sometimes replace table(P,P) with local(P) and remote(P) plus communication to get values stored elsewhere.

Some computational algorithms scale as P*P, which can limit scaling: example - certain methods for automatic mesh refinement

More Processors = More Fun

Looking forward to the petaflop scale …

Documents

BG/L Application Tuning and Lessons Learned Bob Walkup IBM Watson Research Center