Upload
elmer
View
31
Download
0
Tags:
Embed Size (px)
DESCRIPTION
BG/L Application Tuning and Lessons Learned Bob Walkup IBM Watson Research Center. Performance Decision Tree. Timing Summary from MPI Wrappers. Data for MPI rank 0 of 32768: Times and statistics from MPI_Init() to MPI_Finalize(). -------------------------------------------------------- - PowerPoint PPT Presentation
Citation preview
BG/L Application Tuning and Lessons LearnedBob Walkup
IBM Watson Research Center
Performance Decision Tree
Total Performance
Computation Communication
Xprofiler HPM
Routines/Source Summary/Blocks
Compiler
Source Listing
MP_Profiler
Summary/Events
I/O
MIO Library
Timing Summary from MPI Wrappers
Data for MPI rank 0 of 32768:Times and statistics from MPI_Init() to MPI_Finalize().--------------------------------------------------------MPI Routine #calls avg. bytes time(sec)--------------------------------------------------------MPI_Comm_size 3 0.0 0.000MPI_Comm_rank 3 0.0 0.000MPI_Sendrecv 2816 112084.3 23.197MPI_Bcast 3 85.3 0.000MPI_Gather 1626 104.2 0.579MPI_Reduce 36 207.2 0.028MPI_Allreduce 679 76586.3 19.810--------------------------------------------------------total communication time = 43.614 seconds.total elapsed time = 302.099 seconds.top of the heap address = 84.832 MBytes.
Compute-Bound => Use gprof/Xprofiler
Compile/link with -g -pg
Optionally link with libmpitrace.a to limit profiler output – get profile data for ranks 0, min, max, median communication time.
Analysis is the same for serial and parallel codes.
Gprof => subroutine-level
Xprofiler => statement level: clock ticks tied to source lines
Gprof Example : GTC Flat Profile
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 37.43 144.07 144.07 201 0.72 0.72 chargei 25.44 241.97 97.90 200 0.49 0.49 pushi 6.12 265.53 23.56 _xldintv 4.85 284.19 18.66 cos 4.49 301.47 17.28 sinl 4.19 317.61 16.14 200 0.08 0.08 poisson 3.79 332.18 14.57 _pxldmod 3.55 345.86 13.68 _ieee754_exp
Time is concentrated in two routines and intrinsic functions.
Good prospects for tuning.
Statement-Level Profile : GTC pushi
Line ticks source 115 do m=1,mi 116 657 r=sqrt(2.0*zion(1,m)) 117 136 rinv=1.0/r 118 34 ii=max(0,min(mpsi-1,int((r-a0)*delr))) 119 55 ip=max(1,min(mflux,1+int((r-a0)*d_inv))) 120 194 wp0=real(ii+1)-(r-a0)*delr 121 52 wp1=1.0-wp0 122 104 tem=wp0*temp(ii)+wp1*temp(ii+1) 123 86 q=q0+q1*r*ainv+q2*r*r*ainv*ainv 124 166 qinv=1.0/q 125 68 cost=cos(zion(2,m)) 126 18 sint=sin(zion(2,m)) 129 104 b=1.0/(1.0+r*cost)
Can pipeline expensive operations like sqrt, reciprocal, cos, sin, …
Requires either compiler option (-qhot=vector) or hand-tuning.
Compiler Listing Example : GTC chargei
Source section: 55 |! inner flux surface 56 | im=ii 57 | tdum=pi2_inv*(tflr-zetatmp*qtinv(im))+10.0 58 | tdum=(tdum-aint(tdum))*delt(im) 59 | j00=max(0,min(mtheta(im)-1,int(tdum))) 60 | jtion0(larmor,m)=igrid(im)+j00 61 | wtion0(larmor,m)=tdum-real(j00) Register section: GPR's set/used: ssss ssss ssss s-ss ssss ssss ssss ssss FPR's set/used: ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ssss ss-- ---- ---- ---s s--s Assembler section: 50| 000E04 stw 0 ST4A #SPILL52(gr31,520)=gr4 58| 000E08 bl 0 CALLN fp1=_xldintv,0,fp1,… 59| 000E0C mullw 2 M gr3=gr19,gr15 58| 000E10 rlwinm 1 SLL4 gr5=gr15,2
Issues: function call for aint(), register spills, pipelining, …
Get listing with source code: -qlist -qsource
GTC – Performance on Blue Gene/L
Original code: main loop time = 384 sec (512 nodes, coprocessor)
Tuned code : main loop time = 244 sec (512 nodes, coprocessor)
Factor of ~1.6 performance improvement by tuning.
Weak scaling, relative performance per processor:
#nodes coprocessor virtual-node512 1.000 0.9741024 1.002 0.9612048 0.985 0.9634096 1.002 0.9568192 1.009 0.93516384 0.968 NAN
BG/L Daxpy Performance
0
0.2
0.4
0.6
0.8
1
1.2
1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08
Bytes
Flo
ps
per
Cyc
le
440d+alignx
440
440d
Daxpy: y(:) = a*x(:) + y(:), with compiler-generated code
Getting Double-FPU Code Generation
Use library routines (blas, vector intrinsics, ffts,…)
Try compiler options : -O3 -qarch=440d (-qlist -qsource)
-O3 -qarch=440d -qhot=simd
Add alignment assertions:
Fortran: call alignx(16,array(1))
C: __alignx(16,array);
Try double-FPU intrinsics:
Fortran: loadfp(A), fpadd(a,b), fpmadd(c,b,a), …
C : __ldpf(address), __fpadd(a,b), __fpmadd(c,b,a)
Can write assembler code.
16K file write test
0
50
100
150
200
250
300
350
1 8 64 512 4096
Number of Directories
Tim
e (
se
c)
write
open
mkdir
Optimizing Communication Performance
3D Torus network => the bandwidth is degraded if the traffic
goes many hops, sharing links => stay local if possible.
Example: 2D domain on 1024 nodes (8x8x16 torus)
try 16x64 with BGLMPI_MAPPING=ZXYT
Example: QCD codes with logical 4D decomposition
try Px = torus x dimension (same for y, z)
Pt = 2 (for virtual-node-mode)
Layout optimizer : Record the communication matrix, then
minimize the cost function to obtain a good mapping.
Currently limited to modest torus dimensions.
Finding Communication Problems
POP Communication Time
1920 Processors (40x48 decomposition)
Some Experience with Performance Tools
Follow the decision tree – don’t get buried in data.
Get details for a just a subset of MPI ranks.
Use the parallel job for data analysis (min, max, median etc.).
For applications that repeat the same work:
sample or trace just a few repeat units.
Save cumulative performance data for all ranks in one file.
Some Lessons Learned
Text core files are great, as long as you get the call stack (need -g).
Use addr2line … takes you from instruction address to the source file and line number. Standard GNU bin utility, compile/link with -g.
Use the backtrace() library routine – standard GNU libc utility.
Can make wrappers for exit() and abort() routines so that normal
exits provide the call stack.
What do you do when >10**4 processes are hung?
halt cores, dump stacks, make separate text core files, use grep
(grep -L tells you which of the >10**4 core files was not
stopped in an MPI routine, also use grep + wc (word count).
Lesson: Flash Flood
If (task = 0)
for (t=1, …, P-1) recv data from task t
Else
send data to task 0 => results in a flood at task 0
-------------------------------------------------------------------------------------
Add flow control … slow but safe:
If (task = 0)
for (t=1, … P-1) {send a flag to task t; then recv data from task t}
Else
{recv a flag from task 0; then send data to task 0}
Lesson: P*P => Can’t Scale
integer table(P,P) requires 1 GB for P = 16K
Memory requirement limits scalability: example Metis
Can sometimes replace table(P,P) with local(P) and remote(P) plus communication to get values stored elsewhere.
Some computational algorithms scale as P*P, which can limit scaling: example - certain methods for automatic mesh refinement
More Processors = More Fun
Looking forward to the petaflop scale …