Automatic Performance Modeling of HPC Applications

31/01/16 | Department of Computer Science | Laboratory for Parallel Programming | Prof. Dr. Felix Wolf | 1

Felix Wolf, TU Darmstadt

Automatic Performance Modeling of HPC Applications


Acknowledgement

•  Christian Bischof, Alexandru Calotoiu, Christian Iwainsky, Sergei Shudler

•  Torsten Hoefler, Grzegorz Kwasniewski

•  Bernd Mohr, Alexandre Strube

•  Andreas Vogel, Gabriel Wittum


Latent scalability bugs

System size Execution time


Latent scalability bugs

System size Execution time


Scalability model

2

92

102

112

122

130

3

6

9

12

15

18

21

3¨ 10

´4 p2

` cProcesses

Time

rss


Analytical scalability modeling

Disadvantages •  Time consuming •  Danger of overlooking unscalable code

Identify kernels

•  Parts of the program that dominate its performance at larger scales

•  Identified via small-scale tests and intuition

Create models

•  Laborious process •  Still confined to a small community of skilled

experts


Objective

MakeperformancemodelingaccessibletoawideraudienceofHPCapplica9ondevelopersthroughautoma9on


Result summary

Compiler-driven modeling

[PACT’14+15] [SPAA’14]

LLVM

Mantevo NAS Application case studies

Core methods

Infrastructure

Automatic empirical modeling [SC13]

Interactive explorer Workflow manager

Scalability validation framework [ICS’15]

MPI MAFIA

HO

MM

E

JuS

PIC

MP

2C

NE

ST

Sw

eep3

D

XN

S

Ope

nMP

[E

uroP

ar’1

5]

SK

IN

CG

GM

G

LAP

LAC

E

UG4

[EuroPar’15]


Automated empirical modeling

f (p) = ck ⋅ pik ⋅ log2

jk (p)k=1

n

∑

Performance model normal form (PMNF)

Generation of candidate models and selection of best fit

c1 + c2 ⋅ p

c1 + c2 ⋅ p2

c1 + c2 ⋅ log(p)c1 + c2 ⋅ p ⋅ log(p)

c1 + c2 ⋅ p2 ⋅ log(p)

c1 ⋅ log(p)+ c2 ⋅ pc1 ⋅ log(p)+ c2 ⋅ p ⋅ log(p)c1 ⋅ log(p)+ c2 ⋅ p

2

c1 ⋅ log(p)+ c2 ⋅ p2 ⋅ log(p)

c1 ⋅ p+ c2 ⋅ p ⋅ log(p)c1 ⋅ p+ c2 ⋅ p

2

c1 ⋅ p+ c2 ⋅ p2 ⋅ log(p)

c1 ⋅ p ⋅ log(p)+ c2 ⋅ p2

c1 ⋅ p ⋅ log(p)+ c2 ⋅ p2 ⋅ log(p)

c1 ⋅ p2 + c2 ⋅ p

2 ⋅ log(p)

Model output

Small-scale measurements

Kernel [2 of 40]

Model [s] t = f(p)

sweep → MPI_Recv

sweep

4.03 p

582.19


Model refinement

Hypothesis generation; hypothesis size n

Scaling model

Input data

Hypothesis evaluation via cross-validation

Computation of for best hypothesis

No

Yes

n =1;R02= −∞

Rn2

n++Rn−12> Rn

2∨

n = nmax


Modeling operations vs. time

Program

Computation Communication

FLOPS Load Store #Msgs #Bytes …

Time

Disagreement may be indicative of wait states


Case studies

Sweep3d Lulesh Milc HOMME JUSPIC

XNS NEST UG4 MP2C BLAST


Sweep3D - Neutron transport simulation

•  LogGP model for communication developed by Hoisie et al.

Kernel [2 of 40]

Model [s] t = f(p)

Predictive error [%] pt=262k

sweep → MPI_Recv

sweep

5.100.01

4.03 p

582.19 #bytes = const. #msg = const.

pi ≤ 8k

tcomm = [2(px + py − 2)+ 4(nsweep −1)]⋅ tmsgtcomm = c ⋅ p


HOMME – Climate

Core of the Community Atmospheric Model (CAM)

§  Spectral element dynamical core on a cubed sphere grid

Kernel [3 of 194]

Model [s] t = f(p)

Predictive error [%] pt = 130k

box_rearrange → MPI_Reduce

vlaplace_sphere_vk

compute_and_apply_rhs

3.63⋅10-6p ⋅ p+ 7.21⋅10-13p3

pi ≤ 43k

30.34

4.280.83

24.44+2.26 ⋅10-7p2

49.09


HOMME – Climate (2)

Table 4: Models of the kernels of HOMME derived from smaller and larger-scale input configurations. The predictive errorrefers to the target scale of pt = 130k.

KernelP4(pi ⇥ 15, 000) P5(pi ⇥ 43, 350)

Model [s] Predictive Model [s] Predictivet = f(p) error [%] t = f(p) error [%]

box_rearrange ⇤ MPI_Reduce 0.026 + 2.53 · 10�6 · p⌅p + 1.24 · 10�12 · p3 57.02 3.63 · 10�6 · p⌅p + 7.21 · 10�13 · p3 30.34vlaplace_sphere_wk 49.53 99.32 24.44 + 2.26 · 10�7 · p2 4.28laplace_sphere_wk 44.08 99.32 21.84 + 1.96 · 10�7 · p2 2.34biharmonic_wk 34.40 99.33 17.92 + 1.57 · 10�7 · p2 3.43divergence_sphere_wk 16.88 99.31 8.02 + 7.56 · 10�8 · p2 4.25vorticity_sphere 9.74 99.55 6.51 + 7.09 · 10�8 · p2 8.66divergence_sphere 15.36 99.33 7.74 + 6.91 · 10�8 · p2 0.95gradient_sphere 14.77 99.33 6.33 + 6.88 · 10�8 · p2 5.17advance_hypervis 9.76 99.25 5.5 + 3.91 · 10�8 · p2 1.47compute_and_apply_rhs 48.68 1.65 49.09 0.83euler_step 28.08 0.51 28.13 0.33

kernels. Obviously, the enlarged set exposes a phenomenonnot visible in the smaller set. With the number of processeschosen large enough, both the quadratic and the cubic termswill turn into serious bottlenecks, contradicting our initialexpectation the code would scale well. The table also showsthe predictive error, which characterizes the deviation of theprediction from measurement at the taget scale pt = 130k,highlighting the benefits of including the extra data points.

After looking at the number of times any of the quadratickernels was visited at runtime, a metric we measure andmodel as well, the quadratic growth was found to be theconsequence of an increasing number of iterations inside aparticular subroutine. Interestingly, the formula by whichthe number of iterations is computed contained a ceilingterm that limits the number of iterations to one for up to andincluding 15k processes. Beyond this threshold, a term de-pending quadratically on the process count causes the num-ber of iterations being executed to grow rapidly, causing asignificant drop in performance. It turned out, the devel-opers were aware of this issue and had already developed atemporary solution, involving manual adjustments of theirproduction code configurations. Specifically, they fix thenumber of iterations and carefully tune other configurationparameters to ensure numerical stability. Nevertheless, theissue was correctly detected by our tool. Given the tuningnecessary to ensure numerical stability, a weak scaling anal-ysis of the workaround is beyond the scope of this paper.

In contrast to the previous problem, the cubic growth of thetime spent in the reduce function was previously unknown.The reduce is needed to funnel data to dedicated I/O pro-cesses. The coe⇤cient of the dominant term at scale is verysmall (i.e., in the order of 10�13). While not being visibleat smaller scales, it will have an explosive e�ect on perfor-mance at larger scales, becoming significant even if executedjust once. The reason why this phenomenon remained un-noticed until today is that it belongs to the initializationphase of the code that was not assumed to be performancerelevant in larger production runs. While still not yet crip-pling in terms of the overall runtime, which is in the order ofdays for production runs, the issue costed already more thanone hour in the large-scale experiments we conducted. Theproblem was reported back to the developers at NCAR, whoare currently working on a solution. The example demon-

strates the advantage of modeling the entire application vs.only selected candidate kernels expected to be time inten-sive. Some problems simply might escape attention becausenon-linear relationships make our intuition less reliable atlarger scales.

Figure 4 summarizes our two findings and compares our pre-dictions with actual measurements. While the growing iter-ation count seems to be more urgent now, the reduce mightbecome the more serious issue in the future.

5. RELATED WORKAnalytic performance modeling techniques have been usedto model the performance of numerous important applica-tions manually [21, 25]. It is well understood that analyticmodels have the potential of providing important insightsinto complex behaviors [29]. Performance models also o�erinsight into di�erent parts of the system. For example, Boydet al. used performance models to assess the quality of a toolchain, such as a compiler or runtime system [6]. In general,there is consensus that performance modeling is a powerful

210 212 214 216 218 220 222

0.01

1

102

104

106

108

Processes

Tim

e(s)

MPI_Reduce

vlaplace_sphere_wk

compute_and_apply_rhs

Training

Prediction

Figure 4: Runtime of selected kernels in HOMME as a func-tion of the number of processes. The graph compares predic-tions (dashed or contiguous lines) to measurements (smalltriangles, squares, and circles).


Which problem? Where in the program?

Which process?

www.scalasca.org


Download and training

•  Tuning workshop in Garching, April 18-22, 2016

•  Past tutorials

http://www.scalasca.org/software/extra-p/

Extra-P


OpenMP

•  How scalable are OpenMP runtimes in terms of the number of threads?

•  Challenges

•  Simultaneous entry of construct to be measured

•  Different behavioral classes

•  Noise

0.00E+00

5.00E-05

1.00E-04

1.50E-04

2.00E-04

2.50E-04

3.00E-04

3.50E-04

4.00E-04

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

GNUIntelPGI


OpenMP

25.01.2016 | FB Computer Science | Scientific Computing | Christian Iwainsky | 1

All Models

Parallel Barrier Dynamic 16

Static 16

Guided 16

First Private

GNU Po2 𝑡 . 𝑡 . log 𝑡 𝑡 . log 𝑡 𝑡 . log 𝑡 𝑡 . log 𝑡 𝑡

ODD 𝑡 . 𝑡 . 𝑡 . log 𝑡 𝑡 . log 𝑡 𝑡 . log 𝑡 𝑡 . log 𝑡

Intel Po2 log 𝑡 𝑡 . 𝑡 log 𝑡 𝑡 log 𝑡

PGI Po2 𝑡 . log 𝑡 log 𝑡 𝑡 . log 𝑡 log 𝑡 𝑡 . 𝑡 .

ODD 𝑡 log 𝑡 𝑡 . log 𝑡 𝑡 . log 𝑡 𝑡 . 𝑡 . 𝑡 . log 𝑡

Intel (Phi)

Po2 𝑡 . 𝑡 . 𝑡 . 𝑡 . 𝑡 𝑡 .

Linear 𝑡 . 𝑡 . log 𝑡 𝑡 . 𝑡 𝑡 .

8X log 𝑡 log 𝑡 𝑡 . 𝑡 . log 𝑡 𝑡 log 𝑡

IBM Po2 𝑡 . 𝑡 . log 𝑡 𝑡 . 𝑡 . 𝑡 𝑡 .


OpenMP


All Models


Static 16

Guided 16

First Private






Intel (Phi)

Po2 𝑡 . 𝑡 . 𝑡 . 𝑡 . 𝑡 𝑡 .





OpenMP


All Models


Static 16

Guided 16

First Private






Intel (Phi)

Po2 𝑡 . 𝑡 . 𝑡 . 𝑡 . 𝑡 𝑡 .





Algorithm engineering

Courtesy of Peter Sanders, KIT


How to validate scalability in practice?

Small text book example

Real application

Verifiable analytical

expression

Asymptotic complexity

Program

Expectation

#FLOPS = n2(2n − 1) #FLOPS = O(n2.8074)


HPC libraries

•  Focus on algorithms rather than applications

•  Theoretical expectations more common

•  Reuse factor makes scalability even more important

Example: MPI communication library Network


Search space generation

Model generation Benchmark

Performance measurements

Scaling model Expectation

+ optional deviation limit

Divergence model

Initial validation Comparing alternatives

Regression testing

Scalability evaluation framework


Customized search space

•  Constructed around expectation

•  Supports wider range of model functions than original PMNF

E(x) E 2 (x)1 E(x)*D(x)E(x) D(x)

Search space boundaries

Deviation limits

Approx. match No match Approx. match No match


Platform Juqueen Juropa Piz Daint

MPI memory [MB] Expectation: O (log p)

Model O (log p) O (p) O (log p)

R2 0.72 1 0.23

Divergence O (1) O (p / log p) O (1)

Match ✔ ✘ ✔

Comm_create [B] Expectation: O (p)

Model O (p) O (p) O (p)

R2 1 1 0.99

Divergence O (1) O (1) O (1)

Match ✔ ✔ ✔

Win_create [B] Expectation: O (p)

Model O (p) O (p) O (p)

R2 1 1 0.99

Divergence O (1) O (1) O (1)

Match ✔ ✔ ✔

MPI

Platform Juqueen Juropa Piz Daint

Barrier [s] Expectation: O (log p)

Model O (log p) O (p0.67 log p) O (p0.33)

R2 0.99 0.99 0.99

Divergence O (1) O (p0.67) O (p0.33/log p)

Match ✔ ✘ ~

Bcast [s] Expectation: O (log p)

Model O (log p) O (p0.5) O (p0.5)

R2 0.86 0.98 0.94

Divergence O (1) O (p0.5/log p) O (p0.5/log p)

Match ✔ ~ ~

Reduce [s] Expectation: O (log p)

Model O (log p) O (p0.5 log p) O (p0.5 log p)

R2 0.93 0.99 0.94

Divergence O (1) O (p0.5) O (p0.5)

Match ✔ ~ ~

Platform Juqueen Juropa Piz Daint Allreduce [s] Expectation: O (log p) Model O (log p) O (p0.5) O (p0.67 log p)

R2 0.87 0.99 0.99 Divergence 0 O (p0.5/log p) O (p0.67)

Match ✔ ~ ✘! Comm_dup [B] Expectation: O (1) Model 2.2e5 256 3770 + 18p

R2 1 1 0.99 Divergence O (1) O (1) O (p)

Match ✔ ✔ ✘


Mass-producing performance models

•  Is feasible •  Offers insight •  Requires low effort •  Improves code coverage


References

•  Arnamoy Bhattacharyya, Grzegorz Kwasniewski, Torsten Hoefler: Using Compiler Techniques to Improve Automatic Performance Modeling. PACT'15

•  Andreas Vogel, Alexandru Calotoiu, Alexandre Strube, Sebastian Reiter, Arne Nägel, Felix Wolf, Gabriel Wittum: 10,000 Performance Models per Minute - Scalability of the UG4 Simulation Framework. Euro-Par’15

•  Christian Iwainsky, Sergei Shudler, Alexandru Calotoiu, Alexandre Strube, Michael Knobloch, Christian Bischof, Felix Wolf: How Many Threads will be too Many? On the Scalability of OpenMP Implementations. Euro-Par’15

•  Sergei Shudler, Alexandru Calotoiu, Torsten Hoefler, Alexandre Strube, Felix Wolf: Exascaling Your Library: Will Your Implementation Meet Your Expectations?, ICS’15

•  Torsten Hoefler, Grzegorz Kwasniewski: Automatic Complexity Analysis of Explicitly Parallel Programs. SPAA’14

•  Alexandru Calotoiu, Torsten Hoefler, Marius Poke, Felix Wolf: Using Automated Performance Modeling to Find Scalability Bugs in Complex Codes. SC13


Thank you!

Documents

Automatic Performance Modeling of HPC Applications