Download pdf - New Developments for PAPI 5.6+ Archive/tech_poster... · 2018. 9. 17. · Improved PAPI Test Infrastructure • The existing PAPI testsuite is used to test the correctness of PAPI

New Developments for PAPI 5.6+Anthony Danalis, Heike Jagode, and Jack Dongarra

University of Tennessee, KnoxvilleVince Weaver and Yan Liu

University of Maine

INTRODUCTION

PART 1: PAPI for Power‐Aware Computing

PART 2:

PAPI’s Counter Inspection Toolkit (CIT)

PART 3:

Modernizing PAPI Infrastructure

SUPPORTED ARCHITECTURES

3RD PARTY TOOLS APPLYING PAPI

• PAPI provides a consistent interface (and methodology) for hardware performance counters, found across a compute system:� i. e., CPUs, GPUs, on- and off-chip memory, interconnects, I/O system, file system, energy/power, etc.

• PAPI enables software engineers to see, in near real time, the relationship between �software performance and hardware events across the entire compute system.

Intel Knights Landing: FLAT mode• Entire MCDRAM is used as addressable memory� ➔ memory allocations are treated similarly to DDR4 memory allocations

PAPI for power-aware computing• We use PAPI’s latest powercap component for

measurement, control, and performance analysis.

• In the past, PAPI power components supported only reading power information.

• New component exposes RAPL functionality to allow users to read and write power.

• Study numerical building blocks of varying computational intensity.

• Use PAPI powercap component to detect power optimization opportunities.

• Cap the power on the architecture to reduce power usage while keeping the execution time constant ➔ energy savings.

Key ConceptsGoal:

• Create a set of micro-benchmarks for illustrating details in hardware events and how they relate to the behavior of the micro-architecture

Target audience:

• Performance-conscious application developers

• PAPI developers working on new architectures (think preset events)

• Developers interested in validating hardware-event counters

QUESTION

How many branch instructions will these codes execute per iteration?

The expectation is that both codes will execute 2.

The measured answer is 2 for the first and 2.5 for the second!

Can you guess why?

Improved PAPI Test Infrastructure• The existing PAPI testsuite is used to test the correctness of

PAPI before release.

• The hardware and operating systems used by PAPI are always changing, and some of the existing tests were outdated or gave false negatives.

• Existing tests were checked to ensure accurate results on modern hardware.

• New counter validation tests were created which should provide a sanity check when bringing up support for a new processor architecture.

Low-Overhead PAPI_read() support• Traditionally PAPI_read() counter reads went through the

standard Linux read() system call, which can be slow (around 1,000 cycles).

• x86 hardware supports a userspace rdpmc() instruction that bypasses the kernel and requires 200 cycles (a 5× speedup).

• Various bugs in the Linux kernel around this interface were found and fixed so that rdpmc() can be enabled by default.

Enhanced Sampling Interface• PAPI currently has a limited counter sampling interface that

only allows gathering the instruction pointer at regular intervals

• Modern processors support much richer sampling information, including the cause of cache misses, where in the cache hierarchy the miss happened, and the cycles taken.

• We extend the PAPI sampling interface to provide this additional sampling information.

Heike Jagode, Azzam Haidar, Phil Vaccaro, Asim YarKhan, and Stan Tomov

ACKNOWLEDGMENTS This material is based upon work supported in part by the National Science Foundation NSF under awards No. 1450429. A portion of this research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

Vince Weaver and Yan LiuAnthony Danalis, Heike Jagode, and Hanumanthrayappa

Power awareness for Level-3 BLAS: DGEMM68 cores KNL, Peak DP = 2,662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s

• DGEMM is run repeatedly for a fixed matrix size of 18K per iteration.

• For each iteration, we dropped the power in 10 or 20 Watts steps.

• Powercap achieves its power limiting though DVFS scaling. See frequency value, which — as it drops — corresponds to a decrease in power.

Power awareness for Level-3 BLAS: DGEMV68 cores KNL, Peak DP = 2,662 Gflop/s Bandwidth MCDRAM ~425 GB/s DDR4 ~90 GB/s

• DGEMV is run repeatedly for a fixed matrix size of 18K per iteration.• For each iteration, we dropped the power in 10 or 20 Watts steps.• Frequency is not affected until the power caps kick in.

Power awareness for applications: Jacobi• Solving Helmholtz equation with Jacobi Iterative Method (grid size 12,800 x 12,800, 2D 5-point stencil)➔ requires multiple memory accesses per update➔ high-memory bandwidth and low computational intensity

Lesson for DGEMM-type operations (compute intensive):• Both data allocation schemes provides roughly the same performance. This is caused by the high ratio of FLOPs to

bytes in the compute-intensive routines, which allows for as much cache reuse as possible while hiding the data transfer between memory and cache levels.

• Performance drops with decreasing power caps, which is expected for compute-intensive kernels.

Lesson for DGEMV-type operations (memory bound):• Performance varies between the two storage options: Data allocations in MCDRAM results in much higher performance

for memory-bound kernels (BW for MCDRAM is 425GB/s vs. 90 GB/s for DDR4).• Dgemv is bound by the reading of matrix A. Depending on its storage location, performance and power may be affected.

• We see lower package-power-usage but higher DDR4-power-usage.• Capping at 140 W improves energy efficiency by ~10%.• Capping at 120 W results in a 17% energy savings with some performance degradation (8%).

• Capping at 180 Watts renders in ~15% of energy saving with a negligible performance penalty (of less than 5%).

Lesson for Jacobi iteration:• Computation is about 3.5X faster when the data is allocated in MCDRAM compared to DDR4.

• MCDRAM: Capping at 170 Watts improves energy efficiency by ~14% without any loss in time to solution.

• DDR4: capping at 135 Watts improves energy efficiency by ~25% without any loss in time to solution.

Power awareness for applications: Lattice-Boltzman• Lattice-Boltzmann simulation of computational fluid dynamics (from SPEC 2006 benchmark)➔ high-memory bandwidth and low computational intensity

Lesson for Lattice-Boltzmann:• Computation is about 3.6X faster when the data is allocated in MCDRAM compared to DDR4.

• For MCDRAM, capping at 190 results in energy saving of ~6% without loss of time to solution.

• For DDR4, capping at 130 Watts improves energy efficiency by ~12%.

Native Event Characterization

Highlighting non-obvious behavior

Cortex A8, A9, A15, ARM64 Gemini and Aries interconnect, power Blue Gene Series, Q: 5-D Torus,I/0 System, EMON power, energy

Tesla, Kepler: CUDA support formultiple GPUs; PC Sampling NVML Virtual Environment Virtual Environment

Power Series Westmore, Sandy/Ivy Bridge, Haswell,Broadwell, Skylake(-X), Kaby Lake

RAPL (power/energy),power capping

PaRSECUTK

http://icl.utk.edu/parsec/

HPCToolkitRice University

http://hpctoolkit.org/

TAUUniversity of Oregon

http://www.cs.uoregon.edu/research/tau/

ScalascaFZ Juelich, TU Darmstadt

http://scalasca.org/

VampirTU Dresden

http://www.vampir.eu/

PerfSuite NCSA

http://perfsuite.ncsa.uiuc.edu/

Open|SpeedshopSGI

http://oss.sgi.com/projects/openspeedshop/

SvPabloRENCI at UNC

http://www.renci.org/research/pablo/

ompPUTK

http://www.ompp-tool.com/

Score-P http://score-p.org/

KNC, Knights Landingincluding power/energy

0

20

40

60

80

100

16 64 256 1024 4096 16384 65536

Aver

age

coun

t per

100

acc

esse

s

Buffer size in KB

L1 HITL2 HITL3 HIT

0

20

40

60

80

100

16 64 256 1024 4096 16384 65536

Aver

age

coun

t per

100

acc

esse

s

Buffer size in KB

L1 MISSL2 MISSL3 MISS

0

10

20

30

40

50

60

70

80

90

100

16 32 64 128 256 512 1024 2048 4096

Aver

age

coun

t per

100

acc

esse

s

Buffer size in KB

unit size= 64Bunit size=128Bunit size=256B

0

10

20

30

40

50

60

70

80

90

100

256 768 1280 1792 2304 2816 3328 3840

Aver

age

coun

t per

100

acc

esse

s

Buffer size in KB

ppb=4ppb=8

ppb=16ppb=32ppb=33ppb=34ppb=35

TDP = 215W

Boxplot showing read latency for various versions of PAPI and the large improvement by using rdpmc

Comparison of historical performance counter interfaces (perfmon2, perfctr) showing that perf_event rdpmc matches even the best historical interface.

Time (sec)0 10 20 30 40 50 60 70 80 90 100 110

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300320

1991

8.22

1383

1901

8.36

1341

1785

8.58

1314

1615

8.57

1325

1340

8.02

1419

1154

7.36

1562

971

6.66

1715

785

5.81

1982

575

4.64

2489

Performancein Gflop/s

Gflops/WattsJoules

Accelerator Power Usage (PACKAGE)Memory Power Usage (DDR4)Max power limit set

MH

z

0 200 400 600 800 100012001400160018002000

Frequency

Time (sec)0 10 20 30 40 50 60 70 80 90 100 110

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300320

1997

8.82

1303

1904

8.92

1279

1741

8.95

1267

1589

9.04

1253

1328

8.47

1345

1137

7.73

1480

956

6.95

1661

773

6.03

1904

560

4.72

2443

Performancein Gflop/s

Gflops/WattsJoules


MH

z

0 200 400 600 800 100012001400160018002000

Frequency

DDR4 MCDRAM

MCDRAM

DDR4

MCDRAM

Time (sec)0 11 22 33 44 55 66 77 88 99 110 121 132 143 154

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300320

2185

0.12

2588

2185

0.12

2623

2185

0.12

2639

2185

0.12

2664

2185

0.12

2661

2184

0.13

2558

2182

0.13

2432

2082

0.14

2305

1978

0.14

2240

Performancein Gflop/sAchievedBandwidth GB/sGflops/WattsJoules


MH

z

0 200 400 600 800 100012001400160018002000

Frequency

Time (sec)0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300320

8433

50.

3981

5

8333

10.

3980

5

8232

80.

4271

779

317

0.45

686

6526

10.

4274

5

5622

50.

3882

2

4718

80.

3490

9

3815

00.

2910

77

2811

10.

2413

61

Performancein Gflop/sAchievedBandwidth GB/sGflops/WattsJoules


MH

z

0 200 400 600 800 100012001400160018002000

Frequency

DDR4

MCDRAMDDR4DDR4 MCDRAM

Time (sec)0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300

2770 joules2878 joules2689 joules2438 joules2137 joules2731 joules

DDR_215WattsDDR_200WattsDDR_180WattsDDR_160WattsDDR_140WattsDDR_120Watts

Time (sec)0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300

826 joules842 joules724 joules803 joules1066 joules1865 joules

MCDRAM_215WattsMCDRAM_200WattsMCDRAM_180WattsMCDRAM_160WattsMCDRAM_140WattsMCDRAM_120Watts

Time (sec)0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300

9064 joules9196 joules9194 joules9178 joules9005 joules8084 joules8467 joules

DDR_270wattsDDR_215wattsDDR_200wattsDDR_180wattsDDR_160wattsDDR_140wattsDDR_120watts

Time (sec)0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54

Ave

rage

pow

er (W

atts

)

0 20 40 60 80 100120140160180200220240260280300

3050 joules

2981 joules

2872 joules

3043 joules3383 joules

4033 joules5734 joules

MCDRAM_270wattsMCDRAM_215wattsMCDRAM_200wattsMCDRAM_180wattsMCDRAM_160wattsMCDRAM_140wattsMCDRAM_120watts