PAPI Performance Application ProgrammingInterfacefgsong/cs590HPC/lecture_notes/ICL_PAPI... · 1/17/18 1 PAPI Performance Application ProgrammingInterface (adapted by Fengguang Song)

1/17/18

1

PAPIPerformance Application Programming Interface

(adapted by Fengguang Song)

Heike McCraw [email protected]

To get more details, please read the manual:http://icl.cs.utk.edu/projects/papi/wiki/PAPI3:PAPI.3

“ ”

1/17/18

2

OUTLINE

1. Motivation• What is Performance?• Why being annoyed with Performance Analysis?

2. Concepts and Definitions• The performance analysis cycle• Measurement: profiling vs. tracing• Analysis: manual vs. automated

3. Performance Analysis Tools• PAPI: Access to hardware performance counters• Vampir Suite: Instrumentation and Trace visualization• KOJAK / Scalasca: automatic performance analysis tool• TAU: Toolset for profiling and tracing of MPI/OpenMP/Java/

Python applications

3

WHY PERFORMANCE ANALYSIS ?

4

Performance Analysis is important:

• Large investments in HPC systemso Purchase costs:o Operational costs:o Electricity costs:

~$40M~$5M per year1 MW / year ~$1M

• Efficient Usage is important because of these resources are limited, precious, and expensive

• Scalability is important to execute bigger simulations

• Performance analysis: Get the highest performance for a given cost• Who cares about it: Anyone who is associated with computer systems

• e.g., system engineers, computer scientists, application delevopers and of course users

1/17/18

3

Performance Optimization Cycle:

Measure & Analyze:• Have an optimization phase• just like testing & debugging phase

• Do profiling and tracing• Use tools• For nontrivial

problems, printf() is not enough

CONCEPTS AND DEFINITIONS

Usage / Production

5

Measure

Analyze

and correct program

Modify / Tunecomplete, correct and

well-performing program

Instrumentation

Code Developmentfunctionally complete

WHAT ARE HARDWARE PERFORMANCECOUNTERS?

6

• For many years, hardware engineers have designed specializedregisters to measure the performance of various aspects of a processor.

• On the other hand, HW performance counters provide application developers with valuable information about code sections that can be improved.

• Hardware performance counters can provide insight into:•  Program time•  Cache behaviors•  Branch behaviors (e.g., conditional statement if, while, …)•  Memory and resource contention and access frequency•  Pipeline stalls•  Floating point efficiency•  Instructions per cycle•  Subroutine resolution• Process or thread attribution

1/17/18

4

PAPI

•  PAPI is A Middleware that provides a consistent interface and methodology for the hardware performance counter found in most major microprocessors

•  PAPI enables software engineers to see, in real time, the relation between software performance and hardware events

SUPPORTED ARCHITECTURES:•  AMD•  ARM Cortex A8, A9, A15 (coming Soon: ARM64)•  CRAY•  IBM Blue Gene Series, Q: 5D-Torus, I/O system, CNK,

(coming soon: EMON2 power)•  IBM Power Series•  Intel Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Knights Corner•  NVidia Tesla, Kepler, NVML•  Infiniband•  Intel RAPL (power/energy)•  Intel MIC power/energy

COMPONENT PAPI:•  Provides access to a number of external components that expose performance measurement

opportunities across the system as a whole, including network, the I/O system, the Compute Node Kernel, power/energy

PAPI HARDWARE EVENTS

8

•  The countable events are defined in two ways:o  Platform-independent Preset Events (e.g., PAPI_TOT_INS)o  Platform-dependent Native Events (e.g., L3_CACHE_MISS)

•  Preset Events are derived from Native Events (e.g. PAPI_L1_TCM may be the sum of L1 Data Misses and L1 Instruction Misses on a given platform)

1/17/18

5

PAPI HARDWARE EVENTS

9

• Preset Events•  A predefined set of over 100 events for application performance tuning

• Commonly used, but not always provided by the hardware•  There is no standardization of the exact definition

•  Mapped to either single or combinations (i.e., derived) of native events

•  Use papi_avail to see what preset events are available on a given platform

• Native Events•  Any event countable by the CPU•  Use papi_native_avail utility to see all available native events

• A lot of events!

• May use papi_event_chooser utility to select a compatible set of events• Will see how to use it later

Remembertorun:>moduleloadpapipapi_avail –hfordetails

What Preset Events Are Available?

Name Code Avail Deriv Description (Note)PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache missesPAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses

………

Of 108 possible events, 40 are available, of which 7 are derived.

papi_avail

1/17/18

6

3rd Party and GUI Tools

PAPI PORTABLE LAYER

PAPI HARDWARE SPECIFIC LAYER

Kernel Extension

Operating System

Perf Counter Hardware

Low Level UserAPI

High Level UserAPI

PAPI COUNTER INTERFACES

11

PAPI provides 3 interfaces to the underlying counter hardware:

1.  A Low Level APImanages hardware events (both preset and native) in user defined groups called EventSets.• Meant for experienced users

2.  A High Level API provides theabilityto start, stop and read the countersfor a specified list of events• Both preset and native• Meant for programmers wantingsimple measurements.

3.  Graphical tools provide facile data collection and visualization.

But most powerful !

PAPI HIGH LEVEL CALLS

10

•  PAPI_num_counters()•  get the number of hardware counters available on the system

•  PAPI_flips (float *rtime, float *ptime, long long *flpins, float *mflips)•  simplified call to get Mflips/s (floating point instruction rate), real and

processor time•  PAPI_flops (float *rtime, float *ptime, long long *flpops, float *mflops)

•  simplified call to get Mflops/s (floating point operation rate), real andprocessor time

•  PAPI_ipc (float *rtime, float *ptime, long long *ins, float *ipc)•  gets instructions per cycle, real and processor time

• PAPI_start_counters (int *events, int array_len)•  start counting hardware events

•  PAPI_stop_counters (long long *values, int array_len)•  stop counters and return current counts

•  PAPI_accum_counters (long long *values, int array_len)•  add current counts to array, reset counters, and continue counting

•  PAPI_read_counters (long long *values, int array_len)•  copy current counts to array, reset counters, and continue couting

1/17/18

7

• The high-level interface is self-initializing.

• You can mix high and low level calls, but you must call either PAPI_library_init or a high level routine before calling a low level routine.


/* Setup PAPI library and begin collecting data from the counters */if((retval=PAPI_flops( &real_time, &proc_time, &flpins, &mflops))!=PAPI_OK)test_fail(__FILE__, __LINE__, "PAPI_flops", retval);

/* Your target code region*/……

/* Collect the data into the variables passed in */if((retval=PAPI_flops( &real_time, &proc_time, &flpins, &mflops))!=PAPI_OK)test_fail(__FILE__, __LINE__, "PAPI_flops", retval);

printf("Real_time:\t%f\nProc_time:\t%f\nTotalflp ops:\t%lld\n MFLOPS:\t\t%f\n",real_time, proc_time, flpins, mflops);

printf("%s\tPASSED\n", __FILE__);PAPI_shutdown();

http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:Getting_Started


1/17/18

8

/* Start counting events */if (PAPI_start_counters(Events, num_hwcntrs) != PAPI_OK)handle_error(1);

*events -- an array of codes for events such as PAPI_INT_INS or a native event code


EXAMPLE: LOW LEVEL API

16

#include "papi.h”#define NUM_EVENTS 2int Events[NUM_EVENTS]={ PAPI_FP_OPS, PAPI_TOT_CYC };int EventSet = PAPI_NULL; long long values[NUM_EVENTS];

/* Initialize the Library */retval = PAPI_library_init (PAPI_VER_CURRENT);/* Allocate space for the new eventset and do setup */ retval = PAPI_create_eventset (&EventSet);/* Add Flops and total cycles to the eventset */retval = PAPI_add_events (EventSet, Events, NUM_EVENTS);

/* Start the counters */ retval = PAPI_start (EventSet);

do_work(); /* What we want to monitor*/

/*Stop counters and store results in values */ retval = PAPI_stop (EventSet, values);

http://icl.cs.utk.edu/projects/papi/wiki/PAPI3:PAPI.3#Low_Level_Functions

1/17/18

9

How to Add a Native Event

• PAPI_event_name_to_code() is used to translate an ASCII PAPI event name into an integer PAPI event code.

• int PAPI_add_event(int EventSet, int EventCode )• int PAPI_add_events(int EventSet, int * EventCodes, int number )

- PAPI_add_event()addsoneeventtoaPAPIEventSet

int native=0x0;if(PAPI_event_name_to_code("CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_5",&native)!=PAPI_OK){

fprintf(stderr,"Error:eventtocodefailed!\n");exit(1);

}if(PAPI_add_event(NativeEventSet,native)!=PAPI_OK){

fprintf(stderr,"thrd%d:add1steventerror\n",thrd_id);exit(1);

}

PAPI Utilities

login1 /opt/cray/papi/5.3.2/bin> ls• papi_avail• papi_cost• papi_mem_info• papi_xml_event_infopapi_clockres• papi_decode• papi_multiplex_cost• papi_command_line• papi_error_codes• papi_native_avail• papi_component_avail• papi_event_chooser• papi_version

https://bitbucket.org/icl/papi/src/b4d5217cb6b6b4181f7b411dc0cf7c10c5c74fd7/src/utils/?at=master

SourceCode:

1/17/18

10

PAPI UTILITIES: PAPI_COST

krakenpf7: cs594>papi_cost -hThis is the PAPI cost program.It computes min / max / mean / std. deviation for PAPI start/stop

pairs and for PAPI reads. Usage:

cost [options] [parameters] cost TESTS_QUIET

Options:

19

-b BINS set the number of bins for the graphicaldistribution of costs. Default: 100

show a graphical distribution of costs print this help messageshow number of iterations above the first 10 std

-d-h-sdeviations-t THRESHOLD set the threshold for the number of iterations.Default: 100,000

PAPI UTILITIES: PAPI_AVAIL

krakenpf7: cs594>papi_avail -h Usage: papi_avail [options] Options:

General command options:

20

-a, --avail-d, --detail

-e EVENTNAME

-h, --help

Display only available preset events Display detailed information about all

preset eventsDisplay detail information about specified

preset or native eventPrint this help message

This program provides information about PAPI preset and native events.

1/17/18

11

PAPI UTILITIES: PAPI_AVAILkrakenpf7: cs594>aprun -n1 papi_availAvailable events and hardware information.--------------------------------------------------------------------------------

21

PAPI VersionVendor string and code Model string and code CPU RevisionCPU MegahertzCPU Clock Megahertz CPU's in this Node Nodes in this System Total CPU's

: 3.6.2.2: AuthenticAMD (2): 6-Core AMD Opteron(tm) Processor 23 (D0) (16): 0.000000: 2600.000000: 2600: 12: 1: 12

Number Hardware Counters : 4 Max Multiplex Counters : 512--------------------------------------------------------------------------------The following correspond to fields in the PAPI_event_info_t structure.

Name Code Avail Deriv Description (Note)PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache missesPAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache missesPAPI_L2_DCM 0x80000002 Yes No Level 2 data cache missesPAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache missesPAPI_L1_TCM[...]

0x80000006 Yes Yes Level 1 cache misses

-------------------------------------------------------------------------Of 103 possible events, 41 are available, of which 9 are derived.

PAPI UTILITIES: PAPI_AVAILkrakenpf7: cs594>aprun -n1 papi_avail –aAvailable events and hardware information.--------------------------------------------------------------------------------

22

PAPI VersionVendor string and code Model string and code CPU RevisionCPU MegahertzCPU Clock Megahertz CPU's in this Node Nodes in this System Total CPU's

: 3.6.2.2: AuthenticAMD (2): 6-Core AMD Opteron(tm) Processor 23 (D0) (16): 0.000000: 2600.000000: 2600: 12: 1: 12

Number Hardware Counters : 4 Max Multiplex Counters : 512--------------------------------------------------------------------------------The following correspond to fields in the PAPI_event_info_t structure.

Name Code Deriv Description (Note)PAPI_L1_DCM 0x80000000 No Level 1 data cache missesPAPI_L1_ICM 0x80000001 No Level 1 instruction cache missesPAPI_L2_DCM 0x80000002 No Level 2 data cache missesPAPI_L2_ICM 0x80000003 No Level 2 instruction cache missesPAPI_L1_TCM...

0x80000006 Yes Level 1 cache misses

PAPI_FP_OPS 0x80000066 No Floating point operations-------------------------------------------------------------------------Of 41 available events, 9 are derived.

1/17/18

12

PAPI UTILITIES: PAPI_AVAILkrakenpf7: cs594>aprun -n1 papi_avail -e PAPI_L1_TCM[...]

23

Event name: Event Code:Number of Native Events: Short Description:Long Description: Developer's Notes: Derived Type:Postfix Processing String: Native Code[0]: 0x40000029 Number of Register Values:Register[ 0]:Register[ 1]:Register[ 2]:Register[ 3]:

0x000000810x000000810x000000810x00000081

Native Event Description:

PAPI_L1_TCM 0x800000062

|L1 cache misses||Level 1 cache misses||||DERIVED_ADD||||INSTRUCTION_CACHE_MISSES| 4

|Event Code||Event Code||Event Code||Event Code||Instruction Cache Misses|

Native Code[1]: 0x40000011 Number of Register Values:Register[ 0]:Register[ 1]:Register[ 2]:Register[ 3]:

0x000000410x000000410x000000410x00000041

Native Event Description:

|DATA_CACHE_MISSES| 4

|Event Code||Event Code||Event Code||Event Code||Data Cache Misses|

PAPI UTILITIES: PAPI_NATIVE_AVAIL

krakenpf7: cs594>aprun -n1 papi_native_avail

Available native events and hardware information.

Event Code Symbol | Long Description ||--------------------------------------------------------------------------------

24

RETIRED_SSE_OPERATIONS | Retired SSE Operations:SINGLE_ADD_SUB_OPS | Single precision add/subtract ops:SINGLE_MUL_OPS:SINGLE_DIV_OPS

| Single precision multiply ops| Single precision divide/square root ops

:DOUBLE_ADD_SUB_OPS | Double precision add/subtract ops:DOUBLE_MUL_OPS:DOUBLE_DIV_OPS

| Double precision multiply ops| Double precision divide/square root ops

:OP_TYPE | Op type: 0=uops. 1=FLOPS

0x400000034000100340002003400040034000800340010003400200034004000340080003 :ALL | All sub-events selected

|||||||||

--------------------------------------------------------------------------------[...]

0x40000010 DATA_CACHE_ACCESSES | Data Cache Accesses |--------------------------------------------------------------------------------0x40000011 DATA_CACHE_MISSES | Data Cache Misses |--------------------------------------------------------------------------------[...]

Total events reported: 114

1/17/18

13

PAPI UTILITIES: PAPI_EVENT_CHOOSER

krakenpf7: cs594>aprun -n1 papi_event_chooser

Usage:papi_event_chooser NATIVE|PRESET evt1 evt2 ...

25

PAPI UTILITIES: PAPI_EVENT_CHOOSERkrakenpf7: cs594>aprun -n1 papi_event_chooser PRESET PAPI_L1_TCM

26

Deriv Description (Note)Name PAPI_L1_DCM PAPI_L1_ICM PAPI_L2_DCMPAPI_L2_ICMPAPI_L2_TCMPAPI_L3_TCM

Code 0x800000000x800000010x800000020x800000030x800000070x80000008

PAPI_FPU_IDL 0x80000012PAPI_TLB_DMPAPI_TLB_IMPAPI_TLB_TL

0x800000140x800000150x80000016

No Level 1 data cache missesNo Level 1 instruction cache misses No Level 2 data cache missesNo Level 2 instruction cache misses No Level 2 cache missesNo Level 3 cache missesNo Cycles floating point units are idleNo Data translation lookaside buffer missesNo Instruction translation lookaside buffer miss Yes Total translation lookaside buffer misses

[...]

PAPI_FP_OPS 0x80000066 No Floating point operations-------------------------------------------------------------------------Total events reported: 39

1/17/18

14

PAPI UTILITIES: PAPI_EVENT_CHOOSER

krakenpf7: cs594>aprun -n1 papi_event_chooser PRESET PAPI_L1_TCM PAPI_TLB_TL

20

Name Code Deriv Description (Note)PAPI_L1_DCM 0x80000000 No Level 1 data cache missesPAPI_L1_ICM 0x80000001 No Level 1 instruction cache missesPAPI_TLB_DM 0x80000014 No Data translation lookaside buffer missesPAPI_TLB_IM 0x80000015 No Instruction translation lookaside buffer miss

-------------------------------------------------------------------------Total events reported: 4

PAPI UTILITIES: PAPI_COMMAND_LINE

krakenpf7: cs594>aprun -n1 papi_command_line PAPI_FP_OPSSuccessfully added: PAPI_FP_OPS

PAPI_FP_OPS : 40000000

----------------------------------Verification: None.This utility lets you add events from the command line interface to see if they work.

28

krakenpf7: cs594>aprun -n1 papi_command_line PAPI_FP_OPS PAPI_L1_TCM

Successfully added: PAPI_FP_OPS

Successfully added: PAPI_L1_TCM

PAPI_FP_OPS : PAPI_L1_TCM :

4000000040

1/17/18

15

PERFORMANCE MEASUREMENT CATEGORIES

29

•  Efficiencyo  IPC: Instructions Per Cycleo  Memory bandwidth

•  Cacheso  L1 data cache misses and miss ratioo  L1 instruction cache misses and miss ratio•  L2 cache misses and miss ratio

•  Translation lookaside buffers (TLB)o  Data TLB misses and miss ratioo  Instruction TLB misses and miss ratio

•  Control transferso  Branch mispredictionso  Near return mispredictions

THE CODE

void classic_matmul(){

// Multiply the two matrices int i, j, k;for (i = 0; i < ROWS; i++) {

for (j = 0; j < COLUMNS; j++) {float sum = 0.0;for (k = 0; k < COLUMNS; k++) {

sum +=matrix_a[i][k] * matrix_b[k][j];

}matrix_c[i][j] = sum;

}}

}

void interchanged_matmul(){


for (k = 0; k < COLUMNS; k++) { for (j = 0; j < COLUMNS; j++) {

matrix_c[i][j] +=matrix_a[i][k] * matrix_b[k][j];

}}

}}

// Note that the nesting of the innermost loops// has been changed. The index variables j and k// change the most frequently and the access// pattern through the operand matrices is// sequential using a small stride (one.) This// change improves access to memory data through// the data cache. Data translation lookaside// buffer (DTLB) behavior is also improved.

#define ROWS 1000#define COLUMNS 1000

// Number of rows in each matrix// Number of columns in each matrix

void classic_matmul(){


for (j = 0; j < COLUMNS; j++) {float sum = 0.0;for (k = 0; k < COLUMNS; k++) {

sum +=matrix_a[i][k] * matrix_b[k][j];

}matrix_c[i][j] = sum;

}}

}

void interchanged_matmul(){


for (k = 0; k < COLUMNS; k++) { for (j = 0; j < COLUMNS; j++) {

matrix_c[i][j] +=matrix_a[i][k] * matrix_b[k][j];

}}

}}

30

1/17/18

16

IPC – INSTRUCTIONS PER CYCLE

realtime[0] = PAPI_get_real_usec(); retval = PAPI_start_counters(events, 2); classic_matmul();retval = PAPI_stop_counters(cvalues, 2); realtime[1] = PAPI_get_real_usec();

int events[] = {PAPI_TOT_CYC, PAPI_TOT_INS};

retval = PAPI_library_init (PAPI_VER_CURRENT); retval = PAPI_create_eventset(&EventSet);retval = PAPI_add_events(EventSet, events, 2); realtime[0] = PAPI_get_real_usec();retval = PAPI_start(EventSet); classic_matmul();retval = PAPI_stop(EventSet, cvalues);realtime[1] = PAPI_get_real_usec();

•  Measure instruction level parallelism•  An indicator of code efficiency

int events[] = {PAPI_TOT_CYC, PAPI_TOT_INS};

PAPI High Level

PAPI Low Level

32

IPC – INSTRUCTIONS PER CYCLE

33

Measurement Classic mat_mul Reordered mat_mul============================================================================

High Level IPC Test (PAPI_{start,stop}_counters)Real time IPC

13.6106 sec0.3697

2.9762 sec1.6939

PAPI_TOT_CYC 24362605525 5318626915PAPI_TOT_INS 9007034503 9009011245

Low Level IPC Test (PAPI low level calls)Real time IPC

13.6113 sec0.3697

2.9772 sec1.6933

PAPI_TOT_CYC 24362750167 5320395138PAPI_TOT_INS 9007034381 9009011130

•  Both PAPI methods are consistent•  Roughly 460% improvement in reordered code

1/17/18

17

DATA CACHE ACCESS

34

Cache miss:  Results in main memory access with much longer latency Data Cache Misses can be considered in 3 categories:

•  Compulsory misses: Occurs on first reference to a data itemo  Prefetching can help

•  Capacity misses: Occurs when the working set exceeds the cache capacity

o  Smaller working set (blocking or tiling algorithms)

•  Conflict misses: Occurs when a data item is referenced after the cache line containing the item was evicted earlier.

o   Data layout; memory access patterns

L1 DATA CACHE ACCESS

35

Measurement Classic mat_mul Reordered mat_mul============================================================================PAPI NATIVE EVENTS:DATA_CACHE_ACCESSES 2,002,807,841 3,008,528,961

60,716,301

1,950,282

DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED205,968,263

DATA_CACHE_REFILLS_FROM_SYSTEM:MODIFIED:OWNED:EXCLUSIVE:SHARED61,970,925

----------------------PAPI PRESET EVENTS:PAPI_L1_DCA 2,002,808,034 3,008,528,895PAPI_L1_DCM----------------------

268,010,587 62,680,818

Data Cache Request Rate 0.2224 req/inst 0.3339 req/instData Cache Miss Rate 0.0298 miss/inst 0.0070 miss/instData Cache Miss Ratio 0.1338 miss/req 0.0208 miss/req

•  Two ways:–  First uses native events–  Second uses PAPI presets only

•  ~50% more requests from reordered code•  6X less misses per request!

1/17/18

18

PAPI Thread Support

• PAPI_thread_init initializes thread support in the PAPI library. • Applications that make no use of threads do not need to call

this routine. • This function MUST return a UNIQUE thread ID for every new

thread/LWP created. It is much better to use the underlying thread subsystem's call, which is pthread_self() on Linux platforms.

For example,

if (PAPI_thread_init(pthread_self) != PAPI_OK)exit(1);

One User Case

Table 1: Data of hardware performance counters

collected from a 12-core experiment and a 48-core

experiment, respectively.

On 12 Cores On 48 Cores

Performance per Core 6.4 GFlops 5.1 GFLops

L1 Data Cache Miss Rate .5% .5%

L2 Data Cache Miss Rate 2.98% 2.87%

TLB Miss Rate 7.8% 8.7%

Branch Miss-Prediction Rate 1.6% 1.5%

Cycles w/o Inst. .5% .6%

FPU Idle Cycles 3.9% 5.1%

Any-Stall Cycles 20.6% 35.9%

2.2 Reasons for Increased Any-Stall CyclesWe expect there are three possible reasons that can result

in lots of stall cycles (due to any resources). They are: (1)thread synchronization, (2) remote memory access latency,and (3) memory bus or memory controller contention.

Thread synchronization can happen during the schedul-ing of tasks by the TBLAS runtime system. For instance,a thread is waiting to enter a critical section (e.g., a readytask queue) to pick up a task. When a synchronization oc-curs, the thread cannot do any computation but waiting.Therefore, the synchronization overhead is a part of thethread’s non-computation time (i.e., total execution time -computation time). However, as shown in Table 2, the non-computation time of our program is less than 1%. Therefore,we can omit thread synchronization as a possible reason.

Table 2: Analysis of synchronization overhead.

Total Time (s) Computation Time (s)

12 Cores 30 29.9

48 Cores 74.8 74

The second possible reason is remote memory accesses.We conduct a di↵erent experiment to test the e↵ect of re-mote memory accesses. In the experiment, the TBLAS pro-gram allocates memory only from the NUMA memory nodesfrom 4 to 7, that is, the second half of the eight NUMAmemory nodes as shown in Figure 2. This is enforced byusing the Linux command numactl. Then, we compare twoconfigurations (see Figure 2): (a) running twelve threads onCPU cores from 0 to 11 located on NUMA nodes 0 and 1,and (b) running twelve threads on CPU cores from 36 to 47located on NUMA nodes 6 and 7. In the first configuration,TBLAS attains a total performance of 57 GFlops. In thesecond configuration, it attains a total performance of 63GFLOPS. Note that the matrix input is stored in NUMAnodes from 4 to 7. The performance di↵erence shows thataccessing remote memory does decrease performance. Notethat we use Linux command taskset to pin each thread toa specific core.

The third possible reason is memory contention that maycome from a memory bus or a memory controller. Herewe do not di↵erentiate the controller contention from thebus contention, but consider the memory controller an end

Figure 2: E↵ect of remote NUMA memory access

latency. matrix is always allocated on NUMA nodes

from 4 to 7. Then we run twelve threads in two

di↵erent locations: (a) on NUMA nodes 0 and 1

(i.e., far from data), and (b) on NUMA nodes 6 and

7 (i.e., close to data).

point of its attached memory bus. In our third experiment,we run “numactl -i all” to place memory to all the eightNUMA nodes using the round-robin policy. As shown inFigure 3, there are two configurations: (a) we create twelvethreads and pin them on cores from 0 to 11, (b) we createtwelve thread and each thread runs on one core out of everyfour cores. In both configurations, each thread must accessall the eight NUMA nodes due to the interleaved memoryplacement. Since the second configuration provides morememory bandwidth to each thread (i.e., lesser contention)than the first configuration, it achieves a total performanceof 76.2 GFlops. By contrast, the first configuration onlyachieves a total performance of 66.3 GFlops. The di↵erenceshows that memory contention also has a significant impacton performance. One way to reduce memory contention isto maximize data locality and minimize cross-chip memoryaccess tra�c.The above three experiments together with the perfor-

mance di↵erences they have made, provide an insight intothe unscalability problem and motivate us to perform NUMAmemory optimizations. We target at minimizing remotememory accesses since it is able to not only reduce the la-tency to access remote memories, but also alleviate the pres-sure on memory buses.

3. THE TILE ALGORITHM

Figure 3: E↵ect of memory bus contention. ma-trix is allocated to all eight NUMA nodes using the

roundrobin policy. We start twelve threads using

two configurations, respectively. (a) 12 threads are

located on 12 contiguous cores from 0 to 11. (b) 12

threads, each of which runs on one core out of every

four cores.

[1]Songetal.,ICS’14.

1/17/18

19

3RD PARTY TOOLS APPLYING PAPI

TAU (U Oregon) http://www.cs.uoregon.edu/research/tau/

PerfSuite (NCSA) http://perfsuite.ncsa.uiuc.edu/

HPCToolkit (Rice University) http://hpctoolkit.org/

KOJAK and SCALASCA (FZ Juelich, UTK) http://icl.cs.utk.edu/kojak/

VampirTrace and Vampir (TU Dresden) http://www.vamir.eu

•  PaRSEC (UTK) http://icl.cs.utk.edu/parsec/

• • • • • • • 

Open|Speedshop (SGI) http://oss.sgi.com/projects/openspeedshop/

SvPablo (UNC Renaissance Computing Institute)http://www.renci.org/research/pablo/

•  ompP (UTK) http://www.ompp-tool.com

• Also, Intel Trace Analyzer and Collector (free education version available)• Read the manual -> http://icl.cs.utk.edu/projects/papi/wiki/PAPI3:PAPI.3

Documents

PAPI Performance Application ProgrammingInterfacefgsong/cs590HPC/lecture_notes/ICL_PAPI... · 1/17/18 1 PAPI Performance Application ProgrammingInterface (adapted by Fengguang Song)