Upload
truongmien
View
215
Download
0
Embed Size (px)
Citation preview
1/17/18
1
PAPIPerformance Application Programming Interface
(adapted by Fengguang Song)
Heike McCraw [email protected]
To get more details, please read the manual:http://icl.cs.utk.edu/projects/papi/wiki/PAPI3:PAPI.3
“ ”
1/17/18
2
OUTLINE
1. Motivation• What is Performance?• Why being annoyed with Performance Analysis?
2. Concepts and Definitions• The performance analysis cycle• Measurement: profiling vs. tracing• Analysis: manual vs. automated
3. Performance Analysis Tools• PAPI: Access to hardware performance counters• Vampir Suite: Instrumentation and Trace visualization• KOJAK / Scalasca: automatic performance analysis tool• TAU: Toolset for profiling and tracing of MPI/OpenMP/Java/
Python applications
3
WHY PERFORMANCE ANALYSIS ?
4
Performance Analysis is important:
• Large investments in HPC systemso Purchase costs:o Operational costs:o Electricity costs:
~$40M~$5M per year1 MW / year ~$1M
• Efficient Usage is important because of these resources are limited, precious, and expensive
• Scalability is important to execute bigger simulations
• Performance analysis: Get the highest performance for a given cost• Who cares about it: Anyone who is associated with computer systems
• e.g., system engineers, computer scientists, application delevopers and of course users
1/17/18
3
Performance Optimization Cycle:
Measure & Analyze:• Have an optimization phase• just like testing & debugging phase
• Do profiling and tracing• Use tools• For nontrivial
problems, printf() is not enough
CONCEPTS AND DEFINITIONS
Usage / Production
5
Measure
Analyze
and correct program
Modify / Tunecomplete, correct and
well-performing program
Instrumentation
Code Developmentfunctionally complete
WHAT ARE HARDWARE PERFORMANCECOUNTERS?
6
• For many years, hardware engineers have designed specializedregisters to measure the performance of various aspects of a processor.
• On the other hand, HW performance counters provide application developers with valuable information about code sections that can be improved.
• Hardware performance counters can provide insight into:• Program time• Cache behaviors• Branch behaviors (e.g., conditional statement if, while, …)• Memory and resource contention and access frequency• Pipeline stalls• Floating point efficiency• Instructions per cycle• Subroutine resolution• Process or thread attribution
1/17/18
4
PAPI
• PAPI is A Middleware that provides a consistent interface and methodology for the hardware performance counter found in most major microprocessors
• PAPI enables software engineers to see, in real time, the relation between software performance and hardware events
SUPPORTED ARCHITECTURES:• AMD• ARM Cortex A8, A9, A15 (coming Soon: ARM64)• CRAY• IBM Blue Gene Series, Q: 5D-Torus, I/O system, CNK,
(coming soon: EMON2 power)• IBM Power Series• Intel Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Knights Corner• NVidia Tesla, Kepler, NVML• Infiniband• Intel RAPL (power/energy)• Intel MIC power/energy
COMPONENT PAPI:• Provides access to a number of external components that expose performance measurement
opportunities across the system as a whole, including network, the I/O system, the Compute Node Kernel, power/energy
PAPI HARDWARE EVENTS
8
• The countable events are defined in two ways:o Platform-independent Preset Events (e.g., PAPI_TOT_INS)o Platform-dependent Native Events (e.g., L3_CACHE_MISS)
• Preset Events are derived from Native Events (e.g. PAPI_L1_TCM may be the sum of L1 Data Misses and L1 Instruction Misses on a given platform)
1/17/18
5
PAPI HARDWARE EVENTS
9
• Preset Events• A predefined set of over 100 events for application performance tuning
• Commonly used, but not always provided by the hardware• There is no standardization of the exact definition
• Mapped to either single or combinations (i.e., derived) of native events
• Use papi_avail to see what preset events are available on a given platform
• Native Events• Any event countable by the CPU• Use papi_native_avail utility to see all available native events
• A lot of events!
• May use papi_event_chooser utility to select a compatible set of events• Will see how to use it later
Remembertorun:>moduleloadpapipapi_avail –hfordetails
What Preset Events Are Available?
Name Code Avail Deriv Description (Note)PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache missesPAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses
………
Of 108 possible events, 40 are available, of which 7 are derived.
papi_avail
1/17/18
6
3rd Party and GUI Tools
PAPI PORTABLE LAYER
PAPI HARDWARE SPECIFIC LAYER
Kernel Extension
Operating System
Perf Counter Hardware
Low Level UserAPI
High Level UserAPI
PAPI COUNTER INTERFACES
11
PAPI provides 3 interfaces to the underlying counter hardware:
1. A Low Level APImanages hardware events (both preset and native) in user defined groups called EventSets.• Meant for experienced users
2. A High Level API provides theabilityto start, stop and read the countersfor a specified list of events• Both preset and native• Meant for programmers wantingsimple measurements.
3. Graphical tools provide facile data collection and visualization.
But most powerful !
PAPI HIGH LEVEL CALLS
10
• PAPI_num_counters()• get the number of hardware counters available on the system
• PAPI_flips (float *rtime, float *ptime, long long *flpins, float *mflips)• simplified call to get Mflips/s (floating point instruction rate), real and
processor time• PAPI_flops (float *rtime, float *ptime, long long *flpops, float *mflops)
• simplified call to get Mflops/s (floating point operation rate), real andprocessor time
• PAPI_ipc (float *rtime, float *ptime, long long *ins, float *ipc)• gets instructions per cycle, real and processor time
• PAPI_start_counters (int *events, int array_len)• start counting hardware events
• PAPI_stop_counters (long long *values, int array_len)• stop counters and return current counts
• PAPI_accum_counters (long long *values, int array_len)• add current counts to array, reset counters, and continue counting
• PAPI_read_counters (long long *values, int array_len)• copy current counts to array, reset counters, and continue couting
1/17/18
7
• The high-level interface is self-initializing.
• You can mix high and low level calls, but you must call either PAPI_library_init or a high level routine before calling a low level routine.
PAPI HIGH LEVEL CALLS
/* Setup PAPI library and begin collecting data from the counters */if((retval=PAPI_flops( &real_time, &proc_time, &flpins, &mflops))!=PAPI_OK)test_fail(__FILE__, __LINE__, "PAPI_flops", retval);
/* Your target code region*/……
/* Collect the data into the variables passed in */if((retval=PAPI_flops( &real_time, &proc_time, &flpins, &mflops))!=PAPI_OK)test_fail(__FILE__, __LINE__, "PAPI_flops", retval);
printf("Real_time:\t%f\nProc_time:\t%f\nTotalflp ops:\t%lld\n MFLOPS:\t\t%f\n",real_time, proc_time, flpins, mflops);
printf("%s\tPASSED\n", __FILE__);PAPI_shutdown();
http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:Getting_Started
PAPI HIGH LEVEL CALLS
1/17/18
8
/* Start counting events */if (PAPI_start_counters(Events, num_hwcntrs) != PAPI_OK)handle_error(1);
*events -- an array of codes for events such as PAPI_INT_INS or a native event code
PAPI HIGH LEVEL CALLS
EXAMPLE: LOW LEVEL API
16
#include "papi.h”#define NUM_EVENTS 2int Events[NUM_EVENTS]={ PAPI_FP_OPS, PAPI_TOT_CYC };int EventSet = PAPI_NULL; long long values[NUM_EVENTS];
/* Initialize the Library */retval = PAPI_library_init (PAPI_VER_CURRENT);/* Allocate space for the new eventset and do setup */ retval = PAPI_create_eventset (&EventSet);/* Add Flops and total cycles to the eventset */retval = PAPI_add_events (EventSet, Events, NUM_EVENTS);
/* Start the counters */ retval = PAPI_start (EventSet);
do_work(); /* What we want to monitor*/
/*Stop counters and store results in values */ retval = PAPI_stop (EventSet, values);
http://icl.cs.utk.edu/projects/papi/wiki/PAPI3:PAPI.3#Low_Level_Functions
1/17/18
9
How to Add a Native Event
• PAPI_event_name_to_code() is used to translate an ASCII PAPI event name into an integer PAPI event code.
• int PAPI_add_event(int EventSet, int EventCode )• int PAPI_add_events(int EventSet, int * EventCodes, int number )
- PAPI_add_event()addsoneeventtoaPAPIEventSet
int native=0x0;if(PAPI_event_name_to_code("CPU_TO_DRAM_REQUESTS_TO_TARGET_NODE:LOCAL_TO_5",&native)!=PAPI_OK){
fprintf(stderr,"Error:eventtocodefailed!\n");exit(1);
}if(PAPI_add_event(NativeEventSet,native)!=PAPI_OK){
fprintf(stderr,"thrd%d:add1steventerror\n",thrd_id);exit(1);
}
PAPI Utilities
login1 /opt/cray/papi/5.3.2/bin> ls• papi_avail• papi_cost• papi_mem_info• papi_xml_event_infopapi_clockres• papi_decode• papi_multiplex_cost• papi_command_line• papi_error_codes• papi_native_avail• papi_component_avail• papi_event_chooser• papi_version
https://bitbucket.org/icl/papi/src/b4d5217cb6b6b4181f7b411dc0cf7c10c5c74fd7/src/utils/?at=master
SourceCode:
1/17/18
10
PAPI UTILITIES: PAPI_COST
krakenpf7: cs594>papi_cost -hThis is the PAPI cost program.It computes min / max / mean / std. deviation for PAPI start/stop
pairs and for PAPI reads. Usage:
cost [options] [parameters] cost TESTS_QUIET
Options:
19
-b BINS set the number of bins for the graphicaldistribution of costs. Default: 100
show a graphical distribution of costs print this help messageshow number of iterations above the first 10 std
-d-h-sdeviations-t THRESHOLD set the threshold for the number of iterations.Default: 100,000
PAPI UTILITIES: PAPI_AVAIL
krakenpf7: cs594>papi_avail -h Usage: papi_avail [options] Options:
General command options:
20
-a, --avail-d, --detail
-e EVENTNAME
-h, --help
Display only available preset events Display detailed information about all
preset eventsDisplay detail information about specified
preset or native eventPrint this help message
This program provides information about PAPI preset and native events.
1/17/18
11
PAPI UTILITIES: PAPI_AVAILkrakenpf7: cs594>aprun -n1 papi_availAvailable events and hardware information.--------------------------------------------------------------------------------
21
PAPI VersionVendor string and code Model string and code CPU RevisionCPU MegahertzCPU Clock Megahertz CPU's in this Node Nodes in this System Total CPU's
: 3.6.2.2: AuthenticAMD (2): 6-Core AMD Opteron(tm) Processor 23 (D0) (16): 0.000000: 2600.000000: 2600: 12: 1: 12
Number Hardware Counters : 4 Max Multiplex Counters : 512--------------------------------------------------------------------------------The following correspond to fields in the PAPI_event_info_t structure.
Name Code Avail Deriv Description (Note)PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache missesPAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache missesPAPI_L2_DCM 0x80000002 Yes No Level 2 data cache missesPAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache missesPAPI_L1_TCM[...]
0x80000006 Yes Yes Level 1 cache misses
-------------------------------------------------------------------------Of 103 possible events, 41 are available, of which 9 are derived.
PAPI UTILITIES: PAPI_AVAILkrakenpf7: cs594>aprun -n1 papi_avail –aAvailable events and hardware information.--------------------------------------------------------------------------------
22
PAPI VersionVendor string and code Model string and code CPU RevisionCPU MegahertzCPU Clock Megahertz CPU's in this Node Nodes in this System Total CPU's
: 3.6.2.2: AuthenticAMD (2): 6-Core AMD Opteron(tm) Processor 23 (D0) (16): 0.000000: 2600.000000: 2600: 12: 1: 12
Number Hardware Counters : 4 Max Multiplex Counters : 512--------------------------------------------------------------------------------The following correspond to fields in the PAPI_event_info_t structure.
Name Code Deriv Description (Note)PAPI_L1_DCM 0x80000000 No Level 1 data cache missesPAPI_L1_ICM 0x80000001 No Level 1 instruction cache missesPAPI_L2_DCM 0x80000002 No Level 2 data cache missesPAPI_L2_ICM 0x80000003 No Level 2 instruction cache missesPAPI_L1_TCM...
0x80000006 Yes Level 1 cache misses
PAPI_FP_OPS 0x80000066 No Floating point operations-------------------------------------------------------------------------Of 41 available events, 9 are derived.
1/17/18
12
PAPI UTILITIES: PAPI_AVAILkrakenpf7: cs594>aprun -n1 papi_avail -e PAPI_L1_TCM[...]
23
Event name: Event Code:Number of Native Events: Short Description:Long Description: Developer's Notes: Derived Type:Postfix Processing String: Native Code[0]: 0x40000029 Number of Register Values:Register[ 0]:Register[ 1]:Register[ 2]:Register[ 3]:
0x000000810x000000810x000000810x00000081
Native Event Description:
PAPI_L1_TCM 0x800000062
|L1 cache misses||Level 1 cache misses||||DERIVED_ADD||||INSTRUCTION_CACHE_MISSES| 4
|Event Code||Event Code||Event Code||Event Code||Instruction Cache Misses|
Native Code[1]: 0x40000011 Number of Register Values:Register[ 0]:Register[ 1]:Register[ 2]:Register[ 3]:
0x000000410x000000410x000000410x00000041
Native Event Description:
|DATA_CACHE_MISSES| 4
|Event Code||Event Code||Event Code||Event Code||Data Cache Misses|
PAPI UTILITIES: PAPI_NATIVE_AVAIL
krakenpf7: cs594>aprun -n1 papi_native_avail
Available native events and hardware information.
Event Code Symbol | Long Description ||--------------------------------------------------------------------------------
24
RETIRED_SSE_OPERATIONS | Retired SSE Operations:SINGLE_ADD_SUB_OPS | Single precision add/subtract ops:SINGLE_MUL_OPS:SINGLE_DIV_OPS
| Single precision multiply ops| Single precision divide/square root ops
:DOUBLE_ADD_SUB_OPS | Double precision add/subtract ops:DOUBLE_MUL_OPS:DOUBLE_DIV_OPS
| Double precision multiply ops| Double precision divide/square root ops
:OP_TYPE | Op type: 0=uops. 1=FLOPS
0x400000034000100340002003400040034000800340010003400200034004000340080003 :ALL | All sub-events selected
|||||||||
--------------------------------------------------------------------------------[...]
0x40000010 DATA_CACHE_ACCESSES | Data Cache Accesses |--------------------------------------------------------------------------------0x40000011 DATA_CACHE_MISSES | Data Cache Misses |--------------------------------------------------------------------------------[...]
Total events reported: 114
1/17/18
13
PAPI UTILITIES: PAPI_EVENT_CHOOSER
krakenpf7: cs594>aprun -n1 papi_event_chooser
Usage:papi_event_chooser NATIVE|PRESET evt1 evt2 ...
25
PAPI UTILITIES: PAPI_EVENT_CHOOSERkrakenpf7: cs594>aprun -n1 papi_event_chooser PRESET PAPI_L1_TCM
26
Deriv Description (Note)Name PAPI_L1_DCM PAPI_L1_ICM PAPI_L2_DCMPAPI_L2_ICMPAPI_L2_TCMPAPI_L3_TCM
Code 0x800000000x800000010x800000020x800000030x800000070x80000008
PAPI_FPU_IDL 0x80000012PAPI_TLB_DMPAPI_TLB_IMPAPI_TLB_TL
0x800000140x800000150x80000016
No Level 1 data cache missesNo Level 1 instruction cache misses No Level 2 data cache missesNo Level 2 instruction cache misses No Level 2 cache missesNo Level 3 cache missesNo Cycles floating point units are idleNo Data translation lookaside buffer missesNo Instruction translation lookaside buffer miss Yes Total translation lookaside buffer misses
[...]
PAPI_FP_OPS 0x80000066 No Floating point operations-------------------------------------------------------------------------Total events reported: 39
1/17/18
14
PAPI UTILITIES: PAPI_EVENT_CHOOSER
krakenpf7: cs594>aprun -n1 papi_event_chooser PRESET PAPI_L1_TCM PAPI_TLB_TL
20
Name Code Deriv Description (Note)PAPI_L1_DCM 0x80000000 No Level 1 data cache missesPAPI_L1_ICM 0x80000001 No Level 1 instruction cache missesPAPI_TLB_DM 0x80000014 No Data translation lookaside buffer missesPAPI_TLB_IM 0x80000015 No Instruction translation lookaside buffer miss
-------------------------------------------------------------------------Total events reported: 4
PAPI UTILITIES: PAPI_COMMAND_LINE
krakenpf7: cs594>aprun -n1 papi_command_line PAPI_FP_OPSSuccessfully added: PAPI_FP_OPS
PAPI_FP_OPS : 40000000
----------------------------------Verification: None.This utility lets you add events from the command line interface to see if they work.
28
krakenpf7: cs594>aprun -n1 papi_command_line PAPI_FP_OPS PAPI_L1_TCM
Successfully added: PAPI_FP_OPS
Successfully added: PAPI_L1_TCM
PAPI_FP_OPS : PAPI_L1_TCM :
4000000040
1/17/18
15
PERFORMANCE MEASUREMENT CATEGORIES
29
• Efficiencyo IPC: Instructions Per Cycleo Memory bandwidth
• Cacheso L1 data cache misses and miss ratioo L1 instruction cache misses and miss ratio• L2 cache misses and miss ratio
• Translation lookaside buffers (TLB)o Data TLB misses and miss ratioo Instruction TLB misses and miss ratio
• Control transferso Branch mispredictionso Near return mispredictions
THE CODE
void classic_matmul(){
// Multiply the two matrices int i, j, k;for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLUMNS; j++) {float sum = 0.0;for (k = 0; k < COLUMNS; k++) {
sum +=matrix_a[i][k] * matrix_b[k][j];
}matrix_c[i][j] = sum;
}}
}
void interchanged_matmul(){
// Multiply the two matrices int i, j, k;for (i = 0; i < ROWS; i++) {
for (k = 0; k < COLUMNS; k++) { for (j = 0; j < COLUMNS; j++) {
matrix_c[i][j] +=matrix_a[i][k] * matrix_b[k][j];
}}
}}
// Note that the nesting of the innermost loops// has been changed. The index variables j and k// change the most frequently and the access// pattern through the operand matrices is// sequential using a small stride (one.) This// change improves access to memory data through// the data cache. Data translation lookaside// buffer (DTLB) behavior is also improved.
#define ROWS 1000#define COLUMNS 1000
// Number of rows in each matrix// Number of columns in each matrix
void classic_matmul(){
// Multiply the two matrices int i, j, k;for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLUMNS; j++) {float sum = 0.0;for (k = 0; k < COLUMNS; k++) {
sum +=matrix_a[i][k] * matrix_b[k][j];
}matrix_c[i][j] = sum;
}}
}
void interchanged_matmul(){
// Multiply the two matrices int i, j, k;for (i = 0; i < ROWS; i++) {
for (k = 0; k < COLUMNS; k++) { for (j = 0; j < COLUMNS; j++) {
matrix_c[i][j] +=matrix_a[i][k] * matrix_b[k][j];
}}
}}
30
1/17/18
16
IPC – INSTRUCTIONS PER CYCLE
realtime[0] = PAPI_get_real_usec(); retval = PAPI_start_counters(events, 2); classic_matmul();retval = PAPI_stop_counters(cvalues, 2); realtime[1] = PAPI_get_real_usec();
int events[] = {PAPI_TOT_CYC, PAPI_TOT_INS};
retval = PAPI_library_init (PAPI_VER_CURRENT); retval = PAPI_create_eventset(&EventSet);retval = PAPI_add_events(EventSet, events, 2); realtime[0] = PAPI_get_real_usec();retval = PAPI_start(EventSet); classic_matmul();retval = PAPI_stop(EventSet, cvalues);realtime[1] = PAPI_get_real_usec();
• Measure instruction level parallelism• An indicator of code efficiency
int events[] = {PAPI_TOT_CYC, PAPI_TOT_INS};
PAPI High Level
PAPI Low Level
32
IPC – INSTRUCTIONS PER CYCLE
33
Measurement Classic mat_mul Reordered mat_mul============================================================================
High Level IPC Test (PAPI_{start,stop}_counters)Real time IPC
13.6106 sec0.3697
2.9762 sec1.6939
PAPI_TOT_CYC 24362605525 5318626915PAPI_TOT_INS 9007034503 9009011245
Low Level IPC Test (PAPI low level calls)Real time IPC
13.6113 sec0.3697
2.9772 sec1.6933
PAPI_TOT_CYC 24362750167 5320395138PAPI_TOT_INS 9007034381 9009011130
• Both PAPI methods are consistent• Roughly 460% improvement in reordered code
1/17/18
17
DATA CACHE ACCESS
34
Cache miss: Results in main memory access with much longer latency Data Cache Misses can be considered in 3 categories:
• Compulsory misses: Occurs on first reference to a data itemo Prefetching can help
• Capacity misses: Occurs when the working set exceeds the cache capacity
o Smaller working set (blocking or tiling algorithms)
• Conflict misses: Occurs when a data item is referenced after the cache line containing the item was evicted earlier.
o Data layout; memory access patterns
L1 DATA CACHE ACCESS
35
Measurement Classic mat_mul Reordered mat_mul============================================================================PAPI NATIVE EVENTS:DATA_CACHE_ACCESSES 2,002,807,841 3,008,528,961
60,716,301
1,950,282
DATA_CACHE_REFILLS:L2_MODIFIED:L2_OWNED:L2_EXCLUSIVE:L2_SHARED205,968,263
DATA_CACHE_REFILLS_FROM_SYSTEM:MODIFIED:OWNED:EXCLUSIVE:SHARED61,970,925
----------------------PAPI PRESET EVENTS:PAPI_L1_DCA 2,002,808,034 3,008,528,895PAPI_L1_DCM----------------------
268,010,587 62,680,818
Data Cache Request Rate 0.2224 req/inst 0.3339 req/instData Cache Miss Rate 0.0298 miss/inst 0.0070 miss/instData Cache Miss Ratio 0.1338 miss/req 0.0208 miss/req
• Two ways:– First uses native events– Second uses PAPI presets only
• ~50% more requests from reordered code• 6X less misses per request!
1/17/18
18
PAPI Thread Support
• PAPI_thread_init initializes thread support in the PAPI library. • Applications that make no use of threads do not need to call
this routine. • This function MUST return a UNIQUE thread ID for every new
thread/LWP created. It is much better to use the underlying thread subsystem's call, which is pthread_self() on Linux platforms.
For example,
if (PAPI_thread_init(pthread_self) != PAPI_OK)exit(1);
One User Case
Table 1: Data of hardware performance counters
collected from a 12-core experiment and a 48-core
experiment, respectively.
On 12 Cores On 48 Cores
Performance per Core 6.4 GFlops 5.1 GFLops
L1 Data Cache Miss Rate .5% .5%
L2 Data Cache Miss Rate 2.98% 2.87%
TLB Miss Rate 7.8% 8.7%
Branch Miss-Prediction Rate 1.6% 1.5%
Cycles w/o Inst. .5% .6%
FPU Idle Cycles 3.9% 5.1%
Any-Stall Cycles 20.6% 35.9%
2.2 Reasons for Increased Any-Stall CyclesWe expect there are three possible reasons that can result
in lots of stall cycles (due to any resources). They are: (1)thread synchronization, (2) remote memory access latency,and (3) memory bus or memory controller contention.
Thread synchronization can happen during the schedul-ing of tasks by the TBLAS runtime system. For instance,a thread is waiting to enter a critical section (e.g., a readytask queue) to pick up a task. When a synchronization oc-curs, the thread cannot do any computation but waiting.Therefore, the synchronization overhead is a part of thethread’s non-computation time (i.e., total execution time -computation time). However, as shown in Table 2, the non-computation time of our program is less than 1%. Therefore,we can omit thread synchronization as a possible reason.
Table 2: Analysis of synchronization overhead.
Total Time (s) Computation Time (s)
12 Cores 30 29.9
48 Cores 74.8 74
The second possible reason is remote memory accesses.We conduct a di↵erent experiment to test the e↵ect of re-mote memory accesses. In the experiment, the TBLAS pro-gram allocates memory only from the NUMA memory nodesfrom 4 to 7, that is, the second half of the eight NUMAmemory nodes as shown in Figure 2. This is enforced byusing the Linux command numactl. Then, we compare twoconfigurations (see Figure 2): (a) running twelve threads onCPU cores from 0 to 11 located on NUMA nodes 0 and 1,and (b) running twelve threads on CPU cores from 36 to 47located on NUMA nodes 6 and 7. In the first configuration,TBLAS attains a total performance of 57 GFlops. In thesecond configuration, it attains a total performance of 63GFLOPS. Note that the matrix input is stored in NUMAnodes from 4 to 7. The performance di↵erence shows thataccessing remote memory does decrease performance. Notethat we use Linux command taskset to pin each thread toa specific core.
The third possible reason is memory contention that maycome from a memory bus or a memory controller. Herewe do not di↵erentiate the controller contention from thebus contention, but consider the memory controller an end
Figure 2: E↵ect of remote NUMA memory access
latency. matrix is always allocated on NUMA nodes
from 4 to 7. Then we run twelve threads in two
di↵erent locations: (a) on NUMA nodes 0 and 1
(i.e., far from data), and (b) on NUMA nodes 6 and
7 (i.e., close to data).
point of its attached memory bus. In our third experiment,we run “numactl -i all” to place memory to all the eightNUMA nodes using the round-robin policy. As shown inFigure 3, there are two configurations: (a) we create twelvethreads and pin them on cores from 0 to 11, (b) we createtwelve thread and each thread runs on one core out of everyfour cores. In both configurations, each thread must accessall the eight NUMA nodes due to the interleaved memoryplacement. Since the second configuration provides morememory bandwidth to each thread (i.e., lesser contention)than the first configuration, it achieves a total performanceof 76.2 GFlops. By contrast, the first configuration onlyachieves a total performance of 66.3 GFlops. The di↵erenceshows that memory contention also has a significant impacton performance. One way to reduce memory contention isto maximize data locality and minimize cross-chip memoryaccess tra�c.The above three experiments together with the perfor-
mance di↵erences they have made, provide an insight intothe unscalability problem and motivate us to perform NUMAmemory optimizations. We target at minimizing remotememory accesses since it is able to not only reduce the la-tency to access remote memories, but also alleviate the pres-sure on memory buses.
3. THE TILE ALGORITHM
Figure 3: E↵ect of memory bus contention. ma-trix is allocated to all eight NUMA nodes using the
roundrobin policy. We start twelve threads using
two configurations, respectively. (a) 12 threads are
located on 12 contiguous cores from 0 to 11. (b) 12
threads, each of which runs on one core out of every
four cores.
[1]Songetal.,ICS’14.
1/17/18
19
3RD PARTY TOOLS APPLYING PAPI
TAU (U Oregon) http://www.cs.uoregon.edu/research/tau/
PerfSuite (NCSA) http://perfsuite.ncsa.uiuc.edu/
HPCToolkit (Rice University) http://hpctoolkit.org/
KOJAK and SCALASCA (FZ Juelich, UTK) http://icl.cs.utk.edu/kojak/
VampirTrace and Vampir (TU Dresden) http://www.vamir.eu
• PaRSEC (UTK) http://icl.cs.utk.edu/parsec/
• • • • • • •
Open|Speedshop (SGI) http://oss.sgi.com/projects/openspeedshop/
SvPablo (UNC Renaissance Computing Institute)http://www.renci.org/research/pablo/
• ompP (UTK) http://www.ompp-tool.com
• Also, Intel Trace Analyzer and Collector (free education version available)• Read the manual -> http://icl.cs.utk.edu/projects/papi/wiki/PAPI3:PAPI.3