29
Performance Programing Module I: Measuring Program Performance 9 Performance Programming: Theory, Practice and Case Studies Module I: Measuring Program Performance

Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 9

Performance Programming: Theory, Practice and Case Studies

Module I: Measuring Program Performance

Page 2: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 10

Outline Outline

� Measuring methodology and guidelines

� Measurement tools � Timing Tools � Profiling Tools � Process monitoring and tracing tools � System monitoring tools

� Hardware counter measurements � Monitoring tools � Code instrumentation

� Parallel performance measurements � Guidelines and recommendations � Tools for parallel monitoring

� Summary

Page 3: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 11

Measurement Methodology Measurement Methodology

� Quantifying performance is the first step in the application tuning process

� Important to set reasonable expectations for op-timization

� Measurements should be made repeatedly to identify parts of the program that need to be op-timized

� Proper choice of measurement characteristics suitable for a particular application

� Comparison of measurements to theoretical peak values

Page 4: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 12

What to Measure What to Measure

� Timing measurements � Wall clock time for a single job (turnaround time)� Wall clock time for multiple jobs (throughput measurements)� Wall clock time for parallel runs (scalability measurements)

� Execution and computation rates � MFLOPS (million floating point operations per second)� MIPS (million instructions per second) � IPC (instructions per cycle)

� Resource utilization � Memory usage� I/O utilization� Network usage

Page 5: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 13

Benchmarking GuidelinesBenchmarking Guidelines

� Benchmark runs should adequately represent the use of the application

� Preferably only one parameter changing at a time

� Overhead of measurement should be considered

� Runs from tmpfs or from a locally mounted ufs

� System activities should be monitored

� The systems should not have any other computa-tional jobs running during benchmarking

� System parameters and settings should be docu-mented together with the results of the runs.

Page 6: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 14

Measurement ToolsMeasurement Tools

� Functionality � Timing tools� Profiling tools� Monitoring tools

� Usage requirements � Tools that can operate on optimized binaries� Tools that require recompilation� Tools that require source code instrumentation

� Parallel / serial measurement tools � Tools measuring serial performance� Tools measuring parallel performance

Page 7: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 15

Timing Entire Program Timing Entire Program

� Measuring the elapsed (wall- clock) time that passes during the program execution

� Example: Solaris time , timex , and ptime

Page 8: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 16

Timing Program PortionsTiming Program Portions

� Fortran 77: etime , dtime (both not thread safe)

� C, C++, Fortran 90/95: gethrtime� High resolution timer (nanoseconds)� Can be called via a C wrapper from Fortran 77� Can be used for multithreaded applications

� Platform-specific tools and methods� Solaris microstate accounting � Fine-grain timing measurements by accessing UltraSPARC

TICK register directly

.inline readtick,1rd %tick, %o1stx %o1, [%o0].end

Page 9: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 17

Measurement Overhead Measurement Overhead

� Computing overhead of gethrtime() call

#include<sys/time.h> time_t start, end; int i, iters = 100000; for (i = 0; i < iters; i++) { start = gethrtime(); end = gethrtime(); (void)printf("%lld \n", (end - start));}

Call overhead (ns)0

2500

5000

7500

10000

12500

15000

17500

20000

22500

Distribution

180-185 185-190 190-195 195-200200-205 205-210 210-215 215-220

220-225 225-230 (ns)

Page 10: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 18

Program Profiling with Program Profiling with gprofgprof

� Application profiling � Special form of timing measurements that shows which func-

tions account for large parts of application runtimes� Should be used on multiple and representative test cases

� gprof - standard UNIX profiling utility � Can be used for profiling executalbes and shared libraries� Based on Program Counter (PC) sampling at periodic intervals� Requires recompilation with -pg (Linux, Solaris, Tru64) or -G

(HP-UX) � After the run the data is collected in gmon.out file � Profiling results displayed with gprof command

Page 11: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 19

gprofgprof Output Output

� Output includes� Absolute time spent in a function � Percentage of total run time spent in a function � Number of calls to the function � Average time per call

� Functions can be sorted by � time they consume together with their descendants (commul-

ative or inclusive time)� time spent executing the function itself (self or exclusive time)

% cumulative self self totaltime seconds seconds calls ms/call ms/call name66.4 65.70 65.70 186116 0.35 0.35 dmmch_ [4]15.2 80.72 15.02 20448 0.73 0.73 dmake_ [8]10.9 91.51 10.79 16924 0.64 0.64 dgemm_ [9]...

Page 12: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 20

Profiling Using Coverage AnalysisProfiling Using Coverage Analysis

� Coverage analysis tools annotate source code with the number of times each line was executed� Basic block profiling � Results can be accumulated for multiple runs � Information about hot loops in the code and branches taken � Code coverage for quality assurance

DO 350 L = LL, LL+ LSEC- 1150483840 -> F11 = F11 + T1( L- LL+ 1, I- II+ 1 )* $ T2( L- LL+ 1, J- JJ+ 1 )

� Available on UNIX platforms� Linux/GNU: gcov� Solaris: tcov � IRIX: cvcov , cvxcov � Tru64: pixie � AIX: tprof

Page 13: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 21

Advanced Profiling ToolsAdvanced Profiling Tools

� Measurement parameters and features� Measurements based on hardware counters � Profiling by

� functions � basic blocks � lines of high level code� assembly instructions

� Source code annotation � Capabilities to work with parallel programs

� synchronization overhead,� load balancing monitoring

� Available tools Tool Vendor Platforms

VTune Intel NTAnalyzer Sun Solaris SpeedShop SGI IRIXDCPI Tru64, NT DEC Compaq HP

Page 14: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 22

Example: Sun Performance Analyzer (1 of 3)Example: Sun Performance Analyzer (1 of 3)

� Profiling by function and module (no recompilation)

Page 15: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 23

Example: Sun Performance Analyzer (2 of 3)Example: Sun Performance Analyzer (2 of 3)

� Annotated source (recompilation with -g ) and disassembly (no recompilation)

Page 16: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 24

Example: Sun Performance Analyzer (3 of 3)Example: Sun Performance Analyzer (3 of 3)

� Hardware counter overflow profiling

Page 17: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 25

Process Monitoring ToolsProcess Monitoring Tools

� Tracing tools � Linux: strace (ltrace for dynamic library calls) � Solaris: truss (sotruss for dynamic library calls) � IRIX: par � Tru64: atom -tool ptrace

� procfs -based tools� pmap: prints the address space of the program� pldd : lists the dynamic shared objects linked into the process

(including ones explicitly attached using dlopen )� pstack : prints a stack trace for each LWP in the process� pflags : prints the /proc tracing flags� ptree : process trees containing specified pids or users� pwait : wait for specified processes to terminate� pcred : prints the credentials (effective, real, saved UIDs and

GIDs)

Page 18: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 26

Example: profiling system callsExample: profiling system calls� truss on Solaris � Reports the number of system calls for a process and associated

time

Page 19: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 27

System Monitoring Tools System Monitoring Tools

� Tools for various UNIX platforms � vmstat , vm_stat , memvis - virtual memory and CPU sta-

tistics � mpstat , mpvis - parallel memory/CPU statistics � netstat , nfsstat , nfsvis - network status and statistics � iostat , dkvis - I/O statistics � sar - system activity report � top , prstat - list of most active processes � systat - system activity stats � lockstat - kernel lock statistics � dkstat - file status information

Page 20: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 28

vmstatvmstat - Virtual Memory Statistics - Virtual Memory Statistics� Available on HP-UX, Tru64, Solaris, Linux, FreeBSD, etc. � Example on Alpha/Tru64

Memory Usage

Paging Activity

CPU Usage

Idle System

Page 21: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 29

Hardware Counter MeasurementsHardware Counter Measurements

� Hardware performance counters allow for the runtime low-overhead measurements of various hardware events� Cache references� Cache misses� Pipeline stalls� Branch misprediction statistics� D-TLB (Data Translation Lookaside Buffer) misses� I-TLB (Instruction Translation Lookaside Buffer)� Bus statistics including DMA and cache coherency transac-

tions on a multiprocessor systems � Others

� Only several events can be monitored at the same time

Page 22: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 30

Code Instrumentation Code Instrumentation

� APIs can be used directly in the code� High-resolution timing of performance-critical parts of the pro-

gram � Access to HW performance counters

� Example (Solaris) if ( cpc_take_sample(&before) == -1) exit(-1); for (k = 0; k < N-1; k++) sum = sum + a[k]*b[k];if ( cpc_take_sample(&after) == -1) exit(-1);

� Counters specified by setting PERFEVENTS environment variable

example% setenv PERFEVENTS pic0=Load_use,pic1=Load_use_RAW

� Works on UltraSPARC CPUs

Page 23: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 31

Parallel Measurement MethodologyParallel Measurement Methodology

� Same guidelines as in the serial case� Parallel benchmarks should be representative of typical uses of

applications� Benchmarking must be performed to ensure repeatable and

consistent results � Probe effects and tool overheads should be minimized

� Specifics of parallel benchmarking� Parallelism vs. Concurrency � Dedicated mode of benchmarking � Number of processors � Choice of timer and time criterion � Processor-set configuration � Processor allocation in clusters

Page 24: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 32

Timing a Parallel Threaded ProgramTiming a Parallel Threaded Program

� timex can be used for parallel timing

� Note that the real time decreases, but the user time repre-senting combined CPU usage stays constant

Page 25: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 33

Specific Parallel Timers Specific Parallel Timers

� Timing MPI programs � time or timex timers can be used in combination with MPI

submitting commands (mprun , mpirun , etc.) � For timing portions of an MPI program, one can use the

MPI_Wtime function available in Fortran, C and C++ bind-ings (typically highly accurate).

� Threaded applications can use gethrvtime (S-olaris, Tru64 with Solaris Compatibility Library) � Shows the user time on a per-thread basis � Can be used in combination with gethrtime , which returns

the elapsed real (wallclock) time on a per-thread basis

Page 26: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 34

Parallel System Monitoring Parallel System Monitoring

� mpstat - mutliprocessor monitoring

Sample measurements

Crosscalls

InterruptsContextswitches

CPU usageSystem calls

Threadmigrations

Mutex infoCPUID

First snapshot: average since boot

Page 27: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 35

Kernel Lock Statistics Kernel Lock Statistics

� Tools that report kernel lock statistics � lockstat - Solaris, IRIX, AIX, Linux � lockinfo - Tru64

� Allows one to specify what events to monitor� spin on adaptive mutex� block on read access to rwlock due to waiting writers

� On some platforms generates gprof -like output # lockstat -IWk example_tnf 24...Profiling interrupt: 151649 events in 130.282 seconds (1164 events/sec)Count indv cuml rcnt nsec Hottest CPU+PIL Caller--------------------------------------------------------------------85698 57% 57% 1.00 188 cpu[12] mutex_vector_enter14247 9% 66% 1.00 160 cpu[9]+10 disp_getwork12792 8% 74% 1.00 746 cpu[14] mutex_tryenter10359 7% 81% 1.00 280 cpu[5] (usermode) 1951 1% 82% 1.00 59 cpu[1] splx 1648 1% 84% 1.00 365 cpu[5]+10 _resume_from_idle 1510 1% 85% 1.00 490 cpu[9]+10 disp 1259 1% 85% 1.00 255 cpu[15]+10 setfrontdq

Page 28: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 36

Binding a Program To a Set of Processors Binding a Program To a Set of Processors

� Process monitoring can be difficult on multiproc-essor systems due to process migration

� Single-threaded programs� One can bind to a processor

� For multithreaded programs � One can use processor sets

� Commands to set up and use processor sets � psrset (HP-UX, Solaris) � pset (IRIX) � pset_create , pset_assign_cpu ,

pset_assign_pid , etc. (Tru64)

Page 29: Performance Programming: Theory, Practice and Case Studies ...cs.anu.edu.au/~Alistair.Rendell/sc02/module1.pdf · 13 Performance Programing Module I: Measuring Program Performance

Performance Programing Module I: Measuring Program Performance 37

SummarySummary

� Monitoring performance is essential to optimiza-tion� If you cannot measure it you cannot improve it

� Important to select benchmarks carefully and identify parameters to measure

� Select tools suitable for the task � System-wide or process-specific? � Parallel or serial? � Require recompilation or instrumentation? � Need source-level information? � Need hardware counter information?