Towards a Methodology for Deliberate Sample-Based Statistical Performance Analysis

University of Maryland1

Towards a Methodology for Deliberate Sample-Based

Statistical Performance Analysis

Geoff Stoker


Why Deliberate Statistical Profiling?

• Move statistical profiling sample rate selection out of the realm of ad hoc

• Use mathematical model to balance change in statistical accuracy and effects of perturbation

• Allow measured program parameters and system context to inform sampling

• Likely to scale better


Abstract Model

true performance

Mea

sure

d Pe

rfor

man

ce

total # of samples

PerturbationError

MeasurementError

Best PossibleMeasurement Point

How much execution is attributable to foo?

University of Maryland

Analytical ModelHow much execution time is attributable to foo?

• t(n) – execution time of foo, after n samples• n – number of samples• o – overhead time cost per sample• p – foo’s proportion of total program execution time• T – uninstrumented (true) total program execution time• z – standard score (z-value)

n

ppzpnopnt

)1(T)(

4

true performance

Mea

sure

d Pe

rfor

man

ce

total # of samples

PerturbationError

MeasurementError



Assumptions

• Time is a representative surrogate for all perturbation effects

• Hypergeometric distribution is appropriately approximated by a Normal (Gaussian) where n>=100; and M, N >> n

• Systematic sampling provides results similar to random sampling and occurs asynchronously with periodic events in the measured program


Example• Predict the expected range of measurement results for foo

(20% of a program’s execution) at the 95% confidence level

• o = 250 µseconds• p = .20• T = 300 seconds• z = 1.96 for 95% confidence level

n

ppzpnopnt

)1(T)(


Analytical Model Prediction

9002400

39005400

69008400

990011400

1290014400

1590017400

1890020400

2190023400

2490026400

2790029400

3090032400

3390035400

3690038400

3990041400

4290044400

45

50

55

60

65

70

75

total # samples taken during program execution

pred

icte

d 95

% c

onfid

ence

inte

rval

tim

e (s

ec) f

or fo

o

minimum


Example Continued

3

2

2

)1(T

op

ppzn

3

2

2).000250(.2

)2.1(2.300(1.96)

n

686,170001.

235.2 32

n

T = 300, z = 1.96, p = .2, o = .000250


Analytical Model Prediction

9002400

39005400

69008400

990011400

1290014400

1590017400

1890020400

2190023400

2490026400

2790029400

3090032400

3390035400

3690038400

3990041400

4290044400

45

50

55

60

65

70

75


pred

icte

d 95

% c

onfid

ence

inte

rval

tim

e (s

ec) f

or fo

o

minimum


Simulation

• 1,000,000 int array• int=300 µsec of exec• 200,000, p=.2, foo• Shuffle• Draw rnd sample 1000x

– 900, then every 1,500 up to 44,400

• Sample rate 3/sec – 148/sec

• Assess 250 µsec/sample

array shuffled sample

ss

ssss

sss

s


Simulation Results

9002400

39005400

69008400

990011400

1290014400

1590017400

1890020400

2190023400

2490026400

2790029400

3090032400

3390035400

3690038400

3990041400

4290044400

45

50

55

60

65

70

75

total # samples taken during simulated program execution

time

(sec

) cal

cula

ted

for f

oo


Signal Handler

Experiment

• Measured program – executes ≈ 300 sec– 1,000,000 function calls– 300 µsec functions– Compiled with -g

• Tool– Forks measured program– Initialization, signal

handler, close-out– 23 different sample

rates: 3/sec to 166/sec

s

s

s

s

s

s

Setup

Close-out


Experimental Results

896.55

2394.03

3939.15

5450.33

6978.75

8583.46999999993

10024.85

11583.27

13107.69

14370.16

15901.25

17795.1499999999

18926.7

20204.71

21674.22

23368.87

25354.52

27708.6299999999

32019.37

35464.2

41090.75

44105.58

51774.8645

50

55

60

65

70

75


time

(sec

) cal

cula

ted

for f

oo


Combined Results90

090

224

0024

0239

0039

0254

0054

0269

0069

0284

0084

0299

0099

0211

400

1140

212

900

1290

214

400

1440

215

900

1590

217

400

1740

218

900

1890

220

400

2040

221

900

2190

223

400

2340

224

900

2490

227

900

2790

232

400

3240

235

400

3540

241

400

4140

244

400

4440

2

45.00

50.00

55.00

60.00

65.00

70.00

75.00Series11

simulation

Series1


time(

sec)

cal

cula

ted

for f

oo


Experiments with SPEC

• Omnetpp– Runtime ≈ 340 sec– 115 runs at 2 samples/sec to establish “truth” – 10 runs at 28 different sample rates (hpcrun 4.9.9)– Use Mean Absolute Percent Error (MAPE) to determine

closest sample set to “truth”• Bzip2– Similar experiment procedure– Runtime ≈ 90 sec– Look at functions 1-4


Omnetpp Analysis

0 20,000 40,000 60,000 80,000 100,000 120,000 140,00044.7

49.7

54.7

59.7

64.7

Total Samples

Exec

ution

Tim

e

0 20,000 40,000 60,000 80,000 100,000 120,000 140,0000.1385

0.1485

0.1585

0.1685

0.1785

Total Samples

% E

xecu

tion

Tim

e

Distribution of 10 runs at 28 different sampling intervals - cMessageHeap::shiftup(int)


Omnetpp Analysis

16653300

49196477

969310946

1252414162

1544216289

1686617768

1902519995

2146122515

2446726116

2811631264

3432837518

4228248325

5680368581

84739

1146350

0.01

0.02

0.03

0.04

0.05

0.06

Total Samples

Mea

n Ab

solu

te %

Err

or

T=343, z=1.96, p=.1585, o=.000068n=50,623


Bzip2 Analysis

0 5,000 10,000 15,000 20,000 25,000 30,0000.105

0.115

0.125

0.135

0.145

0.155

0.165

0.175

0.185

mainGtU – distribution of 10 runs at 28 different sampling intervals

Total Samples

% E

xecu

tion

Tim

e


Bzip2 Analysis

0 5,000 10,000 15,000 20,000 25,000 30,0000.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

0.19

Comparison of 2nd – 4th most computationally expensive functions at se-lected sample intervals

mainGtU

BZ2_decompress

BZ2_compressBlock

Total Samples

% E

xecu

tion

Tim

e


Bzip2 Analysis

431853

12621674

24932808

32373629

39614213

43984597

48435122

54405805

62326708

72557916

87089694

1090612462

1451917446

2183729198

0.00000

0.02000

0.04000

0.06000

0.08000

0.10000

0.12000mainGtUBZ2_blockSort

Total Samples

Mea

n Ab

solu

te %

Err

or

T=87, z=1.96, p=.385, o=.000050n=16,685


Some Concerns of Oversampling

• Tool performance variation• Misleading results– Functions heavily perturbed by sampling (SPEC

OMP examples)


Example Analytical Result

632 4424 8216 12008 15800 19592 23384 27176 30968 34760 38552 42344 46136 49928 53720 57512 6130460.00

61.00

62.00

63.00

64.00

65.00

66.00

67.00

68.00

Tool1Tool2

total # of samples


perf

orm

ance

(tim

e)

Published Result


Apsi Analysis

0 20,000 40,000 60,000 80,000 100,000 120,000 140,0000

0.02

0.04

0.06

0.08

0.1

0.12

3rd – 6th most expensive functions; 11 run sets; 4 core machine

radb3_ radf3_

radb2_ leapfr_.omp_fn.23

Total Samples(sample intervals of 1 s, 500 ms, 200 ms, 100 ms, 80 ms, 60 ms, 50 ms, 40 ms, 25 ms)

% E

xecu

tion

Tim

e


Fma3d Analysis

1,000 10,000 100,000 1,000,0000.09

0.10

0.11

0.12

0.13

0.14

0.15

0.16

0.17

2nd and 3rd most expensive functions ; 11 run sets; 4 core machine

scatter_element_nodal_forces.omp_fn.5

khplq_gradient_operator_

Total Samples - log scale(sample intervals of 1 s, 500 ms, 200 ms, 100 ms, 50 ms, 25 ms, 10 ms)

% E

xecu

tion

Tim

e


Future Work

• More experiments with additional sequential and parallel SPEC programs

• Overhead calculation process• Overhead function enhancement• Deliberate statistical profiling methodology

refinement


Conclusion

• Oversampling can generate misleading analysis

• Deliberate statistical profiling can lead to better results

• Questions??


Backup Slides


Determining Sample Size

• Sample size for determining proportions

– Jain: r is CI for p/100

– Lilja: r is CI for ci/p

2

2 )1(

r

ppzn

22 )1(

pr

ppzn

n

ppzp

)1(


Effective Sample Rates

0 50 100 150 200 250 300 350 400 450 5000

50

100

150

200

250

300

350

400

450

500

target sample rate

actu

al sa

mpl

e ra

te


Omnetpp “Truth”

660 670 680 690 700 710 7200.1185

0.1385

0.1585

0.1785

0.1985


Analytical Model

n

ppzpnopnt

)1(T)(

n

mp pnt mΤ)(

TTm no pnont T)(

pnopnt T)( n

pp )1(

How much execution time is attributable to foo?


Sample Size and Accuracy

90% 95% 99% 100% 100%0

5,00010,00015,00020,00025,00030,00035,00040,000

Sample sizes required for +/- 1% accuracy

0.50.40.30.20.10.050.01

confidence levels

sam

ple

size

90% 95% 99% 100% 100%0

500,0001,000,0001,500,0002,000,0002,500,0003,000,0003,500,0004,000,000

Sample sizes required for +/- .1% accuracy

0.50.40.30.20.10.050.01

confidence levels

sam

ple

size

90% 95% 99% 100% 100%0

50,000,000100,000,000150,000,000200,000,000250,000,000300,000,000350,000,000400,000,000

Sample sizes required for +/- .01% accuracy

0.50.40.30.20.10.050.01

confidence levels

sam

ple

size

One order of magnitude accuracy change

Two orders of magnitude sample size

change


Value of p(1-p)

0.010.04

0.07 0.10.13

0.160.19

0.220.25

0.28

0.3100000000000010.34

0.37 0.40.43

0.460.49

0.520.55

0.58

0.610000000000001

0.640000000000002

0.670000000000002

0.700000000000001

0.730000000000001

0.7600000000000020.79

0.820000000000001

0.8500000000000010.88

0.91

0.940000000000001

0.9700000000000010

0.05

0.1

0.15

0.2

0.25

value of p

n

ppzpnopnt

)1(T)(


Mathematical Model

n

ppzpnopnt

)1(T)(

3

2

2

)1(T

op

ppzn

32

)1(T)(

n

ppzopnt


Sample Size and Accuracy cont

0.010.05 0.1

0.15 0.20.25 0.3

0.35 0.40.45 0.5

0.55

0.600000000...

0.650000000...

0.700000000...

0.750000000... 0.8

0.850000000... 0.9

0.950000000...0.99

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

99%95%90%

p values

sam

ple

coun

t


Another Look at Statistical Results

9002400

39005400

69008400

990011400

1290014400

1590017400

1890020400

2190023400

2490026400

2790029400

3090032400

3390035400

3690038400

3990041400

4290044400

0

0.5

1

1.5

2

2.5

3

9002400

39005400

69008400

990011400

1290014400

1590017400

1890020400

2190023400

2490026400

2790029400

3090032400

3390035400

3690038400

3990041400

4290044400

0100200300400500600700800900

1000

Chart Title


Sample of Sampling Practices

• 100 samples/sec [gprof, XProfiler]• 200 samples/sec [T09]• 1000 samples/sec [Intel VTune]• 5200 samples/sec [DCPI, A91]• 10,000 samples/sec [A05]• 2.5% all memory ops [O05]• 15 sec CPU, 10 sec mem analysis [R08]• 1,000,000 mem accesses, skip 9,000,000 [W08]


Current Practice

“To ensure good statistical coverage of profiled code, one must collect a large number of samples, either by measuring over a long interval, or by using a high sample rate.”

“Volume and accuracy are antithetical”


References• A91 – Anderson, Berc, Dean, Ghemawat, …• A91 – Andersland, Casavant• A05 – Azimi, Stumm, Wisniewski• K71 – Knuth• K05 – Kumar, Childers, Soffa• L07 – Lahiri, Chatterjee, Maiti• M92 – Malony, Reed, Wijshoff• M07 – Malony, Shende, Morris, Wolf• N04 – Najafzadeh, Chaiken• O05 – Odom, Hollingsworth, DeRose, Ekanadham, …• P95 – Miller, Callaghan, Cargille, Hollingsworth, …• R08 - Reiss• T09 – Tallent, Mellor-Crummey, Fagan• V99 – Vetter, Reed• W47 – Wald• W08 – Weinberg, Snavely

Documents

Towards a Methodology for Deliberate Sample-Based Statistical Performance Analysis