Thirteen modern ways to fool the massesThirteen modern ... · Thirteen modern ways to fool the massesThirteen modern ways to fool the masses with performance results on parallel computers

Thirteen modern ways to fool the massesThirteen modern ways to fool the massesThirteen modern ways to fool the masses Thirteen modern ways to fool the masses with performance results on parallel with performance results on parallel computerscomputerscomputerscomputers

Georg HagerGeorg HagerGeorg HagerGeorg HagerErlangen Regional Computing Center Erlangen Regional Computing Center (RRZE)(RRZE)University of ErlangenUniversity of Erlangen--NurembergNurembergUniversity of ErlangenUniversity of Erlangen NurembergNuremberg

6th Erlangen International High End Computing Symposium6th Erlangen International High End Computing Symposium6th Erlangen International High End Computing Symposium6th Erlangen International High End Computing SymposiumRRZE, 04.06.2010RRZE, 04.06.2010

1991

David H. BaileySupercomputing Review, August 1991, p. 54-55 “Twelve Ways to Fool the Masses When Giving Performance ResultsTwelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”

1. Quote only 32-bit performance results, not 64-bit results.2. Present performance figures for an inner kernel, and then represent these

figures as the performance of the entire application.3. Quietly employ assembly code and other low-level language constructs.4. Scale up the problem size with the number of processors, but omit any

mention of this fact.5. Quote performance results projected to a full system.6. Compare your results against scalar, unoptimized code on Crays.7. When direct run time comparisons are required, compare with an old code on an obsolete system.8 If MFLOPS rates must be quoted base the operation count on the parallel implementation not on8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on

the best sequential implementation.9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar.10 Mutilate the algorithm used in the parallel implementation to match the architecture10. Mutilate the algorithm used in the parallel implementation to match the architecture.11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy

environment.12 If all else fails sho prett pict res and animated ideos and don't talk abo t performance

204.06.2010 Fooling the masses

12. If all else fails, show pretty pictures and animated videos, and don't talk about performance.

1991 …

If you were plowing a field, which y p g ,would you rather use?

Two strong oxen gor 1024 chickens?(Attributed to Seymour Cray)


1991 …

Strong I/O facilities

32-bit vs. 64-bit FP arithmetic

SIMD/MIMD parallelism

Vectorization

System-specific ti i ti

No parallelization standards optimizationsstandards


Today we have…

Multicore processorswith shared/separate caches, shared data paths

Hybrid hierarchical systemsHybrid, hierarchical systemswith multi-socket, multi-core, ccNUMA, heterogeneous networks


Today we have…

Ants all over the placeCell, Clearspeed, GPUs...


Today we have…

Commodity everywherex86-type processors, cost-effective interconnects, GNU/Linux


The landscape of High Performance Computing and the way we think about HPC has changed over the last 19 years, and g y ,

we need an update!

Still many of Bailey’s points are valid without changeStill, many of Bailey s points are valid without change


Stunt 1

Report scalability, not absolute performance.

workerswithwork/time NSpeedup: worker withwork/time workerswithwork/time1

)( NNS =

“Good” scalability ↔ S(N) ≈ N , but there is no mention of how fast you can solve your problem!

Consequence: Comparing different systems is much easier when usingConsequence: Comparing different systems is much easier when using scalability instead of work/time directly


Stunt 1: Scalability vs. performance

And… instant success!

160

180

40

45

120

140

30

35

ork/

time)

up

80

100

20

25

man

ce (w

oSp

eedu

40

60

10

15

Perf

orm

0

20

0 10 20 30 40 50 60 700

5

0 10 20 30 40 50 60 70

NEC ClusterNEC Cluster# CPUs or nodes


Stunt 2

Slow down code execution.

This is useful whenever there is some noticeable “non-execution” overhead

Parallel speedup with work ~ Nα:(α=0: strong α=1: weak scaling) )()1(

)1()( 1 NcNssNssNS α

α

+−+−+

= −(α=0: strong, α=1: weak scaling)

Now let’s slow down execution by a factor of μ>1 (for strong scaling):

)()1( NcNss α++

Now let s slow down execution by a factor of μ>1 (for strong scaling):

( )μ

σ /)(/)1(1

)(/)1()(

NNNNNS ==

I.e., if there is overhead, the slow code/machine scales better:

( ) μμσ /)(/)1()(/)1()(

NcNssNcNss +−++−+

I.e., if there is overhead, the slow code/machine scales better:

0)()()( 1 >> = NcNSNS if μμ


Stunt 2: Slow computing

Corollaries:

1. Do not use high compiler optimization levels or the latest compiler versions.

2. If scalability is still bad, parallelize some short loops with OpenMP. That way you can get some extra bonus for a scalable hybrid code.

If someone asks for time to solution, answer that if you had a bigger machine, you could get the solution as fast as you want. This is of course due to the superior scalability of your codecourse due to the superior scalability of your code.



“Slowness” has some surprises in store…Let’s look at μ=2:

fast N=4 slow N=8 fast N=4 slow N=8

calc comm calc comm

Tf TfTf

Ts < Tf ?

Tf

1/Tf < 2/Ts ?

Ts

s f

This happens ifTs

1/Tf 2/Ts ?

This happens if

00)( =<′ sNc @

pp

0)()( => sµµNcNc @ What’s the

catch here?


µ


Example for µ=4 and c(N)~N-2/3 at strong scaling:

35

40 The performance is better with µN slow CPUs than with N fast

25

30

ance

CPUs than with N fast CPUs“Slow computing” can

15

20

Perf

orm

a p geffectively lessen the impact of comm nication

5

10

communication overheadWe assume that the

0

5

0 10 20 30 40 50 60 70

We assume that the network is the same in both machines

fast slow # nodes


Stunt 3 (The power of obfuscation, part I)

If scalability doesn’t look good enough, use a logarithmic scale to drive your point home.

Everything looks OK if you plot it the right way!

60

70100

60

701. Linear plot: bad scaling, strange things at N=32 40

45

50

60

40

50

60

2. Log-log plot: better scaling, but still the

30

35

40

30

401030

40g,

N=32 problem

3 Log-linear plot: N=32 15

20

25

10

20

10

203. Log linear plot: N=32 problem gone

4 and remove the ideal 0

5

10

00 10 20 30 40 50 60 70

Speedup Ideal

11 10 100

Speedup Ideal

01 10 100

Speedup Ideal

4. … and remove the ideal scaling line to make it perfect!

01 10 100

Speedup


Stunt 3: Log scale

© Top500 `08


p

Stunt 4

If you must report actual performance, quietly employ weak scaling to show off

It’s all in that bloody denominator… )()1()1()( 1 NcNssNssNS α

α

+−+−+

= −

At α=1 the world looks so much nicer:

)()1( NcNss α++

)1()( NssNS −+=

)(1)(

NcNS

+

… but keep in mind: Do not mention the term “weak scaling” or you will be asked nasty questions about parallel efficiency.asked nasty questions about parallel efficiency.


Stunt 4: Weak scaling

But weak scaling gives us much more than just a “straight” graph. It gives us perfect scaling if we choose the right metric to look at!

Assumption: Weak scaling with parallel efficiency ε = S(N)/N << 1 and no other overheadother overhead

has a small slopeNssNS )1()( −+= p

But: If we choose a metric for work that is applicable to the parallel part alone

)()(

applicable to the parallel part alone, work/time scales linearly.

So all you need to do is plot Mflop/s, MLUP/s,or anything that doesn’t happen in the serialpart and you can even show real performance numbers! See also stunt #10


Stunt 5 (The power of obfuscation, part II)

Instead of performance, plot absolute runtime vs. CPU count

Very, very popular indeed!1,2

Nobody will be able to tellwhether your code actually

1

re

Scales

0,6

0,8

???

untim

em

e pe

r cor

C ll

0,4R

uC

PU ti

m

Corollary:

CPU time per core is even 0

0,2

CPU time per core is even better because it omitsmost overheads…

0 10 20 30 40 50 60 70

# CPUs


Stunt 6 (The power of obfuscation, part III)

Compare different systems by showing the log of parallel efficiency vs. CPU count

Unusual ways of putting1

0 10 20 30 40 50 60 70

# nodes/CPUs

data together surpriseand confuse your

di ncy

Cl taudience

Remember: Legends0,1

el e

ffici

en Cluster

NECRemember: Legendscan be any size youlike!

Para

lle NEC

like!

0,01

Cluster eff. NEC eff.


Stunt 7

Emphasize the quality of your shiny accelerator code by comparing it with scalar, unoptimized

d i l f ld t d d CPUcode on a single core of an old standard CPU. And use GCC 2.7.2.

Anything else is a waste of time.

And besides, don’t the compiler guys always say that they’re “multi-core enabled”?

Corollary:

Use single precision on the GPU but double precision on the CPU. This will cut on the effective bandwidths cache size and peak performancewill cut on the effective bandwidths, cache size, and peak performance of the latter and let the former shine.


Stunt 8

Always quote GFlops, MIps, Watts per Flop or any other irrelevant interesting metric instead of inverse time to solution.

Flops are so cool it hurts:for(i=0; i<N; ++i)

for(j=0; j<N; ++j)b[i][j] = 0.25*(a[i-1][j]+a[i+1][j]+a[i][j-1]+a[i][j+1]);

for(i=0; i<N; ++i)for(j=0; j<N; ++j)

b[i][j] = 0.25*(a[i-1][j]+a[i+1][j]+a[i][j-1]+a[i][j+1]);

for(i=0; i<N; ++i)for(i=0; i<N; ++i)for(j=0; j<N; ++j)

b[i][j] = 0.25*a[i-1][j]+0.25*a[i+1][j]+0.25*a[i][j-1]+0.25*a[i][j+1];

for(j=0; j<N; ++j)b[i][j] = 0.25*a[i-1][j]+0.25*a[i+1][j]

+0.25*a[i][j-1]+0.25*a[i][j+1];

“Floptimization”

Watts/Flop are an ingenious fallback – who would dare question a truly “green” application/system? Except maybe some investors…

22

g pp y p y


Stunt 9

Ignore affinity and topology issues. Real scientists are not bothered by such details.

Multi-core, cache groups, ccNUMA, SMT, network hierarchies etc. are just parts of a vicious plot to take the fun out of computing Ignoring thoseparts of a vicious plot to take the fun out of computing. Ignoring those issues will make them go away. If people ask specific questions about it, answer that it’s the OS’s or the compiler’s job., p j

Shared cache re use

OpenMP overhead

PCC

PCC

PCC

PCC

C

PCC

PCC

PCC

PCC

C

Shared cache re-use

Bandwidth contention

OS buffer cache

MI

C

MI

C

Intra-node MPI

Memory Memory ccNUMA page placement


Stunt 9: Affinity issues

Re-using shared cache on multi-core CPUs? More cores mean more performance, do they not?

core0 core1core0 core1

tmp(:,:,3)

on

p( )

Memory

y-di

rect

io Memory

x(:,:,:)

z-direction



Memory bandwidth saturation? ccNUMA effects? Shouldn’t the OS put the threads and pages where they are supposed to be?

Parallel STREAM performance



Intra-node MPI is infinitely fast! Look at those latencies!

7,4PPPP PPPP

MPI intra-node and inter-node latencies on Cray XT5

,

7

8 PCC

PCC

PCC

MI

PCC

C

PCC

PCC

PCC

MI

PCC

C

5

6

[µs]

MI

Memory

MI

Memory

3

4

Late

ncy Memory Memory

0,63 0,491

2

0inter-node inter-socket intra-socket

26Fooling the masses04.06.2010


Intra-node MPI is infinitely fast! Low-level benchmarking is unreliable!

Shared cache advantage

Between two cores of one socket

Between two nodes via interconnect

fabricfabricBetween two sockets of one node (cache effects eliminated))

27Fooling the masses04.06.2010


Why should you reverse engineer the overcomplicated cache topology of those modern systems?

Xeon E5420 shared L2 same socket different socket2 Threadspthreads_barrier_wait 5863 27032 27647omp barrier (icc 11.0) 576 760 1269Spin loop 259 485 11602

Nehalem2 Threads

Shared SMT threads

shared L3 different socket

pthreads_barrier_wait 23352 4796 49237omp barrier (icc 11.0) 2761 479 1206Spin loop 17388 267 787


Stunt 9: Affinity – if you still insist…

Command line tools for Linux:easy to installworks with standard linux 2.6 kernelsimple and clear to usesupport Intel and AMD CPUs

Current tools:likwid-topology: Print thread and cache topologylikwid-perfCtr: Measure performance counterslikwid-features: View and enable/disable hardware prefetcherslikwid-pin: Pin threaded application without touching code

Open source project (GPL v2):http://code.google.com/p/likwid/


Stunt 10

If you really can’t reduce communication overhead, argue in favor of “reliable inefficiency.”

Even if you spend 80%of time comm nicating

1

of time communicating,that’s ok if the ratiostays constant – it

0,8

ime

stays constant itmeans you can scaleto any size!

0,4

0,6

on o

f run

ti

And fill any machine. 0,2

0,4

Frac

tio

01 10 100 1000

# nodes/CPUsCalculation Computation

Efficiency constant for large N

# nodes/CPUs


Stunt 11 (The power of obfuscation, part IV)

Performance modeling is for wimps. Show real data. Plenty. And then some.

Don’t try to make senseof o r data b fitting it

300

of your data by fitting itto a model. Instead, showat least 8 graphs per plot, 200

250

Machine 1

eat least 8 graphs per plot,all in bright pastel colors,with different symbols. 150

200Machine 2

Machine 3

Machine 4

Machine 5form

ance

If t ti

100

Machine 5

Machine 6

Machine 7

Machine 8

Perf

If nasty questions pop up,say your code is so complex that no model

0

50

0 100 200 300 400 500 600complex that no modelcan describe it.

0 100 200 300 400 500 600

# nodes/CPUs


Stunt 12

If they get you cornered, blame it all on OS jitter.

They will understand and nod knowingly.

Corollary:Corollary:Depending on the audience, TLB misses may work just as fine

32

TLB misses may work just as fine.


Stunt 13

If all else fails, show pretty pictures and animated videos, and don’t talk about performance.

In four decades of supercomputing, this was always the best-selling plan, and it will stay that way foreverand it will stay that way forever.


THANK YOUTHANK YOU

Documents

Thirteen modern ways to fool the massesThirteen modern ... · Thirteen modern ways to fool the massesThirteen modern ways to fool the masses with performance results on parallel computers