L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 [email protected]

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

1

The Roofline Model

Samuel Williams

Lawrence Berkeley National Laboratory

[email protected]



2

Outline

Challenges / Goals Fundamentals Roofline Performance Model Example: Heat Equation Example: SpMV Alternate Roofline Formulations Summary



3

Challenges / Goals



4

Four Architectures

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllersH

yper

Tra

nspo

rtOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar


SRI / crossbarHyp

erT

rans

port

4G

B/s

(eac

h di

rect

ion

)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s


4 Coherency Hubs

2x128b controllers

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

inte

rcon

nect

86.4 GB/s

768MB 900MHz GDDR3 Device DRAM

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

192KB L2 (Textures only)

24 ROPs

6 x 64b memory controllers

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)


SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

VMTPPE

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

Sun Victoria FallsAMD Barcelona

NVIDIA G80IBM Cell Blade



5

Challenges / Goals

We have extremely varied architectures. Moreover, the characteristics of numerical methods can vary

dramatically.

The result is that performance and the benefit of optimization can vary significantly from one architecture x kernel combination to the next.

We wish to understand whether or not we’ve attained good performance (high fraction of a theoretical peak)

We wish to identify performance bottlenecks and enumerate potential remediation strategies.



6

Fundamentals



7

Little’s Law

Little’s Law:

Concurrency = Latency * Bandwidth

- or -

Effective Throughput = Expressed Concurrency / Latency

Bandwidth conventional memory bandwidth #floating-point units

Latency memory latency functional unit latency

Concurrency: bytes expressed to the memory subsystem concurrent (parallel) memory operations

For example, consider a CPU with 2 FPU’s each with a 4-cycle latency. Little’s law states that we must express 8-way ILP to fully utilize the machine.



8

Little’s Law Examples

consider a CPU with 2 FPU’s each with a 4-cycle latency.

Little’s law states that we must express 8-way ILP to fully utilize the machine.

Solution: unroll/jam the code to express 8 independent FP operations.

Note, simply unrolling dependent operations (e.g. reduction) does not increase ILP. It simply amortizes loop overhead.

Applied to FPUsApplied to Memory consider a CPU with 20GB/s of

bandwidth and 100ns memory latency.

Little’s law states that we must express 2KB of concurrency (independent memory operations) to the memory subsystem to attain peak performance

On today’s superscalar processors, hardware stream prefetchers speculatively load consecutive elements.

Solution: express the memory access pattern in a streaming fashion in order to engage the prefetchers.



9

Three Classes of Locality

Temporal Locality reusing data (either registers or cache lines) multiple times amortizes the impact of limited bandwidth. transform loops or algorithms to maximize reuse.

Spatial Locality data is transferred from cache to registers in words. However, data is transferred to the cache in 64-128Byte lines using every word in a line maximizes spatial locality. transform data structures into structure of arrays (SoA) layout

Sequential Locality Many memory address patterns access cache lines sequentially. CPU’s hardware stream prefetchers exploit this observation to hide

speculatively load data to memory latency. Transform loops to generate (a few) long, unit-stride accesses.



10

Arithmetic Intensity

True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes

Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality)

Others have constant intensity

Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses.

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods



NUMA

11

Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among

between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory



NUMA

12

Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among

between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory



13

Roofline Model



14

Overlap of Communication

Consider a simple example in which a FP kernel maintains a working set in DRAM.

We assume we can perfectly overlap computation with communication or v.v. either through prefetching/DMA and/or pipelining (decoupling of communication and computation)

Thus, time, is the maximum of the time required to transfer the data and the time required to perform the floating point operations.

Byte’s / STREAM Bandwidth

Flop’s / Flop/s

time



Roofline ModelBasic Concept

15

Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.

where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernel’s arithmetic intensity (based on DRAM traffic after

being filtered by the cache), programmers can inspect the figure, and bound performance.

Moreover, provides insights as to which optimizations will potentially be beneficial.

AttainablePerformanceij

= minFLOP/s with Optimizations1-i

AI * Bandwidth with Optimizations1-j



16

Example

Consider the Opteron 2356: Dual Socket (NUMA) limited HW stream prefetchers quad-core (8 total) 2.3GHz 2-way SIMD (DP) separate FPMUL and FPADD

datapaths 4-cycle FP latency

Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ?



Roofline ModelBasic Concept

17

Plot on log-log scale Given AI, we can easily

bound performance But architectures are much

more complicated

We will bound performance as we eliminate specific forms of in-core parallelism

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth



Roofline Modelcomputational ceilings

18

Opterons have dedicated multipliers and adders.

If the code is dominated by adds, then attainable performance is half of peak.

We call these Ceilings They act like constraints on

performance


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

mul / add imbalance




19

Opterons have 128-bit datapaths.

If instructions aren’t SIMDized, attainable performance will be halved


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

mul / add imbalance

w/out SIMD




20

On Opterons, floating-point instructions have a 4 cycle latency.

If we don’t express 4-way ILP, performance will drop by as much as 4x


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidth

mul / add imbalance



Roofline Modelcommunication ceilings

21

We can perform a similar exercise taking away parallelism from the memory subsystem


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth




22

Explicit software prefetch instructions are required to achieve peak bandwidth


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out

SW

pre

fetch




23

Opterons are NUMA As such memory traffic

must be correctly balanced among the two sockets to achieve good Stream bandwidth.

We could continue this by examining strided or random memory access patterns


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out

SW

pre

fetch

w/out

NUM

A



Roofline Modelcomputation + communication ceilings

24

We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth.


atta

inab

le G

FLO

P/s


0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

peak DP

mul / add imbalance

w/out ILP

Stream

Ban

dwidth

w/out

SW

pre

fetch

w/out

NUM

A



Roofline Modellocality walls

25

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination


atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

FLOPs

Compulsory MissesAI =



26

Cache Behavior

Knowledge of the underlying cache operation can be critical. For example, caches are organized into lines. Lines are organized

into sets & ways (associativity) Thus, we must mimic the effect of Mark Hill’s 3C’s of caches Impacts of conflict, compulsory, and capacity misses are both

architecture- and application-dependent. Ultimately they reduce the actual flop:byte ratio.

Moreover, many caches are write allocate. a write allocate cache read in an entire cache line upon a write miss. If the application ultimately overwrites that line, the read was

superfluous (further reduces flop:byte ratio) Because programs access data in words, but hardware transfers it

in 64 or 128B cache lines, spatial locality is key Array-of-structure data layouts can lead to dramatically lower

flop:byte ratios. e.g. if a program only operates on the “red” field of a pixel, bandwidth is

wasted.




27





atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

FLOPs

Allocations + Compulsory MissesAI =




28





atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

FLOPs

Capacity + Allocations + CompulsoryAI =




29





atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

+co

nflict m

iss traffic

FLOPs

Conflict + Capacity + Allocations + CompulsoryAI =




30

Optimizations remove these walls and ceilings which act to constrain performance.


atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A


peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

+co

nflict m

iss traffic



31

Instruction Issue Bandwidth

On a superscalar processor, there is likely ample instruction issue bandwidth.

This allows loads, integer, and FP instructions to be issued simultaneously.

As such, we assumed that expression of parallelism was the underlying challenge for in-core.

However, on some architectures, finite instruction-issue bandwidth can become a major impediment to performance.



Roofline ModelInstruction Mix

32

As the instruction mix shifts away from floating-point, finite issue bandwidth begins to affect limits on in-core performance.

On Niagara2, with dual issues units but only 1 FPU, FP instructions must constitute ≥50% of the mix to attain peak performance.

A similar approach should be used on GPUs where proper use of CUDA solves the parallelism challenges.


atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Niagara2)

Stream

Ban

dwidth

12% FP

25% FP

w/out

SW

pre

fetch

w/out

NUM

A

peak DP, ≥50% FP



33

Optimization Categorization

Maximizing (attained)In-core Performance

Minimizing (total)Memory Traffic

Maximizing (attained)Memory Bandwidth



34


MinimizingMemory Traffic

MaximizingMemory Bandwidth

MaximizingIn-core Performance

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance



35







???

?unroll &jam

explicitSIMD

reorder

eliminatebranches



???

?unroll &jam

explicitSIMD

reorder

eliminatebranches

36







• Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking



?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

???

?unroll &jam

explicitSIMD

reorder

eliminatebranches

37






• Exploit NUMA




Eliminate:• Capacity misses• Conflict misses• Compulsory misses• Write allocate behavior

? ???

cacheblockingarray

padding

compressdata

streamingstores



38







• Exploit NUMA



?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

Eliminate:• Capacity misses• Conflict misses• Compulsory misses• Write allocate behavior

? ???

cacheblockingarray

padding

compressdata

streamingstores?

??

?unroll &jam

explicitSIMD

reorder

eliminatebranches



39

Examples



40

Multicore SMPs Used

667MHz DDR2 DIMMs

10.66 GB/s


Hyp

erT

rans

portOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s


Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim


SRI / crossbar


SRI / crossbarHyp

erT

rans

port

4G

B/s

(eac

h di

rect

ion

)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s


4 Coherency Hubs

2x128b controllers

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s


4 Coherency Hubs

2x128b controllers

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)


VMTPPE

512KL2

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)


SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

VMTPPE

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

Core

FSB

Core Core Core

10.66 GB/s

Core

FSB

Core Core Core

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2



41

Heat Equation



7-point Stencil

42

PDE grid

+Y

+Z

+X

stencil for heat equation PDE

y+1

y-1

x-1

z-1

z+1

x+1x,y,z

Simplest derivation of the Laplacian operator results in a constant coefficient 7-point stencil

for all x,y,z:u(x,y,z,t+1) = alpha*u(x,y,z,t) + beta*(

u(x,y,z-1,t) + u(x,y-1,z,t) + u(x-1,y,z,t) + u(x+1,y,z,t) + u(x,y+1,z,t) + u(x,y,z+1,t))

Clearly each stencil performs: 8 floating-point operations 8 memory references

all but 2 should be

filtered by an ideal

cache 6 memory streams

all but 2 should be filtered

(less than # HW prefetchers) Ideally, AI=0.5. However, write-allocate bounds it to 0.33.



43

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3

High locality requirements Capacity and conflict

misses will severely impair flop:byte ratio

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation



44

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3

High locality requirements Capacity and conflict

misses will severely impair flop:byte ratio

Clovertown

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8


1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A




45

Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …)

Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3

Clovertown has huge caches but is pinned to lower BW ceiling

Cache management is essential when capacity/thread is low

Clovertown

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8


1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A




46

Roofline model for Stencil(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~0.5 on x86/Cell

Clovertown

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8


1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A



47

SpMV



48

Sparse MatrixVector Multiplication

What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure

What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors

Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP (bad for pipelined or superscalar), Difficult to exploit DLP (bad for SIMD)

(a)algebra conceptualization

(c)CSR reference code

for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0;}

A x y

(b)CSR data structure

A.val[ ]

A.rowStart[ ]

...

...

A.col[ ]...



49

Roofline model for SpMV

Double precision roofline models

In-core optimizations 1..i DRAM optimizations 1..j

FMA is inherent in SpMV (place at bottom)

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade


Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

data

set d

atas

et fit

s in

snoo

p filt

er

GFlopsi,j(AI) = min InCoreGFlopsi

StreamBWj * AI



50

Roofline model for SpMV(overlay arithmetic intensity)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade




data

set d

atas

et fit

s in

snoo

p filt

er



51

Roofline model for SpMV(out-of-the-box parallel)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166 For simplicity: dense

matrix in sparse format1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A


IBM QS20Cell Blade




data

set d

atas

et fit

s in

snoo

p filt

er



52

Roofline model for SpMV(NUMA & SW prefetch)

compulsory flop:byte ~ 0.166

utilize all memory channels

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A


IBM QS20Cell Blade




data

set d

atas

et fit

s in

snoo

p filt

er



53

Roofline model for SpMV(matrix compression)

Inherent FMA Register blocking improves

ILP, DLP, flop:byte ratio, and FP% of instructions

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16


att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

data

set d

atas

et fit

s in

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade






54

Various Kernels

We have examined and heavily optimized a number of kernels and applications for both CPUs and GPUs.

We observe that for most, performance is highly correlated with DRAM bandwidth – particularly on the GPU.

Note, GTC has a strong scatter/gather component that skews STREAM-based rooflines.

512

256

128

64

32

16

8

4

2

1024

1/16 1 2 4 8 16 321/81/4

1/21/32

single-precision peak

double-precision peak

STREAM b

andw

idth

RTM/wave eqn.

7pt Stencil27pt Stencil

Xeon X5550 (Nehalem)

DP add-only

SpMV

DGEMM

GTC/chargei

GTC/pushi

512

256

128

64

32

16

8

4

2

1024

1/16 1 2 4 8 16 321/81/4

1/21/32

single-precision peak

double-precision peak

Device

ban

dwidt

hRTM/wave eqn.

NVIDIA C2050 (Fermi)

DP add-only

SpMV

7pt Stencil

27pt Stencil

DGEMM

GTC/chargei

GTC/pushi



55

Alternate Rooflines



56

No overlap of communication and computation

Previously, we assumed perfect overlap of communication or computation.

What happens if there is a dependency (either inherent or by a lack of optimization) that serializes communication and computation ?

Byte’s / STREAM Bandwidth

Flop’s / Flop/s

time

Byte’s / STREAM Bandwidth Flop’s / Flop/s

time

Time is the sum of communication time and computation time. The result is that flop/s grows asymptotically.



57

No overlap of communication and computation

Consider a generic machine If we can perfectly decouple

and overlap communication with computation, the roofline is sharp/angular.

However, without overlap, the roofline is smoothed, and attainable performance is degraded by up to a factor of 2x.



58

Alternate Bandwidths

Thus far, we assumed a synergy between streaming applications and bandwidth (proxied by the STREAM benchmark)

STREAM is NOT a good proxy for short stanza/random cacheline access patterns as memory latency (instead of just bandwidth) is being exposed.

Thus one might conceive of alternate memory benchmarks to provide a bandwidth upper bound (ceiling)

Similarly, if data is primarily local in the LLC cache, one should construct rooflines based on LLC bandwidth and flop:LLC byte ratios.

For GPUs/accelerators, PCIe bandwidth can be an impediment. Thus one can construct a roofline model based on PCIe bandwidth and the flop:PCIe byte ratio.



59

Alternate Computations

Arising from HPC kernels, its no surprise roofline use DP Flop/s. Of course, it could use

SP flop/s, integer ops, bit operations, pairwise comparisons (sorting), graphics operations, etc…



60

Time-based roofline

In some cases, it is easier to visualize performance in terms of seconds (i.e. time-to-solution).

We can invert the roofline (seconds per flop) and simply multiply by the number of requisite flop’s

Additionally, we could change the horizontal axis from locality to some more appealing metric.



61

Empirical Roofline

Thus far, all in-core estimates have been based on first principles analysis of the underlying computer architecture (frequency, SIMD-width, latency, etc…)

Conceivably, one could design a series of compiled benchmarks that would extract the relevant roofline parameters.

Similarly, one could use performance counters to extract application characteristics so one could accurately determine application coordinates.



62

Questions?

AcknowledgmentsResearch supported by DOE Office of Science under contract number

DE-AC02-05CH11231.



63

BACKUP SLIDES

Documents

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 [email protected]