63
LAWRENCE BERKELEY NATIONAL LABORATORY F U T U R E T E C H N O L O G I E S G R O U P The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 [email protected]

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 [email protected]

Embed Size (px)

Citation preview

Page 1: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

1

The Roofline Model

Samuel Williams

Lawrence Berkeley National Laboratory

[email protected]

Page 2: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

2

Outline

Challenges / Goals Fundamentals Roofline Performance Model Example: Heat Equation Example: SpMV Alternate Roofline Formulations Summary

Page 3: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

3

Challenges / Goals

Page 4: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

4

Four Architectures

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllersH

yper

Tra

nspo

rtOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erT

rans

port

4G

B/s

(eac

h di

rect

ion

)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

inte

rcon

nect

86.4 GB/s

768MB 900MHz GDDR3 Device DRAM

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

ThreadCluster

192KB L2 (Textures only)

24 ROPs

6 x 64b memory controllers

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

VMTPPE

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

Sun Victoria FallsAMD Barcelona

NVIDIA G80IBM Cell Blade

Page 5: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

5

Challenges / Goals

We have extremely varied architectures. Moreover, the characteristics of numerical methods can vary

dramatically.

The result is that performance and the benefit of optimization can vary significantly from one architecture x kernel combination to the next.

We wish to understand whether or not we’ve attained good performance (high fraction of a theoretical peak)

We wish to identify performance bottlenecks and enumerate potential remediation strategies.

Page 6: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

6

Fundamentals

Page 7: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

7

Little’s Law

Little’s Law:

Concurrency = Latency * Bandwidth

- or -

Effective Throughput = Expressed Concurrency / Latency

Bandwidth conventional memory bandwidth #floating-point units

Latency memory latency functional unit latency

Concurrency: bytes expressed to the memory subsystem concurrent (parallel) memory operations

For example, consider a CPU with 2 FPU’s each with a 4-cycle latency. Little’s law states that we must express 8-way ILP to fully utilize the machine.

Page 8: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

8

Little’s Law Examples

consider a CPU with 2 FPU’s each with a 4-cycle latency.

Little’s law states that we must express 8-way ILP to fully utilize the machine.

Solution: unroll/jam the code to express 8 independent FP operations.

Note, simply unrolling dependent operations (e.g. reduction) does not increase ILP. It simply amortizes loop overhead.

Applied to FPUsApplied to Memory consider a CPU with 20GB/s of

bandwidth and 100ns memory latency.

Little’s law states that we must express 2KB of concurrency (independent memory operations) to the memory subsystem to attain peak performance

On today’s superscalar processors, hardware stream prefetchers speculatively load consecutive elements.

Solution: express the memory access pattern in a streaming fashion in order to engage the prefetchers.

Page 9: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

9

Three Classes of Locality

Temporal Locality reusing data (either registers or cache lines) multiple times amortizes the impact of limited bandwidth. transform loops or algorithms to maximize reuse.

Spatial Locality data is transferred from cache to registers in words. However, data is transferred to the cache in 64-128Byte lines using every word in a line maximizes spatial locality. transform data structures into structure of arrays (SoA) layout

Sequential Locality Many memory address patterns access cache lines sequentially. CPU’s hardware stream prefetchers exploit this observation to hide

speculatively load data to memory latency. Transform loops to generate (a few) long, unit-stride accesses.

Page 10: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

10

Arithmetic Intensity

True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes

Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality)

Others have constant intensity

Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses.

A r i t h m e t i c I n t e n s i t y

O( N )O( log(N) )

O( 1 )

SpMV, BLAS1,2

Stencils (PDEs)

Lattice Methods

FFTsDense Linear Algebra

(BLAS3)Particle Methods

Page 11: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA

11

Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among

between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory

Page 12: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

NUMA

12

Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among

between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory

Page 13: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

13

Roofline Model

Page 14: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

14

Overlap of Communication

Consider a simple example in which a FP kernel maintains a working set in DRAM.

We assume we can perfectly overlap computation with communication or v.v. either through prefetching/DMA and/or pipelining (decoupling of communication and computation)

Thus, time, is the maximum of the time required to transfer the data and the time required to perform the floating point operations.

Byte’s / STREAM Bandwidth

Flop’s / Flop/s

time

Page 15: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline ModelBasic Concept

15

Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.

where optimization i can be SIMDize, or unroll, or SW prefetch, … Given a kernel’s arithmetic intensity (based on DRAM traffic after

being filtered by the cache), programmers can inspect the figure, and bound performance.

Moreover, provides insights as to which optimizations will potentially be beneficial.

AttainablePerformanceij

= minFLOP/s with Optimizations1-i

AI * Bandwidth with Optimizations1-j

Page 16: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

16

Example

Consider the Opteron 2356: Dual Socket (NUMA) limited HW stream prefetchers quad-core (8 total) 2.3GHz 2-way SIMD (DP) separate FPMUL and FPADD

datapaths 4-cycle FP latency

Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ?

Page 17: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline ModelBasic Concept

17

Plot on log-log scale Given AI, we can easily

bound performance But architectures are much

more complicated

We will bound performance as we eliminate specific forms of in-core parallelism

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

Page 18: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcomputational ceilings

18

Opterons have dedicated multipliers and adders.

If the code is dominated by adds, then attainable performance is half of peak.

We call these Ceilings They act like constraints on

performance

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

mul / add imbalance

Page 19: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcomputational ceilings

19

Opterons have 128-bit datapaths.

If instructions aren’t SIMDized, attainable performance will be halved

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

mul / add imbalance

w/out SIMD

Page 20: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcomputational ceilings

20

On Opterons, floating-point instructions have a 4 cycle latency.

If we don’t express 4-way ILP, performance will drop by as much as 4x

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

w/out ILP

peak DP

Stream

Ban

dwidth

mul / add imbalance

Page 21: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcommunication ceilings

21

We can perform a similar exercise taking away parallelism from the memory subsystem

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

Page 22: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcommunication ceilings

22

Explicit software prefetch instructions are required to achieve peak bandwidth

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out

SW

pre

fetch

Page 23: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcommunication ceilings

23

Opterons are NUMA As such memory traffic

must be correctly balanced among the two sockets to achieve good Stream bandwidth.

We could continue this by examining strided or random memory access patterns

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

peak DP

Stream

Ban

dwidth

w/out

SW

pre

fetch

w/out

NUM

A

Page 24: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modelcomputation + communication ceilings

24

We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

Opteron 2356(Barcelona)

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

peak DP

mul / add imbalance

w/out ILP

Stream

Ban

dwidth

w/out

SW

pre

fetch

w/out

NUM

A

Page 25: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modellocality walls

25

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

FLOPs

Compulsory MissesAI =

Page 26: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

26

Cache Behavior

Knowledge of the underlying cache operation can be critical. For example, caches are organized into lines. Lines are organized

into sets & ways (associativity) Thus, we must mimic the effect of Mark Hill’s 3C’s of caches Impacts of conflict, compulsory, and capacity misses are both

architecture- and application-dependent. Ultimately they reduce the actual flop:byte ratio.

Moreover, many caches are write allocate. a write allocate cache read in an entire cache line upon a write miss. If the application ultimately overwrites that line, the read was

superfluous (further reduces flop:byte ratio) Because programs access data in words, but hardware transfers it

in 64 or 128B cache lines, spatial locality is key Array-of-structure data layouts can lead to dramatically lower

flop:byte ratios. e.g. if a program only operates on the “red” field of a pixel, bandwidth is

wasted.

Page 27: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modellocality walls

27

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

FLOPs

Allocations + Compulsory MissesAI =

Page 28: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modellocality walls

28

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

FLOPs

Capacity + Allocations + CompulsoryAI =

Page 29: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modellocality walls

29

Remember, memory traffic includes more than just compulsory misses.

As such, actual arithmetic intensity may be substantially lower.

Walls are unique to the architecture-kernel combination

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

+co

nflict m

iss traffic

FLOPs

Conflict + Capacity + Allocations + CompulsoryAI =

Page 30: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline Modellocality walls

30

Optimizations remove these walls and ceilings which act to constrain performance.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

w/out SIMD

mul / add imbalance

w/out ILP

w/out

SW

pre

fetch

w/out

NUM

A

Opteron 2356(Barcelona)

peak DP

Stream

Ban

dwidth

on

ly com

pu

lsory m

iss traffic

+w

rite a

lloca

tion

traffic

+ca

pa

city miss tra

ffic

+co

nflict m

iss traffic

Page 31: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

31

Instruction Issue Bandwidth

On a superscalar processor, there is likely ample instruction issue bandwidth.

This allows loads, integer, and FP instructions to be issued simultaneously.

As such, we assumed that expression of parallelism was the underlying challenge for in-core.

However, on some architectures, finite instruction-issue bandwidth can become a major impediment to performance.

Page 32: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

Roofline ModelInstruction Mix

32

As the instruction mix shifts away from floating-point, finite issue bandwidth begins to affect limits on in-core performance.

On Niagara2, with dual issues units but only 1 FPU, FP instructions must constitute ≥50% of the mix to attain peak performance.

A similar approach should be used on GPUs where proper use of CUDA solves the parallelism challenges.

actual FLOP:Byte ratio

atta

inab

le G

FLO

P/s

0.5

1.0

1/8

2.0

4.0

8.0

16.0

32.0

64.0

128.0

256.0

1/41/2 1 2 4 8 16

UltraSparc T2+ T5140(Niagara2)

Stream

Ban

dwidth

12% FP

25% FP

w/out

SW

pre

fetch

w/out

NUM

A

peak DP, ≥50% FP

Page 33: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

33

Optimization Categorization

Maximizing (attained)In-core Performance

Minimizing (total)Memory Traffic

Maximizing (attained)Memory Bandwidth

Page 34: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

34

Optimization Categorization

MinimizingMemory Traffic

MaximizingMemory Bandwidth

MaximizingIn-core Performance

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

Page 35: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

35

Optimization Categorization

MinimizingMemory Traffic

MaximizingMemory Bandwidth

MaximizingIn-core Performance

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

???

?unroll &jam

explicitSIMD

reorder

eliminatebranches

Page 36: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

???

?unroll &jam

explicitSIMD

reorder

eliminatebranches

36

Optimization Categorization

MaximizingIn-core Performance

MinimizingMemory Traffic

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

MaximizingMemory Bandwidth

• Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

Page 37: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

???

?unroll &jam

explicitSIMD

reorder

eliminatebranches

37

Optimization Categorization

MaximizingIn-core Performance

MaximizingMemory Bandwidth

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

• Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

MinimizingMemory Traffic

Eliminate:• Capacity misses• Conflict misses• Compulsory misses• Write allocate behavior

? ???

cacheblockingarray

padding

compressdata

streamingstores

Page 38: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

38

Optimization Categorization

MaximizingIn-core Performance

MinimizingMemory Traffic

MaximizingMemory Bandwidth

• Exploit in-core parallelism (ILP, DLP, etc…)

• Good (enough) floating-point balance

• Exploit NUMA

• Hide memory latency

• Satisfy Little’s Law

?memoryaffinity ?SW

prefetch

?DMAlists

?unit-stridestreams

?TLBblocking

Eliminate:• Capacity misses• Conflict misses• Compulsory misses• Write allocate behavior

? ???

cacheblockingarray

padding

compressdata

streamingstores?

??

?unroll &jam

explicitSIMD

reorder

eliminatebranches

Page 39: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

39

Examples

Page 40: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

40

Multicore SMPs Used

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Hyp

erT

rans

portOpteron Opteron Opteron Opteron

667MHz DDR2 DIMMs

10.66 GB/s

2x64b memory controllers

Opteron Opteron Opteron Opteron

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

512KBvictim

2MB Shared quasi-victim (32 way)

SRI / crossbar

2MB Shared quasi-victim (32 way)

SRI / crossbarHyp

erT

rans

port

4G

B/s

(eac

h di

rect

ion

)

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

Crossbar

179 GB/s 90 GB/s

667MHz FBDIMMs

21.33 GB/s 10.66 GB/s

4MB Shared L2 (16 way)(64b interleaved)

4 Coherency Hubs

2x128b controllers

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

MT

SP

AR

C

Crossbar

179 GB/s 90 GB/s

8 x

6.4

GB

/s(1

pe

r h

ub p

er

dire

ctio

n)

BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

VMTPPE

512KL2

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C BIF

512MB XDR DRAM

25.6 GB/s

EIB (ring network)

XDR memory controllers

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

SP

E2

56K

MF

C

VMTPPE

512KL2

<20

GB

/s(e

ach

dire

ctio

n)

AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)

IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)

667MHz FBDIMMs

Chipset (4x64b controllers)

10.66 GB/s(write)21.33 GB/s(read)

10.66 GB/s

Core

FSB

Core Core Core

10.66 GB/s

Core

FSB

Core Core Core

4MBshared L2

4MBshared L2

4MBshared L2

4MBshared L2

Page 41: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

41

Heat Equation

Page 42: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

7-point Stencil

42

PDE grid

+Y

+Z

+X

stencil for heat equation PDE

y+1

y-1

x-1

z-1

z+1

x+1x,y,z

Simplest derivation of the Laplacian operator results in a constant coefficient 7-point stencil

for all x,y,z:u(x,y,z,t+1) = alpha*u(x,y,z,t) + beta*(

u(x,y,z-1,t) + u(x,y-1,z,t) + u(x-1,y,z,t) + u(x+1,y,z,t) + u(x,y+1,z,t) + u(x,y,z+1,t))

Clearly each stencil performs: 8 floating-point operations 8 memory references

all but 2 should be

filtered by an ideal

cache 6 memory streams

all but 2 should be filtered

(less than # HW prefetchers) Ideally, AI=0.5. However, write-allocate bounds it to 0.33.

Page 43: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

43

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3

High locality requirements Capacity and conflict

misses will severely impair flop:byte ratio

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 44: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

44

Roofline model for Stencil(out-of-the-box code)

Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than

multiplies (imbalance) Ideal flop:byte ratio 1/3

High locality requirements Capacity and conflict

misses will severely impair flop:byte ratio

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 45: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

45

Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …)

Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3

Clovertown has huge caches but is pinned to lower BW ceiling

Cache management is essential when capacity/thread is low

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve Cellimplementation

Page 46: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

46

Roofline model for Stencil(SIMDization + cache bypass)

Make SIMDization explicit Technically, this swaps ILP

and SIMD ceilings Use cache bypass

instruction: movntpd Increases flop:byte ratio to

~0.5 on x86/Cell

Clovertown

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

Cell BladeVictoria Falls

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

Barcelona

w/out FMA

peak DP

w/out ILP

w/out SIMD

w/out

NUM

Aun

align

ed D

MA

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

fits w

ithin

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

Page 47: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

47

SpMV

Page 48: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

48

Sparse MatrixVector Multiplication

What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure

What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors

Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP (bad for pipelined or superscalar), Difficult to exploit DLP (bad for SIMD)

(a)algebra conceptualization

(c)CSR reference code

for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0;}

A x y

(b)CSR data structure

A.val[ ]

A.rowStart[ ]

...

...

A.col[ ]...

Page 49: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

49

Roofline model for SpMV

Double precision roofline models

In-core optimizations 1..i DRAM optimizations 1..j

FMA is inherent in SpMV (place at bottom)

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

data

set d

atas

et fit

s in

snoo

p filt

er

GFlopsi,j(AI) = min InCoreGFlopsi

StreamBWj * AI

Page 50: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

50

Roofline model for SpMV(overlay arithmetic intensity)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

data

set d

atas

et fit

s in

snoo

p filt

er

Page 51: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

51

Roofline model for SpMV(out-of-the-box parallel)

Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory

flop:byte < 0.166 For simplicity: dense

matrix in sparse format1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

data

set d

atas

et fit

s in

snoo

p filt

er

Page 52: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

52

Roofline model for SpMV(NUMA & SW prefetch)

compulsory flop:byte ~ 0.166

utilize all memory channels

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

No naïve SPEimplementation

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

data

set d

atas

et fit

s in

snoo

p filt

er

Page 53: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

53

Roofline model for SpMV(matrix compression)

Inherent FMA Register blocking improves

ILP, DLP, flop:byte ratio, and FP% of instructions

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

1

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 81

2

1/16

flop:DRAM byte ratio

att

ain

ab

le G

flop

/s

4

8

16

32

64

128

1/81/4

1/2 1 2 4 8

w/out SIMD

peak DP

w/out ILP

w/out FMA

w/out

NUM

Aba

nk c

onflic

ts

25% FP

peak DP

12% FP

w/out

SW

pre

fetc

h

w/out

NUM

A

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

data

set d

atas

et fit

s in

snoo

p filt

er

peak DP

w/out SIMD

w/out ILP

mul/add imbalance

w/out

SW

pre

fetc

h

w/out

NUM

A

IBM QS20Cell Blade

Opteron 2356(Barcelona)

Intel Xeon E5345(Clovertown)

Sun T2+ T5140(Victoria Falls)

Page 54: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

54

Various Kernels

We have examined and heavily optimized a number of kernels and applications for both CPUs and GPUs.

We observe that for most, performance is highly correlated with DRAM bandwidth – particularly on the GPU.

Note, GTC has a strong scatter/gather component that skews STREAM-based rooflines.

512

256

128

64

32

16

8

4

2

1024

1/16 1 2 4 8 16 321/81/4

1/21/32

single-precision peak

double-precision peak

STREAM b

andw

idth

RTM/wave eqn.

7pt Stencil27pt Stencil

Xeon X5550 (Nehalem)

DP add-only

SpMV

DGEMM

GTC/chargei

GTC/pushi

512

256

128

64

32

16

8

4

2

1024

1/16 1 2 4 8 16 321/81/4

1/21/32

single-precision peak

double-precision peak

Device

ban

dwidt

hRTM/wave eqn.

NVIDIA C2050 (Fermi)

DP add-only

SpMV

7pt Stencil

27pt Stencil

DGEMM

GTC/chargei

GTC/pushi

Page 55: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

55

Alternate Rooflines

Page 56: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

56

No overlap of communication and computation

Previously, we assumed perfect overlap of communication or computation.

What happens if there is a dependency (either inherent or by a lack of optimization) that serializes communication and computation ?

Byte’s / STREAM Bandwidth

Flop’s / Flop/s

time

Byte’s / STREAM Bandwidth Flop’s / Flop/s

time

Time is the sum of communication time and computation time. The result is that flop/s grows asymptotically.

Page 57: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

57

No overlap of communication and computation

Consider a generic machine If we can perfectly decouple

and overlap communication with computation, the roofline is sharp/angular.

However, without overlap, the roofline is smoothed, and attainable performance is degraded by up to a factor of 2x.

Page 58: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

58

Alternate Bandwidths

Thus far, we assumed a synergy between streaming applications and bandwidth (proxied by the STREAM benchmark)

STREAM is NOT a good proxy for short stanza/random cacheline access patterns as memory latency (instead of just bandwidth) is being exposed.

Thus one might conceive of alternate memory benchmarks to provide a bandwidth upper bound (ceiling)

Similarly, if data is primarily local in the LLC cache, one should construct rooflines based on LLC bandwidth and flop:LLC byte ratios.

For GPUs/accelerators, PCIe bandwidth can be an impediment. Thus one can construct a roofline model based on PCIe bandwidth and the flop:PCIe byte ratio.

Page 59: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

59

Alternate Computations

Arising from HPC kernels, its no surprise roofline use DP Flop/s. Of course, it could use

SP flop/s, integer ops, bit operations, pairwise comparisons (sorting), graphics operations, etc…

Page 60: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

60

Time-based roofline

In some cases, it is easier to visualize performance in terms of seconds (i.e. time-to-solution).

We can invert the roofline (seconds per flop) and simply multiply by the number of requisite flop’s

Additionally, we could change the horizontal axis from locality to some more appealing metric.

Page 61: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

F U T U R E T E C H N O L O G I E S G R O U P

LAWRENCE BERKELEY NATIONAL LABORATORY

61

Empirical Roofline

Thus far, all in-core estimates have been based on first principles analysis of the underlying computer architecture (frequency, SIMD-width, latency, etc…)

Conceivably, one could design a series of compiled benchmarks that would extract the relevant roofline parameters.

Similarly, one could use performance counters to extract application characteristics so one could accurately determine application coordinates.

Page 62: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

62

Questions?

AcknowledgmentsResearch supported by DOE Office of Science under contract number

DE-AC02-05CH11231.

Page 63: L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory 1 SWWilliams@lbl.gov

LAWRENCE BERKELEY NATIONAL LABORATORY

F U T U R E T E C H N O L O G I E S G R O U P

63

BACKUP SLIDES