Maryland and You - Sandia National Laboratories...Maryland SOS-13 March 2009 SLIDE The Memory System and You A Love/Hate Relationship Bruce Jacob Electrical & Computer Engineering

MEMORY SYSTEMS,ETC.

Bruce Jacob

University of Maryland

SOS-13March 2009

SLIDE

The Memory System and YouA Love/Hate Relationship

Bruce JacobElectrical & Computer EngineeringUniversity of Maryland at College [email protected]

1

mailto:[email protected]

mailto:[email protected]

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Good Quotes from Tues.

Al Geist: Apps get bigger and more complex

Thomas Schulthess: Memory BW primary limiter to solving superconductivity equations

Karl-Heinz Winkler: Memory = 50% power(cf. Zia spec: proc=214W, proc+mem=230W)

Chuck Moore: image/video as data types => larger working sets, larger data types

Bill Camp: “DRAM sucks—that’s the real problem”

2

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Some Trends

Storage per CPU socket has been flat:

(the capacity problem, Brinda Ganesh’s thesis)

Per-core capacity decr. as # cores/CPU incr.

3

(Gb

it)

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Some Trends

Required BW per core (~1 GB/s):

• Thread-based load (SPECjbb),memory set to 52GB/s sustained

• Saturates around 64 cores/threads (~1GB/s per core)

• cf. 32-core Sun Niagara: saturates at 25.6 GB/s

4

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Some Trends

Commodity Systems:

• Low double-digit GB per CPU socket

• $10–100 per DIMM

High End:

• Higher (but still not high)double-digit GB per CPU socket

• ~ $1000 per DIMM

Fully-Buffered DIMM:

• (largely failed) attempt to bridge the gap …

5

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Some Trends: FB-DIMM

• JEDEC DDRx: ~1W/DIMM, ~10W total

• FB-DIMM: 5–10W/DIMM, ~350W total

6

MC MC

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Some Perspective

Cost of access is high; requires significant effort to amortize this over the (increasingly short) payoff.

ColumnRead

7

tRP = 15ns tRCD = 15ns, tRAS = 37.5ns

CL = 8

Bank Precharge

Row Activate (15ns)and Data Restore (another 22ns)

DATA (on bus)

BL = 8TIME

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Bottom Line

DRAM is performance limiter: HPC apps will always exceed your cache (if you could run it on a laptop, you would)

Memory system determines, to large extent:

• Your performance

• Your power dissipation

• Your system cost

PROBLEM:

• Nobody models it accurately

8

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

What It Looks Like

CPU/$CPU/$

Outgoing bus request

MC

read data

read data

Read B

Write X, data

Read Z

Write Q, data

Read A

Write A, data

Read W

Read Z

Read Y

AC

T

RD

PR

E

RD

RD

PR

E

PR

E

AC

T

WR

WR

AC

TR

D

PRERDread data

bea

t

cmd

9

…

…

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

How It Is Represented

if (cache_miss(addr)) {

cycle_count += DRAM_LATENCY;

}

… even in simulators with “cycle accurate” memory systems—no lie

10

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Some Cases In Point

Prefetching, Multicore, etc.

11

!"

#$%%

&'((

!"#$%&'()*()'+&%,-%./01&'1-.2/%"3-0'-,'4/%"-$3'.&.-%5'.-6&73',-%'889:*;<<)'

&)*+,-.,/0+0)12+3)*+4/55*6*73+789:*6+15+;16*0+.74+<-.,/0+0)12+3)*+='(+4/55*6*7;*+516

>.6/180+9*916<+914*?0+7169.?/@*4+31+.;;86.3*+9*916<+;17361??*6A+B*+;.7+1:0*6>*+3).3

3)*+ 4/55*6*7;*+ /7;6*.0*0+ 2/3)+ 3)*+ 789:*6+ 15+ ;16*0+ .?:*/3+ ?*00*6+ 3).7+ CCD-EFF

;175/G86.3/17A

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Some Cases In Point


12

!"

!"#$%&'()*+)',&%-.%/012&'2./30%"4.1'.-'50%".$4'/&/.%6'/.7&84'9":;'3%&-&:2;"1#)'

#$%&' ()*+$&' &$,-&' .$/' 012' 3%44/)/56/' 5,)7*8%9/3' .,' *66:)*./' 7/7,);' 6,5.),88/)' 4,)

+)/4/.6$%5(' &6$/7/' <&.)/*7' +)/4/.6$/)'-%.$' *' 3/+.$' ,4' =>?' #$/' +/)4,)7*56/' 3%44/)/56/

@/.-//5'.$/'&%7+8%&.%6'7,3/8&'*53'AB2'%56)/*&/&'-%.$'.$/'6,)/&'*53'%&'()/*./)'.$*5'.$/

5,C+)/4/.6$%5('&6$/7/?

DEFF

#122

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Some Cases In Point


13

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Your Choice

Today, DRAM is primary limiter. If you really want to predict system behavior

• performance

• power

• cost

you have to model the memory system, accurately.

… so what do you get if you do?

14

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

DRAMsim

15

Latest version via Sandia & NSA

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

DRAMsim

SystemParameters

16

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

DRAMsim

17

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

DRAMsim

DeviceParameters

18

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

DRAMsim

19

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

DRAMsim

Overlay Graphs from Multiple Runs

20

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

DRAMsim Results: Epetra

21

bandwidth, open page

bandwidth, closed page

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE


22

latency, closed page

latency, open page

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE


23

power, closed page power, open page

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE


24

BANDWIDTH, addr2

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE


25

BANDWIDTH, addr4

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE


26

POWER, addr2

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE


27

POWER, addr4

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

DRAMsim Results: Flash

28

0

1

2

3

4

1 x

8 x

25

1 x

8 x

50

2 x

8 x

25

1 x

16 x

50

1 x

8 x

100

2 x

8 x

50

1 x

16 x

50

4 x

8 x

25

2 x

16 x

25

1 x

32 x

25

2 x

8 x

100

1 x

16 x

100

4 x

8 x

50

2 x

16 x

50

1 x

32 x

50

4 x

16 x

25

2 x

32 x

25

4 x

8 x

100

2 x

16 x

100

1 x

32 x

100

4 x

16 x

50

2 x

32 x

50

4 x

32 x

25

4 x

16 x

100

2 x

32 x

100

4 x

32 x

50

4 x

32 x

100

Read Response Time vs. Organization

Resp

on

se T

ime (

msec)

Configuration

disk-interface speeds are scaling up with serial interface and fiber channel, SSD’s performance is

expected to be limited by the media transfer rate. We have measured the effect of media transfer rate on

the performance of NAND Flash SSD by scaling I/O bus bandwidth from 25 MB/s (8-bit wide bus at 25

MHz) up to 400 MB/s (32-bit wide bus at 100 MHz). As shown in Figure 7, performance does not

improve significantly beyond 100 MB/s.

However, note that, even when performance saturates at high bandwidths, it is still possible to

achieve significant performance gains by increasing the level of concurrency by either banking or

implementing superblocks. Performance saturates at 100MB/s because the real limitation to NAND Flash

memory performance is the device’s core interface—the requirement to read and write the flash storage

array through what is effectively a single port (the read/cache registers)—and this is a limitation that

concurrency overcomes.

5.4. Increasing the Degree of Concurrency

As shown previously, flash memory performance can be improved significantly if request latency is

reduced by dividing the flash array into independent banks and utilizing concurrency. The flash controller

can support these concurrent requests through multiple flash memory banks via the same channel or

through multiple independent channels to different banks, or through a combination of two. To get a

better idea of the shape of the design space, we have focused on changing the degree of concurrency one

I/O bandwidth at a time. Figure 8 shows example configurations modeled in our simulations with

bandwidths ranging from 25 MB/s to 400 MB/s. This is equivalent to saying, “I have 4 50 MHz 8-bit I/O

channels... what should I do? Gang them together, use them as independent channels, or a combination of

the two?”

The performance results are shown in Figure 9. Though increasing the number of available

concurrency in the storage sub-system (number of banks x number of channels) typically increases

16

pro

Fit T

RIA

L v

ers

ion

Ho

st

Ho

st I

/F

Fla

sh

Co

ntr

oll

er

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

k

Ban

kHo

st

Ho

st I

/F

Fla

sh

Co

ntr

oll

er

Ho

st

Ho

st I

/F

Fla

sh

Co

ntr

oll

er

(a) Single channel (b) Dedicated channel for each bank (c) Multiple shared channels

Figure 8: Flash SSD Organizations. (a) Single I/O bus is shared - 1, 2, or 4 banks; (b) dedicated I/O bus: 1, 2, or 4

buses and single bank per bus; (c) multiple shared I/O channels - 2 or 4 channels with 2 or 4 banks per channel.

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

… and tons more

Address-mapping policies (2–10x)

Scheduling policies (2–10x)(e.g., improved Cray MC by 20% sustained BW)

Paging policies (2x)

System organization (10+x)

Blocking, cache interactions, OS interactions (e.g. virtual memory), file system interactions, queueing mechanisms, device architectures, …

If you want to reduce power, this is where to lookIf you want to increase performance, this is where to look

29

MEMORY SYSTEMS,ETC.

Bruce Jacob


SOS-13March 2009

SLIDE

Shameless Plug

30

Documents

Maryland and You - Sandia National Laboratories...Maryland SOS-13 March 2009 SLIDE The Memory System and You A Love/Hate Relationship Bruce Jacob Electrical & Computer Engineering