Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
The Memory System and YouA Love/Hate Relationship
Bruce JacobElectrical & Computer EngineeringUniversity of Maryland at College [email protected]
1
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Good Quotes from Tues.
Al Geist: Apps get bigger and more complex
Thomas Schulthess: Memory BW primary limiter to solving superconductivity equations
Karl-Heinz Winkler: Memory = 50% power(cf. Zia spec: proc=214W, proc+mem=230W)
Chuck Moore: image/video as data types => larger working sets, larger data types
Bill Camp: “DRAM sucks—that’s the real problem”
2
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Some Trends
Storage per CPU socket has been flat:
(the capacity problem, Brinda Ganesh’s thesis)
Per-core capacity decr. as # cores/CPU incr.
3
(Gb
it)
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Some Trends
Required BW per core (~1 GB/s):
• Thread-based load (SPECjbb),memory set to 52GB/s sustained
• Saturates around 64 cores/threads (~1GB/s per core)
• cf. 32-core Sun Niagara: saturates at 25.6 GB/s
4
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Some Trends
Commodity Systems:
• Low double-digit GB per CPU socket
• $10–100 per DIMM
High End:
• Higher (but still not high)double-digit GB per CPU socket
• ~ $1000 per DIMM
Fully-Buffered DIMM:
• (largely failed) attempt to bridge the gap …
5
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Some Trends: FB-DIMM
• JEDEC DDRx: ~1W/DIMM, ~10W total
• FB-DIMM: 5–10W/DIMM, ~350W total
6
MC MC
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Some Perspective
Cost of access is high; requires significant effort to amortize this over the (increasingly short) payoff.
ColumnRead
7
tRP = 15ns tRCD = 15ns, tRAS = 37.5ns
CL = 8
Bank Precharge
Row Activate (15ns)and Data Restore (another 22ns)
DATA (on bus)
BL = 8TIME
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Bottom Line
DRAM is performance limiter: HPC apps will always exceed your cache (if you could run it on a laptop, you would)
Memory system determines, to large extent:
• Your performance
• Your power dissipation
• Your system cost
PROBLEM:
• Nobody models it accurately
8
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
What It Looks Like
CPU/$CPU/$
Outgoing bus request
MC
read data
read data
Read B
Write X, data
Read Z
Write Q, data
Read A
Write A, data
Read W
Read Z
Read Y
AC
T
RD
PR
E
RD
RD
PR
E
PR
E
AC
T
WR
WR
AC
TR
D
PRERDread data
bea
t
cmd
9
…
…
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
How It Is Represented
if (cache_miss(addr)) {
cycle_count += DRAM_LATENCY;
}
… even in simulators with “cycle accurate” memory systems—no lie
10
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Some Cases In Point
Prefetching, Multicore, etc.
11
!"
#$%%
&'((
!"#$%&'()*()'+&%,-%./01&'1-.2/%"3-0'-,'4/%"-$3'.&.-%5'.-6&73',-%'889:*;<<)'
&)*+,-.,/0+0)12+3)*+4/55*6*73+789:*6+15+;16*0+.74+<-.,/0+0)12+3)*+='(+4/55*6*7;*+516
>.6/180+9*916<+914*?0+7169.?/@*4+31+.;;86.3*+9*916<+;17361??*6A+B*+;.7+1:0*6>*+3).3
3)*+ 4/55*6*7;*+ /7;6*.0*0+ 2/3)+ 3)*+ 789:*6+ 15+ ;16*0+ .?:*/3+ ?*00*6+ 3).7+ CCD-EFF
;175/G86.3/17A
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Some Cases In Point
Prefetching, Multicore, etc.
12
!"
!"#$%&'()*+)',&%-.%/012&'2./30%"4.1'.-'50%".$4'/&/.%6'/.7&84'9":;'3%&-&:2;"1#)'
#$%&' ()*+$&' &$,-&' .$/' 012' 3%44/)/56/' 5,)7*8%9/3' .,' *66:)*./' 7/7,);' 6,5.),88/)' 4,)
+)/4/.6$%5(' &6$/7/' <&.)/*7' +)/4/.6$/)'-%.$' *' 3/+.$' ,4' =>?' #$/' +/)4,)7*56/' 3%44/)/56/
@/.-//5'.$/'&%7+8%&.%6'7,3/8&'*53'AB2'%56)/*&/&'-%.$'.$/'6,)/&'*53'%&'()/*./)'.$*5'.$/
5,C+)/4/.6$%5('&6$/7/?
DEFF
#122
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Some Cases In Point
Prefetching, Multicore, etc.
13
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Your Choice
Today, DRAM is primary limiter. If you really want to predict system behavior
• performance
• power
• cost
you have to model the memory system, accurately.
… so what do you get if you do?
14
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim
15
Latest version via Sandia & NSA
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim
SystemParameters
16
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim
17
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim
DeviceParameters
18
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim
19
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim
Overlay Graphs from Multiple Runs
20
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim Results: Epetra
21
bandwidth, open page
bandwidth, closed page
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim Results: Epetra
22
latency, closed page
latency, open page
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim Results: Epetra
23
power, closed page power, open page
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim Results: Epetra
24
BANDWIDTH, addr2
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim Results: Epetra
25
BANDWIDTH, addr4
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim Results: Epetra
26
POWER, addr2
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim Results: Epetra
27
POWER, addr4
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
DRAMsim Results: Flash
28
0
1
2
3
4
1 x
8 x
25
1 x
8 x
50
2 x
8 x
25
1 x
16 x
50
1 x
8 x
100
2 x
8 x
50
1 x
16 x
50
4 x
8 x
25
2 x
16 x
25
1 x
32 x
25
2 x
8 x
100
1 x
16 x
100
4 x
8 x
50
2 x
16 x
50
1 x
32 x
50
4 x
16 x
25
2 x
32 x
25
4 x
8 x
100
2 x
16 x
100
1 x
32 x
100
4 x
16 x
50
2 x
32 x
50
4 x
32 x
25
4 x
16 x
100
2 x
32 x
100
4 x
32 x
50
4 x
32 x
100
Read Response Time vs. Organization
Resp
on
se T
ime (
msec)
Configuration
disk-interface speeds are scaling up with serial interface and fiber channel, SSD’s performance is
expected to be limited by the media transfer rate. We have measured the effect of media transfer rate on
the performance of NAND Flash SSD by scaling I/O bus bandwidth from 25 MB/s (8-bit wide bus at 25
MHz) up to 400 MB/s (32-bit wide bus at 100 MHz). As shown in Figure 7, performance does not
improve significantly beyond 100 MB/s.
However, note that, even when performance saturates at high bandwidths, it is still possible to
achieve significant performance gains by increasing the level of concurrency by either banking or
implementing superblocks. Performance saturates at 100MB/s because the real limitation to NAND Flash
memory performance is the device’s core interface—the requirement to read and write the flash storage
array through what is effectively a single port (the read/cache registers)—and this is a limitation that
concurrency overcomes.
5.4. Increasing the Degree of Concurrency
As shown previously, flash memory performance can be improved significantly if request latency is
reduced by dividing the flash array into independent banks and utilizing concurrency. The flash controller
can support these concurrent requests through multiple flash memory banks via the same channel or
through multiple independent channels to different banks, or through a combination of two. To get a
better idea of the shape of the design space, we have focused on changing the degree of concurrency one
I/O bandwidth at a time. Figure 8 shows example configurations modeled in our simulations with
bandwidths ranging from 25 MB/s to 400 MB/s. This is equivalent to saying, “I have 4 50 MHz 8-bit I/O
channels... what should I do? Gang them together, use them as independent channels, or a combination of
the two?”
The performance results are shown in Figure 9. Though increasing the number of available
concurrency in the storage sub-system (number of banks x number of channels) typically increases
16
pro
Fit T
RIA
L v
ers
ion
Ho
st
Ho
st I
/F
Fla
sh
Co
ntr
oll
er
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
k
Ban
kHo
st
Ho
st I
/F
Fla
sh
Co
ntr
oll
er
Ho
st
Ho
st I
/F
Fla
sh
Co
ntr
oll
er
(a) Single channel (b) Dedicated channel for each bank (c) Multiple shared channels
Figure 8: Flash SSD Organizations. (a) Single I/O bus is shared - 1, 2, or 4 banks; (b) dedicated I/O bus: 1, 2, or 4
buses and single bank per bus; (c) multiple shared I/O channels - 2 or 4 channels with 2 or 4 banks per channel.
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
… and tons more
Address-mapping policies (2–10x)
Scheduling policies (2–10x)(e.g., improved Cray MC by 20% sustained BW)
Paging policies (2x)
System organization (10+x)
Blocking, cache interactions, OS interactions (e.g. virtual memory), file system interactions, queueing mechanisms, device architectures, …
If you want to reduce power, this is where to lookIf you want to increase performance, this is where to look
29
MEMORY SYSTEMS,ETC.
Bruce Jacob
University of Maryland
SOS-13March 2009
SLIDE
Shameless Plug
30