Upload
trandat
View
231
Download
1
Embed Size (px)
Citation preview
DDR4 DIMM Devices With Hybrid DRAM & NVM
For Big Data Performances At Low Cost
Xiaobing Lee*,Florian Longnos**, Shaojie Chen**, Wei Yang**
*Storage Media Application Laboratory, Futurewei
**Shannon Laboratory, Huawei
Flash Memory Summit 2016
Santa Clara, CA
1
Contents
• Big data applications demands huge host memory
• Hybrid DIMM vs. separate DIMMs
• Optimizing hit rate for big data applications
Flash Memory Summit 2016
Santa Clara, CA
2
Big Data Demands Huge Host Memory
Flash Memory Summit 2016
Santa Clara, CA
3
bus utilization : 60% 2-rank DDR4-DRAMs in slot1,2,3
• Big data applications demand huge host memory for high performance
• 4-rank DDR4 or 8-rank 3DS DRAM DIMMs for higher bus utilization
Some DIMM devices could be hybrid DRAM and PM/NVM to reduce costs
Transaction Queues
Hybrid
controller
DRAMs L4 cache
PM/NVM
or NAND
Hybrid
controller
DRAMs L4 cache
PM/NVM
or NAND
ACT, CSx, CE t/c CMD/Address [17:0]
NVMe Block Device Over DDR4 Bus
Flash Memory Summit 2016
Santa Clara, CA
4
IOC to SSD storage runs 2x DMA-IRQ ops
PCIe-SSD/SAS-SSD rd/wr ops needs CPU to process
DMA-IRQ ops 2 times, using CPU DDR4 bus twice,
and then reduce the CPU handle IOPs capacity
Header: Data ~ 1 : 1000
SSD devices are better, if put on DDR4 bus
CPU only process IOC DMA and IRQ ops once, could
eliminate all SSD related PCIe DMA and IRQ ops
Hybrid DIMM (NVDIMM-P) offloads DMA ops on-DIMM
DDR4 data traffic is cut in half then could double IOPs
Contents
Flash Memory Summit 2016
Santa Clara, CA
5
• Big data applications demands huge host memory
• Hybrid DIMM vs. separate DIMMs
• Optimizing hit rate for big data applications
CPU Memory Controller
9
531
7
DRAM Cache ReadNVM AddressNVM ReadNVM FillBackWriteNVM VictimWrite
Module Controller
NVM chip
DRAM Cache DIMM
1 2 3 4 5 6 7
NVM chip
1
1
CPU Memory Controller
Data Return Back Data
Return
data returnHybrid DIMM
98 6
NVM DIMM
NVM Address
Cache management traffic is confined into
the Hybrid DIMM, thus reducing transactions
on the memory bus Flash Memory Summit 2016
Santa Clara, CA
6
Separate vs. Hybrid DIMM
Requests’ Handling in Hybrid DIMM
On DIMM
Cache Controller
DRAM Ctrl
NVM
Ctrl
NVM
Ctrl
Asynchronous Protocol as
handshaking
①
②
③ ⑤
Write dataflow
④
On DIMM
Cache Controller
DRAM Ctrl
NVM
Ctrl
NVM
Ctrl
Asynchronous Protocol as
handshaking
①
②
③
④
⑤ ⑥ ⑦
⑧
Read dataflow
1. Host issues WRite (WR) request
2. WR request is transferred to on-DIMM cache controller
3. Cache controller schedules WR for DRAM cache
4. DRAM Ctrl transfers victim cacheline to cache controller
5. Cache controller schedules write-back to NVM
1. Host issues ReaD (RD) request
2. RD request is transferred to on-DIMM cache controller
3. Cache controller schedules RD for NVM
4. NVM Ctrl returns RD data to cache controller
5. Hybrid DIMM notifies Host data is ready
6. Host grants the bus to hybrid DIMM
7. Hybrid DIMM returns RD data to Host
8. DRAM Ctrl fills RD data on DRAM Flash Memory Summit 2016
Santa Clara, CA
7
Latency Improvement In Hybrid DIMM
8
• Reduced bus
competition
• Lower latency for
most applications
• Higher ratio requires
high hit rate for
improvement
SPEC CPU2006 (HPC)
Capacity ratio: 1:8 La
ten
cy d
iffe
ren
ce
(cycle
s)
memory intensive benchmarks
Latency difference in cycles
Latency difference in %
Flash Memory Summit 2016
Santa Clara, CA
La
ten
cy d
iffe
ren
ce
(%
)
Contents
• Big data applications demands huge host memory
• Hybrid DIMM vs. separate DIMMs
• Optimizing hit rate for big data applications
Flash Memory Summit 2016
Santa Clara, CA
9
Applications For Hybrid DIMM
Hybrid DIMM will be beneficial for the
applications that have large footprint
and high locality, making most of the
working set fit into DRAM cache
Big Data Benchmarks (data traffics)
• Terasort input 1GB
• Hive Join Query
• Hive Aggregation Query
• Kmeans input 2GB
• Hive Select Query
• Pagerank
• Direct mapping or associative DRAM cache
• Replacement policy: LRU
0
1
2
3
4
5
6
7
8
Mem
ory
usage [G
B]
working set
footprint
Flash Memory Summit 2016
Santa Clara, CA
10
Read/Write Intensivity
0
1000
2000
3000
4000
5000
6000
7000
8000
Write and Read Footprints
footprint write fooprint read
footprint all DRAM cache capacity
terasort join agg kmeans select pagerank
Pagerank
HIVE Select
Flash Memory Summit 2016
Santa Clara, CA
Physical Address
Physical Address N
um
ber
of
accesses
N
um
ber
of
accesses
Impact Of Associativity On Cache Hit Rate
• Small capacity ratio & large DRAM:
Associativity has little effect
• Above 1:32 ratio: Highest hit rate for
associativity of 4 (+4% for 1:8)
• Further increase of associativity
decreases hit rate
• PAGERANK hit rate is affected by
associativity for all capacity ratios
• HIVE Agg’s is significantly degraded for
high associativity values (e.g. 64 or 128)
12
Cap ratio 1:8 Cap ratio 1:32
Cap ratio 1:128 Cap ratio 1:64
Flash Memory Summit 2016
Santa Clara, CA
13
Impact Of Capacity Ratio On Cache Hit Rate
Direct Mapped
DRAM Cache
4-way associative
DRAM Cache
• TERASORT, SELECT, and JOIN have a regular
hit rate decrease, almost linear
• JOIN and PAGERANK show a steep decrease in
hit rate from 4:1 to 32:1 capacity ratio
• PAGERANK is the most affected by decrease in
capacity ratio, with 46% decrease from 4:1 to
128:1 capacity ratio
• The behavior is unchanged whichever direct
mapping or 4-way associative cache is used
31%
22%
Improving Cache Traffic Management Conventional cache management scheme vs. a different algorithm:
• We simulate the behavior of the two schemes over the first 10 million memory accesses
• Both solutions use the same prefetch scheme at the same granularity for fair comparison
• Both solutions use the same write/read access granularity
1) Hit rate could reach better than 90% for all studied big data
benchmarks, when using algorithm 2 instead of algorithm 1
2) Hive AGGR and above all got hit rate performance boosted,
PAGERANK hit rate increased 18% by using algorithm 2
Flash Memory Summit 2016
Santa Clara, CA
14
1) Hit rate is increased for all studied big data benchmarks,
when using algorithm 2 instead of algorithm 1
2) Hive AGG and above all got hit rate performance boosted.
PAGERANK hit rate increased 25% by using algorithm 2
Conclusions
• Hybrid DIMM provides reduced latency due to less bus contention vs. separate
DIMM design (best case –10 cycles; average –6 cycles)
• All big data applications have large footprint, but their frequently-accessed data,
especially written pages, could potentially fit into DRAM L4-cache
• Optimizing cache hierarchy for writes can allow making the most of hybrid DIMM for such
applications
• Improving hit rate can be done through cache traffic management algorithm, in various
hybrid memory designs (25% hit rate increase)
• PM/NVM/NAND memory medias do not really affect the cache hit rates, but big data
traffic patterns and caching algorithms affect hit performance, and their read latencies
only marginally affect cache miss cases
Flash Memory Summit 2016
Santa Clara, CA
15
Bibliography of authors
Dr. Xiaobing Lee is a Principal Engineer / Storage Architect in Huawei Storage Media Application Laboratory.
He has over 25 years experience in ICT industries, including previous positions with AT&T labs, Sarnoff labs,
SeaChange International, and Wegener Communications. His major interests include unified network and
DRAM/NVM hybrid controllers, all-flash arrays SSD, LPDDR4-T / DDR4-T and NVDIMM-P to hybrid or hyper
memories, emerging memories and applications.
Dr. Florian LONGNOS is a Research Engineer in Huawei Shannon Laboratory. He received his PhD Degree
in microelectronics and nanotechnologies from Grenoble Institute of Technology, France, in 2014. He has
been with Huawei Technologies Co. since 2015. His responsibilities and current research interests include
computer architecture, memory sub-system design, emerging memory technologies.
Dr. Chen Shaojie is a research engineer at Huawei Shannon Lab since 2008. He received the Communication
degree signal and information processing from Hangzhou Dianzi University, China, in 2008. His responsibilities
are computer architecture, emerging memory technologies, memory controller design and memory protocol
design. He attend the system architecture design and development of data-center server, such as High-
Throughput Computer.
Flash Memory Summit 2016
Santa Clara, CA
16