Upload
virgil-cobb
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
1
Data Mapping forHigher Performance and Energy Efficiency
in Multi-Level Phase Change Memory
HanBin Yoon*, Naveen Muralimanoharǂ, Justin Meza*, Onur Mutlu*, Norm Jouppiǂ
*Carnegie Mellon University, ǂHP Labs
2
Overview
• MLC PCM: Strengths and weaknesses• Data mapping scheme for MLC PCM– Exploits PCM characteristics for lower latency– Improves data integrity
• Row buffer management for MLC PCM– Increases row buffer hit rate
• Performance and energy efficiency improvements
3
Why MLC PCM?
• Emerging high density memory technology– Projected 3-12 denser than DRAM1
• Scalable DRAM alternative on the horizon– Access latency comparable to DRAM
• Multi-Level Cell: 1 of key strengths over DRAM– Further increases memory density (by 2–4)
• But MLC also has drawbacks
[1Lee+ ISCA’09]
4
Higher MLC Latencies and Energy
• MLC program/read operation is more complex– Finer control/detection of cell resistances
• Generally leads to higher latencies and energy– ~2 for reads, ~4 for writes (depending on tech. & impl.)
11 10 01 00
Number of cells
Resistance
1 0
5
• In MLC, single cell failure can lead to multi-bit faults
MLC Multi-bit Faults
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
cell
bit line
word line
LSB
MSB row
column
6
Motivation
• MLC PCM strength:– Scalable, dense memory
• MLC PCM weaknesses:– Higher latencies– Higher energy– Multi-bit faults– Endurance
Mitigate throughbit mapping schemes
androw buffer managementbased on the following
observations
7
• Read latency depends on cell state– Higher cell resistance higher read latency
Observation #1: Read Asymmetry
[2Qureshi+ ISCA’10]
8
• MSB can be determined before read completes• Quicker MSB read group LSB & MSB separately
Observation #1: Read Asymmetry
9
Observation #2: Program Asymmetry
00 01MSB LSB MSB LSB
10 11MSB LSB MSB LSB
210ns75ns
75ns
75ns 210ns
210ns
200ns
200ns
200ns250ns250ns
50ns
[3Joshi+ HPCA’11]
• Program latency depends on cell state
10
• Single-bit change reduces LSB program latency• Quicker LSB prog. group LSB & MSB
separately
Observation #2: Program Asymmetry
00 01MSB LSB MSB LSB
10 11MSB LSB MSB LSB
210ns75ns
75ns
75ns 210ns
210ns
200ns
200ns
200ns250ns250ns
50ns
Not allowed
11
• Bit mapping affects distribution of bit faults– 1 cell failure: 2 faults in 1 block vs. 1 fault each in 2
blocks (ECC-wise better)• Distributed faults group LSB & MSB separately
Observation #3: Distributed Bit Faults
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
MSB bits
LSB bits
MSB bits
LSB bits
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
12
• Decoupled bit mapping scheme– Reduced read latency for MSB pages (read asym.)– Reduced program latency for LSB pages (prog asym.)– Distributed bit faults between LSB and MSB blocks– Worse endurance
Idea #1: Bit-Decoupled Mapping
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
01
23
45
67
89
1011
1213
1415
1617
1819 Row
= Page
0256
1257
2258
3259
4260
5261
6262
7263
8264
9265 MSB page
LSB page
Coupled
Decoupled
13
Coalescing Writes
• Assuming spatial locality in writebacks• Interleaving blocks facilitates write coalescing• Improved endurance interleave blocks between LSB &
MSB
LSB blocks
MSB blocks
LSB blocks
MSB blocks
32 33 34 35 36 37 38 39 40 41 44 4742 43 45 460 3 8 9 10 11 12 13 14 151 2 4 5 6 7
3 9 11 13 15 17 19 21 23 25 27 29 311 5 70 8 10 12 14 16 18 20 22 24 26 28 302 4 6
PCM row: Decoupled bit mapping
PCM row: Decoupled bit mapping + block interleaving
dirty cache blockscache blocks
1 2 4 5 6 7
1 5 72 4 6
14
• LM-Interleaved (LMI) bit mapping scheme– Mitigates cell wear
Idea #2: LSB-MSB Block Interleaving
MSB page
LSB page0
2561
2572
2583
2594
2605
2616
2627
2638
2649
265
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit
bitbit MSB page
LSB page0
81
92
103
114
125
136
147
1516
2417
25
Decoupled
Decoupled and LM-Interleaved (LMI)
15
• Opportunity: Two latches per cell in row buffer– Use single row buffer as two “page buffers”
Row Buffer Management
01
23
45
67
89
1011
1213
1415
1617
1819
0256
1257
2258
3259
4260
5261
6262
7263
8264
9265
Row buffer
Row= Page
MSB page
LSB page
MSB bits
LSB bits
Coupled
Decoupled
16
• Increased row buffer hit rate
Idea #3: Split Page Buffering (SPB)MSB page
LSB page512
768513
769514
770515
771516
772517
773518
774519
775520
776521
777
LSB/MSB
LSB/MSB
0256
1257
2258
3259
4260
5261
6262
7263
8264
9265 MSB page
LSB page
0 1 2 3 4 5 6 7 8 9512 513 514 515 516 517 518 519 520 521
Row buffer
17
Evaluation Methodology
• Cycle-level x86 CPU-memory simulator– CPU: 8 out-of-order cores, 32 KB private L1 per
core– L2: 512 KB shared per core, DRAM-Aware LLC
Writeback4,5
– Dual channel DDR3 1066 MT/s, 2 ranks, aggregate PCM capacity 16 GB (2 bits per cell)
• Multi-programmed SPEC CPU2006 workloads– Misses per kilo-instructions > 10
[4Lee+ UTA-TechReport’10; 5Stuecheli+ ISCA’10]
18
Comparison Points and Metrics
• Baseline: Coupled bit mapping• Decoupled: Decoupled bit mapping• LMI-4: LSB-MSB interleaving every 4 blocks• LMI-16: LSB-MSB interleaving every 16 blocks• Weighted speedup (performance) = sum of
thread speedups versus when run alone• Max slowdown (fairness) = highest slowdown
experienced by any thread
190
0.2
0.4
0.6
0.8
1
1.2
Baseline Decoupled LMI-4 LMI-16
Performance
Wei
ghte
d Sp
eedu
p (n
orm
.)
Decoupled schemes benefit from reduced read latency (MSB) & program latency (LSB)
200
0.2
0.4
0.6
0.8
1
1.2
Baseline Decoupled LMI-4 LMI-16
Fairness
Max
imum
Slo
wdo
wn
(nor
m.)
Individual thread speedups and increased row buffer hit rate
210
0.2
0.4
0.6
0.8
1
1.2
Baseline Decoupled LMI-4 LMI-16
Energy Efficiency
Perf
orm
ance
per
Watt
(nor
m.)
Lower read energy (dominant case) due to exploiting read asymmetry
220
0.2
0.4
0.6
0.8
1
1.2
Baseline Decoupled LMI-4 LMI-16
Memory Lifetime
Mem
ory
Life
time
(nor
m.)
5-year lifespan feasible for system design?Point of on-going research…
23
Conclusion
• MLC PCM is a scalable, dense memory tech.– Exhibits higher latency and energy compared to SLC
1. LSB-MSB decoupled bit mapping– Exploits read asymmetry & program asymmetry– Distributes multi-bit faults
2. LSB-MSB block interleaving– Mitigates cell wear
3. Split page buffering– Increases row buffer hit rate
• Enhances perf. and energy eff. of MLC PCM
24
Thank you! Questions?