Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon, Naveen Muralimanohar ǂ, Justin Meza, Onur Mutlu*,

1

Data Mapping forHigher Performance and Energy Efficiency

in Multi-Level Phase Change Memory

HanBin Yoon*, Naveen Muralimanoharǂ, Justin Meza*, Onur Mutlu*, Norm Jouppiǂ

*Carnegie Mellon University, ǂHP Labs

2

Overview

• MLC PCM: Strengths and weaknesses• Data mapping scheme for MLC PCM– Exploits PCM characteristics for lower latency– Improves data integrity

• Row buffer management for MLC PCM– Increases row buffer hit rate

• Performance and energy efficiency improvements

3

Why MLC PCM?

• Emerging high density memory technology– Projected 3-12 denser than DRAM1

• Scalable DRAM alternative on the horizon– Access latency comparable to DRAM

• Multi-Level Cell: 1 of key strengths over DRAM– Further increases memory density (by 2–4)

• But MLC also has drawbacks

[1Lee+ ISCA’09]

4

Higher MLC Latencies and Energy

• MLC program/read operation is more complex– Finer control/detection of cell resistances

• Generally leads to higher latencies and energy– ~2 for reads, ~4 for writes (depending on tech. & impl.)

11 10 01 00

Number of cells

Resistance

1 0

5

• In MLC, single cell failure can lead to multi-bit faults

MLC Multi-bit Faults

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

cell

bit line

word line

LSB

MSB row

column

6

Motivation

• MLC PCM strength:– Scalable, dense memory

• MLC PCM weaknesses:– Higher latencies– Higher energy– Multi-bit faults– Endurance

Mitigate throughbit mapping schemes

androw buffer managementbased on the following

observations

7

• Read latency depends on cell state– Higher cell resistance higher read latency

Observation #1: Read Asymmetry

[2Qureshi+ ISCA’10]

8

• MSB can be determined before read completes• Quicker MSB read group LSB & MSB separately

Observation #1: Read Asymmetry

9

Observation #2: Program Asymmetry

00 01MSB LSB MSB LSB


210ns75ns

75ns

75ns 210ns

210ns

200ns

200ns

200ns250ns250ns

50ns

[3Joshi+ HPCA’11]

• Program latency depends on cell state

10

• Single-bit change reduces LSB program latency• Quicker LSB prog. group LSB & MSB

separately

Observation #2: Program Asymmetry



210ns75ns

75ns

75ns 210ns

210ns

200ns

200ns

200ns250ns250ns

50ns

Not allowed

11

• Bit mapping affects distribution of bit faults– 1 cell failure: 2 faults in 1 block vs. 1 fault each in 2

blocks (ECC-wise better)• Distributed faults group LSB & MSB separately

Observation #3: Distributed Bit Faults

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

MSB bits

LSB bits

MSB bits

LSB bits

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

12

• Decoupled bit mapping scheme– Reduced read latency for MSB pages (read asym.)– Reduced program latency for LSB pages (prog asym.)– Distributed bit faults between LSB and MSB blocks– Worse endurance

Idea #1: Bit-Decoupled Mapping

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

01

23

45

67

89

1011

1213

1415

1617

1819 Row

= Page

0256

1257

2258

3259

4260

5261

6262

7263

8264

9265 MSB page

LSB page

Coupled

Decoupled

13

Coalescing Writes

• Assuming spatial locality in writebacks• Interleaving blocks facilitates write coalescing• Improved endurance interleave blocks between LSB &

MSB

LSB blocks

MSB blocks

LSB blocks

MSB blocks

32 33 34 35 36 37 38 39 40 41 44 4742 43 45 460 3 8 9 10 11 12 13 14 151 2 4 5 6 7

3 9 11 13 15 17 19 21 23 25 27 29 311 5 70 8 10 12 14 16 18 20 22 24 26 28 302 4 6

PCM row: Decoupled bit mapping

PCM row: Decoupled bit mapping + block interleaving

dirty cache blockscache blocks

1 2 4 5 6 7

1 5 72 4 6

14

• LM-Interleaved (LMI) bit mapping scheme– Mitigates cell wear

Idea #2: LSB-MSB Block Interleaving

MSB page

LSB page0

2561

2572

2583

2594

2605

2616

2627

2638

2649

265

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit

bitbit MSB page

LSB page0

81

92

103

114

125

136

147

1516

2417

25

Decoupled

Decoupled and LM-Interleaved (LMI)

15

• Opportunity: Two latches per cell in row buffer– Use single row buffer as two “page buffers”

Row Buffer Management

01

23

45

67

89

1011

1213

1415

1617

1819

0256

1257

2258

3259

4260

5261

6262

7263

8264

9265

Row buffer

Row= Page

MSB page

LSB page

MSB bits

LSB bits

Coupled

Decoupled

16

• Increased row buffer hit rate

Idea #3: Split Page Buffering (SPB)MSB page

LSB page512

768513

769514

770515

771516

772517

773518

774519

775520

776521

777

LSB/MSB

LSB/MSB

0256

1257

2258

3259

4260

5261

6262

7263

8264

9265 MSB page

LSB page

0 1 2 3 4 5 6 7 8 9512 513 514 515 516 517 518 519 520 521

Row buffer

17

Evaluation Methodology

• Cycle-level x86 CPU-memory simulator– CPU: 8 out-of-order cores, 32 KB private L1 per

core– L2: 512 KB shared per core, DRAM-Aware LLC

Writeback4,5

– Dual channel DDR3 1066 MT/s, 2 ranks, aggregate PCM capacity 16 GB (2 bits per cell)

• Multi-programmed SPEC CPU2006 workloads– Misses per kilo-instructions > 10

[4Lee+ UTA-TechReport’10; 5Stuecheli+ ISCA’10]

18

Comparison Points and Metrics

• Baseline: Coupled bit mapping• Decoupled: Decoupled bit mapping• LMI-4: LSB-MSB interleaving every 4 blocks• LMI-16: LSB-MSB interleaving every 16 blocks• Weighted speedup (performance) = sum of

thread speedups versus when run alone• Max slowdown (fairness) = highest slowdown

experienced by any thread

190

0.2

0.4

0.6

0.8

1

1.2

Baseline Decoupled LMI-4 LMI-16

Performance

Wei

ghte

d Sp

eedu

p (n

orm

.)

Decoupled schemes benefit from reduced read latency (MSB) & program latency (LSB)

200

0.2

0.4

0.6

0.8

1

1.2


Fairness

Max

imum

Slo

wdo

wn

(nor

m.)

Individual thread speedups and increased row buffer hit rate

210

0.2

0.4

0.6

0.8

1

1.2


Energy Efficiency

Perf

orm

ance

per

Watt

(nor

m.)

Lower read energy (dominant case) due to exploiting read asymmetry

220

0.2

0.4

0.6

0.8

1

1.2


Memory Lifetime

Mem

ory

Life

time

(nor

m.)

5-year lifespan feasible for system design?Point of on-going research…

23

Conclusion

• MLC PCM is a scalable, dense memory tech.– Exhibits higher latency and energy compared to SLC

1. LSB-MSB decoupled bit mapping– Exploits read asymmetry & program asymmetry– Distributes multi-bit faults

2. LSB-MSB block interleaving– Mitigates cell wear

3. Split page buffering– Increases row buffer hit rate

• Enhances perf. and energy eff. of MLC PCM

24

Thank you! Questions?

Documents

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon, Naveen Muralimanohar ǂ, Justin Meza, Onur Mutlu*,