96
Efficient Hardware-Assisted Out-of-Place Update for Non-Volatile Memory Miao Cai Chance Coats Jian Huang Systems Platform Research Group

Systems Platform Research Group - NVMW 2021

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Efficient Hardware-Assisted

Out-of-Place Update for Non-Volatile Memory

Miao Cai † Chance Coats Jian Huang

Systems Platform Research Group

2

Non-Volatile Memory is a Revolutionary Technology

Close-to-DRAM Performance

2

Non-Volatile Memory is a Revolutionary Technology

Close-to-DRAM Performance Data Durability

2

Non-Volatile Memory is a Revolutionary Technology

Close-to-DRAM Performance Data Durability Byte Addressability

2

Non-Volatile Memory is a Revolutionary Technology

New and emerging NVMs offer promising properties and become popular

Close-to-DRAM Performance Data Durability Byte Addressability

3

Memory Persistency Challenge: A Well-Known Problem

Volatile Processor Cache

3

Memory Persistency Challenge: A Well-Known Problem

Out-of-Order ExecutionVolatile Processor Cache

3

Memory Persistency Challenge: A Well-Known Problem

Performance vs. PersistencyOut-of-Order ExecutionVolatile Processor Cache

3

Memory Persistency Challenge: A Well-Known Problem

Ensuring memory persistency with commodity architecture is challenging!

Performance vs. PersistencyOut-of-Order ExecutionVolatile Processor Cache

4

State-of-the-Art Approach: Redo/Undo Logging

4

State-of-the-Art Approach: Redo/Undo Logging

Undo Logging

4

State-of-the-Art Approach: Redo/Undo Logging

Undo Logging

Redo Logging

4

State-of-the-Art Approach: Redo/Undo Logging

Undo Logging

Redo Logging

Undo/Redo logging causes DOUBLE WRITES on the critical path.

5

State-of-the-Art Approach: Shadow Paging

Page Copy

5

State-of-the-Art Approach: Shadow Paging

Page Copy

5

State-of-the-Art Approach: Shadow Paging

Page Copy

5

State-of-the-Art Approach: Shadow Paging

Page Copy

5

State-of-the-Art Approach: Shadow Paging

Page Copy

5

State-of-the-Art Approach: Shadow Paging

Optimized shadow paging still suffers from FREQUENT DATA FLUSHES.

6

State-of-the-Art Approach: Log-structured NVM

6

State-of-the-Art Approach: Log-structured NVM

Log Index

6

State-of-the-Art Approach: Log-structured NVM

Software-based LSNVM suffers from LONG ACCESS LATENCY.

Log Index

7

A Summary of State-of-the-Art Approaches

Logging Shadow Paging Log-structured NVM

7

A Summary of State-of-the-Art Approaches

Logging Shadow Paging Log-structured NVM

Memory persistency overheads: double writes, frequent flushes, long critical-path latency

8

Our Approach: Hardware-assisted Out-Of-Place (HOOP) Update

8

Our Approach: Hardware-assisted Out-Of-Place (HOOP) Update

Reduced write traffic with data coalescing and packing

8

Our Approach: Hardware-assisted Out-Of-Place (HOOP) Update

Reduced write traffic with data coalescing and packing

No requirement on persistence ordering

+

8

Our Approach: Hardware-assisted Out-Of-Place (HOOP) Update

Reduced write traffic with data coalescing and packing

No requirement on persistence ordering

Transparent support of atomic data durability

+

+

9

Lightweight

Indirection Layer

Challenges of Supporting Out-Of-Place Update

9

Lightweight

Indirection Layer

Challenges of Supporting Out-Of-Place Update

Limited Resource in

Memory Controller

9

Lightweight

Indirection Layer

Challenges of Supporting Out-Of-Place Update

Limited Resource in

Memory ControllerEfficient Garbage

Collection

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

NVM

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

store

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

store

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

storeload

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

storeload

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

storeload

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

physical-to-physical

address mapping

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload Insert mapping entry

Upon a write to OOP region

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload Insert mapping entry

Upon a write to OOP region

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload Insert mapping entry

Upon a write to OOP region

Delete mapping entry

Data migration from OOP to home

Upon a read from OOP region

10

Address Remapping for Supporting Out-of-Place Update

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload Insert mapping entry

Upon a write to OOP region

Delete mapping entry

Data migration from OOP to home

Upon a read from OOP region

GC

11

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

Data Packing in the Memory Controller for Improved Performance

OOP Data Buffer

Many applications

update data at a

fine granularity

11

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

Data Packing in the Memory Controller for Improved Performance

OOP Data Buffer

11

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

Data Packing in the Memory Controller for Improved Performance

OOP Data Buffer

Home

address

11

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

Data Packing in the Memory Controller for Improved Performance

OOP Data Buffer

OOP BlockHeadOOP BlockHead …

12

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Ensuring Persistence Ordering in the Memory Controller

12

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Ensuring Persistence Ordering in the Memory Controller

Done the data packing for a memory slice

Upon the end of transaction (e.g., Tx_end)

12

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Ensuring Persistence Ordering in the Memory Controller

13

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Efficient Garbage Collection for Improved Memory Utilization

13

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Efficient Garbage Collection for Improved Memory Utilization

GC

13

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Efficient Garbage Collection for Improved Memory Utilization

GC

OOP BlockHeadOOP BlockHead …

13

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Efficient Garbage Collection for Improved Memory Utilization

GC

OOP BlockHeadOOP BlockHead …

Linked Memory Slices

13

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Efficient Garbage Collection for Improved Memory Utilization

GC

Load stale data

during GC

13

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Efficient Garbage Collection for Improved Memory Utilization

GC

Load stale data

during GCEviction Buffer

14

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

OOP BlockHeadOOP BlockHead …

Handling Crash Consistency Upon Failures

Eviction Buffer

14

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

OOP BlockHeadOOP BlockHead …

Handling Crash Consistency Upon Failures

Eviction Buffer

14

Processor Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

OOP BlockHeadOOP BlockHead …

Handling Crash Consistency Upon Failures

Eviction Buffer

15

Put It All Together

Last-Level Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table OOP Data Buffer

Eviction Buffer

L1 Cache L1 Cache

core core

15

Put It All Together

Last-Level Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

load

OOP Data Buffer

Eviction Buffer

L1 Cache L1 Cache

core core

15

Put It All Together

Last-Level Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

load

OOP Data Buffer

Eviction Buffer

L1 Cache L1 Cache

core core

15

Put It All Together

Last-Level Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

load

OOP Data Buffer

Eviction Buffer

L1 Cache L1 Cache

core core

miss

15

Put It All Together

Last-Level Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

load

OOP Data Buffer

Eviction Buffer

L1 Cache L1 Cache

core core

miss

15

Put It All Together

Last-Level Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Eviction Buffer

L1 Cache L1 Cache

core core

miss

15

Put It All Together

Last-Level Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Eviction Buffer

L1 Cache L1 Cache

core core

miss

miss

15

Put It All Together

Last-Level Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Eviction Buffer

L1 Cache L1 Cache

core core

miss

miss

15

Put It All Together

Last-Level Cache

Memory

Controller

Home Region OOP RegionNVM

Mapping Table

storeload

OOP Data Buffer

Eviction Buffer

L1 Cache L1 Cache

core core

miss

miss

16

HOOP

Implementation

Evaluation

Benchmarks

McSimA+: OoO cores, 2.5GHz,

32KB L1, 256KB L2, 2MB LLC Processor Simulator

NVM Simulator Read/Write = 50/150ns, 512GB

Synthetic Workloads

Real-world Workloads

Vector, HashMap, Queue, RB-Tree, B- Tree

YCSB, TPCC

17

Improving Transaction Throughput with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed S

pee

dup

Optimized Redo Optimized Undo Optimized Shadow Paging

Log-Structured NVM Logless Atomic Durability HOOP

Ideal

17

Improving Transaction Throughput with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed S

pee

dup

Optimized Redo Optimized Undo Optimized Shadow Paging

Log-Structured NVM Logless Atomic Durability HOOP

Ideal

17

Improving Transaction Throughput with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed S

pee

dup

Optimized Redo Optimized Undo Optimized Shadow Paging

Log-Structured NVM Logless Atomic Durability HOOP

Ideal

17

Improving Transaction Throughput with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed S

pee

dup

Optimized Redo Optimized Undo Optimized Shadow Paging

Log-Structured NVM Logless Atomic Durability HOOP

Ideal

17

Improving Transaction Throughput with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed S

pee

dup

Optimized Redo Optimized Undo Optimized Shadow Paging

Log-Structured NVM Logless Atomic Durability HOOP

Ideal

17

Improving Transaction Throughput with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed S

pee

dup

Optimized Redo Optimized Undo Optimized Shadow Paging

Log-Structured NVM Logless Atomic Durability HOOP

Ideal

17

Improving Transaction Throughput with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed S

pee

dup

Optimized Redo Optimized Undo Optimized Shadow Paging

Log-Structured NVM Logless Atomic Durability HOOP

Ideal

17

Improving Transaction Throughput with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed S

pee

dup

Optimized Redo Optimized Undo Optimized Shadow Paging

Log-Structured NVM Logless Atomic Durability HOOP

Ideal

HOOP is close to the performance of a system without any persistence enforcement.

18

Reducing Critical-Path Latency with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed L

aten

cy

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

18

Reducing Critical-Path Latency with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed L

aten

cy

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

18

Reducing Critical-Path Latency with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed L

aten

cy

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

18

Reducing Critical-Path Latency with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed L

aten

cy

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

18

Reducing Critical-Path Latency with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed L

aten

cy

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

18

Reducing Critical-Path Latency with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed L

aten

cy

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

18

Reducing Critical-Path Latency with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed L

aten

cy

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

18

Reducing Critical-Path Latency with HOOP

0

0.5

1

1.5

2

2.5

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed L

aten

cy

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

HOOP achieves the lowest latency, compared to state-of-the-art approaches.

19

Reducing Write Traffic with HOOP

0

0.5

1

1.5

2

2.5

3

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed W

rite

Tra

ffic

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

19

Reducing Write Traffic with HOOP

0

0.5

1

1.5

2

2.5

3

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed W

rite

Tra

ffic

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

19

Reducing Write Traffic with HOOP

0

0.5

1

1.5

2

2.5

3

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed W

rite

Tra

ffic

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

19

Reducing Write Traffic with HOOP

0

0.5

1

1.5

2

2.5

3

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed W

rite

Tra

ffic

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

19

Reducing Write Traffic with HOOP

0

0.5

1

1.5

2

2.5

3

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed W

rite

Tra

ffic

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

19

Reducing Write Traffic with HOOP

0

0.5

1

1.5

2

2.5

3

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed W

rite

Tra

ffic

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

19

Reducing Write Traffic with HOOP

0

0.5

1

1.5

2

2.5

3

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed W

rite

Tra

ffic

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

19

Reducing Write Traffic with HOOP

0

0.5

1

1.5

2

2.5

3

Vector Queue RBTree Btree HashMap YCSB TPCC

Norm

aliz

ed W

rite

Tra

ffic

Ideal Optimized Redo Optimized Undo

Optimized Shadow Paging Log-Structured NVM Logless Atomic Durability

HOOP

HOOP reduces write traffic by up to 2.1x, compared to logging approaches.

20

HOOP

Summary

1.7x Performance Speedup for Data-Intensive Apps

2.1x Reduction of Write Amplification

Thanks!

University of Illinois at Urbana-Champaign

Miao Cai Chance Coats Jian Huang

Systems Platform Research Group