Non-volatile Memory Logging - PGCon€¦ · • Combination of existing technologies -- DRAM, SSD,...

Preview:

Citation preview

Non-volatile Memory (NVM) Logging

2016.05.20Takashi HORIKAWA

1

Who am IName

Takashi HORIKAWA, Ph. D.

Research interestsPerformance evaluation of computer & communication systems, including

performance engineering of IT systemswith slightly shifting the focus of the research to CPU scalability

PapersLatch-free data structures for DBMS: design, implementation, and evaluation, SIGMOD ‘13An Unexpected Scalability Bottleneck in a DBMS: A Hidden Pitfall in Implementing Mutual Exclusion, PDCS ‘11An approach for scalability-bottleneck solution: identification and elimination of scalability bottlenecks in a DBMS, ICPE '11A method for analysis and solution of scalability bottleneck in DBMS, SoICT '10

2

Contents

• Introduction• Problems to be solved• Implementation• Evaluation• Technical trends• Conclusion

3

Write Ahead LoggingSync. vs Async. CommitDifference in PerformanceFundamental idea for NVM LoggingByte or Block Addressable NVMByte addressable NVMs

Introduction

4

Write Ahead Logging

Worker process

Shared buffer WAL buffer

Asynchronous Write Synchronous Write

Data files WAL files

Storage

Worker processWorker processWorker process

Memory

TableIndex

XLog

Transactionprocessing

Widely used method to make transaction durable

5

XLog recordTable, Index

Write Ahead Logging

Worker process

Shared buffer WAL buffer

Asynchronous Write Synchronous Write

Data files WAL files

Storage

Worker processWorker processWorker process

Memory

TableIndex

XLog

Write latency is included in transaction response

Overhead

Transactionprocessing

Widely used method to make transaction durable

6

XLog recordTable, Index

Sync. vs Async. Commit

7

Transaction processing WAL write (I/O)

t

Transaction becomes durable

Sync. vs Async. Commit

8

Transaction processing WAL write (I/O)

t

Notifying the client of the transaction commit

Sync. commit

Transaction becomes durable

In sync. committhe client is notified of the transaction commit after the transaction becomes durable.

Durability : OKResponse : Slow

Sync. vs Async. Commit

9

Transaction processing WAL write (I/O)

t

Notifying the client of the transaction commit

Async. commit

Transaction becomes durable

In async. committhe client is notified of the transaction commit before the transaction becomes durable.

Durability : NGResponse : Fast

Difference in Performance

10

0 40 80 120 1600

10000

20000

30000

40000

50000

ClientsTh

roug

hput

Async

Sync

0 40 80 120 1600

10000

20000

30000

40000

50000

Clients

Thro

ughp

ut (T

PS)

Async

Sync

(PGBENCH)

Disk-drive cache off Disk-drive cache on

Fundamental idea for NVM Logging

Worker process

Shared buffer WAL buffer

Asynchronous Write Synchronous Write

Data files WAL files

Storage

Worker processWorker processWorker process

Memory

TableIndex

XLog

Transactionprocessing

Non-volatile

11

XLog recordTable, Index

Volatile

Fundamental idea for NVM Logging

Worker process

Shared buffer WAL buffer

Asynchronous Write Asynchronous Write

Data files WAL files

Storage

Worker processWorker processWorker process

Memory

TableIndex

XLog

Transactionprocessing

Volatile

12

XLog recordTable, Index

XLog record is stored as non-volatile memory

NVM

Non-volatile

Byte or Block Addressable NVM

13

CPU

Memory

Store instruction

Storage

HD, SSD

I/O operation

Accessed in units of byte

Accessed in units of block

Byte addressable NVM

Block addressable NVM(Commonly used already)

(Expected to be used from now on)

Key device in NVM Logging

Byte addressable NVMs• Combination of existing technologies -- DRAM, SSD, battery (NVDIMM)

– AgigA Tech (Micron Technology)– Viking Techonogy– SK Hynix

• Use of a new memory cell (Storage Class Memory)– Phase Change Memory (PCM)– Magnetic Random Access Memory (MRAM)– Ferroelectric RAM (FRAM)– The memristor

14

Capacity

Small

Large

Access time

~100nS

~μS

Ready-to-use

Near future (?)

Availability

Fundamental idea for NVM Logging (Again)It is not simple than it looksNecessary condition for RecoveryProblems

Partial writeUnreachable XLog RecordCPU cache effect

Problems to be solved

15

Fundamental idea for NVM Logging

Worker process

Shared buffer WAL buffer

Asynchronous Write Asynchronous Write

Data files WAL files

Storage

Worker processWorker processWorker process

Memory

TableIndex

XLog

Transactionprocessing

Volatile

16

XLog recordTable, Index

XLog record is stored as non-volatile memory

NVM

Non-volatile

(Again)

It is not simple than it looks

• Naive implementation of NVM Logging– Allocating WAL buffer in NVM area– Using asynchronous commit mode

• Problems are:– Partial write– Unreachable XLog Record– CPU cache effect

17

Not sufficient

Necessary condition for Recovery

• If the recovery process reads all XLR of transaction m correctly

transaction m is possible to be recovered

• Else transaction m will be lost

18

LSN

XLRm,N-2 XLRm,N-1 XLRm,NXLog

A worker process finishes the commit of the transaction after all XLRs of the transaction are stored in the non-volatile memory.

Partial write

• The recovery process will read an incomplete XLRif the system crashes in the middle of writing a XLR

19

XLR 1

WAL buffer

LSN

(CRC in the XLR may be effective but it is not perfect)

Unreachable XLog Record

• The recovery process cannot find a XLR of a committed transaction– A worker process finishes writing of XLR 3 before

another worker process begin to write XLR 2

20

XLR 1

WAL bufferLSN

XLR 2 XLR 3

Length1

XLog reader finds the head position of a XLR by adding the head position of the previous XLR and its length

CPU cache effect

• The recovery process will read inconsistent XLR

21

CPU

Memory

Cache

Inconsistent

If the CPU internal cache uses write-back policy, a XLR written by the CPU does not reach to the memory immediately

Prototype architecturePreventing partial writePreventing an unreachable XLRUse of Write-combined modeGUC parameter for NVM LoggingAccessing NVM at recoveryWrap around of WAL buffer

Implementation

22

Prototype architecture

23

File system

PostgreSQL

open mmap

WAL buffer

Pseudo NVM(pram)

Shared buffer

userkernel

ext4 ext3 xfsext2pm

Modified

Newly added

Kernel module

Similar to RAM disk but CPU cache mode differs

(9.6devel at the middle of March)

Preventing partial write

24

XLog Record

Tail

Start

Reserve

Write data

t

LSNWrite len

New tail

1. Move the tail pointer to reserve buffer area

2. Write the XLR data other than length field

3. Write length field of the XLR

( At this point length field of XLR in XLog buffer is 0)

If XLog reader find a XLR whose length field is not zero,all XLR data is written in the XLog buffer.

Preventing an unreachable XLR

25

XLR 1Start

Reserve 1

tLSN

Write 1

Reserve 2

Write 2

Reserve 3

Write 3

Reserve 4

Write 4

XLR 2

XLR 3

WP waits until commit becomes able to be finished

WP finishes the commit as all previous XLRs are written

Write

Write Write

Write

Wait control

• A wait mechanism is already implemented in PostgreSQLstatic XLogRecPtr

WaitXLogInsertionsToFinish(XLogRecPtr upto)

If any XLR with a smaller LSN than the upto parameter is not finished to copy in the WAL buffer, the worker process sleeps until all of those XLRs are copied in the WAL buffer.

NVM Logging implements the wait mechanism by using this function

26

Use of Write-combined mode

27

File system

PostgreSQL

open mmap

WAL buffer

Pseudo NVM(pram)

Shared buffer

userkernel

ext4 ext3 xfsext2pm

Kernel module

Write-combined mode, which is a variation of write-through, is set for the memory pages for pseudo NVM.

GUC parameter for NVM Logging

• NVM Logging is enabled through one GUC parameter (described in postgresql.conf)

– PRAM_FILE_NAME = “NVM File name”

• When PRAM_FILE_NAME is set– XLOGShmemInit() invokes open() and mmap() to

“NVM File name” and uses the memory area for WAL buffer

– CopyXLogRecordToWAL() copies XLR in WAL buffer according to the procedure that prevents partial write

28

Accessing NVM at recovery

29

XLogReader

XLog buffer in NVM

n n+1 n+2 n+3n-1n-2

minLSN maxLSN

Segment number

WAL files

XLogReader accesses NVM XLog buffer to obtain XLog records whose LSN is between minLSN and maxLSN

LSN

Wrap around of WAL buffer

30

Logical view

LSN

NVM size

Saved in XLog files

InitializedUpto

Written in NVM only

EvaluatedUpto Current write point

Wrap around of WAL buffer

31

Logical view

LSN

NVM size

Saved in XLog files

InitializedUpto

Written in NVM only

EvaluatedUpto

Physical viewNVM size

InitializedUptoCurrent write point

EvaluatedUpto

LSN

WAL segment

Current write point

(n-1) th round

n th round

Experimental SetupPerformance

PGBENCHDBT-2

DurabilityDurability testResult

Write amplification Reduction

Evaluation

32

Experimental setup• DB server

– CPU: E5-2650 v2 x 2 (16 cores)– Memory: 64GB– Storage

• RAID0: 200GB SSD x 2 for data• RAID0: 1TB ATA HD x 4 for WAL

• Client– CPU: E7420 x 4 (16 cores)– Memory: 8GB

• Network– GB ether x 1

33

PGBENCH Performance

34

0 40 80 120 1600

10000

20000

30000

40000

50000

ClientsTh

roug

hput

(TPS

)

Async

NVM Logging

Sync

0 40 80 120 1600

10000

20000

30000

40000

50000

Clients

Thro

ughp

ut (T

PS)

Async

NVM

Log

ging

Sync

Disk-drive cache off Disk-drive cache on

DBT-2 Performance

35

0 10 20 30 40 500

100000

200000

300000

400000

Clients

Thro

ughp

ut (N

OTP

M)

Async

NVM Logging

Sync

0 10 20 30 40 500

100000

200000

300000

400000

ClientsTh

roug

hput

(NO

TPM

)

Async

NVM Loggingg

Sync

Disk-drive cache off Disk-drive cache on

Write amplification Reduction

36

Page of the storage

WriteWriteWriteWriteWrite

Write

The tail WAL block is written in at every commit

The tail WAL block is written in once

Sync.commit

NVM Logging

Durability test

37

table2table1

DBMS

key value(=Kp)

while (1) {BeginSelect value From table1 Where key = Kp(v = value + 1)Update table1 set value = v Where key = Kp

(… same for table 2)End(record v as committed transaction)

}

Fault

Transaction

Clientp

After the recovery, durability is examined by checking whether the value of table1 and table2 is equal andvalue is equal to or greater than v that each client recorded as the result of the last transaction.

Results

• The results were just what we expected

– Durability is ensured• Sync. commit, NVM Logging

– Durability is not ensured• Async. commit

38

NVDIMM for DB serversProgramming support for NVM

Technical trends

39

NVDIMM in DB Servers

• NVDIMM-N Standardization– JEDEC Hybrid Memory Task Group– SNIA NVDIMM SIG

• Server product: HP ProLiant XL230a Server– Up to 2 Intel® Xeon® E5-2600 v3 Series, 6/8/10/12/14/16

Cores (16) – DDR4, (512GB max), support for NVDIMM– …

40

http://community.hpe.com/t5/Servers-The-Right-Compute/Address-your-Compute-needs-with-HP-ProLiant-Gen9/ba-p/6794213#.VybkulK3GA9

DB server with NVDIMM is just around the corner!!

Programming support for NVM

• pmem.io– The Linux NVM Library builds on the Direct Access

(DAX) changes under development in Linux.– This project focuses specifically on how persistent

memory is exposed to server-class applications which will explicitly manage the placement of data among the three tiers (volatile memory, persistent memory, and storage).

41

http://pmem.io/

Conclusion

• NVM is becoming commodity– NVDIMM is already shipped as a product– Servers began to equipped with NVDIMM

• Benefits of NVM Logging– Performance improvement– Durability ensurance– Write amplitude reduction

42

Similar to sync. commit

Almost the same as async. commit

Good for SSD lifetime

Future work

• Bring to a state acceptable for the mainline – Cope with standard for NVM access

• libpmem is a promising candidate

– Check the operation in using real NVM

43

44

That’s it

Thank you for listening

Recommended