英文目录标题 :35-40pt 颜色 : R153 G0 B0 内部使用字体 : FrutigerNext LT Medium...

Scalable MVCC solution for

Many core machines

- Dilip Kumar (dilip.kumar@huawei.com)- Nirmala S

(nirmalas@huawei.com)- Zhengxiaojin

(zhengxiaojin@huawei.com)- Renchaoqian

(renchaoqian@huawei.com)

Content

Concept of MVCCConcept of MVCC22

MVCC in PGMVCC in PG

Lock Free CSN SolutionLock Free CSN Solution

ConclusionConclusion55

Multicore Scalability issueMulticore Scalability issue11

Multicore Scalability issue – TPCC test

For TPCC test using PG 9.4 on a machine with 120 cores, peak tpmc is hit at 25 terminals. After this increase in terminals reduces the performance. I/O and CPU are under utilized and maximum number of process are in idle state.

0 20 40 60 80 100 120 1400

100000

150000

200000

250000

TPCC tpmc results

terminals

0 20 40 60 80 100 120 14002468

101214161820

%CPU usage

ternimals

0 20 40 60 80 100 120 1400.0

RunQueue

terminals

sda2 sd

Disk %Busy linuxAvg. WAvg.

Multicore Scalability issue - Bottleneck Analysis

On analyzing stack for multiple concurrent terminals, it is evident that ProcArrayLock contention is increasing with number of terminals. Contention starts increasing steeply after 30 terminals and stays in constant state. This lock accounts for 80% of contention in the whole system (GetSnaphsotData and ProcArrayEndTransaction)

TPCC Lock Wait Distribution

GetSnapshotDataProcArrayLock LW_SHARED

ProcArrayEndTransac-tionProcArrayLock LW_EXCLUSIVE

XLogFlushWALWriteLockLW_EXCLUSIVE

WALInsertLock

Concept of MVCCMVCC is a locking scheme commonly used by database

implementations to control fast, safe and concurrent access to data.

MVCC is designed to provide following features for concurrent access• Readers do not block writers• Writers do not block readers

When an object has to be overwritten, it is marked obsolete and a new

copy of it is written by the writer. Reader reads the current version and

writer affects a future version of data. This ensures that readers do not

block writers. Since writer is working on a different copy of data it does

not block other readers.

MVCC is generally implemented using transactionID or some

timestamp mechanism. This is stored in each tuple so at run time, a

decision can be made on the visibility of the tuple for any user given

MVCC in PGMVCC is implemented in PG using transactionID. Each tuple maintains a xmin(XID of

transaction that created it) and a xmax (XID of transaction that deleted it). A snapshot is

taken at the beginning of a SQL statement irrespective of the isolation level . The main

purpose of taking a snapshot is to • highest-numbered committed transaction• Lowest-numbered running transaction• Transactions that are currently executing

Any tuple is said to be visible, if

If xmin is <= highest-numbered committed transaction and xmin does not present in

running transaction list, means changes of xmin is visible

1. If tuple has Xmax and xmax of the tuple is > highest-numbered committed

transaction or xmax present in running transaction list, means changes of

xmax changes are not visible and tuple is visible.

2. If Tuple don’t have Xmax then its visible.

MVCC in PGAll other tuple are not visible to current SQL.

Transaction information for a backend is stored in a data structure

named Pgxact which is protected by a latch(LWLock) named

ProcArracLock.

MVCC in PGDuring snapshot creation, for getting all the running transaction

information, a process takes Shared latch on ProcArrayLock

(GetSnapshotData)

When a transaction is committing, it takes an exclusive latch on

ProcArrayLock (ProcArrayEndTransaction) to update shared global variables.

When transactions – starting and ending try to use ProcArrayLock,

contention ensues and scalability reduces.

Lock Free CSN solution

Main purpose of fetching all running transaction information is to know, which transaction

committed before taking Snapshot and which after taking snapshot.

In order to remove bottleneck, it is necessary that either of GetSnapshotData or

ProcArrayEndTransaction do not take latch. Since ProcArrayEndTransaction needs to

update shared variables, removing latch is impossible.

Each transaction when committing, can increment a global number –

CommitSequenceNumber(CSN). Any transaction which is starting, will note down the

current CSN number of the system during its snapshot creation – this can be termed as

SnapshotCSN.

Now to check the visibility of any tuple, it is enough to see if CSN associated with its

xmin and xmax is less than SnapshotCSN.

By effectively having an integer(CSN) which is shared across all

backends, we can remove traversing Pgxact structure and thereby

removing the need to acquire ProcArrayLock in shared mode.

One of the obvious problems for the CSN based solution is how to

maintain a map between CSN and XID. This data structure should have

the following properties

• Searching should be efficient

• Read/Write should be lock-free

We decided to select a circular array, in which we can directly get the XID slot using

XID%ARRAY_SIZE. This ensures that there is no extra searching cost.

• XID to CSN mapping is stored in circular array called

Dense Map.

• Considering the high scalability in high end machines ,

size of the Dense Map is selected very high ~2GB which

can hold almost 200Millions of transactions.

• Basic assumption of keeping the size big is that when a

slot has to reused because of % operation, the older XID

may already be committed and its CSN value should be

smaller than any active Snapshot CSN.

Each Slot maintain the Map b/w XID to CSN,

XID slot can be accessed directly by XID

Size of This Circular Array is taken quite big ~2GB which can hold more than 200 Million transactions

Lock Free operations

Now major challenge is to keep the access to densemap lock free, b/w the reader (getting

the CSN value for a transaction) and writer(reusing the slot and changing the transaction

• Writer

• Writer will simply use the slot and overwrite the new XID without any lock.

• Reader

• Read Slot XID (Match XID with densemap Slot XID)

• Read Slot CSN

• Read Slot XID again (Rematch the XID, if matches than CSN value can be used.

Otherwise some write has overwritten the XID, then start by reading CSN again)

Overflow handling• Generally this scenario should happen very rarely.

• But when happen we move such transaction to a secondary map, called sparse map.

Sparse map is implemented using PG hash tables and has partition locks.

• Writers of to the sparse map active exclusive locks, and reader take shared locks.

• If a transaction is active, its XID should fall in either of densemap or sparsemap. If it is not

found in both, then the transaction is visible to all active transaction.

When starting new transaction, If

XID concurrent Move To sparse

and Set a Bit

Lock free CSN – Start transaction• Get a slot for this Transaction from

the dense MAP using XID % SIZE

• If there is Old XID in Slot and its

concurrent (either running or CSN

is not smallest than global

snapshot CSN min)

• Move to Sparse Map• Take Sparse Map Lock

• Copy Slot to Sparse Map

• Release Lock

• Set Bit for XID in XID Map

• Reuse the Slot.

Lock free CSN – End Transaction

• Get a slot for this Transaction from the

dense MAP using XID % SIZE

• Generate New CSN(Increment Global

CSN value)

• Assign SUB_CSN value to all sub

transactions.

• Assign CSN value to main transaction.

• Assign CSN value to all sub

transactions.

• In case of abort Assign special CSN

called ABORT_CSN.

Lock Free CSN – Visibility Check

• For checking a visibility of a operation done

by transaction (XID), first find the slot for the

XID, using XID%DenseMapSize.

• If XID matches, read CSN value and check

the XID again.

• If XID not match then check the ByteMAP, if

set than search in Sparse Map.

• Compare the Slot CSN with Snapshot CSN

• If Slot CSN > Snapshot CSN changes are

not visible

• Else Changes are visible.

Lock free CSN - Performance Results

Performance Comparison of Lock Free CSN with Base Code in TPCC Testbase : Base code performance of PG9.4 with best configuration

CSN : Lock Free CSN code performance

CSN_1 : Lock Free CSN performance by configuration change

Result:• fsync=off, peak terminal=70 and peak tpmC=687,325, improve 191%

• fsync=on, peak terminal=75 and peak tpmC=525,862, improve 316%

Conclusion

• From last two experiment we can observe that now ProcArrayLock Bottleneck

is completely removed from the system.

• Now new bottlenecks are getting uncovered like bufferLocks, Clog Partition

Locks.

• This is clearly visible from TPCC experiment, where by changing some

configuration , it can scale further, but same was not possible with base code

as it was already hitting earlier bottleneck (ProcArrayLock).

Thank You

英文目录标题 :35-40pt 颜色 : R153 G0 B0 内部使用字体 : FrutigerNext LT Medium...

Documents

: R153 G0 B0 100G - data.proidea.org.pldata.proidea.org.pl/plnog/10edycja/materialy/prezentacje/Building... · NE40E-X16: Comes with Cluster & 480G 480G~1T/slot ... R153 G0 B0 : LT

13. :32-35pt : R153 G0 B0 : FrutigerNext LT Medium : Arial :30-32pt : R153 G0 B0 : :20-22pt (2-5 ) :18pt : FrutigerNext LT Regular : Arial

TITEL SLIDE VERSIE 1 CALIBRI BOLD 40pt

SKINTOP - mdmetric.com · How to order: Please preface each LAPP part number with 'R153-' Example (using the part number): R153-602002. SKINTOP

Huawei PDF資料 - soumu.go.jp · PDF fileHUAWEI TECHNOLOGIES CO., LTD. 35pt : R153 G0 B0: LT Medium: Arial 32pt : R153 G0 B0 ... MIMO max.: eMU-MIMO DL 8-stream/ UL 4-stream

Presentation title is Arial Bold 40pt on up to four lines...Title: Presentation title is Arial Bold 40pt on up to four lines Author: Fernandez, Marissa Created Date: 8/6/2017 8:55:27

Presentation Title, Verdana Bold 40pt for Global Public Safet… · Presentation Title, Verdana Bold 40pt Option 1 1 Center for Global ... Funds will support the 2018 iGEM team: •

Title Slide Arial 40pt Presenter Name | Date | Georgia Regular 16pt

Pre Design Report Replacement Site og... · 2017-08-23 · HUAWEI TECHNOLOGIES CO., LTD. 35pt : R153 G0 B0: FrutigerNext LT Medium: Arial 32pt : R153 G0 B0 黑体 22pt) :18pt 黑色:

Document Title, 40pt White RockLogic · 2017-04-18 · Document Title, 40pt White CONTROLS & AUTOMATION RockLogicTM Intelligent rockbreaker control for most boom systems. ... Auto

20 pt 30 pt 40 pt 50pt 10 pt 20 pt 30 pt 40 pt 50 pt 10 pt 20 pt 30 pt 40pt 50 pt 10pt 20pt 30 pt 40 pt 50 pt 10 pt 20 pt 30 pt 40pt 50 pt 10pt Exploration

Cover Title – Arial Bold, 40pt - eefocus

Presentation title in Calibri Light 40pt

문서의 제목 나눔명조R, 40pt - contents.kocw.netcontents.kocw.net/KOCW/document/2014/dongguk/choyoungsuk1/11.pdf · 13 6.9 예제 : 합병(merge)과 합병정렬(merge sort)

Cover Title – Arial Bold, 40pt...STM8. Управление прерываниями(1/4) До 32 вектора прерываний Как минимум 1 прерывание

HUAWEI TECHNOLOGIES CO., LTD. :32-35pt : R153 G0 B0 : FrutigerNext LT Medium : Arial :30-32pt : R153 G0 B0 : :20-22pt (2-5 ) :18pt : FrutigerNext LT Regular

Title 40pt sentence case Welcome - Mbed · Title 40pt sentence case Bullets 24pt sentence case Sub-bullets 20pt sentence case Morning Agenda 09.30 – 09.40 Opening Remarks David

문서의 제목 나눔명조R, 40pt - KOCWcontents.kocw.net/KOCW/document/2014/dongguk/... · 요리재료 준비나 재료구입업무의 단순화가 요구됨 : 식재료의 교차활용(Cross-Utilization)

DWDM - OTN/ROADMDWDM - OTN/ROADM Be smart when you plan your Network MSc, Eng. Network Implementation Engineer 英文标题:32-35pt 颜色: R153 G0 B0 内部使用字体 : 外部使用

: R153 G0 B0 LTE Status & Benefits - sps.silhan.netsps.silhan.net/sps/skola/4it/lte-huawei.pdfLTE Status & Benefits Zbynek Pardubsky, Huawei Technologies TINF 2011 . HISILICON SEMICONDUCTORHUAWEI