25
© 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson †‡ Ippokratis Pandis †‡ Radu Stoica Manos Athanassoulis Anastasia Ailamaki †‡ †Carnegie Mellon University ‡École Polytechnique Fédérale de Lausanne @ Carnegie Mellon Databases

© 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

Embed Size (px)

Citation preview

Page 1: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Aether: A Scalable Approach to Logging

VLDB 2010

Ryan Johnson†‡ Ippokratis Pandis †‡ Radu Stoica ‡ Manos Athanassoulis ‡

Anastasia Ailamaki †‡

†Carnegie Mellon University ‡École Polytechnique Fédérale de Lausanne

@ Carnegie MellonDatabases

Page 2: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Scalability is key! Modern hardware needs

software parallelism OLTP is inherently parallel at

the request level Very good on providing high concurrency

But, internal serializations limit execution parallelism

2

Need for scalable OLTP components

0

4

8

12

16 Pentium

Itanium

Intel Core2

UltraSparc

IBM Power

AMD

Year

HW

co

nte

xts/

chip

Page 3: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Logging is crucial for OLTP

Fault tolerance Crash recovery Transaction abort/rollback

Performance Log changes for durability (no in-place updates) Write dirty pages back asynchronously

3* http://www.datacenterknowledge.com/archives/2010/05/13/car-crash-triggers-amazon-power-outage/

(e.g., Amazon outage*)

$$$

Need efficient and scalable logging solution

Page 4: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Logging is bottleneck for scalability

Working around the bottlenecks: Asynchronous commit Replace logging with replication and fail-over

4

(1) At commit, must yield for log flush synchronous I/O at critical path locks held for long time two context switches per commit

(2) Must insert records to the log buffer centralized main-memory structure source of contention

CPU-1

L1

L2

CPU-2

L1

CPU-N

L1

Data Log

CP

UR

AM

HD

D

Workarounds compromise durability

Page 5: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Does “correct” logging have to be so slow?

Locks held for long time Not actually used during the flush Indirect way to enforce isolation

Two context switches per commit Transactions nearly stateless at commit time Easy to migrate transactions between threads

Log buffer is source of contention Log orders incoming requests, not threads Log records can be combined

5No! Aether: uncompromised, yet scalable logging

Page 6: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Agenda Logging-related problems Aether logging

Reducing lock contention Reducing context switching Scalable log buffer implementation

Conclusions

6

Page 7: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Bottleneck 1: Amplified lock contention

7

Xct 1

Xct 2

Done!

Commit

WorkingLock Mgr. Log Mgr. I/O Waiting

Other transactions wait for locks while the log flush I/O completes

Page 8: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Early Lock Release in case of a single log Finish transaction Release locks before commit Insert transaction commit record Wait until log record is flushed Dependent xct serialized at the log buffer No extra overhead, idea around for 30 years

…but nobody uses it so far…

8

With ELR other transactions do not wait for locks held during log flushes

Page 9: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

ELR benefitsSun Niagara T2 (64 HW contexts), 64GB RAMMem. resident TPC-B in Shore-MT Zipfian distribution on transaction inputs

9

0.0 1.0 2.0 3.0 4.0 5.01

10

100

10000 us (slow disk)

100 us (flash)

1000 us (fast disk)

0 us (memory)

Data access skew (zipfian s parameter)

Spee

dup

ELR is simple and sometimes very useful

Page 10: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Agenda Logging-related problems Aether logging

Reducing lock contention Reducing context switching Scalable log buffer implementation

Conclusions

10

Page 11: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

0 20 40 600%

20%

40%

60%

80%

100%

0.0

0.5

1.0

1.5

2.0

2.5

3.0CPUs uti-lized

ClientsCP

U U

tiliz

ation

(%)

Cont

ext S

witc

hes

11

Xct 1

Commit

WorkingLog Mgr.

I/O Waiting

One context switch per log flush Pressure on the OS scheduler

Bottleneck 2: Excessive context switching

Must decouple thread scheduling from log flushes

Time

Xct 2

Context switch

Sun Niagara T2 (64 HW contexts)Mem. resident TPC-B in Shore-MT

Page 12: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Flush Pipelining Scheduler in the critical path and wastes CPU

Multi-core HW only amplifies the problem

But, transaction nearly stateless at commit Detach transaction state from worker thread

• Pass it to log writer Worker threads do not block at commit time

12

Thread 1

Time

Xct 1

Xct 2Thread 2

Page 13: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Flush Pipelining Scheduler in the critical path and wastes CPU

Multi-core HW only amplifies the problem

But, transaction nearly stateless at commit Detach transaction state from worker thread

• Pass it to log writer Worker threads do not block at commit time

13

Thread 1

Time

Xct 1

Xct 2Thread 2

Log Writer

Xct 3

Xct 4

Staged-like mechanism = low scheduling costs

Page 14: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Impact of Flush Pipelining

14

0 20 40 600

10

20

30

40

50

60

70 FlushPipeliningAsynchronous commitBaseline

ClientsTh

roug

hput

Sun Niagara T2 (64 HW contexts)Mem. resident TPC-B in Shore-MT

0 20 40 600%

20%

40%

60%

80%

100%

0.0

0.5

1.0

1.5

2.0

2.5

3.0Base - CPUsFlushP - CPUsBase - CtxsFlushP - Ctxs

Clients

CPU

Util

izati

on (%

)

Cont

ext S

witc

hes

Match Asynchronous Commit throughputwithout compromising durability

Page 15: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Agenda Logging-related problems Aether logging

Reducing lock contention Reducing context switching Scalable log buffer implementation

Conclusions

15

Page 16: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis 16

Bottleneck 3: Log buffer contention

Xct 1

Xct 2

WorkingLog Mgr. I/O Waiting

Time

Xct 3

Log Buffer Latch Waiting

Centralized log buffer Contention, which depends on participating number of threads size of modifications (kiB in case of physical logging)

Page 17: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Eliminating critical sections Inspiration: elimination-based backoff*

Critical sections can cancel each other out E.g., stack push/pop operations

17* D. Hendler, N. Shavit, and L. Yerushalmi. “A Scalable Lock-free Stack Algorithm.” In Proc. SPAA, 2004

Adapt elimination-based backoff for db logging

Attempt to acquire mutex If failed, backoff waiting on a array If someone else already waits

there, eliminate requests w/o acquiring mutex

push()

Stationarea

Stack

push()

pop()

Page 18: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Accessing the log buffer Break log insert into three logical steps

(a) Reserve space by updating head LSN(b) Copy log record (memcpy)(c) Make insert visible by updating tail LSN, in LSN order

Steps (a) + (c) can be consolidated Accumulate requests off the critical path Send only group leader to fight for the critical section

Move (b) out of critical section

18

(a) (b) (c)

Page 19: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Mutex heldStart/finishCopy into bufferWaiting

Design evolution

19

Consolidation array (C)

(D) Decoupled buffer insert Hybrid design (CD)

(B) Baseline

(D) Decoupled buffer insert Hybrid design (CD)

(B) Baseline

contention(work) = O(1)

contention(# threads) = O(1)

Decouple contention from the # of threads and average log entry size

Page 20: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Performance as contention increases

20

Microbenchmark Bimodal distribution

48B and 160B120B average

Hybrid solution combines benefits of both

1 1010

100

1000

10000 Baseline Decoupled (D)Consolidation (C) Hybrid (CD)

Threads

Log

inse

rt ra

te (G

B/s)

Page 21: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Sensitivity to slot count

21

30

10

20

40

50

# Slots

1 2 4 6 7 9853 10

# T

hre

ad

s

0

60

400

800100012001400

1600

1700

Relatively insensitive to slot count (3 or 4 slots good enough for most cases)

Colors/heightis throughput(in MB/s)

Page 22: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Case against distributed logging Distributing TPC-C log records over 8 logs

1 ms wall time, ~200 in flight transactions, 30 commits Horizontal blue line = 1 log Diagonal line = dependency (new = black, older = grey)

22

Large overhead keeping track dependencies and over-flushing

Page 23: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Agenda Logging-related problems Aether logging

Reducing context switching Scalable log buffer implementation

Conclusions

23

Page 24: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

0 10 20 30 40 50 600

20000

40000

60000

80000 AetherFlushPipelining + ELRBaseline

#CPUs utilized

Thro

ughp

ut (K

Tps)

Putting it all together

24

Gap increasesw/ # threads!

Sun Niagara T2 (64 HW contexts)Mem. Resident, TPC-B

+60% from Baseline

Eliminate current log bottlenecksFuture-proof system against contention

+15%

Page 25: © 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki

© 2010 Ippokratis Pandis

Conclusions Logging is an essential component for OLTP

Simplifies recovery, improves performance without the need of physically partitioning data

.. but need to address all lurking bottlenecks Aether is a holistic approach to logging

Leverages existing techniques (Early lock release) Reduces context switches (Flush Pipelining) Eliminates log contention (Consolidation-based backoff)

• Can achieve 2GB/s of log throughput per node

25

Thank you!