Upload
monserrat-holly
View
214
Download
1
Embed Size (px)
Citation preview
© 2010 Ippokratis Pandis
Aether: A Scalable Approach to Logging
VLDB 2010
Ryan Johnson†‡ Ippokratis Pandis †‡ Radu Stoica ‡ Manos Athanassoulis ‡
Anastasia Ailamaki †‡
†Carnegie Mellon University ‡École Polytechnique Fédérale de Lausanne
@ Carnegie MellonDatabases
© 2010 Ippokratis Pandis
Scalability is key! Modern hardware needs
software parallelism OLTP is inherently parallel at
the request level Very good on providing high concurrency
But, internal serializations limit execution parallelism
2
Need for scalable OLTP components
0
4
8
12
16 Pentium
Itanium
Intel Core2
UltraSparc
IBM Power
AMD
Year
HW
co
nte
xts/
chip
© 2010 Ippokratis Pandis
Logging is crucial for OLTP
Fault tolerance Crash recovery Transaction abort/rollback
Performance Log changes for durability (no in-place updates) Write dirty pages back asynchronously
3* http://www.datacenterknowledge.com/archives/2010/05/13/car-crash-triggers-amazon-power-outage/
(e.g., Amazon outage*)
$$$
Need efficient and scalable logging solution
© 2010 Ippokratis Pandis
Logging is bottleneck for scalability
Working around the bottlenecks: Asynchronous commit Replace logging with replication and fail-over
4
(1) At commit, must yield for log flush synchronous I/O at critical path locks held for long time two context switches per commit
(2) Must insert records to the log buffer centralized main-memory structure source of contention
CPU-1
L1
L2
CPU-2
L1
CPU-N
L1
Data Log
CP
UR
AM
HD
D
Workarounds compromise durability
© 2010 Ippokratis Pandis
Does “correct” logging have to be so slow?
Locks held for long time Not actually used during the flush Indirect way to enforce isolation
Two context switches per commit Transactions nearly stateless at commit time Easy to migrate transactions between threads
Log buffer is source of contention Log orders incoming requests, not threads Log records can be combined
5No! Aether: uncompromised, yet scalable logging
© 2010 Ippokratis Pandis
Agenda Logging-related problems Aether logging
Reducing lock contention Reducing context switching Scalable log buffer implementation
Conclusions
6
© 2010 Ippokratis Pandis
Bottleneck 1: Amplified lock contention
7
Xct 1
Xct 2
Done!
Commit
WorkingLock Mgr. Log Mgr. I/O Waiting
Other transactions wait for locks while the log flush I/O completes
© 2010 Ippokratis Pandis
Early Lock Release in case of a single log Finish transaction Release locks before commit Insert transaction commit record Wait until log record is flushed Dependent xct serialized at the log buffer No extra overhead, idea around for 30 years
…but nobody uses it so far…
8
With ELR other transactions do not wait for locks held during log flushes
© 2010 Ippokratis Pandis
ELR benefitsSun Niagara T2 (64 HW contexts), 64GB RAMMem. resident TPC-B in Shore-MT Zipfian distribution on transaction inputs
9
0.0 1.0 2.0 3.0 4.0 5.01
10
100
10000 us (slow disk)
100 us (flash)
1000 us (fast disk)
0 us (memory)
Data access skew (zipfian s parameter)
Spee
dup
ELR is simple and sometimes very useful
© 2010 Ippokratis Pandis
Agenda Logging-related problems Aether logging
Reducing lock contention Reducing context switching Scalable log buffer implementation
Conclusions
10
© 2010 Ippokratis Pandis
0 20 40 600%
20%
40%
60%
80%
100%
0.0
0.5
1.0
1.5
2.0
2.5
3.0CPUs uti-lized
ClientsCP
U U
tiliz
ation
(%)
Cont
ext S
witc
hes
11
Xct 1
Commit
WorkingLog Mgr.
I/O Waiting
One context switch per log flush Pressure on the OS scheduler
Bottleneck 2: Excessive context switching
Must decouple thread scheduling from log flushes
Time
Xct 2
Context switch
Sun Niagara T2 (64 HW contexts)Mem. resident TPC-B in Shore-MT
© 2010 Ippokratis Pandis
Flush Pipelining Scheduler in the critical path and wastes CPU
Multi-core HW only amplifies the problem
But, transaction nearly stateless at commit Detach transaction state from worker thread
• Pass it to log writer Worker threads do not block at commit time
12
Thread 1
Time
Xct 1
Xct 2Thread 2
© 2010 Ippokratis Pandis
Flush Pipelining Scheduler in the critical path and wastes CPU
Multi-core HW only amplifies the problem
But, transaction nearly stateless at commit Detach transaction state from worker thread
• Pass it to log writer Worker threads do not block at commit time
13
Thread 1
Time
Xct 1
Xct 2Thread 2
Log Writer
Xct 3
Xct 4
Staged-like mechanism = low scheduling costs
© 2010 Ippokratis Pandis
Impact of Flush Pipelining
14
0 20 40 600
10
20
30
40
50
60
70 FlushPipeliningAsynchronous commitBaseline
ClientsTh
roug
hput
Sun Niagara T2 (64 HW contexts)Mem. resident TPC-B in Shore-MT
0 20 40 600%
20%
40%
60%
80%
100%
0.0
0.5
1.0
1.5
2.0
2.5
3.0Base - CPUsFlushP - CPUsBase - CtxsFlushP - Ctxs
Clients
CPU
Util
izati
on (%
)
Cont
ext S
witc
hes
Match Asynchronous Commit throughputwithout compromising durability
© 2010 Ippokratis Pandis
Agenda Logging-related problems Aether logging
Reducing lock contention Reducing context switching Scalable log buffer implementation
Conclusions
15
© 2010 Ippokratis Pandis 16
Bottleneck 3: Log buffer contention
Xct 1
Xct 2
WorkingLog Mgr. I/O Waiting
Time
Xct 3
Log Buffer Latch Waiting
Centralized log buffer Contention, which depends on participating number of threads size of modifications (kiB in case of physical logging)
© 2010 Ippokratis Pandis
Eliminating critical sections Inspiration: elimination-based backoff*
Critical sections can cancel each other out E.g., stack push/pop operations
17* D. Hendler, N. Shavit, and L. Yerushalmi. “A Scalable Lock-free Stack Algorithm.” In Proc. SPAA, 2004
Adapt elimination-based backoff for db logging
Attempt to acquire mutex If failed, backoff waiting on a array If someone else already waits
there, eliminate requests w/o acquiring mutex
push()
Stationarea
Stack
push()
pop()
© 2010 Ippokratis Pandis
Accessing the log buffer Break log insert into three logical steps
(a) Reserve space by updating head LSN(b) Copy log record (memcpy)(c) Make insert visible by updating tail LSN, in LSN order
Steps (a) + (c) can be consolidated Accumulate requests off the critical path Send only group leader to fight for the critical section
Move (b) out of critical section
18
(a) (b) (c)
© 2010 Ippokratis Pandis
Mutex heldStart/finishCopy into bufferWaiting
Design evolution
19
Consolidation array (C)
(D) Decoupled buffer insert Hybrid design (CD)
(B) Baseline
(D) Decoupled buffer insert Hybrid design (CD)
(B) Baseline
contention(work) = O(1)
contention(# threads) = O(1)
Decouple contention from the # of threads and average log entry size
© 2010 Ippokratis Pandis
Performance as contention increases
20
Microbenchmark Bimodal distribution
48B and 160B120B average
Hybrid solution combines benefits of both
1 1010
100
1000
10000 Baseline Decoupled (D)Consolidation (C) Hybrid (CD)
Threads
Log
inse
rt ra
te (G
B/s)
© 2010 Ippokratis Pandis
Sensitivity to slot count
21
30
10
20
40
50
# Slots
1 2 4 6 7 9853 10
# T
hre
ad
s
0
60
400
800100012001400
1600
1700
Relatively insensitive to slot count (3 or 4 slots good enough for most cases)
Colors/heightis throughput(in MB/s)
© 2010 Ippokratis Pandis
Case against distributed logging Distributing TPC-C log records over 8 logs
1 ms wall time, ~200 in flight transactions, 30 commits Horizontal blue line = 1 log Diagonal line = dependency (new = black, older = grey)
22
Large overhead keeping track dependencies and over-flushing
© 2010 Ippokratis Pandis
Agenda Logging-related problems Aether logging
Reducing context switching Scalable log buffer implementation
Conclusions
23
© 2010 Ippokratis Pandis
0 10 20 30 40 50 600
20000
40000
60000
80000 AetherFlushPipelining + ELRBaseline
#CPUs utilized
Thro
ughp
ut (K
Tps)
Putting it all together
24
Gap increasesw/ # threads!
Sun Niagara T2 (64 HW contexts)Mem. Resident, TPC-B
+60% from Baseline
Eliminate current log bottlenecksFuture-proof system against contention
+15%
© 2010 Ippokratis Pandis
Conclusions Logging is an essential component for OLTP
Simplifies recovery, improves performance without the need of physically partitioning data
.. but need to address all lurking bottlenecks Aether is a holistic approach to logging
Leverages existing techniques (Early lock release) Reduces context switches (Flush Pipelining) Eliminates log contention (Consolidation-based backoff)
• Can achieve 2GB/s of log throughput per node
25
Thank you!