Upload
makoto-yui
View
2.208
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Makoto Yui, Jun Miyazaki, Shunsuke Uemura and Hayato Yamana. ``Nb-GCLOCK: A Non-blocking Buffer Management based on the Generalized CLOCK'', Proc. ICDE, March 2010.
Citation preview
Nb-GCLOCK: A Non-blocking Buffer Management based on the Generalized CLOCK
Makoto YUI1, Jun MIYAZAKI2, Shunsuke UEMURA3
and Hayato YAMANA4
1 .Research fellow, JSPS (Japan Society for the Promotion of Science) / Visiting Postdoc at Waseda University, Japan and CWI, Netherlands 2. Nara Institute of Science and Technology 3. Nara Sangyo University 4. Waseda University / National Institute of Informatics
Outline
• Background
• Our approach
– Non-Blocking Synchronization
– Nb-GCLOCK
• Experimental Evaluation
• Related Work
• Conclusion
2
3
UltraSparc T2 Azul Vega Larrabee?
Nehalem
Multi-Core CPU
Many-Core CPU
2000
1990
Core2
Power4
Pentium
Single-Core CPU
Background – Recent trends in CPU development
# of CPU cores in a chip is doubling in two year cycles
Many-core era is coming.
4
UltraSparc T2 Azul Vega Larrabee?
Nehalem
Multi-Core CPU
Many-Core CPU
2000
1990
Core2
Power4
Pentium
Single-Core CPU
Background – Recent trends in CPU development
- Niagara T2 – 8 cores x 8 SMT = 64 processors - Azul Vega3 – 54 cores x 16 chips = 864 processors
# of CPU cores in a chip is doubling in two year cycles
Many-core era is coming.
Open source DBs have faced CPU scalability problems
5
Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”, In Proc. EDBT, 2009.
Background – CPU Scalability of open source DBs
Open source DBs have faced CPU scalability problems
6
Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”, In Proc. EDBT, 2009.
0
2
4
6
8
10
1 4 8 12 16 24 32
PostgreSQL
MySQL
BDB
Background – CPU Scalability of open source DBs
Microbenchmark on UltraSparc T1 (32 procs)
Open source DBs have faced CPU scalability problems
7
Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”, In Proc. EDBT, 2009.
0
2
4
6
8
10
1 4 8 12 16 24 32
PostgreSQL
MySQL
BDB
Concurrent threads
Throughput (normalized)
Background – CPU Scalability of open source DBs
Microbenchmark on UltraSparc T1 (32 procs)
Open source DBs have faced CPU scalability problems
8
Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”, In Proc. EDBT, 2009.
0
2
4
6
8
10
1 4 8 12 16 24 32
PostgreSQL
MySQL
BDB
Concurrent threads
Throughput (normalized)
Background – CPU Scalability of open source DBs
Microbenchmark on UltraSparc T1 (32 procs)
Gain after 16 threads is less than 5 %
Open source DBs have faced CPU scalability problems
9
Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”, In Proc. EDBT, 2009.
0
2
4
6
8
10
1 4 8 12 16 24 32
PostgreSQL
MySQL
BDB
Concurrent threads
Throughput (normalized)
Background – CPU Scalability of open source DBs
Microbenchmark on UltraSparc T1 (32 procs)
Gain after 16 threads is less than 5 %
You might think…
What about TPC-C ?
10
CPU scalability of PostgreSQL
Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.
TPC-C benchmark result on a high-end Linux machine of Unisys
(Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage)
11
CPU scalability of PostgreSQL
Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.
TPC-C benchmark result on a high-end Linux machine of Unisys
(Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage)
Version 8.0
Version 8.1
Version 8.2
TPS
CPU cores
12
Gain after 16 CPU cores is less than 5%
CPU scalability of PostgreSQL
Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.
TPC-C benchmark result on a high-end Linux machine of Unisys
(Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage)
Version 8.0
Version 8.1
Version 8.2
TPS
CPU cores
13
Gain after 16 CPU cores is less than 5%
CPU scalability of PostgreSQL
Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.
TPC-C benchmark result on a high-end Linux machine of Unisys
(Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage)
Version 8.0
Version 8.1
Version 8.2
TPS
CPU cores Q. What PostgreSQL community did?
14
Gain after 16 CPU cores is less than 5%
CPU scalability of PostgreSQL
Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.
TPC-C benchmark result on a high-end Linux machine of Unisys
(Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage)
Version 8.0
Version 8.1
Version 8.2
TPS
CPU cores Q. What PostgreSQL community did?
Revised their synchronization mechanisms in the buffer management module
[1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008.
Synchronization in Buffer Management Module
Several empirical studies have revealed that the largest bottleneck is …
synchronization in buffer management module
[1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008.
CPU
Memory
HDD Database
Files
Buffer Manager
Page requests
reduces disk access by caching database pages
Synchronization in Buffer Management Module
Several empirical studies have revealed that the largest bottleneck is …
synchronization in buffer management module
20
[1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008.
CPU
Memory
HDD Database
Files
Buffer Manager
Page requests
reduces disk access by caching database pages
Synchronization in Buffer Management Module
Several empirical studies have revealed that the largest bottleneck is …
synchronization in buffer management module
Looking-up hash table
Page replacement algorithm
Page requests
hits misses
Database Files
Buffer Manager
(1)
(2)
18
[1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008.
CPU
Memory
HDD Database
Files
Buffer Manager
Page requests
reduces disk access by caching database pages
Synchronization in Buffer Management Module
Several empirical studies have revealed that the largest bottleneck is …
synchronization in buffer management module
Looking-up hash table
Page replacement algorithm
Page requests
hits misses
Database Files
Buffer Manager
(1)
(2)
19
[1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008.
CPU
Memory
HDD Database
Files
Buffer Manager
Page requests
reduces disk access by caching database pages
Synchronization in Buffer Management Module
Several empirical studies have revealed that the largest bottleneck is …
synchronization in buffer management module
Looking-up hash table
Page replacement algorithm
Page requests
hits misses
Database Files
Buffer Manager
(1)
(2)
20
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (Least Recently Used)
Page requests
hits misses
Database Files
Looking-up hash table
Page replacement algorithm (Least Recently Used)
Page requests
hits misses
Database Files
Naive buffer management schemes
PostgreSQL 8.0 PostgreSQL 8.1
21
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (Least Recently Used)
Page requests
hits misses
Database Files
Looking-up hash table
Page replacement algorithm (Least Recently Used)
Page requests
hits misses
Database Files
Naive buffer management schemes
PostgreSQL 8.0 PostgreSQL 8.1
Giant lock sucks!
22
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (Least Recently Used)
Page requests
hits misses
Database Files
Looking-up hash table
Page replacement algorithm (Least Recently Used)
Page requests
hits misses
Database Files
Naive buffer management schemes
LRU list always needs to be locked when it is accessed
PostgreSQL 8.0 PostgreSQL 8.1
Giant lock sucks!
23
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (Least Recently Used)
Page requests
hits misses
Database Files
Looking-up hash table
Page replacement algorithm (Least Recently Used)
Page requests
hits misses
Database Files
Naive buffer management schemes
LRU list always needs to be locked when it is accessed
PostgreSQL 8.0 PostgreSQL 8.1
Giant lock sucks! Striped a lock into buckets
24
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (Least Recently Used)
Page requests
hits misses
Database Files
Looking-up hash table
Page replacement algorithm (Least Recently Used)
Page requests
hits misses
Database Files
Naive buffer management schemes
LRU list always needs to be locked when it is accessed
PostgreSQL 8.0 PostgreSQL 8.1
Giant lock sucks!
Did not scale at all Scales up to 8 processors
Striped a lock into buckets
Page requests
25
Less naive buffer management schemes
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (Least Recently Used)
hits misses
Database Files
PostgreSQL 8.1
Scales up to 8 processors
Always needs to be locked when it is accessed
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (CLOCK)
Page requests
hits misses
Database Files
PostgreSQL 8.2
Page requests CLOCK does not require a lock when an entry is touched
26
Less naive buffer management schemes
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (Least Recently Used)
hits misses
Database Files
PostgreSQL 8.1
Scales up to 8 processors
Always needs to be locked when it is accessed
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (CLOCK)
Page requests
hits misses
Database Files
PostgreSQL 8.2
Scales up to 16 processors
Outline
• Background
• Our approach
– Non-Blocking Synchronization
– Nb-GCLOCK
• Experimental Evaluation
• Related Work
• Conclusion
27
28
Database files
Buffer Manager
Request pages
Previous approaches Our optimistic approach
CPU
Memory
HDD Database
files
Buffer Manager
Request pages
Core idea of our approach
29
Database files
Buffer Manager
Request pages
Previous approaches Our optimistic approach
CPU
Memory
HDD Database
files
Buffer Manager
Request pages
○Reducing disk I/Os × locks are contended
Core idea of our approach
30
Database files
Buffer Manager
Request pages
Previous approaches Our optimistic approach
CPU
Memory
HDD Database
files
Buffer Manager
Request pages
○Reducing disk I/Os × locks are contended
Core idea of our approach
intuition
31
Database files
Buffer Manager
Request pages
Previous approaches Our optimistic approach
CPU
Memory
HDD Database
files
Buffer Manager
Request pages
○Reducing disk I/Os × locks are contended
Disk bandwidth is not utilized
Enough processors
Core idea of our approach
32
Database files
Buffer Manager
Request pages
Previous approaches Our optimistic approach
CPU
Memory
HDD Database
files
Buffer Manager
Request pages
○Reducing disk I/Os × locks are contended
Disk bandwidth is not utilized
Enough processors
Core idea of our approach
33
Database files
Buffer Manager
Request pages
Previous approaches Our optimistic approach
CPU
Memory
HDD Database
files
Buffer Manager
Request pages
○Reducing disk I/Os × locks are contended
Disk bandwidth is not utilized
Enough processors
Reduced lock granularity to one CPU instruction and remove the bottleneck
Core idea of our approach
34
Database files
Buffer Manager
Request pages
Previous approaches Our optimistic approach
CPU
Memory
HDD Database
files
Buffer Manager
Request pages
○Reducing disk I/Os × locks are contended
△ # of I/O slightly increases ○ no contention on locks
Disk bandwidth is not utilized
Enough processors
Reduced lock granularity to one CPU instruction and remove the bottleneck
Core idea of our approach
35
Previous approaches Our optimistic approach
○Reducing disk I/Os × locks are contended
△ # of I/O slightly increases ○ no contention on locks
Major Difference to Previous Approaches
Their goal is …
36
Previous approaches Our optimistic approach
○Reducing disk I/Os × locks are contended
△ # of I/O slightly increases ○ no contention on locks
Major Difference to Previous Approaches
Their goal is …
Improve buffer hit-rates for reducing I/Os
Unique goal for many decades. Is this goal valid for many core era? There are also SSDs.
37
Previous approaches Our optimistic approach
○Reducing disk I/Os × locks are contended
△ # of I/O slightly increases ○ no contention on locks
Major Difference to Previous Approaches
Their goal is …
Improve buffer hit-rates for reducing I/Os
Unique goal for many decades. Is this goal valid for many core era? There are also SSDs.
Our goal is …
38
Previous approaches Our optimistic approach
○Reducing disk I/Os × locks are contended
△ # of I/O slightly increases ○ no contention on locks
Major Difference to Previous Approaches
Their goal is …
Improve buffer hit-rates for reducing I/Os
Unique goal for many decades. Is this goal valid for many core era? There are also SSDs.
Our goal is …
Improve throughputs by utilizing (many) CPUs.
39
Previous approaches Our optimistic approach
○Reducing disk I/Os × locks are contended
△ # of I/O slightly increases ○ no contention on locks
Major Difference to Previous Approaches
Their goal is …
Improve buffer hit-rates for reducing I/Os
Unique goal for many decades. Is this goal valid for many core era? There are also SSDs.
Our goal is …
Improve throughputs by utilizing (many) CPUs.
Use Non-blocking synchronization instead of acquiring locks!
40
What’s non-blocking and lock-free?
Formally:
41
What’s non-blocking and lock-free?
Formally:
Stopping one thread will not prevent global progress. Individual threads make progress without waiting.
42
What’s non-blocking and lock-free?
Formally:
Stopping one thread will not prevent global progress. Individual threads make progress without waiting.
Less Formally:
43
What’s non-blocking and lock-free?
Formally:
Stopping one thread will not prevent global progress. Individual threads make progress without waiting.
Less Formally:
No thread 'locks' any resource No 'critical sections', locks, mutexs, spin-locks, etc
44
What’s non-blocking and lock-free?
Formally:
Stopping one thread will not prevent global progress. Individual threads make progress without waiting.
Less Formally:
No thread 'locks' any resource No 'critical sections', locks, mutexs, spin-locks, etc
Lock-free if every successful step makes Global Progress and completes within finite time (ensuring liveness)
45
What’s non-blocking and lock-free?
Formally:
Stopping one thread will not prevent global progress. Individual threads make progress without waiting.
Less Formally:
No thread 'locks' any resource No 'critical sections', locks, mutexs, spin-locks, etc
Lock-free if every successful step makes Global Progress and completes within finite time (ensuring liveness)
Wait-free if every step makes Global Progress and completes within finite time (ensuring fairness)
46
Synchronization method that does not acquire any lock, enabling concurrent accesses to shared resources
Utilize atomic CPU primitives
Utilize memory barriers
Non-blocking synchronization
47
Synchronization method that does not acquire any lock, enabling concurrent accesses to shared resources
Utilize atomic CPU primitives CAS (compare-and-swap) cmpxchg on X86
Utilize memory barriers
Non-blocking synchronization
48
Synchronization method that does not acquire any lock, enabling concurrent accesses to shared resources
Utilize atomic CPU primitives CAS (compare-and-swap) cmpxchg on X86
Utilize memory barriers
Non-blocking synchronization
acquire_lock(lock); counter++; release_lock(lock);
Blocking
49
Synchronization method that does not acquire any lock, enabling concurrent accesses to shared resources
Utilize atomic CPU primitives CAS (compare-and-swap) cmpxchg on X86
Utilize memory barriers
Non-blocking synchronization
acquire_lock(lock); counter++; release_lock(lock);
int old; do { old = *counter; } while (!CAS(counter, old, old+1));
Blocking Non-Blocking
counter is incremented if the value was equals to old
50
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (GCLOCK)
Page requests
hits misses
Database Files
Making the buffer manager non-blocking
lock; lseek; read; unlock
51
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (GCLOCK)
Page requests
hits misses
Database Files
Making the buffer manager non-blocking
lock; lseek; read; unlock
1. Utilized existing lock-free hash table
52
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (GCLOCK)
Page requests
hits misses
Database Files
Making the buffer manager non-blocking
lock; lseek; read; unlock
1. Utilized existing lock-free hash table
2. Removing locks on cache misses (in fig. 6)
53
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (GCLOCK)
Page requests
hits misses
Database Files
Making the buffer manager non-blocking
lock; lseek; read; unlock
54
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (GCLOCK)
Page requests
hits misses
Database Files
Making the buffer manager non-blocking
lock; lseek; read; unlock
3. Need to keep consistency between lookup hash table and GCLOCK (in the right half of fig. 3)
55
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (GCLOCK)
Page requests
hits misses
Database Files
Making the buffer manager non-blocking
lock; lseek; read; unlock
Reference in buffer lookup table still has a different page identifier immediately after changing the page allocation of a buffer frame
3. Need to keep consistency between lookup hash table and GCLOCK (in the right half of fig. 3)
56
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (GCLOCK)
Page requests
hits misses
Database Files
Making the buffer manager non-blocking
lock; lseek; read; unlock
Reference in buffer lookup table still has a different page identifier immediately after changing the page allocation of a buffer frame
3. Need to keep consistency between lookup hash table and GCLOCK (in the right half of fig. 3)
4. Avoided locks on I/Os by utilizing pread, CAS, and memory barriers (in fig. 5)
57
State Machine-based Reasoning for selecting replacement victim
Construct algorithm from many 'steps' ─ build a State Machine for ensuring glabal progress
58
State Machine-based Reasoning for selecting replacement victim
59
Select a frame
Check whether Evicted
!null
continue
E: try next entry
null
Fix in pool
Check whether Pinned
evicted
!evicted
pinned
Try to decrement the refcount
E: decrement the refcount
success
E: move the clock hand
--refcount>0
--refcount<=0
! swapped
Try to evict
E: evict
evicted
!evicted
!pinned
E: CAS value
swapped E: entry action
State Machine-based Reasoning for selecting replacement victim
60
Select a frame
Check whether Evicted
!null
continue
E: try next entry
null
Fix in pool
Check whether Pinned
evicted
!evicted
pinned
Try to decrement the refcount
E: decrement the refcount
success
E: move the clock hand
--refcount>0
--refcount<=0
! swapped
Try to evict
E: evict
evicted
!evicted
!pinned
E: CAS value
swapped E: entry action
Start finding a replacement victim
State Machine-based Reasoning for selecting replacement victim
61
Select a frame
Check whether Evicted
!null
continue
E: try next entry
null
Fix in pool
Check whether Pinned
evicted
!evicted
pinned
Try to decrement the refcount
E: decrement the refcount
success
E: move the clock hand
--refcount>0
--refcount<=0
! swapped
Try to evict
E: evict
evicted
!evicted
!pinned
E: CAS value
swapped E: entry action
Start finding a replacement victim
Decrement weight count of a buffer page
State Machine-based Reasoning for selecting replacement victim
62
Select a frame
Check whether Evicted
!null
continue
E: try next entry
null
Fix in pool
Check whether Pinned
evicted
!evicted
pinned
Try to decrement the refcount
E: decrement the refcount
success
E: move the clock hand
--refcount>0
--refcount<=0
! swapped
Try to evict
E: evict
evicted
!evicted
!pinned
E: CAS value
swapped E: entry action
Return a replacement victim
Start finding a replacement victim
Decrement weight count of a buffer page
State Machine-based Reasoning for selecting replacement victim
63
Select a frame
Check whether Evicted
!null
continue
E: try next entry
null
Fix in pool
Check whether Pinned
evicted
!evicted
pinned
Try to decrement the refcount
E: decrement the refcount
success
E: move the clock hand
--refcount>0
--refcount<=0
! swapped
Try to evict
E: evict
evicted
!evicted
!pinned
E: CAS value
swapped E: entry action
Advance CLOCK hand (check the next candidate)
Return a replacement victim
Start finding a replacement victim
Decrement weight count of a buffer page
State Machine-based Reasoning for selecting replacement victim
64
Select a frame
Check whether Evicted
!null
continue
E: try next entry
null
Fix in pool
Check whether Pinned
evicted
!evicted
pinned
Try to decrement the refcount
E: decrement the refcount
success
E: move the clock hand
--refcount>0
--refcount<=0
! swapped
Try to evict
E: evict
evicted
!evicted
!pinned
E: CAS value
swapped E: entry action
Advance CLOCK hand (check the next candidate)
Return a replacement victim
Start finding a replacement victim
Decrement weight count of a buffer page
State Machine-based Reasoning for selecting replacement victim
Thread A
65
Select a frame
Check whether Evicted
!null
continue
E: try next entry
null
Fix in pool
Check whether Pinned
evicted
!evicted
pinned
Try to decrement the refcount
E: decrement the refcount
success
E: move the clock hand
--refcount>0
--refcount<=0
! swapped
Try to evict
E: evict
evicted
!evicted
!pinned
E: CAS value
swapped E: entry action
Advance CLOCK hand (check the next candidate)
Return a replacement victim
Start finding a replacement victim
Decrement weight count of a buffer page
State Machine-based Reasoning for selecting replacement victim
Thread A
Thread B
66
Select a frame
Check whether Evicted
!null
continue
E: try next entry
null
Fix in pool
Check whether Pinned
evicted
!evicted
pinned
Try to decrement the refcount
E: decrement the refcount
success
E: move the clock hand
--refcount>0
--refcount<=0
! swapped
Try to evict
E: evict
evicted
!evicted
!pinned
E: CAS value
swapped E: entry action
Advance CLOCK hand (check the next candidate)
Return a replacement victim
Start finding a replacement victim
Decrement weight count of a buffer page
State Machine-based Reasoning for selecting replacement victim
Thread A
Thread B Oops! Candidate is intercepted.
67
Select a frame
Check whether Evicted
!null
continue
E: try next entry
null
Fix in pool
Check whether Pinned
evicted
!evicted
pinned
Try to decrement the refcount
E: decrement the refcount
success
E: move the clock hand
--refcount>0
--refcount<=0
! swapped
Try to evict
E: evict
evicted
!evicted
!pinned
E: CAS value
swapped E: entry action
Advance CLOCK hand (check the next candidate)
Return a replacement victim
Start finding a replacement victim
Decrement weight count of a buffer page
State Machine-based Reasoning for selecting replacement victim
Thread A
Thread B
Outline
• Background
• Our approach
– Non-Blocking Synchronization
– Nb-GCLOCK
• Experimental Evaluation
• Related Work
• Conclusion
68
69
Workload Zipf 80/20 distribution (a famous power law)
containing 20% of sequential scans
dataset size is 32GB in total Machine used: UltraSPARC T2
Experimental settings
64 processors
70
Workload Zipf 80/20 distribution (a famous power law)
containing 20% of sequential scans
dataset size is 32GB in total Machine used: UltraSPARC T2
Experimental settings
64 processors
We also performed evaluation on various X86 settings in the paper.
71
Throughput (normalized by LRU)
Processors
Performance comparison on moderate I/Os (of fig.9)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
8 16 32 64
LRU
GCLOCK
Nb-GCLOCK
72
Throughput (normalized by LRU)
Processors
Performance comparison on moderate I/Os (of fig.9)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
8 16 32 64
LRU
GCLOCK
Nb-GCLOCK
CPU utilization Previous approach: Low, about 20% Nb-GCLOCK: High, more than 95%
73
More difference in CPU time can be expected when # of CPU increases ➜ We expect more throughput
Throughput (normalized by LRU)
Processors
Performance comparison on moderate I/Os (of fig.9)
0.0
1.0
2.0
3.0
4.0
5.0
6.0
8 16 32 64
LRU
GCLOCK
Nb-GCLOCK
CPU utilization Previous approach: Low, about 20% Nb-GCLOCK: High, more than 95%
74
Maximum throughput to processors
Scalability to processors when pages are resident in memory intending to see the scalability limit expected by each algorithm
75
Maximum throughput to processors
Scalability to processors when pages are resident in memory intending to see the scalability limit expected by each algorithm
Processors (cores)
Throughput (log scale)
8 (1) 16 (2) 32 (4) 64 (8)
2Q 890992 819975 866009 662782
GCLOCK 1758605 1912000 1931268 1817748
Nb-GCLOCK 3409819 7331722 14245524 25834449
76
Maximum throughput to processors
Scalability to processors when pages are resident in memory intending to see the scalability limit expected by each algorithm
Processors (cores)
Throughput (log scale)
8 (1) 16 (2) 32 (4) 64 (8)
2Q 890992 819975 866009 662782
GCLOCK 1758605 1912000 1931268 1817748
Nb-GCLOCK 3409819 7331722 14245524 25834449
Achieved almost linear scalability, at least, up to 64 processors! This is the first attempt that removed locks in buffer management
77
Maximum throughput to processors
Scalability to processors when pages are resident in memory intending to see the scalability limit expected by each algorithm
Processors (cores)
Throughput (log scale)
8 (1) 16 (2) 32 (4) 64 (8)
2Q 890992 819975 866009 662782
GCLOCK 1758605 1912000 1931268 1817748
Nb-GCLOCK 3409819 7331722 14245524 25834449
Achieved almost linear scalability, at least, up to 64 processors! This is the first attempt that removed locks in buffer management
Interesting here is GCLOCK has CPU-scalability limit on around 16 processors. Caching solutions using GCLOCK have scalability limit there.
78
Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs)
Accesses issued from 64 threads in 60 seconds Thus, ideally 64 x 60 = 3,840 seconds can be used
Max thoughput (operation/sec) evaluation
79
Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs)
Accesses issued from 64 threads in 60 seconds Thus, ideally 64 x 60 = 3,840 seconds can be used
Max thoughput (operation/sec) evaluation
80
Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs)
Accesses issued from 64 threads in 60 seconds Thus, ideally 64 x 60 = 3,840 seconds can be used
Max thoughput (operation/sec) evaluation
Most of CPU time is used because our Nb-GCLOCK is non-blocking!
81
Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs)
Accesses issued from 64 threads in 60 seconds Thus, ideally 64 x 60 = 3,840 seconds can be used
Max thoughput (operation/sec) evaluation
Most of CPU time is used because our Nb-GCLOCK is non-blocking!
About 10-20% of CPU Time is used!
82
Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs)
Accesses issued from 64 threads in 60 seconds Thus, ideally 64 x 60 = 3,840 seconds can be used
Max thoughput (operation/sec) evaluation
Most of CPU time is used because our Nb-GCLOCK is non-blocking!
About 10-20% of CPU Time is used!
The CPU utilization would be more differs when # of processors grows. It would causes contentions!
800
900
1000
1100
1200
1300
1400
8 16 32 64 128
tpmC
# of terminals (threads)
Derby
Nb-GCLOCK
TPC-C evaluation using Apache Derby
Transaction per minutes
83
Sang Kyun Cha et al. Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. In Proc. VLDB, 2001.
800
900
1000
1100
1200
1300
1400
8 16 32 64 128
tpmC
# of terminals (threads)
Derby
Nb-GCLOCK
TPC-C evaluation using Apache Derby
Transaction per minutes
The original scheme of Derby (CLOCK) decreased throughput. On the other hand, ours scheme showed better result.
84
Sang Kyun Cha et al. Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. In Proc. VLDB, 2001.
800
900
1000
1100
1200
1300
1400
8 16 32 64 128
tpmC
# of terminals (threads)
Derby
Nb-GCLOCK
TPC-C evaluation using Apache Derby
Transaction per minutes
Throughput to buffer management module reduced a latch on root page of B+-tree ➜ We would require a concurrent B+-tree (see OLFIT)
85
Sang Kyun Cha et al. Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. In Proc. VLDB, 2001.
Outline
• Background
• Our approach
– Non-Blocking Synchronization
– Nb-GCLOCK
• Experimental Evaluation
• Related Work
• Conclusion
86
87
Bp-wrapper
Page requests
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (any)
hits misses
Database Files
Recording access
Xiaoning Ding, Song Jiang, and Xiaodong Zhang: Bp-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free, Proc. ICDE, 2009.
eliminates lock contention on buffer hits by using a batching and prefetching technique
88
Bp-wrapper
Page requests
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (any)
hits misses
Database Files
Recording access
Xiaoning Ding, Song Jiang, and Xiaodong Zhang: Bp-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free, Proc. ICDE, 2009.
eliminates lock contention on buffer hits by using a batching and prefetching technique
called Lazy synchronization in the literature
postpones the physical work (adjusting the buffer replacement list)
and immediately returns the logical operation
89
Bp-wrapper
Page requests
Hash bucket
Hash bucket
Hash bucket
Hash bucket
Page replacement algorithm (any)
hits misses
Database Files
Recording access
Xiaoning Ding, Song Jiang, and Xiaodong Zhang: Bp-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free, Proc. ICDE, 2009.
eliminates lock contention on buffer hits by using a batching and prefetching technique
called Lazy synchronization in the literature
postpones the physical work (adjusting the buffer replacement list)
and immediately returns the logical operation
Pros. - works with any page replacement algorithm
Cons. - Does not increase throughputs of CLOCK variants because CLOCK does not require locks on buffer hits
- Cache misses involve batching larger lock holding time makes more contentions
90
Proposed a lock-free variant of the GCLOCK page replacement algorithm, named Nb-GCLOCK.
Conclusions
Linearizability and lock-freedom are proven in the paper
91
Proposed a lock-free variant of the GCLOCK page replacement algorithm, named Nb-GCLOCK.
Conclusions
almost linear scalability to processors up to 64 processors while existing locking-based schemes do not scale beyond 16 processors
The first attempt that introduce non-blocking synchronization to database buffer management Optimistic I/Os using pread, CAS and memory barriers
Linearizability and lock-freedom are proven in the paper
92
Proposed a lock-free variant of the GCLOCK page replacement algorithm, named Nb-GCLOCK.
Conclusions
almost linear scalability to processors up to 64 processors while existing locking-based schemes do not scale beyond 16 processors
The first attempt that introduce non-blocking synchronization to database buffer management Optimistic I/Os using pread, CAS and memory barriers
Linearizability and lock-freedom are proven in the paper
The lock-freedom guarantees a certain throughput: any active thread taking a bounded number of steps ensures global progress.
93
Proposed a lock-free variant of the GCLOCK page replacement algorithm, named Nb-GCLOCK.
Conclusions
almost linear scalability to processors up to 64 processors while existing locking-based schemes do not scale beyond 16 processors
The first attempt that introduce non-blocking synchronization to database buffer management Optimistic I/Os using pread, CAS and memory barriers
Linearizability and lock-freedom are proven in the paper
The lock-freedom guarantees a certain throughput: any active thread taking a bounded number of steps ensures global progress.
This work is also useful for any caching solution that requires high throughput (e.g., C10K accesses)
94
Thank you for your attention!