Upload
satin
View
64
Download
0
Embed Size (px)
DESCRIPTION
Read-Write Lock Allocation in Software Transactional Memory. Amir Ghanbari Bavarsad and Ehsan Atoofian Lakehead University. Transactional Memory. Software transactional memory (STM) exploits a global clock to validate transactional data Pros: reduces validation overhead Cons: contention - PowerPoint PPT Presentation
Citation preview
Read-Write Lock Allocation in Software Transactional
Memory
Amir Ghanbari Bavarsad and Ehsan Atoofian
Lakehead University
P1
$ $
Pn
Global Clock
Transactional Memory Software transactional memory (STM) exploits a
global clock to validate transactional data Pros: reduces validation overhead Cons: contention
Alternate: Read Write Lock Allocation (RWLA) Pros: no central clock Cons: overhead if a TX aborts
Speculative RWLA: changes validation policy dynamically → Speedup: up to 66%
2
Outline
Background
RWLA
Speculative RWLA
Conclusion
3
4
Counter in STM
T1
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Transactional data are validated using: Global clock
Shared variable Timestamp for transactions
Lock Memory is mapped to Lock Table Each entry of the table:
Version #
…
…
5
Validation in STM
Global Clock
Memory
Lock Table
Version #
6
Updating Global Clock & Lock Increment Global Clock Version # = global_clock Global Clock
Memory
Lock Table
Version #
…
…
counter
7
Validation in STM
rv (read version) is set to global_clock
T1
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Metadata for TX1
rv
Global Clock
8
Successful Read Validation
rv >= version# The most recent write to counter,
occurred before TM_BEGIN()
T1
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Metadata for TX1 Global Clock
rv
9
Failed Read Validation
rv < version# The most recent write to counter,
occurred after TM_BEGIN()
T1
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Metadata for TX1 Global Clock
rv
Overhead of Validation
This method, called GV4, results in many cache coherence misses if transactions commit frequently
10
P1
$ $
Pn
Global Clock
Outline
Background
RWLA
Speculative RWLA
Conclusion
11
Lock Memory is mapped to Lock Table Each entry of the table:
Lock bit Read bits
Read Write Lock Allocation (RWLA)
12
Lock Table
…
…
Memory
P0P1…Pn-1
lock bitRead bits
13
TM_READ
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
000000 …..
14
TM_READ
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Set read bit in the corresponding lock
entry
Yes
TM_READ()
Lock bit is free?
000000 …..1lock bit
15
TM_READ
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Abort
No
100000 …..
Set read bit in the corresponding lock
entry
Yes
TM_READ()
Lock bit is free?
16
TM_WRITE
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Abort
TM_WRITE
All read bits are clear?
No
000100 …..
17
TM_WRITE
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
Abort
TM_WRITE
Acquire lockfailed
All read bits are clear?
No
Yes
100000 …..
18
TM_WRITE
TM_BEGIN(); local_counter = TM_READ(counter); local_counter++;
TM_WRITE(counter, local_counter); TM_END();
00000 …..
Abort
TM_WRITE
Acquire lockfailed
All read bits are clear?
No
Yes
10
Experimental Framework
Benchmarks: Stamp v0.9.7 Run up to competition Measured statistics over 10 runs
TL2 as an STM framework
Two Intel Xeon E5660, 6-way CMP
19
Performance of RWLA
20
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Bayes Kmeans Labyrinth Ssca2 Vacation Genome
2 4 8 16 AVG.
bette
r
Speculative RWLA Conflict occurs frequently → select GV4 Conflict occurs rarely → select RWLA How to predict conflict?
21
Contention Predictor
Prediction: y≥0 →predict commit y<0 →predict abort
Update If outcome of current TX and TXi agree/disagree →increment/decrement
wi
22
1 X1 … Xn
y
w1w0 wn
n
niiwxwy
10 )(
xi: global transaction history, bipolar value
wi: weight vector
Performance of Speculative RWLA # of threads changes between 2 and 16 On average, performance changes from 21% in Bayes to
47% in Labyrinth
23
0
0.2
0.4
0.6
0.8
1
1.2
Bayes Kmeans Labyrinth Ssca2 Vacation Genome
2 4 8 16 AVG.
bette
r
Conclusion
RWLA to overcome contentions over global clok
Applications react differently to GV4 and RWLA
Speculative RWLA changes validation policy dynamically
Speculative RWLA performance of STMs up to 66%
24
25
Thank You!
Questions?