View
223
Download
2
Category
Tags:
Preview:
Citation preview
CS-510
Paul E. McKenney, IBM Linux Technology CenterMaged M. Michael, IBM TJ Watson ResearchJonathan Walpole, Portland State University
Presented by Vidhya Priyadharshnee Palaniswamy Gnanam
Why The Grass May Not Be Greener On The Other Side:
A Comparison of Locking vs. Transactional Memory
2CS-510
Outline Concurrency Control Techniques Review Objective Locking Critique TM Critique Where do Locking and TM fit in? Conclusion Recent Work Future Work
3CS-510
CONCURRENCY CONTROL TECHNIQUES REVIEW
4CS-510
Multicore Computing With the speed of individual cores no longer
increasing at the rate it used to, we started using increased number of CPU cores to increase the speed of our ever-more complicated applications.
To use these extra cores, programs must be parallelized.
Synchronization of shared data access is critical for correctness of these programs.
5CS-510
Lock Based Synchronization “Traditional” pessimistic synchronization approach
Simple. Partition the shared data and protect each partition with separate a lock
Locks prevent concurrent access and enable sequential reasoning about critical section code.
Reader Writer Locking: Allows multiple readers to gain access concurrently. Improves scalability if used correctly.
6CS-510
Lock Based Synchronization: Downsides
Lock based Synchronization open a whole new can of worms, though.
High Contention on non-partitionable data structures. – Coarse Grained locking limits concurrency. Lock Contention.
Poorly Scales.
Fine Grained locking is hard. Lock acquisition overhead affects performance.
Introduces dependencies among threads.– Propagation of thread failure– Affects fault tolerance of the system
7CS-510
Non Blocking Synchronization Lock-free, “optimistic” synchronization.
Execute the critical section unconstrained, and check at the end to see if you were the only one
If so, continue. If not roll back and retry
Optimistic synchronization keep threads independent giving different levels of fault tolerant properties like Block Freedom, Wait Freedom and Obstruction Freedom based on implementation.
8CS-510
Non Blocking Synchronization: Downsides
Difficult programming logic
Heavy use of atomic operations like CAS to do combination of verification and finalization (if passes).
Impact of contention can be quite severe. Increased number of retries causes heavy bus contention, cache contention and thus slows down progressive threads.
May not perform as well as a lock-based approach in non preemptible kernel.
9CS-510
Objective Each technique has both green and dry areas. The goal of paper is to
– Spot green and dry areas of Lock Based Synchronization and Transactional memory (NBS)
– Constructively criticize to them to understand where each technique fit
10CS-510
LOCKING CRITIQUE
11CS-510
Locking Strengths Simple and elegant idea. Allow only one CPU to access a
given data at a time.
Provides Disjoint access parallelism but with more effort.
Does not require any specialized HW support. Can be used on existing commodity hardware.
Supported in multiple platforms as it is largely used and well-defined standardized locking APIs like POSIX pthread API exists. – Much of the legacy code use locking. – More experienced programmers
Contention effects are concentrated within locking primitives, allowing critical sections to run at full speed.
12CS-510
Locking Strengths Degradation on performance can be minimized by reducing
the power consumption during waiting on lock.
Good for protecting non-idempotent operations such as I/O, thread creation, memory remapping and system rebooting.
Interacts naturally with other synchronization mechanisms, including reference counting, atomic operations, non-blocking synchronization, RCU
Interacts in a natural manner with debuggers
13CS-510
Locking: Problems & ImprovementsProblem: Lock Contention Some data structures such as unstructured graphs and trees
are difficult to partition. May have to settle for coarse grained locking which leading to
high contention and reduced scalability
Solution Redesign algorithms to use partition-able data structures
– Replace trees and graphs with hash tables and radix trees.
Problem remains with non-partitionable data structures!
14CS-510
Locking: Problems & ImprovementsProblem: Lock Overhead Lock granularity determines scalability.
– Can we partition the shared data as much as possible and protect each partition with separate lock?
Locking uses expensive instructions and creates high synchronization overhead.
Locking introduces communication related cache misses into read mostly workloads which would otherwise run entirely within the cpu cache.
Solution While lock overhead cannot be completely overcome, it can be
avoided. In read mostly situations, locked updates may be paired with read-
copy-update (RCU) or hazard pointers thus reducing lock overhead in common cases, increasing read side performance and scalability. Problem Remains in Update heavy workloads!
15CS-510
Locking: Problems & Improvements
Performance Vs Scalability
Need right granularity of locks!
16CS-510
Locking: Problems & ImprovementsProblem: Deadlock Multiple threads acquire the same set of locks in different
order. Self-deadlock: if interrupt occurs while a lock is held by a
thread and the interrupt handler also needs that lockSolution Require a clear locking hierarchy; multiple locks are acquired
in a pre-specified order If lock not free, thread surrenders conflicting locks and retries Detect deadlock; break cycle by terminating selected threads
based upon priority/ work done. Track lock acquisition, dynamically detect potential deadlock
and prevent before it occurs To avoid self deadlocks disable interrupts on entering CS/
avoid lock acquisition in handlers
17CS-510
Locking: Problems & ImprovementsProblem: Priority Inversion Priority inversion can cause a high-priority thread to miss its
real-time scheduling deadline, which is unacceptable in safety-critical systems
Solution Low priority thread holding the lock temporarily inherits
priority of high priority blocked thread so that no medium priority thread can preempt it
Lock holder is assigned priority of the highest priority task that might acquire that lock
Preemption is disabled entirely while locks are held
18CS-510
Locking: Problems & ImprovementsProblem: Convoying Preemption or blocking (due to I/O, page fault etc.) of the lock
holder can block other threads. Unrealistically increased critical section length. Non-deterministic lock acquisition latency May lead to starvation of large critical sections. Problem for real-time workloads.
Solution Use scheduler-conscious synchronization to avoid scheduler to
preempt the thread holding a lock. Use RCU for read side critical sections to avoid Non-
deterministic lock acquisition latency in read side. To avoid starvation use FCFS lock acquisition primitives with
limit on number of threads- e.g. Semaphores
19CS-510
Locking: Problems & ImprovementsProblem: Lack of composability and Modularity Enabling atomic operations to be composed into larger
atomic operations is difficult. Leads to self deadlock if the inner critical section tries to
acquire same lock out critical section is holding
Solution Need to know what locks other modules use before
calling/composing them.
Abstraction is lost!
20CS-510
Locking: Problems & ImprovementsProblems: Indefinite blocking Due to termination of the lock holder. Creates problems for fault tolerant software.
Solution Abort and restart entire application- Simple, reliable Identify the terminated lock holder and clean up its state-
extremely complex
Fault tolerance of the software is still affected!
21CS-510
TRANSACTIONAL MEMORY CRITIQUE
22CS-510
Composability In locking, operations may be thread safe individually, but not
composed together. Consider, pop from one stack and push into another.
T2
struct foo *push (struct foo_stack *dst){ struct foo *q; lock (dst);
get(q); q->next = dst; dst = q; unlock (dst);}
T1
struct foo *pop (struct foo_stack *src){ struct foo *q; lock (src);
q = src;src = q->next;
unlock (src);}
Intermediate state (item is in neither stacks) is visible!
23CS-510
TM Approach
struct foo *pop_push(struct foo_stack *src, struct foo_stack *dst){ struct foo *q; begin_txn; q = src;
src = q->next;q->next = dst;
dst = q; end_txn;}
Let the TM system take care of the rest!
24CS-510
Transactional Memory
Solution to the problem of consistency in the face of concurrency adopted from the database world - Transactions.
Simple, Composable, Scalable
Atomic Blocks == TransactionsAtomicity: All-or-nothing execution of a tx. Isolation: Partial results are invisible to other txs/
threads
25CS-510
Transactional Memory
TM is a non-blocking synchronization mechanism: at least one thread will succeed
Can be constructed to be either as – Optimistic
– Speculate concurrency without waiting for permission (acquire no locks on reads/writes)
– Performs well when critical regions do not interfere with each other more often.
– Pessimistic – "Always ask for permission"- Acquire locks on read/ writes
(blocking) used in databases. – Good when conflicts are more
26CS-510
HW Transactional Memory
New instructions (LT, LTX, ST, Abort, Commit, Validate)
Fully-associative transactional cache for buffering updates
Piggy Backing on multi-processor cache coherence protocol to
detect transaction conflicts
27CS-510
SW Transactional Memory Obstruction free
– Introduce level of indirection
– Log the modifications to memory locations in descriptors.
– Based on tx outcome, commit by writing the new values to memory locations atomically or abort by reverting to old values.
Non Obstruction free
– Revocable Two Phase Locking for Writes: A transaction locks all objects that it writes and does not release these locks until the transaction terminates. If deadlock occurs then one transaction aborts, releasing its locks and reverting its writes.
– Optimistic Concurrency Control for Reads: Whenever a transaction reads from an object, it logs the version it read. When the transaction commits, it verifies that these are still the current versions of the objects.
28CS-510
TM Strengths Non-blocking: system as a whole makes progress
Familiar to large users in the context of database systems and trivial hardware implementation LL/SC
Scalable Allows multiple, non-interfering threads to concurrently
execute in a critical section.
Automatic Disjoint access parallelism Achieved automatically without having to design complex fine
grain locking solution.
Modular & Composable Transactions may be nested or composed
29CS-510
TM StrengthsDeadlock Free Avoids common pitfalls of lock composition such as deadlock.
Fault tolerance Failure of one transaction will not affect others
Non Partitionable datastructures Can be used with difficult to partition data structures such as
unstructured graphs
30CS-510
TM Problems & ImprovementsProblem: Portability in Hardware TM
Portability: need special hardware
Size of transaction limited by transaction cache.
Overflow of transaction cache addressed by virtualization in newer implementations.
Solution
Use HTM in case of small txs, but fall back to STM otherwise with language support.
Transparency to application requires semantics of HTM and STM to be identical.
31CS-510
TM Problems & ImprovementsProblem: Performance in Software TM
Poor performance compared to locking even at low levels of contention
Atomic operations for acquiring shared object handles Cost of consistency validation Effect on cache of shared object metadata Dynamic allocation, data copying and memory reclamation
Solution:
STM performance can be improved by eliminating overheads of indirection, dynamic allocation, data copying, and memory reclamation by relaxing the non-blocking property
Reintroduce many of the problems of locking!
32CS-510
Problem: Non Idempotent Operations: I/O Cannot perform any operation that cannot be undone like I/O,
memory remapping, thread creation and destruction It cannot be performed multiple times on tx retry as it will lead to
multiple send requests Common Solution Postpone I/O until outcome of tx is known to avoid I/O retries.
Problematic scenario I/O waits until commit And commit waits for I/O completion.
Self deadlock!
TM Problems & Improvements
33CS-510
TM Problems & ImprovementsSolutions: Non Idempotent Operations: I/O
Buffered I/O might be addressed by including the buffering mechanism within the scope of the transactions doing I/O This cannot handle the scenario shown.
Can expand both sender and receiver in one tx. But Tx limited to single system currently.
Txs performing non idempotent operations can be executed in “inevitable” mode, where it is guaranteed to commit avoiding the irreversibility problem of I/O etc. But it does not scale, as at most only one transaction can be inevitable.
34CS-510
TM Problems & ImprovementsProblem: Contention Management
When transactions collide, only one can proceed, others must be rolled back.– Starvation of large transactions by smaller ones– delay of a high-priority thread via rollback of its transactions due to
conflicts with those of a lower-priority thread
Solution
Communication b/w scheduler and tx contention manager. Carefully select the transactions to roll back based on priority,
amount of work done etc. Convert read only transactions to non-transactional form, in a
manner similar to the pairing of locking with RCU.– Writer should have necessary primitives to support non transactional
readers.– Eg, A Relativistic Enhancement to Software Transactional Memory," by
Philip Howard and Jonathan Walpole
35CS-510
TM Problems- PrivatizationPrivatization Optimization technique that allows access to some data non -
transactionally.
Need To improve performance by temporarily exempting objects
from the overhead of transactional access. Trade Strong Isolation for performance
Problems Can break isolation guarantees causing inconsistent
concurrent access. Performance vs Correctness
36CS-510
TM Problems- Privatization
T1 T2T1 intends to insert
A1 T2 intends to
Privatize listT1 read A
T1 locks A
T2 locks Head
T2 Commit by local = head,
head=null. Unlock
T2 privatized,
perform operations
T1 Commit by A1->next=B, A->next =
A. Unlock
TIM
E
Certain STM optimizations can result in allowing concurrent access to privatized data!
37CS-510
TM Problems & ImprovementsProblem: Ratio of data and control operation overheads• DBMS: Data operation usually includes reads/writes to mass storage
device. Tx overhead becomes negligible comparatively.• TM: Data operations almost always includes only reads/writes to
memory. Tx overhead seems large.
Solution• Use TM for heavy weight operations like grouping system calls.
Problem: Debugability Difficult debugability of Transactions- break points causes
unconditional aborting
Solution Debugging issue can be addressed by using STM- High degree of
compatibility between STM and HTM needed.
38CS-510
TM Problems & ImprovementsOthers Problems:
Interaction with other systems is important. In practice it is complicated and expensive.
Conflict Prone Variables- inevitable data structures appearing in every CS causes excessive conflicts.
Performance overhead due to Conflict Resolution and excessive restarts in the face of High conflict rates.
39CS-510
Where do Locking and TM fit in?
Scenario Best Technique Why?Partitionable data structures Locking Disjoint Access Parallelism
Large Non Partitionable data structures TM Automatic Disjoint Access Parallelism
Read Mostly Situations Locking/TM with Hazard Pointers/ RCU Readers Scalable
Update Heavy Situations TM Writers Scalable
Complex fine grain locking design, No clear lock hierarchy exists TM Deadlock Avoidance
Atomic operations spanning multiple independent data structures, eg pop from one stack and push to another
TM Composability
Single threaded software having embarrassingly parallel core containing only idempotent operations
TM Performance benefits without much programming effort
Non Idempotent Operations Locking Supportability of non idempotent operations. Large Critical Sections Locking Lock acquisition cost small compared to retry
Commodity Hardware LockingCommodity HW suffices. HTM requires specialized H/W and depends on cache geometry details. Else performance limited by STM
40CS-510
Conclusion: Use the Right Tool For The Job!
There is no silver bullet: successful adoption of multithreaded/multi-core CPUs will require combination of techniques
Analogy with engineering: How many types of fasteners are there? How many subtypes? Nail, screw, clip, bolt, glue, joint, magnet...
Neither locking nor TM solve the fundamental performance and scalability problems
Combine strengths of various synchronization mechanisms according to the need
Integrate with other techniques: “use the right tool for the job”
TM's applicability may increase if STM performance improves Formalize and generalize existing techniques such as RCU
41CS-510
Recent Work cx_spinlocks
– new hybrid TM and locking primitive
TxLinux: Using and Managing Hardware Transactional Memory in an Operating System by Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, Aditya Bhandari, and Emmett Witchel
“Inevitable Transactions” special transactions containing non-idempotent operations (I/O).
Such transactions unconditionally abort any conflicting transactions, thus non-idempotence is OK.
Allowing more than one concurrent inevitable transaction is necessary to achieve reasonable I/O performance, but feasibility is an open question
42CS-510
Recent Work
Glue together Relativistic programming and Transactional Memory to gain scalability of readers and writers
A Relativistic Enhancement to Software Transactional Memory by Philip Howard, Jonathan Walpole
43CS-510
Future Work Expand the comparison to include other synchronization
mechanisms (message passing, deferred reclamation, RCU)
Investigate combining different mechanisms:– TM and locking (much work in this area)– RCU and locking (typical use of RCU)– TM and RCU (very little work done here)
There might still be hope for a “silver bullet”– But until then, it would be quite foolish to ignore combinations of
existing mechanisms
44CS-510
45CS-510
References Lecture Slides from Winter 2008 by the authors Parallel Programming with Transactional Memory by Ulrich
Drepper, Red Hat Software Transactional Memory why is it only a research toy?
by Calin Cascaval, Colin Blundell, Maged Michael, Harold W.Cain, Peng Wu, Stefane Chiras and Siddhartha Chatterjee
Privatization Techniques for Software Transactional Memory by Michael F. Spear, Virendra J. Marathe, Luke Dalessandro, and Michael L. Scott
Inevitability Mechanisms for Software Transactional Memory by Michael F. Spear, Maged M. Michael, Michael L. Scott
http://en.wikipedia.org/wiki/Software_transactional_memory
46CS-510
THANK YOU!
Recommended