Upload
vorfeed-chen
View
33
Download
0
Embed Size (px)
Citation preview
Parallel Computing
• Cache Coherence• https://en.wikipedia.org/wiki/Cache_coherence• False sharing
• Sequential Consistency• https://en.wikipedia.org/wiki/Sequential_consistency• Compiler, CPU, multicore• Cache load, register
Processor Guaranteed Atomic
• Bus Lock• https://software.intel.com/en-us/node/544402• LOCK# signal
• Cache Lock• Between CPU and Memory• Cache Coherence
Memory Barrier
• Memory Barrier• https://en.wikipedia.org/wiki/Memory_barrier• Causes a CPU or compiler to enforce an ordering constraint on
memory operations issued before and after the barrier instruction
• Compile-time Memory Ordering• atomic_thread_fence(memory_order_acq_rel);• Forbids compiler to reorder read and write commands around it
Memory Ordering
• Memory Ordering• https://en.wikipedia.org/wiki/Memory_ordering• The runtime order of accesses to computer memory by a CPU
• Sequential Consistency• All reads and all writes are in-order
• Relaxed consistency• Some types of reordering are allowed
• Weak consistency• Reads and writes are arbitrarily reordered, limited only by explicit
memory barriers
Volatile
• Volatile• https://en.wikipedia.org/wiki/Volatile_(computer_programming)
• Un-cacheable variable• Prevents reordering between volatile variables• Not applicable
• Depend on other variable• Depend on old value
• Enhanced in Java• Write: release• Read: acquire
Volatile in Java
Read
Write
StoreStore
Volatile Write
StoreLoad
Volatile Read
LoadStore
Read
LoadLoad
Write
Memory Ordering in C++11
• memory_order_relaxed• memory_order_acquire• memory_order_release• memory_order_consume• memory_order_acq_rel• memory_order_seq_cst
Relaxed Ordering
• Atomicity• Modification order consistency• Example
• A is sequenced-before B, C is sequenced before D• Is allowed to produce r1 == r2 == 42 ?• Reference counters of std::shared_ptr
Relaxed Ordering// thread 1r1 = y.load(memory_order_relaxed); // Ax.store(r1, memory_order_relaxed); // B
// thread 2r2 = x.load(memory_order_relaxed); // Cy.store(42, memory_order_relaxed); // D
// possible ordery.store(42, memory_order_relaxed);r1 = y.load(memory_order_relaxed);x.store(r1, memory_order_relaxed);r2 = x.load(memory_order_relaxed);
Release-Acquire Ordering
• Between the threads releasing and acquiring the same atomic variable
• All memory writes happened-before the atomic store• The atomic load happened-before all memory loads• Example
• A sequenced-before B sequenced-before C• C synchronizes-with D• D sequenced-before E sequenced-before F
Release-Acquire Orderingatomic<string*> ptr;int data;
void producer() {string* p = new string("Hello"); // Adata = 42; // Bptr.store(p, memory_order_release); // C
}
void consumer() {string* p2;while (!(p2 = ptr.load(memory_order_acquire))); // Dassert(*p2 == "Hello"); // Eassert(data == 42); // F
}
thread t1(producer);thread t2(consumer);
Release-Consume ordering
• Data-dependency relationship• Example
• A sequenced-before B sequenced-before C• C dependency-ordered-before D• D sequenced-before E sequenced-before F• A happens-before E ?• B happens-before F ?
• Discouraged
Release-Consume orderingatomic<string*> ptr;int data;
void producer() {string* p = new string("Hello"); // Adata = 42; // Bptr.store(p, memory_order_release); // C
}
void consumer() {string* p2;while (!(p2 = ptr.load(memory_order_consume))); // Dassert(*p2 == "Hello"); // Eassert(data == 42); // F
}
thread t1(producer);thread t2(consumer);
Sequentially-Consistent Ordering
• Order memory the same way as release/acquire ordering• Establish a single total modification order of all atomic
operations• Example
• Is r1 == r2 == 0 possible ?
Sequentially-Consistent Orderingatomic<int> x { 0 }, y { 0 };// thread 1x.store(1, memory_order_seq_cst);r1 = y.load(memory_order_seq_cst);
// thread 2y.store(1, memory_order_seq_cst);r2 = x.load(memory_order_seq_cst);
// thread 1x.store(1, memory_order_relaxed);atomic_thread_fence(memory_order_seq_cst);r1 = y.load(memory_order_relaxed);
// thread 2y.store(1, memory_order_relaxed);atomic_thread_fence(memory_order_seq_cst);r2 = x.load(memory_order_relaxed);
Sequentially-Consistent Orderingatomic<int> x { 0 }, y { 0 };// thread 1x.store(1, memory_order_acq_rel);r1 = y.load(memory_order_acq_rel);
// thread 2y.store(1, memory_order_acq_rel);r2 = x.load(memory_order_acq_rel);
// thread 1x.store(1, memory_order_relaxed); atomic_thread_fence(memory_order_acq_rel); r1 = y.load(memory_order_relaxed);
// thread 2y.store(1, memory_order_relaxed); atomic_thread_fence(memory_order_acq_rel); r2 = x.load(memory_order_relaxed);
Atomic Operations
• atomic_store/load• atomic_exchange• atomic_compare_exchange_weak/strong• atomic_fetch_add/sub/and/or/xor• atomic_thread_fence• atomic_signal_fence
Atomic Compare and Exchange
• compare_exchange_weak• Allow to fail spuriously• Act as if (actual value != expected) even if they are equal• May require a loop
• compare_exchange_strong• Distinguish spurious failure and concurrent acces• Needs extra overhead to retry in the case of failure
Concurrency Control
• Pessimistic• Blocking until the possibility of violation disappears
• Optimistic• Collisions between transactions will rarely occur• Use resources without acquiring locks• If conflict, the committing rolls back and restart• Compare and Swap
do {expected = resource;some operation;
} while (compare_and_swap(resource, expected, new_value) == false);
Progress Condition
• Blocking
• Obstruction-Free• http://cs.brown.edu/people/mph/HerlihyLM03/main.pdf
• Lock-Free
• Wait-Free
while (!lock.compare_and_set(0, 1)) {this_thread::yield();
}
while (!atomic_value.compare_and_set(local_value, local_value + 1)) {local_value = atomic_value.load();
}
counter.fetch_add(1); // XADD
Lock-Free Stack
• Treiber (1986) Algorithm• https://en.wikipedia.org/wiki/Treiber_Stack• 《Treiber, R.K., 1986. Systems programming: Coping with
parallelism. International Business Machines Incorporated, Thomas J. Watson Research Center.》
// Copyright 2016, Xiaojie Chen. All rights reserved.// https://github.com/vorfeed/naesala
struct IStackNode {IStackNode* next;
};
template <class T>class LockfreeStack {public:void Push(T* node);T* Pop();private:static_assert(is_base_of<IStackNode, T>::value, "");atomic<uint64_t> top_ { 0 };
};
Lock-Free Stack
Lock-Free Stackvoid Push(T* node) {uint64_t last_top = 0;uint64_t node_ptr = reinterpret_cast<uint64_t>(node);do {// Take out the top node of the stacklast_top = top_.load(memory_order_acquire);// Add a new node as the top of the stack, and point to the old topnode->next = reinterpret_cast<T*>(last_top);
// If the top node is modified by other threads, discard this operation and retry} while (!top_.compare_exchange_weak(last_top, node_ptr));
}
Lock-Free StackT* Pop() {T* top = nullptr;uint64_t top_ptr = 0, new_top_ptr = 0;do {// Take out the top node of the stacktop_ptr = top_.load(memory_order_acquire);top = reinterpret_cast<T*>(top_ptr);// Empty stackif (!top) {return nullptr;
}// Set the next node of the top node as the new top of the stacknew_top_ptr = reinterpret_cast<uint64_t>(top->next);
// If the top node is modified by other threads, discard this operation and retry} while (!top_.compare_exchange_weak(top_ptr, new_top_ptr));return top;
}
Lock-Free Queue
• Michael & Scott (1996) Algorithm• Java ConcurrentLinkedQueue• 《Michael, Maged; Scott, Michael (1996). Simple, Fast, and
Practical Non-Blocking and Blocking Concurrent Queue Algorithms. Proc. 15th Annual ACM Symp. on Principles of Distributed Computing (PODC). pp. 267–275. doi:10.1145/248052.248106. ISBN 0-89791-800-2.》
Lock-Free Queue// Copyright 2016, Xiaojie Chen. All rights reserved.// https://github.com/vorfeed/naesala
struct IListNode {IListNode(uint64_t next) : next(next) {}atomic<uint64_t> next;
};
template <class T>class LockfreeList {public:// Both head and tail point to a dummy if queue is emptyLockfreeList() : dummy_(reinterpret_cast<uint64_t>(new T())),
head_(dummy_), tail_(dummy_) {}private:static_assert(is_base_of<IListNode<T>, T>::value, "");uint64_t dummy_;atomic<uint64_t> head_, tail_;
};
Lock-Free Queuevoid Put(T* node) {
while (true) {// The tail node of the queueuint64_t tail_ptr = tail_.load(memory_order_acquire);T* tail = reinterpret_cast<T*>(tail_ptr);// The next node of the tail nodeuint64_t tail_next_ptr = tail->next.load(memory_order_acquire);T* tail_next = reinterpret_cast<T*>(tail_next_ptr);// If the next node of tail node is modified by other threadsif (tail_next) {
// Try to help other threads to swing tail to the next node, and then retrytail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(tail_next));
// Else try to link node at the end of the queue} else if (tail->next.compare_exchange_weak(tail_next_ptr,
reinterpret_cast<uint64_t>(node))) {// If successful, try to swing Tail to the inserted node// Can also be done by other threadstail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(node));break;
}}
}
Lock-Free Queue
Dummy Node1 Node2
Head
Tail
Dummy Node1 Node2
Head
Tail
Node3
Dummy Node1 Node2
Head
Tail
Node3
Lock-Free QueueT* Take() {
while (true) {// The head node of the queueuint64_t head_ptr = head_.load(memory_order_acquire);T* head = reinterpret_cast<T*>(head_ptr);// The tail node of the queueuint64_t tail_ptr = tail_.load(memory_order_acquire);T* tail = reinterpret_cast<T*>(tail_ptr);// The next node of the head nodeuint64_t head_next_ptr = head->next.load(memory_order_acquire);T* head_next = reinterpret_cast<T*>(head_next_ptr);// Empty queue or the tail falling behindif (head == tail) {
// Empty queue, couldn’t popif (!head_next) {
return nullptr;}// another thread is pushing and the tail is falling behind, try to advance ittail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(head_next));
} else {// Queue is not empty, do pop operation
}}return nullptr;
}
Lock-Free Queue// pop operation
// another thread had just taken a nodeif (!head_next) {
continue;}// copy the next node of the head node to a bufferT data(*head_next);// Try to swing head to the next nodeif (head_.compare_exchange_weak(head_ptr, reinterpret_cast<uint64_t>(head_next))) {
// If successful, copy the buffer data to the head node*head = move(data);// Clear the next node pointer of the head nodehead->next.store(0, memory_order_release);// Return the head nodereturn head;
}
ABA Problem
• https://en.wikipedia.org/wiki/ABA_problem• Another thread change the value, do other work, then
change the value back• Fooling the first thread into thinking "nothing has
changed"
ABA Problemtemplate <class T>T* Pointer(uint64_t combine) {return reinterpret_cast<T*>(combine & 0x0000FFFFFFFFFFFF);
}
template <class T>uint64_t Combine(T* pointer) {static atomic_short version(0);return reinterpret_cast<uint64_t>(pointer) |
(static_cast<uint64_t>(version.fetch_add(1, memory_order_acq_rel)) << 48);}
ABA Problemvoid Push(T* node) {uint64_t last_top_combine = 0;uint64_t node_combine = Combine(node);do {last_top_combine = top_.load(memory_order_acquire);node->next = Pointer<T>(last_top_combine);
// If the top node is still next, then assume no one has changed the stack// (That statement is not always true because of the ABA problem)// Atomically replace top with new node} while (!top_.compare_exchange_weak(last_top_combine, node_combine));
}
ABA ProblemT* Pop() {T* top = nullptr;uint64_t top_combine = 0, new_top_combine = 0;do {top_combine = top_.load(memory_order_acquire);top = Pointer<T>(top_combine);if (!top) {return nullptr;
}new_top_combine = Combine(top->next);
// If the top node is still ret, then assume no one has changed the stack// (That statement is not always true because of the ABA problem)// Atomically replace top with next} while (!top_.compare_exchange_weak(top_combine, new_top_combine));return top;
}
Benchmark
0
500000000
1E+09
1.5E+09
1 PRODUCER 1 CONSUMER
SPSC
Condition Variable Queue Lock-Free Queue
0
500000000
1E+09
1.5E+09
1P1C 1P2C 1P4C 1P8C 1P16C 1P32C
SPMC
Condition Variable Queue Lock-Free Queue
0200000000400000000600000000800000000
1E+091.2E+09
1P1C 2P1C 4P1C 8P1C 16P1C 32P1C
MPSC
Condition Variable Queue Lock-Free Queue
0200000000400000000600000000800000000
1E+091.2E+09
1P1C 2P2C 4P4C 8P8C 16P16C 32P32C
MPMC
Condition Variable Queue Lock-Free Queue
Reference
• 《Java Concurrency in Practice》• 《The Art of Multiprocessor Programming》• 《C++ Concurrency In Action》• http://open-std.org• java.util.concurrent• https://github.com/vorfeed/naesala/lockfree