无锁编程

无锁编程[email protected]

2016.08.18

Content

• Parallel• Barrier• Memory Order• Volatile• Atomic• Lock-Free• ABA Problem• Reference

Parallel Computing

• Cache Coherence• https://en.wikipedia.org/wiki/Cache_coherence• False sharing

• Sequential Consistency• https://en.wikipedia.org/wiki/Sequential_consistency• Compiler, CPU, multicore• Cache load, register

https://en.wikipedia.org/wiki/Cache_coherence

https://en.wikipedia.org/wiki/Sequential_consistency

False Sharing

• Solution: Padding

Processor Guaranteed Atomic

• Bus Lock• https://software.intel.com/en-us/node/544402• LOCK# signal

• Cache Lock• Between CPU and Memory• Cache Coherence

https://software.intel.com/en-us/node/544402

Memory Barrier

• Memory Barrier• https://en.wikipedia.org/wiki/Memory_barrier• Causes a CPU or compiler to enforce an ordering constraint on

memory operations issued before and after the barrier instruction

• Compile-time Memory Ordering• atomic_thread_fence(memory_order_acq_rel);• Forbids compiler to reorder read and write commands around it

https://en.wikipedia.org/wiki/Memory_barrier

Memory Ordering

• Memory Ordering• https://en.wikipedia.org/wiki/Memory_ordering• The runtime order of accesses to computer memory by a CPU

• Sequential Consistency• All reads and all writes are in-order

• Relaxed consistency• Some types of reordering are allowed

• Weak consistency• Reads and writes are arbitrarily reordered, limited only by explicit

memory barriers

https://en.wikipedia.org/wiki/Memory_ordering

Volatile

• Volatile• https://en.wikipedia.org/wiki/Volatile_(computer_programming)

• Un-cacheable variable• Prevents reordering between volatile variables• Not applicable

• Depend on other variable• Depend on old value

• Enhanced in Java• Write: release• Read: acquire

https://en.wikipedia.org/wiki/Volatile_(computer_programming)

Volatile in Java

Read

Write

StoreStore

Volatile Write

StoreLoad

Volatile Read

LoadStore

Read

LoadLoad

Write

Memory Ordering in C++11

• memory_order_relaxed• memory_order_acquire• memory_order_release• memory_order_consume• memory_order_acq_rel• memory_order_seq_cst

Relaxed Ordering

• Atomicity• Modification order consistency• Example

• A is sequenced-before B, C is sequenced before D• Is allowed to produce r1 == r2 == 42 ?• Reference counters of std::shared_ptr

Relaxed Ordering// thread 1r1 = y.load(memory_order_relaxed); // Ax.store(r1, memory_order_relaxed); // B

// thread 2r2 = x.load(memory_order_relaxed); // Cy.store(42, memory_order_relaxed); // D

// possible ordery.store(42, memory_order_relaxed);r1 = y.load(memory_order_relaxed);x.store(r1, memory_order_relaxed);r2 = x.load(memory_order_relaxed);

Release-Acquire Ordering

• Between the threads releasing and acquiring the same atomic variable

• All memory writes happened-before the atomic store• The atomic load happened-before all memory loads• Example

• A sequenced-before B sequenced-before C• C synchronizes-with D• D sequenced-before E sequenced-before F

Release-Acquire Orderingatomic<string*> ptr;int data;

void producer() {string* p = new string("Hello"); // Adata = 42; // Bptr.store(p, memory_order_release); // C

}

void consumer() {string* p2;while (!(p2 = ptr.load(memory_order_acquire))); // Dassert(*p2 == "Hello"); // Eassert(data == 42); // F

}

thread t1(producer);thread t2(consumer);

Release-Consume ordering

• Data-dependency relationship• Example

• A sequenced-before B sequenced-before C• C dependency-ordered-before D• D sequenced-before E sequenced-before F• A happens-before E ?• B happens-before F ?

• Discouraged

Release-Consume orderingatomic<string*> ptr;int data;

void producer() {string* p = new string("Hello"); // Adata = 42; // Bptr.store(p, memory_order_release); // C

}

void consumer() {string* p2;while (!(p2 = ptr.load(memory_order_consume))); // Dassert(*p2 == "Hello"); // Eassert(data == 42); // F

}

thread t1(producer);thread t2(consumer);

Sequentially-Consistent Ordering

• Order memory the same way as release/acquire ordering• Establish a single total modification order of all atomic

operations• Example

• Is r1 == r2 == 0 possible ?

Sequentially-Consistent Orderingatomic<int> x { 0 }, y { 0 };// thread 1x.store(1, memory_order_seq_cst);r1 = y.load(memory_order_seq_cst);

// thread 2y.store(1, memory_order_seq_cst);r2 = x.load(memory_order_seq_cst);

// thread 1x.store(1, memory_order_relaxed);atomic_thread_fence(memory_order_seq_cst);r1 = y.load(memory_order_relaxed);

// thread 2y.store(1, memory_order_relaxed);atomic_thread_fence(memory_order_seq_cst);r2 = x.load(memory_order_relaxed);

Sequentially-Consistent Orderingatomic<int> x { 0 }, y { 0 };// thread 1x.store(1, memory_order_acq_rel);r1 = y.load(memory_order_acq_rel);

// thread 2y.store(1, memory_order_acq_rel);r2 = x.load(memory_order_acq_rel);

// thread 1x.store(1, memory_order_relaxed); atomic_thread_fence(memory_order_acq_rel); r1 = y.load(memory_order_relaxed);

// thread 2y.store(1, memory_order_relaxed); atomic_thread_fence(memory_order_acq_rel); r2 = x.load(memory_order_relaxed);

Atomic Operations

• atomic_store/load• atomic_exchange• atomic_compare_exchange_weak/strong• atomic_fetch_add/sub/and/or/xor• atomic_thread_fence• atomic_signal_fence

Atomic Compare and Exchange

• compare_exchange_weak• Allow to fail spuriously• Act as if (actual value != expected) even if they are equal• May require a loop

• compare_exchange_strong• Distinguish spurious failure and concurrent acces• Needs extra overhead to retry in the case of failure

Concurrency Control

• Pessimistic• Blocking until the possibility of violation disappears

• Optimistic• Collisions between transactions will rarely occur• Use resources without acquiring locks• If conflict, the committing rolls back and restart• Compare and Swap

do {expected = resource;some operation;

} while (compare_and_swap(resource, expected, new_value) == false);

Progress Condition

• Blocking

• Obstruction-Free• http://cs.brown.edu/people/mph/HerlihyLM03/main.pdf

• Lock-Free

• Wait-Free

while (!lock.compare_and_set(0, 1)) {this_thread::yield();

}

while (!atomic_value.compare_and_set(local_value, local_value + 1)) {local_value = atomic_value.load();

}

counter.fetch_add(1); // XADD

http://cs.brown.edu/people/mph/HerlihyLM03/main.pdf

Lock-Free Stack

• Treiber (1986) Algorithm• https://en.wikipedia.org/wiki/Treiber_Stack• 《Treiber, R.K., 1986. Systems programming: Coping with

parallelism. International Business Machines Incorporated, Thomas J. Watson Research Center.》

https://en.wikipedia.org/wiki/Treiber_Stack

// Copyright 2016, Xiaojie Chen. All rights reserved.// https://github.com/vorfeed/naesala

struct IStackNode {IStackNode* next;

};

template <class T>class LockfreeStack {public:void Push(T* node);T* Pop();private:static_assert(is_base_of<IStackNode, T>::value, "");atomic<uint64_t> top_ { 0 };

};

Lock-Free Stack

Lock-Free Stackvoid Push(T* node) {uint64_t last_top = 0;uint64_t node_ptr = reinterpret_cast<uint64_t>(node);do {// Take out the top node of the stacklast_top = top_.load(memory_order_acquire);// Add a new node as the top of the stack, and point to the old topnode->next = reinterpret_cast<T*>(last_top);

// If the top node is modified by other threads, discard this operation and retry} while (!top_.compare_exchange_weak(last_top, node_ptr));

}

Lock-Free Stack

Node2 Node1

Top

NewNode Node2 Node1

Top

NewNode Node2 Node1

Top

Lock-Free StackT* Pop() {T* top = nullptr;uint64_t top_ptr = 0, new_top_ptr = 0;do {// Take out the top node of the stacktop_ptr = top_.load(memory_order_acquire);top = reinterpret_cast<T*>(top_ptr);// Empty stackif (!top) {return nullptr;

}// Set the next node of the top node as the new top of the stacknew_top_ptr = reinterpret_cast<uint64_t>(top->next);

// If the top node is modified by other threads, discard this operation and retry} while (!top_.compare_exchange_weak(top_ptr, new_top_ptr));return top;

}

Lock-Free Stack

Node3 Node2 Node1

Top

Node3 Node2 Node1

Top

Node3 Node2 Node1

Top

Lock-Free Queue

• Michael & Scott (1996) Algorithm• Java ConcurrentLinkedQueue• 《Michael, Maged; Scott, Michael (1996). Simple, Fast, and

Practical Non-Blocking and Blocking Concurrent Queue Algorithms. Proc. 15th Annual ACM Symp. on Principles of Distributed Computing (PODC). pp. 267–275. doi:10.1145/248052.248106. ISBN 0-89791-800-2.》

Lock-Free Queue// Copyright 2016, Xiaojie Chen. All rights reserved.// https://github.com/vorfeed/naesala

struct IListNode {IListNode(uint64_t next) : next(next) {}atomic<uint64_t> next;

};

template <class T>class LockfreeList {public:// Both head and tail point to a dummy if queue is emptyLockfreeList() : dummy_(reinterpret_cast<uint64_t>(new T())),

head_(dummy_), tail_(dummy_) {}private:static_assert(is_base_of<IListNode<T>, T>::value, "");uint64_t dummy_;atomic<uint64_t> head_, tail_;

};

Lock-Free Queuevoid Put(T* node) {

while (true) {// The tail node of the queueuint64_t tail_ptr = tail_.load(memory_order_acquire);T* tail = reinterpret_cast<T*>(tail_ptr);// The next node of the tail nodeuint64_t tail_next_ptr = tail->next.load(memory_order_acquire);T* tail_next = reinterpret_cast<T*>(tail_next_ptr);// If the next node of tail node is modified by other threadsif (tail_next) {

// Try to help other threads to swing tail to the next node, and then retrytail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(tail_next));

// Else try to link node at the end of the queue} else if (tail->next.compare_exchange_weak(tail_next_ptr,

reinterpret_cast<uint64_t>(node))) {// If successful, try to swing Tail to the inserted node// Can also be done by other threadstail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(node));break;

}}

}

Lock-Free Queue

Dummy Node1 Node2

Head

Tail

Dummy Node1 Node2

Head

Tail

Node3

Dummy Node1 Node2

Head

Tail

Node3

Lock-Free QueueT* Take() {

while (true) {// The head node of the queueuint64_t head_ptr = head_.load(memory_order_acquire);T* head = reinterpret_cast<T*>(head_ptr);// The tail node of the queueuint64_t tail_ptr = tail_.load(memory_order_acquire);T* tail = reinterpret_cast<T*>(tail_ptr);// The next node of the head nodeuint64_t head_next_ptr = head->next.load(memory_order_acquire);T* head_next = reinterpret_cast<T*>(head_next_ptr);// Empty queue or the tail falling behindif (head == tail) {

// Empty queue, couldn’t popif (!head_next) {

return nullptr;}// another thread is pushing and the tail is falling behind, try to advance ittail_.compare_exchange_strong(tail_ptr, reinterpret_cast<uint64_t>(head_next));

} else {// Queue is not empty, do pop operation

}}return nullptr;

}

Lock-Free Queue// pop operation

// another thread had just taken a nodeif (!head_next) {

continue;}// copy the next node of the head node to a bufferT data(*head_next);// Try to swing head to the next nodeif (head_.compare_exchange_weak(head_ptr, reinterpret_cast<uint64_t>(head_next))) {

// If successful, copy the buffer data to the head node*head = move(data);// Clear the next node pointer of the head nodehead->next.store(0, memory_order_release);// Return the head nodereturn head;

}

Lock-Free Queue

Dummy Node1 Node2

Head

Tail

Dummy Node1 Node2

Head

Tail

Node1 Dummy Node2

Head

Tail

ABA Problem

• https://en.wikipedia.org/wiki/ABA_problem• Another thread change the value, do other work, then

change the value back• Fooling the first thread into thinking "nothing has

changed"

https://en.wikipedia.org/wiki/ABA_problem

ABA Problemtemplate <class T>T* Pointer(uint64_t combine) {return reinterpret_cast<T*>(combine & 0x0000FFFFFFFFFFFF);

}

template <class T>uint64_t Combine(T* pointer) {static atomic_short version(0);return reinterpret_cast<uint64_t>(pointer) |

(static_cast<uint64_t>(version.fetch_add(1, memory_order_acq_rel)) << 48);}

ABA Problemvoid Push(T* node) {uint64_t last_top_combine = 0;uint64_t node_combine = Combine(node);do {last_top_combine = top_.load(memory_order_acquire);node->next = Pointer<T>(last_top_combine);

// If the top node is still next, then assume no one has changed the stack// (That statement is not always true because of the ABA problem)// Atomically replace top with new node} while (!top_.compare_exchange_weak(last_top_combine, node_combine));

}

ABA ProblemT* Pop() {T* top = nullptr;uint64_t top_combine = 0, new_top_combine = 0;do {top_combine = top_.load(memory_order_acquire);top = Pointer<T>(top_combine);if (!top) {return nullptr;

}new_top_combine = Combine(top->next);

// If the top node is still ret, then assume no one has changed the stack// (That statement is not always true because of the ABA problem)// Atomically replace top with next} while (!top_.compare_exchange_weak(top_combine, new_top_combine));return top;

}

Benchmark

0

500000000

1E+09

1.5E+09

1 PRODUCER 1 CONSUMER

SPSC

Condition Variable Queue Lock-Free Queue

0

500000000

1E+09

1.5E+09

1P1C 1P2C 1P4C 1P8C 1P16C 1P32C

SPMC


0200000000400000000600000000800000000

1E+091.2E+09

1P1C 2P1C 4P1C 8P1C 16P1C 32P1C

MPSC


0200000000400000000600000000800000000

1E+091.2E+09

1P1C 2P2C 4P4C 8P8C 16P16C 32P32C

MPMC


Reference

• 《Java Concurrency in Practice》• 《The Art of Multiprocessor Programming》• 《C++ Concurrency In Action》• http://open-std.org• java.util.concurrent• https://github.com/vorfeed/naesala/lockfree

http://open-std.org/

Thank you

Technology

无锁编程