@gfraiteur
a deep investigation on how .NET multithreading primitives
map to hardware and Windows Kernel
You Thought You Understood Multithreading
Gaël Fraiteur
PostSharp TechnologiesFounder & Principal Engineer
@gfraiteur
Hello!
My name is GAEL.
No, I don’t thinkmy accent is funny.
my twitter
@gfraiteur
Agenda1. Hardware
2. Windows Kernel
3. Application
@gfraiteur
MicroarchitectureHardware
@gfraiteurThere was no RAM in Colossus, the world’s first electronic programmable computing device (1944).http://en.wikipedia.org/wiki/File:Colossus.jpg
@gfraiteur
HardwareProcessor microarchitecture
Intel Core i7 980x
@gfraiteur
HardwareHardware latency
DRAM
L3 Cache
L2 Cache
L1 Cache
CPU Cycle
0 5 10 15 20 25 30
26.2
8.58
3
1.2
0.3
Latency
nsMeasured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI) with SiSoftware Sandra
@gfraiteur
HardwareCache Hierarchy
• Distributed into caches
• L1 (per core)• L2 (per core)• L3 (per die)
Die 1
Core 4
Processor
Core 5
Processor
Core 7
Processor
L2 Cache
Core 6
Processor
L2 Cache
Memory Store (DRAM)
Die 0
Core 1
Processor
Core 0
Processor
Core 2
Processor
L2 Cache
Core 3
Processor
L2 Cache
L2 Cache
L3 Cache
L2 Cache
L2 Cache
L3 Cache
L2 Cache
Data Transfer
Core Interconn
ect
Core Interconn
ect Processor Interconnect
Synchronization
• Synchronized through an asynchronous “bus”:
• Request/response• Message queues• Buffers
@gfraiteur
HardwareMemory Ordering• Issues due to implementation details:
• Asynchronous bus
• Caching & Buffering
• Out-of-order execution
@gfraiteur
HardwareMemory Ordering
Processor 0 Processor 1
Store A := 1 Load A
Store B := 1 Load B
Possible values of (A, B) for loads?
Processor 0 Processor 1
Load A Load B
Store B := 1 Store A := 1
Possible values of (A, B) for loads?
Processor 0 Processor 1
Store A := 1 Store B := 1
Load B Load A
Possible values of (A, B) for loads?
@gfraiteur
HardwareMemory Models: Guarantees
TypeAlpha
ARMv7
PA-
RISC
POWER
SPARC RMO
SPARC
PSO
SPARC TSO
x86
x86 oostore
AMD64
IA64
zSeries
Loads reordered after Loads
Y Y Y Y Y Y Y
Loads reordered after Stores Y Y Y Y Y Y Y
Stores reordered after Stores
Y Y Y Y Y Y Y Y
Stores reordered after Loads
Y Y Y Y Y Y Y Y Y Y Y Y
Atomic reordered with Loads
Y Y Y Y Y
Atomic reordered with Stores
Y Y Y Y Y Y
Dependent Loads reordered
Y
Incoherent Instruction cache pipeline
Y Y Y Y Y Y Y Y Y Y
http://en.wikipedia.org/wiki/Memory_ordering
X84, AMD64: strong orderingARM: weak ordering
@gfraiteur
HardwareMemory Models ComparedProcessor 0 Processor 1
Store A := 1 Load A
Store B := 1 Load B
Possible values of (A,B) for loads?• X86: • ARM:
Processor 0 Processor 1
Load A Load B
Store B := 1 Store A := 1
Possible values of (A, B) for loads?• X86: • ARM:
@gfraiteur
HardwareCompiler Memory Ordering• The compiler (CLR) can:
• Cache (memory into register),
• Reorder
• Coalesce writes
• volatile keyword disables compiler optimizations. And that’s all!
@gfraiteur
HardwareThread.MemoryBarrier()
Processor 0 Processor 1
Store A := 1 Store B := 1
Thread.MemoryBarrier() Thread.MemoryBarrier()
Load B Load A
Possible values of (A, B)?• Without MemoryBarrier: • With MemoryBarrier:
Barriers serialize access to memory.
@gfraiteur
HardwareAtomicityProcessor 0 Processor 1
Store A := (long) 0xFFFFFFFF
Load A
Possible values of (A,B)?• X86 (32-bit): • AMD64:
• Up to native word size (IntPtr)• Properly aligned fields (attention to struct fields!)
@gfraiteur
HardwareInterlocked access• The only locking mechanism at hardware level.
• System.Threading.Interlocked
• AddAdd(ref x, int a ) { x += a; return x; }
• IncrementInc(ref x ) { x += 1; return x; }
• Exchange
Exc( ref x, int v ) { int t = x; x = a; return t; }
• CompareExchange CAS( ref x, int v, int c ) { int t = x; if ( t == c ) x = a; return t; }
• Read(ref long)
• Implemented by L3 cache and/or interconnect messages.
• Implicit memory barrier
@gfraiteur
HardwareCost of interlocked
Interlocked increment (contended, same socket)
L3 Cache
Interlocked increment (non-contended)
Non-interlocked increment
L1 Cache
0 2 4 6 8 10 12
9.75
8.58
5.5
1.87
1.2
Latency
ns
Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI)with C# code. Cache latencies measured with SiSoftware Sandra.
@gfraiteur
HardwareCost of cache coherency
@gfraiteur
Multitasking• No thread, no process at hardware level
• There no such thing as “wait”
• One core never does more than 1 “thing” at the same time(except HyperThreading)
• Task-State Segment:
• CPU State
• Permissions (I/O, IRQ)
• Memory Mapping
• A task runs until interrupted by hardware or OS
@gfraiteur
Lab: Non-Blocking Algorithms• Non-blocking stack (4.5 MT/s)
• Ring buffer (13.2 MT/s)
@gfraiteur
Windows KernelOperating System
@gfraiteurSideKick (1988). http://en.wikipedia.org/wiki/SideKick
@gfraiteur
• Program(Sequence of Instructions)
• CPU State
• Wait Dependencies
• (other things)
• Virtual Address Space
• Resources
• At least 1 thread
• (other things)
Processes and Threads
Process Thread
Windows Kernel
@gfraiteur
Windows KernelProcessed and Threads
8 Logical Processors(4 hyper-threaded cores)
2384 threads155 processes
@gfraiteur
Windows KernelDispatching threads to CPU
Executing
CPU 0
CPU 1
CPU 2
CPU 3
Thread H
Thread G
Thread I
Thread J
Mutex Semaphore
Event Timer
Thread A
Thread C
Thread B
Wait
Thread DThread
EThread F
Ready
@gfraiteur
• Live in the kernel
• Sharable among processes
• Expensive!
• Event
• Mutex
• Semaphore
• Timer
• Thread
Dispatcher Objects (WaitHandle)
Kinds of dispatcher objects Characteristics
Windows Kernel
@gfraiteur
Windows KernelCost of kernel dispatching
Normal increment
Non-contented interlocked increment
Contented interlocked increment
Access kernel dispatcher object
Create+dispose kernel dispatcher object
1 10 100 1000 10000
2
6
10
295
2,093
Duration (ns)
Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI)
@gfraiteur
Windows KernelWait Methods• Thread.Sleep
• Thread.Yield
• Thread.Join
• WaitHandle.Wait
• (Thread.SpinWait)
@gfraiteur
Windows KernelSpinWait: smart busy loops
1. Thread.SpinWait (Hardware awareness, e.g. Intel HyperThreading)
2. Thread.Sleep(0) (give up to threads of same priority)
3. Thread.Yield() (give up to threads of any priority)
4. Thread.Sleep(1) (sleeps for 1 ms)
Processor 0
Processor 1
A = 1;B = 1;
SpinWait w = new SpinWait();int localB;if ( A == 1 ) { while ( ((localB = B) == 0 ) w.SpinOnce();}
@gfraiteur
.NET FrameworkApplication Programming
@gfraiteur
Application ProgrammingDesign Objectives• Ease of use
• High performancefrom applications!
@gfraiteur
Application ProgrammingInstruments from the platform
Hardware• Interlocked• SpinWait• Volatile
Operating System
• Thread• WaitHandle
(mutex, event, semaphore, …)• Timer
@gfraiteur
Application ProgrammingNew Instruments
Concurrency & Synchronization
• Monitor (Lock), SpinLock• ReaderWriterLock• Lazy, LazyInitializer• Barrier, CountdownEvent
Data Structures• BlockingCollection• Concurrent(Queue|Stack|Bag|Dictionary)• ThreadLocal, ThreadStatic
Thread Dispatching
• ThreadPool• Dispatcher, ISynchronizeInvoke• BackgroundWorker
Frameworks• PLINQ• Tasks• Async/Await
@gfraiteur
Concurrency & Synchronization
Application Programming
@gfraiteur
Concurrency & SynchronizationMonitor: the lock keyword1. Start with interlocked operations (no
contention)
2. Continue with “spin wait”.
3. Create kernel event and wait
Good performance if low contention
@gfraiteur
Concurrency & SynchronizationReaderWriterLock• Concurrent reads allowed
• Writes require exclusive access
@gfraiteur
Concurrency & SynchronizationLab: ReaderWriterLock
@gfraiteur
Data StructuresApplication Programming
@gfraiteur
Data StructuresBlockingCollection• Use case: pass messages between
threads
• TryTake: wait for an item and take it
• Add: wait for free capacity (if limited) and add item to queue
• CompleteAdding: no more item
@gfraiteur
Data StructuresLab: BlockingCollection
@gfraiteur
Data StructuresConcurrentStack
node
top
Producer A
Producer B
Producer C
Consumer A
Consumer B
Consumer C
Consumers and producers compete for the top of the stack
node
@gfraiteur
Data StructuresConcurrentQueue
node
node
tail
Producer A
Producer B
Producer C
Consumer A
Consumer B
Consumer C
head
Consumers compete for the head.Producers compete for the tail.Both compete for the same node if the queue is empty.
@gfraiteur
Data StructuresConcurrentBag
Thread A
Bag
Thread Local Storage A
Thread Local Storage B
node
node
node
node
Producer A Consumer A
Thread B
Producer B Consumer B
• Threads preferentially consume nodes they produced themselves.
• No ordering
@gfraiteur
Data StructuresConcurrentDictionary• TryAdd
• TryUpdate (CompareExchange)
• TryRemove
• AddOrUpdate
• GetOrAdd
@gfraiteur
Thread DispatchingApplication Programming
@gfraiteur
Thread DispatchingThreadPool• Problems solved:
• Execute a task asynchronously (QueueUserWorkItem)
• Waiting for WaitHandle (RegisterWaitForSingleObject)
• Creating a thread is expensive
• Good for I/O intensive operations
• Bad for CPU intensive operations
@gfraiteur
Thread DispatchingDispatcher• Execute a task in the UI thread
• Synchronously: Dispatcher.Invoke
• Asynchonously: Dispatcher.BeginInvoke
• Under the hood, uses the HWND message loop
@gfraiteur
Thread DispatchingLab: background UI
@gfraiteur
Q&AGael [email protected]@gfraiteur
@gfraiteur
Summary
@gfraiteur
http://www.postsharp.net/[email protected]