Download pptx - Multithreading Fundamentals

@gfraiteur

a deep investigation on how .NET multithreading primitives

map to hardware and Windows Kernel

You Thought You Understood Multithreading

Gaël Fraiteur

PostSharp TechnologiesFounder & Principal Engineer

@gfraiteur

Hello!

My name is GAEL.

No, I don’t thinkmy accent is funny.

my twitter

@gfraiteur

Agenda1. Hardware

2. Windows Kernel

3. Application

@gfraiteur

MicroarchitectureHardware

@gfraiteurThere was no RAM in Colossus, the world’s first electronic programmable computing device (1944).http://en.wikipedia.org/wiki/File:Colossus.jpg

@gfraiteur

HardwareProcessor microarchitecture

Intel Core i7 980x

@gfraiteur

HardwareHardware latency

DRAM

L3 Cache

L2 Cache

L1 Cache

CPU Cycle

0 5 10 15 20 25 30

26.2

8.58

3

1.2

0.3

Latency

nsMeasured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI) with SiSoftware Sandra

@gfraiteur

HardwareCache Hierarchy

• Distributed into caches

• L1 (per core)• L2 (per core)• L3 (per die)

Die 1

Core 4

Processor

Core 5

Processor

Core 7

Processor

L2 Cache

Core 6

Processor

L2 Cache

Memory Store (DRAM)

Die 0

Core 1

Processor

Core 0

Processor

Core 2

Processor

L2 Cache

Core 3

Processor

L2 Cache

L2 Cache

L3 Cache

L2 Cache

L2 Cache

L3 Cache

L2 Cache

Data Transfer

Core Interconn

ect

Core Interconn

ect Processor Interconnect

Synchronization

• Synchronized through an asynchronous “bus”:

• Request/response• Message queues• Buffers

@gfraiteur

HardwareMemory Ordering• Issues due to implementation details:

• Asynchronous bus

• Caching & Buffering

• Out-of-order execution

@gfraiteur

HardwareMemory Ordering

Processor 0 Processor 1

Store A := 1 Load A

Store B := 1 Load B

Possible values of (A, B) for loads?


Load A Load B

Store B := 1 Store A := 1



Store A := 1 Store B := 1

Load B Load A


@gfraiteur

HardwareMemory Models: Guarantees

TypeAlpha

ARMv7

PA-

RISC

POWER

SPARC RMO

SPARC

PSO

SPARC TSO

x86

x86 oostore

AMD64

IA64

zSeries

Loads reordered after Loads

Y Y Y Y Y Y Y

Loads reordered after Stores Y Y Y Y Y Y Y

Stores reordered after Stores

Y Y Y Y Y Y Y Y

Stores reordered after Loads

Y Y Y Y Y Y Y Y Y Y Y Y

Atomic reordered with Loads

Y Y Y Y Y

Atomic reordered with Stores

Y Y Y Y Y Y

Dependent Loads reordered

Y

Incoherent Instruction cache pipeline

Y Y Y Y Y Y Y Y Y Y

http://en.wikipedia.org/wiki/Memory_ordering

X84, AMD64: strong orderingARM: weak ordering

@gfraiteur

HardwareMemory Models ComparedProcessor 0 Processor 1

Store A := 1 Load A

Store B := 1 Load B

Possible values of (A,B) for loads?• X86: • ARM:


Load A Load B

Store B := 1 Store A := 1

Possible values of (A, B) for loads?• X86: • ARM:

@gfraiteur

HardwareCompiler Memory Ordering• The compiler (CLR) can:

• Cache (memory into register),

• Reorder

• Coalesce writes

• volatile keyword disables compiler optimizations. And that’s all!

@gfraiteur

HardwareThread.MemoryBarrier()


Store A := 1 Store B := 1

Thread.MemoryBarrier() Thread.MemoryBarrier()

Load B Load A

Possible values of (A, B)?• Without MemoryBarrier: • With MemoryBarrier:

Barriers serialize access to memory.

@gfraiteur

HardwareAtomicityProcessor 0 Processor 1

Store A := (long) 0xFFFFFFFF

Load A

Possible values of (A,B)?• X86 (32-bit): • AMD64:

• Up to native word size (IntPtr)• Properly aligned fields (attention to struct fields!)

@gfraiteur

HardwareInterlocked access• The only locking mechanism at hardware level.

• System.Threading.Interlocked

• AddAdd(ref x, int a ) { x += a; return x; }

• IncrementInc(ref x ) { x += 1; return x; }

• Exchange

Exc( ref x, int v ) { int t = x; x = a; return t; }

• CompareExchange CAS( ref x, int v, int c ) { int t = x; if ( t == c ) x = a; return t; }

• Read(ref long)

• Implemented by L3 cache and/or interconnect messages.

• Implicit memory barrier

@gfraiteur

HardwareCost of interlocked

Interlocked increment (contended, same socket)

L3 Cache

Interlocked increment (non-contended)

Non-interlocked increment

L1 Cache

0 2 4 6 8 10 12

9.75

8.58

5.5

1.87

1.2

Latency

ns

Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI)with C# code. Cache latencies measured with SiSoftware Sandra.

@gfraiteur

HardwareCost of cache coherency

@gfraiteur

Multitasking• No thread, no process at hardware level

• There no such thing as “wait”

• One core never does more than 1 “thing” at the same time(except HyperThreading)

• Task-State Segment:

• CPU State

• Permissions (I/O, IRQ)

• Memory Mapping

• A task runs until interrupted by hardware or OS

@gfraiteur

Lab: Non-Blocking Algorithms• Non-blocking stack (4.5 MT/s)

• Ring buffer (13.2 MT/s)

@gfraiteur

Windows KernelOperating System

@gfraiteurSideKick (1988). http://en.wikipedia.org/wiki/SideKick

@gfraiteur

• Program(Sequence of Instructions)

• CPU State

• Wait Dependencies

• (other things)

• Virtual Address Space

• Resources

• At least 1 thread

• (other things)

Processes and Threads

Process Thread

Windows Kernel

@gfraiteur

Windows KernelProcessed and Threads

8 Logical Processors(4 hyper-threaded cores)

2384 threads155 processes

@gfraiteur

Windows KernelDispatching threads to CPU

Executing

CPU 0

CPU 1

CPU 2

CPU 3

Thread H

Thread G

Thread I

Thread J

Mutex Semaphore

Event Timer

Thread A

Thread C

Thread B

Wait

Thread DThread

EThread F

Ready

@gfraiteur

• Live in the kernel

• Sharable among processes

• Expensive!

• Event

• Mutex

• Semaphore

• Timer

• Thread

Dispatcher Objects (WaitHandle)

Kinds of dispatcher objects Characteristics

Windows Kernel

@gfraiteur

Windows KernelCost of kernel dispatching

Normal increment

Non-contented interlocked increment

Contented interlocked increment

Access kernel dispatcher object

Create+dispose kernel dispatcher object

1 10 100 1000 10000

2

6

10

295

2,093

Duration (ns)

Measured on Intel® Core™ i7-970 Processor (12M Cache, 3.20 GHz, 4.80 GT/s Intel® QPI)

@gfraiteur

Windows KernelWait Methods• Thread.Sleep

• Thread.Yield

• Thread.Join

• WaitHandle.Wait

• (Thread.SpinWait)

@gfraiteur

Windows KernelSpinWait: smart busy loops

1. Thread.SpinWait (Hardware awareness, e.g. Intel HyperThreading)

2. Thread.Sleep(0) (give up to threads of same priority)

3. Thread.Yield() (give up to threads of any priority)

4. Thread.Sleep(1) (sleeps for 1 ms)

Processor 0

Processor 1

A = 1;B = 1;

SpinWait w = new SpinWait();int localB;if ( A == 1 ) { while ( ((localB = B) == 0 ) w.SpinOnce();}

@gfraiteur

.NET FrameworkApplication Programming

@gfraiteur

Application ProgrammingDesign Objectives• Ease of use

• High performancefrom applications!

@gfraiteur

Application ProgrammingInstruments from the platform

Hardware• Interlocked• SpinWait• Volatile

Operating System

• Thread• WaitHandle

(mutex, event, semaphore, …)• Timer

@gfraiteur

Application ProgrammingNew Instruments

Concurrency & Synchronization

• Monitor (Lock), SpinLock• ReaderWriterLock• Lazy, LazyInitializer• Barrier, CountdownEvent

Data Structures• BlockingCollection• Concurrent(Queue|Stack|Bag|Dictionary)• ThreadLocal, ThreadStatic

Thread Dispatching

• ThreadPool• Dispatcher, ISynchronizeInvoke• BackgroundWorker

Frameworks• PLINQ• Tasks• Async/Await

@gfraiteur

Concurrency & Synchronization

Application Programming

@gfraiteur

Concurrency & SynchronizationMonitor: the lock keyword1. Start with interlocked operations (no

contention)

2. Continue with “spin wait”.

3. Create kernel event and wait

Good performance if low contention

@gfraiteur

Concurrency & SynchronizationReaderWriterLock• Concurrent reads allowed

• Writes require exclusive access

@gfraiteur

Concurrency & SynchronizationLab: ReaderWriterLock

@gfraiteur

Data StructuresApplication Programming

@gfraiteur

Data StructuresBlockingCollection• Use case: pass messages between

threads

• TryTake: wait for an item and take it

• Add: wait for free capacity (if limited) and add item to queue

• CompleteAdding: no more item

@gfraiteur

Data StructuresLab: BlockingCollection

@gfraiteur

Data StructuresConcurrentStack

node

top

Producer A

Producer B

Producer C

Consumer A

Consumer B

Consumer C

Consumers and producers compete for the top of the stack

node

@gfraiteur

Data StructuresConcurrentQueue

node

node

tail

Producer A

Producer B

Producer C

Consumer A

Consumer B

Consumer C

head

Consumers compete for the head.Producers compete for the tail.Both compete for the same node if the queue is empty.

@gfraiteur

Data StructuresConcurrentBag

Thread A

Bag

Thread Local Storage A

Thread Local Storage B

node

node

node

node

Producer A Consumer A

Thread B

Producer B Consumer B

• Threads preferentially consume nodes they produced themselves.

• No ordering

@gfraiteur

Data StructuresConcurrentDictionary• TryAdd

• TryUpdate (CompareExchange)

• TryRemove

• AddOrUpdate

• GetOrAdd

@gfraiteur

Thread DispatchingApplication Programming

@gfraiteur

Thread DispatchingThreadPool• Problems solved:

• Execute a task asynchronously (QueueUserWorkItem)

• Waiting for WaitHandle (RegisterWaitForSingleObject)

• Creating a thread is expensive

• Good for I/O intensive operations

• Bad for CPU intensive operations

@gfraiteur

Thread DispatchingDispatcher• Execute a task in the UI thread

• Synchronously: Dispatcher.Invoke

• Asynchonously: Dispatcher.BeginInvoke

• Under the hood, uses the HWND message loop

@gfraiteur

Thread DispatchingLab: background UI

@gfraiteur

Q&AGael [email protected]@gfraiteur

@gfraiteur

Summary

@gfraiteur

http://www.postsharp.net/[email protected]