44
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 5 Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition

Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

1 Copyright © 2012, Elsevier Inc. All rights reserved.

Chapter 5

Multiprocessors and Thread-Level Parallelism

Computer Architecture A Quantitative Approach, Fifth Edition

Page 2: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

2

5.1 Introduction !  Importance of multiprocessing

!  Power wall, ILP wall: power and silicon costs growed faster than performance

!  Growing interest in high-end servers, cloud computing, SaaS

!  Growth of data-intensive applications, internet, massive data….

!  Insight: current desktop performance is acceptable, since data-compute intensive applications run in the cloud

!  Improved understanding of how to use multiprocessors effectively: servers, natural parallelism in large data sets or large number of independent requests

!  Advantages of replicating a design rather than investing in a unique design

Page 3: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

3 Copyright © 2012, Elsevier Inc. All rights reserved.

Introduction !  Thread-Level parallelism

!  Have multiple program counters !  Uses MIMD model !  Targeted for tightly-coupled shared-memory

multiprocessors

!  For n processors, need n threads

!  Amount of computation assigned to each thread = grain size !  Threads can be used for data-level parallelism, but

the overheads may outweigh the benefit

Introduction

Page 4: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

4 Copyright © 2012, Elsevier Inc. All rights reserved.

Types !  Symmetric multiprocessors

(SMP) !  Small number of cores !  Share single memory with

uniform memory latency !  Distributed shared memory

(DSM) !  Memory distributed among

processors !  Non-uniform memory access/

latency (NUMA) !  Processors connected via

direct (switched) and non-direct (multi-hop) interconnection networks

Introduction

Page 5: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

5

Challenges of Parallel Processing !  Two main problems

!  Limited parallelism !  example: to achieve a speedup of 80 with 100 processors we need

to have 99.75% of code able to run in parallel !! (see exmpl p349)

!  Communication costs: 30-50 cycles between separate cores, 100-500 cycle between separate chips (next slide)

!  Solutions !  Limited parallelism

!  better algorithms

!  software systems should maximize hardware occupancy

!  Communication costs; reducing frequency of remote data access !  HW: caching shared data

!  SW: restructuring data to make more accesses local

Page 6: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

6

Example !  We want a speedup of 80 with 100 processors

Page 7: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

7

Example - Communication

Page 8: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

8

Example - Communication

Page 9: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

9 Copyright © 2012, Elsevier Inc. All rights reserved.

Cache Coherence !  Processors may see different values through

their caches:

Centralized S

hared-Mem

ory Architectures

Page 10: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

10 Copyright © 2012, Elsevier Inc. All rights reserved.

Cache Coherence !  Coherence

!  All reads by any processor must return the most recently written value

!  Writes to the same location by any two processors are seen in the same order by all processors

!  Consistency !  When a written value will be returned by a read !  If a processor writes location A followed by location B,

any processor that sees the new value of B must also see the new value of A

Centralized S

hared-Mem

ory Architectures

Page 11: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

11 Copyright © 2012, Elsevier Inc. All rights reserved.

Enforcing Coherence !  Coherent caches provide:

!  Migration: movement of data !  Replication: multiple copies of data

!  Cache coherence protocols !  Directory based

!  Sharing status of each block kept in one location !  Snooping

!  Each core tracks sharing status of each block

Centralized S

hared-Mem

ory Architectures

Page 12: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

12 Copyright © 2012, Elsevier Inc. All rights reserved.

Snoopy Coherence Protocols !  Write invalidate

!  On write, invalidate all other copies !  Use bus itself to serialize

!  Write cannot complete until bus access is obtained

!  Write update !  On write, update all copies

Centralized S

hared-Mem

ory Architectures

Page 13: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

13 Copyright © 2012, Elsevier Inc. All rights reserved.

Snoopy Coherence Protocols !  Locating an item when a read miss occurs

!  In write-back cache, the updated value must be sent to the requesting processor

!  Cache lines marked as shared or exclusive/modified !  Only writes to shared lines need an invalidate

broadcast !  After this, the line is marked as exclusive

Centralized S

hared-Mem

ory Architectures

Page 14: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

14 Copyright © 2012, Elsevier Inc. All rights reserved.

Snoopy Coherence Protocols C

entralized Shared-M

emory A

rchitectures

Page 15: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

15 Copyright © 2012, Elsevier Inc. All rights reserved.

Snoopy Coherence Protocols C

entralized Shared-M

emory A

rchitectures

Page 16: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

16

Snoopy Coherence Protocols: MSI

Estado (ação

permitida)

estímulo que causou mudança de estado bus xaction resultante

Miss para um bloco em estado ≠ inválido " dado está lá mas wrong tag " miss

M

I S

Page 17: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

17

Snoopy Coherence Protocols

Figure 5.7 Cache coherence state diagram with the state transitions induced by the local processor shown in black and by the bus activities shown in gray. Activities on a transition are shown in bold.

Page 18: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

18 Copyright © 2012, Elsevier Inc. All rights reserved.

Snoopy Coherence Protocols !  Complications for the basic MSI protocol:

!  Operations are not atomic !  E.g. detect miss, acquire bus, receive a response !  Creates possibility of deadlock and races !  One solution: processor that sends invalidate can hold bus

until other processors receive the invalidate

!  Extensions: !  Add exclusive state to indicate clean block in only one

cache (MESI protocol) !  Prevents needing to write invalidate on a write if a block is

read by a single cache them written by the same cache !  Owned state (MOESI)

Centralized S

hared-Mem

ory Architectures

Page 19: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

19 Copyright © 2012, Elsevier Inc. All rights reserved.

Snoopy Coherence Protocols

!  Extensions: !  Owned state (MOESI)

!  Indicates that the block is owned by that cache and it is out of date in memory

!  Owner must supply it on a miss and write-back it in replacement

!  MSI/MESI !  Share a block in M state forces a write-back

Centralized S

hared-Mem

ory Architectures

Page 20: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

20 Copyright © 2012, Elsevier Inc. All rights reserved.

Coherence Protocols: Extensions

!  Shared memory bus and snooping bandwidth is bottleneck for scaling symmetric multiprocessors !  Duplicating tags !  Place directory in

outermost cache !  Use crossbars or point-

to-point networks with banked memory

Centralized S

hared-Mem

ory Architectures

Page 21: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

21

Evolution !  Bus + snoop + small scale multiprocessing = ok !  As number or processors increase

!  multibus: snoopy? !  interconnection network: snoopy?

!  Snoopy demands broadcast, ok with bus !  also possible in interconnection network " traffic, latency, write

serialization !  All solutions but single bus lack its easy “bus order” "

write serialization !  Races? !  Directory is more appropriate for implementing cache

coherence protocols in large scale multiprocessors !  (see history, devil in details, textbook)

Page 22: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

22 Copyright © 2012, Elsevier Inc. All rights reserved.

Performance !  Coherence influences cache miss rate

!  Coherence misses !  True sharing misses

!  Write to shared block (transmission of invalidation) !  Read an invalidated block

!  False sharing misses !  Read an unmodified word in an invalidated block

Perform

ance of Sym

metric S

hared-Mem

ory Multiprocessors

Page 23: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

23 Copyright © 2012, Elsevier Inc. All rights reserved.

Performance Study: Commercial Workload P

erformance of S

ymm

etric Shared-M

emory M

ultiprocessors

Page 24: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

24 Copyright © 2012, Elsevier Inc. All rights reserved.

Performance Study: OLTP L3 P

erformance of S

ymm

etric Shared-M

emory M

ultiprocessors

Page 25: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

25 Copyright © 2012, Elsevier Inc. All rights reserved.

Performance Study: Commercial Workload P

erformance of S

ymm

etric Shared-M

emory M

ultiprocessors

Page 26: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

26 Copyright © 2012, Elsevier Inc. All rights reserved.

Performance Study: Commercial Workload P

erformance of S

ymm

etric Shared-M

emory M

ultiprocessors

Page 27: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

27 Copyright © 2012, Elsevier Inc. All rights reserved.

Performance Study: Commercial Workload P

erformance of S

ymm

etric Shared-M

emory M

ultiprocessors

Page 28: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

28 Copyright © 2012, Elsevier Inc. All rights reserved.

Directory Protocols !  Directory keeps track of every block

!  Which caches have each block !  Dirty status of each block

!  Implement in shared L3 cache !  Keep bit vector of size = # cores for each block in L3 !  Not scalable beyond shared L3

!  Implement in a distributed fashion:

Distributed S

hared Mem

ory and Directory-B

ased Coherence

Page 29: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

29 Copyright © 2012, Elsevier Inc. All rights reserved.

Directory Protocols !  For each block, maintain state:

!  Shared !  One or more nodes have the block cached, value in memory

is up-to-date !  Set of node IDs

!  Uncached !  Modified

!  Exactly one node has a copy of the cache block, value in memory is out-of-date

!  Owner node ID

!  Directory maintains block states and sends invalidation messages

Distributed S

hared Mem

ory and Directory-B

ased Coherence

Page 30: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

30 Copyright © 2012, Elsevier Inc. All rights reserved.

Directory Protocols !  For uncached block:

!  Read miss !  Requesting node is sent the requested data and is made the

only sharing node, block is now shared !  Write miss

!  The requesting node is sent the requested data and becomes the sharing node, block is now exclusive

!  For shared block: !  Read miss

!  The requesting node is sent the requested data from memory, node is added to sharing set

!  Write miss !  The requesting node is sent the value, all nodes in the

sharing set are sent invalidate messages, sharing set only contains requesting node, block is now exclusive

Distributed S

hared Mem

ory and Directory-B

ased Coherence

Page 31: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

31 Copyright © 2012, Elsevier Inc. All rights reserved.

Directory Protocols !  For modified block:

!  Read miss !  The owner is sent a data fetch message, block becomes

shared, owner sends data to the directory, data written back to memory, sharers set contains old owner and requestor

!  Data write back !  Block becomes uncached, sharer set is empty

!  Write miss !  Message is sent to old owner to invalidate and send the

value to the directory, requestor becomes new owner, block remains exclusive

Distributed S

hared Mem

ory and Directory-B

ased Coherence

Page 32: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

32 Copyright © 2012, Elsevier Inc. All rights reserved.

Messages D

istributed Shared M

emory and D

irectory-Based C

oherence

Page 33: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

33 Copyright © 2012, Elsevier Inc. All rights reserved.

Directory Protocols D

istributed Shared M

emory and D

irectory-Based C

oherence

Page 34: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

34 Copyright © 2012, Elsevier Inc. All rights reserved.

Synchronization !  Basic building blocks:

!  Atomic exchange !  Swaps register with memory location

!  Test-and-set !  Sets under condition

!  Fetch-and-increment !  Reads original value from memory and increments it in memory

!  Requires memory read and write in uninterruptable instruction

!  load linked/store conditional !  If the contents of the memory location specified by the load linked

are changed before the store conditional to the same address, the store conditional fails

Synchronization

Page 35: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

35 Copyright © 2012, Elsevier Inc. All rights reserved.

Synchronization !  Basic building blocks:

!  Atomic exchange !  Swaps register with memory location

Synchronization

Page 36: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

36 Copyright © 2012, Elsevier Inc. All rights reserved.

Synchronization !  Basic building blocks:

!  Fetch-and-increment !  Reads original value from memory and increments it in memory

Synchronization

Page 37: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

37 Copyright © 2012, Elsevier Inc. All rights reserved.

Implementing Locks !  Spin lock

!  If no coherence: DADDUI R2,R0,#1

lockit: EXCH R2,0(R1) ;atomic exchange BNEZ R2,lockit ;already locked?

!  If coherence: lockit: LD R2,0(R1) ;load of lock

BNEZ R2,lockit ;not available-spin DADDUI R2,R0,#1 ;load locked value EXCH R2,0(R1) ;swap BNEZ R2,lockit ;branch if lock wasn’t 0

Synchronization

Page 38: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

38 Copyright © 2012, Elsevier Inc. All rights reserved.

Implementing Locks !  Advantage of this scheme: reduces memory

traffic

Synchronization

Page 39: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

39 Copyright © 2012, Elsevier Inc. All rights reserved.

Implementing Locks !  In summary:

!  Step 1: P0 starts with the lock

!  Step 2: P0 unlocks the lock

!  Steps 3 to 5: P1 and P2 race to see which reads the unlocked value

!  Steps 6,7: P2 wins an enters the critical section

!  Steps 7,8: P1’s attempt fails, spins again

Synchronization

Page 40: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

40 Copyright © 2012, Elsevier Inc. All rights reserved.

Models of Memory Consistency M

odels of Mem

ory Consistency: A

n Introduction

Processor 1: A=0 … A=1 if (B==0) …

Processor 2: B=0 … B=1 if (A==0) …

!  Should be impossible for both if-statements to be evaluated as true !  Delayed write invalidate?

!  Sequential consistency: !  Result of execution should be the same as long as:

!  Accesses on each processor were kept in order !  Accesses on different processors were arbitrarily interleaved

Page 41: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

41 Copyright © 2012, Elsevier Inc. All rights reserved.

Models of Memory Consistency M

odels of Mem

ory Consistency: A

n Introduction

!  Sequential consistency (Lamport 1979):

!  the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program

Page 42: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

42 Copyright © 2012, Elsevier Inc. All rights reserved.

Implementing Locks !  To implement, delay completion of all memory

accesses until all invalidations caused by the access are completed !  Reduces performance!

!  Alternatives: !  Program-enforced synchronization to force write on

processor to occur before read on the other processor !  Requires synchronization object for A and another for B

!  “Unlock” after write !  “Lock” after read

Models of M

emory C

onsistency: An Introduction

Page 43: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

43 Copyright © 2012, Elsevier Inc. All rights reserved.

Relaxed Consistency Models !  Rules:

!  X → Y !  Operation X must complete before operation Y is done !  Sequential consistency requires:

!  R → W, R → R, W → R, W → W

!  Relax W → R !  “Total store ordering”

!  Relax W → W !  “Partial store order”

!  Relax R → W and R → R !  “Weak ordering” and “release consistency”

Models of M

emory C

onsistency: An Introduction

Page 44: Chapter 5 Multiprocessors and Thread-Level Parallelismsandro/cursos/mo401/1s2017/... · 2017. 5. 29. · Symmetric multiprocessors (SMP) ! Small number of cores ! Share single memory

44 Copyright © 2012, Elsevier Inc. All rights reserved.

Relaxed Consistency Models !  Consistency model is multiprocessor specific

!  Programmers will often implement explicit synchronization

!  Speculation gives much of the performance advantage of relaxed models with sequential consistency !  Basic idea: if an invalidation arrives for a result that

has not been committed, use speculation recovery

Models of M

emory C

onsistency: An Introduction