48
Lecture 17 NUMA Architecture and Programming

Lecture 17 NUMA Architecture and Programmingcseweb.ucsd.edu/classes/fa12/cse260-b/Lectures/Lec17.pdf• Example NUMA Systems ! Cray XE-6 ! SGI • Performance progrmaming ©2012 Scott

Embed Size (px)

Citation preview

Lecture 17

NUMA Architecture and Programming

Announcements •  Extended office hours today until 6pm •  Weds after class? •  Partitioning and communication in Particle

method project

©2012 Scott B. Baden /CSE 260/ Fall 2012 2

Uniform vs. non-uniform partitioning •  Rectilinear partitioning maintains fixed

connectivity, simplifying communication

©2012 Scott B. Baden /CSE 260/ Fall 2012 3

Uniform vs. non-uniform partitioning •  Rectilinear partitioning maintains fixed

connectivity, simplifying communication

©2012 Scott B. Baden /CSE 260/ Fall 2012 4

Load imbalance drift increases running time

©2012 Scott B. Baden /CSE 260/ Fall 2012 5

sdnoceS

Velocity Evaluation

170

150

130

110

904036322824201612840

DYNAMIC

STATIC

A comparison of static and dynamic load balancing of a vortex dynamics computation on 32 processors ofthe Intel iPSC-1. If the workloads are partitioned only at the beginning of the calculation (static load bal-ancing, top curve), the loads drift gradually out of balance, and the time required to perform a velocityevaluation steadily increases with time. In contrast, a dynamic load balancing strategy (bottom curve)periodically rebalances the workloads and can maintain an almost steady running time; here loads were re-balanced every fourth velocity evaluation.

Ghost cells with irregular partitioning

©2012 Scott B. Baden /CSE 260/ Fall 2012 6

kL jL

iD

iL

C

Processor i is assigned Li , a subregion of the work mesh, and an external interaction region Di , which wehave shaded. Di is a surrounding shell of bins and is C bins thick, where we have chosen C = 2. This re-gion divides into 6 subregions, one for each interacting processor. Subproblems Li and Lk are locally de-pendent because Di and Lk intersect. In comparison, subproblems Li and Lj are locally independent, be-cause Di and Lj do not intersect.

Repatriating particles

©2012 Scott B. Baden /CSE 260/ Fall 2012 7

D

L

L

L - Li

i

i

ii

Processor i ’s assigned subproblem changes from Li to Li as the result of repartitioning. The processor

must give up the work contained in the region Li !Li to other processors.

Today’s lecture •  Consistency •  NUMA •  Example NUMA Systems

u  Cray XE-6 u  SGI

•  Performance progrmaming

8 ©2012 Scott B. Baden /CSE 260/ Fall 2012

Memory model •  Cache coherence tells us that memory will

eventually be consistent •  The memory consistency policy tells us when

this will happen •  Even if memory is consistent, changes don’t

propagate instantaneously •  These give rise to correctness issues

involving program behavior

9 ©2012 Scott B. Baden /CSE 260/ Fall 2012

Memory consistency model •  The memory consistency model determines when a

written value will be seen by a reader •  Sequential Consistency maintains a linear

execution on a parallel architecture that is consistent with the sequential execution of some interleaved arrangement of the separate concurrent instruction streams

•  Expensive to implement •  Relaxed consistency

u  Enforce consistency only at well defined times u  Useful in handling false sharing

10 ©2012 Scott B. Baden /CSE 260/ Fall 2012

11

Memory consistency

•  A memory system is consistent if the following 3 conditions hold u  Program order u  Definition of a coherent view of memory u  Serialization of writes

©2012 Scott B. Baden /CSE 260/ Fall 2012

12

Program order

•  If a processor writes and then reads the same location X, and there are no other intervening writes by other processors to X , then the read will always return the value previously written.

X:=2 Memory

P

X:=2

X:=2

©2012 Scott B. Baden /CSE 260/ Fall 2012

13

Definition of a coherent view of memory

•  If a processor P reads from location X that was previously written by a processor Q , then the read will return the value previously written, if a sufficient amount of time has elapsed between the read and the write.

X:=1 Memory

Q X:=1 P

Load X

X:=1

©2012 Scott B. Baden /CSE 260/ Fall 2012

14

Serialization of writes

•  If two processors write to the same location X, then other processors reading X will observe the same the sequence of values in the order written

•  If 10 and then 20 is written into X, then no processor can read 20 and then 10

©2012 Scott B. Baden /CSE 260/ Fall 2012

Today’s lecture

•  Consistency •  NUMA •  Example NUMA Systems

u  Cray XE-6 u  SGI

•  Performance programming

15 ©2012 Scott B. Baden /CSE 260/ Fall 2012

16

NUMA Architectures •  The address space is global to all processors, but

memory is physically distributed •  Point-to-point messages manage coherence •  A directory keeps track of sharers, one for each

block of memory •  Stanford Dash; NUMA nodes of the Cray XE-6,

SGI UV, Altix, Origin 2000

©2012 Scott B. Baden /CSE 260/ Fall 2012

en.wikipedia.org/wiki/ Non-Uniform_Memory_Access

17

Some terminology

•  Every block of memory has an associated home: the specific processor that physically holds the associated portion of the global address space

•  Every block also has an owner: the processor whose memory contains the actual value of the data

•  Initially home = owner, but this can change … •  … if a processor other than the home processor

writes a block

©2012 Scott B. Baden /CSE 260/ Fall 2012

Inside a directory •  Each processor has a 1-bit “sharer” entry in the

directory •  There is also a dirty bit and a PID identifying the

owner in the case of a dirt block

• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection NetworkParallel Computer Architecture, Culler, Singh, & Gupta

©2012 Scott B. Baden /CSE 260/ Fall 2012 18

• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

Operation of a directory

•  P0 loads A •  Set directory entry for A (on P1) to indicate

that P0 is a sharer

P0 P1

Mem

$

1 0 0 0 0 0 0 0 ©2012 Scott B. Baden /CSE 260/ Fall 2012 19

Operation of a directory

•  P2, P3 load A (not shown) •  Set directory entry for A (on P1) to indicate

that P0 is a sharer

P0 P1

Mem

$

1 0 1 1 0 0 0 0

P2

P3

©2012 Scott B. Baden /CSE 260/ Fall 2012 20

Acquiring ownership of a block

•  P0 writes A •  P0 becomes the owner of A

P0 P1

Mem

$

1 0 1 1 0 0 0 0 ©2012 Scott B. Baden /CSE 260/ Fall 2012 21

Acquiring ownership of a block •  P0 becomes the owner of A •  P1’s directory entry for A is set to Dirty •  Outstanding sharers are invalidated •  Access to line is blocked until all invalidations are

acknowledged

P0 P1

Mem

$

0 0 0 0 D P0

0 0 0 0

P2

P3

©2012 Scott B. Baden /CSE 260/ Fall 2012 22

Change of ownership

P1

P2

P0

Store A, #x (home & owner)

Store A, #y

A ← dirty

Load A

Directory

©2012 Scott B. Baden /CSE 260/ Fall 2012 23

1 1 D P1

P0 stores into A (home & owner) P1 stores into A (becomes owner) P2 loads A

Forwarding

P1

P2

P0

Store A, #x (home & owner)

Store A, #y

A ← dirty

Load A

Directory

©2012 Scott B. Baden /CSE 260/ Fall 2012 24

1 1 D P1

P0 stores into A (home & owner) P1 stores into A (becomes owner) P2 loads A home (P0) forwards request to owner (P1)

25

Performance issues

•  False sharing •  Locality, locality, locality

u  Page placement u  Page migration u  Copying v. redistribution u  Layout

©2012 Scott B. Baden /CSE 260/ Fall 2012

Today’s lecture •  Consistency •  NUMA •  Example NUMA Systems

u  Cray XE-6 u  SGI

•  Performance programming

26 ©2012 Scott B. Baden /CSE 260/ Fall 2012

27

•  24 cores sharing 32GB main memory •  Packaged as 2 AMD Opteron 6172 processors “Magny-Cours” •  Each processor is a directly connected Multi-Chip Module:

two hex-core dies living on the same socket •  Each die has 6MB of shared L3, 512KB L2/core, 64K L1/core

u  1MB of L3 is used for cache coherence traffic u  Direct access to 8GB main memory via 2 memory channels u  4 Hyper Transport (HT) links for communicating with other dies

•  Asymmetric connections between dies and processors

Cray XE6 node

©2012 Scott B. Baden /CSE 260/ Fall 2012

www.nersc.gov/users/computational-systems/hopper/configuration/compute-nodes/

29

XE-6 Processor memory interconnect (node)

http://www.hector.ac.uk/cse/documentation/Phase2b/#arch

©2012 Scott B. Baden /CSE 260/ Fall 2012

30

XE-6 Processor memory interconnect (node)

http://www.hector.ac.uk/cse/documentation/Phase2b/#arch

©2012 Scott B. Baden /CSE 260/ Fall 2012

Today’s lecture

•  Consistency •  Example NUMA Systems

u  Cray XE-6 u  SGI Origin 2000

•  Performance programming

31 ©2012 Scott B. Baden /CSE 260/ Fall 2012

32

Origin 2000 Interconnect

©2012 Scott B. Baden /CSE 260/ Fall 2012

33

Locality

©2012 Scott B. Baden /CSE 260/ Fall 2012

34

Poor Locality

©2012 Scott B. Baden /CSE 260/ Fall 2012

Today’s lecture

•  Consistency •  NUMA •  Example NUMA Systems

u  Cray XE-6 u  SGI

•  Performance programming

35 ©2012 Scott B. Baden /CSE 260/ Fall 2012

Quick primer on paging

•  We group the physical and virtual address spaces into units called pages

•  Pages are backed up on disk

•  Virtual to physical mapping done by the Translation Lookaside Buffer (TLB), backs up page tables set up by the OS

•  When we allocate a block of memory, we don’t need to allocate physical storage to pages; we do it on demand

©2012 Scott B. Baden /CSE 260/ Fall 2012 36

Remote access latency •  When we allocate a block of memory, which

processor(s) is (are) the owner(s)? •  We can control memory locality with the same kind

of data layouts that we use with message passing •  Page allocation policies

u  First touch u  Round robin

•  Page placement and Page migration

•  Copying v. redistribution •  Layout

©2012 Scott B. Baden /CSE 260/ Fall 2012 37

Example

•  Consider the following loop for r = 0 to nReps-1 for i = 0 to n-1 a[i] = b[i] + q*c[i]

©2012 Scott B. Baden /CSE 260/ Fall 2012 38

Page Migration a[i] = b[i] + q*c[i]

techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/0650/bks/SGI_Developer/books/OrOn2_PfTune/sgi_html/ch08.html#id5224855

(No migration)

Parallel Initialization, First touch (No migration)

©2012 Scott B. Baden /CSE 260/ Fall 2012 39

Round robin initialization, w/ migration

•  Parallel initialization •  Serial initialization

40

Migration eventually reaches the optimal time

Parallel initialization, first touch (optimal)

Parallel initialization, first touch, migration

Round robin initial, w/ migration

•  Parallel initialization •  Serial initialization

One node initial placement, w/ migration

©2012 Scott B. Baden /CSE 260/ Fall 2012

Cumulative effect of Page Migration

©2012 Scott B. Baden /CSE 260/ Fall 2012 41

43

Bisection Bandwidth and Latency

a(i) = b(i) + q*c(i)

©2012 Scott B. Baden /CSE 260/ Fall 2012

44

Parallelization via the compiler integer n, lda, ldr, lds, i, j, k real*8 a(lda,n), r(ldr,n), s(lds,n) !$OMP PARALLEL DO private(j,k,i), shared(n,a,r,s),

schedule(static) do j = 1, n do k = 1, n do i = 1, n a(i,j) = a(i,j) + r(i,k)*s(k,j) enddo enddo enddo

©2012 Scott B. Baden /CSE 260/ Fall 2012

Coping with false sharing

False sharing

Successive writes by P0 and P1 cause the processors to uselessly invalidate one another’s cache

P0 P1

©2012 Scott B. Baden /CSE 260/ Fall 2012 46

An example of false sharing float a[m,n], s[m] // Outer loop is in parallel // Consider m=4, 128 byte cache line size // Thread i updates element s[i] #pragma omp parallel for private(i,j), shared(s,a) for i = 0, m-1 s[i] = 0.0 for j = 0, n-1 s[i] += a[i,j] end for end for

©2012 Scott B. Baden /CSE 260/ Fall 2012 47

Avoiding false sharing

float a[m,n], s[m,32] #pragma omp parallel for private(i,j), shared(s,a) for i = 0, m-1 s[i,1] = 0.0 for j = 0, n-1 s[i,1] += a[i,j] end for end for

©2012 Scott B. Baden /CSE 260/ Fall 2012 48

0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31

49

False sharing in higher dimension arrays

Jacobi’s method for solving Laplace’s Eqn for j=1 : N for i=1 : M u’[i,j] = (u[i-1,j]+ u[i+1,j] +u[I,j-1]+u[I,j+1])/4;

C o n t i g u i t y i n m e m o r y l a y o u t C a c h e b l o c k s t r a d d l e s p a r t i t i o n C a c h e b l o c k i s

w i t h i n a p a r t i t b o u n d a r y P 1 P 0 P 2 P 3

P 5 P 6 P 7 P 4

P 8

P 2 P 3

P 5 P 6 P 7 P 4

P 8

P 0 P 1

Parallel Computer Architecture, Culler, Singh, & Gupta

In global memory Distributed memory

©2012 Scott B. Baden /CSE 260/ Fall 2012

Fin