Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Lecture11:DistributedMemoryMachinesandProgramming

1

CSCE569ParallelComputing

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://cse.sc.edu/~yanyh

Topics

• Introduction• Programmingonsharedmemorysystem(Chapter7)

– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonlargescalesystems(Chapter6)

– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel

• Analysisofparallelprogramexecutions(Chapter5)– PerformanceMetricsforParallelSystems

• ExecutionTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools

2

Acknowledgement

• SlidesadaptedfromU.C.BerkeleycourseCS267/EngC233ApplicationsofParallelComputersbyJimDemmel andKatherineYelick,Spring2011– http://www.cs.berkeley.edu/~demmel/cs267_Spr11/

• Andmaterialsfromvarioussources

3

SharedMemoryParallelSystems:MulticoreandMulti-CPU

• a

4

Node-levelArchitectureandProgramming

• Sharedmemorymultiprocessors:multicore,SMP,NUMA– Deepmemoryhierarchy,distantmemorymuchmore

expensivetoaccess.– Machinesscaleto10sor100sofprocessors– InstructionLevelParallelism(ILP),DataLevelParallelism(DLP)

andThreadLevelParallelism(TLP)• Programming

– OpenMP,PThreads,Cilkplus,etc

5

HPCArchitectures(TOP500,Nov2014)

6

Outline

• ClusterIntroduction• DistributedMemoryArchitectures

– Propertiesofcommunicationnetworks– Topologies– Performancemodels

• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives

7

Clusters

• A groupoflinkedcomputers,workingtogethercloselysothatinmanyrespectstheyformasinglecomputer.

• Consistsof– Nodes(Front+computing)– Network– Software:OSandmiddleware

8

Node Node Node…

High Speed Local Network

Cluster Middle ware

…

9

10

Top10ofTop500

11

http://www.top500.org/lists/2016/06/

( Ethernet,Infiniband….) + (MPI)

HPCBeowulfCluster

• Masternode:orservice/frontnode(usedtointeractwithuserslocallyorremotely)

• ComputingNodes:performancecomputations• Interconnectandswitchbetweennodes:e.g.G/10G-bitEthernet,

Infiniband• Inter-nodeprogramming

– MPI(MessagePassingInterface)isthemostcommonlyusedone.

12

NetworkSwitch

13

NetworkInterfaceCard(NIC)

14

Outline




15

NetworkAnalogy

• Tohavealargenumberofdifferenttransfersoccurringatonce,youneedalargenumberofdistinctwires– Notjustabus,asinsharedmemory

• Networksarelikestreets:– Link =street.– Switch =intersection.– Distances (hops)=numberofblockstraveled.– Routingalgorithm =travelplan.

• Properties:– Latency:howlongtogetbetweennodesinthenetwork.– Bandwidth:howmuchdatacanbemovedperunittime.

• Bandwidthislimitedbythenumberofwiresandtherateatwhicheachwirecanacceptdata.

16

LatencyandBandwidth

• Latency:Timetotravelfromonelocationtoanotherforavehicle– Vehicletype(largeorsmallmessages)– Road/trafficcondition,speed-limit,etc

• Bandwidth:Howmanycarsandhowfasttheycantravelfromonelocationtoanother– Numberoflanes

17

PerformancePropertiesofaNetwork:Latency

• Diameter:themaximum(overallpairsofnodes)oftheshortestpathbetweenagivenpairofnodes.

• Latency: delaybetweensendandreceivetimes– Latencytendstovarywidelyacrossarchitectures– Vendorsoftenreporthardwarelatencies (wiretime)– Applicationprogrammerscareaboutsoftwarelatencies (user

programtouserprogram)• Observations:

– Latenciesdifferby1-2ordersacrossnetworkdesigns– Software/hardwareoverheadatsource/destinationdominatecost

(1s-10susecs)– Hardwarelatencyvarieswithdistance(10s-100snsec perhop)butis

smallcomparedtooverheads• Latencyiskeyforprogramswithmanysmallmessages

18

I second = 10^3 millseconds (ms) = 10^6 microseconds (us) = 10^9 nanoseconds (ns)

LatencyonSomeMachines/Networks

8-byte Roundtrip Latency

14.6

6.6

22.1

9.6

18.5

24.2

0

5

10

15

20

25

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Rou

ndtri

p La

tenc

y (u

sec)

MPI ping-pong

19

• Latenciesshownarefromaping-pong testusingMPI• Theseareroundtrip numbers:manypeopleuse½ofroundtrip timeto

approximate1-waylatency(whichcan’teasilybemeasured)

EndtoEndLatency(1/2roundtrip)OverTime

6.9745

36.34

7.2755

3.3

12.08059.25

2.6

6.905

11.027

4.81

nCube/2

nCube/2

CM5

CM5 CS2

CS2

SP1

SP2

Paragon

T3DT3D

SPP

KSR

SPP

Cenju3

T3E

T3E18.916

SP-Power3

Quadrics

Myrinet

Quadrics

1

10

100

1990 1995 2000 2005 2010Year (approximate)

usec

20

• Latencyhasnotimprovedsignificantly,unlikeMoore’sLaw•T3E(shmem)waslowestpoint– in1997

Data from Kathy Yelick, UCB and NERSC

PerformancePropertiesofaNetwork:Bandwidth

• Thebandwidthofalink=#wires/time-per-bit• BandwidthtypicallyinGigabytes/sec(GB/s),i.e.,8*220 bitspersecond

• Effectivebandwidth isusuallylowerthanphysicallinkbandwidthduetopacketoverhead.

21

Routing and control header

Data payload

Error code

Trailer

• Bandwidth is important for applications with mostly large messages

BandwidthonSomeNetworks

Flood Bandwidth for 2MB messages

1504

630

244

857225

610

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Perc

ent H

W p

eak

(BW

in M

B) MPI

22

• Floodbandwidth(throughputofback-to-back2MBmessages)

BandwidthChart

230

50

100

150

200

250

300

350

400

2048 4096 8192 16384 32768 65536 131072Message Size (Bytes)

Band

wid

th (M

B/se

c)

T3E/MPIT3E/ShmemIBM/MPIIBM/LAPICompaq/PutCompaq/GetM2K/MPIM2K/GMDolphin/MPIGiganet/VIPLSysKonnect

Data from Mike Welcome, NERSC

Note:bandwidthdependsonSW,notjustHW

PerformancePropertiesofaNetwork:BisectionBandwidth

• Bisectionbandwidth:bandwidthacrosssmallestcutthatdividesnetworkintotwoequalhalves

• Bandwidthacross“narrowest” partofthenetwork

24

bisection cut

not a bisectioncut

bisectionbw=linkbw bisectionbw=sqrt(n)*linkbw

• Bisectionbandwidthisimportantforalgorithmsinwhichallprocessorsneedtocommunicatewithallothers

OtherCharacteristicsofaNetwork

• Topology(howthingsareconnected)– Crossbar,ring,2-Dand3-Dmeshortorus,hypercube,tree,

butterfly,perfectshuffle....• Routingalgorithm:

– Examplein2Dtorus:alleast-westthenallnorth-south(avoidsdeadlock).

• Switchingstrategy:– Circuitswitching:fullpathreservedforentiremessage,likethe

telephone.– Packetswitching:messagebrokenintoseparately-routed

packets,likethepostoffice.• Flowcontrol (whatifthereiscongestion):

– Stall,storedatatemporarilyinbuffers,re-routedatatoothernodes,tellsourcenodetotemporarilyhalt,discard,etc.

25

NetworkTopology

• Inthepast,therewasconsiderableresearchinnetworktopologyandinmappingalgorithmstotopology.– Keycosttobeminimized:numberof“hops” betweennodes

(e.g.“storeandforward”)– Modernnetworkshidehopcost(i.e.,“wormholerouting”),so

topologyisnolongeramajorfactorinalgorithmperformance.• Example:OnIBMSPsystem,hardwarelatencyvariesfrom0.5usec to1.5usec,butuser-levelmessagepassinglatencyisroughly36usec.

• Needsomebackgroundinnetworktopology– Algorithmsmayhaveacommunicationtopology– Topologyaffectsbisectionbandwidth.

26

LinearandRingTopologies

• Lineararray

– Diameter=n-1;averagedistance~n/3.– Bisectionbandwidth=1(inunitsoflinkbandwidth).

• TorusorRing

– Diameter=n/2;averagedistance~n/4.– Bisectionbandwidth=2.– Naturalforalgorithmsthatworkwith1Darrays.

27

MeshesandTori

• Twodimensionalmesh– Diameter=2*(sqrt(n)– 1)– Bisectionbandwidth=sqrt(n)

• Twodimensionaltorus– Diameter=sqrt(n)– Bisectionbandwidth=2*sqrt(n)

28

•Generalizestohigherdimensions• CrayXT(eg Franklin@NERSC)uses3DTorus

• Naturalforalgorithmsthatworkwith2Dand/or3Darrays(matmul)

Hypercubes

• Numberofnodesn=2dfordimensiond.– Diameter=d.– Bisectionbandwidth=n/2.

• 0d1d2d 3d 4d

• Popularinearlymachines(InteliPSC,NCUBE).– Lotsofcleveralgorithms.

• Greycode addressing:– Eachnodeconnectedto

otherswith1bitdifferent.

29

001000

100

010 011

111

101

110

Trees

• Diameter=logn.• Bisectionbandwidth=1.• Easylayoutasplanargraph.• Manytreealgorithms(e.g.,summation).• Fattreesavoidbisectionbandwidthproblem:

– More(orwider)linksneartop.– Example:ThinkingMachinesCM-5.

30

Butterflies

• Diameter=logn.• Bisectionbandwidth=n.• Cost:lotsofwires.• UsedinBBNButterfly.• NaturalforFFT.

31

O 1O 1

O 1 O 1

butterfly switch multistage butterfly network

Ex: to get from proc 101 to 110,Compare bit-by-bit andSwitch if they disagree, else not

TopologiesinRealMachines

32

Cray XT3 and XT4 3D Torus (approx)

Blue Gene/L 3D Torus

SGI Altix Fat tree

Cray X1 4D Hypercube*

Myricom (Millennium) Arbitrary

Quadrics (in HP Alpha server clusters)

Fat tree

IBM SP Fat tree (approx)

SGI Origin Hypercube

Intel Paragon (old) 2D Mesh

BBN Butterfly (really old) Butterfly

olde

rnew

er

Manyoftheseareapproximations:E.g.,theX1isreallya“quadbristledhypercube” andsomeofthefattreesarenotasfatastheyshouldbeatthetop

PerformanceModels

33

LatencyandBandwidthModel

• Timetosendmessageoflengthnisroughly

• Topologyisassumedirrelevant.• Oftencalled“a-bmodel” andwritten

• Usuallya >>b >>timeperflop.– Onelongmessageischeaperthanmanyshortones.

– Candohundredsorthousandsofflopsforcostofonemessage.• Lesson:Needlargecomputation-to-communicationratiotobe

efficient.

34

Time = latency + n*cost_per_word= latency + n/bandwidth

Time = a + n*b

a + n*b << n*(a + 1*b)

Alpha-BetaParametersonCurrentMachines

• Thesenumberswereobtainedempirically

35

machine a bT3E/Shm 1.2 0.003T3E/MPI 6.7 0.003IBM/LAPI 9.4 0.003IBM/MPI 7.6 0.004Quadrics/Get 3.267 0.00498Quadrics/Shm 1.3 0.005Quadrics/MPI 7.3 0.005Myrinet/GM 7.7 0.005Myrinet/MPI 7.2 0.006Dolphin/MPI 7.767 0.00529Giganet/VIPL 3.0 0.010GigE/VIPL 4.6 0.008GigE/MPI 5.854 0.00872

a is latency in usecsb is BW in usecs per Byte

How well does the modelTime = a + n*b

predict actual performance?

ModelTimeVaryingMessageSize&Machines

36

MeasuredMessageTime

37

LogP Model

• 4performanceparameters– L:latencyexperiencedineachcommunicationevent

• timetocommunicatewordorsmall#ofwords– o:send/recv overheadexperiencedbyprocessor

• timeprocessorfullyengagedintransmissionorreception– g:gapbetweensuccessivesendsorrecvs byaprocessor

• 1/g=communicationbandwidth– P:numberofprocessor/memorymodules

38

LogP Parameters:Overhead&Latency

• Non-overlappingoverhead

• Sendandrecv overheadcanoverlap

39

P0

P1

osend

L

orecv

P0

P1

osend

orecv

EEL = End-to-End Latency= osend + L + orecv

EEL = f(osend, L, orecv)³ max(osend, L, orecv)

LogP Parameters:gap

• TheGapisthedelaybetweensendingmessages

• Gapcouldbegreaterthansendoverhead– NICmaybebusyfinishingtheprocessing

oflastmessageandcannotacceptanewone.

– FlowcontrolorbackpressureonthenetworkmaypreventtheNICfromacceptingthenextmessagetosend.

• NooverlapÞ timetosendnmessages(pipelined)=

40

P0

P1

osendgap

gap

(osend + L + orecv - gap) + n*gap = α + n*β

Results:EELandOverhead

41

0

5

10

15

20

25

T3E/M

PI

T3E/Shm

em

T3E/E-R

eg

IBM/MPI

IBM/LAPI

Quadri

cs/M

PI

Quadri

cs/Put

Quadri

cs/G

et

M2K/M

PI

M2K/G

M

Dolphin

/MPI

Gigane

t/VIPL

usec

Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency

Data from Mike Welcome, NERSC

SendOverheadOverTime

• Overheadhasnotimprovedsignificantly;T3Dwasbest– Lackofintegration;lackofattentioninsoftware

42

Myrinet2K

DolphinT3E

Cenju4

CM5

CM5

Meiko

MeikoParagon

T3D

Dolphin

Myrinet

SP3

SCI

Compaq

NCube/2

T3E0

2

4

6

8

10

12

14

1990 1992 1994 1996 1998 2000 2002Year (approximate)

usec

Data from Kathy Yelick, UCB and NERSC

LimitationsoftheLogP Model

• TheLogP modelhasafixedcostforeachmessage– Thisisusefulinshowinghowtoquicklybroadcastasingleword– OtherexamplesalsointheLogP papers

• Forlargermessages,thereisavariationLogGP– Twogapparameters,oneforsmallandoneforlargemessages– Thelargemessagegapisthebinourpreviousmodel

• Notopologyconsiderations(includingnolimitsforbisectionbandwidth)– Assumesafullyconnectednetwork– OKforsomealgorithmswithnearestneighborcommunication,but

with“all-to-all” communicationweneedtorefinethisfurther• Thisisaflatmodel,i.e.,eachprocessorisconnectedtothe

network– Clustersofmulticoresarenotaccuratelymodeled

43

Summary

• Latencyandbandwidtharetwoimportantnetworkmetrics– Latencymattersmoreforsmallmessagesthanbandwidth– Bandwidthmattersmoreforlargemessagesthanbandwidth– Time=a +n*b

• Communicationhasoverheadfrombothsendingandreceivingend– EEL = End-to-End Latency = osend + L + orecv

• Multiplecommunicationcanoverlap

44

HistoricalPerspective

• Earlydistributedmemorymachineswere:– Collectionofmicroprocessors.– Communicationwasperformedusingbi-directionalqueues

betweennearestneighbors.• Messageswereforwardedbyprocessorsonpath.

– “Storeandforward” networking• Therewasastrongemphasisontopologyinalgorithms,inordertominimizethenumberofhops=minimizetime

45

EvolutionofDistributedMemoryMachines

• Specialqueueconnectionsarebeingreplacedbydirectmemoryaccess(DMA):– Processorpacksorcopiesmessages.– Initiatestransfer,goesoncomputing.

• Wormholeroutinginhardware:– Specialmessageprocessorsdonotinterruptmainprocessorsalong

path.– Longmessagesendsarepipelined.– Processorsdon’twaitforcompletemessagebeforeforwarding

• Messagepassinglibrariesprovidestore-and-forwardabstraction:– Cansend/receivebetweenanypairofnodes,notjustalongone

wire.– Timedependsondistancesinceeachprocessoralongpathmust

participate.

46

Outline




47

Documents

Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering