47
Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering Yonghong Yan [email protected] http://cse.sc.edu/~yanyh

Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Lecture11:DistributedMemoryMachinesandProgramming

1

CSCE569ParallelComputing

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://cse.sc.edu/~yanyh

Page 2: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Topics

• Introduction• Programmingonsharedmemorysystem(Chapter7)

– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonlargescalesystems(Chapter6)

– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel

• Analysisofparallelprogramexecutions(Chapter5)– PerformanceMetricsforParallelSystems

• ExecutionTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools

2

Page 3: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Acknowledgement

• SlidesadaptedfromU.C.BerkeleycourseCS267/EngC233ApplicationsofParallelComputersbyJimDemmel andKatherineYelick,Spring2011– http://www.cs.berkeley.edu/~demmel/cs267_Spr11/

• Andmaterialsfromvarioussources

3

Page 4: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

SharedMemoryParallelSystems:MulticoreandMulti-CPU

• a

4

Page 5: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Node-levelArchitectureandProgramming

• Sharedmemorymultiprocessors:multicore,SMP,NUMA– Deepmemoryhierarchy,distantmemorymuchmore

expensivetoaccess.– Machinesscaleto10sor100sofprocessors– InstructionLevelParallelism(ILP),DataLevelParallelism(DLP)

andThreadLevelParallelism(TLP)• Programming

– OpenMP,PThreads,Cilkplus,etc

5

Page 6: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

HPCArchitectures(TOP500,Nov2014)

6

Page 7: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Outline

• ClusterIntroduction• DistributedMemoryArchitectures

– Propertiesofcommunicationnetworks– Topologies– Performancemodels

• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives

7

Page 8: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Clusters

• A groupoflinkedcomputers,workingtogethercloselysothatinmanyrespectstheyformasinglecomputer.

• Consistsof– Nodes(Front+computing)– Network– Software:OSandmiddleware

8

Node Node Node…

High Speed Local Network

Cluster Middle ware

Page 9: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

9

Page 10: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

10

Page 11: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Top10ofTop500

11

http://www.top500.org/lists/2016/06/

Page 12: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

( Ethernet,Infiniband….) + (MPI)

HPCBeowulfCluster

• Masternode:orservice/frontnode(usedtointeractwithuserslocallyorremotely)

• ComputingNodes:performancecomputations• Interconnectandswitchbetweennodes:e.g.G/10G-bitEthernet,

Infiniband• Inter-nodeprogramming

– MPI(MessagePassingInterface)isthemostcommonlyusedone.

12

Page 13: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

NetworkSwitch

13

Page 14: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

NetworkInterfaceCard(NIC)

14

Page 15: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Outline

• ClusterIntroduction• DistributedMemoryArchitectures

– Propertiesofcommunicationnetworks– Topologies– Performancemodels

• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives

15

Page 16: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

NetworkAnalogy

• Tohavealargenumberofdifferenttransfersoccurringatonce,youneedalargenumberofdistinctwires– Notjustabus,asinsharedmemory

• Networksarelikestreets:– Link =street.– Switch =intersection.– Distances (hops)=numberofblockstraveled.– Routingalgorithm =travelplan.

• Properties:– Latency:howlongtogetbetweennodesinthenetwork.– Bandwidth:howmuchdatacanbemovedperunittime.

• Bandwidthislimitedbythenumberofwiresandtherateatwhicheachwirecanacceptdata.

16

Page 17: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

LatencyandBandwidth

• Latency:Timetotravelfromonelocationtoanotherforavehicle– Vehicletype(largeorsmallmessages)– Road/trafficcondition,speed-limit,etc

• Bandwidth:Howmanycarsandhowfasttheycantravelfromonelocationtoanother– Numberoflanes

17

Page 18: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

PerformancePropertiesofaNetwork:Latency

• Diameter:themaximum(overallpairsofnodes)oftheshortestpathbetweenagivenpairofnodes.

• Latency: delaybetweensendandreceivetimes– Latencytendstovarywidelyacrossarchitectures– Vendorsoftenreporthardwarelatencies (wiretime)– Applicationprogrammerscareaboutsoftwarelatencies (user

programtouserprogram)• Observations:

– Latenciesdifferby1-2ordersacrossnetworkdesigns– Software/hardwareoverheadatsource/destinationdominatecost

(1s-10susecs)– Hardwarelatencyvarieswithdistance(10s-100snsec perhop)butis

smallcomparedtooverheads• Latencyiskeyforprogramswithmanysmallmessages

18

I second = 10^3 millseconds (ms) = 10^6 microseconds (us) = 10^9 nanoseconds (ns)

Page 19: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

LatencyonSomeMachines/Networks

8-byte Roundtrip Latency

14.6

6.6

22.1

9.6

18.5

24.2

0

5

10

15

20

25

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Rou

ndtri

p La

tenc

y (u

sec)

MPI ping-pong

19

• Latenciesshownarefromaping-pong testusingMPI• Theseareroundtrip numbers:manypeopleuse½ofroundtrip timeto

approximate1-waylatency(whichcan’teasilybemeasured)

Page 20: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

EndtoEndLatency(1/2roundtrip)OverTime

6.9745

36.34

7.2755

3.3

12.08059.25

2.6

6.905

11.027

4.81

nCube/2

nCube/2

CM5

CM5 CS2

CS2

SP1

SP2

Paragon

T3DT3D

SPP

KSR

SPP

Cenju3

T3E

T3E18.916

SP-Power3

Quadrics

Myrinet

Quadrics

1

10

100

1990 1995 2000 2005 2010Year (approximate)

usec

20

• Latencyhasnotimprovedsignificantly,unlikeMoore’sLaw•T3E(shmem)waslowestpoint– in1997

Data from Kathy Yelick, UCB and NERSC

Page 21: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

PerformancePropertiesofaNetwork:Bandwidth

• Thebandwidthofalink=#wires/time-per-bit• BandwidthtypicallyinGigabytes/sec(GB/s),i.e.,8*220 bitspersecond

• Effectivebandwidth isusuallylowerthanphysicallinkbandwidthduetopacketoverhead.

21

Routing and control header

Data payload

Error code

Trailer

• Bandwidth is important for applications with mostly large messages

Page 22: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

BandwidthonSomeNetworks

Flood Bandwidth for 2MB messages

1504

630

244

857225

610

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Perc

ent H

W p

eak

(BW

in M

B) MPI

22

• Floodbandwidth(throughputofback-to-back2MBmessages)

Page 23: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

BandwidthChart

230

50

100

150

200

250

300

350

400

2048 4096 8192 16384 32768 65536 131072Message Size (Bytes)

Band

wid

th (M

B/se

c)

T3E/MPIT3E/ShmemIBM/MPIIBM/LAPICompaq/PutCompaq/GetM2K/MPIM2K/GMDolphin/MPIGiganet/VIPLSysKonnect

Data from Mike Welcome, NERSC

Note:bandwidthdependsonSW,notjustHW

Page 24: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

PerformancePropertiesofaNetwork:BisectionBandwidth

• Bisectionbandwidth:bandwidthacrosssmallestcutthatdividesnetworkintotwoequalhalves

• Bandwidthacross“narrowest” partofthenetwork

24

bisection cut

not a bisectioncut

bisectionbw=linkbw bisectionbw=sqrt(n)*linkbw

• Bisectionbandwidthisimportantforalgorithmsinwhichallprocessorsneedtocommunicatewithallothers

Page 25: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

OtherCharacteristicsofaNetwork

• Topology(howthingsareconnected)– Crossbar,ring,2-Dand3-Dmeshortorus,hypercube,tree,

butterfly,perfectshuffle....• Routingalgorithm:

– Examplein2Dtorus:alleast-westthenallnorth-south(avoidsdeadlock).

• Switchingstrategy:– Circuitswitching:fullpathreservedforentiremessage,likethe

telephone.– Packetswitching:messagebrokenintoseparately-routed

packets,likethepostoffice.• Flowcontrol (whatifthereiscongestion):

– Stall,storedatatemporarilyinbuffers,re-routedatatoothernodes,tellsourcenodetotemporarilyhalt,discard,etc.

25

Page 26: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

NetworkTopology

• Inthepast,therewasconsiderableresearchinnetworktopologyandinmappingalgorithmstotopology.– Keycosttobeminimized:numberof“hops” betweennodes

(e.g.“storeandforward”)– Modernnetworkshidehopcost(i.e.,“wormholerouting”),so

topologyisnolongeramajorfactorinalgorithmperformance.• Example:OnIBMSPsystem,hardwarelatencyvariesfrom0.5usec to1.5usec,butuser-levelmessagepassinglatencyisroughly36usec.

• Needsomebackgroundinnetworktopology– Algorithmsmayhaveacommunicationtopology– Topologyaffectsbisectionbandwidth.

26

Page 27: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

LinearandRingTopologies

• Lineararray

– Diameter=n-1;averagedistance~n/3.– Bisectionbandwidth=1(inunitsoflinkbandwidth).

• TorusorRing

– Diameter=n/2;averagedistance~n/4.– Bisectionbandwidth=2.– Naturalforalgorithmsthatworkwith1Darrays.

27

Page 28: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

MeshesandTori

• Twodimensionalmesh– Diameter=2*(sqrt(n)– 1)– Bisectionbandwidth=sqrt(n)

• Twodimensionaltorus– Diameter=sqrt(n)– Bisectionbandwidth=2*sqrt(n)

28

•Generalizestohigherdimensions• CrayXT(eg Franklin@NERSC)uses3DTorus

• Naturalforalgorithmsthatworkwith2Dand/or3Darrays(matmul)

Page 29: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Hypercubes

• Numberofnodesn=2dfordimensiond.– Diameter=d.– Bisectionbandwidth=n/2.

• 0d1d2d 3d 4d

• Popularinearlymachines(InteliPSC,NCUBE).– Lotsofcleveralgorithms.

• Greycode addressing:– Eachnodeconnectedto

otherswith1bitdifferent.

29

001000

100

010 011

111

101

110

Page 30: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Trees

• Diameter=logn.• Bisectionbandwidth=1.• Easylayoutasplanargraph.• Manytreealgorithms(e.g.,summation).• Fattreesavoidbisectionbandwidthproblem:

– More(orwider)linksneartop.– Example:ThinkingMachinesCM-5.

30

Page 31: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Butterflies

• Diameter=logn.• Bisectionbandwidth=n.• Cost:lotsofwires.• UsedinBBNButterfly.• NaturalforFFT.

31

O 1O 1

O 1 O 1

butterfly switch multistage butterfly network

Ex: to get from proc 101 to 110,Compare bit-by-bit andSwitch if they disagree, else not

Page 32: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

TopologiesinRealMachines

32

Cray XT3 and XT4 3D Torus (approx)

Blue Gene/L 3D Torus

SGI Altix Fat tree

Cray X1 4D Hypercube*

Myricom (Millennium) Arbitrary

Quadrics (in HP Alpha server clusters)

Fat tree

IBM SP Fat tree (approx)

SGI Origin Hypercube

Intel Paragon (old) 2D Mesh

BBN Butterfly (really old) Butterfly

olde

rnew

er

Manyoftheseareapproximations:E.g.,theX1isreallya“quadbristledhypercube” andsomeofthefattreesarenotasfatastheyshouldbeatthetop

Page 33: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

PerformanceModels

33

Page 34: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

LatencyandBandwidthModel

• Timetosendmessageoflengthnisroughly

• Topologyisassumedirrelevant.• Oftencalled“a-bmodel” andwritten

• Usuallya >>b >>timeperflop.– Onelongmessageischeaperthanmanyshortones.

– Candohundredsorthousandsofflopsforcostofonemessage.• Lesson:Needlargecomputation-to-communicationratiotobe

efficient.

34

Time = latency + n*cost_per_word= latency + n/bandwidth

Time = a + n*b

a + n*b << n*(a + 1*b)

Page 35: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Alpha-BetaParametersonCurrentMachines

• Thesenumberswereobtainedempirically

35

machine a bT3E/Shm 1.2 0.003T3E/MPI 6.7 0.003IBM/LAPI 9.4 0.003IBM/MPI 7.6 0.004Quadrics/Get 3.267 0.00498Quadrics/Shm 1.3 0.005Quadrics/MPI 7.3 0.005Myrinet/GM 7.7 0.005Myrinet/MPI 7.2 0.006Dolphin/MPI 7.767 0.00529Giganet/VIPL 3.0 0.010GigE/VIPL 4.6 0.008GigE/MPI 5.854 0.00872

a is latency in usecsb is BW in usecs per Byte

How well does the modelTime = a + n*b

predict actual performance?

Page 36: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

ModelTimeVaryingMessageSize&Machines

36

Page 37: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

MeasuredMessageTime

37

Page 38: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

LogP Model

• 4performanceparameters– L:latencyexperiencedineachcommunicationevent

• timetocommunicatewordorsmall#ofwords– o:send/recv overheadexperiencedbyprocessor

• timeprocessorfullyengagedintransmissionorreception– g:gapbetweensuccessivesendsorrecvs byaprocessor

• 1/g=communicationbandwidth– P:numberofprocessor/memorymodules

38

Page 39: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

LogP Parameters:Overhead&Latency

• Non-overlappingoverhead

• Sendandrecv overheadcanoverlap

39

P0

P1

osend

L

orecv

P0

P1

osend

orecv

EEL = End-to-End Latency= osend + L + orecv

EEL = f(osend, L, orecv)³ max(osend, L, orecv)

Page 40: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

LogP Parameters:gap

• TheGapisthedelaybetweensendingmessages

• Gapcouldbegreaterthansendoverhead– NICmaybebusyfinishingtheprocessing

oflastmessageandcannotacceptanewone.

– FlowcontrolorbackpressureonthenetworkmaypreventtheNICfromacceptingthenextmessagetosend.

• NooverlapÞ timetosendnmessages(pipelined)=

40

P0

P1

osendgap

gap

(osend + L + orecv - gap) + n*gap = α + n*β

Page 41: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Results:EELandOverhead

41

0

5

10

15

20

25

T3E/M

PI

T3E/Shm

em

T3E/E-R

eg

IBM/MPI

IBM/LAPI

Quadri

cs/M

PI

Quadri

cs/Put

Quadri

cs/G

et

M2K/M

PI

M2K/G

M

Dolphin

/MPI

Gigane

t/VIPL

usec

Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency

Data from Mike Welcome, NERSC

Page 42: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

SendOverheadOverTime

• Overheadhasnotimprovedsignificantly;T3Dwasbest– Lackofintegration;lackofattentioninsoftware

42

Myrinet2K

DolphinT3E

Cenju4

CM5

CM5

Meiko

MeikoParagon

T3D

Dolphin

Myrinet

SP3

SCI

Compaq

NCube/2

T3E0

2

4

6

8

10

12

14

1990 1992 1994 1996 1998 2000 2002Year (approximate)

usec

Data from Kathy Yelick, UCB and NERSC

Page 43: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

LimitationsoftheLogP Model

• TheLogP modelhasafixedcostforeachmessage– Thisisusefulinshowinghowtoquicklybroadcastasingleword– OtherexamplesalsointheLogP papers

• Forlargermessages,thereisavariationLogGP– Twogapparameters,oneforsmallandoneforlargemessages– Thelargemessagegapisthebinourpreviousmodel

• Notopologyconsiderations(includingnolimitsforbisectionbandwidth)– Assumesafullyconnectednetwork– OKforsomealgorithmswithnearestneighborcommunication,but

with“all-to-all” communicationweneedtorefinethisfurther• Thisisaflatmodel,i.e.,eachprocessorisconnectedtothe

network– Clustersofmulticoresarenotaccuratelymodeled

43

Page 44: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Summary

• Latencyandbandwidtharetwoimportantnetworkmetrics– Latencymattersmoreforsmallmessagesthanbandwidth– Bandwidthmattersmoreforlargemessagesthanbandwidth– Time=a +n*b

• Communicationhasoverheadfrombothsendingandreceivingend– EEL = End-to-End Latency = osend + L + orecv

• Multiplecommunicationcanoverlap

44

Page 45: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

HistoricalPerspective

• Earlydistributedmemorymachineswere:– Collectionofmicroprocessors.– Communicationwasperformedusingbi-directionalqueues

betweennearestneighbors.• Messageswereforwardedbyprocessorsonpath.

– “Storeandforward” networking• Therewasastrongemphasisontopologyinalgorithms,inordertominimizethenumberofhops=minimizetime

45

Page 46: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

EvolutionofDistributedMemoryMachines

• Specialqueueconnectionsarebeingreplacedbydirectmemoryaccess(DMA):– Processorpacksorcopiesmessages.– Initiatestransfer,goesoncomputing.

• Wormholeroutinginhardware:– Specialmessageprocessorsdonotinterruptmainprocessorsalong

path.– Longmessagesendsarepipelined.– Processorsdon’twaitforcompletemessagebeforeforwarding

• Messagepassinglibrariesprovidestore-and-forwardabstraction:– Cansend/receivebetweenanypairofnodes,notjustalongone

wire.– Timedependsondistancesinceeachprocessoralongpathmust

participate.

46

Page 47: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering

Outline

• ClusterIntroduction• DistributedMemoryArchitectures

– Propertiesofcommunicationnetworks– Topologies– Performancemodels

• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives

47