Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Lecture11:DistributedMemoryMachinesandProgramming
1
CSCE569ParallelComputing
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://cse.sc.edu/~yanyh
Topics
• Introduction• Programmingonsharedmemorysystem(Chapter7)
– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonlargescalesystems(Chapter6)
– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel
• Analysisofparallelprogramexecutions(Chapter5)– PerformanceMetricsforParallelSystems
• ExecutionTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools
2
Acknowledgement
• SlidesadaptedfromU.C.BerkeleycourseCS267/EngC233ApplicationsofParallelComputersbyJimDemmel andKatherineYelick,Spring2011– http://www.cs.berkeley.edu/~demmel/cs267_Spr11/
• Andmaterialsfromvarioussources
3
SharedMemoryParallelSystems:MulticoreandMulti-CPU
• a
4
Node-levelArchitectureandProgramming
• Sharedmemorymultiprocessors:multicore,SMP,NUMA– Deepmemoryhierarchy,distantmemorymuchmore
expensivetoaccess.– Machinesscaleto10sor100sofprocessors– InstructionLevelParallelism(ILP),DataLevelParallelism(DLP)
andThreadLevelParallelism(TLP)• Programming
– OpenMP,PThreads,Cilkplus,etc
5
HPCArchitectures(TOP500,Nov2014)
6
Outline
• ClusterIntroduction• DistributedMemoryArchitectures
– Propertiesofcommunicationnetworks– Topologies– Performancemodels
• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives
7
Clusters
• A groupoflinkedcomputers,workingtogethercloselysothatinmanyrespectstheyformasinglecomputer.
• Consistsof– Nodes(Front+computing)– Network– Software:OSandmiddleware
8
Node Node Node…
High Speed Local Network
Cluster Middle ware
…
9
10
Top10ofTop500
11
http://www.top500.org/lists/2016/06/
( Ethernet,Infiniband….) + (MPI)
HPCBeowulfCluster
• Masternode:orservice/frontnode(usedtointeractwithuserslocallyorremotely)
• ComputingNodes:performancecomputations• Interconnectandswitchbetweennodes:e.g.G/10G-bitEthernet,
Infiniband• Inter-nodeprogramming
– MPI(MessagePassingInterface)isthemostcommonlyusedone.
12
NetworkSwitch
13
NetworkInterfaceCard(NIC)
14
Outline
• ClusterIntroduction• DistributedMemoryArchitectures
– Propertiesofcommunicationnetworks– Topologies– Performancemodels
• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives
15
NetworkAnalogy
• Tohavealargenumberofdifferenttransfersoccurringatonce,youneedalargenumberofdistinctwires– Notjustabus,asinsharedmemory
• Networksarelikestreets:– Link =street.– Switch =intersection.– Distances (hops)=numberofblockstraveled.– Routingalgorithm =travelplan.
• Properties:– Latency:howlongtogetbetweennodesinthenetwork.– Bandwidth:howmuchdatacanbemovedperunittime.
• Bandwidthislimitedbythenumberofwiresandtherateatwhicheachwirecanacceptdata.
16
LatencyandBandwidth
• Latency:Timetotravelfromonelocationtoanotherforavehicle– Vehicletype(largeorsmallmessages)– Road/trafficcondition,speed-limit,etc
• Bandwidth:Howmanycarsandhowfasttheycantravelfromonelocationtoanother– Numberoflanes
17
PerformancePropertiesofaNetwork:Latency
• Diameter:themaximum(overallpairsofnodes)oftheshortestpathbetweenagivenpairofnodes.
• Latency: delaybetweensendandreceivetimes– Latencytendstovarywidelyacrossarchitectures– Vendorsoftenreporthardwarelatencies (wiretime)– Applicationprogrammerscareaboutsoftwarelatencies (user
programtouserprogram)• Observations:
– Latenciesdifferby1-2ordersacrossnetworkdesigns– Software/hardwareoverheadatsource/destinationdominatecost
(1s-10susecs)– Hardwarelatencyvarieswithdistance(10s-100snsec perhop)butis
smallcomparedtooverheads• Latencyiskeyforprogramswithmanysmallmessages
18
I second = 10^3 millseconds (ms) = 10^6 microseconds (us) = 10^9 nanoseconds (ns)
LatencyonSomeMachines/Networks
8-byte Roundtrip Latency
14.6
6.6
22.1
9.6
18.5
24.2
0
5
10
15
20
25
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Rou
ndtri
p La
tenc
y (u
sec)
MPI ping-pong
19
• Latenciesshownarefromaping-pong testusingMPI• Theseareroundtrip numbers:manypeopleuse½ofroundtrip timeto
approximate1-waylatency(whichcan’teasilybemeasured)
EndtoEndLatency(1/2roundtrip)OverTime
6.9745
36.34
7.2755
3.3
12.08059.25
2.6
6.905
11.027
4.81
nCube/2
nCube/2
CM5
CM5 CS2
CS2
SP1
SP2
Paragon
T3DT3D
SPP
KSR
SPP
Cenju3
T3E
T3E18.916
SP-Power3
Quadrics
Myrinet
Quadrics
1
10
100
1990 1995 2000 2005 2010Year (approximate)
usec
20
• Latencyhasnotimprovedsignificantly,unlikeMoore’sLaw•T3E(shmem)waslowestpoint– in1997
Data from Kathy Yelick, UCB and NERSC
PerformancePropertiesofaNetwork:Bandwidth
• Thebandwidthofalink=#wires/time-per-bit• BandwidthtypicallyinGigabytes/sec(GB/s),i.e.,8*220 bitspersecond
• Effectivebandwidth isusuallylowerthanphysicallinkbandwidthduetopacketoverhead.
21
Routing and control header
Data payload
Error code
Trailer
• Bandwidth is important for applications with mostly large messages
BandwidthonSomeNetworks
Flood Bandwidth for 2MB messages
1504
630
244
857225
610
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Perc
ent H
W p
eak
(BW
in M
B) MPI
22
• Floodbandwidth(throughputofback-to-back2MBmessages)
BandwidthChart
230
50
100
150
200
250
300
350
400
2048 4096 8192 16384 32768 65536 131072Message Size (Bytes)
Band
wid
th (M
B/se
c)
T3E/MPIT3E/ShmemIBM/MPIIBM/LAPICompaq/PutCompaq/GetM2K/MPIM2K/GMDolphin/MPIGiganet/VIPLSysKonnect
Data from Mike Welcome, NERSC
Note:bandwidthdependsonSW,notjustHW
PerformancePropertiesofaNetwork:BisectionBandwidth
• Bisectionbandwidth:bandwidthacrosssmallestcutthatdividesnetworkintotwoequalhalves
• Bandwidthacross“narrowest” partofthenetwork
24
bisection cut
not a bisectioncut
bisectionbw=linkbw bisectionbw=sqrt(n)*linkbw
• Bisectionbandwidthisimportantforalgorithmsinwhichallprocessorsneedtocommunicatewithallothers
OtherCharacteristicsofaNetwork
• Topology(howthingsareconnected)– Crossbar,ring,2-Dand3-Dmeshortorus,hypercube,tree,
butterfly,perfectshuffle....• Routingalgorithm:
– Examplein2Dtorus:alleast-westthenallnorth-south(avoidsdeadlock).
• Switchingstrategy:– Circuitswitching:fullpathreservedforentiremessage,likethe
telephone.– Packetswitching:messagebrokenintoseparately-routed
packets,likethepostoffice.• Flowcontrol (whatifthereiscongestion):
– Stall,storedatatemporarilyinbuffers,re-routedatatoothernodes,tellsourcenodetotemporarilyhalt,discard,etc.
25
NetworkTopology
• Inthepast,therewasconsiderableresearchinnetworktopologyandinmappingalgorithmstotopology.– Keycosttobeminimized:numberof“hops” betweennodes
(e.g.“storeandforward”)– Modernnetworkshidehopcost(i.e.,“wormholerouting”),so
topologyisnolongeramajorfactorinalgorithmperformance.• Example:OnIBMSPsystem,hardwarelatencyvariesfrom0.5usec to1.5usec,butuser-levelmessagepassinglatencyisroughly36usec.
• Needsomebackgroundinnetworktopology– Algorithmsmayhaveacommunicationtopology– Topologyaffectsbisectionbandwidth.
26
LinearandRingTopologies
• Lineararray
– Diameter=n-1;averagedistance~n/3.– Bisectionbandwidth=1(inunitsoflinkbandwidth).
• TorusorRing
– Diameter=n/2;averagedistance~n/4.– Bisectionbandwidth=2.– Naturalforalgorithmsthatworkwith1Darrays.
27
MeshesandTori
• Twodimensionalmesh– Diameter=2*(sqrt(n)– 1)– Bisectionbandwidth=sqrt(n)
• Twodimensionaltorus– Diameter=sqrt(n)– Bisectionbandwidth=2*sqrt(n)
28
•Generalizestohigherdimensions• CrayXT(eg Franklin@NERSC)uses3DTorus
• Naturalforalgorithmsthatworkwith2Dand/or3Darrays(matmul)
Hypercubes
• Numberofnodesn=2dfordimensiond.– Diameter=d.– Bisectionbandwidth=n/2.
• 0d1d2d 3d 4d
• Popularinearlymachines(InteliPSC,NCUBE).– Lotsofcleveralgorithms.
• Greycode addressing:– Eachnodeconnectedto
otherswith1bitdifferent.
29
001000
100
010 011
111
101
110
Trees
• Diameter=logn.• Bisectionbandwidth=1.• Easylayoutasplanargraph.• Manytreealgorithms(e.g.,summation).• Fattreesavoidbisectionbandwidthproblem:
– More(orwider)linksneartop.– Example:ThinkingMachinesCM-5.
30
Butterflies
• Diameter=logn.• Bisectionbandwidth=n.• Cost:lotsofwires.• UsedinBBNButterfly.• NaturalforFFT.
31
O 1O 1
O 1 O 1
butterfly switch multistage butterfly network
Ex: to get from proc 101 to 110,Compare bit-by-bit andSwitch if they disagree, else not
TopologiesinRealMachines
32
Cray XT3 and XT4 3D Torus (approx)
Blue Gene/L 3D Torus
SGI Altix Fat tree
Cray X1 4D Hypercube*
Myricom (Millennium) Arbitrary
Quadrics (in HP Alpha server clusters)
Fat tree
IBM SP Fat tree (approx)
SGI Origin Hypercube
Intel Paragon (old) 2D Mesh
BBN Butterfly (really old) Butterfly
olde
rnew
er
Manyoftheseareapproximations:E.g.,theX1isreallya“quadbristledhypercube” andsomeofthefattreesarenotasfatastheyshouldbeatthetop
PerformanceModels
33
LatencyandBandwidthModel
• Timetosendmessageoflengthnisroughly
• Topologyisassumedirrelevant.• Oftencalled“a-bmodel” andwritten
• Usuallya >>b >>timeperflop.– Onelongmessageischeaperthanmanyshortones.
– Candohundredsorthousandsofflopsforcostofonemessage.• Lesson:Needlargecomputation-to-communicationratiotobe
efficient.
34
Time = latency + n*cost_per_word= latency + n/bandwidth
Time = a + n*b
a + n*b << n*(a + 1*b)
Alpha-BetaParametersonCurrentMachines
• Thesenumberswereobtainedempirically
35
machine a bT3E/Shm 1.2 0.003T3E/MPI 6.7 0.003IBM/LAPI 9.4 0.003IBM/MPI 7.6 0.004Quadrics/Get 3.267 0.00498Quadrics/Shm 1.3 0.005Quadrics/MPI 7.3 0.005Myrinet/GM 7.7 0.005Myrinet/MPI 7.2 0.006Dolphin/MPI 7.767 0.00529Giganet/VIPL 3.0 0.010GigE/VIPL 4.6 0.008GigE/MPI 5.854 0.00872
a is latency in usecsb is BW in usecs per Byte
How well does the modelTime = a + n*b
predict actual performance?
ModelTimeVaryingMessageSize&Machines
36
MeasuredMessageTime
37
LogP Model
• 4performanceparameters– L:latencyexperiencedineachcommunicationevent
• timetocommunicatewordorsmall#ofwords– o:send/recv overheadexperiencedbyprocessor
• timeprocessorfullyengagedintransmissionorreception– g:gapbetweensuccessivesendsorrecvs byaprocessor
• 1/g=communicationbandwidth– P:numberofprocessor/memorymodules
38
LogP Parameters:Overhead&Latency
• Non-overlappingoverhead
• Sendandrecv overheadcanoverlap
39
P0
P1
osend
L
orecv
P0
P1
osend
orecv
EEL = End-to-End Latency= osend + L + orecv
EEL = f(osend, L, orecv)³ max(osend, L, orecv)
LogP Parameters:gap
• TheGapisthedelaybetweensendingmessages
• Gapcouldbegreaterthansendoverhead– NICmaybebusyfinishingtheprocessing
oflastmessageandcannotacceptanewone.
– FlowcontrolorbackpressureonthenetworkmaypreventtheNICfromacceptingthenextmessagetosend.
• NooverlapÞ timetosendnmessages(pipelined)=
40
P0
P1
osendgap
gap
(osend + L + orecv - gap) + n*gap = α + n*β
Results:EELandOverhead
41
0
5
10
15
20
25
T3E/M
PI
T3E/Shm
em
T3E/E-R
eg
IBM/MPI
IBM/LAPI
Quadri
cs/M
PI
Quadri
cs/Put
Quadri
cs/G
et
M2K/M
PI
M2K/G
M
Dolphin
/MPI
Gigane
t/VIPL
usec
Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency
Data from Mike Welcome, NERSC
SendOverheadOverTime
• Overheadhasnotimprovedsignificantly;T3Dwasbest– Lackofintegration;lackofattentioninsoftware
42
Myrinet2K
DolphinT3E
Cenju4
CM5
CM5
Meiko
MeikoParagon
T3D
Dolphin
Myrinet
SP3
SCI
Compaq
NCube/2
T3E0
2
4
6
8
10
12
14
1990 1992 1994 1996 1998 2000 2002Year (approximate)
usec
Data from Kathy Yelick, UCB and NERSC
LimitationsoftheLogP Model
• TheLogP modelhasafixedcostforeachmessage– Thisisusefulinshowinghowtoquicklybroadcastasingleword– OtherexamplesalsointheLogP papers
• Forlargermessages,thereisavariationLogGP– Twogapparameters,oneforsmallandoneforlargemessages– Thelargemessagegapisthebinourpreviousmodel
• Notopologyconsiderations(includingnolimitsforbisectionbandwidth)– Assumesafullyconnectednetwork– OKforsomealgorithmswithnearestneighborcommunication,but
with“all-to-all” communicationweneedtorefinethisfurther• Thisisaflatmodel,i.e.,eachprocessorisconnectedtothe
network– Clustersofmulticoresarenotaccuratelymodeled
43
Summary
• Latencyandbandwidtharetwoimportantnetworkmetrics– Latencymattersmoreforsmallmessagesthanbandwidth– Bandwidthmattersmoreforlargemessagesthanbandwidth– Time=a +n*b
• Communicationhasoverheadfrombothsendingandreceivingend– EEL = End-to-End Latency = osend + L + orecv
• Multiplecommunicationcanoverlap
44
HistoricalPerspective
• Earlydistributedmemorymachineswere:– Collectionofmicroprocessors.– Communicationwasperformedusingbi-directionalqueues
betweennearestneighbors.• Messageswereforwardedbyprocessorsonpath.
– “Storeandforward” networking• Therewasastrongemphasisontopologyinalgorithms,inordertominimizethenumberofhops=minimizetime
45
EvolutionofDistributedMemoryMachines
• Specialqueueconnectionsarebeingreplacedbydirectmemoryaccess(DMA):– Processorpacksorcopiesmessages.– Initiatestransfer,goesoncomputing.
• Wormholeroutinginhardware:– Specialmessageprocessorsdonotinterruptmainprocessorsalong
path.– Longmessagesendsarepipelined.– Processorsdon’twaitforcompletemessagebeforeforwarding
• Messagepassinglibrariesprovidestore-and-forwardabstraction:– Cansend/receivebetweenanypairofnodes,notjustalongone
wire.– Timedependsondistancesinceeachprocessoralongpathmust
participate.
46
Outline
• ClusterIntroduction• DistributedMemoryArchitectures
– Propertiesofcommunicationnetworks– Topologies– Performancemodels
• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives
47