Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Lecture08:PrinciplesofParallelAlgorithmDesign
ConcurrentandMul<coreProgrammingCSE436/536
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
1
Lastlecture:AlgorithmsandConcurrency
•  IntroducAontoParallelAlgorithms–  TasksandDecomposiAon–  ProcessesandMapping
•  DecomposiAonTechniques–  RecursiveDecomposiAon(divide-conquer)–  DataDecomposiAon(input,output,input+output,intermediate)
•  Termsandconcepts–  Taskdependencygraph,taskgranularity,degreeofconcurrency–  TaskinteracAongraph,criAcalpath
•  Examples:–  DensevectoraddiAon,matrixvectorproduct–  Densematrixmatrixproduct–  Databasequery–  Quicksort,MIN
2
Today’slecture
•  Decomposi<onTechniques-con<nued–  ExploratoryDecomposiAon–  HybridDecomposiAon
Mappingtaskstoprocesses/cores/CPU/PEs•  Characteris<csofTasksandInterac<ons
–  TaskGeneraAon,Granularity,andContext–  CharacterisAcsofTaskInteracAons
•  MappingTechniquesforLoadBalancing–  StaAcandDynamicMapping
•  MethodsforMinimizingInterac<onOverheads•  ParallelAlgorithmDesignModels
3
ExploratoryDecomposi<on
4
•  DecomposiAonisfixed/staAcfromthedesign–  Dataandrecursive
•  ExploraAon(search)ofastatespaceofsoluAons–  problemdecomposiAonreflectsshapeofexecuAon–  Goeshand-in-handwithitsexecuAon
•  Examples–  discreteopAmizaAon,e.g.0/1integerprogramming–  theoremproving–  gameplaying
ExploratoryDecomposi<on:Example
5
Solvea15puzzle•  Sequenceofthreemovesfromstate(a)tofinalstate(d)
•  Fromanarbitrarystate,mustsearchforasoluAon
ExploratoryDecomposi<on:Example
6
Solvinga15puzzle•  Search
–  generatesuccessorstatesofthecurrentstate–  exploreeachasanindependenttask
ExploratoryDecomposi<onSpeedupSolvea15puzzle
•  ThedecomposiAonbehavesaccordingtotheparallelformulaAon–  Maychangetheamountofworkdone
7Execu<onterminatewhenasolu<onisfound
Specula<veDecomposi<on
8
•  Dependenciesbetweentasksarenotknowna-priori.–  ImpossibletoidenAfyindependenttasks
•  Twoapproaches–  ConservaAveapproaches,whichidenAfyindependenttasks
onlywhentheyareguaranteedtonothavedependencies•  Mayyieldlialeconcurrency
–  OpAmisAcapproaches,whichscheduletasksevenwhentheymaypotenAallybeinter-dependent•  Roll-backchangesincaseofanerror
Specula<veDecomposi<on:Example
9
Discreteeventsimula<on•  CentralizedAme-orderedeventlist
–  yougetupà getreadyà drivetoworkà workà eatlunchà worksomemoreà drivebackà eatdinnerà andsleep
•  SimulaAon–  extractnexteventinAmeorder–  processtheevent–  ifrequired,insertneweventsintotheeventlist
•  OpAmisAceventscheduling–  assumeoutcomesofallpriorevents–  speculaAvelyprocessnextevent–  ifassumpAonisincorrect,rollbackitseffectsandconAnue
Specula<veDecomposi<on:Example
10
Simula<onofanetworkofnodes•  Simulatenetworkbehaviorforvariousinputandnodedelays
–  Theinputaredynamicallychanging•  Thustaskdependencyisunknown
•  SpeculateexecuAon:tasks’input–  Correct:parallelism–  Incorrect:rollbackandredo
Specula<vevsExploratory
•  ExploratorydecomposiAon–  TheoutputofmulApletasksfromabranchisunknown–  Parallelprogramperformmore,lessorsameamountofwork
asserialprogram•  SpeculaAve
–  TheinputatabranchleadingtomulApleparalleltasksisunknown
–  Parallelprogramperformmoreorsameamountofworkastheserialalgorithm
11
HybridDecomposi<ons
12
Usemul<pledecomposi<ontechniquestogether•  OnedecomposiAonmaybenotopAmalforconcurrency
–  QuicksortrecursivedecomposiAonlimitsconcurrency(Why?)
•  CombinedrecursiveanddatadecomposiAonforMIN
Today’slecture
•  DecomposiAonTechniques-conAnued–  ExploratoryDecomposiAon–  HybridDecomposiAonMappingtaskstoprocesses/cores/CPU/PEs
•  CharacterisAcsofTasksandInteracAons–  TaskGeneraAon,Granularity,andContext–  CharacterisAcsofTaskInteracAons
•  MappingTechniquesforLoadBalancing–  StaAcandDynamicMapping
•  MethodsforMinimizingInteracAonOverheads•  ParallelAlgorithmDesignModels
13
Characteris<csofTasks
14
•  Theory–  DecomposiAon:toparallelizetheoreAcally
•  Concurrencyavailableinaproblem•  PracAce
–  TaskcreaAons,interacAonsandmappingtoPEs.•  RealizingconcurrencypracAcally
–  CharacterisAcsoftasksandtaskinteracAons•  Impactchoiceandperformanceofparallelism
•  Characteris<csoftasks–  Taskgenera<onstrategies–  Tasksizes(theamountofwork,e.g.FLOPs)–  Sizeofdataassociatedwithtasks
TaskGenera<on
15
•  StaActaskgeneraAon–  Concurrenttasksandtaskgraphknowna-priori(beforeexecuAon)–  TypicallyusingrecursiveordatadecomposiAon–  Examples
•  MatrixoperaAons•  Graphalgorithms•  ImageprocessingapplicaAons•  Otherregularlystructuredproblems
•  DynamictaskgeneraAon–  ComputaAonsformulateconcurrenttasksandtaskgraphonthefly
•  Notexplicitapriori,thoughhigh-levelrulesorguidelinesknown–  TypicallybyexploratoryorspeculaAvedecomposiAons.
•  AlsopossiblebyrecursivedecomposiAon,e.g.quicksort–  Aclassicexample:gameplaying
•  15puzzleboard
TaskSizes/Granularity
16
•  Theamountofworkà amountofAmetocomplete–  E.g.FLOPs,#memoryaccess
•  Uniform:–  OlenbyevendatadecomposiAon,i.e.regular
•  Non-uniform–  Quicksort,thechoiceofpivot
SizeofDataAssociatedwithTasks
17
•  Maybesmallorlargecomparedtothetasksizes–  Howrelevanttotheinputand/oroutputdatasizes–  Example:
•  size(input)<size(computa<on),e.g.,15puzzle•  size(input)=size(computa<on)>size(output),e.g.,min•  size(input)=size(output)<size(computa<on),e.g.,sort
•  Consideringtheeffortstoreconstructthesametaskcontext–  smalldata:smallefforts:taskcaneasilymigratetoanother
process–  largedata:largeefforts:Aesthetasktoaprocess
•  ContextreconstrucAngvscommunicaAng–  Itdepends
Characteris<csofTaskInterac<ons
•  AspectsofinteracAons–  What:shareddataorsynchronizaAons,andsizesofthemedia–  When:theAming–  Who:withwhichtask(s),andoveralltopology/paaerns–  DoweknowdetailsoftheabovethreebeforeexecuAon–  How:involveoneorboth?
•  TheimplementaAonconcern,implicitorexplicitOrthogonalclassificaAon•  StaAcvs.dynamic•  Regularvs.irregular•  Read-onlyvs.read-write•  One-sidedvs.two-sided
18
Characteris<csofTaskInterac<ons
•  AspectsofinteracAons–  What:shareddataorsynchronizaAons,andsizesofthemedia–  When:theAming–  Who:withwhichtask(s),andoveralltopology/paaerns–  DoweknowdetailsoftheabovethreebeforeexecuAon–  How:involveoneorboth?
•  StaAcinteracAons–  PartnersandAming(andelse)areknowna-priori–  RelaAvelysimplertocodeintoprograms.
•  DynamicinteracAons–  TheAmingorinteracAngtaskscannotbedetermineda-priori.–  Hardertocode,especiallyusingexplicitinteracAon.
19
Characteris<csofTaskInterac<ons
20
•  AspectsofinteracAons–  What:shareddataorsynchronizaAons,andsizesofthemedia–  When:theAming–  Who:withwhichtask(s),andoveralltopology/paaerns–  DoweknowdetailsoftheabovethreebeforeexecuAon–  How:involveoneorboth?
•  RegularinteracAons–  DefinitepaaernoftheinteracAons
•  E.g.ameshorring–  CanbeexploitedforefficientimplementaAon.
•  IrregularinteracAons–  lackwell-definedtopologies–  Modeledasagraph
ExampleofRegularSta<cInterac<on
21
Imageprocessingalgorithms:dithering,edgedetec<on•  NearestneighborinteracAonsona2Dmesh
ExampleofIrregularSta<cInterac<on
22
Sparsematrixvectormul<plica<on
Characteris<csofTaskInterac<ons
23
•  AspectsofinteracAons–  What: shared data or synchronizaAons, and sizes of the
media
•  Read-onlyinteracAons–  Tasksonlyreaddataitemsassociatedwithothertasks
•  Read-writeinteracAons–  Read,aswellasmodifydataitemsassociatedwithothertasks.–  Hardertocode
•  RequireaddiAonalsynchronizaAonprimiAves–  toavoidread-writeandwrite-writeorderingraces
Shareddata
T2T1 write
read
Characteris<csofTaskInterac<ons
24
•  AspectsofinteracAons–  What:shareddataorsynchronizaAons,andsizesofthemedia–  When:theAming–  Who:withwhichtask(s),andoveralltopology/paaerns–  DoweknowdetailsoftheabovethreebeforeexecuAon–  How:involveoneorboth?
•  TheimplementaAonconcern,implicitorexplicit•  One-sided
–  iniAated&completedindependentlyby1of2interacAngtasks•  GETandPUT
•  Two-sided–  bothtaskscoordinateinaninteracAon
•  SEND+RECV
Today’slecture
•  DecomposiAonTechniques-conAnued–  ExploratoryDecomposiAon–  HybridDecomposiAon
•  CharacterisAcsofTasksandInteracAons–  TaskGeneraAon,Granularity,andContext–  CharacterisAcsofTaskInteracAons
•  MappingTechniquesforLoadBalancing–  StaAcandDynamicMapping
•  MethodsforMinimizingInteracAonOverheads•  ParallelAlgorithmDesignModels
25
MappingTechniques
26
•  Parallelalgorithmdesign– Programdecomposed– CharacterisAcsoftaskandinteracAonsidenAfied
Assignlargeamountofconcurrenttaskstoequalorrela<velysmallamountofprocessesforexecu<on•  Thougho^enwedo1:1mapping
MappingTechniques
27
•  Goalofmapping:minimizeoverheads–  Thereiscosttodoparallelism
•  Interac<onsandidling(serializa<on)
•  ContradicAngobjecAves:interacAonsvsidling–  Idling(serializaAon)ñ:insufficientparallelism–  InteracAonsñ:excessiveconcurrency
–  E.g.AssigningallworktooneprocessortriviallyminimizesinteracAonattheexpenseofsignificantidling.
MappingTechniquesforMinimumIdling
28
•  ExecuAon:alternaAngstagesofcomputaAonandinteracAon
•  Mappingmustsimultaneouslyminimizeidlingandloadbalance–  Idlingmeansnotdoingusefulwork–  Loadbalance:doingthesameamountofwork
•  Merelybalancingloaddoesnotminimizeidling
Apoormapping,50%waste
MappingTechniquesforMinimumIdling
Sta<cordynamicmapping•  StaAcMapping
–  Tasksaremappedtoprocessesa-prior–  NeedagoodesAmateoftasksizes–  OpAmalmappingmaybeNPcomplete
•  DynamicMapping–  TasksaremappedtoprocessesatrunAme–  Because:
•  TasksaregeneratedatrunAme•  Theirsizesarenotknown.
•  Otherfactorsdeterminingthechoiceofmappingtechniques–  thesizeofdataassociatedwithatask–  thecharacterisAcsofinter-taskinteracAons–  eventheprogrammingmodelsandtargetarchitectures
29
SchemesforSta<cMapping
•  MappingsbasedondatadecomposiAon–  Mostly1-1mapping
•  MappingsbasedontaskgraphparAAoning•  Hybridmappings
30
MappingsBasedonDataPar<<oning
31
•  ParAAonthecomputaAonusingacombinaAonof–  DatadecomposiAon–  The``owner-computes''rule
Example:1-Dblockdistribu5onof2-Ddensematrix 1-1mappingoftask/dataandprocess
BlockArrayDistribu<onSchemes
32
Mul<-dimensionalBlockdistribu<on
Ingeneral,higherdimensiondecomposiAonallowstheuseoflarger#ofprocesses.
BlockArrayDistribu<onSchemes:Examples
Mul<plyingtwodensematrices:A*B=C•  ParAAontheoutputmatrixCusingablockdecomposiAon
–  Loadbalance:EachtaskcomputethesamenumberofelementsofC•  Note:eachelementofCcorrespondstoasingledotproduct
–  ThechoiceofprecisedecomposiAon:1-D(row/col)or2-D•  DeterminedbytheassociatedcommunicaAonoverhead
33
BlockDistribu<onandDataSharingforDenseMatrixMul<plica<on
34
X=
AXB=C
P0P1P2P3
X=
AXB=C
P0P1P2P3
•  Row-based1-D
•  Column-based1-D
•  Row/Col-based2-D
CyclicandBlockCyclicDistribu<ons
•  ConsiderablockdistribuAonforLUdecomposiAon(GaussianEliminaAon)–  Theamountofcomputa<onperdataitemvaries–  Blockdecomposi<onwouldleadtosignificantloadimbalance
35
LUFactoriza<onofaDenseMatrix
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
36
AdecomposiAonofLUfactorizaAoninto14tasks
BlockDistribu<onforLU
No<cethesignificantloadimbalance
37
BlockCyclicDistribu<ons
•  VariaAonoftheblockdistribuAonscheme–  ParAAonanarrayintomanymoreblocks(i.e.tasks)thanthe
numberofavailableprocesses.–  Blocksareassignedtoprocessesinaround-robinmannerso
thateachprocessgetsseveralnon-adjacentblocks.–  N-1mappingoftaskstoprocesses
•  Usedtoalleviatetheload-imbalanceandidlingproblems.
38
Block-CyclicDistribu<onforGaussianElimina<on
39
•  AcAvesubmatrixshrinksaseliminaAonprogresses•  Assigningblocksinablock-cyclicfashion
–  EachPEsreceivesblocksfromdifferentpartsofthematrix–  Inonebatchofmapping,thePEdoingthemostwillmost
likelyreceivetheleastinthenextbatch
Block-CyclicDistribu<on
•  AcyclicdistribuAon:aspecialcasewithblocksize=1•  AblockdistribuAon:aspecialcasewithblocksize=n/p•  nisthedimensionofthematrixandpisthe#ofprocesses.
40
BlockPar<<oningandRandomMapping
Sparsematrixcomputa<ons•  Loadimbalanceusingblock-cyclicparAAoning/mapping
–  morenon-zeroblockstodiagonalprocessesP0,P5,P10,andP15thanothers
–  P12getsnothing
41
BlockPar<<oningandRandomMapping
42
GraphPar<<oningBasedDataDecomposi<on
•  Array-basedparAAoningandstaAcmapping–  Regulardomain,i.e.rectangular,mostlydensematrix–  StructuredandregularinteracAonpaaerns–  QuiteeffecAveinbalancingthecomputaAonsandminimizing
theinteracAons
•  Irregulardomain–  Sparsmatrix-related–  NumericalsimulaAonsofphysicalphenomena
•  Car,water/bloodflow,geographic•  ParAAontheirregulardomainsoasto
–  Assignequalnumberofnodestoeachprocess–  MinimizingedgecountoftheparAAon.
43
Par<<oningtheGraphofLakeSuperior
RandomParAAoning
ParAAoningforminimumedge-cut.
44
•  EachmeshpointhasthesameamountofcomputaAon–  Easyforloadbalancing
•  Minimizeedges•  OpAmalparAAonisanNP-complete–  UseheurisAcs
MappingsBasedonTaskPari<oning
•  SchemesforStaAcMapping–  MappingsbasedondataparAAoning
•  Mostly1-1mapping–  MappingsbasedontaskgraphparAAoning–  Hybridmappings
•  DataparAAoning–  DatadecomposiAonandthen1-1mappingoftaskstoPEs
Par<<oningagiventask-dependencygraphacrossprocesses•  AnopAmalmappingforageneraltask-dependencygraph
–  NP-completeproblem.•  ExcellentheurisAcsexistforstructuredgraphs.
45
MappingaBinaryTreeDependencyGraph
46
Mappingdependencygraphofquicksorttoprocessesinahypercube
•  Hypercube:n-dimensionalanalogueofasquareandacube–  nodenumbersthatdifferin1bitareadjacent
MappingaSparseGraph
Sparsematrixvectormul<plica<onUsingdataparAAoning
47
MappingaSparseGraph
Sparsematrixvectormul<plica<onUsingtaskgraphparAAoning
48
13itemstocommunicate
Process0 0,4,5,8
Process1 1,2,3,7
Process2 6,9,10,11
Hierarchical/HybridMappings
•  Asinglemappingisinadequate.–  E.g.taskgraphmappingofthebinarytree(quicksort)cannot
usealargenumberofprocessors.•  Hierarchicalmapping
–  Taskgraphmappingatthetoplevel–  DataparAAoningwithineachlevel.
49
Today’slecture
•  DecomposiAonTechniques-conAnued–  ExploratoryDecomposiAon–  HybridDecomposiAon
•  CharacterisAcsofTasksandInteracAons–  TaskGeneraAon,Granularity,andContext–  CharacterisAcsofTaskInteracAons
•  MappingTechniquesforLoadBalancing–  StaAc–  DynamicMapping
•  MethodsforMinimizingInteracAonOverheads•  ParallelAlgorithmDesignModels
50
SchemesforDynamicMapping
•  Alsoreferredtoasdynamicloadbalancing–  LoadbalancingistheprimarymoAvaAonfordynamic
mapping.•  Dynamicmappingschemescanbe
–  Centralized–  Distributed
51
CentralizedDynamicMapping
•  Processesaredesignatedasmastersorslaves–  Workers(slaveispoliAcallyincorrect)
•  Generalstrategies–  Masterhaspooloftasksandascentraldispatcher–  Whenonerunsoutofwork,itrequestsfrommasterformorework.
•  Challenge–  Whenprocess#increases,mastermaybecometheboaleneck.
•  Approach–  Chunkscheduling:aprocesspicksupmulApletasksatonce–  Chunksize:
•  Largechunksizesmayleadtosignificantloadimbalancesaswell•  SchemestograduallydecreasechunksizeasthecomputaAonprogresses.
52
DistributedDynamicMapping
•  Allprocessesarecreatedequal–  Eachcansendorreceiveworkfromothers
•  Alleviatestheboaleneckincentralizedschemes.•  FourcriAcaldesignquesAons:
–  howaresendingandreceivingprocessespairedtogether–  whoiniAatesworktransfer–  howmuchworkistransferred–  whenisatransfertriggered?
•  AnswersaregenerallyapplicaAonspecific.
•  Workstealing
53
Today’slecture
•  DecomposiAonTechniques-conAnued–  ExploratoryDecomposiAon–  HybridDecomposiAon
•  CharacterisAcsofTasksandInteracAons–  TaskGeneraAon,Granularity,andContext–  CharacterisAcsofTaskInteracAons
•  MappingTechniquesforLoadBalancing–  StaAc–  DynamicMapping
•  MethodsforMinimizingInteracAonOverheads•  ParallelAlgorithmDesignModels
54
MinimizingInterac<onOverheads
Rulesofthumb•  Maximizedatalocality
–  Wherepossible,reuseintermediatedata–  RestructurecomputaAonsothatdatacanbereusedinsmaller
Amewindows.•  Minimizevolumeofdataexchange
–  parAAoninteracAongraphtominimizeedgecrossings•  MinimizefrequencyofinteracAons
–  MergemulApleinteracAonstoone,e.g.aggregatesmallmsgs.•  MinimizecontenAonandhot-spots
–  Usedecentralizedtechniques–  Replicatedatawherenecessary
55
MinimizingInterac<onOverheads(con<nued)
Techniques•  OverlappingcomputaAonswithinteracAons
–  Usenon-blockingcommunicaAons–  MulAthreading–  Prefetchingtohidelatencies.
•  ReplicaAngdataorcomputaAonstoreducecommunicaAon•  UsinggroupcommunicaAonsinsteadofpoint-to-pointprimiAves.
•  OverlapinteracAonswithotherinteracAons.
56
Today’slecture
•  DecomposiAonTechniques-conAnued–  ExploratoryDecomposiAon–  HybridDecomposiAon
•  CharacterisAcsofTasksandInteracAons–  TaskGeneraAon,Granularity,andContext–  CharacterisAcsofTaskInteracAons
•  MappingTechniquesforLoadBalancing–  StaAc–  DynamicMapping
•  MethodsforMinimizingInteracAonOverheads•  ParallelAlgorithmDesignModels
57
ParallelAlgorithmModels
•  Waysofstructuringparallelalgorithm–  DecomposiAontechniques–  Mappingtechnique–  StrategytominimizeinteracAons.
•  DataParallelModel
–  EachtaskperformssimilaroperaAonsondifferentdata–  TasksarestaAcally(orsemi-staAcally)mappedtoprocesses
•  TaskGraphModel–  Usetaskdependencygraphtoguidethemodelforbeaer
localityorlowinteracAoncosts.
58
ParallelAlgorithmModels(con<nued)
•  Master-SlaveModel–  Master(oneormore)generatework–  Dispatchworktoworkers.–  DispatchingmaybestaAcordynamic.
•  Pipeline/Producer-ComsumerModel–  Streamofdataispassedthroughasuccessionofprocesses,
eachofwhichperformsometaskonit–  MulAplestreamconcurrently
•  HybridModels–  ApplyingmulAplemodelshierarchically–  ApplyingmulAplemodelssequenAallytodifferentphasesofa
parallelalgorithm.
59
References
•  Adaptedfromslides“PrinciplesofParallelAlgorithmDesign”byAnanthGrama
•  BasedonChapter3of“IntroducAontoParallelCompuAng”byAnanthGrama,AnshulGupta,GeorgeKarypis,andVipinKumar.AddisonWesley,2003
60