Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Lecture8:PrinciplesofParallelAlgorithmDesign
1
CSCE569ParallelComputing
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://cse.sc.edu/~yanyh
Topics
• Introduction• Programmingonsharedmemorysystem(Chapter7)
– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonlargescalesystems(Chapter6)
– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel
• Analysisofparallelprogramexecutions(Chapter5)– PerformanceMetricsforParallelSystems
• ExecutionTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools
2
Topics
• Programmingonsharedmemorysystem(Chapter7)– Cilk/Cilkplus andOpenMP Tasking– PThread,mutualexclusion,locks,synchronizations
• Parallelarchitecturesandhardware– Parallelcomputerarchitectures– Memoryhierarchyandcachecoherency
• Manycore GPUarchitecturesandprogramming– GPUsarchitectures– CUDAprogramming– IntroductiontooffloadingmodelinOpenMP
3
“parallelandfor”OpenMPConstructs
4
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
#pragma omp parallel shared (a, b)
{
int id, i, Nthrds, istart, iend;id = omp_get_thread_num();Nthrds = omp_get_num_threads();istart = id * N / Nthrds;iend = (id+1) * N / Nthrds;for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }
}
#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
Sequential code
OpenMP parallel region
OpenMP parallel region and a worksharing for construct
OpenMPBestPractices
#pragmaomp parallelprivate(i){#pragmaomp fornowaitfor(i=0;i<n;i++)a[i]+=b[i];
#pragmaomp fornowaitfor(i=0;i<n;i++)c[i]+=d[i];
#pragmaomp barrier#pragmaomp fornowait reduction(+:sum)for(i=0;i<n;i++)sum+=a[i]+c[i];
}
5
• Falsesharing– Whenatleastonethreadwritetoa
cachelinewhileothersaccessit• Thread0:=A[1](read)• Thread1:A[0]=…(write)
• Solution:usearraypadding
int a[max_threads];#pragma omp parallel for schedule(static,1)for(int i=0; i<max_threads; i++)
a[i] +=i;
int a[max_threads][cache_line_size];#pragma omp parallel for schedule(static,1)for(int i=0; i<max_threads; i++)
a[i][0] +=i;
False-sharinginOpenMPandSolution
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
False Sharing
CPUs Caches Memory
A store into a shared cache line invalidates the other copies of that line:
The system is not able to distinguish between changes
within one individual line
6
A
T0
T1
NUMAFirst-touch
7
Getting OpenMP Up To Speed
RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010
About “First Touch” placement/2
for (i=0; i<100; i++) a[i] = 0;
a[0] :a[49]
#pragma omp parallel for num_threads(2)
First TouchBoth memories each have “their half” of
the array
a[50] :a[99]
SPMD ProgramModelsinOpenMP
8
• SPMD(SingleProgram,MultipleData)forparallelregions– Allthreadsoftheparallelregionexecutethesamecode– EachthreadhasuniqueID
• UsethethreadIDtodivergetheexecutionofthethreads– Differentthreadcanfollowdifferentpathsthroughthesame
code
• SPMDisbyfarthemostcommonlyusedpatternforstructuringparallelprograms– MPI,OpenMP,CUDA,etc
if(my_id == x) { }else { }
Overview:AlgorithmsandConcurrency(part1)
• IntroductiontoParallelAlgorithms– TasksandDecomposition– ProcessesandMapping
• DecompositionTechniques– RecursiveDecomposition– DataDecomposition– ExploratoryDecomposition– HybridDecomposition
• CharacteristicsofTasksandInteractions– TaskGeneration,Granularity,andContext– CharacteristicsofTaskInteractions.
9
Overview:ConcurrencyandMapping(part2)
• MappingTechniquesforLoadBalancing– StaticandDynamicMapping
• MethodsforMinimizingInteractionOverheads– MaximizingDataLocality– MinimizingContentionandHot-Spots– OverlappingCommunicationandComputations– Replicationvs.Communication– GroupCommunicationsvs.Point-to-PointCommunication
• ParallelAlgorithmDesignModels– Data-Parallel,Work-Pool,TaskGraph,Master-Slave,Pipeline,
andHybridModels
10
Decomposition,Tasks,andDependencyGraphs
• Decomposeworkintotasksthatcanbeexecutedconcurrently• Decompositioncouldbeinmanydifferentways.• Tasksmaybeofsame,different,orevenindeterminatesizes.• Taskdependencygraph:
– node=task– edge=controldependence,output-inputdependency– Nodependency==parallelism
11
Example:DenseMatrixVectorMultiplication
12
X=
• Computationofeachelementofoutputvectoryisindependent
• Decomposedintontasks,oneperelementinyà Easy• Observations
– EachtaskonlyreadsonerowofA,andwritesoneelementofy– Alltaskssharevectorb(theshareddata)– Nocontroldependenciesbetweentasks– Alltasksareofthesamesizeintermsofnumberofoperations.
REALA[n][n],b[n],y[n];int i,j;for(i =0;i <n;i++){sum=0.0;for(j=0;j<n;j++)sum+=A[i][j]*b[j];
c[i]=sum;}
Example:DatabaseQueryProcessingConsidertheexecutionofthequery:
MODEL=``CIVIC''AND YEAR=2001AND(COLOR=``GREEN''OR COLOR=``WHITE”)
onthefollowingtable:
ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN $18,000 3476 Corolla 1999 White IL $15,000 7623 Camry 2001 Green NY $21,000 9834 Prius 2001 Green CA $18,000 6734 Civic 2001 White OR $17,000 5342 Altima 2001 Green FL $19,000 3845 Maxima 2001 Blue NY $22,000 8354 Accord 2000 Green VT $18,000 4395 Civic 2001 Red CA $17,000 7352 Civic 2002 Red WA $18,000
13
Example:DatabaseQueryProcessing• Tasks:Eachtasksearchthewholetableforentriesthatsatisfy
onepredicate– Results:Alistofentries
• Edges:outputofonetaskservesasinputtothenextMODEL=``CIVIC''AND YEAR=2001AND
(COLOR=``GREEN''OR COLOR=``WHITE”)
14
Example:DatabaseQueryProcessing• Analternatedecomposition
MODEL=``CIVIC''AND YEAR=2001AND(COLOR=``GREEN''OR COLOR=``WHITE)
• Differentdecompositionsmayyielddifferentparallelismandperformance
15
GranularityofTaskDecompositions• Granularityistasksize(amountofcomputation)
– Dependingonthenumberoftasksforthesameproblemsize• Fine-graineddecomposition=largenumberoftasks• Coarsegraineddecomposition=smallnumberoftasks• Granularityfordensematrix-vectorproduct
– fine-grain:eachtaskcomputesanindividualelementiny– coarser-grain:eachtaskcomputes3elementsiny
X=
16
DegreeofConcurrency
• Definition:thenumberoftasksthatcanbeexecutedinparallel• Maychangeoverprogramexecution
• Metrics– Maximumdegreeofconcurrency
• Maximumnumberofconcurrenttasksatanypointduringexecution.
– Averagedegreeofconcurrency• Theaveragenumberoftasksthatcanbeprocessedinparallelovertheexecutionoftheprogram
• Speedup:serial_execution_time/parallel_execution_time• Inverserelationshipofdegreeofconcurrencyandtask
granularity– Taskgranularityé(lesstasks),degreeofconcurrencyê– Taskgranularityê(moretasks),degreeofconcurrencyé
17
Examples:DegreeofConcurrency
• Maximumdegreeofconcurrency• Averagedegreeofconcurrency
18
X=
• Databasequery– Max:4– Average:7/3(eachtasktakesthesametime)
• Matrix-vectormultiplication– Max:n– Average:n
CriticalPathLength
• Adirectedpath: asequenceoftasksthatmustbeserialized– Executedoneafteranother
• Criticalpath:– Thelongestweightedpaththroughoutthegraph
• Criticalpathlength:shortesttimeinwhichtheprogramcanbefinished– Lowerboundonparallelexecutiontime
19
Abuildingproject
CriticalPathLengthandDegreeofConcurrencyDatabasequerytaskdependencygraph
Questions:Whatarethetasksonthecriticalpathforeachdependencygraph?Whatistheshortestparallelexecutiontime?Howmanyprocessorsareneededtoachievetheminimumtime?Whatisthemaximumdegreeofconcurrency?Whatistheaverageparallelism(averagedegreeofconcurrency)?
Totalamountofwork/(criticalpathlength)2.33(63/27)and1.88(64/34)
20
LimitsonParallelPerformance
• Whatboundsparallelexecutiontime?– minimumtaskgranularity
• e.g.densematrix-vectormultiplication≤n2 concurrenttasks– fractionofapplicationworkthatcan’tbeparallelized
• moreaboutAmdahl’slawinalaterlecture…– dependenciesbetweentasks– parallelizationoverheads
• e.g.,costofcommunicationbetweentasks
• Metricsofparallelperformance– T1:sequentialexecutiontime,Tp parallelexecutiontimeonp
processors/cores/threads
– speedup=T1/Tp– parallelefficiency=T1/(p*Tp)
21
TaskInteractionGraphs
• Tasksgenerallyexchangedatawithothers– example:densematrix-vectormultiply
• Ifbisnotreplicatedinalltasks:eachtaskhassome,butnotall• Taskswillhavetocommunicateelementsofb
• Taskinteractiongraph– node=task– edge=interactionordataexchange
• Taskinteractiongraphsvs.taskdependencygraphs– Taskdependencygraphsrepresentcontroldependences– Taskinteractiongraphsrepresentdatadependences
22
TaskInteractionGraphs:AnExample
Sparsematrixvectormultiplication:A xb• Computationofeachresultelement=anindependenttask.• Onlynon-zeroelementsofsparsematrixA participatein
computation.• Ifpartitionb acrosstasks,i.e.taskTihasb[i]only• Thetaskinteractiongraphofthecomputation=graphofthematrixA• Aistheadjacentmatrixofthegraph
23
• Finertaskgranularityèmoreoverheadoftaskinteractions– Overheadasaratioofusefulworkofatask
• Example:sparsematrix-vectorproductinteractiongraph
• Assumptions:– eachdot(A[i][j]*b[j])takesunittimetoprocess– eachcommunication(edge)causesanoverheadofaunittime
• Ifnode0isatask:communication=3;computation=4• Ifnodes0,4,and5areatask:communication=5;computation=15
– coarser-graindecomposition→smallercommunication/computationratio(3/4vs 5/15)
TaskInteractionGraphs,Granularity,andCommunication
24
ProcessesandMapping
• Generally– #oftasks>=#processingelements(PEs)available– parallelalgorithmmustmaptaskstoprocesses
• Mapping:aggregatetasksintoprocesses– Process/thread=processingorcomputingagentthatperformswork– assigncollectionoftasksandassociateddatatoaprocess
• AnPE,e.g.acore/thread,hasitssystemabstraction– NoteasytobindtaskstophysicalPEs,thus,morelayersatleast
conceptuallyfromPEtotaskmapping– ProcessinMPI,threadinOpenMP/pthread,etc
• Theoverloadedtermsofprocessesandthreads– Taskà processesà OSprocessesà CPUà cores– Forthesakeofsimplicity,processes=PEs/cores
25
ProcessesandMapping
• Mappingoftaskstoprocessesiscriticaltotheparallelperformanceofanalgorithm.
• Onwhatbasisshouldonechoosemappings?– Taskdependencygraph– Taskinteractiongraph
• Taskdependencygraphs– Toensureequalspreadofworkacrossallprocessesatanypoint
• minimumidling• optimalloadbalance
• Taskinteractiongraphs– Tominimizeinteractions
• Minimizecommunicationandsynchronization26
ProcessesandMapping
Agoodmappingmustminimizeparallelexecutiontimeby:
• Mappingindependenttaskstodifferentprocesses– Maximizeconcurrency
• Tasksoncriticalpathhavehighpriorityofbeingassignedtoprocesses
• Minimizinginteractionbetweenprocesses– mappingtaskswithdenseinteractionstothesameprocess.
• Difficulty:thesecriteriaoftenconflictwitheachother– E.g.Nodecomposition,i.e.onetask,minimizesinteractionbut
nospeedupatall!
27
ProcessesandMapping:Example
28
Example:mappingdatabasequeriestoprocesses• Considerthedependencygraphsinlevels
– nonodesinaleveldependupononeanother• Assignalltaskswithinaleveltodifferentprocesses
Overview:AlgorithmsandConcurrency
• IntroductiontoParallelAlgorithms– TasksandDecomposition– ProcessesandMapping
• DecompositionTechniques– RecursiveDecomposition– DataDecomposition– ExploratoryDecomposition– HybridDecomposition
• CharacteristicsofTasksandInteractions– TaskGeneration,Granularity,andContext– CharacteristicsofTaskInteractions.
29
DecompositionTechniques
30
Sohowdoesonedecomposeataskintovarioussubtasks?
• Nosinglerecipethatworksforallproblems• Practicallyusedtechniques
– Recursivedecomposition– Datadecomposition– Exploratorydecomposition– Speculativedecomposition
RecursiveDecomposition
Generallysuitedtoproblemssolvableusingthedivide-and-conquerstrategySteps:1. decomposeaproblemintoasetofsub-problems2. recursivelydecomposeeachsub-problem3. stopdecompositionwhenminimumdesiredgranularity
reached
31
RecursiveDecomposition:Quicksort
32
Ateachlevel andforeachvector1. Selectapivot2. Partitionsetaroundpivot3. Recursivelysorteachsubvector
ê
quicksort(A,lo,hi)iflo<hip=pivot_partition(A,lo,hi)quicksort(A,lo,p-1)quicksort(A,p+1,hi)
Eachvectorcanbesortedconcurrently(i.e.,eachsortingrepresentsanindependentsubtask).
RecursiveDecomposition:Min
33
procedure SERIAL_MIN (A, n)min = A[0];for i := 1 to n − 1 doif (A[i] < min) min := A[i];
return min;
procedure RECURSIVE_MIN (A,n)if (n= 1)thenmin :=A [0];elselmin :=RECURSIVE_MIN (A,n/2 );rmin :=RECURSIVE_MIN (&(A[n/2]),n- n/2);if (lmin <rmin)thenmin :=lmin;elsemin :=rmin;
returnmin;
Findingtheminimuminavectorusingdivide-and-conquer
Applicabletootherassociativeoperations,e.g.sum,AND…Knownasreductionoperation
RecursiveDecomposition:Minfinding the minimum in set {4, 9, 1, 7, 8, 11, 2, 12}.
34
Task dependency graph:• RECURSIVE_MIN forms the
binary tree• min finishes and closes• Parallel in the same level
FibwithOpenMPTasking
• Taskcompletionoccurswhenthetaskreachestheendofthetaskregioncode
• Multipletasksjoinedtocompletethroughtheuseoftasksynchronizationconstructs– taskwait– barrier construct
• taskwait constructs:– #pragmaomp taskwait– !$omp taskwait
35
int fib(int n){int x,y;if(n<2)returnn;else{
#pragmaomp taskshared(x)x=fib(n-1);#pragmaomp taskshared(y)y=fib(n-2);#pragmaomp taskwaitreturnx+y;
}}
DataDecomposition-- Themostcommonlyusedapproach
• Steps:1. Identifythedataonwhichcomputationsareperformed.2. Partitionthisdataacrossvarioustasks.
• Partitioninginducesadecompositionoftheproblem,i.e.computationispartitioned
• Datacanbepartitionedinvariousways– Criticalforparallelperformance
• Decompositionbasedon– outputdata– inputdata– input+outputdata– intermediatedata
36
OutputDataDecomposition
• Eachelementoftheoutputcanbecomputedindependentlyofothers– simplyasafunctionoftheinput.
• Anaturalproblemdecomposition
37
OutputDataDecomposition:MatrixMultiplication
multiplyingtwonxnmatricesA andB toyieldmatrixC
TheoutputmatrixC canbepartitionedintofourtasks:
Task1:
Task2:
Task3:
Task4: 38
OutputDataDecomposition:ExampleApartitioningofoutputdatadoesnotresultinauniquedecompositionintotasks.Forexample,forthesameproblemasinpreviusfoil,withidenticaloutputdatadistribution,wecanderivethefollowingtwo(other)decompositions:
Decomposition I Decomposition IITask 1: C1,1 = A1,1 B1,1
Task 2: C1,1 = C1,1 + A1,2 B2,1
Task 3: C1,2 = A1,1 B1,2
Task 4: C1,2 = C1,2 + A1,2 B2,2
Task 5: C2,1 = A2,1 B1,1
Task 6: C2,1 = C2,1 + A2,2 B2,1
Task 7: C2,2 = A2,1 B1,2
Task 8: C2,2 = C2,2 + A2,2 B2,2
Task 1: C1,1 = A1,1 B1,1
Task 2: C1,1 = C1,1 + A1,2 B2,1
Task 3: C1,2 = A1,2 B2,2
Task 4: C1,2 = C1,2 + A1,1 B1,2
Task 5: C2,1 = A2,2 B2,1
Task 6: C2,1 = C2,1 + A2,1 B1,1
Task 7: C2,2 = A2,1 B1,2
Task 8: C2,2 = C2,2 + A2,2 B2,239
OutputDataDecomposition:ExampleCount the frequency of item sets in database transactions
40
• Decomposetheitemsetstocount– eachtaskcomputestotalcountforeach
ofitsitemsets– appendtotalcountsforitemsetsto
producetotalcountresult
OutputDataDecomposition:Example
Fromthepreviousexample,thefollowingobservationscanbemade:
• Ifthedatabaseoftransactionsisreplicatedacrosstheprocesses,eachtaskcanbeindependentlyaccomplishedwithnocommunication.
• Ifthedatabaseispartitionedacrossprocessesaswell(forreasonsofmemoryutilization),eachtaskfirstcomputespartialcounts.Thesecountsarethenaggregatedattheappropriatetask.
41
InputDataPartitioning
• Generallyapplicableifeachoutputcanbenaturallycomputedasafunctionoftheinput.
• Inmanycases,thisistheonlynaturaldecompositioniftheoutputisnotclearlyknowna-priori– e.g.,theproblemoffindingtheminimum,sorting.
• Ataskisassociatedwitheachinputdatapartition– Thetaskperformscomputationwithitspartofthedata.– Subsequentprocessingcombinesthesepartialresults.
• MapReduce
42
InputDataPartitioning:Example
43
PartitioningInputand OutputData
44
• Partitiononbothinputandoutputformoreconcurrency• Example:itemsetcounting
Histogram
• ParallelizingHistogramusingOpenMP inAssignment2– Similarascountfrequency
45
IntermediateDataPartitioning
46
• Ifcomputationisasequenceoftransforms– frominputdatatooutputdata,e.g.imageprocessing
workflow• Candecomposebasedondataforintermediatestages
Task1:
Task2:
Task3:
Task4:
IntermediateDataPartitioning:Example
47
• dense matrix multiplication– visualize this computation in terms of intermediate
matrices D.
IntermediateDataPartitioning:Example
StageII
Task 01: D1,1,1= A1,1 B1,1 Task 02: D2,1,1= A1,2 B2,1
Task 03: D1,1,2= A1,1 B1,2 Task 04: D2,1,2= A1,2 B2,2
Task 05: D1,2,1= A2,1 B1,1 Task 06: D2,2,1= A2,2 B2,1
Task 07: D1,2,2= A2,1 B1,2 Task 08: D2,2,2= A2,2 B2,2
Task 09: C1,1 = D1,1,1 + D2,1,1 Task 10: C1,2 = D1,1,2 + D2,1,2
Task 11: C2,1 = D1,2,1 + D2,2,1 Task 12: C2,,2 = D1,2,2 + D2,2,2 48
• Adecompositionofintermediatedata:8+4tasks:StageI
IntermediateDataPartitioning:ExampleThe task dependency graph for the decomposition(shown in previous foil) into 12 tasks is as follows:
49
TheOwnerComputesRule
50
• Eachdatumisassignedtoaprocess• Eachprocesscomputesvaluesassociatedwithitsdata• Implications
– inputdatadecomposition• allcomputationsusinganinputdatumareperformedbyitsprocess
– outputdatadecomposition• anoutputiscomputedbytheprocessassignedtotheoutputdata
References
• Adaptedfromslides“PrinciplesofParallelAlgorithmDesign”byAnanth Grama
• BasedonChapter3of“IntroductiontoParallelComputing”byAnanth Grama,Anshul Gupta,GeorgeKarypis,andVipinKumar.AddisonWesley,2003
51