Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design

Lecture8:PrinciplesofParallelAlgorithmDesign

1

CSCE569ParallelComputing

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://cse.sc.edu/~yanyh

Topics

• Introduction• Programmingonsharedmemorysystem(Chapter7)

– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonlargescalesystems(Chapter6)

– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel

• Analysisofparallelprogramexecutions(Chapter5)– PerformanceMetricsforParallelSystems

• ExecutionTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools

2

Topics

• Programmingonsharedmemorysystem(Chapter7)– Cilk/Cilkplus andOpenMP Tasking– PThread,mutualexclusion,locks,synchronizations

• Parallelarchitecturesandhardware– Parallelcomputerarchitectures– Memoryhierarchyandcachecoherency

• Manycore GPUarchitecturesandprogramming– GPUsarchitectures– CUDAprogramming– IntroductiontooffloadingmodelinOpenMP

3

“parallelandfor”OpenMPConstructs

4

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

#pragma omp parallel shared (a, b)

{

int id, i, Nthrds, istart, iend;id = omp_get_thread_num();Nthrds = omp_get_num_threads();istart = id * N / Nthrds;iend = (id+1) * N / Nthrds;for(i=istart;i<iend;i++) { a[i] = a[i] + b[i]; }

}

#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static)

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

Sequential code

OpenMP parallel region

OpenMP parallel region and a worksharing for construct

OpenMPBestPractices

#pragmaomp parallelprivate(i){#pragmaomp fornowaitfor(i=0;i<n;i++)a[i]+=b[i];

#pragmaomp fornowaitfor(i=0;i<n;i++)c[i]+=d[i];

#pragmaomp barrier#pragmaomp fornowait reduction(+:sum)for(i=0;i<n;i++)sum+=a[i]+c[i];

}

5

• Falsesharing– Whenatleastonethreadwritetoa

cachelinewhileothersaccessit• Thread0:=A[1](read)• Thread1:A[0]=…(write)

• Solution:usearraypadding

int a[max_threads];#pragma omp parallel for schedule(static,1)for(int i=0; i<max_threads; i++)

a[i] +=i;

int a[max_threads][cache_line_size];#pragma omp parallel for schedule(static,1)for(int i=0; i<max_threads; i++)

a[i][0] +=i;

False-sharinginOpenMPandSolution

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

False Sharing

CPUs Caches Memory

A store into a shared cache line invalidates the other copies of that line:

The system is not able to distinguish between changes

within one individual line

6

A

T0

T1

NUMAFirst-touch

7

Getting OpenMP Up To Speed

RvdP/V1 Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

About “First Touch” placement/2

for (i=0; i<100; i++) a[i] = 0;

a[0] :a[49]

#pragma omp parallel for num_threads(2)

First TouchBoth memories each have “their half” of

the array

a[50] :a[99]

SPMD ProgramModelsinOpenMP

8

• SPMD(SingleProgram,MultipleData)forparallelregions– Allthreadsoftheparallelregionexecutethesamecode– EachthreadhasuniqueID

• UsethethreadIDtodivergetheexecutionofthethreads– Differentthreadcanfollowdifferentpathsthroughthesame

code

• SPMDisbyfarthemostcommonlyusedpatternforstructuringparallelprograms– MPI,OpenMP,CUDA,etc

if(my_id == x) { }else { }

Overview:AlgorithmsandConcurrency(part1)

• IntroductiontoParallelAlgorithms– TasksandDecomposition– ProcessesandMapping

• DecompositionTechniques– RecursiveDecomposition– DataDecomposition– ExploratoryDecomposition– HybridDecomposition

• CharacteristicsofTasksandInteractions– TaskGeneration,Granularity,andContext– CharacteristicsofTaskInteractions.

9

Overview:ConcurrencyandMapping(part2)

• MappingTechniquesforLoadBalancing– StaticandDynamicMapping

• MethodsforMinimizingInteractionOverheads– MaximizingDataLocality– MinimizingContentionandHot-Spots– OverlappingCommunicationandComputations– Replicationvs.Communication– GroupCommunicationsvs.Point-to-PointCommunication

• ParallelAlgorithmDesignModels– Data-Parallel,Work-Pool,TaskGraph,Master-Slave,Pipeline,

andHybridModels

10

Decomposition,Tasks,andDependencyGraphs

• Decomposeworkintotasksthatcanbeexecutedconcurrently• Decompositioncouldbeinmanydifferentways.• Tasksmaybeofsame,different,orevenindeterminatesizes.• Taskdependencygraph:

– node=task– edge=controldependence,output-inputdependency– Nodependency==parallelism

11

Example:DenseMatrixVectorMultiplication

12

X=

• Computationofeachelementofoutputvectoryisindependent

• Decomposedintontasks,oneperelementinyà Easy• Observations

– EachtaskonlyreadsonerowofA,andwritesoneelementofy– Alltaskssharevectorb(theshareddata)– Nocontroldependenciesbetweentasks– Alltasksareofthesamesizeintermsofnumberofoperations.

REALA[n][n],b[n],y[n];int i,j;for(i =0;i <n;i++){sum=0.0;for(j=0;j<n;j++)sum+=A[i][j]*b[j];

c[i]=sum;}

Example:DatabaseQueryProcessingConsidertheexecutionofthequery:

MODEL=``CIVIC''AND YEAR=2001AND(COLOR=``GREEN''OR COLOR=``WHITE”)

onthefollowingtable:

ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN $18,000 3476 Corolla 1999 White IL $15,000 7623 Camry 2001 Green NY $21,000 9834 Prius 2001 Green CA $18,000 6734 Civic 2001 White OR $17,000 5342 Altima 2001 Green FL $19,000 3845 Maxima 2001 Blue NY $22,000 8354 Accord 2000 Green VT $18,000 4395 Civic 2001 Red CA $17,000 7352 Civic 2002 Red WA $18,000

13

Example:DatabaseQueryProcessing• Tasks:Eachtasksearchthewholetableforentriesthatsatisfy

onepredicate– Results:Alistofentries

• Edges:outputofonetaskservesasinputtothenextMODEL=``CIVIC''AND YEAR=2001AND

(COLOR=``GREEN''OR COLOR=``WHITE”)

14

Example:DatabaseQueryProcessing• Analternatedecomposition

MODEL=``CIVIC''AND YEAR=2001AND(COLOR=``GREEN''OR COLOR=``WHITE)

• Differentdecompositionsmayyielddifferentparallelismandperformance

15

GranularityofTaskDecompositions• Granularityistasksize(amountofcomputation)

– Dependingonthenumberoftasksforthesameproblemsize• Fine-graineddecomposition=largenumberoftasks• Coarsegraineddecomposition=smallnumberoftasks• Granularityfordensematrix-vectorproduct

– fine-grain:eachtaskcomputesanindividualelementiny– coarser-grain:eachtaskcomputes3elementsiny

X=

16

DegreeofConcurrency

• Definition:thenumberoftasksthatcanbeexecutedinparallel• Maychangeoverprogramexecution

• Metrics– Maximumdegreeofconcurrency

• Maximumnumberofconcurrenttasksatanypointduringexecution.

– Averagedegreeofconcurrency• Theaveragenumberoftasksthatcanbeprocessedinparallelovertheexecutionoftheprogram

• Speedup:serial_execution_time/parallel_execution_time• Inverserelationshipofdegreeofconcurrencyandtask

granularity– Taskgranularityé(lesstasks),degreeofconcurrencyê– Taskgranularityê(moretasks),degreeofconcurrencyé

17

Examples:DegreeofConcurrency

• Maximumdegreeofconcurrency• Averagedegreeofconcurrency

18

X=

• Databasequery– Max:4– Average:7/3(eachtasktakesthesametime)

• Matrix-vectormultiplication– Max:n– Average:n

CriticalPathLength

• Adirectedpath: asequenceoftasksthatmustbeserialized– Executedoneafteranother

• Criticalpath:– Thelongestweightedpaththroughoutthegraph

• Criticalpathlength:shortesttimeinwhichtheprogramcanbefinished– Lowerboundonparallelexecutiontime

19

Abuildingproject

CriticalPathLengthandDegreeofConcurrencyDatabasequerytaskdependencygraph

Questions:Whatarethetasksonthecriticalpathforeachdependencygraph?Whatistheshortestparallelexecutiontime?Howmanyprocessorsareneededtoachievetheminimumtime?Whatisthemaximumdegreeofconcurrency?Whatistheaverageparallelism(averagedegreeofconcurrency)?

Totalamountofwork/(criticalpathlength)2.33(63/27)and1.88(64/34)

20

LimitsonParallelPerformance

• Whatboundsparallelexecutiontime?– minimumtaskgranularity

• e.g.densematrix-vectormultiplication≤n2 concurrenttasks– fractionofapplicationworkthatcan’tbeparallelized

• moreaboutAmdahl’slawinalaterlecture…– dependenciesbetweentasks– parallelizationoverheads

• e.g.,costofcommunicationbetweentasks

• Metricsofparallelperformance– T1:sequentialexecutiontime,Tp parallelexecutiontimeonp

processors/cores/threads

– speedup=T1/Tp– parallelefficiency=T1/(p*Tp)

21

TaskInteractionGraphs

• Tasksgenerallyexchangedatawithothers– example:densematrix-vectormultiply

• Ifbisnotreplicatedinalltasks:eachtaskhassome,butnotall• Taskswillhavetocommunicateelementsofb

• Taskinteractiongraph– node=task– edge=interactionordataexchange

• Taskinteractiongraphsvs.taskdependencygraphs– Taskdependencygraphsrepresentcontroldependences– Taskinteractiongraphsrepresentdatadependences

22

TaskInteractionGraphs:AnExample

Sparsematrixvectormultiplication:A xb• Computationofeachresultelement=anindependenttask.• Onlynon-zeroelementsofsparsematrixA participatein

computation.• Ifpartitionb acrosstasks,i.e.taskTihasb[i]only• Thetaskinteractiongraphofthecomputation=graphofthematrixA• Aistheadjacentmatrixofthegraph

23

• Finertaskgranularityèmoreoverheadoftaskinteractions– Overheadasaratioofusefulworkofatask

• Example:sparsematrix-vectorproductinteractiongraph

• Assumptions:– eachdot(A[i][j]*b[j])takesunittimetoprocess– eachcommunication(edge)causesanoverheadofaunittime

• Ifnode0isatask:communication=3;computation=4• Ifnodes0,4,and5areatask:communication=5;computation=15

– coarser-graindecomposition→smallercommunication/computationratio(3/4vs 5/15)

TaskInteractionGraphs,Granularity,andCommunication

24

ProcessesandMapping

• Generally– #oftasks>=#processingelements(PEs)available– parallelalgorithmmustmaptaskstoprocesses

• Mapping:aggregatetasksintoprocesses– Process/thread=processingorcomputingagentthatperformswork– assigncollectionoftasksandassociateddatatoaprocess

• AnPE,e.g.acore/thread,hasitssystemabstraction– NoteasytobindtaskstophysicalPEs,thus,morelayersatleast

conceptuallyfromPEtotaskmapping– ProcessinMPI,threadinOpenMP/pthread,etc

• Theoverloadedtermsofprocessesandthreads– Taskà processesà OSprocessesà CPUà cores– Forthesakeofsimplicity,processes=PEs/cores

25

ProcessesandMapping

• Mappingoftaskstoprocessesiscriticaltotheparallelperformanceofanalgorithm.

• Onwhatbasisshouldonechoosemappings?– Taskdependencygraph– Taskinteractiongraph

• Taskdependencygraphs– Toensureequalspreadofworkacrossallprocessesatanypoint

• minimumidling• optimalloadbalance

• Taskinteractiongraphs– Tominimizeinteractions

• Minimizecommunicationandsynchronization26

ProcessesandMapping

Agoodmappingmustminimizeparallelexecutiontimeby:

• Mappingindependenttaskstodifferentprocesses– Maximizeconcurrency

• Tasksoncriticalpathhavehighpriorityofbeingassignedtoprocesses

• Minimizinginteractionbetweenprocesses– mappingtaskswithdenseinteractionstothesameprocess.

• Difficulty:thesecriteriaoftenconflictwitheachother– E.g.Nodecomposition,i.e.onetask,minimizesinteractionbut

nospeedupatall!

27

ProcessesandMapping:Example

28

Example:mappingdatabasequeriestoprocesses• Considerthedependencygraphsinlevels

– nonodesinaleveldependupononeanother• Assignalltaskswithinaleveltodifferentprocesses

Overview:AlgorithmsandConcurrency

• IntroductiontoParallelAlgorithms– TasksandDecomposition– ProcessesandMapping

• DecompositionTechniques– RecursiveDecomposition– DataDecomposition– ExploratoryDecomposition– HybridDecomposition

• CharacteristicsofTasksandInteractions– TaskGeneration,Granularity,andContext– CharacteristicsofTaskInteractions.

29

DecompositionTechniques

30

Sohowdoesonedecomposeataskintovarioussubtasks?

• Nosinglerecipethatworksforallproblems• Practicallyusedtechniques

– Recursivedecomposition– Datadecomposition– Exploratorydecomposition– Speculativedecomposition

RecursiveDecomposition

Generallysuitedtoproblemssolvableusingthedivide-and-conquerstrategySteps:1. decomposeaproblemintoasetofsub-problems2. recursivelydecomposeeachsub-problem3. stopdecompositionwhenminimumdesiredgranularity

reached

31

RecursiveDecomposition:Quicksort

32

Ateachlevel andforeachvector1. Selectapivot2. Partitionsetaroundpivot3. Recursivelysorteachsubvector

ê

quicksort(A,lo,hi)iflo<hip=pivot_partition(A,lo,hi)quicksort(A,lo,p-1)quicksort(A,p+1,hi)

Eachvectorcanbesortedconcurrently(i.e.,eachsortingrepresentsanindependentsubtask).

RecursiveDecomposition:Min

33

procedure SERIAL_MIN (A, n)min = A[0];for i := 1 to n − 1 doif (A[i] < min) min := A[i];

return min;

procedure RECURSIVE_MIN (A,n)if (n= 1)thenmin :=A [0];elselmin :=RECURSIVE_MIN (A,n/2 );rmin :=RECURSIVE_MIN (&(A[n/2]),n- n/2);if (lmin <rmin)thenmin :=lmin;elsemin :=rmin;

returnmin;

Findingtheminimuminavectorusingdivide-and-conquer

Applicabletootherassociativeoperations,e.g.sum,AND…Knownasreductionoperation

RecursiveDecomposition:Minfinding the minimum in set {4, 9, 1, 7, 8, 11, 2, 12}.

34

Task dependency graph:• RECURSIVE_MIN forms the

binary tree• min finishes and closes• Parallel in the same level

FibwithOpenMPTasking

• Taskcompletionoccurswhenthetaskreachestheendofthetaskregioncode

• Multipletasksjoinedtocompletethroughtheuseoftasksynchronizationconstructs– taskwait– barrier construct

• taskwait constructs:– #pragmaomp taskwait– !$omp taskwait

35

int fib(int n){int x,y;if(n<2)returnn;else{

#pragmaomp taskshared(x)x=fib(n-1);#pragmaomp taskshared(y)y=fib(n-2);#pragmaomp taskwaitreturnx+y;

}}

DataDecomposition-- Themostcommonlyusedapproach

• Steps:1. Identifythedataonwhichcomputationsareperformed.2. Partitionthisdataacrossvarioustasks.

• Partitioninginducesadecompositionoftheproblem,i.e.computationispartitioned

• Datacanbepartitionedinvariousways– Criticalforparallelperformance

• Decompositionbasedon– outputdata– inputdata– input+outputdata– intermediatedata

36

OutputDataDecomposition

• Eachelementoftheoutputcanbecomputedindependentlyofothers– simplyasafunctionoftheinput.

• Anaturalproblemdecomposition

37

OutputDataDecomposition:MatrixMultiplication

multiplyingtwonxnmatricesA andB toyieldmatrixC

TheoutputmatrixC canbepartitionedintofourtasks:

Task1:

Task2:

Task3:

Task4: 38

OutputDataDecomposition:ExampleApartitioningofoutputdatadoesnotresultinauniquedecompositionintotasks.Forexample,forthesameproblemasinpreviusfoil,withidenticaloutputdatadistribution,wecanderivethefollowingtwo(other)decompositions:

Decomposition I Decomposition IITask 1: C1,1 = A1,1 B1,1

Task 2: C1,1 = C1,1 + A1,2 B2,1

Task 3: C1,2 = A1,1 B1,2

Task 4: C1,2 = C1,2 + A1,2 B2,2

Task 5: C2,1 = A2,1 B1,1

Task 6: C2,1 = C2,1 + A2,2 B2,1

Task 7: C2,2 = A2,1 B1,2

Task 8: C2,2 = C2,2 + A2,2 B2,2

Task 1: C1,1 = A1,1 B1,1

Task 2: C1,1 = C1,1 + A1,2 B2,1

Task 3: C1,2 = A1,2 B2,2

Task 4: C1,2 = C1,2 + A1,1 B1,2

Task 5: C2,1 = A2,2 B2,1

Task 6: C2,1 = C2,1 + A2,1 B1,1

Task 7: C2,2 = A2,1 B1,2

Task 8: C2,2 = C2,2 + A2,2 B2,239

OutputDataDecomposition:ExampleCount the frequency of item sets in database transactions

40

• Decomposetheitemsetstocount– eachtaskcomputestotalcountforeach

ofitsitemsets– appendtotalcountsforitemsetsto

producetotalcountresult

OutputDataDecomposition:Example

Fromthepreviousexample,thefollowingobservationscanbemade:

• Ifthedatabaseoftransactionsisreplicatedacrosstheprocesses,eachtaskcanbeindependentlyaccomplishedwithnocommunication.

• Ifthedatabaseispartitionedacrossprocessesaswell(forreasonsofmemoryutilization),eachtaskfirstcomputespartialcounts.Thesecountsarethenaggregatedattheappropriatetask.

41

InputDataPartitioning

• Generallyapplicableifeachoutputcanbenaturallycomputedasafunctionoftheinput.

• Inmanycases,thisistheonlynaturaldecompositioniftheoutputisnotclearlyknowna-priori– e.g.,theproblemoffindingtheminimum,sorting.

• Ataskisassociatedwitheachinputdatapartition– Thetaskperformscomputationwithitspartofthedata.– Subsequentprocessingcombinesthesepartialresults.

• MapReduce

42

InputDataPartitioning:Example

43

PartitioningInputand OutputData

44

• Partitiononbothinputandoutputformoreconcurrency• Example:itemsetcounting

Histogram

• ParallelizingHistogramusingOpenMP inAssignment2– Similarascountfrequency

45

IntermediateDataPartitioning

46

• Ifcomputationisasequenceoftransforms– frominputdatatooutputdata,e.g.imageprocessing

workflow• Candecomposebasedondataforintermediatestages

Task1:

Task2:

Task3:

Task4:

IntermediateDataPartitioning:Example

47

• dense matrix multiplication– visualize this computation in terms of intermediate

matrices D.

IntermediateDataPartitioning:Example

StageII

Task 01: D1,1,1= A1,1 B1,1 Task 02: D2,1,1= A1,2 B2,1

Task 03: D1,1,2= A1,1 B1,2 Task 04: D2,1,2= A1,2 B2,2

Task 05: D1,2,1= A2,1 B1,1 Task 06: D2,2,1= A2,2 B2,1

Task 07: D1,2,2= A2,1 B1,2 Task 08: D2,2,2= A2,2 B2,2

Task 09: C1,1 = D1,1,1 + D2,1,1 Task 10: C1,2 = D1,1,2 + D2,1,2

Task 11: C2,1 = D1,2,1 + D2,2,1 Task 12: C2,,2 = D1,2,2 + D2,2,2 48

• Adecompositionofintermediatedata:8+4tasks:StageI

IntermediateDataPartitioning:ExampleThe task dependency graph for the decomposition(shown in previous foil) into 12 tasks is as follows:

49

TheOwnerComputesRule

50

• Eachdatumisassignedtoaprocess• Eachprocesscomputesvaluesassociatedwithitsdata• Implications

– inputdatadecomposition• allcomputationsusinganinputdatumareperformedbyitsprocess

– outputdatadecomposition• anoutputiscomputedbytheprocessassignedtotheoutputdata

References

• Adaptedfromslides“PrinciplesofParallelAlgorithmDesign”byAnanth Grama

• BasedonChapter3of“IntroductiontoParallelComputing”byAnanth Grama,Anshul Gupta,GeorgeKarypis,andVipinKumar.AddisonWesley,2003

51

Documents

Lecture 8: Principles of Parallel Algorithm Design · Topics •Introduction •Programming on shared memory system (Chapter 7) –OpenMP •Principles of parallel algorithm design