kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

InteractionofArchitectureandAlgorithm

intheDomain-basedParallelizationofan

UnstructuredGridIncompressibleFlowCode

DineshK.Kaushik&DavidE.Keyes

CSDepartment,OldDominionUniversity&

ICASE,NASALangleyResearchCenter

BarryF.Smith

MCSDivision,ArgonneNationalLaboratory

OrganizationofPresentation

�Issuesforunstructuredgriddomaindecompositionmethods

�BackgroundofFUN3D

�BackgroundofPETSc

�Illustrationsofgeneralportingissues

�Summaryofserialandparallelperformance

�Conclusions

SolvingUnstructuredGridProblemsinParallel:

MainIssues

�SPMDparallelizationofunstructuredgridsolversiscomplicatedby

thefactthatnotwointerprocessordatadependencypatternsarealike

�Theuser-providedglobalorderingmaybeincompatiblewiththe

subdomain-contiguousorderingrequiredforhighperformanceand

convenientSPMDcoding

�Lossofregularityinunstructuredgridsolversmakesthem

more

memoryandinteger-opintensive;nevertheless,alibrary-based

solvershouldbecompetitiveinserialwithalegacysolverinterms

ofmemoryandexecutiontime

ImplicationsoftheMemoryHierarchy

onComputationalE�ciency

�Storage/usepatternsshouldfollowmemoryhierarchy

{BlocksforRegisters

blockstorageformatformulticomponentsystems{savesCPU

cycles

{InterlacedDataStructuresforCache

choose

u1;v1;w1;p1;u2;v2;w2;p2;:::

inplaceof

u1;u2;:::;v1;v2;:::;w1;w2;:::;p1;p2;:::

{SubdomainsforDistributedMemory

\chunky"domaindecompositionforoptimalsurface-to-volume

(communication-to-computation)ratio

�Thishierarchyisconcernedwithdi�erentissuesthanthealgorith-

mice�ciencyissuesassociatedwithhierarchiesofgrids

OptimalGranularityofDecomposition

Forcache-basedmicroprocessors,granularityisdeterminedbythree

forces:

�ConvergenceRate

usuallydeteriorateswithincreasedgranularity

�CommunicationVolume

increaseswithincreasedgranularity

�SizeofLocalWorkingSet

�tsbetterintosuccessivelysmallercachelevelswithincreasedgran-

ularity

DescriptionoftheLegacyCode-FUN3D

�FUN3Disatetrahedralvertex-centeredunstructuredgridcodedevel-

opedbyW.K.Anderson(LaRC)forcompressibleandincompressible

EulerandNavier-Stokesequations

�ParallelexperienceiswithincompressibleEulersofar,butnothingin

thealgorithmsorsoftwarechangesfortheothercases;onlyconver-

genceratewillvarywithconditioning,asdeterminedbyMachand

Reynoldsnumbers(andmesh)

�FUN3Duses1st-or2nd-orderRoeforconvectionandGalerkinfor

di�usion,andfalsetimesteppingwithbackwardsEulerfornonlinear

continuationtowardssteadystate

�SolverisNewton-Krylov-Schwarz;timestepisadvancedtowardsin-

�nitybytheswitchedevolution/relaxation(SER)heuristicofVan

Leer&Mulder

PETSc|

aPortableExtensibleToolkitforScienti�cComputing

�Givesrelativelyhigh-levelexpressiontopreconditionediterativelin-

earsolvers,andNewtoniterativemethods

�Supportscomplexarithmetic

�PortswhereverMPIports;committedtoprogressiveMPItuning

�Permitsgreat exibility(throughobject-orientedphilosophy)foral-

gorithmicinnovation

�Freelyavailable

�CallablefromFORTRAN77,C,andC++;writteninC

�Includesdiagnostic,monitoring,andvisualizationGUIs

ThePETScPhilosophy

�Libraryapproach|compilercan'tdoall;usersshouldn'tdoallmore

thanonce

�Distributeddatastructuresasfundamentalobjects|

indexsets,

vectors,andmatrices(gridfunctionscoming)

�Iterativelinearandnonlinearsolvers,combinablemodularlyandre-

cursively,andextensible

�Portable

�UniformApplicationProgrammerInterface(API)

�Multi-layeredentry

�Message-passingdetailsuppressed

ConversionofLegacyFUN3D

intoPETSc/MPIversion

�Projectbegun10/96,completed3/97,undergoingcontinualenhance-

ment

�Five-month(part-time)e�ortincluded:

{learningFUN3DandthePUNS3Dmeshpreprocessor

{learningtheMeTiSpartitioner

{addingandtestingnewfunctionalityinPETSc

{restructuringFUN3Dfromvectortocacheorientation

�Approximately3,300of14,400F77linesofFUN3Dretained(primar-

ilyas\nodecode"for uxandJacobianevaluations);PETScsolvers

usedfortherest

�E�orthasnotyetincluded:

{ParallelI/Oandpost-processing

�Nextunstructuredmeshcodeportshouldrequiresigni�cantlyless

time

SolvingUnstructuredGridProblemsinParallel:

BasicOutlineoftheSolutionStrategy

�Followthe\ownercomputes"ruleunderthedualconstraintsofmini-

mizingthenumberofmessagesandoverlappingcommunicationwith

computation

�Eachprocessor\ghosts"itsstencildependencesinitsneighbors

�Ghostnodesorderedaftercontiguousownednodes

�Domainmappedfrom(user)globalorderingintolocalorderings

�Scatter/gatheroperationscreatedbetweenlocalsequentialvectors

andglobaldistributedvectors,basedonruntimeconnectivitypat-

terns

�Newton-Krylov-Schwarzoperationstranslatedintolocaltasksand

communicationtasks(nonblockingforoverlapwherehardwaresup-

ports)

Three Different Orderings - In Focus

Application Ordering

0 1 2 3

45 6

7

89 10

11

12 13 14 15PETSc Ordering

10

4

23

5

6 7

10

8 9

11

12 13

14 15

Local Ordering for Processor 0

0 1

23

45

6 7

8

9

10

11Local Ordering for Processor 1

0 1

23

45

6 7

8

9

10

11

ScatteringBetweentheOrderings

�Afterestablishingdi�erentorderings,establishthe\scatter"between

theglobalandlocalvectorsinthefollowingway:

ISCreateStride(MPICOMMSELF,bs*nvertices,0,1,&islocal);

ISCreateBlock(MPICOMMSELF,bs,nvertices,svertices,&isglobal);

VecScatterCreate(x,isglobal,user.localX,islocal,&user.scatter);

�Next,beforeusingthelocalvectorinanysubroutine,carryoutthe

scatteroperation:

VecScatterBegin(X,localX,INSERTVALUES,SCATTERFORWARD,scatter);

VecScatterEnd(X,localX,INSERTVALUES,SCATTERFORWARD,scatter);

SampleSerialPerformanceComparison:

PETScvs.LegacyCode

Forbothcodes

�sameoptimizationlevel(-O3)wasused

�sametimerwasused

�timemeasurementstartedafterreadingalltheinput�les

�nooutputwaswrittenduringtimingmeasurements

�platformusedwasIBMSPatArgonnewithenoughmemorytoavoid

pagefaultsafterloading

Execution(s)

Memory(MB)

vertorignalPETScoriginalPETSc

2800

122.71

27.88

10.22

12.08

227002905.30

381.09

74.74

83.67

Percentagedi�erenceinmemoryrequirement

reduceswithproblemsize

SampleMemoryConservationTechniques

AndSuccessiveE�ectsinFUN3D

PortingHistory

�Preciselysizedpreallocationofsparsematrixobjects

(77!

47MBofRAM)

�Pruningoflegacycodesolverdatastructures

(47!

34MBofRAM)

�In-placefactorizationofpreconditioner

(34!

21MBofRAM)

�Moving\MatSetValues"callsintolegacysubroutines

(21!

16MBofRAM)

�MakingPartitioningStageScalable

(16!

12MBofRAM)

�Sizeoflegacycodeonsameproblem:10MB

�Sizeofparallelsingle-nodecodeonsameproblem:12MB

SummaryofParallelPerformanceonCrayT3EandIBM

SP

�1.4milliondegree-of-freedomproblemconvergedtomachineprecision

inapproximately6minuteswithapproximately1600 uxbalance

operations(workunits)on128processorsofaT3Eor80processor

ofanSP

�Relativee�cienciesof75%to85%overthisrange

�Algorithmice�ciency(ratioofiterationcountoflessdecomposed

gridtomoredecomposedgrid{usingthe\best"algorithmforeach

processorgranularity)isinexcessof90%overthisrange;iteration

countisonlyweaklydependentupongranularity

�Implementatione�ciency(ratioofthecostpervertexperiteration)

isinexcessof80%

overthisrangeandcanbesuperunitary

�Superunitaryimplementatione�ciencyderivesfromimprovedcache

localityathighergranularity(smallerworkingsetsoneachprocessor),

inspiteofgreaternearestneighborcommunicationvolume

�Properlysizingworkingsettocachelargelyovercomesconvergence

andcommunicationpenaltiesofconcurrency

CrayT3EScalability{FixedSize

FUN3D-PETScM6WingTestCase,IncompressibleEuler

2nd-orderRoeScheme,1-layerHalo

Tetrahedralgridof357,900vertices(1,431,600unknowns)

procsits

exe

speedup

�alg

�impl

�overall

16772587.95s

1.00

1.001.00

1.00

24781792.34s

1.44

0.990.97

0.96

32751262.01s

2.05

1.031.00

1.03

40751043.55s

2.48

1.030.97

0.99

4876

885.91s

2.92

1.010.96

0.97

6475

662.06s

3.91

1.030.95

0.98

8078

559.93s

4.62

0.990.94

0.92

9679

491.40s

5.27

0.970.90

0.88

12882

382.30s

6.77

0.940.90

0.85

85%relativee�ciencyat128nodes

IBM

SPScalability{FixedSize



Tetrahedralgridof357,900vertices(1,431,600unknowns)

procsits

exe

speedup

�alg

�impl

�overall

8702897.46s

1.00

1.001.00

1.00

10732405.66s

1.20

0.961.00

0.96

16781670.67s

1.73

0.900.97

0.87

20731233.06s

2.35

0.960.98

0.94

3274

797.46s

3.63

0.950.96

0.91

4075

672.90s

4.31

0.930.92

0.86

4875

569.94s

5.08

0.930.91

0.85

6474

437.72s

6.62

0.950.87

0.83

8077

386.83s

7.49

0.910.82

0.75

75%relativee�ciencyat80nodes

CrayT3EScalability{Gustafson



Tetrahedralgrid

vertprocsvert/procits

exe

exe/it

357,900

80

4474

78559.93s7.18s

53,961

12

4497

36265.72s7.38s

9,428

2

4714

19131.07s6.89s

Lessthan7%variationinperformance

overfactorofnearly40inproblemsize

NotesonE�ciency

Con ictingde�nitionsofparallele�ciencyabound,dependingupontwo

choices:

�Whatscalingistobeusedasthenumberofprocessorsisvaried?

{overall�xed-sizeproblem

{varyingsizeproblemwith�xedmemoryperprocessor

{varyingsizeproblemwith�xedworkperprocessor

�Whatformofthealgorithmistobeusedasnumberofprocessoris

varied?

{reproducethesequentialarithmeticexactly

{adjustparameterstoperformbestoneachgivennumberofpro-

cessors

Ourchartsincludebothoverall�xed-sizescalingandapproximately�xed

memoryperprocessor(Gustafson)scaling

Wealwaysadjustthesubdomainblockingparametertomatchthenum-

berofprocessors,onesubdomainperprocessor;thiscausesthenumber

ofiterationstovary

NotesonE�ciency,cont.

E�ectofchanging-strengthpreconditionerande�ectofparalleloverhead

areoftenseparatedintoalgorithmicandimplementationfactors

�Customaryde�nitionofoveralle�ciencyingoingfromqtopproces-

sors(p>q):

�(pjq)=q�T(q)

p�T(p)

whereT(p)istheoverallexecutiontimeonpprocessors(measured)

�FactorT(p)intoI(p),thenumberofiterations,andC(p),theaverage

costperiteration.

�Algorithmice�ciencyismeasureofpreconditioningquality(mea-

sured):

�alg (pjq)=I(q)

I(p)

�Implementatione�ciencyisremaining(inferred,notdirectlymeasur-

able)factor:

�impl (pjq)=q�C(q)

p�C(p)

FootnotesonScalabilityTables

�\its"representsthenumberofpseudo-transientNewtonsteps|

one

Newtonsteppertimestep,withSERgrowthintimestepuptoa

CFLof100,000,andwithamaximum

number(20)ofSchwarz-

preconditionedGMRESstepsperNewtonstepwithrelativetolerance

of10�

2

�Convergencede�nedasarelativereductioninthenormofthesteady-

statenonlinearresidualbyafactorof10�

10

�Convergenceratetypicallydegradesslightlyasnumberofprocessors

isincreased,duetointroductionofconcurrencyinpreconditioner|

highlypartition-dependent

�Implementatione�ciencymayimproveslightlyasprocessorsareadded,

duetosmallerworkingsets|

bettercacheresidency

�Implementatione�ciencyultimatelydegradesascommunication-to-

computationratioincreases

OurViewofthe\State-of-the-Art":

ArchitectureandProgrammingEnvironment

�Vector-awarenessisout;cache-awarenessisin;vector-awarenesswill

returninsubtleways

�ExceptforTeraandinstalledvectorbase,allnear-termlarge-scale

computerswillbebasedoncommodityprocessors

�HPFandparallelcompilersnotyetuptoperformance

�Someusefulparallellibraries,likePETSc

�Needforbettermemorybandwidthtoharnessthefullcapabilityof

future(&current)chips


Algorithms

�Explicittimeintegrationissolvedproblem,exceptfordynamicmesh

adaptivity

�Implicitmethodsremainamajorchallenge:

{Today'salgorithmsleavesomethingtobedesiredinconvergence

rate

{Allgoodalgorithmshaveglobalsynchronization

�Dataparallelism

fromdomaindecompositionisunquestionablythe

mainsourceoflocality-preservingconcurrency,butgoodsmoothers

andpreconditionersviolatelocality

�Newformsofalgorithmiclatencytolerancemustbefound

�ExoticmethodsshouldbeconsideredatASCIscales


Application-Algorithm-ArchitectureInteraction

�Ripestremainingadvancesareinterdisciplinary

�Application-Algorithm

{Ordering,partitioning,andcoarseningmustadapttocoe�cients

(gridspacingand owmagnitudeanddirection)

{Trade-o�sbetweenpseudo-timeiteration,nonlineariteration,lin-

eariteration,andpreconditioneriterationmustbeunderstoodand

exploited

�Algorithm-Architecture

{Algorithmicistsmustthinknativelyinparallelandavoidintroduc-

ingunnecessarysequentialconstraints

{Algorithmicistsshouldinform

theirchoiceswithwhattheirma-

chineisgoodatandwhatitisbadat

Conclusions

�Hierarchyofdomaindecompositionshouldfollowdistributedmemory

hierarchyforcomputationale�ciency

{blockingforregisters

givesafactorof2inperformanceformulticomponentsystems

{interlaceddatastructureforcache

reducesexecutiontimebymorethanafactorof4

{subdomainsforprocessormemory

migratesthesequentialcodetoSPMDparallelism

�Inadditiontoconvergencerateandcommunicationvolume,

workingsetsizeisanotherparametertoconsiderfor\preferred"

granularityofdomaindecomposition

�PETScportedFUN3Dgivesnicescalabilityresults(parallele�-

ciencyrangesfrom

75%{85%)ontwoplatforms-IBM

SPand

CrayT3E

�Library(PETSc)basedsolveriscompetitivewiththelegacysolver

{outperformsbyafactorof9eveninserialmode{percentage

di�erenceinmemoryreduceswithproblemsize

ReferenceURLs

�FUN3D

http://fmad-www.larc.nasa.gov/~wanderso/Fun/fun.html

�PETSc

http://www.mcs.anl.gov/petsc/petsc.html

�Pointersandrelatedpapers

http://www.cs.odu.edu/~keyes/keyes.html

Documents

kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible