26

kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

InteractionofArchitectureandAlgorithm

intheDomain-basedParallelizationofan

UnstructuredGridIncompressibleFlowCode

DineshK.Kaushik&DavidE.Keyes

CSDepartment,OldDominionUniversity&

ICASE,NASALangleyResearchCenter

BarryF.Smith

MCSDivision,ArgonneNationalLaboratory

Page 2: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

OrganizationofPresentation

�Issuesforunstructuredgriddomaindecompositionmethods

�BackgroundofFUN3D

�BackgroundofPETSc

�Illustrationsofgeneralportingissues

�Summaryofserialandparallelperformance

�Conclusions

Page 3: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

SolvingUnstructuredGridProblemsinParallel:

MainIssues

�SPMDparallelizationofunstructuredgridsolversiscomplicatedby

thefactthatnotwointerprocessordatadependencypatternsarealike

�Theuser-providedglobalorderingmaybeincompatiblewiththe

subdomain-contiguousorderingrequiredforhighperformanceand

convenientSPMDcoding

�Lossofregularityinunstructuredgridsolversmakesthem

more

memoryandinteger-opintensive;nevertheless,alibrary-based

solvershouldbecompetitiveinserialwithalegacysolverinterms

ofmemoryandexecutiontime

Page 4: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

ImplicationsoftheMemoryHierarchy

onComputationalE�ciency

�Storage/usepatternsshouldfollowmemoryhierarchy

{BlocksforRegisters

blockstorageformatformulticomponentsystems{savesCPU

cycles

{InterlacedDataStructuresforCache

choose

u1;v1;w1;p1;u2;v2;w2;p2;:::

inplaceof

u1;u2;:::;v1;v2;:::;w1;w2;:::;p1;p2;:::

{SubdomainsforDistributedMemory

\chunky"domaindecompositionforoptimalsurface-to-volume

(communication-to-computation)ratio

�Thishierarchyisconcernedwithdi�erentissuesthanthealgorith-

mice�ciencyissuesassociatedwithhierarchiesofgrids

Page 5: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

OptimalGranularityofDecomposition

Forcache-basedmicroprocessors,granularityisdeterminedbythree

forces:

�ConvergenceRate

usuallydeteriorateswithincreasedgranularity

�CommunicationVolume

increaseswithincreasedgranularity

�SizeofLocalWorkingSet

�tsbetterintosuccessivelysmallercachelevelswithincreasedgran-

ularity

Page 6: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

DescriptionoftheLegacyCode-FUN3D

�FUN3Disatetrahedralvertex-centeredunstructuredgridcodedevel-

opedbyW.K.Anderson(LaRC)forcompressibleandincompressible

EulerandNavier-Stokesequations

�ParallelexperienceiswithincompressibleEulersofar,butnothingin

thealgorithmsorsoftwarechangesfortheothercases;onlyconver-

genceratewillvarywithconditioning,asdeterminedbyMachand

Reynoldsnumbers(andmesh)

�FUN3Duses1st-or2nd-orderRoeforconvectionandGalerkinfor

di�usion,andfalsetimesteppingwithbackwardsEulerfornonlinear

continuationtowardssteadystate

�SolverisNewton-Krylov-Schwarz;timestepisadvancedtowardsin-

�nitybytheswitchedevolution/relaxation(SER)heuristicofVan

Leer&Mulder

Page 7: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

PETSc|

aPortableExtensibleToolkitforScienti�cComputing

�Givesrelativelyhigh-levelexpressiontopreconditionediterativelin-

earsolvers,andNewtoniterativemethods

�Supportscomplexarithmetic

�PortswhereverMPIports;committedtoprogressiveMPItuning

�Permitsgreat exibility(throughobject-orientedphilosophy)foral-

gorithmicinnovation

�Freelyavailable

�CallablefromFORTRAN77,C,andC++;writteninC

�Includesdiagnostic,monitoring,andvisualizationGUIs

Page 8: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

ThePETScPhilosophy

�Libraryapproach|compilercan'tdoall;usersshouldn'tdoallmore

thanonce

�Distributeddatastructuresasfundamentalobjects|

indexsets,

vectors,andmatrices(gridfunctionscoming)

�Iterativelinearandnonlinearsolvers,combinablemodularlyandre-

cursively,andextensible

�Portable

�UniformApplicationProgrammerInterface(API)

�Multi-layeredentry

�Message-passingdetailsuppressed

Page 9: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

ConversionofLegacyFUN3D

intoPETSc/MPIversion

�Projectbegun10/96,completed3/97,undergoingcontinualenhance-

ment

�Five-month(part-time)e�ortincluded:

{learningFUN3DandthePUNS3Dmeshpreprocessor

{learningtheMeTiSpartitioner

{addingandtestingnewfunctionalityinPETSc

{restructuringFUN3Dfromvectortocacheorientation

�Approximately3,300of14,400F77linesofFUN3Dretained(primar-

ilyas\nodecode"for uxandJacobianevaluations);PETScsolvers

usedfortherest

�E�orthasnotyetincluded:

{ParallelI/Oandpost-processing

�Nextunstructuredmeshcodeportshouldrequiresigni�cantlyless

time

Page 10: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

SolvingUnstructuredGridProblemsinParallel:

BasicOutlineoftheSolutionStrategy

�Followthe\ownercomputes"ruleunderthedualconstraintsofmini-

mizingthenumberofmessagesandoverlappingcommunicationwith

computation

�Eachprocessor\ghosts"itsstencildependencesinitsneighbors

�Ghostnodesorderedaftercontiguousownednodes

�Domainmappedfrom(user)globalorderingintolocalorderings

�Scatter/gatheroperationscreatedbetweenlocalsequentialvectors

andglobaldistributedvectors,basedonruntimeconnectivitypat-

terns

�Newton-Krylov-Schwarzoperationstranslatedintolocaltasksand

communicationtasks(nonblockingforoverlapwherehardwaresup-

ports)

Page 11: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

Three Different Orderings - In Focus

Application Ordering

0 1 2 3

45 6

7

89 10

11

12 13 14 15PETSc Ordering

10

4

23

5

6 7

10

8 9

11

12 13

14 15

Local Ordering for Processor 0

0 1

23

45

6 7

8

9

10

11Local Ordering for Processor 1

0 1

23

45

6 7

8

9

10

11

Page 12: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

ScatteringBetweentheOrderings

�Afterestablishingdi�erentorderings,establishthe\scatter"between

theglobalandlocalvectorsinthefollowingway:

ISCreateStride(MPICOMMSELF,bs*nvertices,0,1,&islocal);

ISCreateBlock(MPICOMMSELF,bs,nvertices,svertices,&isglobal);

VecScatterCreate(x,isglobal,user.localX,islocal,&user.scatter);

�Next,beforeusingthelocalvectorinanysubroutine,carryoutthe

scatteroperation:

VecScatterBegin(X,localX,INSERTVALUES,SCATTERFORWARD,scatter);

VecScatterEnd(X,localX,INSERTVALUES,SCATTERFORWARD,scatter);

Page 13: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

SampleSerialPerformanceComparison:

PETScvs.LegacyCode

Forbothcodes

�sameoptimizationlevel(-O3)wasused

�sametimerwasused

�timemeasurementstartedafterreadingalltheinput�les

�nooutputwaswrittenduringtimingmeasurements

�platformusedwasIBMSPatArgonnewithenoughmemorytoavoid

pagefaultsafterloading

Execution(s)

Memory(MB)

vertorignalPETScoriginalPETSc

2800

122.71

27.88

10.22

12.08

227002905.30

381.09

74.74

83.67

Percentagedi�erenceinmemoryrequirement

reduceswithproblemsize

Page 14: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

SampleMemoryConservationTechniques

AndSuccessiveE�ectsinFUN3D

PortingHistory

�Preciselysizedpreallocationofsparsematrixobjects

(77!

47MBofRAM)

�Pruningoflegacycodesolverdatastructures

(47!

34MBofRAM)

�In-placefactorizationofpreconditioner

(34!

21MBofRAM)

�Moving\MatSetValues"callsintolegacysubroutines

(21!

16MBofRAM)

�MakingPartitioningStageScalable

(16!

12MBofRAM)

�Sizeoflegacycodeonsameproblem:10MB

�Sizeofparallelsingle-nodecodeonsameproblem:12MB

Page 15: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

SummaryofParallelPerformanceonCrayT3EandIBM

SP

�1.4milliondegree-of-freedomproblemconvergedtomachineprecision

inapproximately6minuteswithapproximately1600 uxbalance

operations(workunits)on128processorsofaT3Eor80processor

ofanSP

�Relativee�cienciesof75%to85%overthisrange

�Algorithmice�ciency(ratioofiterationcountoflessdecomposed

gridtomoredecomposedgrid{usingthe\best"algorithmforeach

processorgranularity)isinexcessof90%overthisrange;iteration

countisonlyweaklydependentupongranularity

�Implementatione�ciency(ratioofthecostpervertexperiteration)

isinexcessof80%

overthisrangeandcanbesuperunitary

�Superunitaryimplementatione�ciencyderivesfromimprovedcache

localityathighergranularity(smallerworkingsetsoneachprocessor),

inspiteofgreaternearestneighborcommunicationvolume

�Properlysizingworkingsettocachelargelyovercomesconvergence

andcommunicationpenaltiesofconcurrency

Page 16: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

CrayT3EScalability{FixedSize

FUN3D-PETScM6WingTestCase,IncompressibleEuler

2nd-orderRoeScheme,1-layerHalo

Tetrahedralgridof357,900vertices(1,431,600unknowns)

procsits

exe

speedup

�alg

�impl

�overall

16772587.95s

1.00

1.001.00

1.00

24781792.34s

1.44

0.990.97

0.96

32751262.01s

2.05

1.031.00

1.03

40751043.55s

2.48

1.030.97

0.99

4876

885.91s

2.92

1.010.96

0.97

6475

662.06s

3.91

1.030.95

0.98

8078

559.93s

4.62

0.990.94

0.92

9679

491.40s

5.27

0.970.90

0.88

12882

382.30s

6.77

0.940.90

0.85

85%relativee�ciencyat128nodes

Page 17: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

IBM

SPScalability{FixedSize

FUN3D-PETScM6WingTestCase,IncompressibleEuler

2nd-orderRoeScheme,1-layerHalo

Tetrahedralgridof357,900vertices(1,431,600unknowns)

procsits

exe

speedup

�alg

�impl

�overall

8702897.46s

1.00

1.001.00

1.00

10732405.66s

1.20

0.961.00

0.96

16781670.67s

1.73

0.900.97

0.87

20731233.06s

2.35

0.960.98

0.94

3274

797.46s

3.63

0.950.96

0.91

4075

672.90s

4.31

0.930.92

0.86

4875

569.94s

5.08

0.930.91

0.85

6474

437.72s

6.62

0.950.87

0.83

8077

386.83s

7.49

0.910.82

0.75

75%relativee�ciencyat80nodes

Page 18: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

CrayT3EScalability{Gustafson

FUN3D-PETScM6WingTestCase,IncompressibleEuler

2nd-orderRoeScheme,1-layerHalo

Tetrahedralgrid

vertprocsvert/procits

exe

exe/it

357,900

80

4474

78559.93s7.18s

53,961

12

4497

36265.72s7.38s

9,428

2

4714

19131.07s6.89s

Lessthan7%variationinperformance

overfactorofnearly40inproblemsize

Page 19: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

NotesonE�ciency

Con ictingde�nitionsofparallele�ciencyabound,dependingupontwo

choices:

�Whatscalingistobeusedasthenumberofprocessorsisvaried?

{overall�xed-sizeproblem

{varyingsizeproblemwith�xedmemoryperprocessor

{varyingsizeproblemwith�xedworkperprocessor

�Whatformofthealgorithmistobeusedasnumberofprocessoris

varied?

{reproducethesequentialarithmeticexactly

{adjustparameterstoperformbestoneachgivennumberofpro-

cessors

Ourchartsincludebothoverall�xed-sizescalingandapproximately�xed

memoryperprocessor(Gustafson)scaling

Wealwaysadjustthesubdomainblockingparametertomatchthenum-

berofprocessors,onesubdomainperprocessor;thiscausesthenumber

ofiterationstovary

Page 20: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

NotesonE�ciency,cont.

E�ectofchanging-strengthpreconditionerande�ectofparalleloverhead

areoftenseparatedintoalgorithmicandimplementationfactors

�Customaryde�nitionofoveralle�ciencyingoingfromqtopproces-

sors(p>q):

�(pjq)=q�T(q)

p�T(p)

whereT(p)istheoverallexecutiontimeonpprocessors(measured)

�FactorT(p)intoI(p),thenumberofiterations,andC(p),theaverage

costperiteration.

�Algorithmice�ciencyismeasureofpreconditioningquality(mea-

sured):

�alg (pjq)=I(q)

I(p)

�Implementatione�ciencyisremaining(inferred,notdirectlymeasur-

able)factor:

�impl (pjq)=q�C(q)

p�C(p)

Page 21: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

FootnotesonScalabilityTables

�\its"representsthenumberofpseudo-transientNewtonsteps|

one

Newtonsteppertimestep,withSERgrowthintimestepuptoa

CFLof100,000,andwithamaximum

number(20)ofSchwarz-

preconditionedGMRESstepsperNewtonstepwithrelativetolerance

of10�

2

�Convergencede�nedasarelativereductioninthenormofthesteady-

statenonlinearresidualbyafactorof10�

10

�Convergenceratetypicallydegradesslightlyasnumberofprocessors

isincreased,duetointroductionofconcurrencyinpreconditioner|

highlypartition-dependent

�Implementatione�ciencymayimproveslightlyasprocessorsareadded,

duetosmallerworkingsets|

bettercacheresidency

�Implementatione�ciencyultimatelydegradesascommunication-to-

computationratioincreases

Page 22: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

OurViewofthe\State-of-the-Art":

ArchitectureandProgrammingEnvironment

�Vector-awarenessisout;cache-awarenessisin;vector-awarenesswill

returninsubtleways

�ExceptforTeraandinstalledvectorbase,allnear-termlarge-scale

computerswillbebasedoncommodityprocessors

�HPFandparallelcompilersnotyetuptoperformance

�Someusefulparallellibraries,likePETSc

�Needforbettermemorybandwidthtoharnessthefullcapabilityof

future(&current)chips

Page 23: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

OurViewofthe\State-of-the-Art":

Algorithms

�Explicittimeintegrationissolvedproblem,exceptfordynamicmesh

adaptivity

�Implicitmethodsremainamajorchallenge:

{Today'salgorithmsleavesomethingtobedesiredinconvergence

rate

{Allgoodalgorithmshaveglobalsynchronization

�Dataparallelism

fromdomaindecompositionisunquestionablythe

mainsourceoflocality-preservingconcurrency,butgoodsmoothers

andpreconditionersviolatelocality

�Newformsofalgorithmiclatencytolerancemustbefound

�ExoticmethodsshouldbeconsideredatASCIscales

Page 24: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

OurViewofthe\State-of-the-Art":

Application-Algorithm-ArchitectureInteraction

�Ripestremainingadvancesareinterdisciplinary

�Application-Algorithm

{Ordering,partitioning,andcoarseningmustadapttocoe�cients

(gridspacingand owmagnitudeanddirection)

{Trade-o�sbetweenpseudo-timeiteration,nonlineariteration,lin-

eariteration,andpreconditioneriterationmustbeunderstoodand

exploited

�Algorithm-Architecture

{Algorithmicistsmustthinknativelyinparallelandavoidintroduc-

ingunnecessarysequentialconstraints

{Algorithmicistsshouldinform

theirchoiceswithwhattheirma-

chineisgoodatandwhatitisbadat

Page 25: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

Conclusions

�Hierarchyofdomaindecompositionshouldfollowdistributedmemory

hierarchyforcomputationale�ciency

{blockingforregisters

givesafactorof2inperformanceformulticomponentsystems

{interlaceddatastructureforcache

reducesexecutiontimebymorethanafactorof4

{subdomainsforprocessormemory

migratesthesequentialcodetoSPMDparallelism

�Inadditiontoconvergencerateandcommunicationvolume,

workingsetsizeisanotherparametertoconsiderfor\preferred"

granularityofdomaindecomposition

�PETScportedFUN3Dgivesnicescalabilityresults(parallele�-

ciencyrangesfrom

75%{85%)ontwoplatforms-IBM

SPand

CrayT3E

�Library(PETSc)basedsolveriscompetitivewiththelegacysolver

{outperformsbyafactorof9eveninserialmode{percentage

di�erenceinmemoryreduceswithproblemsize

Page 26: kground of FUN3D · Description of the Legacy Co de-FUN3D FUN3D is a tetrahedral v ertex-cen tered unstructured grid co de dev el-op ed b y W. K. Anderson (LaR C) for compressible

ReferenceURLs

�FUN3D

http://fmad-www.larc.nasa.gov/~wanderso/Fun/fun.html

�PETSc

http://www.mcs.anl.gov/petsc/petsc.html

�Pointersandrelatedpapers

http://www.cs.odu.edu/~keyes/keyes.html