Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
InteractionofArchitectureandAlgorithm
intheDomain-basedParallelizationofan
UnstructuredGridIncompressibleFlowCode
DineshK.Kaushik&DavidE.Keyes
CSDepartment,OldDominionUniversity&
ICASE,NASALangleyResearchCenter
BarryF.Smith
MCSDivision,ArgonneNationalLaboratory
OrganizationofPresentation
�Issuesforunstructuredgriddomaindecompositionmethods
�BackgroundofFUN3D
�BackgroundofPETSc
�Illustrationsofgeneralportingissues
�Summaryofserialandparallelperformance
�Conclusions
SolvingUnstructuredGridProblemsinParallel:
MainIssues
�SPMDparallelizationofunstructuredgridsolversiscomplicatedby
thefactthatnotwointerprocessordatadependencypatternsarealike
�Theuser-providedglobalorderingmaybeincompatiblewiththe
subdomain-contiguousorderingrequiredforhighperformanceand
convenientSPMDcoding
�Lossofregularityinunstructuredgridsolversmakesthem
more
memoryandinteger-opintensive;nevertheless,alibrary-based
solvershouldbecompetitiveinserialwithalegacysolverinterms
ofmemoryandexecutiontime
ImplicationsoftheMemoryHierarchy
onComputationalE�ciency
�Storage/usepatternsshouldfollowmemoryhierarchy
{BlocksforRegisters
blockstorageformatformulticomponentsystems{savesCPU
cycles
{InterlacedDataStructuresforCache
choose
u1;v1;w1;p1;u2;v2;w2;p2;:::
inplaceof
u1;u2;:::;v1;v2;:::;w1;w2;:::;p1;p2;:::
{SubdomainsforDistributedMemory
\chunky"domaindecompositionforoptimalsurface-to-volume
(communication-to-computation)ratio
�Thishierarchyisconcernedwithdi�erentissuesthanthealgorith-
mice�ciencyissuesassociatedwithhierarchiesofgrids
OptimalGranularityofDecomposition
Forcache-basedmicroprocessors,granularityisdeterminedbythree
forces:
�ConvergenceRate
usuallydeteriorateswithincreasedgranularity
�CommunicationVolume
increaseswithincreasedgranularity
�SizeofLocalWorkingSet
�tsbetterintosuccessivelysmallercachelevelswithincreasedgran-
ularity
DescriptionoftheLegacyCode-FUN3D
�FUN3Disatetrahedralvertex-centeredunstructuredgridcodedevel-
opedbyW.K.Anderson(LaRC)forcompressibleandincompressible
EulerandNavier-Stokesequations
�ParallelexperienceiswithincompressibleEulersofar,butnothingin
thealgorithmsorsoftwarechangesfortheothercases;onlyconver-
genceratewillvarywithconditioning,asdeterminedbyMachand
Reynoldsnumbers(andmesh)
�FUN3Duses1st-or2nd-orderRoeforconvectionandGalerkinfor
di�usion,andfalsetimesteppingwithbackwardsEulerfornonlinear
continuationtowardssteadystate
�SolverisNewton-Krylov-Schwarz;timestepisadvancedtowardsin-
�nitybytheswitchedevolution/relaxation(SER)heuristicofVan
Leer&Mulder
PETSc|
aPortableExtensibleToolkitforScienti�cComputing
�Givesrelativelyhigh-levelexpressiontopreconditionediterativelin-
earsolvers,andNewtoniterativemethods
�Supportscomplexarithmetic
�PortswhereverMPIports;committedtoprogressiveMPItuning
�Permitsgreat exibility(throughobject-orientedphilosophy)foral-
gorithmicinnovation
�Freelyavailable
�CallablefromFORTRAN77,C,andC++;writteninC
�Includesdiagnostic,monitoring,andvisualizationGUIs
ThePETScPhilosophy
�Libraryapproach|compilercan'tdoall;usersshouldn'tdoallmore
thanonce
�Distributeddatastructuresasfundamentalobjects|
indexsets,
vectors,andmatrices(gridfunctionscoming)
�Iterativelinearandnonlinearsolvers,combinablemodularlyandre-
cursively,andextensible
�Portable
�UniformApplicationProgrammerInterface(API)
�Multi-layeredentry
�Message-passingdetailsuppressed
ConversionofLegacyFUN3D
intoPETSc/MPIversion
�Projectbegun10/96,completed3/97,undergoingcontinualenhance-
ment
�Five-month(part-time)e�ortincluded:
{learningFUN3DandthePUNS3Dmeshpreprocessor
{learningtheMeTiSpartitioner
{addingandtestingnewfunctionalityinPETSc
{restructuringFUN3Dfromvectortocacheorientation
�Approximately3,300of14,400F77linesofFUN3Dretained(primar-
ilyas\nodecode"for uxandJacobianevaluations);PETScsolvers
usedfortherest
�E�orthasnotyetincluded:
{ParallelI/Oandpost-processing
�Nextunstructuredmeshcodeportshouldrequiresigni�cantlyless
time
SolvingUnstructuredGridProblemsinParallel:
BasicOutlineoftheSolutionStrategy
�Followthe\ownercomputes"ruleunderthedualconstraintsofmini-
mizingthenumberofmessagesandoverlappingcommunicationwith
computation
�Eachprocessor\ghosts"itsstencildependencesinitsneighbors
�Ghostnodesorderedaftercontiguousownednodes
�Domainmappedfrom(user)globalorderingintolocalorderings
�Scatter/gatheroperationscreatedbetweenlocalsequentialvectors
andglobaldistributedvectors,basedonruntimeconnectivitypat-
terns
�Newton-Krylov-Schwarzoperationstranslatedintolocaltasksand
communicationtasks(nonblockingforoverlapwherehardwaresup-
ports)
Three Different Orderings - In Focus
Application Ordering
0 1 2 3
45 6
7
89 10
11
12 13 14 15PETSc Ordering
10
4
23
5
6 7
10
8 9
11
12 13
14 15
Local Ordering for Processor 0
0 1
23
45
6 7
8
9
10
11Local Ordering for Processor 1
0 1
23
45
6 7
8
9
10
11
ScatteringBetweentheOrderings
�Afterestablishingdi�erentorderings,establishthe\scatter"between
theglobalandlocalvectorsinthefollowingway:
ISCreateStride(MPICOMMSELF,bs*nvertices,0,1,&islocal);
ISCreateBlock(MPICOMMSELF,bs,nvertices,svertices,&isglobal);
VecScatterCreate(x,isglobal,user.localX,islocal,&user.scatter);
�Next,beforeusingthelocalvectorinanysubroutine,carryoutthe
scatteroperation:
VecScatterBegin(X,localX,INSERTVALUES,SCATTERFORWARD,scatter);
VecScatterEnd(X,localX,INSERTVALUES,SCATTERFORWARD,scatter);
SampleSerialPerformanceComparison:
PETScvs.LegacyCode
Forbothcodes
�sameoptimizationlevel(-O3)wasused
�sametimerwasused
�timemeasurementstartedafterreadingalltheinput�les
�nooutputwaswrittenduringtimingmeasurements
�platformusedwasIBMSPatArgonnewithenoughmemorytoavoid
pagefaultsafterloading
Execution(s)
Memory(MB)
vertorignalPETScoriginalPETSc
2800
122.71
27.88
10.22
12.08
227002905.30
381.09
74.74
83.67
Percentagedi�erenceinmemoryrequirement
reduceswithproblemsize
SampleMemoryConservationTechniques
AndSuccessiveE�ectsinFUN3D
PortingHistory
�Preciselysizedpreallocationofsparsematrixobjects
(77!
47MBofRAM)
�Pruningoflegacycodesolverdatastructures
(47!
34MBofRAM)
�In-placefactorizationofpreconditioner
(34!
21MBofRAM)
�Moving\MatSetValues"callsintolegacysubroutines
(21!
16MBofRAM)
�MakingPartitioningStageScalable
(16!
12MBofRAM)
�Sizeoflegacycodeonsameproblem:10MB
�Sizeofparallelsingle-nodecodeonsameproblem:12MB
SummaryofParallelPerformanceonCrayT3EandIBM
SP
�1.4milliondegree-of-freedomproblemconvergedtomachineprecision
inapproximately6minuteswithapproximately1600 uxbalance
operations(workunits)on128processorsofaT3Eor80processor
ofanSP
�Relativee�cienciesof75%to85%overthisrange
�Algorithmice�ciency(ratioofiterationcountoflessdecomposed
gridtomoredecomposedgrid{usingthe\best"algorithmforeach
processorgranularity)isinexcessof90%overthisrange;iteration
countisonlyweaklydependentupongranularity
�Implementatione�ciency(ratioofthecostpervertexperiteration)
isinexcessof80%
overthisrangeandcanbesuperunitary
�Superunitaryimplementatione�ciencyderivesfromimprovedcache
localityathighergranularity(smallerworkingsetsoneachprocessor),
inspiteofgreaternearestneighborcommunicationvolume
�Properlysizingworkingsettocachelargelyovercomesconvergence
andcommunicationpenaltiesofconcurrency
CrayT3EScalability{FixedSize
FUN3D-PETScM6WingTestCase,IncompressibleEuler
2nd-orderRoeScheme,1-layerHalo
Tetrahedralgridof357,900vertices(1,431,600unknowns)
procsits
exe
speedup
�alg
�impl
�overall
16772587.95s
1.00
1.001.00
1.00
24781792.34s
1.44
0.990.97
0.96
32751262.01s
2.05
1.031.00
1.03
40751043.55s
2.48
1.030.97
0.99
4876
885.91s
2.92
1.010.96
0.97
6475
662.06s
3.91
1.030.95
0.98
8078
559.93s
4.62
0.990.94
0.92
9679
491.40s
5.27
0.970.90
0.88
12882
382.30s
6.77
0.940.90
0.85
85%relativee�ciencyat128nodes
IBM
SPScalability{FixedSize
FUN3D-PETScM6WingTestCase,IncompressibleEuler
2nd-orderRoeScheme,1-layerHalo
Tetrahedralgridof357,900vertices(1,431,600unknowns)
procsits
exe
speedup
�alg
�impl
�overall
8702897.46s
1.00
1.001.00
1.00
10732405.66s
1.20
0.961.00
0.96
16781670.67s
1.73
0.900.97
0.87
20731233.06s
2.35
0.960.98
0.94
3274
797.46s
3.63
0.950.96
0.91
4075
672.90s
4.31
0.930.92
0.86
4875
569.94s
5.08
0.930.91
0.85
6474
437.72s
6.62
0.950.87
0.83
8077
386.83s
7.49
0.910.82
0.75
75%relativee�ciencyat80nodes
CrayT3EScalability{Gustafson
FUN3D-PETScM6WingTestCase,IncompressibleEuler
2nd-orderRoeScheme,1-layerHalo
Tetrahedralgrid
vertprocsvert/procits
exe
exe/it
357,900
80
4474
78559.93s7.18s
53,961
12
4497
36265.72s7.38s
9,428
2
4714
19131.07s6.89s
Lessthan7%variationinperformance
overfactorofnearly40inproblemsize
NotesonE�ciency
Con ictingde�nitionsofparallele�ciencyabound,dependingupontwo
choices:
�Whatscalingistobeusedasthenumberofprocessorsisvaried?
{overall�xed-sizeproblem
{varyingsizeproblemwith�xedmemoryperprocessor
{varyingsizeproblemwith�xedworkperprocessor
�Whatformofthealgorithmistobeusedasnumberofprocessoris
varied?
{reproducethesequentialarithmeticexactly
{adjustparameterstoperformbestoneachgivennumberofpro-
cessors
Ourchartsincludebothoverall�xed-sizescalingandapproximately�xed
memoryperprocessor(Gustafson)scaling
Wealwaysadjustthesubdomainblockingparametertomatchthenum-
berofprocessors,onesubdomainperprocessor;thiscausesthenumber
ofiterationstovary
NotesonE�ciency,cont.
E�ectofchanging-strengthpreconditionerande�ectofparalleloverhead
areoftenseparatedintoalgorithmicandimplementationfactors
�Customaryde�nitionofoveralle�ciencyingoingfromqtopproces-
sors(p>q):
�(pjq)=q�T(q)
p�T(p)
whereT(p)istheoverallexecutiontimeonpprocessors(measured)
�FactorT(p)intoI(p),thenumberofiterations,andC(p),theaverage
costperiteration.
�Algorithmice�ciencyismeasureofpreconditioningquality(mea-
sured):
�alg (pjq)=I(q)
I(p)
�Implementatione�ciencyisremaining(inferred,notdirectlymeasur-
able)factor:
�impl (pjq)=q�C(q)
p�C(p)
FootnotesonScalabilityTables
�\its"representsthenumberofpseudo-transientNewtonsteps|
one
Newtonsteppertimestep,withSERgrowthintimestepuptoa
CFLof100,000,andwithamaximum
number(20)ofSchwarz-
preconditionedGMRESstepsperNewtonstepwithrelativetolerance
of10�
2
�Convergencede�nedasarelativereductioninthenormofthesteady-
statenonlinearresidualbyafactorof10�
10
�Convergenceratetypicallydegradesslightlyasnumberofprocessors
isincreased,duetointroductionofconcurrencyinpreconditioner|
highlypartition-dependent
�Implementatione�ciencymayimproveslightlyasprocessorsareadded,
duetosmallerworkingsets|
bettercacheresidency
�Implementatione�ciencyultimatelydegradesascommunication-to-
computationratioincreases
OurViewofthe\State-of-the-Art":
ArchitectureandProgrammingEnvironment
�Vector-awarenessisout;cache-awarenessisin;vector-awarenesswill
returninsubtleways
�ExceptforTeraandinstalledvectorbase,allnear-termlarge-scale
computerswillbebasedoncommodityprocessors
�HPFandparallelcompilersnotyetuptoperformance
�Someusefulparallellibraries,likePETSc
�Needforbettermemorybandwidthtoharnessthefullcapabilityof
future(¤t)chips
OurViewofthe\State-of-the-Art":
Algorithms
�Explicittimeintegrationissolvedproblem,exceptfordynamicmesh
adaptivity
�Implicitmethodsremainamajorchallenge:
{Today'salgorithmsleavesomethingtobedesiredinconvergence
rate
{Allgoodalgorithmshaveglobalsynchronization
�Dataparallelism
fromdomaindecompositionisunquestionablythe
mainsourceoflocality-preservingconcurrency,butgoodsmoothers
andpreconditionersviolatelocality
�Newformsofalgorithmiclatencytolerancemustbefound
�ExoticmethodsshouldbeconsideredatASCIscales
OurViewofthe\State-of-the-Art":
Application-Algorithm-ArchitectureInteraction
�Ripestremainingadvancesareinterdisciplinary
�Application-Algorithm
{Ordering,partitioning,andcoarseningmustadapttocoe�cients
(gridspacingand owmagnitudeanddirection)
{Trade-o�sbetweenpseudo-timeiteration,nonlineariteration,lin-
eariteration,andpreconditioneriterationmustbeunderstoodand
exploited
�Algorithm-Architecture
{Algorithmicistsmustthinknativelyinparallelandavoidintroduc-
ingunnecessarysequentialconstraints
{Algorithmicistsshouldinform
theirchoiceswithwhattheirma-
chineisgoodatandwhatitisbadat
Conclusions
�Hierarchyofdomaindecompositionshouldfollowdistributedmemory
hierarchyforcomputationale�ciency
{blockingforregisters
givesafactorof2inperformanceformulticomponentsystems
{interlaceddatastructureforcache
reducesexecutiontimebymorethanafactorof4
{subdomainsforprocessormemory
migratesthesequentialcodetoSPMDparallelism
�Inadditiontoconvergencerateandcommunicationvolume,
workingsetsizeisanotherparametertoconsiderfor\preferred"
granularityofdomaindecomposition
�PETScportedFUN3Dgivesnicescalabilityresults(parallele�-
ciencyrangesfrom
75%{85%)ontwoplatforms-IBM
SPand
CrayT3E
�Library(PETSc)basedsolveriscompetitivewiththelegacysolver
{outperformsbyafactorof9eveninserialmode{percentage
di�erenceinmemoryreduceswithproblemsize
ReferenceURLs
�FUN3D
http://fmad-www.larc.nasa.gov/~wanderso/Fun/fun.html
�PETSc
http://www.mcs.anl.gov/petsc/petsc.html
�Pointersandrelatedpapers
http://www.cs.odu.edu/~keyes/keyes.html