Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Research on the Path to
Exascale at NERSC
Alice E. Koniges Sustainable Research
Pathways Workshop December 8, 2015
• Na#onalEnergyResearchScien#ficCompu#ngCenter– Established1974,firstunclassifiedsupercomputercenter
– Originalmission:toenablecomputa<onalscienceasacomplementtomagne<callycontrolledplasmaexperiment
• Today’s mission: Accelerate scientific discovery at the DOE Office of Science through high performance computing and extreme data analysis
• A national user facility
NERSC
Trajectoryofanenerge.cioninaFieldReverseConfigura.on(FRC)magne.cfield.Magne.cseparatrixdenotedbygreensurface.Spheresarecoloredbyazimuthalvelocity.ImagecourtesyofCharlsonKim,U.ofWashington;NERSCreposm487,mp21,m1552
• Diverseworkload:– 4,500users,700+projects– 700codes;100sofusersdaily
• Alloca#onscontrolledprimarilybyDOE– 80%DOEAnnualProduc<onawards(ERCAP):
• From10Khourto~10Mhour• Proposal-based;DOEchooses
– 10%DOEASCRLeadershipCompu<ngChallenge
– 10%NERSCreserve• NISE,NESAP
3
NERSC: Mission Science Computing for the DOE Office of Science
CollisionbetweentwoshellsofmaSerejectedintwosupernova
erup.ons,showingaslicethroughacorneroftheevent.
Colorsrepresentgasdensity(redishighest,darkblueislowest).ImagecourtesyofKe-JungChen,SchoolofPhysicsandAstronomy,Univ.Minnesota.Repom1400
DOE View of Workload
ASCR AdvancedScien#ficCompu#ngResearch
BER Biological&EnvironmentalResearch
BES BasicEnergySciences
FES FusionEnergySciences
HEP HighEnergyPhysics
NP NuclearPhysics
NERSCAlloca#onsByDOEOffice
ASCR6% BER
18%
BES31%
FES20%
HEP13%
NP12%
107
106
105
104
103
102
10
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
NERSC Roadmap
Franklin(N5)+QC36TFSustained352TFPeak
Hopper(N6)1.25PFPeak
Edison(N7)2.4PFPeak
Cori(N8)10-30xHopper
N9200-500PFPeak
PeakTeraflop
/s
N10~1EFPeak
CRTFacility
The Big Picture: KNL is Coming to NERSC • ThenextlargeNERSCproduc#onsystem“Cori”willbeIntelXeonPhi
KNL(KnightsLanding)architecture– Energyefficientmanycoresystem– >9300singlesocketnodes,mul<pleNUMAdomains– Self-hosted,notanaccelerator– 72cores/node,4hardwarethreadspercore.Totalof288threadspernode– AVX512,largervectorlengthof512bits(8double-precisionelements)– Onpackagehigh-bandwidthmemory(HBM)– BurstBuffer
-6-
Edison / Cori Quick Comparison
Edison(Ivy-Bridge)NERSCCrayXC30system
• 12CoresPerCPU• 24LogicalCoresPerCPU• 2.4-3.2GHz• Vectorlengthof256bits,
4DoublePrecisionOpsperCycle(+mul<ply/add)
• 2.5GBofMemoryPerCore• ~100GB/sMemory
Bandwidth
Cori(Knights-Landing)
• 72PhysicalCoresPerCPU• 288LogicalCoresPerCPU• MuchslowerGHz• Vectorlengthof512bits,
8DoublePrecisionOpsperCycle(+mul<ply/add)
• <0.3GBofFastMemoryPerCore• <2GBofSlowMemoryPerCore• Fastmemoryhas~5xDDR4
bandwidth• BurstBufferforfastIO
AdvancedScien.ficCompu.ngResearch
BoxLibAMRFrameworkChombo-crunch
HighEnergyPhysics
WARP&IMPACTMILCHACC
NuclearPhysicsMFDnChroma
DWF/HISQ
BasicEnergySciencesQuantumEspresso
BerkeleyGWPARSECNWChemEMGeo
BiologicalandEnvironmental
ResearchGromacs
MeraculousMPAS-OACMECESM
FusionEnergySciencesM3DXGC1
NESAP: NERSC Exascale Science Application Program
Jack Deslippe Rebecca Hartman-Baker
Programming Considerations: Running on Cori
• Applica#onisverylikelytorunonKNLwithsimplepor#ng,buthighperformanceishardertoachieve.
• Applica#onsneedtoexploremoreon-nodeparallelismwiththreadscalingandvectoriza#on,alsotou#lizeHBMandburstbufferop#ons.
• Manyapplica#onswillnotfitintothememoryofaKNLnodeusingpureMPIacrossallHWcoresandthreadsbecauseofthememoryoverheadforeachMPItask.
• HybridMPI/OpenMPistherecommendedprogrammingmodel,toachievescalingcapabilityandcodeportability.
• CurrentNERSCsystems(Edison/HopperandBabbage)canhelppreparecodesforCori.
Lots of cores with in package memory
10Source:AvinashSodani,HotChips2015KNLtalk
Connecting tiles
11Source:AvinashSodani,HotChips2015KNLtalk
-12-
To run effectively on Cori users will have to: • ManageDomainParallelism
– independentprogramunits;explicit
• IncreaseNodeParallelism– independentexecu<onunitswithintheprogram;generallyexplicit
• ExploitDataParallelism– Sameopera<ononmul<pleelements
• Improvedatalocality– Cacheblocking;Useon-packagememory
MPI MPI MPI
x
y
z
Threads
x
y
z
|--> DO I = 1, N | R(I) = B(I) + A(I) |--> ENDDO
Threads Threads
CASE STUDY: XGC1 PIC Fusion Code• Par#cle-in-cellcodeusedtostudyturbulenttransportinmagne#cconfinementfusionplasmas.• Usesfixedunstructuredgrid.HybridMPI/OpenMPforbothspa#algridandpar#cledata.(plusPGI
CUDAFortran,OpenACC)• ExcellentoverallMPIscalability• Internalprofiling#merborrowedfromCESM• UsesPETScPoissonSolver(separateNESAPeffort)• 60k+linesofFortran90codes.• Foreach#mestep:
– Depositchargesongrid– Solveellip<cequa<ontoobtainelectro-magne<cpoten<al– Pushpar<clestofollowtrajectoriesusingforcescomputedfrombackgroundpoten<al(~50-70%of<me)– Accountforcollisionandboundaryeffectsonvelocitygrid
• Most#mespentinPar#clePushandChargeDeposi#on
-13-
Unstructuredtriangularmeshgridduetocomplicatededgegeometry
SampleMatrixofcommunica<onvolume
Programming Portability
• CurrentlyXGC1runsonmanyplagorms• PartofNESAPandORNLCAARprograms• AppliedforANLThetaprogram• PreviouslyusedPGICUDAFortranforaccelerators• ExploringOpenMP4.0targetdirec#vesandOpenACC.• Have#ifdef_OpenACCand#ifdef_OpenMPincode.• Hopetohaveasfewercompilerdependentdirec#vesas
possible.• NestedOpenMPisused• NeedsthreadsafePSPLIBandPETSclibraries.
-14-
New algorithms should work in concert with new exascale operating systems: ParalleX Execution Model
• Lightweightmul#-threading– Dividesworkintosmallertasks– Increasesconcurrency
• Message-drivencomputa#on– Moveworktodata– Keepsworklocal,stopsblocking
• Constraint-basedsynchroniza#on– Declara<vecriteriaforwork– Eventdriven– Eliminatesglobalbarriers
• Data-directedexecu#on– Mergerofflowcontrol&datastructure
• Sharednamespace– Globaladdressspace– Simplifiesrandomgathers
Thomas Sterling, et al. IU and XPRESS
HPX and Related Application Development
– Exploreappdevelopmentalterna<veto“tradi<onalMPI+X”.– Ques<on:Canaqualita<velydifferentapproach(Parallex-based):
• Exploituntappedandnewparallelism?• Improveexpressability?• Improveproduc<vity?• GetustoExascaleandbeyond?
– Broadsamplingofappdomains&algorithms:• Plasmaphysics,Many-body&par<cle-in-cell(PIC)• Nuclearengineering&finitevolume/eigensolvers.• Shockphysics&finiteelement/explicit<meintegra<on.• Computa<onalmechanics&implicitsparsesolvers.
– Fullteameffortinvolvingappdesigners,XPRESSteam,HPXandParalleXdevelopers,andcompilerandtoolsdevelopers
Application Perspective
• UseofFutures:– Exploitpreviouslyinaccessible,fine-graindynamicparallelism.– Naturalframeworkforexpressingdata-drivenparallelism.
• BekerthanMPI:– Beyondfunc<onalmimicofMPI.– ReallyuseAc<veGlobalAddressSpace(AGAS)– Takeadvantageoffine-grainedparallelismusingageneralizedconceptofthreads
• OverarchingApplica#onTeamgoal:– DemonstratethatParallex-basedapproacheswork– SuperiortoMPI+Xinoneormoremetric:
• Performance:Extrac<nglatentparallelism.• Portability:Performanceobtainedfromsystem’sunderlyingrun<me.• Produc<vity:Easiertowrite,understand,maintain.
The Ideal HPX Model from an Application Perspective
• Func#onalize–figureoutwhatisaquantumofwork• Determinedatadependencies• Createadataflowstructure• Feedintodatataskmanager• Applica#ons
– Sweetspotbetweengrainsizesforvarioustaskgranulari<es.– Strongscalingimprovesasyouaddedextralevelsofrefinement.OppositeofwhatyouseewithMPI,givingmoreusableworktothesimula<on
– LotsnewresultsatSC15
NERSC is a great place for research on applications and programming models
• Accesstolatesthardware• Applica#onexpertsinavarietyofdomains• NESAPprogramwithpost-docsandstaff• Par#cipatesinOpenMPandMPIlanguagestandards
• Cross-DOEprojectsonnextgenera#onrun#mesandprogrammingmodels