Upload
ngominh
View
430
Download
13
Embed Size (px)
Citation preview
Zhengji Zhao!NERSC User Engagement Group
Using VASP at NERSC
June10,2016
Agenda
• VASPaccessatNERSC• GetstartedwithavailableVASPbuildsatNERSC• Commonproblemsusersruninto• GoodpracCces• CompilingVASPonNERSCsystems
-2-
VASP Access at NERSC
-3-
VASP is available to the users who have the license by themselves
• NeedtoconfirmyourlicensewithVASPdevelopers– Sendyourlicenseinfo(licensenumber,PI’sname,etc)[email protected]:[email protected]
– Toavoidunnecessarydelay,makesureyourarearegistereduserunderyourPI’sVASPlicense
– Thisisamanualprocess.Itmaytakeacoupleofdaysnormally,someGmetakeslonger,e.g.,whentheVASPsupportstaffisonvacaGon.
– OnceyourlicenseisconfirmedbytheVASPsupportatVienna,NERSCgivesyoutheaccesstotheVASPbinariesonNERSCmachines.Theaccessiscontrolledbyaunixfilegroup,vasp5,orvasp(forVASP4).TypegroupscommandtoseeifyouhavetheaccesstotheVASPbinariesatNERSC.
-4-
hRps://www.nersc.gov/users/soTware/applicaGons/materials-science/vasp/#toc-anchor-2
Get Started with Available VASP Builds
-5-
How many vasp builds are available at NERSC?
-6-
zz217@cori01:~>moduleavailvasp----------------------/usr/common/soTware/modulefiles-------------------------------vasp/5.3.5vasp/5.3.5-ccevasp/5.3.5_vtstvasp/5.3.5_vtst-ccevasp/5.4.1(default)
zz217@edison01:~>moduleavailvasp
-----------------------/usr/common/usg/Modules/modulefiles-----------------------vasp/4.6.35_vtstvasp/5.3.5vasp/5.3.5_vtst-ccevasp/5.4.1vasp/5.3.2vasp/5.3.5-cce(default)vasp/5.3.2_vtstvasp/5.3.5_vtst
ModulefilenamingconvenCon:vasp/<version><_vtst><-compiler>- WheretheversionistheofficialVASPreleaseversion;the-compileristhe
compilernamethatwasusedtobuildthecode,whenomiRed,theIntelandCraycompilerswereusedforEdisonandHopper,respecGvely;andthe_vtstdenotesthebuildswiththethirdpartycontributedcodes,VTST,Wannier90,etc.
- OnEdison,vasp/5.3.5-cceisabuildfortheofficialVASPrelease,compiledwithaCraycompiler;vasp/5.3.5_vtst-ccebuiltwithaCraycompiler,enabledVTST(U.Texas,AusGn)andWannier90.
How many VASP builds are available at NERSC? -cont
• Therearedifferentcompilerbuilds– ToprovidealternaGvesbecausesomeproblemsmayoccurwithone
compilerbuildbutnotwithanother– OnEdison,IntelcompilerbuildsrunfasterthantheCraycompiler
builds;however,mayrunintoLAPACKerrorsmoreoTen.SothedefaultVASPmoduleonEdisonwasbuiltwithaCraycompiler.
• Thedefaultmoduleisrecommended• Toaccess,do:moduleloadvasp• Formoreinfo,do:moduleshowvasp/<versionstring>
-7-
What do the modulefiles do?
-8-
Edison01>moduleshowvasp-------------------------------------------------------------------/usr/common/usg/Modules/modulefiles/vasp/5.3.5-cce:module-whaGs VASP:ViennaAb-iniGoSimulaGonPackageAccesstothevaspsuiteisallowedonlyforresearchgroupswithexisGnglicensesforVASP.IfyouhaveaVASPlicensepleaseemailvasp.Materialphysik@univie.ac.atandCC:vasp_licensing@nersc.govwiththeinformaGononwhichresearchgroupyourlicensederivesfrom.ThePIofthegroupaswellastheinsGtuGonandlicensenumberwillhelpspeedtheprocess.setenv PSEUDOPOTENTIAL_DIR/usr/common/usg/vasp/pseudopotenGals/5.3.5setenv VDW_KERNAL_DIR/usr/common/usg/vasp/vdw_kernalsetenv NO_STOP_MESSAGE1setenv MPICH_NO_BUFFER_ALIAS_CHECK1prepend-path PATH/usr/common/usg/vasp/vtstscripts/3.1prepend-path PATH/usr/common/usg/vasp/5.3.5-cce/bin-------------------------------------------------------------------
Where do VASP binaries reside?
-9-
zz217@edison01>ls-ld/usr/common/usg/vasp/5.3.5-cce/bindrwxr-x---+4zz217vasp5512Jan621:24/usr/common/usg/vasp/5.3.5-cce/binzz217@edison01>ls-l/usr/common/usg/vasp/5.3.5-cce/bintotal395660-rwxrwxr-x+1zz217usg69225563Jul312014gvasp-rwxrwxr-x+1zz217usg66153550Aug212014makeparam-rwxrwxr-x+1zz217usg69601938Aug222014vasp-rwxr-xr-x+1zz217usg69604098Jul312014vasp.NMAX_DEG=128-rwxrwxr-x+1zz217usg60887560Aug212014vasp.seriallrwxrwxrwx1zz217zz2175Jan621:24vasp_gam->gvasp-rwxrwxr-x+1zz217usg69676474Jul312014vasp_ncllrwxrwxrwx1zz217zz2174Jan621:24vasp_std->vasp
vasp–generalkpointVASPbuildgvasp–Gammapointonlybuildvasp_ncl–Non-collinearversionOtherbinariesarespecialbuildsperuserrequestNote,thelinkstothevasp_std,vasp_gamtobeconsistentwithvasp/5.4.1version
VASP builds with VTST and Wannier90 enabled
-10-
edison01>moduleshowvasp/5.3.5_vtst-cce-------------------------------------------------------------------/usr/common/usg/Modules/modulefiles/vasp/5.3.5_vtst-cce:
module-whaGs VASP:ViennaAb-iniGoSimulaGonPackage
Note:ThisbuildenabledtheVTST3.1code(U.Texas,AusGn)andWannier901.2.0.
AccesstothevaspsuiteisallowedonlyforresearchgroupswithexisGnglicensesforVASP.IfyouhaveaVASPlicensepleaseemail
[email protected]:[email protected]
withtheinformaGononwhichresearchgroupyourlicensederivesfrom.ThePIofthegroupaswellastheinsGtuGonandlicensenumberwillhelpspeedtheprocess.setenv PSEUDOPOTENTIAL_DIR/usr/common/usg/vasp/pseudopotenGals/5.3.5setenv VDW_KERNAL_DIR/usr/common/usg/vasp/vdw_kernalsetenv NO_STOP_MESSAGE1setenv MPICH_NO_BUFFER_ALIAS_CHECK1prepend-path PATH/usr/common/usg/vasp/vtstscripts/3.1prepend-path PATH/usr/common/usg/vasp/5.3.5_vtst-cce/bin-------------------------------------------------------------------
VASP builds with VTST and Wannier90 enabled
-11-
zz217@edison01>ls-l/usr/common/usg/vasp/5.3.5_vtst-cce/bintotal340624-rwxrwxr-x1zz217usg100514105Feb1916:21gvasp-rwxrwxr-x1zz217usg101455347Feb1916:21vasplrwxrwxrwx1zz217zz2175Feb1716:33vasp_gam->gvasp-rwxrwxr-x1zz217usg101515371Feb1916:21vasp_ncllrwxrwxrwx1zz217zz2174Feb1716:33vasp_std->vasp-rwxrwxr-x1zz217usg45308309May82015wannier90.x
Get started with VASP at NERSC
-12-
edison01>catrun.slurm#!/bin/bash-l#SBATCH-Jtest_vasp#SBATCH-pdebug#SBATCH-N2#SBATCH-t00:30:00#SBATCH–otest_vasp.o%jmoduleloadvaspsrun-n48vasp_stdedison01>sbatchrun.slurmSubmiRedbatchjob518354
hRps://www.nersc.gov/users/soTware/applicaGons/materials-science/vasp/
squeue–u<username>scancel<jobid>scontrolshow<jobid>sqs–u<username>sinfo–sscontrolhold<jobid>scontrolrelease<jobid>Scontrolupdatejob<jobid>TimeLimit=24:00:00Scontrolupdatejob<jobid>QOS=premiumFormoreinfo,readmanpages
edison01>salloc–N2–pdebug–t30:00…moduleloadvaspsrun-n48vasp_std
Runningvaspviaabatchjob
RunninginteracCvely
Commonlyusedcommands:
#SBATCH--mail-type=BEGIN,END,FAIL#SBATCH–A<yourrepo>
Common Problems Users Run into
-13-
VASP not found error
• NoaccesstoVASP–checkifyouhaveaccesstotheVASPbinariesatNERSC.Ifnot,confirmyourlicense.usgsw@edison01:~>groupsusgswoprofileusg#userusgswdoesnothaveaccesstoVASP
• Failedtoloadavaspmodule–– Youwillalsoseeanerrorinyourjobstandarderrorfile:module:commandnotfound
-14-
Running VASP jobs with too many cores
• VASPingeneralconsideredtoscaleupto1core/atom.Ifrunningoutsidetheparallelscalingregion,variouserrorsmayoccur
• internalERRORRSPHER:runningoutofbuffer– Thiserrorwasseenwhenusingtoomanycoresforasmallsystem,andwasalsoseenwhenasmallNPAR+LPLANEisused,sothatnumberofcoresperorbitalissimilarorevenlargerthanNGZ(VASPmanual:LPLANE=.TRUE.shouldonlybeusedifNGZisatleast3*(numberofcores)/NPAR).
– TherecouldbemulGplefixesincludingmodifyingthesourcecodetoallocatelargerworkarrays,butyoucaneliminatethiserrorbyreducingthetotalnumberofcoresorusingalargerNPAR(closertothedefaultvalue),orse{ngLPALNE=.FALSE.
-15-
Running with too small number of cores - Out of Memory error
• Errormessage:OOMkillerterminatedthisprocess– Edison2.6GB/core;24corespernode– Cori$GBGB/core;32corespernode
• Toavoid– Considertouse~1core/atom– GammapointcalculaGons,usegvaspinsteadofvasp– Usemorecoresand/orrunonunpackednodes,e.g.,use12core/node
-16-
How many cores to use? • UsinglargernumberofcoresmaynotreducetheCme
tosoluConcosteffecCvely– VASPmayspendmostoftheGmeinthecommunicaGonevenifitdoesnotrunintoerror
• 1Core/atomisagoodreference– ConservaGvely,youmaygowith0.5core/atomorslightlyless– UseNPAR~sqrt(totalnumberofcoresused)whereapplicable– Iftherearemanykpoints,use0.5core/atomforeachkpointgroup;thetotalnumberofcores=KPARx0.5core/atomx(#ofatoms)
– IftherearemulGpleimages,totalnumberofcores=IMAGESx0.5core/atomx(#ofatoms)
• ThiswillalsoeffecCvelyreducetheOutofMemory(OOM)error
-17-
Error when total number of cores is not divisible by NPAR
• M_divide:cannotsubdivide• NPARdefaultisthesameasthetotalnumberofcores– MaxNPAR=256– NPAR~sqrt(totalnumberofcores)performsbeRerthanNPAR=1orthedefaultNPAR(=totalnumberofcores)whenapplicable
-18-
Error due to aliasing buffers in MPI collective calls in VASP code
• FatalerrorinPMPI_Allgatherv:Invalidbufferpointer,errorstack:PMPI_Allgatherv(1235):MPI_Allgatherv(sbuf=0x9ed6000,scount=64,MPI_DOUBLE_COMPLEX,rbuf=0x9ed6000,rcounts=0xb27f7c0,displs=0xad81140,MPI_DOUBLE_COMPLEX,comm=0xc4000003)failedPMPI_Allgatherv(1183):Buffersmustnotbealiased.ConsiderusingMPI_IN_PLACEorsehngMPICH_NO_BUFFER_ALIAS_CHECK
– ThiserrorwasseenwithHSEjobsexportMPICH_NO_BUFFER_ALIAS_CHECK=1#forbash/shsetenvMPICH_NO_BUFFER_ALIAS_CHECK1#forcsh/tcsh
– InNERSCVASPmodules,wesetMPICH_NO_BUFFER_ALIAS_CHECK=1tobypassthebufferaliasingerrorcheckinginMPIcollecGvecalls.
– IfyourunyourownVASPbuildsmayneedtosetthisenv
-19-
Hung jobs – difficult to debug
• Filesystemissues–Lustre,globalhomes– Filesystemsmaysloworhangduetohardware/soTwareissues,excessiveusefromsomeusers,usersover$HOMEquota,etc.Jobsmayhangorrunslowerduetounder-performingfilesystems
• Srun–utodisabletheIObufferingofSlurm
-20-
LAPACK’s ZHEGV failure
• ErrorEDDDAV:CalltoZHEGVfailed.Returncode=25248– ZHEGVsolves(diagonalizes)thecomplexgeneralizedHermiGan
eigenproblemA*x=(lambda)*B*x,whereBmustbeposiGvedefinite(i.e.,itseigenvaluesareposiGve).
– ForsomesystemsVASPmaygenerateamatrixBthatisnotposiGvedefinite(thereasonforthisisnotwellunderstood)andZHEGVreturnsanerrorcode.--commentsprovidedbyOsniMarquesatLBNL.
• ThiserrorwasmorefrequentlyseenwithIntelbuildsonEdison
• SwitchingtoCraybuildsmayhelp• Edisondefaultvaspmodule,vasp/5.3.5-cce,wasbuilt
withaCraycompiler
-21-
Known issues with the Wannier90 enabled builds
• CraycompilerbuildsdonotworkwellwithWannierf90bothonEdisonandCori– lib-4095:UNRECOVERABLElibraryerrorAWRITEoperaGonisinvalidifthefileisposiGonedaTertheend-of-file.
-22-
Be aware that jobs may fail due to various system issues
• Systemhardwareandsomwarefailuresoccur– Filesystemsunder-performing– Filesystemsnotavailable,GPFSexpel– Batchsystemerror– UpgradesonOS,system,libraryandapplicaGonsoTware– Nodefailure– Networkfailure– Powerglitch
• Usersmisusethesystemmayaffectotherusers– Poundingthefilesystems,andmakethemunresponsiveaffecGngalluserjobsrunningoutofthesamefilesystems
– RunningI/Ointensiveparalleljobsoutof$HOME• [email protected]
– Helpustoresolvetheproblemfaster
-23-
Good practices
-24-
Which system to use? Goal:getmorecompuCngdonewithagivenallocaConwithinagivenCmeperiod• Charging:
– Edisonhasamachinechargefactor2,however,sincemostcodesruntwicefasteronEdison,theMPPchargingforthesamecorejobsonbothsystemsshouldbesimilar.VASPonEdisoncaneasilyoutperformHopperby2-3Cmes.
-25-
NERSC-6ApplicaConBenchmarks
ApplicaCon CAM GAMESS GTC IMPACT-T MAESTRO MILC PARATECConcurrency 240 1024 2048 1024 2048 8192 1024Streams/Core 2 2 2 2 1 1 1EdisonTime(s) 273.08 1,125.80 863.88 579.78 935.45 446.36 173.51HopperTime(s) 348 1389 1338 618 1901 921 353Speedup1) 2.5 2.5 3.1 2.1 2.0 2.1 2.01)Speedup=[Time(Hopper)/Time(Edison)]*Streams/Core
Which system to use - cont
-26-
HelenHeatNERSCprovidedthefigure
• Corihasachargefactorof2.5whileEdisonis2.AslongasanapplicaGonrunsmorethan25%fasteronCori,runningonCoriwouldbemoreMPPchargingefficient.
• VASPwouldbeabout30%fasteronCori(samecore),sobothsystemsshouldbesimilarinMPPcharge.
Which system to use? - cont
• Queuepolicy:– Edisonfavorslargerjobs,thesmallregularjobs(1-682nodes)havealowerpriorityonEdison.Corihasthesamepriorityforalljobs.
– Ingeneralshorterjobshavemorebackfillopportunity.MakesuretoavoidunnecessarylongwallGmerequests.
– Bundlingupmanysmallregularjobs,iftheydosimilarcomputaGons,togetahigherqueuepriorityandachargingdiscount.
• Youcanuseeithersystemtorunyourjobs
-27-
hRps://www.nersc.gov/users/computaGonal-systems/edison/running-jobs/queues-and-policies/hRps://www.nersc.gov/users/computaGonal-systems/cori/running-jobs/queues-and-policies/
Choose the right file systems to run jobs on
• Yourglobalhomes,$HOME,isnottherightfilesystemtorunyourVASPjobs– Youmayexceedyourhomequota,40GB,andcausesystemissues.– Slowdownyourselfandalsootherusers
• TheLustrefilesystems($SCRATCH)arerecommendedfilesystemstorunyourjobsbothforalargerstoragespaceandabererI/Operformance.– 10TBscratchquotaonEdison;100TBon/scratch3(forI/Ointensive
workloadonly)– 20TBscratchquotaonCori(/cscratch1)– Note:filesarenotaccessedformorethan8weeksonEdisonand12
weeksonCoriwillbepurged– Youcanbackupyourimportantfilestoyourprojectdirectory,/
project/projectdires/<yourrepo>,whichhas4TBquota,ortoHPSSaTeranalysis.
-28-
Short scaling tests are recommended to choose optimal core counts and NPAR
• ThedebugqueuehasahighpriorityonEdisonandCoriINCAR:
LWAVE=.FALSE.LCHARG=.FALSE.NELMDL=-1NELM=2NSW=1
hRp://cms.mpi.univie.ac.at/vasp/vasp/ParallelisaGon_NPAR_NCORE_LPLANE_KPAR_tag.html
• AlsododebugrunstoseeifyourjobscouldruntocompleConbeforesubmihngyourlongjobs– Makesurenomemoryandotherfailuresduringtherun
-29-
Ask questions and send problem reports to NERSC consultants
• Email:[email protected]• Phone:1-800-666-3772menuopCon3.• Wemaynotsolveallyourproblems,butwemaybeabletohelpyoutofindworkarounds
-30-
Compiling VASP on NERSC systems
-31-
Where to find the makefiles • Weprovidemakefilesfortheuserswhowanttocompilethecodesby
themselves–opentoeveryone,donotneedtoconfirmyourlicensetoaccessthemakefiles
• ThemakefilesareavailableatthevaspinstallaCondirectories– Instdir=/usr/common/usgonEdison;Instdir=/usr/common/soTwareonCori– VASP5.3.5:$instdir/vasp/<version-string>/makefiles– VASP5.4.1:themakefile.includeisavailableat$instdir/vasp/5.4.1.– Trythebuildscript,build.shforVASP5.3.5,orrunmakeaTergetthemake.include
file– Tocompilerotherolderversions(<5.3.5)youmayjustneedtoreplacetheSOURCE
variableinthemakefilewiththelistofsourcefilesthatmatchyourVASPversion(getthesourcelistfromthemakefilesthatweredistributedwithyourvaspcode).
• VASPwasbuiltwithtwocompilersingeneral,IntelandCraycompilersonEdisonandCori.– IntelCompiler+MKL+�w3wrapperrouGnesfromMKL– CrayCompiler+Libsci+�w3
-32-
Thank you
-33-