Upload
trinhtruc
View
227
Download
0
Embed Size (px)
Citation preview
Running Jobs on Cori with SLURM
Helen He NERSC User Engagement Group ""Cori Phase 1 Training June 14, 2016
Cori Phase 1 - Cray XC40
• 4GBmemory/coreforapplica1ons
• /scratchdiskquotaof20TB
• 30PBof/scratchdisk
• ChoiceoffullLinuxopera1ngsystemorop1mizedLinuxOS(CrayLinux)
• Intel,Cray,andGNUcompilers
2
• 52,160cores,1,630nodes
• “Aries”interconnect
• 2x16-coreIntel’Haswell'2.3GHzprocessorspernode
• 32processorcorespernode,64withhyperthreading
• 128GBofmemorypernode
• 203TBofaggregatememory
Cori Phase 1 Compute Nodes
-3-
• CoriPhase1:NERSCCrayXC40,1,630nodes,52,160cores
• Eachnodehas2IntelXeon16-coreHaswellprocessors
• 2NUMAdomainspernode,16coresperNUMAdomain.2hardwarethreadspercore.
• Memorybandwidthisnon-homogeneousamongNUMAdomains
Toobtainprocessorinfo:
Getonacomputenode:
%salloc-N1
Then:
%cat/proc/cpuinfo
or
%hwloc-ls
User Jobs at NERSC • Mostareparalleljobs(10sto100,000+cores)
• Alsoanumberof“serial”jobs– Typically“pleasantlyparallel”simula1onordataanalysis
• ProducTonrunsexecuteinbatchmode
• OurbatchschedulerisSLURM(naTve)
• Debugjobsaresupportedforupto30minutes
• TypicallyrunTmesareafewto10sofhours– Eachmachinehasdifferentlimits
– LimitsarenecessarybecauseofMTBFandtheneedto
accommodate6,000users’jobs
What is SLURM • Insimpleword,SLURMisaworkloadmanager,ora
batchscheduler
• SLURMstandsforSimpleLinuxUTlityforResourceManagement
• SLURMunitestheclusterresourcemanagement(suchasTorque)andjobscheduling(suchasMoab)intoonesystem.Avoidsinter-toolcomplexity.
• SLURMwasusedin6ofthetop10computers(June2015TOP500list),includingthe#1system,Tianhe-2,withover3Mcores
• MoreCraysitesadoptSLURM:NERSC,CSCS,KAUST,TACC,etc.
-5-
Advantages of Using SLURM • Fullyopensource• SLURMisextensible(pluginarchitecture)• Lowlatencyscheduling.Highlyscalable• Integrated“serial”or“shared”queue• IntegratedBurstBuffersupport• Goodmemorymanagement• Built-inaccounTnganddatabasesupport• “NaTve”SLURMrunswithoutCrayALPS(Applica1onLevel
PlacementScheduler)
– Batchscriptrunsontheheadcomputenodedirectly
– Easiertouse.Lesschanceforconten1oncomparedtosharedMOMnode
-6-
Login Nodes and Compute Nodes Eachmachinehas2typesofnodesvisibletousers
• Loginnodes(external)– Editfiles,compilecodes,submitbatchjobs,etc.
– Runshort,serialu1li1esandapplica1ons• Computenodes– Executeyourapplica1on– Dedicatedresourcesforyourjob
7
Submitting Batch Jobs
• Torunabatchjobonthecomputenodesyoumustwritea“batchscript”thatcontains
– Direc1vestoallowthesystemtoscheduleyourjob
– Ansruncommandthatlaunchesyourparallelexecutable
• Submitthejobtothequeuingsystemwiththesbatchcommand
– % sbatch my_batch_script!
8
Interactive Parallel Jobs
• YoucanrunsmallparalleljobsinteracTvelyforupto30minutes
login% salloc -N 2 -p debug -t 15:00 [wait for job to start]!compute% srun –n 64 ./mycode.exe
9
Launching Parallel Jobs with SLURM
10
sbatch
LoginNode HeadComputeNode
OtherComputeNodesallocatedtothejob
Headcomputenode:• Runscommandsinbatchscript
• Issuesjoblauncher“srun”tostartparallel
jobsonallcomputenodes(includingitself)
Loginnode:• Submitbatchjobsviasbatchorsalloc
• Pleasedonotissue“srun”fromloginnodes
• Donotrunbigexecutablesonloginnodes
Sample Cori Batch Script - MPI
11
#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!#SBATCH -n 1280!#SBATCH -J myjob!#SBATCH –L scratch!!export OMP_NUM_THREADS=1!srun -n 1280 ./mycode.exe!
Sample Cori Batch Script - MPI
12
#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!#SBATCH -n 1280!#SBATCH -J myjob!#SBATCH –L scratch!!export OMP_NUM_THREADS=1!srun -n 1280 ./mycode.exe!
• Needtospecifywhichshelltouseforbatchscript
• Use“-l”asloginshellisop1onal
• Environmentisautoma1callyimported
Sample Cori Batch Script - MPI
13
#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!#SBATCH -n 1280!#SBATCH -J myjob!#SBATCH –L scratch!!export OMP_NUM_THREADS=1!srun -n 1280 ./mycode.exe!
Jobdirec1ves:instruc1onsforthebatchsystem
• Submissionpar11on(defaultis“debug”)
• Howmanycomputenodestoreserveforyourjob
• Howlongtoreservethosenodes
• Moreop1onalSBATCHkeywords
Sample Cori Batch Script - MPI
14
#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!#SBATCH -n 1280!#SBATCH -J myjob!#SBATCH –L scratch!!export OMP_NUM_THREADS=1!srun -n 1280 ./mycode.exe!
SBATCHop1onalkeywords:• Howmanyinstancesofapplica1onstolaunch(#ofMPItasks)
• Whatismyjobname
• Whatfilesystemlicensesareused
Sample Cori Batch Script - MPI
15
#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!#SBATCH -n 1280!#SBATCH -J myjob!!export OMP_NUM_THREADS=1!srun -n 1280 ./mycode.exe!
• Bydefault,hyperthreadingison.SLURMsees2threadsareavailablefor
eachofthe32physicalCPUsonthenode
• Noneedtosetthisifyourapplica1onprogrammingmodelispureMPI.
• IfyourcodeishybridMPI/OpenMP,setthisvalueto1toruninpure
MPImode
Sample Cori Batch Script - MPI
16
#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!#SBATCH -n 1280!#SBATCH -J myjob!!export OMP_NUM_THREADS=1!srun -n 1280 ./mycode.exe!
“srun”commandlaunchesparallelexecutablesonthecomputenodes
• srunflagsoverwriteSBATCHkeywords
• Noneedtorepeatflagsinsruncommandifalreadydefinedin
SBATCHkeywords.(e.g.“srun./my_executable”willalsodoin
aboveexample)
Sample Cori Batch Script - MPI
17
#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!#SBATCH -n 1280!#SBATCH -J myjob!!export OMP_NUM_THREADS=1!srun -n 1280 ./mycode.exe!
• Thereare64logicalCPUsoneachnode
• With40nodes,usinghyperthreading,upto40*64=2,560MPItasks
canbelaunched:“srun-n2560./mycode.exe”isOK
Sample Batch Script - Hybrid MPI/OpenMP
18
#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!!export OMP_NUM_THREADS=8!srun -n 160 -c 8 ./mycode.exe!
• srundoesmostofop1malprocessandthreadbindingautoma1cally.
Onlyflagssuchas“-n”“-c”,alongwithOMP_NUM_THREADSare
neededformostapplica1ons
• Hyperthreadingisenabledbydefault.Jobsreques1ngmorethan32
cores(MPItasks*OpenMPthreads)pernodewillusehyperthreads
automa1cally
Long and Short Commands Options
-19-
#!/bin/bash-l
#SBATCH--par11on=regular
#SBATCH--job-name=test
#SBATCH--account=mpccc
#SBATCH--nodes=2
#SBATCH--1me=00:30:00
#SBATCH--license=scratch
srun-n64./mycode.exe
#!/bin/bash-l
#SBATCH-pregular
#SBATCH-Jtest
#SBATCH-Ampccc
#SBATCH-N2
#SBATCH-t00:30:00
#SBATCH-Lscratch
srun-n64./mycode.exe
Longcommandop1ons Shortcommandop1ons
Running Multiple Parallel Jobs Sequentially
-20-
#!/bin/bash-l
#SBATCH-pregular
#SBATCH-N4
#SBATCH-t12:00:00
#SBATCH-Lproject,cscratch1
srun-n128./a.out
srun-n128./b.out
srun-n128./c.out!
• Needtorequestmaxnumberofnodesneededbyeachsrun
Running Multiple Parallel Jobs Simultaneously
-21-
#!/bin/bash-l
#SBATCH-pregular
#SBATCH-N8
#SBATCH-t12:00:00
#SBATCH-Lcscratch1
srun-n44-N2./a.out&
srun-n108-N4./b.out&
srun-n40-N2./c.out&
wait!
• Needtorequesttotalnumberofnodesneededbyallsruns
• No1cethe“&”and“wait”above
• Itisbestifeachsruntakesroughlysameamountof1meto
complete,otherwisewastealloca1on
Running MPMD Jobs
-22-
#Configfile:
%catmpmd.conf
0-35./a.out
36-95./b.out
#BatchScript:
#!/bin/bash-l
#SBATCH-pregular
#SBATCH-N4
#SBATCH-n96#totalof96tasks
#SBATCH-t02:00:00
#SBATCH–LSCRATCH
srun--mul1-prog./mpmd.conf!
• TwoexecutableswillshareoneMPI_COMM_WORLD
• Requesttotalnumberofnodesneededfora.outandb.out
Job Steps and Dependency jobs
• Usejobdependencyfeaturestochainjobsthathavedependency
-23-
cori%sbatchjob1
Submisedbatchjob5547
cori06%sbatch--dependency=aterok:5547job2
or
cori06%sbatch--dependency=aterany:5547job2
!!
cori06%sbatchjob1
submisedbatchjob5547
cori06%catjob2
#!/bin/bash-l
#SBATCH-pregular
#SBATCH-N1
#SBATCH-t00:30:00
#SBATCH-daterok:5547
cd$SLURM_SUBMIT_DIR
srun-n32./a.out
cori06%sbatchjob2!!
“shared” Partition on Cori • Usersseemanyjobsin“shared”,appearstouse1nodeper
job(displayedwiththequeuemonitoringscripts),actuallyNOT.
• Serialjobsorsmallparalleljobsaresharedonthesenodes
• 40nodesaresetasideforthe“shared”jobs• “shared”jobsdonotrunonothernodescurrently(may
changeinthefuture)
• Highsubmitlimits(10,000)andrunlimits(1,000)
• Jobsaregedngverygoodthroughput
• “shared”jobsarenotchargedbyenTrenode,butbyactualphysicalcoresused
-24-
Running Serial Jobs • The“shared”parTTonallowsmulTpleexecutablesfrom
differentuserstoshareanode• Eachserialjobrunonasinglecoreofa“shared”node• Upto32jobsfromdifferentusersdependingontheirmemory
requirements
25
#SBATCH -p shared!#SBATCH -t 1:00:00!#SBATCH --mem=4GB!#SBATCH -J my_job!./mycode.exe!
• Smallparalleljobthatuselessthanafullnodecanalsoruninthe“shared”parTTon
• Donotspecify#SBATCH-N”• Default“#SBATCH-n”is1• Defaultmemoryis1,952MB
• Use-nor--memtorequest
moreslotsforlargermemory
• Donotuse“srun”forserialexecutable(reducesoverhead)
)
“realtime” Partition
• Specialpermissiontouse“realTme”forreal-Tmeneedofdataintensiveworkflows
• Highestpriorityfor“realTme”jobssotheystartalmostimmediately.CouldbedisrupTvetooverallqueuescheduling.
• “realTme”jobscanrunin“shared”or“exclusive”modefornodeusage
• Nodesaresetasideforthe“realTme”jobs(currently)
• “realTme”jobscanrunonothernodes
-26-
Job Arrays • UseJobArraysforsubmidngandmanagingcollecTonsof
similarjobs– Besermanagingjobs,notnecessaryfasterturnaround
– Eacharraytaskconsideredasinglejobforscheduling,submit/runlimits
– SLURM_ARRAY_JOB_IDsettothefirstjobIDofthearray
SLURM_ARRAY_TASK_IDsettothejobarrayindexvalue
-27-
#Submitajobarraywithindexvaluesbetween0and31%sbatch--array=0-31-N1
#Submitajobarraywithindexvaluesof1,3,5and7%sbatch--array=1,3,5,7-N1-n2
#Submitajobarraywithindexvaluesbetween1and7,withastepsizeof2%sbatch--array=1-7:2-N1-pregular
#submitajobarraywithindexvaluesbetween1and7,andlimitmaxrunningjobsto2%sbatch--array=1:7%2-N1-pregular
stderr/stdout • Bydefault,whileyourjobisrunning,theslurm-$SLURM-
JOBID.outisgeneratedandcontainsbothstderrandstdout• stdoutisbufferedinsegmentsof8KBsize• stderrisnotbuffered• Canrenamevia“#SBATCH-o”(forstdout)and“#SBATCH-
e”(forstderr)flags• Canalsoredirectfromsruncommands
– srun-n48./a.out>&my_output_file(forcsh/tcsh)
– Srun–n48./a.out>my_output_file2>&1(forbash)
• srun“-u”opTondisablesstdoutbuffer– Notrecommendedtouseasdefaultsinceitslowsdownapplica1on
significantly
– Howeverthishelpdebuggingtogetcompleteoutputbeforejobtermina1on
-28-
Cluster Compatibility Mode (CCM) Applications
• Certain3rdpartapplicaTonneedsshtoothercomputenodesfromtheheadcomputenode
• TheCCMmodeallows3rdpartyapplicaTonstorunonthecomputenodeswithoutrebuild
• The“#SBATCH--ccm”opTonwillsetupthenecessaryenvironmenttosupportCCMapplicaTons
• NoseparateCCMqueue,CCMjobscanruninanyqueuenow
• Sincesbatchorsalloclandsonacomputenode,applicaTonssuchas“matlab”canbelauncheddirectlywithoutCCMsupport
-29-
Running Jobs Built with Intel MPI
-30-
#!/bin/bash-l
#SBATCH-pregular
#SBATCH-N8
#SBATCH-t03:00:00
#SBATCH-Lproject.SCRATCH
moduleloadimpimpiicc-openmp-omycode.exemycode.c
exportOMP_NUM_THREADS=8
exportI_MPI_PMI_LIBRARY=/opt/slurm/default/lib/pmi/libpmi.sosrun-n32-c8./mycode.exe
Which File Systems to Use • Donotrunfrom$HOME
– Meantforpermanentspaceformanysmallfiles(source,smallinput,etc).
– Performancenotop1mizedforlargeIO
– RunninglargeIOjobsfrom$HOMEcancausedelayedinterac1veresponse1meforotherusers
• Use$SCRATCH(Lustrefilesystem)orprojectdirectorytorunyourjobs,forbemerIOperformanceandlargerspacequota– $SCRATCH:default20TBquota,10Minodes.Filesnotaccessedfor12weeksaresubjecttobepurged.Backupimportantfiles.
– /global/project/projectdirs/your_repo:default4TBquota,1M
inodes.Notpurged.
-31-
File System Licenses
• Use“#SBATCH-L”or“#SBATCH--license=“tospecifyfilesystemsneeded
• JobsthatusefilesystemlicenseswillnotstartifthereisanissuewiththatparTcularfilesystem– Protectsjobsfromfailingwithfilesystemissues
– Allowsselec1vemaintenanceonfilesystems
• Example:#SBATCH-Lscratch1,project
• “SCRATCH”canbeusedasshorthandforyourdefaultscratchfilesystemonEdisonorCori
• CurrentlyopTonal,butstronglyencouraged• Maybeenforcedinthenearfuture
-32-
More SBATCH Options
• WhichQOStouse:normal(default),premium,low
– #SBATCH--qos=premium
• What’smystdoutfilename
– #SBATCH-omy_stdout_name• Whichaccounttocharge
– #SBATCH-Amy_repo• Whentosendemail
– #SBATCH--mail-type=BEGIN,END,FAIL
• …
-33-
1-node “Long” Jobs Up to 96 hrs
• Runin“regular”parTTon,no“premium”or“low”QOSpriority
• %salloc-N1-pregular-t96:00:00-LSCRATCH• Canonlyuseasinglenode• Ausercanhaveupto4longjobsrunningsimultaneously
• Ausercansubmitamaximumof10longjobsataTme
• Amaximumof10nodescanbeoccupiedbylongjobsfromallusers
• Beawareofjobswon’tstartifTmeletbeforeasystemmaintenanceislessthan96hrs
-34-
Recap: Running Jobs with SLURM • Use“sbatch”tosubmitbatchscriptor“salloc”torequest
interacTvebatchsession.
• Use“srun”tolaunchparalleljobs• MostSLURMcommandopTonshavetwoformats(long
andshort)
• Needtospecifywhichshelltouseforbatchscript.• EnvironmentisautomaTcallyimported
• Landsinthesubmitdirectory
• Batchscriptrunsontheheadcomputenode
• NoneedtorepeatflagsinthesruncommandifalreadydefinedinSBATCHkeywords.
• srunflagsoverwriteSBATCHkeywords
-35-
Some SLURM Gotchas • Hyperthreadingisenabledbydefault.
– SLURMsees64CPUspernode(eachCorinodehas32physicalcores,
totalof64logicalcorespernode.)
– srunwilldecidewhethertouselogicalcores.(seelastbullet)• NeedtosetOMP_NUM_THREADS=1explicitlytoruninpure
MPImode(withhybridMPI/OpenMPapplicaTonsbuiltwithOpenMPcompilerflagenabled)
• Alwaysuse“#SBATCH-N”torequestnumberofnodes.Ifaskingnodeswith“#SBATCH-n”only(fornum_MPI_tasks),youmaygethalfthe#nodesdesired.
• AutomaTcprocessandthreadaffinityisgood.Hyperthreadingwillnotbeusedifnum_MPI_tasks*num_OpenMP_threads_per_nodeis<=32.CanexplorewithadvancedsedngsformorecomplicatedbindingopTons.
-36-
Two SLURM Schedulers are in Work • InstantScheduler(eventtriggered)
– Performsaquickandsimpleschedulingasemptateventssuchasjobsubmissionorcomple1onandconfigura1onchanges.
• BackfillScheduler(atsetintervals)– Considerspendingjobsinpriorityorder,determiningwhen
andwhereeachwillstart,takingintoconsidera1onthepossibilityofjobpreemp1on,gangscheduling,genericresource(GRES)requirements,memoryrequirements,etc.
– Ifthejobunderconsidera1oncanstartimmediatelywithoutimpac1ngtheexpectedstart1meofanyhigherpriorityjob,thenitdoesso.
-37-
SLURM User Commands • sbatch:submitabatchscript
• salloc:requestnodesforaninterac1vebatchsession
• srun:launchparalleljobs
• scancel:deleteabatchjob
• sqs:NERSCcustomqueuedisplaywithjobpriorityrankinginfo
• squeue:displayinfoaboutjobsinthequeue
• sinfo:viewSLURMconfigura1onaboutnodesandpar11ons
• scontrol:viewandmodifySLURMconfigura1onandjobstate
• sacct:displayaccoun1ngdataforjobsandjobsteps
• hsps://www.nersc.gov/users/computa1onal-systems/cori/
running-jobs/monitoring-jobs/
-38-
sqs: NERSC Custom Queue Monitoring Script
• ProvidestwocolumnsofrankingvaluestogiveusersmoreperspecTveoftheirjobsinqueue.– ColumnRANK_BFshowstherankingusingthebestes1matedstart
1me(ifavailable)atabackfillschedulingcycle,sotherankingis
dynamicandchangesfrequentlyalongwiththechangesinthe
queuedjobs.
– ColumnRANK_Pshowstherankingwithabsolutepriorityvalue,
whichisafunc1onofpar11onQOS,jobwait1me,andfairshare.
Jobswithhigherprioritywon'tnecessarilyrunearlierduetovarious
runlimits,totalnodelimits,andbackfilldepthwehaveset.
-39-
sqs Example Commands %sqs(showuser’sownjobs)
%sqs-a(showsalljobs)
%sqs-a-pdebug(showsonlydebugjobs)
%sqs-a-nr-npshared(norunningjobs,nosharedjobs)
%sqs-w(showsallmyjobsinwideformatwithmoreinfo)
%sqs-s(shortsummaryofqueuedjobs)
Seemanpageoruse“sqs--help”formoreopTons.
-40-
squeue Example Commands
-41-
%squeue-u<username>
%squeue-j<jobid>--start
%squeue-o"%.18i%.3t%.10r%.10u%.12j%.8D%.10M%.10l%.20V%.12P%.20S%.15Q”JOBIDSTREASONUSERNAMENODESTIMETIME_LIMIT
SUBMIT_TIMEPARTITIONSTART_TIMEPRIORITY
Seemanpageor“squeue--help”formoreopTons.
sinfo Example Commands
-42-
%sinfo-sPARTITIONAVAILTIMELIMITNODES(A/I/O/T)NODELIST
debug*up30:005357/154/8/5519nid[00008-00063,00072-00127,…]
regularup4-00:00:005110/147/6/5263nid[00296-00323,00328-00383,…]
regularxup2-00:00:005357/154/8/5519nid[00008-00063,00072-00127,…]
real1meup12:00:005420/158/8/5586nid[00008-00063,00072-00127,…]
sharedup2-00:00:0055/0/0/55nid[06089-06143]
%sinfo-pdebugPARTITIONAVAILJOB_SIZETIMELIMITCPUSS:C:TNODESSTATENODELIST
debug*up1-infini30:00482:12:26down*nid[01343,02062,02137,03132,03150,03307]
debug*up1-infini30:00482:12:22drainednid[00008-00009]
debug*up1-infini30:00482:12:22reservednid[00010-00011]
debug*up1-infini30:00482:12:22mixednid[00077,00090]
debug*up1-infini30:00482:12:25349allocatednid[00012-00063,00072-00076,--]
debug*up1-infini30:00482:12:2158idlenid[00083-00086,00091-00094,…]
Seemanpageor“sinfo--help”formoreop1ons.
scontrol Example commands
%scontrolshowparTTon<parTTon>
%scontrolshowjob<jobid>
%scontrolhold<jobid>
%scontrolrelease<jobid>
%scontrolupdatejob<jobid>Tmelimit=24:00:00
%scontrolupdatejob<jobid>qos=premium
Seemanpageor“scontrol--help”formoreopTons.
-43-
!
sacct Example Commands
-44-
%sacct-u<username>--stardme=01/12/16T00:01--endTme=01/15/16T12:00-ojobid,elapsed,nnodes,start,end,submit
%sacct-a--stardme=01/12/16T00:01--endTme=01/15/16T12:00-oUser,JobID,NNodes,State,Start,End,TimeLimit,Elapsed,ExitCode,DerivedExitcode,Comment
%sacct-u<username>-j<jobid>-ojobid,elapsed,nnodes,nodelist
Seemanpageor“sacct--help”formoreopTons.!
Edison Queue Policy (as of 06/10/2016)
-45-
Specifythesepar11onswith#SBATCH -q partition_name
SpecifytheseQOSwith#SBATCH --qos=premium
Theselimitsareperuser
perpar11on/QOSlimits
Jobswithinsufficient
alloca1onstorunare
directedto“scanvenger”
Cori Queue Policy (as of 06/10/2016)
-46-
Largeuserlimits
serialworkloadreal1meworkflow
Cori Phase 1 Data Features Implemented in SLURM
• CoriPhase1alsoknownasthe"CoriDataParTTon”
• Designedtoacceleratedata-intensiveapplicaTons,withhighthroughputand“realTme”need.– "shared”par11on.Mul1plejobsonthesamenode.Largersubmitand
runlimits.40nodessetaside
– The1-2nodebininthe"regular"forhighthroughputjobs.Largesubmitandrunlimits.
– “real1me”par11onforjobsrequiringreal1medataanalysis.Highestqueuepriority.Specialpermissiononly.
– Internalsshd(CCMmode)inanyqueue
– Largenumberoflogin/interac1venodestosupportapplica1onswithadvancedworkflows
– “burstbuffer”usageintegratedinSLURM,inearlyuserperiod.
– Encourageuserstorunjobsusing683+nodesonEdisonwithqueuepriorityboostand40%chargingdiscountthere.
-47-
Charge Factors & Discounts
• Eachmachinehasa“machinechargefactor”(MCF)thatmulTpliesthe“rawhours”used– EdisonMCF=2.0
– CoriMCF=2.5
• EachQOShasa“QOSchargefactor”(QCF)– premiumQCF=2.0
– normalQCF=1.0(default)
– lowQCF=0.5– scavengerQCF=0
• OnEdison:– Jobsreques1ng683ormorenodesgeta40%discount
48
How Your Jobs Are Charged • Yourrepositoryischargedforeachnodeyourjobwas
allocatedfortheenTreduraTonofyourjob.– Theminimumallocatableunitisanode(exceptforthe“shared”parGGononCori).Edisonhave24cores/nodeandCorihas32cores/node.
– Example:4Corinodesfor1hourwith“premium”QOSMPPhours=(4)*(32)*(1hour)*(2)*(2.5)=640MPPhours
– “shared”jobsarechargedwithphysicalCPUsusedinsteadofen1renode.
• IfyouhaveaccesstomulTplerepos,pickwhichonetochargeinyourbatchscript#SBATCH –A repo_name
MPPhours=(#nodes)*(#cores/node)*(wall1meused)*(QCF)*(MCF)
I submitted my job, but it is not running! • That’swhatthebatchqueueisfor!
• YourjobswillwaitunTltheresourcesareavailableforthemtorun
• Yourjob’splaceinthequeueisamixofTmeandpriority,solinejumpingisallowed,butitmaycostmore
-50-
Cori Backlog Over Time
-51-
• Currentbacklogis~8days• PlotobtainedfromMyNERSC
Average Queue Wait Time
• Historicdatacanbeobtainedfromhmps://www.nersc.gov/users/queues/queue-wait-Tmes/
• D
-52-
53
Tips for Getting Better Throughput • Linejumpingisallowed,butitmaycostmore• Submitshorterjobs,theyareeasiertoschedule
– Checkpointifpossibletobreakuplongjobs– Shortjobscantakeadvantageof‘backfill’opportuni1es– Runshortjobsjustbeforemaintenance
• Veryimportant:makesurethewallclockTmeyourequestisaccurate– Asnotedabove,shorterjobsareeasiertoschedule– Manyusersunnecessarilyenterthelargestwallclock1mepossibleasa
default
• Bundlejobs(mulTple“srun”sinonescript,sequenTalorsimultaneously)
• Use“shared”parTTonforserialjobsorverysmallparalleljobs.• QueuewaitTmestaTsTcs
– hsps://www.nersc.gov/users/queues/queue-wait-1mes/
Places and Tools to Check Job Status
• Completedjobswebpage:– hsps://www.nersc.gov/users/job-logs-sta1s1cs/completed-jobs/
• MyNERSCQueuesdisplay– hsps://my.nersc.gov/queues.php?machine=cori&full_name=Cori
• QueueWaitTimes– hsp://www.nersc.gov/users/queues/queue-wait-1mes/
• ScriptsdescribedonQueueMonitoringPage– sqs,squeue,sstat,sprio,etc.– hsps://www.nersc.gov/users/computa1onal-systems/cori/
running-jobs/monitoring-jobs/
-54-
Not Covered in Details Here • Taskfarmer
– Managesingle-core(“serial”)ormul1-corejobs
• Using“realTme”parTTon• “xfer”jobs
– TransferdatabetweenCoritoarchive– Runonexternalloginnodes
• BurstBuffer– Non-vola1lestoragesitsbetweenmemoryandfilesystem
– AccelerateIOperformance
• Shiter– Runjobsincustomenvironment
• Advancedworkflow• PleaserefertoCorirunningjobswebpage
– hsp://www.nersc.gov/users/computa1onal-systems/cori/running-jobs/
-55-
Demo and Hands on will be on Edison
• CoriiscurrentlydownduetoOSupgradetoprepareforPhase2KNLsupport
• EdisonisalsoaCraysystemwithIntelIvybridgeprocessors.
• Edisonhas24corespernodeascomparedto32corespernodeonCori.
• EachEdisonnodehas64GBofmemory,ascomparedto128GBofmemoryforeachCorinode.
• TodoSLURMexercises– %moduleloadtraining
– %cp-r$EXAMPLES/CoriP1/SLURM.(no1cethe“dot”attheend)
-56-
Edison - Cray XC30
• 2.7GBmemory/coreforapplica1ons
• /scratchdiskquotaof10TB
• 7.6PBof/scratchdisk
• ChoiceoffullLinuxopera1ngsystemorop1mizedLinuxOS(CrayLinux)
• Intel,Cray,andGNUcompilers
57
• 133,824cores,5,576nodes
• “Aries”interconnect
• 2x12-coreIntel’IvyBridge'2.4GHzprocessorspernode
• 24processorcorespernode,48withhyperthreading
• 64GBofmemorypernode
• 357TBofaggregatememory
Edison Compute Nodes
-58-
• Edison:NERSCCrayXC30,5,576nodes,133,824cores.
• 2NUMAdomainspernode,12coresperNUMAdomain.2hardwarethreadspercore.
• Memorybandwidthisnon-homogeneousamongNUMAdomains.
More Information
• NERSCwebpages– hsp://www.nersc.gov/users/computa1onal-systems/cori/running-jobs/
– hsp://www.nersc.gov/users/computa1onal-systems/edison/running-jobs/
• SchedMD– hsp://www.schedmd.com/
– ManpagesforSLURMcommands
• ContactNERSCConsulTng
– Toll-free800-666-3772– 510-486-8611,op1on#3– [email protected]
59
Thank You