Running Jobs on Cori with SLURM - National Energy …€¦ · Running Jobs on Cori with SLURM Helen He NERSC User Engagement Group" " ... – Batch script runs on the head compute

Running Jobs on Cori with SLURM

Helen He NERSC User Engagement Group ""Cori Phase 1 Training June 14, 2016

Cori Phase 1 - Cray XC40

•  4GBmemory/coreforapplica1ons

•  /scratchdiskquotaof20TB

•  30PBof/scratchdisk

•  ChoiceoffullLinuxopera1ngsystemorop1mizedLinuxOS(CrayLinux)

•  Intel,Cray,andGNUcompilers

2

•  52,160cores,1,630nodes

•  “Aries”interconnect

•  2x16-coreIntel’Haswell'2.3GHzprocessorspernode

•  32processorcorespernode,64withhyperthreading

•  128GBofmemorypernode

•  203TBofaggregatememory

Cori Phase 1 Compute Nodes

-3-

•  CoriPhase1:NERSCCrayXC40,1,630nodes,52,160cores

•  Eachnodehas2IntelXeon16-coreHaswellprocessors

•  2NUMAdomainspernode,16coresperNUMAdomain.2hardwarethreadspercore.

•  Memorybandwidthisnon-homogeneousamongNUMAdomains

Toobtainprocessorinfo:

Getonacomputenode:

%salloc-N1

Then:

%cat/proc/cpuinfo

or

%hwloc-ls

User Jobs at NERSC •  Mostareparalleljobs(10sto100,000+cores)

•  Alsoanumberof“serial”jobs–  Typically“pleasantlyparallel”simula1onordataanalysis

•  ProducTonrunsexecuteinbatchmode

•  OurbatchschedulerisSLURM(naTve)

•  Debugjobsaresupportedforupto30minutes

•  TypicallyrunTmesareafewto10sofhours–  Eachmachinehasdifferentlimits

–  LimitsarenecessarybecauseofMTBFandtheneedto

accommodate6,000users’jobs

What is SLURM •  Insimpleword,SLURMisaworkloadmanager,ora

batchscheduler

•  SLURMstandsforSimpleLinuxUTlityforResourceManagement

•  SLURMunitestheclusterresourcemanagement(suchasTorque)andjobscheduling(suchasMoab)intoonesystem.Avoidsinter-toolcomplexity.

•  SLURMwasusedin6ofthetop10computers(June2015TOP500list),includingthe#1system,Tianhe-2,withover3Mcores

•  MoreCraysitesadoptSLURM:NERSC,CSCS,KAUST,TACC,etc.

-5-

Advantages of Using SLURM •  Fullyopensource•  SLURMisextensible(pluginarchitecture)•  Lowlatencyscheduling.Highlyscalable•  Integrated“serial”or“shared”queue•  IntegratedBurstBuffersupport•  Goodmemorymanagement•  Built-inaccounTnganddatabasesupport•  “NaTve”SLURMrunswithoutCrayALPS(Applica1onLevel

PlacementScheduler)

–  Batchscriptrunsontheheadcomputenodedirectly

–  Easiertouse.Lesschanceforconten1oncomparedtosharedMOMnode

-6-

Login Nodes and Compute Nodes Eachmachinehas2typesofnodesvisibletousers

•  Loginnodes(external)–  Editfiles,compilecodes,submitbatchjobs,etc.

–  Runshort,serialu1li1esandapplica1ons•  Computenodes–  Executeyourapplica1on–  Dedicatedresourcesforyourjob

7

Submitting Batch Jobs

•  Torunabatchjobonthecomputenodesyoumustwritea“batchscript”thatcontains

–  Direc1vestoallowthesystemtoscheduleyourjob

–  Ansruncommandthatlaunchesyourparallelexecutable

•  Submitthejobtothequeuingsystemwiththesbatchcommand

–  % sbatch my_batch_script!

8

Interactive Parallel Jobs

•  YoucanrunsmallparalleljobsinteracTvelyforupto30minutes

login% salloc -N 2 -p debug -t 15:00 [wait for job to start]!compute% srun –n 64 ./mycode.exe

9

Launching Parallel Jobs with SLURM

10

sbatch

LoginNode HeadComputeNode

OtherComputeNodesallocatedtothejob

Headcomputenode:•  Runscommandsinbatchscript

•  Issuesjoblauncher“srun”tostartparallel

jobsonallcomputenodes(includingitself)

Loginnode:•  Submitbatchjobsviasbatchorsalloc

•  Pleasedonotissue“srun”fromloginnodes

•  Donotrunbigexecutablesonloginnodes

Sample Cori Batch Script - MPI

11

#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!#SBATCH -n 1280!#SBATCH -J myjob!#SBATCH –L scratch!!export OMP_NUM_THREADS=1!srun -n 1280 ./mycode.exe!


12


•  Needtospecifywhichshelltouseforbatchscript

•  Use“-l”asloginshellisop1onal

•  Environmentisautoma1callyimported


13


Jobdirec1ves:instruc1onsforthebatchsystem

•  Submissionpar11on(defaultis“debug”)

•  Howmanycomputenodestoreserveforyourjob

•  Howlongtoreservethosenodes

•  Moreop1onalSBATCHkeywords


14


SBATCHop1onalkeywords:•  Howmanyinstancesofapplica1onstolaunch(#ofMPItasks)

•  Whatismyjobname

•  Whatfilesystemlicensesareused


15

#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!#SBATCH -n 1280!#SBATCH -J myjob!!export OMP_NUM_THREADS=1!srun -n 1280 ./mycode.exe!

•  Bydefault,hyperthreadingison.SLURMsees2threadsareavailablefor

eachofthe32physicalCPUsonthenode

•  Noneedtosetthisifyourapplica1onprogrammingmodelispureMPI.

•  IfyourcodeishybridMPI/OpenMP,setthisvalueto1toruninpure

MPImode


16


“srun”commandlaunchesparallelexecutablesonthecomputenodes

•  srunflagsoverwriteSBATCHkeywords

•  Noneedtorepeatflagsinsruncommandifalreadydefinedin

SBATCHkeywords.(e.g.“srun./my_executable”willalsodoin

aboveexample)


17


•  Thereare64logicalCPUsoneachnode

•  With40nodes,usinghyperthreading,upto40*64=2,560MPItasks

canbelaunched:“srun-n2560./mycode.exe”isOK

Sample Batch Script - Hybrid MPI/OpenMP

18

#!/bin/bash -l !!#SBATCH -p regular!#SBATCH -N 40!#SBATCH -t 1:00:00!!export OMP_NUM_THREADS=8!srun -n 160 -c 8 ./mycode.exe!

•  srundoesmostofop1malprocessandthreadbindingautoma1cally.

Onlyflagssuchas“-n”“-c”,alongwithOMP_NUM_THREADSare

neededformostapplica1ons

•  Hyperthreadingisenabledbydefault.Jobsreques1ngmorethan32

cores(MPItasks*OpenMPthreads)pernodewillusehyperthreads

automa1cally

Long and Short Commands Options

-19-

#!/bin/bash-l

#SBATCH--par11on=regular

#SBATCH--job-name=test

#SBATCH--account=mpccc

#SBATCH--nodes=2

#SBATCH--1me=00:30:00

#SBATCH--license=scratch

srun-n64./mycode.exe

#!/bin/bash-l

#SBATCH-pregular

#SBATCH-Jtest

#SBATCH-Ampccc

#SBATCH-N2

#SBATCH-t00:30:00

#SBATCH-Lscratch

srun-n64./mycode.exe

Longcommandop1ons Shortcommandop1ons

Running Multiple Parallel Jobs Sequentially

-20-

#!/bin/bash-l

#SBATCH-pregular

#SBATCH-N4

#SBATCH-t12:00:00

#SBATCH-Lproject,cscratch1

srun-n128./a.out

srun-n128./b.out

srun-n128./c.out!

•  Needtorequestmaxnumberofnodesneededbyeachsrun

Running Multiple Parallel Jobs Simultaneously

-21-

#!/bin/bash-l

#SBATCH-pregular

#SBATCH-N8

#SBATCH-t12:00:00

#SBATCH-Lcscratch1

srun-n44-N2./a.out&

srun-n108-N4./b.out&

srun-n40-N2./c.out&

wait!

•  Needtorequesttotalnumberofnodesneededbyallsruns

•  No1cethe“&”and“wait”above

•  Itisbestifeachsruntakesroughlysameamountof1meto

complete,otherwisewastealloca1on

Running MPMD Jobs

-22-

#Configfile:

%catmpmd.conf

0-35./a.out

36-95./b.out

#BatchScript:

#!/bin/bash-l

#SBATCH-pregular

#SBATCH-N4

#SBATCH-n96#totalof96tasks

#SBATCH-t02:00:00

#SBATCH–LSCRATCH

srun--mul1-prog./mpmd.conf!

•  TwoexecutableswillshareoneMPI_COMM_WORLD

•  Requesttotalnumberofnodesneededfora.outandb.out

Job Steps and Dependency jobs

•  Usejobdependencyfeaturestochainjobsthathavedependency

-23-

cori%sbatchjob1

Submisedbatchjob5547

cori06%sbatch--dependency=aterok:5547job2

or

cori06%sbatch--dependency=aterany:5547job2

!!

cori06%sbatchjob1

submisedbatchjob5547

cori06%catjob2

#!/bin/bash-l

#SBATCH-pregular

#SBATCH-N1

#SBATCH-t00:30:00

#SBATCH-daterok:5547

cd$SLURM_SUBMIT_DIR

srun-n32./a.out

cori06%sbatchjob2!!

“shared” Partition on Cori •  Usersseemanyjobsin“shared”,appearstouse1nodeper

job(displayedwiththequeuemonitoringscripts),actuallyNOT.

•  Serialjobsorsmallparalleljobsaresharedonthesenodes

•  40nodesaresetasideforthe“shared”jobs•  “shared”jobsdonotrunonothernodescurrently(may

changeinthefuture)

•  Highsubmitlimits(10,000)andrunlimits(1,000)

•  Jobsaregedngverygoodthroughput

•  “shared”jobsarenotchargedbyenTrenode,butbyactualphysicalcoresused

-24-

Running Serial Jobs •  The“shared”parTTonallowsmulTpleexecutablesfrom

differentuserstoshareanode•  Eachserialjobrunonasinglecoreofa“shared”node•  Upto32jobsfromdifferentusersdependingontheirmemory

requirements

25

#SBATCH -p shared!#SBATCH -t 1:00:00!#SBATCH --mem=4GB!#SBATCH -J my_job!./mycode.exe!

•  Smallparalleljobthatuselessthanafullnodecanalsoruninthe“shared”parTTon

•  Donotspecify#SBATCH-N”•  Default“#SBATCH-n”is1•  Defaultmemoryis1,952MB

•  Use-nor--memtorequest

moreslotsforlargermemory

•  Donotuse“srun”forserialexecutable(reducesoverhead)

)

“realtime” Partition

•  Specialpermissiontouse“realTme”forreal-Tmeneedofdataintensiveworkflows

•  Highestpriorityfor“realTme”jobssotheystartalmostimmediately.CouldbedisrupTvetooverallqueuescheduling.

•  “realTme”jobscanrunin“shared”or“exclusive”modefornodeusage

•  Nodesaresetasideforthe“realTme”jobs(currently)

•  “realTme”jobscanrunonothernodes

-26-

Job Arrays •  UseJobArraysforsubmidngandmanagingcollecTonsof

similarjobs–  Besermanagingjobs,notnecessaryfasterturnaround

–  Eacharraytaskconsideredasinglejobforscheduling,submit/runlimits

–  SLURM_ARRAY_JOB_IDsettothefirstjobIDofthearray

SLURM_ARRAY_TASK_IDsettothejobarrayindexvalue

-27-

#Submitajobarraywithindexvaluesbetween0and31%sbatch--array=0-31-N1

#Submitajobarraywithindexvaluesof1,3,5and7%sbatch--array=1,3,5,7-N1-n2

#Submitajobarraywithindexvaluesbetween1and7,withastepsizeof2%sbatch--array=1-7:2-N1-pregular

#submitajobarraywithindexvaluesbetween1and7,andlimitmaxrunningjobsto2%sbatch--array=1:7%2-N1-pregular

stderr/stdout •  Bydefault,whileyourjobisrunning,theslurm-$SLURM-

JOBID.outisgeneratedandcontainsbothstderrandstdout•  stdoutisbufferedinsegmentsof8KBsize•  stderrisnotbuffered•  Canrenamevia“#SBATCH-o”(forstdout)and“#SBATCH-

e”(forstderr)flags•  Canalsoredirectfromsruncommands

–  srun-n48./a.out>&my_output_file(forcsh/tcsh)

–  Srun–n48./a.out>my_output_file2>&1(forbash)

•  srun“-u”opTondisablesstdoutbuffer–  Notrecommendedtouseasdefaultsinceitslowsdownapplica1on

significantly

–  Howeverthishelpdebuggingtogetcompleteoutputbeforejobtermina1on

-28-

Cluster Compatibility Mode (CCM) Applications

•  Certain3rdpartapplicaTonneedsshtoothercomputenodesfromtheheadcomputenode

•  TheCCMmodeallows3rdpartyapplicaTonstorunonthecomputenodeswithoutrebuild

•  The“#SBATCH--ccm”opTonwillsetupthenecessaryenvironmenttosupportCCMapplicaTons

•  NoseparateCCMqueue,CCMjobscanruninanyqueuenow

•  Sincesbatchorsalloclandsonacomputenode,applicaTonssuchas“matlab”canbelauncheddirectlywithoutCCMsupport

-29-

Running Jobs Built with Intel MPI

-30-

#!/bin/bash-l

#SBATCH-pregular

#SBATCH-N8

#SBATCH-t03:00:00

#SBATCH-Lproject.SCRATCH

moduleloadimpimpiicc-openmp-omycode.exemycode.c

exportOMP_NUM_THREADS=8

exportI_MPI_PMI_LIBRARY=/opt/slurm/default/lib/pmi/libpmi.sosrun-n32-c8./mycode.exe

Which File Systems to Use •  Donotrunfrom$HOME

–  Meantforpermanentspaceformanysmallfiles(source,smallinput,etc).

–  Performancenotop1mizedforlargeIO

–  RunninglargeIOjobsfrom$HOMEcancausedelayedinterac1veresponse1meforotherusers

•  Use$SCRATCH(Lustrefilesystem)orprojectdirectorytorunyourjobs,forbemerIOperformanceandlargerspacequota–  $SCRATCH:default20TBquota,10Minodes.Filesnotaccessedfor12weeksaresubjecttobepurged.Backupimportantfiles.

–  /global/project/projectdirs/your_repo:default4TBquota,1M

inodes.Notpurged.

-31-

File System Licenses

•  Use“#SBATCH-L”or“#SBATCH--license=“tospecifyfilesystemsneeded

•  JobsthatusefilesystemlicenseswillnotstartifthereisanissuewiththatparTcularfilesystem–  Protectsjobsfromfailingwithfilesystemissues

–  Allowsselec1vemaintenanceonfilesystems

•  Example:#SBATCH-Lscratch1,project

•  “SCRATCH”canbeusedasshorthandforyourdefaultscratchfilesystemonEdisonorCori

•  CurrentlyopTonal,butstronglyencouraged•  Maybeenforcedinthenearfuture

-32-

More SBATCH Options

•  WhichQOStouse:normal(default),premium,low

–  #SBATCH--qos=premium

•  What’smystdoutfilename

–  #SBATCH-omy_stdout_name•  Whichaccounttocharge

–  #SBATCH-Amy_repo•  Whentosendemail

–  #SBATCH--mail-type=BEGIN,END,FAIL

•  …

-33-

1-node “Long” Jobs Up to 96 hrs

•  Runin“regular”parTTon,no“premium”or“low”QOSpriority

•  %salloc-N1-pregular-t96:00:00-LSCRATCH•  Canonlyuseasinglenode•  Ausercanhaveupto4longjobsrunningsimultaneously

•  Ausercansubmitamaximumof10longjobsataTme

•  Amaximumof10nodescanbeoccupiedbylongjobsfromallusers

•  Beawareofjobswon’tstartifTmeletbeforeasystemmaintenanceislessthan96hrs

-34-

Recap: Running Jobs with SLURM •  Use“sbatch”tosubmitbatchscriptor“salloc”torequest

interacTvebatchsession.

•  Use“srun”tolaunchparalleljobs•  MostSLURMcommandopTonshavetwoformats(long

andshort)

•  Needtospecifywhichshelltouseforbatchscript.•  EnvironmentisautomaTcallyimported

•  Landsinthesubmitdirectory

•  Batchscriptrunsontheheadcomputenode

•  NoneedtorepeatflagsinthesruncommandifalreadydefinedinSBATCHkeywords.

•  srunflagsoverwriteSBATCHkeywords

-35-

Some SLURM Gotchas •  Hyperthreadingisenabledbydefault.

–  SLURMsees64CPUspernode(eachCorinodehas32physicalcores,

totalof64logicalcorespernode.)

–  srunwilldecidewhethertouselogicalcores.(seelastbullet)•  NeedtosetOMP_NUM_THREADS=1explicitlytoruninpure

MPImode(withhybridMPI/OpenMPapplicaTonsbuiltwithOpenMPcompilerflagenabled)

•  Alwaysuse“#SBATCH-N”torequestnumberofnodes.Ifaskingnodeswith“#SBATCH-n”only(fornum_MPI_tasks),youmaygethalfthe#nodesdesired.

•  AutomaTcprocessandthreadaffinityisgood.Hyperthreadingwillnotbeusedifnum_MPI_tasks*num_OpenMP_threads_per_nodeis<=32.CanexplorewithadvancedsedngsformorecomplicatedbindingopTons.

-36-

Two SLURM Schedulers are in Work •  InstantScheduler(eventtriggered)

–  Performsaquickandsimpleschedulingasemptateventssuchasjobsubmissionorcomple1onandconfigura1onchanges.

•  BackfillScheduler(atsetintervals)–  Considerspendingjobsinpriorityorder,determiningwhen

andwhereeachwillstart,takingintoconsidera1onthepossibilityofjobpreemp1on,gangscheduling,genericresource(GRES)requirements,memoryrequirements,etc.

–  Ifthejobunderconsidera1oncanstartimmediatelywithoutimpac1ngtheexpectedstart1meofanyhigherpriorityjob,thenitdoesso.

-37-

SLURM User Commands •  sbatch:submitabatchscript

•  salloc:requestnodesforaninterac1vebatchsession

•  srun:launchparalleljobs

•  scancel:deleteabatchjob

•  sqs:NERSCcustomqueuedisplaywithjobpriorityrankinginfo

•  squeue:displayinfoaboutjobsinthequeue

•  sinfo:viewSLURMconfigura1onaboutnodesandpar11ons

•  scontrol:viewandmodifySLURMconfigura1onandjobstate

•  sacct:displayaccoun1ngdataforjobsandjobsteps

•  hsps://www.nersc.gov/users/computa1onal-systems/cori/

running-jobs/monitoring-jobs/

-38-

sqs: NERSC Custom Queue Monitoring Script

•  ProvidestwocolumnsofrankingvaluestogiveusersmoreperspecTveoftheirjobsinqueue.–  ColumnRANK_BFshowstherankingusingthebestes1matedstart

1me(ifavailable)atabackfillschedulingcycle,sotherankingis

dynamicandchangesfrequentlyalongwiththechangesinthe

queuedjobs.

–  ColumnRANK_Pshowstherankingwithabsolutepriorityvalue,

whichisafunc1onofpar11onQOS,jobwait1me,andfairshare.

Jobswithhigherprioritywon'tnecessarilyrunearlierduetovarious

runlimits,totalnodelimits,andbackfilldepthwehaveset.

-39-

sqs Example Commands %sqs(showuser’sownjobs)

%sqs-a(showsalljobs)

%sqs-a-pdebug(showsonlydebugjobs)

%sqs-a-nr-npshared(norunningjobs,nosharedjobs)

%sqs-w(showsallmyjobsinwideformatwithmoreinfo)

%sqs-s(shortsummaryofqueuedjobs)

Seemanpageoruse“sqs--help”formoreopTons.

-40-

squeue Example Commands

-41-

%squeue-u<username>

%squeue-j<jobid>--start

%squeue-o"%.18i%.3t%.10r%.10u%.12j%.8D%.10M%.10l%.20V%.12P%.20S%.15Q”JOBIDSTREASONUSERNAMENODESTIMETIME_LIMIT

SUBMIT_TIMEPARTITIONSTART_TIMEPRIORITY

Seemanpageor“squeue--help”formoreopTons.

sinfo Example Commands

-42-

%sinfo-sPARTITIONAVAILTIMELIMITNODES(A/I/O/T)NODELIST

debug*up30:005357/154/8/5519nid[00008-00063,00072-00127,…]

regularup4-00:00:005110/147/6/5263nid[00296-00323,00328-00383,…]

regularxup2-00:00:005357/154/8/5519nid[00008-00063,00072-00127,…]

real1meup12:00:005420/158/8/5586nid[00008-00063,00072-00127,…]

sharedup2-00:00:0055/0/0/55nid[06089-06143]

%sinfo-pdebugPARTITIONAVAILJOB_SIZETIMELIMITCPUSS:C:TNODESSTATENODELIST

debug*up1-infini30:00482:12:26down*nid[01343,02062,02137,03132,03150,03307]

debug*up1-infini30:00482:12:22drainednid[00008-00009]

debug*up1-infini30:00482:12:22reservednid[00010-00011]

debug*up1-infini30:00482:12:22mixednid[00077,00090]

debug*up1-infini30:00482:12:25349allocatednid[00012-00063,00072-00076,--]

debug*up1-infini30:00482:12:2158idlenid[00083-00086,00091-00094,…]

Seemanpageor“sinfo--help”formoreop1ons.

scontrol Example commands

%scontrolshowparTTon<parTTon>

%scontrolshowjob<jobid>

%scontrolhold<jobid>

%scontrolrelease<jobid>

%scontrolupdatejob<jobid>Tmelimit=24:00:00

%scontrolupdatejob<jobid>qos=premium

Seemanpageor“scontrol--help”formoreopTons.

-43-

!

sacct Example Commands

-44-

%sacct-u<username>--stardme=01/12/16T00:01--endTme=01/15/16T12:00-ojobid,elapsed,nnodes,start,end,submit

%sacct-a--stardme=01/12/16T00:01--endTme=01/15/16T12:00-oUser,JobID,NNodes,State,Start,End,TimeLimit,Elapsed,ExitCode,DerivedExitcode,Comment

%sacct-u<username>-j<jobid>-ojobid,elapsed,nnodes,nodelist

Seemanpageor“sacct--help”formoreopTons.!

Edison Queue Policy (as of 06/10/2016)

-45-

Specifythesepar11onswith#SBATCH -q partition_name

SpecifytheseQOSwith#SBATCH --qos=premium

Theselimitsareperuser

perpar11on/QOSlimits

Jobswithinsufficient

alloca1onstorunare

directedto“scanvenger”

Cori Queue Policy (as of 06/10/2016)

-46-

Largeuserlimits

serialworkloadreal1meworkflow

Cori Phase 1 Data Features Implemented in SLURM

•  CoriPhase1alsoknownasthe"CoriDataParTTon”

•  Designedtoacceleratedata-intensiveapplicaTons,withhighthroughputand“realTme”need.–  "shared”par11on.Mul1plejobsonthesamenode.Largersubmitand

runlimits.40nodessetaside

–  The1-2nodebininthe"regular"forhighthroughputjobs.Largesubmitandrunlimits.

–  “real1me”par11onforjobsrequiringreal1medataanalysis.Highestqueuepriority.Specialpermissiononly.

–  Internalsshd(CCMmode)inanyqueue

–  Largenumberoflogin/interac1venodestosupportapplica1onswithadvancedworkflows

–  “burstbuffer”usageintegratedinSLURM,inearlyuserperiod.

–  Encourageuserstorunjobsusing683+nodesonEdisonwithqueuepriorityboostand40%chargingdiscountthere.

-47-

Charge Factors & Discounts

•  Eachmachinehasa“machinechargefactor”(MCF)thatmulTpliesthe“rawhours”used–  EdisonMCF=2.0

–  CoriMCF=2.5

•  EachQOShasa“QOSchargefactor”(QCF)–  premiumQCF=2.0

–  normalQCF=1.0(default)

–  lowQCF=0.5–  scavengerQCF=0

•  OnEdison:–  Jobsreques1ng683ormorenodesgeta40%discount

48

How Your Jobs Are Charged •  Yourrepositoryischargedforeachnodeyourjobwas

allocatedfortheenTreduraTonofyourjob.–  Theminimumallocatableunitisanode(exceptforthe“shared”parGGononCori).Edisonhave24cores/nodeandCorihas32cores/node.

–  Example:4Corinodesfor1hourwith“premium”QOSMPPhours=(4)*(32)*(1hour)*(2)*(2.5)=640MPPhours

–  “shared”jobsarechargedwithphysicalCPUsusedinsteadofen1renode.

•  IfyouhaveaccesstomulTplerepos,pickwhichonetochargeinyourbatchscript#SBATCH –A repo_name

MPPhours=(#nodes)*(#cores/node)*(wall1meused)*(QCF)*(MCF)

I submitted my job, but it is not running! •  That’swhatthebatchqueueisfor!

•  YourjobswillwaitunTltheresourcesareavailableforthemtorun

•  Yourjob’splaceinthequeueisamixofTmeandpriority,solinejumpingisallowed,butitmaycostmore

-50-

Cori Backlog Over Time

-51-

•  Currentbacklogis~8days•  PlotobtainedfromMyNERSC

Average Queue Wait Time

•  Historicdatacanbeobtainedfromhmps://www.nersc.gov/users/queues/queue-wait-Tmes/

•  D

-52-

53

Tips for Getting Better Throughput •  Linejumpingisallowed,butitmaycostmore•  Submitshorterjobs,theyareeasiertoschedule

–  Checkpointifpossibletobreakuplongjobs–  Shortjobscantakeadvantageof‘backfill’opportuni1es–  Runshortjobsjustbeforemaintenance

•  Veryimportant:makesurethewallclockTmeyourequestisaccurate–  Asnotedabove,shorterjobsareeasiertoschedule–  Manyusersunnecessarilyenterthelargestwallclock1mepossibleasa

default

•  Bundlejobs(mulTple“srun”sinonescript,sequenTalorsimultaneously)

•  Use“shared”parTTonforserialjobsorverysmallparalleljobs.•  QueuewaitTmestaTsTcs

–  hsps://www.nersc.gov/users/queues/queue-wait-1mes/

Places and Tools to Check Job Status

•  Completedjobswebpage:–  hsps://www.nersc.gov/users/job-logs-sta1s1cs/completed-jobs/

•  MyNERSCQueuesdisplay–  hsps://my.nersc.gov/queues.php?machine=cori&full_name=Cori

•  QueueWaitTimes–  hsp://www.nersc.gov/users/queues/queue-wait-1mes/

•  ScriptsdescribedonQueueMonitoringPage–  sqs,squeue,sstat,sprio,etc.–  hsps://www.nersc.gov/users/computa1onal-systems/cori/

running-jobs/monitoring-jobs/

-54-

Not Covered in Details Here •  Taskfarmer

–  Managesingle-core(“serial”)ormul1-corejobs

•  Using“realTme”parTTon•  “xfer”jobs

–  TransferdatabetweenCoritoarchive–  Runonexternalloginnodes

•  BurstBuffer–  Non-vola1lestoragesitsbetweenmemoryandfilesystem

–  AccelerateIOperformance

•  Shiter–  Runjobsincustomenvironment

•  Advancedworkflow•  PleaserefertoCorirunningjobswebpage

–  hsp://www.nersc.gov/users/computa1onal-systems/cori/running-jobs/

-55-

Demo and Hands on will be on Edison

•  CoriiscurrentlydownduetoOSupgradetoprepareforPhase2KNLsupport

•  EdisonisalsoaCraysystemwithIntelIvybridgeprocessors.

•  Edisonhas24corespernodeascomparedto32corespernodeonCori.

•  EachEdisonnodehas64GBofmemory,ascomparedto128GBofmemoryforeachCorinode.

•  TodoSLURMexercises–  %moduleloadtraining

–  %cp-r$EXAMPLES/CoriP1/SLURM.(no1cethe“dot”attheend)

-56-

Edison - Cray XC30

•  2.7GBmemory/coreforapplica1ons

•  /scratchdiskquotaof10TB

•  7.6PBof/scratchdisk

•  ChoiceoffullLinuxopera1ngsystemorop1mizedLinuxOS(CrayLinux)

•  Intel,Cray,andGNUcompilers

57

•  133,824cores,5,576nodes

•  “Aries”interconnect

•  2x12-coreIntel’IvyBridge'2.4GHzprocessorspernode

•  24processorcorespernode,48withhyperthreading

•  64GBofmemorypernode

•  357TBofaggregatememory

Edison Compute Nodes

-58-

•  Edison:NERSCCrayXC30,5,576nodes,133,824cores.

•  2NUMAdomainspernode,12coresperNUMAdomain.2hardwarethreadspercore.

•  Memorybandwidthisnon-homogeneousamongNUMAdomains.

More Information

•  NERSCwebpages–  hsp://www.nersc.gov/users/computa1onal-systems/cori/running-jobs/

–  hsp://www.nersc.gov/users/computa1onal-systems/edison/running-jobs/

•  SchedMD–  hsp://www.schedmd.com/

–  ManpagesforSLURMcommands

•  ContactNERSCConsulTng

–  Toll-free800-666-3772–  510-486-8611,op1on#3–  [email protected]

59

Thank You

Documents

Running Jobs on Cori with SLURM - National Energy …€¦ · Running Jobs on Cori with SLURM Helen He NERSC User Engagement Group" " ... – Batch script runs on the head compute