22
Cori and Edison Queues -1- Helen He NUG Meeting, 1/21/2016

Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Cori and Edison Queues

-1-

Helen He NUG Meeting, 1/21/2016

Page 2: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Goals for Cori and Edison •  Wheretorunwhattypeofjobsa2erCarverandHopperre7red?•  TheCoriPhase1(alsoknownasthe"CoriDataPar77on")systemis

designedtoacceleratedata-intensiveapplica7ons,withhighthroughputand“real7me”need.–  "shared”par--on.Mul-plejobsonthesamenode.Largersubmitandrunlimits.–  The1-2nodebininthe"regular"par--on(mimics“thruput”queueonHopper).

Largesubmitandrunlimits.–  “real-me”par--on.Highestqueuepriority.Specialpermissiononly.–  “burstbuffer”capability,inearlyuserperiod.–  Maxwall-melimitforCoriincreasedto48hrs(from24hrs)yesterday

•  Edison’spurposeisthesupportoflargejobs–  EdisonisthelargestNERSCsystem.–  Largerjobsareboostedforqueuepriority.–  Jobsuse683+nodesonEdisonget40%chargingdiscount.–  Edisonqueuestructureislargelysimplified.

•  ThesegoalshavebeencommunicatedwithusersinweeklynewsleMerandpublishedonNERSCwebsite.

-2-

Page 3: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Cori Queues

-3-

Page 4: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Edison Queues

-4-

Page 5: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

SLURM on Cori and Edison

•  Thispresenta7onwillfocusmoreonCori.•  UsershavebeenonCoriwithSLURMlonger

–  Cori:allusersfrom11/12/2015–  Edison:allusersfrom01/04/2016–  MoreexperiencetuningSLURMconfigura-onsonCori

•  Corihasmorecomplicatedqueuestructures–  Exci-ngnewfeaturescomplicatesscheduling

•  EdisonandCorisharesimilarSLURMconfigura7ons.•  LessonslearnedfromCoriareappliedtoEdison,and

viceversa.

-5-

Page 6: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

SLURM Configuration is Ongoing •  BeforeAY16startsonJan12,wemostlyfocusedon

installingCori,movingEdison,andperformingini7aldeploymentsofSLURM.

•  A2erthemoveandalloca7onyearpolicychangesarein,we'vefocusedalotondetailedqueueturn-around,u7liza7onandschedulingofworkloadinanefficientmanner.–  Extremelysuccessfulinfixingtheissuesthatwerepresentintheini-al

configura-ons•  Wewillbetuningtowardsmoreuserfacingissues,suchas

reliablerankingsofthequeue,end-of-jobprocessing,andenablingnewfeaturestoallowuserstocon7nuerunningoncetheirrepohasbeenexhausted.

•  Userfeedbackandcommentsarealwayswelcome

-6-

Page 7: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

“shared” Partition on Cori •  Usersseemanyjobsin“shared”,appearstouse1nodeper

job(displayedwiththequeuemonitoringscripts),actuallyNOT.

•  Serialjobsorsmallparalleljobsaresharedonthesenodes.•  40nodesaresetasideforthe“shared”jobs.•  “shared”jobsdonotrunonothernodescurrently(may

changeinthefuture).•  Highsubmitlimits(2500)andrunlimits(500).•  Jobsaregedngverygoodthroughput.•  “shared”jobsarenotchargedbyen7renode,butbyactual

physicalcoresused.

-7-

Page 8: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

“realtime” Partition on Cori

•  Specialpermissiontouse“real7me”forreal-7meneedofdataintensiveworkflows.

•  Highestpriorityfor“real7me”jobssotheystartalmostimmediately.Couldbedisrup7vetooverallqueuescheduling.

•  “real7me”jobscanrunin“shared”or“exclusive”modefornodeusage.

•  8nodesaresetasideforthe“real7me”jobs(currently)•  “real7me”jobscanrunonothernodes.

-8-

Page 9: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Two SLURM Schedulers are in Work

•  InstantScheduler(eventtriggered)–  Performsaquickandsimpleschedulinga^emptateventssuchasjobsubmissionorcomple-onandconfigura-onchanges.

•  BackfillScheduler(atsetintervals)–  Considerspendingjobsinpriorityorder,determiningwhenandwhereeachwillstart,takingintoconsidera-onthepossibilityofjobpreemp-on,gangscheduling,genericresource(GRES)requirements,memoryrequirements,etc.

–  Ifthejobunderconsidera-oncanstartimmediatelywithoutimpac-ngtheexpectedstart-meofanyhigherpriorityjob,thenitdoesso.

-9-

Page 10: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

SLURM Limits and Priority Tunings •  Noseparatequeuesfor“premium”,“low”,etc.Thesearenow

availableviaQOSsedngsin“regular”par77on.•  No“idle”limitsconcept.

–  Alljobsinthequeueareeligible,except•  Userheldjobs,priorityvalueis0.

–  Dependencyjobs,priorityvalueisnot0,butdonotage•  Limitsandpoliciesenforcedtoensurefairness

–  Maxsubmitlimit–  Maxrunlimit–  Totalnodesnumbernodesperpar--on/QOS–  Backfillinterval–  Maxbackfillperuser(userssubmihngmanyjobswon’thaveadvantage)–  Maxbackfillperpar--on–  Maxtotalremainingwall-me*nodesfromallrunningjobs(usedpreviously)–  Fairsharepolicy(basedonremainingalloca-onandusagebeforeAY16,

basedonrecentusageandmuchlowerweightnow)

-10-

Page 11: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Shorter Queues After Charging Began

•  ManymorejobsweresubmiMedduringfree7me.–  Backlogsarelarge

•  ChargingbeganatAY16start–  jobswithnoac-verepowerecancelled–  Userscancelledownjobsthatwouldnotliketobecharged–  Jobsubmissionlimitsweredecreased

•  Usereduca7on–  communicatedwithindividualuserstousethe“shared”par--on,job

arrays,andbundlingjobs.

-11-

Page 12: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Job Wait Time Improves Significantly on Cori •  UserscomplainedaboutVERYLONGwait7meforjobs•  ChangesweremadefromJan15

–  Addedmaxnumberofbackfilljobsperpar--on(ontopofmaxnumberofbackfilljobsperuser)significantlyimprovedthebacklogfordebugjobs.

–  Itallowslowerprioritydebugjobstorunaheadofregularjobsthathavehigherabsolutevalueofpriority.

–  Decreasedmaxsizeofdebugfrom128to112.•  Mostdebugjobsnowstartwithin30min,manymuch

shorter!•  Theregularjobswait7mearesignificantlysmallertoo

–  Addi-onaltuning:•  Increasedmaxbackfillintervalfrom30to150sec•  Tunedmaxbackfilljobsperuser,andmaxbackfillperpar--on

–  Usersdeletemorejobssubmi^edduringfree-me•  BacklogonCoriisnowonly~4days

-12-

Page 13: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Backlogs on Cori •  Currentbacklogis4days.•  Hugesubmissionsfrom2usersincreasedbacklogssignificantly.

–  Oneusersubmitmany512nodesjobs,each24hrs.increasedbacklogfrom40to92days

–  Anotherusersubmi^eda1000-tasklargearrayjob,with1hrwall-melimit,laterincreasedto12hrs-melimit,increasedbacklogfrom33to83to644days.

–  Althoughbacklogscausedfromsuchsubmissionsareshownhigh,theywon’taffectschedulingforotherusersjobssignificantly,sincethelimitswehavesetwillbasicallycausemostofthesejobsnotbeingconsideredforscheduling.

-13-

Page 14: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Average Wait Time for Debug Jobs on Cori

-14-

1/12/16–1/15/16 1/16/16-1/20/1611/30/15-1/11/16

Page 15: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Current Debug Jobs on Cori

-15-

Page 16: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Average Wait Time for Regular Jobs on Cori (1)

-16-

11/30/15–1/11/16,Edisonmovestartedon11/30/15,Hopperre-redon12/15/15

Page 17: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Average Wait Time for Regular Jobs on Cori (2)

-17-

Dec16–Jan11

1/12/16–1/15/16,AY16startedon1/12/16

Page 18: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Average Wait Time for Regular Jobs on Cori (3)

-18-

Jan16-20,2016,aoerchangesmadeonJan15

Dec16–Jan11

Page 19: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

New “sqs” with 2 Columns of Priority Ranking •  Anewversionof“sqs”(aNERSCcustomqueuemonitoringscript)deployedon

Jan19.Original“sqs”hasonecolumnforrankingbasedonstart7meprovidedbythebackfillscheduler.

•  “sqs”indefault,onlyshowsuser’sownjobs•  “sqs-a”showsalljobs•  Othersampleop7ons:

–  “sqs-a-pdebug”(showonlydebugjobs)–  “sqs-a-nr-npshared”(norunningjobs,nosharedjobs)–  “sqs-w”(showallmyjobsinwideformatwithmoreinfo)–  “sqs–s”(shortsummaryofqueuedjobs)

•  Thisversionprovidestwocolumnsofrankingvaluestogiveusersmoreperspec7veoftheirjobsinqueue.–  ColumnRANK_Pshowstherankingwithabsolutepriorityvalue,whichisafunc-onof

par--onQOS,jobwait-me,andfairshare.Jobswithhigherprioritywon'tnecessarilyrunearlierduetovariousrunlimits,totalnodelimits,andbackfilldepthwehaveset.

–  ColumnRANK_BFshowstherankingusingthebestes-matedstart-me(ifavailable)atabackfillschedulingcycle(every150secnow),sotherankingisdynamicandchangesfrequentlyalongwiththechangesinthequeuedjobs.

–  Thefirstfewjobswithreasonbeing“Resources”arerankedbypriorityvalue,hencetheymatchinRANK_PandRANK_BFcolumns.

-19-

Page 20: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Sample “sqs” Output

-20-

%sqs-a-nr|more

Page 21: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

Places and Tools to Check Job Status

•  Completedjobswebpage:–  h^ps://www.nersc.gov/users/job-logs-sta-s-cs/completed-jobs/

•  MyNERSCQueuesdisplay–  h^ps://my.nersc.gov/queues.php?machine=cori&full_name=Cori

•  QueueWaitTimes–  h^p://www.nersc.gov/users/queues/queue-wait--mes/

•  ScriptsdescribedonQueueMonitoringPage(sqs,squeue,sstat,sprio,etc.)–  h^ps://www.nersc.gov/users/computa-onal-systems/cori/running-jobs/monitoring-jobs/

-21-

Page 22: Cori and Edison Queues - NERSC...“shared” Partition on Cori • Users see many jobs in “shared”, appears to use 1 node per job (displayed with the queue monitoring scripts),

A Few Tips to Get Faster Job Turnaround

•  Requestshorterwall7meifyoucan,donotuseallowedmaxwall7me.

•  Use“shared”par77onforserialjobsorverysmallparalleljobs.

•  Bundlejobs(mul7ple“sruns”inonescript,sequen7alorsimultaneously)

•  UseJobArrays(beMermanagingjobs,notnecessaryfasterturnaround.Eacharraytaskisconsideredasinglejobforscheduling.

-22-