28
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER 1 LoadLeveler vs. NQE/NQS: Clash of The Titans NERSC User Services Oak Ridge National Lab 6/6/00

LoadLeveler vs. NQE/NQS: Clash of The Titans

Embed Size (px)

DESCRIPTION

LoadLeveler vs. NQE/NQS: Clash of The Titans. NERSC User Services Oak Ridge National Lab 6/6/00. NERSC Batch Systems. LoadLeveler - IBM SP NQS/NQE - Cray T3E/J90’s This talk will focus on the MPP systems Using the batch system on the J90’s is similar to the T3E - PowerPoint PPT Presentation

Citation preview

Page 1: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

LoadLeveler vs. NQE/NQS:Clash of The Titans

NERSC User Services

Oak Ridge National Lab

6/6/00

Page 2: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

2

NERSC Batch Systems

• LoadLeveler - IBM SP

• NQS/NQE - Cray T3E/J90’s

• This talk will focus on the MPP systems

• Using the batch system on the J90’s is similar to the T3E• The IBM batch system:

http://hpcf.nersc.gov/running_jobs/ibm/batch.html

• The Cray batch system: http://hpcf.nersc.gov/running_jobs/cray/batch.html

• Batch differences between IBM and Cray: http://hpcf.nersc.gov/running_jobs/ibm/lldiff.html

Page 3: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

3

About the T3E

• 644 application processors (PEs)

• 33 command PEs

• Additional PEs for OS

• NQE/NQS jobs run on application PEs

• Interactive jobs (“mpprun” jobs) run on command PEs

• Single system image

• A single parallel job must run on a contiguous set of PEs

• A job will not be scheduled if there are enough idle PEs but they are fragmented throughout the torus

Page 4: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

4

About the SP

• 256 compute nodes

• 8 login nodes

• Additional nodes for file system, network, etc.

• Each node has 2 processors that share memory

• Each node can have either 1 or 2 MPI tasks

• Each node runs full copy of AIX OS

• LoadLeveler jobs can run only on the compute nodes

• Interactive jobs (“poe” jobs) can run on either compute or login nodes

Page 5: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

5

How To Use a Batch System

• Write a batch script– must use keywords specific to the scheduler

– default values will be different for each site

• Submit your job– commands are specific to scheduler

• Monitor your job– commands are specific to scheduler

– run limits are specific to site

• Check results when complete

• Call NERSC consultants when your job disappears :o)

Page 6: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

6

T3E Batch Terminology

PE - processor element (a single CPU)

Torus - the high-speed connection between PEs. All communication between PEs must go through the torus.

Swapping - when a job is stopped by the system to allow a higher priority job run on that PE. The job may stay in memory. Also called “gang-scheduling”.

Migrating - when a job is moved to a different set of PEs to better pack the torus

Checkpoint - when a job is stopped by the system and an image is saved to be restarted at a later time.

Page 7: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

7

More T3E Batch Terminology

Pipe Queue - a queue in the NQE portion of the scheduler. It determines which batch queues the job may be submitted to. The user must specify this on the cqsub command line if anything other than “regular”.

Batch Queue - a queue on the NQS portion of the scheduler. The batch queues are served in a first-fit manner. The user should not specify any batch queue on the command line or in their script.

Page 8: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

8

NQS/NQE

• Developed by Cray

• Very complex set of scheduling parameters

• Complicated to understand

• Fragile

• Powerful and flexible

• Allows checkpoint/restart

Page 9: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

9

What NQE Does

• Users submit jobs to NQE

• NQE assigns it a unique identifier called the taskid and stores it in a database

• The status of the job is “NPend”

• NQE examines various parameters and decides when to pass the job to the LWS

• The LWS then submits the job to an NQS batch queue (see next slide for NQS details)

• After job completes NQE stores the job information for about 4 hours

Page 10: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

10

What NQS Does

• NQS receives a job from the LWS

• The job is placed in a batch queue which is determined by number of requested PEs and time

• The status of the job is now “NSubm”

• NQS batch queues are served in a first-fit manner

• When the job is ready to be scheduled, it is sent to the GRM (global resource manager)

• At this point the status of the job is “R03”

• The job may be stopped for checkpointing or swapping but still have a “running” status in NQS

Page 11: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

11

NQS/NQE Commands

• cqsub - submit your job

% cqsub -la regular script_fileTask id t7225 inserted into database nqedb.

• cqstatl - monitor your NQE job

• qstat - monitor your NQS job

• cqdel - delete your queued or running job

% cqdel t7225

Page 12: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

12

Sample T3E Batch Script

#QSUB -s /bin/csh #Specify C Shell for 'set echo'

#QSUB -A abc #charge account abc for this job

#QSUB -r sample #Job name

#QSUB -eo -o batch_log.out #Write error and output to single file.

#QSUB -l mpp_t=00:30:00 #Wallclock time

#QSUB -l mpp_p=8 #PEs to be used (Required).

ja #Turn on Job Accounting

mpprun -n 8 ./a.out #Execute on 8 PEs reading data.in

ja -s

Page 13: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

13

Monitoring Your Job on the T3E

% cqstatl -a | grep jimbob t4417 l441h4 scheduler.main jimbob NQE Database NPend

t4605 (1259.mcurie) l513v8 lws.mcurie jimbob nqs@mcurie NSubm

t4777 (1082.mcurie) l541l2 monitor.main jimbob NQE Database NComp

t4884 (1092.mcurie) l543l1 lws.mcurie jimbob nqs@mcurie NSubm

t4885 (1093.mcurie) l545l1 lws.mcurie jimbob nqs@bmcurie Nsubm

t4960 l546 scheduler.main jimbob NQE Database NPend

% qstat -a | grep jimbob1259.mcurie l513v8 jimbob pe32@mcurie 2771 26 255 1800 R031092.mcurie l543l1 jimbob pe32@mcurie 3416 26 252 1800 R031093.mcurie l545l1 jimbob pe32@mcurie 921 28672 1800 Qge

Page 14: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

14

Monitoring Your Job on the T3E (cont’)

• Use commands pslist (see next slide) and tstat to check running jobs

• Using ps on a command PE will list all instances of a parallel job because the T3E has a single system image

% mpprun -n 4 ./a.out% ps -u jimbobPID TTY TIME CMD7523 ? 0:01 csh7568 ? 12:13 a.out16991 ? 12:13 a.out16992 ? 12:13 a.out16993 ? 12:13 a.out

Page 15: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

15

Monitoring Your Job on the T3E (cont’)

S USER RK APID JID PE_RANG NPE TTY TIME CMD STATUS

- -------- -- -------- ------ ------- --- -------- -------- ------------ -------

a user1 0 29451 29786 000-015 16 ? 02:50:32 sander

b buffysum 0 29567 29787 016-031 16 ? 02:57:45 osiris.e

<snip>

ACTIVE PEs = 631

q buffysum 1 18268 29715 146-161 16 ? 00:42:28 osiris.e Swapped 1 of 16

r miyoung 1 77041 28668 172-235 64 ? 03:52:11 vasp

s bufyysum 1 53202 30069 236-275 40 ? 00:18:16 osiris.e Swapped 1 of 40

t willow 1 51069 27914 276-325 50 ? 00:53:03 MicroMag.

u hal 1 77007 30569 326-357 32 ? 00:26:09 alknemd

ACTIVE PEs = 266 BATCH = 770 INTERACTIVE = 12

WAIT QUEUE:

user uid gid acid Label Size ApId Command Reason Flags

giles 13668 2607 2607 - 64 55171 xlatqcdp Ap. limit a----

bobg 14721 2751 2751 - 54 68936 Cmdft Ap. limit a----

jimbo 15761 3009 3009 - 32 77407 pop.8x4 Ap. limit af---

Page 16: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

16

Possible Job States on the T3E

ST Job State Description

R03 Running The job is currently running.

NSubm Submitted The job has been submitted to the NQS scheduler and is being considered to run.

NPend Pending The job is still residing in the NQE database and is not being considered to run. This is probably because you already have 3 jobs in the queue.

NComp Completed The job has completed.

NTerm Terminated The job was terminated, probably due to an error in the batch script.

Page 17: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

17

Current Queue Limits on the T3E

Pipe Q Batch Q MAX PE Timedebug debug_small 32 33 min

debug_medium128 10 min

production pe16 16 4 hr

pe32 32 4 hr

pe64 64 4 hr

pe128 128 4 hr

pe256 256 4 hr

pe512 512 4 hr

long long128 128 12 hr

long 256 256 12 hr

Page 18: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

18

Queue Configuration on the T3E

Time (PDT) Action

7:00 am long256 stopped

pe256 stopped

10:00 pm pe512 started

long 128, pe128 stopped and checkpointed

pe64, pe32, pe16 run as backfill

1:00 am pe512 stopped and checkpointed

long256, pe256, long 128, pe128 started

Page 19: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

19

LoadLeveler

• Product of IBM

• Conceptually very simple

• Few commands and options available

• Packs system well with backfilling algorithm

• Allows MIMD jobs

• Does not have checkpoint/restart to favor certain jobs

Page 20: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

20

SP/LoadLeveler Terminology

Keyword - used to specify your job parameters (e.g. number of nodes and wallclock time) to the LoadLeveler scheduler

Node - a set of 2 processors that share memory and a switch adapter. NERSC users are charged for exclusive use of a node.

Job ID - the identifier for a LoadLeveler job, e.g. gs01013.1234.0.

Switch - a high-speed connection between the nodes. All communication between nodes goes through the switch.

Class - a user submits a batch job to a particular class. Each class has a different priority and different limits.

Page 21: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

21

What LoadLeveler Does

• Jobs are submitted directly to LoadLeveler

• The following keywords are set:– node_usage = not_shared

– tasks_per_node = 2

• The user can override tasks_per_node but not node_usage

• Incorrect keywords and parameters are passed silently to scheduler!

• NERSC only checks for valid repo and class names

• Prolog script creates $SCRATCH and $TMPDIR directories and environment variables– $SCRATCH is a global (GPFS) filesystem and $TMPDIR is local

Page 22: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

22

LoadLeveler Commands

• llsubmit - submit your job

% llsubmit script_file

llsubmit: The job "gs01007.nersc.gov.101" has been submitted.

• llqs - monitor your job

• llq - get details about one of your queued or running jobs

• llcancel - delete your queued or running job

% llcancel gs01005.84.0

llcancel: Cancel command has been sent to the central manager.

Page 23: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

23

Sample SP Batch Script

#!/usr/bin/csh

#@ job_name = myjob

#@ account_no = repo_name

#@ output = myjob.out

#@ error = myjob.err

#@ job_type = parallel

#@ environment = COPY_ALL

#@ notification = complete

#@ network.MPI = css0,not_shared,us

#@ node_usage = not_shared

#@ class = regular

#@ tasks_per_node = 2

#@ node = 32

#@ wall_clock_limit= 01:00:00

#@ queue

./a.out < input

Page 24: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

24

Monitoring Your Job on the SP

gseaborg% llqs Step Id JobName UserName Class ST NDS WallClck Submit Time

---------------- --------------- -------- ------- -- --- -------- -----------

gs01007.1087.0 a240 buffy regular R 32 00:31:44 3/13 04:30

gs01001.529.0 s1.x willow regular R 64 00:28:17 3/12 21:45

gs01001.578.0 xdnull xander debug R 5 00:05:19 3/14 12:44

gs01009.929.0 gs01009.nersc.g spike regular R 128 03:57:27 3/13 05:17

gs01001.530.0 s2.x willow regular I 64 04:00:00 3/12 21:48

gs01001.532.0 s3.x willow regular I 64 04:00:00 3/12 21:50

gs01001.533.0 y1.x willow regular I 64 04:00:00 3/12 22:17

gs01001.534.0 y2.x willow regular I 64 04:00:00 3/12 22:17

gs01001.535.0 y3.x willow regular I 64 04:00:00 3/12 22:17

gs01001.537.0 gs01001.nersc.g spike regular I 128 02:30:00 3/13 06:10

gs01009.930.0 gs01009.nersc.g spike regular I 128 02:30:00 3/13 07:17

Page 25: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

25

Monitoring Your Job on the SP (cont’)

• Issuing a ps command will show only what is running on that login node, not any instances of your parallel job

• If you could issue a ps command on a compute node running 2 MPI tasks of your parallel job, you would see:

gseaborg% ps -u jimbob UID PID TTY TIME CMD14397 9444 - 58:37 a.out14397 10878 - 0:00 pmdv214397 11452 0:00 <defunct>14397 15634 - 0:00 LoadL_starter14397 16828 - 58:28 a.out14397 19696 - 0:00 pmdv214397 19772 - 0:02 poe14397 20878 0:00 <defunct>

Page 26: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

26

Possible Job States on the SP

ST Job State Description

R Running The job is currently running.

I Idle The job is being considered to run.

NQ Not Queued The job is not being considered to run. This is probably because you have submitted more than 10 jobs.

ST Starting The job is starting to run.

HU User Hold The user put the job on hold. You must issue the llhold -r command

in order for it to be considered for scheduling.

HS System Hold The job was put on hold by the system. This is probably because

you are over disk quota in $HOME.

Page 27: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

27

Current Class Limits on the SP

CLASS NODE TIME PRIORITY

debug 16 30 min 20000

premium 256 4 hr 10000

regular 256 4 hr 5000

low 256 4 hr 1

interactive 8 20 min 15000

Same configuration runs all the time.

Page 28: LoadLeveler vs. NQE/NQS: Clash of The Titans

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

28

More Information

• Please see NERSC Web documentation

• The IBM batch system: http://hpcf.nersc.gov/running_jobs/ibm/batch.html

• The Cray batch system: http://hpcf.nersc.gov/running_jobs/cray/batch.html

• Batch differences between IBM and Cray: http://hpcf.nersc.gov/running_jobs/ibm/lldiff.html