33
IBM® Scheduler for High Throughput Computing on IBM Blue Gene®/P Table of Contents Introduction................................................................................................................................................3 Architecture................................................................................................................................................4 simple_sched daemon............................................................................................................................4 startd daemon........................................................................................................................................4 End-user commands..............................................................................................................................4 Personal HTC Scheduler............................................................................................................................6 Using HTC Scheduler with Tivoli Workload Scheduler LoadLeveler®...................................................8 Using HTC Scheduler with LoadLeveler version 3.5 and later..........................................................10 Using HTC Scheduler with LoadLeveler before version 3.5..............................................................11 Service setup............................................................................................................................................12 Configuration...........................................................................................................................................14 Configuration options..........................................................................................................................14 Daemons...................................................................................................................................................18 simple_sched daemon..........................................................................................................................18 Command-line options....................................................................................................................18 Shutting down.................................................................................................................................19 startd daemon......................................................................................................................................19 Command-line options....................................................................................................................19 Submit plug-in................................................................................................................................19 Shutting down.................................................................................................................................20 End-user commands.................................................................................................................................21 qcmd....................................................................................................................................................21 Commands......................................................................................................................................21 Immediate mode.............................................................................................................................23 Interactive mode.............................................................................................................................24 Response format.............................................................................................................................24 Submit ID response....................................................................................................................24 Submit status response...............................................................................................................25 Scheduler status response..........................................................................................................25 Request rejected response..........................................................................................................25 qsub.....................................................................................................................................................26 qstat.....................................................................................................................................................26 qdel......................................................................................................................................................26 Submitted job states.............................................................................................................................27 Note about when state info is available...............................................................................................28 run_simple_sched_jobs............................................................................................................................29 Command files....................................................................................................................................29 Output..................................................................................................................................................30 Signal handling....................................................................................................................................30

IBM Scheduler for HTC on IBM Blue Gene/P

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

IBM® Scheduler for High Throughput Computing onIBM Blue Gene®/P

Table of ContentsIntroduction................................................................................................................................................3Architecture................................................................................................................................................4

simple_sched daemon............................................................................................................................4startd daemon........................................................................................................................................4End-user commands..............................................................................................................................4

Personal HTC Scheduler............................................................................................................................6Using HTC Scheduler with Tivoli Workload Scheduler LoadLeveler®...................................................8

Using HTC Scheduler with LoadLeveler version 3.5 and later..........................................................10Using HTC Scheduler with LoadLeveler before version 3.5..............................................................11

Service setup............................................................................................................................................12Configuration...........................................................................................................................................14

Configuration options..........................................................................................................................14Daemons...................................................................................................................................................18

simple_sched daemon..........................................................................................................................18Command-line options....................................................................................................................18Shutting down.................................................................................................................................19

startd daemon......................................................................................................................................19Command-line options....................................................................................................................19Submit plug-in................................................................................................................................19Shutting down.................................................................................................................................20

End-user commands.................................................................................................................................21qcmd....................................................................................................................................................21

Commands......................................................................................................................................21Immediate mode.............................................................................................................................23Interactive mode.............................................................................................................................24Response format.............................................................................................................................24

Submit ID response....................................................................................................................24Submit status response...............................................................................................................25Scheduler status response..........................................................................................................25Request rejected response..........................................................................................................25

qsub.....................................................................................................................................................26qstat.....................................................................................................................................................26qdel......................................................................................................................................................26Submitted job states.............................................................................................................................27Note about when state info is available...............................................................................................28

run_simple_sched_jobs............................................................................................................................29Command files....................................................................................................................................29Output..................................................................................................................................................30Signal handling....................................................................................................................................30

Positional parameters..........................................................................................................................30LoadLeveler integration......................................................................................................................30Configuration.......................................................................................................................................30

2 IBM Scheduler for HTC on IBM Blue Gene/P

IntroductionThe HTC Scheduler is a simple scheduler for High Throughput Computing (HTC) jobs on Blue Gene/P. HTC on Blue Gene/P provides the ability to run independent, single-node tasks on each node in a partition. For information on the setup, configuration, and use of HTC on Blue Gene/P, refer to the IBM System Blue Gene Solution: Blue Gene/P System Administration Redbook (SG24-7417) and IBM System Blue Gene Solution: Blue Gene/P Application Development Redbook (SG24-7287).

An HTC application may involve more tasks than there are nodes in the partition. In this situation, some tasks must wait until another task finishes before they can be submitted. A resource manager, or scheduler, automates this task. The HTC Scheduler is an implementation of a resource scheduler that was specifically designed to work in the Blue Gene/P's HTC environment. The HTC Scheduler is capable of reliably and efficiently running a large number of HTC jobs on a Blue Gene/P system.

The parts that make up the HTC Scheduler are the end-user command line utilities (qsub, qstat, qdel, and qcmd), the simple_sched daemon, and the startd daemon. Also available is a utility for running a batch of jobs through a “personal” instance of the HTC Scheduler.

3

ArchitectureFigure 1 shows the architecture of the HTC Scheduler which includes a single simple_sched daemon, multiple startd daemons, and several instances of end-user commands (qcmd, qsub, qstat, and qdel).

simple_sched daemon

This daemon waits for the client programs (qcmd, qsub, qstat, qdel, and startd) to contact it. When a startd client connects to the simple_sched daemon, it puts the startd client into a pool to which it can assign new jobs. When an end-user program makes a request, it handles the request. For example, if it's a new job request (sent by the qsub program) it assigns the new job a submit ID and puts the job on a queue; when that job reaches the front of the queue, it will assign the submitted command to a startd client. The simple_sched daemon can run on the service node or a front end node.

startd daemon

The startd daemon connects to the simple_sched daemon. When simple_sched sends it a job to run, startd forks off a process, and in the child process sets up the environment (sets the gid and uid), and execs submit with the job-specific command line options. As the submit process runs, it notifies the startd daemon of the state of the job through the submit plugin (for example, the HTC scheduler's submit plugin is called when the job ID is assigned). When the submit process ends, startd retrieves the exit status and sends the job result information to simple_sched. A single startd process can have multiple submits forked and running at the same time. The startd daemon will be calling submit, so the computer it's running on must have a submit multiplexer (submit_mux) running and configured (typically a front end node).

End-user commands

The end-user commands qcmd, qsub, qstat, and qdel, are used to send commands to the simple_sched daemon. There are commands available for submitting a new job, getting the status of a submitted job,

4 IBM Scheduler for HTC on IBM Blue Gene/P

Figure 1: HTC Scheduler architecture

simple_sched

startd

qsub

qstat

qdel

qcmd

submit

submit

submit

startd

submit

submit

submit

canceling a job, and performing administrative functions. These are typically run from a front end node, but can also be compiled to run on a workstation.

5

Personal HTC Scheduler Users may want to run HTC jobs on their own partition using a personal instance of the HTC Scheduler (possibly under the direction of the LoadLeveler scheduler). This is made easier using the provided run_simple_sched_jobs command which will start a personal instance of the HTC Scheduler and startd, execute commands either specified in command files or read from stdin, and exit when the commands have all completed. It creates a personal configuration file that can be used when submitting jobs externally.

In this example, a user wants to run the “location” program several times with different arguments.

First, create a file that contains the program to run along with arguments, where each line is the program to run and any arguments, for example, the text file cmds.run contains

-- location argsfor1-- location argsfor2-- location argsfor3... one line for each

If you want the stdout and stderr for the program to go to a different location, put the -stdout-file and -stderr-file parameters before the executable. You can also use this feature to discard output.

-stdout-file=loc_out1.out –stderr-file=loc_out1.err -- location.aout argsfor1-stdout-file=/dev/null –stderr-file=/dev/null -- location.aout argsfor2

The first step in running jobs using run_simple_sched_jobs is to boot the partition. The htcpartition utility is provided to boot a partition in HTC mode from the command line. Refer to the IBM SystemBlue Gene Solution: Blue Gene/P Application Development Redbook (SG24-7287) for full documentation of the htcpartition utility. The --configfile parameter tells htcpartition to create a file that run_simple_sched_jobs can read to get the pool name and pool size.

$ htcpartition --boot --partition R00-M0-N14 --mode SMP --configfile my_config.cfg

Use run_simple_sched_jobs to start up a personal instance of HTC Scheduler and run your commands:

$ run_simple_sched_jobs -config my_config.cfg cmds.run

To run more commands using the same configuration file, pass -reuse-config to run_simple_sched_jobs:

$ run_simple_sched_jobs -config my_config.cfg -reuse-config cmds2.run$ htcpartition --free --partition R00-M0-N14

To have run_simple_sched_jobs read the commands from stdin, use - as the command file. Using this method, the commands to run can be generated by another program or script:

$ ./gen_cmds.py | run_simple_sched_jobs -config my_config.cfg -

To have the personal HTC Scheduler continue running after all the command files have been run, use the -keep-running option:

shell_1$ run_simple_sched_jobs -config my_config.cfg -keep-running

shell_2$ qsub -config my_config.cfg my_program arg1 arg2

run_simple_sched_jobs prints out a line to stdout whenever it's notified that a command completed,

6 IBM Scheduler for HTC on IBM Blue Gene/P

either successfully or unsuccessfully.

1 -| 1 is COMPLETED exit status 02 -| 2 is COMPLETED exit status 0...

The first number is the request ID (the line command from the command files) and the second is the submit ID supplied by the simple_sched daemon. Note that commands may complete in a different order than they were submitted.

Once all the commands in the command files have completed, run_simple_sched_jobs prints out a summary containing the number of jobs that completed successfully, the number of jobs that completed with non-zero exit status, and the number of jobs that failed to run due to an error.

For details on using run_simple_sched_jobs, see the run_simple_sched_jobs chapter on page 29.

7

Using HTC Scheduler with Tivoli Workload Scheduler LoadLeveler ® The LoadLeveler scheduler was enhanced in version 3.5 to enable use of Blue Gene/P's HTC mode with the HTC Scheduler as a meta-scheduler. This makes integrating the HTC Scheduler into a LoadLeveler workflow much easier and more efficient. Refer to the following sections depending on your version of LoadLeveler for instructions on configure LoadLeveler and creating a job command file (JCF) to submit HTC jobs.

Figure 2 illustrates how the HTC Scheduler “glides in” to LoadLeveler. LoadLeveler has selected and created a partition in the Blue Gene/P using the Control System (Bridge) API. LoadLeveler's Central Manager tells a LoadLeveler startd to run this JCF. The LoadLeveler startd will execute run_simple_sched_jobs which starts a simple_sched server, a HTC Scheduler startd, and a qsub program, then reads the cmds.run input file, converting those lines into calls to qsub. The HTCScheduler startd process executes several submits in parallel, which contact the submit mux, which communicate with the control system processes, which cause the program to run on the compute nodes.

8 IBM Scheduler for HTC on IBM Blue Gene/P

Figure 2: HTC Scheduler as LoadLeveler glide-in

qsub pgm1 args qsub pgm2 argsqsub pgm3 args … cmds.run file

pgm1 args pgm2 argspgm3 args…

run_simple_sched_jobs

Service node

Central Manager

FEN

Co

ntro

l Sys

tem

AP

IStatus Updates

Actions

DB2

Control

System

Processes

I

simple_sched

startd (SIMPLE)

submit pgm1 args submit pgm2 argssubmit pgm3 args

startd (LL)

starter

Running HTC jobs

Submit

Mux

1

2 3

Blue Gene Machine

Using HTC Scheduler with LoadLeveler version 3.5 and later

LoadLeveler version 3.5 provides new features that make using the HTC Scheduler under LoadLeveler easier. This section describes the new Job Command File (JCF) keywords and behavior available in LoadLeveler.

LoadLeveler provides a bg_partition_type keyword in the JCF that specifies whether the partition will be booted for HPC or HTC jobs and, if the partition is booted for HTC jobs, the mode in which the jobs will run. The values for bg_partition_type are as follows:

• HPC – HPC jobs, this is the default value if bg_partition_type isn't present

• HTC_SMP – HTC jobs in SMP mode

• HTC_DUAL – HTC jobs in Dual mode

• HTC_VN – HTC jobs in Virtual Node mode

• HTC_LINUX_SMP – HTC jobs in Linux / SMP mode

The bg_user_list keyword is used to specify the users that can run jobs on the partition. This can be set to a space-separated list of user names or the special value “ALL” to allow any user to run on the partition. If not specified, only the step owner can submit jobs to the partition. Note that “Linux / SMP” mode might not be available on every Blue Gene/P system.

The bg_partition and bg_user_list keywords are not inherited by other job steps.

LoadLeveler sets several environment variables when the it runs the job that run_simple_sched_jobs uses to boot the partition chosen by LoadLeveler in the correct mode.

Example 1 contains a sample LoadLeveler JCF file that uses run_simple_sched_jobs:

#!/bin/bash

#@ job_name = htc_glide_in_32#@ output = $(job_name).$(jobid).out#@ error = $(job_name).$(jobid).err#@ job_type = bluegene#@ bg_size = 32#@ bg_partition_type = HTC_VN#@ bg_user_list = user1 user2 user3#@ queue

/bgsys/opt/simple_sched/bin/run_simple_sched_jobs cmds.txt

Example 1: Sample LoadLeveler 3.5 JCF

Modify the sample JCF to run your HTC application. The sections highlighted in bold should be changed to suit your application. The job_type must be “bluegene” so that LoadLeveler will run the script on a front end node and allocate a partition.

LoadLeveler also provides commands that can be used to display the partition type and user list information of a HTC partition. For example, llstatus will show the partition type is “HTC (SMP)” if the partition is booted in HTC mode for SMP jobs.

10 IBM Scheduler for HTC on IBM Blue Gene/P

At present, LoadLeveler doesn't re-use partitions booted in HTC mode even if the cache partitions option is enabled. LoadLeveler will automatically free the partition after the job has ended.

Using HTC Scheduler with LoadLeveler before version 3.5

In order to use the HTC Scheduler with LoadLeveler prior to version 3.5, LoadLeveler must be configured to not cache Blue Gene partitions. This is done by setting BG_CACHE_PARTITIONS=false in the LoadL_config file. Refer to the “Tivoli Workload Scheduler LoadLeveler” documentation for information regarding this configuration option. This requirement has been removed with LoadLeveler 3.5.

The HTC Scheduler can be called from a LoadLeveler JCF to submit a batch of HTC jobs to a partition managed by LoadLeveler. The JCF should look like Example 2.

Modify the sample JCF to run your commands. The sections highlighted in bold should be changed to suit your application. The job_type must be “bluegene” so that LL will run the script on a FEN. The FEN that LoadLeveler chooses must have the submit_mux running.

The script in the JCF uses htcpartition to boot the partition. Set the mode parameter to htcpartition to match the mode that your application will run in (DUAL, LINUX, SMP, or VN).

The script uses run_simple_sched_jobs to run all the commands in cmds.txt. A shell trap is set up so that when the script exits, htcpartition will free the partition.

11

#!/bin/bash

#@ job_name = htc_glide_in_32#@ output = $(job_name).$(jobid).out#@ error = $(job_name).$(jobid).err#@ job_type = bluegene#@ bg_size = 32#@ queue

function free_partition() { /bgsys/drivers/ppcfloor/bin/htcpartition –free}

export RUN_JOBS_CONFIG_FILE=my_simple_sched.cfg

/bgsys/drivers/ppcfloor/bin/htcpartition --boot --mode SMP --configfile "$RUN_JOBS_CONFIG_FILE"if [ $? != 0 ]; then echo "Booting HTC partition failed." exit 1fi

trap free_partition EXIT

/bgsys/opt/simple_sched/bin/run_simple_sched_jobs cmds.txt

Example 2: Sample LoadLeveler JCF

Service setupAdministrators may want to have a single pool that users can submit HTC jobs to without going through another scheduler like LoadLeveler. In this case, follow the instructions in this section to set up the HTC Scheduler to run as a system service. Once the following steps are completed, end users can run qsub to submit jobs, qstat to see the status of their job, and qdel to remove a job from the queue. Note that if users are only using personal instances of the HTC Scheduler, service setup is not necessary.

1. Customize the configuration file

Copy the configuration file /bgsys/opt/simple_sched/etc/simple_sched.cfg to /bgsys/local/etc/simple_sched.cfg. Edit /bgsys/local/etc/simple_sched.cfg and set the following options:

● Set scheduler_hostname to your SN hostname. For example, "mysn.mydomain"● Set pool_name to the pool that you will be using for HTC jobs. For example, "R00-M0"● Set pool_size to the number of nodes in the pool. For example, "1mpV" (= 1 midplane in VN

mode)

You will probably not have to change the other options.

2. Start the HTC Scheduler server daemon on the SN

Create a symlink to the init script in /etc/init.d, install the daemon to start automatically at the default run levels, and start it manually:

# ln -s /bgsys/opt/simple_sched/etc/init.d/ibm.com-simple_sched_server /etc/init.d/# /usr/lib/lsb/install_initd -v ibm.com-simple_sched_server# /etc/init.d/ibm.com-simple_sched_server start

3. Start the HTC Scheduler startd daemons on the submit nodes

The submit nodes are any system where the submit mux is running (i.e., FENs). At least one of these systems must be designated to run the startd daemon. The number of systems required depends on the size of the pool.

The startd daemons will be starting the submit program supplied with the Blue Gene software, which must be able to load the HTC Scheduler's submit plug-in. The location of the submit plug-in, /bgsys/opt/simple_sched/lib, needs to be configured in the dynamic linker (ld.so). This is typically done by creating a text file in /etc/ld.so.conf.d and running ldconfig.

Create a symlink to the startd daemon init script in /etc/init.d, install the daemon to start automatically at the default run levels, and start it manually:

# ln -s /bgsys/opt/simple_sched/etc/init.d/ibm.com-simple_sched_startd /etc/init.d/# /usr/lib/lsb/install_initd -v ibm.com-simple_sched_startd# /etc/init.d/ibm.com-simple_sched_startd start

4. Configure the end-user environment

Change the system environment so that the end-user command line utilities (qsub, qstat, and

12 IBM Scheduler for HTC on IBM Blue Gene/P

qdel) are available. The users' PATH should include /bgsys/opt/simple_sched/bin. This is usually done by creating a script in /etc/profile.d.

13

ConfigurationMost configuration values can be set using:

● an option on the command line. For example, -scheduler-service 12345● an environment variable. For example, “SIMPLE_SCHEDULER_SERVICE=12345 ; export

SIMPLE_SCHEDULER_SERVICE”● a line in the configuration file. For example, scheduler_service_name=12345

If a configuration value can be set using multiple methods, the command line option takes precedence over the environment variable, which takes precedence over the config file.

The HTC Scheduler programs use a configuration file. The file that's used is either (in order of preference):

1. specified on the command line using the -config parameter2. specified using the SIMPLE_SCHED_CONFIG_FILE environment variable3. if present, the current directory config file, ./simple_sched.cfg4. if present, the system config file, /bgsys/local/etc/simple_sched.cfg5. if present, the install config file, /bgsys/opt/simple_sched/etc/simple_sched.cfg

Typically, the administrator will have copied the configuration file from /bgsys/opt/simple_sched/etc/simple_sched.cfg to /bgsys/local/etc/simple_sched.cfg and changed any configuration options necessary for the local system.

Configuration options

This section describes the configuration options.

Configuration file name

Description The configuration file to use. If not specified, will use in this order:

1. if present, the current directory config file, ./simple_sched.cfg2. if present, the system config file, /bgsys/local/etc/simple_sched.cfg3. if present, the local config

file: /bgsys/opt/simple_sched/etc/simple_sched.cfg

Format File name, see open()

Environment variable SIMPLE_SCHED_CONFIG_FILE

Command-line -config <filename>

Scheduler service name

Description The service name (port) that the server will listen on and the clients will attempt to contact. The default value is "simple_htc_scheduler".

Format Service name, see getaddrinfo()

Configuration file option scheduler_service_name

Environment variable SIMPLE_SCHEDULER_SERVICE

Command-line -scheduler-service <service-name>

14 IBM Scheduler for HTC on IBM Blue Gene/P

Scheduler host name

Description The host name of the system that the server is running on.

Format Host name, see getaddrinfo()

Configuration file option scheduler_hostname

Environment variable SIMPLE_SCHEDULER_HOSTNAME

Command-line -scheduler-hostname <hostname>

Pool name

Description The name of the pool to run HTC jobs on. Defaults to “default_pool”.

Format Pool name, see the Blue Gene System Administration Redbook (SG24-7417

Configuration file option pool_name

Environment variable SIMPLE_SCHEDULER_POOL

Command-line -pool <pool-name>

Pool size

Description Describes the partitions in the pool. For each partition in the pool the scheduler must be told its size and mode using this format:[<count>][<type>][<mode>]where at least one of these must be present, and• count is a number (default is 1)• type is a hardware type, “n”=node, “nc”=node card,

“mp”=midplane,”r”=rack” (default is “n”)• mode is the mode of the partition, “D”=dual, “L”=Linux, “S”=SMP,

“V”=virtual node (default is “S”)Separate partition descriptions using space.Example partition: “1ncS” = 1 node card in SMP mode (32 nodes to run on)Example pool: “1mpS 1ncV” = 1 midplane booted in SMP mode and 1 node card booted in virtual node mode.

Format Pool size, see description

Configuration file option pool_size

Environment variable SIMPLE_SCHEDULER_POOL_SIZE

Command-line -pool-size <pool-size>

Submit path

15

Description The path to the submit program. Defaults to “/bgsys/drivers/ppcfloor/bin/submit”.

Format Executable name, see exec()

Configuration file option submit_path

Environment variable SIMPLE_SCHEDULER_SUBMIT_PATH

Command-line -submit-path <filename>

Submit Options

Description Additional options to set when calling submit. The startd daemon will put these options on the submit command when it executes submit in addition to the arguments it uses. The default is empty. If the options aren't valid then submitted jobs will fail.

Format Command-line options, like "-trace 0"

Configuration file option submit_args

Environment variable SIMPLE_SCHEDULER_SUBMIT_ARGS

Command-line -submit-args <arguments>

simple_sched daemon PID file

Description The path to use for simple_sched's PID file. Defaults to “/var/run/simple_sched.pid”.

Format File name

Configuration file option startd_pid_file

Environment variable SIMPLE_SCHEDULER_PID_FILE

Command-line -pid-file <filename>

startd daemon PID file

Description The path to use for startd's PID file. Defaults to “/var/run/startd.pid”.

Format File name

Configuration file option startd_pid_file

Environment variable SIMPLE_SCHEDULER_STARTD_PID_FILE

Command-line -pid-file <filename>

Verbose

Description The verbose level for log output. If not present, then no logging will be done. If present with no value, the level is "notice". The levels

16 IBM Scheduler for HTC on IBM Blue Gene/P

available are, from most to least selective:

• debug, D, 7• info, I, 6• notice, N, 5• warning, W, 4• err, E, 3• crit, 2 (not used)• alert, 1 (not used)• emerg, 0 (not used)

Format Verbose level, see description

Command-line -verbose[=<level>]

17

DaemonsThe following section provides details on the daemons that implement the HTC Scheduler.

simple_sched daemon

This section provides details on the simple_sched daemon.

Command-line options

In addition to the command-line options to override configuration options, the following options are available when starting the simple_sched daemon.

-accept-sd=SD

Description The simple_sched daemon accepts connections on the supplied socket descriptor. The default is to open a socket to accept client connections on. The socket descriptor must be an integer.

-log-to-stdout

Description The simple_sched daemon will log to stdout. The default is to log using the syslog() API.

-suspend

Description The simple_sched daemon will start in “suspended” state. It will not assign jobs to startd daemons until resumed. The default is to start in “running” state. To resume, use “qcmd resume”

-pick-port

Description The simple_sched daemon will pick an ephemeral port to use. The default is to use the configured service name.

-pid-file-required[=optional|required|skip]

Description Tells the simple_sched daemon how to handle the PID file. The default is optional if the option is not used, or required if the option is used. Allowed values are:• optional - Try to create the pid file, but if can't, continue• required - Create the pid file and fail if cannot.• skip – Do not create the pid file

-boot

18 IBM Scheduler for HTC on IBM Blue Gene/P

Description simple_sched will execute /bgsys/drivers/ppcfloor/bin/htcpartition to boot the partition. htcpartition must be able to get its boot parameters from the mpirun plugin.

Shutting down

The simple_sched daemon can be shut down in one of three ways:

• Very Slow - No more jobs will be accepted. simple_sched will wait until the submit queue is empty and all outstanding submits are complete. Trigger this by signaling with SIGINT (CTRL-C).

• Slow - No more jobs will be accepted and the submit queue will be cleared. simple_sched will wait until all outstanding jobs are complete. Trigger this by signaling with SIGQUIT (CTRL-\).

• Quick - Just exits, not waiting for jobs to complete. Trigger this by signaling with SIGTERM.

startd daemon

This section provides details on the startd daemon.

Command-line options

In addition to the command-line options to override configuration options, the following options are available when starting the simple_sched daemon.

-log-to-stdout

Description The simple_sched daemon will log to stdout. The default is to log using the syslog() API.

-pid-file-required[=optional|required|skip]

Description Tells the simple_sched daemon how to handle the PID file. The default is optional if the option is not used, or required if the option is used. Allowed values are:• optional - Try to create the pid file, but if can't, continue• required - Create the pid file and fail if cannot• skip – Do not create the pid file

Submit plug-in

The startd daemon uses the submit plug-in to get information about the job back from the submit program. The submit program will call functions in the submit plug-in when a job ends. If the job failed, the data provided on the function call will include the reason for the failure. The submit command uses dlopen to load the submit plug-in, so the shared library containing the submit plug-in must be configured in the dynamic linker. Configuring the submit plug-in shared library in the dynamic linker can be done in several ways, including use of LD_LIBRARY_PATH, and editing the ld.so.conf. The HTC Scheduler's submit plug-in is located in /bgsys/opt/simple_sched/lib/libsubmit_if.so. (Note that run_simple_sched_jobs sets the LD_LIBRARY_PATH, so if the HTC Scheduler is run only through run_simple_sched_jobs then no extra configuration is required.)

The submit plug-in provided by the HTC Scheduler also prevents other users from submitting jobs to

19

the pool it's configured to use. It does this by reading the local HTC Scheduler configuration file, /bgsys/local/etc/simple_sched.cfg, and if the pool entered on the command line is not set, or it's the same pool as is in the local configuration file, then it returns non-zero and submit will fail.

Shutting down

The startd daemon can be shut down in one of three ways:

1. Very slow – startd tells simple_sched to stop sending work; then waits until all submits have finished. Trigger this by sending SIGINT (CTRL-C).

2. Slow – startd tells simple_sched to stop sending work; all current submits will get SIGTERM and should end quickly; then waits until all the submits have finished. Trigger this by sending SIGQUIT (CTRL-\).

3. Quick – Exits without sending results, trigger this by signaling with SIGTERM.

20 IBM Scheduler for HTC on IBM Blue Gene/P

End-user commands

qcmd

qcmd can be used to send commands to the HTC Scheduler. If qcmd can't connect to the simple_sched daemon it will exit with an error message and non-zero exit status.

Commands

Listed here are the commands accepted by qcmd. Following the list of commands is a description of the response types.

submit [OPTION]... COMMAND...

Description Submit a job to run.

Response Submit ID if -wait, or job status if no -wait

Options -mode=MODE

Description The mode that the job requires.

Parameter A mode, one of “DUAL”, “LINUX”, “SMP”, or “VN”.

Default The job can run in any mode. The server will check for an available HTC resource in this order: VN, DUAL, SMP, LINUX.

-restartable

Description Indicates that the job can be restarted if it fails.

Default The job cannot be restarted.

-cwd=DIRECTORY

Description The working directory.

Parameter A directory name.

Default The current working directory.

-exp_env=NAME

Description Export an environment variable to the job.This can be used multiple times to export multiple variables.

Parameter An environment variable name

Default The environment variable is not exported to the job

-env_all

21

Description Export all environment variables to the job

Default No environment variables are exported

-env=NAME=VALUE[ NAME=VALUE]

Description Define environment variables for the job.This can be used multiple times to define multiple environment variables.

Parameter A space-separated list of NAME=VALUE pairs

-name=NAME

Description The name for the job. This is used as the base name for the output files.

Parameter The name can be any value that can be used in a file name

Default “submit”

-stdin-file=FILE

Description The file from which to read standard input.

Parameter A file name. If the file name is not a full path then the file is opened from <cwd>. This file must be readable when the program runs.

Default /dev/null

-stdout-file=FILE

Description The file to which to write standard output. If this option is specified, the file will not be removed even if it is empty.

Parameter A file name. If the file name is not a full path then the file is opened from <cwd>.

Default <name>-<submit-id>.out

-stderr-file=FILE

Description The file to which to write standard error. If this option is specified, the file will not be removed even if it is empty.

Parameter A file name. If the file name is not a full path then the file is opened from <cwd>.

Default <name>-<submit-id>.err

-wait

Description Wait for results.

22 IBM Scheduler for HTC on IBM Blue Gene/P

Default Do not wait for results.

status [-wait] <submit-id>|all

Description Display the state of a job or all jobs if “all” is specified.

Response job status

Option -wait

Description Wait for results.

Default Do not wait for results.

cancel <submit-id>

Description Cancel a submitted job.

Response job status

scheduler_status

Description Display the scheduler status.

Response scheduler status

suspend

Description The simple_sched daemon will stop assigning jobs until “resume”.

Response scheduler status

resume

Description The simple_sched daemon will resume assigning jobs.

Response scheduler status

help [<command>]

Description Display the command help summary or detailed help for the specified command

Immediate mode

If a command is supplied on the qcmd command-line, it will execute that command and exit. This can be seen in the following examples:

$ qcmd scheduler_status

23

[running (submit queue=0) (submits=assigned:0 completed:5 notzero:0 error:0 canceled:0) (htc resources=smp:32/32 dual:0/0 vn:0/0 linux:32/32)]

$ qcmd submit test.cnaSubmit id: 6

Interactive mode

qcmd will operate in interactive mode when a command is not supplied on the command line. In this mode qcmd reads commands from stdin. Commands are processed asynchronously: a command creates a request that is sent to the simple_sched daemon, the next request can be sent before the previous command completes, and the responses to the requests may be received out of order.

When qcmd generates a request from a command, a unique request ID is generated, and the request ID and the command name are displayed, for example:

$ cat cmds.txtsuspendsubmit test

$ cat cmds.txt | qcmd1 <- suspend2 <- submit

When qcmd receives a response from the simple_sched daemon, it prints out the request ID and the response info. Continuing the previous example, the two requests received these responses:

1 -| [running (submit queue=0) (submits=assigned:0 completed:5 notzero:0 error:0 canceled:0) (htc resources=smp:32/32 dual:0/0 vn:0/0 linux:32/32)]2 -| Submit id: 7

When the response is not the last response for a request, qcmd will display "->" after the request ID, whereas if it is the last response for a request, qcmd will display "-|" after the request ID.

When stdin is closed, qcmd will continue receiving responses from the simple_sched daemon until the simple_sched daemon indicates that there are no more outstanding requests. Users can take advantage of this behavior to submit several jobs by writing "submit -wait" commands to qcmd, closing stdin, and then waiting for qcmd to exit, at which point all the submitted jobs have completed. The run_simple_sched_jobs command uses this feature.

Response format

The format for the response info depends on the type of the response.

Submit ID response

If you do submit without -wait, qcmd will print out the response like "Submit ID: <submit-id>". For example,

$ qcmdsubmit test.cna1 <- submit1 -| Submit ID: 1

24 IBM Scheduler for HTC on IBM Blue Gene/P

Submit status response

If you do a “submit -wait” or status command, qcmd will print out the current state and, if the state is COMPLETED, the exit status, and, if set, the error message. The format is "<submit-id> is <status>[ exit status <exit-status>][ error message '<error-msg>']".

$ qcmdstatus 11 <- status 11 -| 1 is COMPLETED exit status 0

$ qcmdsubmit -wait test1 <- submit [wait]1 -> 2 is QUEUED1 -> 2 is ASSIGNED1 -| 2 is COMPLETED exit status 0

Scheduler status response

If the response is a scheduler status, qcmd will print the response using the following format:

[<submit_thread_status>[ booting] <submit_queue_status> <submits_status> <resource_pool_status>]

where

• submit_thread_status is either "running" or "suspended"

• “booting” will be displayed if the server is waiting for htcpartition to finish booting the partition

• submit_queue_status is "(submit queue=<count>)". This is the number of jobs in the submit queue

• submits_status is "(submits=assigned:<count> completed:<count> notzero:<count> error:<count> canceled:<count>)". This shows the number of jobs that are currently assigned, that have completed, that ended with non-zero exit status, that did not run due to an error, and that were canceled

• resource_pool_status is "(htc resources=smp:<avail>/<total> dual:<avail>/<total> vn:<avail>/<total> linux:<avail>/<total>)"

$ qcmd scheduler_status[running (submit queue=0) (submits=assigned:0 completed:5 notzero:0 error:0 canceled:0) (htc resources=smp:32/32 dual:0/0 vn:0/0 linux:32/32)]

Request rejected response

The HTC Scheduler may reject a request. One example of when a request would be rejected is the server is shutting down and a new job is submitted. The format for this is "Request failed '<error-msg>'".

$ qcmdsubmit test.cna1 <- submit1 -| Request failed 'submit rejected because shutting down'

25

qsub

qsub is simply a symbolic link to qcmd. When qcmd is called and the program name is qsub, it performs a “submit” command. Refer to the documentation on qcmd's submit command for the parameters to qsub.

Following is sample output of qsub:

$ qsub test.cnaSubmit id: 2$ qsub -wait test.cna3 is QUEUED3 is ASSIGNED3 is ASSIGNED [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']3 is COMPLETED exit status 0 [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']$ qsub -stdout-file=/dev/null -stderr-file=/dev/null -cwd /bgusr/myhome -env=HTC=true -- /bgusr/myhome/test.cna -opt1=opt1valueSubmit id: 4

qstat

qstat is used to get the status of a submitted job. qstat is simply a symbolic link to qcmd. When qcmd is called and the program name is qstat, it performs a “status” command. Refer to the documentation on qcmd's status command for the parameters to qstat.

Following is sample output of qstat:

$ qstat 44 is COMPLETED exit status 0 [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']$ qstat 7Status for 7 is not available.$ qstat -wait 96 is ASSIGNED [location='R00-M0-N14-J19-C00' jobId=12345 partition='R00-M0-N14']6 is COMPLETED exit status 0 [location='R00-M0-N14-J19-C00' jobId=12345 partition='R00-M0-N14']$ qstat -wait all7 is QUEUED7 is ASSIGNED7 is ASSIGNED [location='R00-M0-N14-J59-C00' jobId=12345 partition='R00-M0-N14']7 is COMPLETED exit status 0 [location='R00-M0-N14-J19-C00' jobId=12345 partition='R00-M0-N14']8 is QUEUED8 is ASSIGNED8 is ASSIGNED [location='R00-M0-N14-J44-C00' jobId=12345 partition='R00-M0-N14']8 is COMPLETED exit status 0 [location='R00-M0-N14-J20-C00' jobId=12345 partition='R00-M0-N14']

qdel

qdel is used to cancel a submitted job. qdel is simply a symbolic link to qcmd. When qcmd is called and the program name is qdel, it performs a “cancel” command. Refer to the documentation on qcmd's cancel command for the parameters to qdel.

26 IBM Scheduler for HTC on IBM Blue Gene/P

Following is sample output of qdel:

$ qdel 1010 is CANCELED$ qdel 1111 is CANCELING [location='R00-M0-N14-J23-C00' jobId=12345 partition='R00-M0-N14']$ qdel 1111 is COMPLETED term signal 9 [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']$ qdel 1212 is COMPLETED exit status 0 [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']

In the previous examples, submit ID 10 was queued when it was canceled; 11 was ASSIGNED when it was canceled the first time, and completed the second time; 12 had already exited when it was canceled.

Submitted job states

The states that a job can be in are as follows:

• QUEUED - The job is in the queue and will run when an HTC resource and startd are available. An error message may be available if the job failed and was requeued. When a job in this state is canceled, it goes to CANCELED state.

• ASSIGNED - The job is assigned to a startd to run. Information supplied by the submit program may be available (for example, the Blue Gene job ID and location). When a job in this state is canceled, it goes to CANCELING state.

• COMPLETED - The job has completed normally and has an exit status.• CANCELING - The job has been canceled and the startd has been told to kill it. When it exits, it

should go to COMPLETED state with the exit status indicating it was killed with a signal.• CANCELED - The job was canceled without running.• ERROR - The HTC Scheduler wasn't able to run this job and may have an error message.

Figure 3 illustrates the possible transitions between states.

27

Figure 3: Submitted job states

QUEUED ASSIGNED

CANCELING

COMPLETED

ERROR

CANCELED

Note about when state info is available

When a job's state is reported to a client and the job state was an end state (COMPLETED, CANCELED, or ERROR; the right-most states in Figure 3), knowledge of the submitted job will be cleared from the server. Any further request for state using the submit ID will get a response of 'not found'.

28 IBM Scheduler for HTC on IBM Blue Gene/P

run_simple_sched_jobsrun_simple_sched_jobs starts a personal instance of the HTC Scheduler and runs commands through it. To do this, it opens up an ephemeral port (i.e., Linux picks one that's not in use) and creates a HTCScheduler configuration file which is based on the base HTC Scheduler configuration with these options replaced:

• scheduler_hostname -- is set to the current system's host name• scheduler_service_name -- is set to the ephemeral port number• pool_name -- is set to the configured pool name• pool_size -- is set to the configured pool size

Note that if the --reuse-config option is specified, then only the scheduler_service_name is changed and the base configuration is not used.

Next, run_simple_sched_jobs forks and execs a simple_sched process with options to use the new config file and accept connections on the socket descriptor that run_simple_sched_jobs opened to listen on the ephemeral port. Then run_simple_sched_jobs forks and execs a startd process which uses the new config file. If there are any command files specified on the command line, run_simple_sched_jobs forks and execs a qcmd process using the new config file, to which run_simple_sched_jobs writes a "submit -wait" command for each command. Output from qcmd is parsed to look for completion messages which are echoed to stdout. Once all the command files have been processed, and the run_simple_sched_jobs “keep running” option is not set, it will next signal the child processes to quit. It will then wait for the simple_sched and startd child processes to exit.

Tip: Use the --configfile option when booting the partition with htcpartition, then pass that same config file as the -config parameter to run_simple_sched_jobs which will read the pool name and pool size configuration options from this file.

Command files

run_simple_sched_jobs reads command files. Each line of a command file contains options to qcmd's submit command, an “end of options” indicator, the program to run, and the arguments to the command.

A line in the command file can contain arguments for qcmd's submit command. Refer to qcmd's submit command options on page 21 for the options to the submit command. After the options to qcmd's submit command in the line, the user should put an “end of options” indicator, “--”. If the “end of options” indicator is left off, any arguments to the program that start with “-” will be interpreted as options for the submit command.

The following command file line will start the “location.cna” program with the -print argument; any output will be discarded because the -stdout-file and -stderr-file options are specified for qcmd's submit command:

-stdout-file=/dev/null -stderr-file=/dev/null -cwd /bgusr/myhome -- location.cna -print

Each command line is converted into a qcmd submit command that is sent to the qcmd subprocess. qcmd will generate a request ID for the submitted job. Since request IDs are assigned starting at 1, the ID will match the line number.

29

Output

run_simple_sched_jobs prints out a line whenever it's notified that a submitted command has completed, or that a submitted command was rejected (this should be rare).

Completion lines look like:

1 -| 1 is COMPLETED exit status 0 [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']

This is the standard completion line from the qcmd program. The first number is the request ID (command number), which starts at 1 and is incremented for each command submitted. The second number is the submit ID that the simple_sched daemon assigned to the job.

When run_simple_sched_jobs exits it prints out a line summarizing the results, like this:

Submitted 128 jobs, 128 completed, 60 had non-zero exit status, 0 requests failed.

Signal handling

If run_simple_sched_jobs gets SIGINT (CTRL-C) it goes into slow shutdown mode. No new commands will be accepted, and all submitted jobs will complete. If run_simple_sched_jobs gets SIGQUIT (CTRL-\) or SIGTERM (kill) it goes into fast shutdown mode. No new commands will be accepted and all submitted jobs will be canceled.

Positional parameters

The positional parameters are the names of command files (see the command files section). If the name is - then run_simple_sched_jobs will read commands from stdin until it reads an end-of-file. If there are no positional parameters, then no commands will be run, which is only useful when using the -keep-running option.

LoadLeveler integration

run_simple_sched_jobs looks for environment variables set by LoadLeveler and changes its behavior when these environment variables are set. The environment variables set by LoadLeveler are LOADL_BG_PARTITION, LOADL_BG_SIZE, and LOADL_BG_PARTITION_TYPE. When these environment variables are set, run_simple_sched_jobs will create a temporary configuration file containing the pool name and pool size set from these values. It will also automatically pass -boot to simple_sched.

Configuration

There are several configuration options. When an option can be specified in multiple ways, a command-line option takes precedence over an environment variable.

30 IBM Scheduler for HTC on IBM Blue Gene/P

Keep running

Description run_simple_sched_jobs will continue running after all command files have been processed. This can be used to submit jobs to the partition using qsub. Simply specify the same config file when invoking qsub.

Command-line option -keep-running

Default Exit after all jobs have completed.

Suspend

Description run_simple_sched_jobs will start the simple_sched daemon suspended. Use the "qcmd resume" command to cause the simple_sched daemon to start assigning jobs.

Command-line option -suspend

Default The simple_sched daemon will be started in “running” mode.

Boot

Description run_simple_sched_jobs will pass -boot to the simple_sched process.

Command-line option -boot

Default run_simple_sched_jobs will not pass -boot to the simple_sched process.

Configuration file

Description run_simple_sched_jobs will first attempt to read this file as an htcpartition output configuration file to get the partition information; then it will create or overwrite this file with the new HTC Scheduler configuration.

Command-line option -config <filename>

Environment variable RUN_JOBS_CONFIG_FILE

Default Use mkstemp() to create a temporary file whose name is like "my_simple_sched.cfg.XXXXXX". This file will be deleted when run_simple_sched_jobs exits.

Re-use configuration file

Description Tells run_simple_sched_jobs to re-use the configuration file from an earlier run rather than create a brand new one. This would be useful if calling run_simple_sched_jobs again for a single HTC boot.

Command-line option -reuse-config

Default The configuration file will not be reused.

31

Base configuration file

Description The configuration file to use for the base configuration. Your personal HTC Scheduler instance will use several of the options from this configuration file, for example, submit_path and submit_args.

Command-line option -base-config-file=<filename>

Environment variable SIMPLE_SCHED_CONFIG_FILE (see the HTC Scheduler configuration section above)

Default Search for the configuration file as described in the Configuration section starting on page 14.

Pool name

Description The pool that the HTC Scheduler will use.

Command-line option -pool-name=<pool-name>

Environment variable RUN_JOBS_POOL_NAME

Default • If re-use config, gets from configuration file• Otherwise, if the configuration file was created by htcpartition, gets

from the configuration file• Otherwise, there's no default and this must be specified

Pool size

Description The size of the pool that the HTC Scheduler will use, see the HTCScheduler configuration for a description of the format (it also specifies the mode).

Command-line option -pool-size=<pool-size>

Environment variable RUN_JOBS_POOL_SIZE

Default • If re-use config, gets from configuration file• Otherwise, if the configuration file was created by htcpartition, gets

from the configuration file• Otherwise there's no default and this must be specified

simple_sched daemon executable

Description The program that will be executed for the HTC Scheduler process. You will probably not have to change this.

Command-line option -simple-sched-exe=<executable>

Environment variable RUN_JOBS_SIMPLE_SCHED_EXE

Default /bgsys/opt/simple_sched/sbin/simple_sched

32 IBM Scheduler for HTC on IBM Blue Gene/P

startd daemon executable

Description The program that will be executed for the startd process. You will probably not have to change this.

Command-line option -startd-exe=<executable>

Environment variable RUN_JOBS_STARTD_EXE

Default /bgsys/opt/simple_sched/sbin/startd

Qcmd executable

Description The program that will be executed for the qcmd process. You will probably not have to change this.

Command-line option -qcmd-exe=<executable>

Environment variable RUN_JOBS_QCMD_EXE

Default /bgsys/opt/simple_sched/bin/qcmd

Verboseness

Description Set the verboseness of run_simple_sched_jobs by setting the -verbose parameter. The verboseness for the child processes can be overridden using the following command-line options:• -verbose-qcmd -- The qcmd process• -verbose-simple-sched -- The simple_sched process• -verbose-startd -- The startd processSee the verbose section in the Configuration section starting on page 14 for the allowed values.

Command-line option -verbose[=<verbose-level>]

Default The default for run_simple_sched_jobs is that no log messages will be displayed. For the child processes, the default log level is Warning.

33