Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
IBM® Scheduler for High Throughput Computing onIBM Blue Gene®/P
Table of ContentsIntroduction................................................................................................................................................3Architecture................................................................................................................................................4
simple_sched daemon............................................................................................................................4startd daemon........................................................................................................................................4End-user commands..............................................................................................................................4
Personal HTC Scheduler............................................................................................................................6Using HTC Scheduler with Tivoli Workload Scheduler LoadLeveler®...................................................8
Using HTC Scheduler with LoadLeveler version 3.5 and later..........................................................10Using HTC Scheduler with LoadLeveler before version 3.5..............................................................11
Service setup............................................................................................................................................12Configuration...........................................................................................................................................14
Configuration options..........................................................................................................................14Daemons...................................................................................................................................................18
simple_sched daemon..........................................................................................................................18Command-line options....................................................................................................................18Shutting down.................................................................................................................................19
startd daemon......................................................................................................................................19Command-line options....................................................................................................................19Submit plug-in................................................................................................................................19Shutting down.................................................................................................................................20
End-user commands.................................................................................................................................21qcmd....................................................................................................................................................21
Commands......................................................................................................................................21Immediate mode.............................................................................................................................23Interactive mode.............................................................................................................................24Response format.............................................................................................................................24
Submit ID response....................................................................................................................24Submit status response...............................................................................................................25Scheduler status response..........................................................................................................25Request rejected response..........................................................................................................25
qsub.....................................................................................................................................................26qstat.....................................................................................................................................................26qdel......................................................................................................................................................26Submitted job states.............................................................................................................................27Note about when state info is available...............................................................................................28
run_simple_sched_jobs............................................................................................................................29Command files....................................................................................................................................29Output..................................................................................................................................................30Signal handling....................................................................................................................................30
Positional parameters..........................................................................................................................30LoadLeveler integration......................................................................................................................30Configuration.......................................................................................................................................30
2 IBM Scheduler for HTC on IBM Blue Gene/P
IntroductionThe HTC Scheduler is a simple scheduler for High Throughput Computing (HTC) jobs on Blue Gene/P. HTC on Blue Gene/P provides the ability to run independent, single-node tasks on each node in a partition. For information on the setup, configuration, and use of HTC on Blue Gene/P, refer to the IBM System Blue Gene Solution: Blue Gene/P System Administration Redbook (SG24-7417) and IBM System Blue Gene Solution: Blue Gene/P Application Development Redbook (SG24-7287).
An HTC application may involve more tasks than there are nodes in the partition. In this situation, some tasks must wait until another task finishes before they can be submitted. A resource manager, or scheduler, automates this task. The HTC Scheduler is an implementation of a resource scheduler that was specifically designed to work in the Blue Gene/P's HTC environment. The HTC Scheduler is capable of reliably and efficiently running a large number of HTC jobs on a Blue Gene/P system.
The parts that make up the HTC Scheduler are the end-user command line utilities (qsub, qstat, qdel, and qcmd), the simple_sched daemon, and the startd daemon. Also available is a utility for running a batch of jobs through a “personal” instance of the HTC Scheduler.
3
ArchitectureFigure 1 shows the architecture of the HTC Scheduler which includes a single simple_sched daemon, multiple startd daemons, and several instances of end-user commands (qcmd, qsub, qstat, and qdel).
simple_sched daemon
This daemon waits for the client programs (qcmd, qsub, qstat, qdel, and startd) to contact it. When a startd client connects to the simple_sched daemon, it puts the startd client into a pool to which it can assign new jobs. When an end-user program makes a request, it handles the request. For example, if it's a new job request (sent by the qsub program) it assigns the new job a submit ID and puts the job on a queue; when that job reaches the front of the queue, it will assign the submitted command to a startd client. The simple_sched daemon can run on the service node or a front end node.
startd daemon
The startd daemon connects to the simple_sched daemon. When simple_sched sends it a job to run, startd forks off a process, and in the child process sets up the environment (sets the gid and uid), and execs submit with the job-specific command line options. As the submit process runs, it notifies the startd daemon of the state of the job through the submit plugin (for example, the HTC scheduler's submit plugin is called when the job ID is assigned). When the submit process ends, startd retrieves the exit status and sends the job result information to simple_sched. A single startd process can have multiple submits forked and running at the same time. The startd daemon will be calling submit, so the computer it's running on must have a submit multiplexer (submit_mux) running and configured (typically a front end node).
End-user commands
The end-user commands qcmd, qsub, qstat, and qdel, are used to send commands to the simple_sched daemon. There are commands available for submitting a new job, getting the status of a submitted job,
4 IBM Scheduler for HTC on IBM Blue Gene/P
Figure 1: HTC Scheduler architecture
simple_sched
startd
qsub
qstat
qdel
qcmd
submit
submit
submit
startd
submit
submit
submit
canceling a job, and performing administrative functions. These are typically run from a front end node, but can also be compiled to run on a workstation.
5
Personal HTC Scheduler Users may want to run HTC jobs on their own partition using a personal instance of the HTC Scheduler (possibly under the direction of the LoadLeveler scheduler). This is made easier using the provided run_simple_sched_jobs command which will start a personal instance of the HTC Scheduler and startd, execute commands either specified in command files or read from stdin, and exit when the commands have all completed. It creates a personal configuration file that can be used when submitting jobs externally.
In this example, a user wants to run the “location” program several times with different arguments.
First, create a file that contains the program to run along with arguments, where each line is the program to run and any arguments, for example, the text file cmds.run contains
-- location argsfor1-- location argsfor2-- location argsfor3... one line for each
If you want the stdout and stderr for the program to go to a different location, put the -stdout-file and -stderr-file parameters before the executable. You can also use this feature to discard output.
-stdout-file=loc_out1.out –stderr-file=loc_out1.err -- location.aout argsfor1-stdout-file=/dev/null –stderr-file=/dev/null -- location.aout argsfor2
The first step in running jobs using run_simple_sched_jobs is to boot the partition. The htcpartition utility is provided to boot a partition in HTC mode from the command line. Refer to the IBM SystemBlue Gene Solution: Blue Gene/P Application Development Redbook (SG24-7287) for full documentation of the htcpartition utility. The --configfile parameter tells htcpartition to create a file that run_simple_sched_jobs can read to get the pool name and pool size.
$ htcpartition --boot --partition R00-M0-N14 --mode SMP --configfile my_config.cfg
Use run_simple_sched_jobs to start up a personal instance of HTC Scheduler and run your commands:
$ run_simple_sched_jobs -config my_config.cfg cmds.run
To run more commands using the same configuration file, pass -reuse-config to run_simple_sched_jobs:
$ run_simple_sched_jobs -config my_config.cfg -reuse-config cmds2.run$ htcpartition --free --partition R00-M0-N14
To have run_simple_sched_jobs read the commands from stdin, use - as the command file. Using this method, the commands to run can be generated by another program or script:
$ ./gen_cmds.py | run_simple_sched_jobs -config my_config.cfg -
To have the personal HTC Scheduler continue running after all the command files have been run, use the -keep-running option:
shell_1$ run_simple_sched_jobs -config my_config.cfg -keep-running
shell_2$ qsub -config my_config.cfg my_program arg1 arg2
run_simple_sched_jobs prints out a line to stdout whenever it's notified that a command completed,
6 IBM Scheduler for HTC on IBM Blue Gene/P
either successfully or unsuccessfully.
1 -| 1 is COMPLETED exit status 02 -| 2 is COMPLETED exit status 0...
The first number is the request ID (the line command from the command files) and the second is the submit ID supplied by the simple_sched daemon. Note that commands may complete in a different order than they were submitted.
Once all the commands in the command files have completed, run_simple_sched_jobs prints out a summary containing the number of jobs that completed successfully, the number of jobs that completed with non-zero exit status, and the number of jobs that failed to run due to an error.
For details on using run_simple_sched_jobs, see the run_simple_sched_jobs chapter on page 29.
7
Using HTC Scheduler with Tivoli Workload Scheduler LoadLeveler ® The LoadLeveler scheduler was enhanced in version 3.5 to enable use of Blue Gene/P's HTC mode with the HTC Scheduler as a meta-scheduler. This makes integrating the HTC Scheduler into a LoadLeveler workflow much easier and more efficient. Refer to the following sections depending on your version of LoadLeveler for instructions on configure LoadLeveler and creating a job command file (JCF) to submit HTC jobs.
Figure 2 illustrates how the HTC Scheduler “glides in” to LoadLeveler. LoadLeveler has selected and created a partition in the Blue Gene/P using the Control System (Bridge) API. LoadLeveler's Central Manager tells a LoadLeveler startd to run this JCF. The LoadLeveler startd will execute run_simple_sched_jobs which starts a simple_sched server, a HTC Scheduler startd, and a qsub program, then reads the cmds.run input file, converting those lines into calls to qsub. The HTCScheduler startd process executes several submits in parallel, which contact the submit mux, which communicate with the control system processes, which cause the program to run on the compute nodes.
8 IBM Scheduler for HTC on IBM Blue Gene/P
Figure 2: HTC Scheduler as LoadLeveler glide-in
qsub pgm1 args qsub pgm2 argsqsub pgm3 args … cmds.run file
pgm1 args pgm2 argspgm3 args…
run_simple_sched_jobs
Service node
Central Manager
FEN
Co
ntro
l Sys
tem
AP
IStatus Updates
Actions
DB2
Control
System
Processes
I
simple_sched
startd (SIMPLE)
submit pgm1 args submit pgm2 argssubmit pgm3 args
startd (LL)
starter
Running HTC jobs
Submit
Mux
1
2 3
Blue Gene Machine
Using HTC Scheduler with LoadLeveler version 3.5 and later
LoadLeveler version 3.5 provides new features that make using the HTC Scheduler under LoadLeveler easier. This section describes the new Job Command File (JCF) keywords and behavior available in LoadLeveler.
LoadLeveler provides a bg_partition_type keyword in the JCF that specifies whether the partition will be booted for HPC or HTC jobs and, if the partition is booted for HTC jobs, the mode in which the jobs will run. The values for bg_partition_type are as follows:
• HPC – HPC jobs, this is the default value if bg_partition_type isn't present
• HTC_SMP – HTC jobs in SMP mode
• HTC_DUAL – HTC jobs in Dual mode
• HTC_VN – HTC jobs in Virtual Node mode
• HTC_LINUX_SMP – HTC jobs in Linux / SMP mode
The bg_user_list keyword is used to specify the users that can run jobs on the partition. This can be set to a space-separated list of user names or the special value “ALL” to allow any user to run on the partition. If not specified, only the step owner can submit jobs to the partition. Note that “Linux / SMP” mode might not be available on every Blue Gene/P system.
The bg_partition and bg_user_list keywords are not inherited by other job steps.
LoadLeveler sets several environment variables when the it runs the job that run_simple_sched_jobs uses to boot the partition chosen by LoadLeveler in the correct mode.
Example 1 contains a sample LoadLeveler JCF file that uses run_simple_sched_jobs:
#!/bin/bash
#@ job_name = htc_glide_in_32#@ output = $(job_name).$(jobid).out#@ error = $(job_name).$(jobid).err#@ job_type = bluegene#@ bg_size = 32#@ bg_partition_type = HTC_VN#@ bg_user_list = user1 user2 user3#@ queue
/bgsys/opt/simple_sched/bin/run_simple_sched_jobs cmds.txt
Example 1: Sample LoadLeveler 3.5 JCF
Modify the sample JCF to run your HTC application. The sections highlighted in bold should be changed to suit your application. The job_type must be “bluegene” so that LoadLeveler will run the script on a front end node and allocate a partition.
LoadLeveler also provides commands that can be used to display the partition type and user list information of a HTC partition. For example, llstatus will show the partition type is “HTC (SMP)” if the partition is booted in HTC mode for SMP jobs.
10 IBM Scheduler for HTC on IBM Blue Gene/P
At present, LoadLeveler doesn't re-use partitions booted in HTC mode even if the cache partitions option is enabled. LoadLeveler will automatically free the partition after the job has ended.
Using HTC Scheduler with LoadLeveler before version 3.5
In order to use the HTC Scheduler with LoadLeveler prior to version 3.5, LoadLeveler must be configured to not cache Blue Gene partitions. This is done by setting BG_CACHE_PARTITIONS=false in the LoadL_config file. Refer to the “Tivoli Workload Scheduler LoadLeveler” documentation for information regarding this configuration option. This requirement has been removed with LoadLeveler 3.5.
The HTC Scheduler can be called from a LoadLeveler JCF to submit a batch of HTC jobs to a partition managed by LoadLeveler. The JCF should look like Example 2.
Modify the sample JCF to run your commands. The sections highlighted in bold should be changed to suit your application. The job_type must be “bluegene” so that LL will run the script on a FEN. The FEN that LoadLeveler chooses must have the submit_mux running.
The script in the JCF uses htcpartition to boot the partition. Set the mode parameter to htcpartition to match the mode that your application will run in (DUAL, LINUX, SMP, or VN).
The script uses run_simple_sched_jobs to run all the commands in cmds.txt. A shell trap is set up so that when the script exits, htcpartition will free the partition.
11
#!/bin/bash
#@ job_name = htc_glide_in_32#@ output = $(job_name).$(jobid).out#@ error = $(job_name).$(jobid).err#@ job_type = bluegene#@ bg_size = 32#@ queue
function free_partition() { /bgsys/drivers/ppcfloor/bin/htcpartition –free}
export RUN_JOBS_CONFIG_FILE=my_simple_sched.cfg
/bgsys/drivers/ppcfloor/bin/htcpartition --boot --mode SMP --configfile "$RUN_JOBS_CONFIG_FILE"if [ $? != 0 ]; then echo "Booting HTC partition failed." exit 1fi
trap free_partition EXIT
/bgsys/opt/simple_sched/bin/run_simple_sched_jobs cmds.txt
Example 2: Sample LoadLeveler JCF
Service setupAdministrators may want to have a single pool that users can submit HTC jobs to without going through another scheduler like LoadLeveler. In this case, follow the instructions in this section to set up the HTC Scheduler to run as a system service. Once the following steps are completed, end users can run qsub to submit jobs, qstat to see the status of their job, and qdel to remove a job from the queue. Note that if users are only using personal instances of the HTC Scheduler, service setup is not necessary.
1. Customize the configuration file
Copy the configuration file /bgsys/opt/simple_sched/etc/simple_sched.cfg to /bgsys/local/etc/simple_sched.cfg. Edit /bgsys/local/etc/simple_sched.cfg and set the following options:
● Set scheduler_hostname to your SN hostname. For example, "mysn.mydomain"● Set pool_name to the pool that you will be using for HTC jobs. For example, "R00-M0"● Set pool_size to the number of nodes in the pool. For example, "1mpV" (= 1 midplane in VN
mode)
You will probably not have to change the other options.
2. Start the HTC Scheduler server daemon on the SN
Create a symlink to the init script in /etc/init.d, install the daemon to start automatically at the default run levels, and start it manually:
# ln -s /bgsys/opt/simple_sched/etc/init.d/ibm.com-simple_sched_server /etc/init.d/# /usr/lib/lsb/install_initd -v ibm.com-simple_sched_server# /etc/init.d/ibm.com-simple_sched_server start
3. Start the HTC Scheduler startd daemons on the submit nodes
The submit nodes are any system where the submit mux is running (i.e., FENs). At least one of these systems must be designated to run the startd daemon. The number of systems required depends on the size of the pool.
The startd daemons will be starting the submit program supplied with the Blue Gene software, which must be able to load the HTC Scheduler's submit plug-in. The location of the submit plug-in, /bgsys/opt/simple_sched/lib, needs to be configured in the dynamic linker (ld.so). This is typically done by creating a text file in /etc/ld.so.conf.d and running ldconfig.
Create a symlink to the startd daemon init script in /etc/init.d, install the daemon to start automatically at the default run levels, and start it manually:
# ln -s /bgsys/opt/simple_sched/etc/init.d/ibm.com-simple_sched_startd /etc/init.d/# /usr/lib/lsb/install_initd -v ibm.com-simple_sched_startd# /etc/init.d/ibm.com-simple_sched_startd start
4. Configure the end-user environment
Change the system environment so that the end-user command line utilities (qsub, qstat, and
12 IBM Scheduler for HTC on IBM Blue Gene/P
qdel) are available. The users' PATH should include /bgsys/opt/simple_sched/bin. This is usually done by creating a script in /etc/profile.d.
13
ConfigurationMost configuration values can be set using:
● an option on the command line. For example, -scheduler-service 12345● an environment variable. For example, “SIMPLE_SCHEDULER_SERVICE=12345 ; export
SIMPLE_SCHEDULER_SERVICE”● a line in the configuration file. For example, scheduler_service_name=12345
If a configuration value can be set using multiple methods, the command line option takes precedence over the environment variable, which takes precedence over the config file.
The HTC Scheduler programs use a configuration file. The file that's used is either (in order of preference):
1. specified on the command line using the -config parameter2. specified using the SIMPLE_SCHED_CONFIG_FILE environment variable3. if present, the current directory config file, ./simple_sched.cfg4. if present, the system config file, /bgsys/local/etc/simple_sched.cfg5. if present, the install config file, /bgsys/opt/simple_sched/etc/simple_sched.cfg
Typically, the administrator will have copied the configuration file from /bgsys/opt/simple_sched/etc/simple_sched.cfg to /bgsys/local/etc/simple_sched.cfg and changed any configuration options necessary for the local system.
Configuration options
This section describes the configuration options.
Configuration file name
Description The configuration file to use. If not specified, will use in this order:
1. if present, the current directory config file, ./simple_sched.cfg2. if present, the system config file, /bgsys/local/etc/simple_sched.cfg3. if present, the local config
file: /bgsys/opt/simple_sched/etc/simple_sched.cfg
Format File name, see open()
Environment variable SIMPLE_SCHED_CONFIG_FILE
Command-line -config <filename>
Scheduler service name
Description The service name (port) that the server will listen on and the clients will attempt to contact. The default value is "simple_htc_scheduler".
Format Service name, see getaddrinfo()
Configuration file option scheduler_service_name
Environment variable SIMPLE_SCHEDULER_SERVICE
Command-line -scheduler-service <service-name>
14 IBM Scheduler for HTC on IBM Blue Gene/P
Scheduler host name
Description The host name of the system that the server is running on.
Format Host name, see getaddrinfo()
Configuration file option scheduler_hostname
Environment variable SIMPLE_SCHEDULER_HOSTNAME
Command-line -scheduler-hostname <hostname>
Pool name
Description The name of the pool to run HTC jobs on. Defaults to “default_pool”.
Format Pool name, see the Blue Gene System Administration Redbook (SG24-7417
Configuration file option pool_name
Environment variable SIMPLE_SCHEDULER_POOL
Command-line -pool <pool-name>
Pool size
Description Describes the partitions in the pool. For each partition in the pool the scheduler must be told its size and mode using this format:[<count>][<type>][<mode>]where at least one of these must be present, and• count is a number (default is 1)• type is a hardware type, “n”=node, “nc”=node card,
“mp”=midplane,”r”=rack” (default is “n”)• mode is the mode of the partition, “D”=dual, “L”=Linux, “S”=SMP,
“V”=virtual node (default is “S”)Separate partition descriptions using space.Example partition: “1ncS” = 1 node card in SMP mode (32 nodes to run on)Example pool: “1mpS 1ncV” = 1 midplane booted in SMP mode and 1 node card booted in virtual node mode.
Format Pool size, see description
Configuration file option pool_size
Environment variable SIMPLE_SCHEDULER_POOL_SIZE
Command-line -pool-size <pool-size>
Submit path
15
Description The path to the submit program. Defaults to “/bgsys/drivers/ppcfloor/bin/submit”.
Format Executable name, see exec()
Configuration file option submit_path
Environment variable SIMPLE_SCHEDULER_SUBMIT_PATH
Command-line -submit-path <filename>
Submit Options
Description Additional options to set when calling submit. The startd daemon will put these options on the submit command when it executes submit in addition to the arguments it uses. The default is empty. If the options aren't valid then submitted jobs will fail.
Format Command-line options, like "-trace 0"
Configuration file option submit_args
Environment variable SIMPLE_SCHEDULER_SUBMIT_ARGS
Command-line -submit-args <arguments>
simple_sched daemon PID file
Description The path to use for simple_sched's PID file. Defaults to “/var/run/simple_sched.pid”.
Format File name
Configuration file option startd_pid_file
Environment variable SIMPLE_SCHEDULER_PID_FILE
Command-line -pid-file <filename>
startd daemon PID file
Description The path to use for startd's PID file. Defaults to “/var/run/startd.pid”.
Format File name
Configuration file option startd_pid_file
Environment variable SIMPLE_SCHEDULER_STARTD_PID_FILE
Command-line -pid-file <filename>
Verbose
Description The verbose level for log output. If not present, then no logging will be done. If present with no value, the level is "notice". The levels
16 IBM Scheduler for HTC on IBM Blue Gene/P
available are, from most to least selective:
• debug, D, 7• info, I, 6• notice, N, 5• warning, W, 4• err, E, 3• crit, 2 (not used)• alert, 1 (not used)• emerg, 0 (not used)
Format Verbose level, see description
Command-line -verbose[=<level>]
17
DaemonsThe following section provides details on the daemons that implement the HTC Scheduler.
simple_sched daemon
This section provides details on the simple_sched daemon.
Command-line options
In addition to the command-line options to override configuration options, the following options are available when starting the simple_sched daemon.
-accept-sd=SD
Description The simple_sched daemon accepts connections on the supplied socket descriptor. The default is to open a socket to accept client connections on. The socket descriptor must be an integer.
-log-to-stdout
Description The simple_sched daemon will log to stdout. The default is to log using the syslog() API.
-suspend
Description The simple_sched daemon will start in “suspended” state. It will not assign jobs to startd daemons until resumed. The default is to start in “running” state. To resume, use “qcmd resume”
-pick-port
Description The simple_sched daemon will pick an ephemeral port to use. The default is to use the configured service name.
-pid-file-required[=optional|required|skip]
Description Tells the simple_sched daemon how to handle the PID file. The default is optional if the option is not used, or required if the option is used. Allowed values are:• optional - Try to create the pid file, but if can't, continue• required - Create the pid file and fail if cannot.• skip – Do not create the pid file
-boot
18 IBM Scheduler for HTC on IBM Blue Gene/P
Description simple_sched will execute /bgsys/drivers/ppcfloor/bin/htcpartition to boot the partition. htcpartition must be able to get its boot parameters from the mpirun plugin.
Shutting down
The simple_sched daemon can be shut down in one of three ways:
• Very Slow - No more jobs will be accepted. simple_sched will wait until the submit queue is empty and all outstanding submits are complete. Trigger this by signaling with SIGINT (CTRL-C).
• Slow - No more jobs will be accepted and the submit queue will be cleared. simple_sched will wait until all outstanding jobs are complete. Trigger this by signaling with SIGQUIT (CTRL-\).
• Quick - Just exits, not waiting for jobs to complete. Trigger this by signaling with SIGTERM.
startd daemon
This section provides details on the startd daemon.
Command-line options
In addition to the command-line options to override configuration options, the following options are available when starting the simple_sched daemon.
-log-to-stdout
Description The simple_sched daemon will log to stdout. The default is to log using the syslog() API.
-pid-file-required[=optional|required|skip]
Description Tells the simple_sched daemon how to handle the PID file. The default is optional if the option is not used, or required if the option is used. Allowed values are:• optional - Try to create the pid file, but if can't, continue• required - Create the pid file and fail if cannot• skip – Do not create the pid file
Submit plug-in
The startd daemon uses the submit plug-in to get information about the job back from the submit program. The submit program will call functions in the submit plug-in when a job ends. If the job failed, the data provided on the function call will include the reason for the failure. The submit command uses dlopen to load the submit plug-in, so the shared library containing the submit plug-in must be configured in the dynamic linker. Configuring the submit plug-in shared library in the dynamic linker can be done in several ways, including use of LD_LIBRARY_PATH, and editing the ld.so.conf. The HTC Scheduler's submit plug-in is located in /bgsys/opt/simple_sched/lib/libsubmit_if.so. (Note that run_simple_sched_jobs sets the LD_LIBRARY_PATH, so if the HTC Scheduler is run only through run_simple_sched_jobs then no extra configuration is required.)
The submit plug-in provided by the HTC Scheduler also prevents other users from submitting jobs to
19
the pool it's configured to use. It does this by reading the local HTC Scheduler configuration file, /bgsys/local/etc/simple_sched.cfg, and if the pool entered on the command line is not set, or it's the same pool as is in the local configuration file, then it returns non-zero and submit will fail.
Shutting down
The startd daemon can be shut down in one of three ways:
1. Very slow – startd tells simple_sched to stop sending work; then waits until all submits have finished. Trigger this by sending SIGINT (CTRL-C).
2. Slow – startd tells simple_sched to stop sending work; all current submits will get SIGTERM and should end quickly; then waits until all the submits have finished. Trigger this by sending SIGQUIT (CTRL-\).
3. Quick – Exits without sending results, trigger this by signaling with SIGTERM.
20 IBM Scheduler for HTC on IBM Blue Gene/P
End-user commands
qcmd
qcmd can be used to send commands to the HTC Scheduler. If qcmd can't connect to the simple_sched daemon it will exit with an error message and non-zero exit status.
Commands
Listed here are the commands accepted by qcmd. Following the list of commands is a description of the response types.
submit [OPTION]... COMMAND...
Description Submit a job to run.
Response Submit ID if -wait, or job status if no -wait
Options -mode=MODE
Description The mode that the job requires.
Parameter A mode, one of “DUAL”, “LINUX”, “SMP”, or “VN”.
Default The job can run in any mode. The server will check for an available HTC resource in this order: VN, DUAL, SMP, LINUX.
-restartable
Description Indicates that the job can be restarted if it fails.
Default The job cannot be restarted.
-cwd=DIRECTORY
Description The working directory.
Parameter A directory name.
Default The current working directory.
-exp_env=NAME
Description Export an environment variable to the job.This can be used multiple times to export multiple variables.
Parameter An environment variable name
Default The environment variable is not exported to the job
-env_all
21
Description Export all environment variables to the job
Default No environment variables are exported
-env=NAME=VALUE[ NAME=VALUE]
Description Define environment variables for the job.This can be used multiple times to define multiple environment variables.
Parameter A space-separated list of NAME=VALUE pairs
-name=NAME
Description The name for the job. This is used as the base name for the output files.
Parameter The name can be any value that can be used in a file name
Default “submit”
-stdin-file=FILE
Description The file from which to read standard input.
Parameter A file name. If the file name is not a full path then the file is opened from <cwd>. This file must be readable when the program runs.
Default /dev/null
-stdout-file=FILE
Description The file to which to write standard output. If this option is specified, the file will not be removed even if it is empty.
Parameter A file name. If the file name is not a full path then the file is opened from <cwd>.
Default <name>-<submit-id>.out
-stderr-file=FILE
Description The file to which to write standard error. If this option is specified, the file will not be removed even if it is empty.
Parameter A file name. If the file name is not a full path then the file is opened from <cwd>.
Default <name>-<submit-id>.err
-wait
Description Wait for results.
22 IBM Scheduler for HTC on IBM Blue Gene/P
Default Do not wait for results.
status [-wait] <submit-id>|all
Description Display the state of a job or all jobs if “all” is specified.
Response job status
Option -wait
Description Wait for results.
Default Do not wait for results.
cancel <submit-id>
Description Cancel a submitted job.
Response job status
scheduler_status
Description Display the scheduler status.
Response scheduler status
suspend
Description The simple_sched daemon will stop assigning jobs until “resume”.
Response scheduler status
resume
Description The simple_sched daemon will resume assigning jobs.
Response scheduler status
help [<command>]
Description Display the command help summary or detailed help for the specified command
Immediate mode
If a command is supplied on the qcmd command-line, it will execute that command and exit. This can be seen in the following examples:
$ qcmd scheduler_status
23
[running (submit queue=0) (submits=assigned:0 completed:5 notzero:0 error:0 canceled:0) (htc resources=smp:32/32 dual:0/0 vn:0/0 linux:32/32)]
$ qcmd submit test.cnaSubmit id: 6
Interactive mode
qcmd will operate in interactive mode when a command is not supplied on the command line. In this mode qcmd reads commands from stdin. Commands are processed asynchronously: a command creates a request that is sent to the simple_sched daemon, the next request can be sent before the previous command completes, and the responses to the requests may be received out of order.
When qcmd generates a request from a command, a unique request ID is generated, and the request ID and the command name are displayed, for example:
$ cat cmds.txtsuspendsubmit test
$ cat cmds.txt | qcmd1 <- suspend2 <- submit
When qcmd receives a response from the simple_sched daemon, it prints out the request ID and the response info. Continuing the previous example, the two requests received these responses:
1 -| [running (submit queue=0) (submits=assigned:0 completed:5 notzero:0 error:0 canceled:0) (htc resources=smp:32/32 dual:0/0 vn:0/0 linux:32/32)]2 -| Submit id: 7
When the response is not the last response for a request, qcmd will display "->" after the request ID, whereas if it is the last response for a request, qcmd will display "-|" after the request ID.
When stdin is closed, qcmd will continue receiving responses from the simple_sched daemon until the simple_sched daemon indicates that there are no more outstanding requests. Users can take advantage of this behavior to submit several jobs by writing "submit -wait" commands to qcmd, closing stdin, and then waiting for qcmd to exit, at which point all the submitted jobs have completed. The run_simple_sched_jobs command uses this feature.
Response format
The format for the response info depends on the type of the response.
Submit ID response
If you do submit without -wait, qcmd will print out the response like "Submit ID: <submit-id>". For example,
$ qcmdsubmit test.cna1 <- submit1 -| Submit ID: 1
24 IBM Scheduler for HTC on IBM Blue Gene/P
Submit status response
If you do a “submit -wait” or status command, qcmd will print out the current state and, if the state is COMPLETED, the exit status, and, if set, the error message. The format is "<submit-id> is <status>[ exit status <exit-status>][ error message '<error-msg>']".
$ qcmdstatus 11 <- status 11 -| 1 is COMPLETED exit status 0
$ qcmdsubmit -wait test1 <- submit [wait]1 -> 2 is QUEUED1 -> 2 is ASSIGNED1 -| 2 is COMPLETED exit status 0
Scheduler status response
If the response is a scheduler status, qcmd will print the response using the following format:
[<submit_thread_status>[ booting] <submit_queue_status> <submits_status> <resource_pool_status>]
where
• submit_thread_status is either "running" or "suspended"
• “booting” will be displayed if the server is waiting for htcpartition to finish booting the partition
• submit_queue_status is "(submit queue=<count>)". This is the number of jobs in the submit queue
• submits_status is "(submits=assigned:<count> completed:<count> notzero:<count> error:<count> canceled:<count>)". This shows the number of jobs that are currently assigned, that have completed, that ended with non-zero exit status, that did not run due to an error, and that were canceled
• resource_pool_status is "(htc resources=smp:<avail>/<total> dual:<avail>/<total> vn:<avail>/<total> linux:<avail>/<total>)"
$ qcmd scheduler_status[running (submit queue=0) (submits=assigned:0 completed:5 notzero:0 error:0 canceled:0) (htc resources=smp:32/32 dual:0/0 vn:0/0 linux:32/32)]
Request rejected response
The HTC Scheduler may reject a request. One example of when a request would be rejected is the server is shutting down and a new job is submitted. The format for this is "Request failed '<error-msg>'".
$ qcmdsubmit test.cna1 <- submit1 -| Request failed 'submit rejected because shutting down'
25
qsub
qsub is simply a symbolic link to qcmd. When qcmd is called and the program name is qsub, it performs a “submit” command. Refer to the documentation on qcmd's submit command for the parameters to qsub.
Following is sample output of qsub:
$ qsub test.cnaSubmit id: 2$ qsub -wait test.cna3 is QUEUED3 is ASSIGNED3 is ASSIGNED [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']3 is COMPLETED exit status 0 [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']$ qsub -stdout-file=/dev/null -stderr-file=/dev/null -cwd /bgusr/myhome -env=HTC=true -- /bgusr/myhome/test.cna -opt1=opt1valueSubmit id: 4
qstat
qstat is used to get the status of a submitted job. qstat is simply a symbolic link to qcmd. When qcmd is called and the program name is qstat, it performs a “status” command. Refer to the documentation on qcmd's status command for the parameters to qstat.
Following is sample output of qstat:
$ qstat 44 is COMPLETED exit status 0 [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']$ qstat 7Status for 7 is not available.$ qstat -wait 96 is ASSIGNED [location='R00-M0-N14-J19-C00' jobId=12345 partition='R00-M0-N14']6 is COMPLETED exit status 0 [location='R00-M0-N14-J19-C00' jobId=12345 partition='R00-M0-N14']$ qstat -wait all7 is QUEUED7 is ASSIGNED7 is ASSIGNED [location='R00-M0-N14-J59-C00' jobId=12345 partition='R00-M0-N14']7 is COMPLETED exit status 0 [location='R00-M0-N14-J19-C00' jobId=12345 partition='R00-M0-N14']8 is QUEUED8 is ASSIGNED8 is ASSIGNED [location='R00-M0-N14-J44-C00' jobId=12345 partition='R00-M0-N14']8 is COMPLETED exit status 0 [location='R00-M0-N14-J20-C00' jobId=12345 partition='R00-M0-N14']
qdel
qdel is used to cancel a submitted job. qdel is simply a symbolic link to qcmd. When qcmd is called and the program name is qdel, it performs a “cancel” command. Refer to the documentation on qcmd's cancel command for the parameters to qdel.
26 IBM Scheduler for HTC on IBM Blue Gene/P
Following is sample output of qdel:
$ qdel 1010 is CANCELED$ qdel 1111 is CANCELING [location='R00-M0-N14-J23-C00' jobId=12345 partition='R00-M0-N14']$ qdel 1111 is COMPLETED term signal 9 [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']$ qdel 1212 is COMPLETED exit status 0 [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']
In the previous examples, submit ID 10 was queued when it was canceled; 11 was ASSIGNED when it was canceled the first time, and completed the second time; 12 had already exited when it was canceled.
Submitted job states
The states that a job can be in are as follows:
• QUEUED - The job is in the queue and will run when an HTC resource and startd are available. An error message may be available if the job failed and was requeued. When a job in this state is canceled, it goes to CANCELED state.
• ASSIGNED - The job is assigned to a startd to run. Information supplied by the submit program may be available (for example, the Blue Gene job ID and location). When a job in this state is canceled, it goes to CANCELING state.
• COMPLETED - The job has completed normally and has an exit status.• CANCELING - The job has been canceled and the startd has been told to kill it. When it exits, it
should go to COMPLETED state with the exit status indicating it was killed with a signal.• CANCELED - The job was canceled without running.• ERROR - The HTC Scheduler wasn't able to run this job and may have an error message.
Figure 3 illustrates the possible transitions between states.
27
Figure 3: Submitted job states
QUEUED ASSIGNED
CANCELING
COMPLETED
ERROR
CANCELED
Note about when state info is available
When a job's state is reported to a client and the job state was an end state (COMPLETED, CANCELED, or ERROR; the right-most states in Figure 3), knowledge of the submitted job will be cleared from the server. Any further request for state using the submit ID will get a response of 'not found'.
28 IBM Scheduler for HTC on IBM Blue Gene/P
run_simple_sched_jobsrun_simple_sched_jobs starts a personal instance of the HTC Scheduler and runs commands through it. To do this, it opens up an ephemeral port (i.e., Linux picks one that's not in use) and creates a HTCScheduler configuration file which is based on the base HTC Scheduler configuration with these options replaced:
• scheduler_hostname -- is set to the current system's host name• scheduler_service_name -- is set to the ephemeral port number• pool_name -- is set to the configured pool name• pool_size -- is set to the configured pool size
Note that if the --reuse-config option is specified, then only the scheduler_service_name is changed and the base configuration is not used.
Next, run_simple_sched_jobs forks and execs a simple_sched process with options to use the new config file and accept connections on the socket descriptor that run_simple_sched_jobs opened to listen on the ephemeral port. Then run_simple_sched_jobs forks and execs a startd process which uses the new config file. If there are any command files specified on the command line, run_simple_sched_jobs forks and execs a qcmd process using the new config file, to which run_simple_sched_jobs writes a "submit -wait" command for each command. Output from qcmd is parsed to look for completion messages which are echoed to stdout. Once all the command files have been processed, and the run_simple_sched_jobs “keep running” option is not set, it will next signal the child processes to quit. It will then wait for the simple_sched and startd child processes to exit.
Tip: Use the --configfile option when booting the partition with htcpartition, then pass that same config file as the -config parameter to run_simple_sched_jobs which will read the pool name and pool size configuration options from this file.
Command files
run_simple_sched_jobs reads command files. Each line of a command file contains options to qcmd's submit command, an “end of options” indicator, the program to run, and the arguments to the command.
A line in the command file can contain arguments for qcmd's submit command. Refer to qcmd's submit command options on page 21 for the options to the submit command. After the options to qcmd's submit command in the line, the user should put an “end of options” indicator, “--”. If the “end of options” indicator is left off, any arguments to the program that start with “-” will be interpreted as options for the submit command.
The following command file line will start the “location.cna” program with the -print argument; any output will be discarded because the -stdout-file and -stderr-file options are specified for qcmd's submit command:
-stdout-file=/dev/null -stderr-file=/dev/null -cwd /bgusr/myhome -- location.cna -print
Each command line is converted into a qcmd submit command that is sent to the qcmd subprocess. qcmd will generate a request ID for the submitted job. Since request IDs are assigned starting at 1, the ID will match the line number.
29
Output
run_simple_sched_jobs prints out a line whenever it's notified that a submitted command has completed, or that a submitted command was rejected (this should be rare).
Completion lines look like:
1 -| 1 is COMPLETED exit status 0 [location='R00-M0-N14-J10-C00' jobId=12345 partition='R00-M0-N14']
This is the standard completion line from the qcmd program. The first number is the request ID (command number), which starts at 1 and is incremented for each command submitted. The second number is the submit ID that the simple_sched daemon assigned to the job.
When run_simple_sched_jobs exits it prints out a line summarizing the results, like this:
Submitted 128 jobs, 128 completed, 60 had non-zero exit status, 0 requests failed.
Signal handling
If run_simple_sched_jobs gets SIGINT (CTRL-C) it goes into slow shutdown mode. No new commands will be accepted, and all submitted jobs will complete. If run_simple_sched_jobs gets SIGQUIT (CTRL-\) or SIGTERM (kill) it goes into fast shutdown mode. No new commands will be accepted and all submitted jobs will be canceled.
Positional parameters
The positional parameters are the names of command files (see the command files section). If the name is - then run_simple_sched_jobs will read commands from stdin until it reads an end-of-file. If there are no positional parameters, then no commands will be run, which is only useful when using the -keep-running option.
LoadLeveler integration
run_simple_sched_jobs looks for environment variables set by LoadLeveler and changes its behavior when these environment variables are set. The environment variables set by LoadLeveler are LOADL_BG_PARTITION, LOADL_BG_SIZE, and LOADL_BG_PARTITION_TYPE. When these environment variables are set, run_simple_sched_jobs will create a temporary configuration file containing the pool name and pool size set from these values. It will also automatically pass -boot to simple_sched.
Configuration
There are several configuration options. When an option can be specified in multiple ways, a command-line option takes precedence over an environment variable.
30 IBM Scheduler for HTC on IBM Blue Gene/P
Keep running
Description run_simple_sched_jobs will continue running after all command files have been processed. This can be used to submit jobs to the partition using qsub. Simply specify the same config file when invoking qsub.
Command-line option -keep-running
Default Exit after all jobs have completed.
Suspend
Description run_simple_sched_jobs will start the simple_sched daemon suspended. Use the "qcmd resume" command to cause the simple_sched daemon to start assigning jobs.
Command-line option -suspend
Default The simple_sched daemon will be started in “running” mode.
Boot
Description run_simple_sched_jobs will pass -boot to the simple_sched process.
Command-line option -boot
Default run_simple_sched_jobs will not pass -boot to the simple_sched process.
Configuration file
Description run_simple_sched_jobs will first attempt to read this file as an htcpartition output configuration file to get the partition information; then it will create or overwrite this file with the new HTC Scheduler configuration.
Command-line option -config <filename>
Environment variable RUN_JOBS_CONFIG_FILE
Default Use mkstemp() to create a temporary file whose name is like "my_simple_sched.cfg.XXXXXX". This file will be deleted when run_simple_sched_jobs exits.
Re-use configuration file
Description Tells run_simple_sched_jobs to re-use the configuration file from an earlier run rather than create a brand new one. This would be useful if calling run_simple_sched_jobs again for a single HTC boot.
Command-line option -reuse-config
Default The configuration file will not be reused.
31
Base configuration file
Description The configuration file to use for the base configuration. Your personal HTC Scheduler instance will use several of the options from this configuration file, for example, submit_path and submit_args.
Command-line option -base-config-file=<filename>
Environment variable SIMPLE_SCHED_CONFIG_FILE (see the HTC Scheduler configuration section above)
Default Search for the configuration file as described in the Configuration section starting on page 14.
Pool name
Description The pool that the HTC Scheduler will use.
Command-line option -pool-name=<pool-name>
Environment variable RUN_JOBS_POOL_NAME
Default • If re-use config, gets from configuration file• Otherwise, if the configuration file was created by htcpartition, gets
from the configuration file• Otherwise, there's no default and this must be specified
Pool size
Description The size of the pool that the HTC Scheduler will use, see the HTCScheduler configuration for a description of the format (it also specifies the mode).
Command-line option -pool-size=<pool-size>
Environment variable RUN_JOBS_POOL_SIZE
Default • If re-use config, gets from configuration file• Otherwise, if the configuration file was created by htcpartition, gets
from the configuration file• Otherwise there's no default and this must be specified
simple_sched daemon executable
Description The program that will be executed for the HTC Scheduler process. You will probably not have to change this.
Command-line option -simple-sched-exe=<executable>
Environment variable RUN_JOBS_SIMPLE_SCHED_EXE
Default /bgsys/opt/simple_sched/sbin/simple_sched
32 IBM Scheduler for HTC on IBM Blue Gene/P
startd daemon executable
Description The program that will be executed for the startd process. You will probably not have to change this.
Command-line option -startd-exe=<executable>
Environment variable RUN_JOBS_STARTD_EXE
Default /bgsys/opt/simple_sched/sbin/startd
Qcmd executable
Description The program that will be executed for the qcmd process. You will probably not have to change this.
Command-line option -qcmd-exe=<executable>
Environment variable RUN_JOBS_QCMD_EXE
Default /bgsys/opt/simple_sched/bin/qcmd
Verboseness
Description Set the verboseness of run_simple_sched_jobs by setting the -verbose parameter. The verboseness for the child processes can be overridden using the following command-line options:• -verbose-qcmd -- The qcmd process• -verbose-simple-sched -- The simple_sched process• -verbose-startd -- The startd processSee the verbose section in the Configuration section starting on page 14 for the allowed values.
Command-line option -verbose[=<verbose-level>]
Default The default for run_simple_sched_jobs is that no log messages will be displayed. For the child processes, the default log level is Warning.
33