Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
Content
� MPIRUN Command
� Environment Variables
� LoadLeveler
� SUBMIT Command
� IBM Simple Scheduler
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
Control System
� Service Node (SN)
– An IBM system-p 64-bit system
– Control System and database are on this system
– Access to this system is generally privileged
– Communication with Blue Gene via a private 1Gb control ethernet
� Database
– A commercial database tracks state of the system• Hardware inventory• Partition configuration• RAS data• Environmental data• Operational data including partition state, jobs, and job history
� Service action support for hot plug hardware
� Administration and System status
– Administration either via a console or web “Navigator” interfaces
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
Service Node Database Structure
� Configuration database is the representation of all the hardware on the system
� Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history
� Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages
� RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex
DB2
Configuration Database
Operational Database
Environmental Database
RAS Database
Useful log files: /bgsys/logs/BGP
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
Job Launching Mechanism
� mpirun Command
– Standard mpirun options supported
– May be used to launch any job, not just MPI based applications
– Has options to allocate partitions when a scheduler is not in use
� Scheduler APIs enable various schedulers
– LoadLeveler
– SLURM
– Platform LSF
– Altair PBS Pro
– Cobalt
– ….
� Note: All the schedulers are on mpirun/mpiexec
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
MPIRUN Implementation
� Identical Functionalities to BG/L Implementation + New implementation + New options
– No more rsh/ssh mechanism for security reason, replace by a deamon running on the Service node
– freepartition command integrated as an option (-free)
– Standard input (STDIN) is supported on BGP (only MPI task 0)
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
MPIRUN Command Parameters | 1
� -args "program args"– Pass "program args" to the BlueGene job on the compute nodes
� -cwd <Working Directory>– Specifies the full path to use as the current working directory on the compute
nodes. The path is specified as seen by the I/O and compute nodes
� -exe <Executable>– Specifies the full path to the executable to run on the compute nodes. The path is
specified as seen by the I/O and compute nodes
� -mode { SMP | DUAL | VN }– specify what mode the job will run in. Choices are coprocessor or virtual node
mode
� -np <Nb MPI Tasks>– Create exactly n MPI ranks for the job. Aliases are -nodes and -n
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
MPIRUN Command Parameters | 2
� -enable_tty_reporting– By default MPIRUN will tell the control system and the C runtime on the compute nodes that STDIN,
STDOUT and STDERR are tied to TTY type devices. Enable STDOUT bufferization (GPFS blocksize)
� -env “<Variable Name>=<Variable Value>"– Set an environment variable in the environment of the job on the compute nodes
� -expenv <Variable Name>– Export an environment variable in mpiruns current environment to the job on the compute nodes
� -label– Use this option to have mpirun label the source of each line of output.
� -partition <Block ID>– Specify a predefined block to use
� -mapfile <mapfile>– Specify an alternative MPI toplogy. The mapfile path must be fully qualified as seen by the I/O and compute
nodes
� -verbose { 0 | 1 | 2 | 3 | 4 }– Set the 'verbosity' level. The default is 0 which means that mpirun will not output any status or diagnostic
messages unless a severe error occurs. If you are curious as to what is happening try levels 1 or 2. All mpirun generated status and error messages appear on STDERR.
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
MPIRUN Command Reference (Documentation)
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
MPIRUN Example
mpirun –partition XXX –np 128 –mode SMP –exe /patch/exe
–cwd working_directory
–env ‘’ OMP_NUM_THREADS=4 XLSMPOPTS=spins=0:yields=0:stack= 64000000’’
� Execution Settings
– 128 MPI Tasks
– SMP Mode
– 4 OpenMP Threads
– 64 MB Thread Stack
� Mpirun application program interfaces available: get_paramaters, mpirun_done
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
MPIRUN Environment Variables
� Most command line options for mpirun can be specified u sing an environmentvariable– -partition MPIRUN_PARTITION
– -nodes MPIRUN_NODES
– -mode MPIRUN_MODE
– -exe MPIRUN_EXE– -cwd MPIRUN_CWD
– -host MMCS_SERVER_IP
– -env MPIRUN_ENV– -expenv MPIRUN_EXP_ENV
– -mapfile MPIRUN_MAPFILE
– -args MPIRUN_ARGS
– -label MPIRUN_LABEL– -enable_tty_reporting MPIRUN_ENABLE_TTY_REPORTING
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
STDIN / STDOUT / STDERR Support
� STDIN, STDOUT, and STDERR work as expected– You can pipe or redirect files into mpirun and pipe or redirect output from
mpirun– STDIN may also come from the keyboard interactively
� Any compute node may send STDOUT or STDERR data
� Only MPI rank 0 may read STDIN data
� Mpirun always tells the control system and the C run time on the compute nodes that it is writing to TTY devices. This is be cause logically MPIRUN looks like a pipe; it can not do seeks on STDIN, ST DOUT, and STDERR even if they are coming from files.
� As always, STDIN, STDOUT and STDERR are the slowest ways to get input and output from a supercomputer– Use them sparingly
� STDOUT is not buffered and can generate a huge over head for someapplications– Such applications should buffer the stdout with option
• -enable_tty_reporting
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
MPIEXEC Command
� What is mpiexec?
– Method for launching and interacting with parallel Mutliple Program Multiple Data (MPMD) jobs on BlueGene/P
– Very similar to mpirun with the only exception being the arguments supportedby mpiexec are slightly different
� Command Limitations
– A pset is the smallest granularity for each executable, though one executablecan span multiple psets
– You must use every compute node of each pset, specifically different ‘-np’values are not supported
– The job's mode (SMP, DUAL, VNM) must be uniform across all psets
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
MPIEXEC Command Parameters
� Only parameter / environmental supported by mpiexec that is not supported by mpirun
– -configfile / MPIRUN_MPMD_CONFIGFILE
� The following parameters / environmentals are not supp orted by mpiexecsince their use is ambiguous for MPMD jobs
– -args / MPIRUN_ARGS
– -cwd / MPIRUN_CWD
– -env / MPIRUN_ENV
– -env_all / MPIRUN_EXP_ENV_ALL
– -exe / MPIRUN_EXE
– -exp_env / MPIRUN_EXP_ENV
– -partition / MPIRUN_PARTITION
– -mapfile / MPIRUN_MAPFILE
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
MPIEXEC Configuration File Syntax
� -n <Nb Nodes> -wdir <Working Directory> <Binary>
� Example
– Configuration File Content
• -n 32 -wdir /home/bgpuser /bin/hostname -n 32 -wdir/home/bgpuser/hello_world /home/bgpuser/hello_world/hello_world
– Runs
• /bin/hostname on one 32 node pset• hello_world on one 32 node pset
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
SUBMIT Command
� submit = mpirun Command for HTC
– Command used to run a HTC job and act as a lightweight shadow for the real job running on a Blue Gene node
– Simplifies user interaction with the system by providing a simple common interface for launching, monitoring, and controlling HTC jobs
– Run from a Frontend Node
– Contacts the control system to run the HTC user job
– Allows the user to interact with the running job via the job's standard input, standard output, and standard error
� Standard System Location
– /bgsys/drivers/ppcfloor/bin/submit
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
HTC Technical Architecture
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
SUBMIT Command Syntax
� /bgsys/drivers/ppcfloor/bin/submit [options]or/bgsys/drivers/ppcfloor/bin/submit [options] binary [arg1 arg2 ... argn]
� Options– -exe <exe> Executable to run– -args "arg1 arg2 ... argn“ Arguments, must be enclosed in double quotes– -env <env=value> Define an environmental for the job– -exp_env <env> Export an environmental to the job's environment– -env_all Add all current environmentals to the job's environment– -cwd <cwd> The job's current working directory– -timeout <seconds> Number of seconds before the job is killed– -mode <SMP|DUAL|VNM> Job mode– -location <Rxx-Mx-Nxx-Jxx-Cxx> Compute core location, regular expression supported– -pool <id> Compute Node pool ID
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
IBM Scheduler for HTC
� IBM Scheduler for HTC = HTC Jobs Scheduler
– Handles scheduling of HTC jobs
� HTC Job Submission
– External work requests are routed to HTC scheduler
• Single or multiple work requests from each source
– IBM Scheduler for HTC finds available HTC client and forwards the work request
– HTC client runs executable on compute node
• A launcher program on each compute node handles work request sent to it by the scheduler. When work request completes, the launcher program is reloaded and client is ready to handle another work request.
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
IBM Scheduler for HTC Components
� IBM Scheduler for HTC Purpose
– Provides features not available with “submit” interface• Queuing of jobs until compute resources are available• Tracking of failed compute nodes
– “submit” interface is intended for usage by job schedulers• Not end users directly
� IBM Scheduler for HTC Components
– simple_sched Daemon• Runs on Service Node or Frontend Node• Accepts connections from startd and client programs
– startd Daemons• Run on Frontend Node• Connects to simple_sched, gets jobs and executes submit
– Client programs• qsub = Submits job to run• qdel = Deletes job submitted by qsub• qstat = Gets status of submitted job• qcmd = Admin commands
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
HTC Executables
� htcpartition
– Utility program shipped with Blue Gene
– Responsible for booting / freeing HTC partitions from a Frontend Node
� run_simple_sched_jobs
– Provides instance of IBM Scheduler for HTC and startd
– Executes commands either specified in command files or read fromstdin
– Creates a cfg file that can be used to submit jobs externally to the cmdfiles or stdin
– Exits when the commands have all finished (or can specify “keep running”)
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
IBM Scheduler for HTC Integration to LoadLeveler
� LoadLeveler handles
– Partition Reservation & Booting
• New LoadLeveler Keyword– # @ bg_partition_type = HTC_LINUX_SMP
– Partition Shutdown
� IBM Scheduler for HTC handles
– Batch of executions queueing
• Either specified in command files or read from stdin
– Executions submission
– Execution recovery when failure occurs
• Only system faults are recovered– Failed submission can be retried
• User program failures are considered as permanent
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
IBM Scheduler for HTC Glide-In to LoadLeveler
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
LoadLeveler Job Command File Example
#!/bin/bash
# @ bg_partition_type = HTC_LINUX_SMP
# @ class = BGP64_1H
# @ comment = "Personality / HTC"
# @ environment =
# @ error = $(job_name).$(jobid).err
# @ group = default
# @ input = /dev/null
# @ job_name = Personality-HTC
# @ job_type = bluegene
# @ notification = never
# @ output = $(job_name).$(jobid).out
# @ queue
# Command File
COMMANDS_RUN_FILE=$PWD/cmds.txt
/bgsys/opt/simple_sched/bin/run_simple_sched_jobs $COMMANDS_RUN_FILE
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
IBM Scheduler for HTC Integration to LoadLeveler < 3.5
� Described IBM Scheduler for HTC / LoadLeveler integrati on is validfor LoadLeveler versions >= 3.5
� Looser integration with LoadLeveler versions < 3.5
– LoadLeveler doesn’t handle partition boot / shutdown
� Consequences
– Explicit partition boot / shutdown required in LoadLeveler job command file
– Achieved through call to HTC binary command htcpartition
• htcpartition --boot { … }
• htcpartition --free
© 2009 IBM Corporation
IBM PSSC Montpellier Customer Center
LoadLeveler Job Command File Example (LL < v3.5)
#!/bin/bash
# @ class = BGP64_1H
# @ comment = "Personality / HTC"
# @ environment =
# @ error = $(job_name).$(jobid).err
# @ group = default
# @ input = /dev/null
# @ job_name = Personality-HTC
# @ job_type = bluegene
# @ notification = never
# @ output = $(job_name).$(jobid).out
# @ queue
# Command File
COMMANDS_RUN_FILE=$PWD/cmds.txt
# Local Simple Scheduler Configuration File
SIMPLE_SCHED_CONFIG_FILE=$PWD/my_simple_sched.cfg
partition_free() {
echo "Freeing HTC Partition"
/bgsys/drivers/ppcfloor/bin/htcpartition --free
}
/bgsys/drivers/ppcfloor/bin/htcpartition --boot --configfile $SIMPLE_SCHED_CONFIG_FILE --mode linux_smp
trap partition_free EXIT
/bgsys/opt/simple_sched/bin/run_simple_sched_jobs -config $SIMPLE_SCHED_CONFIG_FILE $COMMANDS_RUN_FILE