25
© 2009 IBM Corporation IBM PSSC Montpellier Customer Center Content MPIRUN Command Environment Variables LoadLeveler SUBMIT Command IBM Simple Scheduler

IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

Content

� MPIRUN Command

� Environment Variables

� LoadLeveler

� SUBMIT Command

� IBM Simple Scheduler

Page 2: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

Control System

� Service Node (SN)

– An IBM system-p 64-bit system

– Control System and database are on this system

– Access to this system is generally privileged

– Communication with Blue Gene via a private 1Gb control ethernet

� Database

– A commercial database tracks state of the system• Hardware inventory• Partition configuration• RAS data• Environmental data• Operational data including partition state, jobs, and job history

� Service action support for hot plug hardware

� Administration and System status

– Administration either via a console or web “Navigator” interfaces

Page 3: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

Service Node Database Structure

� Configuration database is the representation of all the hardware on the system

� Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history

� Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages

� RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex

DB2

Configuration Database

Operational Database

Environmental Database

RAS Database

Useful log files: /bgsys/logs/BGP

Page 4: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

Job Launching Mechanism

� mpirun Command

– Standard mpirun options supported

– May be used to launch any job, not just MPI based applications

– Has options to allocate partitions when a scheduler is not in use

� Scheduler APIs enable various schedulers

– LoadLeveler

– SLURM

– Platform LSF

– Altair PBS Pro

– Cobalt

– ….

� Note: All the schedulers are on mpirun/mpiexec

Page 5: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

MPIRUN Implementation

� Identical Functionalities to BG/L Implementation + New implementation + New options

– No more rsh/ssh mechanism for security reason, replace by a deamon running on the Service node

– freepartition command integrated as an option (-free)

– Standard input (STDIN) is supported on BGP (only MPI task 0)

Page 6: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

MPIRUN Command Parameters | 1

� -args "program args"– Pass "program args" to the BlueGene job on the compute nodes

� -cwd <Working Directory>– Specifies the full path to use as the current working directory on the compute

nodes. The path is specified as seen by the I/O and compute nodes

� -exe <Executable>– Specifies the full path to the executable to run on the compute nodes. The path is

specified as seen by the I/O and compute nodes

� -mode { SMP | DUAL | VN }– specify what mode the job will run in. Choices are coprocessor or virtual node

mode

� -np <Nb MPI Tasks>– Create exactly n MPI ranks for the job. Aliases are -nodes and -n

Page 7: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

MPIRUN Command Parameters | 2

� -enable_tty_reporting– By default MPIRUN will tell the control system and the C runtime on the compute nodes that STDIN,

STDOUT and STDERR are tied to TTY type devices. Enable STDOUT bufferization (GPFS blocksize)

� -env “<Variable Name>=<Variable Value>"– Set an environment variable in the environment of the job on the compute nodes

� -expenv <Variable Name>– Export an environment variable in mpiruns current environment to the job on the compute nodes

� -label– Use this option to have mpirun label the source of each line of output.

� -partition <Block ID>– Specify a predefined block to use

� -mapfile <mapfile>– Specify an alternative MPI toplogy. The mapfile path must be fully qualified as seen by the I/O and compute

nodes

� -verbose { 0 | 1 | 2 | 3 | 4 }– Set the 'verbosity' level. The default is 0 which means that mpirun will not output any status or diagnostic

messages unless a severe error occurs. If you are curious as to what is happening try levels 1 or 2. All mpirun generated status and error messages appear on STDERR.

Page 8: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

MPIRUN Command Reference (Documentation)

Page 9: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

MPIRUN Example

mpirun –partition XXX –np 128 –mode SMP –exe /patch/exe

–cwd working_directory

–env ‘’ OMP_NUM_THREADS=4 XLSMPOPTS=spins=0:yields=0:stack= 64000000’’

� Execution Settings

– 128 MPI Tasks

– SMP Mode

– 4 OpenMP Threads

– 64 MB Thread Stack

� Mpirun application program interfaces available: get_paramaters, mpirun_done

Page 10: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

MPIRUN Environment Variables

� Most command line options for mpirun can be specified u sing an environmentvariable– -partition MPIRUN_PARTITION

– -nodes MPIRUN_NODES

– -mode MPIRUN_MODE

– -exe MPIRUN_EXE– -cwd MPIRUN_CWD

– -host MMCS_SERVER_IP

– -env MPIRUN_ENV– -expenv MPIRUN_EXP_ENV

– -mapfile MPIRUN_MAPFILE

– -args MPIRUN_ARGS

– -label MPIRUN_LABEL– -enable_tty_reporting MPIRUN_ENABLE_TTY_REPORTING

Page 11: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

STDIN / STDOUT / STDERR Support

� STDIN, STDOUT, and STDERR work as expected– You can pipe or redirect files into mpirun and pipe or redirect output from

mpirun– STDIN may also come from the keyboard interactively

� Any compute node may send STDOUT or STDERR data

� Only MPI rank 0 may read STDIN data

� Mpirun always tells the control system and the C run time on the compute nodes that it is writing to TTY devices. This is be cause logically MPIRUN looks like a pipe; it can not do seeks on STDIN, ST DOUT, and STDERR even if they are coming from files.

� As always, STDIN, STDOUT and STDERR are the slowest ways to get input and output from a supercomputer– Use them sparingly

� STDOUT is not buffered and can generate a huge over head for someapplications– Such applications should buffer the stdout with option

• -enable_tty_reporting

Page 12: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

MPIEXEC Command

� What is mpiexec?

– Method for launching and interacting with parallel Mutliple Program Multiple Data (MPMD) jobs on BlueGene/P

– Very similar to mpirun with the only exception being the arguments supportedby mpiexec are slightly different

� Command Limitations

– A pset is the smallest granularity for each executable, though one executablecan span multiple psets

– You must use every compute node of each pset, specifically different ‘-np’values are not supported

– The job's mode (SMP, DUAL, VNM) must be uniform across all psets

Page 13: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

MPIEXEC Command Parameters

� Only parameter / environmental supported by mpiexec that is not supported by mpirun

– -configfile / MPIRUN_MPMD_CONFIGFILE

� The following parameters / environmentals are not supp orted by mpiexecsince their use is ambiguous for MPMD jobs

– -args / MPIRUN_ARGS

– -cwd / MPIRUN_CWD

– -env / MPIRUN_ENV

– -env_all / MPIRUN_EXP_ENV_ALL

– -exe / MPIRUN_EXE

– -exp_env / MPIRUN_EXP_ENV

– -partition / MPIRUN_PARTITION

– -mapfile / MPIRUN_MAPFILE

Page 14: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

MPIEXEC Configuration File Syntax

� -n <Nb Nodes> -wdir <Working Directory> <Binary>

� Example

– Configuration File Content

• -n 32 -wdir /home/bgpuser /bin/hostname -n 32 -wdir/home/bgpuser/hello_world /home/bgpuser/hello_world/hello_world

– Runs

• /bin/hostname on one 32 node pset• hello_world on one 32 node pset

Page 15: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

SUBMIT Command

� submit = mpirun Command for HTC

– Command used to run a HTC job and act as a lightweight shadow for the real job running on a Blue Gene node

– Simplifies user interaction with the system by providing a simple common interface for launching, monitoring, and controlling HTC jobs

– Run from a Frontend Node

– Contacts the control system to run the HTC user job

– Allows the user to interact with the running job via the job's standard input, standard output, and standard error

� Standard System Location

– /bgsys/drivers/ppcfloor/bin/submit

Page 16: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

HTC Technical Architecture

Page 17: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

SUBMIT Command Syntax

� /bgsys/drivers/ppcfloor/bin/submit [options]or/bgsys/drivers/ppcfloor/bin/submit [options] binary [arg1 arg2 ... argn]

� Options– -exe <exe> Executable to run– -args "arg1 arg2 ... argn“ Arguments, must be enclosed in double quotes– -env <env=value> Define an environmental for the job– -exp_env <env> Export an environmental to the job's environment– -env_all Add all current environmentals to the job's environment– -cwd <cwd> The job's current working directory– -timeout <seconds> Number of seconds before the job is killed– -mode <SMP|DUAL|VNM> Job mode– -location <Rxx-Mx-Nxx-Jxx-Cxx> Compute core location, regular expression supported– -pool <id> Compute Node pool ID

Page 18: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

IBM Scheduler for HTC

� IBM Scheduler for HTC = HTC Jobs Scheduler

– Handles scheduling of HTC jobs

� HTC Job Submission

– External work requests are routed to HTC scheduler

• Single or multiple work requests from each source

– IBM Scheduler for HTC finds available HTC client and forwards the work request

– HTC client runs executable on compute node

• A launcher program on each compute node handles work request sent to it by the scheduler. When work request completes, the launcher program is reloaded and client is ready to handle another work request.

Page 19: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

IBM Scheduler for HTC Components

� IBM Scheduler for HTC Purpose

– Provides features not available with “submit” interface• Queuing of jobs until compute resources are available• Tracking of failed compute nodes

– “submit” interface is intended for usage by job schedulers• Not end users directly

� IBM Scheduler for HTC Components

– simple_sched Daemon• Runs on Service Node or Frontend Node• Accepts connections from startd and client programs

– startd Daemons• Run on Frontend Node• Connects to simple_sched, gets jobs and executes submit

– Client programs• qsub = Submits job to run• qdel = Deletes job submitted by qsub• qstat = Gets status of submitted job• qcmd = Admin commands

Page 20: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

HTC Executables

� htcpartition

– Utility program shipped with Blue Gene

– Responsible for booting / freeing HTC partitions from a Frontend Node

� run_simple_sched_jobs

– Provides instance of IBM Scheduler for HTC and startd

– Executes commands either specified in command files or read fromstdin

– Creates a cfg file that can be used to submit jobs externally to the cmdfiles or stdin

– Exits when the commands have all finished (or can specify “keep running”)

Page 21: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

IBM Scheduler for HTC Integration to LoadLeveler

� LoadLeveler handles

– Partition Reservation & Booting

• New LoadLeveler Keyword– # @ bg_partition_type = HTC_LINUX_SMP

– Partition Shutdown

� IBM Scheduler for HTC handles

– Batch of executions queueing

• Either specified in command files or read from stdin

– Executions submission

– Execution recovery when failure occurs

• Only system faults are recovered– Failed submission can be retried

• User program failures are considered as permanent

Page 22: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

IBM Scheduler for HTC Glide-In to LoadLeveler

Page 23: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

LoadLeveler Job Command File Example

#!/bin/bash

# @ bg_partition_type = HTC_LINUX_SMP

# @ class = BGP64_1H

# @ comment = "Personality / HTC"

# @ environment =

# @ error = $(job_name).$(jobid).err

# @ group = default

# @ input = /dev/null

# @ job_name = Personality-HTC

# @ job_type = bluegene

# @ notification = never

# @ output = $(job_name).$(jobid).out

# @ queue

# Command File

COMMANDS_RUN_FILE=$PWD/cmds.txt

/bgsys/opt/simple_sched/bin/run_simple_sched_jobs $COMMANDS_RUN_FILE

Page 24: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

IBM Scheduler for HTC Integration to LoadLeveler < 3.5

� Described IBM Scheduler for HTC / LoadLeveler integrati on is validfor LoadLeveler versions >= 3.5

� Looser integration with LoadLeveler versions < 3.5

– LoadLeveler doesn’t handle partition boot / shutdown

� Consequences

– Explicit partition boot / shutdown required in LoadLeveler job command file

– Achieved through call to HTC binary command htcpartition

• htcpartition --boot { … }

• htcpartition --free

Page 25: IBM System Blue Gene-P - Execution - 2-2-0 · Only MPI rank 0 may read STDIN data Mpirun always tells the control system and the C runtime on the compute ... – Batch of executions

© 2009 IBM Corporation

IBM PSSC Montpellier Customer Center

LoadLeveler Job Command File Example (LL < v3.5)

#!/bin/bash

# @ class = BGP64_1H

# @ comment = "Personality / HTC"

# @ environment =

# @ error = $(job_name).$(jobid).err

# @ group = default

# @ input = /dev/null

# @ job_name = Personality-HTC

# @ job_type = bluegene

# @ notification = never

# @ output = $(job_name).$(jobid).out

# @ queue

# Command File

COMMANDS_RUN_FILE=$PWD/cmds.txt

# Local Simple Scheduler Configuration File

SIMPLE_SCHED_CONFIG_FILE=$PWD/my_simple_sched.cfg

partition_free() {

echo "Freeing HTC Partition"

/bgsys/drivers/ppcfloor/bin/htcpartition --free

}

/bgsys/drivers/ppcfloor/bin/htcpartition --boot --configfile $SIMPLE_SCHED_CONFIG_FILE --mode linux_smp

trap partition_free EXIT

/bgsys/opt/simple_sched/bin/run_simple_sched_jobs -config $SIMPLE_SCHED_CONFIG_FILE $COMMANDS_RUN_FILE