Upload
neil-simmons
View
218
Download
4
Tags:
Embed Size (px)
Citation preview
Job Submission on WestGrid
Feb 15 2005on Access Grid
Job Submission on WestGrid
Feb 15 2005on Access Grid
IntroductionIntroduction Simon Sharpe, one member of the WestGrid
support team The best way to contact us is to email
[email protected] This seminar tells you;
How to run, monitor, or cancel your jobs How to select the best site for your job How to adapt your job submission for different sites How to get your jobs running as quickly as possible
Feel free to interrupt if you have questions
Simon Sharpe, one member of the WestGrid support team
The best way to contact us is to email [email protected]
This seminar tells you; How to run, monitor, or cancel your jobs How to select the best site for your job How to adapt your job submission for different sites How to get your jobs running as quickly as possible
Feel free to interrupt if you have questions
Getting into the QueueGetting into the Queue HPC Resources are valuable research
tools A batch queuing system is needed to
Match jobs to resources Deliver maximum bang for the research buck Distribute jobs and collect output across
parallel CPUs Ensure a fair sharing of resources
HPC Resources are valuable research tools
A batch queuing system is needed to Match jobs to resources Deliver maximum bang for the research buck Distribute jobs and collect output across
parallel CPUs Ensure a fair sharing of resources
Getting into the QueueGetting into the Queue WestGrid compute sites use
TORQUE/Moab Based on PBS (Portable Batch System) You need just a few commands common to
WestGrid machines There are important differences in job
submission among sites you need to know about
With the diversity of WestGrid, it is possible that there is more than one machine suitable for your job
WestGrid compute sites use TORQUE/Moab Based on PBS (Portable Batch System) You need just a few commands common to
WestGrid machines There are important differences in job
submission among sites you need to know about
With the diversity of WestGrid, it is possible that there is more than one machine suitable for your job
A Simple SampleA Simple Sample
The script file serialhello.pbs tells TORQUE how to run the C program serialhello
The script file serialhello.pbs tells TORQUE how to run the C program serialhello
This example show how to run a serial job on Glacier, which is a good choice for serial jobs
The qsub command tells TORQUE to run the job described in the script file serialhello.pbs
This example show how to run a serial job on Glacier, which is a good choice for serial jobs
The qsub command tells TORQUE to run the job described in the script file serialhello.pbs
When your job completes, TORQUE creates two new files in the current directory capturing; error out from the job standard out
When your job completes, TORQUE creates two new files in the current directory capturing; error out from the job standard out
End of SeminarEnd of Seminar Thanks for coming
I wish it was that easy
Thanks for coming
I wish it was that easy
HPC: One Size Does Not Fit AllHPC: One Size Does Not Fit All When the only tool you have is a
hammer, every job looks like a nail Things that affect system selection;
System dictated by executable or licensing
MPI or OpenMP Availability: How busy is the system? Amount of RAM required Speed or number of processors
When the only tool you have is a hammer, every job looks like a nail
Things that affect system selection; System dictated by executable or
licensing MPI or OpenMP Availability: How busy is the system? Amount of RAM required Speed or number of processors
HPC: One Size Does Not Fit AllHPC: One Size Does Not Fit All
Things that affect system selection (continued); Scalability of your application Inter-processor communication
requirements Queue limits (walltime, number of
CPUs) Inertia: It is where we’ve always run it
Things that affect system selection (continued); Scalability of your application Inter-processor communication
requirements Queue limits (walltime, number of
CPUs) Inertia: It is where we’ve always run it
http://www.westgrid.ca/support/System_Statushttp://www.westgrid.ca/support/Facilitieshttp://www.westgrid.ca/support/software
Uses of WestGrid MachinesUses of WestGrid MachinesMachine Use Interconnect CPUs
Glacier
IBM Xeon
Serial, moderate parallel MPI
GigE
Shared in node
1680
Dual CPUs/node
Matrix
HP XC Alpha
MPI Parallel Infiniband,
Shared in node
256
Dual CPUs/node
Lattice
HP SC Alpha
Moderate MPI parallel, serial
Quadrics,
Shared in node
144, 68 (G03)
Quad CPUs/node
Cortex
IBM Power5
OpenMP, MPI Parallel
Shared memory 64, 64, 4
Nexus
SGI Origin MIPS
OpenMP, MPI Parallel
Shared memory 256, 64, 64, 36, 32, 32, 8
Robson
IBM Power5
Serial, moderate MPI parallel
GigE,
Shared in node
56
Dual CPUs/node
TORQUE and Moab CommandsTORQUE and Moab Commands
qsub script Submit this job to the queue, common options include
-l mem=1GB
-l nodes=4:ppn=2 or, on Nexus –l ncpus=4
-l walltime=06:00:00
-q queue-name
-m and –M for email notifications
showq Show me the jobs in the queue
qstat jobid Show the status of the job in the queue, common options include
-a and -an
qdel jobid Delete this job number from the queue
Sample MPI job on GlacierSample MPI job on GlacierParallel jobs have differing degrees of parallelism
Glacier, which has a slower interconnect than other WestGrid machines, may not turn out to be the best place for your parallel job
Latency: Like the time it takes to dial and say “hello”
Bandwidth: How fast can you talk?
If your parallel job does not require intensive communications between processes, it may be worth testing on Glacier
More info on Glacier submissions at;http://www.westgrid.ca/support/programming/glacier.php
http://guide.westgrid.ca/guide-pages/jobs.html
MPI Submission on GlacierMPI Submission on Glacier We need to tell TORQUE how many processors we need We need to tell TORQUE how many processors we need
This asks for 2 nodes and 2 processors per node (4 CPUs)
This asks for 2 nodes and 2 processors per node (4 CPUs)
Similar script to last time, but now calling program parallelized with MPI
Adding the walltime estimate helps TORQUE schedule the job Note that we can pass directives;
on the command line or in the script
Similar script to last time, but now calling program parallelized with MPI
Adding the walltime estimate helps TORQUE schedule the job Note that we can pass directives;
on the command line or in the script
This time we wait in the queue This time we wait in the queue
Sample MPI job on MatrixSample MPI job on Matrix
Matrix is an HP XC cluster using AMD Opterons and Infiniband Interconnect
64-bit Linux
Not intended for serial work
A good home for parallel jobs
More info on Matrix submissions at;http://www.westgrid.ca/support/programming/matrix.php
Running MPI Jobs on MatrixRunning MPI Jobs on MatrixFor Matrix, use nodes and processors/node (ppn) to tell TORQUE how many CPUs your job needs
Matrix machines have 2 CPUs/Node
A minimal TORQUE script to run a parallel MPI job on Matrix
Standard and Error output dropped into the directory we submitted from
Sample MPI job on LatticeSample MPI job on Lattice
Lattice is an HP Alpha cluster connected with Quadrics
64-bit Tru64
Intended for parallel workFour processor shared memory
Quadrics interconnect for more than 4 processors
MPI communicates through interconnect or shared memory, as appropriate
Also being used for some serial work
More info on Lattice submissions at;http://hpc.ucalgary.ca/westgrid/running.html
http://www.westgrid.ca/support/programming/lattice.php
Running MPI Jobs on LatticeRunning MPI Jobs on LatticeFor Lattice, use nodes and processors/node to set number of processors. Lattice has 4 processors on each node.
In this case we ask for 2 CPUs on one box and 2 on another
A minimal TORQUE script to run a parallel MPI job on Lattice
Standard and error out dropped into the directory we submitted from
Sample Serial Job on LatticeSample Serial Job on Lattice
Lattice has a high-speed Quadrics interconnect
If your job is serial, it does not take advantage of the Quadrics interconnect
Glacier may be an alternative Having said that, many serial jobs are
run on Lattice
Lattice has a high-speed Quadrics interconnect
If your job is serial, it does not take advantage of the Quadrics interconnect
Glacier may be an alternative Having said that, many serial jobs are
run on Lattice
Running Serial Jobs on LatticeRunning Serial Jobs on LatticeOn Lattice, we tell TORQUE to run the job described in the script file serialhello.pbs
A minimal TORQUE script to run a serial job on Lattice
Standard and error out dropped into the directory we submitted from
Sample Parallel job on CortexSample Parallel job on Cortex
Cortex is a machine with IBM Power5 SMP processors
Running AIX
Not for serial work
A good home for large parallel applications needing shared memory and/or fast interconnection
Good for large memory jobs
More info on Cortex submissions at;http://www.westgrid.ca/support/cortex
http://www.westgrid.ca/support/programming/cortex.php
Running Serial Jobs on CortexRunning Serial Jobs on CortexOn Cortex, we tell TORQUE to run the job described in the script file mpihello.pbs
The script which describes how we want cortex to run the parallel program mpihello
The standard output file, dropped into our working directory
Sample Parallel Job on NexusSample Parallel Job on Nexus Nexus is a collection of SGI SMP machines Several sizes serviced by different queues. Test on smaller machines, heavy lifting on
large ones A good home for parallel jobs with intense
communication requirements and/or large memory needs
More information at;http://www.ualberta.ca/AICT/RESEARCH/PBS/index.westgrid.html
Nexus is a collection of SGI SMP machines Several sizes serviced by different queues. Test on smaller machines, heavy lifting on
large ones A good home for parallel jobs with intense
communication requirements and/or large memory needs
More information at;http://www.ualberta.ca/AICT/RESEARCH/PBS/index.westgrid.html
Running OpenMP Jobs on NexusRunning OpenMP Jobs on Nexus
For Nexus, match ncpus with OMP_NUM_THREADS
In this case we ask for 8 CPUs on the Helios machine (8-32 CPUs)
You can try trivial OpenMP jobs from the command line. This job ran interactively on the head node.
You should not use more than 2 processors for interactive jobs.
To run jobs requiring real processing, you must submit them to TORQUE
Sample Serial Job on RobsonSample Serial Job on Robson
Robson is a new 56 processor Power5 system
64-bit Linux Good for serial work, may be suitable
for some parallel processing. Message passing through MPI More info at;http://www.westgrid.ca/support/robson
Robson is a new 56 processor Power5 system
64-bit Linux Good for serial work, may be suitable
for some parallel processing. Message passing through MPI More info at;http://www.westgrid.ca/support/robson
Running Serial Jobs on RobsonRunning Serial Jobs on RobsonThis is a minimal serial job submission script for Robson. It runs the executable “hello”
A more elaborate script example is available;
http://www.westgrid.ca/support/robson
Robson also runs MPI parallel jobs, as described on the above web page
TORQUE drops the Error Out (zero –length in this case) and Standard Out to the directory we submitted from
Shortening HPC CycleShortening HPC Cycle
Try your jobs at different sites Test your process on small jobs Give realistic walltimes, memory
requirements Apply for a larger Resource
Allocation http://www.westgrid.ca/manage_rac.html
Try your jobs at different sites Test your process on small jobs Give realistic walltimes, memory
requirements Apply for a larger Resource
Allocation http://www.westgrid.ca/manage_rac.html
SummarySummary
HPC jobs have differing requirements WestGrid provides an increasing variety of tools Use the system that is best for your job Start off simple and small Find out how well your job scales Getting help
Because of implementation differences, “man qsub” might not be your best source of help
Support pages as listed throughout this presentation Email [email protected]
HPC jobs have differing requirements WestGrid provides an increasing variety of tools Use the system that is best for your job Start off simple and small Find out how well your job scales Getting help
Because of implementation differences, “man qsub” might not be your best source of help
Support pages as listed throughout this presentation Email [email protected]