High Performance Computing at ac3
ac3’s HPC Facilities – p.1/37
HPC Platforms available to ac3 users
Machines at ac3, a company owned jointly bya number of NSW universities and NSWState Government.
Machines at APAC, a national partnership ofstate based organisations such as ac3.
ac3’s HPC Facilities – p.2/37
Types of HPC
1. Vector
CPU
2. Shared Memory (SMP)
CPU
CPU
CPU
CPU
Mem
3. Clusters
CPU
CPU
CPU
CPU
The Earth Simulator (currently the third fastest
machine in the world) is a cluster of vector SMPs.
SMP clusters are very popular. ac3’s HPC Facilities – p.3/37
@ac3: SGI SMP systems
SGI Origin (Clare) 64 processors with 32GBshared memory. 50GFlops peak speed -decommissioned.
SGI Power Challenge (Napier) 28 processors with4GB shared memory. 11GFlops peak. Usedfor interactive code development andeducational purposes - decommissioned.
SGI Altix (Swan) 16 processors with performancein the range of Clare’s.
ac3’s HPC Facilities – p.4/37
@ac3: NEC Vector system
NEC SX5 (Hunter) 2 vector processors sharing12GB memory. 16GFlops peak
ac3’s HPC Facilities – p.5/37
@ac3: Dell cluster system
Dell Beowulf cluster (Barossa) 155 dual processorPentium 4 nodes at 3GHz. 1.7TFlops peak,290 GB memory in total.
Benchmarked at 1.096 TFlops (Linpack), this
is Australia’s most powerful academic computer,
and is currently 330th on the Top 500 list. It en-
tered the list at position 108. ac3’s HPC Facilities – p.6/37
APAC National Facility
HP (Compaq) SC 1 x 16 processorSMP node (32GF peak) with16GB memory, and 120 quadprocessor nodes (960GF peak,700GB memory). It is replacedby an Altix cluster consisting ofapprox. 1000 CPU’s.
Dell Beowulf cluster LC 150 Pentium 4nodes at 2.66GHz. 800GFlopspeak, 150GB memory. ac3’s HPC Facilities – p.7/37
Deciding on a Machine
To decide which machine is best for your task,you need to consider:
How much memory you need.
How much independent parallelism you have.
Is your code parallelised in a shared memoryor distributed memory way.
Is your application compute bound,communication bound or I/O bound.
ac3’s HPC Facilities – p.8/37
Deciding on a Machine
To decide which machine is best for your task,you need to consider:
How much memory you need.
How much independent parallelism you have.
Is your code parallelised in a shared memoryor distributed memory way.
Is your application compute bound,communication bound or I/O bound.
ac3’s HPC Facilities – p.8/37
Deciding on a Machine
To decide which machine is best for your task,you need to consider:
How much memory you need.
How much independent parallelism you have.
Is your code parallelised in a shared memoryor distributed memory way.
Is your application compute bound,communication bound or I/O bound.
ac3’s HPC Facilities – p.8/37
Deciding on a Machine
To decide which machine is best for your task,you need to consider:
How much memory you need.
How much independent parallelism you have.
Is your code parallelised in a shared memoryor distributed memory way.
Is your application compute bound,communication bound or I/O bound.
ac3’s HPC Facilities – p.8/37
Deciding on a Machine
To decide which machine is best for your task,you need to consider:
How much memory you need.
How much independent parallelism you have.
Is your code parallelised in a shared memoryor distributed memory way.
Is your application compute bound,communication bound or I/O bound.
ac3’s HPC Facilities – p.8/37
Barossa is good for...
Lots of independent tasks needing less than1GB memory
SMP tasks up to 2GB memory
Compute bound distributed memory tasks.(Communication bound tasks may be betterserved on the SC or Swan).
Barossa is not particularly good for I/O boundtasks.
ac3’s HPC Facilities – p.9/37
Barossa is good for...
Lots of independent tasks needing less than1GB memory
SMP tasks up to 2GB memory
Compute bound distributed memory tasks.(Communication bound tasks may be betterserved on the SC or Swan).
Barossa is not particularly good for I/O boundtasks.
ac3’s HPC Facilities – p.9/37
Barossa is good for...
Lots of independent tasks needing less than1GB memory
SMP tasks up to 2GB memory
Compute bound distributed memory tasks.(Communication bound tasks may be betterserved on the SC or Swan).
Barossa is not particularly good for I/O boundtasks.
ac3’s HPC Facilities – p.9/37
Barossa is good for...
Lots of independent tasks needing less than1GB memory
SMP tasks up to 2GB memory
Compute bound distributed memory tasks.(Communication bound tasks may be betterserved on the SC or Swan).
Barossa is not particularly good for I/O boundtasks.
ac3’s HPC Facilities – p.9/37
Getting Access
Access to ac3 systems is obtained throughyour Campus Coordinator. This will enableminimal usage, i.e. no project
See http://www.ac3.edu.au for detailsSigned paper form to ac3Online researcher database entry(providing information about your work)
Access to APAC system is through applyingfor a research grant at http://nf.apac.edu.au:
Startup Grant (1000SUs) - 1 SU ≈ 1 hr ona 1GHz CPUAPAC Merit Allocation Grantac3 Research resource allocation (PartnerShare)
ac3’s HPC Facilities – p.10/37
ac3 Partner share procedure
Fill out online form at http://nf.apac.edu.au (ifnew account)
Fill out ac3 resource allocation application athttp://www.ac3.edu.au
Fill out researcher database entry
→ Apply for an APAC grant through ac3 or do a
private allocation
ac3’s HPC Facilities – p.11/37
Introductory User Guides
These are a must read!Topics covered:
Logging in and setting up your account
Using compilers and numerical libraries
Parallelising code
Using batch system
Using scratch files systems
For ac3’s systems: see http://www.ac3.edu.au
For APAC NF: see http://nf.apac.edu.au
ac3’s HPC Facilities – p.12/37
Getting Help
Scientists should do science, not computerscience!
Seek professional help, don’t bang your headon a brick wall!
First port of call: [email protected]
HPCSU (UNSW) part of ac3 user supportteam; USyd Vislab has an HPC expert.
ac3’s HPC Facilities – p.13/37
Using batch queues
PBS used on Barossa, Swan and APACmachines. Hunter uses NQS variant:
qsub submit a job
qstat query job or queue status
qdel delete a job from queue
qsub.pl script provides a PBS compatible inter-
face for Hunter
ac3’s HPC Facilities – p.14/37
qsub options
-lwalltime=, -lcput= request walltime or CPU time
-lncpus=, -lnodes= request a certain number ofCPUs or nodes (parallel jobs)
-lmem= request a certain amount of memory
-lnodes=x:ppn=2 Request 2x CPUs, spread over x
nodes.
-q queuename Specify a queue
-A projectname Specify a project you are joining in
ac3’s HPC Facilities – p.15/37
qstat output
[rks@barossa gen-random]$ qstat -u rks
barossa.ac3.com.au:
Req’d Req’d Elap
Job ID Username Queue Jobname NDS TSK Memory Time S Time
------ -------- -------- ------- --- --- ------ ----- - -----
83991 rks workq pt0.91 1 1 512mb 72:00 R 04:05
83992 rks workq pt0.92 1 1 512mb 72:00 R 03:53
84056 rks workq gen-ran 1 1 512mb 72:00 R 00:01
NDS Number of nodes TSK Task (interesting for parallel jobs)
ac3’s HPC Facilities – p.16/37
Maui scheduler
Used on Clare and Barossashowq (or “pbs showq” on Barossa) can be usedto obtain additional queue information:[rks@barossa gen-random]$ pbs showq -i
JobName Priority XFactor User Procs WCLimit SystemQueueTime
84057* 10003 1.8 alexg 20 1:00:00 Mon Jun 28 14:33:09
84059* 8721 1.0 houska 8 6:06:00:00 Mon Jun 28 15:07:20
84040* 6749 1.0 ahmadj 1 3:00:00:00 Mon Jun 28 12:54:37
84041* 6749 1.0 ahmadj 1 3:00:00:00
XFactor = 1 +queued time
requested walltime
→ larger XFactor causes larger likelihood to get
started ac3’s HPC Facilities – p.17/37
Priority
requested walltime -
length of time queued +
fare share FS (number of submitted jobs) -
resource allocation (memory) -
submitted queue +-
$ diagnose -p
Job PRIORITY* Cred( User:Accnt:Class) FS( User:Accnt) Serv(QTime)
444531 20990345 100.0( 0.0:41962: 10.0) 0.0(525.1: 0.0) 0.0(3820.)
ac3’s HPC Facilities – p.18/37
Preemption
When another job is suspended so that a highpriority job can start.
Automatic preemption is available on Swan(next) and the APAC SC, but not on Barossa.Running jobs will run to completion or theirrequested limit.
A large parallel job will cause queues to drain.Jobs with small wallclock requests will“backfill” (Barossa’s alternative topreemption).
Because Barossa does not have preemption,long running jobs (in the xlong queue) arerestricted to 8 nodes.
ac3’s HPC Facilities – p.19/37
Estimating when your job will start
$ showstart 444531
job 444531 requires 16 procs for 1:06:25:00
Earliest start in 1:15:28:23 on Sun Jan 16 08:44:26
Earliest completion in 2:21:53:23 on Mon Jan 17 15:09:26
Best Partition: DEFAULT
This estimate is conservative. It assumes alljobs will run for their requested time.Resources may become available sooner.
It cannot be used for jobs submitted in thefuture which get higher priority (hence thecaveat “earliest”).
ac3’s HPC Facilities – p.20/37
Understanding the machine’s state
$ qview
barossa015 . 1 441911 carlm (1200mb, 512mb)
barossa025 . 1 444043 bsoule (1200mb, 512mb)
barossa037 . 1 444292 lambui01 ( 128mb, 512mb)
barossa050 . 0 free
barossa055 O 0 free
barossa064 . 1 444534 sinavafi ( 512mb, 1700mb)
barossa073 O 0 free
barossa098 O 0 free
barossa115 O 0 free
barossa117 O 0 free
barossa137 . 1 444539 sinavafi ( 512mb, 1700mb)
The O status indicates a node is offline forsome reason
Column 3 indicates number of CPUs in useon that node.
ac3’s HPC Facilities – p.21/37
Queues on Barossa
priority Highest priority, charged at 3× → SU isaffected stronger
xlarge For jobs with more than 32 nodes
xlong For jobs longer than 3 days
checkable For jobs that can be checkpointed
single Single CPU jobs
stampfl,robinson private queues
ac3’s HPC Facilities – p.22/37
Notes on queues
xlong and xlarge require you to [email protected] before jobs are scheduled
checkable jobs can be killed at any time tomake room for xlarge jobs → restartopportunity
default — if no queue specified, PBS willroute job to most appropriate
workq — a “catchall” queue, low priority, withreducing share (ac3 will reduce that queuefurther)
ac3’s HPC Facilities – p.23/37
Checkpointing
Checkable jobs should
at minimum specify -r. Job will be requeuedif killed.
ideally write a checkpoint file, so no work islost
could also use a log file to determine wherecalculation left off
ac3’s HPC Facilities – p.24/37
Checkpointing
Checkable jobs should
at minimum specify -r. Job will be requeuedif killed.
ideally write a checkpoint file, so no work islost
could also use a log file to determine wherecalculation left off
ac3’s HPC Facilities – p.24/37
Checkpointing
Checkable jobs should
at minimum specify -r. Job will be requeuedif killed.
ideally write a checkpoint file, so no work islost
could also use a log file to determine wherecalculation left off
ac3’s HPC Facilities – p.24/37
Checkpointing
Checkable jobs should
at minimum specify -r. Job will be requeuedif killed.
ideally write a checkpoint file, so no work islost
could also use a log file to determine wherecalculation left off
ac3’s HPC Facilities – p.24/37
Writing Checkpoint files
SIGUSR1 and SIGTERM is sent by batchsystem prior to job being killed.
Trapping these signals may not give sufficienttime to reach a checkpointable part of theprogram.
Alternatively, write a checkpoint wheneverpossible (within reasonable I/O demands)
Trap SIGTERM and record its arrival. If thesignal was received before the checkpointwrite, exit immediately, if received during thecheckpoint, exit immediately after checkpoint.
ac3’s HPC Facilities – p.25/37
Writing Checkpoint files
SIGUSR1 and SIGTERM is sent by batchsystem prior to job being killed.
Trapping these signals may not give sufficienttime to reach a checkpointable part of theprogram.
Alternatively, write a checkpoint wheneverpossible (within reasonable I/O demands)
Trap SIGTERM and record its arrival. If thesignal was received before the checkpointwrite, exit immediately, if received during thecheckpoint, exit immediately after checkpoint.
ac3’s HPC Facilities – p.25/37
Writing Checkpoint files
SIGUSR1 and SIGTERM is sent by batchsystem prior to job being killed.
Trapping these signals may not give sufficienttime to reach a checkpointable part of theprogram.
Alternatively, write a checkpoint wheneverpossible (within reasonable I/O demands)
Trap SIGTERM and record its arrival. If thesignal was received before the checkpointwrite, exit immediately, if received during thecheckpoint, exit immediately after checkpoint.
ac3’s HPC Facilities – p.25/37
Writing Checkpoint files
SIGUSR1 and SIGTERM is sent by batchsystem prior to job being killed.
Trapping these signals may not give sufficienttime to reach a checkpointable part of theprogram.
Alternatively, write a checkpoint wheneverpossible (within reasonable I/O demands)
Trap SIGTERM and record its arrival. If thesignal was received before the checkpointwrite, exit immediately, if received during thecheckpoint, exit immediately after checkpoint.
ac3’s HPC Facilities – p.25/37
Checkpointing options
No system support for checkpointing is providedon Barossa, applications must provide their own!SIGUSR1 and SIGTERM are sent from thesystem but no software is provided for trappingthe signal.Use Classdesca for C++ or FClassdescb forFortran90 to write a binary file representing therelevant state data of your program.
Traditional (non-classdesc) checkpointing tech-
niques are also possiblec, but code is harder to
maintain.ahttp://parallel.hpc.unsw.edu.au/classdescbhttp://parallel.hpc.unsw.edu.au/fclassdescchttp://www.ac3.edu.au/hints
ac3’s HPC Facilities – p.26/37
clumon
http://barossa.ac3.com.au/clumon
ac3’s HPC Facilities – p.27/37
Batch scripts
Typically bourne shell programs, but can beany interpreted language (eg Perl) where # isa comment character.
qsub options can be specified on a linebeginning with #PBS. These are overriddenby command line.
PATH usually doesn’t include “.” e.g setappropriate path with PATH=/opt/mpich-1.2.5.10-ch_p4-gcc/bin:$PATH or in .bashrcset, e.g.:MPICH=/opt/mpich-1.2.5.10-ch_p4-gcc/bin;export MPICH
ac3’s HPC Facilities – p.28/37
Batch scripts ...
Environment contains extra information:$PBS O WORKDIR Place from where job was
submitted
$PBS NODEFILE list of node names attached toyour job. eg wc -l $PBS_NODEFILE willreturn the number of CPUs your job isrunning on.
$PBS JOBID is your current job’s jobid. eg/scratch/$PBS_JOBID is the name of atemporary directory on a node’s local diskfor intensive I/O applications.
ac3’s HPC Facilities – p.29/37
(Trivially) Parallel Batch scripts
#!/bin/sh
n=0
while [ $n -lt 10 ]; do
echo $n >parm${n}.dat
cat >scr$n <<EOF#PBS -l cput=03:00:00 -l mem=128MB
a.out parm${n}.dat >out${n}.dat
expr ‘cat parm${n}.dat‘ + 10 >parm${n}.dat
qsub scr$n
EOFqsub scr$n
n=‘expr $n + 1‘
done ac3’s HPC Facilities – p.30/37
OpenMP jobs
Compiler can autoparallelise, or make use ofmanual OpenMP compiler directives. Seeintro guides for specific compiler flags.
Specify number of threads viaOMP_NUM_THREADS environment variable.On Barossa, job can make use of 2 threads,SC can make use of 4 and Swan up to 16.
Use -lnode=1:ppn=2 option on Barossa.Use -lnode=1:ppn=16 to select APAC’s bigSMP node on the SC
ac3’s HPC Facilities – p.31/37
MPI jobs
Swan (the new Altix) has SGI MPI andMPICH
Barossa has LAM and MPICH
LC has LAM
SC has an elan version, SC will be replacedby a large Altix
In terms of network performance, the order isSwan, SC, Barossa, LC.In terms of processor performance, the order isBarossa, LC, SC, Swan.Are you CPU bound or network bound? ac3’s HPC Facilities – p.32/37
Other parallel jobs (eg PVM, CFX, ...)
On Barossa, you have ssh priveleges to anynode running your job. If you haven’t a jobrunning you don’t.
This means that any parallel transport layer built
(like pvm) using ssh will work on Barossa. This is
not true of the APAC systems, where you have to
use MPI. (PVM is available on the SC).
ac3’s HPC Facilities – p.33/37
Self-submitting batch jobs
#!/bin/sh#PBS -l cput=3:0:0 -l mem=128MB -r yif [ ! -z "$PBS_O_WORKDIR" ]; then
cd $PBS_O_WORKDIRfiif [ -f stop ]; then exit; fiif [ -f checkpoint.dat ]; then
a.out restart >>outputelse
a.out init >>outputfiqsub $PBS_JOBNAME
PBS_JOBNAME name of the script above ac3’s HPC Facilities – p.34/37
Scratch I/O
On a cluster, a system wide file system isprovided via NFS to access home and shortterm data directories.
NFS cannot cope with lots of small read/writerequests
Each node has a local disk, which can beaccessed for the duration of the job. Use scpto copy data to/from the local disk (scfscp onthe SC). On Barossa, this scratch directory iscalled /scratch/$PBS JOBID
ac3’s HPC Facilities – p.35/37
Resource Allocation
Application to ac3 resource allocationcommittee every six months. Grants areassessed on merit (a well composedapplication is a condition). Proposals aregenerally around 3 pages long.
Resources are granted in terms of systemunits. 1SU corresponds to roughly 1 hour ona 1GHz processor.Machine SU avail per 6 months
Barossa 3.3 million
APAC NF 132,000
Hunter 114,000
Clare 94,000ac3’s HPC Facilities – p.36/37
Looking at your resource usage
http://barossa.ac3.edu.au/pbsProjectID Grant Used Remaining
-------------------------------------------------------------
acnoise 400000 109913 290087
agero 30000 39356 -9356
ahmadj 0 63633 -63633
alexg 32000 23248 8752
apitman 1500 0 1500
ayang 0 134171 -134171
...
ac3’s HPC Facilities – p.37/37