Download pdf - High Performance Computing at ac3 - University of Wollongongtim/casties.pdf · Shared Memory (SMP) CPU CPU CPU CPU Mem 3. Clusters CPU CPU CPU CPU The Earth Simulator (currently the

High Performance Computing at ac3

ac3’s HPC Facilities – p.1/37

HPC Platforms available to ac3 users

Machines at ac3, a company owned jointly bya number of NSW universities and NSWState Government.

Machines at APAC, a national partnership ofstate based organisations such as ac3.


Types of HPC

1. Vector

CPU

2. Shared Memory (SMP)

CPU

CPU

CPU

CPU

Mem

3. Clusters

CPU

CPU

CPU

CPU

The Earth Simulator (currently the third fastest

machine in the world) is a cluster of vector SMPs.

SMP clusters are very popular. ac3’s HPC Facilities – p.3/37

@ac3: SGI SMP systems

SGI Origin (Clare) 64 processors with 32GBshared memory. 50GFlops peak speed -decommissioned.

SGI Power Challenge (Napier) 28 processors with4GB shared memory. 11GFlops peak. Usedfor interactive code development andeducational purposes - decommissioned.

SGI Altix (Swan) 16 processors with performancein the range of Clare’s.


@ac3: NEC Vector system

NEC SX5 (Hunter) 2 vector processors sharing12GB memory. 16GFlops peak


@ac3: Dell cluster system

Dell Beowulf cluster (Barossa) 155 dual processorPentium 4 nodes at 3GHz. 1.7TFlops peak,290 GB memory in total.

Benchmarked at 1.096 TFlops (Linpack), this

is Australia’s most powerful academic computer,

and is currently 330th on the Top 500 list. It en-

tered the list at position 108. ac3’s HPC Facilities – p.6/37

APAC National Facility

HP (Compaq) SC 1 x 16 processorSMP node (32GF peak) with16GB memory, and 120 quadprocessor nodes (960GF peak,700GB memory). It is replacedby an Altix cluster consisting ofapprox. 1000 CPU’s.

Dell Beowulf cluster LC 150 Pentium 4nodes at 2.66GHz. 800GFlopspeak, 150GB memory. ac3’s HPC Facilities – p.7/37

Deciding on a Machine

To decide which machine is best for your task,you need to consider:

How much memory you need.

How much independent parallelism you have.

Is your code parallelised in a shared memoryor distributed memory way.

Is your application compute bound,communication bound or I/O bound.






























Barossa is good for...

Lots of independent tasks needing less than1GB memory

SMP tasks up to 2GB memory

Compute bound distributed memory tasks.(Communication bound tasks may be betterserved on the SC or Swan).

Barossa is not particularly good for I/O boundtasks.




















Getting Access

Access to ac3 systems is obtained throughyour Campus Coordinator. This will enableminimal usage, i.e. no project

See http://www.ac3.edu.au for detailsSigned paper form to ac3Online researcher database entry(providing information about your work)

Access to APAC system is through applyingfor a research grant at http://nf.apac.edu.au:

Startup Grant (1000SUs) - 1 SU ≈ 1 hr ona 1GHz CPUAPAC Merit Allocation Grantac3 Research resource allocation (PartnerShare)


ac3 Partner share procedure

Fill out online form at http://nf.apac.edu.au (ifnew account)

Fill out ac3 resource allocation application athttp://www.ac3.edu.au

Fill out researcher database entry

→ Apply for an APAC grant through ac3 or do a

private allocation


Introductory User Guides

These are a must read!Topics covered:

Logging in and setting up your account

Using compilers and numerical libraries

Parallelising code

Using batch system

Using scratch files systems

For ac3’s systems: see http://www.ac3.edu.au

For APAC NF: see http://nf.apac.edu.au


Getting Help

Scientists should do science, not computerscience!

Seek professional help, don’t bang your headon a brick wall!

First port of call: [email protected]

HPCSU (UNSW) part of ac3 user supportteam; USyd Vislab has an HPC expert.


Using batch queues

PBS used on Barossa, Swan and APACmachines. Hunter uses NQS variant:

qsub submit a job

qstat query job or queue status

qdel delete a job from queue

qsub.pl script provides a PBS compatible inter-

face for Hunter


qsub options

-lwalltime=, -lcput= request walltime or CPU time

-lncpus=, -lnodes= request a certain number ofCPUs or nodes (parallel jobs)

-lmem= request a certain amount of memory

-lnodes=x:ppn=2 Request 2x CPUs, spread over x

nodes.

-q queuename Specify a queue

-A projectname Specify a project you are joining in


qstat output

[rks@barossa gen-random]$ qstat -u rks

barossa.ac3.com.au:

Req’d Req’d Elap

Job ID Username Queue Jobname NDS TSK Memory Time S Time

------ -------- -------- ------- --- --- ------ ----- - -----

83991 rks workq pt0.91 1 1 512mb 72:00 R 04:05

83992 rks workq pt0.92 1 1 512mb 72:00 R 03:53

84056 rks workq gen-ran 1 1 512mb 72:00 R 00:01

NDS Number of nodes TSK Task (interesting for parallel jobs)


Maui scheduler

Used on Clare and Barossashowq (or “pbs showq” on Barossa) can be usedto obtain additional queue information:[rks@barossa gen-random]$ pbs showq -i

JobName Priority XFactor User Procs WCLimit SystemQueueTime

84057* 10003 1.8 alexg 20 1:00:00 Mon Jun 28 14:33:09

84059* 8721 1.0 houska 8 6:06:00:00 Mon Jun 28 15:07:20

84040* 6749 1.0 ahmadj 1 3:00:00:00 Mon Jun 28 12:54:37

84041* 6749 1.0 ahmadj 1 3:00:00:00

XFactor = 1 +queued time

requested walltime

→ larger XFactor causes larger likelihood to get

started ac3’s HPC Facilities – p.17/37

Priority

requested walltime -

length of time queued +

fare share FS (number of submitted jobs) -

resource allocation (memory) -

submitted queue +-

$ diagnose -p

Job PRIORITY* Cred( User:Accnt:Class) FS( User:Accnt) Serv(QTime)

444531 20990345 100.0( 0.0:41962: 10.0) 0.0(525.1: 0.0) 0.0(3820.)


Preemption

When another job is suspended so that a highpriority job can start.

Automatic preemption is available on Swan(next) and the APAC SC, but not on Barossa.Running jobs will run to completion or theirrequested limit.

A large parallel job will cause queues to drain.Jobs with small wallclock requests will“backfill” (Barossa’s alternative topreemption).

Because Barossa does not have preemption,long running jobs (in the xlong queue) arerestricted to 8 nodes.


Estimating when your job will start

$ showstart 444531

job 444531 requires 16 procs for 1:06:25:00

Earliest start in 1:15:28:23 on Sun Jan 16 08:44:26

Earliest completion in 2:21:53:23 on Mon Jan 17 15:09:26

Best Partition: DEFAULT

This estimate is conservative. It assumes alljobs will run for their requested time.Resources may become available sooner.

It cannot be used for jobs submitted in thefuture which get higher priority (hence thecaveat “earliest”).


Understanding the machine’s state

$ qview

barossa015 . 1 441911 carlm (1200mb, 512mb)

barossa025 . 1 444043 bsoule (1200mb, 512mb)

barossa037 . 1 444292 lambui01 ( 128mb, 512mb)

barossa050 . 0 free

barossa055 O 0 free

barossa064 . 1 444534 sinavafi ( 512mb, 1700mb)

barossa073 O 0 free

barossa098 O 0 free

barossa115 O 0 free

barossa117 O 0 free

barossa137 . 1 444539 sinavafi ( 512mb, 1700mb)

The O status indicates a node is offline forsome reason

Column 3 indicates number of CPUs in useon that node.


Queues on Barossa

priority Highest priority, charged at 3× → SU isaffected stronger

xlarge For jobs with more than 32 nodes

xlong For jobs longer than 3 days

checkable For jobs that can be checkpointed

single Single CPU jobs

stampfl,robinson private queues


Notes on queues

xlong and xlarge require you to [email protected] before jobs are scheduled

checkable jobs can be killed at any time tomake room for xlarge jobs → restartopportunity

default — if no queue specified, PBS willroute job to most appropriate

workq — a “catchall” queue, low priority, withreducing share (ac3 will reduce that queuefurther)


Checkpointing

Checkable jobs should

at minimum specify -r. Job will be requeuedif killed.

ideally write a checkpoint file, so no work islost

could also use a log file to determine wherecalculation left off


Checkpointing






Checkpointing






Checkpointing






Writing Checkpoint files

SIGUSR1 and SIGTERM is sent by batchsystem prior to job being killed.

Trapping these signals may not give sufficienttime to reach a checkpointable part of theprogram.

Alternatively, write a checkpoint wheneverpossible (within reasonable I/O demands)

Trap SIGTERM and record its arrival. If thesignal was received before the checkpointwrite, exit immediately, if received during thecheckpoint, exit immediately after checkpoint.




















Checkpointing options

No system support for checkpointing is providedon Barossa, applications must provide their own!SIGUSR1 and SIGTERM are sent from thesystem but no software is provided for trappingthe signal.Use Classdesca for C++ or FClassdescb forFortran90 to write a binary file representing therelevant state data of your program.

Traditional (non-classdesc) checkpointing tech-

niques are also possiblec, but code is harder to

maintain.ahttp://parallel.hpc.unsw.edu.au/classdescbhttp://parallel.hpc.unsw.edu.au/fclassdescchttp://www.ac3.edu.au/hints


clumon

http://barossa.ac3.com.au/clumon


Batch scripts

Typically bourne shell programs, but can beany interpreted language (eg Perl) where # isa comment character.

qsub options can be specified on a linebeginning with #PBS. These are overriddenby command line.

PATH usually doesn’t include “.” e.g setappropriate path with PATH=/opt/mpich-1.2.5.10-ch_p4-gcc/bin:$PATH or in .bashrcset, e.g.:MPICH=/opt/mpich-1.2.5.10-ch_p4-gcc/bin;export MPICH


Batch scripts ...

Environment contains extra information:$PBS O WORKDIR Place from where job was

submitted

$PBS NODEFILE list of node names attached toyour job. eg wc -l $PBS_NODEFILE willreturn the number of CPUs your job isrunning on.

$PBS JOBID is your current job’s jobid. eg/scratch/$PBS_JOBID is the name of atemporary directory on a node’s local diskfor intensive I/O applications.


(Trivially) Parallel Batch scripts

#!/bin/sh

n=0

while [ $n -lt 10 ]; do

echo $n >parm${n}.dat

cat >scr$n <<EOF#PBS -l cput=03:00:00 -l mem=128MB

a.out parm${n}.dat >out${n}.dat

expr ‘cat parm${n}.dat‘ + 10 >parm${n}.dat

qsub scr$n

EOFqsub scr$n

n=‘expr $n + 1‘

done ac3’s HPC Facilities – p.30/37

OpenMP jobs

Compiler can autoparallelise, or make use ofmanual OpenMP compiler directives. Seeintro guides for specific compiler flags.

Specify number of threads viaOMP_NUM_THREADS environment variable.On Barossa, job can make use of 2 threads,SC can make use of 4 and Swan up to 16.

Use -lnode=1:ppn=2 option on Barossa.Use -lnode=1:ppn=16 to select APAC’s bigSMP node on the SC


MPI jobs

Swan (the new Altix) has SGI MPI andMPICH

Barossa has LAM and MPICH

LC has LAM

SC has an elan version, SC will be replacedby a large Altix

In terms of network performance, the order isSwan, SC, Barossa, LC.In terms of processor performance, the order isBarossa, LC, SC, Swan.Are you CPU bound or network bound? ac3’s HPC Facilities – p.32/37

Other parallel jobs (eg PVM, CFX, ...)

On Barossa, you have ssh priveleges to anynode running your job. If you haven’t a jobrunning you don’t.

This means that any parallel transport layer built

(like pvm) using ssh will work on Barossa. This is

not true of the APAC systems, where you have to

use MPI. (PVM is available on the SC).


Self-submitting batch jobs

#!/bin/sh#PBS -l cput=3:0:0 -l mem=128MB -r yif [ ! -z "$PBS_O_WORKDIR" ]; then

cd $PBS_O_WORKDIRfiif [ -f stop ]; then exit; fiif [ -f checkpoint.dat ]; then

a.out restart >>outputelse

a.out init >>outputfiqsub $PBS_JOBNAME

PBS_JOBNAME name of the script above ac3’s HPC Facilities – p.34/37

Scratch I/O

On a cluster, a system wide file system isprovided via NFS to access home and shortterm data directories.

NFS cannot cope with lots of small read/writerequests

Each node has a local disk, which can beaccessed for the duration of the job. Use scpto copy data to/from the local disk (scfscp onthe SC). On Barossa, this scratch directory iscalled /scratch/$PBS JOBID


Resource Allocation

Application to ac3 resource allocationcommittee every six months. Grants areassessed on merit (a well composedapplication is a condition). Proposals aregenerally around 3 pages long.

Resources are granted in terms of systemunits. 1SU corresponds to roughly 1 hour ona 1GHz processor.Machine SU avail per 6 months

Barossa 3.3 million

APAC NF 132,000

Hunter 114,000

Clare 94,000ac3’s HPC Facilities – p.36/37

Looking at your resource usage

http://barossa.ac3.edu.au/pbsProjectID Grant Used Remaining

-------------------------------------------------------------

acnoise 400000 109913 290087

agero 30000 39356 -9356

ahmadj 0 63633 -63633

alexg 32000 23248 8752

apitman 1500 0 1500

ayang 0 134171 -134171

...