Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
GC3: Grid Computing Competence Center
Cluster computing, IBatch-queueing systemsRiccardo Murri, Sergio MaffiolettiGrid Computing Competence Center,Organisch-Chemisches Institut,University of Zurich
Oct. 23, 2012
Today’s topic
purpose︷ ︸︸ ︷Batch job processing clusters︸ ︷︷ ︸
HW architecture
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
What is a cluster? I
compute−0−0.local compute−0−1.local compute−0−27.local
internet
local network fabric
frontend.node.uzh.ch����������������
A cluster is a group of computerswith a direct network interconnect,
centralized management,and distributed execution facilites.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
What is a cluster? II
Centralized: – Authorization and Authentication– Shared filesystem– Application execution and
management
Distributed: – Execution of jobs– Multiple units of the same parallel
job may reside on separateresources
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
What is an HPC cluster?
A cluster is a group of computerswith a direct network interconnect,
centralized managementand distributed execution facilites.
An HPC cluster is a clusterwith a fast local network interconnect,
specialized for execution of paralleldistributed-memory programs.
A supercomputer is (currently)a very large HPC cluster
with a very fast local network interconnect.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
What’s batch job processing?
Asynchronous execution of shell commands.
Wikipedia: Asynchronous actions areactions executed in a non-blocking scheme,
allowing the main program flow to continue processing.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Lifecycle of a batch job
1. A command to run is submitted to the batchprocessing system
2. The batch job scheduler selects appropriateresources to run the job
3. The resource manager executes the job
4. Users monitor the job execution state
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Functional components of a batch job system
Resource ManagerMonitors compute infrastructure, launches andsupervise jobs, cleans up after termination.
Job manager / schedulerAllocates resources and time slots (scheduling)
Workload ManagerPolicy and orchestration at “job collection” level: fairshare, workflow orchestration, QoS, SLA, etc.
Reference: O. Richard, Batch Scheduler and File Management, The third
workshop of the INRIA-Illinois joint-laboratory on Petascale Computing,
June 21-24, 2010, Bordeaux, FranceLSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Architecture of a batch job system
compute−0−0.local compute−0−1.local compute−0−27.local
scheduler
resourcemanager
server
client
job launch
& execution
frontend
master
2. allocate resources
4. monitor execution
3. start job
4. monitor execution
1. submit job
machine status monitoring
monitor monitormonitor
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Grid Engine
Sun Grid Engine (GE) is a batch-queuing systemproduced by Sun Microcomputers; made open-sourcein 2001.
After acquisition by Oracle, the product forked:
– Open Grid Scheduler (OGS) and Son of Grid Engine(SGE), independent open-source versions.
– Oracle Grid Engine, commercial and focused onenterprise technical computing.
– Univa Grid Engine is a commercial-only version,developed by the core SGE engineer team fromSun.
Used on UZH main HPC cluster “Schroedinger”.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
GE architecture, I
sge qmaster
– Runs on master node
– Accepts client requests (job submission, job/hoststate inspection)
– Schedules jobs on compute nodes (formerlyseparate sge schedd process)
Client programs qhost, qsub, qstat
– Run by user on submit node
– Clients for sge qmaster
– Master daemon has a list of authorized submitnodes
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
GE architecture, II
sge execd
– Runs on every compute node
– Accepts job start requests from sge qmaster
– Monitors node status (load average, free memory,etc.) and reports back to sge qmaster
sge shepherd
– Spawned by sge execd when starting a job
– Monitors the execution of a single job
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
GE architecture, III
compute−0−0.local compute−0−1.local compute−0−27.local
scheduler
resourcemanager
ge_master
frontend
master
2. allocate resources
4. monitor execution
3. start job
4. monitor execution
1. submit job
machine status monitoring
ge_execd
qstat
qsub
ge_execdge_execd
ge_shepherd
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Lifecycle of a Job: user perspective
1. Prepare job script (normally shell script)
2. Define resource requirements
3. Submit job and record jobID
4. Monitor status of job (using JobID)
5. When done, inspect results
6. Otherwise check logs
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Prepare job script
#!/bin/bash
MZXMLSEARCH="./MzXML2Search"
${MZXMLSEARCH} -dta ${MZXML_NAME}.mzXMLif [ ! $? -eq 0 ]; then
echo "[FATAL]"exit $1
fi
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Submit job and monitor using jobID
# qsub test.sh534.localhost
# qstat 534Job id Name S Queue--------------- -------- - -------534.localhost test.sh R default
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Lifecycle of a Job: system perspective
1. Job submission form DRM client2. Resource Manager stores job in a queue
– Queue selected inspecting DRM policies and job’sdescription
3. Scheduler starts scheduling cycle– Collects resource information from exec hosts– Inspects jobs in queues– Applies scheduling policies to sort jobs in queues– Sends run request to Resource Manager
4. Resource Manager sends job to exec host to run5. Exec host receives payload and runs it
– Job executed using user credentials– Periodically report to Resource Manager resource
utilization– When job finished, reports to Resource Manager
6. Resource Manager updates job’s stateLSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Job lifecycle
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Implementation issues
I/Ohow to provide input data to the job and collect outputdata from it
SchedulingWhen should the job start?
Resource allocationOn what computer(s) should it run? How to cope withheterogeneous resource pools?
Job monitoring and accountingWhat usage records should be collected and stored?
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
I/O management in HPC clusters
Two main ways:
1. Shared file system
2. Data staging
Reference: O. Richard, Batch Scheduler and File Management, The third
workshop of the INRIA-Illinois joint-laboratory on Petascale Computing,
June 21-24, 2010, Bordeaux, France
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Shared file systems
Used on most cluster systems
Parallel filesystem (e.g., Lustre, GPFS, PVFS, NFSv4.1,. . . ) for performance and scalability
Often separate filesystems based on features:
– a filesystem for persistent / longer-term data (e.g.,/home)
– another one for ephemeral I/O (deleted after thejob has finished running)
– responsibility is on the user to move data into theappropriate filesystem
Easy to use: no difference with local I/O model.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Data staging
Job data requirements are identified and provided byuser in submitted script.
Stage-inInput Files are transfered to local disk of computenodes before job start.
Stage-outOutput Files are transfered from nodes to massstorage after execution.
Nowadays, rarely used on cluster, mainly used in Gridcontext
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Scheduling
Long-term scheduler.– Jobs may last hours, days, even months!
HPC job scheduling is usually non-preemptive.– Compute resources are fully utilized, there’s little
room for sharing.
Common scheduling algorithms are usually variationsof FCFS or priority-based scheduling.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Scheduling: terminology
Turnaround timeThe total time elapsed from the moment a job is submittedto the moment it terminates running.
Wait timeThe time elapsed from submission until a job actuallystarts running.
Wall-timeThe time elapsed from job start to end. (Abbreviation ofwall-clock time.)
CPU timeThe total time spent by CPU(s) executing a job program’scode.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Scheduling: FCFS, I
First come, first served
Job requests are kept in a queue.
New job requests (submissions) append to the back ofthe queue.
Each time a suitable execution slot is freed, the job atthe front of the queue is run.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Scheduling: FCFS, II
Issues with bare FCFS:1. Average waiting time might be long:
– e.g., a user submits a large number of very longjobs; other users have to wait a lot in order to haveshorter jobs running.
– Solutions: separate queues, backfill,priority-based scheduling
2. When there are parallel jobs spanning multipleexecution units, the scheduler has to keep somenodes idle to allocate enough resources.
– Solutions: backfill
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Scheduling: separate queues
Create separate job queues.– Submission queue may be explicitly chosen by
user, or selected by scheduler based on jobcharacteristics.
Each queue is associated with a different set ofexecution nodes.
Each queue has different run features– e.g., different maximum run time
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Scheduling: backfill
Jobs jump ahead in the queue and are executed on“reserved” nodes if they will be finished by the time thejob holding the reservation is scheduled to start.
Requires job duration to be known in advance!
Image source: http://people.ee.ethz.ch/∼ballisti/computer topics/lsf/admin/04-tunin.htmLSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Scheduling: SFJ, I
Shortest job first
Job queue is sorted according to duration: shortestjobs are moved to the front.
Requires job duration to be known in advance!
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Scheduling: SFJ, II
If all jobs are known in advance, it can be proved todeliver the optimal average wait time.
Otherwise, may delay long jobs indefinitely:
– At 10 am, Job X with expected runtime 4 hours issubmitted; it has to wait 2 hours in the queue.
– At 11 am, 10 jobs of 2 hours runtime aresubmitted; they jump ahead in the queue anddelay job X by 20 hours.
– At 12 am, 5 more jobs of 1 hour runtime aresubmitted; they delay job X by another 5 hours.
Solution: add “deadline” factor: take into account thetime a job has already spent waiting in the queue.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Priority-based scheduling
Sort job queues according to some priorityfunction
The “priority” function is usually a weighted sum ofvarious contributions, e.g.:
– Requested run time
– Number of processors
– Wait time in queue
– Recent usage by same user/group/department(fair share)
– Administrator-set QoS
Reference: http://www.adaptivecomputing.com/resources/docs/maui/5.
1jobprioritization.php
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Fair-share scheduling
Fair-share prioritization assigns higher priorities tousers/groups/etc. that have not used all of theirresource quota (usually expressed in CPU time).
Important parameters in defining a fair-share policy:
– window length: how much historical informationis kept and used for calculating resource usage
– interval: how often is resource utilizationcomputed
– decay: weights applied to resource usage in thepast (e.g., 2 hours of CPU time one week agomight weigh less than 2 hours of CPU time today)
Reference: http:
//www.adaptivecomputing.com/resources/docs/maui/6.3fairshare.php
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Resource allocation, I
Resource allocation is the act of selecting executionunits out of the available pool for running a job.
Over time, clusters tend to grow inhomogeneously:new nodes are added, that are different from the olderones.
Jobs are different in computational and hardwarerequirements, e.g.:
– short jobs vs long-running jobs
– large memory hence less jobs fit in a singlemulti-core node
– I/O bound hence fast filesystem needed
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Resource allocation, II
General resource allocation algorithm (match-making):
1. user specifies resource requirements during jobsubmission
2. filtering: scheduler filters resources based onevaluation of a boolean formula
– usually, logical AND of resource requirements
3. ranking: matching resources are sorted and thefirst-ranking one gets the job
Normally the filtering and ranking functions are fixedor can only be modified by the cluster admin.
A notable exception is the Condor batch system,which allows users to specify arbitrary filtering andranking functions.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Example: resource requirements in SGE
Grid Engine allows specifying resource requirementswithin a job script.
#$/bin/bash
#$ -q all.q # queue name#$ -l s_vmem=300M # memory#$ -l s_rt=60 # walltime#$ -l gpu=1 # require 1 GPGPU#$ -pe mpich 32 # CPU cores
MZXMLSEARCH="./MzXML2Search"...
(Note that you write s rt=60 but the systemunderstands s rt >= 60 for the purpose of filtering.)
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Condor
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1 compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1 compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1
condor_master
condor_submit
condor_resourcecondor_resource condor_resource
condor_agent
�����������������������������������
�����������������������������������
����������������
����������������
���������������
���������������
�����������������
�����������������
�����������������
�����������������
��������
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Condor overview
Agents (client-side software) and Resources(cluster-side software) advertise their requests andcapabilities to the Condor Master.
The Master performs match-making betweenAgents’ requests and Resources’ offerings.
An Agent sends its computational job directly to thematching Resource.
Reference: Thain, D., Tannenbaum, T. and Livny, M. (2005):
“Distributed computing in practice: the Condor experience.”
Concurrency and Computation: Practice and Experience,
17:323–356.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
What is matchmaking?
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Matchmaking, I
Same idea in Condor, except the schema is not fixed.
Agents and Resources report their requests and offersusing the “ClassAd” format (an enriched key=valueformat).
No prescribed schema, hence a Resource is free toadvertise any “interesting feature” it has, and torepresent it in any way that fits the key=value model.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Matchmaking, II
1. Agents specify a Requirements constraint: aboolean expression that can use any value from theAgents’ own (self) ClassAd or the Resource’s (other).
2a. Resources whose offered ClassAd does not satisfythe Requirements constraint are discarded.
2b. Conversely, if the Agents’ ClassAd does not satisfythe Resource Requirements, the Resource isdiscarded.
3. Surviving Resources are sorted according to thevalue of the Rank expression in the Agent’s ClassAd,and their list is returned to the Agent.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Example: Job ClassAd
Select 64-bit Linux hosts, and sort them preferringhosts with larger memory and CPU speed.
Requirements = Arch=="x86_64" && OpSys == "LINUX"Rank = TARGET.Memory + TARGET.Mips
Reference: http:
//research.cs.wisc.edu/condor/manual/v6.4/4 1Condor s ClassAd.html
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Example: Resource ClassAd
A complex access policy, giving priority to users fromthe owner research group, then other “friend” users,and then the rest. . .
Friend = Owner == "tannenba"ResearchGroup = (Owner == "jbasney"
|| Owner == "raman")Trusted = Owner != "rival"Requirements = Trusted && ( ResearchGroup
|| LoadAvg < 0.3 &&KeyboardIdle > 15*60 )
Rank = Friend + ResearchGroup*10
Resource ClassAds specify an access/usage policy forthe resource.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Resource allocation, III
Problem: How do you submit a job that requires200GB of local scratch space? Or 16 cores in a singlenode?
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
Resource allocation, IV
The names and types of resource requirementsvary from cluster to cluster
– Defaults change with batch system softwarerelease
– Custom requirements depend on local systemadministrator
Job management software must adapted to thelocal cluster
– When you get access to a new cluster, you mustrewrite a large portion of your submission scripts.
– Applies to Condor as well: since ClassAds arefree-form, defining what attributes can be usedand relied upon is an organizational problem.
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012
All these job management systems are based on apush model (you send the job to an execution cluster).
Is there conversely a pull model?
LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012