Cluster computing, I Batch-queueing systems

GC3: Grid Computing Competence Center

Cluster computing, IBatch-queueing systemsRiccardo Murri, Sergio MaffiolettiGrid Computing Competence Center,Organisch-Chemisches Institut,University of Zurich

Oct. 23, 2012

Today’s topic

purpose︷︸︸︷Batch job processing clusters︸︷︷︸

HW architecture

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

What is a cluster? I

compute−0−0.local compute−0−1.local compute−0−27.local

internet

local network fabric

ssh [email protected]

frontend.node.uzh.ch��

A cluster is a group of computerswith a direct network interconnect,

centralized management,and distributed execution facilites.


What is a cluster? II

Centralized: – Authorization and Authentication– Shared filesystem– Application execution and

management

Distributed: – Execution of jobs– Multiple units of the same parallel

job may reside on separateresources


What is an HPC cluster?

A cluster is a group of computerswith a direct network interconnect,

centralized managementand distributed execution facilites.

An HPC cluster is a clusterwith a fast local network interconnect,

specialized for execution of paralleldistributed-memory programs.

A supercomputer is (currently)a very large HPC cluster

with a very fast local network interconnect.


What’s batch job processing?

Asynchronous execution of shell commands.

Wikipedia: Asynchronous actions areactions executed in a non-blocking scheme,

allowing the main program flow to continue processing.


http://en.wikipedia.org/wiki/Asynchrony

Lifecycle of a batch job

1. A command to run is submitted to the batchprocessing system

2. The batch job scheduler selects appropriateresources to run the job

3. The resource manager executes the job

4. Users monitor the job execution state


Functional components of a batch job system

Resource ManagerMonitors compute infrastructure, launches andsupervise jobs, cleans up after termination.

Job manager / schedulerAllocates resources and time slots (scheduling)

Workload ManagerPolicy and orchestration at “job collection” level: fairshare, workflow orchestration, QoS, SLA, etc.

Reference: O. Richard, Batch Scheduler and File Management, The third

workshop of the INRIA-Illinois joint-laboratory on Petascale Computing,

June 21-24, 2010, Bordeaux, FranceLSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

http://jointlab.ncsa.illinois.edu/events/workshop3/pdf/presentations/Richard-Batch-Scheduler-and-File_Management.pdf

Architecture of a batch job system


scheduler

resourcemanager

server

client

job launch

& execution

frontend

master

2. allocate resources

4. monitor execution

3. start job


1. submit job

machine status monitoring

monitor monitormonitor


Grid Engine

Sun Grid Engine (GE) is a batch-queuing systemproduced by Sun Microcomputers; made open-sourcein 2001.

After acquisition by Oracle, the product forked:

– Open Grid Scheduler (OGS) and Son of Grid Engine(SGE), independent open-source versions.

– Oracle Grid Engine, commercial and focused onenterprise technical computing.

– Univa Grid Engine is a commercial-only version,developed by the core SGE engineer team fromSun.

Used on UZH main HPC cluster “Schroedinger”.


GE architecture, I

sge qmaster

– Runs on master node

– Accepts client requests (job submission, job/hoststate inspection)

– Schedules jobs on compute nodes (formerlyseparate sge schedd process)

Client programs qhost, qsub, qstat

– Run by user on submit node

– Clients for sge qmaster

– Master daemon has a list of authorized submitnodes


GE architecture, II

sge execd

– Runs on every compute node

– Accepts job start requests from sge qmaster

– Monitors node status (load average, free memory,etc.) and reports back to sge qmaster

sge shepherd

– Spawned by sge execd when starting a job

– Monitors the execution of a single job


GE architecture, III


scheduler

resourcemanager

ge_master

frontend

master

2. allocate resources


3. start job


1. submit job

machine status monitoring

ge_execd

qstat

qsub

ge_execdge_execd

ge_shepherd


Lifecycle of a Job: user perspective

1. Prepare job script (normally shell script)

2. Define resource requirements

3. Submit job and record jobID

4. Monitor status of job (using JobID)

5. When done, inspect results

6. Otherwise check logs


Prepare job script

#!/bin/bash

MZXMLSEARCH="./MzXML2Search"

${MZXMLSEARCH} -dta ${MZXML_NAME}.mzXMLif [ ! $? -eq 0 ]; then

echo "[FATAL]"exit $1

fi


Submit job and monitor using jobID

# qsub test.sh534.localhost

# qstat 534Job id Name S Queue--------------- -------- - -------534.localhost test.sh R default


Lifecycle of a Job: system perspective

1. Job submission form DRM client2. Resource Manager stores job in a queue

– Queue selected inspecting DRM policies and job’sdescription

3. Scheduler starts scheduling cycle– Collects resource information from exec hosts– Inspects jobs in queues– Applies scheduling policies to sort jobs in queues– Sends run request to Resource Manager

4. Resource Manager sends job to exec host to run5. Exec host receives payload and runs it

– Job executed using user credentials– Periodically report to Resource Manager resource

utilization– When job finished, reports to Resource Manager

6. Resource Manager updates job’s stateLSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Job lifecycle


Implementation issues

I/Ohow to provide input data to the job and collect outputdata from it

SchedulingWhen should the job start?

Resource allocationOn what computer(s) should it run? How to cope withheterogeneous resource pools?

Job monitoring and accountingWhat usage records should be collected and stored?


I/O management in HPC clusters

Two main ways:

1. Shared file system

2. Data staging

Reference: O. Richard, Batch Scheduler and File Management, The third

workshop of the INRIA-Illinois joint-laboratory on Petascale Computing,

June 21-24, 2010, Bordeaux, France


http://jointlab.ncsa.illinois.edu/events/workshop3/pdf/presentations/Richard-Batch-Scheduler-and-File_Management.pdf

Shared file systems

Used on most cluster systems

Parallel filesystem (e.g., Lustre, GPFS, PVFS, NFSv4.1,. . . ) for performance and scalability

Often separate filesystems based on features:

– a filesystem for persistent / longer-term data (e.g.,/home)

– another one for ephemeral I/O (deleted after thejob has finished running)

– responsibility is on the user to move data into theappropriate filesystem

Easy to use: no difference with local I/O model.


Data staging

Job data requirements are identified and provided byuser in submitted script.

Stage-inInput Files are transfered to local disk of computenodes before job start.

Stage-outOutput Files are transfered from nodes to massstorage after execution.

Nowadays, rarely used on cluster, mainly used in Gridcontext


Scheduling

Long-term scheduler.– Jobs may last hours, days, even months!

HPC job scheduling is usually non-preemptive.– Compute resources are fully utilized, there’s little

room for sharing.

Common scheduling algorithms are usually variationsof FCFS or priority-based scheduling.


Scheduling: terminology

Turnaround timeThe total time elapsed from the moment a job is submittedto the moment it terminates running.

Wait timeThe time elapsed from submission until a job actuallystarts running.

Wall-timeThe time elapsed from job start to end. (Abbreviation ofwall-clock time.)

CPU timeThe total time spent by CPU(s) executing a job program’scode.


Scheduling: FCFS, I

First come, first served

Job requests are kept in a queue.

New job requests (submissions) append to the back ofthe queue.

Each time a suitable execution slot is freed, the job atthe front of the queue is run.


Scheduling: FCFS, II

Issues with bare FCFS:1. Average waiting time might be long:

– e.g., a user submits a large number of very longjobs; other users have to wait a lot in order to haveshorter jobs running.

– Solutions: separate queues, backfill,priority-based scheduling

2. When there are parallel jobs spanning multipleexecution units, the scheduler has to keep somenodes idle to allocate enough resources.

– Solutions: backfill


Scheduling: separate queues

Create separate job queues.– Submission queue may be explicitly chosen by

user, or selected by scheduler based on jobcharacteristics.

Each queue is associated with a different set ofexecution nodes.

Each queue has different run features– e.g., different maximum run time


Scheduling: backfill

Jobs jump ahead in the queue and are executed on“reserved” nodes if they will be finished by the time thejob holding the reservation is scheduled to start.

Requires job duration to be known in advance!

Image source: http://people.ee.ethz.ch/∼ballisti/computer topics/lsf/admin/04-tunin.htmLSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

http://people.ee.ethz.ch/~ballisti/computer_topics/lsf/admin/04-tunin.htm

Scheduling: SFJ, I

Shortest job first

Job queue is sorted according to duration: shortestjobs are moved to the front.

Requires job duration to be known in advance!


Scheduling: SFJ, II

If all jobs are known in advance, it can be proved todeliver the optimal average wait time.

Otherwise, may delay long jobs indefinitely:

– At 10 am, Job X with expected runtime 4 hours issubmitted; it has to wait 2 hours in the queue.

– At 11 am, 10 jobs of 2 hours runtime aresubmitted; they jump ahead in the queue anddelay job X by 20 hours.

– At 12 am, 5 more jobs of 1 hour runtime aresubmitted; they delay job X by another 5 hours.

Solution: add “deadline” factor: take into account thetime a job has already spent waiting in the queue.


Priority-based scheduling

Sort job queues according to some priorityfunction

The “priority” function is usually a weighted sum ofvarious contributions, e.g.:

– Requested run time

– Number of processors

– Wait time in queue

– Recent usage by same user/group/department(fair share)

– Administrator-set QoS

Reference: http://www.adaptivecomputing.com/resources/docs/maui/5.

1jobprioritization.php


http://www.adaptivecomputing.com/resources/docs/maui/5.1jobprioritization.php

http://www.adaptivecomputing.com/resources/docs/maui/5.1jobprioritization.php

Fair-share scheduling

Fair-share prioritization assigns higher priorities tousers/groups/etc. that have not used all of theirresource quota (usually expressed in CPU time).

Important parameters in defining a fair-share policy:

– window length: how much historical informationis kept and used for calculating resource usage

– interval: how often is resource utilizationcomputed

– decay: weights applied to resource usage in thepast (e.g., 2 hours of CPU time one week agomight weigh less than 2 hours of CPU time today)

Reference: http:

//www.adaptivecomputing.com/resources/docs/maui/6.3fairshare.php


http://www.adaptivecomputing.com/resources/docs/maui/6.3fairshare.php

http://www.adaptivecomputing.com/resources/docs/maui/6.3fairshare.php

Resource allocation, I

Resource allocation is the act of selecting executionunits out of the available pool for running a job.

Over time, clusters tend to grow inhomogeneously:new nodes are added, that are different from the olderones.

Jobs are different in computational and hardwarerequirements, e.g.:

– short jobs vs long-running jobs

– large memory hence less jobs fit in a singlemulti-core node

– I/O bound hence fast filesystem needed


Resource allocation, II

General resource allocation algorithm (match-making):

1. user specifies resource requirements during jobsubmission

2. filtering: scheduler filters resources based onevaluation of a boolean formula

– usually, logical AND of resource requirements

3. ranking: matching resources are sorted and thefirst-ranking one gets the job

Normally the filtering and ranking functions are fixedor can only be modified by the cluster admin.

A notable exception is the Condor batch system,which allows users to specify arbitrary filtering andranking functions.


http://research.cs.wisc.edu/condor/

Example: resource requirements in SGE

Grid Engine allows specifying resource requirementswithin a job script.

#$/bin/bash

#$ -q all.q # queue name#$ -l s_vmem=300M # memory#$ -l s_rt=60 # walltime#$ -l gpu=1 # require 1 GPGPU#$ -pe mpich 32 # CPU cores

MZXMLSEARCH="./MzXML2Search"...

(Note that you write s rt=60 but the systemunderstands s rt >= 60 for the purpose of filtering.)


Condor

compute node N

local 1Gb/s ethernet network

batch system server

compute node 2compute node 1 compute node N


batch system server

compute node 2compute node 1 compute node N


batch system server

compute node 2compute node 1

condor_master

condor_submit

condor_resourcecondor_resource condor_resource

condor_agent

��

��

��

��

��

��

��

��

��

��

��


Condor overview

Agents (client-side software) and Resources(cluster-side software) advertise their requests andcapabilities to the Condor Master.

The Master performs match-making betweenAgents’ requests and Resources’ offerings.

An Agent sends its computational job directly to thematching Resource.

Reference: Thain, D., Tannenbaum, T. and Livny, M. (2005):

“Distributed computing in practice: the Condor experience.”

Concurrency and Computation: Practice and Experience,

17:323–356.


What is matchmaking?


Matchmaking, I

Same idea in Condor, except the schema is not fixed.

Agents and Resources report their requests and offersusing the “ClassAd” format (an enriched key=valueformat).

No prescribed schema, hence a Resource is free toadvertise any “interesting feature” it has, and torepresent it in any way that fits the key=value model.


Matchmaking, II

1. Agents specify a Requirements constraint: aboolean expression that can use any value from theAgents’ own (self) ClassAd or the Resource’s (other).

2a. Resources whose offered ClassAd does not satisfythe Requirements constraint are discarded.

2b. Conversely, if the Agents’ ClassAd does not satisfythe Resource Requirements, the Resource isdiscarded.

3. Surviving Resources are sorted according to thevalue of the Rank expression in the Agent’s ClassAd,and their list is returned to the Agent.


Example: Job ClassAd

Select 64-bit Linux hosts, and sort them preferringhosts with larger memory and CPU speed.

Requirements = Arch=="x86_64" && OpSys == "LINUX"Rank = TARGET.Memory + TARGET.Mips

Reference: http:

//research.cs.wisc.edu/condor/manual/v6.4/4 1Condor s ClassAd.html


http://research.cs.wisc.edu/condor/manual/v6.4/4_1Condor_s_ClassAd.html

http://research.cs.wisc.edu/condor/manual/v6.4/4_1Condor_s_ClassAd.html

Example: Resource ClassAd

A complex access policy, giving priority to users fromthe owner research group, then other “friend” users,and then the rest. . .

Friend = Owner == "tannenba"ResearchGroup = (Owner == "jbasney"

|| Owner == "raman")Trusted = Owner != "rival"Requirements = Trusted && ( ResearchGroup

|| LoadAvg < 0.3 &&KeyboardIdle > 15*60 )

Rank = Friend + ResearchGroup*10

Resource ClassAds specify an access/usage policy forthe resource.


Resource allocation, III

Problem: How do you submit a job that requires200GB of local scratch space? Or 16 cores in a singlenode?


Resource allocation, IV

The names and types of resource requirementsvary from cluster to cluster

– Defaults change with batch system softwarerelease

– Custom requirements depend on local systemadministrator

Job management software must adapted to thelocal cluster

– When you get access to a new cluster, you mustrewrite a large portion of your submission scripts.

– Applies to Condor as well: since ClassAds arefree-form, defining what attributes can be usedand relied upon is an organizational problem.


All these job management systems are based on apush model (you send the job to an execution cluster).

Is there conversely a pull model?


Documents

Cluster computing, I Batch-queueing systems