36
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid Documentations

Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Embed Size (px)

Citation preview

Page 1: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Job Submission with Globus, Condor, and Condor-G

Selim KalayciFlorida International University

07/21/2009

Note: Slides are compiled from various TeraGrid Documentations

Page 2: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

2

Grid Job Management using Globus

• Common WS interface to schedulers– Unix, Condor, LSF, PBS, SGE, …

• More generally: interface for process execution management– Lay down execution environment – Stage data– Monitor & manage lifecycle– Kill it, clean up

Page 3: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

3

Grid Job Management Goals

Provide a service to securely:• Create an environment for a job• Stage files to/from environment• Cause execution of job process(es)

– Via various local resource managers• Monitor execution• Signal important state changes to client• Enable client access to output files

– Streaming access during execution

Page 4: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

4

GRAM• GRAM: Globus Resource Allocation and

Management• GRAM is a Globus Toolkit component

– For Grid job management• GRAM is a unifying remote interface to Resource

Managers– Yet preserves local site security/control

• Remote credential management• File staging via RFT and GridFTP

Page 5: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

5

A Simple Example• First, login to queenbee.loni-lsu.teragrid.org• Command example:% globusrun-ws -submit -c /bin/date

Submitting job...Done.Job ID: uuid:002a6ab8-6036-11d9-bae6-0002a5ad41e5Termination time: 01/07/2005 22:55 GMTCurrent job state: ActiveCurrent job state: CleanUpCurrent job state: DoneDestroying job...Done.

• A successful submission will create a new ManagedJob resource with its own unique EPR for messaging

• Use –o option to create the EPR file% globusrun-ws -submit –o job.epr -c /bin/date

Page 6: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

6

A Simple Example(2)• To see the output, use –s (stream) option% globusrun-ws -submit –s -c /bin/date

Termination time: 06/14/2007 18:07 GMTCurrent job state: ActiveCurrent job state: CleanUp-HoldWed Jun 13 14:07:54 EDT 2007Current job state: CleanUpCurrent job state: DoneDestroying job...Done.Cleaning up any delegated credentials...Done.

• If you want to send the output to a file, use –so option% globusrun-ws -submit –s –so job.out -c /bin/date

…% cat job.out

Wed Jun 13 14:07:54 EDT 2007

Page 7: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

7

A Simple Example(3)

• Submitting your job to different schedulers– Fork% globusrun-ws -submit -Ft Fork -s -c /bin/date

(Actually, the default is Fork. So, you can skip it in this case.)

– SGE% globusrun-ws -submit -Ft PBS-s -c /bin/date

• Submitting to a remote site% globusrun-ws -submit -F tg-login.frost.ncar.teragrid.org -c /bin/date

Page 8: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

8

Batch Job Submissions% globusrun-ws -submit -batch -o job_epr -c /bin/sleep

50Submitting job...Done.Job ID: uuid:f9544174-60c5-11d9-97e3-0002a5ad41e5Termination time: 01/08/2005 16:05 GMT

% globusrun-ws -status -j job_eprCurrent job state: Active

% globusrun-ws -status -j job_eprCurrent job state: Done

% globusrun-ws -kill -j job_eprRequesting original job description...Done.Destroying job...Done.

Page 9: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

9

Resource Specification Language (RSL)

• RSL is the language used by the clients to submit a job.

• All job submission parameters are described in RSL, including the executable file and arguments.

• You can specify the type and capabilities of resources to execute your job.

• You can also coordinate Stage-in and Stage-out operations through RSL.

Page 10: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

10

Submitting a job through RSL

• Command:% globusrun-ws -submit -f touch.xml

• Contents of touch.xml file:<job> <executable>/bin/touch</executable> <argument>touched_it</argument></job>

Page 11: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

11

Condor• is a software system that creates an HTC

environment– Created at UW-Madison

• Condor is a specialized workload management system for compute-intensive jobs.– Detects machine availability– Harnesses available resources– Uses remote system calls to send R/W operations over the

network– Provides powerful resource management by matching

resource owners with consumers (broker)

Page 12: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

12

How Condor worksCondor provides: • a job queueing mechanism

• scheduling policy

• priority scheme

• resource monitoring, and

• resource management.

Users submit their serial or parallel jobs to Condor,

Condor places them into a queue,

… chooses when and where to run the jobs based upon a policy,

… carefully monitors their progress, and

… ultimately informs the user upon completion.

Page 13: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

13

Condor - features

• Checkpoint & migration • Remote system calls

– Able to transfer data files and executables across machines

• Job ordering• Job requirements and preferences can be

specified via powerful expressions

Page 14: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

14

Condor lets you manage a large number of jobs.

• Specify the jobs in a file and submit them to Condor• Condor runs them and keeps you notified on their

progress– Mechanisms to help you manage huge numbers of jobs

(1000’s), all the data, etc.– Handles inter-job dependencies (DAGMan)

• Users can set Condor's job priorities • Condor administrators can set user priorities

Page 15: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Condor-G

• Condor-G is a specialization of Condor. It is also known as the “Globus universe” or “Grid universe”.

• Condor-G can submit jobs to Globus resources, just like globusrun-ws.

• Condor-G combines the inter-domain resource management protocols of the Globus Toolkit and the intra-domain resource and job management methods of Condor for managing Grid jobs.

Page 16: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

16

Condor-G …

• does whatever it takes to run your jobs, even if …– The gatekeeper is temporarily unavailable– The job manager crashes– Your local machine crashes– The network goes down

Page 17: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Remote Resource Access: Globus

“globusrun myjob …”

Globus GRAM ProtocolGlobus

JobManager

fork()

Organization A Organization B

Page 18: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Globus

Globus GRAM ProtocolGlobus

JobManager

fork()

Organization A Organization B

“globusrun myjob …”

Page 19: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Globus + Condor

Globus GRAM Protocol Globus JobManager

Submit to Condor

Condor PoolOrganization A Organization B

“globusrun myjob …”

Page 20: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Globus + Condor

“globusrun …”

Globus GRAM Protocol Globus JobManager

Submit to Condor

Condor PoolOrganization A Organization B

Page 21: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Condor-G + Globus + Condor

Globus GRAM Protocol Globus JobManager

Submit to Condor

Condor PoolOrganization A Organization B

Condor-GCondor-G

myjob1myjob2myjob3myjob4myjob5…

Page 22: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Just to be fair…

• The gatekeeper doesn’t have to submit to a Condor pool.– It could be PBS, LSF, Sun Grid Engine…

• Condor-G will work fine whatever the remote batch system is.

Page 23: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

23

Four Steps to Run a Job with Condor

• These choices tell Condor – how– when – where to run the job, – and describe exactly what you want to run.

• Choose a Universe for your job • Make your job batch-ready• Create a submit description file• Run condor_submit

Page 24: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

24

1. Choose a Universe• There are many choices

– Vanilla: any old job– Standard: checkpointing & remote I/O– Java: better for Java jobs– MPI: Run parallel MPI jobs– Virtual Machine: Run a virtual machine as job– …

• For now, we’ll just consider vanilla

Page 25: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

25

2. Make your job batch-ready

• Must be able to run in the background: – no interactive input, windows, GUI, etc.

• Condor is designed to run jobs as a batch system, with pre-defined inputs for jobs

• Can still use STDIN, STDOUT, and STDERR (the keyboard and the screen), but files are used for these instead of the actual devices

• Organize data files

Page 26: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

26

3. Create a Submit Description File• A plain ASCII text file• Condor does not care about file extensions

• Tells Condor about your job:

– Which executable to run and where to find it– Which universe– Location of input, output and error files – Command-line arguments, if any– Environment variables– Any special requirements or preferences

Page 27: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

27

Simple Submit Description File

# myjob.submit file# Simple condor_submit input file# (Lines beginning with # are comments)# NOTE: the words on the left side are not# case sensitive, but filenames are!Universe = vanillaExecutable = analysisLog = my_job.logQueue

Page 28: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

28

4. Run condor_submit

• You give condor_submit the name of the submit file you have created:

condor_submit my_job.submit

• condor_submit parses the submit file

Page 29: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

29

Another Submit Description File

# Example condor_submit input file# (Lines beginning with # are comments)# NOTE: the words on the left side are not# case sensitive, but filenames are!Universe = vanillaExecutable = /home/wright/condor/my_job.condorInput = my_job.stdinOutput = my_job.stdoutError = my_job.stderrArguments = -arg1 -arg2Queue

Page 30: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

“Clusters” and “Processes”

• If your submit file describes multiple jobs, we call this a “cluster”

• Each job within a cluster is called a “process” or “proc”• If you only specify one job, you still get a cluster, but it has

only one process• A Condor “Job ID” is the cluster number, a period, and the

process number (“23.5”)• Process numbers always start at 0

Page 31: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Example Submit Description File for a Cluster

# Example condor_submit input file that defines# a cluster of two jobs with different iwdUniverse = vanillaExecutable = my_jobArguments = -arg1 -arg2

InitialDir = run_0 Queue Becomes job 2.0

InitialDir = run_1

Queue Becomes job 2.1

Page 32: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Submit Description File for a BIG Cluster of Jobs

• The initial directory for each job is specified with the $(Process) macro, and instead of submitting a single job, we use “Queue 600” to submit 600 jobs at once

• $(Process) will be expanded to the process number for each job in the cluster (from 0 up to 599 in this case), so we’ll have “run_0”, “run_1”, … “run_599” directories

• All the input/output files will be in different directories!

Page 33: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Submit Description File for a BIG Cluster of Jobs

# Example condor_submit input file that defines# a cluster of 600 jobs with different iwdUniverse = vanillaExecutable = my_jobArguments = -arg1 –arg2InitialDir = run_$(Process)Queue 600

Page 34: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

34

Other Condor commands

• condor_q – show status of job queue

• condor_status – show status of compute nodes

• condor_rm – remove a job• condor_hold – hold a job temporarily• condor_release – release a job from hold

Page 35: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

35

Submitting more complex jobs

• express dependencies between jobs WORKFLOWS

• Condor DAGMan.• Next week

Page 36: Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid

Hands-on Lab

• http://users.cs.fiu.edu/~skala001/Condor_Lab.htm