Upload
phong
View
35
Download
0
Embed Size (px)
DESCRIPTION
Part 8: DAGMan. Part 8: DAGMan. A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan. A: Grid Workflow Management. Job Dependencies. In many applications, some jobs are dependent on other jobs E.g. job A must finish before job B starts Often because job B uses output from job A - PowerPoint PPT Presentation
Citation preview
Part 8:DAGMan
Part 8: DAGMan
• A: Grid Workflow Management
• B: DAGMan
• C: Laboratory: DAGMan
A: Grid Workflow Management
Job Dependencies
• In many applications, some jobs are dependent on other jobs• E.g. job A must finish before job B starts• Often because job B uses output from job A
• We call a set of interdependent jobs a workflow
• Condor-G can run jobs in any order• We need a workflow manager
Two Motivating Examples
The Sloan Digital Sky Survey
The MontageProject
Sloan Digital Sky Survey
• Map one-quarter of the entire sky
• Determine the positions and absolute brightness of more than 100 million celestial objects.
• Measure the distance to a million of the nearest galaxies, and to 100,000 quasars.
http://www.sdss.org
Workflow to Find Galaxy Clusters
catalog
cluster
5
4
core
brg
field
tsObj
3
2
1
brg
field
tsObj
2
1
brg
field
tsObj
2
1
brg
field
tsObj
2
1
core
3
fieldPrep
maxBrg
maxBcg
bcgCoal
getCatalog
Workflow to Find Galaxy Clusters
maxBrg
maxBcg
bcgCoal
getCatalog
Montage
• Create a large mosaic image from many smaller images
• Used for astronomy data
• Correct optical distortions and intensity differences
http://montage.ipac.caltech.edu
Montage Workflow
Data Stage in nodes
Montage compute nodesData stage out nodes
Inter pool transfer nodes
Montage Workflow
1202 nodes
B: DAGMan
DAGMan
• Directed Acyclic Graph Manager• Workflow manager for Condor-G
• DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.
• By default, Condor may run your jobs in any order, or everything simultaneously, so we need DAGMan to enforce an ordering when necessary.
• (e.g., “Don’t run job “B” until job “A” has completed successfully.”)
What is a DAG?
• A DAG is the data structure used by DAGMan to represent these dependencies.
• Each job is a “node” in the DAG.
• Each node can have any number of “parent” or “children” nodes – as long as there are no loops!
Job A
Job B Job C
Job D
Defining a DAG
• A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D
• each node will run the Condor job specified by its accompanying Condor submit file
Job A
Job B Job C
Job D
Submitting a DAG
• To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:
% condor_submit_dag diamond.dag• condor_submit_dag submits a job with DAGMan as the
executable.
• This job happens to run on the submitting machine, not any other computer.
• Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.
DAGMan
Running a DAG
• DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.
CondorJobQueue
C
D
A
A
B.dagFile
DAGMan
Running a DAG (cont’d)
• DAGMan holds & submits jobs to the Condor queue at the appropriate times.
CondorJobQueue
C
D
B
C
B
A
DAGMan
Running a DAG (cont’d)
• In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.
CondorJobQueue
X
D
A
BRescue
File
DAGMan
Recovering a DAG
• Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.
CondorJobQueue
C
D
A
BRescue
File
C
DAGMan
Recovering a DAG (cont’d)
• Once that job completes, DAGMan will continue the DAG as if the failure never happened.
CondorJobQueue
C
D
A
B
D
DAGMan
Finishing a DAG
• Once the DAG is complete, the DAGMan job itself is finished, and exits.
CondorJobQueue
C
D
A
B
Additional DAGMan Features
• Provides other handy features for job management…
• nodes can have PRE & POST scripts• failed nodes can be automatically re-tried a
configurable number of times• job submission can be “throttled”
Another sample DAGMan submit file
# Filename: diamond.dagJob A A.condorJob B B.condorJob C C.condorJob D D.condorScript PRE A top_pre.cshScript PRE B mid_pre.perl $JOBScript POST B mid_post.perl $JOB $RETURNScript PRE C mid_pre.perl $JOBScript POST C mid_post.perl $JOB $RETURNScript PRE D bot_pre.cshPARENT A CHILD B CPARENT B C CHILD DRetry C 3
Job A
Job B Job C
Job D
Lab 8: DAGMan
Lab 8: DAGMan
• In this lab, you’ll:• Run a simple DAGMan job• Run a more complex DAGMan job• Recover a failed DAGMan job
Credits
• NSF disclaimer
• Portions of this presentation were adapted from the following sources:• Jaime Frey, UW-Madison