View
221
Download
2
Tags:
Embed Size (px)
Citation preview
High Throughput Computingwith Condor at Notre Dame
Douglas Thain
30 April 2009
Today’s Talk
• High Level Introduction (20 min)– What is Condor?– How does it work?– What is it good for?
• Hands-On Tutorial (30 min)– Finding Resources– Submitting Jobs– Managing Jobs– Ideas for Scaling Up
The Cooperative Computing Lab• We create software that enables the reliable
sharing of cycles and storage capacity between cooperating people.
• We conduct research on the effectiveness of various systems and strategies for large scale computing.
• We collaborate with others that need to use large scale computing, so as to find the real problems and make an impact on the world.
• We operate systems like Condor that directly support research and collaboration at ND.
http://www.cse.nd.edu/~ccl
What is Condor?• Condor is software from UW-Madison that
harnesses idle cycles from existing machines. (Most workstations are ~90% idle!)
• With the assistance of CSE, OIT, and CRC staff, Condor has been installed on ~700 cores in Engineering and Science since early 2005.
• The Condor pool expands the capabilities of researchers in to perform both cycle and storage intensive research.
• New users and contributors are welcome to join!
http://condor.cse.nd.edu
Condor Distributed Batch System (~700 cores)
ccl 8x1cclsun16x2
loco32x2
sc032x2
netscale16x2
cvrl32x2
iss44x2
compbio1x8
netscale1x32
Fitzpatrick130
CSE170
CHEG25
EE10
Nieu20
DeBart10
MPIHadoopBiometrics
StorageResearch
NetworkResearch
NetworkResearch
TimesharedCollaboration
PersonalWorkstations
StorageResearch
BatchCapacity
wwwportals
loginnodes
dbserver
Primary Interactive Users
Batch Users
centralmgr
Purdue~10k cores
Wisconsin~5k cores
“flocking” to other condor pools
greenhouse
http://www.cse.nd.edu/~ccl/viz
The Condor Principle• Machine Owners Have Absolute Control
– Set who, what, and when can use machine.– Can kick jobs off at any time manually.
• Default policy that satisfies most people:• Start job if console idle > 15 minutes• Suspend job if console used or CPU busy.• Kick off job if suspended > 10 minutes.
– After that, jobs run in this order: owner, research group, Notre Dame, elsewhere.
For the full technical details, see:http://www.cse.nd.edu/~ccl/operations/condor/policy.shtml
What’s the value proposition?
• If you install Condor on your workstations, servers, or clusters, then:– You retain immediate, preemptive priority on
your machines, both batch and interactive.– You gain access to the unused cycles
available on other machines.– By the way, other people get to use your
machines when you are not.
http://condor.cse.nd.edu
http://condor.cse.nd.edu
http://condor.cse.nd.edu
Condor Architecture
matchmaker
schedd startd
I want an INTELCPU with > 3GB RAM
I prefer to run jobsowned by user “joe”.
You two shouldtalk to each other.
Run job with files X, Y.
Represents auser with jobs to run.
Representsan available
machine.
jobX
YY
Y
~700 CPUs at Notre Dame
matchmaker
startdstartdstartdstartdstartdstartd
scheddscheddscheddscheddscheddscheddschedd
Flocking to Other Sites
2000 CPUsUniversity
of Wisconsin
20,000 CPUsPurdue
University
700 CPUsNotreDame
What is Condor Good For?
• Condor works well on large workflows of sequential jobs, provided that they match the machines available to you.
• Ideal workload:– One million jobs that require one hour each.
• Doesn’t work at all:– An 8-node MPI job that must run now.
• Many workloads can be converted into the ideal form, with varying degrees of effort.
High Throughput Computing
• Condor is not High Performance Computing– HPC: Run one program as fast as possible.
• Condor is High Throughput Computing– HTC: Run as many programs as possible before
my paper deadline on May 1st.
Intermission and Questions
Getting Started:
If your shell is tcsh:% setenv PATH
/afs/nd.edu/user37/condor/software/bin:$PATH
If your shell is bash:% export PATH=/afs/nd.edu/user37/condor/software/bin:
$PATH
Then, create a temporary working space:% mkdir /tmp/YOURNAME% cd /tmp/YOURNAME
Viewing Available Resources
• Condor Status Web Page:– http://condor.cse.nd.edu
• Command Line Tool:– condor_status– condor_status –constraint ‘(Memory>2048)’– condor_status –constraint ‘(Arch==“INTEL”)’– condor_status –constraint ‘(OpSys==“LINUX”)’– condor_status -run– condor_status –submitters– condor_status -pool boilergrid.rcac.purdue.edu
A Simple Script Job
#!/bin/sh
echo $@
date
uname –a
% vi simple.sh
% chmod 755 simple.sh
% ./simple.sh hello world
% vi simple.submit
A Simple Submit File
universe = vanillaexecutable = simple.sharguments = hello condoroutput = simple.stdouterror = simple.stderrshould_transfer_files = yeswhen_to_transfer_output = on_exitlog = simple.logfilequeue
Submitting and Watching a Job
• Submit the job:– condor_submit simple.submit
• Look at the job queue:– condor_q
• Remove a job:– condor_rm <#>
• See where the job went:– tail -f simple.logfile
% vi simple.submit
Submitting Lots of Jobs
universe = vanillaexecutable = simple.sharguments = hello $(PROCESS)output = simple.stdout.$(PROCESS)error = simple.stderr.$(PROCESS)should_transfer_files = yeswhen_to_transfer_output = on_exitlog = simple.logfilequeue 50
What Happened to All My Jobs?• http://condorlog.cse.nd.edu
Setting Requirements
• By default, Condor will only run your job on a machine with the same CPU and OS as the submitter.
• Use requirements to send your job to other kinds of machines:– requirements = (Memory>2084)– requirements = (Arch==“INTEL” || Arch==“X86_64”)– requirements = (MachineGroup==“fitzlab”)– requirements = (UidDomain!=“nd.edu”)
• (Hint: Try out your requirements expressions using condor_status as above.)
Setting Requirements
• By default, Condor will assume any machine that satisfies your requirements is sufficient.
• Use the rank expression to indicate which machines that you prefer:– rank = (Memory>1024)– rank = (MachineGroup==“fitzlab”)– rank = (Arch==“INTEL”)*10
+ (Arch==“X86_64”)*20
File Transfer
• Notes to keep in mind:– Condor cannot write to AFS. (no creds)– Not all machines in Condor have AFS.
• So, you must specify what files your job needs, and Condor will send them there:– transfer_input_files = x.dat, y.calib, z.library
• By default, all files created by your job will be sent home automatically.
In Class Assignment
• Execute 50 jobs that run on a machine not at Notre Dame that has >1GB RAM.