View
245
Download
2
Category
Tags:
Preview:
Citation preview
Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison
condor-admin@cs.wisc.eduhttp://www.cs.wisc.edu/condor
Condor: A Project and a System
Scientific Data Intensive Computing Workshop ‘04
Microsoft ResearchMay 2004
2http://www.cs.wisc.edu/condor
Outline
› What is the Condor Project?
› What is the Condor HTC Software?
› Recipe for using desktops for science
› Data!
3http://www.cs.wisc.edu/condor
The Condor Project (Established ‘85)
Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.
4http://www.cs.wisc.edu/condor
The Condor Project (Established ‘85)Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who:
face software engineering challenges in a heterogeneous distributed environment
are involved in national and international grid collaborations,
actively interact with academic and commercial users, maintain and support large distributed production
environments, and educate and train students.
Funding – US Govt. (DoD, DoE, NASA, NSF, NIH),AT&T, IBM, INTEL, Microsoft, UW-Madison, …
5http://www.cs.wisc.edu/condor
A Multifaceted Project › Harnessing the power of clusters - opportunistic and/or
dedicated (Condor)
› Job management services for Grid applications (Condor-G, Stork)
› Fabric management services for Grid resources (Condor, GlideIns, NeST)
› Distributed I/O technology (Parrot, Kangaroo, NeST)
› Job-flow management (DAGMan, Condor, Hawk)
› Distributed monitoring and management (HawkEye)
› Technology for Distributed Systems (ClassAD, MW)
› Packaging and Integration (NMI, VDT)
6http://www.cs.wisc.edu/condor
Outline
› What is the Condor Project?
› What is the Condor HTC Software?
› Recipe for using desktops for science
› Data!
7http://www.cs.wisc.edu/condor
What is Condor?Condor converts collections of distributively
owned workstations and dedicated clusters into a distributed fault-tolerant high-throughput computing (HTC) facility.
› Distributed Ownership: decrease in cost-performance ratio caused Huge increase in organization aggregate computing
capacity Much smaller increase in the capacity accessible by a
single person
› HTC Large amounts of processing capacity sustained over
very long time periods
8http://www.cs.wisc.edu/condor
Condor can manage a large number of jobs
› Managing a large number of jobs You specify the jobs in a file and submit
them to Condor, which runs them all and keeps you notified on their progress
Mechanisms to help you manage huge numbers of jobs (1000’s), the data, etc.
Condor can handle work flow / inter-job dependencies (DAGMan)
Condor users can set job priorities Condor administrators can set user priorities
9http://www.cs.wisc.edu/condor
Condor can manage Dedicated Resources…
› Dedicated Resources Compute Clusters
› Manage Node monitoring,
scheduling Job launch,
monitor & cleanup
10http://www.cs.wisc.edu/condor
…and Condor can manage non-dedicated
resources› Non-dedicated resources examples:
Desktop workstations in offices Workstations in student labs
› Non-dedicated resources are often idle --- ~70% of the time!
› Condor can effectively harness the otherwise wasted compute cycles from non-dedicated resources
11http://www.cs.wisc.edu/condor
Some HTC Challenges
› Condor does whatever it takes to run your jobs, even if some machines… Crash (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & managed by someone
else
12http://www.cs.wisc.edu/condor
The Condor System› Unix and Win2k/XP
› Operational since 1986› Just at UW: more than 1800 CPUs in 10
pools on our campus
› Software available free on the web Open license
› Adopted by the “real world” (Galileo, Maxtor, Micron, Oracle, Tigr, Xerox,
NASA, Texas Instruments, … )
13http://www.cs.wisc.edu/condor
Downloads and Deployments
14http://www.cs.wisc.edu/condor
15http://www.cs.wisc.edu/condor
Outline
› What is the Condor Project?
› What is the Condor HTC Software?
› Recipe for using desktops for science
› Data!
16http://www.cs.wisc.edu/condor
Recipe Tip: Useful Distributed Ownership mechanisms in Condor
› Checkpoint / Migration Checkpoint == picture of process state Enables preempt/resume scheduling
and migration, ensures forward progress
› Remote System Calls Redirect I/O and other system calls back
to the submit machine.
› Matchmaking with ClassAds
17http://www.cs.wisc.edu/condor
ClassAds
› Set of bindings of Attribute Names to Expressions
› Self-describing (no separate schema)› Combine query and data› Arbitrarily composed and nested› Bilateral
Resource owners are generous if it doesn’t cost them anything!
18http://www.cs.wisc.edu/condor
Examples[ Type = "Job"; Owner = "raman"; Cmd = "run_sim"; Args = "-Q 17 3200"; Cwd = "/u/raman"; Memory = 31; Qdate = 886799469; ... Rank = other.Kflops... Requirements =
other.Type = ...]
[ Type = "Machine"; Name = "xxy.cs. ..."; Arch = "iX86"; OpSys = "Solaris"; Mips = 104; Kflops = 21893; State = "Unclaimed"; LoadAvg = 0.042969; ... Rank = ...; Requirements = ...;]
19http://www.cs.wisc.edu/condor
Attribute Expressions› Constants 104, 0.042969, "iX86"
› References attr, self.attr, other.attr, expr.attr
› Operators +, *, >>, <, >=, &&, ...
› Functions strcat, substr, floor, member, ...
› Lists { expr, expr, ... }
› ClassAds [ name=expr; name=expr; ... ]
20http://www.cs.wisc.edu/condor
Examples
› Descriptive attributes Type = "Job"; Owner = "raman"; Arch = "iX86"; OpSys = "Solaris"; Memory = 64; // megabytes Disk = 323496; // k bytes
21http://www.cs.wisc.edu/condor
Examples
› Current state Daytime = 36017; // secs past
midnight KeyboardIdle = 1432; // seconds State = "Unclaimed"; LoadAvg = 0.042969;
22http://www.cs.wisc.edu/condor
Examples
› Parameters ResearchGrp = { "raman", "miron",
"solomon", "jbasney" }; Friends = { "tannenba", "wright" }; Untrusted = { "rival", "riffraff" }; WantCheckpoint = 1;
23http://www.cs.wisc.edu/condor
Examples
› Derived data
Rank = // machine's rank for job10 * member(other.Owner,ResearchGrp) + member(other.Owner, Friends);
Rank = // job's rank for machineKflops/1E3 + other.Memory/32;
24http://www.cs.wisc.edu/condor
Examples
› Job constraint Requirements =
other.Type = "Machine"&& Arch = "iX86"&& OpsSys = "Solaris"&& Disk > 10000&& other.Memory >= self.Memory;
25http://www.cs.wisc.edu/condor
Examples
› Machine constraint
Requirements = ! member(other.Owner, Untrusted) && Rank >= 10 ? true : Rank > 0 ? (LoadAvg < 0.3 && KeyboardIdle > 15*60) : DayTime < 6*60*60 || DayTime > 18*60*60;
26http://www.cs.wisc.edu/condor
Matching Algorithm› To match two ads A and B
Set up environment such that in A• self evaluates to A• other evaluates to B• other attributes are searched for first in A
and then in B• and vice versa (with A and B interchanged)
Check if A.Requirements and B.Requirements both evaluate to true
A.Rank and B.Rank for preferences
27http://www.cs.wisc.edu/condor
Three-valued Logicother.Memory > 32 all
other.Memory == 32 UNDEFINED
other.Memory != 32 if other has no
!(other.Memory == 32) "Memory" attribute
other.Mips >= 10 || other.Kflps >= 1000
TRUE if either attribute exists and
satisfies the given condition
28http://www.cs.wisc.edu/condor
Recipe Tip: Build from Bottom up!
› Start with a service for a single user, on a single machine.
› “Personal Condor” Condor on your own workstation, no
local system/root access required, no system administrator intervention needed
29http://www.cs.wisc.edu/condor
yourworkstation
personalCondor
600 Condorjobs
30http://www.cs.wisc.edu/condor
Personal Condor?!
What’s the benefit of a Condor “Pool” with just
one user and one machine?
31http://www.cs.wisc.edu/condor
Your Personal Condor will ...
› … keep an eye on your jobs and will keep you posted on their progress
› … implement your policy on the execution order of the jobs
› … keep a log of your job activities› … add fault tolerance to your jobs› … implement your policy on when the
jobs can run on your workstation
32http://www.cs.wisc.edu/condor
Expand from your desktop…
› Build a Condor pool inside your organization Install Condor on multiple machines, pointing
them to your initial machine as the manager.› Utilize Condor resources at remote
organizations (“build a grid”) Takes advantage of your Condor-using friends… Get permission to access their resources Then configure your Condor pool to “flockflock” to
these pools Accounting system is flocking aware
33http://www.cs.wisc.edu/condor
yourworkstation
Friendly Condor Pool
personalCondor
600 Condorjobs
Condor Pool
34http://www.cs.wisc.edu/condor
Condor-G› What about resources at remote
organizations that are NOT managed via Condor? (perhaps they are managed via PBS, SGE, LSF, …)
› Condor-G Job task-broker for Grid Middleware. Submit jobs to resources managed via grid
middleware such as Globus (GT2 & GT3), Nordugrid, Unicore, or Oracle (or Condor)
Oracle: run PL/SQL programs on Oracle just like a normal job, via transactions, put in DAGs, etc.
35http://www.cs.wisc.edu/condor
Condor GlideIn
› Problems What if the grid middleware or remote
scheduler doesn’t provide services I want? What about end-to-end semantic
guarantees?› Solution
Submit the Condor daemons to remote schedulers instead of the job
When the resources run these GlideIn jobs, they will temporarily join her Condor Pool, and run the job as usual.
36http://www.cs.wisc.edu/condor
yourworkstation
Friendly Condor Pool
personalCondor
600 Condorjobs
Globus Grid
PBS LSF
Condor
Condor Pool
glide-in jobs
37http://www.cs.wisc.edu/condor
Outline
› What is the Condor Project?
› What is the Condor HTC Software?
› Recipe for using desktops for science
› Data! Harmonize computation w/ data
storage and data movement.
38http://www.cs.wisc.edu/condor
Data Movement: Stork
› Scheduler for wide-area data transfer› Condor historically focused on CPU
allocation Data movement was implicit side-effect
› Stork elevates data movement to be a “first class citizen” Data movement is another type of node within a
job dependency graph Data movement is now queued, scheduled,
monitored, managed, check-pointed
39http://www.cs.wisc.edu/condor
Data Access: Parrot
Useful in distributed batch systems where one has access to many CPUs, but no consistent distributed filesystem (BYOFS!).
Works with legacy programs
% gv /gsiftp/www.cs.wisc.edu/condor/doc/usenix_1.92.ps % grep Yahoo /http/www.yahoo.com
40http://www.cs.wisc.edu/condor
Data Storage: NeST
› Storage management software› Complementary piece of Condor software;
adds storage management to the traditional CPU management
› Key features User level Guaranteed storage reservations that allow higher-
level scheduling and planning (e.g. Stork) Flexible, extendible protocol layer allows easy
integration with existing middle-ware and applications
Easily deployable via glide-in
41http://www.cs.wisc.edu/condor
› Practical and easily deployable User-level; requires no privilege Package NeST as standard batch jobs
› Result: Managed storage› General; glide-in works everywhere
Gliding-in storage mgmt
Internet
SGE SGE
SGE SGE SGE
SGE SGE
SGENeSTNeSTNeSTNeSTNeSTNeSTNeSTNeST
Homestore
42http://www.cs.wisc.edu/condor
BirdBath
SOAP Interfaces to Condor Services LBNL: Workflow, ZSI (soon ?
LIGO, Laser Interferometer Gravitational-Wave Observatory )
IU: Portals UK College of London|
Cambridge: .NET
43http://www.cs.wisc.edu/condor
The Idea
Computing power
is everywhere, we try to make it usable
by anyone.
44http://www.cs.wisc.edu/condor
Thank you!
Condor Project on the Web:http://www.cs.wisc.edu/condor
Email:condor-admin@cs.wisc.edu
Recommended