24
Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Condor-G: A Case in Distributed Job Delegation

Condor-G: A Case in Distributed Job Delegation

  • Upload
    lonna

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Condor-G: A Case in Distributed Job Delegation. Job Delegation. Transfer of responsibility to schedule and execute a job Multiple delegations can form a chain. Job Delegation in Condor-G Today. Globus GRAM. Batch System Front-end. Execute Machine. Condor-G. Expanding the Model. - PowerPoint PPT Presentation

Citation preview

Page 1: Condor-G: A Case in Distributed Job Delegation

Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Condor-G: A Case in Distributed Job

Delegation

Page 2: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Job Delegation› Transfer of responsibility to

schedule and execute a job› Multiple delegations can form a

chain

Page 3: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Job Delegation in Condor-G Today

Condor-G

Globus GRAM

Batch System Front-end

Execute Machine

Page 4: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Expanding the Model› What can we do with new forms of job

delegation?› Some ideas

Mirroring Load-balancing Glide-in schedd Multi-hop grid scheduling

Page 5: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Mirroring› What it does

Jobs mirrored on two Condor-Gs If primary Condor-G crashes, secondary one

starts running jobs On recovery, primary Condor-G gets job

status from secondary one› Removes Condor-G submit point as

single point of failure

Page 6: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Mirroring Example

Condor-G 1

Matchmaker

Execute Machine

Condor-G 2

Page 7: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Mirroring Example

Condor-G 1

Matchmaker

Execute Machine

Condor-G 2

Page 8: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Load-Balancing› What it does

Front-end Condor-G distributes all jobs among several back-end Condor-Gs

Front-end Condor-G keeps updated job status

› Improves scalability› Maintains single submit point for users

Page 9: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Load-Balancing Example

Condor-G Back-end 1

Condor-G Front-end

Condor-G Back-end 3

Condor-G Back-end 2

Page 10: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Glide-In Schedd› What it does

Drop a Condor-G onto the front-end machine of a cluster

Delegate jobs to the cluster through the glide-in schedd

› Apply cluster-specific policies to jobs

Page 11: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Glide-In Schedd Example

Condor-G

Glide-In Schedd

Batch System

Page 12: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Multi-Hop Grid Scheduling

› Match a job to a Virtual Organization (VO), then to a resource within that VO

› Easier to schedule jobs across multiple VOs and grids

Page 13: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Multi-Hop Grid Scheduling Example

Experiment Condor-G

Experiment Resource Broker

VO Condor-G

VO Resource Broker

Globus GRAM

Batch Scheduler

Page 14: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Endless Possibilities› These new models can be

combined with each other or with other new models

› Resulting system can be arbitrarily sophisticated

Page 15: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Job Delegation Challenges

› New complexity introduces new issues and exacerbates existing ones

› A few… Transparency Representation Scheduling Control Active Job Control Revocation Error Handling and Debugging

Page 16: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Transparency› Full information about job should be

available to user Information from full delegation path No manual tracing across multiple machines

› Users need to know what’s happening with their jobs

Page 17: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Representation› Job state is a vector› How best to show this to user

Summary• Current delegation endpoint• Job state at endpoint

Full information available if desired• Series of nested ClassAds?

Page 18: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Scheduling Control› Avoid loops in delegation path› Give user control of scheduling

Allow limiting of delegation path length?

Allow user to specify part or all of delegation path

Page 19: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Active Job Control› User may request certain actions

hold, suspend, vacate, checkpoint› Actions cannot be completed

synchronously for user Must forward along delegation path User checks completion later

Page 20: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Active Job Control (cont)

› Endpoint systems may not support actions If possible, execute them at furthest

point that does support them› Allow user to apply action in

middle of delegation path

Page 21: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Revocation› Leases

Lease must be renewed periodically for delegation to remain valid

Allows revocation during long-term failures

› What are good values for lease lifetime and update interval?

Page 22: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Error Handling and Debugging

› Many more places for things to go horribly wrong

› Need clear, simple error semantics› Logs, logs, logs

Have them everywhere

Page 23: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Current Status› Done

Mirroring› In Progress

Condor-G -> Condor-G delegation• User must specify hops

Glide-in schedd• Set up by hand

Page 24: Condor-G: A Case in Distributed Job Delegation

www.cs.wisc.edu/condor

Thank You!› Questions?