Computing and Brokering

Computing and Brokering

Grid Middleware 5

David Groep, lecture series 2005-2006

Grid Middleware V 2

Outline Classes of computing services

MPP SHMEM Clusters with high-speed interconnect Conveniently parallel jobs Through the hourglass: basic functionalities

Representing computing services resource availability, RunTimeEnvironment Software installation and ESIA Jobs as resources, or ?

Brokering brokering models: central view, per-user broker, ‘neighbourhood’ P2P brokering job farming and DAGs: Condor-G, gLite WMS, Nimrod-G, DAG man resource selection: ERT, freeCPUs, …?

Prediction techniques and challenges colocating jobs and data, input & output sandboxes, LogicalFiles

Specialties Supporting interactivity

Computing Service

resource variability and the hourglass model

Grid Middleware V 4

The Famous Hourglass Model

Grid Middleware V 5

Types of systems

Very different models and pricing; suitability depends on application

shared memory MPP systems vector systems cluster computing with high-speed interconnect

can perform like MPP, except for the single memory image e.g. Myrinet, Infiniband

course-grained compute clusters ‘conveniently parallel’ applications without IPC can be built of commodity components

specialty systems visualisation, systems with dedicated co-processors, …

Grid Middleware V 6

Quick, cheap, or both: how to run an app?

Task: how to run your application the fastest, or the most cost-effective (this argument usually wins )

Two choices to speed up an application Use the fastest processor available

but this gives only a small factor over modest (PC) processors

Use many processors, doing many tasks in parallel and since quite fast processors are inexpensive we can think of

using very many processors in parallel but the problem must first be decomposed

“fast, cheap, good – pick any two”

Grid Middleware V 7

High Performance – or – High Throughput?

Key question: max. granularity of decomposition:

Have you got one big problem or a bunch of little ones? To what extent can the “problem” be decomposed into sort-of-

independent parts (‘grains’) that can all be processed in parallel?

Granularity fine-grained parallelism –

the independent bits are small, need to exchange information, synchronize often

coarse-grained – the problem can be decomposed into large chunks that can be processed independently

Practical limits on the degree of parallelism – how many grains can be processed in parallel? degree of parallelism v. grain size grain size limited by the efficiency of the system at synchronising

grains

Grid Middleware V 8

High Performance – v. – High Throughput?

fine-grained problems need a high performance system that enables rapid synchronization between the bits that can be

processed in parallel and runs the bits that are difficult to parallelize as fast as possible

coarse-grained problems can use a high throughput system that maximizes the number of parts processed per minute

High Throughput Systems use a large number of inexpensive processors, inexpensively interconnected

High Performance Systems use a smaller number of more expensive processors expensively interconnected

Grid Middleware V 9

High Performance – v. – High Throughput?

There is nothing fundamental here – it is just a question of financial trade-offs like:

how much more expensive is a “fast” computer than a bunch of slower ones?

how much is it worth to get the answer more quickly? how much investment is necessary to improve the degree of

parallelization of the algorithm?

But the target is moving - Since the cost chasm first opened between fast and slower computers

12-15 years ago an enormous effort has gone into finding parallelism in “big” problems

Inexorably decreasing computer costs and de-regulation of the wide area network infrastructure have opened the door to ever larger computing facilities –

clusters fabrics (inter)national grids

demanding ever-greater degrees of parallelism

Grid Middleware V 10

But the fact is:

Graphic: Network of Workstations, Berkeley IEEE Micro, Feb, 1995, Thomas E. Anderson, David E. Culler, David A. Patterson

‘the food chain has been reversed’, and supercomputer vendors are struggling to make a living.


Using these systems

As both clusters and capability systems are both ‘expensive’ (i.e. not on your desktop), they are resources that need to be scheduled

interface for scheduled access is a batch queue job submit, cancel, status, suspend sometimes: checkpoint-restart in OS, e.g. on SGI IRIX allocate #processors

(and amount of memory, these may be linked!) as part of the job request

systems usually also have smaller interactive partition not intended for running production jobs …


Cluster batch system model


Some batch systems

Batch systems and schedulers Torque (OpenPBS, PBS Pro) Sun Grid Engine (that’s not a Grid ) Condor LoadLeveller Load Share Facility (LSF)

Dedicated schedulers: MAUI can drive scheduling for Torque/PBS, SGE, LSF, … support advanced scheduling features, like:

reservation, fair-shares, accounts/banking, QoS

head node or UI system can usually be used for test jobs


Torque/PBS job description

# PBS batch job script

#PBS -l walltime=36:00:00

#PBS -l cput=30:00:00

#PBS -l vmem=1gb

#PBS -q qlong

# Executing user job

UTCDATE=`date -u '+%Y%m%d%H%M%SZ'`

echo "Execution started on $UTCDATE"

echo "*****"

printenv

date

sleep 3

date

id

hostname


PBS queue

bosui:tmp:1010$ qstat -an1|head -10

tbn20.nikhef.nl:

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----

823302.tbn20.nikhef. biome034 qlong STDIN 20253 1 -- -- 60:00 R 20:58 node15-11




...

827388.tbn20.nikhef. lhcb031 qlong STDIN -- 1 -- -- 60:00 Q -- --




Example: Condor – clusters of idle workstations

Central Manager

master

collector

negotiator

schedd

startd

= ClassAd Communication Pathway

= Process Spawned

Desktop

schedd

startd

master

Desktop

schedd

startd

master

Cluster Node

master

startd

Cluster Node

master

startd

The Condor Project, Miron Livny et al. University of Wisconsin, Madison. See http://www.cs.wisc.edu/condor/


Condor example

Write a submit file: Executable = dowork Input = dowork.in Output = dowork.out Arguments = 1 alpha beta Universe = vanilla Log = dowork.log Queue

Give it to Condor: condor_submit <submit-file>

Watch it run: condor_q

} Files: on shared fs

From: Alan Roy, IO Access in Condor and Grid, UW Madison. See http://www.cs.wisc.edu/condor/

in a cluster at least,for other options see later


Matching jobs to resources

For ‘homogeneous’ clusters mainly policy-based FIFO credential-based policy fair-share queue wait time banks & accounts QoS specific

For heterogeneous clusters (like condor pools) matchmaking based on resource & job characteristics see later in grid matchmaking


Example: scheduling policies - MAUI

RMTYPE[0] PBSRMHOST[0] tbn20.nikhef.nl...NODEACCESSPOLICY SHAREDNODEAVAILABILITYPOLICY DEDICATED:PROCSNODELOADPOLICY ADJUSTPROCS

FEATUREPROCSPEEDHEADER xpsBACKFILLPOLICY ONBACKFILLTYPE FIRSTFITNODEALLOCATIONPOLICY FASTEST

FSPOLICY DEDICATEDPESFSDEPTH 24FSINTERVAL 24:00:00FSDECAY 0.99

GROUPCFG[users] FSTARGET=1 PRIORITY=10 MAXPROC=50GROUPCFG[dteam] FSTARGET=2 PRIORITY=5000 MAXPROC=32GROUPCFG[alice] FSTARGET=9 PRIORITY=100 MAXPROC=200 QDEF=lhcaliceGROUPCFG[alicesgm] FSTARGET=1 PRIORITY=100 MAXPROC=200 QDEF=lhcaliceGROUPCFG[atlas] FSTARGET=54 PRIORITY=100 MAXPROC=200 QDEF=lhcatlas

QOSCFG[lhccms] FSTARGET=1- MAXPROC=10

MAUI is an open source product from ClusterResources, Inc. http://www.supercluster.org/

Grid Interface to Computing


Grid Interfaces to the compute services

Need common interface for job management for test jobs in ‘interactive’ mode: fork

like the interactive partition in clusters and supers batch system interface:

executable arguments #processors memory environment stdin/out/err

Note: batch system usually doesn’t manage local file space assumes executable is ‘just there’, because of shared FS or JIT copying

of the files to the worker node in job prologue local file space management needs to be exposed as part of the grid

service and then implemented separately


Expectations?

What can a user expect from a compute service? Different user scenarios are all valid:

paratrooper mode: come in, take all your equipment (files, executable &c) with you, do your thing and go away

you’re supposed to clean up, but the system will likely do that for you if you forget. In all cases, garbage left behind is likely to be removed

two-stage ‘prepare’ and ‘run’ extra services to pre-install environment and later request it see later on such Community Software Area services

don’t think but just do it blindly assume the grid is like your local system expect all software to be there expect your results to be retained indefinitely … realism of this scenario is quite low for ‘production’ grids, as it

does not scale to larger numbers of users


Basic Operations

Direct run/submit useless unless you have an environment already set up

Cancel Signal Suspend Resume List jobs/status Purge (remove garbage)

retrieve output first …

Other useful functions Assess submission (eligibility, ERT) Register & Start (needed if you have sandboxes)


A job submission diagram for a single CE

diagram from: DJRA1.1 EGEE Middleware Architecture

Example explicit interactions


WS-GRAM: Job management using WS-RF

same functionalitymodelled with jobs represented as resources

for input sandbox leverages an existing (GT4) data movement service exploit re-useable components


GRAMservices

GT4 Java Container

GRAMservices

Delegation

RFT FileTransfer

Transferrequest

GridFTPRemote storage element(s)

Localscheduler

Userjob

Compute element

GridFTP

sudoGRAMadapter

FTPcontrol

Local job control

Delegate

FTP data

Cli

ent Job

functions

Delegate

Service host(s) and compute element(s)

SEGJob events

GT4 WS GRAM Architecture

diagram from: Carl Kesselman, ISI, ISOC/GFNL masterclass 2006


GT2 GRAM

Informational & historical: so don’t blame the current Globus for this …

single job submission flow chart


GRAM GT2 Protocol

RSL over http-g target to a single specific resource

http-g is like https modified protocol (one one byte) to specify delegation no longer interoperable with standard https delegation implicit in job submission

RSL Resource Specification Language Used in the GRAM protocol to describe the job required some (detailed) knowledge about target system


GT2 RSL

&(executable="/bin/echo")

(arguments="12345")

(stdout=x-gass-cache://$(GLOBUS_GRAM_JOB_CONTACT)stdout anExtraTag)

(stderr=x-gass-cache://$(GLOBUS_GRAM_JOB_CONTACT)stderr anExtraTag)

(queue=qshort)


GT2 Job Manager interface

One job manager per running or queued job provide control interface: cancel, suspend, status GASS ‘Grid Access to Secondary Storage’:

stdin, stdout, stderr selected input/output files

listens on a specific TCP port on the Gatekeeper host

Some issues protocol does not provide two-phase commit

know way to know if the job really made it too many open ports one process for each queued job, i.e. too many processes

Workaround don’t submit a job, but instread a grid-manager process


Performance ?

Time to submit a basic GRAM job Pre-WS GRAM: < 1 second WS GRAM (in Java): 2 seconds

so GT2-style GRAM did have one significant advantage …

Concurrent jobs Pre-WS GRAM: 300 jobs WS GRAM: 32,000 jobs


Scaling scheduling

load on the CE head node per VO cannot be controlled with a single common job manager

1. with many VOs might need to resolve inter-VO resource contention different VOs may want different policies

2. make the CE ‘pluggable’3. and provide a common CE interface, irrespective of the

site-specific job submission mechanism as long as the site supports a ‘fork’ JM


gLite job submission model

site

one grid CEMON per VO or user


Unicore CE

Other design and concept: eats JSDL (GGF standard) as a description

described job requirements in detail

security model cannot support dynamic VOs yet grid-wide coordinated UID space (or shared group accounts for all grid users) no VO management tools (DEISA added a directory for that) intra-site communication not secured

one big plus: job management uses only 1 port for all ommunications (including file transfer), and is thus firewall-friendly


Unicore CE Architecture

Batch Subsystem

AJO/UPL

User Certificate

Job preparation/control Plugins

Unsafe Internet (SSL)

User authentication

UNICORESite List

UNICOREPro Client

Target System Interface (TSI)

Incarnated job

Commands

User mapping,job incarnation,job scheduling

TSI TSI

Any clustermanagement system

UNICORE SiteFZJ

...

Preparation andControl of jobs

Network Job Supervisor(NJS)

Safe Intranet (TCP)

IDB

Jobs and data transfer to other UNICORE sites

Status request

SV1 Blade files

UUDB IDBIDB

NJS

UNICORE Gateway

optional firewall

optional firewall

AJO/UPL

Runtime Interface

Arcon Client Toolkit User Certificate

UNICORESite List

Graphic from: Dave Snelling, Fujitsu Labs Europe, “Unicore Technology”, Grid School July 2003


Unicore programming model

Abstract Job Object Collection of classes representing Grid functions Encoded as Java objects (XML encoding possible)

Where to build AJOs Pallas client GUI - The user’s view Client plugins - Grid deployer Arcon client tool kit - Hard core

What can’t the AJO do Application level Meta-computing ???

from: Dave Snelling, Fujitsu Labs Europe, “Unicore Technology”, Grid School July 2003

Batch Subsystem

AJO/UPL

User Certificate



User authentication

UNICORESite List

UNICOREPro Client


Incarnated job

Commands


TSI TSI


UNICORE SiteFZJ

...


Network Job Supervisor(NJS) Safe Intranet

(TCP)IDB


Status request

SV1

Blade files

UUDB IDBIDB

NJS

UNICORE Gateway

optional firewall

optional firewall

AJO/UPL

Runtime Interface


UNICORESite List


Interfacing to the local system

Incarnation Data Base Maps abstract representation to concrete jobs Includes resource description

Prototype auto-generation from MDS

Target System Interface Perl interface to host platform Very small system specific module for easy porting Current: NQS (several versions), PBS, Loadleveler, UNICOS,

Linux, Solaris, MacOSX, PlayStation-2 Condor: Under development (& probably done by now)

from: Dave Snelling, Fujitsu Labs Europe, “Unicore Technology”, Grid School July 2003

Batch Subsystem

AJO/UPL

User Certificate



User authentication

UNICORESite List

UNICOREPro Client


Incarnated job

Commands


TSI TSI


UNICORE SiteFZJ

...


Network Job Supervisor(NJS) Safe Intranet

(TCP)IDB


Status request

SV1

Blade files

UUDB IDBIDB

NJS

UNICORE Gateway

optional firewall

optional firewall

AJO/UPL

Runtime Interface


UNICORESite List

Resource Representation

CE attributesobtaining metricsGLUE CE


Describing a CE

Balance between completeness and timeliness Some useful metrics almost impossible to obtain

‘when will this job of mine be finished if I submit now?’cannot be answered!

depends on system load need to predict runtime for already running & queued jobs simultaneous submission in a non-FIFO scheduling model (e.g. fair

share, priorities, pre-emption &c)


GlueCE: a ‘resource description’ viewpoint

From: the GLUE Information Model version 1.2, see document for details


Through the Glue Schema: Cluster Info

Performance info: SI2k, SF2k Max wall time, CPU time: secondstogether these determine if a job completes in time

but clusters are not homogeneous solve at the local end (scale mas{CPU,wall} time on each node

to the system speed)CAVEAT: when doing cross-cluster grid-wide scheduling, this can make you choose the wrong resource entirely!

solve (i.e. multiply) at the broker endbut now you need a way to determine on which subcluster your job will run… oops.


Cluster Info: total, free and max JobSlots

FreeJobSlots is the wrong metric to use for scheuling (a good cluster is always 100% full)

these metrics may be VO, user and job dependent if a cluster have free CPUs, that does not mean that you can

use them… even if there are thousands of waiting jobs, you might get to

the front of the queue because of your prio or fair-share


Cluster info: ERT and WRT

Estimated/worst response time when will my job start to run if I submit now

Impossible to pre-determine in case of simultaneous submissions

Best to do is to estimate

Possible approaches simulation – good but very, very slow

“Predicting Job Start Times on Clusters”, Hui Li et al. 2004 historical comparisons

template approach – need to discover the proper template look for ‘similar system states’ in the past learning approach – adapt the estimation algorithm to the actual

load and ‘learn’ the best approach

see the many other papers by Hui, bundle on Blackboard!

Brokering


Brokering models

All current grid broker systems use global brokering consider all known resources when matching requests brokering takes longer as the system grows

Models Bubble-to-the-top-information-system based

current Condor-G, gLite WMS

Ask the world for bids Unicore Broker


Some grid brokers

Condor-G uses Condor schedd (matchmaker) to match resources a Condor submitter has a number of backends to talk to

different CEs (GT2, GT4-GRAM, Condor (flocking)) supports DAG workflows schedd is ‘close’ to the user

gLite WMS separation between broker (based on Condor-G) and the UI additional Logging and Bookkeeping (generic, but actually only

used for the WMS) does job-data co-location scheduling


Grid brokers (contd.)

Nimrod-G parameter sweep engine cycles through static list of resources automatically inspects the job output and uses that to drive

automatic job submission minimisation methods like simulated annealing built in

Unicore broker based on a pricing model asks for bids from resources

no large information system needed full of useless resources, but instead ask bids from all resources for every job

shifts, but does nothing to resolve, the info-system explosion


Alternative brokering

Alternatives could be ‘P2P-style’ brokering look in the ‘neighbourhood’ for ‘reasonable’ matches, if none

found, give the task to a peer super-scheduler scheduler only considers ‘close’ resources (has no global

knowledge) job submission pattern may or may not follow brokering

pattern if it does, it needs recursive delegation for job submission, which

opens the door for worms and trojans trust is not very transitive

(this is not a problem in sharing ‘public’ files, such as in the popular P2P file sharing applications)


Broker detailed example: gLite WMS

Job services in the gLite architecture Computing Element (just discussed) Workload Management System (brokering, submission control) Accounting (for EGEE comes in two flavours: site or user) Job Provenance (to be done) Package management (to be done)

continuous matchmaking solution persistent list of pending jobs, waiting for matching resources

WMS task akin to what the resources did in Unicore

Grid Middleware VSlide from the EGEE Project, see www.eu-egee,org and www. glite.org 50

Enabling Grids for E-sciencE

INFSO-RI-508833

WMS

Services

UI

ReplicaCatalog

Inform.System

StorageElement

Resource Broker Node(Workload Manager, WM)

Architecture Overview

Logging &Bookkeeping

Job status

Grid InterfaceComputing Element

LRMS

LCG

Match Maker

JobAdapter

NetworkServer

WorkloadManager

Job Contr.-

CondorG

Match Maker

Task Queue

Information Supermarket

NetworkServer

JobSubmission

gLite



INFSO-RI-508833

WMS’s Architecture



INFSO-RI-508833


Job managementJob managementrequests (submission, requests (submission, cancellation) expressedcancellation) expressed

via a Job Descriptionvia a Job DescriptionLanguage (JDL)Language (JDL)



INFSO-RI-508833


Keeps submission Keeps submission requestsrequests

Requests are keptRequests are kept for a whilefor a while

if no matchingif no matchingresources availableresources available



INFSO-RI-508833


Repository of resourceRepository of resource informationinformation

available to matchmakeravailable to matchmaker

Updated via notifications Updated via notifications and/or active and/or active

polling on sourcespolling on sources



INFSO-RI-508833


Finds an appropriateFinds an appropriateCE for each submission CE for each submission

request, taking into account request, taking into account job requests and preferences, job requests and preferences, Grid status, utilization policies Grid status, utilization policies

on resources on resources



INFSO-RI-508833


Performs the actual Performs the actual job submission job submission and monitoring and monitoring



INFSO-RI-508833

The Information Supermarket

• ISM represents one of the most notable improvements in the WM as inherited from the EU DataGrid (EDG) project– decoupling between the collection of information concerning

resources and its use allows flexible application of different policies

• The ISM basically consists of a repository of resource information that is available in read only mode to the matchmaking engine– the update is the result of

the arrival of notifications active polling of resources some arbitrary combination of both

– can be configured so that certain notifications can trigger the matchmaking engine

improve the modularity of the software support the implementation of lazy scheduling policies



INFSO-RI-508833

The Task Queue

• The Task Queue represents the second most notable improvement in the WM internal design– possibility to keep a submission request for a while if no

resources are immediately available that match the job requirements technique used by the AliEn and Condor systems

• Non-matching requests – will be retried either periodically

eager scheduling approach

– or as soon as notifications of available resources appear in the ISM lazy scheduling approach



INFSO-RI-508833

Job Logging & Bookkeeping

• L&B tracks jobs in terms of events– important points of job life

submission, finding a matching CE, starting execution etc• gathered from various WMS components

• The events are passed to a physically close component of the L&B infrastructure– locallogger

avoid network problems• stores them in a local disk file and takes over the responsibility to deliver them

further

• The destination of an event is one of bookkeeping servers – assigned statically to a job upon its submission

processes the incoming events to give a higher level view on the job states• Submitted, Running, Done

various recorded attributes• JDL, destination CE name, job exit code

• Retrieval of both job states and raw events is available via legacy (EDG) and WS querying interfaces– user may also register for receiving notifications on particular job state

changes



INFSO-RI-508833

Job Preparation

• Information to be specified when a job has to be submitted:

• Job characteristics

• Job requirements and preferences on the computing resources• Also including software dependencies

• Job data requirements

• Information specified using a Job Description Language (JDL)

• Based upon Condor’s CLASSified ADvertisement language (ClassAd)• Fully extensible language

• A ClassAd

•Constructed with the classad construction operator []

•It is a sequence of attributes separated by semi-colons.

•An attribute is a pair (key, value), where value can be a Boolean, an Integer, a list of strings, …

• <attribute> = <value>;


ClassAds: matchmaking

Brokering based on ‘advertisements’ by both jobs and resources


ClassAds matchmaking

Allow customers to set provide requirements and preferences on the resources

Allow resources to impose constraints on the customers they wish to service.

Separation between matchmaking and claiming.

The matchmake is stateless and thus can scale to very large systems without complex failure recovery.



INFSO-RI-508833

Job Description Language (JDL)

• The supported attributes are grouped into two categories:

• Job Attributes • Define the job itself

• Resources• Taken into account by the Workload Manager for carrying out the

matchmaking algorithm (to choose the “best” resource where to submit the job)

• Computing Resource•Used to build expressions of Requirements and/or Rank attributes by the user

•Have to be prefixed with “other.”

• Data and Storage resources •Input data to process, Storage Element (SE) where to store output data, protocols spoken by application when accessing SEs



INFSO-RI-508833

JDL: Relevant Attributes (1)• JobType

• Normal (simple, sequential job), DAG, Interactive, MPICH, Checkpointable

• Executable (mandatory)• The command name

• Arguments (optional)• Job command line arguments

• StdInput, StdOutput, StdError (optional)• Standard input/output/error of the job

• Environment• List of environment settings

• InputSandbox (optional)• List of files on the UI’s local disk needed by the job for running

• The listed files will be staged automatically to the remote resource

• OutputSandbox (optional)• List of files, generated by the job, which have to be retrieved



INFSO-RI-508833

JDL: Relevant Attributes (2)• Requirements

• Job requirements on computing resources

• Specified using attributes of resources published in the Information Service

• If not specified, default value defined in UI configuration file is considered• Default: other.GlueCEStateStatus == "Production" (the resource has to be able

to accept jobs and dispatch them on WNs)

• Rank

• Expresses preference (how to rank resources that have already met the Requirements expression)

• Specified using attributes of resources published in the Information Service

• If not specified, default value defined in the UI configuration file is considered

• Default: - other.GlueCEStateEstimatedResponseTime (the lowest estimated traversal time)

• Default: other.GlueCEStateFreeCPUs (the highest number of free CPUs) for parallel jobs (see later)



INFSO-RI-508833

JDL: Relevant Attributes (3)

• InputData• Refers to data used as input by the job: these data are published

in the Replica Catlog and stored in the Storage Elements)• LFNs and/or GUIDs

• InputSandbox• Execuable, files etc. that are sent to the job

• DataAccessProtocol (mandatory if InputData has been specified)

• The protocol or the list of protocols which the application is able to speak with for accessing InputData on a given Storage Element

• OutputSE• The Uniform Resource Identifier of the output Storage Element• RB uses it to choose a Computing Element that is compatible with

the job and is close to Storage Element

Details in Data Management lecture



INFSO-RI-508833

Example of JDL File

[

JobType=“Normal”;

Executable = “gridTest”;

StdError = “stderr.log”;

StdOutput = “stdout.log”;

InputSandbox = {“/home/mydir/test/gridTest”};

OutputSandbox = {“stderr.log”, “stdout.log”};

InputData = {“lfn:/glite/myvo/mylfn” };

DataAccessProtocol = “gridftp”;

Requirements = other.GlueHostOperatingSystemNameOpSys == “LINUX”

&& other.GlueCEStateFreeCPUs>=4;

Rank = other.GlueCEPolicyMaxCPUTime;

]



INFSO-RI-508833

Jobs State Machine (1/9)

Submitted: job is entered by the user to the User Interface but not yet transferred to Network Server for processing



INFSO-RI-508833


Waiting: job accepted by NS and waiting for Workload Manager processing or being processed by WMHelper modules.



INFSO-RI-508833


Ready: job processed by WM and its Helper modules (CE found) but not yet transferred to the CE (local batch system queue) via JC and CondorC. This state does not exists for a DAG as it is not subjected to matchmaking (the nodes are) but passed directly to DAGMan.



INFSO-RI-508833


Scheduled: job waiting in the queue on the CE. This state also does not exists for a DAG as it is not directly sent to a CE (the node are).



INFSO-RI-508833


Running: job is running. For a DAG this means that DAGMan has started processing it.



INFSO-RI-508833


Done: job exited or considered to be in a terminal state by CondorC (e.g., submission to CE has failed in an unrecoverable way).



INFSO-RI-508833


Aborted: job processing was aborted by WMS (waiting in the WM queue or CE for too long, over-use of quotas, expiration of user credentials).



INFSO-RI-508833


Cancelled: job has been successfully canceled on user request.



INFSO-RI-508833


Cleared: output sandbox was transferred to

the user or removed due to the timeout.



INFSO-RI-508833

Directed Acyclic Graphs (DAGs)

• A DAG represents a set of jobs:

Nodes = Jobs Edges = Dependencies

NodeA

NodeB

NodeC

NodeDNodeE



INFSO-RI-508833

DAG: JDL Structure

• Type = “DAG”• VirtualOrganisation = “yourVO”• Max_Nodes_Running = int >0• MyProxyServer = “…”• Requirements = “…”• Rank = “…”• InputSandbox = more later!• OutSandbox = “…”• Nodes = nodeX more later!

Dependencies = more later!

Mandatory

Mandatory

Optional

Optional

Optional

Optional

Optional

Mandatory

Mandatory



INFSO-RI-508833

Attribute: Nodes

The Nodes attribute is the core of the DAG description;….

Nodes = [ nodefilename1 = [...]

nodefilename2 = […]

…….

dependencies = …

]

Nodefilename1 = [ file = “foo.jdl”; ]

Nodefilename2 =

[ file = “/home/vardizzo/test.jdl”;

retry = 2; ]

Nodefilename1 = [

description = [ JobType = “Normal”;

Executable = “abc.exe”;

Arguments = “1 2 3”;

OutputSandbox = […];

InputSandbox = […];

….. ]

retry = 2;

]



INFSO-RI-508833

Attribute: Dependencies

• It is a list of lists representing the dependencies between the nodes of the DAG.

….

Nodes = [ nodefilename1 = [...]

nodefilename2 = […]

…….

dependencies = …

]

dependencies =

{nodefilename1, nodefilename2}

{ nodefilename1, nodefilename2 }

{ { nodefilename1, nodefilename2 }, nodefilename3 }

{ { { nodefilename1, nodefilename2}, nodefilename3}, nodefilename4 }

MANDATORY : YES!

dependencies = {};



INFSO-RI-508833

Type = “DAG”

VirtualOrganisation = “yourVO”

Max_Nodes_Running = int >0

MyProxyServer = “…”

Requirements = “…”

Rank = “…”

InputSandbox = { };

Nodes = [ nodefilename =[];

…..

dependencies = …;

];

NodeA= [

description = [

JobType = “Normal”;

Executable = “abc.exe”;

OutputSandbox = {“myout.txt”};

InputSandbox = {

“/home/vardizzo/myfile.txt”,

root.InputSandbox; };

]

]

InputSandbox & Inheritance

• All nodes inherit the value of the attributes from the one specified for the DAG.

• Nodes without any InputSandbox values, have to contain in their description an empty list:

InputSandbox = { };



INFSO-RI-508833

Interactive Jobs

• It is a job whose standard streams are forwarded to the submitting client.

• The DISPLAY environment variable has to be set correctly, because an X window may be opened.

UI

Listener Process

X window or std no-gui

WN



INFSO-RI-508833

Interactive Jobs

• Specified setting JobType = “Interactive” in JDL

• When an interactive job is executed, a window for the stdin, stdout, stderr streams is opened

• Possibility to send the stdin to

• the job

• Possibility the have the stderr

• and stdout of the job when it

• is running

• Possibility to start a window for

• the standard streams for a

• previously submitted interactive

• job with command glite-job-attach



INFSO-RI-508833

Interactive Jobs: JDL Structure

• Type = “job”;• JobType = “interactive”;• Executable = “…”;• Argument = “…”; • ListenerPort = “int > 0”;• OutputSandbox = “”;• Requirements = “…”;• Rank = “”;

Mandatory

Mandatory

Mandatory

Optional

Optional

Optional

Mandatory

Mandatory

gLite Commands:

glite-job-attach [options] <jobID>



INFSO-RI-508833

gLite Commands

• JDL Submission: glite-job-submit –o guidfile jobCheck.jdl

• JDL Status: glite-job-status –i guidfile

• JDL Output: glite-job-output –i guidfile

• Get Latest Job State: glite-job-get-chkpt –o statefile –i guidfile

• Submit a JDL from a state: glite-job-submit -chkpt statefile –o guidfile jobCheck.jdl

• See also [options] typing –help after the commands.

Economy based brokering

Unicore


Unicore Broker

Distributed brokering Sites Know the State of their Resources Best Sites Can Conceal their Resource Configuration Different VOs Need Different Selection Algorithms

Preferred site sets will vary Different applications have different performance characteristics

Uses an economic model cost-based evaluation, like in the real world

broker developed by University of Manchester, UK

Unicore is a open source product coordinated by the Unicore Forum, see www.unicore.org


Unicore Broker

graphic from: Brokering in Unicore, John Brooke and Donal Fellows, UoM, Unicore Summit October 2005


Job description ontology



Unicore Broker hierarchy



Unicore Broker in the system

UnicoreGateway

Unicore ClientNetwork

JobSupervisor

ResourceDatabase

UserDatabase

Condor

NQS

GT

ResourceBroker

Multiple firewalllayouts possible

Alternative Client

Ext. AuthService

UoM Broker Architecture, from: Dave Snelling, Fujitsu Labs Europe, Unicore Technology, Grid School July 2003


Unicore Broker

ComputeResourceComputeResource

BrokerBroker

NJSNJSIDBIDB UUDBUUDB

ExpertBrokerExpertBroker

DWDLMExpertDWDLMExpert OtherOther

LocalResourceCheckerLocalResourceChecker

UnicoreRCUnicoreRC GlobusRCGlobusRC

TranslatorTranslator

OntologicalTranslatorOntologicalTranslator

OntologyOntology

SimpleTranslatorSimpleTranslator

MDSGRAMTSI

ICMExpertICMExpert

Look up staticresources

Look upconfiguration

Verify delegatedidentities

Delegate to application-domain expert codeDelegate to Grid architecture-specificengine for local resource check

Pass untranslatable resources to Unicore resource checker

Look up resourcesLook updynamicresources

Delegate resource domain translation

Look up translations appropriateto target Globus resource schema

Broker hosted in NJS

To outside world

Get back set ofresource filters and set ofuntranslatable resources

TicketManagerTicketManager

UNICORE Components

EUROGRID BrokerGlobus Components

GRIP Broker

Key:

Inheritance relation

Get signed ticket (contract)

Look up signing identity

UoM Broker Architecture, from: Dave Snelling, Fujitsu Labs Europe, Unicore Technology, Grid School July 2003

VO Schedulers

Pilot jobs and overlay networks


Towards a multi-scheduler world

expressing scheduling policies (priorities and usage shares) for multiple complex VOs in a single scheduler is proving difficult resource owner does not want to know about VO internal

structure, but assign the VO just a single share VO wants to set fine-grained intra-VO shares local schedulers (such as MAUI) are not geared towards non-

admin defined policies: there is no ‘grid-aware’ scheduler

possible solutions develop an interface to manage the local scheduling policies stack the schedulers, i.e. introduce a per-VO scheduler


traditional job submission models

There are three ‘traditional’ deployment models:

1. direct per-user job submission to a ‘gatekeeper’ running with root privileges (GT2GK, today’s model)

2. a non-privileged dedicated CE or scheduler, accepting authenticated user jobs and submitting to the batch system

3. on-demand CE, submitted by VO or user to a front-end system, that then receives user jobs and submits these to the batch system

in order to not have complex schedulers run as root, a sudo-component glexec is introducted

Submitting user’s identity & job

VO identity/process or VO placeholder manager

Site managed and trusted services


What is glexec?

glexec

a thin layerto change unix credentials

based on grid identity and attribute information

you can think of it as: ‘a replacement for the gatekeeper’ ‘a griddy version of Apache’s suexec(8)’ ‘a program wrapper around LCAS, LCMAPS or GUMS’


What glexec does

Input1. a certificate chain, possibly with VOMS extensions2. a user program name & arguments to run

Action1. check authorization (LCAS, GUMS)

• user credentials, proper VOMS attributes, executable name

2. acquire local credentials local (uid, gid) pair, possibly across a cluster

3. enforce the local credential on the process

Result1. user program is run with the mapped credentials


Jobs submission today (GT2 GK)

Deployment model without glexec (‘mode GT2GK’) jobs are submitted with an identity (hopefully the original user’s one)

to the site Gatekeeper running as root one job manager is run for each user on the head node with the user’s (uid,gid) as set by the gatekeeper


Glexec in a one-per-site mode

Deployment model with a CE ‘service’ running in a non-privileged account or with a CE run (maybe one per VO) on a single front-end per site

examples• CREAM• GT4 WS-GRAM


glexec with an on-demand CE

Deployment model with on-demand CEs (‘mode on-demand CEs’) The user or the VO start their own scheduler on a front-end system All these on-demand schedulers are resource-limited by a site-

managed master scheduler (via a GT2GK or Condor) the on-demand schedulers eat jobs for their VO or user and set the proper identity before the job gets submitted to the site

batch system


glexec with on-demand CE

Deployment model with on-demand CEs (‘mode on-demand for VOs’ with native interface)


Traditional model summary

In all three models, the submission of the user job to the batch system is done with the original job owner’s mapped (uid, gid) identity

grid-to-local identity mapping is done only on the front-end system (CE)

batch system accounting provides per-user records inspection of Unix process on worker nodes are per-user


Pilot jobs

A pilot job is basically just a small script which downloads a real job from a repository once it starts executing, hence it is not committed to any particular task, or perhaps even

a particular user, until that point. If there are no tasks waiting the pilot job exits immediately. In principle, if the time limits on the queue are long enough

a single pilot job could run more than one real job, although I'm not sure if anyone is actually doing that at the moment.


From the VO side

Background: some large VOs develop and prefer to use their own scheduling & job management framework

late binding of jobs to job slots first establishing an overlay network subsequent scheduling and starting of jobs is faster

hide details between the various grid flavours implement VO priorities full use of allocated slots, up to max wall clock time

but these VOs will need their ‘own’ scheduler some of them do have it already, but then others don’t and most never will, so the use of pilots should not be the

only option (or even the default) way of things


Situation today

‘VO-type’ pilot jobs submitted as if regular user jobs run with the identity of one or a few individuals from a VO obtain jobs from any user (within the VO) and run that payload

on the WN allocated site ‘sees’ only a single identity, not the true owner of the

workload

no effective mechanisms today can deny this use model

note that this does not apply to the regular ‘per-user’ pilot jobs


Issues

Issues that drove the original glexec-on-WN scenario:

VO supplied pilot jobs must observe and honour the same policies the site uses for normal job execution

preferably without requiring alternate mechanisms to describe the policies be continuously in synch with the site policies

again, ‘per-user’ pilot jobs satisfy these rules by design


Pieces of a solution

Three pieces that go together:

glexec on the worker-node deployment mechanism for pilot job to submit themselves and their

payload to site policy control give incontrovertible evidence of who is running on which node

at any one time needed at selected sites for regulatory compliance ability to nail individual culprits by requiring the VO to present a valid delegation from each user

VO should want this to keep user jobs from interfering with each other honouring site ban lists for individuals may help in not banning the

entire VO in case of an incident


Pieces of the solution

glexec on the worker-node deployment way to keep the pilot jobs submitters to their word

system-level auditing of the pilot jobs, to see they are not doing the user job by themselves or evading the controls

relies on advanced auditing features of the OS (from EAL3+) but auditing data on the WN is useful for incident investigations only

internal accounting should be done by the VO the regular site accounting mechanisms are via the batch system, and

will see the pilot job identity the site can easily show from those logs the usage by the pilot job

(for which wall-clock-time accounting should be used) making a site do accounting based glexec jobs is non-standard, requires

effort, may be intrusive, and messes up normal accounting ‘a VO capable of writing their own submission framework, ought to be

able to write their own accounting system as well …’


glexec on WN deployment model

VO submits a pilot job to the batch system the VO ‘pilot job’ submitter is responsible for the pilot behaviour

this might be a specific role in the VO, or a locally registered ‘badged’ user at each site

Pilot job is subject to normal site policies for jobs Pilot job obtains the true user job,

and presents the user credentials and the job (executable name) to the site (glexec) to request a decision

Submitting user’s identity & job

VO identity/process or VO placeholder manager

Site managed and trusted services


VO pilot job on the node

Note: proper uid change by Gatekeeper or Condor-C/BLAHP on head node should remain default

• On success: the site will set the uid/gid of the new user’s job• On failure: glexec will return with an error, and pilot job can terminate or obtain other job


What is needed in this model?

1. Agreement on the three ingredients• deployment of glexec on the WN to do setuid• detailed auditing on the head node and the WNs• site accounting done at the VO (i.e. pilot job) level

2. glexec• needs feature enhancements compared to single-CE version• see status of glexec on the next slide

3. Inspection of the audit logs• detect abuse patterns in the system-call auditing logs

4. Grid job logging capabilities• glexec will log (uid, user/system/real time usage) via syslog• credential mapping framework (LCMAPS) will log mapping

(also via syslog)• centralisation of glexec mappings, e.g. via JobRepository


Notes and alternatives

glexec, like any site-managed ingress point, trusts the submitter not to have mixed up the user credentials and the jobs we trust the RB today do this correctly, and RBs are unknown

quantities to the receiving site

a longer term solution is to have the job request singed by the submitting user since the description is modified by intermediaries (brokers), the

signature can only be to the original content, and the site would have to evaluate whether the job received matches the signed JDL

or use an inheritance model for the job description, and treat the job like you would, e.g., a CIM entity


Summary

Realize that today some VOs are doing ‘pilot’ jobs today there is no effective enforcement against this some sites may even just don’t care yet, whilst others have hard

requirements on auditability and regulatory compliance

The glexec-on-WN model gives the VOs tools to comply with site requirements at least makes it ‘better’ than it is today but you, as a site, will miss that warm and fuzzy feeling of trust

a glexec-on-WN is always replaceable by the ‘null operation’ for sites that don’t care or want it but realize this is for just one of the glexec deployment models

Documents

Computing and Brokering