Download pdf - GC3: Grid Computing Competence Center · GC3: Grid Computing Competence Center Grid Computing Riccardo Murri, Sergio Mafﬁoletti GC3: Grid Computing Competence Center, University

GC3: Grid Computing Competence Center

Grid ComputingRiccardo Murri, Sergio MaffiolettiGC3: Grid Computing Competence Center,University of Zurich

Oct. 31, 2012

Grid Computing (the vision)

“A computational grid is a hardware andsoftware infrastructure that providesdependable, consistent, pervasive, andinexpensive access to high-end computationalcapabilities.”

Reference: Foster, I., Kesselman, C., “Computational Grids”, in: “The Grid:

a Blueprint for a New Computing Infrastructure”, Morgan-Kauffman,

1999.

LSCI2012 Grid Computing Oct. 31, 2012



Predictable and sustained level of performance.




Standardized interfaces/APIs




Always available in a large variety of environments.




Affordable and convenient(relative to the delivered performance).


HPC vs HTC

High-Performance ComputingMinimize turnaround time of computationalapplications.

High-Throughput ComputingMaximize amount of “work” done in a given time.


HPC vs HTC

Given a fixed set of resources, the two objectives areequivalent.

Fast wide-area networking and cheaper computinghardware (and now even cheaper IaaS provisioning)enable aggregation of geographically distributedcompute resources as HTC facilities.


Grid Computing (the reality)

An aggregation of computational clusters for executionof a large number of batch jobs.


HTC system stakeholders

ownersSet the policy for the usage of a resource

sysadminsTake technical and operational decisions

developers (application writers)Interact with HTC system via its APIs and adaptapplications to its conceptual model.

users (customers)Need applications to work; HTC system must beflexible enough to adapt to their requirements.


Example: the UZH “Schrodinger” cluster

Informatik Dienste UZH

{owner

admin

Research groups

{developers

users


Example: SMSCG

Collaborative projectto share HPC clusters

among higher-educationinstitutions.

http://www.smscg.ch/


http://www.smscg.ch/

Example: SMSCG

ownersParticipating institutions

sysadminsPersonnel of the participating institutions

developersSupport groups (e.g., GC3)

usersResearch groups


Resource Management layers

userRun tasks on the distributed infrastructure;application-level scheduling and resource selection.

grid

– Provide uniform/abstract view of computationalresources and authentication/authorizationcredentials.

– Allocate resources to tasks.

fabricActual execution of tasks and storage of data,according to owner policies.


Scheduling on a cluster

compute node N

local 1Gb/s ethernet network

internet

batch system server

compute node 2compute node 1

ssh username@server

��

All job requests sent to a central server.

The server decides which job runs where and when.


where: resource allocation model

Computing resources are defined by a structuredset of attributes (key=value pairs).

SGE’s default configuration defines 53 suchattributes: number of available cores/CPUs; total sizeof RAM/swap; current load average; etc.

A node is eligible for running a job iff the nodeattributes are compatible with the job resourcerequirements.

(Other batch systems are similar.)


when: scheduling policy

There are usually more jobs than the system canhandle concurrently. (Even more so, inhigh-throughput computing cases we are interestedin.)

So, job requests must be prioritized.

Prioritization of requests is a matter of the localscheduling policy.

(And this differs greatly among batch systems andamong sites.)


(Hidden) assumptions

1. The scheduling server has complete knowledgeof the nodes

Local networks have low latency (RTT average 0.3 mson a 1GB/s ethernet) and the status information is asmall packet.

2. The server has complete control over the nodes

So a compute node will immediately execute a jobwhen told by the server.


How does this extend to Grid computing?

By definition of a Grid. . .

1. It’s geographically distributed– High-latency links (hence: resource status may be

not up-to-date)

– Network is easily partitioned or nodesdisconnected (hence: resources have a dynamicnature; they may come and go)

2. Resources come from multiple control domains– Prioritization is a matter of local policy!

– AuthZ and other issues may prevent execution atall.


The Globus/ARC model

compute node N


batch system server


compute node N


batch system server


compute node N


batch system server


internet

arcsub/arcstat/arcget

��

An infrastructure is a set of independent clusters.

The client host selects one cluster and submits a jobthere. Then periodically polls for status information.


Issues in the Globus/ARC approach?

1. How to select a “good” execution site?

2. How to gather the required information from thesites?

3. Based on the same information, two clients canarrive on the same scheduling information, hencethey can flood a site with jobs.

4. Actual job start times are unpredictable, asscheduling is ultimately a local decision.

5. Client polling increases the load linearly with thenumber of jobs.


The MDS InfoSystem, I

compute node N


batch system server

compute node 2

GRIS

compute node N


batch system server

compute node 2

GRIS

compute node N


batch system server

compute node 2

GRIS

internet

arcsub/arcstat/arcget

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

The Globus Monitoring and Discovery Service


The MDS InfoSystem, II

A specialized service provides information about sitestatus.

Each site reports its information to a local database(GRIS).

Each GRIS registers with a global indexing service(GIIS).

The client talks with the GIIS to get the list of sites,and then queries each GRIS for the site-specificinformation.


LDAP

The protocol underlying MDS is called LDAP.

LDAP allows remote read/write accesses to adistributed database (“X.500 directory system”), with aflexible authentication and authorization scheme.

LDAP makes the assumptions that most accesses arereads, so LDAP servers are optimized for infrequentwrites.

Reference: A. S. Tanenbaum, “Computer Networks,”

ISBN 978-0-13-212695-3


http://en.wikipedia.org/wiki/Ldap

LDAP schemas

Entries in an LDAP database are sets of key/valuepairs. (Keys need not be unique; equivalently: a keycan map to multiple values.)

An LDAP schema specifies the names of allowed keys,and the type of corresponding values.

Each entry declares a set of schemas it conforms to;every attribute in an LDAP entry must be defined insome schema.


X.500/LDAP Directories

Entries are organized into a tree structure (DIT). (SoLDAP queries return subtrees, as opposed to flat setsof rows as in a RDBMS query.)

Each entry is uniquely identified by a “DistinguishedName” (DN). The DN of an entry is formed byappending a one or more attribute values to theparent entry’s DN.

LDAP accesses might result in referrals, which redirectthe client to access another entry at a remote server.


Example

Example: this is how the ARC MDS representinformation about a cluster queue in LDAP.# all.q, gordias.unige.ch, Switzerland, griddn: nordugrid-queue-name=all.q,nordugrid-cluster-name=gordias.unige.ch,Mds-Vo-name=Switzerland,o=gridobjectClass: MdsobjectClass: nordugrid-queuenordugrid-queue-name: all.qnordugrid-queue-status: activenordugrid-queue-comment: sge default queuenordugrid-queue-homogeneity: TRUEnordugrid-queue-nodecpu: Xeon 2800 MHznordugrid-queue-nodememory: 2048nordugrid-queue-architecture: x86_64nordugrid-queue-opsys: ScientificLinux-5.5nordugrid-queue-totalcpus: 224nordugrid-queue-gridqueued: 0nordugrid-queue-prelrmsqueued: 4nordugrid-queue-gridrunning: 0nordugrid-queue-running: 0nordugrid-queue-maxrunning: 136nordugrid-queue-localqueued: 4


Based on the information in the previous slide,can you decide whether to send a jobthat requires 200GB of scratch space

to this cluster?


The MDS “cluster” model

Exactly: there’s no way to make that decision.

ARC (and Globus) only provideCPU/RAM/architecture information.

In addition, they assume clusters are organized intohomogeneous queues, which might not be the case.

This is just an example of a more general problem:what information do we need of a remote clusterand how to represent it?

Reference: B. Konya, “The ARC Information System”,

http://www.nordugrid.org/documents/arc infosys.pdf


http://www.nordugrid.org/documents/arc_infosys.pdf

MDS performance

The complete LDAP tree of the SMSCG grid countsover 28000 entries.

A full dump of the SMSCG infosystem tree requiresabout 30 seconds.

So:

1. Information is several seconds old (on average)

2. It does not make sense to refresh informationmore often that this.

By default, ARC refreshes the infosystem every 60seconds.


Supported and unsupported use cases, I

Pre-installed application: OK

The ARC InfoSys has a generic mechanism (“run timeenvironments”) for providing “installed software”information.

So you can select only sites that provide theapplication you want.

(And the information provided in the InfoSys is usuallyenough to make a good guess about the overallperformance.)


Supported and unsupported use cases, II

Java/Python/Ruby/R script

Require brokering based on a large number of supportlibrary/packages: if the dependencies are not there,the program cannot run.

In theory, this solves the issue. In practice: there isalways less information that would be useful, andproviding all the information that would be useful istoo much work.

Ultimately, it relies on convention and “good practice.”


Supported and unsupported use cases, III

Code benchmarking: FAIL

Benchmarking code requires running all cases underthe same conditions.

There is just no way to guarantee that with the“federation of clusters” model: e.g., the site batchscheduler may run two jobs on compute nodes with adifferent CPU.


ARC: Pros and Cons

Pros:

– Very simple to deploy, easy to extend.

– System and code complexity still manageable.

Cons:

– The burden for scaling up is on each site, but notall sites have the required know-how/resources.

– Complexity of managing large collections of jobs ison the client software side.

– Fixed infosystem schema does not accomodatecertain use cases.


The gLite approach

compute node N


batch system server


site−BDII

compute node N


batch system server


site−BDII

compute node N


batch system server


site−BDII

glite−job−submit

top−BDIIWMS

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Reference: http://web.infn.it/gLiteWMS/index.php/component/content/article/51-generaltechdocs/57-archoverview


http://web.infn.it/gLiteWMS/index.php/component/content/article/51-generaltechdocs/57-archoverview

http://web.infn.it/gLiteWMS/index.php/component/content/article/51-generaltechdocs/57-archoverview

The gLite WMS

Server-centric architecture:

– All jobs are submitted to the WMS server.

– WMS inspects the Grid status, makes thescheduling decision and submits jobs to sites.

– The WMS also monitors jobs as they run, andfetches back the output when a job is done.

– The client polls the WMS, and when a job is donegets the output from the WMS.


The gLite infosystem, I

Hierarchical architecture, based on LDAP:

1. Each “Grid element” runs its own LDAP server(resource BDII) providing information on thesoftware status and capabilities.

2. A site-BDII polls the local element servers, andaggregates information into a site view.

3. A top-BDII polls the site BDIIs and aggregatesinformation into a global view.

Each step requires processing the collected entriesand creating a new LDIF tree based on the newinformation.


The gLite infosystem, II

The CREAM computing element at CSCS has 43entries in its resource BDII. Listing them takes 0.5seconds.

The CSCS site-BDII has 191 entries. Listing themtakes 0.5 seconds.

The CERN top-BDII has > 180 ′000 entries, collectedfrom circa 200 sites. Listing them all takes over 2minutes time.


The GLUE schema

The gLite information system represents systemsstatus based on the GLUE schema. (Version 1.3currently being phased out in favor of v. 2.0)

Comprehensive and complex schema:

1. aimed at interoperability among Grid providers;

2. attempt to cover every feature supported by themajor middlewares and productioninfrastructures (esp. HEP);

3. heavy use of cross-entry references.

Can accomodate the “scratch space” example, butthere’s still no way of figuring out whether (and how) ajob can request 16 cores on the same physical node.


http://www.ogf.org/documents/GFD.147.pdf

Comparison with ARC’s InfoSystem

ARC stores information about jobs and users in theinfosystem:

– relatively large number of entries in the ARCinfosys

– cannot scale to a large high-throughputinfrastructure

However, gLite’s BDII puts a large load on the top BDII:

– must handle load from all clients

– must be able to poll all site-BDIIs in a fixed time

– so it must cope with network timeouts, slow sites,etc.


gLite WMS: Pros and Cons

Pros:

– Global view of the Grid, could take bettermeta-scheduling decisions.

– Can support aggregate job types (e.g., workflows)

– Aggregates the monitoring operations, so reducesthe load on site.

Cons:

– The WMS is a single point of failure.

– Clients still use a polling mechanism, so the WMSmust sustain the load.

– Extremely complex piece of software, running on asingle machine: very hard to scale up!

– Relies on a infosystem to take sensible decisions(fixed schema/representation problem).


Condor

compute node N


batch system server

compute node 2compute node 1 compute node N


batch system server

compute node 2compute node 1 compute node N


batch system server


condor_master

condor_submit

condor_resourcecondor_resource condor_resource

condor_agent

��

��

��

��

��

��

��

��

��

��

��


Condor overview

Agents (client-side software) and Resources(cluster-side software) advertise their requests andcapabilities to the Condor Master.

The Master performs match-making betweenAgents’ requests and Resources’ offerings.

An Agent sends its computational job directly to thematching Resource (claiming).

Reference: Thain, D., Tannenbaum, T. and Livny, M. (2005):

“Distributed computing in practice: the Condor experience.”

Concurrency and Computation: Practice and Experience,

17:323–356.


Matchmaking

Agents and resources publish “advertisements”:requests and offers using the “ClassAd” format (anenriched key=value format).

No prescribed schema, hence a Resource is free toadvertise any “interesting feature” it has, and torepresent it in any way that fits the key=value model.Similarly, a client may require arbitrary constraints.


Condor: Pros and Cons

Pros:

– Fully-distributed job submission and monitoringsystem.

– Flexible mechanism (schemaless) for specifying jobrequirements and enforcing resource usagepolicies.

– Modular architecture and separate claimingprotocol allow for running many types of“workloads” (e.g., computational jobs, VMs, . . . )

Cons:

– Schemalessness moves correct specification ofrequirements and policies up into theorganizational layer.


All these job management systems are based on apush model (you send the job to an execution cluster).

Is there conversely a pull model?


Scaling issues in the “push” model

– Scheduling entities need to have a comprehensiveview of the system.

– Information needs frequent updates.

– No complete information on local schedulingpolicies.

– The meta-scheduler is another scheduling layer ontop of local cluster scheduling.


Condor glide-ins

Idea: run Condor daemons as regular grid/batch jobsand build a “personal resource pool”

out of allocated job slots.


Condor glide-ins

Idea: run Condor daemons as regular grid/batch jobsand build a “personal resource pool”

out of allocated job slots.

1. User starts a “personal” Condor match-makingdaemon

2. User submits Condor condor resource servers asbatch job to foreign systems.

3. As submitted resource daemons are started, theycontact back the “personal” Condor master andform a resource pool.

4. User can now run jobs through the “personal”Condor pool.


Pilot jobs

The general name for this approach is pilot (or“late-binding”) jobs.

A pilot job is a job that lands on a resource and thenpulls in a task (payload) to execute from a centralserver.


Advantages of late binding

Information about the execution node can be gatheredat the moment the pilot job starts execution.

– More accurate description: can look for neededfeatures (e.g., installed software/libraries)

– Hence, no need for an information system.

– Load balancing done according to local systemresponsiveness.

Easier failure handling: if a pilot job fails, assign sametask to another pilot.

– Users only need to monitor the task success ratio!


Disadvantages

Outbound network connectivity required.

Heterogeneous workload (e.g., mixture of parallel andsequential tasks) leads to combinatorial explosion ofthe number of submitted jobs.

Multi-user pilot job frameworks introduce another(possibly opaque) step in the traceability chain.


References, I

– Foster, I. (2002): “What is the Grid? A Three PointChecklist.”, Grid Today, July 20, 2002http://dlib.cs.odu.edu/WhatIsTheGrid.pdf

– Thain, D., Tannenbaum, T. and Livny, M. (2005):“Distributed computing in practice: the Condorexperience.” Concurrency and Computation:Practice and Experience, 17: 323–356.DOI: 10.1002/cpe.938

– Livny, M., Raman, R., “High-Throughput ResourceManagement”, in: Foster, I. and Kesselman, C.(eds.) “The Grid: Blueprint for a New ComputingInfrastructure”, Morgan-Kauffman, 1999.


http://dlib.cs.odu.edu/WhatIsTheGrid.pdf

References, II

– Konya, B. (2010): “The ARC Information System”,http://www.nordugrid.org/documents/arc infosys.pdf

– Cecchi, M. et al. (2009): “The gLite WorkloadManagement System”, Lecture Notes in ComputerScience, 5529/2009, pp. 256–268.

– Andreozzi, S. et al. (2009): “GLUE Specificationv. 2.0”, http://www.ogf.org/documents/GFD.147.pdf


http://www.nordugrid.org/documents/arc_infosys.pdf

http://www.ogf.org/documents/GFD.147.pdf