GC3: Grid Computing Competence Center
Grid ComputingRiccardo Murri, Sergio MaffiolettiGC3: Grid Computing Competence Center,University of Zurich
Oct. 31, 2012
Grid Computing (the vision)
“A computational grid is a hardware andsoftware infrastructure that providesdependable, consistent, pervasive, andinexpensive access to high-end computationalcapabilities.”
Reference: Foster, I., Kesselman, C., “Computational Grids”, in: “The Grid:
a Blueprint for a New Computing Infrastructure”, Morgan-Kauffman,
1999.
LSCI2012 Grid Computing Oct. 31, 2012
Grid Computing (the vision)
“A computational grid is a hardware andsoftware infrastructure that providesdependable, consistent, pervasive, andinexpensive access to high-end computationalcapabilities.”
Predictable and sustained level of performance.
LSCI2012 Grid Computing Oct. 31, 2012
Grid Computing (the vision)
“A computational grid is a hardware andsoftware infrastructure that providesdependable, consistent, pervasive, andinexpensive access to high-end computationalcapabilities.”
Standardized interfaces/APIs
LSCI2012 Grid Computing Oct. 31, 2012
Grid Computing (the vision)
“A computational grid is a hardware andsoftware infrastructure that providesdependable, consistent, pervasive, andinexpensive access to high-end computationalcapabilities.”
Always available in a large variety of environments.
LSCI2012 Grid Computing Oct. 31, 2012
Grid Computing (the vision)
“A computational grid is a hardware andsoftware infrastructure that providesdependable, consistent, pervasive, andinexpensive access to high-end computationalcapabilities.”
Affordable and convenient(relative to the delivered performance).
LSCI2012 Grid Computing Oct. 31, 2012
HPC vs HTC
High-Performance ComputingMinimize turnaround time of computationalapplications.
High-Throughput ComputingMaximize amount of “work” done in a given time.
LSCI2012 Grid Computing Oct. 31, 2012
HPC vs HTC
Given a fixed set of resources, the two objectives areequivalent.
Fast wide-area networking and cheaper computinghardware (and now even cheaper IaaS provisioning)enable aggregation of geographically distributedcompute resources as HTC facilities.
LSCI2012 Grid Computing Oct. 31, 2012
Grid Computing (the reality)
An aggregation of computational clusters for executionof a large number of batch jobs.
LSCI2012 Grid Computing Oct. 31, 2012
HTC system stakeholders
ownersSet the policy for the usage of a resource
sysadminsTake technical and operational decisions
developers (application writers)Interact with HTC system via its APIs and adaptapplications to its conceptual model.
users (customers)Need applications to work; HTC system must beflexible enough to adapt to their requirements.
LSCI2012 Grid Computing Oct. 31, 2012
Example: the UZH “Schrodinger” cluster
Informatik Dienste UZH
{owner
admin
Research groups
{developers
users
LSCI2012 Grid Computing Oct. 31, 2012
Example: SMSCG
Collaborative projectto share HPC clusters
among higher-educationinstitutions.
http://www.smscg.ch/
LSCI2012 Grid Computing Oct. 31, 2012
Example: SMSCG
ownersParticipating institutions
sysadminsPersonnel of the participating institutions
developersSupport groups (e.g., GC3)
usersResearch groups
LSCI2012 Grid Computing Oct. 31, 2012
Resource Management layers
userRun tasks on the distributed infrastructure;application-level scheduling and resource selection.
grid
– Provide uniform/abstract view of computationalresources and authentication/authorizationcredentials.
– Allocate resources to tasks.
fabricActual execution of tasks and storage of data,according to owner policies.
LSCI2012 Grid Computing Oct. 31, 2012
Scheduling on a cluster
compute node N
local 1Gb/s ethernet network
internet
batch system server
compute node 2compute node 1
ssh username@server
����������������
All job requests sent to a central server.
The server decides which job runs where and when.
LSCI2012 Grid Computing Oct. 31, 2012
where: resource allocation model
Computing resources are defined by a structuredset of attributes (key=value pairs).
SGE’s default configuration defines 53 suchattributes: number of available cores/CPUs; total sizeof RAM/swap; current load average; etc.
A node is eligible for running a job iff the nodeattributes are compatible with the job resourcerequirements.
(Other batch systems are similar.)
LSCI2012 Grid Computing Oct. 31, 2012
when: scheduling policy
There are usually more jobs than the system canhandle concurrently. (Even more so, inhigh-throughput computing cases we are interestedin.)
So, job requests must be prioritized.
Prioritization of requests is a matter of the localscheduling policy.
(And this differs greatly among batch systems andamong sites.)
LSCI2012 Grid Computing Oct. 31, 2012
(Hidden) assumptions
1. The scheduling server has complete knowledgeof the nodes
Local networks have low latency (RTT average 0.3 mson a 1GB/s ethernet) and the status information is asmall packet.
2. The server has complete control over the nodes
So a compute node will immediately execute a jobwhen told by the server.
LSCI2012 Grid Computing Oct. 31, 2012
How does this extend to Grid computing?
By definition of a Grid. . .
1. It’s geographically distributed– High-latency links (hence: resource status may be
not up-to-date)
– Network is easily partitioned or nodesdisconnected (hence: resources have a dynamicnature; they may come and go)
2. Resources come from multiple control domains– Prioritization is a matter of local policy!
– AuthZ and other issues may prevent execution atall.
LSCI2012 Grid Computing Oct. 31, 2012
The Globus/ARC model
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1
internet
arcsub/arcstat/arcget
����������������
An infrastructure is a set of independent clusters.
The client host selects one cluster and submits a jobthere. Then periodically polls for status information.
LSCI2012 Grid Computing Oct. 31, 2012
Issues in the Globus/ARC approach?
1. How to select a “good” execution site?
2. How to gather the required information from thesites?
3. Based on the same information, two clients canarrive on the same scheduling information, hencethey can flood a site with jobs.
4. Actual job start times are unpredictable, asscheduling is ultimately a local decision.
5. Client polling increases the load linearly with thenumber of jobs.
LSCI2012 Grid Computing Oct. 31, 2012
The MDS InfoSystem, I
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2
GRIS
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2
GRIS
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2
GRIS
internet
arcsub/arcstat/arcget
��������
��������
��������
��������
��������
��������
��������
��������
��������
��������
��������
��������
����������������������������������������
����������������������������������������
�������������������������
�������������������������
��������������������
����������
����������
���������
���������
���������
���������
����������
����������
����������
����������
��������������������
��������������������
��������������������
��������������������
����������������
The Globus Monitoring and Discovery Service
LSCI2012 Grid Computing Oct. 31, 2012
The MDS InfoSystem, II
A specialized service provides information about sitestatus.
Each site reports its information to a local database(GRIS).
Each GRIS registers with a global indexing service(GIIS).
The client talks with the GIIS to get the list of sites,and then queries each GRIS for the site-specificinformation.
LSCI2012 Grid Computing Oct. 31, 2012
LDAP
The protocol underlying MDS is called LDAP.
LDAP allows remote read/write accesses to adistributed database (“X.500 directory system”), with aflexible authentication and authorization scheme.
LDAP makes the assumptions that most accesses arereads, so LDAP servers are optimized for infrequentwrites.
Reference: A. S. Tanenbaum, “Computer Networks,”
ISBN 978-0-13-212695-3
LSCI2012 Grid Computing Oct. 31, 2012
LDAP schemas
Entries in an LDAP database are sets of key/valuepairs. (Keys need not be unique; equivalently: a keycan map to multiple values.)
An LDAP schema specifies the names of allowed keys,and the type of corresponding values.
Each entry declares a set of schemas it conforms to;every attribute in an LDAP entry must be defined insome schema.
LSCI2012 Grid Computing Oct. 31, 2012
X.500/LDAP Directories
Entries are organized into a tree structure (DIT). (SoLDAP queries return subtrees, as opposed to flat setsof rows as in a RDBMS query.)
Each entry is uniquely identified by a “DistinguishedName” (DN). The DN of an entry is formed byappending a one or more attribute values to theparent entry’s DN.
LDAP accesses might result in referrals, which redirectthe client to access another entry at a remote server.
LSCI2012 Grid Computing Oct. 31, 2012
Example
Example: this is how the ARC MDS representinformation about a cluster queue in LDAP.# all.q, gordias.unige.ch, Switzerland, griddn: nordugrid-queue-name=all.q,nordugrid-cluster-name=gordias.unige.ch,Mds-Vo-name=Switzerland,o=gridobjectClass: MdsobjectClass: nordugrid-queuenordugrid-queue-name: all.qnordugrid-queue-status: activenordugrid-queue-comment: sge default queuenordugrid-queue-homogeneity: TRUEnordugrid-queue-nodecpu: Xeon 2800 MHznordugrid-queue-nodememory: 2048nordugrid-queue-architecture: x86_64nordugrid-queue-opsys: ScientificLinux-5.5nordugrid-queue-totalcpus: 224nordugrid-queue-gridqueued: 0nordugrid-queue-prelrmsqueued: 4nordugrid-queue-gridrunning: 0nordugrid-queue-running: 0nordugrid-queue-maxrunning: 136nordugrid-queue-localqueued: 4
LSCI2012 Grid Computing Oct. 31, 2012
Based on the information in the previous slide,can you decide whether to send a jobthat requires 200GB of scratch space
to this cluster?
LSCI2012 Grid Computing Oct. 31, 2012
The MDS “cluster” model
Exactly: there’s no way to make that decision.
ARC (and Globus) only provideCPU/RAM/architecture information.
In addition, they assume clusters are organized intohomogeneous queues, which might not be the case.
This is just an example of a more general problem:what information do we need of a remote clusterand how to represent it?
Reference: B. Konya, “The ARC Information System”,
http://www.nordugrid.org/documents/arc infosys.pdf
LSCI2012 Grid Computing Oct. 31, 2012
MDS performance
The complete LDAP tree of the SMSCG grid countsover 28000 entries.
A full dump of the SMSCG infosystem tree requiresabout 30 seconds.
So:
1. Information is several seconds old (on average)
2. It does not make sense to refresh informationmore often that this.
By default, ARC refreshes the infosystem every 60seconds.
LSCI2012 Grid Computing Oct. 31, 2012
Supported and unsupported use cases, I
Pre-installed application: OK
The ARC InfoSys has a generic mechanism (“run timeenvironments”) for providing “installed software”information.
So you can select only sites that provide theapplication you want.
(And the information provided in the InfoSys is usuallyenough to make a good guess about the overallperformance.)
LSCI2012 Grid Computing Oct. 31, 2012
Supported and unsupported use cases, II
Java/Python/Ruby/R script
Require brokering based on a large number of supportlibrary/packages: if the dependencies are not there,the program cannot run.
In theory, this solves the issue. In practice: there isalways less information that would be useful, andproviding all the information that would be useful istoo much work.
Ultimately, it relies on convention and “good practice.”
LSCI2012 Grid Computing Oct. 31, 2012
Supported and unsupported use cases, III
Code benchmarking: FAIL
Benchmarking code requires running all cases underthe same conditions.
There is just no way to guarantee that with the“federation of clusters” model: e.g., the site batchscheduler may run two jobs on compute nodes with adifferent CPU.
LSCI2012 Grid Computing Oct. 31, 2012
ARC: Pros and Cons
Pros:
– Very simple to deploy, easy to extend.
– System and code complexity still manageable.
Cons:
– The burden for scaling up is on each site, but notall sites have the required know-how/resources.
– Complexity of managing large collections of jobs ison the client software side.
– Fixed infosystem schema does not accomodatecertain use cases.
LSCI2012 Grid Computing Oct. 31, 2012
The gLite approach
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1
site−BDII
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1
site−BDII
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1
site−BDII
glite−job−submit
top−BDIIWMS
�����������������������������������
�����������������������������������
����������������
����������������
��������������������
���������
���������
��������
����
���������
���������
��������
����
���������
���������
��������
����
�����������������
�����������������
�����������������
�����������������
��������
��������
��������
��������
��������
��������
��������
��������
��������
��������
��������
��������
��������
Reference: http://web.infn.it/gLiteWMS/index.php/component/content/article/51-generaltechdocs/57-archoverview
LSCI2012 Grid Computing Oct. 31, 2012
The gLite WMS
Server-centric architecture:
– All jobs are submitted to the WMS server.
– WMS inspects the Grid status, makes thescheduling decision and submits jobs to sites.
– The WMS also monitors jobs as they run, andfetches back the output when a job is done.
– The client polls the WMS, and when a job is donegets the output from the WMS.
LSCI2012 Grid Computing Oct. 31, 2012
The gLite infosystem, I
Hierarchical architecture, based on LDAP:
1. Each “Grid element” runs its own LDAP server(resource BDII) providing information on thesoftware status and capabilities.
2. A site-BDII polls the local element servers, andaggregates information into a site view.
3. A top-BDII polls the site BDIIs and aggregatesinformation into a global view.
Each step requires processing the collected entriesand creating a new LDIF tree based on the newinformation.
LSCI2012 Grid Computing Oct. 31, 2012
The gLite infosystem, II
The CREAM computing element at CSCS has 43entries in its resource BDII. Listing them takes 0.5seconds.
The CSCS site-BDII has 191 entries. Listing themtakes 0.5 seconds.
The CERN top-BDII has > 180 ′000 entries, collectedfrom circa 200 sites. Listing them all takes over 2minutes time.
LSCI2012 Grid Computing Oct. 31, 2012
The GLUE schema
The gLite information system represents systemsstatus based on the GLUE schema. (Version 1.3currently being phased out in favor of v. 2.0)
Comprehensive and complex schema:
1. aimed at interoperability among Grid providers;
2. attempt to cover every feature supported by themajor middlewares and productioninfrastructures (esp. HEP);
3. heavy use of cross-entry references.
Can accomodate the “scratch space” example, butthere’s still no way of figuring out whether (and how) ajob can request 16 cores on the same physical node.
LSCI2012 Grid Computing Oct. 31, 2012
Comparison with ARC’s InfoSystem
ARC stores information about jobs and users in theinfosystem:
– relatively large number of entries in the ARCinfosys
– cannot scale to a large high-throughputinfrastructure
However, gLite’s BDII puts a large load on the top BDII:
– must handle load from all clients
– must be able to poll all site-BDIIs in a fixed time
– so it must cope with network timeouts, slow sites,etc.
LSCI2012 Grid Computing Oct. 31, 2012
gLite WMS: Pros and Cons
Pros:
– Global view of the Grid, could take bettermeta-scheduling decisions.
– Can support aggregate job types (e.g., workflows)
– Aggregates the monitoring operations, so reducesthe load on site.
Cons:
– The WMS is a single point of failure.
– Clients still use a polling mechanism, so the WMSmust sustain the load.
– Extremely complex piece of software, running on asingle machine: very hard to scale up!
– Relies on a infosystem to take sensible decisions(fixed schema/representation problem).
LSCI2012 Grid Computing Oct. 31, 2012
Condor
compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1 compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1 compute node N
local 1Gb/s ethernet network
batch system server
compute node 2compute node 1
condor_master
condor_submit
condor_resourcecondor_resource condor_resource
condor_agent
�����������������������������������
�����������������������������������
����������������
����������������
���������������
���������������
�����������������
�����������������
�����������������
�����������������
��������
LSCI2012 Grid Computing Oct. 31, 2012
Condor overview
Agents (client-side software) and Resources(cluster-side software) advertise their requests andcapabilities to the Condor Master.
The Master performs match-making betweenAgents’ requests and Resources’ offerings.
An Agent sends its computational job directly to thematching Resource (claiming).
Reference: Thain, D., Tannenbaum, T. and Livny, M. (2005):
“Distributed computing in practice: the Condor experience.”
Concurrency and Computation: Practice and Experience,
17:323–356.
LSCI2012 Grid Computing Oct. 31, 2012
Matchmaking
Agents and resources publish “advertisements”:requests and offers using the “ClassAd” format (anenriched key=value format).
No prescribed schema, hence a Resource is free toadvertise any “interesting feature” it has, and torepresent it in any way that fits the key=value model.Similarly, a client may require arbitrary constraints.
LSCI2012 Grid Computing Oct. 31, 2012
Condor: Pros and Cons
Pros:
– Fully-distributed job submission and monitoringsystem.
– Flexible mechanism (schemaless) for specifying jobrequirements and enforcing resource usagepolicies.
– Modular architecture and separate claimingprotocol allow for running many types of“workloads” (e.g., computational jobs, VMs, . . . )
Cons:
– Schemalessness moves correct specification ofrequirements and policies up into theorganizational layer.
LSCI2012 Grid Computing Oct. 31, 2012
All these job management systems are based on apush model (you send the job to an execution cluster).
Is there conversely a pull model?
LSCI2012 Grid Computing Oct. 31, 2012
Scaling issues in the “push” model
– Scheduling entities need to have a comprehensiveview of the system.
– Information needs frequent updates.
– No complete information on local schedulingpolicies.
– The meta-scheduler is another scheduling layer ontop of local cluster scheduling.
LSCI2012 Grid Computing Oct. 31, 2012
Condor glide-ins
Idea: run Condor daemons as regular grid/batch jobsand build a “personal resource pool”
out of allocated job slots.
LSCI2012 Grid Computing Oct. 31, 2012
Condor glide-ins
Idea: run Condor daemons as regular grid/batch jobsand build a “personal resource pool”
out of allocated job slots.
1. User starts a “personal” Condor match-makingdaemon
2. User submits Condor condor resource servers asbatch job to foreign systems.
3. As submitted resource daemons are started, theycontact back the “personal” Condor master andform a resource pool.
4. User can now run jobs through the “personal”Condor pool.
LSCI2012 Grid Computing Oct. 31, 2012
Pilot jobs
The general name for this approach is pilot (or“late-binding”) jobs.
A pilot job is a job that lands on a resource and thenpulls in a task (payload) to execute from a centralserver.
LSCI2012 Grid Computing Oct. 31, 2012
Advantages of late binding
Information about the execution node can be gatheredat the moment the pilot job starts execution.
– More accurate description: can look for neededfeatures (e.g., installed software/libraries)
– Hence, no need for an information system.
– Load balancing done according to local systemresponsiveness.
Easier failure handling: if a pilot job fails, assign sametask to another pilot.
– Users only need to monitor the task success ratio!
LSCI2012 Grid Computing Oct. 31, 2012
Disadvantages
Outbound network connectivity required.
Heterogeneous workload (e.g., mixture of parallel andsequential tasks) leads to combinatorial explosion ofthe number of submitted jobs.
Multi-user pilot job frameworks introduce another(possibly opaque) step in the traceability chain.
LSCI2012 Grid Computing Oct. 31, 2012
References, I
– Foster, I. (2002): “What is the Grid? A Three PointChecklist.”, Grid Today, July 20, 2002http://dlib.cs.odu.edu/WhatIsTheGrid.pdf
– Thain, D., Tannenbaum, T. and Livny, M. (2005):“Distributed computing in practice: the Condorexperience.” Concurrency and Computation:Practice and Experience, 17: 323–356.DOI: 10.1002/cpe.938
– Livny, M., Raman, R., “High-Throughput ResourceManagement”, in: Foster, I. and Kesselman, C.(eds.) “The Grid: Blueprint for a New ComputingInfrastructure”, Morgan-Kauffman, 1999.
LSCI2012 Grid Computing Oct. 31, 2012
References, II
– Konya, B. (2010): “The ARC Information System”,http://www.nordugrid.org/documents/arc infosys.pdf
– Cecchi, M. et al. (2009): “The gLite WorkloadManagement System”, Lecture Notes in ComputerScience, 5529/2009, pp. 256–268.
– Andreozzi, S. et al. (2009): “GLUE Specificationv. 2.0”, http://www.ogf.org/documents/GFD.147.pdf
LSCI2012 Grid Computing Oct. 31, 2012