Upload
semah
View
44
Download
0
Embed Size (px)
DESCRIPTION
Grid Systems and scheduling. Grid systems. Many!!! Classification : (depends on the author) Computational grid : distributed supercomputing (parallel application execution on multiple machines) high throughput (stream of jobs) - PowerPoint PPT Presentation
Citation preview
Grid Systems and scheduling
2
Grid systems• Many!!!• Classification: (depends on the author)
– Computational grid: • distributed supercomputing (parallel application execution on
multiple machines)• high throughput (stream of jobs)
– Data grid: provides the way to solve large scale data management problems
– Service grid: systems that provide services that are not provided by any single local machine.
• on demand: aggregate resources to enable new services• Collaborative: connect users and applications via a virtual
workspace• Multimedia: infrastructure for real-time multimedia applications
3
Taxonomy of Applications Distributed supercomputing consume CPU cycles
and memory
High-Throughput Computing unused processor cycles
On-Demand Computing meet short-term requirements for resources that cannot be cost-effectively or conveniently located locally.
Data-Intensive Computing
Collaborative Computing enabling and enhancing human-to-human interactions (eg: CAVE5D system supports remote, collaborative exploration of large geophysical data sets and the models that generated them)
4
Alternative classification
• independent tasks
• loosely-coupled tasks
• tightly-coupled tasks
5
Application Management
• Description
• Partitioning
• Mapping
• Allocation
partitioning
mapping
allocation
grid node B
Application
grid node A
management
6
Description
• Use a grid application description language
• Grid-ADL and GEL– One can take advantage of loop construct to
use compilation mechanisms for vectorization
7
Grid-ADL
Traditional systems
alternative systems
1
6
2 5
1
6
2 5..
8
Partitioning/Clustering
• Application represented as a graph– Nodes: job– Edges: precedence
• Graph partitioning techniques:– Minimize communication– Increase throughput or speedup– Need good heuristics
• Clustering
9
Graph Partitioning
• Optimally allocating the components of a distributed program over several machines
• Communication between machines is assumed to be the major factor in application performance
• NP-hard for case of 3 or more terminals
Graph partitioning and cut set
• The partition of the program on to machines that minimizes the interprocessor communication corresponds to the minimal cut set for the graph
• Finding a minimal cut set is an np-hard problem
• heuristics
10
11
Basic concept: Collapse the graph
• Given G = {N, E, M}• N is the set of Nodes• E is the set of Edges• M is the set of
machine nodes
12
Heuristic: Dominant Edge
• Take node n and its heaviest edge e
• Edges e1,e2,…er with opposite end nodes not in M
• Edges e’1,e’2,…e’k with opposite end nodes in M
• If w(e) ≥ Sum(w(ei)) + max(w(e’1),…,w(e’k))
• Then the min-cut does not contain e
• So e can be collapsed
13
Another heuristic: Machine Cut
• Let machine cut Mi be the set of all edges between a machine mi and non-machine nodes N
• Let Wi be the sum of the weights of all edges in the machine cut Mi
• Wi’s are sorted so W1 ≥ W2 ≥ …• Any edge that has a
weight greater than W2 cannot be part of the min-cut
14
Yet another heuristic: Zeroing
• Assume that node n has edges to each of the m machines in M with weights
w1 ≤ w2 ≤ … ≤ wm
• Reducing the weights of each of the m edges from n to machines M by w1 doesn’t change the assignment of nodes for the min-cut
• It reduces the cost of the minimum cut by (m-1)w1
15
Heuristics: Order of Application
• If the previous 3 techniques are repeatedly applied on a graph until none of them are applicable then:– the resulting reduced graph is independent of
the order of application of the techniques
16
Output
• List of nodes collapsed into each of the machine nodes
• Weight of edges connecting the machine nodes
• Source: Graph Cutting Algorithms for Distributed Applications Partitioning, Karin Hogstedt, Doug Kimelman, VT Rajan, Tova Roth, and Mark WegmanACM SIGMETRICS, v. 28:4, 2001
• homepages.cae.wisc.edu/~ece556/fall2002/PROJECT/distributed_applications.ppt
17
Graph partitioning• Hendrickson and Kolda, 2000: edge cuts:
– are not proportional to the total communication volume
– try to (approximately) minimize the total volume but not the total number of messages
– do not minimize the maximum volume and/or number of messages handled by any single processor
– do not consider distance between processors (number of switches the message passes through, for example)
– undirected graph model can only express symmetric data dependencies.
18
Graph partitioning
• To avoid message contention and improve the overall throughput of the message traffic, it is preferable to have communication restricted to processors which are near to each other
• But, edge-cut is appropriate to applications whose graph has locality and few neighbors
19
Resource Management
Source: P. K. V. Mangan, Ph.D. Thesis, 2006
(1988)
20
Static scheduling task precedence graphDSC: Dominance Sequence Clustering
• Yang and Gerasoulis, 1994: two step method for scheduling with communication:(focus on the critical path)1) schedule an unbounded number of completely
connected processors (cluster of tasks);
2) if the number of clusters is larger than the number of available processors, then merge the clusters until it gets the number of real processors, considering the network topology (merging step).
21
Kwok and Ahmad, 1999: multiprocessor scheduling taxonomy
Static Scheduling Algorithms for Allocating Directed Task Graphsto Multiprocessors
22
List Scheduling• make an ordered list of processes by assigning them some
priorities• repeatedly execute the following two steps until a valid schedule
is obtained:– Select from the list, the process with the highest priority for
scheduling. – Select a resource to accommodate this process.
• priorities are determined statically before the scheduling process begins. The first step chooses the process with the highest priority, the second step selects the best possible resource.
• Some known list scheduling strategies:• Highest Level First algorithm or HLF • Longest Path algorithm or LP • Longest Processing Time • Critical Path Method
• List scheduling algorithms only produce good results for coarse-grained applications
23
Graph partitioning
• Kumar and Biswas, 2002: MiniMax– multilevel graph partitioning scheme– Grid-aware– consider two weighted undirected
graphs: • a work-load graph (to model the problem
domain)• a system graph (to model the
heterogeneous system)
24
Resource Management
• The scheduling algorithm has four components:– transfer policy: when a node can take part of
a task transfer; – selection policy: which task must be
transferred;– location policy: which node to transfer to;– information policy: when to collect system
state information.
25
Resource Management
• Location policy:
– Sender-initiated
– Receiver-initiated
– Symmetrically-initiated
26
Scheduling mechanisms for grid
• Berman, 1998 (ext. by Kayser, 2006):– Job scheduler– Resource scheduler– Application scheduler– Meta-scheduler
27
Scheduling mechanisms for grid
• Legion– University of Virginia (Grimshaw, 1993)– Supercomputing 1997– Commercialized in 2003 by Avaki
28
Legion
• is an object oriented infrastructure for grid environments layered on top of existing software services. (some say it is grid-aware operating system)
• uses the existing operating systems, resource management tools, and security mechanisms at host sites to implement higher level system-wide services
• design is based on a set of core objects
Legion
• Uses the concept of Context Spaces to implement the objects (processes, file names etc)
• ProxyMultiObject: container process used to represent files and contexts residing on one host
29
LegionFS ProxyMultiObject
30
Lightweight and distributed
31
Legion
• resource management is a negotiation between resources and active objects that represent the distributed application
• three steps to allocate resources for a task:– Decision: considers task’s characteristics and
requirements, resource’s properties and policies, and users’ preferences
– Enactment: the class object receives an activation request; if the placement is acceptable, start the task
– Monitoring: ensures that the task is operating correctly
32
Globus
• From version 1.0 in 1998 to the 2.0 release in 2002 and the latest 3.0, the emphasis is to provide a set of components that can be used either independently or together to develop applications
• The Globus Toolkit version 2 (GT2) design is highly related to the architecture proposed by Foster et al.
• The Globus Toolkit version 3 (GT3) design is based on grid services, which are quite similar to web services. GT3 implements the Open Grid Service Infrastructure (OGSI).
• GT4 is also based on grid services, but with some changes in the standard
• GT5 provides an API multithreaded implementation based on an asynchronous event model
33
Globus
• Toolkit with a set of components that implement basic services:
– Security– resource location– resource management– data management– resource reservation– Communication
Apr 21, 2023 MCC/MIERSI Grid Computing 34
Core Globus Services
• Communication Infrastructure (Nexus)
• Information Services (MDS)
• Remote File and Executable Management (GASS, RIO, and GEM)
• Resource Management (GRAM)
• Security (GSS)
Apr 21, 2023 MCC/MIERSI Grid Computing 35
Communications (Nexus)
• Communication library (ANL & Caltech)– Asynchronous communications– Multithreading– Dynamic resource management
Communications (Nexus)
• 5 basic abstractions– Nodes– Contexts (Address spaces)– Threads – Communication links (global pointers)– Remote service requests
• Startpoints and Endpoints
36
Communications (Nexus)
37
A Remote Service Request takes a GP, a proc name and dataTransfers the data to the context refrenced by the GPRemotely invokes the specified procedure (data and local portion of the GP arguments)
Source; technologies for ubiquitous supercomputing…Foster et al, (CCPE 1997)
Apr 21, 2023 MCC/MIERSI Grid Computing 38
Information Services(Metacomputing Directory Service - MDS)
• Required information– Configuration details about resources
• Amount of memory• CPU speed
– Performance information• Network latency• CPU load
– Application specific information• Memory requirements
Apr 21, 2023 MCC/MIERSI Grid Computing 39
Remote file and executable management
• Global Access to Secondary Storage (GASS)– basic access to remote files, operations supported
include remote read, remote write and append
• Remote I/O (RIO)– distributed implementation of the MPI-IO, parallel I/O
API
• Globus Executable Management (GEM)– enables loading and executing a remote file through
the GRAM resource manager
Apr 21, 2023 MCC/MIERSI Grid Computing 40
Resource management
• Resource Specification Language (RSL)
• Globus Resource Allocation Manager (GRAM)– provides a standardized interface to all of the various local
resource management tools that a site might have in place
• DUROC– provides a co-allocation service– it coordinates a single request that may span multiple
GRAMs.
LSF EASY-LL NQE
GRAM
DUROC: Dynamically-Updated Request Online Coallocator
Apr 21, 2023 MCC/MIERSI Grid Computing 41
Authentication Model
• Authentication is done on a “user” basis– Single authentication step allows access to all
grid resources
• No communication of plaintext passwords
• Most sites will use conventional account mechanisms– You must have an account on a resource to
use that resource
Apr 21, 2023 MCC/MIERSI Grid Computing 42
Grid Security Infrastructure
• Each user has:– a Grid user id (called a Subject Name)– a private key (like a password)– a certificate signed by a Certificate Authority
(CA)
• A “gridmap” file at each site specifiesgrid-id to local-id mapping
Apr 21, 2023 MCC/MIERSI Grid Computing 43
Certificate Based Authentication
• User has a certificate, signed by a trusted “certificate authority” (CA)– Certificate contains user name and public key– Globus project operates a CA
Apr 21, 2023 MCC/MIERSI Grid Computing 44
“Logging” onto the Grid
• To run programs, authenticate to Globus:% grid-proxy-init
Enter PEM pass phrase: ******
• Creates a temporary, short-lived credential for use by our computationsPrivate key is not exposed past grid-proxy-init
Apr 21, 2023 MCC/MIERSI Grid Computing 45
Simple job submission
• globus-job-run provides a simple RSH compatible interface% grid-proxy-init Enter PEM pass phrase: *****% globus-job-run host program [args]
Apr 21, 2023 MCC/MIERSI Grid Computing 46
Condor
• It is a specialized job and resource management system. It provides:– Job management mechanism– Scheduling– Priority scheme– Resource monitoring– Resource management
Apr 21, 2023 MCC/MIERSI Grid Computing 47
Condor Terminology
• The user submits a job to an agent.• The agent is responsible for remembering jobs in
persistent storage while finding resources willing to run them.
• Agents and resources advertise themselves to a matchmaker, which is responsible for introducing potentially compatible agents and resources.
• At the agent, a shadow is responsible for providing all the details necessary to execute a job.
• At the resource, a sandbox is responsible for creating a safe execution environment for the job and protecting the resource from any mischief.
Apr 21, 2023 MCC/MIERSI Grid Computing 48
Condor-G: computation management agent for Grid Computing
• Merging of Globus and Condor technologies• Globus
– Protocols for secure inter-domain communications– Standardized access to remote batch systems
• Condor– Job submission and allocation– Error recovery– Creation of an execution environment
49
Globus: scheduling
• Resource Specification Language (RSL) is used to communicate requirements.
• To take advantage of GRAM, a user still needs a system that can remember what jobs have been submitted, where they are, and what they are doing.
• To track large numbers of jobs, the user needs queuing, prioritization, logging, and accounting. These services cannot be found in GRAM alone, but are provided by systems such as Condor-G
50
MyGrid and OurGrid (Cirne et al.)
• Mainly for bag-of-tasks (BoT) applications• uses the dynamic algorithm Work Queue
with Replication (WQR)• hosts that finished their tasks are assigned
to execute replicas of tasks that are still running.
• Tasks are replicated until a predefined maximum number of replicas is achieved (in MyGrid, the default is one).
51
OurGrid
• An extension of MyGrid
• resource sharing system based on peer-to-peer technology
• resources are shared according to a “network of favors model”, in which each peer prioritizes those who have credit in their past history of interactions.
• Interoperates with gLite
52
GrADS Grid Application Development Software
• is an application scheduler• The user invokes the Grid Routine component
to execute an application• The Grid Routine invokes the component
Resource Selector• The Resource Selector accesses the Globus
MetaDirectory Service (MDS) to get a list of machines that are alive and then contact the Network Weather Service (NWS) to get system information for the machines.
53
GrADSGrid Application Development Software
• The Grid Routine then invokes a component called Performance Modeler with the problem parameters, machines and machine information.
• The Performance Modeler builds the final list of machines and sends it to the Contract Developer for approval.
• The Grid Routine then passes the problem, its parameters, and the final list of machines to the Application Launcher.
54
• The Application Launcher spawns the job using the Globus resource management mechanism (GRAM) and also spawns the Contract Monitor.
• The Contract Monitor monitors the application, displays the actual and predicted times, and can report contract violations to a re-scheduler.
GrADSGrid Application Development Software
55
• Although the execution model is efficient from the application perspective, it does not take into account the existence of other applications in the system
GrADSGrid Application Development Software
56
• Vadhiyar and Dongarra, 2002: proposed a metascheduling architecture in the context of the GrADS Project.
• The metascheduler receives candidate schedules of different application level schedulers and implements scheduling policies for balancing the interests of different applications.
GrADSGrid Application Development Software
57
EasyGrid (Rebello & Boeres et al.)
• Mainly concerned with MPI applications
• Allows intercluster execution of MPI processes that belong to the same application
EasyGrid portal
58Source: CCP&E, Volume 18 Issue 6 , Pages 549 - 699 (May 2006)
59
Nimrod (Buyya et al.)
• uses a simple declarative parametric modeling language to express parametric experiments
• provides machinery that automates task of:– formulating, – running, – monitoring, – collating results from the multiple individual experiments.
• incorporates distributed scheduling that can manage the scheduling of individual experiments to idle computers in a local area network
• has been applied to a range of application areas, e.g.: Bioinformatics, Operations Research, Network Simulation, Electronic CAD, Ecological Modelling and Business Process Simulation.
60
Nimrod/G
61
AppLeS (Berman et al.)
Application Level Scheduling
• UCSD (Berman and Casanova)
• Application parameter Sweep Template
• Use scheduling based on min-min, min-max, sufferage, with heuristics to estimate performance of resources and tasks– Performance information dependent
algorithms (pida)
• Main goal: to minimize file transfers
62
Main scheduling algorithm
sched() { (1) compute the next scheduling event(2) create a Gantt Chart, G(3) foreach computation and file transfer currently underway
compute an estimate of its completion timefill in the corresponding blocks in G
(4) until each host has been assigned enough workheuristically assign tasks to hosts (filling blocks in
G)(5) convert G into a plan
}Min-min, min-max, sufferage: step (4)
63
Min-min algorithm1. A task list is generated that includes all the tasks as unmapped
tasks.2. For each task in the task list, the machine that gives the task its
minimum completion time (first Min) is determined (ignoring other unmapped tasks).
3. Among all task-machine pairs found in 2, the pair that has the minimum completion time (second Min) is determined.
4. The task selected in 3 is removed from the task list and is mapped to the paired machine.
5. The ready time of the machine on which the task is mapped is updated.
6. Steps 2-5 are repeated until all tasks have been mapped.
Source: Study of an Iterative Technique to Minimize Completion Times of Non-Makespan Machines, by Luis Diego Briceño, Mohana Oltikar, Howard Jay Siegel, and Anthony A. Maciejewski, 2007
64
Sufferage algorithm1. A task list (L) is generated that includes all unmapped tasks in a
given arbitrary order.2. While there are still unmapped tasks:
i. Mark all machines as unassigned.ii. For each task tk є L.
a. The machine mj that gives the earliest completion time is found.b. The Sufferage value is calculated. (Sufferage value = second earliest
completion time minus earliest completion time).c. If machine mj is unassigned then assign tk to machine mj , delete tk from L,
and mark mj as assigned. Otherwise, if the sufferage value of the task (ti) already assigned to mj is less than the sufferage value of task tk then unassign ti, add ti back to L, assign tk to machine mj , and remove tk from L.
iii. The ready times for all machines are updated.
Source: Study of an Iterative Technique to Minimize Completion Times of Non-Makespan Machines, by Luis Diego Briceño, Mohana Oltikar, Howard Jay Siegel, and Anthony A. Maciejewski, 2007
65
Minimum Completion Time (MCT) algorithm
1. A task list is generated that includes all unmapped tasks in a given arbitrary order.
2. The first task in the list is mapped to its minimum completion time machine (machine ready time plus estimated computation time of the task on that machine).
3. The task selected in step 2 is removed from the task list.4. The ready time of the machine on which the task is
mapped is updated.5. Steps 2-4 are repeated until all the tasks have been
mapped.
Source: Study of an Iterative Technique to Minimize Completion Times of Non-Makespan Machines, by Luis Diego Briceño, Mohana Oltikar, Howard Jay Siegel, and Anthony A. Maciejewski, 2007
66
GRAnD [Kayser et al., CCP&E, 2007
Grid Robust Application Deployment]
• Distributed submission control
• Data locality
• automatic staging of data
• optimization of file transfer
67
Distributed submission
Results of simulation with Monarc: http://monarc.web.cern.ch/MONARC/[Kayser, 2006]
68
GRAnD
• Experiments with Globus– Discussion list: [email protected] (05/02/2004)
• Submission takes 2s per task • Place 200 tasks in the queue: ~6min• Maximum number of tasks: few hundreds
– experiments in CERN (D. Foster et al. 2003)
• 16s to submit a task• Saturation in the server: 3.8 tasks/minute
69
GRAnD• Grid Robust Application Deployment
70
GRAnD
71
GRAnD data management
72
GRAnD data management
73
Comparison (Kayser, 2006)
74
Comparison (Kayser, 2006)
75
Condor performance
76
Condor performance
77
Condor x AppMan
78
Condor performance
exps on a cluster of 8 nodes (Sanches et al. 2005)
79
ReGS: Condor performance
80
ReGS: Condor performance
81
Toward Grid Operating Systems
• Vega GOS
• G SMA
82
Vega GOS (the CNGrid OS)
GOS overviewA user-level middleware running on a client
machine
• GOS has 2 components: GOS and gnetd• GOS is a daemon running on the client machine
• gnetd is a daemon on the grid server
83
GOS
• Grid process and Grid thread– Grid process is a unit for managing the whole resource of the Grid.– Grid thread is a unit for executing computation on the Grid.
• GOS API– GOS API for application developers
• grid(): constructs a Grid process on the client machine.• gridcon(): grid process connects to the Grid system.
• gridclose(): close a connected grid.
– gnetd API for service developer on Grid servers• grid_register(): register a service to Grid.• grid_unregister(): unregister a service.
Others
• XtreemOS (challenge this year!!!!)http://www.xtreemos.eu/hotspot_news/xtreemos-computing-challenge
• Mosix
• Environments: g-Eclipse
• ….
84