CoreGRID Summer School Bonn, July 24 th - 28 th 2006 HPC4U: Realizing SLA-aware Resource Management Simon Alexandre CETIC, Charleroi, Belgium Matthias

CoreGRID Summer SchoolCoreGRID Summer SchoolBonn, July 24th - 28th 2006

HPC4U: Realizing SLA-aware HPC4U: Realizing SLA-aware Resource ManagementResource Management

Simon AlexandreSimon AlexandreCETIC, Charleroi, BelgiumCETIC, Charleroi, Belgium

Matthias HovestadtMatthias HovestadtUniversity of Paderborn, GermanyUniversity of Paderborn, Germany

HPC4U 2

TopicsTopics

• Motivation• Architecture of an SLA-aware RMS• Phases of Operation• SLA-aware Scheduling• Cross-border Migration• Summary

HPC4U 3

Grid Computing TodayGrid Computing Today

• How do Grids look like today? Grids are in usage, but…

– … commercial usage is rare and limitedo only isolated applications

– … mostly used as a prototypic solution in researcho testbeds within research projects

• Problem: No contractually fixed QoS levels Deadline bounded business critical jobs

HPC4U 4

What is an SLA?What is an SLA?• Service Level Agreement (SLA)

Contract between Provider and Customer– Describes all obligations and expectations

Flexible formulation for each use case

• SLA is in focus of research in Grid Middleware

Se

rvic

e L

ev

el A

gre

eme

nt

Terms R-Type: HW, OS, Compiler, Software Packages, …R-Quantity: Number CPUs, main memory, …R-Quality: CPU>2GHz, Network Bandwidth, … Deadline: Date, Time,…Policies: Demands on Security and Privacy, …

Price for Resource Consumtion (fulfilled SLA)Penalty Fee in case of SLA violation

Contract Parties, Responsible Persons

ID or Description of SLAName

Context

Se

rvice

Le

vel A

gre

eme

nt

HPC4U 5

The Gap between Grid and RMSThe Gap between Grid and RMS

SLA

RMS RMS RMS

M1 M2 M3

grid middleware

user request

Reliability? Quality of Service?

Best Effort!

• User asks forService Level Agreement

• Grid Middleware realizes job by means of local RMS systems

• BUT: These RMS only offer Best Effort!

• Goal: SLA-aware RMS runtime responsibility reliability

– fault tolerance

Guaranteed!

HPC4U 6

Demands on an SLA-aware RMSDemands on an SLA-aware RMS

• Negotiation active negotiation with upper layers accept new job only if SLA can be fulfilled

• System Management taking terms of SLAs into account allocation of nodes according SLAs

• Fault Tolerance ensure terms of SLAs also in case of failures mechanisms for failure handling

HPC4U 7

TopicsTopics


HPC4U 8

SLA-aware RMSSLA-aware RMS• Central component

Interface to Grid middleware for SLA-Negotiation Interfaces to Subsystems for provision of FT

• Tasks SLA Negotiation Policies

– security, … Monitoring FT

– checkpoints– migration

• Open interfaces

Alternative Solution Alternative SolutionAlternative Solution

HPC4U Grid-enabled, SLA-aware Resource Management System:

SLA Negotiation SLA-aware Scheduling Monitoring Support of Checkpointing and Migration Security

Interface to Storage

Storage Solution

Interface to Checkpointing

Checkpointing Solution

Interface to Network

Networking Solution

Outcome 1(Open Source)

Outcome 2(Commercial)

Interface to Grid Middleware

Outcome 3(Non-Comm.)

HPC4U 9

Process SubsystemProcess Subsystem

• Concept: “virtual bubble” Virtualization of Resources

– virtual network devices, virtual process ids, … Application runs in virtual environment only minimal impact on job runtime

• Checkpoint of entire “virtual bubble” no re-linking necessary

– also applicable for commercial applications

• Restart of checkpointed “virtual bubble” compatibility has to be ensured application does not detect restart

HPC4U 10

Network SubsystemNetwork Subsystem

• Provision of FT also for parallel jobs communication between nodes checkpoint of network state necessary

– ensuring consistency between process and network at restart

• Network Checkpointing checkpoint of network queues checkpoint of in-transit packets

• Cooperative Checkpoint Protocol (CCP) direct communication between process checkpointing

and network checkpointing

HPC4U 11

Storage SubsystemStorage Subsystem

• Task of Storage Subsystem Storage-related QoS Checkpointing of storage

• Overall consistency of checkpoint restoring state of storage at process checkpointing time checkpoint = process+network+storage

• Storage Checkpoint may be huge Problem: Delay until restart on remote resource

– Grid migration over “slow” WAN-connections Solution: Data replication with COW (copy on write)

– precautionary data transfer to remote resource

HPC4U 12

Generation of a new CheckpointGeneration of a new Checkpoint

RMS

Network StorageCP

1. CP job+halt

2. In-TransitPackets

4. Snap-shot !

5. Link to Snapshot

6. Resume job

7. Job runningagain

8. Migration from last checkpoint

3. Return: “Checkpoint

completed!”

HPC4U 13

CheckpointingCheckpointing

• Backup of consistent image of running job running process network state in case of parallel jobs storage partition

• Process checkpointing causes delay in job completion depends on number of jobs, memory, interconnect, …

• Delay has to be regarded at job scheduling partition size = estimated runtime + checkpointing

overhead

HPC4U 14

TopicsTopics


HPC4U 15

Phases of OperationPhases of Operation

• Negotiation of SLA• Pre-Runtime: Configuration of Resources

e.g. network, storage, compute nodes

• Runtime: Stage-In, Computation, Stage-Out• Post-Runtime: Re-configuration

StageIn

Negotiation Pre-Runtime

Runtime

Lifetimeof SLA

Allocationof systemresources

Post-Runtime

time

Acceptance(or rejection)

of SLA Compu-tation

StageOut

HPC4U 16

Negotiation PhaseNegotiation Phase

• Negotiation Grid customer and provider try to agree on a Service

Level Agreement– which resources have to be provided?– which QoS level is required?

o specification of a deadline

• RMS in central position steering of negotiation process current system condition has to be regarded

StageIn


Runtime Post-Runtime

time

Compu-tation

StageOut

HPC4U 17

Pre-Runtime PhasePre-Runtime PhaseStage

In



time

Compu-tation

StageOut

• Task of Pre-Runtime Phase Configuration of all allocated resources Goal: Fulfill requirements of SLA

• Reconfiguration affects all system elements Resource Management System

– e.g. configuration of assigned compute nodes Storage Subsystem

– e.g. initialization of a new data partition Network Subsystem

– e.g. configuration of network infrastructure

HPC4U 18

Runtime PhaseRuntime Phase

• Runtime Phase lifetime of job in system adherence with SLA has to be assured FT mechanisms have to be utilized

• Phase consists of three distinct steps Stage-In

– transmission of required input data from Grid customer to compute resource

Computation– execution of application

Stage-Out– transmission of generated output data from

compute resource back to Grid customer

StageIn



time

Compu-tation

StageOut

HPC4U 19

Post-Runtime PhasePost-Runtime Phase

• Task of Post-Runtime Phase: Re-Configuration of all resources

– e.g. re-configuration of network– e.g. deletion of checkpoint datasets– e.g. deletion of temporary data

Counterpart to Pre-Runtime Phase

• Allocation of resources ends Update of schedules in RMS and storage Resources are available for new jobs

StageIn



time

Compu-tation

StageOut

HPC4U 20

TopicsTopics


HPC4U 21

Negotiation of new SLANegotiation of new SLA

• Incoming SLA-request 3 nodes, 7h runtime, earliest start: 20:00, deadline 6:00 request can be accepted, buffer time frame 2h

• Regular Checkpointing new checkpoint to be generated every 60 minutes checkpointing causes delay in job completion

– depends on CP-system and job size

12am 6pm 12pm 6am

HPC4U 22

Suspending JobsSuspending Jobs• valuable resources may be blocked by non-SLA jobs

23:00: SLA-request: 3 nodes, 7 hours, deadline 6:00– insufficient capacity: rejection of new SLA-request

• checkpoint and suspend of non-SLA jobs (best effort only)

• acceptance of request and execution of SLA-job• resume of suspended SLA-bound job

completion of best-effort job, completion of SLA-job

12am 6pm 12pm 6am

HPC4U 23

Increasing system utilizationIncreasing system utilization

• Jobs are requesting number of nodes and runtime users do not align their requests to free capacities

• Reservations must be guaranteed No other complete jobs fits into gaps

• Using of job suspend to use gaps for partial job execution

• Realization of „background jobs“

HPC4U 24

Runtime of SLA jobRuntime of SLA job

• Pre-runtime phase configuration of network, storage, and nodes

• Runtime phase Monitoring of system Regular checkpointing

• Post-runtime phase

12am 6pm 12pm 6am

HPC4U 25

Handling of Resource FailuresHandling of Resource Failures• Resource outage in partition of job

job crashes immediately last checkpoint after 4h runtime

– computation time since last checkpoint is lost

• allocation of partition with 3h runtime• restore from last checkpointed state• scheduling of regular checkpoint intervals• resuming computation

12am 6pm 12pm 6am

HPC4U 26

Availability of spare resourcesAvailability of spare resources

• Migration presumes availability of resources but: resources may be blocked by other jobs

• Solution: Suspension of other jobs• Problem: What to do in case of SLA-jobs blocking

resource SLA-job can only be suspended if deadline is held

• Buffer nodes: execution of non-SLA jobs only

12am 6pm 12pm 6am

HPC4U 27

TopicsTopics


HPC4U 28

Cross-border migrationCross-border migration

• Goal: Successful execution of SLA-jobs handling of failures depends on local load situation goal of provider: utilization of resources high load + massive failure → no migration → SLA violation

• Idea: Cross-border migration usage of resources on other local machines

– multiple clusters available on most sites transfer of checkpoint dataset to remote cluster resume of job from checkpointed state

RMS RMS

HPC4U 29

Grid MigrationGrid Migration

• Cross border migration enhances FT-level additional alternatives for migration process

• Grid Migration = usage of Grid as migration target Virtual Resource Manager as active Grid component Negotiation with Grid on resources

• Migration process request for spare resources transfer using standard protocols

• Transparent for the user user will receive results from new site

• Problem: Compatibility of resources

HPC4U 30

Compatibility ProfileCompatibility Profile

• Checkpoint dataset needs compatible resources for restart processor architecture main and storage memory interconnect type libraries

– exact version for loaded libs– compatible version for unloaded libs

paths

• Compatibility profile describes requirements of checkpointed jobs

• Resource query according to this profile

HPC4U 31

Grid IntegrationGrid Integration

Grid Middleware

Grid Customer

Grid Interface

RMS HPC4U Grid Fabric

HPC4U Grid Fabric

HPC4U Grid Fabric

Grid Customer

Grid Customer

HPC4U 32

SummarySummary

• New requirements from future commercial Grids Transparent fault tolerance, SLA negotiation and mgmt.

• SLA-aware Resource Management System orchestrated operation of subsystems for

Process, Storage, and Network SLA scheduling in RMS

• Cross-border migration for increased FT level virtual resource management, compatibility profile

• Progress support for single node jobs running support of parallel applications close to completion next: cross-border migration

HPC4U 33

Further InformationFurther Information

• please visit our website http://www.hpc4u.org

• you will find… … general information about HPC4U … movies showing fault tolerance in action … downloadable demo system for playing … links and contact addresses

http://www.hpc4u.org/

Documents

CoreGRID Summer School Bonn, July 24 th - 28 th 2006 HPC4U: Realizing SLA-aware Resource Management Simon Alexandre CETIC, Charleroi, Belgium Matthias