View
2
Download
0
Category
Preview:
Citation preview
Hybrid Distributed Computing InfrastructureExperiments in Grid5000: Supporting QoS in
Desktop Grids with Cloud Resources
Simon Delamare, Gilles Fedak, Oleg Lodygensky
simon.delamare@inria.frINRIA Rhone Alpes
GRAAL team
Grid’5000 Spring School 2011
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 1 / 20
Plan
1 Introduction
2 The SpeQuloS framework
3 Grid5000 as a Best-Effort DCI
4 Grid5000 as a Cloud
5 Conclusion
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 2 / 20
Context
Growing demand for computing power from scientific communities (largeapplication, large datasets)Distributed Computing Infrastructures continues to diversify :
I Supercomputers, Grids, Desktop Grids, and now Cloud Computing. . .I Different characteristics in term of performance, size, cost, reliability, quality
of service etc. . .
Question : how do we mix them ? According to which criteria/scenario ?I example : extends Grid infrastructure with Cloud resources to meet a peak
demand.I example : use local Desktop Grid to cut the Cloud resource cost.I example : use on-demand Cloud resources to improve QoS of Desktop Grid.
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 3 / 20
Background : The EU FP7 EDGI
EDGeS (Enabling DesktopGrid for eScience) : bridgesfrom EGEE to Desktop Grid(BOINC and XtremWeb)FP7 EDGI/DEGISCO are newprojects to maintain andextend the EDGeSinfrastructures
I new Grids : ARC, UnicoreI Clouds : Eucalyptus,
OpenNebula
International Desktop GridFederation: 40 partnersworldwide.
EDGeS: Bridging EGEE to BOINC and XtremWeb
The job is accepted if and only if all thesestatements are true,
• the helper script doesn’t have to be started, asthe proxy doesn’t leave the CE, so there is noneed to update proxies on Worker Nodes,
• there is no need to run the wrapper script, itonly has to be parsed, so our GRAM jobman-ager can send the job to the 3G Bridge andcan interact with the L&B server instead ofthe wrapper script,
• moreover, the GRAM jobmanager periodi-cally has to check the status of the job in the3G Bridge database, and update the status ofthe job for EGEE.
In order to implement the EGEE ! BOINCbridge, we have extended the 3G bridge with aWeb Service interface. So, there is no need toplace the 3G bridge and the BOINC server on thesame machine as the EDGeS CE: the 3G bridgeand the BOINC server are completely separatedfrom the EDGeS CE, which uses the WS interfaceto communicate with the 3G bridge.
On BOINC side, the DC-API [6] plug-in is usedto create BOINC work units out of entries in the3G Bridge Job Database, query their status andget the results of processed work units. Once workunits are created in the BOINC database, theyare processed sooner or later by attached BOINCclients. If desired, the BOINC server can performany job redundancy and checking as usual.
The EGEE ! BOINC bridge publishes in-formation to the BDII according to GLUE 1.3,contains an EGEE producer and a BOINC GIPplug-in. The BOINC plug-in is responsible forreporting performance information about theBOINC project. For this, BOINC statistics pro-vided by the BOINC project are used.
5.3.3 Bridge EGEE ! XtremWeb
The general principle of creating the EGEE !XtremWeb bridge is the same as in the case of theEGEE ! BOINC bridge since both solutions usethe 3G bridge as the heart of the EGEE ! DGbridges. The architecture of EGEE ! XtremWebbridge, depicted in Fig. 9, clearly shows that theEGEE ! 3G bridge part is the same as the oneshown in Fig. 8 above for the EGEE ! BOINC
case. The only difference is the replacement, in-side the 3G bridge, of the BOINC plug-in by theXtremWeb plug-in.
6 Current Operational Status of EDGeS
The EDGeS 3G bridge is not an pure researchprototype, but implementations are in real opera-tion between EGEE and Desktop Grids, as shownin Fig. 10 below:
6.1 Operational DG ! EGEE Infrastructure
The DG ! EGEE bridges of the EDGeS systemhave been prototyped in June 2008 and put intooperation in September 2008.
The BOINC ! EGEE bridge is currentlyin operation at SZTAKI in Budapest, Hungary.
EDGeS VO of EGEE
CNRS / IN2P3 CE 1.600 cpus
SZTAKI CE 16 cpus
CIEMAT CE 20 cpus
BDII VOMS MyProxy WMS LB
EGEE Users
EDGeS
BOINC ! EGEE bridge
Application Repository
EGEE XtremWeb
bridge
EGEE BOINC bridge
BOINC-based Desktop Grids
SZDG (public) 72.000 PCs
IberCivis (public) 14.000 PCs
AlmereGrid (public) 1.000 PCs
UoW (local) 1.881 PCs
Correlation Systems
(local) 10 PCs
XtremWeb-based Desktop Grids
AlmereGrid (public) 1.000 PCs
IN2P3 (public) 600 PCs
AlmereGrid (local)10 PCs
BOINC Project Owners XtremWeb Users
! !
Fig. 10 EDGeS operational EGEE "! DGinfrastructure
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 4 / 20
Motivation for QoS in Best-Effort DCIBest Effort DCI: Desktop Grid, Grids’besteffort queues, Amazon EC2 spotinstances, ...
Characteristics: Variable amount ofresources, volatility, unpredictability,unannounced departure.
Low QoS compare to classical DCI →Tail Effect
0
500
1000
1500
2000
2500
3000
3500
4000
0 1000 2000 3000 4000 5000
Nu
mb
er o
f Jo
bs
Time - s
Jobs completed
We define Quality of Service as a level of confidence in Bag of Task (BoT)execution :
I BoT makespan is the time between the BoT is first submitted and the time allthe results have been received and validated
I What can be estimated, predicted, guaranteed ?
Question: how do we provide QoS to users given the dynamismand volatility of the computing resources ?
Intrinsic approach : improve scheduler for QoS ability
Extrinsic approach : provide additional dedicated computing resources
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 5 / 20
Motivation for QoS in Best-Effort DCIBest Effort DCI: Desktop Grid, Grids’besteffort queues, Amazon EC2 spotinstances, ...
Characteristics: Variable amount ofresources, volatility, unpredictability,unannounced departure.
Low QoS compare to classical DCI →Tail Effect
0
500
1000
1500
2000
2500
3000
3500
4000
0 1000 2000 3000 4000 5000
Nu
mb
er o
f Jo
bs
Time - s
Jobs completed
We define Quality of Service as a level of confidence in Bag of Task (BoT)execution :
I BoT makespan is the time between the BoT is first submitted and the time allthe results have been received and validated
I What can be estimated, predicted, guaranteed ?
Question: how do we provide QoS to users given the dynamismand volatility of the computing resources ?
Intrinsic approach : improve scheduler for QoS ability
Extrinsic approach : provide additional dedicated computing resources
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 5 / 20
Plan
1 Introduction
2 The SpeQuloS framework
3 Grid5000 as a Best-Effort DCI
4 Grid5000 as a Cloud
5 Conclusion
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 6 / 20
The SpeQuloS Framework
Objectives:I Allow users to express QoS needs for their BoTI Provision resources from Cloud to satisfy these needs
Needs:I Monitor iInfrastructure activity & BoT executionI When & How many Cloud resources to provisionI Instanciate Cloud resources as Cloud Workers and manage itI Account and arbitrate Cloud usage
User EGEE DesktopGrid
SpeQuloS Cloud
BoT
provision
monitor
support
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 7 / 20
Overview of the SpeQuloS framework
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 8 / 20
Implementation details
DG Middle-ware
Middle-ware supported: BOINC, XtremWeb-HEP (XWHEP)Modifications to make sure that workers deployed in the Cloud onlyprocess specified BoT.
→ XWHEP version ≥ 7.3.0 & patch for BOINC server
Starting Workers on the Cloud
For each Clouds, VM images are built with DG workers softwareCloud Workers started using libcloud and ssh to configure themClouds supported : OpenNebula, EC2, Eucalyptus and we added G5K
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 9 / 20
Plan
1 Introduction
2 The SpeQuloS framework
3 Grid5000 as a Best-Effort DCI
4 Grid5000 as a Cloud
5 Conclusion
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 10 / 20
Experimental Testbed Using Grid’5000XWHEP Desktop Grid server @ LRI, connected to EGEEA gateway to allow G5K nodes to connect to the XWHEP server.A set of XWHEP worker nodes, executed on G5K nodes.
I A G5K job is a pilot job running several XWHEP workers (1 per core)I No specific environment neededI G5K jobs submitted in best effort queue → Variable amount of resources,
unpredictable and unannounced node departure.
SpeQuloS monitors the XWHEP server and start Cloud resources fromAmazon EC2.
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 11 / 20
Grid’5000 resources utilization
Deployment on seven G5K sites,running:
max #pjobs ← 30max #pjobs waiting ← 7while true do
if current #pjobs < max #pjobs thenif current #pjobs waiting <max #pjobs waiting then
”Submit start XWHEP workers toone Grid5000 node in besteffortqueue”
else”Too many pilot jobs waiting”
end ifelse
”Maximum number of pilot jobs reached”end ifsleep(15 minutes)
end while
0
200
400
600
800
1000
1200
0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96N
um
ber
of
wo
rker
s o
r jo
bs
Time - hours
XWHEP workersG5K pilot jobs runningG5K pilot jobs waiting
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 12 / 20
First results
0
500
1000
1500
2000
2500
3000
3500
4000
0 1000 2000 3000 4000 5000 600
625
650
675
700
725
750
775
800
825
850
875
900
925
950
975
1000
Nu
mb
er o
f Jo
bs
Nu
mb
er o
f w
ork
ers
Time - s
Jobs completedBest-Effort workers
SpeQuloS not used
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0 2000 4000 6000 8000 10000 700
725
750
775
800
825
850
875
900
925
950
975
1000
1025
1050
1075
1100
1125
1150
1175
1200
1225
1250
Nu
mb
er o
f Jo
bs
Nu
mb
er o
f w
ork
ers
Time - s
Jobs completedAdditional Cloud workers
Best-Effort workers
SpeQuloS used
→ the tail has disappeared ! But not enough experiments to conclude.
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 13 / 20
Plan
1 Introduction
2 The SpeQuloS framework
3 Grid5000 as a Best-Effort DCI
4 Grid5000 as a Cloud
5 Conclusion
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 14 / 20
Using Grid’5000 as a Cloud
For experimentations, it is difficult to use Clouds is :I Using public Clouds like EC2 is costlyI Using private Clouds:
F Is complex to deploy and maintainF Needs a lot of hardware to be useful.
Grid’5000 has most of IaaS Cloud featuresI On-demand resources availabilityI User-driven execution environment using deploymentI Remote access through API
Idea: Conduct SpeQuloS experiments with Grid5000 as a Cloud.
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 15 / 20
The Grid’5000 libcloud driver
libcloud:I Python API used in SpeQuloSI Common interface to various Cloud technologiesI Support: Amazon EC2, Eucalytpus, OpenNebula, RackSpace, Nimbus, ...
We propose a new ”driver” for libcloud: Grid’5000 driverI Using the Grid’5000 APII Implement standard libcloud features:
libcloud feature Grid5000 implementationStart/Stop instance Interactive job submission
List Node Sizes List available nodesList Disk Image List available environment to deploy
Still few work to be completed
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 16 / 20
Experimentation resultsGrid5000 is used as Best Effort DCIBoth EC2 and Grid5000 are used as Cloud resourcesGrid5000 as a Cloud is hosted on the Rennes site
0
500
1000
1500
2000
2500
3000
3500
4000
0 500 1000 1500 2000 2500 3000 3500 700
725
750
775
800
825
850
875
900
925
950
Nu
mb
er o
f Jo
bs
Nu
mb
er o
f w
ork
ers
Time - s
Jobs completedAdditional Cloud workers from G5K
Additional Cloud workers from Amazon EC2Best-Effort workers only
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 17 / 20
Plan
1 Introduction
2 The SpeQuloS framework
3 Grid5000 as a Best-Effort DCI
4 Grid5000 as a Cloud
5 Conclusion
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 18 / 20
Conclusion
On-going works regarding SpeQuloS
Improving SpeQuloS basics:I improve real-time estimation of completion time (moving average)I improve the detection of the tails by fitting statistical distribution to BoT
execution archiveI improve scheduling of Cloud resources, i.e. when and how many Cloud
Workers to start ?
→ More G5K experiments to validate SpeQuloS
Remarks on Grid5000 utilization
Grids besteffort queue as a Best Effort DCIGrid5000 can be used as a Cloud
I Without additional virtualizationI Not as flexible as real Cloud (deployment, node size, isolated network)I libcloud driver release will be announced on the mailing list
S. Delamare (INRIA - GRAAL Team) Hybrid DCI Experiments in G5K Grid’5000 Spring School, 19/04/2011 19 / 20
Recommended