45
Towards Autonomic Grids ecile Germain-Renaud Laboratoire de Recherche en Informatique Universit´ e Paris-Sud - CNRS - INRIA

Towards Autonomic Grids

Embed Size (px)

Citation preview

Page 1: Towards Autonomic Grids

Towards Autonomic Grids

Cecile Germain-RenaudLaboratoire de Recherche en Informatique

Universite Paris-Sud - CNRS - INRIA

Page 2: Towards Autonomic Grids

e-science infrastructures

2003 NSF Atkins Report :Revolutionizing Science and Engineeringthrough Cyberinfrastructure

Grids of computational centers

Comprehensive libraries of digitalobjects

Well-curated collections ofscientific data

Online instruments and vast sensorarrays

Convenient software toolkits

The largest (circ 26km),fastest(14TeV), coldest(1.9K), emptiest (10−13 atm)machine.

Page 3: Towards Autonomic Grids

e-science infrastructures

2003 NSF Atkins Report :Revolutionizing Science and Engineeringthrough Cyberinfrastructure

Grids of computational centers

Comprehensive libraries of digitalobjects

Well-curated collections ofscientific data

Online instruments and vast sensorarrays

Convenient software toolkitsThe largest (circ 26km),fastest(14TeV), coldest(1.9K), emptiest (10−13 atm)machine.

Page 4: Towards Autonomic Grids

e-science infrastructures

2003 NSF Atkins Report :Revolutionizing Science and Engineeringthrough Cyberinfrastructure

Grids of computational centers

Comprehensive libraries of digitalobjects

Well-curated collections ofscientific data

Online instruments and vast sensorarrays

Convenient software toolkits

Storage and analysis of15PB/year

Page 5: Towards Autonomic Grids

e-science infrastructures

2003 NSF Atkins Report :Revolutionizing Science and Engineeringthrough Cyberinfrastructure

Grids of computational centers

Comprehensive libraries of digitalobjects

Well-curated collections ofscientific data

Online instruments and vast sensorarrays

Convenient software toolkits

The largest (40000 CPUs),most complex (200 VOs),most distributed (250 sites),most used (300K jobs/day)computing machine

Page 6: Towards Autonomic Grids

How we configure our grids

Courtesy James Casey talk @EGEE09

Page 7: Towards Autonomic Grids

Outline

1 The grid ecosystem

2 Grids and Autonomic Computing

3 The Grid Observatory

4 Learning grid modelsOn-line fault detectionModel Selection

5 Model-free policiesPolicy evaluationReinforcement learning for responsive grids

Page 8: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

e-science infrastructures

The classical definition of grids

A computational grid is a hardware and software infrastructurethat provides dependable, consistent, pervasive, and inexpensiveaccess to high computational capabilities.I. Foster, C. Kesselman, The Grid, 1998

An old dream

UCLA press release on the creation of Arpanet, 1969

Page 9: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

The niches in the ecosystem

Page 10: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Grids are not about technology, but about sharing

Ian Foster’s definition 2000

Grid are defined bycoordinated resource sharingand problem solving indynamic, multi-institutionalvirtual organizationsThe sharing is necessarily, highly controlled, with

resource providers and consumers defining clearly

and carefully just what is shared, who is allowed to

share, and the conditions under which sharing

occurs. A set of individuals and/or institutions

defined by such sharing rules form a virtual

organization

Consumers: Large scaleinternational collaborations

Different users withdifferentiated requirementsacross and within thecollaborations

Page 11: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Grids are not about technology, but about sharing

Ian Foster’s definition 2000

Grid are defined bycoordinated resource sharingand problem solving indynamic, multi-institutionalvirtual organizationsThe sharing is necessarily, highly controlled, with

resource providers and consumers defining clearly

and carefully just what is shared, who is allowed to

share, and the conditions under which sharing

occurs. A set of individuals and/or institutions

defined by such sharing rules form a virtual

organization

Providers: national andregional institutions

Organized in National GridInitiatives, coordinated by EGI

Page 12: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Grids are not about technology, but about sharing

Ian Foster’s definition 2000

Grid are defined bycoordinated resource sharingand problem solving indynamic, multi-institutionalvirtual organizationsThe sharing is necessarily, highly controlled, with

resource providers and consumers defining clearly

and carefully just what is shared, who is allowed to

share, and the conditions under which sharing

occurs. A set of individuals and/or institutions

defined by such sharing rules form a virtual

organization

Operators: local sites, withtemporary EU support(EGI-Inspire)

Configuration, prioritization,monitoring, accounting, . . .

Page 13: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Do Datacenters and Cloud make Grid obsolete?

Page 14: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

*-aaS

Courtesy William Vambenepe - slides from the Cloud Connect keynote Freeing SaaS from Cloud

Page 15: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Grids and Clouds

IaaS : on-demand, elastic, virtualization-based provisioning

A single-objective optimization target: pay less by turning onand off at the minute rather than days or weeks scale

Convergence path: Grids over Clouds or Clouds of Grids?

EU project Stratuslab

SaaS: the core of the ITprocess lies in deploying andorchestrating heterogeneoussoftware components, andhaving them ”in the cloud”does not help much

Page 16: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Autonomic Computing

Computing systems that manage themselves in accordance withhigh-level objectives from humansKephart and Chess A vision of Autonomic Computing, IEEEComputer 2003AUTONOMIC VISION & MANIFESTOhttp://www.research.ibm.com/autonomic/manifesto/Relation with Machine Learning : I. Rish tutorial @ECML 2006,

Self-managing system with the ability of

Self-healing: detect, diagnose and repair failuresSelf-configuring: automatically incorporate and configurecomponentsSelf-optimizing: ensure the optimal functioning wrt high-levelrequirementsSelf-protecting: anticipate and defend against security breaches

On dynamical non-steady state systems

Page 17: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Autonomic Computing

Page 18: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Autonomic Computing

Page 19: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Autonomic Grids

Emerging behaviour as the result of sites and stakeholdersdecisions

Coupled usage: Virtual Organizations, community softwareand activity

Feedback loops in the middleware

Incomplete and noisy information

We need

Inference of models for middleware components andapplications, users and usage profiles, users interactions,inconsistencies

Self-configuration and self-optimization for managementpolicies

Self-healing across middleware and applications

Page 20: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Goals

Grid digital assets curation

Collecting verifiable digital assetsProviding digital asset search and retrievalCertification of the trustworthiness and integrity of thecollection contentSemantic and ontological continuity and comparability of thecollection

Building the domain knowledge

Dimensionality and volume reduction: getting rid of themassive redundancy in operational logsAnswering operational issuesDescriptive/generative/predictive modelsDesign and validation of model-free policies

Page 21: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Support and collaborations

Page 22: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Methods

Focused on EGEE/EGI

The best approximationof the current needs ofe-science

Extensive monitoringfacilities

Traces were discardedafter operational usage,and in any case notavailable to the scientificcommunity

Now available withoutgrid certificate

www.grid-observatory.org

Page 23: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Methods

Focused on EGEE/EGI

The best approximationof the current needs ofe-science

Extensive monitoringfacilities

Traces were discardedafter operational usage,and in any case notavailable to the scientificcommunity

Now available withoutgrid certificate

Page 24: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Grids are complex systems

Users/Files/Clients worker nodes graph display with AVIZ GraphDice

Page 25: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Grids are complex systems

Users in green, File groups in purple. Rightmost is most ”active”And also [Lovro Iliasic PhD Computational Grids as Complex Networks]

Page 26: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Issues

Large non-stationary system

Courtesy M. Lassnig et al. Austrian Grid Symp. 09

Trends

Academic events

Scientific events

Software events

Page 27: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

On-line fault detection

Abrupt changepoint detection

Page-Hinkley Statistics -jumps in the mean

pt changing distributionpt = 1

t

Pt`=1 p`

mt =Pt

`=1 (p` − p` + δ)Mt = max{m`}PHt = Mt −mt

CUSUM test: if PHt > λ, changedetected

First Application

Blackhole detectionValidation requires expertinterpretation

Page 28: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

On-line fault detection

StrAP: On-line clustering aka Streaming

Affinity Propagation (AP) [Frey2007]

statistical physics algorithm for clustering(based on message passing)

a cluster = an exemplar(akin k-centers)

the model = set of {exemplar, frequency}

Why AP ?

Traceability: real jobs as exemplarsbecause of categorical variables, e.g., userid, queue name etc

No prior knowledge of K , number of clusters

quasi optimality wrt. information loss—> stability [Meila2006]

Page 29: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

On-line fault detection

From AP to Large-scale Data Streaming

1 SCALABILITY : from O(N2 log N) to O(Nh+2h+1 )

Hierarchical Affinity Propagation

negligible infromation loss (proof in the paper)

Page 30: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

On-line fault detection

From AP to Large-scale Data Streaming

2 Non stationary distribution

various Virtual Organization

number and expertise of users

Streaming AP (StrAP)

Page 31: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

On-line fault detection

Adaptive change detection test

Self-adapt λ ≡ An optimization problem

BIC: Fλ = 1|C |∑|C |

i=1

(1ni

∑ej∈Ci

d(ej , e∗i ))

+ ϕρ2 log N + ηOt

∝ loss + size of model + fraction of outliers

OPTIMIZATION:

ε-greedy search from a finite set of λ values

λ = argmin{E(Fλ}),

λ1 λ2 λ3 λ4 ...

E(Fλ1) E(Fλ2) E(Fλ3) E(Fλ4) ...

Gaussian Process Regression based on {λi ,Fλi}

a continuous value of λ is generated

Page 32: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

On-line fault detection

G-StrAP: A Grid Dashboard

Online Monitoring

1 2 3 4 50

20

40

60

80

100

Reservoir

700000

10 47 54129 0 0

8 18 24 30595139

7 13 14 24 972819190

Clusters

Perc

enta

ge o

f job

s as

signe

d (%

)

exemplar shown as a job vector

1 2 3 4 5 6 7 80

20

40

60

80

100

Reservoir

000000

700000

10 47 54129 0 0

9 18 2520110 0 0

8 18 24 30595139

6 5 10 14 12710854

10 18 2920091 395 276

LogMonitor isgetting clogged

Off-line Analysis

Page 33: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Model Selection

The Piecewise Autoregressive model

AR process: Xt = γ + φ1Xt−1 + . . .+ φpXt−p + εt

The model Parameters forpiecewise AR

Number of segments m

Breakpointslocation/segment size(nj)j=1...m

AR orders.(pj)j=1...m

AR parameters(Ψj)j=1...m

Very large model space

Segment 1, 0 < t ≤ 512:Xt = 0.9Xt−1 + εt

Segment 2, 512 < t ≤ 768:Xt = 1.69Xt−1 − 0.81Xt−2 + εt

Segment 3, 768 < t ≤ 1024:Xt = 1.32Xt−1 − 0.81Xt−2 + εt

Page 34: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Model Selection

Minimum Description Length model selection for PAR

[Davis, Lee, Rodriguez-Yam, J. American Statist. Assoc. 2006.]

The MDL principle: the best-fitting model is the one that producesthe shortest code length that completely describes the observeddata y

CLF (y) = CLF (F) + CLF (e|F)

CLF (F): description of the model

CLF (e|F) description the residuals - what is not explained by themodel

CL = log m+(m+1) log n+∑m+1

j=1 log pj +pj +2

2 log nj +nj

2 log(2πσ2j )

Page 35: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Model Selection

Results on the workload processes

The amount of unterminated work in the system

Smoothed workloaddifference

Typically low ARmodels

Long segments

no. of segment segment smallest Ljung-BoxCE segment start end root abs. test on

[days] [days] value residuals(p-value)

CE-A 18 158.91 196.53 1.5915 0.05CE-B 19 109.61 160.65 2.1563 0.04CE-C 17 104.86 149.31 5.5711 0.21CE-D 27 151.39 190.16 1.1062 0.05

[T. Elteto et al. Discovering Piecewise Linear Models of Grid Workload, CCGrid 2010]

Page 36: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Model Selection

Model validation

PAR: Ljung-Box test -whiteness of the ARresiduals

Stability: Bootstrapping- stable breakpoints

Page 37: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Model Selection

Model reconciliation – bootstrap aggregation

Outcome: a simple and robust model describing the essentialpart of the workload process.

Page 38: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Policy evaluation

Evaluation of the matchmaking scheduling policy

ART: Actual Response Time = queuing delay at the CE

ERT: Expected Response Time, copernican principle, gLite

Question: how good is the prediction?

Question: what is your definition of good predictor?

Root Mean Squared Error?Close statistical distribution, at normal regime, in the tail?Correlation of time series?ROC (Receiver Operating Characteristic): cost-benefit relation

Heterogeneous data

Page 39: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Policy evaluation

Evaluation of the matchmaking scheduling policy

Overall

The distributions are notconsistent

RMSE Atl. 7.94E4, Biom.7.2E3

Correlation (subsamplingat 900s) is not convincing

Page 40: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Policy evaluation

Evaluation of the matchmaking scheduling policy

A la BQP (Batch QueuePredictor) How often does

the prediction lie within a

reasonable distance of the

actual? Modified because

BQP considers only upper

bounds

ERT is a classifier, theclasses are intervals of thevalue range Intervals ofexponentially increasingsize

ROC: True Positive Ratevs False Positive Rate

Page 41: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Reinforcement learning for responsive grids

Reinforcement learning for ressource provisioning in grids

A multi-objective scheduling anddimensioning problem

Users: Differentiated QoS

Stakeholders: Fairness

Administrators: Utilization 100

101

102

103

104

105

106

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pro

babili

ty

Execution time [s]

all data

atlas

biomed

Goals

Elastic resource provisioning: the context is Grids over Clouds- Infrastructure as a Service (IaaS)

Realistic hypotheses: organized sharing and mutualization, nocentral control

Autonomics: Model-free policies and configuration-freeimplementations

Page 42: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Reinforcement learning for responsive grids

Formalisation

The scheduling MDP

State: descriptive variables of a site (queue, cluster)

Action: descriptive variables of a job (VO, execution time)

The dimensioning MDP

Action: number of computing nodes to maintain in activity

Policy learning

sarsa algorithm

Continuous state-action space: Non linear regression ofQ : (s, a)→ r

Neural Network and Echo State Network

Page 43: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Reinforcement learning for responsive grids

The Rewards

The Responsiveness utility for job j is

Wj =execution timej

execution timej + waiting timej

. (1)

The Fairness utility for job j is

Fj = 1−maxk(wk − Skj)+,

M, (2)

where x+ = x if x > 0 and 0 otherwise, wk the target share of VOk, and Skj the share received by VO k up to the election of job jThe Utilization reward Un at time Tn is

Un =fn∑n

k=0 Pk(Tk+1 − Tk)(3)

where (T1, . . . ,TN) are the instants of decision making, Pk thenumber of processors allocated in the interval [Tk ,Tk+1] for1 ≤ n < N, and fn the sum of the execution times of jobs completedat time Tn.

Page 44: Towards Autonomic Grids

The grid ecosystem Grids and Autonomic Computing The Grid Observatory Learning grid models Model-free policies

Reinforcement learning for responsive grids

Experimental results on EGEE traces

101

102

103

104

105

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Queueing delay (sec)

CD

F

EGEE−INTERORA−INTER−0.5ORA−INTER−1.0EST−INTER−0.5EST−INTER−1.0

Queuing delays - interactive jobs -Rigid

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

x 106

−3

−2

−1

0

1

2

3

4x 10

−3

Arrival Times (sec)

Fai

rsha

re D

iffer

ence

ELA−ORA−0.5 − EGEE

Dynamics of the fairshare - All jobs - Rigid

101

102

103

104

105

0.8

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Queueing delay (sec)

CD

F

ELA−ORA−0.5ELA−ORA−1.0ELA−EST−0.5ELA−EST−1.0RIG−ORA−0.5RIG−ORA−1.0

Queuing delays - interactive jobs - Elastic [J.

Perez et al. JoGC 8/3 Sep. 2010]

Page 45: Towards Autonomic Grids

Conclusion