24
Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear Physics, Moscow State University

Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Embed Size (px)

Citation preview

Page 1: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Grid infrastructure analysis with a simple flow

model

Andrey Demichev, Alexander Kryukov,

Lev Shamardin, Grigory ShpizScobeltsyn Institute of Nuclear Physics, Moscow State University

Page 2: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 2

Why a grid simulator?

A simulator allows easy changes to a grid

structure and behavior.

The grid behavior under stress conditions:

Site failures

Job execution failures

Unexpected raise of the job load

Bottleneck analysis

System structure optimization

Page 3: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 3

Different approaches to job flow simulation

Individual jobs tracking

Monte-Carlo simulation of job submission. The

model system simulates of stages of the job life from

the submission to completion or failure.

Easier to implement.

Examples: simgrid, gridsim, beosim

Page 4: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 4

Different approaches to job flow simulation

Statistical models of job flows

Simulation of job flows (i.e. “jobs/second”). The

model system consists of boxes which take a number

of job flows as input and produces a number of job

flows as an output.

The output of such model is actually exactly the

numbers we are interested in.

Examples: optorsim

Page 5: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 5

Goals

Create a simple reallistic model of the grid

Model should be capable answering the

questions:

Will the grid handle the required constant average job

load?

Can it be reorganized to handle the load?

Page 6: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 6

JobRegistration

Job Submission & Status

Resource Status Information

job

job status, output

Input queuePlanner

Output queue

Structure of a workload

management system

Page 7: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 7

Simple flow-based model

Simulation of an LCG-like grid

Four general node types:

User Interface (UI), the source of the jobs in the

system.

Resource Broker (RB), accepts the jobs, queries the

informational system, dispatches the jobs to the

Computing Elements (CE), where the jobs are

executed.

BDII nodes, which are the informational system.

Page 8: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 8

User Interface

UIs may be connected to a number of RBs

Each UI generates a constant job requests flow in the

direction of a connected RB.

UIRB

RB

UI

UI

RB

Page 9: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 9

Resource Broker

RBs are connected to BDIIs and CEs, and have

connected UIs. RB is characterized by

maximum input job requests flow

number of informational system lookups per job and a

maximum number of informational lookups flow

maximum job flow to the CEs

Page 10: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 10

Informational System (BDII)

Maximum flow of requests it can handle

UIRB

RB

UI

UI

RB

BDII BDII

Page 11: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 11

Computing Elements

The maximum flow of „jobs“ it can process

All jobs are assumed to be equal

We are not interested in the exact location of the

failing CE when the grid is overloaded, therefore

we can combine all grid CEs into one virtual CE

with the efficient capacity.

We could actually do the same for the UIs

Page 12: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 12

Simple flow-based model

UIRB

RB

UI

UI

RB

BDII BDII

Virtual CE

Page 13: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 13

Flows

Think of a pluming.

The UIs generate the flow of incoming jobs to the

RBs.

The RB generates a flow of the requests to the

BDII and CE

The flow of the requests to the CE is checked

against the maximum

Page 14: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 14

Overflows treatment

All overflows are monitored but not truncated.

If an overflow happened we are interested not in

the exact values of the overflow, but in the fact of

the overflow itself.

Page 15: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 15

Automatic structure generation

Information published in the GOC database

No direct access to the GOCDB, so the data is pulled

out from the SAM web-services

Information published in the services

configuration files

No straight way to determine which BDII is used by a

particular RB, but gsiftp access to the RB filesystem

allows to read an parse the RB config

Page 16: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 16

Automatic structure generation: UI

No information about UIs is published. We have

to guess and/or estimate.

Each site is assumed to be running a UI with

some default parameters. This UI is connected to

the site RBs, or to the country RBs, or to the

region RBs or to the „default“ RB.

Page 17: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 17

Automatic structure generation: RB

RBs parameters are based on the measurements

by CMS collaboration („Update on gLite WMS

tests“ by Andrea Sciabà).

All RBs are assumed to be able to submit jobs to

all CEs.

Page 18: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 18

Automatic structure generation: RB

The RB is using the BDII specified in its

configuration if this data is available

Site BDII is used if the information is unavaible.

One of the BDIIs in the same Country is used if there

is no site BDII

One of the BDIIs in the same Region is used if there

is no BDII in the country

Top-level „default“ BDII is used if there are no BDIIs

in the Region.

Page 19: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 19

Automatic structure generation

For the BDII performance we use the results from

the talk „LCG/gLite BDII performance

measurements“.

The CE performance is scaled according to the

number of the CPUs on each CE.

Page 20: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 20

Example: russian part of LCG

UI, RB, BDII

Page 21: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 21

Conclusion

A simple flow-based model describing the job

load distribution in the grid

The structure of the modeled grid is automatically

updated to match the real grid structure

Parameters of nodes are based on the measured

values

Page 22: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 22

Conclusion

Any node connections or parameters may be

overriden allowing to play with the grid

Numbers for the current LCG are quite

optimistic:

RBs are capable of generating the job flow to

accomodate all available resources on CEs, but

Clever connection between RBs and UIs is required,

i.e. if we want not to overflow the RB, the UI should

become a registered service.

Page 23: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 23

Future plans

Distinguish different kinds of jobs.

A big number of short-time jobs makes a higher load

on the grid than the smaller number of long jobs.

Accomodate the delays in the informational

system

The information about CE availability is delayed from

the reality on the RB, causing job submission failures

and resubmissions => additional „background“ load

on the RB

Page 24: Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear

Lev Shamardin 24

Acknowledgements

The research was partially supported by

INTAS-CERN Grant 2005-7509

RFBR Grant 06-07-89199