Upload
diane-perry
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Grid infrastructure analysis with a simple flow
model
Andrey Demichev, Alexander Kryukov,
Lev Shamardin, Grigory ShpizScobeltsyn Institute of Nuclear Physics, Moscow State University
Lev Shamardin 2
Why a grid simulator?
A simulator allows easy changes to a grid
structure and behavior.
The grid behavior under stress conditions:
Site failures
Job execution failures
Unexpected raise of the job load
Bottleneck analysis
System structure optimization
Lev Shamardin 3
Different approaches to job flow simulation
Individual jobs tracking
Monte-Carlo simulation of job submission. The
model system simulates of stages of the job life from
the submission to completion or failure.
Easier to implement.
Examples: simgrid, gridsim, beosim
Lev Shamardin 4
Different approaches to job flow simulation
Statistical models of job flows
Simulation of job flows (i.e. “jobs/second”). The
model system consists of boxes which take a number
of job flows as input and produces a number of job
flows as an output.
The output of such model is actually exactly the
numbers we are interested in.
Examples: optorsim
Lev Shamardin 5
Goals
Create a simple reallistic model of the grid
Model should be capable answering the
questions:
Will the grid handle the required constant average job
load?
Can it be reorganized to handle the load?
Lev Shamardin 6
JobRegistration
Job Submission & Status
Resource Status Information
job
job status, output
Input queuePlanner
Output queue
Structure of a workload
management system
Lev Shamardin 7
Simple flow-based model
Simulation of an LCG-like grid
Four general node types:
User Interface (UI), the source of the jobs in the
system.
Resource Broker (RB), accepts the jobs, queries the
informational system, dispatches the jobs to the
Computing Elements (CE), where the jobs are
executed.
BDII nodes, which are the informational system.
Lev Shamardin 8
User Interface
UIs may be connected to a number of RBs
Each UI generates a constant job requests flow in the
direction of a connected RB.
UIRB
RB
UI
UI
RB
Lev Shamardin 9
Resource Broker
RBs are connected to BDIIs and CEs, and have
connected UIs. RB is characterized by
maximum input job requests flow
number of informational system lookups per job and a
maximum number of informational lookups flow
maximum job flow to the CEs
Lev Shamardin 10
Informational System (BDII)
Maximum flow of requests it can handle
UIRB
RB
UI
UI
RB
BDII BDII
Lev Shamardin 11
Computing Elements
The maximum flow of „jobs“ it can process
All jobs are assumed to be equal
We are not interested in the exact location of the
failing CE when the grid is overloaded, therefore
we can combine all grid CEs into one virtual CE
with the efficient capacity.
We could actually do the same for the UIs
Lev Shamardin 12
Simple flow-based model
UIRB
RB
UI
UI
RB
BDII BDII
Virtual CE
Lev Shamardin 13
Flows
Think of a pluming.
The UIs generate the flow of incoming jobs to the
RBs.
The RB generates a flow of the requests to the
BDII and CE
The flow of the requests to the CE is checked
against the maximum
Lev Shamardin 14
Overflows treatment
All overflows are monitored but not truncated.
If an overflow happened we are interested not in
the exact values of the overflow, but in the fact of
the overflow itself.
Lev Shamardin 15
Automatic structure generation
Information published in the GOC database
No direct access to the GOCDB, so the data is pulled
out from the SAM web-services
Information published in the services
configuration files
No straight way to determine which BDII is used by a
particular RB, but gsiftp access to the RB filesystem
allows to read an parse the RB config
Lev Shamardin 16
Automatic structure generation: UI
No information about UIs is published. We have
to guess and/or estimate.
Each site is assumed to be running a UI with
some default parameters. This UI is connected to
the site RBs, or to the country RBs, or to the
region RBs or to the „default“ RB.
Lev Shamardin 17
Automatic structure generation: RB
RBs parameters are based on the measurements
by CMS collaboration („Update on gLite WMS
tests“ by Andrea Sciabà).
All RBs are assumed to be able to submit jobs to
all CEs.
Lev Shamardin 18
Automatic structure generation: RB
The RB is using the BDII specified in its
configuration if this data is available
Site BDII is used if the information is unavaible.
One of the BDIIs in the same Country is used if there
is no site BDII
One of the BDIIs in the same Region is used if there
is no BDII in the country
Top-level „default“ BDII is used if there are no BDIIs
in the Region.
Lev Shamardin 19
Automatic structure generation
For the BDII performance we use the results from
the talk „LCG/gLite BDII performance
measurements“.
The CE performance is scaled according to the
number of the CPUs on each CE.
Lev Shamardin 20
Example: russian part of LCG
UI, RB, BDII
Lev Shamardin 21
Conclusion
A simple flow-based model describing the job
load distribution in the grid
The structure of the modeled grid is automatically
updated to match the real grid structure
Parameters of nodes are based on the measured
values
Lev Shamardin 22
Conclusion
Any node connections or parameters may be
overriden allowing to play with the grid
Numbers for the current LCG are quite
optimistic:
RBs are capable of generating the job flow to
accomodate all available resources on CEs, but
Clever connection between RBs and UIs is required,
i.e. if we want not to overflow the RB, the UI should
become a registered service.
Lev Shamardin 23
Future plans
Distinguish different kinds of jobs.
A big number of short-time jobs makes a higher load
on the grid than the smaller number of long jobs.
Accomodate the delays in the informational
system
The information about CE availability is delayed from
the reality on the RB, causing job submission failures
and resubmissions => additional „background“ load
on the RB
Lev Shamardin 24
Acknowledgements
The research was partially supported by
INTAS-CERN Grant 2005-7509
RFBR Grant 06-07-89199