Information Systems describing resources

Information Systems

describing resources

Grid Middleware 4

David Groep, lecture series 2005-2006

Grid Middleware IV 2

Outline

Taxonomy of information systems hierarchies and republishers Grid Monitoring Architecture push and pull, subscriptions

Performance of an IS collecting information sensors

IS content: schemas and approaches


Grid Information Systems

Concerns data shared between administrative domains for use by multiple people or VOs

So it does not include things like cluster temperature monitoring debugging streams accounting history


Classification of information systems

Which monitoring systems types are suitable for grid? Paper:

http://www.cs.man.ac.uk/~zanikols/fgcs05.pdf

Different types are: Level 0

self-contained not accessible by programs (but only e.g. web) Level 1

events are accessible remotely at the single producer level Level 2

includes republishers with fixed functionality Level 3

supports hierarchies of republishers

http://www.cs.man.ac.uk/~zanikols/fgcs05.pdf


System taxonomy: levels of systems

Components used in information systems

and taxonomy levels

graphics and concept from S. Zanikolas et al., FGCS 21 (2005) 163-188


Information system classes

Level 2 or 3 system are suitable

Reference architecture: GMA Grid Monitoring Architecture requirements

(performance) information with relatively short lifetime frequent updates (should) carry quality-of-information status as well

but: when you get down to it, almost anything fits in this architecture

including directories with relatively static information suitable mainly for resource state


Grid Monitoring Architecture

Definition of terms and roles (GWD-GP-16-2)

Functions: Registry (directory)

Add, Update, Remove, Search

Producer Maintain Registration, Accept Query,

Accept (Un)subscribe, Locate Consumer, Notify, Initiate (Un)subscribe

Consumer Locate Producer, Initiate Query, ~ (Un)subscribe, Maintain Registration,

Accept Notification, ~ (Un)subscribe, Locate Event Schema


GMA: Intermediaries

Also referred to as ‘republishers’make it a level-3 system

Examples Latest Producer

return the ‘last’ value of an event

Archiver (history producer) storage of historical monitoring data e.g. accounting records


Directories

Information providers ‘publish’ information to a directory

Directories may be linked in networked hierarchies

Information is usually also in a DIT-like structure(Directory Information Tree)

Typical implementation: LDAP


Approaches to sending information

Orthogonal to the topology is the information flow model

Push model information gets published regardless of its use bet it’s there (in higher-level aggregators) when it’s needed e.g. Condor Hawkeye, LCG BDII

Hybrid information location gets published consumers can subscribe to information and from then on continuously

get it e.g. R-GMA, (MDS4?)

Pull model information is retrieved on-demand, and you cannot subscribe e.g. MDS-2


Information Systems

Examples shown in this lecture

1. Monitoring and Discovery Service (MDS)2. Relational Grid Monitoring Arch (R-GMA)3. Hawk eye4. Berkeley-DataBase Information Index (BDII)


1 – MDS2

Part of GT2.x Typical use: resource selection by brokers Architecture

decentralized hierarchical soft-state protocols with timeouts supports caching in index servers

Security: GSI (optional)


MDS2 Architecture

GI IS

Cache contains info fromA and B

GI IS requests infofrom GRIS services

Client 1 Client 2

Client 2 uses GI IS for searching collective information

GRIS register with GI IS

Resource A

GRIS

IPIP Resource B

GRIS

IPIP

IP

Client 1 searchesthe GRIS directly

GI IS

Cache contains info fromA and B

GI IS requests infofrom GRIS services

Client 1 Client 2

Client 2 uses GI IS for searching collective information

GRIS register with GI IS

Resource A

GRIS

IPIPResource A

GRIS

IPIP Resource B

GRIS

IPIP

IP

Resource B

GRIS

IPIP

IP

Client 1 searchesthe GRIS directly

graphic: J. Schopf, GFNL masterclass 2005: Distributed Monitoring and Information Services for the Grid


MDS2 information flow

Soft-state registration of GRISes with GIISes time out on the registration (TTL and nextUpdate)

Data retrieved on-demand from underlying GRIS timeout on the answer resources silently drop out if they fail

GRISes collect information using scripts

GIISes can be collated in arbitrary hierarchies


2 – R-GMA

‘straight’ implementation of the GMA uses a relational representation of the data

notification/subscription directly from the source implementation in Java

developed in EU DataGrid and EGEE JRA1 UK cluster, Steve Fisher (RAL), et al.


R-GMA Archirecture


MON Box

Every site has a MON box to proxy information local cache of info in memory through-channel to systems behind a firewall

producers/consumers connect actively to the MON box

Multiple producers can publish in the same table joins can be done, but only via a secondary producer

Usually deployed with a single registry


R-GMA plain SQL interface

bosui:davidg:1001$ rgma

Welcome to the R-GMA virtual database for Virtual Organisations.

================================================================

Your local R-GMA server is:

https://eg.nikhef.nl:8443/R-GMA

You are connected to the following R-GMA Registry services:

https://lcgic01.gridpp.rl.ac.uk:8443/R-GMA/RegistryServlet

You are connected to the following R-GMA Schema service:

https://lcgic01.gridpp.rl.ac.uk:8443/R-GMA/SchemaServlet

Type "help" for a list of commands.

rgma> show tables

+------------------------------------------+

| Table Name |

+------------------------------------------+

| ArchiverTestTable |

| ... |

| GlueCE |

| ... |

+------------------------------------------+


Queries

rgma> select UniqueID,Name,TotalCPUs from GlueCE WHERE UniqueID LIKE '%ulakbim%';

+--------------------------------------------------+---------+-----------+

| UniqueID | Name | TotalCPUs |

+--------------------------------------------------+---------+-----------+

| ce.ulakbim.gov.tr:2119/jobmanager-lcgpbs-seegrid | seegrid | 126 |

| ce.ulakbim.gov.tr:2119/jobmanager-lcgpbs-trgrida | trgrida | 126 |

| ce.ulakbim.gov.tr:2119/jobmanager-lcgpbs-lhcb | lhcb | 126 |

...


3 – Hawkeye

Condor information system publishes class-ads for

matchmaking fault detection

periodic updates to the agents by the modules information kept in the agents


Hawkeye architecture

Manager

Agent Agent Agent

Module Module Module Module Module Module

graphic: J. Schopf, GFNL masterclass 2005: Distributed Monitoring and Information Services for the Grid


4 – BDII & GIP

BDII conceptually similar to Hawkeye but data is pulled rather than pushed mentioned here because of it’s wide-spread deployment in EGEE/LCG,

OSG, &c Generic Information Providers (GIP)

scripting framework to produce LDIF static values overridden by output from scripts

periodically, LDAP queries sent to subordinate directories with time-out on the answer previous answer is persistent for a defined amount of time

contrary to MDS2, BDII will never forget

Paper:http://indico.cern.ch/materialDisplay.py?contribId=126&sessionId=23&materialId=paper&confId=0


BDII organisation


BDII scaling

OpenLDAP update (write) is not optimized with SleepyCat Berkeley DB, simultaneous read/write lead to

timeouts So, put in a forwarder service that redirects to a pool of

OpenLDAP/DB backends that swap roles


WS style information systems

MDS4 based on WS-RF, WS-Notification mechanisms provides a common aggregator framework for

index service (republisher) trigger service (send events, mails, execute programs) archive service

NAREGI Distributed Information Service Aggregator collect information from various sources put these as CIM objects in a database OGSA-DAI front-end to the database with CIM objects

PS: OGSA-DAI (Data Access & Integration) is a system for providing uniform grid access to database resources


MDS4 Aggregator Framework


NAREGI Distributed Information Service

graphic:Satoshi Matuoka, Tokyo Institute of Technology & NII, NAREGI


Status

Both developed and available

neither been tested yet at the very large scale i.e. O(1000) resources, thousands of simultaneous queries

Hierarchies and Views


Views on the information system

For resource information information view on those resources to which the viewer

potientially has access

a single global root is neither feasible nor needed a per-VO or per-infrastructure view is sufficient

For ‘application level’ monitoring fine-grained access control needed at the VO or user level attributes in the schema may have different privacy levels requires view management like in regular databases


Typical hierarchical top levels today

per-infrastructure e.g. EGEE/LCG, OSG, NAREGI used by many VOs needs support at the infrastructure level

per-VO view prevalent in ‘grass-roots’ deployment

all systems can support both although not all in the same way:

R-GMA works with per-site mon boxes that (today) use a central registry -> one per infrastructure

Performance

an example of a grid performance study


Performance analysis

Best paper so far: X. Zhang, J. Freschl, J. Schopf, A performance study of monitoring and information services for distributed systems, in: Proceedings of the 12th IEEE High Performance Distributed Computing (HPDC-12 2003), IEEE Computer Society Press, Seattle, WA, USA, 2003, pp. 270–282.

Perf results on R-GMA are outdated, but basics still do hold MDS2 has since been replaced with MDS4 (in GT4) The three systems selected are indicative of the different classes, and

thus it’s a very valuable comparison!

Data in the next slides by Jennifer Schopf from the GridForum NL/ISOC NL Masterclass 2005


Roles of components in the comparison

MDS2 R-GMA Hawkeye

InfoCollector

Information Provider

Producer Module

Info Server

GRIS Producer Servlet

Agent

Aggregate Info Server

GIIS Combo Producer-Consumer

Manager

Directory Server

GIIS Registry Manager

ideas, graphics, results: J. Schopf, GFNL masterclass 2005: Distributed Monitoring and Information Services for the Grid


Performance analysis

Three ‘characteristics’ systems MDS2 (pull system, with and without caching) R-GMA (hybrid, straight GMA implementation w/Relational IF) Hawkeye (push system, from Condor)

Tests done on a small test bed (~7 systems) scaling has not been tested but results are at least comparable



Performance analysis: other facts

Keep in mind that MDS2 & Hawkeye are programmed in C

R-GMA is in Java

This R-GMA version relied heavily on threads i.e. implementation was straight translation of architecture JVM and Linux kernel 2.4 don’t like too many O(500) threads…


Model for evaluation

paper attempts to compare similar properties in the three systems

deploy in a standard mode (as depicted)

Registration & Data

Client Query

AggregateInformation Server

DirectoryServer

InformationServer

InformationCollector

Client



Experiments in Zhang et al.

1. How many users can query an information server at a time?

2. How many users can query a directory server?3. How does an information server scale with the

amount of data in it?4. How does an aggregator scale with the number

of information servers registered to it?



Experiments

Registration & Data

Client Query

AggregateInformation Server

DirectoryServer

InformationServer

InformationCollector

Client

4

1

2

3



Comparing Information Systems

We also looked at the queries in depth - NetLogger 3 phases

Connect, Process, Response

Response

Process

Connect



Testbed

Lucky cluster at Argonne 7 nodes, each has two 1133 MHz Intel PIII CPUs (with a 512 KB cache)

and 512 MB main memory

Users simulated at the UC nodes 20 P3 Linux nodes, mostly 1.1 GHz R-GMA has an issue with the shared file system, so we also simulated

users on Lucky nodes

All figures are 10 minute averages Queries happening with a one second wait between each

query (think synchronous send with a 1 second wait)



Metrics

Throughput Number of requests processed per second

Response time Average amount of time (in sec) to handle a request

Load percentage of CPU cycles spent in user mode and system mode,

recorded by Ganglia High when running small number compute intensive aps

Load1 average number of processes in the ready queue waiting to run, 1

minute average, from Ganglia High when large number of aps blocking on I/O



Information Server Throughputvs. Number of Users

0

20

40

60

80

100

120

140

160

180

200

1 10 50 100 200 300 400 500 600No. of Users

MDS2.4 GRIS (cache) MDS2.4 GRIS (no cache)

R-GMA 3.4.6 LatestProducerServlet Hawkeye 1.0 Agent

(Larger number is better)



Query Times

0.001

0.01

0.1

1

10

100

1000

Connection Phase Processing Phase Response TransmissionPhase

Tim

e (s

ec)

MDS2 GRIS(caching) MDS2 GRIS(no caching)R-GMA ProducerServlet Hawkeye Agent

0.001

0.01

0.1

1

10

100

1000

Connection Phase Processing Phase Response

Transmission Phase

Tim

e (

se

c)

MDS2 GRIS(caching) MDS2 GRIS(no caching)

R-GMA ProducerServlet Hawkeye Agent

50 users 400 users

(Smaller number is better)



Experiment 1 Summary

Caching can significantly improve performance of the information server Particularly desirable if one wishes the server to scale well with an

increasing number of users

When setting up an information server, care should be taken to make sure the server is on a well-connected machine Network behavior plays a larger role than expected If this is not an option, thought should be given to duplicating the server if

more than 200 users are expected to query it



Directory Server Throughput

0

20

40

60

80

100

120

140

160

1 10 50 100 200 300 400 500 600No. of Users

Thr

ough

put (

quer

ies/

sec)

MDS2.4 GIIS (cache) R-GMA 3.4.6 RegistryHawkeye 1.0 Manager




Directory Server CPU Load

0

10

20

30

40

50

60

70

1 10 50 100 200 300 400 500 600No. of Users

CP

U_l

oad

(%)

MDS2.4 GIIS (cache) R-GMA 3.4.6 RegistryHawkeye 1.0 Manager




Query Times

0.001

0.01

0.1

1

10

100

Connection Phase Processing Phase ResponseTransmission Phase

Tim

e (s

ec)

MDS2 GIIS(caching) R-GMA RegistryHawkeye Manager

0.001

0.01

0.1

1

10

100

Connection Phase Processing Phase ResponseTransmission Phase

Tim

e (s

ec)

MDS2 GIIS(caching) R-GMA RegistryHawkeye Manager

50 users 400 users





Because of the network contention issues, the placement of a directory server on a highly connected machine will play a large role in the scalability as the number of users grows

Significant loads are seen even with only a few users, it will be important that this service be run on a dedicated machine, or that it be duplicated as the number of users grows.



Information Server Scalabilitywith Information Collectors

0

2

4

6

8

10

10 20 30 40 50 60 70 80 90Number of Information Collectors

Thr

ough

put (

quer

ies/

sec)

MDS2.4 GRIS (cache) MDS2.4 GRIS (no cache)R-GMA 3.4.6 LatestProducerServlet Hawkeye 1.0 Agent




Experiment 3 Load Measurements


0

10

20

30

40

50

60

70

80

90

10 20 30 40 50 60 70 80 90Number of Information Collectors

CP

U_l

oad

(%)

MDS2.4 GRIS (cache) MDS2.4 GRIS (no cache)R-GMA 3.4.6 LatestProducerServlet Hawkeye 1.0 Agent



Experiment 3 Query Times

0.001

0.01

0.1

1

10

100


Tim

e (s

ec)


0.001

0.01

0.1

1

10

100

Connection Phase Processing Phase Response Transmission

Phase


30 Info Collectors 80 Info Collectors




Sample Query

Note: log scaleideas, graphics, results: J. Schopf, GFNL masterclass 2005: Distributed Monitoring and Information Services for the Grid



The more the data is cached, the less often it has to be fetched, thereby increasing throughput

Search time isn’t significant at these sizes



Aggregate Information Server Scalability

0

2

4

6

8

10

1 10 50 100 200 300 400 500 600No. of Information Servers

Thr

ough

put (

quer

ies/

sec)

MDS2.4 GIIS (query all) MDS2.4 GIIS (query part)R-GMA 3.4.6 ProducerConsumer Hawkeye 1.0 Manager




Load

0

10

20

30

40

50

60

1 10 50 100 200 300 400 500 600No. of Information Servers

CP

U_

loa

d (

%)

MDS2.4 GIIS (query all) MDS2.4 GIIS (query part)

R-GMA 3.4.6 ProducerConsumer Hawkeye 1.0 Manager



Query Response Times

0.001

0.01

0.1

1

10


Tim

e (s

ec)

MDS2 GIIS(all) MDS2 GIIS(portion)R-GMA ProducerConsumer Hawkeye Manager

0.001

0.01

0.1

1

10


Tim

e (s

ec)

MDS2 GIIS(all) MDS2 GIIS(portion)R-GMA ProducerConsumer Hawkeye Manager

50 Info Servers 400 Info Servers





None of the Aggregate Information Servers scaled well with the number of Information Servers registered to them

When building hierarchies of aggregation, they will need to be rather narrow and deep having very few Information Servers registered to any one Aggregate Information Server.



Overall Results

Performance can be a matter of deployment Effect of background load Effect of network bandwidth

Performance can be affected by underlying infrastructure LDAP/Java strengths and weaknesses

Performance can be improved using standard techniques Caching; multi-threading; etc.



Observations on the performance study

Measures performance, not stability test bed size is only 7 machines and 10 clients local cluster, i.e. latency is well controlled

In a real-life deployment, complexity is determining factor in success simple systems are more likely to ‘survive’ systems with soft-state registration& timeouts (like MDS)

are more prone to instabilities than systems based on a persistent ‘elephant-style’ memory (like BDII) (c.f. sypical signal processing issues)

Assorted Issues


Access Control

AuthN is simple keeps out 99% of the rogue information doesn’t do a bit for privacy preservation not every cert owner is a grid user for a specific infrastructure

course grained ACLs better grid-mapfile, ACLs on access to service keeps known bad guys out still no privacy

fine-grained acls support within the DB engine is actually required

as it’s too hard to retro-fit otherwise


Timeout issues

differences in timeouts in information providers lead to ‘phase difference’ effects in the system temporary amnesia of aggregate information indices cumulative delays

Timeouts registration period for GRIS with a GIIS time to bind to GRIS (defines if a resource is up or down) time to produce information entries cache TTL in the GIIS timeout before removing stale information from GIIS

essentially it’s feed-back signal theory ;-)

Content of the Information System


Approaches to resource information

Resource description GLUE CIM and similar but slightly different schemas for ARC and GT2

Job description Unicore’s AJO


Information Schemas: GLUE

Describes resource availability information Common for various middleware suites

Known limitations not even all specified info is actually used contains lots of info that are un-used cannot express information needed for brokering at the

appropriate granularity level (this is fundamental for all such information schemas)

More specifics discussed with each component

See http://infnforge.cnaf.infn.it/glueinfomodel/


Glue Abstractions

Core Entities Site: name, contact info, latitude/longitude, sponsor Service: type, version, endpoint, status, WDSL URL, Semantics

URN, StartTime

Cluster ComputingElement: Info, State, Policy, ACBRule, …

VOView: ACBRule, Running, Waiting, Free, ERT, WRT, … SubCluster: HostOperatingSystem*, HostAppSWRTEnv,

StorageElement


GLUE Core Schema


GLUE Cluster


GLUE Storage


GLUE Linking compute and storage

Useful is storage is accessible via POSIX, or via faster networks

position of such a binding is difficult abused for pure-SE info as this is the only place

where the file path to the storage was specified…


Alternative schemas with the same viewpoint

Original GT2 schema (obsolete) NorduGrid ARC


CIM Common Information Model

object oriented abstraction of information (DMTF) uses abstractions, dependencies, inheritance

goes beyond a mere information model by defining methods for standard object behaviour trying to solve every possible problem (and solve the

perpetuum mobile issue in the process …)

information components of CIM can use used to represent resources


Common Information Model (CIM)

Object-oriented schema developed by the DMTF representation in different formats (such as XML) See http://www.dmtf.org/standards/cim/

Extended for grid elements by the GCS-WG BatchService, &c

The NAREGI grid is main user of this system


Example: CIM Job Submission Interface

(See Core Model)

EnabledLogicalElement

System

(See Core Model)

Process

CreationClassName : string {key}Handle : string {key}Priority : uint32ExecutionState : uint16OtherExecutionDescription : stringCreationDate : datetimeTerminationDate : datetimeKernelModeTime : uint64UserModeTime : uint64WorkingSetSize : uint64

OSProcess

1

CreationClassName : string {key}Name : string {override, key}

JobDestination

HostedJobDestination

JobStatus : stringTimeSubmitted : datetimeScheduledStartTime : datetimeStartTime : datetimeElapsedTime : datetimeUntilTime : datetimeNotify : stringOwner : stringPriority : uint32 PercentComplete {units}DeleteOnCompletion : booleanErrorCode : uint16ErrorDescription : string

KillJob ([IN] DeleteOnKill : boolean) : uint32 {enum}

Job

JobDestination

Jobs

*

*w

OperatingSystem

(See System Model(Operating System)

Batch Jobs, Submission, and Processing

ServiceProcess

*

*

*w

1

(See Core Model)

LogicalElement

(See Core Model)

ManagedSystemElement

(See Core Model)

ManagedElement

(See Core Model)

SettingData

(See Core Model)

ScopedSettingData

ScheduledStartTime : datatimeReoccuringElapsedTime : datetimeUntilTime : datetimeNotify : stringOwner : stringPriority : uint32DeleteOnCompletion : boolean

ScheduledJob

OwningJobElement

0..1

*

AffectedJobElement

*

*

ConcreteJob

InstanceID : string {key}Name : string {override, req'd}

ProcessOfJob

* *

Association

AggregationAssociation with WEAK reference

Inheritance

Aggregation with WEAK reference

w

w

* Equivalent to: 0 .. nComposition Aggregation

OwningBatchJobQueue

BatchSAP

BatchProtocol : uint16[ ] {enum}BatchProtocolInfo : string[ ]

BatchService

<insert properties for a batch service>

*

*

(See Core Model)

Service

(See Core Model)

ServiceAccessPoint

BatchJob

JobID : string {key}SchedulingInformation :stringMaxCPUTime : uint32 {units}CPUTimeUsed : uint32 {units}BatchJobStatus : uint16 {enum}TimeCompleted : datetimeJobOrigination : string

BatchQueue

QueueEnabled : booleanQueueAccepting : booleanNumberOnQueue : uint32QueueStatus : uint16 {enum}QueueStatusInfo : stringDefaultJobPriority : uint32JobPriorityHigh : uint32JobPriorityLow : uint32MaxJobWallTime : uint32 {units}MaxJobCPUTime : uint32 {units}MaxTotalJobs : uint32MaxRunningJobs : uint32RunningJobs : uint32WaitingJobs : uint32

QueueForBatchService*w

1

*

*w

QueueForwardsToBatchSAP

*

Logical File

(See System Model(File System)

DataFile

(See System Model(File System)

*

ProcessExecutable

*

OSServicingJob

OSServicingQueue

*

*

*

*

RecurringBatchJob

<any addtl properties needed?>

BatchSchedulingData

*

*


The Unicore information model

Describe the resource requests(so opposite viewpoint compared to GLUE)

the resources themselves need not be described, since they will ‘bid’ on the job requests

we will deal with this one in the Brokering & CE lecture


Summary

Information systems used across multiple organisations and by multiple people or VOs taxonomy classiciation: (republishing; data flow) Any grid information system needs

programmatic access via producer/consumer APIs compositional IS freedom (VO or infrastructure hierarchies)

focus has been on resource selection used for brokering decisions, either by people or programs needs a common information schema or translators

for application-level information systems user-defined schema and a schema registry (like R-GMA)

Documents

Information Systems describing resources