26
Future Generation Computer Systems 21 (2005) 163–188 A taxonomy of grid monitoring systems Serafeim Zanikolas , Rizos Sakellariou School of Computer Science, The University of Manchester, Oxford Road, Manchester M13 9PL, Manchester, UK Received 17 December 2003; accepted 28 July 2004 Available online 8 October 2004 Abstract Monitoring is the act of collecting information concerning the characteristics and status of resources of interest. Monitoring grid resources is a lively research area given the challenges and manifold applications. The aim of this paper is to advance the understanding of grid monitoring by introducing the involved concepts, requirements, phases, and related standardisation activities, including Global Grid Forum’s Grid Monitoring Architecture. Based on a refinement of the latter, the paper proposes a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value of the offered taxonomy lies in that it captures a given system’s scope, scalability, generality and flexibility. The paper concludes with, among others, a discussion of the considered systems, as well as directions for future research. © 2004 Elsevier B.V. All rights reserved. Keywords: Grid monitoring; Grid Information Services; Grid Monitoring Architecture 1. Introduction The grid is perceived as a large-scale distributed sys- tem that supports scattered communities to form Vir- tual Organisations [1], in order to collaborate for the realisation of common goals. This collaboration takes the form of managed sharing of on-line resources (e.g., software, hardware, data, instruments). However, grid resources, may dynamically join and leave, hence fea- turing a membership which varies over time; even in Corresponding author. Tel.: +44 161 275 6160; fax: +44 161 275 6236. E-mail addresses: [email protected] (S. Zanikolas), [email protected] (R. Sakellariou) fairly static settings, resource availability is subject to failures. Given this transient nature, users must be sup- ported in finding and keeping track of resources of in- terest; this is the main purpose of Grid Information Services (GIS) [2,3]. In order for information services to address the mentioned user needs, they must sys- tematically collect information regarding the current and, sometimes, past status of grid resources; a pro- cess known as monitoring. In addition to information services, monitoring is also crucial in a variety of cases such as scheduling, data replication, accounting, performance analysis and optimisation of distributed systems or individual appli- cations, self-tuning applications, and many more [4]. Also, given the increasing number of grid resources, 0167-739X/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2004.07.002

A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

Future Generation Computer Systems 21 (2005) 163–188

A taxonomy of grid monitoring systems

Serafeim Zanikolas∗, Rizos Sakellariou

School of Computer Science, The University of Manchester, Oxford Road, Manchester M13 9PL, Manchester, UK

Received 17 December 2003; accepted 28 July 2004Available online 8 October 2004

Abstract

Monitoring is the act of collecting information concerning the characteristics and status of resources of interest. Monitoringgrid resources is a lively research area given the challenges and manifold applications. The aim of this paper is to advancethe understanding of grid monitoring by introducing the involved concepts, requirements, phases, and related standardisationactivities, including Global Grid Forum’s Grid Monitoring Architecture. Based on a refinement of the latter, the paper proposesa taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value ofthe offered taxonomy lies in that it captures a given system’s scope, scalability, generality and flexibility. The paper concludeswith, among others, a discussion of the considered systems, as well as directions for future research.© 2004 Elsevier B.V. All rights reserved.

Keywords:Grid monitoring; Grid Information Services; Grid Monitoring Architecture

1

ttrtsrt

f

r

t tosup-f in-ions

t sys-entpro-

isling,and

pli-

ces,

0d

. Introduction

The grid is perceived as a large-scale distributed sys-em that supports scattered communities to form Vir-ual Organisations[1], in order to collaborate for theealisation of common goals. This collaboration takeshe form of managed sharing of on-line resources (e.g.,oftware, hardware, data, instruments). However, gridesources, may dynamically join and leave, hence fea-uring a membership which varies over time; even in

∗ Corresponding author. Tel.: +44 161 275 6160;ax: +44 161 275 6236.

E-mail addresses:[email protected] (S. Zanikolas),[email protected] (R. Sakellariou)

fairly static settings, resource availability is subjecfailures. Given this transient nature, users must beported in finding and keeping track of resources oterest; this is the main purpose of Grid InformatServices (GIS)[2,3]. In order for information serviceto address the mentioned user needs, they mustematically collect information regarding the currand, sometimes, past status of grid resources; acess known asmonitoring.

In addition to information services, monitoringalso crucial in a variety of cases such as schedudata replication, accounting, performance analysisoptimisation of distributed systems or individual apcations, self-tuning applications, and many more[4].Also, given the increasing number of grid resour

167-739X/$ – see front matter © 2004 Elsevier B.V. All rights reserved.oi:10.1016/j.future.2004.07.002

Page 2: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

164 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

real-time monitoring of their availability and utilisa-tion is becoming essential for effective management,particularly regarding the detection of faults and bot-tlenecks and in some cases even their automatic reso-lution. Finally, identifying patterns of utilisation mayform valuable input for long-term resource planning ofgrid infrastructures.

One may wonder how monitoring in the contextof the grid differs from traditional monitoring ofcomputer-related resources, which has been a researchtopic for several decades (e.g.,[5]). Grid monitoringis characterised by significant requirements including,among others, scalable support for both pull and pushdata delivery models applied over vast amounts of cur-rent and past monitoring data that may be distributedacross organisations. In addition, a monitoring system’sdata format has to balance between extensibility andself-description on one hand and compactness on theother. The former is required to accommodate the ever-expanding types of monitored resources, whereas thelatter is a prerequisite for non-intrusive and scalablebehaviour. The problem is further complicated by thecontinuous evolution of grid middleware and the lackof consensus regarding data representation, protocolsand semantics, leading toad hocsolutions of limitedinteroperability. Existing proprietary network and hostmonitoring applications lack the openness required forinteroperability and customisation, while they also im-pose significant financial costs. Few of the equivalentopen source projects have a potential and in fact someo

der-s andt pa-p ni-t e ofm stedt ni-t cter-i m’sGo ivens bilitya

fol-l pts,s rieflyd the

Grid Monitoring Architecture and other standardisa-tion/recommendation efforts promoted by the GlobalGrid Forum (GGF). InSection 3, we propose a tax-onomy of monitoring systems.Sections 4–7classifythe considered systems in order of increasing scope,according to the proposed taxonomy.Section 8brieflyconsiders related work which is not intended to deliverimplemented systems. The paper is complemented, inSection 9, by a summary and discussion of the consid-ered systems, and concluded inSection 10.

2. Background

This section introduces the basic concepts and re-lated terminology, as well as the main phases of a mon-itoring model. A brief explanation of the most evidentrequirements for monitoring systems, and an overviewof GGF’s Grid Monitoring Architecture, as well asother standardisation activities follows.

2.1. Concepts and terminology

Having defined the type of monitoring we are con-cerned with, it is important to note that people use dif-ferent terms to refer to, more or less, the same con-cepts. To this end, we explicitly set the following terms,mainly drawn from[6]:

An entityas defined in[7], is any networked resource,vingicaledi-

ses.ta,in a

to

edhat,tureding

rat-siveypi-c-

f them are actually employed for grid monitoring.The aim of this document is to advance the un

tanding of the issues involved in grid monitoring,o provide a classification of related projects. Theer’s main contribution is a taxonomy of grid mo

oring systems and its application in a wide rangonitoring systems. The key feature of the sugge

axonomy is that it allows the classification of mooring systems based on the provision and charastics of components defined in Global Grid Forurid Monitoring Architecture (GMA)[6]. The valuef the proposed taxonomy lies in that it reflects a gystem’s scope and consequently generality, scaland flexibility.

The structure of the remainder of the paper is asows. The next section defines the involved conceets the requirements of the considered problem, bescribes the process of monitoring, and outlines

which can be considered useful, unique, haa considerable lifetime and general use. Typentities are processors, memories, storage mums, network links, applications and proces

An eventis a collection of timestamped, typed daassociated with an entity, and representedspecific structure.

An event typeis an identifier which uniquely mapsan event structure.

An event schemaor simply schema, defines the typstructure and semantics of all events so tgiven an event type, one can find the strucand interpret the semantics of the corresponevent.

Asensoris a process monitoring an entity and geneing events. Sensors are distinguished in pas(i.e., use readily available measurements, tcally from operating system facilities) and a

Page 3: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188 165

tive (i.e., estimate measurements using custombenchmarks). The former typically provide OS-specific measurements while the latter are moreintrusive.

2.2. The monitoring process

Monitoring distributed systems, and hence grids,typically includes four stages[8,9]: (i) generationofevents, that is, sensors enquiring entities and encodingthe measurements according to a given schema; (ii)processingof generated events is application-specificand may take place during any stage of the monitoringprocess, typical examples include filtering accordingto some predefined criteria, or summarising a groupof events (i.e., computing the average); (iii)distribu-tion refers to the transmission of the events from thesource to any interested parties; (iv) finally,presenta-tion typically involves some further processing so thatthe overwhelming number of received events will beprovided in a series of abstractions in order to enablean end-user to draw conclusions about the operationof the monitored system. A presentation, typically pro-vided by a GUI application making use of visualisationtechniques, may either use a real-timestreamof eventsor a recordedtraceusually retrieved from an archive.However, in the context of grids, we generalise the laststage asconsumptionsince the users of the monitoringinformation are not necessarily humans and thereforevisualisation may not be involved.

2

ys-t mayv needt

ef-fi ntsa esulto for-m ievet onset s tot m tot allym ory,I the

collection, processing and distribution of events. It canbe seen that a monitoring system with moderate per-formance will be insufficient in heavy load scenarios(many resources and/or users), whereas a system in-flicting non-trivial intrusiveness degrades the capacityof the monitored resources.

Extensibility: A monitoring system must be exten-sible with respect to the supported resources and hencethe events generated by the latter. To this end, desirablefeatures include (i) an extensible and self-describingevent encoding method (i.e., data format); (ii) an eventschema service which allows controlled and dynamicextensions/modifications; (iii) a producer-consumerprotocol that can accommodate new event types. Atthe same time, (i) and (iii) must be compact to min-imise the previously described intrusiveness, which isso important for scalability.

Data delivery models: Monitoring information in-cludes fairly static (e.g., software and hardware con-figuration of a given node) and dynamic events (e.g.,current processor load, memory), which suggests theuse of different measurement policies (e.g., periodic oron demand). In addition, consumer patterns may varyfrom sparse interactions to long lived subscriptions forreceiving a constant stream of events. In this regard,the monitoring system must support both pull and pushdata delivery models (further discussed in the next sec-tion).

Portability: The portability of a monitoring system,and particularly that of the sensors, is of major impor-t cifict y ont tot latedm

tor-i ccessc ands

venti e at te ane entf s ofa n oft ser-v eedf tion

.3. Requirements

A set of general requirements for monitoring sems that are considered important follows; theseary considerably depending on the use cases thato be supported by a specific system.Scalability: Monitoring systems have to cope

ciently with a growing number of resources, evend users. This scalability can be achieved as a rf good performance and low intrusiveness. Theer guarantees that a monitoring system will ach

he needed throughput within an acceptable respime in a variety of load scenarios. The latter referhe intrusiveness imposed by the monitoring systehe entities being monitored. Intrusiveness is typiceasured as a function of host (processor, mem

/O) and network load (bandwidth) generated by

ance; otherwise a system is unable to monitor speypes of resources, and hence support their visibilithe grid[10]. The concept of portability also applieshe generated events, meaning that any encapsueasurements must be platform independent.Security: Certain scenarios may require a moni

ng service to support security services such as aontrol, single or mutual authentication of parties,ecure transport of monitoring information.

In addition, we have already mentioned that an es a collection of timestamped typed data; hencimestamp is required to allow consumers to estimavent’sfreshness. This in turn suggests the requiremor a global notion of time between the component

monitoring system and its users. A global notioime in the Grid is needed anyhow (e.g., for securityices), but in the context of monitoring there is a nor a much higher accuracy of clock synchronisa

Page 4: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

166 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

Fig. 1. The GGF Grid Monitoring Architecture.

than what can be provided by established technologies(such as the Network Time Protocol[11]).

2.4. A Grid Monitoring Architecture

In this section, we provide a brief overview of theGrid Monitoring Architecture (GMA)[6] put togetherby the Global Grid Forum to encourage discussion andimplementations (i.e., it is not a standard). The maincomponents of GMA are as follows (Fig. 1):

Aproduceris a process implementing at least one pro-ducer Application Programming Interface (API)for providing events.

A consumeris any process that receives events by us-ing an implementation of at least one consumerAPI.

A registryis a lookup service that allows producers topublish the event types they generate, and con-sumers to find out the events they are interestedin.1 Additionally, a registry holds the details re-quired for establishing communication with reg-istered parties (e.g., address, supported protocolbindings, security requirements). Even for sys-tems with no notion of events, registries can beuseful for producers and consumers discoveringeach other.

Interactions: After discovering each other throughthe registry, producers and consumers communicated est con-s er-

rys col(w

Fig. 2. A republisher implements both producer and consumer in-terfaces.

action consisting of a subscription for a specific eventtype, a stream of events from a producer to a consumer,and a termination of the subscription. Both the estab-lishment and the termination of a subscription can beinitiated by any of the two parties. Aquery/responseis an one-off interaction initiated by a consumer andfollowed by a single producer response containing oneor more events. Finally, anotificationcan be sent bya producer to a consumer without any further interac-tions.

In addition to the three core components, the GMAdefines a republisher (referred as compound compo-nent or intermediary) and a schema repository.

A republisheris any single component implementingboth producer and consumer interfaces (Fig. 2)for reasons such as filtering, aggregating, sum-marising, broadcasting, and caching.

A schema repositoryholds the event schema, that is,the collection of defined event types. If a sys-tem is to support an extensible event schema,such a repository must have an interface for dy-namic and controlled addition, modification andremoval of any custom event types.

Republishers and the schema repository are consid-ered as optional components, though one can easily seethat they are essential parts of any sophisticated moni-toring framework. The schema repository may be parto entsm tri-b

finei odel,e d so

irectly (i.e., not through the registry). GMA definhree types of interactions between producers andumers.Publish/subscriberefers to a three-phase int

1 The GMA document[6] refers to the registry as a directoervice, which implies a Lightweight Directory Access ProtoLDAP) -based engine. To this end, and in agreement with[12],e prefer the term registry because it is engine-neutral.

f the registry, but in any case these two componust be replicated and distributed to allow for dis

ution of load and robustness.The GMA, being an architecture, does not de

mplementation details such as employed data mvent schema, protocol bindings, registry engine an

Page 5: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188 167

Fig. 3. Mapping GMA components to phases of monitoring. Square brackets and parenthesis indicate optional and grouped expressions respec-tively, whereas “+” stands for at least one repetition of the preceding item (see text for further explanation).

on. Probably the most important feature of the GMAis the separation of the discovery and retrieval opera-tions (i.e., discover from the registry and retrieve fromproducers or republishers).

Revisiting GMA: Because GMA’s components arefairly general, we correlate its main components tothe phases of the monitoring process (as described inSection 2.2). As shown inFig. 3, a sensor (shown bya circle) must generate events (i.e., the first phase ofmonitoring), may process them and may make themavailable to local consumers only (e.g., through a lo-cal file); a producer (depicted as a box) may imple-ment its own sensors, may process events (generatedby built-in or external sensors) and must support theirdistribution to remote consumers, hence the producerinterface; a republisher (shown as a rounded box) mustapply some type of processing to collected events andmake them available to other consumers; a hierarchyof republishers (shown as a polygon) consists of oneor more (hence, the “+” sign) republishers; finally, aconsumer (depicted as an ellipse) may apply some pro-cessing before presenting the results to the end-user orapplication.

2.5. Other recommendations and standards

The Global Grid Forum, in addition to the GMAdocument, hosts several other activities that relateto grid monitoring. Among others, a simple XML-b de-s rd-i ork-i af entsw In-f IM-b ) isw mab M)[ tan-

dard that provides the means for specifying concep-tual information models for management systems. An-other schema is the Grid Laboratory Uniform Envi-ronment (GLUE), developed as part of the DataTagproject in order to facilitate interoperability betweenselected US and EU Grid sites. The GLUE schema hasgained wide acceptance given its adoption by GlobusMDS3. Finally, the Grid Benchmark Research Group(GB-WG) aims to specify metrics and benchmarksto promote, among others, quantitative comparabilityof grid hardware and middleware, applications, andpractices.

3. A scope-oriented taxonomy of monitoringapproaches and frameworks

The previous section has refined the GMA com-ponents by mapping them to monitoring phases. Thissection proposes a taxonomy of monitoring systems,which is primarily concerned with a system’s provi-sion of GMA components (as they were defined inFig. 3). The categories of the proposed taxonomy arenamed from zero to three depending on the provisionand characteristics of a system’s producers and repub-lishers (Fig. 4).

Level 0: Events flow from sensors to consumers ineither an on-line or an offline fashion (i.e., atthe time of measurements being taken or after-

Fig. 4. The categories of the proposed taxonomy of monitoring sys-tems. Note that, although not shown to avoid clutter, the sensor sym-bol can be omitted in systems where producers have built-in sensors.

ased producer-consumer protocol is informallycribed in[13] to encourage interoperability. Regang event types, the Network Measurements Wng Group (NM-WG) is developing an XML schemor exchanging network performance measuremithin the framework of the Open Grid Services

rastructure (OGSI). On the same topic, the Cased Grid Schema Working Group (CGS-WGorking towards the development of a Grid scheased on the Common Information Model (CI

14], an object oriented, platform-independent s

Page 6: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

168 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

wards, using a trace file). In the on-line case, thesensors store locally any measurements, whichare accessed in an application-specific way. Thistypically is via a web interface that provides in-teractive access to HTML-formatted informationthat includes measurements, hence not what onewould consider a generic API. The distinguish-ing feature of level zero orself-containedsys-tems is the lack of producer APIs that wouldenable the distribution of events to remotely lo-cated components, in a programmable fashion(as opposed to non-programmable such as webpages intended for interactive use).

Level 1: In first levelsystems, sensors are either sep-arately implemented and hosted at the same ma-chines with producers, or their functionality isprovided by producers. In either case, events areremotely accessible via a generic API providedby producers.

Level 2: In addition to producers,second levelmoni-toring systems feature at least one type of repub-lisher, which however has a fixed functionality.Republishers of different functionality may bestacked upon each other but only in predefinedways. A second level system is differentiated bya semantically equivalent first level system by thedistribution of the functionality (that would oth-erwise be provided by a single producer) amongdifferent hosts.

Level 3: Highly flexible monitoring systems, apartareanedodeevelh asra-

aingherulti-ingpe-y isplyfor

The taxonomy includes amultiplicity qualifier tocapture whether republishers in second level systemswithin an organisation are centralised (i.e., one repub-lisher), merely distributed, or distributed with supportfor replication. The multiplicity (and distribution) ofproducers over sensors (in systems of all levels) isnot significant since, according to the given definition,at least one producer is required per monitored host.The multiplicity of republishers over producers in thirdlevel systems can vary based on the adopted hierarchy.

Another qualifier refers to thetype of entitiesthatare primarily monitored by a considered system. Thisqualifier can be any of hosts, networks, applications,availability and generic. The last one denoting general-purpose systems that support events for at least hostsand networks.

Last, thestackablequalifier denotes whether a mon-itoring system is intended to employ another system’sproducers or republishers, in other words operate ontop of it. A stackable, say, first level system can bemore promising than a non-stackable system of thesame level, because the former encapsulates (part of)the functionality of another monitoring system.

Based on the above categories and qualifiers, con-sidered systems are characterised using the formL{0-3}. {H,N,A,V,G}.[S], where the number denotesthe level, the following letter the nature of monitoredentities (Hosts, Networks, Applications, Availability,Generic), and an optional S implies whether the sys-tem is stackable. Specifically for second level systems,a (a)a dis-t up-p es as hav-i epli-c

top sys-t ithc ede ys-t .

ys-t eyw rst illh rs)

from producers, provide republishers thatconfigurable, allowing their organisation inarbitrarily structured hierarchy. As explainearlier, in a hierarchy of republishers every ncollects and processes events from lower lproducers or republishers for reasons sucthe provision of customised views or prepation of higher-level events.Third levelsystemshave a potential for scalability and may formstandalone Grid Information Service (assumsupport for a variety of grid resources). Whetthey actually are scalable depends on the mplicity and distribution of components accordto the arbitrary hierarchy that is adopted in scific deployments. Since the actual scalabilitconfiguration-dependent, the taxonomy simhighlights which systems have the potentialscalability.

n a, b or c letter follows the level number to denotesingle centralised republisher; (b) more than one

ributed republisher; (c) distributed republishers sorting data replication. For instance, L2c.H denotecond level system concerning mainly hosts andng more than one republisher that supports data ration.Goals and audience: The goal of the taxonomy is

rovide a simple means to describe a monitoringem’s features with respect to: (i) its compliance wore GMA components; (ii) main target of monitorntities (e.g., network, hosts, etc.); (iii) whether a s

em can or has to operate on top of another systemDevelopers would be primarily interested in s

ems that provide their functionality via APIs; thus thould be looking first for L1 systems. Administrato

hat are interested in performance or reliability wave to look for at least L2b (distributed republishe

Page 7: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188 169

and L2c (distributed and replicated republishers) sys-tems, respectively; they can also identify systems thatdepend or may operate on top of other systems withthe stackable qualifier. Users that need general-purposesystems that, when appropriately deployed, can scalebeyond organisation-wide installations need to look forL3 systems. Users and others can also find useful themonitored entities qualifier to identify generic systemsor those primarily concerned with hosts, networks, ap-plications, or availability.

4. Level 0: self-contained systems

A monitoring system is characterised as “self-contained” when it does not expose its functionalitythrough a producer interface. To this end, such a sys-tem can be used only in predefined and rigid ways(e.g., through a GUI). The considered self-containedsystems, MapCenter and GridICE, support grid admin-istrators in monitoring the availability (in the former)and utilisation (in the latter) of grid resources, throughweb front-ends.

4.1. MapCenter (L0.V.S)

MapCenter[15], developed as part of the EU Data-Grid project, is a monitoring application which pro-vides web users a visualisation of the availability anddistribution of services throughout a Grid. It is intendeda ityp de-p ud-i

p-d in ag calv ns,g tantt r isa es( ov-e notk tiono mi-c APc ingC

MapCenter has a number of coordinated threads,polling specific ports of a set of hosts according to aconfiguration file. In addition, there is support for auto-matic discovery of UDP, TCP, and HTTP-based (e.g.,grid) services, by means of probing to well-known ports[16].

MapCenter’s configuration along with the retrievedstatus information is kept in a flat text file. The lat-ter is used to update periodically a set of HTML fileswhich present the previously mentioned logical views.Of these, the graphical view illustrates the nodes ona geographical map using localisation information re-trieved from the WHOIS Internet service.

In the taxonomy’s context, MapCenter’s pollingthreads are consideredsensorsbecause they merelystore locally any acquired events, as opposed to provid-ing them via a producer API. To this end, MapCenteris classified as L0.V.S because it lacks producers (levelzero), is concerned with hosts and services availability,and operates on top of existing information services(stackable). MapCenter can be converted to a first levelsystem by providing the monitoring events through aproducer API.

4.2. GridICE (L0.G.S)

GridICE[17], also known as the InterGrid MonitorMap and EDT-Monitor[18], was developed as part ofthe DataTag project in order to facilitate grid adminis-trators. It provides status and utilisation information atV ll asb real-t

urew odest net-w Them , hosta ndedb lu-gf ons plu-g in-f ag-g ggera itora ss the

s a grid administration tool for tracking availabilroblems. At the time of writing, MapCenter wasloyed in more than 10 major grid installations, incl

ng the EU DataGrid.Overview: MapCenter builds and periodically u

ates a model of the network services availablerid, and provides this information in several logiiews (sites, Virtual Organisations (VOs), applicatioeographical) through a web interface. It is impor

o note that the information provided by MapCentebout theavailability of grid nodes and their service.g., the daemons of Globus’ Monitoring and Discry Service (MDS), etc.); hence MapCenter doeseep details concerning configuration and utilisaf resources. However, it does allow users to dynaally query an MDS server (using a PHP-based LDlient), ping and otherwise interact with hosts (usGI scripts).

irtual Organisation, site and resource level, as weasic statistics derived from historical traces and

ime alerts, all through a web front-end.Overview: GridICE has a centralised architect

here a main server periodically queries a set of no extract information about the status of grid andork services, and the utilisation of resources.ain server is based on Nagios, an open sourcend network service monitor that can be easily extey the use of custom monitoring and notification pins. GridICE has an MDS plugin (seeSection 7.2)

or periodically querying Globus index informatiervers and information providers, whereas otherins can be built, say, for RGMA. The collected

ormation is stored in a DBMS and used to buildregate statistics (e.g., total memory per site), trilerts and dynamically configure Nagios to monny newly discovered resources. End-users acce

Page 8: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

170 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

service through a PHP-based web front-end which in-cludes logical views at VO, site and entity level as wellas a geographical map.

GridICE employs a custom extension of the GLUEschema in order to support network-, and process-related events. There are concerns in terms of scalabil-ity given the centralised architecture and the frequentpolling that has to be performed. A way of resolvingthis could be to distribute the overall load among sev-eral Nagios servers organised in a hierarchy.

GridICE is a rather uncommon system in that it hasits own sensorsfor taking raw measurements withinmonitored hosts and uses Globus MDS asproduc-ers of those measurements. A Nagios server collectsevents from MDS daemons as described earlier, con-structs higher-level events and provides them throughthe web. The Nagios server republishes raw measure-ments as useful abstractions, yet without a producerAPI and thus cannot be considered a republisher. Onthis basis, GridICE can be classified either as a zeroor a first level system depending on whether the infor-mation of interest is the abstracted events or the rawmeasurements, respectively. GridICE can be convertedto a second level system by providing its abstractionsvia a producer API.

5. Level 1: producer-only systems

As an example of a first level system, this sectionp elf-a

5

a-t nts.T a-t e.g.,s a dy-n aini lop-mf andt rida re-q ongo on’s

progress against the requirements of its contract andtriggers corrective actions in case of violations.

Overview: Application adaptivity requires real-time performance measurements (i.e., generation ofapplication-specific events), reasoning whether there isa need for a corrective action and, if so, instructing theapplication to perform the latter. Autopilot’s function-ality is implemented in separate components, namelysensors, actuators, clients and distributed name servers.

Applications instrumented for Autopilot includesensors and actuators for remotely reading and writingrespectively, application-level variables. Sensors andactuators are described by property lists (e.g., name,location, type of variable measured/controlled, etc.),have attached functions and register themselves to aname service (i.e., aregistry). Property lists are usedby clients (i.e.,consumers) to lookup in the registry forsensors. Attached functions implement data reductiontechniques (e.g., summarisation) in case of sensors orperform requested actions in case of actuators.

An Autopilot client finds out “interesting” sen-sors through the registry and subscribes for receivingtheir events. Subsequently, a client uses an application-specific logic—that is defined as a fuzzy logic rulebase—to make decisions and, if applicable, instructan actuator to perform adaptive actions. In addition,clients manage sensors, in terms of activation and sam-pling frequency, through appropriate actuators.

Autopilot’s events are represented in either binary orASCII encodings according to the PabloSelf-DefiningD ede ught for-m e ord CIIe ngtho

en-s a-s ssingt ica-t

6

mst pur-

resents Autopilot, a framework for implementing sdapting applications.

.1. Autopilot (L1.A)

Autopilot [19] is a framework for enabling applicions to dynamically adapt to changing environmehis run-time tuning ability is important for applic

ions having to choose among a variety of policies (chedulers) as well as those that need to adapt toamic environment, such as the grid. Autopilot’s m

deas are leveraged in the grid Application Deveent Software (GrADS) project[20], which aims to

acilitate end-users in the development, executionuning of grid enabled applications. In GrADS, gpplications are characterised with performanceuirements (specified in so-called “contracts”); amthers, a real-time monitor compares an applicati

ataFormat[21]. As the name implies, SDDF encodvents include descriptions of their structure, thohe actual semantics are not specified. The binaryat can be employed between machines of samifferent byte order conventions, whereas the ASncoding must be used in cases of different word ler floating point representations.

In the context of the taxonomy, the Autopilot sors operate asproducerssince they not only take meurements but also provide the means for accehem remotely. Combined with the focus on applion monitoring, Autopilot is classified as L1.A.

. Level 2: producer and republisher systems

This section is concerned with monitoring systehat include producers and one or more special

Page 9: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188 171

pose republishers. The considered systems, describedin alphabetical order, are listed here under the threesubcategories of second level systems.

Centralised republisher: an administration orientedmonitoring framework based on the CODEframework; Hawkeye, an administration ori-ented system for monitoring and management ofcomputer clusters; GridRM, a proposal for inte-grating the diverse monitoring sources typicallyavailable in a grid site.

Distributed republishers: HBM, an unreliable faultdetector of fail-stop failures; JAMM, a host mon-itoring prototype with managed sensors; Mer-cury, GridLab’s organisation-level monitoringsystem; NetLogger, an application performanceanalysis toolkit that was extended with com-ponents for providing a producer interface andcontrolling application-level sensors; OCM-G,an interactive-applications monitoring systemfor execution steering and performance analy-sis; Remos, a prototype similar to NWS thatadditionally provides logical views of networktopologies; SCALEA-G, an extensible, service-oriented monitoring and performance analysissystem for both applications and resources.

Distributed republishers with replication: NWS, anetwork monitoring and forecasting service pro-viding end-to-end measurements and predic-tions.

6

-i sedg ionP -w n-v ters

b-s Eacho ovidet encea ed,t ons,s il. Am more

producers and, based on a management logic (a seriesof if-then statements, or an expert system), instructsactors to perform specific actions.

Producers and actors register their location, eventsand actions in an LDAP-basedregistrywhere managerslookup for the appropriate producers and actors.

A management agent, consisting of an observer,an actor and a manager, is placed in each Grid Re-source and Allocation Manager (GRAM) and Grid In-formation Service (GIS) server of a Globus installa-tion. Events generated by management agents are for-warded to an event archive which is discovered throughthe registry. A GUI management front-end (i.e., acon-sumer) retrieves events from the archive (arepublisher)to illustrate the current status of hosts and networks.Also, the management application allows a user to per-form simple actions, such as Globus daemons and useraccounts management, based on what is advertisedin the registry. The GGF’s XML producer-consumerprotocol is employed for exchanging events, whilethe event archive is an XML database queried usingXPath.

The CODE monitoring system is intended for hosts,networks and services, and has a single instance ofa republisher per installation and thus is classified asL2a.G.

6.2. GridRM (L2a.H.S)

Grid Resource Monitoring (GridRM)[24] is a re-s y ofa , thata pleN ,N ).

aJ ventsf ev-eG -p in-f

yer.T cesw ridS n or-

ons.

.1. CODE-based monitoring system (L2a.G)

The considered system[22] is concerned with montoring and managing organisation-wide Globus-barid installations, and is used in the NASA Informatower Grid (IPG). It is based on CODE[23], a frameork for Control and Observation in Distributed Eironments, primarily intended to facilitate compuystems administration.Overview: The CODE framework includes o

ervers, actors, managers and a directory service.bserver process manages a set of sensors and pr

heir events through an event producer interface, hcting as aproducer. Every actor process can be ask

hrough an actor interface, to perform specific actiuch as restarting a daemon or sending an emaanager process consumes the events from one or

s

earch project that aims to provide a unified waccessing a diverse set of monitoring data sourcesre typically found in grid environments (e.g., Simetwork Management Protocol (SNMP)[25], GangliaetLogger, Network Weather Service (NWS), etc.2

Overview: In GridRM, every organisation hasava-based gateway that collects and normalises erom local monitoring systems. In this respect,ry gateway operates as arepublisherof external (toridRM) producers. A global registry is used to suport consumers in discovering gateways providing

ormation of interest.Each gateway consists of a global and a local la

he former includes an abstract layer which interfaith platform-specific consumer APIs (Java, Web/Gervices, etc.) and a security layer that applies a

2 Ganglia, NetLogger and NWS are considered in later secti

Page 10: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

172 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

ganisation’s access control policy. The local layer, hasseveral components including an abstract data layerand a request handler. The latter receives consumerqueries from the global layer and collects real-timeor archived data from appropriate sources dependingon the query’s type (last state or historical). The ab-stract data layer includes several JDBC-based drivers,each one for retrieving data from a specific producer.The Java Database Connectivity (JDBC) interface isJava’s standard way of interoperating with databases;GridRM hides the diversity of monitoring sources be-hind JDBC’s widely used interface.

Gateways represent events according to the GLUEschema. Consumers form and submit SQL queries us-ing GLUE as the vocabulary, and gateways forward thequeries to the appropriate drivers.

GridRM is a research proposal that has to resolveseveral issues. For instance, the GLUE schema is un-der development and hence not complete; custom ex-tensions must be made or adopted to support events fornetworks, processes and applications. Also, it is notclear whether the joining of information from severalgateways is performed by clients themselves or by gate-ways (e.g., iteratively or recursively). In the latter caseand in addition to the load generated by the constantconversion of collected events to the GLUE schema,gateways are likely to become a bottleneck, whereasthey also pose a single point of failure.

In the context of the taxonomy, GridRM is classifiedas L2a.H.S, namely a stackable second level systemw at isc ex-t on-i

6

ntt sess as

or-i eto mu-n rep-r ver-t ep

The central manager (republisher) indexes the cur-rent state of nodes for fast query execution, and peri-odically stores it into a round robin database to main-tain an archive. The monitoring information in the cen-tral manager, in addition to an API, can be accessedvia command line utilities, and web and GUI front-ends.

Administrators can submit jobs to monitored nodes,either for unconditional execution or to be triggered asa response to specific events (e.g., when disk space isrunning out).

Hawkeye is a second level, general-purpose mon-itoring system with a centralised republisher, i.e.,L2a.G.

6.4. HBM (L2b.V)

The Globus Heartbeat Monitor[29] (HBM) is an im-plementation of an unreliable fault-detection service offail-stop failures of processes and hosts. A fault detec-tor is considered unreliable if there is a possibility oferroneously reporting failures. A fail-stop failure of acomponent refers to the class of failures that are perma-nent and can be detected by external components. HBMwas employed in early versions of Globus to verify theavailability of grid services, but has been dropped dueto the later adoption of soft-state protocols (i.e., ser-vices subscribe to a registry and periodically renewtheir subscription, which otherwise expires).

Overview: HBM consists of local monitors (produc-ei spon-s ro-c lo-c andh pro-c ameh ofa theirs ipt ofm ve tod ses,a redc

seda tedr ofh

ith a centralised republisher per organisation, thurrently focused on host events (GLUE must beended accordingly to consider GridRM a generic mtoring system).

.3. Hawkeye (L2a.G)

Hawkeye[26], is a monitoring and managemeool for clusters of computers. Although Hawkeye uome technology from Condor[27] it is available astandalone distribution for Linux and Solaris.Overview: Every monitored node hosts a monit

ng agent (producer) that periodically calculates a sf metrics, which reflect the host’s state, and comicates them to a central manager. The metrics areesented in XML-encoded Condor’s classified adisements (classads)[28], that is, simple attribute-valuairs with optional use of expressions.

rs), data collectors (republishers) andconsumers. Fornstance, a consumer can be a program that is reible for the availability of specific services, or a pess of a distributed parallel program. In HBM, aal monitor has to be running in monitored hosts,osts of monitored processes. Every monitoredess registers to the local monitor residing in the sost. A local monitor periodically detects the statell the monitored processes and communicatestatus to interested data collectors. Upon the receessages from local monitors, data collectors haetermine the availability status of specific procesnd notify accordingly any previously registeonsumers.

With respect to the taxonomy, HBM is characteris L2b.V, namely a second level system with distribuepublishers that is concerned with the availabilityosts and processes.

Page 11: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188 173

6.5. JAMM (L2b.G)

Java Agents for Monitoring and Management(JAMM) [30] was an effort to leverage the achieve-ments of NetLogger (seeSection 6.7) for building aGMA-like monitoring system with managed sensors.

Overview: JAMM places a sensor manager per host,implemented using Java RMI, in order to achieve dy-namic activation of sensors either by (i) monitoringnetwork ports for incoming connections and enablingsensors related to the load generated by the invokedapplication, or (ii) after an explicit request made by aGUI management application.

Every host is associated with a gateway where itsevents are being sent using NetLogger’s Universal Log-ger Message (ULM) format. Consumers can lookup inan LDAP-based registry for available sensors and theirassociated gateways, and retrieve the events they areinterested in, from the latter.

In the context of the taxonomy, JAMM’s sensors op-erate asproducersbecause they generateanddissemi-nate events; gateways serve asrepublishersthat aggre-gate and optionally filter incoming events according toconsumer queries. Since there may be more than onerepublishers but replication is not supported, JAMM isclassified as L2b.G.

6.6. Mercury (L2b.G)

Mercury [31–33] is a generic and extensible mon-i ct.Tc d-a dvt ente form l en-v ).

orp gs ofs llecti tus,a Thel estsr rvesr

The monitoring service is where external consumerssubmit their queries. Upon the receipt of such a query,the monitoring service validates it against the site pol-icy and, if valid, instructs the main monitor to performthe query, which in turn coordinates the involved lo-cal monitors. Eventually, the monitoring service re-ceives the results from the main monitor, transforms theplatform-specific measurements to comparable valuesand finally forwards them to the consumer.

Mercury defines a custom producer-consumer pro-tocol that supports multi-channel communication anduses External Data Representation (XDR) for the en-coding of events. In addition, a library and a specialsensor are provided for the instrumentation of applica-tions, so that a job can generate custom events whichare sent to the sensor and read by the local monitor.Mercury also provides decision-making modules thatinform actuators on adapting the monitoring processand steering applications.

The main monitor in Mercury may be deployed inseveral instances to allow for load distribution. Com-bined with the described architecture and the supportfor events related to hosts, networks and applications,Mercury is classified as L2b.G.

6.7. NetLogger (L2a.A)

The Network Application Logger Toolkit (NetLog-ger)[37] is used for performance analysis of complexsystems such as client–server and/or multi-threadeda anda iewt tle-n

ts:a erla ogs( rap-p forv

k-i ter( ingc witha es-t . Theg yslogd ents

toring system, built as part of the GridLab projehe latter aims in the development of aGrid Appli-ation Toolkitto facilitate developers in building griware applications[34]. Mercury is a grid-enhanceersion of the GRM distributed monitor[35,36] ofhe P-GRADE graphical parallel program developmnvironment. GRM is an instrumentation libraryessage-passing applications in traditional paralle

ironments (such as clusters and supercomputersOverview: Mercury consists of one local monit

er host (producer), a main monitor, and a monitorinervice (republisher). Local monitors employ a setensors, implemented as loadable modules, to conformation about the local node, including host stapplications, etc., and send it to the main monitor.

atter coordinates local monitors according to requeceived from the monitoring service and also seequests from local (i.e., site-level) consumers.

pplications. NetLogger combines network, hostpplication events and thus provides an overall v

hat facilitates the identification of performance botecks.Overview: NetLogger consists of four componen

n API and its library (available for C, C++, Java, Pnd Python), tools for collecting and manipulating li.e., events), host and network sensors (typically wers of Unix monitoring programs), and a front-endisualisation of real-time or historical events.

An application is manually instrumented by invong NetLogger’s API calls typically before and afdisk/network) I/O requests and any time-consumomputations. Events are tagged by the developertextual description, and by the library with a tim

amp and some host- and network-related eventsenerated events are stored to either a local file, a saemon or a remote host. Prior to transmission, ev

Page 12: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

174 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

are locally buffered to minimise the overhead imposedin high rates of generation.

In terms of data encoding, NetLogger supports thetext-based Universal Logger Message format (ULM),along with binary and XML-based encodings, allow-ing developers to choose the imposed overhead. In ad-dition, the API was extended to allow for dynamic(de)activation of logging by periodically checking alocal or remote configuration file. Concerning robust-ness, another extension to the API supports dynamicfail-over to a secondary logging destination in case theoriginal remote host becomes unavailable.

GMA-like NetLogger application monitoring: In anattempt to line up NetLogger with the GMA concepts,Gunter et al.[38] extended NetLogger’s framework byadding a monitoring activation service which is fur-ther elaborated in[39]. An activation service consistsof three components: an activation manager and an ac-tivation producer per installation (republisher), and anactivation node per host (producer).

An activation manager holds the logging detail re-quired for applications instrumented with NetLogger(including an option for deactivation of logging) anda simple client is provided for setting these values re-motely. Each activation node periodically queries theactivation manager for the required logging levels andcommunicates this information to the local applica-tions through configuration files that are periodicallychecked by the latter. Applications are instructed tolog events in a local file wherefrom the activation nodef ro-d tiva-t crip-t

m-p P-b reg-i GFX et cersc t op-t

cedb pollt allyc ualc thed nts.

Given a registry where all components would be sub-scribing their details, it would be far less intrusive tohave the activation manager to inform activation nodesof logging level updates instead of the current designwhere activation nodes have to poll activation managersevery 5 s.

NetLogger is classified as L2a.A, namely a secondlevel system with a centralised republisher that is in-tended for application monitoring.

6.8. NWS (L2c.N)

The Network Weather Service is a portable (ANSIC-based) and non-intrusive performance monitoringand forecasting distributed system, primarily intendedto support scheduling and dynamic resource allocation[40,10].

Overview: In NWS, a host employs sensors for es-timating CPU load, memory utilisation and end-to-endnetwork bandwidth and latency for all possible sensorpairs. Sensors combine passive and active monitoringmethods, to accomplish accurate measurements, andare stateless to improve robustness and minimise intru-siveness. Network sensors in particular employ a setof techniques for avoiding conflicts among competingsensors. Sensors are managed through a sensor con-trol process and their events are sent to a memory ser-vice, both of which can be replicated for distributionof load and fault-tolerance. All components subscribeto an LDAP-basedregistry(referred as name service),u

them ng av nde pre-d

andp diskI bil-i usM ers r de-v ure(

e asp andmf ices.

orwards them asynchronously to the activation pucer. The latter passively accepts events from ac

ion nodes and matches them to consumer subsions, expressed as simple filters.

All interactions among the activation service coonents employ pyGMA, which is a Python SOAased implementation of producer, consumer and

stry interfaces similar to those defined in the GML producer-consumer protocol[13]. In contrast, th

ransfer of events from activation nodes to produan be done using any of the NetLogger transporions, hence ULM, binary or XML formats.Remarks: Considerable intrusiveness is introdu

ecause of activation nodes having to periodicallyhe activation manager, and applications to periodicheck their configuration file. Also, instead of manonfiguration, a registry could be used to supportynamic discovery of activation service compone

sing a soft-state protocol.A forecasting process consumes events from

emory service to generate load predictions usiariety of forecasting libraries. A CGI-based front-exposes current performance measurements andictions to end-users.

NWS has a small number of sophisticatedortable sensors, while there are prototypes for

/O, Network File System (NFS) and system availaty. Current interfaces include C, LDAP and Glob

onitoring and Discovery Service (MDS) wrapphell scripts, whereas a prototype has been undeelopment for the Open Grid Services ArchitectOGSA)[41].

In terms of the taxonomy, NWS sensors operatroducers(i.e., measure and disseminate events),emory services and forecasters serve asrepublishers;

orecasters always operate on top of memory serv

Page 13: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188 175

Since memory services can be configured for replica-tion, NWS is classified as L2c.N.

More recently, Topomon[42] has extended NWS toprovide network topology information, which can beused to compute minimum spanning trees between twogiven hosts, in terms of latency or bandwidth. As partof that extension, Topomon has a republisher operatingon top of NWS memory and forecasting processes, andemploys GGF’s XML producer-consumer protocol.

6.9. OCM-G (L2b.A)

OMIS Compliant Monitor (OCM-G)[43] is a mon-itoring system for interactive grid applications, devel-oped as part of the EU CrossGrid project[44]. OCM-Gis a grid-enhanced implementation of the On-line Mon-itoring Interface Specification (OMIS)[45]. OMIS de-fines a standard interface between instrumented appli-cations and consumers.

Overview: OCM-G’s overall architecture is rathersimilar to that of Mercury. There are per-host localmonitors and per-site service managers. Local monitorshave aproducerinterface for disseminating events gen-erated by statically or dynamically instrumented ap-plications (sensors). End-user performance tools (con-sumers) address commands to service managers (re-publishers), which in turn contact the local monitors ofthe involved applications.

OCM-G supports three kinds of services: on-demand retrieval of events; manipulation of runninga teer-i spe-c

ectt uffers duet uirel ed-b nts( finec tion,O tede Net-L ationa itorsa

ar-a with

distributed republishers that is focused on applicationmonitoring.

6.10. Remos (L2b.N.S)

The Resource Monitoring System (Remos) providesto network-aware applications an application program-ming interface (API) for run-time enquiry of perfor-mance measurements of local and wide area networks[46,47].

Overview: Remos has a query-based interface fea-turing two abstractions, namelyflows and networktopologies. A flow represents a communication link be-tween two applications. In contrast, a network topologygraph provides a logical view of the physical intercon-nection between compute and network nodes, anno-tated with information on link capacity, current band-width utilisation and latency.

Remos consists of several types of collectors, a mod-eller and a prediction service. A variety of collectorsis employed to accommodate the heterogeneity of net-works: SNMP and bridge collectors for SNMP-awarerouters and Ethernet switches respectively; benchmarkcollectors for network links where SNMP is not avail-able, typically wide area network (WAN) links. SNMPand bridge collectors correspond torepublishersmak-ing use of externalproducers(SNMP); benchmark col-lectors areproducersimplementing active sensors.

In addition, every site has a master collector, whichaccepts queries from modellers. A master collector co-o tiono s ther asa a-s uildt ingt wnm host,w

os’A orke , ofw imeo

rks( tc.),R tatedw tion,

pplications for performance-enhancement and sng; and execution of actions upon the detection ofific events.

OCM-G is intended to be customisable with respo performance/overhead tradeoffs (such as the bize for storing the events in local monitors). This iso the emphasis on interactive applications, that reqow-latency monitoring to support real-time user feack. Also, OCM-G defines numerous low-level eveso-called metrics) to allow performance tools to deomposite events with custom semantics. In addiCM-G allows enquiries for host- and network-relavents to facilitate performance analysis (i.e., as inogger), and supports certificate-based authenticnd access control of users as well as local monnd service managers.

In the context of the taxonomy, OCM-G is chcterised as L2b.A, i.e., a second level system

rdinates the appropriate collectors for the execuf a given query, and collects, merges and sendesults to the query’s originator, effectively actinghigher levelrepublisher. These results are raw me

urements and it is the modeller’s responsibility to bhe (flow or topology) abstractions before forwardhem to the application. Every application has its oodeller, which is a process running on the samehich makes modellers part ofconsumers.In addition to current load measurements, Rem

PI supports predictions for host load and netwvents (e.g., bandwidth utilisation and latency)hich only the former was implemented at the tf writing, using the RPS toolkit[48].

Because of the complexity involved in netwoe.g., routing protocols, heterogeneity of devices, eemos provides “best effort” measurements annoith statistical parameters (e.g., standard devia

Page 14: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

176 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

confidence). Remos’ design focuses on the provision ofa consistent interface, independently of the underlyingnetwork technicalities, and on portability, hence the useof SNMP and simple system-independent benchmarks.

Remos employs a variety of producers (externalSNMP producers, benchmark collectors) and repub-lishers (SNMP collectors, master collectors, predictor),which however have to be connected in a predefinedway. On this basis, Remos is classified as L2b.N.S,that is, a stackable (since it operates on top of SNMP)second level system with optional support for multiplefirst-level republishers (SNMP collectors) per installa-tion.

6.11. SCALEA-G (L2b.G)

SCALEA-G is an extensible, service-oriented3 in-strumentation, monitoring and performance analysisframework for hosts, networks and applications.

Overview: SCALEA-G[49] implements a variety ofgrid services (see Section 7.2), including sensor, sensormanager, instrumentation, archival, registry and clientservices. A sensor manager service interacts with sen-sor service instances, which may be on the same or dif-ferent hosts. This implies that SCALEA-G sensors areequivalent toproducersand sensor managers torepub-lishers. An archival service provides persistent storageof monitoring information. Producers and republishersare registered in and discovered from a sensor repos-itory and a directory service, respectively. (The direc-t ined,p

llyb t[ nta-t ser-v lan-g

top PathX nsorm ture( h asc EA-

Ar-c optedW

G is complemented by GUI programs for configuringthe system and conducting performance analysis.

SCALEA-G is a second level general-purpose mon-itoring system with distributed republishers, namelyL2b.G.

7. Level 3: hierarchy of republishers

This section focuses on third level monitoring sys-tems, that is, frameworks featuring producers and gen-eral purpose republishers which can form an arbi-trarily structured hierarchy. The considered systemsare: Ganglia, a fairly scalable and widely used clus-ter monitoring system; Globus MDS, the Monitoringand Discovery Service of the most widely deployedgrid middleware; MonALISA, a Jini-based monitoringprototype for large distributed systems; Paradyn withMulticast/Reduction Network, a system for applicationperformance analysis; RGMA, a relational approachto GMA, that is intended to operate as a standaloneGrid Information Service (GIS). MDS, MonALISAand RGMA can be configured for stackable operation,i.e., run on top of other monitoring systems. Amongothers, both systems can be configured to retrieve andquery information provided by a Ganglia installation.

7.1. Ganglia (L3.G)

Ganglia[52] is an open source hierarchical moni-toring system, primarily designed for computer clustersb imeo .

e-t oft-s y re-n . Alln oni-t

• pre-ost.

• and

• the

epli-c

ory service and sensor repository services combrovide the functionality of aregistry.)

An application instrumentation service is partiauilt on existing systems (SCALEA[50] and Dynins51]) to support source-level and dynamic instrumeion. Consumers interact with the instrumentationice using an XML-based instrumentation requestuage that defines request/response messages.

All messages are encoded in XML accordingredefined schemas, and consumers can pose XQuery queries or establish subscriptions with seanager services. Globus’ Grid Security Infrastruc

GSI) is employed to provide security services suconsumer authentication and authorisation. SCAL

3 SCALEA-G is implemented using the Open Grid Serviceshitecture (OGSA), which has been superseded by the newly adeb Services Resource Framework (WSRF).

/

ut also used in grid installations. Ganglia at the tf writing was deployed in more than 500 clustersOverview: At the cluster level, membership is d

ermined with a broadcast, soft-state protocol—state means that membership must be periodicallewed by explicit messages or otherwise expiresodes have a multi-threaded daemon (Ganglia m

oring daemon) performing the following tasks:

Collecting and broadcasting External Data Resentation (XDR) encoded events from the local hListening the broadcasts sent by other nodeslocally maintaining the cluster’s state.Replying to consumer queries about any node inlocal cluster, using XML encoded messages.

Given the above actions, a cluster’s status is rated among all nodes, which act asproducers, result-

Page 15: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188 177

ing in distribution of load and fault-tolerance, but alsoin high network and host intrusiveness.

An arbitrarily structured hierarchy ofrepublishers(referred as Ganglia meta-daemons) periodically col-lect and aggregate events from lower level data sources,store them in round-robin databases, and provide themon demand to higher level republishers. Data sourcesmay be either producers (on behalf of a cluster) orother republishers (on behalf of several clusters); inboth cases an XML-encoding is employed.

Ganglia does not have a registry and therefore thelocation of producers and republishers must be knownthrough out-of-band means. The databases serve asarchives and are also used by a web-based visuali-sation application providing cluster- and node-levelstatistics. Simple command line utilities are providedfor adding new event types and querying producersand republishers.

Remarks: Ganglia introduces considerable, albeitlinear, overhead both at hosts and networks at clus-ter and hierarchy levels, because of the multicast up-dates in the former, and XML event encoding in thelatter. The network intrusiveness imposed by repub-lishers connected through WAN links is of considerableimportance given the associated costs. Other concernsinclude the availability of IP multicast, and the lackof a registry since Ganglia was primarily intended forclusters, which are fairly static compared to grids.

7.2. Globus MDS (L3.G.S)

m ice,c bust

to-c theG al-l pera-t isf DSc

ol( sen-t PD anda pro-t ical

data model, entities are represented as one or moreLDAP objects defined as typed attribute-value pairsand organised in a hierarchical structure, called theDirectory Information Tree (DIT).

The MDS framework consists of informationproviders (sensors), Grid Resource Information Ser-vices (GRIS—producers) and Grid Index InformationServices (GIIS—republishers). Both producers and re-publishers are implemented as backends for the opensource OpenLDAP server implementation.

Producers collect events from informationproviders, either from a set of shell scripts or fromloadable modules via an API. In addition, producersprovide their events to republishers or to consumersusing GRIP, and register themselves to one or morerepublishers using GRRP.

Republishers form a hierarchy in which each nodetypically aggregates the information provided by lowerlevel republishers (and producers in case of first levelrepublishers). Republishers use GRIP and GRRP aspart of the consumer and producer interfaces, thoughcustom implementations could offer alternative pro-ducer interfaces (i.e., relational). Several roles maybe served by republishers, including the provisionof special purpose views (e.g. application-specific),organisation-level views and so on.

Consumers may submit queries to either producersor republishers, or discover producers through repub-lishers, in any case using GRIP.

Remarks: The hierarchical data model along withL toa needf lisedm ces,s GridSf uiresk thep n—w eenv

d asp[ en-h temst ingd SA,e web

The Monitoring and Discovery Service[2,3], for-erly known as the Metacomputing Directory Serv

onstitutes the information infrastructure of the Glooolkit [53].Overview: MDS 2.x is based on two core pro

ols: the Grid Information Protocol (GRIP) andrid Registration Protocol (GRRP). The former

ows query/response interactions and search oions. GRIP is complemented by GRRP, whichor maintaining soft-state registrations between Momponents.

The Lightweight Directory Access ProtocLDAP) [54] is adopted as a data model and repreation (i.e., hierarchical and LDIF respectively—LDAirectory Interchange Format), a query languagetransport protocol for GRIP, and as a transport

ocol for GRRP. Given the LDAP-based hierarch

DAP’s referral capability (i.e., forward a queryn authoritative server) accommodates well the

or autonomy of resource providers and decentraaintenance. Also, MDS supports security servi

uch as access control, through the use of theecurity Infrastructure (GSI)[55]. However, LDAP

eatures a non-declarative query interface that reqnowledge of the employed schema. In addition,erformance of OpenLDAP’s update operatiohich is by far the most frequently used—has bery much criticised.MDS3: Globus was re-designed and implemente

art of theOpen Grid Services Architecture(OGSA)41], a web services-based framework aiming toance interoperability among heterogeneous sys

hrough service orientation (i.e., hiding the underlyetails by means of consistent interfaces). In OGverything is represented as a grid service, that is, a

Page 16: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

178 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

service that complies to some conventions, includingthe implementation of a set of grid service interfaces(portTypes in WSDL terminology). Every grid ser-vice exposes its state and attributes through the imple-mentation of the GridService portType and, optionally,the Notification-Source portType, which correspond topull and push data delivery models, respectively.

In this respect, the functionality of the MDS2 GridResource Information Service (GRIS) is encapsulatedwithin grid services. In OGSA, the equivalent ofthe MDS2 Grid Index Information Service (GIIS) isthe Index Service which, among others, provides aframework for aggregation and indexing of subscribedgrid services and lower level Index Services. IndexServices are organised in a hierarchical fashion justlike the GIISs in MDS2.

Information is represented in XML according to theGLUE schema. Simple queries can be formed by spec-ifying a grid service and one or more service data el-ements, whereas more complex expressions are sup-ported using XPath.

7.3. MonALISA (L3.G.S)

MonALISA (Monitoring Agents using a LargeIntegrated Services Architecture)[56] is a Jini-based[57], extensible monitoring framework for hosts andnetworks in large-scale distributed systems. It can inter-face with locally available monitoring and batch queue-ing systems through the use of appropriate modules.T adea ro-n nts.

icDi n ag ces( dl teda

rts ifn ervicer canb Jinii callyr ventt witha

Each station server hosts a multi-threaded monitor-ing service, which collects data from locally availablemonitoring sources (e.g., SNMP, Ganglia, LSF, PBS,Hawkeye) using readily available modules. Thecollected data are locally stored and indexed in eitheran embedded or an external database, and provided ondemand to clients (i.e.,consumerservices).

A client, after discovering a service through thelookup service, downloading its code and instantiatinga proxy, can submit real-time and historical queriesor subscribe for events of a given type. Custominformation (i.e., not part of the schema) can beacquired by deploying a digitally signed agent filter tothe remote host. In addition, non-Java clients can usea WSDL/SOAP binding.

Services and modules can be managed through anadministration GUI, allowing an authenticated user toremotely configure what needs to be monitored. Also,MonALISA has a facility for periodically checkingthe availability of new versions and automaticallyupdating any obsolete services.

Current applications of MonALISA are a GUIfront-end featuring several forms of status visuali-sation (maps, histograms, tables, etc.) and dynamicoptimisation of network routing for the needs of avideo conference application. Future plans includebuilding higher level services for supporting jobscheduling and data replication.

Concluding, MonALISA provides a general-purpose and flexible framework, though it could be ar-g lso,J il-a ax-o esa herhi vela isc re-p chy;a iss

7

orl pli-c t is,

he collected information is locally stored and mvailable to higher level services, including a GUI ft-end for visualising the collected monitoring eveOverview: MonALISA is based on the Dynam

istributed Services Architecture (DDSA)[58] whichncludes one station server per site or facility withirid, and a number of Jini lookup discovery servii.e., equivalent toregistries). The latter can join aneave dynamically, while information can be replicamong discovery services of common groups.

A station server hosts, schedules, and restaecessary, a set of agent-based services. Each segisters to a set of discovery services wherefrome found from other services. The registration in

s lease-based, meaning that it has to be periodienewed, and includes contact information, eypes of interest and the code required to interactgiven service.

ued that Java restricts the overall performance. Aini is using multicast, which is not always avable, and places scalability limits. In terms of the tnomy, external (to MonALISA) monitoring sourcreproducers, whereas monitoring services and otigher-level services serve asrepublishersby collect-

ng data from producers and providing higher lebstractions, respectively. To this end, MonALISAlassified as L3.G.S meaning that custom-builtublishers can be structured in a custom hierarvariety of entities can be monitored; the system

tackable.

.4. Paradyn/MRNet (L3.A)

Paradyn[59] is a performance analysis toolkit fong running, parallel, distributed and sequential apations. It supports dynamic instrumentation, tha

Page 17: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188 179

insertion, modification and removal of instrumentationcode during program execution.

Paradyn originally had a first-level architecture (i.e.,consumers and producers only), consisting of an end-user GUI for application performance analysis and con-trol, that consumes data originating from the processesof an instrumented distributed application. Scalabil-ity concerns arising from performance analysis of ap-plications with hundreds of processes motivated thedevelopment of Multicast/Reduction Network. MR-Net [60,61] is a communication system, with supportfor multicast and data aggregation services, for usein parallel tools—available separately and as part ofParadyn.

Overview: In addition to Paradyn’s end-user GUI(consumer, referred as front-end), parallel processes ofan application (sensors) and a back-end per process(producers), MRNet provides a program that can be runin many instances in potentially different hosts to forma custom hierarchy of internal processes (republishers)that transfer data from producers to a single consumerand vice versa. Producers and consumers can use MR-Net’s communication facilities using the provided C++library.

MRNet republishers support synchronisation andtransformation filters to determine when packetsshould be forwarded up the hierarchy (e.g., wait fora packet from every child node) and to apply packet-specific data-reduction techniques (e.g., deliver tothe consumer the average of all producers’ measure-m et-i con-t argen

s ltingi theo etyo

it isa andp cre-a nc-t

7

rchG iew

that the only difference between information and mon-itoring services is that the data involved in the latterhave to be timestamped[62]. To this end, they havebuilt RGMA [12] as part of the EU DataGrid project; aframework which combines grid monitoring and infor-mation services based on the relational model. That is,RGMA defines the GMA components, and hence theirinterfaces, in relational terms.

Overview: In RGMA, producers are distinguishedin five different classes but we limit our discussion indatabase and stream producers, which are indicative ofthe main concepts.Database producersare employedfor static data stored in databases, whereasstream pro-ducersfor dynamic data stored in memory resident cir-cular buffers. New producers announce their relations(i.e., event types) using an SQL “create table” query,offer them via an SQL “insert” statement, and “drop”their tables when they cease to exist. Aconsumerisdefined as an SQL “select” query. In order for a com-ponent to act as either a consumer or a producer, it hasto instantiate a remote object (agent) and invoke meth-ods from the appropriate (consumer or producer) API.

Theglobal schemaincludes a core set of relations,while new relations can be dynamically created anddropped by producers as previously described.Repub-lishersare defined as one or more SQL queries thatprovide a relational view on data received by produc-ers or other republishers.

Theregistryholds the relations and views providedby database producers, stream producers and repub-l andi tedi iona merst thatc ing“

tsa perA TheR fersi

ma-t on-s ablef on-s osto

ents as opposed to all values). Also, an MRNnstrumented consumer uses multicast for sendingrol requests to producers, thus scaling better for lumber of producers.

Paradyn with MRNet is reported[60] to scale well inettings of up to 500 producers per consumer, resun significantly improved scalability (compared toriginal first level architecture) with respect to a varif performance metrics.

Paradyn/MRNet is classified as L3.A becausepplication-oriented and, in addition to consumerroducers, provides a configurable republisher forting arbitrarily structured hierarchies of custom fu

ionality.

.5. RGMA (L3.G.S)

The Relational Grid Information Services Researoup of the Global Grid Forum supports the v

ishers. The registry includes the global schemas centralised, while there are efforts for a distribumplementation. A mediator uses the informatvailable in the registry and cooperates with consuo dynamically construct query plans for queriesannot be satisfied by a single relation (i.e., involvjoins” from several producers).Remarks: RGMA is implemented in Java Servle

nd its API is available for C++ and Java, while wrapPI implementations exist for C, Perl and Python.GMA implementation is considered stable but suf

n terms of performance[63].RGMA can be used as a standalone grid infor

ion service assuming information providers and cumers use the RGMA APIs. Some tools are availor supporting MDS2 information providers and cumers (i.e., using an LDAP interface) but in the cf performance degradation.

Page 18: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

180 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

Conceptually, RGMA provides access to the infor-mation of a Virtual Organisation’s resources as if thatwas stored in a single RDBMS; there are plans for ex-tending this concept across VOs. Overall, the systemhas a potential for good scalability given the (underdevelopment) replication of the global schema and theregistry, and the combination of data sources into a hi-erarchy of republishers.

8. Related work

Most relevant to this survey, is a white paper[64]by the “Automatic Performance Analysis: Real Tools”(APART) working group. This paper provides a di-rectory of grid-oriented performance analysis systems,which are characterised along a large number of fea-tures. The paper also underlines the lack of any singlesystem that can serve the needs of all user classes. Tothis end, the paper suggests three groups of systemsthat combined can satisfy most user needs. As part ofthat integration, it is noted that a lot of effort is neededfor the systems to interoperate.

Also relevant, though rather outdated, is a survey ofselected grid monitoring tools in[65]. For a brief dis-cussion of related work on monitoring systems evalu-ation seeSection 9.

Other work includes, WatchTower[66], a C++ li-brary that allows applications to extract comprehensive

Table 1Systems overview with respect to components mapping to GMA, sen

Features/project AutoPilot CODE Hawkeye ury

Classification L1.A L2a.G L2a.G LProducer Instrumented

applicationsAgent Agent Ex

souRepublisher n/a Event

archiveManager Ga

Registry Name service Directoryservice

Manager Re

Archive – XML DB Roundrobin DB

DB

SensorsMonitored entities A H/N/S H/N/S HPassive/active A P P DManaged y y y n/a

Interactions (PR, PRC: producer–republisher, producer or republishePR n/a Both Both DePRC Push Both Pull B ?

In the sensors category, parenthesis denote non-comprehensive mon

host and network monitoring events from MicrosoftWin32 operating systems, and apply data reductiontechniques; GridObserver[67], a web front-end whichappears rather similar to Ganglia (GridObserver wasnot classified due to limited available information);Ref. [68] discusses a set of sampling methods thataim to minimise intrusiveness in large-scale systems;Refs. [69] and [70] have pointers to numerous weboriented systems and general-purpose network perfor-mance measurement tools, respectively.

9. Discussion

Tables 1–4summarise the features of the systemsconsidered inSections 5–7. To start with,Tables 1 and2 show the mapping of system components to that ofGMA (as they were defined inSection 2.4), as well ascharacteristics of sensors and producer interactions.

GMA mapping: The registry functionality, if any, isprovided either by standalone components or is par-tially (e.g., Mercury, Remos) or fully (e.g., MDS2,Hawkeye) provided by a republisher.

Most of the considered systems provide the tools formaintaining an archive into some form of a database (arelational DBMS, an XML DB, etc.) and some of themexpose the archived data through a producer or a re-publisher interface (e.g., RGMA, CODE). MonALISAis interesting in that it provides the option of using

sors and interactions—part 1

GridRM NetLogger HBM JAMM Merc

2a.H L2a.A L2b.V L2b.G L2b.Gternalrce

Instrumentedapplications

Localmonitor

Instrumentedapplications

LM

teway Activationproducer

Data col-lector

Gateway MM, MS

gistry Registry – Registry MM

MS DBMS – – DBMS

A (H/N) V H/N/A H/A/Nepends A (P) P P P

y n/a y yr–consumerpends Push Pull Push Both

oth? Push Push Pull Both

itoring of related entities.

Page 19: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S.Z

aniko

las,R

.Sakellariou/F

utureGenerationComputerSyste

ms21(2005)163–188

Table 2Systems overview with respect to components mapping to GMA, sensors and interactions—part 2

Features/project OCM-G Remos SCALEAG NWS Ganglia MDS2 MonALISA Paradyn/MRNet

RGMA

Classification L2b.A L2b.N L2b.G L2c.G L3.G L3.G.S L3.G.S L3.A L3.G.SProducer Local

monitorSNMP servicesand benchmarkcollectors

Sensor and/orsensor manager

Memoryhost

Monitoringdaemon

GRIS Monitoringservice

Backends Producers

Republisher Monitoringservice

SNMP and mastercollector

Sensor manager Predictor Metadaemon

GIIS Otheragents

Internalprocesses

Republishers

Registry – Part of collectors Sensor repositoryand directoryservice

Nameservice

– GIIS Lookupservice

– Registry

Archive – – Archival service – RRD tool – DBMS – DBMSSensors

Monitored entities A (H/N) N (H) A/H/N/S N (H) H/N H/N/P H/N A H/N/?Passive/active A (?) P/A P/A P P Depends A? P?Managed y n/a y n/a n n Depends y n

Interactions (PR, PRC: producer–republisher, producer or republisher–consumerPR Both Pull? Both Pull Push Push Depends Both? ?PRC Both Pull? Both Pull Pull Pull Both? Both? Both

In the sensors category, parenthesis denote non-comprehensive monitoring of related entities.

181

Page 20: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

182 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

Table 3Systems overview with respect to events, implementation and other issues—part 1

Features/project Autopilot

CODE Hawkeye GridRM Net logger HBM JAMM Mercury

Events (Format: A, B, G, L, U, X, Xd = ASCII, Binary, GLUE, ULM, XML, XDR)Format SDDF X X G X/U/B ? U Xd

ImplementationLanguage C++ C++, J C++ J C/Py C J C, JAPI C++ C++, J C++, J JDBC C, C++, J, P, Py C? ? CTools V V W, G, CLI W V V V VAvailability y n y n y y y y

MiscellanyKey features App

steeringGlobussystemsmanagement

Cluster stats/management

Unified interface Fault de-tection

Extensibilityand adaptivity

Security GSI GSI GSI – – – – GSI

Language/API: Co, J, Jp, Js, P, Py, W = CORBA, Java, Java Serve Pages, Java Servlets, Perl, Python, WSDL/SOAP; Tools: V, W = VisualisationGUI, Web front-end.

an embedded or an external database per monitoringservice.

Sensors: Regarding the monitored entities, almostall of the considered systems support host and net-work events, though the degree of comprehensivenessgreatly varies (e.g., NWS is primarily intended for net-works, NetLogger and OCM-G for applications, etc.).

Recall fromSection 2.1that a sensor, may be ei-ther active or passive, depending on whether it activelyperforms measurements (e.g., using benchmarking al-gorithms) or passively collects performance data fromplatform-specific facilities. This is a tradeoff betweenintrusiveness on one hand, and potentially platform-independent results on the other. To this end, an idealsystem would provide sensors in both types, to allowdevelopers to choose whichever is appropriate for agiven use. Apart from the application instrumentationsystems (Autopilot, NetLogger, OCM-G, and the in-strumentation service of SCALEA-G) which requireactive sensors, most of the considered systems employpassive sensors. Exceptions include NWS (both types)and Remos (active for WAN, passive for LAN links).

The managed property denotes whether dynamicsensor (de)activation is possible, which applies onlyto systems that support the push data delivery model.This is supported by all relevant systems except Gan-glia and MDS2.

Producers: The last sections ofTables 1 and 2referto the supported data delivery models, between pro-ducers and republishers on one hand, and producers or

republishers with consumers on the other (denoted asPR and PRC, respectively). Pull implies query-like in-teractions and push refers to either subscription or one-off interactions. Event delivery in subscriptions can beconditional or unconditional (i.e., periodic). Periodicevent delivery can guarantee measurements of a givenfrequency but is more intrusive. In managed sensors,the frequency of measurements may be dynamicallyconfigurable.

On-line application performance analysis and steer-ing (as opposed topost mortemanalysis) require aconstant stream of events and thus are more suitedfor push data delivery, whereas infrequent queries arebetter served using pull mode. Particularly push-onlysystems without managed sensors typically impose ex-cessive overhead. Ideally, both modes should be sup-ported or otherwise unnecessary overhead is imposed(e.g., consumers establishing a short-lived subscriptionin place of a query, or constantly posing queries due tolack of subscriptions).

Events: In Tables 3 and 4, the format field indicatesthe employed encoding of events, varying from heavy-weight and self-describing approaches (XML) to verycompact but rigid formats (XDR). In any case, the em-ployed event format greatly affects a system’s intrusive-ness both in terms of host and network load for pro-cessing and transmitting events, respectively. Again,the ideal is to provide several options and let the userdetermine what is appropriate for any given case, as iswith NetLogger’s ULM and Autopilot’s SDDF.

Page 21: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S.Z

aniko

las,R

.Sakellariou/F

utureGenerationComputerSyste

ms21(2005)163–188

Table 4Systems overview with respect to events, implementation and other issues—part 2

Features/project OCM-G Remos SCALEA-G NWS Ganglia MDS2 MonA-LISA

Paradyn/MRNet

RGMA

Events (Format: B, G, L, U, X, Xd = Binary, GLUE, ULM, XML, XDR)Format Java data types X Xd, X L X? X

ImplementationLanguage C J J C C C Jini C++ JsAPI C, C++ J C, C++, J C, sh,

LDAPCCLIutil

C, Co, J, Jp, P,Py, sh, LDAP

J, W C++ C, C++,J, P, Py

Tools V – V W W W V, W V V, WAvailability y y y y y y y y y

MiscellanyKey features Interactive

applicationsteering

Flows, topologies,predictions

Services-basedunified system

Loadpredictions

Clusterstats

Custom views Correctiveactions

Scalable on-lineapp analysis

VirtualDBMS

Security GSI GSI GSI – GSI SSL/ X.509 – GSI

Language/API: Co, J, Jp, Js, P, Py, W = CORBA, Java, Java Server Pages, Java Servlets, Perl, Python, WSDL/SOAP; Tools: V, W = Visualisation/Analysis GUI, Web front-end.

183

Page 22: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

184 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

Implementation issues: Tables 3 and 4also refer tothe implementation language of the considered sys-tems, the supported application programming inter-faces (APIs) and tools, and whether the actual soft-ware is distributed through the Internet. The languageitself is a good indication of a system’s host overheadand overall performance; hence Java-based systems arenormally expected to have a worse performance com-pared to C-based systems.

Concerning APIs, some of the production levelprojects feature many options, whereas there is a ten-dency (though not reflected in the tables) to provideWSDL/SOAP bindings for interfacing with OGSA.

In terms of tools, almost all of the considered sys-tems provide a standalone visualisation GUI, a webfront-end or both. Finally, most of the systems are freelyavailable, though licence restrictions greatly vary.

Miscellany: The diversity of the considered systemsis reflected in the key features row. These include highlevel abstractions, such as network and processor loadpredictions (NWS, Remos); a unified interface integrat-ing other monitoring sources (GridRM); the organisa-tion of monitoring information in custom views (e.g., inMDS one can have different GIIS servers for availablesoftware, processors, storage mediums and so on); thecorrelation of host-, network- and application-relatedevents (NetLogger, OCM-G); the calculation of site- orVO-level load (Ganglia, GridICE), the conceptualisa-tion of a VO’s information services as a virtual DBMS(RGMA), etc. Concerning automatic actions, some sys-t ndsu an-a lowt on-m

nti-c as-t esf mento

r-ie re-c sedb easR fort sd ar-

gues that the latter suffers in terms of performance androbustness.

Another study[63] describes a performance com-parison of Hawkeye and a few versions of GlobusMDS2 and RGMA. One of the main conclusions isthat caching of events is most important for the perfor-mance of all system components. Hawkeye, with theexception of the producer (agent), which does not em-ploy caching, generally appears to be more stable andefficient in terms of imposed host load. Throughput andresponse time vary considerably among different ver-sions and components of every system. According tothe study, RGMA’s producers and registry appear to beless scalable with respect to the number of concurrentusers and sensors (referred as information collectors).Among the proposed suggestions is that (i) the repub-lishers of the considered systems should be connectedto less than 100 producers; (ii) the number of concur-rent consumers should be less than 400, 500 and 600for RGMA, MDS and Hawkeye, respectively.

Remarks: One can easily see the lack of coordi-nation, hence the overlapping functionality, betweenprojects. Also evident is the diversity of APIs and pro-tocols which, along with implied or explicit but customsemantics of event types, lead to interoperability prob-lems. Coordination is clearly a political issue and onecan only hope that there will be a driving force to en-courage consolidation or at least some form of closercooperation.

However, interoperability is more of a technicalp colsi er-c dw on,M erp ofa in-t ocols vents ghtly)d mep

reeo webs sumei ma-t on-i ents

ems can be configured to trigger specific commander certain circumstances, to facilitate systems mgement (CODE, MapCenter, Hawkeye) or to al

he adaptation of applications to a dynamic envirent (Autopilot, OCM-G, MonALISA, Mercury).In terms of security, many systems provide authe

ation and/or authorisation via the Grid Security Infrructure (GSI). MonALISA employs X.509 certificator the secure management, upgrade and deployf agents.Evaluation: With respect to evaluation of monito

ng systems, we note, in agreement with[64], that thexisting work is very limited. Mercury has beenently reported as coping well with the high load poy application monitoring based on GRM, wherGMA’s performance has been proven insufficient

he same purpose[71]. An earlier paper[31] compareesign characteristics of GRM and NetLogger, and

roblem. Communication paradigms and proton the considered systems vary from sockets (Mury, NWS), LDAP (MDS2), Java RMI (JAMM), aneb services (NetLogger’s pyGMA, CODE, TopomDS3 and MonALISA). GGF’s producer-consumrotocol [13] is not sufficient because it is moremethodology for designing producer-consumer

eractions using web services rather than a protpecification. Also, there is no commonly agreed echema either because of systems addressing (sliifferent problems or due to different views of the saroblems.

This problem can be faced in two ways; either agn a widely accepted platform-independent (e.g.,ervices-based) protocol and event schema and asts adoption by monitoring systems, or extend inforion services to interoperate with a diverse set of mtoring sources in order to collect and transform ev

Page 23: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188 185

to a common schema (e.g., GridRM, MonALISA). Theformer places much less overhead on information ser-vices and thus allows for better performance and scal-ability, but requires a consensus on protocols and se-mantics. In the latter approach, information services arevery likely to become performance bottlenecks giventhe need to implement several protocols and normalisecollected events.

In addition to interoperability, which is a major ob-stacle for connecting the existing grid installations intoa worldwide Grid, there is also the scalability require-ment. In a worldwide Grid context, there is a need forhighly scalable monitoring services[72] (i) to build ef-ficient, Grid-wide resource discovery end-user applica-tions or services, such as resource brokers[73], searchengines[74] or directories of grid resources, or (ii) toprovide a variety of “global view” services to enhancethe understanding of the Grid.

Grid search enginesmay provide to users and pro-grams an efficient means of resolving complex, large-scope queries (i.e., involving the combination of infor-mation from many organisations). However, end-usersmay not be able to specify exactly their needs or beunaware of the resources that are available throughoutthe Grid. To this end,resource directorieswill sup-port end-users to find the appropriate matches througha browsable list of resources, offered in a varietyof classifications. The latter can be based on crite-ria such as discipline (in case of collaborations, datarepositories, and on-line instruments), sharing policy,p ad-m or-g viceg , ands

t ,w lita-t hisw up-p itsr es-t s int na ular;w berse ionr

10. Summary

This paper introduced the problem of grid monitor-ing, including associated concepts, requirements andstages. It also outlined the Grid Monitoring Architec-ture of the Global Grid Forum and mapped its com-ponents to phases of distributed systems monitoring.More importantly, a taxonomy was defined to allow theclassification of monitoring systems with respect to (i)compliance to core GMA components; (ii) main targetof monitored entities; and (iii) whether a system can orhas to operate on top of another system. The taxonomyshould be applicable to all monitoring systems becauseits categories are specified with respect to fundamen-tal monitoring functions (i.e., generation, processing,distribution and consumption of monitoring data). Fur-thermore, the taxonomy was used to classify a diverseset of monitoring systems, ranging from applicationinstrumentation tools to standalone Grid InformationServices.

In addition, the paper underlined, in agreement with[64], the need for more coordination and interoper-ability among existing and future monitoring projects,given the currently overlapping functionality and lackof consensus regarding the employed protocols andsemantics. Finally, an unexplored field was identifiedin the context of global monitoring, and motivated interms of potential applications.

A

ersw lso,t olar-s hipsF

R

grid:put.

i, W.gh-f theted375.

hysical location (e.g., continent, country, city),inistration hierarchy (e.g., virtual organisation,anisation, department) supported quality of seruarantees, hardware and software specificationso on.

On the other hand,global view services, similarlyo Netcraft[75] and Alexa[76] for the worldwide webill provide a comprehensive quantitative and qua

ive characterisation of the Grid throughout time. Till help to understand the dynamics of the Grid, sort long-term capacity planning and contribute toealistic modelling and simulation by answering quions such as: how many are the grid resourcehe public worldwide Grid; what is their distributiomong sites; which resource types are rare or pophich grid sites are most reliable; how these numvolve throughout time; how does the Grid’s evolutelates to that of the Internet.

cknowledgements

We would like to thank the anonymous reviewhose feedback significantly improved the paper. A

he first author is pleased to acknowledge the schhip he is receiving from the Greek State Scholarsoundation (IKY).

eferences

[1] I. Foster, C. Kesselman, S. Tuecke, The anatomy of theenabling scalable virtual organizations, Int. J. SupercomAppl. 15 (3) (2001) 115–128.

[2] S. Fitzgerald, I. Foster, C. Kesselman, G. von LaszewskSmith, S. Tuecke, A directory service for configuring hiperformance distributed computations, in: Proceedings oSixth IEEE Symposium on High Performance DistribuComputing, IEEE Computer Society Press, 1997, pp. 365–

Page 24: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

186 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

[3] S. Czajkowski, K. Fitzgerald, I. Foster, C. Kesselman, Gridinformation services for distributed resource sharing, in: Pro-ceedings of the 10th IEEE International Symposium on High-Performance Distributed Computing (HPDC-10), IEEE Com-puter Society Press, San Francisco, CA, 2001, pp. 181–194.

[4] Z. Balaton, P. Kacsuk, N. Podhorszki, F. Vajda, UseCases and the Proposed Grid Monitoring Architecture,Tech. Rep. LDS-1/2001, Computer and Automation Re-search Institute of the Hungarian Academy of Sciences,2001. http://www.lpds.sztaki.hu/publications/reports/lpds-1-2001.pdf.

[5] G. Carlson, How to save money with computer monitoring,in: Proceedings of the ACM Annual Conference, ACM Press,Boston, MA, USA, 1972, pp. 1018–1023.

[6] B. Tierney, R. Aydt, D. Gunter, W. Smith, M. Swany, V.Taylor, R. Wolski, A Grid Monitoring Architecture, GWD-Perf-16–3, Global Grid Forum, August 2002.http://www-didc.lbl.gov/GGF-PERF/GMA-WG/papers/GWD-GP-16-3.pdf.

[7] B. Plale, P. Dinda, M. Helm, G. von Laszewski, J. McGee,Key concepts and services of a grid information service, in:Proceedings of the 15th International Conference on Paralleland Distributed Computing Systems (PDCS 2002), Louisville,KY, 2002.

[8] M. Mansouri-Samani, Monitoring Distributed Systems (ASurvey), Tech. Rep. DOC92/23, Imperial College, Lon-don, UK, 1992. http://www.citeseer.ist.psu.edu/mansouri-samani92monitoring.html.

[9] M. Mansouri-Samani, M. Sloman, Monitoring distributed sys-tems, IEEE Network 7 (6) (1993) 20–30.

[10] R. Wolski, L.J. Miller, G. Obertelli, M. Swany, Performanceinformation services for computational grids, Grid ResourceManagement, State of the Art and Future Trends, Kluwer Aca-demic Publishers, 2003, pp. 193–213.

[11] D.L. Mills, Network Time Protocol (Version 3) Specifica-92.

[ ev-oni,N.A:

ro-ative

[ o-rid-

[ ma-03.

[ open5thput-

[ dis-the

mapcenter approach, in: F.F. Rivera, M. Bubak, A.G. Tato,R. Doallo (Eds.), Proceedings of the First European AcrossGrids Conference, vol. 2970 of Lecture Notes in Computer Sci-ence, Springer-Verlag, Santiago de Compostela, Spain, 2004,pp. 222–229.

[17] S. Andreozzi, N. De Bortoli, S. Fantinel, A. Ghiselli, G. Tor-tone, C. Vistoli, GridICE: A Monitoring Service for the Grid,in: Proceedings of the Third Cracow Grid Workshop, Cracow,Poland, October 27–29, 2003, pp. 220–226.

[18] S. Andreozzi, S. Fantinel, D. Rebatto, L. Vaccarossa, G. Tor-tone, A monitoring tool for a grid operation center, in: Comput-ing in High Energy and Nuclear Physics (CHEP03), La Jolla,CA, 2003.

[19] R.L. Ribler, J.S. Vetter, H. Simitci, D.A. Reed, Autopilot: adap-tive control of distributed applications, in: Proceedings of theSeventh IEEE Symposium on High-Performance DistributedComputing, 1998, pp. 172–179.

[20] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D.Gannon, L. Johnsson, K. Kennedy, C. Kesselman, J. Mellor-Crummey, D. Reed, L. Torczon, R. Wolski, The GrADS project:software support for high-level grid application development,Int. J. High Perform. Comput. Appl. 15 (4) (2001) 327–344.

[21] R. Aydt, The Pablo Self-Defining Data Format, Tech. Rep.,Department of Computer Science, University of Illinois atUrbana-Champaign (Revised at 17 March 2003), March1992. ftp://vibes.cs.uiuc.edu/pub/Pablo.Release.5/SDDF/Documentation/SDDF.ps.gz.

[22] W. Smith, A system for monitoring and management of com-putational grids, in: Proceedings of the 31st International Con-ference on Parallel Processing (ICPP2002), IEEE ComputerSociety Press, 2002, p. 55.

[23] W. Smith, A Framework for Control and Observation in Dis-tributed Environments, Tech. Rep. NAS-01–006, NASA Ad-vanced Supercomputing Division, NASA Ames Research Cen-

[ oni-ster4.

[ et-90.

[ for003.

[ eon-988,

[ ted, in:siumEE

[ ski,uta-ym-

tion, Implementation and Analysis, RFC 1305, March 19http://www.ietf.org/rfc/rfc1305.txt.

12] A. Cooke, A.J.G. Gray, L. Ma, W. Nutt, J. Magowan, M. Oers, P. Taylor, R. Byrom, L. Field, S. Hicks, J. Leake, M. SA. Wilson, R. Cordenonsi, L. Cornwall, A. Djaoui, S. Fisher,Podhorszki, B. Coghlan, S. Kenny, D. O’Callaghan, R-GMan information integration system for grid monitoring, in: Pceedings of the 10th International Conference on CooperInformation Systems, 2003.

13] W. Smith, D. Gunter, D. Quesnel, A Simple XML Prducer/Consumer Protocol, GWD-Perf-8-2, Global GForum, 2002. http://www-didc.lbl.gov/GGF-PERF/GMAWG/papers/GWD-GP-8-2.pdf.

14] Distributed Management Task Force—Common Infortion Model (CIM), accessed on 3 December 20http://www.dmtf.org/standards/cim.

15] F. Bonnassieux, R. Harakaly, P. Primet, MapCenter: angrid status visualization tool, in: Proceedings of the ISCA 1International Conference on Parallel and Distributed Coming Systems, Louisville, KY, USA, 2002.

16] F. Bonnassieux, R. Harakaly, P. Primet, Automatic servicescovery, monitoring and visualization of grid environments:

ter, July 2001.24] M.A. Baker, G.C. Smith, GridRM: an extensible resource m

toring system, in: Proceedings of the IEEE International CluComputing Conference, 1–4 December 2003, pp. 207–21

25] J.D. Case, M. Fedor, M.L. Schoffstall, C. Davin, Simple Nwork Management Protocol (SNMP), RFC 1157, May 19http://www.ietf.org/rfc/rfc1157.txt.

26] A. Hawkeye, Monitoring and Management ToolDistributed Systems, accessed on 27 October 2http://www.cs.wisc.edu/condor/hawkeye/.

27] M. Litzkow, M. Livny, M. Mutka, Condor: a hunter of idlworkstations, in: Proceedings of the Eighth International Cference of Distributed Computing Systems, San Jose, CA, 1pp. 104–111.

28] R. Raman, M. Livny, M. Solomon, Matchmaking: distriburesource management for high throughput computingProceedings of the Seventh IEEE International Sympoon High Performance Distributed Computing (HPDC), IEComputer Society Press, Chicago, IL, USA, 1998, p. 140.

29] P. Stelling, I. Foster, C. Kesselman, C. Lee, G. von LaszewA fault detection service for wide area distributed comptions, in: Proceedings of the Seventh IEEE International S

Page 25: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188 187

posium on High Performance Distributed Computing, Chicago,IL, 1998, pp. 268–278.

[30] B. Tierney, B. Crowley, D. Gunter, J. Lee, M. Thompson, Amonitoring sensor management system for grid environments,Cluster Comput. 4 (1) (2001) 19–28.

[31] Z. Balaton, P. Kacsuk, N. Podhorszki, Application monitoringin the grid with GRM and PROVE, in: Proceedings of the Inter-national Conference on Computational Science (ICCS2001),Part I, vol. 2073 of Lecture Notes in Computer Science,Springer-Verlag, San Francisco, CA, USA, 2001, p. 253.

[32] Z. Balaton, G. Gombas, Resource and job monitoring in thegrid, in: Proceedings of the Ninth International Euro-Par Con-ference, vol. 2790 of Lecture Notes in Computer Science,Springer-Verlag, Klagenfurt, Austria, 2003, pp. 404–411.

[33] Z. Balaton, G. Gombas, GridLab Monitoring: Detailed Archi-tecture Specification, Tech. Rep. GridLab-11-D11.2-01-v1.2,EU Information Society Technologies Programme (IST), 2001.http://www.gridlab.org/Resources/Deliverables/D11.2.pdf.

[34] G. Allen, K. Davis, K.N. Dolkas, N.D. Doulamis, T. Goodale,T. Kielmann, A. Merzky, J. Nabrzyski, J. Pukacki, T. Radke,M. Russell, E. Seidel, J. Shalf, I. Taylor, Enabling applicationson the grid: a gridlab overview, Int. J. High Perform. Comput.Appl. 17 (4) (2003) 449–466.

[35] N. Podhorszki, P. Kacsuk, Design and implementation ofa distributed monitor for semi-on-line monitoring of Vi-sualMP applications, in: Proceedings of the Third Austrian-Hungarian Workshop on Distributed and Parallel Systems(DAPSYS2000), From Instruction Parallelism to Cluster Com-puting, Hungary, 2000, pp. 23–32.

[36] Z. Balaton, P. Kacsuk, N. Podhorszki, F. Vajda, From clustermonitoring to grid monitoring based on GRM, in: R. Sakel-lariou, et al. (Eds.), Proceedings of the Seventh InternationalEuro-Par Conference, vol. 2150 of Lecture Notes in ComputerScience, Springer-Verlag, Manchester, UK, 2001, pp. 874–881.

[37] B. Tierney, D. Gunter, NetLogger: a toolkit for distributed sys-idt,

hthenter,

[ micin:

rfor-So-

[ ey,et-

sics

[ ce: aeta-

999)

[ logyis-rum,f

[42] M. den Burger, T. Kielmann, H.E. Bal, TopoMon: a monitoringtool for grid network topology, in: Proceedings of the Interna-tional Conference on Computational Science (ICCS 2002), vol.2230, Springer-Verlag, 2002, pp. 558–567.

[43] B. Balis, M. Bubak, T. Szepieniec, R. Wismuller, M. Radecki,Monitoring grid applications with grid-enabled OMIS moni-tor, in: F.F. Rivera, M. Bubak, A.G. Tato, R. Doallo (Eds.),Proceedings of the First European Across Grids Conference,vol. 2970 of Lecture Notes in Computer Science, Springer-Verlag, Santiago de Compostela, Spain, 2004, pp. 230–239.

[44] The EU CrossGrid Project, accessed on 27 November 2003.http://www.eu-crossgrid.org.

[45] T.L. II, R. Wismuller, OMIS 2.0: a universal interface for moni-toring systems, in: M. Bubak, J. Dongarra, J. Wasniewski (Eds.),Proceedings of the Fourth European PVM/MPI Users’ GroupMeeting on Recent Advances in Parallel Virtual Machine andMessage Passing Interface, vol. 1332 of Lecture Notes in Com-puter Science, Springer-Verlag, Cracow, Poland, 1997, pp. 267–276.

[46] T. DeWitt, T. Gross, B. Lowekamp, N. Miller, P. Steenkiste,J. Subhlok, D. Sutherland, ReMoS: A Resource Monitor-ing System for Network-Aware Applications, Tech. Rep.CMU-CS-97–194, School of Computer Science, CarnegieMellon University, December 1998.http://www-2.cs.cmu.edu/afs/cs/project/cmcl/www/remulac/remos-tr-2.ps.

[47] P. Dinda, T. Gross, R. Karrer, B. Lowekamp, N. Miller, P.Steenkiste, D. Sutherland, The architecture of the Remos sys-tem, in: Proceedings of the 10th IEEE International Symposiumon High Performance Distributed Computing (HPDC-10 2001),IEEE Computer Society Press, San Francisco, CA, USA, Au-gust 7–9, 2001, pp. 252–265.

[48] P. Dinda, D. O’Hallaron, An Extensible Toolkit for Re-source Prediction in Distributed Systems, Tech. Rep. CMU-CS-99–138, School of Computer Science, Carnegie Mel-

.

gep.,berr2ysisxpe-

ng,9.tednce,

itec-uc-

c-

y ar-CM

tem performance tuning and debugging, in: G.S. GoldszmJ. Schonwalder (Eds.), Proceedings of the IFIP/IEEE EigInternational Symposium on Integrated Network Managem(IM 2003), vol. 246 of IFIP Conference Proceedings, Kluw2003, pp. 97–100.

38] D. Gunter, B. Tierney, K. Jackson, J. Lee, M. Stoufer, Dynamonitoring of high-performance distributed applications,Proceedings of the 11th IEEE Symposium on High Pemance Distributed Computing, HPDC-11, IEEE Computerciety Press, Edinburgh, Scotland, 2002, p. 163.

39] C.E. Tull, D. Gunter, W. Lavrijsen, D. Quarrie, B. TiernGMA instrumentation of the athena framework using NLogger, in: Computing in High Energy and Nuclear Phy(CHEP03), La Jolla, CA, 2003.

40] R. Wolski, N. Spring, J. Hayes, The network weather servidistributed resource performance forecasting service for mcomputing, J. Future Generation Comput. Syst. 15 (5/6) (1757–768.

41] I. Foster, C. Kesselman, J. Nick, S. Tuecke, The Physioof the Grid: An Open Grid Services Architecture for Dtributed Systems Integration, Tech. Rep., Global Grid FoJune 2002.http://www.globus.org/research/papers/ogsa.pd.

lon University, July 1999.http://www.reports-archive.admcs.cmu.edu/anon/1999/CMU-CS-99-138.pdf.

[49] H.L. Truong, T. Fahringer, SCALEA-G: A Unified Monitorinand Performance Analysis System for the Grid, Tech. RInstitute for Software Science, University of Vienna, Novem2003.ftp.vcpc.univie.ac.at/projects/aurora/reports/aurorat.

[50] H.L. Truong, T. Fahringer, SCALEA: a performance analtool for parallel programs, Concurrency Comput.: Pract. Erience 15 (11/12) (2003) 1001–1025.

[51] B. Buck, J.K. Hollingsworth, An API for runtime code patchiInt. J. High Perform. Comput. Appl. 14 (4) (2000) 317–32

[52] M.L. Massie, B.N. Chun, D.E. Culler, Ganglia DistribuMonitoring System: Design, Implementation, and ExperieParallel Computing 30 (2004) 817–840.

[53] I. Foster, C. Kesselman, Globus: a toolkit-based grid archture, in: The Grid: Blueprint for a Future Computing Infrastrture, Morgan-Kaufmann, 1999, pp. 259–278.

[54] M. Wahl, T. Howes, S. Kille, Lightweight Directory Acess Protocol (v3), RFC 2251, December 1997.http://www.ietf.org/rfc/rfc2251.txt.

[55] I. Foster, C. Kesselman, G. Tsudik, S. Tuecke, A securitchitecture for computational grids, in: Proceedings of the A

Page 26: A taxonomy of grid monitoring systemszanikols/fgcs05.pdf · a taxonomy of grid monitoring systems, which is employed to classify a wide range of projects and frameworks. The value

188 S. Zanikolas, R. Sakellariou / Future Generation Computer Systems 21 (2005) 163–188

Conference on Computers and Security, ACM Press, 1998, pp.83–91.

[56] H.B. Newman, I.C. Legrand, P. Galvez, R. Voicu, C. Cirstoiu,MonALISA: a distributed monitoring service architecture, in:Computing in High Energy and Nuclear Physics (CHEP03), LaJolla, CA, 2003.

[57] J. Waldo, The Jini Architecture for Network-Centric Comput-ing, Commun. ACM 42 (7) (1999) 76–82.

[58] H.B. Newman, I.C. Legrand, J.J. Bunn, A distributed agent-based architecture for dynamic services, in: Computing in HighEnergy and Nuclear Physics (CHEP01), 2001.

[59] B. Miller, M. Callaghan, J. Cargille, J. Hollingsworth, R. Irvin,K. Karavanic, K. Kunchithapadam, T. Newhall, The paradynparallel performance measurement tools, IEEE Comput. 28 (11)(1995) 37–46.

[60] P.C. Roth, D.C. Arnold, B.P. Miller, MRNet: a software-basedmulticast/reduction network for scalable tools, in: Proceedingsof the Supercomputing 2003 (SC 2003), Phoenix, AZ, 2003.

[61] P.C. Roth, D.C. Arnold, B.P. Miller, Benchmarking the MRNetdistributed tool infrastructure: lessons learned, in: Proceedingsof the High-Performance Grid Computing Workshop, held inconjunction with IPDPS, IEEE Computer Society Press, SantaFe, NM, USA, 2004.

[62] B. Coghlan, A. Djaoui, S. Fisher, J. Magowan, M. Oevers, Time,information services and the grid, in: Advances in DatabaseSystems (BNCOD): Supplement to the Proceedings of the 18thBritish National Conference on Databases at RAL, 9–11 July2001.

[63] X. Zhang, J. Freschl, J. Schopf, A performance study of mon-itoring and information services for distributed systems, in:Proceedings of the 12th IEEE High Performance DistributedComputing (HPDC-12 2003), IEEE Computer Society Press,Seattle, WA, USA, 2003, pp. 270–282.

[64] M. Gerndt, R. Wismuller, Z. Balaton, G. Gombas, Z.Nemeth, N. Podhorski, H.-L. Truong, T. Fahringer, M.

forep.,tion,ary

ry/

[ m-ch.

In-00.

[ itor-s of

ged

[ s of), in:kyo,

[68] C.L. Mendes, D.A. Reed, Monitoring large systems via statis-tical sampling, in: Proceedings of the Los Alamos ComputerScience Intitute (LACSI) Symposium, Santa Fe, 2002.

[69] Views of Monitoring Data, accessed on 27 November 2003.http://www-unix.mcs.anl.gov/∼schopf/views.html.

[70] CAIDA, the Cooperative Association for Internet DataAnalysis: Performance Measurement Tools Taxonomy, ac-cessed on 27 November 2003.http://www.caida.org/tools/taxonomy/performance.xml.

[71] N. Podhorszki, Z. Balaton, G. Gombas, Monitoring message-passing parallel applications in the grid with GRM and mercurymonitor, in: Proceedings of the Second European Across GridsConference, Nicosia, Cyprus, 2004.

[72] S. Zanikolas, R. Sakellariou, Towards a monitoring frameworkfor worldwide grid information services, in: 10th InternationalEuro-Par Conference, Vol. 3149, Lecture Notes in ComputerScience, Springer-Verlag, Pisa, Italy, August 31–September 3,2004, pp. 417–422.

[73] Z. Balaton, G. Gombas, Z. Nemeth, Information system archi-tecture for brokering in large scale grids, in: Proceedings of theDAPSYS 2002 on Parallel and Distributed Systems: Clusterand Grid Computing, Kluwer, Linz, 2002, pp. 57–65.

[74] M. Dikaiakos, Y. Ioannidis, R. Sakellariou, Search engines forthe grid: a research agenda, in: F.F. Rivera, M. Bubak, A.G. Tato,R. Doallo (Eds.), Proceedings of the First European AcrossGrids Conference, vol. 2970 of Lecture Notes in Computer Sci-ence, Springer-Verlag, Santiago de Compostela, Spain, 2004,pp. 49–58.

[75] Netcraft, accessed on 2 February 2004.http://www.netcraft.com.

[76] Alexa, accessed on 2 February 2004.http://www.alexa.com.

Serafeim Zanikolasreceived a BSc degree in Software Engineering(2001), and a MSc in Computer Science (2003), from the Technolog-i rsityo areao rt ofh rsityo

R nce,f ym-b ands beena ster.P Pro-f toralr niver-s is inp lso in-c g ande

Bubak, E. Laure, T. Margalef, Performance Toolsthe Grid: State of the Art and Future, Tech. RLehrstuhl fuer Rechnertechnik und RechnerorganisaTechnische Universitaet Muenchen (LRR-TUM), Janu2004. http://www.lpds.sztaki.hu/∼zsnemeth/apart/repositogridtools.pdf.

65] Z. Balaton, P. Kacsuk, N. Podhorszki, F. Vajda, Coparison of Representative Grid Monitoring Tools, TeRep. LDS-2/2000, Computer and Automation Researchstitute of the Hungarian Academy of Sciences, 20http://www.lpds.sztaki.hu/publications/reports/lpds-2.pdf.

66] M. Knop, J. Schopf, P. Dinda, Windows performance moning and data reduction using WatchTower, in: Proceedingthe Workshop on Self-Healing, Adaptive and self-MANaSystems (SHAMAN), New York, 2002.

67] N. Chalakornkosol, P. Uthayopas, Monitoring the dynamicgrid environment using grid observer (poster presentationProceedings of the IEEE CCGRID2003, Toshi Center, To2003.

cal Educational Institute of Thessaloniki, Greece, and the Univef Essex, UK, respectively. He is currently doing research in thef monitoring and information services in Grid computing, as pais PhD study at the School of Computer Science of the Univef Manchester, UK.

izos Sakellarioureceived his PhD degree, in Computer Scierom the University of Manchester in 1997 for a thesis on solic analysis techniques with applications to loop partitioningcheduling for parallel computers. Since January 2000, he hasLecturer in Computer Science at the University of Mancherior to his current appointment, he was a visiting Assistant

essor at the University of Cyprus (fall 1999), and a postdocesearch associate at Rice University (1998–1999) and the Uity of Manchester (1996–1998). His primary area of researcharallel and distributed systems, but his research interests alude compilers, computer architecture, performance modellinvaluation, scheduling, and their interactions.