29

Network management by delegation

Embed Size (px)

Citation preview

Network Management by DelegationEARLY DRAFT: DO NOT DISTRIBUTEGerm�an Goldszmidt and Yechiam Yemini�Distributed Computing and Communications (DCC) Lab.Computer Science Department, Columbia University,New York City, NY 10027, USADecember 14, 1992AbstractNetwork management systems built on a client/server model centralize respon-sibilities in client manager processes, with server agents playing restrictive supportroles. Managers must micro-manage agents through primitive steps, resulting inine�ective distribution of management responsibilities, failure-prone managementbottlenecks, and limitations for real time responsiveness. We present a more ex-ible paradigm, the Manager-Agent Delegation (MAD) framework. It supports theability to extend the functionality of servers (agents) at execution time, allowing exible distribution of management responsibilities in a distributed environment.Managers can de�ne management programs in special purpose management script-ing languages. MAD can store and instantiate delegated scripts, and provides aconcurrent runtime environment, where they can execute asynchronously withoutrequiring the manager's intervention. A delegation protocol allows a manager totransfer programs, create process instances, and control their execution. We de-scribe the delegation model, its application to network management, and the designof a prototype implementation.[Keywords: network management, delegation]�Research supported by NSF contract # NCR-91-06127.1

1 IntroductionIndustrial enterprises are increasingly dependent upon large scale complex networked sys-tems serving as their information backbones. The likelihood and costs associated withfaults or performance ine�ciencies of these systems increase with their scale and com-plexity. E�ective manageability of such systems, to ensure fault-free, e�cient and secureoperations, is thus a vital need of industry. The management problem presents a broadrange of non-trivial technical challenges. There are growing e�orts by vendors, standardscommittees [1] and the research community [2, 3, 4], to establish manageable systems.Unfortunately, typical networked systems have not been designed to be manageable. Thefundamental technical problems involved in accomplishing manageability are not fullyunderstood and so their research is still in the embryonic stage.A typical Network Management System (NMS) involves agents, which are responsiblefor monitoring and controlling managed-objects of the network, and managers, whichhave the responsibility for collecting dynamic status data from the agents, interpretingthat data, and directing the agents (e.g., how to handle fault scenarios). Both managersand agents use a management protocol to coordinate their activities. The fundamentalcharacteristic of such protocols, is that they establish an interaction paradigm in whichprogram logic resides in managers while managed objects data resides within agents. Forexample, an agent mantains error counters, e.g., the number of packets that couldn't betransmitted because of errors. To compute an error rate or determine its variation or itsrelationship with other counters, managers need to frequently query the agent to obtainthe value.Standardization e�orts have focused primarily on the management protocol and thestructure of management information [5, 6]. The management model that they are basedon, however, is only partially elaborated: the management needs and styles to be usedare tacitly assumed, and the software architecture required to support the model is leftunspeci�ed.In what follows, we will use the termmanageability to describe the ability to monitorand control the behavior of a networked system. In this paper, we aim to address thefollowing question: how should responsibilities be distributed among managers and agentsto accomplish e�ective manageability?Paradigms for distributed computing �x the functionality of processes at compilationtime. In the typical client/server paradigm, a server exports a �xed interface, and aclient can invoke its functions remotely in a synchronous fashion. While this scheme is2

su�ciently powerful for many distributed applications, it is too in exible for many others.In particular, network management systems built on the client/server paradigm centralizeresponsibilities in client-manager processes, with server agents playing restrictive supportroles.Managers can delegate to agents only primitive monitoring and control tasks, so evennon-trivial management tasks they must micro-manage agents through primitive steps.This results in ine�ective and costly distribution of management responsibilities; producesfailure-prone management bottlenecks, whose loss can result in paralysis; and limits real-time responsiveness to problems. These problems are compounded as networks becomefaster and more complex, and require distribution of management function.We here present a more exible paradigm, which supports the ability to extend thefunctionality of agents at execution time, and thus enables a more exible distribution offunctionality between processes. This scheme builds on a multiapplication process modelthat supports instantiation, interconnection, and communication of processes. By treatingprograms as data that can be communicated between processes and later instantiated,we obtain a powerful delegation mechanism. Managers delegate to agents the executionof management programs prescribed in a management scripting language. Managementprograms include primitives that permit agents to monitor and control local managedobjects e�ectively without unnecessarily involving managers.The proposed manager-agent delegation (MAD) framework allows exible distribu-tion of management responsibilities in a distributed environment. Delegated manage-ment processes, de�ned in the manager's environment, are dynamically delegated to theMAD agent. This agent implements a exible and responsive management strategy byreceiving and executing delegated programs that monitor and control objects in its localenvironment.This paper aims to explores fundamental problems that arise in the design of man-ageable systems. In the next section we focus on the micro-management problems arisingin current management protocols. In Section 3, we outline the remote delegation model.Section 5 describes the software architecture of Mad. In section 6 we review manage-ment scripting languages to support e�ective monitoring and control. In Section 7 weoutline other paradigms for distributed applications, and why they can't completely meetthe requirements for delegation. In section 8 we focus on the problems of managementcoordination among agents and managers. Section 9 brie y describes the current stateof the implementation and in Section 10 we present some conclusions.3

2 Management ParadigmsNetwork management protocols (SNMP [7], CMIP [6]) present a centralized databaseparadigm of management. Their model separates centralized management applicationlogic, which typically resides in a Network Operating Center (NOC) host, from distributednetwork data which resides in network devices. Such a paradigm raises several problems:Scalability. The centralized paradigm is unscalable for large, complex, high-speed orlow bandwidth networks. In all these cases data polling rates between managersand agents become a signi�cant performance bottleneck. The amount of data thatmay be polled is bound by the size limitations of the manager, and by the speed ofpolling across the network. These problems are compounded during managementemergencies, when more resources than ever are required. As the network developsand expands, its size is a limitation on manageability.Micromanagement. Because of the lack of appropriate expressiveness in the manage-ment protocols, managers are forced to micromanage remote agents through thenetwork, even for simple tasks. Complex applications require even greater degreeof interactions among managers and agents. We further discuss the problem ofmicro-management in Section 2.1.Modularity. Modularity is essential in order to mantain and extend management appli-cations. Current approaches lead to unencapsulated network management architec-tures whereby multiple applications share a huge space of managed global variables.This increases the hazards of a highly non modular software architecture.Interoperability. In an open environment, several protocols will have to be used tomanage the network. Interoperability among those protocols is hard to accomplishas it amounts to conversion among distinct database frameworks.Allocation of Resources. It does not permit exible distribution of management re-sponsibilities to meet the need of users and the opportunities provided by di�erentdevices. It constrains vendors and users in accomplishing self-managed, autonomousnetworks.Interpretation. The process of interpreting NM data involves applications of rules ormethods of interpretation to the data collected. Aggregation of data and of rules4

results in expontential increase in complexity. The complexity of the search is expo-nential in the size of data and the number of rules. Centralization of interepretationthus results in exponential complexity increases.The above paradigms put both users and vendors of network devices in positions thatare far from ideal. Users are positioned to retrieve network wide core-dumps of MIBvariables on their operators screens. Typically, users are not quali�ed to make sense ofthe data, do not know what most of these MIB variables mean, and wish they didn't haveto. What users need are integrated solutions that provide for autonomous, self-managednetwork devices. In contrast, the vendors advantage is their cumulative knowledge ofhow to better utilize amd manage the devices that they produce. Ideally, vendors wouldlike to be able to incorporate this knowledge into the device management and build self-managed devices. But standards expect devices to play a much more restricted role inmanagement, e.g. dumping data on NOC operators.2.1 The Problem of Micro-ManagementThe lack of appropriate expressive power of a management protocol can seriously limitthe manageability of a networked system. This will occur when the protocol inducesmicro-management, that is, when the manager must control the execution of a programor script by stepping the agent through it. Micro-management occurs when managementprotocols are unable to compose primitive management actions in a exible and e�cientfashion, and can lead to highly ine�ective and costly systems. Communication bandwidthand manager cycles may be required to accomplish even simple tasks, which places severerestrictions on manageability, barring all but trivial scripts from being executable. Sinceall management functions must be centralized into the manager, it is rendered mostvulnerable to network failures as even simple failures could load its bandwidth and cycles,bringing it down. Once the manager is down, local agents cannot accomplish recovery,as they must wait for instructions. Thus, even a minor failure may potentially lead toan avalanche failure of the management system. The following example illustrates theproblem of micro-management.Consider a network management agent (server) that is responsible for monitoringa link (or a virtual circuit) object. A typical fault management scenario may consist ofdetection of certain unusual link conditions, execution of diagnostic testing to identify thecause of the conditions, procedures to handle the problem, and noti�cation to appropriatemanager. This is illustrated by the following management script.5

On (link.control.stat>normal.stat) AND (link.q.length>normal.q);/*trigger on composite event*/Begin /* handling actions*/link.handle.congestion; /*fix */link.test; /* diagnose*/If link.failure then /* if problem */Beginrecover(link.failure.type);....notify(Manager,link.failure.params)....End::EndTypical management scripts would similarly use the occurrence of an exceptional event,typically signaled via threshold conditions, to identify potential troubles and trigger diag-nostic and corrective management actions. Both detection of the event and execution ofthe actions involve data and procedures entirely local to the agent and the managed linkobject. The manger must thus use the management protocol to transfer execution respon-sibilities to the agent to process the script. What does it take to accomplish execution ofthis script using a network management protocol such as CMIP?The OSI CMIS management model [6], for example, distributes management re-sponsibilities through the following manager-agent interaction primitives: creation (M-CREATE) and deletion (M-DELETE) of managed object instances; retrieval (M-GET)and setting (M-SET) of managed objects attributes; reporting of events (M-EVENT-REPORT); and invocation of actions (M-ACTION). Thus, for example, a manager coulduse M-GET to retrieve certain managed-object attributes values; then use these monitoredvalues to decide on appropriate control/diagnostic measures, which may be invoked bysetting the values of certain managed-object attributes with M-SET or activating certainagent actions via M-ACTION.The manager must �rst establish an appropriate association with the agent, in order6

to establish mechanisms to detect the occurrence of the triggering events. Suppose thatit polls the respective variables link.control.stat and link.q.length using M-GET,and evaluates the event condition. (This may be the case with agents whose associa-tions do not include M-EVENT-REPORT capabilities.) Polling, however, is a poor eventdetection-mechanism, since events may occur very infrequently and hold for a relativelyshort random time (or require a response within a short time of occurrence). In ordernot to miss such an occurrence, the manager would have to poll the agent at a frequencydetermined by the expected holding time (or permissible tolerance to detection latency).A high polling frequency can potentially exhaust the manager communication bandwidthand cycles. Intermittent events might be unobservable through polling. In summary,polling can result in highly ine�ective division of event-detection responsibilities amongmanager and agents.Because of these intrinsic limitations of polling, management protocols provide primi-tives for asynchronous event reporting (Traps in SNMP, M-EVENT-REPORT in CMIP).For example, using M-EVENT-REPORT, a manager could be noti�ed of the occurrenceof threshold events as in the example above. However, in the absence of mechanismsto specify composite events, it would have to seek independent noti�cations of the twoevents:(link.control.stat > normal.stat) AND (link.q.length > normal.q)and decide on their conjunction. This division of event-detection responsibilities maylead to ine�ciencies and hazards. Event noti�cations may be randomly delayed, and thusthe conjunction event may go unnoticed even if its components have been adequatelyreported. Both the agent and the manager must spend bandwidth and cycles to dividedetection responsibilities.Once an event is detected, the management script calls upon certain diagnostic han-dling to be executed. This requires the execution of certain diagnostic actions (e.g.,invoking link.test), checking certain managed variables (e.g., ..If link.failure..)or invoking a corrective action (e.g., recover(link.failure.type)). Each of these com-putations may be accomplished via an appropriate CMIP primitive. Thus, a diagnostic ortesting operation may be invoked using M-ACTION, and the value of link.failuremaybe retrieved by the manager via M-GET. There may be, however, a problem in invokinga procedure that requires parameters as in recover(link.failure.type).However, both SNMP and CMIP provide limited or no capabilities to combine prim-itives to handle composite scripts. Thus, the manager must step the agent through the7

execution of the script. In this example, the manager must step the agent through theexecution of:link.handle.congestion;link.test;:::via two (or more) invocations of M-ACTION. This is followed by a retrieval of link.failure(using M-GET) to evaluate the conditional statement, and with additional calls upon theagent to complete the rest of the script. In other words, the manager must micro-managethe execution of the script.The micro-management problem can be resolved by delegating entire composite man-agement scripts from managers to agents. Thus, in the example above, a manager wouldpass the entire script to an agent that is responsible for the link object. The agent wouldthen execute the composite script without manager intervention (except when the scriptitself requires it). Indeed, the agent is ultimately responsible for executing all the prim-itives of the script, from event detection to invocation of local actions. Managementinvolvement in execution is neither necessary nor useful, and is only mandated by thelimitations of current management protocols.2.2 Centralization of ManagementCentralization of management is inconsistent with the emerging needs of complex, large-scale or high-speed networks. In such networks the complexity of management needs andthe speed at which a response may be required may render centralization ine�ective oreven impossible. Also, management centralization is inconsistent with the trade-o�s sug-gested by current and future technology trends. Agent processing cycles are abundantlyavailable, increasingly so as one climbs in the protocol hierarchy towards managing high-layers entities. The tacit assumption in the design of current management protocols thatagents lack management "smarts" and should leave all that computation and analysis tomanager cycles, may re ect the tradeo�s represented by traditional voice and teleprocess-ing networks of the past rather than networks of the future.3 MAD: A Software Model for Management8

Manager MADAgent

Delegated Script

ManagedDeviceFigure 1: DelegationWe introduce an enhanced software model and architecture for distributed managementapplications. The central component of this architecture is the MAD agent, a hybridcombination of a hierarchical manager with a proxy agent which provides a collection ofservices required for e�ective monitoring and control of managed objects. A MAD agentis analogous to a middle level manager in a large organization, which has control oversome scope of the global environment. As a traditional server process, it encapsulates alevel of local autonomy, and exports a �xed set of services that can be invoked by clientprocesses. In addition, it is a Elastic Server capable of dynamically adapting its set ofresponsibilities via delegation. Figure 1 depicts the basic delegation model, in which aManager transfers a Delegated Management Program, which is a script of instructions, toa Mad agent.3.1 Remote DelegationOne of the most signi�cant properties of MAD is that it provides a new, general purposescheme for interaction between distributed components, remote delegation. Its mainfeature is to allow management programs to be de�ned in and by managers and thendelegated to MAD agents for their execution. Delegation provides a exible way formanagers to dynamically associate functionality and responsability with other managers.9

It enables a manager to transfer the responsibility for performing certain functions toa remote agent. It allows the agent to perform these functions independently of themanager's execution, except were coordination is required. This framework enables amanager to augment, during execution time, the functionality of a subordinate agent, ine�ect allowing it to perform an open-ended set of management programs.Thus, management programs can be designed and coded as part of a speci�c manage-ment scheme rather than as part of the individual agent design. This can be done muchlater than agent design and deployment, and can be tailored to the speci�c requirementsof any given installation. By taking advantage of the runtime services provided by MAD,(which are described in following sections), management programs can be dynamicallydelegated by managers on-the- y.Delegation empowers manager processes to exibly adapt management responsibilitiesassigned to agents, to be executed in close proximity to the managed devices. This enablesmanagers to ensure a much faster response in the event of faults. Furthermore, it providesan e�ective tool to balance the computational requirements of management applicationsas temporary conditions change over time, e.g. network load, and as the con�guration ofthe network evolves, e.g. adding new devices.From a software con�guration perspective, delegation provides a simple yet powerfulscheme to dynamically compose management systems, by connecting and integrating in-dependent delegated scripts. It is not required, though allowed, that manager processactually write programs to be delegated during their execution. Installations can pro-vide libraries of such programs, which encapsulate network management expertise, to beused as management conditions require. Designers and vendors of devices should providelibraries of prede�ned routines that can be used to compose delegated management pro-grams. Using the facilities provided by MAD, manager processes can build, con�gure andcontrol distributed multi-process applications.A process that delegates responsabilities to another process is referred to as a MADmanager. The process that receives the delegated responsabilities is referred to as a MADagent. The roles of a MAD manager and a MAD agent refer speci�cally to a given set offunctional responsabilities. A process which is a manager with respect to a given agentmay itself be an agent with respect to another manager.10

3.2 Delegated Management ProgramsA Delegated Management Program (DMP) is transferred via a delegation protocol, froma delegating manager to an accepting agent, which stores it in its Repository. Severalimportant issues are raised (and will be addressed) regarding the use of remote delegation:What type of programs are the DMPs? In which language(s) should they be written?What are the semantic implications of writting DMPs in a general purpose programminglanguage or a speci�c management language? How is the execution of a managementprogram controlled (e.g., when is it invoked, terminated)? How can a DMP instanceobtain access to managed objects? How can a DMP coordinate its execution with othermanagement programs and with managed objects? We examine some of these questionsin the following.From a programming perspective, there are three basic types of DMPs: completeprocesses, objects and pure functions. Complete process execute an independent programwith its own thread of control and its own private data. It contains logic to control itscommunications with managers, managed objects and other DMPs. Object DMPs aresimilar to complete processes in that they keep state in the MAD agent environment.However, they are passive server which go into a sleep state waiting for requests. Whena request message is received, the proper actions are taken. A pure function DMP is aprogram whose instantiation may return some value but will not have any side e�ects.Thus, the code of a pure function DMP can easily be shared by many other DMPs. Inthe sequel, unless stated speci�cally, all DMPs are complete processes, since this is themost general case, e.g., all others are degenerate cases.A DMP is written in a Management Scripting Language (MSL). There is a whole spec-trum of languages that can serve as MSLs, from general-purpose programming languagessuch as C, to more restricted specialized languages for management. (We discuss some ofthe issues in selecting and supporting a particular MSL in Section 6.)3.3 Instantiation of Delegated Management ProgramsA Delegated Management Process Instance (DMPI) is an executing instance of a DMP.The lifetime of a DMPI can be best conceptualized as a lightweight process with itsown thread of execution in the MAD environment. DMPIs start being executed wheninstantiated by a MAD agent upon request by an authorized manager. Typically thedelegating manager is authorized to request instantiation of the DMPs that it delegated.When a DMP is delegated, the agent returns to the delegating manager a capability right to11

DMPI

DMPI SNMP

CMIP

OCP

OCP

DMPI

DMPI

ManagedObjects

Program DMP

DMPI

OCP

OCPOCP

SNMP Client

SNMP Agent

SNMP MIB

Comm withManagers

DeviceSpecific

To SNMPManagers

OCP

MAD Client

MADAgent

Program DMP

DMPI MADComm withManagers Figure 2: Delegated Management Process Instances

12

apply mad protocol operations to the delegated DMP object. A manager maymake copiesand further distribute the capability right to instantiate DMPs to other processes, e.g.,other manager processes and even delegated DMPIs can obtain the right to instantiateanother DMP.Figure 2 describes the relationships between DMPIs, OCPs and managed objects.3.4 Controlling the Execution of a DMPIThe execution of DMPIs procedes independently of the execution of the delegating man-ager. However, a manager may need to retain control over the execution of a DMPI.For example, a manager may instantiate a DMPI that monitors and reports periodi-cally some statistical value which is a function of certain managed object attributes oversome period of time. At certain times, the manager may wish to suspend this reportingto release resources (eg cycles) used by the DMPIs. This will enable the agent to allocatethe freed resources to other critical problems, and the manager to reduce the amountof information that it receives from the DMPI, thus releasing some of its own resources.Once these problems are handled, and resources become available reporting should beresumed.MAD provides a protocol and agent services for DMPI execution control that enablesuch functionality. A DMPI is an integrated unit of execution over which the agent's envi-ronment can exercise control: it can be instantiated, suspended, resumed and terminated.Also, a DMPI may request to be awakened after a speci�ed period of time.3.5 Delegation and Access RightsA DMPI must have some means to interact with managed objects in order to obtain infor-mation and to control the managed objects. DMPIs obtain controlled access to managedobjects attributes thru explicit interfaces provided by Observation and Control Point pro-cesses. Indiscriminate access to these objects would introduce concurrency con icts. ADMPI only has access to the attributes of the managed objects which have been explicitlydeclared as exported, thru an OCP interface. OCPs for the managed objects accessed bya DMP must be allocated and maintained in the environment of the agent to ascertainthat accesses by the DMPI to these objects are properly supported.A DMPI is able to invoke locally the same actions that its creator could invoke re-motely (or a subset of them), if it inherits the required access capabilities from its cre-13

ator. Additionally, DMPIs can send and receive messages between themselves, with theircreator and with other processes, by using an asynchronous send/recieve interface (seeSection 5.2). For example, a DMPI can acquire the role of receiving event reports froma device which were previously been routed to its creator. In e�ect, they have all therights and responsibilities associated with a managerial role, including the right to usethe delegation protocol.4 A Delegated Management Program ExampleExample 1 of the previous section, can be handled as follows. The management scriptis coded as a management program in an appropriate MSL, e.g., one that provides theappropriate primitives for this case. A DMP is passed to a MAD agent who, after authen-ticating the manager, compiles the DMP (if needed) and installs the DMP in a repository.When the manager requests it, the DMP is instantiated for execution. The delegated pro-gram will then monitor (using services provided by the MAD agent) for the occurrenceof the following event.(link.control.stat > normal.stat) & (link.q.length > normal.q).We assume that the above are attributes of objects de�ned in the MAD environment.To monitor this event, the DMPI requests, to be noti�ed when any of the relevant managedobject attributes is updated. The MAD agent will provide the DMPI with bindings toObservation and Control Point (OCP) server processes, which provide a controlled, high-level access interface to the actual managed objects, and are dedicated to service suchrequests. The OCPs will monitor the changes in the actual managed objects, eitherdirectly or by becoming proxy agents of some network management protocol. When suchan occurrence is detected, the OCPs would notify the requesting DMP. The DMP willthen be awakened and it will proceed with the remainder actions: link.handle.congestion....5 The Architecture of MAD AgentsIn this section we describe the software architecture structure of MAD agents. A MADagent is a combination of a proxy agent and hierarchical manager which implementsthe runtime support to enable dynamic program delegation and control over the DMPIs.14

Figure 3 depicts the the internal structure of the MAD agent and the relationships betweenits internal components.The Mad agent includes componenents which are responsible for implementing the del-egation protocol storing DMPs, instantiating and controlling DMPIs, providing communi-cations between managers and DMPIs, and providing a concurrent runtime environmentwhere all the DMPIs execute.The independent components of the Mad agent are: Controller, Delegation Protocol,DMP Repository, Translator, DMPI Server, Name Service, IPC, and Scheduler. DelegatedManagement Process Instances (DMPIs), and Observation and Control Points (OCPs),execute as independent processes inside the agent. A prototype MAD Agent supportingall the above functionality has been implemented. Most of the above components areimplemented as one or more independent threads, and full address space processes.5.1 ControllerThe controller contains the main logic of the Mad agent, that is, a speci�c algorithmthat the particular agent should execute. When the agent is loaded into memory, thecontroller is the 'main' entry point. It is responsible to initialize the agent's environment,by instantiating all the component processes. Then it loads and instantiates all the load-time DMPs, taken instructions from a given initialization �le. When the agent terminates,the controller is responsible to ensure proper 'clean up', e.g., disposal of resources.During normal execution the controller plays the role of local manager for the agent.For example, it may decide how to handle requests from managers, which protocols tosupport, etc. Installation speci�c policies which are invariant for the lifetime of the Agentare programmed in the controller of an Agent. For example, authentication schemes formanagers, handling of exceptional events, etc. The controller may itself use Mad prim-itives to delegate this responsibility to other processes. In the event of contradicting orcon icting management actions, the controller may establish precedence. In many cases,(and in the default case), there may be no speci�c logic associated with the controller.In such cases, the controller will only perform the role of initiator and terminator of theagent. 15

DelegationProtocol

RepositoryController

DMPI

Scheduler

IPC

Translator

Controller

DMPIDMPI

DMP Executables

DynamicLoad

Names

Figure 3: Mad Runtime Environment16

5.2 Delegation ProtocolThe delegation protocol is an application-layer protocol used by Mad managers and agentsto exchange requests. It is implemented by one (or more) dedicated (lightweight) processwhich provide associations between a manager and the Mad agent and between the man-ager and its DMPIs. Our current prototype uses TCP or UDP to communicate betweenmanagers and agents. An asynchronous IPC implemented over shared memory by theMAD kernel is used to communicate between lightweight processes that execute as partof the Mad agent, and between DMPIs.The protocol enables a manager to:1. transfer a DMP to a Mad agent2. instantiate DMPIs (typically executed as light-weight processes) each with its owninstantiation (load time) parameters and its own private state;3. exercise control over the DMPI's execution, e.g., their execution can be suspended,resumed and terminated by the manager.4. communicate with the DMPIs to which it has handles to.5. receive information from the DMPI server about the execution state of the DMPIs.5.3 The RepositoryThe Repository allows storing and retrieving of DMPs from which the DMPIs can beinstantiated. DMPs can be stored at agent startup (boot time), or downloaded from aremote manager into the Repository during execution, via the delegation protocol. Afterreceiving a Delegate request, the agent will store the DMP in a repository, and returnto the delegating manager (through the protocol) a handle to it (DMPid). The managercan further distrbute this handle, thus enabling other managers to have access rights overthe DMP. The DMP handle can be used to request that an instance of the DMP beinstantiated as a DMPI, and to request deletion of the DMP from the repository. Therepository uses the name service to mantain a mapping between the manager's DMPidenti�er, and the internal names of the objects used by the actual storage facility (�lesystem). 17

5.4 The TranslatorThe Translator component of the Mad Agent is responsible to ensure that the DMPs arelegal programs, and to compile them into executable code. Each DMP is represented inthe repository as a data structure which contains the actual code of the program, andadditional information describing its characteristics and requirements. For example, aDMP will describe in which language it is written, and what is the minimum amountof memory required for instantiating it. DMPs can be in source code format, whichmay require compilation and linking or interpretation, or as object code that requiresonly dynamic linking. The Mad Agent uses the Translator for preparing the DMP forinstantiation, via compilation and linking. The Translator uses the Repository to storethe executable images which are produced by the compilation. If the DMP violates anyof a set of de�ned rules for the given MSL of the program the translator will reject theDMP. For example, the current prototype, restricts DMPs on their ability to bind toexternal functions. If a DMP invokes some arbitrary external function, it is rejected bythe Translator.5.5 Observation and Control Points (OCPs)An OCP is a process that executes in the Mad Agent environment and de�nes a set ofspeci�c services to enable DMPIs to interact with managed objects. OCPs can be de�nedin the Mad Agent, and can also be delegated as DMPs, thus providing a programmablemanagement interface to managed objects.An OCP's service interface might be generic and hide many details of the actualmanagement protocol, and it can also reproduce a given protocol interface. They canserve as protocol gateways between di�erent management protocols. OCPs provide aprocess representation of managed objects, trapping events and signals, and making themaccessible to multiple DMPIs. They provide concurrency control in the access of DMPIsto managed objects, by serializing their requests.An OCP can report the current status of a managed object and can forward DMPIrequests to the object, by invoking an available protocol interface to the managed object.Such protocol can be device speci�c (i.e., it becomes a proxy agent) or a managementprotocol (e.g., SNMP, CMIP). OCPs can locally store the values of attributes, cachingfrequently used information. Finally, OCPs provide granularity of access control, e.g. agiven DMP can obtain bindings to a speci�c OCP but not to others.A sample OCP implemented with the Mad prototype uses the SNMP protocol to18

access network objects. It also provides a translation facility between SNMP and manager-de�ned name spaces, and it provides concurrent access to the DMPIs.5.6 Delegated Management Processes InstancesA process DMPI is instantiated from a DMP template, via the Mad Instantiate call.Each DMPI has its own private state and can have di�erent initialization parameters.For example, two DMPIs may perform some accounting task using the same algorithm(DMP) but di�erent parameters (resources being measured).A typical DMPI will perform certain initialization steps, request services from OCPsand from the Mad kernel, and then `sleep' until `awakened' by some event, usually thearrival of a message. It will not poll for messages, but will (implicitly) yield the CPUwhen waiting for a message. Typically, most of their time is spent in a suspended state,resting between bursts of activity. They will periodically collect information, analize itand possibly take some action as a result. If they are not equipped to fully handle thegiven situation they can send a status report to the manager processes or other DMPISto request assistance.DMPIs have access to the attributes of the managed objects that have been explicitlymade available via protected programming interfaces (the Mad API) provided by theKernel and those provided by OCPs. Since the Kernel's scheduler provides pre-emptivescheduling, DMPIs can not monopolize the CPU.5.7 Agent KernelThe Kernel implements several services, which together provide an execution environmentfor DMPIs. It is divided between the following components: Scheduler, DMPI processmanager, IPC and Naming. The Scheduler provides pre-emptive multitasking for theDMPIs, and a wakeup function that activates a suspended thread after a speci�ed in-terval. The DMPI process manager allows to instantiate, kill, suspend, resume, retrieve,and modify the status of DMPIs. The IPC facility, supports asynchronous local commu-nications between DMPIs and OCPs, over shared memory. The name service provides aprivate, protocol neutral way to associate names with managed objects attributes, OCPsand DMPIs. 19

6 Management Scripting LanguagesA Management Scripting Language (MSL) is used to encode management programs to bedelegated from managers to agents. In this section we describe some of the requirementsfor MSLs.Delegated management programs must be compatible with the computational capabil-ities that are supported by the respective agent's software environment and the processorsin which they execute. The computational resources that are available for managementpurposes vary greatly among networked systems and devices. Small devices, such asmodems or multiplexors, will typically o�er very limited computational capabilities formanagement purposes. In contrast, agents controlling switches or large computing sys-tems may be able to a�ord substantial resources in order to support extensive manage-ability.In addition to the intrinsic limitations of the available hardware, users of the man-agement system may want to allocate these resources according to installation-speci�cparameters. Administrative policies may impose additional restrictions in the allocationof management resources. For example, any user may be able to query the liveness of agiven device, but only a few may be allowed to reset it. Thus, di�erent manager processesshould be allowed to perform di�erent subsets of the management operations supportedby the device.The combination of a wide variety of computational capabilities in managed devicesover a distributed system, and the administrative considerations regarding access to thoseresources, yields a highly heterogeneous management environment. To overcome thisinherent complexity, writers of management applications require tools that can help takeadvantage of this heterogeneity. A comprehensive management system environment musttherefore provide facilities for a spectrum of possible manager requirements and agentcapabilities.The need to provide such restrictions has been recognized in working implementa-tion agreements for management protocols [8] by de�ning protocol subsets for CMIP.Association-types are used to de�ne a restricted subset of the CMIP protocol over a par-ticular application layer association. Thus, a pair of processes, { one in a manager role,the other in an agent role { can establish an association of a given type in order to re-strict the type of messages that can be exchanged. For example, an association of typeevent restricts agents to only report events over that association. Thus, association-typesallow a manager-agent relationship to specify a computational restriction indirectly, by20

specifying a protocol restriction.In contrast to the above protocol-oriented restrictions, the Mad paradigm enforcescomputational restrictions via sub-languages or types of MSLs. For example, a manage-ment program can be speci�ed in a sub-language that restricts the program to read-onlyaccess to managed objects attributes. Thus, a management script can be written in anMSL that is a subset of a general-purpose programming language, (e.g., C, Pascal).A management script written in such an MSL will be processed as follows. The man-agement script is transferred from the manager to the agent via the delegation protocol.In the header of the message, there is a �eld that speci�es the MSL identi�er for thisscript. If the manager process is authorized to delegate scripts in the given MSL, thenthe program is analyzed by a Mad pre-processor for the given MSL. For example, the pre-processor will check that there are no assignment to non-local variables in the script. Theoutput of the pre-processor, if found to keep the constraints speci�ed by the sub-language,is compiled with the source's programming language compiler. The object code generatedby the compilation is then stored in the agent's repository. Thus, a Mad agent can use acombination of compile time pre-processing and runtime authentication to enforce speci�csub-language constraints.The MSL used in the current prototype is a subset of C, which restricts the capabilityto invoke arbitrary external functions. Thus, a DMP can only invoke external functionsmade available in a Mad Application Programming Interface (MADAPI) library. Thislibrary enables delegated management process instances to request services from the Madkernel; control the scheduling execution of the processes; exchange messages with otherDMPIs, OCPs, and manager processes; and receive noti�cations of events as messagesfrom the Mad kernel.Although general-purpose languages like C are su�ciently expressive, they can exposethe Mad agent instance to hazards that may be unacceptable in a reliable managementapplication (for example, overwriting unintended memory locations within the sharedaddress space, memory leaks, non-reentrant code, etc.). The current prototype MADagent employs a variety of runtime protection schemes to reduce the likelihood of theseevents and to recover from them. However, a sound solution to these problems must usecompile-time analysis techniques to ensure safety.One of the important questions that must be addressed in order to provide e�ectivemanagement capabilities is: what is an appropriate paradigm in which to express manage-ment speci�cations? Management applications need specialization in order to overcomethe huge complexity involved in distributed systems. Thus, we must raise the level of21

abstraction of the programming interface in which to encode management scripts. Wewill do that by providing a very high-level special-purpose management scripting languagewith special primitives to simplify the encoding of such an application.From an implementation perspective, these new primitives can be considered macro-extensions that can be compiled into any general-purpose programming language (e.g.C), or they can be interpreted by the agent.7 Models for Manager/Agent InteractionIn this section we review existing models of process interactions, and evaluate their suit-ability for resolving the micro-management problem.The client server interaction paradigm is widely used for structuring distributed soft-ware. For example, in the remote procedure call (RPC) model [9], a server exports anumber of �xed procedures that can be invoked remotely by clients. Upon a remote call,the caller is suspended, the remote procedure is executed, and its results are returned tothe caller, which then resumes its execution. A detailed critique of the RPC paradigm ispresented in [10].It is natural to ask whether the RPC mechanism is su�cient to solve the micro-management problem. Let us examine this question via the example 1. Suppose thatthe entire management program shown there is encoded, as part of the agent code, interms of a particular action { say, link.handler. A manager will need to execute a remoteprocedure call invoking execution of the script by the agent. Does this solution o�er anadequate model for management?There are two fundamental problems associated with the solution above. The �rstis rooted in the synchronous invocation semantics of RPC mechanisms. A managementscript will typically involve actions to be triggered by independent event occurrences in theagent, asynchronously with the manager's invocation of the RPC. Clearly, the manager'sown execution should not be tied with the execution of the management script. An RPCmechanism would block the manager until the completion of the manager script.A second limitation of this model for supporting manager/agent interactions is thestatic nature of the exported procedures. That is, there is an underlying assumption thatthe programming of management scripts is done by the agent/managed-object designerat agent-design time (just as server calls are part of server design), rather than at man-agement design time as part of incorporating the agent/managed-objects in a network.22

Designers of the agent (server) must thus predict and code as agent procedures the entirerange of composite management scripts in which this agent may be usefully involved.It is typically not possible to predict at agent-design time all the possible managementscenarios in which it may be involved, nor is this desirable, as it would limit the manage-ability a�orded by the product. Furthermore, coding such scripts into the agent wouldsigni�cantly increase its complexity and costs.Remote evaluation (REV) [11], goes a step further in permitting the dynamic transferof programs to remote execution. It allows a procedure (written in LISP) to be transferedfrom a client to a server where an interpreter will execute it. It thus overcomes the problemof �xing all the server procedures that can be invoked remotely at the design time of theserver. However, the execution of the remotely evaluated procedure is still synchronouswith the execution of the caller (manager): i.e., the REV procedure is executed upon itsreceipt by the server (agent). Thus, this technique, is not adequate since it does not meetthe requirements for management in a distributed environment.In summary, the RPC model fails to provide an adequate solution of the micro-management problem, as agent designers must predict and code all management scriptsrequired as part of the agent. The remote evaluation model goes a step further in per-mitting the dynamic transfer of programs to remote execution. However, both modelsinvolve synchronous procedure-call interactions whereby the agent executes the manage-ment program upon invocation, while the caller is blocked until completion. Thus neithertechnique proves to be adequate for management in a distributed environment.8 Non-determinism of management actionsConsider an agent to whom a number of management scripts have been delegated (byone or more managers) sharing the same "triggering" event E. When E occurs, somemanagement actions must be executed by the agent. Di�erent execution orders mayresult in substantially di�erent results. Consider for instance the followingExample 2: Non-determinism of management actionsAn X.25 controller within a switch includes an agent responsible for managing thecontroller operations (mostly virtual circuits (VC) operations). The agent has been del-egated three management scripts by di�erent managers, to be triggered upon the eventbu�ers-full (i.e., all VC bu�ers are full):� D1 speci�es the following: run local diagnostics to check possible link or controller23

failures, reboot the controller upon fault; D1 is delegated by a vendor-providedswitch manager.� D2 speci�es: evaluate and report performance parameters, abort some VCs and setlimits on local resources usage; D2 is a congestion handler delegated by the networkcontrol center.� D3 speci�es: dump bu�ers contents and reset ow-control on VCs; D3 is a ow-control action pursued by network ow-control protocol (acting in a manager role).This example illustrates typical forms in which management actions distributed throughnetworked systems can lead to unpredictable management actions. First, any order of ex-ecution pursued in applying the three actions will lead to very di�erent behaviors. Thus,if D1 is executed �rst, the controller may be rebooted and then when D2 evaluates theperformance parameters it would sample and report to the global manager di�erent valuesthan those reported by executing D2 �rst and then D1. As a result the controller may pur-sue unobservable intermittent failures as it reboots and fails again. Global managementseeking reporting through D2 could never know about the problem.Second, the roles of managers and agents is e�ectively established through the verydesign of network devices and protocols. A global network manager is not the only entityacting in a role of a manager. Typically, devices would be equipped with local (vendor-provided) software functioning in explicit management roles. Thus, local automated re-covery systems, monitoring and diagnostics software also function in management roles.Management roles are often implicitly built-into the very speci�cations of certain protocolentities. Thus, ow control procedures of the protocol entity may be triggered by condi-tions monitored by the protocol entity. In designing an overall management system it iscritical that all management actions are considered and that possible interactions amongsuch actions and those pursued by management of a network control center are carefullycoordinated.Third, the distribution of management functions among di�erent management entitiesis typically established through ad-hoc evolution. Such distribution of function may con-tribute to very di�cult coordination problems, or even render certain devices and protocolentities unmanageable.Non-determinism can (and should) be avoided by constraining interactions among con-current management programs. Two management programs are concurrent if they maybe simultaneously executed; they are said to be associative if they lead to the same re-sults independent of their execution order. Obviously, the design of a management system24

must aim to assure that concurrent management programs are associative. The executionorders of non-associative concurrent management programs must be carefully controlledto ensure that the results are deterministic. This bears some similarity to the problemof serializability of transactions in database systems. However, unlike the problem ofserializability, where arbitrary serializable execution orders are acceptable, deterministicmanagement requires control over the speci�c ordering of concurrent actions. The spe-ci�c ordering of non-associative management program executions can re ect managementhierarchy (e.g., prioritize programs of higher-level managers) or critical priorities of theprogram (e.g., execute �rst higher priority handlers).8.1 InterferenceManagement actions may interfere with each other, leading to potential hazards. Amanagement system is coherent when interferences are properly controlled. Interferenceis typically caused when evaluation of an event or an action causes changes in managedobject attributes triggering other events/actions. Coherence may be accomplished byminimizing and controlling interference. For example, interference among event detectionmay be avoided by preventing side-e�ects of detection.It is, however, undesirable to prevent interference among actions and events in general.Consider the following Example 3: An X.25 virtual-circuit managerOn (heavy.load)BeginFor all active VCsIf (VC.thruput > 2.4 Kbs)then VC.thruput := 2.4 Kbs::EndIn the example, the event heavy.load triggers changes of the thruput class (set to 2.4Kbs)for some VCs. A user of a VC (e.g., TP4 protocol) may require noti�cation of suchchanges to ensure proper handling. In the absence of TP4 handling of these thruput classchanges, the network may crash as a result of the changes. Interference is thus essentialto assure communication of changes among managers. Interference is also importantin supporting other forms of cooperative management. For example, in a hierarchicalmanagement organization, a high-level manager may capture and prevent attempts oflower-level managers to cause certain actions.25

However, interference may lead to hazardous behaviors. For example, managementactions may form a loop, each triggering evaluation of the next action in the loop. Thiswould lead agents to a management livelock. It is thus necessary for managers to controlpossible interference among management programs and ascertain management coherence.A management system should support coordination primitives for interference control.9 StatusThe MAD prototype [12], has successfully demonstrated the delegation model of networkmanagement. The prototype includes the following components:1. A Process Manager which supports all the operations over lightweight DMPIs.2. An implementation of the delegation protocol, using ASN.1 presentation servicesand TCP/IP transport services.3. A Repository providing storage and retrieval facility for DMPs.4. A Translator supporting DMPs written in C and C++.5. An SNMP based OCP which allows the Mad Agent to also serve as a SNMP agent.6. a message-passing IPC service over shared memory,7. A simpleMad Agent Controller, which implements two preemptive execution schemes,wound-die and wound-wait.8. Two user interfaces for simple managers: one runs as an window based applicationover X, and the other, for simpler terminals, based on cursors.9. Several simple applications to run on Mad as DMPs.The prototype agent can receive, compile, link, and execute nontrivial managementscripts from remote manager processes. The scripts execute asynchronously from their del-egating manager and take independent action, eliminating the micro-management prob-lem. Managers use the protocol to control the execution of their DMPIs. Several sampleDMPs have been written and tested over the prototype. In particular, we implemented acollection of DMPs for the health example described earlier. The Mad agent can option-ally incorporate a SNMP agent which mantains a private MIB as an OCP which can bequeried directly by SNMP clients. 26

10 ConclusionsCurrent management models, as pursued by standardization e�orts, centralize all man-agement functions within managers with agents playing minor roles. This centralizationinduces managers to micro-manage agents, resulting in ine�cient and costly managementsystems. It also results in intrinsically unreliable network management, since managerprocesses are turned into sensitive communications and processing bottlenecks whose losscan result in paralysis. Centralized management does not meet the management needsof emerging high-speed, increasingly complex networks and distributed systems. Noris it re ective of emerging cost/performance trade-o�s, where processing resources havebecome cheaper, as the costs of mismanagement continue to increase. Network devicesand their agents may require and support as powerful processing capabilities as managerapplications.The traditional Client/Server paradigm does not provide an adequate solution for themicro-management problem, as agent designers must predict and code all managementscripts required as part of the agent. The remote evaluation model goes a step furtherin permitting the dynamic transfer of programs to remote execution. However, bothmodels involve synchronous procedure-call interactions whereby the agent executes themanagement program upon invocation, while the caller is blocked until completion. Thusneither technique proves to be adequate for management in a distributed environment.E�ective management requires support for exible, e�cient distribution of manage-ment functions and allocation of resources. The Manager-Agent Delegation (MAD) modelproposed in this paper, supports such a distribution. It provides a software frameworkthat enhances the expressive power of management interactions. A manager can delegatean entire management procedure to be executed in the agent's environment, while main-tining control over its execution. Delegated management programs can then be executedlocally at the agent's host environment, without involving the manager unnecessarily.The capabilities and limitations of agents may be re ected in the management scriptinglanguage that they use.From a performance perspective, delegation presents an attractive trade-o�. It allowslower communications costs (of management messages) by dynamically allocating man-agement function closer to the real devices. Of course, exibility and expressiveness donot come for free: they require more computing cycles and memory in the MAD agentenvironment. These costs are divided into two parts: a �xed component (the built-inMad support) and a varying component (the OCPs and DMPs). There are two reasons27

to justify this cost. The �rst one is the increasing costs associated with mismanagementof networks. Such costs will continue to increase, as long as we continue to increase ourreliance on these distributed systems and networks as vital information backbones. Onthe other hand, there is a continous decrease in the cost of computing devices, and inparticular, microprocessors. Thus, a signi�cant increase in in the abundance of in pro-cessing cycles in distributed environments is expected. Thus, we advocate increasing thedegree of manageability at the cost of allocating { intelligently { more resources over thenetwork. ACKNOWLEDGMENTS:Most of the current prototype of Mad was implemented by graduate students in theDCC Laboratory of the Computer Science Department of Columbia University. Theyinclude Antonis Maragkos, James Tuller, Chris Wood, Sharon Barkai, Cristina Auerro-coechea and Ben Ohtsu.References[1] Uyless Black. Network Management Standards The OSI, SNMP and CMOL Proto-cols. McGraw Hill, 1992.[2] Branislav N. Meandzija and Jil Westcott, editors. The First IFIP International Sym-posium on Integrated Network Management, Boston, MA, May 1989. North Holland.[3] Iyengar Krishnan and Wolfgang Zimmer, editors. The Second IFIP InternationalSymposium on Integrated Network Management, Washington, DC, April 1991. NorthHolland.[4] Aaron Kershenbaum, Manu Malek, and Mark Wall, editors. Network Managementand Control Workshop. Plenum Press, September 1989.[5] Marshall T. Rose. The Simple Book, An introduction to Management of TCP/IP-based Internets. Prentice Hall, 1991.[6] International Standards Organization ISO. 9596 Information Technology, OpenSystems Interconnection, Common Management Information Protocol Speci�cation,May 1990. 28

[7] Je�rey D. Case, Mark S. Fedor, Martin L. Scho�stall, and James R. Davin. A SimpleNetwork Management Protocol (SNMP). RFC 1157, May 1990. DDN NetworkInformation Center, SRI International.[8] Frederick E. Boland, editor. Working Implementation Agreements for Open SystemsInterconnection Protocols. IEEE Computer Society Press, 1990. Vol. 2, Number 2.[9] Andrew D. Birrell and Bruce J. Nelson. Implementing remote procedure calls. ACMTransactions on Computer Systems, 2(1), February 1984.[10] Andrew S. Tanembaum and Robert van Renesse. A critique of the remote procedurecall paradigm. Research into Networks and Distributed Applications, pages 775{783,April 1988.[11] James W. Stamos and David K. Gi�ord. Remote evaluation. ACM Transactions onProgramming Languages and Systems, 12(4):537{565, October 1990.[12] Germ�an Goldszmidt and Yechiam Yemini. The Design of a Management DelegationEngine. In Proceedings of the IFIP/IEEE International Workshop on DistributedSystems: Operations and Management, Santa Barbara, CA, October 1991.

29