24
February 2006 Iosif Legrand 1 Iosif Legrand Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor, An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed Systems Control and Optimize Distributed Systems

February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

Embed Size (px)

Citation preview

Page 1: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand1

Iosif LegrandIosif LegrandCalifornia Institute of Technology

February 2006February 2006

An Agent Based, Dynamic Service System to Monitor,An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed SystemsControl and Optimize Distributed Systems

Page 2: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand2

The MonALISA Framework

MonALISA is a Dynamic, Distributed Service System capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems.

The MonALISA system is designed as an ensemble of autonomous multi-threaded, self-describing agent-based subsystems which are registered as dynamic services, and are able to collaborate and cooperate in performing a wide range of monitoring tasks. These agents can analyze and process the information, in a distributed way, and to provide optimization decisions in large scale distributed applications.

Page 3: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand3

MonALISA is A Dynamic, Distributed Service Architecture

The framework is based on a hierarchical structure of loosely coupled agents acting as distributed services which are independent & autonomous entities able to discover themselves and to cooperate using a dynamic set of proxies or self describing protocols.

An agent-based architecture provides the ability to invest the system with increasing degrees of intelligence; to reduce complexity and make global systems manageable in real time. For an effective use of distributed resources, these services provide adaptability and self-organization.

Page 4: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand4

LookupService

MonALISA service & Data HandlingMonALISA service & Data Handling

Data CacheService & DB

Configuration Control (SSL)Configuration Control (SSL)

LookupServiceData StoresWEB

Service

WSDLSOAP

Client(other service)

Java

Discovery

Registratio

nClient

(other service) Web client

data

Postgres MySQL

Applications

User defined loadable Modules to write /sent data

Predicates & Agents

Communications via the ML Proxy

MonALSIA Service

Page 5: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand5

The MonALISA Discovery System & ServicesThe MonALISA Discovery System & Services

Network of JINI-LUSsNetwork of JINI-LUSsSecure & Public Secure & Public

MonALISA servicesMonALISA services

ProxiesProxies

Clients , HL servicesClients , HL servicesrepositoriesrepositories

Distributed Dynamic Distributed Dynamic Discovery- based on a lease Discovery- based on a lease Mechanism and REN Mechanism and REN

Distributed System Distributed System for gathering and for gathering and Analyzing InformationAnalyzing Information..

Dynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationSecuritySecurity AAA for Clients AAA for Clients

Global Services orGlobal Services orClientsClients

Fully Distributed System with no Single Point of Failure

AGENTS

Page 6: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand6

Monitoring Internet2 backbone NetworkMonitoring Internet2 backbone Network

Test for a Land Speed Record Test for a Land Speed Record ~ 7 Gb/s in a single TCP stream ~ 7 Gb/s in a single TCP stream

from Geneva to Caltechfrom Geneva to Caltech

Page 7: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand7

The UltraLight Network

BNL ESnet IN /OUT

Page 8: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand8

Monitoring Network Topology Monitoring Network Topology Latency, RoutersLatency, Routers

NETWORKS

AS

ROUTERS

Page 9: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand9

Monitoring The GLORIAD RingMonitoring The GLORIAD Ring

Page 10: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand10

Monitoring Grid sites, Running Jobs, Monitoring Grid sites, Running Jobs, Network Traffic, and ConnectivityNetwork Traffic, and Connectivity

TOPOLOGY

JOBS

ACCOUNTING

Page 11: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand11

Monitoring OSG: Resources, Jobs & AccountingMonitoring OSG: Resources, Jobs & Accounting

42 SITES 42 SITES ~ 4 000 Nodes ( 10 000 CPUs) ~ 4 000 Nodes ( 10 000 CPUs) Thousands of Jobs Thousands of Jobs 60 000 parameters60 000 parameters

Running Jobs Accounting

Page 12: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand12

FTP Data Transfer between GRID sitesFTP Data Transfer between GRID sites

Total FTP Traffic per VO

Page 13: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand13

Bandwidth Challenge at SC2005

151 Gbs

~ 500 TB Total in 4h

Page 14: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand14

End User / Client AgentLISA- Localhost Information Service AgentLISA- Localhost Information Service Agent

Authorization Service discovery Local detection of the hardware and software configuration Complete end-system monitoring: Per-process load, I/O and

network throughputs, etc. End-to-end performance measurements Will act as an active listener for all events related with the requests generated

by its local applications.

Page 15: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand15

Host Monitoring at SC2005Host Monitoring at SC2005

Many “network” problems are actually endhost problems: Many “network” problems are actually endhost problems: misconfigured or underpowered end-systemsmisconfigured or underpowered end-systems

The LISA application was designed to monitor the The LISA application was designed to monitor the endhost and its view of the network.endhost and its view of the network.

For SC|05 we developed we used LISA to gather the For SC|05 we developed we used LISA to gather the relevant host details related to network performance relevant host details related to network performance

Information on the system information, TCP configuration Information on the system information, TCP configuration and network device setup was gathered and accessible and network device setup was gathered and accessible from one site.from one site.

Future plans are to coordinate this with LISA and deploy Future plans are to coordinate this with LISA and deploy this as part of OSG. The Tier-2 centers are a primary this as part of OSG. The Tier-2 centers are a primary target.target.

Network Device InformationTCP SettingsHost/System Information

Page 16: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand16

Available Bandwidth MeasurementsAvailable Bandwidth Measurements

Embedded Pathload module.Embedded Pathload module.

Page 17: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand17

Coordination Service for Available Coordination Service for Available Bandwidth MeasurementsBandwidth Measurements

Enforces measurement fairnessEnforces measurement fairness Avoids multiple probes on shared network segmentsAvoids multiple probes on shared network segments Dynamic Dynamic

configuration of configuration of measurements measurements timingtiming

Logs eventsLogs events Provides service Provides service

redundancy by redundancy by using a master-using a master-slave modelslave model

Page 18: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand18

Monitoring the Execution of JobsMonitoring the Execution of Jobs and the Time Evolution and the Time Evolution

SPLIT JOBSSPLIT JOBS

LIFELINES for JOBS

Job Job

Job1

Job2

Job3

Job31

Job32

Summit a Job

DAG

Page 19: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand19

ApMon – Application Monitoring

MonALISAService

MonALISAService

ApMon

ApMon

APPLICATION

APPLICATION

MonitoringData

UDP/XDR

Mbps_out: 0.52 Status: reading

App. Monitoring

MB_inout: 562.4

ApMonConfig

parameter1: value parameter2: value

App. Monitoring

...

Time;IP;procIDMonitoring

Data

UDP/XDR

MonitoringData

UDP/XDR

load1: 0.24 processes: 97

System Monitoring

pages_in: 83

MonALISA

hosts

Config Servlet

Library of APIs (C, C++, Java, Perl. Python) that can be used to send any information to MonALISA services

Flexibility, dynamic configuration, high communication performancedynamic reloading

ApMon configuration generated automatically by a servlet / CGI script

Automated system monitoring

Accounting information

0

10

20

30

40

50

60

70

0 1000 2000 3000 4000 5000 6000

Messages per second

MonALISA CPU Usage (%)No Lost Packages

Page 20: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand20

Optical Switch

Runs a ML Demon Runs a ML Demon

>>ml_path IP1 IP4 “copy file IP4”ml_path IP1 IP4 “copy file IP4”

ML proxy servicesML proxy servicesused in Agent Communicationused in Agent Communication

ML Demon ML Demon

Control and Control and Monitor the Monitor the switchswitch

Optical Switch

Optical Switch

MonALISAML Agent

MonALISAML Agent

MonALISAML Agent

2

1

3

Discovery &Secure Connection

4

MonALISA agents to create on demand MonALISA agents to create on demand on an optical path or treeon an optical path or tree

Time to create a Time to create a path on demand path on demand <1s independent <1s independent of the location of the location and the number and the number of connectionsof connections

Page 21: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand21

Monitoring and Controlling Optical Planes

Port power monitoring

Controlling

Page 22: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand22

Monitoring Optical Switches Monitoring Optical Switches Agents to Create on Demand an Optical PathAgents to Create on Demand an Optical Path

Page 23: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand23

Major Communities OSG CMS ALICE D0 STAR VRVS LGC RUSSIA SE Europe GRID APAC Grid UNAM Grid

ABILENE ULTRALIGHT GLORIAD LHC Net RoEduNET

Communities using MonALISACommunities using MonALISA

ABILENEABILENE

VRVSVRVS

--

ALICE

CMS-DC04CMS-DC04

Demonstrated at:

SC2003

Telecom World 2003

WSIS 2003

SC 2004

I2 2005

TERENA 2005

IGrid 2005

SC 2005

MonALISARunning 24 X 7

at 250 SitesCollecting 250,000

parameters in near real-time

Update rate of 25,000 parameter updates per second

Monitoring12,000 computers > 100 WAN Links

Thousands of Grid jobs running con- currently

Page 24: February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,

February 2006 Iosif Legrand24

The MonALISA Architecture Provides: Distributed Distributed Registration and DiscoveryRegistration and Discovery for Services and Applications. for Services and Applications.

Monitoring all aspects of complex systems :Monitoring all aspects of complex systems : System information for computer nodes and clusters System information for computer nodes and clusters Network information : WAN and LAN Network information : WAN and LAN Monitoring the performance of Applications, Jobs or services Monitoring the performance of Applications, Jobs or services The End User Systems, its performance The End User Systems, its performance Video streaming Video streaming

Can Can interact with any other servicesinteract with any other services to provide in near real-time customized to provide in near real-time customized information based on monitoring datainformation based on monitoring data

Secure, remote Secure, remote administrationadministration for services and applications for services and applications

Agents to supervise applicationsAgents to supervise applications, trigger alarms, restart or reconfigure , trigger alarms, restart or reconfigure them, and to notify other services when certain conditions are detected.them, and to notify other services when certain conditions are detected.

The MonALISA framework is used The MonALISA framework is used to develop higher level decision servicesto develop higher level decision services, , implemented as a distributed network of communicating agents, to perform implemented as a distributed network of communicating agents, to perform global optimization tasks. global optimization tasks.

Graphical User InterfacesGraphical User Interfaces to visualize complex information to visualize complex information