7
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, 19-20 June 2006

Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, 19-20 June 2006

Embed Size (px)

Citation preview

Page 1: Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, 19-20 June 2006

Lemon Monitoring

Presented by Bill TomlinCERN-IT/FIO/FD

WLCG-OSG-EGEE Operations WorkshopCERN, 19-20 June 2006

Page 2: Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, 19-20 June 2006

19/06/2006 WLCG-OSG-EGEE Operations Workshop

2

Lemon – LHC Era Monitoring

• Distributed monitoring framework + default metrics• For nodes, DBs, power consumption, backups, VO jobs• Scalable to ~10k nodes, 500+ metrics• Early error detection and automatic recovery• Web interface• Integrated alarm system• Data persisted to Oracle, Oracle Express or flat files• Framework for plug-in sensors• Site independent: BARC, CERN IT+AB, FZK, IN2P3, INFN, RAL• GridICE based on LEMON (~180 sites) • Easy to install out of the box• Well documented at http://www.cern.ch/lemon

Page 3: Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, 19-20 June 2006

19/06/2006 WLCG-OSG-EGEE Operations Workshop

3

Lemon architecture

CorrelationEngines

Web browser

Lemon CLI

User

MonitoringRepository

TCP/UDP

SOAP

SOAP

Repositorybackend

ProtNodes

Monitoring Agent

Sensor SensorSensor

RRDTool / PHP

apache

HTTP

Page 4: Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, 19-20 June 2006

19/06/2006 WLCG-OSG-EGEE Operations Workshop

4

Automatic Recovery Actions

• Actuator called for defined conditions• Complex correlations: m1 > m2 – 50 and m3 < m4• Retry n times before raising an alarm; • All actions logged, including success/failure• Example: ssh daemon dead – action /sbin/service sshd

start• ~62 corrective actions defined

Jun-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 May-06 Jul-05

0

2000

4000

6000

8000

10000

12000

14000

16000

Actuator Runs

Escalated Alarms

Date

cou

nt

Page 5: Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, 19-20 June 2006

19/06/2006 WLCG-OSG-EGEE Operations Workshop

5

Web Interface

Page 6: Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, 19-20 June 2006

19/06/2006 WLCG-OSG-EGEE Operations Workshop

6

LEMON Alarm System

• Oracle based• AJAX web based GUI• Oracle PL/SQL based business logic (reductions of alarms for

operators)• Notifications: RSS feeds, e-mail, SMS• Integrated with quattor and State Management System• Plug-ins for site-specific integration e.g. Remedy• Phasing in Lemon Alarm System (August 2006)• Ongoing work

Page 7: Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, 19-20 June 2006

19/06/2006 WLCG-OSG-EGEE Operations Workshop

7

Summary

– Can re-use whole or part of LEMON– Good fabric management essential to providing good

grid services– Queries to: [email protected]– More details: http://www.cern.ch/lemon– LEMON tutorial at CERN on 22nd of September