36
Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Embed Size (px)

Citation preview

Page 1: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. Collaborate

GÉANT2 monitoring

Otto Kreiter, DANTENavneet Daga, DANTE

LHC Monitoring Workshop, Munich, 19.07.2006

Page 2: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateAgenda

• Extraction of monitoring information from the GÉANT2 network

• External application developed by DANTE• Demonstration of a home grown weather-map• Conclusion

Page 3: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateNetwork Element Manager• All network elements communicate with the NM separately • NM task is to configure and monitor one by one each NE• It is not service aware – no knowledge about the intra-domain e2e path status.

Page 4: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. Collaborate

Regional Network Manager (RM)

TopologyServices

Correlation“User”

interface

Page 5: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateHow we export data !

Alarms

Alarms

Perf. Meas.

Rem. Inv.

Page 6: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateStatus via alarms

Alarms

SNMPTrapD

Alarms

Monitoringstation

Page 7: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateAlarm content

• From the NM:– Information about interfaces and associated signal

status, SDH timing problems– NE and ILA status

• From the RM– Information related to services– Information related to path, trails and physical

connections at all layers

Page 8: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateOne hop case NMS vs JRA-4

Path – gen_mil_CERN

OCH trailPhys-link Phys link

Domain linkP. ID link P. ID link

BOL-CERN-LHC-001

Page 9: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateMultiple hop case NMS vs JRA-4

Path – gen_mil_CERN

OCH trailPhys-link Phys link

Domain link P. IDLink

CERN-SARA-LHC-001

OCH trailPhys-link

P. IDLink

Page 10: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateAlarm processing

• SNMP traps from the Alcatel IOO module.• Alcatel Enterprise v1/v2c MIB• SNMP traps received by a Linux station

– snmptrapd to pick up all alarms– For each trap a bash script is called which performs:

• Analysis• Selection• Action

Page 11: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateAlarm type & information

Alarm Raise:– friendlyName– probableCause– perceivedSeverity– currentAlarmId– eventTime– acknowledgementStatus– additionalInformation– eventType– snmpTrapAddress

Alarm Clear:– friendlyName– probableCause– currentAlarmId– eventTime– snmpTrapAddress

Page 12: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateUsed alarm information

Alarm Raise:– friendlyName– probableCause– perceivedSeverity– currentAlarmId– eventTime– acknowledgementStatus– additionalInformation– eventType– snmpTrapAddress

Alarm Clear:– friendlyName– probableCause– currentAlarmId– eventTime– snmpTrapAddress

Page 13: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateAlarm analyzer process

SNMP trap received

snmpTrapAddress Must be registered

Check for type Of Alarm

Raise

Additional Infopath

clientpath

ochtrail

omstrail

physicallink

recordAlarm

Call External Program

Clear

alarmID

Read recordAlarm

Call ExternalProgram

Record all traps

delete recordAl

friendlyName friendlyName

Page 14: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateAlarm analyzer

• Called every time a trap is received• Written in bash• Each trap is analyzed separately and if in the meantime a

new trap arrives it waits in the queue (snmptrapd)– Possible problem if an external program get stuck and

the scripts hangs. The alarms remains unprocessed in the queue

• Must maintain state– SNMP traps may get lost so a program needs to check

time to time if the monitoring station is in syncro with the NMS.

Page 15: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. Collaborate

XML file generation

Page 16: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateE2E Data transformation• Prototype applications developed in Java –

– E2EXMLWriter– XMLGenerator

• E2EXMLWriter performs 2 functions – – Takes in a template XML and produces an XML file containing live e2e

path status information conforming to the JRA4 e2e data model. – Feeds a perfSonar MA with live path status information.

• E2EXMLWriter is triggered by a script listening to SNMP alarms– Parameters passed

• Trail ID• Status

• XMLGenerator produces this template XML that E2EXMLWriter uses to export domain’s e2e information

Page 17: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateDesign of E2EXMLWriter

• Relies on 2 configuration files to produce live XML status information– Properties file (links.properties)

• Properties file containing key = value entries• Each key is one e2e path name• Value to each key is a csv of multiple trails that form one

Domain Link and/or Partial ID Link• Currently manually maintained

– Alarm register• A simple csv file• Application maintained• An “alarm raise” registers the associated path• An “alarm clear” de-registers the associated path

(contd).

Page 18: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateDesign (contd.)

• The application sets all path’s default status as UP with admin state as NORMALOPERATION

• Only the paths “registered” in the alarm-register csv file are set as DOWN with admin state as MAINTENANCE

• No implementation of the status DEGRADED at the moment

• No implementation of other admin states at the moment

Page 19: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateDesign of XMLGenerator

• Relies on 3 configuration files – – Properties file (init.properties)

• Contains a key = value entry• Key = DOMAIN• Value = <domain_name>• Enables on-the-fly domain name configuration

– Config file (config.csv)• A simple CSV file• Contains node-link-node information

– A sample XML file containing “pieces of XML” to be replicated for each node and link in the final output “template XML”

• All configuration files are currently manually maintained

Page 20: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. Collaborate

Monitoring data processing “e2e path”

Page 21: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateLHC weather-map live demonstration

1. CERN user-side down

2. CERN user-side up

3. GEN-MIL Lambda down

4. GARR user-side down

5. Back-to-back interconnection in DE broken

6. AMS-FRA lambda down

7. Up DE interconnection

8. AMS-FRA lambda up

9. GARR user-side up

10. GEN-MIL lambda up

Page 22: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateConclusion

• Status monitoring via SNMP alarms in an advanced phase and well understood.– Once the characteristic of the equipment/alarms/faults

understood the development was easy.

• XMLGenerator not bonded to a specific equipment and can be used together with the JRA-4 MP and/or to feed an perfSONAR MA

Page 23: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. Collaborate

Questions ?

[email protected]

[email protected]

Page 24: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateT0-T1 CERN-CNAF

GARR GÉANT2CERN(CH)

CNAF(IT)

Page 25: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateTechnologies

CERN-CNAF-LHCOPN-001

GÉANT2 GARR CNAFCERN

Domain linkP. ID LinkP. ID LInk P. ID Link P. ID Link Domain link

F10 1626 LM 1626 LM M320 M320 C6509

Page 26: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateDomain I – CERN

• Partial ID Link corresponds to the status of the port• MP developed by Martin Swany - export port status information

CERN

P. ID LInk

F10

Page 27: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateDomain II – GÉANT2

• Partial ID link – status of the ports facing the adjacent domains• Domain Link – status of the lambda• perfSonar MA and GN2-JRA4 MP used to export status

information

GÉANT2

Domain linkP. ID Link P. ID Link

1626 LM

Page 28: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateDomain III - GARR

• Inter Domain Link – status of the port facing GÉANT2• Domain link – status of the LSP between the two routers +

status of the interface facing CNAF (T1)• GN2-JRA4 MP used to export measurement data

GARR

P. ID Link Domain link

Page 29: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. Collaborate

View on the E2E monitoring system

Page 30: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateConclusion

• Fairly easy to establish the monitoring of the E2E path.– It took around two phone conf with GARR + around 10 e-mails– 3-4 phone conf with CERN and Martin Swany + around 10-15 e-

mails– All parties were extremely familiar with their equipment and the

required softwares.

• Questions started to pop-up if we need to monitor an End-Point and how should we do it ?– Is an EP a simple client ?– Or we shall redefine the “Client” as somebody who actively

participate in the e2e monitoring

Page 31: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateBackup

Page 32: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateCERN user side down

Page 33: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. CollaborateLambda CH-IT down

Page 34: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. Collaborate

Lambda and user failure in IT

Page 35: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. Collaborate

Lambda + POP interconnect failure

Page 36: Connect. Communicate. Collaborate GÉANT2 monitoring Otto Kreiter, DANTE Navneet Daga, DANTE LHC Monitoring Workshop, Munich, 19.07.2006

Connect. Communicate. Collaborate

Multiple Lambda, user and POP interconnect failure