1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

Preview:

Citation preview

1© Bull, 2014

October 14th 2014 Dave Williams

Technical Architect

Multi-Tenant Nagios Monitoring

2© Bull, 2014

Agenda

BackgroundMulti-Tenant MonitoringWhy Multi-TenantMulti-Tenant DesignService CatalogueFutures & ‘Blue Sky thinking’Questions

3© Bull, 2014

Background

UK basedMainframe (IBM & Honeywell)Unix (HP-UX, AIX, Solaris)Linux (RedHat, SLES, Debian)Network (CASE, 3COM, CISCO)

Working for BullFrench Computer ManufacturerMainframes, Unix, HPC, Security, Managed Services, Advisory Services

4© Bull, 2014

Background

System MonitoringOpenViewNetviewOpen Master

Open Source MonitoringNetSaint on AIXNagios

5© Bull, 2014

Why Multi-Tenant ?

Outsourcing Support & MonitoringMultiple Customers

–Different Levels of security–Different Hardware / Software Platforms

One Support Team–Only need to know about real problems–Can be driven by support ticket not Nagios

Required 365 x 24–Infrastructure must survive all outages without loss of service

6© Bull, 2014

Multi-Tenant Design

Each customer may have 2-3000 hosts10-100 services per hostReal time monitoring

Customer profileSLA ReportingBatch Event completionDifferent SLA’s for each Business Process per customerDifferent alerting & escalation methods per customer

7© Bull, 2014

Multi-Tenant Design

Hardware Platform – Central SupportVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage Inexpensive Licensing

Shared Storage–NAS

Using QNAP Appliances with underlying RAID-5 & Hot Spare protection Network connection using dual interfaces bound across multiple switches Could have used FreeNas

LAN Infrastructure–Dual connections to all hardware–SNMP managed switches

8© Bull, 2014

Hardware Platform – Basic Schematic

9© Bull, 2014

Multi-Tenant Design

Hardware Platform – ResilienceVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage If Primary node fails cluster will ‘spin up’ image on 2nd node

Same data / logs (Shared storage)

LAN Infrastructure–Dual connections to all hardware

Bonded interfaces for NAS access – no data loss / access loss with failure SNMP managed switches

10© Bull, 2014

Hardware Setup

11© Bull, 2014

Multi-Tenant Design

Hardware Platform – RecoveryVirtualised Platform (Intel based)

–XenServer Hypervisor Allows clustering with shared storage If Primary Site fails will spin up image Internet Access fails over – using BGP

Shared Storage – replicated from Prime Site–NAS

Using QNAP Appliances with underlying RAID-5 & Hot Spare protection Using RTRR (Real Time Remote Replication) between sites Network connection using dual interfaces bound across multiple switches

LAN Infrastructure–Dual connections to all hardware

Bonded interfaces for NAS access – no data loss / access loss with failure SNMP managed switches

12© Bull, 2014

Hardware Platform - Resilience

13© Bull, 2014

Hardware Platform – Customer Site

Using generic netbooks Minimum requirement

–1Gb Memory , Atom processor, Ethernet Port–Running Centos 6.4 64 bit Operating System

Can use Raspberry Pi for small customers–512K Memory , Arm processor , Ethernet Port –Running Raspbian Operating System

14© Bull, 2014

Software Platform – Central Site

Nagios – CoreRunning latest 4.0.8Using MK Livestatus for interfacingUsing Thruk for Visualisation

Graylog2 / Elastic SearchStore all logs & Syslog in ‘Big Data’ repository using MongoDB

Asterisk PBXAllow all alerting to use standard dial-up with speech synthesis + IVR

SMS-ClientStill using TAPI to SMS Text contacts

15© Bull, 2014

Software Platform – Central Site (contd)

NRPERunning 2.1.5

NSCA &NSCA-ngUsing NSCA for external communicationUsing NSCA-ng for issuing remote commands

Postfix / ProcmailUsed to generate emails but also handle responses.Routes unsolicited alerting emails (HP Insight, Pingdom)

OTRSRecord alerts, track issues

16© Bull, 2014

Software Platform – Remote Site

Nagios – CoreRunning latest 4.0.8

NRPERunning 2.14

NSCA Using NSCA for external communication

OpenVPNCommunication via IPSec VPN

17© Bull, 2014

Customer Multi-Tenant

18© Bull, 2014

Multi Tenant Schematic

19© Bull, 2014

Service Catalogue

ITIL FlavourReally just services & their characteristics

20© Bull, 2014

Service Catalogue

Agreed list of servers / servicesWith importance levelsWith alerting pathsWith escalation pathsRecovery options

Feeds into Service Level Agreements and Operational Level AgreementsBasis of agreed reporting structures

21© Bull, 2014

Examples

Basic Spreadsheet plus Shell scriptUsually easy to create, Shell script is different for each customer based on a initial standard script

Chef or PuppetUse Exported ResourcesNagios Cookbook – Nagios Conference 2012 Presentation

22© Bull, 2014

Multi Tenant Issues

Naming conventionsEvery customer has a server01Customers naming conventions are obscure Customers have multiple physical locations or levels of security

–This gives rise to different nagios names to actual names:–Custloc1-swfeltsw01–Custloc2-nwfeltsw01

Not so smart when a non-Nagios originated alert is received,–‘swfeltsw01 – RAID battery backup failure’ from HP Insight for example–The external alert processor has to perform table lookups before building the

appropriate NSCA command for example

23© Bull, 2014

Futures & Blue Sky thinking

The Nagios Visualisation is resource heavyAll Customers want their own Dashboard All Customers want a different screen layout

Why not move the visualisation into the cloud ?Use a Amazon EC2 image to access central Livestatus via httpsAllow end user to authenticateCustomer portal allows ‘spin up’ & ‘spin down’ of images

–Move billing to the customer–Scale horizontally for Visualisation

24© Bull, 2014

Load Sharing

Using plugins like check_wmi_plus put a strain on the monitoring system, large number of queries that take wall clock time to complete and parse.Better to have ‘worker nodes’ via Merlin or Mod Gearman similar to perform these functions – Raspberry Pi for example.No great expense to add 2/3 Pi’s to customer site configurations, easy fall back if they fail – no unique locally stored data

25© Bull, 2014

BPI Example

26© Bull, 2014

Dashboard Example

27© Bull, 2014

Questions ?

28© Bull, 2014

Recommended