42
MONITORING OPENNEBULA OpenNebulaConf 2013 © Florian Heigl [email protected] There will be some heresy.

OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Embed Size (px)

DESCRIPTION

The complexity of a typical OpenNebula installation brings a special set of challenges on the monitoring side. In this talk, I will show monitoring of a full stack of from the physical servers to storage layer and ONE daemon. Providing an aggregated view of this information allows you see the real impact of a certain failure. I would like to also present a use case for a “closed-loop” setup where new VMs are automatically added to the monitoring without human intervention, allowing for an efficient approach to monitoring the services a OpenNebula setup provides. Bio: I’ve been into virtualization and storage for a long time and I like the amount of abstraction OpenNebula offers. Professionally I have been a Unix systems administrator for most of my working life. I’ve also done systems integration and monitoring work on the Check_MK project. Now I’m one of very few Nagios experts in Germany that aren’t working for one of the 3-5 leading Nagios outfits and as such I’m able to speak freely about what I think works best for the users. My strength is simply sitting down and listening to what people really need.

Citation preview

Page 1: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

MONITORING OPENNEBULA OpenNebulaConf 2013 © Florian Heigl [email protected] There will be some heresy.

Page 2: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Hi! That‘s me!

Unix Sysadmin / freelance consultant.

Storage

virtualiztion

monitoring

HA clusters Backups (if you had them)

Bleeding edge software (fun but makes you grumpy)

Page 3: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

What else? • Created first embedded Xen Distro (and other weird

things) •  Training: Monitoring, Linux Storage (LVM, Ceph...)

• On IRC @darkfader, on Twitter @FlorianHeigl1

Making monitoring more useful is <H1> for me.

reap the benefits!

Page 4: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

OpenNebula My love: • Abstraction / Layering (oZones, VNets, Instantiation) • Hypervisor abstraction (write a Jail driver and a moment

later it could set up FreeBSD jails) • Something happens if you report a bug. My hate: •  Feature imparity • Complexity „spikes“ • Unknown states • Scheduler

Page 5: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

We‘ve all run Nagios once? Not new: • Systems and Application Monitoring • Nagios

But: •  #monitoringsucks on Twitter is quite busy • Managers still unhappy?

Page 6: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Interruption How come there were no checks for OpenNebula?

• Skipped a few demos • Added checks so I can actually show *something*

•  https://bitbucket.org/darkfader/nagios/src/

Page 7: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Monitoring Systems • Keep an eye out for redundancy • monitor everything. EVERYTHING. monitor! • But think about „capacity“

•  I don‘t care if my disk does 200 IOPS (except when i‘m tuning my IO stack)

•  I do care if it‘s maxed! • My manager doesn‘t care if it‘s maxed?

Page 8: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Monitoring Applications • We know how to monitor a process, right?

Differentiate: • Checking software components I don‘t care if a process on one HV is gone. Nor does the mananger, nor does the customer. • End-to-End checks Customers will care if Sunstone dies. Totally different levels of impact!

Page 9: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Monitoring Apps & Systems Chose strategy: • Every single piece (proactive, expensive) • Something hand-picked (reactive) Limited by resources, pick monitoring functionality over monitoring components. Proactively monitoring something random? Doesn‘t work.

Page 10: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Examples •  This is so I don‘t forget to give examples for the last slide. • So, lets go back.

Page 11: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Dynamic configuration • You might have heard of Check_MK and inventory. Some

think that‘s it.

• But... sorry... I won‘t talk (a lot) about that.

• We‘ll be talking about dynamic configuration • We‘ll be talking about rule matching • We‘ll be talking about SLAs

Page 12: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Business KPIs •  „Key Performance Indicators“ • Not our kind of performance. •  I promise there is a reason to talk about this Were you ever asked to provide • Reports and fancy graphs • What impact a failure is going to have As if you had a damn looking glass on your desk, right?

Page 13: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

The looking glass

• Assume, we know how to monitor it all.

•  Let‘s ask what we‘re monitoring.

Page 14: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Top down, spotted.

•  [availability] •  [performance] •  [business operations] •  [redundancy]

Page 15: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Ponder on that: • All your aircos with their [redundancy] failed. •  Isn‘t your cloud still [available]?

• Your filers are being trashed by the Nagios VM, crippling [performance]. Everything is still [available], but cloning a template takes an hour.

• Will that impact [business operations]?

Page 16: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Ponder on that too: Assume you‘re hosting a public cloud. How will your [business operations] lose more money: 1.  A hypervisor is no longer [available] and you even lose

5 VM images 2.  Sunstone doesn‘t work for 5 hours

Disclaimer: Your actual business‘ requirements may differ from this example. J

Page 17: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Losing your accounting...

„das ist ganz schlecht. dadurch funktioniert eine ganze Reihe von Dingen nicht mehr. z.B. Strom u. Traffic-Accounting im RZ, Anlage und Verwaltung von Domains etc. das müssen wir ganz schnell fixen, sonst können wir !nichts abrechnen! da nichts geloggt wird, nix anlegen und nichts nachsehen.“

Very recent example:

Page 18: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

That KPI stuff creeps back • All VMs are running, Sunstone is fine. Our storage is low

util, lot of capacity for new VMs •  => [availability] [redundancy] [Peformance] is A+

• But you have a BIG problem. • You didn‘t notice, because you „just“ monitored that every

piece of „the cloud“ works. • Customers are switching for another provider! • Couldn‘t you easily notice anyway?

Page 19: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Into: Business • VM creations / day => revenue • User registrations / day => revenue •  Time to „bingo point“ for storage Those are „KPIs“. Talk to boss‘s boss about that. You could: • Set alert levels for revenue • Set alert levels for customer aquisitions • Set alert levels on SLA penalties

Page 20: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Starting point

Page 21: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Into: Business • VM creations / day => revenue • User registrations / day => revenue •  Time to „bingo point“ for storage Those are „KPIs“. Talk to boss‘s boss about that. You could: • Set alert levels for revenue • Set alert levels for customer aquisitions • Set alert levels on SLA penalties

Page 22: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Into: Availability • Checks need to be reliable • Avoid anything that can „flap“ • Allow for retries, even allow for larger intervals •  „Wiggle room“ • Reason: DESTROY any false alerts •  Invent more End2End / Alive Checks

Nagios/Icinga users: • You must(!) take care of Parent definitions

Page 23: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Example: Availability •  checks that focus on availability •  Top Down to

•  „doesn‘t ping“ •  Bonded nic •  missing process

Aggregation rules: •  „all“ DNS servers are down •  bus factor is „too low“ •  Can your config understand the SLAs?

Page 24: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Into: Performance • Constant, low intervals • One thing measured at multiple points • Historical data and prediction the future •  Ideally, only alert based on performance issues

•  Interface checks, BAD! •  one alert for two things? link loss,BW limit, error rates •  => maybe historical unicorn/s? •  => loses meaning

Page 25: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Example: Performance Monitoring IO subsystem •  Monitoring Disk BW / IOPS / Queue / Latency!•  Per Disk (xxx MB/s, 200 / 4 / 30ms)!•  Per Host (x GB/s, 4000 / 512 / 30ms)!•  Replication Traffic % Disk IO % Net IO!

Homework: Baseline / Benchmark Turn into „Power reserve“ alerts, aggregate over all hosts. • Nobody ever did it. • Nobody stops us, either

Page 26: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Capacity?

They figured it out.

Screenshot removed.

Page 27: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Capacity? Turn some checks into „Power reserve“ alerts. Nobody ever did it. Nobody stops us, either. Example: one_hosts summary check. aggregate over all hosts.

Page 28: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Into: Redundancy Monitor all components, sublayers making them up. Associate them: • Physical Disks • SAN Lun, Raid Vdisk, MD Raid volume •  Filesystem...

Make your alerting aware. Make it differentiate...

Page 29: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Example: Redundancy Why would you get the same alert for: •  Broken disk in a raid10+HSP under a DR:BD volume? •  A lost LUN •  A crashed storage array

What are your goals •  for replacing a broken disk that is protected •  for MTTR on a array failure => you really need to adjust your „retries“

Page 30: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Create rules to bind them • An eye on details • Relationships •  Impact analysis • Cloud services: Constantly changing platform

⇒  Close to impossible to maintain manually ⇒  Infra as Code is more than a Puppet class adding a

dozen „standard“ service checks.

Page 31: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Approach 1.  Predefine monitoring rulesets on expectations 2.  Externalize SLA info (thresholds) for rulesets 3.  Create Business Intelligence / Process rulesets that

match on attributes (no hardwire of objects) 4.  Use live, external data for identifiying monitored objects 5.  Handling changes: Hook into ONE and Nagios

6.  Sit back, watch it fall into place.

Page 32: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Predefine rules ONEd must be running on Frontends Libvirtd must be running on HV Hosts KVM must be loaded on HV Hosts Diskspace on /var/libvirt/whatever must be OK on HV Hosts Networking bridge must be up on HV Hosts Router VM must be running for networks

Page 33: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Externalize SLAs •  IOPS reserve must be over <float>% threshold •  Free storage must be enough for <float>% hours‘ growth

plus snapshots on <float>% of existing VMs

• Create a file with those numbers • Source it and fill the gaps in your rules simply at config

generation time

Page 34: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Build Business aggregations ONEd must be running on Frontend Libvirtd must be running on HV Hosts KVM must be loaded on HV Hosts Diskspace on /var/libvirt/whatever must match SLA on HV Hosts Networking bridge must be up on HV Hosts Router VM must be running for networks -> Platform is available

Page 35: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Live data • ONE frontend nodes know about all HV hosts • All about its ressouces • All about its networks • So lets source that. • Add attributes (which we do know) automatically •  The rules will match on those attributes for _vnet in _one_info[vnets].keys():! checks += [([ „one-infra“ ], „VM vrouter-%s“ % vnet )]!

Page 36: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

We can haz config! • Attributes == Check_MK host tags • Check_MK rules made on attributes, not hosts etc. • Rules suddenly match as objects are available • Rules inherit SLA data • Check_MK writes out valid Nagios config

=> The pieces have fallen

Page 37: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Change... happens • We now have a fancy config.

But... Once Nagios is running, it‘s running. • How will Check_MK detect new services (i.e. Virtual

Machines)? • How will you not get stupid alerts after onehost delete • How will a new system be added into Nagios

automatically?

Please: don‘t say crontab! Use Hooks!

Page 38: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

How do I use this OpenNebula Marketplace: • Would like to add preconfigured OMD monitoring VM • Add context: SSH info for ONE frontend

•  Test, poke around, ask questions, create patches

Page 39: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Join? Questions?

•  Thanks! Ask questions - or do it later J

•  [email protected]

Page 40: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Monitoring 3 Monitoring Sites • Availability • Capacity • Business Processes

Use preconfigured rulesets ...that differ. Goal: Nothing hardcoded

Page 41: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Monitoring Different handling: Interface link state -> Availability Interface IO rates -> Capacity Rack Power % -> Capacity Rack Power OK -> Availability Sunstone:

Availability Business Processes

Page 42: OpenNebulaConf 2013 - Monitoring of OpenNebula installations by Florian Heigl

Interface 1.  HOOK injects services (or hosts) 2.  Each monitoring filters applicable 3.  Rulesets immediately apply to new objects

• Central Monitoring to aggregate (...them all)