25
Re-thinking Incident Response Automation Kiran Gollu, Co-founder/CEO Neptune.io © 2015

Neptune : Re-thinking Incident Response Automation

Embed Size (px)

Citation preview

Page 1: Neptune : Re-thinking Incident Response Automation

Re-thinking Incident Response Automation

Kiran Gollu, Co-founder/CEO

Neptune.io © 2015

Page 2: Neptune : Re-thinking Incident Response Automation

Brief Intro: Myself & Neptune

Neptune.io © 2015

•  Architected an incident response automation platform for AWS

•  Founding team at Amazon S3, DynamoDB for 5 years

Strong engineering-heavy team

Page 3: Neptune : Re-thinking Incident Response Automation

Agenda

Neptune.io © 2015

•  State of incident response automation today

•  Our learning's from building such a platform for AWS/Neptune

•  Best practices •  Intro to Neptune

•  Examples : Incident response workflows

•  Q/A

Page 4: Neptune : Re-thinking Incident Response Automation

What is Incident Response?

Neptune.io © 2015

How to handle incidents/outages?

Many more..

Alerts

Page 5: Neptune : Re-thinking Incident Response Automation

Incident response automation is broken today!

Neptune.io © 2015

Page 6: Neptune : Re-thinking Incident Response Automation

Neptune.io © 2015

Source : DevOps survey; Victor Ops incident response

95% of Time To Recovery(TTR) is still manual today

Alert Troubleshooting Triage | Investigate | Identify

Resolution Documentation

73% 10% 5% 12%

Snapshots •  Graphs & metrics •  Logs •  Webpages Service health checks •  Internal •  External Host/App diagnostics •  “Top”, “df –H” etc. •  Heap dumps/Stack traces

Runbooks •  On single/cluster

of hosts •  Any script, any

language Cloud API/CLI actions •  Start/Stop/

Reboot •  Scale up/down

Root-cause analysis & Audit •  Heap dumps •  Logs •  Graphs Post-mortem •  History •  Diagnostics

Page 7: Neptune : Re-thinking Incident Response Automation

What has changed?

Neptune.io © 2015

•  Automation, uptime, and agility : #1 priority for businesses •  e.g. People can’t imagine Gmail going down

•  #Servers, #VMs, #Containers, #Apps launched exploding!

•  Maintenance has become huge burden •  13 different tools for managing app

•  Difficult to track down root cause what’s going on where •  Cloud, dynamic environments => knowledge sharing is a problem

Typical incident takes 1-2 hours to diagnose & fix

Page 8: Neptune : Re-thinking Incident Response Automation

Big companies built custom automation tools internally

FBAR : Facebook Auto Remediation Platform “…Its doing the work of approximately 200 sys admins…”

“We built one for Amazon Web Services!”

Neptune.io © 2015

The rest diagnosed in minutes instead of hours

40-60% of alerts get fixed automatically without human intervention

Page 9: Neptune : Re-thinking Incident Response Automation

Key takeaways

Neptune.io © 2015

•  Uptime and Automation agility are critical drivers for your businesses

•  Incident response automation gives you: •  More uptime, better customer experience •  Reduction in MTTR •  Happier engineers

Page 10: Neptune : Re-thinking Incident Response Automation

Maturity level of Incident Response Teams

Neptune.io © 2015 @jpaulreed @kfinnbraun DevOps Enterprise Summit

Page 11: Neptune : Re-thinking Incident Response Automation

3 core pieces of incident response platform

Neptune.io © 2015

Page 12: Neptune : Re-thinking Incident Response Automation

1. Analytics

Neptune.io © 2015

•  Helps identify those top-20% alerts causing 80% of pain •  Sorted by frequency and MTTR

•  Capture: •  MTTA (mean time to acknowledgement) •  MTTR (mean time to resolution) •  Frequency of occurrence (#times a particular alert has occurred)

•  Reporting + Auditing •  Audit all activity (both manual + automated) •  Leads to data-driven post mortems

Page 13: Neptune : Re-thinking Incident Response Automation

2. Context

Neptune.io © 2015

•  When an alert occurs: •  Gather context automatically from 13 different tools

•  Monitoring tools, logging tools, health checks, dependent services

Use cases: •  High memory à capture top-10 memory hogs, memory usage graphs •  High app error rate à capture error rate, latency trends, app logs with

5xx errors

Page 14: Neptune : Re-thinking Incident Response Automation

3. Remediation

Neptune.io © 2015

•  When an alert occurs:

•  If it’s a known alert => Run a remediation runbook

•  Use cases: •  Process crashed à restart process •  Host is unpingable à restart 3 times and escalate if still fails

Page 15: Neptune : Re-thinking Incident Response Automation

Our learnings

Neptune.io © 2015

•  Automate simple things first

•  Have checks in place to avoid cascading failures •  Don’t automatically fix when you don’t know root cause

•  We started with more focus on remediation, but customers really wanted automated context gathering

•  Customers were not of maturity level that we expected, though they’d like to be

•  Security is of paramount importance •  Customer prefer vetted runbooks compared to running arbitrary scripts

•  Use github or chef/puppet recipes for runbooks (code reviewed/vetted)

Page 16: Neptune : Re-thinking Incident Response Automation

Neptune.io © 2015

Neptune: Incident Response Automation-as-a-Service

IRA as a Service Monitoring as a Service Alerting as a Service

Existing tools just alert somebody, without any context or diagnostics

We provide diagnostics for unknown issues, and for known issues, we fix

them automatically

Page 17: Neptune : Re-thinking Incident Response Automation

Neptune.io © 2015

Deployment Models

•  SaaS Model - available today! •  Github/vetted runbooks

•  On-premise AWS VPC deployment model – available today! •  Enterprise customers

•  On-premise deployment model (roadmap)

Page 18: Neptune : Re-thinking Incident Response Automation

Deep Dive: Architecture

Neptune.io © 2015

Event Queue

Policy-based Rule Engine

Action Queue

Neptune Web Service

Dedicated Queue Per customer

Publish action results

Neptune Agent

REST API-based Runbook repo

Custom Tool

Read-only

Page 19: Neptune : Re-thinking Incident Response Automation

Quick Demo

Neptune.io © 2015

UseCase1: Auto-Remediation UseCase2: Auto-Diagnosis

Host-level Alert – high memory •  Collect top-10 memory hogs

•  restart the process

App-level Alert – high error rate •  Collect graph snapshots, logs

•  Run script on cluster of machines

Page 20: Neptune : Re-thinking Incident Response Automation

Neptune.io © 2015

Sample error rate incident today (before Neptune)

Page 21: Neptune : Re-thinking Incident Response Automation

Neptune.io © 2015

Sample error rate incident (after Neptune)

Page 22: Neptune : Re-thinking Incident Response Automation

You can get started in 10 min

1.  Configure monitoring tool to send alerts to Neptune 2.  Install a light-weight agent on a few servers

Neptune.io © 2015

Page 23: Neptune : Re-thinking Incident Response Automation

Thanks! Check out our 2 week free trial!

[email protected]

Neptune.io © 2015

Page 24: Neptune : Re-thinking Incident Response Automation

SaaS Model: Why are we secure?

Neptune.io © 2015

•  Go-based Agent: •  No dependencies (agent code is open source) •  Outbound access only •  No need to open any inbound firewall ports

•  Agent is light weight, dumb, and not chatty •  Sits idle unless there is something to do •  Consumes < 0.01% CPU, 20MB Memory

•  Authentication: •  Leverage AWS STS token Auth: use temp credentials, rotate every 4 hours •  Neptune API_KEY

Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you

Page 25: Neptune : Re-thinking Incident Response Automation

SaaS Model: You Control Runbooks

Neptune.io © 2015

•  All Runbooks stay within your firewall

•  Runbooks are version controlled (e.g. Github)

•  No one can edit your runbooks •  Even Agent has read-only access to runbook repository

Refer to On-prem AWS VPC deployment model if SaaS doesn’t work for you