55
A Deep Dive into Nagios Analytics Alexis Lê-Quôc (@alq) http://datadoghq.com

Deep dive into Nagios analytics

Embed Size (px)

DESCRIPTION

Performance metrics + Nagios traffic + other sources + Datadog in the cloud = real time graphs + analytics

Citation preview

Page 1: Deep dive into Nagios analytics

A Deep Dive into Nagios Analytics

Alexis Lê-Quôc (@alq)http://datadoghq.com

Page 2: Deep dive into Nagios analytics

A Deep Dive into Nagios Analytics

Alexis Lê-Quôc (@alq)http://datadoghq.com

Page 3: Deep dive into Nagios analytics

@alqDev & OpsNagios user since 2008Datadog co-founder

Page 4: Deep dive into Nagios analytics

A little survey

Page 5: Deep dive into Nagios analytics

Top 3 failed checks

Page 6: Deep dive into Nagios analytics

Top 3 failed checks

That I responded tolast week

That woke me up

That most of my teamresponded to at least once

That impacts our businessthe most?

That I responded to5 weeks ago

Page 7: Deep dive into Nagios analytics

Top 3 failed checks

That I responded tolast week

That woke me up

That most of my teamresponded to at least once

That impacts our businessthe most?

That I responded to5 weeks ago

Page 8: Deep dive into Nagios analytics

Using memory to prioritize remediation...

At best, finding local optimums

At worst, brownian motion

Page 9: Deep dive into Nagios analytics

Analytics

Page 10: Deep dive into Nagios analytics

Performance Metrics Nagios Traffic Other Sources

In the “Cloud”

Real-time graphs + analytics

Page 11: Deep dive into Nagios analytics

Aggregation

Page 12: Deep dive into Nagios analytics

Real-time Analytics(Nagios et al.)

Page 13: Deep dive into Nagios analytics

Real-time Analytics

Page 14: Deep dive into Nagios analytics

Nagios Traffic

In the “Cloud”

Real-time graphs + analytics

Page 15: Deep dive into Nagios analytics

Nagios a “chatty” source out of 40+ Datadog supports

Page 16: Deep dive into Nagios analytics

One example

Page 17: Deep dive into Nagios analytics
Page 18: Deep dive into Nagios analytics

Almost 13000 Nagios “events”over past week

Page 19: Deep dive into Nagios analytics

Constant stream

Page 20: Deep dive into Nagios analytics

86 notifications!

Page 21: Deep dive into Nagios analytics

Pattern

Page 22: Deep dive into Nagios analytics

Pattern

Page 23: Deep dive into Nagios analytics

More data? More questions.

Page 24: Deep dive into Nagios analytics

A dialog with dataNot a scientific study

Page 25: Deep dive into Nagios analytics

0

2

4

6

0 250 500 750Host count

Popu

latio

n

factor(quartile)

1

2

3

4

Nagios samples

Population

 25%    50%    75%  100%        20      93    322    904  

Page 26: Deep dive into Nagios analytics

Does size matter?

Page 27: Deep dive into Nagios analytics

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

12

34

0 250 500 750 1000Nagios alert per host

coun

t per

wee

k

Weekly count per host split by quartile

Page 28: Deep dive into Nagios analytics

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

12

34

0 250 500 750 1000Nagios alert per host

coun

t per

wee

k

Weekly count per host split by quartile

Outliers Sick hosts,

silenced checks

Page 29: Deep dive into Nagios analytics

Notifications

Page 30: Deep dive into Nagios analytics

Notifications1-3% of alerts notify

Little difference per quartile

Page 31: Deep dive into Nagios analytics

Does time of day matter?

Page 32: Deep dive into Nagios analytics

●●

●●

● ●

●● ●

●●

●●

●●

4

8

12

4

8

12

4

8

12

4

8

12

12

34

0 5 10 15 20Hour of Day (UTC)

Aler

ts p

er h

our

Page 33: Deep dive into Nagios analytics

●●

●●

● ●

●● ●

●●

●●

●●

4

8

12

4

8

12

4

8

12

4

8

12

12

34

0 5 10 15 20Hour of Day (UTC)

Aler

ts p

er h

our

Mean about the sameacross quartiles

Time-based deviation?

Page 34: Deep dive into Nagios analytics

Does the day of week matter?

Page 35: Deep dive into Nagios analytics

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

12

34

Sun Mon Tue Wed Thu Fri SatDay of week

Aler

ts p

er h

our

Notifying Alerts per Day

Page 36: Deep dive into Nagios analytics

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

12

34

Sun Mon Tue Wed Thu Fri SatDay of week

Aler

ts p

er h

our

Notifying Alerts per Day

Not really

Page 37: Deep dive into Nagios analytics

Squeaky wheels? (checks)

Page 38: Deep dive into Nagios analytics

0

10

20

30

0

10

20

30

0

10

20

30

0

10

20

30

12

34

0 50 100 150 200 250Checks ranked by noise

Aler

ts p

er h

our

Noisiest checks (overall)

Outlier

Page 39: Deep dive into Nagios analytics

● ●● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ●

0

10

20

30

0 20 40Checks ranked by noise

Aler

ts p

er h

our

Noisiest checks (outlier)

Outlier in more detail

Page 40: Deep dive into Nagios analytics

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

12

34

0 50 100 150 200Checks ranked by noise

Aler

ts p

er h

our

Noisiest checks (without outlier)

Long Tail

Page 41: Deep dive into Nagios analytics

Squeaky wheel? (hosts)

Page 42: Deep dive into Nagios analytics

0

10

20

30

0

10

20

30

0

10

20

30

0

10

20

30

12

34

0 50 100 150 200Hosts ranked by noise

Aler

ts p

er h

our

Noisiest hosts (overall)

Same outlier

Page 43: Deep dive into Nagios analytics

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●

0

10

20

30

3

0 20 40 60Hosts ranked by noise

Aler

ts p

er h

our

Noisiest hosts (outlier)

Similar pattern as checks

Page 44: Deep dive into Nagios analytics

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

12

34

0 50 100 150 200Checks ranked by noise

Aler

ts p

er h

our

Noisiest checks (without outlier)

Long Tail

Page 45: Deep dive into Nagios analytics

Recurring alerts

Page 46: Deep dive into Nagios analytics

●●●●

●●●

●●●

●●

●●●●●●

●●●

●●

●●

●●●●●

●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●●●●●

●●●●●

●●●

●●●●

●●●●

●●●●

●●

●●

●●●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●

●●●

●●

●●●●●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●●

●●

●●

●●●●●●

●●●●●

●●●●

●●●

●●●

●●

●●●●●●

●●

●●●●

●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●●

●●●

●●●●

●●●

●●●●

●●

●●

●●

●●●●●●●●

●●

●●●●

●●●●●●

●●●●●●●●●

●●●

●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●●●●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●

●●●

●●●●●

●●

●●

●●●

●●●●●

●●

●●●●

●●●●●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●●●

●●●●●

●●

●●

●●●●●●●●

●●

●●●

●●●

●●●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●

●●

●●

●●●●●

●●

●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●

●●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

0

50

100

150

0 100 200 300Age between earliest and latest occurrence

Num

ber o

f day

s oc

curri

ng

factor(quartile)

1

2

3

4

Alert age & frequency of occurrence

Young Old

Seldom happens

HappensOften

Page 47: Deep dive into Nagios analytics

●●●●

●●●

●●●

●●

●●●●●●

●●●

●●

●●

●●●●●

●●●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●●●●●

●●●●●

●●●

●●●●

●●●●

●●●●

●●

●●

●●●●●●

●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●

●●●●●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●●●●

●●●

●●

●●●●●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●●

●●●

●●●●

●●

●●

●●●●●●

●●●●●

●●●●

●●●

●●●

●●

●●●●●●

●●

●●●●

●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●

●●●●

●●●

●●●●

●●●

●●●●

●●

●●

●●

●●●●●●●●

●●

●●●●

●●●●●●

●●●●●●●●●

●●●

●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●

●●●●●●●●●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●●●●

●●●●

●●●

●●●●●

●●

●●

●●●

●●●●●

●●

●●●●

●●●●●●●●

●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●●

●●●

●●

●●

●●

●●●

●●

●●●●

●●●●●●●●

●●●●●

●●

●●

●●●●●●●●

●●

●●●

●●●

●●●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●

●●

●●

●●●●●

●●

●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●

●●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

0

50

100

150

0 100 200 300Age between earliest and latest occurrence

Num

ber o

f day

s oc

curri

ng

factor(quartile)

1

2

3

4

Alert age & frequency of occurrence

Happen once in a while

Occur often, for a long time Tolerated

Page 48: Deep dive into Nagios analytics

More data? More questions.

Page 49: Deep dive into Nagios analytics

HOWTO?

Page 50: Deep dive into Nagios analytics

Find out tomorrow!Awk

Postgres

R

d3

ggplot2

Page 51: Deep dive into Nagios analytics

Presentation matters

Page 52: Deep dive into Nagios analytics
Page 53: Deep dive into Nagios analytics

Take-away?

Page 54: Deep dive into Nagios analytics

Take-aways

• Don’t rely on your memory to prioritize

• Your Nagios logs are a treasure trove

• Have a dialog with your data

• Presentation matters

Page 55: Deep dive into Nagios analytics

http://dtdg.co/nagios2012

Curious about Datadog?

Like cute logos?