Cloud Tech III: Actionable Metrics

Preview:

DESCRIPTION

Presentation for Cloud Tech III on How Netflix Thinks of Metrics

Citation preview

Actionable MetricsEnabling Decision-Making in Netflix’s Decentralized

Environment

Cloud Tech IIIOctober 6, 2012Roy Rapoport

@royrapoport, rsr@netflix.com

Thursday, October 18, 12

Me

• Been in tech for about 20 years

• Systems engineering, networking, software development, QA, release management

• Time at Netflix: 1195 days (3y:3m:1w)

• (Current) job at Netflix: Make things better (Security Monkey, Python Platform, Central Alert Gateway, Breaking Stuff.. )

Thursday, October 18, 12

Metrics Humor

Thursday, October 18, 12

Metrics Humor

Thursday, October 18, 12

Metrics Humor

Thursday, October 18, 12

Metrics Humor

Thursday, October 18, 12

Metrics Humor

% of instances with even public IP addresses

Thursday, October 18, 12

Technology Overview

Thursday, October 18, 12

Technology Overview• SoA, REST, Mostly Java

Thursday, October 18, 12

Technology Overview• SoA, REST, Mostly Java

• Simple overall architecture:

Thursday, October 18, 12

Technology Overview• SoA, REST, Mostly Java

• Simple overall architecture:

Thursday, October 18, 12

Technology Overview• SoA, REST, Mostly Java

• Simple overall architecture:

Thursday, October 18, 12

Culture Overview

Thursday, October 18, 12

Culture Overview

• Freedom and Responsibility

Thursday, October 18, 12

Culture Overview

• Freedom and Responsibility

• Distributed Operations

Thursday, October 18, 12

Culture Overview

• Freedom and Responsibility

• Distributed Operations

• Get out of the way of Developers

Thursday, October 18, 12

The Metric Lifecycle

Thursday, October 18, 12

The Metric Lifecycle

•Send

Thursday, October 18, 12

The Metric Lifecycle

•Send

•Look

Thursday, October 18, 12

The Metric Lifecycle

•Send

•Look

•Alert

Thursday, October 18, 12

Systems

• Flexible

• Scalable

• Self-Service

Thursday, October 18, 12

TelemetryFlexible, Scalable, Self-Service

import netflix.metrics[...] self.nm = netflix.metrics.Metrics("core_cag")[...]def api(self): self.nm.nfCounter("api") [...] self.nm.nfCounter(“application_%s” % application)[...]

Thursday, October 18, 12

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

VisualizationFlexible, Scalable, Self-Service

Thursday, October 18, 12

AlertingFlexible, Scalable, Self-Service

Thursday, October 18, 12

AlertingFlexible, Scalable, Self-Service

• Static vs Dynamic Thresholds

Thursday, October 18, 12

AlertingFlexible, Scalable, Self-Service

• Static vs Dynamic Thresholds

• Compare to history

Thursday, October 18, 12

For Example ...

What the ...

Last 3 hours’ core_tools.core_cag_api

Thursday, October 18, 12

For Example ...Visualization (Continued)

Last 4 days’ core_tools.core_cag_api

even more questions!

Thursday, October 18, 12

For Example ...Visualization (Continued)

Last 10 days’ core_tools.core_cag_api

What caused the spike?

Thursday, October 18, 12

For Example ...Visualization (Continued)

Show alert volume per application

Someone had a rough few days...

Thursday, October 18, 12

Don’t Like Surprises...{ "alerts": [ { "applyTo": "cluster", "condition": { "minPercent": 90.0, "noise" : .2, "maxPercent": 25.0, "type": "DoubleExponential" }, "metricName": "core_cag_api", "severity": "major" } ], "clusters": [ "core_tools" ]}

Thursday, October 18, 12

Threshold Tuning

• An Abbreviated History ...

Thursday, October 18, 12

Threshold Tuning(in the beginning)

Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor.

- http://www.descentinthedark.com/_d_/dark_priests.php

Thursday, October 18, 12

Threshold Tuning(in the beginning)

• Systems owned by IT

Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor.

- http://www.descentinthedark.com/_d_/dark_priests.php

Thursday, October 18, 12

Threshold Tuning(in the beginning)

• Systems owned by IT

• Want an alert? Submit a ticket

Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor.

- http://www.descentinthedark.com/_d_/dark_priests.php

Thursday, October 18, 12

Threshold Tuning(in the beginning)

• Systems owned by IT

• Want an alert? Submit a ticket

• Want to tune an alert? Submit a ticket

Some priests offer their prayers to alien creatures best left forgotten. This ill-advised worship twists their minds in odd ways. Overlords find these warped men useful due to the unnatural powers they can channel. The dark priests most favored by their strange gods have powerful protections, and defeating one of them is sure to bring down a terrible curse upon the victor.

- http://www.descentinthedark.com/_d_/dark_priests.php

Thursday, October 18, 12

Threshold Tuning(It gets better)

Thursday, October 18, 12

Threshold Tuning(It gets better)

• You get to configure your own threshold

Thursday, October 18, 12

Threshold Tuning(It gets better)

• You get to configure your own threshold

• Freedom!

Thursday, October 18, 12

Threshold Tuning(It gets better)

• You get to configure your own threshold

• Freedom!

• Also, you have to configure your own thresholds

Thursday, October 18, 12

Threshold Tuning(Are we there yet?)

Thursday, October 18, 12

Threshold Tuning(Are we there yet?)

• Play with historical data

Thursday, October 18, 12

Threshold Tuning(Are we there yet?)

• Play with historical data

• Huge difference

Thursday, October 18, 12

Threshold Tuning(Are we there yet?)

• Play with historical data

• Huge difference

• Still falls short

Thursday, October 18, 12

Threshold Tuning(Yeah, that’s the ticket)

Thursday, October 18, 12

Threshold Tuning(Yeah, that’s the ticket)

• Computers can be good at this

Thursday, October 18, 12

Threshold Tuning(Yeah, that’s the ticket)

• Computers can be good at this

Thursday, October 18, 12

Threshold Tuning(Yeah, that’s the ticket)

Thursday, October 18, 12

Threshold Tuning(Yeah, that’s the ticket)

• Computers can be good at this

Thursday, October 18, 12

Threshold Tuning(Yeah, that’s the ticket)

Thursday, October 18, 12

Threshold Tuning(Yeah, that’s the ticket)

• Computers can be good at this

Thursday, October 18, 12

If Time Allows ...

Thursday, October 18, 12

Events vs Metrics

Thursday, October 18, 12

Events vs Metrics

• Irregular Interval

Thursday, October 18, 12

Events vs Metrics

• Irregular Interval

• Point in time

Thursday, October 18, 12

Events vs Metrics

• Irregular Interval

• Point in time

• Lack magnitude

Thursday, October 18, 12

Why Build It?

Thursday, October 18, 12

Why Build It?

• Change management

• Vs Change control

Thursday, October 18, 12

Why Build It?

• Change management

• Vs Change control

• What Changed?

Thursday, October 18, 12

Why Build It?

• Change management

• Vs Change control

• What Changed?

• Better Alerting

Thursday, October 18, 12

Chronos

Thursday, October 18, 12

Chronos

• Rapidly Prototyped

Thursday, October 18, 12

Chronos

• Rapidly Prototyped

• Adapters and reporters

Thursday, October 18, 12

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

Thursday, October 18, 12

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

• Alarming

• Something happened

Thursday, October 18, 12

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

• Alarming

• Something happened

• ... X times in Y minutes

Thursday, October 18, 12

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

• Alarming

• Something happened

• ... X times in Y minutes

• Something didn’t happen

Thursday, October 18, 12

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

• Alarming

• Medium volume

Thursday, October 18, 12

Chronos

• Rapidly Prototyped

• Adapters and reporters

• Easy querying

• Alarming

• Medium volume

• Recursive

• Recursive

Thursday, October 18, 12

End Result

Thursday, October 18, 12

End Result

• Massive decrease in change control tickets

Thursday, October 18, 12

End Result

• Massive decrease in change control tickets

• Not talking about SOX or PCI

Thursday, October 18, 12

End Result

• Massive decrease in change control tickets

• Not talking about SOX or PCI

• Better visibility into changes

Thursday, October 18, 12

End Result

• Massive decrease in change control tickets

• Not talking about SOX or PCI

• Better visibility into changes

• Decreased TTR

Thursday, October 18, 12

End Result

• Massive decrease in change control tickets

• Not talking about SOX or PCI

• Better visibility into changes

• Decreased TTR

• Especially for bad code deployments

Thursday, October 18, 12

End Result

• Massive decrease in change control tickets

• Not talking about SOX or PCI

• Better visibility into changes

• Decreased TTR

• Especially for bad code deployments

• You should do this

Thursday, October 18, 12

I Didn’t Mention

• End-to-end testing and alerting

• External availability and performance

• Open Connect

• Jobs

Thursday, October 18, 12

Questions?

Thursday, October 18, 12

Recommended