Show me the problem- Our insights journey at Netflix

  • View
    1.099

  • Download
    1

  • Category

    Software

Preview:

Citation preview

Show me the Problem Our Insights Journey at Netflix

Suudhan Rangarajan Senior Software Engineer, Playback Features

@suudhan

On Feb 26th...

2

Before ElasticSearch (ES)

Our Insights Journey

Today:ElasticSearch(ES) + Kibana

The Future: Taking Insights to Next Level

Motivation

Why is Insights a critical part of our Service?

DVDs and IFO Files

VIDEO_TS

VIDEO_TS.VOBVIDEO_TS.BUPVIDEO_TS.IFO

VIDEO_TS.VOBVIDEO_TS.BUPVIDEO_TS.IFO

VIDEO_TS.VOBVIDEO_TS.BUPVIDEO_TS.IFO

PLAYBACK CONTEXT (Tracks + Track Urls)

Our Service

NETFLIX OPENCONNECT CDN URLS

Many Many Dimensions

PLAYBACKCONTEXT

COUNTRY

USER PREFERENCES

TITLEMETADATA

DEVICE

NETWORK

Tens of Millions of custom DVDs

Errors Happen

Before ES

Distributed Grep

Log-enabled Clusters

Instances with Verbose Logging

Before ES

Diagnostics REST Endpoints

Before ES

Incident To Resolution Time

10 min of Detection

2+ Hours of Analysis

5 min of Resolution

An Incident Review

Before ES

Hours and even Days of Debugging time

High Incident-To-Resolution Time

No Big picture View

No insights into QoE

Our Insights Journey

Today:ES + Kibana

The Future: Taking Insights to Next Level

Before ES● Hours and even Days of

Debugging time● High Incident-To-Resolution

Time● No Big picture View● No insights into QoE

What is Elasticsearch & Kibana

Now: ES + Kibana

Log Essential Data for all Requests

Now: ES + Kibana

Find Specific Request Fast

Now: ES + Kibana

Interactive Exploration

Now: ES + Kibana

Top N queries

Keep It Simple, Stupid INPUTS

Our Insights Philosophy

DECISIONS OUTPUTS

Just Log Essential Data

For every feature, generate insights:- Input parameters- Decision factors- Results

Micro-analytics

Key Observation

This customer has a problem playing this title

Our device partner is not able to test the new HEVC encodes - we don’t seem to returning those streams

Our latest iOS client’s always plays spanish audio by default.

Macro-analytics

Key Observation

How many requests had a max Video Resolution of 720p?

Are we returning chinese audio for this title in all these countries?

With this feature roll-out, how many unique customers are impacted? Should we roll-back or fix-forward?

A Success Story

Incident To Resolution Time

Before ES

10 min

2+ Hours

5 min

Today: ES + Kibana

10 min

10 min

5 min

With ES and Kibana

Fast Root-Cause-Analysis(Minutes and Seconds)

Quick Incident Resolution

Macro-Analytics

Still Manual

Our Insights Journey

The Future: Automated RCA

Before ES● Hours and even Days of

Debugging time● High Incident-To-Resolution

Time● No Big picture View● No insights into Quality of

Experience

Today:ES + Kibana

● Fast Root-cause-Analysis (minutes and seconds)

● Quick Incident Resolution● Macro-Analytics● Still Manual

Today’s problems

When Developers are focused on Innovation and Creative Problem Solving, a context-switch becomes very costly

Automated Root Cause Analysis

Taking Insights Further

The Runbook Lookup

trends in Kibana

Identify dimensions causing the

issue

Figure out resolution

options

Identifying Repetition

Alert fires

Show me the Problem

Alert fires

Send out Resolution options

Awesome Service for Prod Incident REsolution (ASPIRE)

Don’t Repeat Yourself

ASPIRE Workflow

ES Aggregations FTW

1.Start a ES Query with the Alert Dimension

ASPIRE Workflow

ES Aggregations FTW

2. Combine it with a Significant Terms Aggregation on Error Codes

ASPIRE Workflow

ES Aggregations FTW

3. Cardinality Aggregation on Top Dimensions → Sort on % distinctness

ASPIRE Workflow

ES Aggregations FTW

4. Terms or Cardinality Aggregation on specific Sub-Dimensions → Sort on % distinctness

ASPIRE Workflow

ES Aggregations FTW

5. Collect all results and email

The 80% Use-case

Title Alert

Maturity Error is

statistically significant

Top Dimensions [countries:

US,BR]

Sub Dimensions[titleMaturityLevel:TV-Y7

customerMaturityLevel:Age<=6]

Title Alert

What Caused the

Alert (Error

scenario)?

Is it Specific to a Country, a

Device or RequestType?

What Changed to cause the Alert?

Device Alert

Sub Dimensions: ●Available Video Tracks●Filtrations on Video

Tracks

A Complex Use-case

“All Video tracks are filtered out” is one

statistically significant error

Unexpected Exception is

another statistically significant error

Sub Dimensions:●exception Stack Trace●server instances

Incident To Resolution Time

Before ES

10 min

2+ Hours

5 min

Today: ES + Kibana

10 min

10 min

5 min

Tomorrow: ASPIRE

2 min

2 min

5 min

ASPIRE

Automated and Scalable RCA

Cut the Slow Middle-Man

Increased Developer Productivity

Our Insights Journey

Before ES● Hours and even Days of

Debugging time● High Incident-To-Resolution

Time● No Big picture View

Today:ES + Kibana

● Fast Root-cause-Analysis (minutes and seconds)

● Quick Incident Resolution● Still Manual

The Future: Automated RCA● Scalable RCA ● Cut the slow Middle-man● Increased Developer

Productivity

Big Takeaways

Invest in a micro-&-macro analytics tool for your service

Empower your runbook automation with ES aggregations

@suudhan

Discussion

What’s your Insights Story?

@suudhan

Parting Thought

@suudhan

Imagine you are deeply engaged in designing the next big thing for your team. Production Pages start firing, but problems are getting analyzed and routed to the teams who can fix them. You can focus on your deep-thinking, while the machines take care of themselves

Recommended