54
Watching Pigs Fly with the Netflix Hadoop Toolkit Hadoop Summit 2013 San Jose, CA

Watching Pigs Fly with the Netflix Hadoop Toolkit

Embed Size (px)

DESCRIPTION

Frameworks and technologies in the Hadoop ecosystem are undergoing rapid innovation, but the open source tooling around usability has lagged behind. We will present a suite of tools, deployable on top of the Hadoop ecosystem, that enables even non-technical users to develop, tune, and maintain efficient Pig workflows and easily interact with and visualize datasets. Netflix?s big data teams have worked for the past year implementing this framework in the AWS cloud. During that time, we have seen a massive influx of data and a corresponding increase in new development on our platform. This toolset has been a critical enabler in minimizing development time and effort. Using the development of a recommendation algorithm as an example, we?ll walk through use cases for this stack of tools, showing how they interact to facilitate development. The presentation will include demos, implementation details, and our roadmap to open source various key services in the framework, including restful services that: provide comprehensive metadata management across data sources; enable visualization and caching of results of Hadoop jobs; visualize the execution plans produced by languages such as Pig and Hive; and provide detailed analytics on the currently executing workload and trends in historical performance.

Citation preview

Page 1: Watching Pigs Fly with the Netflix Hadoop Toolkit

Watching Pigs Fly with the Netflix Hadoop Toolkit

Hadoop Summit 2013San Jose, CA

Page 2: Watching Pigs Fly with the Netflix Hadoop Toolkit

Data should be accessible, easy to discover, and easy to process for everyone.

Our Motivation

Page 3: Watching Pigs Fly with the Netflix Hadoop Toolkit

Our Users

Analysts Engineers

Page 4: Watching Pigs Fly with the Netflix Hadoop Toolkit

Hadoop Platform as a Service

Page 5: Watching Pigs Fly with the Netflix Hadoop Toolkit

Hadoop Platform as a Service

S3

Page 6: Watching Pigs Fly with the Netflix Hadoop Toolkit

Hadoop Platform as a ServiceData Platform

Page 7: Watching Pigs Fly with the Netflix Hadoop Toolkit

Data Platform as a Service

Franklin(Metadata API)

Sting(Adhoc Visualization)

Forklift (Data Movement)

Looper(Backloading)

Ignite(A/B Test Analytics)

Spock(Data Auditing)

Genie(Hadoop PaaS)

Lipstick(Pig Workflow Visualization)

Event Service(Orchestration)

Hadoop

S3

Other Processing

Page 8: Watching Pigs Fly with the Netflix Hadoop Toolkit

Let’s solve a problem using the data!

Page 9: Watching Pigs Fly with the Netflix Hadoop Toolkit

Build a recommender.

Page 10: Watching Pigs Fly with the Netflix Hadoop Toolkit

But, what makes good recommendations?Similarity

Personalization

Page 11: Watching Pigs Fly with the Netflix Hadoop Toolkit

COLORS!

Page 12: Watching Pigs Fly with the Netflix Hadoop Toolkit

COLORS!Box art is colorful…

Page 13: Watching Pigs Fly with the Netflix Hadoop Toolkit

We’re Sorry

COLORS!Box art is colorful…

Page 14: Watching Pigs Fly with the Netflix Hadoop Toolkit

Where can I find the data?

Page 15: Watching Pigs Fly with the Netflix Hadoop Toolkit

Hadoop Platform as a Service

S3

Page 16: Watching Pigs Fly with the Netflix Hadoop Toolkit

Hadoop Platform as a Service

S3Cassandra TeradataRedshiftRDS

Page 17: Watching Pigs Fly with the Netflix Hadoop Toolkit

Data Platform as a Service

Franklin(Metadata API)

S3Cassandra TeradataRedshiftRDS

Page 18: Watching Pigs Fly with the Netflix Hadoop Toolkit

Data Platform as a Service

Franklin(Metadata API)

Page 19: Watching Pigs Fly with the Netflix Hadoop Toolkit

Create a dataset for box art and color.

Page 20: Watching Pigs Fly with the Netflix Hadoop Toolkit

Whether your dataset is large or small, being able to visualize it makes it easier to explain.

Page 21: Watching Pigs Fly with the Netflix Hadoop Toolkit

Data Platform as a Service

Franklin(Metadata API)

Sting(Adhoc Visualization)

Page 22: Watching Pigs Fly with the Netflix Hadoop Toolkit

Sting

• Allows users to cache the results of a genie job in memory

• Sub second response to OLAP style operations (slicing, dicing, aggregations).

• Adhoc / recurring schedule• Easy to use!

Page 23: Watching Pigs Fly with the Netflix Hadoop Toolkit

HiveQuery

Schema

Page 24: Watching Pigs Fly with the Netflix Hadoop Toolkit

% Content Consumed / Hour

Page 25: Watching Pigs Fly with the Netflix Hadoop Toolkit

HemlockGrove

House ofCards

ArrestedDevelopment

Page 26: Watching Pigs Fly with the Netflix Hadoop Toolkit

Similarity

Page 27: Watching Pigs Fly with the Netflix Hadoop Toolkit
Page 28: Watching Pigs Fly with the Netflix Hadoop Toolkit
Page 29: Watching Pigs Fly with the Netflix Hadoop Toolkit

House ofCards Macbeth

Page 30: Watching Pigs Fly with the Netflix Hadoop Toolkit
Page 31: Watching Pigs Fly with the Netflix Hadoop Toolkit
Page 32: Watching Pigs Fly with the Netflix Hadoop Toolkit

Toddlers& Tiaras

Star Trek:Voyager

Page 33: Watching Pigs Fly with the Netflix Hadoop Toolkit

Personalization

Page 34: Watching Pigs Fly with the Netflix Hadoop Toolkit

# of subscribers X # of titles = ???,000,…,000 (big data)

Big Data

Page 35: Watching Pigs Fly with the Netflix Hadoop Toolkit

Netflix Apache Pig

Page 36: Watching Pigs Fly with the Netflix Hadoop Toolkit
Page 37: Watching Pigs Fly with the Netflix Hadoop Toolkit

Lipstick

Data Platform as a Service

Franklin(Metadata API)

Sting(Adhoc Visualization)

Page 38: Watching Pigs Fly with the Netflix Hadoop Toolkit

Lipstick

• Allows users to visualize their data flow• Allows users to see common errors• Allows users to easily monitor their jobs• Empowers users to support themselves• Facilitates communication between

infrastructure team and users

Page 39: Watching Pigs Fly with the Netflix Hadoop Toolkit

Lipstick

Page 40: Watching Pigs Fly with the Netflix Hadoop Toolkit

Overall JobProgress

Page 41: Watching Pigs Fly with the Netflix Hadoop Toolkit

LogicalPlan

Overall JobProgress

Page 42: Watching Pigs Fly with the Netflix Hadoop Toolkit

Logical Operator(reduce side)

Logical Operator(map side)

Map/Reduce Job

Intermediate Row Count

RecordsLoaded

Page 43: Watching Pigs Fly with the Netflix Hadoop Toolkit

HadoopCounters

Page 44: Watching Pigs Fly with the Netflix Hadoop Toolkit

My Job has stalled.

Common Problem #1

Page 45: Watching Pigs Fly with the Netflix Hadoop Toolkit
Page 46: Watching Pigs Fly with the Netflix Hadoop Toolkit

Unoptimized/OptimizedLogical Plan Toggle

Dangling Operator

Page 47: Watching Pigs Fly with the Netflix Hadoop Toolkit

I didn’t get the data I was expecting

Common Problem #2

Page 48: Watching Pigs Fly with the Netflix Hadoop Toolkit
Page 49: Watching Pigs Fly with the Netflix Hadoop Toolkit
Page 50: Watching Pigs Fly with the Netflix Hadoop Toolkit

I don’t understand why my job failed.

Common Problem #3

Page 51: Watching Pigs Fly with the Netflix Hadoop Toolkit

Failed Job(light red background)

Successful Job(light blue background)

Page 52: Watching Pigs Fly with the Netflix Hadoop Toolkit
Page 53: Watching Pigs Fly with the Netflix Hadoop Toolkit

Wrapping up

• Demos at the Netflix booth in the exhibit hall (see more Lipstick, Sting, and Genie).

• Lipstick is part of Netflix OSS.• Clone it on github at http:

//github.com/Netflix/Lipstick• We welcome feedback and contributions!

Page 54: Watching Pigs Fly with the Netflix Hadoop Toolkit

Charles Smith: [email protected] Jeff Magnusson: [email protected]

Thank you!

Jobs: http://jobs.netflix.comNetflix OSS: http://netflix.github.io

Tech Blog: http://techblog.netflix.com/