19
1 From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera

Cloudera User Group - From the Lab to the Factory

Embed Size (px)

DESCRIPTION

This is the presentation that Cloudera's senior director of data science, Josh Wills, delivered at the Cloudera User Group (CUG) Chicago meeting on 12/3/13 and NYC meeting on 12/5/13.

Citation preview

Page 1: Cloudera User Group - From the Lab to the Factory

1

From The Lab to the Factory

Building A Production Machine Learning Infrastructure

Josh Wills, Senior Director of Data Science

Cloudera

Page 2: Cloudera User Group - From the Lab to the Factory

One Other Thing About Me

2

Page 3: Cloudera User Group - From the Lab to the Factory

Data Science: Another Definition

3

Page 4: Cloudera User Group - From the Lab to the Factory

Data Scientists Build Data Products.

4

Page 5: Cloudera User Group - From the Lab to the Factory

A Shift In Perspective

Analytics in the Lab

• Question-driven

• Interactive

• Ad-hoc, post-hoc

• Fixed data

• Focus on speed and

flexibility

• Output is embedded into a

report or in-database

scoring engine

Analytics in the Factory

• Metric-driven

• Automated

• Systematic

• Fluid data

• Focus on transparency and reliability

• Output is a production system that makes customer-facing decisions

5

Page 6: Cloudera User Group - From the Lab to the Factory

All* Products Become Data Products

6

Page 7: Cloudera User Group - From the Lab to the Factory

Identifying the Bottlenecks

7

Page 8: Cloudera User Group - From the Lab to the Factory

Oryx: Model Building and Serving

• Algorithms

• ALS Recommenders

• K-Means Parallel

• RDF

• Batch model building

via MapReduce*

• Server for real-time

scoring and updates

• PMML 4.1 Models

8

Page 9: Cloudera User Group - From the Lab to the Factory

Oryx Design

9

Page 10: Cloudera User Group - From the Lab to the Factory

Generational Thinking

10

Page 11: Cloudera User Group - From the Lab to the Factory

The Limits of Our Models

11

Page 12: Cloudera User Group - From the Lab to the Factory

Space Exploration

12

Page 13: Cloudera User Group - From the Lab to the Factory

Data Science Needs DevOps

13

Page 14: Cloudera User Group - From the Lab to the Factory

Introducing Gertrude

• Multivariate Testing

• Define and explore a

space of parameters

• Overlapping

Experiments

• Tang et al. (2010)

• Runs multiple

independent

experiments on every

request

14

Page 15: Cloudera User Group - From the Lab to the Factory

Simple Conditional Logic

• Declare experiment

flags in compiled code

• Settings that can vary per request

• Create a config file that contains simple rules for calculating flag values and rules for experiment diversion

15

Page 16: Cloudera User Group - From the Lab to the Factory

Separate Data Push from Code Push

• Validate config files and

push updates to servers

• Zookeeper via Curator

• File-based

• Servers pick up new

configs, load them, and

update experiment

space and flag value

calculations

16

Page 17: Cloudera User Group - From the Lab to the Factory

The Experiments Dashboard

17

Page 18: Cloudera User Group - From the Lab to the Factory

A Few Links I Love

• http://research.google.com/pubs/pub36500.html

• The original paper on the overlapping experiments

infrastrucure at Google

• http://www.exp-platform.com/

• Collection of all of Microsoft’s papers and presentations on

their experimentation platform

• http://www.deaneckles.com/blog/596_lossy-better-

than-lossless-in-online-bootstrapping/

• Dean Eckles on his paper about bootstrapped confidence

intervals with multiple dependencies

18

Page 19: Cloudera User Group - From the Lab to the Factory

Josh Wills, Director of Data Science, Cloudera @josh_wills

Thank you!