19
Uber’s Data Science Workbench Randy Wei Peng Du

Uber's data science workbench

  • Upload
    ran-wei

  • View
    608

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Uber's data science workbench

Uber’s Data Science Workbench

Randy Wei Peng Du

Page 2: Uber's data science workbench
Page 3: Uber's data science workbench

Mission

Unleash the productivity of the Data Science community at Uber by

providing scalable infrastructure, tools, customization and support.

Page 4: Uber's data science workbench

Tools of the Trade: Jupyter Notebooks

Alternative to traditional CLIs

Interactive tool which combines

Prose (HTML Markdown),

Code (Py, R, Scala)

Visualization (charts, maps, tables)

Shareable artifact of knowledge

Hosted webapp

Notebook, Notes, Cells

Each cell is an executable line of code

Used for

Data exploration, Cleansing, Modeling

Dashboarding/reporting

HTML

Code

Output

Page 5: Uber's data science workbench

Tools of the Trade: RStudio Server

Browser interface to a remote R server

Centrally manage compute infrastructure

IDE for R

Syntax highlight, code completion

Debugging

Charts

File Browser

RStudio also has Notebook functionality

R has a huge library repository

Used mostly for rapid prototyping of models

on small datasets (UbeR)

Data

Code

Output

Page 6: Uber's data science workbench

Tools of the Trade: Apache Spark

Distributed statistical computing framework

Run R code without translating it to Java

Choice of Intelligent Decision, Insurance, etc

teams

Distributed machine learning framework

Easy to integrate with scientific Python

libraries

Choice of Fraud Detection, Sensing and

Perception, etc teams

SparkR PySpark

Page 7: Uber's data science workbench

● Productivity● Py, R, Scala interpreters in Jupyter● Hosted RStudio support● Version Control● Custom libraries/environment● Single-pane lifecycle mgmnt.● PySpark, SparkR

Scale● Scalable Jupyter Server infra.● Large dist. computation backend● Multitenancy● File Persistence● Security

Requirements

Ecosystem Integration● Scheduling: Piper● Dashboards: Shiny● Data Exploration: Query engine API● Deploy: Machine learning platform● Chargeback: Monitoring platform

● Knowledge● Search● Access Controls● Sharing Controls● Publish● Comments & Discussion

Scale Productivity

Social Ecosystem

Page 8: Uber's data science workbench

State of the Union

Problem

● Data Scientists (DSs) start at Uber with diverse skillsets and backgrounds

● Precious time wasted in infra. setup, version control, search, sharing...

● Teams are building their own solutions

Vision

● Web-based hub for all Data Scientists at Uber

● Ability to centrally:○ provision tools○ leverage dist.

Backend○ search, comment,

share ○ monitor

● Integrated with Uber’s data ecosystem

● Dedicated SRE

Opportunity

● Find and reuse knowledge● Opportunity for a dedicated

team to advocate for and build tools needs to make DSs hyper-productive

● Cloud experience● Chargeback

Page 9: Uber's data science workbench

Similar offerings...

Page 10: Uber's data science workbench

Management ServiceCreate, Delete, Search, Share, Publish, Schedule

RStudio(Docker)

Uber Mesos Infra Shared File System

MLlib Worker

MLlib Worker

MLlib Worker

MLlib Worker

MLlib Worker

PySpark Worker

MLlib Worker

MLlib Worker

SparkR Worker

Uber spark debugging

toolkit

Uber spark development

toolkit

RStudio(Docker)

RStudio(Docker) RStudio

(Docker)

RStudio(Docker)

Jupyter(Docker)

Manage

Mesos

Spark

Architecture

Page 11: Uber's data science workbench

Architecture

NB1

Application Management

Service

session / file management,

proxy

Mesos Cluster

Docker Container HadoopCluster

(Hive, Presto, Spark)

Distributed ProcessingDocker Container

Docker Container

RStudioServer

RStudio

Jupyter

Docker Container

NB1Jupyter Server NB2

Web GUI

Page 12: Uber's data science workbench

Data Science Workbench Uber ML platform Palette

Hive Cassandra

Spark

Spark SDK, Spark Debug tool, Spark templates

Uber Ecosystem

Models

HDFS

Query Runner

Production

PySpark for ML

Data Visualization

Page 13: Uber's data science workbench

Workflow Demo

Page 14: Uber's data science workbench
Page 15: Uber's data science workbench
Page 16: Uber's data science workbench
Page 17: Uber's data science workbench
Page 18: Uber's data science workbench
Page 19: Uber's data science workbench

Q&A