162
THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin, and Michael I. Jordan UC Berkeley AMPLab CIDR 2015

THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING

AND MANAGEMENT WITH VELOX

Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin, and Michael I. Jordan

UC Berkeley AMPLab

CIDR 2015

Page 2: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today

• Velox system architecture• Key idea: Split model family• Prediction serving• Model management• Next directions

Page 3: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Catify: Music for Cats

Page 4: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

MODELING TASK

Rating

Songs

Page 5: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

MODELING TASK

Ratings

Songs

Prediction

Page 6: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Data

Page 7: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Data Model

Page 8: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Data ModelTraining

Page 9: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

BERKELEY DATA ANALYTICS STACK (BDAS)

Spark

SparkStreaming Spark SQL

BlinkDBGraphX

MLlib

MLBase

Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Page 10: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Catify: Music for Cats

Page 11: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Catify: Music for CatsCatID Song Score

1 16 2.1

1 14 3.7

3 273 4.2

4 14 1.9

Page 12: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Catify: Music for Cats

Pipeline

CatID Song Score

1 16 2.1

1 14 3.7

3 273 4.2

4 14 1.9

Page 13: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Catify: Music for Cats

Tachyon + HDFS

Pipeline

CatID Song Score

1 16 2.1

1 14 3.7

3 273 4.2

4 14 1.9

Page 14: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Catify: Music for Cats

Tachyon + HDFS

Pipeline

CatID Song Score

1 16 2.1

1 14 3.7

3 273 4.2

4 14 1.9

Page 15: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Page 16: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction ErrorLower isBetter

Page 17: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction ErrorLower isBetter

Lower isBetter

Page 18: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Online Retraining e.g.,

Lower isBetter

Lower isBetter

Page 19: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Pipeline

Tachyon + HDFS

Catify: Music for Cats

Page 20: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Pipeline

Tachyon + HDFS

Catify: Music for Cats

Node.js App Server

Apache Web Server

MySQL

Page 21: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Pipeline

Tachyon + HDFS

Catify: Music for Cats

Node.js App Server

Apache Web Server

MySQL

Page 22: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Catify: Music for Cats

Songs

Users

Page 23: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Songs

Users

O(users * songs)

Catify: Music for Cats

Page 24: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Pipeline

Tachyon + HDFS

Node.js App Server

NGINX

MySQL

Catify: Music for Cats

Page 25: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Pipeline

Tachyon + HDFS

Node.js App Server

NGINX

MySQL

New Model

Catify: Music for Cats

Page 26: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Pipeline

Tachyon + HDFS

Node.js App Server

NGINX

MySQL

Training Data

New Model

Catify: Music for Cats

Page 27: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Online Retraining e.g.,

Page 28: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Online Retraining e.g.,

Full pre-materialization e.g.,

Page 29: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

What’s wrong?

Page 30: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

1. Predictions have either :

What’s wrong?

Page 31: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

1. Predictions have either :a. High latency, low staleness

What’s wrong?

Page 32: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

1. Predictions have either :a. High latency, low stalenessb. Low latency, high staleness

What’s wrong?

Page 33: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

1. Predictions have either :a. High latency, low stalenessb. Low latency, high staleness

2. Limited optimization of model semantics

What’s wrong?

Page 34: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

1. Predictions have either :a. High latency, low stalenessb. Low latency, high staleness

2. Limited optimization of model semantics

3. Ad-hoc lifecycle management

What’s wrong?

Page 35: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today

• Velox system architecture• Split model family• Prediction serving• Model management• Next directions

Page 36: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today• Velox system architecture

• Split model family• Prediction serving• Model management• Next directions

Page 37: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Online Retraining e.g.,

Full pre-materialization e.g.,

Page 38: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Online Retraining e.g.,

Full pre-materialization e.g.,

VELOX

Page 39: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

VELOX GOALS

Page 40: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

VELOX GOALS1. Low latency and low error

predictions

Page 41: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

VELOX GOALS1. Low latency and low error

predictions2. Cross-cutting model-specific

optimizations

Page 42: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

VELOX GOALS1. Low latency and low error

predictions2. Cross-cutting model-specific

optimizations3. Unified system eases operation

Page 43: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Online Retraining e.g.,

Full pre-materialization e.g., VELOX

Page 44: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Online Retraining e.g.,

Full pre-materialization e.g., VELOX

key idea:split model into staleness insensitiveand staleness sensitive components

Page 45: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Online Retraining e.g.,

Full pre-materialization e.g., VELOX

key idea:split model into staleness insensitiveand staleness sensitive components

BATCH

Page 46: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Online Retraining e.g.,

Full pre-materialization e.g., VELOX

key idea:split model into staleness insensitiveand staleness sensitive components

INCREMENTAL

BATCH

Page 47: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Spark

SparkStreaming Spark SQL

BlinkDBGraphX

MLlib

MLbase

Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

BERKELEY DATA ANALYTICS STACK (BDAS)

Page 48: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

Training

THE MISSING PIECE

Page 49: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

Training Management + Serving

THE MISSING PIECE

Page 50: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

VeloxTraining Management + Serving

THE MISSING PIECE

Page 51: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

VeloxTraining Management + Serving

THE MISSING PIECE

Page 52: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

VeloxTraining Management + Serving

THE MISSING PIECE

ModelManager

Page 53: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

VeloxTraining Management + Serving

THE MISSING PIECE

ModelManager

PredictionService

Page 54: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION SERVICE

Page 55: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION SERVICE

1. Implements model serving API

Page 56: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION SERVICE

1. Implements model serving API2. Low latency; < 10ms response time

Page 57: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION SERVICE

1. Implements model serving API2. Low latency; < 10ms response time3. “Fuzzy” materialized view of model state

Page 58: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION SERVICE

1. Implements model serving API2. Low latency; < 10ms response time3. “Fuzzy” materialized view of model state

MODEL MANAGER

Page 59: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION SERVICE

1. Implements model serving API2. Low latency; < 10ms response time3. “Fuzzy” materialized view of model state

MODEL MANAGER

1. Maintains models via online and batch retraining

Page 60: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION SERVICE

1. Implements model serving API2. Low latency; < 10ms response time3. “Fuzzy” materialized view of model state

MODEL MANAGER

1. Maintains models via online and batch retraining2. Stores model catalog, metadata, versioning

Page 61: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION SERVICE

1. Implements model serving API2. Low latency; < 10ms response time3. “Fuzzy” materialized view of model state

MODEL MANAGER

1. Maintains models via online and batch retraining2. Stores model catalog, metadata, versioning3. Contains library of standard models + custom API

Page 62: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today• Velox system architecture

• Key idea: Split model family• Prediction serving• Model management• Next directions

Page 63: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today• Velox system architecture• Key idea: Split model family

• Prediction serving• Model management• Next directions

Page 64: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PERSONALIZED MODELING

Page 65: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PERSONALIZED MODELING

Page 66: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PERSONALIZED MODELING

A Separate Model for Each User?

Page 67: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PERSONALIZED MODELING

Computationally Inefficient many complex models

A Separate Model for Each User?

Page 68: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PERSONALIZED MODELING

Statistically Inefficient not enough data per user

Computationally Inefficient many complex models

A Separate Model for Each User?

Page 69: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Input(Song) Rating

Page 70: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Input(Song) Rating

Page 71: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Input(Song) Rating

Split

Page 72: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Rating

Split

Input(Song)

Page 73: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PERSONALIZED MODELING

Input(Song)

PersonalizedUser Model

Page 74: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PERSONALIZED MODELING

Input(Song)

Shared Basis Feature Model PersonalizedUser Model

Page 75: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PERSONALIZED MODELING

Input(Song)

Shared Basis Feature ModelTrained across users

Changes Slowly

PersonalizedUser Model

Page 76: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PERSONALIZED MODELING

Input(Song)

Shared Basis Feature ModelTrained across users

Changes Slowly

Trained for each userChanges Quickly

PersonalizedUser Model

Page 77: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Input(Song)

Shared Basis Feature Model PersonalizedUser Model

SPLIT MODEL FORMULATION

Page 78: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Input(Song)

Shared Basis Feature Model PersonalizedUser Model

Input(Song)

SPLIT MODEL FORMULATION

Page 79: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Input(Song)

Shared Basis Feature Model PersonalizedUser Model

Meow

Input(Song)

SPLIT MODEL FORMULATION

Page 80: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Input(Song)

PersonalizedUser Model

Meow

Input(Song)

Terrible

Shared Basis Feature Model

SPLIT MODEL FORMULATION

Page 81: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

MATHEMATICAL FORMULATION

Input(Song)

Page 82: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

MATHEMATICAL FORMULATION

Input(Song)

x

Page 83: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Shared BasisFeature Models

MATHEMATICAL FORMULATION

Input(Song)

x

Page 84: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Shared BasisFeature Models

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓)x

Page 85: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Shared BasisFeature Models

Changes slowly

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓)x

Page 86: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Shared BasisFeature Models

PersonalizedUser Model

Changes slowly

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓)x

Page 87: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Shared BasisFeature Models

PersonalizedUser Model

Changes slowly

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓) ·wu

x

Page 88: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Shared BasisFeature Models

PersonalizedUser Model

Changes slowly

Highly dynamic

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓) ·wu

x

Page 89: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Shared BasisFeature Models

PersonalizedUser Model

Changes slowly

Highly dynamic

= Rating

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓) ·wu

x

Page 90: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Shared BasisFeature Models

PersonalizedUser Model

Changes slowly

Highly dynamic

= Rating

MATHEMATICAL FORMULATION

Input(Song)

f(x; ✓) ·wu

x

Terrible

Page 91: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today• Velox system architecture• Key idea: Split model family

• Prediction serving• Model management• Next directions

Page 92: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today• Velox system architecture• Key idea: Split model family• Prediction serving

• Model management• Next directions

Page 93: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Mesos Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Straming Shark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

VeloxTraining Management + Serving

System Architecture

ModelManager

PredictionService

Page 94: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION API

GET  /velox/catify/predict?userid=22&song=27632Simple point queries:

Page 95: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION API

GET  /velox/catify/predict_top_k?userid=22&k=100

GET  /velox/catify/predict?userid=22&song=27632Simple point queries:

More complex ordering queries:

Page 96: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTIONS

def  predict(  u:  UUID,  x:  Context  )

wu · f(x; ✓)

Page 97: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Look up user weight

PREDICTIONS

def  predict(  u:  UUID,  x:  Context  )

wu · f(x; ✓)

Page 98: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Look up user weight

PREDICTIONS

def  predict(  u:  UUID,  x:  Context  )

wu · f(x; ✓)

Primary key lookup

Page 99: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Look up user weight

PREDICTIONS

def  predict(  u:  UUID,  x:  Context  )

wu · f(x; ✓)

Primary key lookupPartition query by user : always local

Page 100: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Compute Features

PREDICTIONS

def  predict(  u:  UUID,  x:  Context  )

wu · f(x; ✓)user independent

}

Page 101: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Compute Features

PREDICTIONS

def  predict(  u:  UUID,  x:  Context  )

wu · f(x; ✓)

Feature computation could be costly

user independent

}

Page 102: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Compute Features

PREDICTIONS

def  predict(  u:  UUID,  x:  Context  )

wu · f(x; ✓)

Feature computation could be costly

user independent

}Cache features forreuse across users

Page 103: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

FEATURE CACHING GAINS

Feature caching leads to order-of-magnitude reduction in latency.

Without Caching

Caching Enabled

Page 104: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs

Page 105: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords

Page 106: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords CandidateSongs

Page 107: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords CandidateSongs

Score andrank allcandidates

Page 108: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords CandidateSongs

By exploiting split model design we can leverage:

Score andrank allcandidates

Page 109: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords CandidateSongs

By exploiting split model design we can leverage:

Score andrank allcandidates

A. Shrivastava, P. Li. “Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS).” NIPS’14 Best Paper

Page 110: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

TOP-K QUERIESQuery predicate to pre-filter candidate set

All Songs Playlist Keywords CandidateSongs

By exploiting split model design we can leverage:

Score andrank allcandidates

A. Shrivastava, P. Li. “Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS).” NIPS’14 Best Paper

Y. Low and A. X. Zheng. “Fast Top-K Similarity Queries Via Matrix Compression.” CIKM 2012

Page 111: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today• Velox system architecture• Key idea: Split model family• Prediction serving

• Model management• Next directions

Page 112: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today• Velox system architecture• Key idea: Split model family• Prediction serving• Model management

• Next directions

Page 113: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Mesos Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Straming Shark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

VeloxTraining Management + Serving

System Architecture

ModelManager

PredictionService

Page 114: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Mesos Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Straming Shark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

VeloxTraining Management + Serving

System Architecture

ModelManager

PredictionService

1. Online and offline model training2. Sample bias problem

Page 115: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

FEEDBACK API

POST  /velox/catify/observe?userid=22&song=27&score=3.7

Simple direct value feedback:

Page 116: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

FEEDBACK API

POST  /velox/catify/observe?userid=22&song=27&score=3.7

Simple direct value feedback:

Continuously update user models in Velox

Online Learning

Page 117: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

FEEDBACK API

POST  /velox/catify/observe?userid=22&song=27&score=3.7

Simple direct value feedback:

Continuously update user models in Velox

Online Learning Offline LearningLogged to DFS for

feature learning in Spark

Page 118: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

FEEDBACK API

POST  /velox/catify/observe?userid=22&song=27&score=3.7

Simple direct value feedback:

Continuously update user models in Velox

Online Learning Offline LearningLogged to DFS for

feature learning in Spark

EvaluationContinuously assessmodel performance

Page 119: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

ONLINE LEARNING

def  observe(u:  UUID,  x:  Context,  y:  Score)

wu · f(x; ✓)

Page 120: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Update wu with new training point

ONLINE LEARNING

def  observe(u:  UUID,  x:  Context,  y:  Score)

wu · f(x; ✓)

Page 121: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Update wu with new training point

ONLINE LEARNING

def  observe(u:  UUID,  x:  Context,  y:  Score)

wu · f(x; ✓)

Stochastic gradient descent

Page 122: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Update wu with new training point

ONLINE LEARNING

def  observe(u:  UUID,  x:  Context,  y:  Score)

wu · f(x; ✓)

Stochastic gradient descentIncremental linear algebra

Page 123: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OFFLINE LEARNINGdef  retrain(trainingData:  RDD)

Spark BasedTraining Algs.

wu · f(x; ✓)Efficient batch training using Spark

Page 124: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OFFLINE LEARNINGdef  retrain(trainingData:  RDD)

Spark BasedTraining Algs.

wu · f(x; ✓)

When do we retrain?

Efficient batch training using Spark

Page 125: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OFFLINE LEARNINGdef  retrain(trainingData:  RDD)

Spark BasedTraining Algs.

wu · f(x; ✓)

Periodically

When do we retrain?

Efficient batch training using Spark

Page 126: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OFFLINE LEARNINGdef  retrain(trainingData:  RDD)

Spark BasedTraining Algs.

wu · f(x; ✓)

Periodically Trigger by theevaluation system

When do we retrain?

Efficient batch training using Spark

Page 127: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PREDICTION LATENCYWITH ONLINE TRAINING

Velox keeps models updated at low latency

Velox

Spark

Page 128: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Data Model

Page 129: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Data Model

Sample Bias: model affects the training data.

Page 130: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

ALWAYS SERVE THE BEST SONG?

Songs

PredictedRating

Page 131: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

ALWAYS SERVE THE BEST SONG?

Songs

PredictedRating

Page 132: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted song

Page 133: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted song

Page 134: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted songWith prob. ϵ pick a random song

Page 135: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted songWith prob. ϵ pick a random song

Page 136: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted songWith prob. ϵ pick a random song

Epsilon Greedy

Page 137: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

VELOX SOLUTION

PredictedRating

Songs

With prob. 1- ϵ serve the best predicted songWith prob. ϵ pick a random song

Epsilon Greedy

Active Learning Opportunity to explore new systems for

this emerging analytics workload

Page 138: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today• Velox system architecture• Key idea: Split model family• Prediction serving• Model management

• Next directions

Page 139: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Talk Outline

• ML model management today• Velox system architecture• Split model family• Prediction serving• Model management• Next directions

Page 140: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OPEN CHALLENGES FOR DATABASE SYSTEMS

Page 141: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OPEN CHALLENGES FOR DATABASE SYSTEMS

• Going beyond the split model family

Page 142: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OPEN CHALLENGES FOR DATABASE SYSTEMS

• Going beyond the split model family• logical model pipeline language

Page 143: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OPEN CHALLENGES FOR DATABASE SYSTEMS

• Going beyond the split model family• logical model pipeline language

• More generic training pipelines

Page 144: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OPEN CHALLENGES FOR DATABASE SYSTEMS

• Going beyond the split model family• logical model pipeline language

• More generic training pipelines• standard set of physical operators

Page 145: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OPEN CHALLENGES FOR DATABASE SYSTEMS

• Going beyond the split model family• logical model pipeline language

• More generic training pipelines• standard set of physical operators

• Automatically choose split for online & offline training

Page 146: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OPEN CHALLENGES FOR DATABASE SYSTEMS

• Going beyond the split model family• logical model pipeline language

• More generic training pipelines• standard set of physical operators

• Automatically choose split for online & offline training• view maintenance and query optimization

Page 147: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OPEN CHALLENGES FOR DATABASE SYSTEMS

• Going beyond the split model family• logical model pipeline language

• More generic training pipelines• standard set of physical operators

• Automatically choose split for online & offline training• view maintenance and query optimization

• Ensure user privacy

Page 148: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

OPEN CHALLENGES FOR DATABASE SYSTEMS

• Going beyond the split model family• logical model pipeline language

• More generic training pipelines• standard set of physical operators

• Automatically choose split for online & offline training• view maintenance and query optimization

• Ensure user privacy• Privacy-Preserving DBMS

Page 149: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Data

Page 150: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Data Model

Page 151: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Data

Model

Training

Page 152: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Data

ModelPredictionsServing

Training

Page 153: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Data

ModelPredictionsServing

TrainingFeedb

ack

Page 154: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

The future of research in scalable learning systems will be in the integration of the learning lifecycle:

Data

ModelPredictionsServing

TrainingFeedb

ack

Page 155: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Mesos

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Streaming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Spark

VeloxTraining Management + Serving

THE MISSING PIECE

ModelManager

PredictionService

Page 156: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

PredictionLatency

Prediction Error

Online Retraining e.g.,

Full pre-materialization e.g., VELOX

key idea:split model into staleness insensitiveand staleness sensitive components

INCREMENTAL

BATCH

Page 157: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

SUMMARY

Page 158: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Today: model training and serving relies on ad-hoc, manual processes spread across multiple systems

SUMMARY

Page 159: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Today: model training and serving relies on ad-hoc, manual processes spread across multiple systems

The Velox system automatically maintains multiple models while providing low latency, scalable, and personalized predictions

SUMMARY

Page 160: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Today: model training and serving relies on ad-hoc, manual processes spread across multiple systems

The Velox system automatically maintains multiple models while providing low latency, scalable, and personalized predictions

Velox is coming soon as part of BDAS

SUMMARY

Page 161: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

Today: model training and serving relies on ad-hoc, manual processes spread across multiple systems

The Velox system automatically maintains multiple models while providing low latency, scalable, and personalized predictions

Velox is coming soon as part of BDAS

https://amplab.cs.berkeley.edu/projects/velox/

SUMMARY

Page 162: THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW ...jegonzal/assets/slides/velox-cidr... · BERKELEY DATA ANALYTICS STACK (BDAS) Spark Spark Streaming Spark SQL BlinkDB GraphX

QUESTIONS?