Balancing Infrastructure with Optimization and Problem Formulation

Preview:

Citation preview

Balancing Infrastructure with Optimization and Problem Formulation

Sailthru Data Science

How do we think about and practice

Data Science

Talk OutlinePart 1:

● What is Data Science● Where should we spend our time as data

scientists?Part 2:

● How we balance infrastructure, optimization and problem formulation at Sailthru.

What is Data Science?!

“Data Science is

the extraction of

knowledge from data”… Wikipedia

http://drewconway.com/

wikibooks.org

Data Scientists are good at …

These Interpretations Suggest:

Data Scientists are good at structuring problems, and solving for and optimizing them.

These Interpretations Suggest:

So what’s missing here?

zorger.com

● problem formulation

● optimization

● infrastructure

The title of this talk mentions...

● problem formulation

● optimization

● infrastructure

The title of this talk mentions...

Infrastructure

the basic physical and organizational structures and facilities needed for the

operation of a society or enterprise“ ”

… Wikipedia

Infrastructure: Often under-appreciated or undervalued by Data Scientists

A Data Scientist’s infrastructure?

InfrastructureSomething we become intimately familiar with

InfrastructureA mission-critical component of our work!

Components of a Solid Infrastructure

● Lots of Machinery. VMs, Containers

● Machines require coordination, redundancy and fault tolerance. CAP Theorem

Components of a Solid Infrastructure● Resource Allocation Fair Scheduling, Bin Packing

● Control strategies Auto Scaling, Feedback, PID

● Communication algorithms Gossip, Paxos, ...

● Configuration Dynamic Persistence, Namespaces

● Monitoring Anomaly Detection, Visualization

● Data Storage Relational, Graph, Key-Value

● SO MANY TOOLS!

So What is Data Science?

Problem Formulation

Infrastructure

Optimization

Central Question

As a data scientist, how do I choose where to

spend my time?

As a Data Scientist, ...

...when do I:

○ build infrastructure that supports my ideas○ optimize my existing models and

problems○ find new problems to work on

Part 2 !

Here’s a glimpse of how we tackle these choices at Sailthru.

● Sailthru is a personalization platform.

● We help our clients communicate with their customers.

● Our goal is to maximize the lifetime value of these customers so that our clients do well, customers are happy, and Sailthru is successful.

Sailthru Sightlines: User Predictions

Sightlines - Example Use Cases

Incentivize users with low chance of purchasing

Personalize discounts above expected order value

Suppress users likely to opt-out of messages

Engage users unlikely to open on other channels

Sightlines - How it Works

Computational Challenges

● Feature Engineering + ML

● Run many dependent jobs at scale

● Resource allocator

● Auto Scaler

Computational Challenges

● Feature Engineering + ML → Tidyjson & GBMs

● Run many dependent jobs at scale → Stolos

● Resource allocator → Mesos + AWS Spot Instances

● Auto Scaler → Relay.Mesos

github.com/sailthru/stolos

STOLOS

What problem does it solve?

A Directed Acyclic Multi-Graph task dependency scheduler designed to simplify complex, distributed pipelines.

It creates application queues that can be consumed from in any order.

Sightlines - Stolos Pipeline

450 * 20

Each node is a job

Sightlines - Stolos PipelineRepeats over time (currently, 1 day periods)

github.com/sailthru/relay

github.com/sailthru/relay.mesos

Relay.Mesos

What problem does it solve?

Relay actively minimizes the difference between a measured signal and a target signal.

Relay.Mesos plugs Relay into a tool called Mesos. → Lets us auto-scale consumers of queued Stolos jobs

FFT Visualization

Signal

FFT

f1

f2

f3

f4

FFT Visualization

k=0k=1

k=2ai-k =1

Signal

FFT

f1

f2

f3

f4

FFT Visualization

k=0k=1

k=2ai-k =1

The PID Algorithm

PV = Process Variable (Signal)SP = Set Point (Target)

MV = Manipulated Variable (Output)t = index on timesteps

**The “D” in PID is excluded here

The PID Algorithm

PV = Process Variable (Signal)SP = Set Point (Target)

MV = Manipulated Variable (Output)t = index on timesteps

**The “D” in PID is excluded here

+ Kd Δ dt

Sightlines - Relay

Thank You! Our team:

Tidyjson github.com/sailthru/tidyjsonStolos github.com/sailthru/stolosRelay github.com/sailthru/relayRelay.Mesos github.com/sailthru/relay.mesosConsulconf github.com/sailthru/consulconf

With more in progress!

Check out our open sourced tools!

Sightlines - On Mesos←----------------> CPU Units <------------------>

←--

----

----

----

----

--->

RA

M ←

----

----

----

----

----

-> ←----------------> CPU Units <------------------>

←--

----

----

----

----

--->

RA

M ←

----

----

----

----

----

->

Sightlines - Stages

Predict API Push

Sample & Assemble Grid Build

Sightlines - Stages

Predict API Push

Sample & Assemble Grid Build

Build train and test sets from a sample of data

Sightlines - Stages

Predict API Push

Sample & Assemble Grid Build

Run Grid Search to identify

Hyperparameters for the model

Sightlines - Stages

Predict API Push

Sample & Assemble Grid Build

Build the model

Sightlines - Stages

Predict API Push

Sample & Assemble Grid Build

Generate predictions for

all relevant models

Sightlines - Pipeline

Sample

Database

Database

SampleSample & Assemble

AssembleSampleSampleGrid AssembleSampleSampleBuild

AssembleSampleSamplePredict SampleSampleAPI Push

Sightlines - Pipeline

Sample

Database

Database

SampleSample & Assemble

AssembleSampleSampleGrid AssembleSampleSampleBuild

AssembleSampleSamplePredict SampleSampleAPI Push

○ Upper branch: once per (client, day, model)○ Lower branch: once per (client, day)

Recommended