44
Traveloka Data Meetup v1.0.0 How to Feed a Data Hungry Organization

How to Feed a Data Hungry Organization – by Traveloka Data Team

Embed Size (px)

Citation preview

Page 1: How to Feed a Data Hungry Organization – by Traveloka Data Team

Traveloka Data

Meetup v1.0.0

How to Feed a Data Hungry Organization

Page 2: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part One

Traveloka Data Culture

Page 3: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 1: Traveloka Data Culture

Five Characteristics of Data Hungry Organization

Driven Decision

Learn from Mistakes

Better Understanding

Uncertainty and Variation

High Quality Data

Data Hungry Organization

Page 4: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 1: Traveloka Data Culture

Our responsibility is to turn data into consumable insights

DATA

TEAM

BETTER

BUSINESS

DECISION

Page 5: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 1: Traveloka Data Culture

We need the brightest people to fill our needs and create the future

Mathematics

Business

Programming

Skills

Page 6: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 1: Traveloka Data Culture

Some of the skills in mathematics

Mathematics

Optimization

Decision Theory

Statistics

Differential Equations

Time Series

Page 7: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 1: Traveloka Data Culture

Some of the skills in business

Business

Strategy

Finance

Economics

Page 8: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 1: Traveloka Data Culture

Some of the skills in programming

Programming

Data Wrangling

Modelling

Big Data

Page 9: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 1: Traveloka Data Culture

This is how we structure our team

Data

TeamData Governance

Machine Learning Engineering

Data Analysis

Data Science

Data Engineering

Page 10: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 1: Traveloka Data Culture

Houston,

We have

a problem.

DW

Tens of Terabytes

Hundreds of ETLs

Kafka

Hundreds of topics

Millions of Messages per Hour

Hundreds of Megabytes per Second

S3

Hundreds of Terabytes

Redshift

Tens of Thousand Queries Daily

DOMO

Thousands of Cards

Hundreds of Users

PeriscopeData

Thousands of Dashboards

Hundreds of Users

Page 11: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 1: Traveloka Data Culture

We need

state of the art

technology

to feed data

hungry people

Ingestion

Gobblin

Data Lake

AWS S3

Batch Processing

Spark, Airflow, Hadoop2,

Python, Java App

Data Warehouse

Redshift, MongoDB,

PostgreSQL

Datahub

Pubsub, Kafka Stream Processing

DataFlow, MemSQL

Pipeline

Near Real Time DW

GCP BigQuery, MemSQL

Real Time DB

AWS DynamoDB

Ingestion Processin

g

Storage Presentation

Source DB

Mongo, PostgreSQL

App / Services

Java App

Analytics Tools

PeriscopeData, Spark, R,

Domo Dataiku Holistics, Keboola

ML Tools, Library, and Services

Jupyter, Zeppelin, Caffe, DataDog,

TensorFlow, Cloud Vision API

Query Engine

Qubole, Presto,

Hive

Page 12: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part Two

Data Engineering

Page 13: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

Fast Food,

Or…?

Page 14: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

MINDSETS

Managed service

for focus

So we could focus more on

the use cases

Page 15: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

MINDSETS

Managed service

for focus

So we could focus more on

the use cases

Page 16: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

Real Time Pipeline

5 min data delivery SLA. Real latency ~ 10s

100 ms query SLA. Real latency ~ 10ms (p95)

Key value data, query by service/app

Autoscale - Self service for each engineering teamwe provide governance, guidance, building blocks, and consultation

Page 17: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

Real

Time

Pipeline

Page 18: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

Near Real Time Pipeline

Raw data, query by BI Tools

5 min data delivery SLA. Real latency ~ 5s

Using Yaml for Schema definition (built and defined by ourselves)

Self service for data analysts! with guidance and governance

Page 19: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

Near Real Time Pipeline

Page 20: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

Near Real Time PipelineBut, MemSQL is not managed service, it is on EC2.

It is easy to scale, but not autoscale yet.

So we are moving to… v2!!

Currently on usability testing test by analysts.

Self service, of course!

Page 21: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

Near Real Time Pipeline

Page 22: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

Analytical Pipeline

Heavy data

processing

query by BI Tools

6 hour data

delivery SLA

Page 23: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

Analytical Pipeline

Interesting features:

• Custom dev/prod environment, for self service!

• Custom framework, on top of Spark

• Custom airflow, separated queue for backfill

• EMR autoscale for backfill

• Redshift microbatch bulk load

• etc...

Page 24: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 2: Data Engineering

Summary

Page 25: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part Three

Data Science in Traveloka

Page 26: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 3: Data Science in Traveloka

Three

Things to

Discuss

Today

Data Science Purpose

Tools of the Trade

Model Evaluations and Applications

Page 27: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 3: Data Science in Traveloka

Three

Things to

Discuss

Today

Data Science Purpose

Tools of the Trade

Model Evaluations and Applications

Page 28: How to Feed a Data Hungry Organization – by Traveloka Data Team

Novia is 25 years old. She is single, outspoken, and

mathematically gifted. As a student, she was deeply

interested in calculus and statistics, and also participated in

International Mathematical Olympiad.

a. Novia is a data scientist

b. Novia is a data scientist and is active as mathematical

Olympiad tutor

Part 3: Data Science in Traveloka

Page 29: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 3: Data Science in Traveloka

Consider a regular six-sided die with four green faces and

two red faces. The die will be rolled 20 times and the

sequence of greens (G) and reds (R) will be recorded.

Choose one sequence from a set of three. Which one is the

more likely outcome?

RGRRR

GRGRRR

GRRRRR

Page 30: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 3: Data Science in Traveloka

Page 31: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 3: Data Science in Traveloka

Page 32: How to Feed a Data Hungry Organization – by Traveloka Data Team

Remember This:

The goal of data science exercise is to help us make

a good business decision

Logic

Alternatives

Information

Preferences

Part 3: Data Science in Traveloka

Page 33: How to Feed a Data Hungry Organization – by Traveloka Data Team

“if they learn nothing else about decision

analysis from their studies, distinction between

outcome and decisions will have been worth

the price of admission”

Ron Howard, Professor at Stanford University

Father of Decision Analysis

Part 3: Data Science in Traveloka

Good Bad

Good Took a taxi and arrived safely Drive home and arrived safely

Bad Took a taxi and involved in accident Drive home and involved in accident

Decisions

Outcome

Page 34: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 3: Data Science in Traveloka

Three

Things to

Discuss

Today

Data Science Purpose

Tools of the Trade

Model Evaluations and Applications

Page 35: How to Feed a Data Hungry Organization – by Traveloka Data Team

Data Science Framework: CRISP-DM

Business

Data

Data Prep

Model

Evaluation

Deployment

Common

Sense

Part 3: Data Science in Traveloka

Page 36: How to Feed a Data Hungry Organization – by Traveloka Data Team

“Hiding within those

mounds of data is

knowledge that could

change the life of a

patient, or change the

world”-Atul Butte, Stanford-

We use open source library for data science

Wrangling

• data.table

• dplyr

• sparkR

• sparklyr

• pandas

• pyspark

Visualization

• ggplot

• matplotlib

• seaborn

• shiny

Statistics

• R

• JAGS

• STAN

• Python

• Julia

Machine Learning

• scikit-learn

• caret

• e1071

• fbprophet

Part 3: Data Science in Traveloka

Page 37: How to Feed a Data Hungry Organization – by Traveloka Data Team

Are we using the algorithm? Or being used by it?

Cla

ssif

icat

ion

Linear Models

Naïve Bayes Classifier

Support Vector Classifier

Vowpal Wabbit Classifier

Random Forest

Decision Trees

Neural Network

Extreme Gradient Boosted Trees

Many more algos!

Pre

dic

tio

n

Linear Models

Nystroem Regressor

Support Vector Regressor

Vowpal Wabbit Regressor

Random Forest

Decision Trees

Neural Network

Extreme Gradient Boosted Trees

More Algos!

• Scikit-learn

• Caret

• TensorFlow

• …

Part 3: Data Science in Traveloka

Page 38: How to Feed a Data Hungry Organization – by Traveloka Data Team

We need more than just off the shelf libraries to

feed data hungry people

Bayesian Network Markov Chain Monte Carlo

Part 3: Data Science in Traveloka

Page 39: How to Feed a Data Hungry Organization – by Traveloka Data Team

Part 3: Data Science in Traveloka

Three

Things to

Discuss

Today

Data Science Purpose

Tools of the Trade

Model Evaluations and Applications

Page 40: How to Feed a Data Hungry Organization – by Traveloka Data Team

Model Evaluation: judging the usefulness of your model

Rule #1

Never ever peek at the test set during training/validation

Rule #2

You can never satisfy all the metrics,

pick one or two metrics as your decision criteria beforehand

Rule #3

Always do comparative statics on the final model

Part 3: Data Science in Traveloka

Page 41: How to Feed a Data Hungry Organization – by Traveloka Data Team

Comparative

Staticscommonly used as

feature importance

analysis

Part 3: Data Science in Traveloka

Page 42: How to Feed a Data Hungry Organization – by Traveloka Data Team

Remember the end goal: decisions

What should

we do?

What

might

happen

Part 3: Data Science in Traveloka

Page 43: How to Feed a Data Hungry Organization – by Traveloka Data Team

“But in my view,

obsessive customer focus

is by far the most protective of

Day 1 vitality”

Our data is telling us:

• What do they want?

• Do we serve their needs?

• Are they trying to leave us?

Part 3: Data Science in Traveloka

My name is Jeff

Page 44: How to Feed a Data Hungry Organization – by Traveloka Data Team

Thank you!