How to Feed a Data Hungry Organization – by Traveloka Data Team

Preview:

Citation preview

Traveloka Data

Meetup v1.0.0

How to Feed a Data Hungry Organization

Part One

Traveloka Data Culture

Part 1: Traveloka Data Culture

Five Characteristics of Data Hungry Organization

Driven Decision

Learn from Mistakes

Better Understanding

Uncertainty and Variation

High Quality Data

Data Hungry Organization

Part 1: Traveloka Data Culture

Our responsibility is to turn data into consumable insights

DATA

TEAM

BETTER

BUSINESS

DECISION

Part 1: Traveloka Data Culture

We need the brightest people to fill our needs and create the future

Mathematics

Business

Programming

Skills

Part 1: Traveloka Data Culture

Some of the skills in mathematics

Mathematics

Optimization

Decision Theory

Statistics

Differential Equations

Time Series

Part 1: Traveloka Data Culture

Some of the skills in business

Business

Strategy

Finance

Economics

Part 1: Traveloka Data Culture

Some of the skills in programming

Programming

Data Wrangling

Modelling

Big Data

Part 1: Traveloka Data Culture

This is how we structure our team

Data

TeamData Governance

Machine Learning Engineering

Data Analysis

Data Science

Data Engineering

Part 1: Traveloka Data Culture

Houston,

We have

a problem.

DW

Tens of Terabytes

Hundreds of ETLs

Kafka

Hundreds of topics

Millions of Messages per Hour

Hundreds of Megabytes per Second

S3

Hundreds of Terabytes

Redshift

Tens of Thousand Queries Daily

DOMO

Thousands of Cards

Hundreds of Users

PeriscopeData

Thousands of Dashboards

Hundreds of Users

Part 1: Traveloka Data Culture

We need

state of the art

technology

to feed data

hungry people

Ingestion

Gobblin

Data Lake

AWS S3

Batch Processing

Spark, Airflow, Hadoop2,

Python, Java App

Data Warehouse

Redshift, MongoDB,

PostgreSQL

Datahub

Pubsub, Kafka Stream Processing

DataFlow, MemSQL

Pipeline

Near Real Time DW

GCP BigQuery, MemSQL

Real Time DB

AWS DynamoDB

Ingestion Processin

g

Storage Presentation

Source DB

Mongo, PostgreSQL

App / Services

Java App

Analytics Tools

PeriscopeData, Spark, R,

Domo Dataiku Holistics, Keboola

ML Tools, Library, and Services

Jupyter, Zeppelin, Caffe, DataDog,

TensorFlow, Cloud Vision API

Query Engine

Qubole, Presto,

Hive

Part Two

Data Engineering

Part 2: Data Engineering

Fast Food,

Or…?

Part 2: Data Engineering

MINDSETS

Managed service

for focus

So we could focus more on

the use cases

Part 2: Data Engineering

MINDSETS

Managed service

for focus

So we could focus more on

the use cases

Part 2: Data Engineering

Real Time Pipeline

5 min data delivery SLA. Real latency ~ 10s

100 ms query SLA. Real latency ~ 10ms (p95)

Key value data, query by service/app

Autoscale - Self service for each engineering teamwe provide governance, guidance, building blocks, and consultation

Part 2: Data Engineering

Real

Time

Pipeline

Part 2: Data Engineering

Near Real Time Pipeline

Raw data, query by BI Tools

5 min data delivery SLA. Real latency ~ 5s

Using Yaml for Schema definition (built and defined by ourselves)

Self service for data analysts! with guidance and governance

Part 2: Data Engineering

Near Real Time Pipeline

Part 2: Data Engineering

Near Real Time PipelineBut, MemSQL is not managed service, it is on EC2.

It is easy to scale, but not autoscale yet.

So we are moving to… v2!!

Currently on usability testing test by analysts.

Self service, of course!

Part 2: Data Engineering

Near Real Time Pipeline

Part 2: Data Engineering

Analytical Pipeline

Heavy data

processing

query by BI Tools

6 hour data

delivery SLA

Part 2: Data Engineering

Analytical Pipeline

Interesting features:

• Custom dev/prod environment, for self service!

• Custom framework, on top of Spark

• Custom airflow, separated queue for backfill

• EMR autoscale for backfill

• Redshift microbatch bulk load

• etc...

Part 2: Data Engineering

Summary

Part Three

Data Science in Traveloka

Part 3: Data Science in Traveloka

Three

Things to

Discuss

Today

Data Science Purpose

Tools of the Trade

Model Evaluations and Applications

Part 3: Data Science in Traveloka

Three

Things to

Discuss

Today

Data Science Purpose

Tools of the Trade

Model Evaluations and Applications

Novia is 25 years old. She is single, outspoken, and

mathematically gifted. As a student, she was deeply

interested in calculus and statistics, and also participated in

International Mathematical Olympiad.

a. Novia is a data scientist

b. Novia is a data scientist and is active as mathematical

Olympiad tutor

Part 3: Data Science in Traveloka

Part 3: Data Science in Traveloka

Consider a regular six-sided die with four green faces and

two red faces. The die will be rolled 20 times and the

sequence of greens (G) and reds (R) will be recorded.

Choose one sequence from a set of three. Which one is the

more likely outcome?

RGRRR

GRGRRR

GRRRRR

Part 3: Data Science in Traveloka

Part 3: Data Science in Traveloka

Remember This:

The goal of data science exercise is to help us make

a good business decision

Logic

Alternatives

Information

Preferences

Part 3: Data Science in Traveloka

“if they learn nothing else about decision

analysis from their studies, distinction between

outcome and decisions will have been worth

the price of admission”

Ron Howard, Professor at Stanford University

Father of Decision Analysis

Part 3: Data Science in Traveloka

Good Bad

Good Took a taxi and arrived safely Drive home and arrived safely

Bad Took a taxi and involved in accident Drive home and involved in accident

Decisions

Outcome

Part 3: Data Science in Traveloka

Three

Things to

Discuss

Today

Data Science Purpose

Tools of the Trade

Model Evaluations and Applications

Data Science Framework: CRISP-DM

Business

Data

Data Prep

Model

Evaluation

Deployment

Common

Sense

Part 3: Data Science in Traveloka

“Hiding within those

mounds of data is

knowledge that could

change the life of a

patient, or change the

world”-Atul Butte, Stanford-

We use open source library for data science

Wrangling

• data.table

• dplyr

• sparkR

• sparklyr

• pandas

• pyspark

Visualization

• ggplot

• matplotlib

• seaborn

• shiny

Statistics

• R

• JAGS

• STAN

• Python

• Julia

Machine Learning

• scikit-learn

• caret

• e1071

• fbprophet

Part 3: Data Science in Traveloka

Are we using the algorithm? Or being used by it?

Cla

ssif

icat

ion

Linear Models

Naïve Bayes Classifier

Support Vector Classifier

Vowpal Wabbit Classifier

Random Forest

Decision Trees

Neural Network

Extreme Gradient Boosted Trees

Many more algos!

Pre

dic

tio

n

Linear Models

Nystroem Regressor

Support Vector Regressor

Vowpal Wabbit Regressor

Random Forest

Decision Trees

Neural Network

Extreme Gradient Boosted Trees

More Algos!

• Scikit-learn

• Caret

• TensorFlow

• …

Part 3: Data Science in Traveloka

We need more than just off the shelf libraries to

feed data hungry people

Bayesian Network Markov Chain Monte Carlo

Part 3: Data Science in Traveloka

Part 3: Data Science in Traveloka

Three

Things to

Discuss

Today

Data Science Purpose

Tools of the Trade

Model Evaluations and Applications

Model Evaluation: judging the usefulness of your model

Rule #1

Never ever peek at the test set during training/validation

Rule #2

You can never satisfy all the metrics,

pick one or two metrics as your decision criteria beforehand

Rule #3

Always do comparative statics on the final model

Part 3: Data Science in Traveloka

Comparative

Staticscommonly used as

feature importance

analysis

Part 3: Data Science in Traveloka

Remember the end goal: decisions

What should

we do?

What

might

happen

Part 3: Data Science in Traveloka

“But in my view,

obsessive customer focus

is by far the most protective of

Day 1 vitality”

Our data is telling us:

• What do they want?

• Do we serve their needs?

• Are they trying to leave us?

Part 3: Data Science in Traveloka

My name is Jeff

Thank you!

Recommended