H2O World - Data Science in Action @ 6sense - Viral Bajaria

Viral Bajaria, CTO & Co-Founder

@viralbajaria

@6senseInc#h2oworld #6sense

#bestpresentationyet

Let’s start with a prediction

BEST TALK @ H2O WORLD!!BEST TALK @ H2O WORLD!!

ONLY 1 FEATURE!

BEST TALK @ H2O WORLD

TOP FEATURE: SAYS MY MOM

PREDICTION ?

BEST TALK @ H2O WORLD

LOW PRECISION, LOW RECALL

WE FIND PROSPECTS THAT ARE IN MARKET TO BUY

WE ARE THE CENTRAL NERVOUS SYSTEM

EMPOWERING ALL MARKETING, SALES AND BIZ

6sense

EMPOWERING ALL MARKETING, SALES AND BIZ

OPERATIONS TEAMS

AS A TEAM, WE LIVE ON: DATA, STATISTICS AND

BEER

CTO & CO-FOUNDER @ 6SENSE

EARLY HADOOP ADOPTER (LATE 2008)

about.me

3B+ EVENTS PER DAY

FUN FACT: Used a sledgehammer to unrack my first

hadoop cluster

Predict who is in-market to buy!!

eg: Company XYZ is 90% going to buy routers in next

90 days.

Problem

90 days.

What kind of data do we need…. A lot!

1st Party:

- Web (eg. apache logs)

- Marketing Automation (eg. Eloqua)

- CRM (eg. Salesforce)

Data Needs

- CRM (eg. Salesforce)

6sense Data Network:

- Publishers

- Ads

- Blogs

Research patterns are different for different products

- Expensive routers

Insights

- Expensive routers

- Freemium cloud services

- Open source tools (think H2O)

Need to build different models for each product

Data Science Needs

Plus, we don’t like to make our life’s easy :)

- Where’s the fun in easy ?

- Need to build 4 models per product

Need to build different models for each product

Data Science Needs

Plus, we don’t like to make our life’s easy :)

- Where’s the fun in easy ?

- Need to build 4 models per product

100’S OF MODELS IN PROD

Data Sync Pipeline

Data Sync Pipeline

Pre Processing Pipeline

MOST IMPORTANT

Processing Pipeline

Web

Identify

Companies

Identify Contacts

Customer

Contacts

SalesNormalize

Companies

Custom Data Set

Make Consistent

Modeling

BaseilneModel

Model Stats

Modeling

Predictive Model

Scikit-Learn or H2O

Output Types: pickle files or pojo

Modeling

Output Types: pickle files or pojo

Script to promote model to production

Puts all artifacts used in s3

eg: data, stats, queries

Modeling

Model Info

• Name

• Type• Type

• Binary Location

• Active

• ……..

Multiple Models for same prediction

Model 1 Model StatsContinue

Prod Pipeline

Model 2 Model Stats

Model 3 Model Stats

Same pipeline as before…

Output written to temporary tables

use templating to switch settings at runtime

Experimental Modeling

use templating to switch settings at runtime

Stats compared to production runs

top decile

raw data for top-100 items

Platform : AWS

Backend: Hadoop, Hive, Presto, Redshift… and a lot more

Tech Stack

ML: H2O, Scikit-Learn

Ops: Fabric, Mesos, Docker, Marathon and home-grown

tools

Questions ??

[email protected]

THANK YOU!

VIRAL BAJARIA, CTO & [email protected]

@viralbajaria@6senseInc

Software

H2O World - Data Science in Action @ 6sense - Viral Bajaria