A Data Scientist And A Log File Walk Into A Bar

Paco NathanConcurrent, Inc.

[email protected]@pacoid

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Copyright @2012, Concurrent, Inc.

“A Data Scientist And A Log File Walk Into A Bar…”

mailto:[email protected]

mailto:[email protected]

Unstructured Data meets Enterprise Scale

opportunity

1. backstory: how we got here2. overview: typical use cases 3. example: a Cascading app

1. backstory:how we got here

Intro to Data ScienceScrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

inflection point

• huge Internet successes after 1997 holiday season…AMZN, EBAY, then GOOG, Inktomi (YHOO Search)

• consider this metric: annual revenue per customer / amount of data storeddropped 100x within a few years after 1997

• storage and processing costs plummeted, now we must work much smarter to extract ROI from Big Data…our methods must adapt

• “conventional wisdom” of RDBMS and BI tools became less viable; business cadre still focused on pivot tables and pie charts… tends toward inertia!

• MapReduce and the Hadoop open source stack grew directly out of that contention… but only solve portions

massive disruption in retail, advertising, etc., “All of Fortune 500 is now on notice over the next 10-year period.” – Geoffrey Moore, 2012 (Mohr Davidow Ventures)

1997

1998

2004

the world before…

BI, SQL, and highly optimized code

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

data innovation: circa 1996

the world after…

machine learning, leveraging log files

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels


the world ahead…

what our customers are doing now

Workflow

RDBMS

"real time"batch

services

transactions,content

socialinteractions

Web Apps,Mobile,

etc.History

Data AppsCustomers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Data Access Patterns

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacity

endpoints

DataScientist

App Dev

Ops

DomainExpert


a key difference…

statistical thinking

employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables

this approach attempts to understand not just problems and solutions, but also the processes involved and their variances

particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering…

programmers typically don’t think this way… however, both systems engineers and data scientists must!

Process Variation Data Tools

references

by Leo Breiman

Statistical Modeling: The Two CulturesStatistical Science, 2001

http://bit.ly/eUTh9L

also check out RStudio:http://rstudio.org/http://rpubs.com/

http://www.amazon.com/dp/B008HMN5BE/


http://rstudio.org/

http://rstudio.org/

http://www.rpubs.com/

http://www.rpubs.com/

most valuable skills

• approximately 80% of the costs for data-related projects get spent on data preparation – mostly on cleaning up data quality issues: ETL, log file analysis, etc.

• unfortunately, data-related budgets for many companies tend to go into frameworks which can only be used after clean up

• most valuable skills:‣ learn to use programmable tools that prepare data

‣ learn to generate compelling data visualizations

‣ learn to estimate the confidence for reported results

‣ learn to automate work, making analysis repeatable

the rest of the skills – modeling, algorithms, etc. – those are secondary

D3

team process

discovery

modeling

integration

apps

systems

help people ask the right questions

allow automation to place informed bets

deliver products at scale to customers

leverage smarts in product features

keep infrastructure running, cost-effective

Gephi

matrix: usage

stakeholder

scientist

developer

ops

conceptual tool for managing Data Science teams

overlay your project requirements (needs) with your team’s strengths (roles)

that will show very quickly where to focus

NB: bring in individuals who cover 2-3 needs, particularly for team leads

discovery

discovery

modeling

modeling

integration

integration

appsapps systems

systems

building teams

stakeholder

scientist

developer

ops

discovery

discovery

modeling

modeling

integration

integration

appsapps systems

systems

references

by DJ Patil

Data JujitsuO’Reilly, 2012

http://www.amazon.com/dp/B008HMN5BE

Building Data Science TeamsO’Reilly, 2011

http://www.amazon.com/dp/B005O4U3ZE





2. overview:typical use cases


DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

in a nutshell, what we do…

• estimate probability

• calculate analytic variance

• manipulate order complexity

• make use of learning theory

• collab with DevOps, Stakeholders

using science in data science

Unique Registration

Launched games lobby

NUI:TutorialMode

Birthday Message

Chat PublicRoom voice

Launched heyzap game

ConnectivityTest: test suite started

Create New Pet

Movie View Started: client, community

NUI:MovieMode

Buy an Item: web

Put on Clothing

Address space remaining: 512M

Customer Made Purchase Cart Page Step 2

Feed Pet

Play Pet

Chat Now

Edit Panel

Client Inventory Panel Flip Product Over

Add Friend

Open 3D Window

Change Seat

Type a Bubble

Visit Own Homepage

Take a Snapshot

NUI:BuyCreditsMode

NUI:MyProfileClicked

Address space remaining: 1G

Leave a Message

NUI:ChatMode

NUI:FriendsModedv

Website Login

Add Buddy

NUI:PublicRoomMode

NUI:MyRoomMode

Client Inventory Panel Remove Product

Client Inventory Panel Apply Product

NUI:DressUpMode

Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpMode

use case: marketing funnel

• must optimize a very large ad spend

• different vendors report different metrics

• seasonal variation distorts performance

• some campaigns are much smaller than others

• hard to predict ROI for incremental spend

approach:• log aggregation, followed with cohort analysis

• bayesian point estimates compare different-sized ad tests

• customer lifetime value quantifies ROI of new leads

• time series analysis normalizes for seasonal variation

• geolocation adjusts for regional cost/benefit

• linear programming models estimate elasticity of demand

Wikipedia

use case: ecommerce fraud

• sparse data means lots of missing values

• “needle in a haystack” lack of training cases

• answers are available in large-scale batch, results are needed in real-time event processing

• not just one pattern to detect – many, ever-changing

approach:• random forest (RF) classifiers predict likely fraud

• subsampled data to re-balance training sets

• impute missing values based on density functions

• train on massive log files, run on in-memory grid

• adjust metrics to minimize customer support costs

• detect novelty – report anomalies via notifications

stat.berkeley.edu

use case: customer segmentation

• many millions of customers, hard to determine which features resonate

• multi-modal distributions get obscured by the practice of calculating an “average”

• not much is known about individual customers

approach:• connected components for sessionization, determining

uniques from logs

• estimates for age, gender, income, geo, etc.

• clustering algorithms to group into market segments

• social graph infers “unknown” relationships

• covariance/heat maps visualizes segments vs. feature sets

Mathw

orks

use case: monetizing content

• need to suggest relevant content which wouldotherwise get buried in the back catalog

• big disconnect between inventory and limited performance ad market

• enormous amounts of text, hard to categorize

approach:• text analytics glean key phrases from documents

• hierarchical clustering of char frequencies detects lang

• latent dirichlet allocation (LDA) reduces dimension to topic models

• recommenders suggest similar topics to customers

• collaborative filters connect known users with less known

Digital H

umanities

plus some great tools…

scale-out:Scalr, RightScale, CycleComputing, vFabric, Beanstalk

apps:Cascading, Scalding, Cascalog, R markdown, SWF

analytics/modeling:R, Weka, Matlab, PMML, GLPK

hadoop:EMR, HW, MapR, EMC, Azure, Compute

key/val:Redis,Membase, MySQL

index:Lucene/Solr, ElasticSearch

durable storage:S3, ASV, GCS, Riak, Couch

imdg:Spark, Storm, Gigaspaces

visualization:ggplot2, D3, Gephi

graph:Gremlin, GraphLab,Neo4J

column:Vertica, HBase, Drill, Dynamo

text:LDA, WordNet, OpenNLP, Mallet, Bixo, NLTK

relational:usual suspects

reporting:Graphite, PowerPivot, Pentaho, Jaspersoft, SAS

machine data:Splunk, collectd, Nagios

3. example:a Cascading app


DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

getting started

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

cascading.org/category/impatient/

businessprocess

APIlanguage

optimize / schedule

physicalplan

computesubstrate

machinedata

Scala, Clojure, Python, Ruby, Java, etc.…envision whatever else runs in a JVM

composition of a workflow

Splunk, Nagios, Collectd, etc.

major changes in technology now

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

domain expertise, business trade-offs,market position, operating parameters, etc.

Apache Hadoop, in-memory local mode…envision GPUs, other frameworks, etc.

“asse

mb

ler”

cod

e

1: copy

Source

Sink

M

public class Main { public static void main( String[] args ) { String inPath = args[ 0 ]; String outPath = args[ 1 ];

Properties props = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );

// create the source tap Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath );

// create the sink tap Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath );

// specify a pipe to connect the taps Pipe copyPipe = new Pipe( "copy" );

// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "copy" ) .addSource( copyPipe, inTap ) .addTailSink( copyPipe, outTap );

// run the flow flowConnector.connect( flowDef ).complete(); } } 1 mapper

0 reducers10 lines code

ten lines of code for a file copy…seems like a lot.

wait!

same JAR, any scale…

Your Laptop:Mb’s dataHadoop standalone modepasses unit tests, or notruntime: seconds – minutes

Staging Cluster:Gb’s dataEMR + 4 Spot InstancesCI shows red or green lightsruntime: minutes – hours

Production Cluster:Tb’s dataEMR w/ 50 HPC InstancesOps monitors resultsruntime: hours – days

MegaCorp Enterprise IT:Pb’s data1000+ node private clusterEVP calls you when app failsruntime: days+

2: word count

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

1 mapper 1 reducer18 lines code

3: City of Palo Alto open data

github.com/Cascading/CoPA/wiki• GIS export for parks, roads, trees (unstructured / open data)• log files of personalized/frequented locations in Palo Alto via iPhone GPS tracks• curated metadata, used to enrich the dataset• could extend via mash-up with many available public data APIs

Enterprise-scale app: road albedo + tree species metadata + geospatial indexing

“Find a shady spot on a summer day to walk near downtown and take a call…”

M

M

M

R

M

M

M

M

GroupBytree_name

RM

Checkpointtsv

Regexfilter

Regexparser

road

RoadMetadata

HashJoinLeft

RHS

EstimateAlbedo

RoadSegments Geohash

CoGroup

RHStree

road

Filtertree_dist

TreeDistance

Checkpointshade

GPSlogs

Geohash

CoGroup

RHS

reco

CoPAGIS exprot

Regexparser

tsv

park

Regexfilter

park

Scrubspecies

Geohash

Regexfilter

Regexparser

tree

TreeMetadata

HashJoinLeft

RHS

FailureTraps

M

R

log events

• addr: 115 HAWTHORNE AVE• lat/lng: 37.446, -122.168• geohash: 9q9jh0• tree: 413 site 2• species: Liquidambar styraciflua• avg height 23 m• road albedo: 0.12• distance: 10 m• a short walk from my train stop ✔

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0 10 20 30 40 50avg_height

dens

ity

count0100200300

Estimated Tree Height (meters)example results

blog, code/wiki/gists, jars, list, DevOps products:

cascading.org/

github.org/Cascading/

conjars.org/

goo.gl/KQtUL

concurrentinc.com/

drill-down

Technology

A Data Scientist And A Log File Walk Into A Bar