NYC Open Data Meetup-- Thoughtworks chief data scientist talk

Data Science Consultingor

Science meets business, again. Third time a charm?

David Johnston ThoughtWorksMarch 17, 2014

Young scientists become…

Professors

Prog

ram

mer

s Trad

ers

Data

scie

ntist

s

Professors

Professors

Talk Overview

• Agile Analytics group at ThoughtWorks

• What is data science anyway? Origins and future. Good or evil?

• Guide to technologies and limits to technology

• Process and methodology for successful data science consulting

ThoughtWorks

• Global software consulting company

• HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil, Australia, China - over 30 worldwide.

• Privately owned by Roy Singham

• Flat hierarchy of passionate people

The three pillars

Agile Analytics at TW

• Practiced started 2011• Led by Ken Collier and John Spens• About a dozen people involved

Key Themes

• BI, data warehousing and analytics has largely missed the revolution in agile methodologies.

• We can do analytics in a agile, fast, light-footprint way.

What do we do?

• Probabilistic modeling• Predictive analytics / machine learning• Advanced BI, prescriptive analysis• Big Data technologies• Advanced algorithms and data structures, streaming

Our main goals

• Use data analysis to give companies an edge in their marketplace

• Use data analysis to improve the world at large

Some typical projects

• Recommending Systems

• Customer behavior analysis

• Optimization

• Efficient algorithms/tech for massive data sets

• Company specific analytics challenges

Case Study 1: HealthCare Group Purchasing Organization

• One of the largest GPOs. 1000s of client hospitals

• Hospital sign up, pay fee and get group-purchasing discounts

• The GPO has to make estimates to hospitals on their likely savings.

• Hospital’s data is usually in a non-standard spreadsheet. No SKUs in healthcare (yet).

• A data matching mess


GPO: Johnson & Johnson Sterile Scalpel #F8-505

Hospital: J&J scalpel, steel item f8505 size 3’’

• Their in-place solution – Oracle, lots of ETL tools, using SQL with lots of rigid rules for how to match.

• Data-base of matching rules was difficult to maintain

• Accuracy of matching ~60%. Rest was done by hand. Took 1 day for processing and weeks for lines done by hand.


What we did

• First convince them that their solution was highly inefficient.

• Wrote python program using a tree data structure and machine learning to do matching.

• Ran on my laptop in a few minutes. Match rates > 80%

• This done in 3 weeks. Later settled on a solution using Elastic Search.

Case Study 2: Retail Rec Systems

• Customer providing coupons to retailer customers

• Needed a better recommendation system

• We’re using a simple logistic regression model

What exactly is data science?

• Is this really new?

• Does the term “data science” make any sense?

• Is it just a fad? Over-hyped?

• Why did this term just become popular a few years back?

• Where is this going?

• Should scientists/engineers/math-types really go and make a career doing this?

What exactly is data science?

• Is this really new? - Not really

• Does the term “data science” make any sense? - Not really but so what?

• Is it just a fad? Over-hyped? – No, some times.

• Why did this term just become popular a few years back? - Productivity

• Where is this going?

• Should scientists/engineers/math-types really go and make a career doing this? Yes for most

Is it new?

Of course not

Combination of many subjects:• Mathematics and statistics – probability theory• Machine learning• Computer science – algorithms, data structures, data bases• Operations research - process optimization• Business consulting • Software development

Where we have seen this before?

Business: Finance, Insurance, Sports, Government accounting, Retail, GoogleScience: Physics, Astronomy, Biology

Isn’t there anything new?

Of course

• Analytics finally becoming ubiquitous in business (as it always should have been)

• Much more communication between disparate fields

• It’s finally work that’s fun

Ok, but why now?

It’s a big movement so lets give it a new name , Data Science

Why now? - Productivity

• There has always been plenty of data science in science

• Job prospects in academia are slim

• Productivity has been rising much faster than postdoc salaries and scientist job creation

Data scientist productivity growth

• Salary increase over postdoc requires ~2.5 x

• Salaries in Industry are set by productivity and supply/demand

• Crossing the threshold in productivity Leads to new job creation

• Eventual slowing in productivity and/or changes in supply/demand will eventually end this burst in job creation

• Nothing magical happened in 2005!

Productivity Drivers for Data-science

Long time scale

• Compute , Moore’s law

• The internet (duh!)

• HD and RAM price drop

• Science learns to deal with Big Data

• Growing importance of statistics

More recent

• Git , code –sharing

• Libraries machine learning

• Python/ R Open source

• Hadoop and ecosystem

• The Cloud, AWS

• NoSQL databases, in-mem

• Growing community in “data science” cohesion, feedback effects of popularity

Then and now

1990s data science

• Writing code in C/C++

• Working with flat files

• Even relational/SQL is new

• Using Matlab, IDL proprietary software

• Writing all algorithms from scratch. Slow. Buggy.

Data science today

• Working in high level open-source languages Python, R

• We’re good at SQL and have lots of other options NoSQL

• Git, thousands of libraries available. Easy to install.

• Can concentrate more on what we’re good at.

So what is data science now

Data Science:

An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science.

Where is it going?

• Big Data technology is separated from data science

• Software developers take over much of Big Data roles

• Businesses begin to understand data science terminology like they now understand software terminology and they are not Twitter.

• Data scientists and businesses find a methodology that works like industrial scale software development has

Where is it going?

Specialization

• Most experienced data scientists move into consulting or management of teams

• Universities graduate many “data scientist-lite” students from new more specialized BS or MA programs

• Fewer generalists

• PhD students need to learn additional skills. Not instant hires(http://bit.ly/1m3krq6)

Why won’t we have 100x more data scientists in N years?

• Pool of disgruntled postdocs will dry up or “I am not even supposed to be here!”

• Many data science problems don’t need the most cutting edge tools. (Some do).

• People rarely get much experience working with real data in academic settings. Requires real-world experience, takes time.

Are we there yet?Overhyped, underhyped, mis-hyped?

• No, probably not

• Productivity growth is real

• We are solving important problems. Plenty left.

• Big Data will probably peak in the hype cycle before data science

• Just watched my first analytics commercial. IBM.

Why Big Data enthusiasm might peak soon

Big Data defined – Process for performing calculations on data that:

• Cannot possibly be done on a single machine• When sampling and streaming are not effective• What data-reduction is not possible• When storage and compute are closely balanced• Parallelizing is absolutely unavoidable

Most tasks are not like this

• Sampling is usually good enough for training machine learning• Need for rapid feedback, interactive work• CPUs are underutilized. IO limited. • Usually a better algorithm can solve the problem better

Hadoop (Spark)

Good use cases

• Large batch jobs like: restructuring and reducing data from raw files.

• Scoring with ML models

• When you have to do something on every data point.

• Raw storage in HDFS

Bad use cases

• Model development

• Visualization

• Brute-forcing an inefficient algorithm.

• Treating Hadoop like a data-base.

The data-sizes we typically see

Most companies have a few million customers 10^7

Often they storage ~ 1000 items per customer

That’s 10^10 data points. 5 bytes/data-point = 500 GB or a few TB. Fits on our laptops (but not in memory). Such data can be moved to the cloud if need be in 1-2 days.

Often we can be productive with either a sample or an aggregation.

True when • Customer specific items are things like purchases, manually entered text, logins etc.

Not true when • Things are web-events, pair-wise interactions (i.e. graphs, social)

Sources of really big data

Sensor data

• Pictures

• Video

• Health monitoring devices

• Internal device monitors

• Results of combinatorical-complexity

However

• Is it really economic to store and process these huge data sets to begin with?

• Will learn to utilize streaming algorithms

• Will learnt on focus on information not noise

Case study : Particle PhysicsData reduction par excellence

• 600 million collisions per second• Most are boring events and are not saved• Save ~ 100 petabytes per year

Determine existence of Higg-boson – 1 bitMeasure it’s mass to 1% ~ 1 byte

Data = ExabytesInformation = 9 bitsCompression 10^18

Goal

$9 billion per byte!

Data science consulting

The good

• Always something new, always learning.

• Exposed to many different people.

• Get to see how everything works on the inside.

• See the world!

• Low career risk but still fun.

The bad

• Your clients choose you

• People problems often more important than math problems

• Travel can be extreme

• Your great ideas will rarely be credited to you.

Challenges in data science consulting

• Business’s don’t yet understand the terminology, process or techniques. Much teaching involved

• Visionary CEO send you into a not-so-visionary environment

• Problems can be vague

• Communication with business stakeholders takes much of your time

• We are still developing an effective model. More than just agile techniques

Red flags to avoid

• “Built us a platform for analytics so we can become a data-driven company” Non-sequitur

• Wanting prediction of the un-predicable

• Attempting to use ML on noisy data

• When incentives and opinions are all over the map

• Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS.

Keep offering up bold ideas

• Look for ways for major productivity enhancement

• Keep up on cutting-edge literature in stats/ML

• All my best ideas for web-apps are now successful companies.

• Everybody laughed at them!

Data science is NOT going to beproductized.

FIN

Technology

NYC Open Data Meetup-- Thoughtworks chief data scientist talk