28
Data Science on a Budget: Maximizing Insight and Impact Nicholas Arcolano, Ph.D. Senior Data Scientist @arcolano Photo by giuseppemilo / CC BY

Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Embed Size (px)

Citation preview

Page 1: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Data Science on a Budget: Maximizing Insight and Impact

Nicholas Arcolano, Ph.D.

Senior Data Scientist

@arcolano

Photo by giuseppemilo / CC BY

Page 2: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

A little background…

• Spent 10 years at MIT Lincoln Laboratory working in ballistic missile defense and cyber security research

• Areas of interest: statistics, machine learning, parallel computing, “big data”

• Realized these things had been collectively re-branded as “data science”

• Started calling myself a “data scientist” and joined a start-up

Nicholas Arcolano – Data Science on a Budget – November 2014 2

Page 3: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

What does a data scientist do?

Nicholas Arcolano – Data Science on a Budget – November 2014 3

Page 4: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

What does a data scientist do?

• Something that happens at the intersection of statistics, machine learning, and computer science

• Usually involves data (typically lots of it)

• Actually, this isn’t the most critical question to be worrying about

Nicholas Arcolano – Data Science on a Budget – November 2014 4

Page 5: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

A better question…

• What does a data team do?

• Basically, two things:

1. Use data to help the rest of the company understand what our users are doing

2. Help the rest of the company use this information to improve our product and our business

Nicholas Arcolano – Data Science on a Budget – November 2014 5

Page 6: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

The Company

• Started in 2008

• Based in Boston

• About 50 people

• 4-person data team

Our Product

• RunKeeper app for GPS and manual tracking of running, walking, cycling, other activities

• Long-term fitness goals, training plans, and performance insights

• iOS, Android, web, 3rd party devices

The Data

• 37 million users

• 450 million fitness activities

• 200 billion GPS points

• 17 billion interactions and events

Page 7: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

DATA

SYSTEMS PRODUCT

MARKETING

EXECUTIVE

BUSINESS DEVELOPMENT

USER EXPERIENCE

QUALITY ASSURANCE

• analytics and business intelligence

• modeling and forecasting

• data systems and archiving

• user research and testing

• data-driven features

• data stories and visualizations

7

SUPPORT

• analytics and business intelligence

• modeling and forecasting

• data systems and archiving

• user research and testing

• data-driven features

• data stories and visualizations “DATA SCIENCE”

Page 8: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

How can we accomplish all this, quickly and with a small team?

It’s hard… but here are some steps to making it easier

Nicholas Arcolano – Data Science on a Budget – November 2014 8

Page 9: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 1: Communicate. A lot.

Nicholas Arcolano – Data Science on a Budget – November 2014 9

Page 10: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 1: Communicate. A lot.

Nicholas Arcolano – Data Science on a Budget – November 2014 10

Page 11: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 1: Communicate. A lot.

• You have a lot to learn about the rest of the company – Every part of the company has its own blend of tools, systems, processes,

environments

– Every part has data it understands and cares about

– Every part knows things that affect the data that you won’t see— user interviews, support feedback, product bugs, system failures

• You also have a lot to teach people – What data we have

– What it can—and can’t—do

– Empower people to “think with data” Nicholas Arcolano – Data Science on a Budget – November 2014 11

Page 12: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 1: Communicate. A lot.

• Be patient—sometimes you have to say the same things many times

• You may be the only one looking at certain data—if you see something, say something!

Nicholas Arcolano – Data Science on a Budget – November 2014 12

Page 13: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Setting expectations

Things our data team will discover

exciting new things things we already knew Anticipated impact of data exploration:

Things our data team will discover

bugs, missing data, and bad data

things we already knew

exciting new things

Actual impact of data exploration:

Nicholas Arcolano – Data Science on a Budget – November 2014 13

Page 14: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 2: Move quickly but carefully.

Nicholas Arcolano – Data Science on a Budget – November 2014 14

“Wisely and slow. They stumble that run fast.”

– Friar Laurence, from Shakespeare’s Romeo and Juliet

Page 15: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 2: Move quickly but carefully.

• On moving fast… – Data science can work well in an agile framework

– Make assumptions, but understand them

– Don’t be afraid to provide caveats

• On being cautious… – Bad analysis is worse than no analysis

– Make time for data QA

– Use common sense—if it seems to good (or bad) to be true, it usually is

Nicholas Arcolano – Data Science on a Budget – November 2014 15

Page 16: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 3: Keep it simple.

Nicholas Arcolano – Data Science on a Budget – November 2014 16

• Go for lots of small, quick wins

• Learn and iterate

• Resist the urge to show everyone how smart you are by doing something super complicated

Page 17: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 3: Keep it simple.

• Do the “stupid thing” first – It helps build understanding

– It helps uncover issues with the data

– It may turn out that you’re not even solving the right problem

– It may actually work pretty well

• When in doubt, favor a simpler method that you understand better over a more complex one – Easier to implement

– Easier to debug

– Easier to explain to others

Nicholas Arcolano – Data Science on a Budget – November 2014 17

Page 18: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

You don’t have to use all the data

• Sometimes, using all the data is the right thing to do:

• Sometimes, though, you can solve your problem entirely with a small data set

• Benefits – Easier computation and data wrangling means faster results

– “Curse of dimensionality” is a real thing

– Mitigate bad assumptions (lack of stationarity, different product versions, changing environments, regional and seasonal effects, etc.)

SELECT COUNT(userid) FROM rk_user;

Nicholas Arcolano – Data Science on a Budget – November 2014 18

Page 19: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 4: Use the right tools.

• In any given scenario, the “right tool” is one of the following: – The tool you already know and are

comfortable with

– Something you don’t know but suspect would work really well

– Something that doesn’t exist yet

• It’s up to you to figure out which one it is

Nicholas Arcolano – Data Science on a Budget – November 2014 19

Page 20: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Languages and technologies I used during 10 years at my last job

Languages and technologies I’ve used during 1 year at my current job

Step 4: Use the right tools.

• Be comfortable using a variety of tools

• Make time to learn new ones

• Build your own tools for repeatable analysis—once you know it’s worth it

• Open source: take advantage of the hard work of others, but make sure you understand what you’re using

• Give back

Nicholas Arcolano – Data Science on a Budget – November 2014 20

Page 21: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 4: Use the right tools.

• Many of the same principles apply to your “analytical toolkit”

• Try to learn when to stick with a well-worn approach and when to try something new

• Be skeptical of the conventional wisdom – Just because a metric or analytical approach is common doesn’t mean it’s

the right thing to do for your situation

– Typical example: A/B testing

Nicholas Arcolano – Data Science on a Budget – November 2014 21

Page 22: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Hypothesis testing (“A/B testing”)

Nicholas Arcolano – Data Science on a Budget – November 2014 22

GROUP A “Control”

GROUP B “Treatment”

USERS

90%

10%

Standard flow

Experimental flow

Test statistic

DECISION “reject/accept

null hypothesis”

# of successes, failures

# of successes, failures

“Null hypothesis”: treatment has no effect “Alternate hypothesis”: treatment has some effect

Page 23: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Thoughts about A/B testing

• A/B testing is hard to do well – Need lots of data and good estimates of baseline rates to have a chance at significance

– Need lots of data infrastructure to do it quickly on a large scale

– Need to manage variables such multiple testing, changes in product and environment, interactions between tests, subjects

– Need to make sure tests align with high-level vision and learning goals

• An A/B test can help with one very specific decision, but typically will not... – Help you understand how multiple different factors interact

– Predict long-term reactions (the “taste test” phenomenon)—need longitudinal study

– Always give you the answer you want—results may be null or inconclusive

– Tell you anything of any value whatsoever if you did it wrong

Nicholas Arcolano – Data Science on a Budget – November 2014 23

Page 24: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Thoughts about A/B testing

Even when performed “correctly”, an A/B test may not tell you what you think it does

Page 25: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 5: Have faith and have fun

• Don’t try to understand everything all at once—keep looking from multiple angles and trust that more understanding will come in time

Nicholas Arcolano – Data Science on a Budget – November 2014 25

Page 26: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Step 5: Have faith and have fun

• Working data from millions of engaged users is awesome

• Helping your company have a real impact on their lives is even more awesome

• All the tools are available to do truly amazing things

• Make sure everyone knows how much you love the data, and they will grow to love it too

Nicholas Arcolano – Data Science on a Budget – November 2014 26

Page 27: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

Things we’re still working on

• Synthesizing knowledge and communicating results

• Data-driven products and features

• Analytics and instrumentation

• Giving back (open source, blogging, tutorials, talks)

Nicholas Arcolano – Data Science on a Budget – November 2014 27

Page 28: Data Science on a Budget: Maximizing Insight and Impact (Boston Data Festival 2014)

[email protected]

http://arcolano.com

@arcolano

Thanks for listening! Questions?

http://www.runkeeper.com