32
A Confluence of Big Data Skills in Academic and Industry R&D Bill Howe, PhD Associate Director University of Washington eScience Institute

Big Data Talent in Academic and Industry R&D

Embed Size (px)

Citation preview

Page 1: Big Data Talent in Academic and Industry R&D

A Confluence of Big Data Skills in

Academic and Industry R&D

Bill Howe, PhDAssociate Director

University of Washington eScience Institute

Page 2: Big Data Talent in Academic and Industry R&D

The Fourth Paradigm

1. Empirical + experimental

2. Theoretical

3. Computational

4. Data-Intensive

Jim Gray

Page 3: Big Data Talent in Academic and Industry R&D

“All across our campus, the process of discovery will

increasingly rely on researchers’ ability to extract

knowledge from vast amounts of data… In order to

remain at the forefront, UW must be a leader in

advancing these techniques and technologies, and in

making [them] accessible to researchers in the

broadest imaginable range of fields.”

2005-2008

In other words:

• Data-intensive research will be ubiquitous

• It’s about intellectual infrastructure and software infrastructure,

not only computational infrastructure

http://escience.washington.edu

Page 4: Big Data Talent in Academic and Industry R&D

A 5-year, US $37.8 million cross-institutional

collaboration to create a data science environment

4

2014

Page 5: Big Data Talent in Academic and Industry R&D

5

“It’s a great time to be a data geek.”-- Roger Barga, Microsoft Research

“The greatest minds of my generation are trying

to figure out how to make people click on ads”-- Jeff Hammerbacher, co-founder, Cloudera

Page 6: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 6

Jake Vanderplas

Page 7: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 7

…the new breed of scientist must be a broadly-

trained expert in statistics, in computing, in

algorithm-building, in software design

The skills required to be a successful scientific

researcher are increasingly indistinguishable from

the skills required to be successful in industry.

Jake Vanderplas

Page 8: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 8

Page 9: Big Data Talent in Academic and Industry R&D

“Data Science” is not the only example…

• Strong Math + PhD Quant, on Wall Street

• Strong “Data” + PhD Data Scientist, anywhere

5/7/2015 Bill Howe, UW 9

Page 10: Big Data Talent in Academic and Industry R&D

increased

statistical rigor and

data-driven

decision-making

increased

sophistication in

the use and

development of

software

Industry

Academia

Page 11: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 11

Page 12: Big Data Talent in Academic and Industry R&D

Maximiliaan Schillebeeckx, Brett Maricque & Cory Lewis

Nature Biotechnology 31, 938–941 (2013) doi:10.1038/nbt.2706

Page 13: Big Data Talent in Academic and Industry R&D

WHAT SKILLS ARE NEEDED?

5/7/2015 Bill Howe, UW 13

Page 14: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 14

Page 15: Big Data Talent in Academic and Industry R&D

Drew Conway’s Data Science Venn Diagram

5/7/2015 Bill Howe, UW 15

Page 16: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 18

“I worry that the Data Scientist role is like

the mythical “webmaster” of the 90s:

master of all trades.”

-- Aaron Kimball, CTO of Zymergen,

formerly CTO of Wibidata, formerly

co-founder of Cloudera

Page 17: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW eScience 19

tools principles

desktop cloud

data structures statistics

hackers analysts

What to look for in data science skills

Page 18: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 20

Cambrian Explosion of Big Data Systems tools principles

Page 19: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 22

What are the abstractions of

data science?

“Data Jujitsu”

“Data Wrangling”

“Data Munging”

Translation: “We have no idea what

this is all about”

tools principles

Page 20: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 23

1850s: matrices and linear algebra (today: engineers and scientists)

1950s: arrays and custom algorithms (today: C/Fortran performance junkies)

1950s: s-expressions and pure functions (today: language purists)

1960s: objects and methods (today: software engineers)

1970s: files and scripts (today: system administrators)

1970s: relations and relational algebra (today: industry data pros)

1980s: data frames and functions (today: statisticians)

2000s: key-value pairs + one of the above (today: NoSQL hipsters)

But what are the abstractions of

data science?

tools principles

Page 21: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 24

“80% of analytics is sums and averages”

-- Aaron Kimball, wibidata

data structures statistics

Page 22: Big Data Talent in Academic and Industry R&D

“The intuition behind this ought to be very simple: Mr. Obama

is maintaining leads in the polls in Ohio and other states that

are sufficient for him to win 270 electoral votes.”

Nate Silver, Oct. 26, 2012

“…the argument we’re making is exceedingly simple. Here it

is: Obama’s ahead in Ohio.”

Nate Silver, Nov. 2, 2012

“The bar set by the competition was invitingly low. Someone could

look like a genius simply by doing some fairly basic research into

what really has predictive power in a political campaign.”

Nate Silver, Nov. 10, 2012

DailyBeast

fivethirtyeight.com

fivethirtyeight.com

source: randy stewart

Nate Silver

data structures statistics

Page 23: Big Data Talent in Academic and Industry R&D

Data Science Workflow

5/7/2015 Bill Howe, UW 26

1) Preparing to run a model

2) Running the model

3) Interpreting the results

Gathering, cleaning, integrating, restructuring,

transforming, loading, filtering, deleting, combining,

merging, verifying, extracting, shaping, massaging

“80% of the work”

-- Aaron Kimball

“The other 80% of the work”

Academia puts far too much

emphasis on this step

data structures statistics

Page 24: Big Data Talent in Academic and Industry R&D

Problem

How much time do you spend “handling

data” as opposed to “doing science”?

Mode answer: “90%”

data structures statistics

Page 25: Big Data Talent in Academic and Industry R&D

“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,

dissection into data blocks and processing steps, order in which steps are performed

to match memory/time requirements, file formats required by software used).

In addition we actually spend quite some time in iterations fixing problems with

certain features (e.g. capping ENCODE data), testing features and feature products

to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs

human-derived variants)

So roughly 50% of the project was testing and improving the model, 30% figuring out

how to do things (engineering) and 20% getting files and getting them into the right

format.

I guess in total [I spent] 6 months [on this project].”

At least 3 months on issues of

scale, file handling, and feature

engineering.

Martin Kircher,

Genome SciencesWhy?

3k NSF postdocs in 2010

$50k / postdoc

at least 50% overhead

maybe $75M annually

at NSF alone?

desk cloud

Page 26: Big Data Talent in Academic and Industry R&D

…up to 1 GB (volume)

…up to 10 data sources (variety)

…up to 1% churn/day (velocity)

…up to 1% bad data (veracity)

…up to 10 collaborators

5/7/2015 Bill Howe, UW 30/57

With “manual” approaches,

you can comfortably handle…

But we’re seeing a 10x-100x increase in every

dimension, even under modest assumptions

desk cloud data

structuresstatistics

Page 27: Big Data Talent in Academic and Industry R&D

US faces shortage of 140,000 to 190,000 people “with

deep analytical skills, as well as 1.5 million managers

and analysts with the know-how to use the analysis of

big data to make effective decisions.”

5/7/2015 Bill Howe, UW 31

--Mckinsey Global Institute

hackers analysts

Page 28: Big Data Talent in Academic and Industry R&D

Where do you store your data?

src: Conversations with Research Leaders (2008)

src: Faculty Technology Survey (2011)

5%

6%

12%

27%

41%

66%

87%

0% 20% 40% 60% 80% 100%

Other

Department-managed data center

External (non-UW) data center

Server managed by research group

Department-managed server

External device (hard drive, thumb drive)

My computer

Lewis et al 2011

Page 29: Big Data Talent in Academic and Industry R&D

Conversations with DS Hiring Managers

• “How to ask the right questions and communicate

results”

– DS: "I tried three methods, two didn't work, achieved 80%

accuracy”

– Manager: “Ok, so….what do we do?”

• “Can you properly tell a story with the data, and

properly persuade people?”

• "For my team, engineering/stats skills need to be

good, not great."

5/7/2015 Bill Howe, UW 35

hackers analysts

Page 30: Big Data Talent in Academic and Industry R&D

If I had to pick 2…

• Experimental Design

– How to design a statistical test?

– How to interpret significance of a test?

– A/B tests

– More complicated sampling methods

– Sources of bias

– Skewed data

• SQL and Databases

– Mentioned on nearly evey DS job description

– Why? Easy scalability, production data sources, IT integration

5/7/2015 Bill Howe, UW 36

Page 31: Big Data Talent in Academic and Industry R&D

http://cds.nyu.edu/ http://bids.berkeley.edu/ http://escience.washington.edu/

Page 32: Big Data Talent in Academic and Industry R&D

5/7/2015 Bill Howe, UW 38

http://escience.washington.edu

Data Scientist and Research Scientist positions available

Who We Are Join Us