47
Data Science Developing a New Profession ©2014 Gary Rector

Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

  • Upload
    vukhue

  • View
    214

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Data

Science

Developing a New Profession

©2014 Gary Rector

Page 2: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Influences

Data Science

Math

Data Engineering

Scientific Method

Business Knowledge

Advanced Computing

Visualization

Curiosity

2

Based on a diagram by Calvin Andrus

Page 3: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

“Sexiest Job of the

21st Century”

Harvard Business Review, October 2012:

“…distributed file system processing…related open-source tools, cloud computing, and data visualization…are important breakthroughs, [but] at least as important are the people with the skill set (and the mind-set) to put them to good use. On this front, demand has raced ahead of supply.”

3

Page 4: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Some Relevant Skills

• Math • Probability and statistics

• Algebra, calculus, logic, set theory

• Data and Software Engineering • Algorithms and programming

• Representation, modeling

• Pattern recognition, data mining

• Business Knowledge • Domain-specific

• Communication skills!

4

Page 5: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

The Scientific Method

• Question! (Be fearless.)

• Observe, Research, Model (Work.)

• Hypothesize, Predict (Think.)

• Experiment, Test, Document (Work!)

• Analyze, Revise (OK to be wrong.)

• Communicate! (Share.)

5

Page 6: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Definition

Data Science is:

The discipline of applying the scientific method to collections of data, using appropriate technology, to reveal previously-unknown information.

6

Page 7: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

A Little History

I am a member of the 3rd generation of modern computer scientists.

• Gen 0: Babbage, Lovelace, Jacquard, …

• Gen 1: Turing, von Neumann, Hopper, Eckert, ...

• Gen 2: Wang, Cray, Dijkstra, Knuth, Wirth, …

• Gen 3: Yourdon, Thompson, Cerf, Berners-Lee, …

• Gen 4: Brin, Page, Zuckerberg, Stone, …

The first commercial computer was installed a year after I was born.

7

Page 8: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Actuarial Science • John Graunt, 1662

– mortality tables

• James Dodson, 1762

– Equitable Life Assurance Society

• National Council on Workmen’s Compensation, 1920

– Calculation of rates required 2 full months of continual work by actuary teams

• 1930’s & 40’s

– Development of stochastic techniques

Source: Wikipedia

8

Page 9: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Turing and Enigma,

circa 1942 Alan Turing and the Bletchley Park crew, using the Colossus machine, mathematics, and luck to analyze mountains of radio transcriptions, crack the German Enigma code, helping to end WWII.

Some of Turing’s work remains classified.

9

Page 10: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

The US Presidential

Election of 1952 Walter Cronkite and Charles Collingwood report live on CBS that the frontrunner is Stevenson, but by 8:30 pm EST with a tiny percentage of votes counted, their Univac predicts a landslide win with 100-to-1 odds in favor of Eisenhower.

Data science enters the public mind! 10

Page 11: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Data is the Driver

11

Source: Volvo

Page 12: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

A Big Bonus

• Netflix awarded $1 million to a team of scientists who improved the Netflix recommendation system’s ability to predict which movies you will like.

12

Source: Netflix

Page 13: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Saving Lives

At Stanford University, a machine learned to diagnose breast cancer better than human doctors by discovering an innovative method that considers more factors in a tissue sample.

13

Source: Stanford University School of Medicine

Page 14: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Dramatic Cost Reduction

UPS saves $600 MM/year in fuel costs by avoiding left turns (less time waiting )

14

Source: NY Times 12/9/2007

Page 15: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Disease Surveillance

15

©2013, Association for Computing Machinery

Page 16: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Increasing the Value

of Data

Data

Information

Knowledge

Wisdom

16

Selective

Use

Page 17: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

It’s all about the “I” in “IT”

Any hardware technology is only a peripheral part of a data science solution.

The heart of a solution is the algorithm.

The value of a solution lies in the

resultant information.

17

Page 18: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

A Brief Aside: “Metrics”

A metric is an abstraction of the notion of distance. A true metric has 4 properties:

1. D(x,y) >= 0

2. D(x,y) = 0 if and only if x=y

3. D(x,y) = D(y,x)

4. D(x,z) <= D(x,y) + D(y,z)

Most business uses of “metric” just mean measurement but sometimes have no precise meaning at all…might be opinions.

18

Page 19: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Business Goals

Deliver actionable information through data analytics to:

• Reduce risks

• Reduce costs

• Increase revenue

• Increase efficiency

19

Page 20: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

BI and Analytics

Many enterprises associate data science with Business Intelligence, fitting into an organization doing “advanced analytics”.

This is not invalid, but data science can contribute much more than just reports.

20

Page 21: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Business Value

Of the 4 major categories of

Business Intelligence (BI):

• Operational Reporting

• Analysis

• Modeling

• Prediction

Prediction has the highest business value.

Unfortunately, it is also the most complex. The Data Warehousing Institute ( TDWI )

21

Page 22: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

React vs. Prevent

vs. Predict

EPRI reports costs of

– $17 to $18 for reactive maintenance

– $11 to $13 for preventive maintenance

–$7 to $9 for predictive maintenance

(Half the cost of reactive maintenance!) (cost per horsepower-year unit)

Source: EPRI Advanced Electric Motor Predictive Maintenance Project

22

Page 23: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Fuel for Decision-making

• Analytics can drill deep or reveal new big-picture perspectives

• But data often has a short “shelf life”

• Software delivers analyses fast, helping managers respond quickly

23

Page 24: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Some Tools

• Statistical analysis packages

• Probabilistic graphical models

• Markov Chain Monte Carlo algorithms

• Simulated annealing

• Textual disambiguation

• Visualization methods

• Cluster and pattern detection

24

Page 25: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Visualization

Graphics communicate faster than spreadsheets or tabular reports for:

•Dashboards, scorecards, alerts

•Geographic and Spatial Information

•Visual discovery and analysis

25

Page 26: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Which is Easier to

Understand?

26

0

20

40

60

80

100

120

140

160

180

200

# Visitors

# Pages Read

190

82

30

15

8 4 7

7 3 27

1

2

3

4

5

6

7

8

9

10 or more

# Visitors # Pages Read

Page 27: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Pattern Detection

Relationships in multi-dimensional data are hard to find without software help.

What is hidden in this sample data?

27

x y z0.5 2.5 9.5

1 4 5.41.2 1.2 2.41.2 6.7 8.81.3 7.6 6.21.6 5.6 4.22.2 0.6 9.72.4 3.3 1.22.5 2.6 11.32.5 6.3 1.92.6 4.4 0.92.6 8.1 3.43.3 1.4 1.73.3 5.3 2.23.4 2.5 8.93.4 6.3 4.83.5 4.2 53.5 7.2 11.93.7 0.3 11.5

4 7 7.54.3 6 11.8

Here is part of a list of 300 data points in 3 dimensions. For example, they might represent Work Order, Crew, and Material or Temperature, Hour, and Load. Looking at the raw data spreadsheet is not helpful, so I’ve plotted this data to show the relationship between pairs of dimensions…

Page 28: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

28

0

2

4

6

8

10

12

14

0 5 10 15 20

0

2

4

6

8

10

12

14

0 2 4 6 8 10

Page 29: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

29

0

1

2

3

4

5

6

7

8

9

0 5 10 15 20

This example has 3 dimensions, but

a real warehouse may have dozens of dimensions…

What is waiting to be discovered?

Page 30: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Descriptive Applications

• Typical BI trend reports

• eDiscovery

• Data loss prevention

• Phone call metadata mining

• NASA’s 60 yrs. in 15 seconds video

30

Page 31: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Predictive Applications

• Spare Parts Inventory Management

• Crew Scheduling

• Theft & Fraud Detection

• Demand Forecasting

• Weather Forecasting

31

Page 32: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Case Histories

• Data loss prevention

• Mortgage risk

• Golf tee-time pricing

• Retail price optimization

• Valentines’ Day promotions

32

Page 33: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

We are Pioneers

This is a very new discipline. There is no single “DSBOK”.

The challenges are more cultural than technical. Ethics matter.

We must seize this opportunity now to make a difference for the profession and for the world at large.

33

Page 34: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

A Plan for Success

1. Continue Personal Development

2. Integrate Data Science into Business

3. Nurture our Analytics Community

34

Page 35: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Personal Development

Self-assessment:

Weaknesses & strengths, likes & dislikes.

Decide:

Generalist or specialist? Which specialty?

Commit:

Never stop learning. Do no evil.

35

Page 36: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

OCM

• Teams

• Expectations

• Basics

• Next steps

• Stages of data use

• Difficulties of achieving proscriptive use

36

Page 37: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Overload!

• No one knows ALL of this!

• Multi-disciplinary teams are needed

• Use academic sources

• Leverage vendor resources

• Nurture in-house subject experts

Set realistic expectations:

We are all in SALES.

37

Page 38: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Firmly Establish Basics

• Data Architecture

• Data Governance

• Data Quality

Getting the basics right is an absolute prerequisite to doing advanced science sustainably.

38

Page 39: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Apply Data Engineering

• Master Data Management

• Reference Data Management

• Normalization, Reduction, Projection

• Software Development Disciplines

39

Page 40: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Allow Greater Freedom

Bending some corporate rules may help nurture creativity needed for discovery.

Data Science might be best started with a skunkworks. (That may have already happened in larger enterprises.)

At the very least, give your data science team the freedom to experiment without punishment for mistakes.

40

Page 41: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

User Support Needs

1. Basic reports are self-service.

2. Complex reports need developers.

3. Experts are needed to create models for forecasting and statistical analysis.

41

Page 42: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Stages of Data Use

1. Descriptive – What do we have?

– What happened? When?

2. Predictive – What is likely to happen?

– What are expected costs?

3. Proscriptive – Recommended actions & timing

– Possible automation of actions 42

Page 43: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Moving to Stage Two

• Establish data quality and governance

• Build a community of analysts • Idea exchange, discussions

• Peer support

• Embed data science with the business

• Employ more visualizations in support of predictive analytics

• Let the business guide the science

43

Page 44: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Stage Three Difficulties

• People don’t want change

• People don’t trust the technology

• People fear losing their jobs

• Myths hinder progress

• Even the law can be a barrier

44

Page 45: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Proscriptive Applications

• Price optimization

• Manufacturing plant scheduling

• Computerized securities trading

• Aircraft autopilot

• Self-driving cars

• Medical devices

45

Page 46: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Ethical Questions

• Privacy versus Public Security

• Quality Control and Correction

• Opt-in and Opt-out Rights

• Ownership and Monetization

• Limits of Liability and Responsibility

46

Page 47: Data Science - DAMA Phoenixdama-phoenix.org/.../2014/02/DAMAPhoenixGaryRectorDataScience.pdf · Turing and Enigma, circa 1942 ... A Plan for Success 1. Continue Personal Development

Summary

• Business wants more Data Science

• Data Science is a team effort

• Data Science is immature

• More and more data is available now that can be mined for predictive uses

• People problems > technical problems

47