31
Presented by: DATA Big Data & Predictive Analytics SO YOU W ANT TO BE A DATA SCIENTIST © Data-Magnum 2016

DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Presented by:

DATA

Big Data & Predictive Analytics

SO YOU WANT TO

BE A

DATA SCIENTIST

© Data-Magnum 2016

Page 2: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Four Perspectives

Data Tools

Data Science Skills

Business / Employer

© Data-Magnum 2016

Page 3: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

© Data-Magnum 2016

Why Start with Data?

Page 4: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Why Start with Data?

80 % CRISP-DM

© Data-Magnum 2016

Page 5: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

2002

2004

2006

2008

2009

Google releases research papers

10/03 and 12/04 read by Cutting

and others First Hadoop Developers Conference

Multiple startups spinoff to commercialize incl Hortonworks, Cloudera, MapR

All the Hoopla over Hadoop

A Little History Google develops proprietary search indexing tool based on Big Table and MapReduce

Doug Cutting working on open source version of the same “Nutch”

Cutting at Yahoo. Renamed Hadoop. First prototype launched 2006.

Yahoo is first commercial implementation 2008

Facebook, Twitter, eBay adopt.

Hadoop becomes open source at

Apache Institute

© Data-Magnum 2016

Page 6: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Some Data is Big – But Not Very Often

1,220 Respondents 72 countries

Rexer Analytics

Respondents reported that their ‘typical’ data set size was:

90% typically < 1 to 100 Million records 60% typically < 100,000 to 1 Million records

© Data-Magnum 2016

Page 7: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Predictive Modeling

Insights are: Specific Directional

Structured RDBMS

Semi-Structured Key Value, Document,

Column, Graph

Unstructured Key Value

How NoSQL Changed Data Science

© Data-Magnum 2016

Page 8: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Predictive Modeling

Insights are: Specific Directional

Structured RDBMS

Semi-Structured Key Value, Document,

Column, Graph

Unstructured Key Value

Data Lakes

How NoSQL Changed Data Science

© Data-Magnum 2016

Page 9: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Predictive Modeling

Insights are: Specific Directional

Structured RDBMS

Semi-Structured Key Value, Document,

Column, Graph

Unstructured Key Value

Recommenders Data Lakes

How NoSQL Changed Data Science

© Data-Magnum 2016

Page 10: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Predictive Modeling

Insights are: Specific Directional

Structured RDBMS

Semi-Structured Key Value, Document,

Column, Graph

Unstructured Key Value

Natural Language Processing

Recommenders Data Lakes

How NoSQL Changed Data Science

© Data-Magnum 2016

Page 11: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Predictive Modeling

Insights are: Specific Directional

Structured RDBMS

Semi-Structured Key Value, Document,

Column, Graph

Unstructured Key Value

Natural Language Processing

Recommenders IOT Data Lakes

How NoSQL Changed Data Science

© Data-Magnum 2016

Page 12: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Predictive Modeling

Insights are: Specific Directional

Structured RDBMS

Semi-Structured Key Value, Document,

Column, Graph

Unstructured Key Value

Natural Language Processing

Recommenders IOT

Deep Learning

Data Lakes

Reinforcement Learning

How NoSQL Changed Data Science

© Data-Magnum 2016

Page 13: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

The Tools Perspective

© Data-Magnum 2016

Page 14: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

All Those Algorithms Answer Only 5

Questions

1. Is this A or B?

2. Is this weird?

3. How much – or – How many?

4. How is this organized?

5. What should I do next?

© Data-Magnum 2016

Page 15: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Three Types of Machine Learning

• Have Data • Data Has Labels • Learn by Example

• No Data • Learn by Trial

and Error

• Have Data • No Labels • Learn by Example • See If There’s a

Pattern in There

© Data-Magnum 2016

Page 16: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Three Types of Machine Learning

Decision trees / Random Forest Naïve Bayes classification Least squares regression Logistic regression Support vector machines Ensemble methods – Bagging, Boosting, Super Learners Neural Networks Linear Genetic Programs

Q-Learning PyBrain Mostly Custom Agents

Clustering Centroid-based algorithms Connectivity-based algorithms Density-based algorithms Probabilistic Dimensionality Reduction Neural networks / Deep Learning Principal Component Analysis Singular Value Decomposition Independent Component Analysis

© Data-Magnum 2016

Page 17: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

2015 Algorithm Usage

1,220 Respondents 72 countries

Rexer Analytics

Page 18: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

R versus Python versus SAS

Which do you prefer to use? Most DS use multiple languages but everyone has a favorite.

Burtch Works www.burtchworks.com/2015/05/21/2015-sas-vs-r-survey-results

© Data-Magnum 2016

Page 19: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

R versus Python versus SAS

Burtch Works www.burtchworks.com/2015/05/21/2015-sas-vs-r-survey-results

© Data-Magnum 2016

Page 20: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

The Data Scientist’s Perspective

Data Wrangler

Model Jockey

Data Scientist

© Data-Magnum 2016

Page 22: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Types of Data Scientists – Self Described

“Analyzing the Analyzers” by Harris, Murphy, and Vaisman, 2013. http://www.oreilly.com/data/free/analyzing-the-analyzers.csp. © Data-Magnum 2016

Leader Business- Person Entrepreneur

Jack of all trades Artist Hacker

Developer Engineer

Researcher Scientist Statistician

Page 23: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

What You Need to Know

• Foundational Statistical Theory

– Probability, statistical analysis, sampling theory, hypothesis testing, statistical distributions, correlation, standard deviation, basic regression

• Foundational Programming Skills

– R, SAS, Python, SQL

• Machine Learning

– Supervised and Unsupervised (leave Reinforcement Learning for later)

• Big Data Toolbox

– Hadoop, Spark, how to operationalize predictive models to create business value

Amy Gershkoff, Chief Data Officer, Zynga © Data-Magnum 2016

Page 24: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

The Business or Employer’s Perspective

© Data-Magnum 2016

Page 25: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Two Markets

The Big Web Developers Market

© Data-Magnum 2016

Page 26: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Two Markets

The Core Data Science Market Banking Insurance Mortgage Lending Brokerage Telecomm

Healthcare e-commerce B&M Retail Utilities Manufacturing

Transportation Education Government Services

© Data-Magnum 2016

Page 27: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Salary Increases as Experience &

Responsibility Increase

Median Base $112,000

Page 28: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

The Opportunity – Good News / Bad News

2nd Best Work/Life Balance and Plenty of Openings Going Unfilled

Market Penetration – 12% in 2012 (Gartner) – Guestimating Maybe 20% to 25% Today.

Citizen Data Scientists and Fully Automated DS

© Data-Magnum 2016

Page 29: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Summing It Up

• Should you specialize?

• Build 3 competencies (Your Focus) – Industry

– Business Process (e.g. customer acquisition, fraud detection)

– Tool Sets (languages, analytic platforms, data platforms)

• Have a life. Join a team. Decide where you want

to live.

© Data-Magnum 2016

Page 30: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Some additional references

How to Become a Data Scientist http://www.datasciencecentral.com/profiles/blogs/how-to-become-a-data-scientist

So You Want to be a Data Scientist http://www.datasciencecentral.com/profiles/blogs/so-you-want-to-be-a-data-scientist

The New Rules for Becoming a Data Scientist http://www.datasciencecentral.com/profiles/blogs/the-new-rules-for-becoming-a-data-scientist

Become a member (for free) of DataScienceCentral.com Use the search feature and search for ‘how to become a data scientist” http://www.datasciencecentral.com/page/search

Join some Meet Ups – Westlake Village Data Science Meet Up 2nd Tuesday of each month at 5:30

Practice on some Kaggle competitions https://www.kaggle.com/

© Data-Magnum 2016

Other Blogs by Bill Vorhies http://www.datasciencecentral.com/profiles/blog/list?user=0h5qapp2gbuf8

Page 31: DATA SCIENTIST - Meetupfiles.meetup.com/11097942/So You Want to be a Data...Some Data is Big – But Not Very Often 1,220 Respondents 72 countries Rexer Analytics Respondents reported

Contact Information

Bill Vorhies

President & Chief Data Scientist

Data-Magnum

[email protected]

www.Data-Magnum.com

818.257.2035

“I shall find a way or make one.” Admiral Robert Peary

© Data-Magnum 2016