32
Guerrilla Analytics Tactics for Coping with Data Science Reality Enda Ridge, PhD 23 February 2015 0 #GuerrillaAnalytics Copyright Enda Ridge 2015

Guerrilla Analytics: Tactics for Coping with Data Science Reality

Embed Size (px)

Citation preview

Guerrilla Analytics

Tactics for Coping with Data Science RealityEnda Ridge, PhD

23 February 2015 0#GuerrillaAnalytics Copyright Enda Ridge 2015

What we are told about Data Science

1#GuerrillaAnalytics Copyright Enda Ridge 2015

“Data is the new science. Big data holds the answers.”

“the sexy job in the next 10 years will be statisticians”

“Data Scientist: The Sexiest Job of the 21st Century”

“Information is the oil of the 21st century, and analytics is the combustion engine.”

http://www.gapminder.org/http://www.statistics.com/data-science-quotes/https://github.com/mbostock/d3/wiki/Gallery

23 February 2015

Hi, we need an update on the insurance policy classification work. It’s going to the Head of Underwriting this afternoon.

Um. Which work? Jo and I are trying two different approaches. And Jo’s on holidays.

I’ll check my mailbox and send you my spreadsheet from last week.

Just need the change in uplift since last week.

Err.....the policy population changed with the extra system extract on Tuesday.

And we added a bunch of business rules to accommodate that.... so we can’t go back to the earlier numbers.

The Data Science Reality

2#GuerrillaAnalytics Copyright Enda Ridge 201523 February 2015

My Journey

Mechanical Engineer

PhD Computer

Science

• “Design of Experiments for the Tuning of Algorithms”

Boutique Consultancy

Forensic Data Analytics

Senior Manager

#GuerrillaAnalytics Copyright Enda Ridge 2015 323 February 2015

ConstraintsComputation takes time!

DynamicRepeatable

Reproducible

DynamicConstrained

DynamicConstrainedReproduce

TestAudit

What is Data Science?

#GuerrillaAnalytics Copyright Enda Ridge 2015 4

Data Analytics Insight

23 February 2015

Common Misconception

#GuerrillaAnalytics Copyright Enda Ridge 2015 5

Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22

23 February 2015

Project Reality – Dynamic

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 6

DataPeopleUnderstandingRulesCode

Project Reality – Constraints

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 7

TimePeopleTechnologyData

Project Reality – Transparency

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 8

ExplainableTestableReproducibleRepeatable

Guerrilla Analytics

#GuerrillaAnalytics Copyright Enda Ridge 2015 9

Data

• Extraction

• Receipt

• Loading

Analytics

• Transform

• Algorithms

• Consolidate

Insight

• Reporting

• Work Products

Disruptions

23 February 2015

Guerrilla Analytics Principles

#GuerrillaAnalytics Copyright Enda Ridge 2015 1023 February 2015

Maintaining Data Provenance mitigates the effect of disruptions on your work

Guerrilla Analytics Principles

• Space is cheap, confusion is expensive 1

• Prefer simple, visual project structures over heavily documented and project-specific rules2

• Prefer automation with program code over manual graphical methods 3

• Version control changes to data and program code 5

Etc...

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 11

Guerrilla Analytics

#GuerrillaAnalytics Copyright Enda Ridge 2015 12

Data

• Extraction

• Receipt

• Loading

Analytics

• Transform

• Algorithms

• Consolidate

Insight

• Reporting

• Work Products

Disruptions

23 February 2015

Data Receipt

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 13

Guerrilla Analytics Environment

• Lost Data

• Multiple Copies of data

• No supporting information

• Local copies of data

• Renamed data

Data Receipt

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 14

Guerrilla Analytics Approach

• Have 1 Data location

• Data Unique Identifiers

• Data log

• Keep supporting material near its data

Data Load

Files

Crazy-name spreadsheet 1Crazy-name spreadsheet 2Crazy-name spreadsheet 3

FNU810A

A_very_long_named_file_v0.2.1.pdf

Analytics Environment

User_markups

Customer_Table

Finance_Report_v1.0

#GuerrillaAnalytics Copyright Enda Ridge 2015 15

Guerrilla Environment

• Renamed files

• Scattered inconsistent locations

• Multiple versions of files

• Replacements of files

23 February 2015

Data Load

Files

Crazy-name spreadsheet 1

Crazy-name spreadsheet 2

Crazy-name spreadsheet 3

FNU810A

A_very_long_named_file_v0.2.1.pdf

Analytics Environment

Crazy-name spreadsheet 1

Crazy-name spreadsheet 2

Crazy-name spreadsheet 3

FNU810A

A_very_long_named_file_v0.2.1.pdf

#GuerrillaAnalytics Copyright Enda Ridge 2015 16

Guerrilla Analytics Approach

• One-to-one mapping from files to datasets

• Keep crazy names

• Minimize prep work

23 February 2015

Guerrilla Analytics

#GuerrillaAnalytics Copyright Enda Ridge 2015 17

Data

• Extraction

• Receipt

• Loading

Analytics

• Transform

• Algorithms

• Consolidate

Insight

• Reporting

• Work Products

Disruptions

23 February 2015

Guerrilla Analytics Environment

• Multiple languages

• Many code files

• Variety of outputs

• Data manipulation on file system

• Data manipulation in analytics environment

• Combinations of tools

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 18

Analytics: Code

Guerrilla Analytics Environment Guerrilla Analytics Approach

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 19

WP_024

Rates cleaned.SQL

Rates_by_city_v1_FINAL.R

Rates_by_city_v2.R

MAP_POSTCODES.SQL

WP_024

010_MAP_POSTCODES.SQL

030_Rates cleaned.SQL

050_Rates_by_cityv2.R

Analytics: Data

ID Addr_1 City

A 10 Main St London

C 5 Junct London

B 54 Shop Rd Dublin

B 123 Middle Str. Galway

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 20

ID Addr_1 City

A 10 MAIN STREET London

B 54 SHOP ROAD Dublin

C 5 JUNCTION London

... ... ...

Analytics: Data

ID Addr_1 City

A 10 Main St London

C 5 Junct London

B 54 Shop Rd Dublin

B 123 Middle Str. Galway

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 21

ID Addr_1 Addr_1_cln City IS_IN_SCOPE

A 10 Main St 10 MAIN STREET London YES

C 5 Junct 5 JUNCTION London YES

B 54 Shop Rd 54 SHOP ROAD Dublin YES

B 123 Middle Str. 123 MIDDLE STREET Galway NO

Guerrilla Analytics

#GuerrillaAnalytics Copyright Enda Ridge 2015 22

Data

• Extraction

• Receipt

• Loading

Analytics

• Transform

• Algorithms

• Consolidate

Insight

• Reporting

• Work Products

Disruptions

23 February 2015

Reporting – what is a report?

#GuerrillaAnalytics Copyright Enda Ridge 2015 2323 February 2015

Reporting – Guerrilla Environment

#GuerrillaAnalytics Copyright Enda Ridge 2015 2423 February 2015

Reporting – Guerrilla Analytics approach

#GuerrillaAnalytics Copyright Enda Ridge 2015 25

1

2

5

Select min/max of transaction_time

WP_030

Select min/max of customer_age

WP_035

Purchases by type

WP_042

23 February 2015

Guerrilla Analytics

#GuerrillaAnalytics Copyright Enda Ridge 2015 26

Data

• Extraction

• Receipt

• Loading

Analytics

• Transform

• Algorithms

• Consolidate

Insight

• Reporting

• Work Products

Disruptions

23 February 2015

Why consolidate?

#GuerrillaAnalytics Copyright Enda Ridge 2015 27

Raw

Duplicates

Customers Clean_Cust

Deduped New_dupes

Work Product

23 February 2015

Why consolidate?

#GuerrillaAnalytics Copyright Enda Ridge 2015 28

Raw

Duplicates

Customers Clean_Cust

Deduped New_dupes

Duplicates_02

Customers

Duplicates

Deduped Clean_cust New_dupes

Work Product

23 February 2015

Consolidating with a Build

#GuerrillaAnalytics Copyright Enda Ridge 2015 29

Deduped

Clean_cust

New_dupesDuplicates_02

Duplicates

Customers

Dupes_latest

Cust_Latest

Raw Latest Clean Rules Interface

Version Controlled Code

23 February 2015

Open Questions

23 February 2015 #GuerrillaAnalytics Copyright Enda Ridge 2015 30

Workflows Testing

‘Big Data’Engineering

Keep in Touch!

#GuerrillaAnalytics Copyright Enda Ridge 2015 31

@Enda_Ridge

[email protected]

www.guerrilla-analytics.net

23 February 2015