31
DECEMBER 8-9, 2016

BSSML16 L6. Basic Data Transformations

Embed Size (px)

Citation preview

Page 1: BSSML16 L6. Basic Data Transformations

D E C E M B E R 8 - 9 , 2 0 1 6

Page 2: BSSML16 L6. Basic Data Transformations

BigML, Inc 2

Poul Petersen CIO, BigML, Inc.

Basic TransformationsCreating Machine Learning Ready Data

Page 3: BSSML16 L6. Basic Data Transformations

BigML, Inc 3Basic Transformations

In a Perfect World…

Q: How does a physicist milk a cow?

A: Well, first let us consider a spherical cow...

Q: How does a data scientist build a model?

A: Well, first let us consider perfectly formatted data…

Page 4: BSSML16 L6. Basic Data Transformations

BigML, Inc 4Basic Transformations

The Dream

CSV Dataset Model Profit!

Page 5: BSSML16 L6. Basic Data Transformations

BigML, Inc 5Basic Transformations

The Reality

CRM

Web Accounts

Transactions ML Ready?

Page 6: BSSML16 L6. Basic Data Transformations

BigML, Inc 6Basic Transformations

Obstacles• Data Structure

• Scattered across systems • Wrong "shape" • Unlabelled data

• Data Value • Format: spelling, units • Missing values • Non-optimal correlation • Non-existant correlation

• Data Significance • Unwanted: PII, Non-Preferred • Expensive to collect • Insidious: Leakage, obviously correlated

Data Transformation

Feature Engineering

Feature Selection

Page 7: BSSML16 L6. Basic Data Transformations

BigML, Inc 7Basic Transformations

The Process• Define a clear idea of the goal.

• sometimes this comes later… • Understand what ML tasks will achieve the goal.

• Transform the data • where is it, how is it stored? • what are the features? • can you access it programmatically?

• Feature Engineering: transform the data you have into the data you actually need.

• Evaluate: Try it on a small scale • Accept that you might have to start over….

• But when it works, automate it!!!!

Page 8: BSSML16 L6. Basic Data Transformations

BigML, Inc 8

Data Transformations

Page 9: BSSML16 L6. Basic Data Transformations

BigML, Inc 9Basic Transformations

BigML TasksGoal

• Will this customer default on a loan? • How many customers will apply for a

loan next month? • Is the consumption of this product

unusual? • Is the behavior of the customers

similar? • Are these products purchased

together?

ML TaskClassificationRegression

Anomaly Detection

Cluster Analysis

Association Discovery

Page 10: BSSML16 L6. Basic Data Transformations

BigML, Inc 10Basic Transformations

ClassificationCategorical

Trai

ning

Test

ing

Predicting

Page 11: BSSML16 L6. Basic Data Transformations

BigML, Inc 11Basic Transformations

RegressionNumeric

Trai

ning

Test

ing

Predicting

Page 12: BSSML16 L6. Basic Data Transformations

BigML, Inc 12Basic Transformations

Anomaly Detection

Page 13: BSSML16 L6. Basic Data Transformations

BigML, Inc 13Basic Transformations

Cluster Analysis

Page 14: BSSML16 L6. Basic Data Transformations

BigML, Inc 14Basic Transformations

Association Discovery

Page 15: BSSML16 L6. Basic Data Transformations

BigML, Inc 15Basic Transformations

Data LabellingData may not have labels needed for doing classification

Create specific metrics for adding labels

Name Month - 3 Month - 2 Month - 1

Joe Schmo 123,23 0 0

Jane Plain 0 0 0

Mary Happy 0 55,22 243,33

Tom Thumb 12,34 8,34 14,56

Un-Labelled Data

Labelled dataName Month - 3 Month - 2 Month - 1 Default

Joe Schmo 123,23 0 0 FALSE

Jane Plain 0 0 0 TRUE

Mary Happy 0 55,22 243,33 FALSE

Tom Thumb 12,34 8,34 14,56 FALSE

Can be done at FeatureEngineering step as well

Page 16: BSSML16 L6. Basic Data Transformations

BigML, Inc 16Basic Transformations

ML Ready DataIn

stan

ces

Fields (Features)

Tabular Data (rows and columns): • Each row

• is one instance. • contains all the information about that one instance.

• Each column • is a field that describes a property of the instance.

Page 17: BSSML16 L6. Basic Data Transformations

BigML, Inc 17Basic Transformations

SF Restaurants Example

https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26ebhttps://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/

create database sf_restaurants;use sf_restaurants;

create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by '\r\n' ignore 1 lines;

create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100));load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated by '\r\n' ignore 1 lines;

create table violations (business_id int, vdate varchar(8), description varchar(1000));load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by '\r\n' ignore 1 lines;

create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100));load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by '\r\n' ignore 1 lines;

Page 18: BSSML16 L6. Basic Data Transformations

BigML, Inc 18Basic Transformations

Data CleaningHomogenize missing values and different types in the same feature, fix input errors, correct semantic issues, types, etc.

Name Date Duration (s) Genre Plays

Highway star 1984-05-24 - Rock 139

Blues alive 1990/03/01 281 Blues 239

Lonely planet 2002-11-19 5:32s Techno 42

Dance, dance 02/23/1983 312 Disco N/A

The wall 1943-01-20 218 Reagge 83

Offside down 1965-02-19 4 minutes Techno 895

The alchemist 2001-11-21 418 Bluesss 178

Bring me down 18-10-98 328 Classic 21

The scarecrow 1994-10-12 269 Rock 734

Original data

Name Date Duration (s) Genre Plays

Highway star 1984-05-24 Rock 139

Blues alive 1990-03-01 281 Blues 239

Lonely planet 2002-11-19 332 Techno 42

Dance, dance 1983-02-23 312 Disco

The wall 1943-01-20 218 Reagge 83

Offside down 1965-02-19 240 Techno 895

The alchemist 2001-11-21 418 Blues 178

Bring me down 1998-10-18 328 Classic 21

The scarecrow 1994-10-12 269 Rock 734

Cleaned data

update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,' [ date violation corrected:') > 0;

Page 19: BSSML16 L6. Basic Data Transformations

BigML, Inc 19Basic Transformations

Denormalizing

business

inspections

violations

scores

InstancesFeatures

(millions)

join

Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single file/dataset.

create table scores select * from inspections left join businesses using (business_id);

Page 20: BSSML16 L6. Basic Data Transformations

BigML, Inc 20Basic Transformations

Aggregating

User Num.Playbacks Total Time Pref.DeviceUser001 3 830 TabletUser002 1 218 SmartphoneUser003 3 1019 TVUser005 2 521 Tablet

Aggregated data (list of users)

When the entity to model is different from the provided data, an aggregation to get the entity might be needed.

Content Genre

Duration Play Time User DeviceHighway

starRock 190 2015-05-12

16:29:33User001 TV

Blues alive Blues 281 2015-05-13 12:31:21

User005 TabletLonely planet

Techno

332 2015-05-13 14:26:04

User003 TVDance, dance

Disco 312 2015-05-13 18:12:45

User001 TabletThe wall Reag

ge218 2015-05-14

09:02:55User002 Smartphone

Offside down

Techno

240 2015-05-14 11:26:32

User005 TabletThe

alchemistBlues 418 2015-05-14

21:44:15User003 TV

Bring me down

Classic

328 2015-05-15 06:59:56

User001 TabletThe

scarecrowRock 269 2015-05-15

12:37:05User003 Smartphone

Original data (list of playbacks)

select business_id,vdate,count(*) as violation_num,group_concat(description) as violation_txt from violations group by business_id, vdate;select * from scores left join violations_aggregated ON (scores.business_id=violations_aggregated.business_id) AND ( vdate=idate);

tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -ctail -n+2 playlist.csv | awk -F',' '{arr[$5]+=$3} END {for (i in arr) {print arr[i],i}}'

Page 21: BSSML16 L6. Basic Data Transformations

BigML, Inc 21Basic Transformations

PivotingDifferent values of a feature are pivoted to new columns in the

result dataset.

Content Genre Duration Play Time User DeviceHighway star Rock 190 2015-05-12 16:29:33 User001 TVBlues alive Blues 281 2015-05-13 12:31:21 User005 Tablet

Lonely planet Techno 332 2015-05-13 14:26:04 User003 TVDance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet

The wall Reagge 218 2015-05-14 09:02:55 User002 SmartphoneOffside down Techno 240 2015-05-14 11:26:32 User005 TabletThe alchemist Blues 418 2015-05-14 21:44:15 User003 TV

Bring me down Classic 328 2015-05-15 06:59:56 User001 TabletThe scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone

Original data

User Num.Playbacks

Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone

User001 3 830 Tablet 1 2 0 190 640 0

User002 1 218 Smartphone 0 0 1 0 0 218

User003 3 1019 TV 2 0 1 750 0 269

User005 2 521 Tablet 0 2 0 0 521 0

Aggregated data with pivoted columns

Page 22: BSSML16 L6. Basic Data Transformations

BigML, Inc 22Basic Transformations

Time WindowsCreate new features using values over different periods of time

InstancesFeatures

Time

InstancesFeatures

(milli

ons)

(thousands)

t=1 t=2 t=3

select business_id, score as score_2013, idate as idate_2013 from inspections where substr(idate,1,4) = "2013";create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id);

Page 23: BSSML16 L6. Basic Data Transformations

BigML, Inc 23Basic Transformations

UpdatesNeed a current view of the data, but new data only comes in

batches of changes

day 1day 2day 3Instances

Features

Page 24: BSSML16 L6. Basic Data Transformations

BigML, Inc 24Basic Transformations

StreamingData only comes in single changes

data stream

InstancesFeatures

Stream Batch

(kafka, etc)

Page 25: BSSML16 L6. Basic Data Transformations

BigML, Inc 25Basic Transformations

Structuring Output

• A CSV file uses plain text to store tabular data. • In a CSV file, each row of the file is an instance. • Each column in a row is usually separated by a comma (,) but other

"separators" like semi-colon (;), colon (:), pipe (|), can also be used. Each row must contain the same number of fields

• but they can be null • Fields can be quoted using double quotes ("). • Fields that contain commas or line separators must be quoted. • Quotes (") in fields must be doubled (""). • The character encoding must be UTF-8 • Optionally, a CSV file can use the first line as a header to provide the

names of each field.

After all the data transformations, a CSV (“Comma-Separated Values) file has to be generated, following the rules below:

Page 26: BSSML16 L6. Basic Data Transformations

BigML, Inc 26Basic Transformations

Prosper Loan Life Cycle

Submit Bids

Cancelled Withdraw

Funded

Expired

Defaulted

Paid

Current

Late

Q: Which new loans make it to funded?Q: Which funded loans make it to paid?Q: If funded, what will be the rate?

Classification

RegressionClassification

Goal ML Task

Page 27: BSSML16 L6. Basic Data Transformations

BigML, Inc 27Basic Transformations

Prosper ExampleData Provided in XML updates!!

fetch.sh“curl”daily

export.sh

import.pyXML

bigml.sh

ModelPredictShare in gallery

Status

LoanStatus

BorrowerRate

Page 28: BSSML16 L6. Basic Data Transformations

BigML, Inc 28Basic Transformations

Prosper Example

• XML… yuck! • MongoDB has CSV export and is record based so it is easy to

handle changing data structure. • Feature Engineering

• There are 5 different classes of “bad” loans • Date cleanup • Type casting: floats and ints

• Would be better to track over time • number of late payments • compare predictions and actuals

• XML… yuck!

Tidbits and Lessons Learned….

Page 29: BSSML16 L6. Basic Data Transformations

BigML, Inc 29Basic Transformations

Tools• Command Line?

• join, cut, awk, sed, sort, uniq • Automation

• Shell, Python, etc • Talend • BigML: bindings, bigmler, API, whizzml

• Relational DB • MySQL

• Non-Relational DB • MongoDB

Page 30: BSSML16 L6. Basic Data Transformations

BigML, Inc 30Basic Transformations

Talend

https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/

Denormalization Example

Page 31: BSSML16 L6. Basic Data Transformations