Upload
bigml-inc
View
96
Download
2
Embed Size (px)
Citation preview
D E C E M B E R 8 - 9 , 2 0 1 6
BigML, Inc 2
Poul Petersen CIO, BigML, Inc.
Basic TransformationsCreating Machine Learning Ready Data
BigML, Inc 3Basic Transformations
In a Perfect World…
Q: How does a physicist milk a cow?
A: Well, first let us consider a spherical cow...
Q: How does a data scientist build a model?
A: Well, first let us consider perfectly formatted data…
BigML, Inc 4Basic Transformations
The Dream
CSV Dataset Model Profit!
BigML, Inc 5Basic Transformations
The Reality
CRM
Web Accounts
Transactions ML Ready?
BigML, Inc 6Basic Transformations
Obstacles• Data Structure
• Scattered across systems • Wrong "shape" • Unlabelled data
• Data Value • Format: spelling, units • Missing values • Non-optimal correlation • Non-existant correlation
• Data Significance • Unwanted: PII, Non-Preferred • Expensive to collect • Insidious: Leakage, obviously correlated
Data Transformation
Feature Engineering
Feature Selection
BigML, Inc 7Basic Transformations
The Process• Define a clear idea of the goal.
• sometimes this comes later… • Understand what ML tasks will achieve the goal.
• Transform the data • where is it, how is it stored? • what are the features? • can you access it programmatically?
• Feature Engineering: transform the data you have into the data you actually need.
• Evaluate: Try it on a small scale • Accept that you might have to start over….
• But when it works, automate it!!!!
BigML, Inc 8
Data Transformations
BigML, Inc 9Basic Transformations
BigML TasksGoal
• Will this customer default on a loan? • How many customers will apply for a
loan next month? • Is the consumption of this product
unusual? • Is the behavior of the customers
similar? • Are these products purchased
together?
ML TaskClassificationRegression
Anomaly Detection
Cluster Analysis
Association Discovery
BigML, Inc 10Basic Transformations
ClassificationCategorical
Trai
ning
Test
ing
Predicting
BigML, Inc 11Basic Transformations
RegressionNumeric
Trai
ning
Test
ing
Predicting
BigML, Inc 12Basic Transformations
Anomaly Detection
BigML, Inc 13Basic Transformations
Cluster Analysis
BigML, Inc 14Basic Transformations
Association Discovery
BigML, Inc 15Basic Transformations
Data LabellingData may not have labels needed for doing classification
Create specific metrics for adding labels
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123,23 0 0
Jane Plain 0 0 0
Mary Happy 0 55,22 243,33
Tom Thumb 12,34 8,34 14,56
Un-Labelled Data
Labelled dataName Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123,23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55,22 243,33 FALSE
Tom Thumb 12,34 8,34 14,56 FALSE
Can be done at FeatureEngineering step as well
BigML, Inc 16Basic Transformations
ML Ready DataIn
stan
ces
Fields (Features)
Tabular Data (rows and columns): • Each row
• is one instance. • contains all the information about that one instance.
• Each column • is a field that describes a property of the instance.
BigML, Inc 17Basic Transformations
SF Restaurants Example
https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores/stya-26ebhttps://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
create database sf_restaurants;use sf_restaurants;
create table businesses (business_id int, name varchar(1000), address varchar(1000), city varchar(1000), state varchar(100), postal_code varchar(100), latitude varchar(100), longitude varchar(100), phone_number varchar(100));load data local infile './businesses.csv' into table businesses fields terminated by ',' enclosed by '"' lines terminated by '\r\n' ignore 1 lines;
create table inspections (business_id int, score varchar(10), idate varchar(8), itype varchar(100));load data local infile './inspections.csv' into table inspections fields terminated by ',' enclosed by '"' lines terminated by '\r\n' ignore 1 lines;
create table violations (business_id int, vdate varchar(8), description varchar(1000));load data local infile './violations.csv' into table violations fields terminated by ',' enclosed by '"' lines terminated by '\r\n' ignore 1 lines;
create table legend (Minimum_Score int, Maximum_Score int, Description varchar(100));load data local infile './legend.csv' into table legend fields terminated by ',' enclosed by '"' lines terminated by '\r\n' ignore 1 lines;
BigML, Inc 18Basic Transformations
Data CleaningHomogenize missing values and different types in the same feature, fix input errors, correct semantic issues, types, etc.
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 - Rock 139
Blues alive 1990/03/01 281 Blues 239
Lonely planet 2002-11-19 5:32s Techno 42
Dance, dance 02/23/1983 312 Disco N/A
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 4 minutes Techno 895
The alchemist 2001-11-21 418 Bluesss 178
Bring me down 18-10-98 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Original data
Name Date Duration (s) Genre Plays
Highway star 1984-05-24 Rock 139
Blues alive 1990-03-01 281 Blues 239
Lonely planet 2002-11-19 332 Techno 42
Dance, dance 1983-02-23 312 Disco
The wall 1943-01-20 218 Reagge 83
Offside down 1965-02-19 240 Techno 895
The alchemist 2001-11-21 418 Blues 178
Bring me down 1998-10-18 328 Classic 21
The scarecrow 1994-10-12 269 Rock 734
Cleaned data
update violations set description = substr(description,1,instr(description,' [ date violation corrected:')-1) where instr(description,' [ date violation corrected:') > 0;
BigML, Inc 19Basic Transformations
Denormalizing
business
inspections
violations
scores
InstancesFeatures
(millions)
join
Data is usually normalized in relational databases, ML-Ready datasets need the information de-normalized in a single file/dataset.
create table scores select * from inspections left join businesses using (business_id);
BigML, Inc 20Basic Transformations
Aggregating
User Num.Playbacks Total Time Pref.DeviceUser001 3 830 TabletUser002 1 218 SmartphoneUser003 3 1019 TVUser005 2 521 Tablet
Aggregated data (list of users)
When the entity to model is different from the provided data, an aggregation to get the entity might be needed.
Content Genre
Duration Play Time User DeviceHighway
starRock 190 2015-05-12
16:29:33User001 TV
Blues alive Blues 281 2015-05-13 12:31:21
User005 TabletLonely planet
Techno
332 2015-05-13 14:26:04
User003 TVDance, dance
Disco 312 2015-05-13 18:12:45
User001 TabletThe wall Reag
ge218 2015-05-14
09:02:55User002 Smartphone
Offside down
Techno
240 2015-05-14 11:26:32
User005 TabletThe
alchemistBlues 418 2015-05-14
21:44:15User003 TV
Bring me down
Classic
328 2015-05-15 06:59:56
User001 TabletThe
scarecrowRock 269 2015-05-15
12:37:05User003 Smartphone
Original data (list of playbacks)
select business_id,vdate,count(*) as violation_num,group_concat(description) as violation_txt from violations group by business_id, vdate;select * from scores left join violations_aggregated ON (scores.business_id=violations_aggregated.business_id) AND ( vdate=idate);
tail -n+2 playlists.csv | cut -d',' -f5 | sort | uniq -ctail -n+2 playlist.csv | awk -F',' '{arr[$5]+=$3} END {for (i in arr) {print arr[i],i}}'
BigML, Inc 21Basic Transformations
PivotingDifferent values of a feature are pivoted to new columns in the
result dataset.
Content Genre Duration Play Time User DeviceHighway star Rock 190 2015-05-12 16:29:33 User001 TVBlues alive Blues 281 2015-05-13 12:31:21 User005 Tablet
Lonely planet Techno 332 2015-05-13 14:26:04 User003 TVDance, dance Disco 312 2015-05-13 18:12:45 User001 Tablet
The wall Reagge 218 2015-05-14 09:02:55 User002 SmartphoneOffside down Techno 240 2015-05-14 11:26:32 User005 TabletThe alchemist Blues 418 2015-05-14 21:44:15 User003 TV
Bring me down Classic 328 2015-05-15 06:59:56 User001 TabletThe scarecrow Rock 269 2015-05-15 12:37:05 User003 Smartphone
Original data
User Num.Playbacks
Total Time Pref.Device NP_TV NP_Tablet NP_Smartphone TT_TV TT_Tablet TT_Smartphone
User001 3 830 Tablet 1 2 0 190 640 0
User002 1 218 Smartphone 0 0 1 0 0 218
User003 3 1019 TV 2 0 1 750 0 269
User005 2 521 Tablet 0 2 0 0 521 0
Aggregated data with pivoted columns
BigML, Inc 22Basic Transformations
Time WindowsCreate new features using values over different periods of time
InstancesFeatures
Time
InstancesFeatures
(milli
ons)
(thousands)
t=1 t=2 t=3
select business_id, score as score_2013, idate as idate_2013 from inspections where substr(idate,1,4) = "2013";create table scores_over_time select * from businesses left join scores_2013 USING (business_id) left join scores_2014 USING (business_id);
BigML, Inc 23Basic Transformations
UpdatesNeed a current view of the data, but new data only comes in
batches of changes
day 1day 2day 3Instances
Features
BigML, Inc 24Basic Transformations
StreamingData only comes in single changes
data stream
InstancesFeatures
Stream Batch
(kafka, etc)
BigML, Inc 25Basic Transformations
Structuring Output
• A CSV file uses plain text to store tabular data. • In a CSV file, each row of the file is an instance. • Each column in a row is usually separated by a comma (,) but other
"separators" like semi-colon (;), colon (:), pipe (|), can also be used. Each row must contain the same number of fields
• but they can be null • Fields can be quoted using double quotes ("). • Fields that contain commas or line separators must be quoted. • Quotes (") in fields must be doubled (""). • The character encoding must be UTF-8 • Optionally, a CSV file can use the first line as a header to provide the
names of each field.
After all the data transformations, a CSV (“Comma-Separated Values) file has to be generated, following the rules below:
BigML, Inc 26Basic Transformations
Prosper Loan Life Cycle
Submit Bids
Cancelled Withdraw
Funded
Expired
Defaulted
Paid
Current
Late
Q: Which new loans make it to funded?Q: Which funded loans make it to paid?Q: If funded, what will be the rate?
Classification
RegressionClassification
Goal ML Task
BigML, Inc 27Basic Transformations
Prosper ExampleData Provided in XML updates!!
fetch.sh“curl”daily
export.sh
import.pyXML
bigml.sh
ModelPredictShare in gallery
Status
LoanStatus
BorrowerRate
BigML, Inc 28Basic Transformations
Prosper Example
• XML… yuck! • MongoDB has CSV export and is record based so it is easy to
handle changing data structure. • Feature Engineering
• There are 5 different classes of “bad” loans • Date cleanup • Type casting: floats and ints
• Would be better to track over time • number of late payments • compare predictions and actuals
• XML… yuck!
Tidbits and Lessons Learned….
BigML, Inc 29Basic Transformations
Tools• Command Line?
• join, cut, awk, sed, sort, uniq • Automation
• Shell, Python, etc • Talend • BigML: bindings, bigmler, API, whizzml
• Relational DB • MySQL
• Non-Relational DB • MongoDB
BigML, Inc 30Basic Transformations
Talend
https://blog.bigml.com/2013/10/30/data-preparation-for-machine-learning-using-mysql/
Denormalization Example