Upload
bigml-inc
View
170
Download
0
Embed Size (px)
Citation preview
D E C E M B E R 8 - 9 , 2 0 1 6
BigML, Inc 2
Poul Petersen CIO, BigML, Inc.
Feature EngineeringCreating Machine Learning Ready Data
BigML, Inc 3Feature Engineering
Machine Learning Secret
“…the largest improvements in accuracy often came from quick experiments, feature engineering, and model tuning rather than applying fundamentally different algorithms.”
Facebook FBLearner 2016
Feature Engineering: applying domain knowledge of the data to create features that make machine
learning algorithms work better or at all.
BigML, Inc 4Feature Engineering
Obstacles• Data Structure
• Scattered across systems • Wrong "shape" • Unlabelled data
• Data Value • Format: spelling, units • Missing values • Non-optimal correlation • Non-existant correlation
• Data Significance • Unwanted: PII, Non-Preferred • Expensive to collect • Insidious: Leakage, obviously correlated
Data Transformation
Feature Engineering
Feature Selection
BigML, Inc 5Feature Engineering
Feature Engineering
2013-09-25 10:02
Automatic Date Transformation
… year month day hour minute …
… 2013 Sep 25 10 2 …
… … … … … … …
NUM NUMCAT NUM NUM
DATE-TIME
BigML, Inc 6Feature Engineering
Feature EngineeringAutomatic Categorical Transformation
… alchemy_category …… business …… recreation …… health …… … …
CAT
business health recreation …… 1 0 0 …… 0 0 1 …… 0 1 0 …… … … … …
NUM NUM NUM
BigML, Inc 7Feature Engineering
Feature Engineering
Be not afraid of greatness: some are born great, some achieve greatness, and some have greatnessthrust upon ‘em.
TEXT
Automatic Text Transformation
… great afraid born achieve …
… 4 1 1 1 …
… … … … … …
NUM NUM NUM NUM
BigML, Inc 8Feature Engineering
Feature Engineering
{ “url":"cbsnews", "title":"Breaking News Headlines Business Entertainment World News “, "body":" news covering all the latest breaking national and world news headlines, including politics, sports, entertainment, business and more.”}
Fixing "non-optimal correlations"
title body
Breaking News… news covering…
… …
TEXT TEXT
TEXT
BigML, Inc 9Feature Engineering
Feature EngineeringDiscretization
Total Spend
7.342,99
304,12
4,56
345,87
8.546,32
NUM
“Predict will spend $3,521 with error
$1,232”
Spend Category
Top 33%
Bottom 33%
Bottom 33%
Middle 33%
Top 33%
CAT
“Predict customer will be Top 33% in
spending”
BigML, Inc 10Feature Engineering
Feature EngineeringCombinations of Multiple Features
Kg M2
101,4 3,24
85,2 2,8
56,2 2,9
136,1 3,6
95,9 4,1
NUM NUM
BMI
31,29
30,42
19,38
37,81
23,39
NUM
Kg M2
BigML, Inc 11Feature Engineering
Feature EngineeringFlatline
• BigML’s Domain-Specific Language (DSL) for Transforming Datasets
• Limited programming language structures
• let, cond, if, maps, list operators, */+-
• Dataset Fields are first-class citizens
• (field “diabetes pedigree”)
• Built-in transformations
• statistics, strings, timestamps, windows
BigML, Inc 12Basic Transformations
Data LabellingData may not have labels needed for doing classification
Create specific metrics for adding labels
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123,23 0 0
Jane Plain 0 0 0
Mary Happy 0 55,22 243,33
Tom Thumb 12,34 8,34 14,56
Un-Labelled Data
Labelled dataName Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123,23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55,22 243,33 FALSE
Tom Thumb 12,34 8,34 14,56 FALSE
(= 0 (+ (abs ( f "Month - 3" ) ) (abs ( f "Month - 2")) (abs ( f "Month - 1") ) ))
BigML, Inc 13Feature Engineering
Feature Engineering
(/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))
date volume price1 34353 3142 44455 3153 22333 3154 52322 3215 28000 3206 31254 3197 56544 3238 44331 3249 81111 287
10 65422 29411 59999 30012 45556 30213 19899 30114 21453 302
day-4 day-3 day-2 day-1 4davg -
314 -314 315 -
314 315 315 -314 315 315 321 316,25315 315 321 320 317,75315 321 320 319 318,75
Current - (4-day avg) std dev
Shock: Deviations from a Trend
BigML, Inc 14Feature Engineering
Feature Engineering
(/ (- (f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))
Current - (4-day avg) std dev
Shock: Deviations from a Trend
Current : (field “price”) 4-day avg: (avg-window “price” -4 -1) std dev: (standard-deviation “price”)
BigML, Inc 15Feature Engineering
Feature EngineeringMoon Phase%
( / ( mod ( - ( / ( epoch ( field {{date-field}} )) 1000 ) 621300 ) 2551443 ) 2551442 )
BigML, Inc 16Feature Engineering
Feature EngineeringFixing "non-existant correlations"
Highway Number Direction Is Long2 East-West FALSE4 East-West FALSE5 North-South TRUE8 East-West FALSE10 East-West TRUE… … …
Goal: Predict principle direction from highway number
( = (mod (field "Highway Number") 2) 0)
BigML, Inc 17Feature Engineering
Feature EngineeringFix Missing Values in a “Meaningful” Way
Filter Zeros
Model insulin
Predict insulin
Select insulin
FixedDataset
AmendedDataset
OriginalDataset
CleanDataset
( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))
BigML, Inc 18
Feature Selection
BigML, Inc 19Feature Engineering
Feature Selection• Model Summary
• Field Importance • Algorithmic
• Best-First Feature Selection • Boruta
• Leakage • Tight Correlations (AD, Plot, Correlations) • Test Data • Perfect future knowledge
cat diabetes.csv diabetes_testset.csv | sort | uniq -d | wc -l
BigML, Inc 20
Evaluate & Automate
BigML, Inc 21Feature Engineering
Evaluate & Automate
• Evaluate • Did you meet the goal? • If not, did you discover something else useful? • If not, start over • If you did…
• Automate - You don’t want to hand code that every time, right? • Consider tools that are easy to automate
• scripting interface • APIs • Ability to maintenance is important
BigML, Inc 22Feature Engineering
The Process
Data TransformDefine Goal Model &
Evaluate
no
yes
BetterData
NotPossible
TuneAlgorithm
Goal Met?
Automate
Feature Engineer & Selection
BetterFeatures