23
DECEMBER 8-9, 2016

BSSML16 L7. Feature Engineering

Embed Size (px)

Citation preview

Page 1: BSSML16 L7. Feature Engineering

D E C E M B E R 8 - 9 , 2 0 1 6

Page 2: BSSML16 L7. Feature Engineering

BigML, Inc 2

Poul Petersen CIO, BigML, Inc.

Feature EngineeringCreating Machine Learning Ready Data

Page 3: BSSML16 L7. Feature Engineering

BigML, Inc 3Feature Engineering

Machine Learning Secret

“…the largest improvements in accuracy often came from quick experiments, feature engineering, and model tuning rather than applying fundamentally different algorithms.”

Facebook FBLearner 2016

Feature Engineering: applying domain knowledge of the data to create features that make machine

learning algorithms work better or at all.

Page 4: BSSML16 L7. Feature Engineering

BigML, Inc 4Feature Engineering

Obstacles• Data Structure

• Scattered across systems • Wrong "shape" • Unlabelled data

• Data Value • Format: spelling, units • Missing values • Non-optimal correlation • Non-existant correlation

• Data Significance • Unwanted: PII, Non-Preferred • Expensive to collect • Insidious: Leakage, obviously correlated

Data Transformation

Feature Engineering

Feature Selection

Page 5: BSSML16 L7. Feature Engineering

BigML, Inc 5Feature Engineering

Feature Engineering

2013-09-25 10:02

Automatic Date Transformation

… year month day hour minute …

… 2013 Sep 25 10 2 …

… … … … … … …

NUM NUMCAT NUM NUM

DATE-TIME

Page 6: BSSML16 L7. Feature Engineering

BigML, Inc 6Feature Engineering

Feature EngineeringAutomatic Categorical Transformation

… alchemy_category …… business …… recreation …… health …… … …

CAT

business health recreation …… 1 0 0 …… 0 0 1 …… 0 1 0 …… … … … …

NUM NUM NUM

Page 7: BSSML16 L7. Feature Engineering

BigML, Inc 7Feature Engineering

Feature Engineering

Be not afraid of greatness: some are born great, some achieve greatness, and some have greatnessthrust upon ‘em.

TEXT

Automatic Text Transformation

… great afraid born achieve …

… 4 1 1 1 …

… … … … … …

NUM NUM NUM NUM

Page 8: BSSML16 L7. Feature Engineering

BigML, Inc 8Feature Engineering

Feature Engineering

{ “url":"cbsnews", "title":"Breaking News Headlines Business Entertainment World News “, "body":" news covering all the latest breaking national and world news headlines, including politics, sports, entertainment, business and more.”}

Fixing "non-optimal correlations"

title body

Breaking News… news covering…

… …

TEXT TEXT

TEXT

Page 9: BSSML16 L7. Feature Engineering

BigML, Inc 9Feature Engineering

Feature EngineeringDiscretization

Total Spend

7.342,99

304,12

4,56

345,87

8.546,32

NUM

“Predict will spend $3,521 with error

$1,232”

Spend Category

Top 33%

Bottom 33%

Bottom 33%

Middle 33%

Top 33%

CAT

“Predict customer will be Top 33% in

spending”

Page 10: BSSML16 L7. Feature Engineering

BigML, Inc 10Feature Engineering

Feature EngineeringCombinations of Multiple Features

Kg M2

101,4 3,24

85,2 2,8

56,2 2,9

136,1 3,6

95,9 4,1

NUM NUM

BMI

31,29

30,42

19,38

37,81

23,39

NUM

Kg M2

Page 11: BSSML16 L7. Feature Engineering

BigML, Inc 11Feature Engineering

Feature EngineeringFlatline

• BigML’s Domain-Specific Language (DSL) for Transforming Datasets

• Limited programming language structures

• let, cond, if, maps, list operators, */+-

• Dataset Fields are first-class citizens

• (field “diabetes pedigree”)

• Built-in transformations

• statistics, strings, timestamps, windows

Page 12: BSSML16 L7. Feature Engineering

BigML, Inc 12Basic Transformations

Data LabellingData may not have labels needed for doing classification

Create specific metrics for adding labels

Name Month - 3 Month - 2 Month - 1

Joe Schmo 123,23 0 0

Jane Plain 0 0 0

Mary Happy 0 55,22 243,33

Tom Thumb 12,34 8,34 14,56

Un-Labelled Data

Labelled dataName Month - 3 Month - 2 Month - 1 Default

Joe Schmo 123,23 0 0 FALSE

Jane Plain 0 0 0 TRUE

Mary Happy 0 55,22 243,33 FALSE

Tom Thumb 12,34 8,34 14,56 FALSE

(= 0 (+ (abs ( f "Month - 3" ) ) (abs ( f "Month - 2")) (abs ( f "Month - 1") ) ))

Page 13: BSSML16 L7. Feature Engineering

BigML, Inc 13Feature Engineering

Feature Engineering

(/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))

date volume price1 34353 3142 44455 3153 22333 3154 52322 3215 28000 3206 31254 3197 56544 3238 44331 3249 81111 287

10 65422 29411 59999 30012 45556 30213 19899 30114 21453 302

day-4 day-3 day-2 day-1 4davg -

314 -314 315 -

314 315 315 -314 315 315 321 316,25315 315 321 320 317,75315 321 320 319 318,75

Current - (4-day avg) std dev

Shock: Deviations from a Trend

Page 14: BSSML16 L7. Feature Engineering

BigML, Inc 14Feature Engineering

Feature Engineering

(/ (- (f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))

Current - (4-day avg) std dev

Shock: Deviations from a Trend

Current : (field “price”) 4-day avg: (avg-window “price” -4 -1) std dev: (standard-deviation “price”)

Page 15: BSSML16 L7. Feature Engineering

BigML, Inc 15Feature Engineering

Feature EngineeringMoon Phase%

( / ( mod ( - ( / ( epoch ( field {{date-field}} )) 1000 ) 621300 ) 2551443 ) 2551442 )

Page 16: BSSML16 L7. Feature Engineering

BigML, Inc 16Feature Engineering

Feature EngineeringFixing "non-existant correlations"

Highway Number Direction Is Long2 East-West FALSE4 East-West FALSE5 North-South TRUE8 East-West FALSE10 East-West TRUE… … …

Goal: Predict principle direction from highway number

( = (mod (field "Highway Number") 2) 0)

Page 17: BSSML16 L7. Feature Engineering

BigML, Inc 17Feature Engineering

Feature EngineeringFix Missing Values in a “Meaningful” Way

Filter Zeros

Model insulin

Predict insulin

Select insulin

FixedDataset

AmendedDataset

OriginalDataset

CleanDataset

( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))

Page 18: BSSML16 L7. Feature Engineering

BigML, Inc 18

Feature Selection

Page 19: BSSML16 L7. Feature Engineering

BigML, Inc 19Feature Engineering

Feature Selection• Model Summary

• Field Importance • Algorithmic

• Best-First Feature Selection • Boruta

• Leakage • Tight Correlations (AD, Plot, Correlations) • Test Data • Perfect future knowledge

cat diabetes.csv diabetes_testset.csv | sort | uniq -d | wc -l

Page 20: BSSML16 L7. Feature Engineering

BigML, Inc 20

Evaluate & Automate

Page 21: BSSML16 L7. Feature Engineering

BigML, Inc 21Feature Engineering

Evaluate & Automate

• Evaluate • Did you meet the goal? • If not, did you discover something else useful? • If not, start over • If you did…

• Automate - You don’t want to hand code that every time, right? • Consider tools that are easy to automate

• scripting interface • APIs • Ability to maintenance is important

Page 22: BSSML16 L7. Feature Engineering

BigML, Inc 22Feature Engineering

The Process

Data TransformDefine Goal Model &

Evaluate

no

yes

BetterData

NotPossible

TuneAlgorithm

Goal Met?

Automate

Feature Engineer & Selection

BetterFeatures

Page 23: BSSML16 L7. Feature Engineering