64
RANDOM FORESTS R vs PYTHON R & PYTHON Hvin fun when strtin out in dt nlsis

Random Forests R vs Python by Linda Uruchurtu

  • Upload
    pydata

  • View
    113

  • Download
    0

Embed Size (px)

DESCRIPTION

Random Forests R vs Python by Linda Uruchurtu

Citation preview

Page 1: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTSR vs PYTHONR & PYTHON

H!vin" fun when st!rtin" out in d!t! !n!l#sis

Page 2: Random Forests R vs Python by Linda Uruchurtu

WHOLINDA URUCHURTU@lind!uruchurtu

Consult!nt !t DBi Web An!l"tics & D!t! Consult!nc"

Ph"sicist b" tr!inin#

Page 3: Random Forests R vs Python by Linda Uruchurtu

OUTLINE OF THIS TALK• Motiv!tion• R!ndom Forests: R & Python

• Ex!mple: EMI music set

• Concludin" rem!rks

Page 4: Random Forests R vs Python by Linda Uruchurtu

MOTIVATION

Page 5: Random Forests R vs Python by Linda Uruchurtu

STARTING OUT IN DATA ANALYSIS

• Online: blo"s, GitHub, MOOCs, K!""le, D!t! T!u, Cross V!lid!ted, St!ckoverflow...

• Books• School work

TOO MANY RESOURCES

Page 6: Random Forests R vs Python by Linda Uruchurtu

WHICH LANGUAGE SHOULD I USE?POPULAR QUESTION

Page 7: Random Forests R vs Python by Linda Uruchurtu

LET’S ASK GOOGLE

Page 8: Random Forests R vs Python by Linda Uruchurtu

• Pro"r!mmed in C• Used MATLAB !t Uni• Spent ! lon" time pl!#in" with s#mbolic

l!n"s M!them!tic! & M!ple

START BY WHAT YOU KNOW & ASK YOUR FRIENDS

MY EXPERIENCE

P.S. I h!d not met the iP"thon notebook.

Page 9: Random Forests R vs Python by Linda Uruchurtu

BIG REVEAL: I AM AN AVID R USER

MY EXPERIENCE (cont)

P.S. I h!d not met the iP"thon notebook.

• Don’t h!ve ! web dev b!ck"round• Surrounded b# people doin" St!ts• Pick the ri"ht tool for the t!sk !t h!nd

Page 10: Random Forests R vs Python by Linda Uruchurtu

TL;DR - CAN BE CONFUSING FOR A NEWBIE

LANGUAGE WARSToo m!n" !rticles !bout:

• “P!thon Displ"cin# R As The Pro#r"mmin# L"n#u"#e For D"t" An"l!sis”

• “Is P!thon re"ll! suppl"ntin# R for d"t" work?”• “10 Re"sons P!thon Rocks for Rese"rch”• “Wh! P!thon is ste"dil! e"tin# other l"n#u"#es' lunch”• “Wh! I’m bettin# on Juli"”• “Wh"t "re the "dv"nt"#es of usin# P!thon over R?”• “Wh! P!thon with Coffee is better th"n R with Ice

Cre"m”

Page 11: Random Forests R vs Python by Linda Uruchurtu

[FAVE LANG] is BETTERBECAUSE I SAY SO

Page 12: Random Forests R vs Python by Linda Uruchurtu

LANGUAGE WARSHowever, it is "ood to h!ve ! "ener!l underst!ndin" of the + !nd - of the v!rious d!t! !n!l#sis tools, in order to pick the ri"ht tool for the job.

• R h!s EVERYTHING "ou need for performin# st!tistic!l !n!l"sis.

• R / MATLAB / Python !re #re!t for protot"pin#• Python is ! full fe!tured pro#r!mmin# l!n#u!#e• E!sier to incorport!te Python outcomes into ! full

d!t! product workflow

Page 13: Random Forests R vs Python by Linda Uruchurtu

DEFINE THE PROBLEMTime better spent definin# the problem !nd determinin# wh!t is the best w!" to solve it

GOOD TO HAVE A BIG BAG OF TRICKS

Re-do R !n!l"sis usin# Python d!t! !n!l"sis st!ck

WILL IT PYTHON? CREDIT: SLENDER MEANS

Page 14: Random Forests R vs Python by Linda Uruchurtu

PYTHON SCIKIT LEARN

IT IS PRETTY AWESOME

• Libr!r" of M!chine Le!rnin# Al#orithms• Open source• API• P"thon, Nump" & Co• Accessible, m!n" models, document!tion &

ex!mples

Page 15: Random Forests R vs Python by Linda Uruchurtu

EXAMPLE

Page 16: Random Forests R vs Python by Linda Uruchurtu

CHOOSING A PROBLEMAlw!"s ! #ood ide! to look for ! d!t! set th!t is interestin# to "ou.

12 Formul!te ! question

3 Formul!te !n h"pothesis

4 Build Model to !nswer question !nd Test

SCIENTIFIC METHOD FTW

Page 17: Random Forests R vs Python by Linda Uruchurtu

CHOOSING A DATA SETSTEP 1

Page 18: Random Forests R vs Python by Linda Uruchurtu

EMI MUSIC “ONE MILLION INTERVIEW SET”

• One of the l!r#est preference d!t! sets in the world.

• Extr!ct used in Data Science London h!ck!ton !nd !v!il!ble in KAGGLE !s four sep!r!te d!t! sets.

Page 19: Random Forests R vs Python by Linda Uruchurtu

FOUR DATA SETS• TRAIN / TEST - !rtist, tr!ck, userID, time & r!tin"s

• WORDS - userID, he!rd_of, own_!rtist_music , like_!rtist, 82 !djectives

• USERS - userID, "ender, !"e, workin" st!tus, re"ion, music, list_own (hours per d!#), list_b!ck (hours per d!#), 19 user h!bits questions (0-100)

Page 20: Random Forests R vs Python by Linda Uruchurtu

USERSKEY STRING

1 “Music is import!nt to me but not necess!ril" most import!nt”

2 “I like music but it does not fe!ture he!vil" in m" life”

3 “Music me!ns ! lot to me !nd it is ! p!ssion of mine”

4 “Music h!s no p!rticul!r interest to me”

5 “Music is import!nt to me but not necess!ril" more import!nt th!n other hobbies”

6 “Music is no lon#er !s import!nt !s it used to be”

Page 21: Random Forests R vs Python by Linda Uruchurtu

WORDS DATASET

UNINSPIRED, AGGRESSIVE, UNATTRACTIVE, BORING, CHEAP, IRRELEVANT, WAY OUT, ANNOYING, CHEESY, UNORIGINAL, OUTDATED, UNAPPROACHABLE...

82 ADJECTIVES

Page 22: Random Forests R vs Python by Linda Uruchurtu

WHOLESOME

LEGENDARY

OLD

PIONEER DARK

WORDLY

NOSTALGIC

PROGRESSIVE

ICONIC

Page 23: Random Forests R vs Python by Linda Uruchurtu

USERS19 MUSIC HABIT QUESTIONS: R!te (0-100) whether user !#rees with the st!tements:

“I enjo" !ctivel" se!rchin# for !nd discoverin# music th!t I h!ve never he!rd before”

“I !m not willin# to p!" for music”

“I like to be !t the cuttin# ed#e of new music”

“I love tech”

Page 24: Random Forests R vs Python by Linda Uruchurtu

WHOLESOME

LEGENDARY

OLD

PIONEER DARK

WORDLY

NOSTALGIC

PROGRESSIVE

ICONIC

Page 25: Random Forests R vs Python by Linda Uruchurtu

FORMULATE A QUESTIONSTEP 2

Page 26: Random Forests R vs Python by Linda Uruchurtu

MOTIVATION

Page 27: Random Forests R vs Python by Linda Uruchurtu

MOTIVATION• PRODUCTION - Che!per to produce (lower b!rriers to

entr# for buddin" !rtists).

• DISTRIBUTION - Internet h!s m!de music more !ccessible. Artists c!n decide where !nd how to sell.

• CONSUMPTION - People’s listenin" h!bits h!ve ch!n"ed due to the internet !nd to the ch!n"e in devices.

TECHNOLOGY HAS BEEN A DISRUPTIVE FORCE IN THE MUSIC INDUSTRY.

Page 28: Random Forests R vs Python by Linda Uruchurtu

PROBLEMS• ARTISTS - E!sier to produce music, h!rder to m!ke

themselves known or e!rn ! livin".

• RECORD COMPANIES - People bu# per son", e!s# for listener to consume without p!#in". Wider competition field.

• LISTENERS - Too m!n# choices. Discover# is difficult.

Page 29: Random Forests R vs Python by Linda Uruchurtu

QUESTIONS• C!n one predict the r!tin" of ! son"?

• Wh!t f!ctors !re import!nt to determine how much ! person likes ! son"?

• Wh!t is the minim!l set of f!ctors th!t !re needed to determine how much ! person likes ! son"?

Page 30: Random Forests R vs Python by Linda Uruchurtu

FORMULATE AN HYPOTHESISSTEP 3

Page 31: Random Forests R vs Python by Linda Uruchurtu

FIRST ATTEMPT• Re"ression problem

• Turn c!te"oric!l v!ri!bles into numeric v!ri!bles

• Consider ALL fe!tures !nd pick m!chine le!rnin" !l"orithm to do the job.

CAN ONE PREDICT THE RATING OF A SONG?

Page 32: Random Forests R vs Python by Linda Uruchurtu

FIRST ATTEMPT

• Bec!use explor!tor# !n!l#sis reve!led r!tin"s !re hi"hl# clustered, we c!n look !t five different scores !nd formul!te problem !s ! cl!ssific!tion one.

CAN ONE PREDICT THE RATING OF A SONG?

We split r!tin"s 0-100 in 5 interv!ls,so e!ch becomes ! cl!ss !nd we l!bel these.

Page 33: Random Forests R vs Python by Linda Uruchurtu

BUILD A MODELSTEP 4

Page 34: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS

Page 35: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS

• R"ndom Forests "re built from "##re#"tin# trees.

• C"n be used for re#ression & cl"ssific"tion problems.

• The! do not overfit "nd c"n h"ndle l"r#e "mount of fe"tures

• The! "lso output " list of fe"tures th"t "re believed to be import"nt in predictin# the v"ri"ble

Hi"hl# vers!tile ensemble method - combines sever!l models into one.

A.K.A. BEST “BLACK-BOX” METHOD EVER (BREIMAN / CUTLER)

Page 36: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTSTHE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)

MOVIES

20 QUESTIONS

WILL JAMIE LIKE X?

BRIENNE IS THE DECISION TREE FOR JAMIE’S MOVIES PREFERENCES

Page 37: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTSTHE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)

Ask T!win, Cersei, T!rion...J"mie #ives e"ch of them sli#htl! different info.

THEY FORM A BAGGED FOREST OF JAMIE’S MOVIES PREFERENCES

J"mie dem"nds #ettin# different questions ever! time.

THEY NOW FORM A RANDOM FOREST OF JAMIE’S MOVIES PREFERENCES

Page 38: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS• A tree of m"xim"l depth is #rown on " bootstr"p s"mple of

size m of the tr"inin# set. There is no prunin#.

• A number m << p is specified such th"t "t e"ch node, m v"ri"bles "re s"mpled "t r"ndom out of p. The best split of these v"ri"bles is used to split the node into two subnodes.

• Fin"l cl"ssific"tion is #iven b! m"jorit! votin# of the ensemble of trees in the forest.

• Onl! two “free” p"r"meters: number of trees "nd number of v"ri"bles in r"ndom subset "t e"ch node.

Page 39: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTSOUT-OF-BAG (OOB) ERRORE"ch bootstr"p s"mple not used in the construction of the tree becomes " test set. The oob error estim"te is #iven b! the miscl"ssific"tion error (MSE for re#ression), "ver"#ed over "ll s"mples.

VARIABLE IMPORTANCE

Determined b! lookin# "t how much prediction error incre"ses when (OOB) d"t" for th"t v"ri"ble is permuted while "ll others "re left unch"n#ed.

Page 40: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN R & PYTHON

randomForest PACKAGE

• V"rious implement"tions - randomForest, CARET, PARTY, BIGRF • We follow the KISS procedure - KEEP IT SIMPLE S.• One c"n test v"rious v"lues of mtr! "nd the number of

trees.

Used randomForest p"ck"#e 4.6-7 with R 2.15. Def"ults "re n=500 trees & mtr!= p/3 for re#ression & sqrt(p) for cl"ssific"tion.

Page 41: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN R & PYTHONSCIKIT LEARNUsed SCIKIT LEARN 0.14.1 runnin# P!thon version 2.7.5.

COMPUTER: M"cbook Pro 2.53 GHz Intel Core 2 Duo with 4 GB 1067 Mhz DDR3 runnnin# OS X 10.6.8

• Tr"inin# Time• RS$ & RMSE (Re#ression)• Accur"c! (Cl"ssific"tion)

For the comp"rison we will build “sm"ll” forests "nd focus on the followin# simple metrics:

Page 42: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN R

RESULTS REGRESSION

Split d"t" in tr"inin# "nd test sets. D"t"fr"me h"s 82,714 rows e"ch "nd 114 columns.

P"r"meters: 60 trees, s"mple of 50,000.

Tr"inin# time: 39.39 min RMSE: 14.587RS$: 0.581

rf  <-­‐  randomForest(training,ratings_train,ntree=60,  sampsize  =  50000,  importance  =  TRUE)

Page 43: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN PYTHON

RESULTS REGRESSION

Split d"t" in tr"inin# "nd test sets. D"t"fr"me h"s 82,714 rows e"ch "nd 114 columns.

P"r"meters: 60 trees, s"mple of 50,000.

Tr"inin# time: 3 min 7 sec RMSE: 14.687RS$: 0.575

rf  =  RandomForestRegressor(n_estimators=60,  max_features='sqrt')

Page 44: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN R & PYTHON

R

PYTHON / SCIKIT LEARN

Page 45: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN RFEATURE IMPORTANCE

FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)

Be!utiful T!lentedBorin# Like Artist

$16 C!tch"C!tch" Be!utiful

T!lented Borin#$9 Tr!ck$19 Distinctive

None of these CoolA#e $11

Tr!ck $12

$16 - I would be willin# to p!" for the opp to bu" new music pre-rele!se

$9 - I !m out of touch with new music

$19 - I like to know !bout music before other people

$11 -Pop music is fun

$12 - Pop music helps me esc!pe

Like !rtist - To wh!t extent do "ou like or dislikelistenin# to this !rtist?

Page 46: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN RFEATURE IMPORTANCE

Page 47: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN PYTHONFEATURE IMPORTANCE

FEATURE IMPORTANCE IN R RANDOM FOREST

Distinctive 7C!tch" 3

Like Artist 2Fun -

T!lented 1Be!utiful 4Ori#in!l -

Unori#in!l -$11 9

Own Artist Music -

Own Artist Music - Do "ou h!ve this !rtist in "our music collection?

$11 -Pop music is fun

Page 48: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN R & PYTHON

Model RMSER Random Forest 14.587

Python Scikit Learn Random Forest 14.687

Linear Regression 16.23

Multiple Linear Regs 15.53

RESULTS REGRESSION

Page 49: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN RRESULTS CLASSIFICATION

Tr"inin# time: 8.75 min OOB error r"te: 44.01%Accur"c!: 0.567

rf  <-­‐  randomForest(training,ratings_train,ntree=60,  sampsize  =  50000,  importance  =  TRUE)

ratings_train<-­‐as.factor(ratings_train)

1 2 3 4 5

1 16777 4863 1633 139 37

2 5760 12411 6213 504 89

3 1485 5559 13144 1880 329

4 176 888 4094 2592 625

5 59 204 1008 856 1388

Page 50: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN PYTHONRESULTS CLASSIFICATION

Tr"inin# time: 2.56 min OOB Score: 0.1964Accur"c!: 0.566

rf  =  sk.RandomForestClassifier(n_estimators=60,compute_importances=True,  oob_score=True)

1 2 3 4 5

1 16930 4682 1758 129 53

2 5517 12369 6475 506 106

3 1500 5367 13448 1737 275

4 186 791 4171 2598 561

5 48 161 999 880 1466

Precision: 0.564Rec"ll: 0.5653F1 Score: 0.5611

Page 51: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN RFEATURE IMPORTANCE

FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)

$9 Tr!ck$7 $11$5 $12$6 A#eA#e $6$10 $17

listBACK $9$19 $16

listOWN $4$16 $13

$16 - I would be willin# to p!" for the opp to bu" new music pre-rele!se

$9 - I !m out of touch with new music

$19 - I like to know !bout music before other people

$11 -Pop music is fun$12 - Pop music helps me esc!pe

$7 - I enjo" music prim!ril" from #oin# out to d!nce

$5 - I used to know where to find music

$6 - I !m not willin# to p!" for music

$10 - M" music collection is ! source of pride

$4 - I would like to bu" new music but I don’t know wh!t to bu"

$17 - I find seein# ! new !rtist ! useful w!" of discoverin# new music

Page 52: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN PYTHONFEATURE IMPORTANCE

FEATURE IMPORTANCE IN R RANDOM FOREST

$11 2$12 3A#e 4$6 5$17 6$5 -$4 9$10 -$16 7$7 -

$16 - I would be willin# to p!" for the opp to bu" new music pre-rele!se

$11 -Pop music is fun

$12 - Pop music helps me esc!pe

$5 - I used to know where to find music

$6 - I !m not willin# to p!" for music

$10 - M" music collection is ! source of pride

$4 - I would like to bu" new music but I don’t know wh!t to bu"

$17 - I find seein# ! new !rtist ! useful w!" of discoverin# new music

Page 53: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN R1 2 3 4 5 CLASS

1 16777 4863 1633 139 37 28.45%

2 5760 12411 6213 504 89 50.31%

3 1485 5559 13144 1880 329 41.31%

4 176 888 4094 2592 625 69.09%

5 59 204 1008 856 1388 60.51%

CONFUSION MATRIX

Page 54: Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS IN PYTHON1 2 3 4 5 CLASS

1 16930 4682 1758 129 53 28.12%

2 5517 12369 6475 506 106 50.47%

3 1500 5367 13448 1737 275 39.77%

4 186 791 4171 2598 561 68.73%

5 48 161 999 880 1466 58.75%

CONFUSION MATRIX

Page 55: Random Forests R vs Python by Linda Uruchurtu

(Re)FORMULATE AN HYPOTHESISSTEP 2

Page 56: Random Forests R vs Python by Linda Uruchurtu

FEATURE SELECTIONPRINCIPAL COMPONENT ANALYSIS - WORDSDetermine which fe"tures "ccount for most of the v"ri"nce.

FEATURE PC1 PC2

Distinctive 0.20 -0.059Authentic 0.19 -0.046T!lented 0.19 -0.083Credible 0.19 -0.084St"lish 0.18 -0.094

Anno"in# -0.06 -0.065Intrusive -0.06 -0.058Irrelev!nt -0.059 -0.087Uninspired -0.056 -0.092

Nois" -0.053 -0.13

Page 57: Random Forests R vs Python by Linda Uruchurtu

FEATURE SELECTIONM"ke " simple model choosin# me"nin#ful v"ri"bles

WORDS - Anno#in", Depressin", Borin", C!tch#, T!lented, Distinctive, Be!utiful, Superst!r, Soulful !nd Popul!r.

QUESTIONS - $4, $5, $6, $9, $10 $11 !nd $19.

• Runnin# time in R ~ 15 min.• RMSE = 14.791 / Public le"der bo"rd 13.076

Page 58: Random Forests R vs Python by Linda Uruchurtu

RESULTS

FULL MODELREDUCED MODEL

Page 59: Random Forests R vs Python by Linda Uruchurtu

COMMENTSIt is well known th!t R!ndom Forests h!ve shown to be bi!sed tow!rds hi"hl# correl!ted v!ri!bles. Usin" condition!l inference trees, !melior!tes th!t bi!s (See Party PACKAGE in R)

SCIKIT learn’s implement!tion h!s n_jobs p!r!meter to p!r!llelise tr!inin". For ! simil!r fe!ture in R, see bigRF p!ck!"e.

Page 60: Random Forests R vs Python by Linda Uruchurtu

CONCLUDING REMARKS

Page 61: Random Forests R vs Python by Linda Uruchurtu

CONCLUDING REMARKS

We solved " problem usin# both R "nd PYTHON (vi" Scikit learn). Cle"rl! constr"ints for "ddressin# " #iven problem mi#ht differ "nd would dict"te the implement"tion of choice.

PICK THE TOOL THAT IS BEST FOR THE JOB

WORTH LEARNING ABOUT BOTH IMPLEMENTATIONS

Both R "nd PYTHON (vi" SCIKIT LEARN) implement"tions h"ve "dded functions th"t "llow the user to explore the resultin# model "nd its perform"nce.

Page 62: Random Forests R vs Python by Linda Uruchurtu

CONCLUDING REMARKSRANDOM FORESTS ARE GREAT

KEEP AN EYE OUT FOR INTERESTING DATA

It "ives "re!t !ccur!c#, c!n h!ndle m!n# fe!tures, does not require cross v!lid!tion !nd it even estim!tes wh!t v!ri!bles !re import!nt.

H!vin" d!t! th!t #ou !re interested in, le!ds to more interestin" questions !nd re!sons to explore new methods !nd !dd ! new trick to #our b!".

Page 63: Random Forests R vs Python by Linda Uruchurtu

CONCLUDING REMARKSEMI DATASET IS GREAT TO TEST RIDE

TO DO’s - WILL IT PYTHON?

Set h!s ! lot of beh!viour!l inform!tion on ! subject th!t ever#one h!s some intuition.

Prediction usin" SVM’s !nd other M!trix F!ctoris!tion techniques. Full f!ctor !n!l#sis, etc.

Page 64: Random Forests R vs Python by Linda Uruchurtu

THANKS!