Download pdf - Feature engineering practice - WordPress.com...This week’s learning goals • Feature engineering makes or breaks projects • Exploratory analysis • Measuring results • General:

Feature engineering practice

Roger Bohn Big Data Analytics

May 23, 2018

1

This week’s learning goals

• Feature engineering makes or breaks projects

• Exploratory analysis

• Measuring results

• General: reading code, debugging, managing projects

• Specific functions

2

Credit cards - new features• Is this behavior unusual? For this customer? For

anyone?

• Spending patterns - $ and what they buy

• Timing patterns

• Which stores

• Locations e.g. 200 km and 1 hour apart (But what if it’s in 2 train stations?)

3

Concept —> formula —-> R code

• Defining and measuring “similarity”

• Example: Did he just buy 3 “similar” TVs in 3 stores?

• Can we rely on product codes? Probably not always

• Once we have verbal definition of “similar”, how do we implement it as code?

• Quant. variables: Try Statistical Process Control (= Assume that quantities always within 3 std deviations.) Use history to measure mean, std deviation.

4

Go outside the core data• Non-https: flag if internet transaction website's URL

was NOT https

• Non-pin: flag if the transaction didn't require user to enter pin-number

• Gender-specific: flag if the product had gender specificity

• Will the merchant’s data stream tell you these? Try.

5

What are plausible new features?

6

𝑶𝒅𝒅𝒔 𝑭𝒓𝒂𝒖𝒅 = 𝜷 𝟎 + 𝜷 𝟏 ∙ 𝑻𝒙𝒏 𝑻𝒚𝒑𝒆 ∙ 𝑪𝒖𝒓𝒓𝒆𝒏𝒄𝒚 + 𝜷 𝟐 ∙ 𝑭𝒐𝒓 𝑻𝒙𝒏 ∙ 𝑭𝒐𝒓 𝑻𝒙𝒏 𝑨𝒎𝒐𝒖𝒏𝒕 + 𝜷 𝟑 ∙ 𝑴𝒆𝒓𝒄𝒉𝒂𝒏𝒕 𝑪𝒐𝒖𝒏𝒕𝒓𝒚 Second, I add merchant country to make sure this item is from the country

where this foreign transaction is made from. Noted that merchant country can still be inaccurate because merchant

can be imported or exported.

• # From Chapter 6 of ISLR textbook.

• In the OS, set up documents you will need. Cheat sheets, text, previous work, etc.

1. Libraries: ISLR, glmnet, ggplot2, dplyr

2. Explore Hitters with ggplot

3. Clean data as before.

4. CHECK what happened at each stage

7

Starting w Hitters

Look at the data

8

Today’s resources

• bda2020.wordpress.com/2018/05/23/resources-on-data-manipulation/

9

http://bda2020.wordpress.com/2018/05/23/resources-on-data-manipulation/

http://bda2020.wordpress.com/2018/05/23/resources-on-data-manipulation/

Exploring data

10

• Scatter plots

• Overplotting and how to fix it

• Distributions

• Correlation diagrams

Graphical explore• ggplot(Hitters2, aes(x= Years, y= Salary)) + geom_point(shape = 19)

• ggplot(Hitters2, aes(x= Years, y= Salary, colour = League)) + geom_point(size=1.5, shape = 19)

• ▾ Dealing with overplotting

• ggplot(Hitters2, aes(x= Years, y= Salary, colour = League))+ geom_point(position="jitter”, size = 1.0). #jitter

• ggplot(Hitters2, aes(x= Years, y= Salary, colour = League)) +. geom_point(alpha = .1). # transparency

• pairs(Hitters2[,c(Years,cHits, cRuns)] #Use col numbers, !names

• ▸ggplot(Hitters2, aes(x=salary)) + stat_ecdf() #cumulative distribution of salaries

11

creating xformed vars

• Hit.augment <- as.data.frame(Hitters[,1])

• Hit.augment$newname <- some calculation

• str(Hit.augment) #what did this do? # Or use View. when ready

12

13

dplyr::group_by(iris, Species) Group data into rows with the same value of Species.

dplyr::ungroup(iris) Remove grouping information from data frame.

iris %>% group_by(Species) %>% summarise(…) Compute separate summary row for each group.

Combine Data Sets

Group Data

Summarise Data Make New Variables

ir irC

dplyr::summarise(iris, avg = mean(Sepal.Length)) Summarise data into single row of values.

dplyr::summarise_each(iris, funs(mean)) Apply summary function to each column.

dplyr::count(iris, Species, wt = Sepal.Length) Count number of rows with each unique value of variable (with or without weights).

dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width) Compute and append one or more new columns.

dplyr::mutate_each(iris, funs(min_rank)) Apply window function to each column.

dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width) Compute one or more new columns. Drop original columns.

Summarise uses summary functions, functions that take a vector of values and return a single value, such as:

Mutate uses window functions, functions that take a vector of values and return another vector of values, such as:

window function

summary function

dplyr::first First value of a vector.

dplyr::last Last value of a vector.

dplyr::nth Nth value of a vector.

dplyr::n # of values in a vector.

dplyr::n_distinct # of distinct values in a vector.

IQR IQR of a vector.

min Minimum value in a vector.

max Maximum value in a vector.

mean Mean value of a vector.

median Median value of a vector.

var Variance of a vector.

sd Standard deviation of a vector.

dplyr::lead Copy with values shifted by 1.

dplyr::lag Copy with values lagged by 1.

dplyr::dense_rank Ranks with no gaps.

dplyr::min_rank Ranks. Ties get min rank.

dplyr::percent_rank Ranks rescaled to [0, 1].

dplyr::row_number Ranks. Ties got to first value.

dplyr::ntile Bin vector into n buckets.

dplyr::between Are values between a and b?

dplyr::cume_dist Cumulative distribution.

dplyr::cumall Cumulative all

dplyr::cumany Cumulative any

dplyr::cummean Cumulative mean

cumsum Cumulative sum

cummax Cumulative max

cummin Cumulative min

cumprod Cumulative prod

pmax Element-wise max

pmin Element-wise min

iris %>% group_by(Species) %>% mutate(…) Compute new variables by group.

x1 x2A 1B 2C 3

x1 x3A TB FD T+ =

x1 x2 x3A 1 TB 2 FC 3 NA

x1 x3 x2A T 1B F 2D T NA

x1 x2 x3A 1 TB 2 F

x1 x2 x3A 1 TB 2 FC 3 NAD NA T

x1 x2A 1B 2C 3

x1 x2B 2C 3D 4+ =

x1 x2B 2C 3

x1 x2A 1B 2C 3D 4

x1 x2A 1

x1 x2A 1B 2C 3B 2C 3D 4

x1 x2 x1 x2A 1 B 2B 2 C 3C 3 D 4

Mutating Joins

Filtering Joins

Binding

Set Operations

dplyr::left_join(a, b, by = "x1") Join matching rows from b to a.

a b

dplyr::right_join(a, b, by = "x1") Join matching rows from a to b.

dplyr::inner_join(a, b, by = "x1") Join data. Retain only rows in both sets.

dplyr::full_join(a, b, by = "x1") Join data. Retain all values, all rows.

x1 x2A 1B 2

x1 x2C 3

y z

dplyr::semi_join(a, b, by = "x1") All rows in a that have a match in b.

dplyr::anti_join(a, b, by = "x1") All rows in a that do not have a match in b.

dplyr::intersect(y, z) Rows that appear in both y and z.

dplyr::union(y, z) Rows that appear in either or both y and z.

dplyr::setdiff(y, z) Rows that appear in y but not z.

dplyr::bind_rows(y, z) Append z to y as new rows.

dplyr::bind_cols(y, z) Append z to y as new columns. Caution: matches rows by position.

RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15devtools::install_github("rstudio/EDAWR") for data sets

Merging• After creating 2 data frames, they have to be

merged.

• Could do this one variable at a time, but only if they are in identical order. This is very risky.

• Instead, use merge. total <- merge(data frameA,data frameB,by=“ID”)

• Looks at the column called “ID” in both, and mashes rows together based on it.

14

clean Hitters• sum(is.na(Hitters$Salary))

• Hitters2 =na.omit(Hitters )

• ###################### work with R code######################

• sum(is.na(Hitters$Division)) #does what?

• # How many NON na's?

• # Many ways to calculate this.

15

Creating data subsets= example of standard technique

• Use code from previous homeworks, or from Rattle.

• set.seed(87664)

• Nobservation <- nrow(Loansdataset)

• Train_IDs <- sample(Nobservation, 0.7*Nobservation) #random sample of 70% of row numbers

• Validate_IDs <- (1:Nobservation)[-Train_IDs] #can divide validate further into Test and Validation rows

16

functions for NA• • is.na ( )

• • na.omit(Hitters ) #throws away entire rows.

• • sum(is.na(Hitters$Salary)) • na.rm = TRUE means ignore any na in object

• sum(Hitters$Salary, na.rm = TRUE)

17

3 debug strategiesA. Look, guess, pray. Works well for common/simple

problems like capitalization, spelling errors.

B. Microscope method Select chunks of code, see if they give same error, see what they return. Is it what you wanted?

• Make small changes to reproduce the error Small changes to code Changes to data (e.g. numeric <—> character

C. Work through code by hand (forwards or backwards)

18

Another approach: formulas in R model

• Quick way to change variables in models

• Create interactions

• ff <- log(Volume) ~ log(Height) + log(Girth)

• str(m <- model.frame(ff, trees))

• mat <- model.matrix(ff, m)

• DD <- Hitters2[players 10 to 20; first 7 columns]

• model.matrix(~ a + b, dd, contrasts = list(a = “contr.sum"))

19

20

Needs list of vars!

21

220 6. Linear Model Selection and Regularization

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

20 50 100 200 500 2000 5000

−2

00

01

00

20

030

04

00

0.0 0.2 0.4 0.6 0.8 1.0

−3

00

−100

01

00

20

03

00

40

0

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

IncomeLimitRatingStudent

λ β̂Lλ 1/ β̂ 1

FIGURE 6.6. The standardized lasso coefficients on the Credit data set areshown as a function of λ and ∥β̂L

λ ∥1/∥β̂∥1.

As an example, consider the coefficient plots in Figure 6.6, which are gen-erated from applying the lasso to the Credit data set. When λ = 0, thenthe lasso simply gives the least squares fit, and when λ becomes sufficientlylarge, the lasso gives the null model in which all coefficient estimates equalzero. However, in between these two extremes, the ridge regression andlasso models are quite different from each other. Moving from left to rightin the right-hand panel of Figure 6.6, we observe that at first the lasso re-sults in a model that contains only the rating predictor. Then student andlimit enter the model almost simultaneously, shortly followed by income.Eventually, the remaining variables enter the model. Hence, depending onthe value of λ, the lasso can produce a model involving any number of vari-ables. In contrast, ridge regression will always include all of the variables inthe model, although the magnitude of the coefficient estimates will dependon λ.

Another Formulation for Ridge Regression and the Lasso

One can show that the lasso and ridge regression coefficient estimates solvethe problems

minimizeβ

⎧⎪⎨

⎪⎩

n∑

i=1

⎛

⎝yi − β0 −p∑

j=1

βjxij

⎞

⎠2⎫⎪⎬

⎪⎭subject to

p∑

j=1

|βj | ≤ s

(6.8)and

minimizeβ

⎧⎪⎨

⎪⎩

n∑

i=1

⎛

⎝yi − β0 −p∑

j=1

βjxij

⎞

⎠2⎫⎪⎬

⎪⎭subject to

p∑

j=1

β2j ≤ s,

(6.9)

Previous R functions• Rfpred <- predict(Result, Valid.df) #Valid.df is the validation

dataset

• Almost every algorithm has predict function

• confusionMatrix(Rf.pred, Valid.df$Personal.Loan)

• table(Rf.pred, Valid.df$Personal.Loan)

• addmargins(table) # appends row and column sums

• Mean squared error for continuous outcomes: sum( (predicted - actual)^2 )

• 22