Feature engineering practice
Roger Bohn Big Data Analytics
May 23, 2018
1
This week’s learning goals
• Feature engineering makes or breaks projects
• Exploratory analysis
• Measuring results
• General: reading code, debugging, managing projects
• Specific functions
2
Credit cards - new features• Is this behavior unusual? For this customer? For
anyone?
• Spending patterns - $ and what they buy
• Timing patterns
• Which stores
• Locations e.g. 200 km and 1 hour apart (But what if it’s in 2 train stations?)
3
Concept —> formula —-> R code
• Defining and measuring “similarity”
• Example: Did he just buy 3 “similar” TVs in 3 stores?
• Can we rely on product codes? Probably not always
• Once we have verbal definition of “similar”, how do we implement it as code?
• Quant. variables: Try Statistical Process Control (= Assume that quantities always within 3 std deviations.) Use history to measure mean, std deviation.
4
Go outside the core data• Non-https: flag if internet transaction website's URL
was NOT https
• Non-pin: flag if the transaction didn't require user to enter pin-number
• Gender-specific: flag if the product had gender specificity
• Will the merchant’s data stream tell you these? Try.
5
What are plausible new features?
6
𝑶𝒅𝒅𝒔 𝑭𝒓𝒂𝒖𝒅 = 𝜷 𝟎 + 𝜷 𝟏 ∙ 𝑻𝒙𝒏 𝑻𝒚𝒑𝒆 ∙ 𝑪𝒖𝒓𝒓𝒆𝒏𝒄𝒚 + 𝜷 𝟐 ∙ 𝑭𝒐𝒓 𝑻𝒙𝒏 ∙ 𝑭𝒐𝒓 𝑻𝒙𝒏 𝑨𝒎𝒐𝒖𝒏𝒕 + 𝜷 𝟑 ∙ 𝑴𝒆𝒓𝒄𝒉𝒂𝒏𝒕 𝑪𝒐𝒖𝒏𝒕𝒓𝒚 Second, I add merchant country to make sure this item is from the country
where this foreign transaction is made from. Noted that merchant country can still be inaccurate because merchant
can be imported or exported.
• # From Chapter 6 of ISLR textbook.
• In the OS, set up documents you will need. Cheat sheets, text, previous work, etc.
1. Libraries: ISLR, glmnet, ggplot2, dplyr
2. Explore Hitters with ggplot
3. Clean data as before.
4. CHECK what happened at each stage
7
Starting w Hitters
Look at the data
8
Today’s resources
• bda2020.wordpress.com/2018/05/23/resources-on-data-manipulation/
9
Exploring data
10
• Scatter plots
• Overplotting and how to fix it
• Distributions
• Correlation diagrams
Graphical explore• ggplot(Hitters2, aes(x= Years, y= Salary)) + geom_point(shape = 19)
• ggplot(Hitters2, aes(x= Years, y= Salary, colour = League)) + geom_point(size=1.5, shape = 19)
• ▾ Dealing with overplotting
• ggplot(Hitters2, aes(x= Years, y= Salary, colour = League))+ geom_point(position="jitter”, size = 1.0). #jitter
• ggplot(Hitters2, aes(x= Years, y= Salary, colour = League)) +. geom_point(alpha = .1). # transparency
• pairs(Hitters2[,c(Years,cHits, cRuns)] #Use col numbers, !names
• ▸ggplot(Hitters2, aes(x=salary)) + stat_ecdf() #cumulative distribution of salaries
11
creating xformed vars
• Hit.augment <- as.data.frame(Hitters[,1])
• Hit.augment$newname <- some calculation
• str(Hit.augment) #what did this do? # Or use View. when ready
12
13
dplyr::group_by(iris, Species) Group data into rows with the same value of Species.
dplyr::ungroup(iris) Remove grouping information from data frame.
iris %>% group_by(Species) %>% summarise(…) Compute separate summary row for each group.
Combine Data Sets
Group Data
Summarise Data Make New Variables
ir irC
dplyr::summarise(iris, avg = mean(Sepal.Length)) Summarise data into single row of values.
dplyr::summarise_each(iris, funs(mean)) Apply summary function to each column.
dplyr::count(iris, Species, wt = Sepal.Length) Count number of rows with each unique value of variable (with or without weights).
dplyr::mutate(iris, sepal = Sepal.Length + Sepal. Width) Compute and append one or more new columns.
dplyr::mutate_each(iris, funs(min_rank)) Apply window function to each column.
dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width) Compute one or more new columns. Drop original columns.
Summarise uses summary functions, functions that take a vector of values and return a single value, such as:
Mutate uses window functions, functions that take a vector of values and return another vector of values, such as:
window function
summary function
dplyr::first First value of a vector.
dplyr::last Last value of a vector.
dplyr::nth Nth value of a vector.
dplyr::n # of values in a vector.
dplyr::n_distinct # of distinct values in a vector.
IQR IQR of a vector.
min Minimum value in a vector.
max Maximum value in a vector.
mean Mean value of a vector.
median Median value of a vector.
var Variance of a vector.
sd Standard deviation of a vector.
dplyr::lead Copy with values shifted by 1.
dplyr::lag Copy with values lagged by 1.
dplyr::dense_rank Ranks with no gaps.
dplyr::min_rank Ranks. Ties get min rank.
dplyr::percent_rank Ranks rescaled to [0, 1].
dplyr::row_number Ranks. Ties got to first value.
dplyr::ntile Bin vector into n buckets.
dplyr::between Are values between a and b?
dplyr::cume_dist Cumulative distribution.
dplyr::cumall Cumulative all
dplyr::cumany Cumulative any
dplyr::cummean Cumulative mean
cumsum Cumulative sum
cummax Cumulative max
cummin Cumulative min
cumprod Cumulative prod
pmax Element-wise max
pmin Element-wise min
iris %>% group_by(Species) %>% mutate(…) Compute new variables by group.
x1 x2A 1B 2C 3
x1 x3A TB FD T+ =
x1 x2 x3A 1 TB 2 FC 3 NA
x1 x3 x2A T 1B F 2D T NA
x1 x2 x3A 1 TB 2 F
x1 x2 x3A 1 TB 2 FC 3 NAD NA T
x1 x2A 1B 2C 3
x1 x2B 2C 3D 4+ =
x1 x2B 2C 3
x1 x2A 1B 2C 3D 4
x1 x2A 1
x1 x2A 1B 2C 3B 2C 3D 4
x1 x2 x1 x2A 1 B 2B 2 C 3C 3 D 4
Mutating Joins
Filtering Joins
Binding
Set Operations
dplyr::left_join(a, b, by = "x1") Join matching rows from b to a.
a b
dplyr::right_join(a, b, by = "x1") Join matching rows from a to b.
dplyr::inner_join(a, b, by = "x1") Join data. Retain only rows in both sets.
dplyr::full_join(a, b, by = "x1") Join data. Retain all values, all rows.
x1 x2A 1B 2
x1 x2C 3
y z
dplyr::semi_join(a, b, by = "x1") All rows in a that have a match in b.
dplyr::anti_join(a, b, by = "x1") All rows in a that do not have a match in b.
dplyr::intersect(y, z) Rows that appear in both y and z.
dplyr::union(y, z) Rows that appear in either or both y and z.
dplyr::setdiff(y, z) Rows that appear in y but not z.
dplyr::bind_rows(y, z) Append z to y as new rows.
dplyr::bind_cols(y, z) Append z to y as new columns. Caution: matches rows by position.
RStudio® is a trademark of RStudio, Inc. • CC BY RStudio • [email protected] • 844-448-1212 • rstudio.com Learn more with browseVignettes(package = c("dplyr", "tidyr")) • dplyr 0.4.0• tidyr 0.2.0 • Updated: 1/15devtools::install_github("rstudio/EDAWR") for data sets
Merging• After creating 2 data frames, they have to be
merged.
• Could do this one variable at a time, but only if they are in identical order. This is very risky.
• Instead, use merge. total <- merge(data frameA,data frameB,by=“ID”)
• Looks at the column called “ID” in both, and mashes rows together based on it.
14
clean Hitters• sum(is.na(Hitters$Salary))
• Hitters2 =na.omit(Hitters )
• ###################### work with R code######################
• sum(is.na(Hitters$Division)) #does what?
• # How many NON na's?
• # Many ways to calculate this.
15
Creating data subsets= example of standard technique
• Use code from previous homeworks, or from Rattle.
• set.seed(87664)
• Nobservation <- nrow(Loansdataset)
• Train_IDs <- sample(Nobservation, 0.7*Nobservation) #random sample of 70% of row numbers
• Validate_IDs <- (1:Nobservation)[-Train_IDs] #can divide validate further into Test and Validation rows
16
functions for NA• • is.na ( )
• • na.omit(Hitters ) #throws away entire rows.
• • sum(is.na(Hitters$Salary)) • na.rm = TRUE means ignore any na in object
• sum(Hitters$Salary, na.rm = TRUE)
17
3 debug strategiesA. Look, guess, pray. Works well for common/simple
problems like capitalization, spelling errors.
B. Microscope method Select chunks of code, see if they give same error, see what they return. Is it what you wanted?
• Make small changes to reproduce the error Small changes to code Changes to data (e.g. numeric <—> character
C. Work through code by hand (forwards or backwards)
18
Another approach: formulas in R model
• Quick way to change variables in models
• Create interactions
• ff <- log(Volume) ~ log(Height) + log(Girth)
• str(m <- model.frame(ff, trees))
• mat <- model.matrix(ff, m)
• DD <- Hitters2[players 10 to 20; first 7 columns]
• model.matrix(~ a + b, dd, contrasts = list(a = “contr.sum"))
19
20
Needs list of vars!
21
220 6. Linear Model Selection and Regularization
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
20 50 100 200 500 2000 5000
−2
00
01
00
20
030
04
00
0.0 0.2 0.4 0.6 0.8 1.0
−3
00
−100
01
00
20
03
00
40
0
Sta
nd
ard
ize
d C
oe
ffic
ien
ts
IncomeLimitRatingStudent
λ β̂Lλ 1/ β̂ 1
FIGURE 6.6. The standardized lasso coefficients on the Credit data set areshown as a function of λ and ∥β̂L
λ ∥1/∥β̂∥1.
As an example, consider the coefficient plots in Figure 6.6, which are gen-erated from applying the lasso to the Credit data set. When λ = 0, thenthe lasso simply gives the least squares fit, and when λ becomes sufficientlylarge, the lasso gives the null model in which all coefficient estimates equalzero. However, in between these two extremes, the ridge regression andlasso models are quite different from each other. Moving from left to rightin the right-hand panel of Figure 6.6, we observe that at first the lasso re-sults in a model that contains only the rating predictor. Then student andlimit enter the model almost simultaneously, shortly followed by income.Eventually, the remaining variables enter the model. Hence, depending onthe value of λ, the lasso can produce a model involving any number of vari-ables. In contrast, ridge regression will always include all of the variables inthe model, although the magnitude of the coefficient estimates will dependon λ.
Another Formulation for Ridge Regression and the Lasso
One can show that the lasso and ridge regression coefficient estimates solvethe problems
minimizeβ
⎧⎪⎨
⎪⎩
n∑
i=1
⎛
⎝yi − β0 −p∑
j=1
βjxij
⎞
⎠2⎫⎪⎬
⎪⎭subject to
p∑
j=1
|βj | ≤ s
(6.8)and
minimizeβ
⎧⎪⎨
⎪⎩
n∑
i=1
⎛
⎝yi − β0 −p∑
j=1
βjxij
⎞
⎠2⎫⎪⎬
⎪⎭subject to
p∑
j=1
β2j ≤ s,
(6.9)
Previous R functions• Rfpred <- predict(Result, Valid.df) #Valid.df is the validation
dataset
• Almost every algorithm has predict function
• confusionMatrix(Rf.pred, Valid.df$Personal.Loan)
• table(Rf.pred, Valid.df$Personal.Loan)
• addmargins(table) # appends row and column sums
• Mean squared error for continuous outcomes: sum( (predicted - actual)^2 )
• 22