Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Dean Abbott
Abbott Analytics
KNIME Fall Summit
#KNIMEFallSummit16
September 16, 2016
Twitter: @deanabb
Measuring
Variable Importance
with Target Shuffling
Dean Abbott
Abbott Analytics
KNIME Fall Summit
September 16, 2016
Twitter: @deanabb
Measuring
Variable Importance
with Target Input Shuffling
Dean Abbott
Co-Founder and Chief Data Scient is t and
Chief Techology Off icer, SmarterHQ
Twit ter : @deanabb
© Abbott Analytics, 2001-20164
A SaaS contextual marketing technology Tier 1 brands use to drive
conversion and loyalty, through multi-channel personalization
AWS: Redshift, MySQL/Aurora, EC2, S3, Kinesis
Why Am I Talking About this
Arcane Topic?
• I’ve been bothered by this for
decades….yes...I’m that old
• It’s conceptually easy to do.
© Abbott Analytics, 2001-20166
Variable Importance in
Linear Regression
© Abbott Analytics, 2001-20167
Variable Importance in
Decision Trees
• Decision Trees
• You think they are easy to explain?
© Abbott Analytics, 2001-20168
Variable Importance in
Neural Networks
• Huh?
© Abbott Analytics, 2001-20169
Variable Importance in
Neural Networks
• Or what neural
nets really look
like…
© Abbott Analytics, 2001-201610
Naïve Bayes Model Outputs
© Abbott Analytics, 2001-2016
Essentially a
series of
cross-tabs for
every
variable!
Remember,
the final
probability is
the product
of the
individual
variable
probabilities.
11
SVM Output
© Abbott Analytics, 2001-201612
Neural Networks: Interpretation via Sensitivities
• Sensitivities reflect the amount of change in the outputs when each of the inputs is changed or wiggled some small amount—a larger sensitivity means the output changes more for a small change in the input.
• Provide measure of the importance of each input variable in the model (by itself)
• Can use sensitivities to reduce input variables in other neural network, decision tree, or regression models
© Abbott Analytics, 2001-201613
KNIME Random Forest Node
Helps with Importance
© Abbott Analytics, 2001-201614
© Abbott Analytics, 2001-201615
Outline
• Classical variable importance: linear regression
• Hack #1: use linear regression model statistics to
infer variable importance
• Hack #2: use target shuffling to infer variable
importance
© Abbott Analytics, 2001-201616
The Data: Easiest Possible!
• 3 inputs: each is a random Normal: mean = 20, std = 5
• Target variable: 0.5*var1 + 0.2*var2 + 0.3*var3
• 95,412 records (same size as cup98lrn)
© Abbott Analytics, 2001-201617
Let’s Start with Normal
© Abbott Analytics, 2001-201618
Variable Importance Using
Linear Regression Coefficient
• Coefficient match (be definition) the proportions used to
be build the target variable
• This is the average influence of each input on the
predictions for all records
© Abbott Analytics, 2001-201619
© Abbott Analytics, 2001-201620
t-proportion
For Each Variable to Assess Influence
• T-value measures the significance of the relationship.
• It turns out, that the proportion of the t-values for the exact model
matches the coefficients
© Abbott Analytics, 2001-201621
Variable Importance Using
Prediction Proportion
• How would an empiricist compute
influence?1. Compute the proportion of the prediction that comes from
each term in the model
1. Influence of variable 1 = W1 * var1
2. Influence of variable 2 = W2 * var2
3. Influence of variable 3 = W3 * var3
2. Average the influences over all records
© Abbott Analytics, 2001-201622
Variable Importance Using
Prediction Proportion• Compute the contribution of each term in the linear regression model
separately (each record).
• Var1_influence = $var1coef$ * $var1$, etc.
• Compute the proportion of the contribution of the predicted target variable value
• Average the contributions of each variable for each record to compute the average influence of each variable
© Abbott Analytics, 2001-201623
So Far So Good
• Now let’s do the same
analysis for
• Neural Networks
• Support Vector
Machines.
© Abbott Analytics, 2001-201624
So Far So Good
• Now let’s do the same
analysis for
• Neural Networks
• Support Vector
Machines.
• Uh.....maybe not
© Abbott Analytics, 2001-201625
Do it the YACK way
• Yet
• Another
• Creative
use of
• KNIME
© Abbott Analytics, 2001-201626
Why “Target Shuffling”?
• We don’t always have nice
metrics to identify the best
inputs with predictive models
(NNets, SVM, … anything
other than regression!)
• Even with regression, we don’t
always have nice inputs
• See John Elder’s introduction
of Target Shuffling to the data
mining community
© Abbott Analytics, 2001-201627
http://semanticommunity.info/@api/deki/files/30744/Elder_-_Target_Shuffling_Sept.2014.pdf
Input Distributions Are
Not Always Ideal
© Abbott Analytics, 2001-201628
Why “Target Shuffling”?
• Don’t care about the “target” part
• The Target shuffling node doesn’t care either
• Scramble (randomly) a single (input variable) column
• Target Shuffling Node doesn’t have to be in a loop; it can scramble a column while leaving the others in their natural order
• Captures the actual distribution of the data
© Abbott Analytics, 2001-201629
Let’s call it
Input Shuffling
© Abbott Analytics, 2001-201630
Input
Principles of Input Shuffling
• Key: randomly re-select a value of a single input variable value while
leaving all other variables in with their original values
• Compute the standard deviation (or some other measure of
perturbation) for each record
• Of the Target Variable Predictions
• NOT the actual target variable
• This perturbation is a measure of how influential the variable is in
the model
• High standard deviation -> lots of influence
• Low standard deviation -> not much influence
• ~0 standard deviation -> no influence
© Abbott Analytics, 2001-201631
Shuffled Inputs Meta Node
Two Loops: (1) loop on input variables and (2) loop on shuffled input variable (50x or so)
© Abbott Analytics, 2001-201632
Shuffling Inputs
All inputs and target
Just 1
input
© Abbott Analytics, 2001-201633
Shuffling Inputs
All inputs and target
Just 1 input at a time
© Abbott Analytics, 2001-201634
Single Record:
What it looks like• Single Record: 50 “shuffles”: Row0
© Abbott Analytics, 2001-201635
Average for All Records in data
(~9K for this data set)
• Measures the spread of the predictions when randomly
perturbing the single input variable
© Abbott Analytics, 2001-201636
Variable Importance Using
Input Shuffling for
Idealized Linear Regression Data
• Compute proportion of the average standard deviation from shuffling
the input (keeping others with the original values)
• (yes, I know I’m averaging standard deviations!)
© Abbott Analytics, 2001-201637
Realistic Data:
KDD Cup 1998• 95,412: cup98lrn from KDD Cup 1998 Competition
• Use only the responders (4843) in linear regression models
• Hundreds of fields in data, but only use 4 for research
purposes
• LASTGIFT, NGIFTALL,
RFA_2F, D_RFA_2A
• Continuous target
• Two continuous
• One ordinal (RFA_2F)
• One dummy (D_RFA_2A)
© Abbott Analytics, 2001-201638
Realistic Data:
KDD Cup 1998• Heavy skew of LASTGIFT, NGIFTALL,
TARGET_D
• Makes visualization
difficult
• Biases
regression
coefficients
(if
one cares)
© Abbott Analytics, 2001-201639
Could Use Normalized Data
• To remove influence of skew and scale
• Log10 transform LASTGIFT, NGIFTALL, TARGET_D
• Scale all variables (post log10) to [0, 1]
© Abbott Analytics, 2001-201640
Normalized Data
• Relationships clearer
• LASTGIFT strong positive correlation with TARGET_D
• NGIFTALL, RFA_2F, D_RFA_2A all have apparently slight negative
correlation
with
TARGET_D
© Abbott Analytics, 2001-201641
The Basic Model:
Linear RegressionCoefficient
Use abs() for all calculations
© Abbott Analytics, 2001-201642
Linear Regression: Compare
Influence Using Different MethodsCoefficient T-Proportion
Use abs() for all t-proportion calculationsUse abs() for all calculations
© Abbott Analytics, 2001-201643
Linear Regression: Compare
Influence Using Different MethodsCoefficient T-Proportion
Prediction Proportion Input Shuffling
Use abs() for all t-proportion calculationsUse abs() for all calculations
Use abs() for all t-proportion calculations© Abbott Analytics, 2001-201644
Linear Regression, Neural Network, and
Random Forest: Input Shuffling Influence
Input Shuffling- LR Input Shuffling - MLP
© Abbott Analytics, 2001-201645
Repeat for More Inputs –
KDD Cup 98
© Abbott Analytics, 2001-201646
Apply Input Shuffling to
Larger KDD Cup 98 Data
© Abbott Analytics, 2001-201647
Shuffle
LASTGIFT_log10
Variable Influence from
Regression Diagnostics
© Abbott Analytics, 2001-201648
Input Shuffling Variable Influence:
Regression
© Abbott Analytics, 2001-201649
currentColumnName VariableInfluence_Linear VariableInfluence_RF VariableInfluence_GBM
D_RFA_2A 0.0518 0.0139 0.0051
LASTGIFT_log10 0.0477 0.0383 0.0596
E_RFA_2A 0.0426 0.0155 0.0153
F_RFA_2A 0.0266 0.0105 0.0037
MINRAMNT_log10 0.0077 0.0127 0.0113
RFA_2F 0.0073 0.0122 0.0063
A_GEOCODE2 0.0060 0.0020 0.0008
B_GEOCODE2 0.0057 0.0011 0.0002
MINRDATE 0.0040 0.0061 0.0085
NGIFTALL 0.0038 0.0075 0.0066
MAXRDATE 0.0028 0.0035 0.0044
C_GEOCODE2 0.0025 0.0005 0.0000
NUMPRM12 0.0024 0.0022 0.0033
DOMAIN3 0.0021 0.0008 0.0009
CARDPM12 0.0016 0.0026 0.0037
LASTDATE 0.0005 0.0029 0.0018
AGE_imputerand 0.0004 0.0029 0.0046
DOMAIN2 0.0002 0.0012 0.0002
NUMPROM 0.0001 0.0036 0.0067
DOMAIN1 0.0000 0.0000 0.0000
Accuracy Comparison on
Testing Data
© Abbott Analytics, 2001-201650
Linear Regression Random Forests Gradient Boosting
Input Shuffling Variable Influence:
Regression (Unnormalized!)
© Abbott Analytics, 2001-201651
currentColumnName VariableInfluence_Linear VariableInfluence_RF VariableInfluence_GBM
E_RFA_2A 4.337 0.807 0.396
LASTGIFT 4.052 2.252 4.016
D_RFA_2A 3.566 0.625 0.245
F_RFA_2A 3.552 0.457 0.000
RAMNTALL 2.429 0.540 1.239
NGIFTALL 2.258 0.692 0.957
MINRAMNT 2.111 0.708 0.722
RFA_2F 1.274 0.618 0.480
FISTDATE 0.970 0.298 0.731
A_GEOCODE2 0.754 0.130 0.086
B_GEOCODE2 0.519 0.082 0.017
DOMAIN3 0.362 0.052 0.066
DOMAIN1 0.358 0.080 0.036
C_GEOCODE2 0.307 0.028 0.000
NUMPRM12 0.304 0.154 0.262
DOMAIN2 0.289 0.072 0.028
MAXRDATE 0.213 0.297 0.444
MINRDATE 0.200 0.345 0.455
CARDPM12 0.178 0.139 0.296
AGE_imputerand 0.174 0.202 0.363
MAXRAMNT 0.168 1.791 1.547
LASTDATE 0.036 0.240 0.243
Input Shuffling Variable Influence:
Classification
© Abbott Analytics, 2001-201652
Input Shuffling Variable Influence:
Classification
© Abbott Analytics, 2001-201653
VariableVariableInfluence
LogisticVariableInfuence
RFVariableInfuence
GBM
FISTDATE 0.0123 0.0349 0.0124
D_RFA_2A 0.0080 0.0024 0.0027
RFA_2F 0.0080 0.0176 0.0040
DOMAIN3 0.0072 0.0056 0.0057
E_RFA_2A 0.0069 0.0069 0.0055
NGIFTALL 0.0057 0.0347 0.0180
DOMAIN1 0.0011 0.0084 0.0013
LASTGIFT 0.0004 0.0236 0.0132
F_RFA_2A 0.0003 0.0103 0.0001
Discussion
• Why Input Shuffling is good• Works for any input distribution
• Works with any algorithm
• Measures importance based on other input variables in natural patterns
rather than an idealized value (like the mean or mode)
• Can use many metrics to measure what “importance” means to you
• Why Input Shuffling is not so good• Takes a long time to run if you have lots of inputs, lots of records
• No statistically defensible metric to use (yet)
© Abbott Analytics, 2001-201654
Conclusion
• Variable influence can be computed as a single
• Coefficients aren’t good measures unless the variables conform to linear
regression assumptions
• Some models don’t have “coefficients” at all so we can’t use the linear regression
approach
• Using target shuffling, we can generate randomized sensitivity scores easily for any
model
• If inputs are not normally distributed, average overall influence doesn’t tell
the full story (or may even tell a misleading story) about how valuable the
variable is in predicting the target
• Breaking predictions into bins (deciles or other number of bins) allows us to
compute an influence score for every part of the predicted range
• Answers the question: for high predicted values, which variables are most
influential
© Abbott Analytics, 2001-201655
Binning Predicted Values
into Buckets (Deciles, Quintiles,…)
• Predictions Deciling predicted values allows us to compute
variable influence for each of these ranges of
the predicted values. Note that the top and
bottom bins have much larger variances.
© Abbott Analytics, 2001-201656
LASTGIFT Influence
• LASTGIFT has stronger influence (positive) at the high end of predictions
• Significant influence for all predicted values
• Nearly constant influence for Bins 7-10
• Monotonic influence vs. predicted values
© Abbott Analytics, 2001-201657
RFA_2F Influence
• RFA_2F has
stronger influence
(negative) at the
low end of
predictions
• Almost no
influence for Bin 7
– Bin 10
• Monotonic
influence vs.
predicted values
© Abbott Analytics, 2001-201658
NGIFTALL Influence
• NGIFTALL has
stronger influence
(negative) at the
low end of
predictions
• Mostly monotonic
influence vs.
predicted values
© Abbott Analytics, 2001-201659
D_RFA_2A Influence
• D_RFA_2A has strong influence at the low end of predictions only (Bin 1 and Bin2)
• No influence at all for Bin 3 through Bin 10
© Abbott Analytics, 2001-201660