60
Dean Abbott Abbott Analytics KNIME Fall Summit #KNIMEFallSummit16 September 16, 2016 [email protected] Twitter: @deanabb Measuring Variable Importance with Target Shuffling

Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 [email protected]

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Dean Abbott

Abbott Analytics

KNIME Fall Summit

#KNIMEFallSummit16

September 16, 2016

[email protected]

Twitter: @deanabb

Measuring

Variable Importance

with Target Shuffling

Page 2: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Dean Abbott

Abbott Analytics

KNIME Fall Summit

September 16, 2016

[email protected]

Twitter: @deanabb

Measuring

Variable Importance

with Target Input Shuffling

Page 3: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Dean Abbott

Co-Founder and Chief Data Scient is t and

Chief Techology Off icer, SmarterHQ

[email protected]

Twit ter : @deanabb

Page 4: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

© Abbott Analytics, 2001-20164

A SaaS contextual marketing technology Tier 1 brands use to drive

conversion and loyalty, through multi-channel personalization

Page 5: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

AWS: Redshift, MySQL/Aurora, EC2, S3, Kinesis

Page 6: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Why Am I Talking About this

Arcane Topic?

• I’ve been bothered by this for

decades….yes...I’m that old

• It’s conceptually easy to do.

© Abbott Analytics, 2001-20166

Page 7: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Variable Importance in

Linear Regression

© Abbott Analytics, 2001-20167

Page 8: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Variable Importance in

Decision Trees

• Decision Trees

• You think they are easy to explain?

© Abbott Analytics, 2001-20168

Page 9: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Variable Importance in

Neural Networks

• Huh?

© Abbott Analytics, 2001-20169

Page 10: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Variable Importance in

Neural Networks

• Or what neural

nets really look

like…

© Abbott Analytics, 2001-201610

Page 11: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Naïve Bayes Model Outputs

© Abbott Analytics, 2001-2016

Essentially a

series of

cross-tabs for

every

variable!

Remember,

the final

probability is

the product

of the

individual

variable

probabilities.

11

Page 12: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

SVM Output

© Abbott Analytics, 2001-201612

Page 13: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Neural Networks: Interpretation via Sensitivities

• Sensitivities reflect the amount of change in the outputs when each of the inputs is changed or wiggled some small amount—a larger sensitivity means the output changes more for a small change in the input.

• Provide measure of the importance of each input variable in the model (by itself)

• Can use sensitivities to reduce input variables in other neural network, decision tree, or regression models

© Abbott Analytics, 2001-201613

Page 14: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

KNIME Random Forest Node

Helps with Importance

© Abbott Analytics, 2001-201614

Page 15: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

© Abbott Analytics, 2001-201615

Page 16: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Outline

• Classical variable importance: linear regression

• Hack #1: use linear regression model statistics to

infer variable importance

• Hack #2: use target shuffling to infer variable

importance

© Abbott Analytics, 2001-201616

Page 17: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

The Data: Easiest Possible!

• 3 inputs: each is a random Normal: mean = 20, std = 5

• Target variable: 0.5*var1 + 0.2*var2 + 0.3*var3

• 95,412 records (same size as cup98lrn)

© Abbott Analytics, 2001-201617

Page 18: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Let’s Start with Normal

© Abbott Analytics, 2001-201618

Page 19: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Variable Importance Using

Linear Regression Coefficient

• Coefficient match (be definition) the proportions used to

be build the target variable

• This is the average influence of each input on the

predictions for all records

© Abbott Analytics, 2001-201619

Page 20: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

© Abbott Analytics, 2001-201620

Page 21: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

t-proportion

For Each Variable to Assess Influence

• T-value measures the significance of the relationship.

• It turns out, that the proportion of the t-values for the exact model

matches the coefficients

© Abbott Analytics, 2001-201621

Page 22: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Variable Importance Using

Prediction Proportion

• How would an empiricist compute

influence?1. Compute the proportion of the prediction that comes from

each term in the model

1. Influence of variable 1 = W1 * var1

2. Influence of variable 2 = W2 * var2

3. Influence of variable 3 = W3 * var3

2. Average the influences over all records

© Abbott Analytics, 2001-201622

Page 23: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Variable Importance Using

Prediction Proportion• Compute the contribution of each term in the linear regression model

separately (each record).

• Var1_influence = $var1coef$ * $var1$, etc.

• Compute the proportion of the contribution of the predicted target variable value

• Average the contributions of each variable for each record to compute the average influence of each variable

© Abbott Analytics, 2001-201623

Page 24: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

So Far So Good

• Now let’s do the same

analysis for

• Neural Networks

• Support Vector

Machines.

© Abbott Analytics, 2001-201624

Page 25: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

So Far So Good

• Now let’s do the same

analysis for

• Neural Networks

• Support Vector

Machines.

• Uh.....maybe not

© Abbott Analytics, 2001-201625

Page 26: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Do it the YACK way

• Yet

• Another

• Creative

use of

• KNIME

© Abbott Analytics, 2001-201626

Page 27: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Why “Target Shuffling”?

• We don’t always have nice

metrics to identify the best

inputs with predictive models

(NNets, SVM, … anything

other than regression!)

• Even with regression, we don’t

always have nice inputs

• See John Elder’s introduction

of Target Shuffling to the data

mining community

© Abbott Analytics, 2001-201627

http://semanticommunity.info/@api/deki/files/30744/Elder_-_Target_Shuffling_Sept.2014.pdf

Page 28: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Input Distributions Are

Not Always Ideal

© Abbott Analytics, 2001-201628

Page 29: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Why “Target Shuffling”?

• Don’t care about the “target” part

• The Target shuffling node doesn’t care either

• Scramble (randomly) a single (input variable) column

• Target Shuffling Node doesn’t have to be in a loop; it can scramble a column while leaving the others in their natural order

• Captures the actual distribution of the data

© Abbott Analytics, 2001-201629

Page 30: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Let’s call it

Input Shuffling

© Abbott Analytics, 2001-201630

Input

Page 31: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Principles of Input Shuffling

• Key: randomly re-select a value of a single input variable value while

leaving all other variables in with their original values

• Compute the standard deviation (or some other measure of

perturbation) for each record

• Of the Target Variable Predictions

• NOT the actual target variable

• This perturbation is a measure of how influential the variable is in

the model

• High standard deviation -> lots of influence

• Low standard deviation -> not much influence

• ~0 standard deviation -> no influence

© Abbott Analytics, 2001-201631

Page 32: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Shuffled Inputs Meta Node

Two Loops: (1) loop on input variables and (2) loop on shuffled input variable (50x or so)

© Abbott Analytics, 2001-201632

Page 33: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Shuffling Inputs

All inputs and target

Just 1

input

© Abbott Analytics, 2001-201633

Page 34: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Shuffling Inputs

All inputs and target

Just 1 input at a time

© Abbott Analytics, 2001-201634

Page 35: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Single Record:

What it looks like• Single Record: 50 “shuffles”: Row0

© Abbott Analytics, 2001-201635

Page 36: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Average for All Records in data

(~9K for this data set)

• Measures the spread of the predictions when randomly

perturbing the single input variable

© Abbott Analytics, 2001-201636

Page 37: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Variable Importance Using

Input Shuffling for

Idealized Linear Regression Data

• Compute proportion of the average standard deviation from shuffling

the input (keeping others with the original values)

• (yes, I know I’m averaging standard deviations!)

© Abbott Analytics, 2001-201637

Page 38: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Realistic Data:

KDD Cup 1998• 95,412: cup98lrn from KDD Cup 1998 Competition

• Use only the responders (4843) in linear regression models

• Hundreds of fields in data, but only use 4 for research

purposes

• LASTGIFT, NGIFTALL,

RFA_2F, D_RFA_2A

• Continuous target

• Two continuous

• One ordinal (RFA_2F)

• One dummy (D_RFA_2A)

© Abbott Analytics, 2001-201638

Page 39: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Realistic Data:

KDD Cup 1998• Heavy skew of LASTGIFT, NGIFTALL,

TARGET_D

• Makes visualization

difficult

• Biases

regression

coefficients

(if

one cares)

© Abbott Analytics, 2001-201639

Page 40: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Could Use Normalized Data

• To remove influence of skew and scale

• Log10 transform LASTGIFT, NGIFTALL, TARGET_D

• Scale all variables (post log10) to [0, 1]

© Abbott Analytics, 2001-201640

Page 41: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Normalized Data

• Relationships clearer

• LASTGIFT strong positive correlation with TARGET_D

• NGIFTALL, RFA_2F, D_RFA_2A all have apparently slight negative

correlation

with

TARGET_D

© Abbott Analytics, 2001-201641

Page 42: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

The Basic Model:

Linear RegressionCoefficient

Use abs() for all calculations

© Abbott Analytics, 2001-201642

Page 43: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Linear Regression: Compare

Influence Using Different MethodsCoefficient T-Proportion

Use abs() for all t-proportion calculationsUse abs() for all calculations

© Abbott Analytics, 2001-201643

Page 44: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Linear Regression: Compare

Influence Using Different MethodsCoefficient T-Proportion

Prediction Proportion Input Shuffling

Use abs() for all t-proportion calculationsUse abs() for all calculations

Use abs() for all t-proportion calculations© Abbott Analytics, 2001-201644

Page 45: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Linear Regression, Neural Network, and

Random Forest: Input Shuffling Influence

Input Shuffling- LR Input Shuffling - MLP

© Abbott Analytics, 2001-201645

Page 46: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Repeat for More Inputs –

KDD Cup 98

© Abbott Analytics, 2001-201646

Page 47: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Apply Input Shuffling to

Larger KDD Cup 98 Data

© Abbott Analytics, 2001-201647

Shuffle

LASTGIFT_log10

Page 48: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Variable Influence from

Regression Diagnostics

© Abbott Analytics, 2001-201648

Page 49: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Input Shuffling Variable Influence:

Regression

© Abbott Analytics, 2001-201649

currentColumnName VariableInfluence_Linear VariableInfluence_RF VariableInfluence_GBM

D_RFA_2A 0.0518 0.0139 0.0051

LASTGIFT_log10 0.0477 0.0383 0.0596

E_RFA_2A 0.0426 0.0155 0.0153

F_RFA_2A 0.0266 0.0105 0.0037

MINRAMNT_log10 0.0077 0.0127 0.0113

RFA_2F 0.0073 0.0122 0.0063

A_GEOCODE2 0.0060 0.0020 0.0008

B_GEOCODE2 0.0057 0.0011 0.0002

MINRDATE 0.0040 0.0061 0.0085

NGIFTALL 0.0038 0.0075 0.0066

MAXRDATE 0.0028 0.0035 0.0044

C_GEOCODE2 0.0025 0.0005 0.0000

NUMPRM12 0.0024 0.0022 0.0033

DOMAIN3 0.0021 0.0008 0.0009

CARDPM12 0.0016 0.0026 0.0037

LASTDATE 0.0005 0.0029 0.0018

AGE_imputerand 0.0004 0.0029 0.0046

DOMAIN2 0.0002 0.0012 0.0002

NUMPROM 0.0001 0.0036 0.0067

DOMAIN1 0.0000 0.0000 0.0000

Page 50: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Accuracy Comparison on

Testing Data

© Abbott Analytics, 2001-201650

Linear Regression Random Forests Gradient Boosting

Page 51: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Input Shuffling Variable Influence:

Regression (Unnormalized!)

© Abbott Analytics, 2001-201651

currentColumnName VariableInfluence_Linear VariableInfluence_RF VariableInfluence_GBM

E_RFA_2A 4.337 0.807 0.396

LASTGIFT 4.052 2.252 4.016

D_RFA_2A 3.566 0.625 0.245

F_RFA_2A 3.552 0.457 0.000

RAMNTALL 2.429 0.540 1.239

NGIFTALL 2.258 0.692 0.957

MINRAMNT 2.111 0.708 0.722

RFA_2F 1.274 0.618 0.480

FISTDATE 0.970 0.298 0.731

A_GEOCODE2 0.754 0.130 0.086

B_GEOCODE2 0.519 0.082 0.017

DOMAIN3 0.362 0.052 0.066

DOMAIN1 0.358 0.080 0.036

C_GEOCODE2 0.307 0.028 0.000

NUMPRM12 0.304 0.154 0.262

DOMAIN2 0.289 0.072 0.028

MAXRDATE 0.213 0.297 0.444

MINRDATE 0.200 0.345 0.455

CARDPM12 0.178 0.139 0.296

AGE_imputerand 0.174 0.202 0.363

MAXRAMNT 0.168 1.791 1.547

LASTDATE 0.036 0.240 0.243

Page 52: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Input Shuffling Variable Influence:

Classification

© Abbott Analytics, 2001-201652

Page 53: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Input Shuffling Variable Influence:

Classification

© Abbott Analytics, 2001-201653

VariableVariableInfluence

LogisticVariableInfuence

RFVariableInfuence

GBM

FISTDATE 0.0123 0.0349 0.0124

D_RFA_2A 0.0080 0.0024 0.0027

RFA_2F 0.0080 0.0176 0.0040

DOMAIN3 0.0072 0.0056 0.0057

E_RFA_2A 0.0069 0.0069 0.0055

NGIFTALL 0.0057 0.0347 0.0180

DOMAIN1 0.0011 0.0084 0.0013

LASTGIFT 0.0004 0.0236 0.0132

F_RFA_2A 0.0003 0.0103 0.0001

Page 54: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Discussion

• Why Input Shuffling is good• Works for any input distribution

• Works with any algorithm

• Measures importance based on other input variables in natural patterns

rather than an idealized value (like the mean or mode)

• Can use many metrics to measure what “importance” means to you

• Why Input Shuffling is not so good• Takes a long time to run if you have lots of inputs, lots of records

• No statistically defensible metric to use (yet)

© Abbott Analytics, 2001-201654

Page 55: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Conclusion

• Variable influence can be computed as a single

• Coefficients aren’t good measures unless the variables conform to linear

regression assumptions

• Some models don’t have “coefficients” at all so we can’t use the linear regression

approach

• Using target shuffling, we can generate randomized sensitivity scores easily for any

model

• If inputs are not normally distributed, average overall influence doesn’t tell

the full story (or may even tell a misleading story) about how valuable the

variable is in predicting the target

• Breaking predictions into bins (deciles or other number of bins) allows us to

compute an influence score for every part of the predicted range

• Answers the question: for high predicted values, which variables are most

influential

© Abbott Analytics, 2001-201655

Page 56: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

Binning Predicted Values

into Buckets (Deciles, Quintiles,…)

• Predictions Deciling predicted values allows us to compute

variable influence for each of these ranges of

the predicted values. Note that the top and

bottom bins have much larger variances.

© Abbott Analytics, 2001-201656

Page 57: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

LASTGIFT Influence

• LASTGIFT has stronger influence (positive) at the high end of predictions

• Significant influence for all predicted values

• Nearly constant influence for Bins 7-10

• Monotonic influence vs. predicted values

© Abbott Analytics, 2001-201657

Page 58: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

RFA_2F Influence

• RFA_2F has

stronger influence

(negative) at the

low end of

predictions

• Almost no

influence for Bin 7

– Bin 10

• Monotonic

influence vs.

predicted values

© Abbott Analytics, 2001-201658

Page 59: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

NGIFTALL Influence

• NGIFTALL has

stronger influence

(negative) at the

low end of

predictions

• Mostly monotonic

influence vs.

predicted values

© Abbott Analytics, 2001-201659

Page 60: Measuring Variable Importance with Target Shuffling...Variable Importance with Target Shuffling. Dean Abbott Abbott Analytics KNIME Fall Summit September 16, 2016 dean@abbottanalytics.com

D_RFA_2A Influence

• D_RFA_2A has strong influence at the low end of predictions only (Bin 1 and Bin2)

• No influence at all for Bin 3 through Bin 10

© Abbott Analytics, 2001-201660