17
Many Faces of Feature Importance: Comparing Built-in and Post-hoc Feature Importance in Text Classification Vivian Lai and Zheng Cai and Chenhao Tan Department of Computer Science University of Colorado Boulder Boulder, CO vivian.lai, jon.z.cai, [email protected] Abstract Feature importance is commonly used to ex- plain machine predictions. While feature im- portance can be derived from a machine learn- ing model with a variety of methods, the con- sistency of feature importance via different methods remains understudied. In this work, we systematically compare feature importance from built-in mechanisms in a model such as attention values and post-hoc methods that ap- proximate model behavior such as LIME. Us- ing text classification as a testbed, we find that 1) no matter which method we use, impor- tant features from traditional models such as SVM and XGBoost are more similar with each other, than with deep learning models; 2) post- hoc methods tend to generate more similar im- portant features for two models than built-in methods. We further demonstrate how such similarity varies across instances. Notably, im- portant features do not always resemble each other better when two models agree on the pre- dicted label than when they disagree. 1 Introduction As machine learning models are adopted in so- cietally important tasks such as recidivism pre- diction and loan approval, explaining machine predictions has become increasingly important (Doshi-Velez and Kim, 2017; Lipton, 2016). Ex- planations can potentially improve the trustworthi- ness of algorithmic decisions for decision makers, facilitate model developers in debugging, and even allow regulators to identify biased algorithms. A popular approach to explaining machine pre- dictions is to identify important features for a particular prediction (Luong et al., 2015; Ribeiro et al., 2016; Lundberg and Lee, 2017). Typically, these explanations assign a value to each feature (usually a word in NLP), and thus enable visual- izations such as highlighting top k features. In general, there are two classes of methods: 1) built-in feature importance that is embedded in the machine learning model such as coefficients in lin- ear models and attention values in attention mech- anisms; 2) post-hoc feature importance through credit assignment based on the model such as LIME. It is well recognized that robust evalua- tion of feature importance is challenging (Jain and Wallace, 2019; Nguyen, 2018, inter alia), which is further complicated by different use cases of ex- planations (e.g., for decision makers vs. for de- velopers). Throughout this work, we refer to ma- chine learning models that learn from data as mod- els and methods to obtain local explanations (i.e., feature importance in this work) for a prediction by a model as methods. While prior research tends to focus on the inter- nals of models in designing and evaluating meth- ods of explanations, e.g., how well explanations reflect the original model (Ribeiro et al., 2016), we view feature importance itself as a subject of study, and aim to provide a systematic character- ization of important features obtained via differ- ent methods for different models. This view is particularly important when explanations are used to support decision making because they are the only exposure to the model for decision makers. It would be desirable that explanations are con- sistent across different instances. In comparison, debugging represents a distinct use case where de- velopers often know the mechanism of the model beyond explanations. Our view also connects to studying explanation as a product in cognitive studies of explanations (Lombrozo, 2012), and is complementary to the model-centric perspective. Given a wide variety of models and methods to generate feature importance, there are basic open questions such as how similar important features are between models and methods, how important features distribute across instances, and what lin- arXiv:submit/2881954 [cs.CL] 16 Oct 2019

arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

Many Faces of Feature Importance: Comparing Built-in and Post-hocFeature Importance in Text Classification

Vivian Lai and Zheng Cai and Chenhao TanDepartment of Computer ScienceUniversity of Colorado Boulder

Boulder, COvivian.lai, jon.z.cai, [email protected]

Abstract

Feature importance is commonly used to ex-plain machine predictions. While feature im-portance can be derived from a machine learn-ing model with a variety of methods, the con-sistency of feature importance via differentmethods remains understudied. In this work,we systematically compare feature importancefrom built-in mechanisms in a model such asattention values and post-hoc methods that ap-proximate model behavior such as LIME. Us-ing text classification as a testbed, we find that1) no matter which method we use, impor-tant features from traditional models such asSVM and XGBoost are more similar with eachother, than with deep learning models; 2) post-hoc methods tend to generate more similar im-portant features for two models than built-inmethods. We further demonstrate how suchsimilarity varies across instances. Notably, im-portant features do not always resemble eachother better when two models agree on the pre-dicted label than when they disagree.

1 Introduction

As machine learning models are adopted in so-cietally important tasks such as recidivism pre-diction and loan approval, explaining machinepredictions has become increasingly important(Doshi-Velez and Kim, 2017; Lipton, 2016). Ex-planations can potentially improve the trustworthi-ness of algorithmic decisions for decision makers,facilitate model developers in debugging, and evenallow regulators to identify biased algorithms.

A popular approach to explaining machine pre-dictions is to identify important features for aparticular prediction (Luong et al., 2015; Ribeiroet al., 2016; Lundberg and Lee, 2017). Typically,these explanations assign a value to each feature(usually a word in NLP), and thus enable visual-izations such as highlighting top k features.

In general, there are two classes of methods: 1)built-in feature importance that is embedded in themachine learning model such as coefficients in lin-ear models and attention values in attention mech-anisms; 2) post-hoc feature importance throughcredit assignment based on the model such asLIME. It is well recognized that robust evalua-tion of feature importance is challenging (Jain andWallace, 2019; Nguyen, 2018, inter alia), which isfurther complicated by different use cases of ex-planations (e.g., for decision makers vs. for de-velopers). Throughout this work, we refer to ma-chine learning models that learn from data as mod-els and methods to obtain local explanations (i.e.,feature importance in this work) for a predictionby a model as methods.

While prior research tends to focus on the inter-nals of models in designing and evaluating meth-ods of explanations, e.g., how well explanationsreflect the original model (Ribeiro et al., 2016),we view feature importance itself as a subject ofstudy, and aim to provide a systematic character-ization of important features obtained via differ-ent methods for different models. This view isparticularly important when explanations are usedto support decision making because they are theonly exposure to the model for decision makers.It would be desirable that explanations are con-sistent across different instances. In comparison,debugging represents a distinct use case where de-velopers often know the mechanism of the modelbeyond explanations. Our view also connects tostudying explanation as a product in cognitivestudies of explanations (Lombrozo, 2012), and iscomplementary to the model-centric perspective.

Given a wide variety of models and methods togenerate feature importance, there are basic openquestions such as how similar important featuresare between models and methods, how importantfeatures distribute across instances, and what lin-

arX

iv:s

ubm

it/28

8195

4 [

cs.C

L]

16

Oct

201

9

Page 2: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

One of favorite places to eat on the King W side, simple and relatively quick. I typically always get the chicken burritoand the small is enough for me for dinner. Ingredients are always fresh and watch out for the hot sauce cause it’s skullscratching hot. Seating is limited so be prepared to take your burrito outside or you can even eat at Metro Hall Park.

methodsmodels SVM (`2) XGBoost LSTM with attention BERT

built-in sauce, seating, park,prepared, even, always,can, fresh, quick, fa-vorite

is, can, quick, fresh,at, to, always, even,favorite, and

me, be, relatively,enough, always, fresh,ingredients, prepared,quick, favorite

., ingredients, rela-tively, quick, places,enough, dinner, typi-cally, me, i

LIME the, dinner, be, quick,and, even, you, always,fresh, favorite

you, to, fresh, quick,at, can, even, always,and, favorite

dinner, ingredients, typ-ically, fresh, places,cause, quick, and, fa-vorite, always

one, watch, to, enough,limited, cause, and,fresh, hot, favorite

Table 1: 10 most important features (separated by comma) identified by different methods for different models forthe given review. In the interest of space, we only show built-in and LIME here.

guistic properties important features tend to have.We use text classification as a testbed to answerthese questions. We consider built-in importancefrom both traditional models such as linear SVMand neural models with attention mechanisms, aswell as post-hoc importance based on LIME andSHAP. Table 1 shows important features for a Yelpreview in sentiment classification. Although mostapproaches consider “fresh” and “favorite” impor-tant, there exists significant variation.

We use three text classification tasks to charac-terize the overall similarity between important fea-tures. Our analysis reveals the following insights:• (Comparison between approaches) Deep learn-

ing models generate more different importantfeatures from traditional models such as SVMand XGBoost. Post-hoc methods tend to reducethe dissimilarity between models by making im-portant features more similar than the built-inmethod. Finally, different approaches do notgenerate more similar important features evenif we focus on the most important features (e.g.,top one feature).

• (Heterogeneity between instances) Similaritybetween important features is not always greaterwhen two models agree on the predicted label,and longer instances are less likely to share im-portant features.

• (Distributional properties) Deep models gen-erate more diverse important features withhigher entropy, which indicates lower consis-tency across instances. Post-hoc methods bringthe POS distribution closer to background dis-tributions.In summary, our work systematically compares

important features from different methods for dif-ferent models, and sheds light on how differentmodels/methods induce important features. Our

work takes the first step to understand impor-tant features as a product and helps inform theadoption of feature importance for different pur-poses. Our code is available at https://github.com/BoulderDS/feature-importance.

2 Related Work

To provide further background for our work, wesummarize current popular approaches to generat-ing and evaluating explanations of machine pre-dictions, with an emphasis on feature importance.Approaches to generating explanations. A bat-tery of approaches have been recently proposedto explain machine predictions (see Guidotti et al.(2019) for an overview), including example-basedapproaches that identifies “informative” examplesin the training data (e.g., Kim et al., 2016) andrule-based approaches that reduce complex mod-els to simple rules (e.g., Malioutov et al., 2017).Our work focuses on characterizing propertiesof feature-based approaches. Feature-based ap-proaches tend to identify important features inan instance and enable visualizations with impor-tant features highlighted. We discuss several di-rectly related post-hoc methods here and introducethe built-in methods in §3. A popular approach,LIME, fits a sparse linear model to approximatemodel predictions locally (Ribeiro et al., 2016);Lundberg and Lee (2017) present a unified frame-work based on Shapley values, which can be com-puted with different approximation methods fordifferent models. Gradients are popular for iden-tifying important features in deep learning mod-els since these models are usually differentiable(Shrikumar et al., 2017), for instance, Li et al.(2016) uses gradient-based saliency to compareLSTMs with simple recurrent networks.Definition and evaluation of explanations. De-

Page 3: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

spite a myriad of studies on approaches to ex-plaining machine predictions, explanation is arather overloaded term and evaluating explana-tions is challenging. Doshi-Velez and Kim (2017)lays out three levels of evaluations: functionally-grounded evaluations based on proxy automatictasks, human-grounded evaluations with layper-sons on proxy tasks, and application-groundedbased on expert performance in the end task. Intext classification, Nguyen (2018) shows that au-tomatic evaluation based on word deletion moder-ately correlate with human-grounded evaluationsthat ask crowdworkers to infer machine predic-tions based on explanations. However, explana-tions that help humans infer machine predictionsmay not actually help humans make better deci-sions/predictions. In fact, recent studies find thatfeature-based explanations alone have limited im-provement on human performance in detecting de-ceptive reviews and media biases (Lai and Tan,2019; Horne et al., 2019).

In another recent debate, Jain and Wallace(2019) examine attention as an explanation mech-anism based on how well attention values corre-late with gradient-based feature importance andwhether they exclusively lead to the predicted la-bel, and conclude that attention is not explanation.Similarly, Serrano and Smith (2019) show that at-tention is not a fail-safe indicator for explainingmachine predictions based on intermediate repre-sentation erasure. However, Wiegreffe and Pin-ter (2019) argue that attention can be explanationdepending on the definition of explanations (e.g.,plausibility and faithfulness).

In comparison, we treat feature importance it-self as a subject of study and compare differentapproaches to obtaining feature importance from amodel. Instead of providing a normative judgmentwith respect to what makes good explanations, ourgoal is to allow decision makers or model devel-opers to make informed decisions based on prop-erties of important features using different modelsand methods.

3 Approach

In this section, we first formalize the problem ofobtaining feature importance and then introducethe models and methods that we consider in thiswork. Our main contribution is to compare im-portant features identified for a particular instancethrough different methods for different models.

Feature importance. For any instance t and a ma-chine learning model m : t → y ∈ {0, 1}, we usemethod h to obtain feature importance on an in-terpretable representation of t, Im,h

t ∈ Rd, whered is the dimension of the interpretable representa-tion. In the context of text classification, we useunigrams as the interpretable representation. Notethat the machine learning model does not neces-sarily use the interpretable representation. Next,we introduce the models and methods in this work.Models (m). We include both recent deep learn-ing models for NLP and popular machine learn-ing models that are not based on neural networks.In addition, we make sure that the chosen modelshave some built-in mechanism for inducing fea-ture importance and describe the built-in featureimportance as we introduce the model.1

• Linear SVM with `2 (or `1) regularization. Lin-ear SVM has shown strong performance in textcategorization (Joachims, 1998). The absolutevalue of coefficients in these models is typi-cally considered a measure of feature impor-tance (e.g., Ott et al., 2011). We also consider `1regularization because `1 regularization is oftenused to induce sparsity in the model.

• Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithmthat shows strong performance in competitions(Chen and Guestrin, 2016). We use the defaultoption in XGBoost to measure feature impor-tance with the average training loss gained whenusing a feature for splitting.

• LSTM with attention (often shortened as LSTMin this work). Attention is a commonly usedtechnique in deep learning models for NLP(Bahdanau et al., 2015). The intuition is to as-sign a weight to each token before aggregatinginto the final prediction (or decoding in machinetranslation). We use the dot product formulationin Luong et al. (2015). The weight on each to-ken has been commonly used to visualize theimportance of each token. To compare with theprevious bag-of-words models, we use the aver-age weight of each type (unique token) in thiswork to measure feature importance.

• BERT. BERT represents an example architec-ture based on Transformers, which could showdifferent behavior from LSTM-style recurrent

1For instance, we do not consider LSTM as a model heredue to the lack of commonly-accepted built-in mechanisms.

Page 4: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

networks (Devlin et al., 2019; Vaswani et al.,2017; Wolf, 2019). It also achieves state-of-the-art performance in many NLP tasks. Similar toLSTM with attention, we use the average atten-tion values of 12 heads used by the first tokenat the final layer (the representation passed tofully connected layers) to measure feature im-portance for BERT.2 Since BERT uses a sub-word tokenizer, for each word, we aggregatethe attention on related subparts. BERT also re-quires special processing due to the length con-straint; please refer to the supplementary mate-rial for details. As a result, we focus on present-ing LSTM with attention in the main paper forease of understanding.

Methods (h). For each model, in addition tothe built-in feature importance that we describedabove, we consider the following two popularmethods for extracting post-hoc feature impor-tance (see the supplementary material for detailsof using the post-hoc methods).• LIME (Ribeiro et al., 2016). LIME generates

post-hoc explanations by fitting a local sparselinear model to approximate model predictions.As a result, the explanations are sparse.

• SHAP (Lundberg and Lee, 2017). SHAP uni-fies several interpretations of feature importancethrough Shapley values. The main intuition is toaccount the importance of a feature by examin-ing the change in prediction outcomes for all thecombinations of other features. Lundberg andLee (2017) propose various approaches to ap-proximate the computation for different classesof models (including gradient-based methodsfor deep models).Note that feature importances obtained via all

approaches are all local, because the top featuresare conditioned on an instance (i.e., words presentin an instance) even for the built-in method forSVM and XGBoost.Comparing feature importance. Given Im,h

t andIm′,h′

t , we use Jaccard similarity based on the topk features with the greatest absolute feature im-

portance, |TopK(Im,ht )∩TopK(Im

′,h′t )|

|TopK(Im,ht )∪TopK(Im

′,h′t )|

, as our main

2We also tried to use the max of 12 heads and previouslayers, and the average of the final layer is more similar toSVM (`2) than the average of first layer. Results are in thesupplementary material. Vig (2019) show that attention inBERT tends to be on first words, neighboring words, and evenseparators. The complex choices for BERT further motivateour work to view feature importance as a subject of study.

similarity metric for two reasons. First, the mosttypical way to use feature importance for inter-pretation purposes is to show the most importantfeatures (Lai and Tan, 2019; Ribeiro et al., 2016;Horne et al., 2019). Second, some models andmethods inherently generate sparse feature impor-tance, so most feature importance values are 0.

It is useful to discuss the implication of sim-ilarity before we proceed. On the one hand, itis possible that different models/methods identifythe same set of important features (high similar-ity) and the performance difference in predictionis due to how different models weigh these im-portant features. If this were true, the choice ofmodel/method would have mattered little for vi-sualizing important features. On the other hand,a low similarity poses challenges for choosingwhich model/method to use for displaying im-portant features. In that case, this work aimsto develop an understanding of how the similar-ity varies depending on models and methods, in-stances, and features. We leave it to future workfor examining the impact on human interactionwith feature importance. Low similarity may en-able model developers to understand the differ-ences between models, but may lead to challengesfor decision makers to get a consistent picture ofwhat the model relies on.

4 Experimental Setup and Hypotheses

Our goal is to characterize the similarities anddifferences between feature importances obtainedwith different methods and different models. Inthis section, we first present our experimentalsetup and then formulate our hypotheses.Experimental setup. We consider the followingthree text classification tasks in this work. Wechoose to focus on classification because classi-fication is the most common scenario used for ex-amining feature importance and the associated hu-man interpretation (e.g., Jain and Wallace, 2019).• Yelp (Yelp, 2019). We set up a binary classifica-

tion task to predict whether a review is positive(rating ≥ 4) or negative (rating ≤ 2). As theoriginal dataset is huge, we subsample 12,000reviews for this work.• SST (Socher et al., 2013). It is a sentence-level

sentiment classification task and represents acommon benchmark. We only consider the bi-nary setup here.• Deception detection (Ott et al., 2013, 2011).

Page 5: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

Model Yelp SST Deception

SVM (`2) 92.3 80.8 86.3SVM (`1) 91.5 79.2 84.4XGBoost 88.8 75.9 83.4LSTM w/ attention 93.9 82.6 88.4BERT 95.5 92.2 90.9

Table 2: Accuracy on the test set.

This dataset was created by extracting genuinereviews from TripAdvisor and collecting decep-tive reviews using Turkers. It is relatively smallwith 1,200 reviews and represents a distinct taskfrom sentiment classification.For all the tasks, we use 20% of the dataset as

the test set. For SVM and XGBoost, we use crossvalidation on the other 80% to tune hyperparame-ters. For LSTM with attention and BERT, we use10% of the dataset as a validation set, and choosethe best hyperparameters based on the validationperformance. We use spaCy to tokenize and obtainpart-of-speech tags for all the datasets (Honnibaland Montani, 2017). Table 2 shows the accuracyon the test set and our results are comparable toprior work. Not surprisingly, BERT achieves thebest performance in all three tasks. For importantfeatures, we use k ≤ 10 for Yelp and deceptiondetection, and k ≤ 5 for SST as it is a sentence-level task. See supplementary materials for detailsof preprocessing, learning, and dataset statistics.Hypotheses. We aim to examine the followingthree research questions in this work: 1) How sim-ilar are important features between models andmethods? 2) What factors relate to the heterogene-ity across instances? 3) What words tend to bechosen as important features?Overall similarity. Here we focus on discussingcomparative hypotheses, but we would like tonote that it is important to understand to whatextent important features are similar across mod-els (i.e., the value of similarity score). First,as deep learning models and XGBoost are non-linear, we hypothesize that built-in feature im-portance is more similar between SVM (`1) andSVM (`2) than other model pairs (H1a). Second,LIME generates more similar important featuresto SHAP than to built-in feature importance be-cause both LIME and SHAP make additive as-sumptions, while built-in feature importance isbased on drastically different models (H1b). Italso follows that post-hoc explanations of differ-ent models show higher similarity than built-in ex-

SVM (`2) SVM (`1) XGB LSTM BERT

SV

M(`

2)S

VM

(`1)

XG

BLS

TMB

ER

T

1 0.76 0.38 0.26 0.18

0.76 1 0.37 0.25 0.17

0.38 0.37 1 0.19 0.14

0.26 0.25 0.19 1 0.21

0.18 0.17 0.14 0.21 1

0.0

0.2

0.4

0.6

0.8

1.0

Figure 1: Jaccard similarity between the top 10 fea-tures of different models based on built-in feature im-portance on Yelp. The similarity is the greatest betweenSVM (`2) and SVM (`1), while LSTM with attentionand BERT pay attention to quite different features fromother models.

planations across models. Third, the similaritywith small k is higher (H1c) because hopefully,all models and methods agree what the most im-portant features are.Heterogeneity between instances. Given a pair of(model, method) combinations, our second ques-tion is concerned with how instance-level prop-erties affect the similarity in important featuresbetween different combinations. We hypothesizethat 1) when two models agree on the predictedlabel, the similarity between important features isgreater (H2a); 2) longer instances are less likelyto share similar important features (H2b). 3) in-stances with higher type-token ratio,3 which mightbe more complex, are less likely to share similarimportant features (H2c).Distribution of important features. Finally, we ex-amine what words tend to be chosen as importantfeatures. This question certainly depends on thenature of the task, but we would like to understandhow consistent different models and methods are.We hypothesize that 1) deep learning models gen-erate more diverse important features (H3a); 2)adjectives are more important in sentiment clas-sification, while pronouns are more important indeception detection as shown in prior work (H3b).

5 Similarity between Instance-levelFeature Importance

We start by examining the overall similarity be-tween different models using different methods. Ina nutshell, we compute the average Jaccard simi-larity of top k features for each pair of (m,h) and

3Type-token ratio is defined as the number of unique to-kens divided by the number of tokens.

Page 6: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

Similarity comparison between models using the built-in method

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Jacc

ard

Sim

ilarit

y

Randombuilt-in - SVM x XGBbuilt-in - SVM x LSTMbuilt-in - XGB x LSTM

(a) Yelp

1 2 3 4 5Number of important features (k)

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Jacc

ard

Sim

ilarit

y

Randombuilt-in - SVM x XGBbuilt-in - SVM x LSTMbuilt-in - XGB x LSTM

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.1

0.2

0.3

0.4

0.5

Jacc

ard

Sim

ilarit

y

Randombuilt-in - SVM x XGBbuilt-in - SVM x LSTMbuilt-in - XGB x LSTM

(c) Deception

Comparison between the built-in method and post-hoc methods

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Jacc

ard

Sim

ilarit

y

RandomSVM x LSTM - built-inSVM x LSTM - LIMESVM x LSTM - SHAP

(d) Yelp

1 2 3 4 5Number of important features (k)

0.1

0.2

0.3

0.4

0.5

Jacc

ard

Sim

ilarit

y

RandomSVM x LSTM - built-inSVM x LSTM - LIMESVM x LSTM - SHAP

(e) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.1

0.2

0.3

0.4

0.5

Jacc

ard

Sim

ilarit

y

RandomSVM x LSTM - built-inSVM x LSTM - LIMESVM x LSTM - SHAP

(f) Deception

Figure 2: Similarity comparison between models with the same method. x-axis represents the number of importantfeatures that we consider, while y-axis shows the Jaccard similarity. Error bars represent standard error throughoutthe paper. The top row compares three pairs of models using the built-in method, while the second row comparesthree methods on SVM and LSTM with attention (LSTM in figure legends always refers to LSTM with attentionin this work). The random line is derived using the average similarity between two random samples of k featuresfrom 100 draws.

(m′, h′). To facilitate effective comparisons, wefirst fix the method and compare the similarity ofdifferent models, and then fix the model and com-pare the similarity of different methods. Figure1 shows the similarity between different modelsusing the built-in feature importance for the top10 features in Yelp (k = 10). Consistent withH1a, SVM (`2) and SVM (`1) are very similar toeach other, and LSTM with attention and BERTclearly lead to quite different top 10 features fromthe other models. As the number of important fea-tures (k) can be useful for evaluating the overalltrend, we thus focus on line plots as in Figure 2 inthe rest of the paper. This heatmap visualizationrepresents a snapshot for k = 10 using the built-in method. Also, we only include SVM (`2) inthe main paper for ease of visualization and some-times refer to it in the rest of the paper as SVM.

No matter which method we use, important fea-tures from SVM and XGBoost are more similarwith each other, than with deep learning mod-els (Figure 2). First, we compare the similarity offeature importance between different models us-

ing the same method. Using the built-in method(first row in Figure 2), the solid line (SVM x XG-Boost) is always above the other lines, usually bya significant margin, suggesting that deep learningmodels such as LSTM with attention are less simi-lar to traditional models. In fact, the similarity be-tween XGBoost and LSTM with attention is lowerthan random samples for k = 1, 2 in SST. Simi-lar results also hold for BERT (see supplementarymaterials). Another interesting observation is thatpost-hoc methods tend to generate greater similar-ity than built-in methods, especially for LIME (thedashed line (LIME) is always above the solid line(built-in) in the second row of Figure 2). This islikely because LIME only depends on the modelbehavior (i.e., what the model predicts) and doesnot account for how the model works.

The similarity between important featuresfrom different methods tends to be lower forLSTM with attention (Figure 3). Second, wecompare the similarity of feature importance de-rived from the same model with different methods.For deep learning models such as LSTM with at-

Page 7: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Jacc

ard

Sim

ilarit

y RandomSVM - built-in x LIMESVM - LIME x SHAPSVM - built-in x SHAPXGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPLSTM - built-in x LIMELSTM - LIME x SHAPLSTM - built-in x SHAP

(a) Yelp

1 2 3 4 5Number of important features (k)

0.2

0.4

0.6

0.8

Jacc

ard

Sim

ilarit

y RandomSVM - built-in x LIMESVM - LIME x SHAPSVM - built-in x SHAPXGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPLSTM - built-in x LIMELSTM - LIME x SHAPLSTM - built-in x SHAP

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.1

0.2

0.3

0.4

0.5

Jacc

ard

Sim

ilarit

y RandomSVM - built-in x LIMESVM - LIME x SHAPSVM - built-in x SHAPXGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPLSTM - built-in x LIMELSTM - LIME x SHAPLSTM - built-in x SHAP

(c) Deception

Figure 3: Similarity comparison between methods using the same model. The similarity between different methodsbased on LSTM with attention is generally lower than other methods. Similar results hold for BERT (see thesupplementary material).

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.15

0.20

0.25

0.30

0.35

0.40

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(a) Yelp

1 2 3 4 5Number of important features (k)

0.1

0.2

0.3

0.4

0.5

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.1

0.2

0.3

0.4

0.5

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(c) Deception

Figure 4: Similarity between SVM (`2) and LSTM with attention with different methods grouped by whether thesetwo models agree on the predicted label. The similarity is not always greater when they agree on the predictedlabels than when they disagree.

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.8

−0.6

−0.4

−0.2

0.0

Spe

arm

anco

rrel

atio

n

SVM - built-in x LIMESVM - LIME x SHAPSVM - built-in x SHAPXGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPLSTM - built-in x LIMELSTM - LIME x SHAPLSTM - built-in x SHAP

(a) Yelp

1 2 3 4 5Number of important features (k)

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

Spe

arm

anco

rrel

atio

n

SVM - built-in x LIMESVM - LIME x SHAPSVM - built-in x SHAPXGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPLSTM - built-in x LIMELSTM - LIME x SHAPLSTM - built-in x SHAP

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.8

−0.6

−0.4

−0.2

0.0

Spe

arm

anco

rrel

atio

n

SVM - built-in x LIMESVM - LIME x SHAPSVM - built-in x SHAPXGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPLSTM - built-in x LIMELSTM - LIME x SHAPLSTM - built-in x SHAP

(c) Deception

Figure 5: In most cases, the similarity between feature importance is negatively correlated with length. Herewe only show the comparison between different methods based on the same model. Similar results hold forcomparison between different models using the same method. For ease of comparison, the gray line marks thevalue 0. Generally as k grows, relationship becomes even more negatively correlated.

tention, the similarity between feature importancegenerated by different methods is the lowest, espe-cially comparing LIME with SHAP. Notably, theresults are much more cluttered in deception de-tection. Contrary to H1b, we do not observe thatLIME is more similar to SHAP than built-in. Theorder seems to depend on both the task and themodel: even within SST, the similarity betweenbuilt-in and LIME can rank as third, second, or

first. In other words, post-hoc methods generatemore similar important features when we comparedifferent models, but that is not the case when wefix the model. It is reassuring that that similaritybetween any pairs is above random, with a sizablemargin in most cases (BERT on SHAP is an ex-ception; see supplementary materials).Relation with k. As the relative order betweendifferent approaches can change with k, we have

Page 8: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

1 2 3 4 5 6 7 8 9 10Number of important features (k)

3

4

5

6

7

8

9E

ntro

py

SVM - built-inSVM - LIMESVM - SHAPXGB - built-inXGB - LIMEXGB - SHAPLSTM - built-inLSTM - LIMELSTM - SHAP

(a) Yelp

1 2 3 4 5Number of important features (k)

5

6

7

8

9

10

Ent

ropy

SVM - built-inSVM - LIMESVM - SHAPXGB - built-inXGB - LIMEXGB - SHAPLSTM - built-inLSTM - LIMELSTM - SHAP

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

3

4

5

6

7

8

Ent

ropy

SVM - built-inSVM - LIMESVM - SHAPXGB - built-inXGB - LIMEXGB - SHAPLSTM - built-inLSTM - LIMELSTM - SHAP

(c) Deception

Figure 6: The entropy of important features. LSTM with attention generates more diverse important features thanSVM and XGBoost.

Background SVM XGB LSTM

5.0

7.5

10.0

12.5

15.0

17.5

20.0

22.5

Perc

enta

ge

NOUNVERBADJADVPRONDET

(a) Yelp

Background SVM XGB LSTM

5

10

15

20

25

Perc

enta

ge

NOUNVERBADJADVPRONDET

(b) SST

Background SVM XGB LSTM0

5

10

15

20

25

30

35

40

Perc

enta

ge

NOUNVERBADJADVPRONDET

(c) Deception

Figure 7: Part-of-speech tag distribution with the built-in method. “Background” shows the distribution of allwords in the test set. LSTM with attention puts a strong emphasis on nouns in deception detection, but is notnecessarily more different from the background than other models.

so far only focused on relatively consistent pat-terns over k and classification tasks. Contraryto H1c, the similarity between most approachesis not drastically greater for small k, which sug-gests that different approaches may not even agreeon the most important features. In fact, there isno consistent trend as k grows: similarity mostlyincreases in SST (while our hypothesis is that itdecreases), increases or stays level in Yelp, andshows varying trends in deception detection.

6 Heterogeneity between Instances

Given the overall low similarity between differentmethods/models, we next investigate how the sim-ilarity may vary across instances.The similarity between models is not alwaysgreater when two models agree on the predictedlabel (Figure 4). One hypothesis for the over-all low similarity between models is that differentmodels tend to give different predictions thereforethey choose different features to support their de-cisions. However, we find that the similarity be-tween models is not particularly high when theyagree on the predicted label, and are sometimes

even lower than when they disagree. This is truefor LIME in Yelp and for all methods in deceptiondetection. In SST, the similarity when the modelsagree on the predicted label is generally greaterthan when they disagree. We show the compari-son between SVM (`2) and LSTM here, and simi-lar results hold for other combinations (see supple-mentary materials). This observation suggests thatfeature importance may not connect with the pre-dicted labels: different models agree for differentreasons and also disagree for different reasons.

The similarity between models and methods isgenerally negatively correlated with length butpositively correlated with type-token ratio (Fig-ure 5). Our results support H2b: Spearman corre-lation between length and similarity is mostly be-low 0, which indicates that the longer an instanceis, the less similar the important features are. Thenegative correlation becomes stronger as k grows,indicating that length has a stronger effect on sim-ilarity when we consider more top features. How-ever, this is not true in the case of LIME and SHAPwhere the correlation between length and similar-ity are occasionally above 0 and sometimes even

Page 9: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.10

0.15

0.20

0.25

0.30

0.35

0.40Je

nsen

-Sha

nnon

Sco

re

SVM - built-in x backgroundSVM - LIME x backgroundSVM - SHAP x backgroundXGB - built-in x backgroundXGB - LIME x backgroundXGB - SHAP x backgroundLSTM - built-in x backgroundLSTM - LIME x backgroundLSTM - SHAP x background

(a) Yelp

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Jens

en-S

hann

onS

core

SVM - built-in x backgroundSVM - LIME x backgroundSVM - SHAP x backgroundXGB - built-in x backgroundXGB - LIME x backgroundXGB - SHAP x backgroundLSTM - built-in x backgroundLSTM - LIME x backgroundLSTM - SHAP x background

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.1

0.2

0.3

0.4

0.5

0.6

Jens

en-S

hann

onS

core

SVM - built-in x backgroundSVM - LIME x backgroundSVM - SHAP x backgroundXGB - built-in x backgroundXGB - LIME x backgroundXGB - SHAP x backgroundLSTM - built-in x backgroundLSTM - LIME x backgroundLSTM - SHAP x background

(c) Deception

Figure 8: Distance of the part-of-speech tag distributions between important features and all words (background).Distance is generally smaller with post-hoc methods for all models, although some exceptions exist for LSTM withattention and BERT.

the declining relationship with k does not hold.Our result on type-token ratio is opposite to H2c:the greater the type-token ratio, the higher the sim-ilarity (see supplementary materials). We believethat the reason is that type-token ratio is stronglynegatively correlated with length (the Spearmancorrelation for Yelp, SST and deception datasetis -0.92, -0.59 and -0.84 respectively). In otherwords, type-to-token ratio becomes redundant tolength and fails to capture text complexity beyondlength.

7 Distribution of Important Features

Finally, we examine the distribution of impor-tant features obtained from different approaches.These results may partly explain our previouslyobserved low similarity in feature importance.Important features show higher entropy usingLSTM with attention and lower entropy withXGBoost (Figure 6). As expected from H3a,LSTM with attention (the pink lines) are usuallyat the top (similar results for BERT in the supple-mentary material). Such a high entropy can con-tribute to the low similarity between LSTM withattention and other models. However, as the orderin similarity between SVM and XGBoost is lessstable, entropy cannot be the sole cause.Distribution of POS tags (Figure 7 and Figure8). We further examine the linguistic properties ofimportant words. Consistent with H3b, adjectivesare more important in sentiment classification thanin deception detection. On the contrary to our hy-pothesis, we found that pronouns do not alwaysplay an important role in deception detection. No-tably, LSTM with attention puts a strong empha-sis on nouns in deception detection. In all cases,determiners are under-represented among impor-tant words. With respect to the distance of part-

of-speech tag distributions between important fea-tures and all words (background), post-hoc meth-ods tend to bring important words closer to thebackground words, which echoes the previous ob-servation that post-hoc methods tend to increasethe similarity between important words (Figure 8).

8 Concluding Discussion

In this work, we provide the first systematic char-acterization of feature importance obtained fromdifferent approaches. Our results show that differ-ent approaches can sometimes lead to very differ-ent important features, but there exist some con-sistent patterns between models and methods. Forinstance, deep learning models tend to generatediverse important features that are different fromtraditional models; post-hoc methods lead to moresimilar important features than built-in methods.

As important features are increasingly adoptedfor varying use cases (e.g., decision making vs.model debugging), we hope to encourage morework in understanding the space of important fea-tures, and how they should be used for differentpurposes. While we focus on consistent patternsacross classification tasks, it is certainly interest-ing to investigate how properties related to tasksand data affect the findings. Another promisingdirection is to understand whether more concen-trated important features (lower entropy) lead tobetter human performance in supporting decisionmaking.

Acknowledgments

We thank Jim Martin, anonymous reviewers, andmembers of the NLP+CSS research group at CUBoulder for their insightful comments and discus-sions. This work was supported in part by NSFgrant IIS-1849931.

Page 10: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In Proceedings ofICLR.

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: Ascalable tree boosting system. In Proceedings ofKDD.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In Proceedings of NAACL.

Finale Doshi-Velez and Been Kim. 2017. Towards arigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608.

Riccardo Guidotti, Anna Monreale, Salvatore Rug-gieri, Franco Turini, Fosca Giannotti, and Dino Pe-dreschi. 2019. A survey of methods for explainingblack box models. ACM computing surveys (CSUR),51(5):93.

Matthew Honnibal and Ines Montani. 2017. spacy 2:Natural language understanding with bloom embed-dings, convolutional neural networks and incremen-tal parsing.

Benjamin D Horne, Dorit Nevo, John O’Donovan, Jin-Hee Cho, and Sibel Adali. 2019. Rating reliabilityand bias in news articles: Does ai assistance helpeveryone? In Proceedings of ICWSM.

Sarthak Jain and Byron C. Wallace. 2019. Attention isnot explanation. In Proceedings of NAACL.

Thorsten Joachims. 1998. Text categorization withsupport vector machines: Learning with many rel-evant features. In Proceedings of ECML.

Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo.2016. Examples are not enough, learn to criti-cize! criticism for interpretability. In Proceedingsof NeurIPS.

Vivian Lai and Chenhao Tan. 2019. On human predic-tions with explanations and predictions of machinelearning models: A case study on deception detec-tion. In Proceedings of FAT*.

Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky.2016. Visualizing and understanding neural modelsin nlp. In Proceedings of NAACL.

Zachary C Lipton. 2016. The mythos of model inter-pretability. arXiv preprint arXiv:1606.03490.

Tania Lombrozo. 2012. Explanation and abductive in-ference. Oxford handbook of thinking and reason-ing, pages 260–276.

Scott M Lundberg and Su-In Lee. 2017. A unified ap-proach to interpreting model predictions. In Pro-ceedings of NeurIPS.

Thang Luong, Hieu Pham, and Christopher D. Man-ning. 2015. Effective approaches to attention-based neural machine translation. In Proceedingsof EMNLP.

Dmitry M Malioutov, Kush R Varshney, Amin Emad,and Sanjeeb Dash. 2017. Learning interpretableclassification rules with boolean compressed sens-ing. In Transparent Data Mining for Big and SmallData, pages 95–121. Springer.

Dong Nguyen. 2018. Comparing automatic and humanevaluation of local explanations for text classifica-tion. In Proceedings of NAACL.

Myle Ott, Claire Cardie, and Jeffrey T Hancock. 2013.Negative deceptive opinion spam. In Proceedings ofNAACL.

Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Han-cock. 2011. Finding deceptive opinion spam by anystretch of the imagination. In Proceedings of ACL.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. Why should i trust you?: Explain-ing the predictions of any classifier. In Proceedingsof KDD.

Sofia Serrano and Noah A. Smith. 2019. Is AttentionInterpretable? In Proceedings of ACL.

Avanti Shrikumar, Peyton Greenside, and Anshul Kun-daje. 2017. Learning important features throughpropagating activation differences. In Proceedingsof ICML.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of EMNLP.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Proceedings of NeurIPS.

Jesse Vig. 2019. Deconstructing BERT: Dis-tilling 6 Patterns from 100 Million Pa-rameters. https://towardsdatascience.com/\deconstructing-bert-distilling-6-patterns-from\-100-million-parameters-b49113672f77. [Online;accessed 27-Apr-2019].

Sarah Wiegreffe and Yuval Pinter. 2019. Attention isnot not Explanation. In Proceedings of EMNLP.

Thomas Wolf. 2019. huggingface/pytorch-pretrained-bert. https://github.com/huggingface/pytorch-pretrained-BERT.

Yelp. 2019. Yelp dataset 2019. https://www.yelp.com/dataset. Accessed: 2019-01-01.

Page 11: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

A Preprocessing and ComputationalDetails

Preprocessing. We used spaCy for tokenizationand part-of-speech tagging. All the words are low-ercased. Table 3 shows basic data statistics.

dataset average tokens

Yelp 134.6SST 20.0Deception 163.7

Table 3: Dataset statistics.

Hyperparameter tuning. Hyperparameters forboth SVM and XGBoost are tuned using cross val-idation. The only hyperparameter tuned for SVMincludes C. We try a range of Cs from log space-5 to 5. The finalized value of C ranges between 1and 5. Hyperparameters tuned for XGBoost in-clude learning rate, max depth of tree, gamma,number of estimators and colsample by tree. Welay out the range of values tried in the processof hyperparameter tuning, learning rate: 0.1 to0.0001, max depth of tree: 3 to 7, gamma: 1 to10, number of estimators: 1000 to 10000 and col-sample by tree: 0.1 to 1.0. Hyperparameters forLSTM with attention are tuned using the valida-tion dataset which comprises 10% of the entiredataset. They include embedding dimension, hid-den dimension, learning rate, number of epochsand the type of optimizer. The range of valuestried in the process of hyperparameter tuning, hid-den dimension: 256 and 512, learning rate: 0.01to 0.0001, number of epochs: 3 to 20 and type ofoptimizer: SGD and adam.BERT fine-tuning. We fine-tuned BERT from apre-trained BERT model provided by its originalrelease and pytorch implementation (Wolf, 2019).We use the same architecture of 12 layers Trans-former with 12 attention heads. The hidden di-mension of each layer is 768. The vocabulary sizeis 30522. The initial learning rate we use is 5∗e−5,and we add an extra `2 regularization on the pa-rameters that are not bias terms or normalizationlayer with a coefficient of 0.01. We do early stop-ping according to the validation set within the first20 epochs with batch size no larger than 4. The at-tention weights we consider are the self-attentionweights of the first token of each text instance,namely the attention weights from “[CLS]”, sinceaccording to BERT’s design, the first token will

generate the sentence representation fed into theclassification layer. For the three target tasks, wechoose different maximum lengths according totheir natural length. For the deception detectiontask, the maximum sequence length is 300 tokens.For the SST binary classification task, we choosethe default 128 tokens as the maximum length andfor the yelp review classification task we use 512tokens.BERT alignment. Given that BERT tokenizes atext instance with its own tokenizer, we map theimportant features from BERT tokens to tokenizeresults from spaCy we used for other models. Tobe specific, we generate token start-end informa-tion as a tuple and call it token spans. We show anexample for text instance “It’s a good day.”:tokenization 1: [It’s], [a], [good], [day], [.]token spans 1: (0,3),(4,4),(5,8),(9,11),(12,12)tokenization 2: [It], [’s], [a], [go], [od], [day], [.]token spans 2: (0,1), (2,3), (4,4), (5,6), (7,8),(9,11), (12,12)

With the span information, we can identify howa token in the first tokenization relates to tokens inthe second tokenization and then aggregate all theattention values to the sub-parts. Formally,W

(1)(i,j) =

∑(s,t) s.t. t≥i,s≤j min(1, t−i+1

t−s+1 ,

j−s+1t−s+1 )W

(2)(k,p).

In other words, for partial span overlapping, weallocate the weight according to the span over-lapping ratio. For example: if span(1)i = (2, 5)

and span(2)k−1 = (2, 3), span(2)k = (4, 6), then

W(1)(2,5) = W

(2)(2,3) +

23W

(2)(4,6). Here W (2) represents

the importance weight according to the second to-kenization, W

(1)(i,j) represents the aligned feature

importance for the token that has span (i, j) in thefirst tokenization. By definition,

∑(i,j)W

(1)(i,j) =∑

(i,j)W(2)(i,j) = 1 for attention values.

LIME. We use the LimeTextExplainer and writea wrapper function that returns actual probabili-ties of the respective model. Since the LinearSVMgenerates only binary predictions, we return 0.999and 0.001 instead. We use 1,000 samples for fit-ting the local classifier.SHAP. We use a LinearExplainer for linear SVM,a TreeExplainer for XGBoost, and adapt thegradient-based DeepExplainer for our neural mod-els. The main adaptation required for the neuralmethod is to view the embedding lookup layer asa matrix multiplication layer so that the entire net-work is differentiable on the input token ids.

Page 12: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

1 2 3 4 5 6 7 8 9 10 11 12Number of layers (l)

0.05

0.10

0.15

0.20

0.25

0.30Ja

ccar

dS

imila

rity

Deception - averageYelp - averageSST - averageDeception - maxYelp - maxSST - max

Figure 9: Similarity comparison between BERT lay-ers using average or maximum attention heads score(k = 10). In general, similarity becomes greater as lincreases, but the last layer is not necessarily the great-est. Similarity is slightly higher when average attentionheads score is computed.

B Additional Figures

Similarity between BERT layers and SVM (`2).Important features using the final layer are moresimilar to that from SVM (`2) than using the firstlayer. See Figure 9.Built-in similarity is much lower with deeplearning models, and post-hoc methods“smooth” the distance. Similar results areobserved in SVM (`1) and BERT. See Figure 10.Similarity between methods is lower for deeplearning models. Similar results are observed inSVM (`1), XGBoost and BERT. See Figure 11.Similarity vs. predicted labels. Similarity isnot necessarily higher when predictions agree, itis also not necessarily lower when predictions dis-agree. See Figure 12 and Figure 13.Similarity vs. length. The negative correlationbetween length and similarity grows stronger as kgrows. See Figure 14.Similarity vs. type-token ratio. The positivecorrelation between type-token ratio and similar-ity grows stronger as k grows. See Figure 15 andFigure 16.Entropy. Deep learning models generate more di-verse important features than traditional models.See Figure 17.Jensen-shannon distance between POS. Dis-tance of part-of-speech tag distributions betweenimportant features and all words is generallysmaller with post-hoc methods for traditionalmodels. See Figure 18.

Page 13: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

Similarity comparison between models using the built-in method

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Jacc

ard

Sim

ilarit

y

Randombuilt-in - SVM (`1) x XGBbuilt-in - SVM (`1) x BERT

built-in - XGB x BERT

(a) Yelp

1 2 3 4 5Number of important features (k)

0.1

0.2

0.3

0.4

0.5

Jacc

ard

Sim

ilarit

y

Randombuilt-in - SVM (`1) x XGBbuilt-in - SVM (`1) x BERT

built-in - XGB x BERT

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Jacc

ard

Sim

ilarit

y

Randombuilt-in - SVM (`1) x XGBbuilt-in - SVM (`1) x BERT

built-in - XGB x BERT

(c) Deception

Comparison between the built-in method and post-hoc methods

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.00

0.05

0.10

0.15

0.20

0.25

Jacc

ard

Sim

ilarit

y

RandomSVM (`1) x BERT - built-inSVM (`1) x BERT - LIMESVM (`1) x BERT - SHAP

(d) Yelp

1 2 3 4 5Number of important features (k)

0.10

0.15

0.20

0.25

0.30

0.35Ja

ccar

dS

imila

rity

RandomSVM (`1) x BERT - built-inSVM (`1) x BERT - LIMESVM (`1) x BERT - SHAP

(e) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Jacc

ard

Sim

ilarit

y

RandomSVM (`1) x BERT - built-inSVM (`1) x BERT - LIMESVM (`1) x BERT - SHAP

(f) Deception

Figure 10: Similarity comparison between models with the same method. x-axis represents the number of im-portant features that we consider, while y-axis shows the Jaccard similarity. Error bars represent standard errorthroughout the paper. The top row compares three pairs of models using the built-in method, while the second rowcompares three methods on SVM (`1) and BERT. The random line is derived using the average similarity betweentwo random samples of k features from 100 draws.

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Jacc

ard

Sim

ilarit

y RandomSVM (`1) - built-in x LIMESVM (`1) - LIME x SHAPSVM (`1) - built-in x SHAP

XGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPBERT - built-in x LIMEBERT - LIME x SHAPBERT - built-in x SHAP

(a) Yelp

1 2 3 4 5Number of important features (k)

0.2

0.4

0.6

0.8

Jacc

ard

Sim

ilarit

y RandomSVM (`1) - built-in x LIMESVM (`1) - LIME x SHAPSVM (`1) - built-in x SHAP

XGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPBERT - built-in x LIMEBERT - LIME x SHAPBERT - built-in x SHAP

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Jacc

ard

Sim

ilarit

y RandomSVM (`1) - built-in x LIMESVM (`1) - LIME x SHAPSVM (`1) - built-in x SHAP

XGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPBERT - built-in x LIMEBERT - LIME x SHAPBERT - built-in x SHAP

(c) Deception

Figure 11: Similarity comparison between methods using the same model for SVM (`1), XGBoost, and BERT.BERT is much closer to random in deception.

Page 14: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

SVM (`2) vs. XGBoost

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.25

0.30

0.35

0.40

0.45

0.50

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(a) Yelp

1 2 3 4 5Number of important features (k)

0.2

0.3

0.4

0.5

0.6

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.25

0.30

0.35

0.40

0.45

0.50

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(c) DeceptionXGBoost vs. LSTM with attention

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.1

0.2

0.3

0.4

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(d) Yelp

1 2 3 4 5Number of important features (k)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(e) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.1

0.2

0.3

0.4

0.5

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(f) Deception

Figure 12: Similarity between two models is not necessarily greater when they agree on the predictions, andsometimes, e.g., SVM (`2) x XGB with LIME method, it is sometimes lower than when they disagree on thepredicted labels.

SVM (`1) vs. XGBoost

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(a) Yelp

1 2 3 4 5Number of important features (k)

0.2

0.3

0.4

0.5

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(c) DeceptionXGBoost vs. BERT

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(d) Yelp

1 2 3 4 5Number of important features (k)

0.05

0.10

0.15

0.20

0.25

0.30

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(e) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Jacc

ard

Sim

ilarit

y

built-in - agreebuilt-in - disagreeLIME - agreeLIME - disagreeSHAP - agreeSHAP - disagree

(f) Deception

Figure 13: Similarity between two models is not necessarily greater when they agree on the predictions, and insome scenarios, e.g., SVM (`1) x XGB with LIME method, XGB x BERT with LIME method, and XGB x BERTwith built-in method, they are sometimes lower than when they disagree on the predicted labels.

Page 15: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

Similarity between different models based on the same method

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.8

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

Spe

arm

anco

rrel

atio

n

built-in - SVM x XGBbuilt-in - SVM x LSTMbuilt-in - XGB x LSTMLIME - SVM x XGBLIME - SVM x LSTMLIME - XGB x LSTMSHAP - SVM x XGBSHAP - SVM x LSTMSHAP - XGB x LSTM

(a) Yelp

1 2 3 4 5Number of important features (k)

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

Spe

arm

anco

rrel

atio

n

built-in - SVM x XGBbuilt-in - SVM x LSTMbuilt-in - XGB x LSTMLIME - SVM x XGBLIME - SVM x LSTMLIME - XGB x LSTMSHAP - SVM x XGBSHAP - SVM x LSTMSHAP - XGB x LSTM

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.6

−0.4

−0.2

0.0

0.2

Spe

arm

anco

rrel

atio

n

built-in - SVM x XGBbuilt-in - SVM x LSTMbuilt-in - XGB x LSTMLIME - SVM x XGBLIME - SVM x LSTMLIME - XGB x LSTMSHAP - SVM x XGBSHAP - SVM x LSTMSHAP - XGB x LSTM

(c) DeceptionSimilarity between different models based on the same method for BERT

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.7

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

Spe

arm

anco

rrel

atio

n

built-in - SVM (`1) x XGBbuilt-in - SVM (`1) x BERT

built-in - XGB x BERTLIME - SVM (`1) x XGBLIME - SVM (`1) x BERT

LIME - XGB x BERTSHAP - SVM (`1) x XGBSHAP - SVM (`1) x BERT

SHAP - XGB x BERT

(d) Yelp

1 2 3 4 5Number of important features (k)

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

Spe

arm

anco

rrel

atio

n

built-in - SVM (`1) x XGBbuilt-in - SVM (`1) x BERT

built-in - XGB x BERTLIME - SVM (`1) x XGBLIME - SVM (`1) x BERT

LIME - XGB x BERTSHAP - SVM (`1) x XGBSHAP - SVM (`1) x BERT

SHAP - XGB x BERT

(e) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

0.1

Spe

arm

anco

rrel

atio

n

built-in - SVM (`1) x XGBbuilt-in - SVM (`1) x BERT

built-in - XGB x BERTLIME - SVM (`1) x XGBLIME - SVM (`1) x BERT

LIME - XGB x BERTSHAP - SVM (`1) x XGBSHAP - SVM (`1) x BERT

SHAP - XGB x BERT

(f) DeceptionSimilarity between different methods based on the same model for BERT

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.6

−0.4

−0.2

0.0

Spe

arm

anco

rrel

atio

n

SVM (`1) - built-in x LIMESVM (`1) - LIME x SHAPSVM (`1) - built-in x SHAP

XGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPBERT - built-in x LIMEBERT - LIME x SHAPBERT - built-in x SHAP

(g) Yelp

1 2 3 4 5Number of important features (k)

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

0.1

Spe

arm

anco

rrel

atio

n

SVM (`1) - built-in x LIMESVM (`1) - LIME x SHAPSVM (`1) - built-in x SHAP

XGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPBERT - built-in x LIMEBERT - LIME x SHAPBERT - built-in x SHAP

(h) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0.0

Spe

arm

anco

rrel

atio

n

SVM (`1) - built-in x LIMESVM (`1) - LIME x SHAPSVM (`1) - built-in x SHAP

XGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPBERT - built-in x LIMEBERT - LIME x SHAPBERT - built-in x SHAP

(i) Deception

Figure 14: Similarity comparison vs. length. The longer the length of an instance, the less similar the importantfeatures are. The negative correlation becomes stronger as k grows. In certain scenarios, e.g., XGB - built-in xLIME and XGB - LIME x SHAP, correlation occasionally goes above 0.

Page 16: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

Comparison between models using the same method

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Spe

arm

anco

rrel

atio

n

built-in - SVM x XGBbuilt-in - SVM x LSTMbuilt-in - XGB x LSTMLIME - SVM x XGBLIME - SVM x LSTMLIME - XGB x LSTMSHAP - SVM x XGBSHAP - SVM x LSTMSHAP - XGB x LSTM

(a) Yelp

1 2 3 4 5Number of important features (k)

0.0

0.1

0.2

0.3

Spe

arm

anco

rrel

atio

n

built-in - SVM x XGBbuilt-in - SVM x LSTMbuilt-in - XGB x LSTMLIME - SVM x XGBLIME - SVM x LSTMLIME - XGB x LSTMSHAP - SVM x XGBSHAP - SVM x LSTMSHAP - XGB x LSTM

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.2

0.0

0.2

0.4

0.6

Spe

arm

anco

rrel

atio

n

built-in - SVM x XGBbuilt-in - SVM x LSTMbuilt-in - XGB x LSTMLIME - SVM x XGBLIME - SVM x LSTMLIME - XGB x LSTMSHAP - SVM x XGBSHAP - SVM x LSTMSHAP - XGB x LSTM

(c) Deception

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Spe

arm

anco

rrel

atio

n

built-in - SVM (`1) x XGBbuilt-in - SVM (`1) x BERT

built-in - XGB x BERTLIME - SVM (`1) x XGBLIME - SVM (`1) x BERT

LIME - XGB x BERTSHAP - SVM (`1) x XGBSHAP - SVM (`1) x BERT

SHAP - XGB x BERT

(d) Yelp

1 2 3 4 5Number of important features (k)

0.0

0.1

0.2

0.3

Spe

arm

anco

rrel

atio

n

built-in - SVM (`1) x XGBbuilt-in - SVM (`1) x BERT

built-in - XGB x BERTLIME - SVM (`1) x XGBLIME - SVM (`1) x BERT

LIME - XGB x BERTSHAP - SVM (`1) x XGBSHAP - SVM (`1) x BERT

SHAP - XGB x BERT

(e) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.1

0.0

0.1

0.2

0.3

0.4

Spe

arm

anco

rrel

atio

n

built-in - SVM (`1) x XGBbuilt-in - SVM (`1) x BERT

built-in - XGB x BERTLIME - SVM (`1) x XGBLIME - SVM (`1) x BERT

LIME - XGB x BERTSHAP - SVM (`1) x XGBSHAP - SVM (`1) x BERT

SHAP - XGB x BERT

(f) Deception

Figure 15: Similarity comparison vs. type-token ratio. The higher the type-token ratio, the more similar theimportant features are. The positive correlation becomes stronger as k grows. In some cases, e.g., LIME methodon deception dataset, correlation becomes weaker as k grows.

Similarity comparison between methods using the same model

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.0

0.2

0.4

0.6

0.8

Spe

arm

anco

rrel

atio

n

SVM - built-in x LIMESVM - LIME x SHAPSVM - built-in x SHAPXGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPLSTM - built-in x LIMELSTM - LIME x SHAPLSTM - built-in x SHAP

(a) Yelp

1 2 3 4 5Number of important features (k)

0.0

0.1

0.2

0.3

0.4

Spe

arm

anco

rrel

atio

n

SVM - built-in x LIMESVM - LIME x SHAPSVM - built-in x SHAPXGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPLSTM - built-in x LIMELSTM - LIME x SHAPLSTM - built-in x SHAP

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.0

0.2

0.4

0.6

0.8

Spe

arm

anco

rrel

atio

n

SVM - built-in x LIMESVM - LIME x SHAPSVM - built-in x SHAPXGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPLSTM - built-in x LIMELSTM - LIME x SHAPLSTM - built-in x SHAP

(c) Deception

1 2 3 4 5 6 7 8 9 10Number of important features (k)

−0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Spe

arm

anco

rrel

atio

n

SVM (`1) - built-in x LIMESVM (`1) - LIME x SHAPSVM (`1) - built-in x SHAP

XGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPBERT - built-in x LIMEBERT - LIME x SHAPBERT - built-in x SHAP

(d) Yelp

1 2 3 4 5Number of important features (k)

−0.05

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Spe

arm

anco

rrel

atio

n

SVM (`1) - built-in x LIMESVM (`1) - LIME x SHAPSVM (`1) - built-in x SHAP

XGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPBERT - built-in x LIMEBERT - LIME x SHAPBERT - built-in x SHAP

(e) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Spe

arm

anco

rrel

atio

n

SVM (`1) - built-in x LIMESVM (`1) - LIME x SHAPSVM (`1) - built-in x SHAP

XGB - built-in x LIMEXGB - LIME x SHAPXGB - built-in x SHAPBERT - built-in x LIMEBERT - LIME x SHAPBERT - built-in x SHAP

(f) Deception

Figure 16: Similarity comparison vs. type-token ratio. The higher the type-token ratio, the more similar theimportant features are. The positive correlation becomes stronger as k grows. In some cases, e.g., XGB - built-inand LIME and XGB - LIME and SHAP on Yelp dataset, correlation becomes weaker as k grows.

Page 17: arXiv:submit/2881954 [cs.CL] 16 Oct 2019 · Gradient boosting tree (XGBoost). XG-Boost represents an ensembled tree algorithm that shows strong performance in competitions (Chen and

1 2 3 4 5 6 7 8 9 10Number of important features (k)

3

4

5

6

7

8

9

10

Ent

ropy

SVM (`1) - built-inSVM (`1) - LIMESVM (`1) - SHAP

XGB - built-inXGB - LIMEXGB - SHAPBERT - built-inBERT - LIMEBERT - SHAP

(a) Yelp

1 2 3 4 5Number of important features (k)

5

6

7

8

9

10

Ent

ropy

SVM (`1) - built-inSVM (`1) - LIMESVM (`1) - SHAP

XGB - built-inXGB - LIMEXGB - SHAPBERT - built-inBERT - LIMEBERT - SHAP

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

3

4

5

6

7

8

9

Ent

ropy

SVM (`1) - built-inSVM (`1) - LIMESVM (`1) - SHAP

XGB - built-inXGB - LIMEXGB - SHAPBERT - built-inBERT - LIMEBERT - SHAP

(c) Deception

Figure 17: The entropy of important features. In general, BERT generates more diverse important features thanSVM (`1) and XGBoost.

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Jens

en-S

hann

onS

core

SVM (`1) - built-in x backgroundSVM (`1) - LIME x backgroundSVM (`1) - SHAP x background

XGB - built-in x backgroundXGB - LIME x backgroundXGB - SHAP x backgroundBERT - built-in x backgroundBERT - LIME x backgroundBERT - SHAP x background

(a) Yelp

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Jens

en-S

hann

onS

core

SVM (`1) - built-in x backgroundSVM (`1) - LIME x backgroundSVM (`1) - SHAP x background

XGB - built-in x backgroundXGB - LIME x backgroundXGB - SHAP x backgroundBERT - built-in x backgroundBERT - LIME x backgroundBERT - SHAP x background

(b) SST

1 2 3 4 5 6 7 8 9 10Number of important features (k)

0.1

0.2

0.3

0.4

0.5

Jens

en-S

hann

onS

core

SVM (`1) - built-in x backgroundSVM (`1) - LIME x backgroundSVM (`1) - SHAP x background

XGB - built-in x backgroundXGB - LIME x backgroundXGB - SHAP x backgroundBERT - built-in x backgroundBERT - LIME x backgroundBERT - SHAP x background

(c) Deception

Figure 18: Distance of the part-of-speech tag distributions between important features and all words (background).Distance is generally smaller with post-hoc methods for all models, although some exceptions exist for LSTM withattention and BERT.