Explainable Clinical Decision Support from Text

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 1478–1489,November 16–20, 2020. c©2020 Association for Computational Linguistics

1478

Explainable Clinical Decision Support from TextJinyue Feng

Unity Health Toronto, Toronto [email protected]

Chantal ShaibUnity Health Toronto, Toronto ON

[email protected]

Frank RudziczUnity Health Toronto, Toronto ON

Department of Computer Science, University of Toronto, Toronto ONVector Institute for Artificial Intelligence, Toronto ON

Surgical Safety Technologies, Toronto [email protected]

AbstractClinical prediction models often use structuredvariables and provide outcomes that are notreadily interpretable by clinicians. Further,free-text medical notes may contain informa-tion not immediately available in structuredvariables. We propose a hierarchical CNN-transformer model with explicit attention asan interpretable, multi-task clinical languagemodel, which achieves an AUROC of 0.75 and0.78 on sepsis and mortality prediction on theEnglish MIMIC-III dataset, respectively. Wealso explore the relationships between learnedfeatures from structured and unstructured vari-ables using projection-weighted canonical cor-relation analysis. Finally, we outline a proto-col to evaluate model usability in a clinical de-cision support context. From domain-expertevaluations, our model generates informativerationales that have promising real-life appli-cations.

1 Introduction

Electronic medical records (EMRs) store bothstructured data (e.g., vitals and laboratory mea-surements) and unstructured data (e.g., nursing andphysician notes). Previous clinical prediction taskshave focused on structured data (e.g., Desautelset al., 2016; Gultepe et al., 2013; Ghassemi et al.,2014) which, despite their utility, may not captureall of the useful information in associated text. Clin-ical decision support systems rarely take advantageof free-text notes due to the complex nature of clin-ical language and interpretation. Rules and special-ized grammars can be applied to circumvent issuesaround clinical language; however, these methodsrely on the presence of certain phrases and spelling,and do not account for the highly variable notestructures across departments and hospitals (Yaoet al., 2019; Mykowiecka et al., 2009; Assale et al.,2019). Further, opaque models without explain-ability are often met with resistance in medical

contexts (Challen et al., 2019; Ahmad et al., 2018;Gordon et al., 2019). To address these challenges,we propose a novel multi-task language model thatalso provides rationales for decisions in medicine.

Our multi-task model leverages ClinicalBERT(Alsentzer et al., 2019), which is a transformer-based model pre-trained on clinical corpora. Giventhe uniqueness of medical text, we introduce a com-bination of CNN and transformer encoders to cap-ture phrase-level patterns and global contextualrelationships. Additionally, we explore latent atten-tion layers to generate rationales.

Based on availability, we use the MIMIC-IIIdatabase (Johnson et al., 2016) to predict two out-comes: sepsis and mortality in the intensive careunit (ICU). All experiments are conducted on noteswritten in English. We define the task of sepsisprediction more rigorously than previous work dueboth to using textual data only, and to emphasizethe practicality of this model in real-world appli-cations. Moreover, we use canonical correlationanalysis (CCA; Hotelling 1992) to explore relation-ships between latent features learned from bothstructured and unstructured data. Finally, we pro-pose an evaluation protocol to examine the usabilityof our model as an interpretable decision supporttool.

2 Related work

2.1 Transformers in the clinical domain

Transformers (Vaswani et al., 2017) have gainedpopularity given their strong performance and par-allelizability. The success of the transformer-basedBERT (Devlin et al., 2019) has inspired numerousstudies to apply it in various domains. For example,BioBERT was pretrained on PubMed abstracts andarticles and was able to better identify biomedicalentities and boundaries than base BERT (Lee et al.,2020). Alsentzer et al. (2019) further fine-tuned

1479

BioBERT on the MIMIC-III clinical dataset (John-son et al., 2016) and released the model as Clin-icalBERT. We use these pretrained BERT-basedmodels as static feature extractors and build layersupon the word embeddings to learn task-specificrepresentations spanning long documents.

2.2 Language model explainabilityExplainable AI is an emerging field with no stan-dardized methodology or evaluation metrics. Thedefinition of model explainability also varies by ap-plication; however, a generally accepted approachto language model explainability is through extrac-tive rationales (Lei et al., 2016; Mullenbach et al.,2018; Wiegreffe and Pinter, 2019).

The wide application of attention mechanismshas led to an ongoing debate over whether atten-tion can be used as explanation (Serrano and Smith,2019; Jain and Wallace, 2019; Wiegreffe and Pinter,2019). Jain and Wallace (2019) claimed that atten-tion scores in recurrent neural networks (RNNs)did not correlate with other feature-importancemeasures, and adversarial attentions did not affectmodel predictions, concluding that attention wasnot explanation. Wiegreffe and Pinter (2019) chal-lenged these assumptions by proposing diagnostictests that allow for meaningful interpretation of at-tention, but also showed that adversarial attentiondistributions failed to achieve the same level ofprediction performance as real model attention.

We propose a clinical decision support tool thatuses explanations to enhance model usability andreliability. Therefore, we adopt a view similar tothat of Wiegreffe and Pinter (2019), in that attentionprovides plausible rationales for use in practice,even though it may not provide a complete internalrepresentation of the model’s behaviour (Serranoand Smith, 2019; Jain and Wallace, 2019).

2.3 Clinical tasksSepsis is an extreme systemic inflammatory re-sponse to infection. If left untreated, sepsis canlead to life-threatening complications such as or-gan failure and septic shock. The ability to predictsepsis before symptom onset allows for earlier in-tervention, thus improving patient outcomes. Pre-vious work on sepsis detection focused on bothpost-hoc identification as well as predicting theneed for early intervention from structured data(Desautels et al., 2016; Taylor et al., 2016; Nematiet al., 2018; Gultepe et al., 2013). As mortalityhas an explicit label in EMRs, the focus has been

on expiry likelihood for early intervention ratherthan post-hoc identification (Ghassemi et al., 2014;Grnarova et al., 2016). We focus on work that usedthe MIMIC-III database (Johnson et al., 2016).

Insight (Desautels et al., 2016) provided amethod for predicting sepsis from vital signs withina fixed-time window before suspected onset on ret-rospective data. Gultepe et al. (2013) proposeda similar structured-data model for mortality andsepsis prediction; however, the features were pre-selected and only considered five measurements.While these methods achieved robust results com-pared to traditional clinical measures (e.g., MEWS,qSOFA, SIRS; Churpek et al. 2017), none tookadvantage of the unstructured data found in EMRs.

Culliton et al. (2017) claimed that unstructureddata in EMRs contain information not found in thestructured variables. They used GloVe word em-beddings to represent notes for each patient, andonly excluded discharge summaries to minimizeexplicit mentions of sepsis. Simply excluding dis-charge summaries, however, is not sufficient toavoid label leakage – a diagnosis may appear inthe notes as the clinician becomes aware of symp-toms. We carefully filter notes to ensure no labelleakage occurs and further refine our definition ofsepsis prediction, as described in Section 4. Ghas-semi et al. (2014) used topic modeling for textualrepresentations aggregated with structured patientdata to predict mortality, but Grnarova et al. (2016)showed that using convolutional document embed-dings for each patient outperformed these topicmodelling strategies for mortality prediction. Simi-larly, we deploy convolutional layers in our modelto obtain sentence-level embeddings. Horng et al.combined structured and unstructured data for sep-sis prediction, using topic models and continuous-bag-of-words (CBOW) to represent text. Despitesuccess, GloVE word embeddings, topic models,and CBOW do not generally capture the complex-ity and contextual relationships between words ina given text. Specifically, these methods rely pri-marily on word frequency and collapse multiplemeanings of a word into a single representation.To this end, we implement a transformer-basedmodel to represent our clinical notes, which wehypothesize may capture the contextual complexitybetween tokens more completely.

1480

3 Methods

3.1 Model architecturesThe structure of our model is illustrated in Figure1. We now explain each component in detail.

BERT word embeddings: BERT and its vari-ants have exhibited strong performance in varioustasks and we are interested in its application specif-ically in medical contexts. As shown in Figure 2,medical documents can easily contain thousands oftokens. With the sequence length limit of 512 to-kens, using BERT as a fine-tuning language modelon long documents is practically challenging or im-possible. Instead, we approach this problem in adepth-first manner and use BERT as a static featureextractor on a sentence-by-sentence basis. Such afeature-based approach with BERT has proved tobe nearly as effective as the fine-tuning approachin other tasks (Devlin et al., 2019).

We split each document into n sentences of mtokens and use a separate data loader with a sequen-tial sampler to group them into sub-batches. Theinput is truncated or padded at both the sentence-and token-level. We then feed the sentences intoa BERT model and take the mean of the last fourencoder layers as token embeddings. For tokeniza-tion, we omit two irrelevant tokens [CLS], whichis used as a pooling mechanism in fine-tuning mod-els, and [SEP], which is used in next sentenceprediction and sentence-pair classification tasks.BERT-related modeling and processing code comesfrom HuggingFace’s Transformers library (Wolfet al., 2019).

Given an input T = [t11, t12 ... tij ... tnm],where tij denotes the jth token of the ith sentence,the BERT feature extractor outputs

X = [x11 ... xnm] = BERT (T ),

where xij is a demb-dimensional vector (i.e., thehidden dimension of the BERT configuration) cor-responding to tij .

Convolutional layer: Previous studies usingCNNs to process medical notes have achievedgood results on tasks such as mortality predic-tion and ICD-9-CM diagnosis code classification(Grnarova et al., 2016; Mullenbach et al., 2018; Siand Roberts, 2019). Specifically, a qualitative eval-uation of text snippets from an attentional CNNindicated the model’s ability to learn features thatare deemed informative and diagnosis-relevant by aphysician (Mullenbach et al., 2018). This suggests

that the CNN is suitable for extracting informationregarding patient status at the phrase-level. We usea simple 1D convolutional layer along the sequenceof each sentence followed by ReLU activation and1D max-pooling to obtain sentence representations.

Taking X as the input, the CNN outputs an n×dfeature matrix.

S = MaxPool(ReLU(Conv(X)))

where dfeature is the number of output channelsof the convolution layer.

Transformer patient encoder: Medical notesfrequently contain repeated segments of medicalhistories as well as plans for future treatment. Al-though related work in patient-clinician dialoguehas explicitly used time-series information (Khat-tak et al., 2019), the strict temporal order of patientconditions in clinical notes can be disrupted byrepeating information. Yet, the highly complexmechanisms of medical outcomes entail that the co-existence of some conditions may change the indi-cation of others. We apply a two-layer transformerencoder on top of sentence features to capture aunified representation among descriptions. Thisstep of encoding results in a matrix

ST = Transformer(S)

that shares the same dimension as S.Although multi-head attention is powerful (Clark

et al., 2019), it is not yet clear how to derive ratio-nales for model prediction from such an approach.For model explainability, we instead apply an ex-plicit attention mechanism that is directly imple-mentable and interpretable.

Latent attention: The outputs of the transformerencoders are sentence-level features. To obtainpatient representations, we use a latent attentionmechanism adapted from similar work in if-thenprogram synthesis (Liu et al., 2016). The goal oflatent attention is to dedicate a component of themodel to explicitly learning the importance of eachunit of explanation such as the sentence or word.

The latent attention scores are calculated fromsentence features using a position-wise feed-forward network (Vaswani et al., 2017). GivenST , an n-dimensional vector ainput is computed as

ainput = FeedForward(ST )

and the attention weight is

a = Softmax(ainput + amask),

1481

Figure 1: Model architecture and data flow. Each patient document undergoes various levels of feature extrac-tion to arrive at token-, sentence-, and patient-level representations. The explicit attention layer provides a latentrepresentation for a patient. The final, attended-to patient representation is used in the classification task.

Figure 2: Distribution of documents based on tokenlengths for the mortality dataset. 2842 out of 5147 doc-uments exceed the token limits of BERT, indicated bythe vertical dashed line.

where amask is an n-dimensional vector for whichvalues unmasked positions are 0 and values atpadding positions are −10, 000.

The final nfeature-dimensional patient vector pis computed as the weighted sum of sentence fea-tures, which we can define as the dot product,

p =

n∑i=1

STiai = ST · a

and feeds a linear layer and a softmax classifier.

3.2 Canonical Correlation Analysis

Classic canonical correlation analysis (CCA) pro-vides a set of linear transformations that maxi-mally correlate data points from multiple views(Hotelling, 1992). We use projection-weightedCCA (PWCCA) (Morcos et al., 2018) to investigatethe correlation between learned textual features andvarious structured data that are split into their re-spective clinical tests, shown in Table 1. Giventwo vectors, x ∈ R d× n and y ∈ R d×m, wheren and m denote feature dimensions and d denotesnumber of data points, the objective is

(w1*, w2

*) = arg maxw1,w2

w1′KXYw2√

w1′KXXw1w2′KYYw2,

Clinical test Related structured variableComplete BloodCount (CBC)

Hemoglobin Hematocrit; Mean Corpuscular Hemoglobin;Platelets; Red Blood Cell Count; White Blood Cell Count

Prothrombin Time(PT)

Partial Thromboplastin Time; Prothrombin Time Inr;Prothrombin Time Pt

Urea, Creatinine, andElectrolytes (UCE)

Bicarbonate; Blood Urea Nitrogen; Chloride; Creatinine;Potassium; Sodium

Arterial Blood Gases(ABG)

Anion Gap; CO2 (etco2, pco2, etc.); Partial Pressure ofCarbon Dioxide; pH

Blood Pressure (BP) Central Venous Pressure; Diastolic Blood Pressure; MeanBlood Pressure; Pulmonary Artery Pressure Systolic;Systolic Blood Pressure

Individual Tests (IND) Glucose; Calcium; Calcium Ionized; Magnesium;Phosphate; Phosphorous; Glascow Coma Scale Total

Pulmonary Flowmetry(PF)

Fraction Inspired Oxygen Set; Peak Inspiratory Pressure;Positive End-Expiratory Pressure Set; Respiratory Rate;Tidal Volume Observed

Primary Vitals (PV) Heart Rate; Oxygen Saturation; Temperature

Table 1: Mapping of clinical tests to their correspond-ing structured variables.

where KXY denotes the cross covariance and KXXand KYY denote the covariances.

Following the method of singular value CCA(Raghu et al., 2017), we use singular value decom-position to obtain the weights w1, w2. From this,we get a total of min{n,m} canonical correlationcoefficients. The high dimensionality of the fea-ture representations may result in noisy coefficientsthat hinder the similarity measurements. We useprojection weighting to compute a weighted meanof the canonical variates, which accounts for theimportance of CCA vectors relative to the originalinput (Morcos et al., 2018). The PWCCA similaritybetween vectors x and y is computed with

dpwcca(x, y) = 1−i=1∑c

αiρ(i)

where αi denotes the normalized importanceweights, and ρ(i) the ith CCA coefficient. We usean open-source implementation of PWCCA1 inour experiments. Understanding the correlated in-formation in patient features between textual andstructured data may provide insight on what latentinformation is learnt from the text.

1https://github.com/google/svcca/

https://github.com/google/svcca/

1482

4 Data

MIMIC-III: MIMIC-III is a clinical databasecomprising de-identified EMRs of 58,976 hospitaladmissions to the critical care units of the BethIsrael Deaconess Medical Center (Johnson et al.,2016). All variables are recorded between 2001and 2012. Note that, although ClinicalBERT ispretrained on MIMIC-III, this does not precludeits use from downstream tasks on the same dataset;Alsentzer et al. emphasize that any impact is negli-gible given the size of the entire MIMIC-III corpuscompared to sub-sampled task corpora. In thisstudy, we choose sepsis and mortality tasks be-cause these are the standard tasks of this dataset.However, our model is not specifically tailored tothese tasks, and may be generalized to wide rangeof potential applications.

Data preprocessing: To avoid data leakageamong hospital admissions of the same patient, weonly include patients with one hospital admission.We select adult patients from the single-admissiongroup and obtain a base population of 31,245 hos-pital admissions. We randomly sample negativecases to balance the dataset in both tasks.

For text, we concatenate text from different noteentries into one document for each patient and re-move punctuation (except periods and commas),masked identifiers, digits, and single characters.When merging patients’ notes, we remove sen-tences that have already appeared in previous notesto avoid repetition. The notes are appended inchronological order according to their timestampsand truncated to a maximum of 50,000 tokens.

For mortality prediction, we do not differentiatenote types. For sepsis, we find differences in thefrequencies of note types between positive and neg-ative populations, which may result in a triviallylearned solution. After consulting with clinicians,we exclude note types that are irrelevant to sepsisand select nursing and physician notes only.

Whereas structured variables have explicit times-tamps that can be easily related to symptom onset,the timestamp of a note may not. For example, anote containing descriptions of possible infectionmay be entered after antibiotic administration. An-choring notes with lab measurement timestampssignificantly limits the number of positive casesin our dataset, especially when compared to otherstudies containing similar sepsis cohorts (Section2.3). Nonetheless, we view the imposed time-

window constraints as necessary to create an honestrepresentation of prediction. Discharge summariesand any notes written after patient outcomes oc-curred are excluded to avoid direct access to thesolution. Unfortunately, these steps are not alwaystaken in the literature.

For the structured data used in Section 3.2, weuse MIMIC-Extract2 to ensure a standard patientpopulation. After obtaining time-binned cohortdata, we extract measurements within the sametime frames as the selected notes.

Sepsis: Systemic inflammatory response syn-drome (SIRS), characterized by abnormal bodytemperature, heart rate, respiratory rate, and whiteblood cell count, often precedes sepsis. In thistask, we aim to predict whether a patient in SIRSwould become septic. In contrast to previous workwhere the negative sepsis populations did not nec-essarily have SIRS (Section 2.3), our task is morerestrictive, as the model must learn features thatare distinctive of sepsis onset rather than generalindications of SIRS. We use ICD-9-CM codesto label cases, where patients with codes for ex-plicit sepsis, or a combination of infection andeither organ failure or SIRS, are considered posi-tive. Although ICD-9-CM codes can be unreliable(O’Malley et al., 2005), we use multiple criteriato deal with false negatives and SIRS as a filterto avoid false positives (Angus and Wax, 2001).We notice that very few notes are recorded beforethe first onset of SIRS, possibly due to a time de-lay in writing or logging notes. To compensatefor the lack of data, notes before and within 24hours of the first onset of SIRS are included. Toavoid possible label leakage, we remove sentencescontaining mentions of “sepsis” or “septic”. Thefinal cohort contains 1262 positive cases and 1500negative cases.

In-ICU mortality: MIMIC-III has an expirytimestamp for patients who died in the hospital,which identifies the positive cohort for in-ICU mor-tality prediction. To ensure that all samples repre-sent patient conditions in the ICU, we only includenotes written within ICU stays. The dataset has2562 positive cases and 2587 negative cases.

2https://github.com/MLforHealth/MIMIC_Extract

https://github.com/MLforHealth/MIMIC_Extract

https://github.com/MLforHealth/MIMIC_Extract

1483

5 Experiments

Our experiments explore 1) differences in predic-tion due to pretraining, 2) multiview projection,and 3) evaluable explainable AI.

5.1 Clinical vs Non-Clinical BERT.

To compare the effect of pretraining BERT withdomain-specific clinical data on the overall qual-ity and performance of the model, we substituteBioBERT (Lee et al., 2020) and base BERT (De-vlin et al., 2019) as the token embedding compo-nent. We run both sepsis and mortality tasks onthe different *BERT models and compare the finalperformance. The results are shown in Table 2.

In comparing performance between tasks, themodels achieve better performance in mortalitythan sepsis. Considering that patients in the nega-tive cases in sepsis task all had SIRS, which is oneof the diagnostic criteria of sepsis, the high falsepositive rate among all three models is expected.

ClinicalBERT models converge faster and out-perform the other two models in both sepsis andmortality tasks. BioBERT and BERT models arecomparable in performance; however, BioBERTmodels exhibit a tendency to output positive re-sults, resulting in high recall and high false positiverates. The fact that BioBERT does not perform bet-ter than base BERT suggests that clinical-specificpretraining is crucial and cannot be replaced bypretraining on general biomedical corpora.

5.2 Structured vs Textual Data

To investigate the relationships between patient fea-tures extracted from structured and text data, weseparately train RNN models to learn representa-tions from different groups (see Table 1) of lab-oratory measurements, and we conduct PWCCA(Figure 3) to compute their similarities to patientfeatures from the language model.

Structured data model: To obtain a single vec-tor from time-series structured data, we construct a2-layer single-directional GRU network followedby a linear layer to project the mean GRU outputto a feature vector that has the same dimension asthe language model feature vectors. Only the pa-tients that appear in the language model cohort areselected. Each model is trained for 50 epochs, andthe best-performing one is used to extract features.

CCA details: To avoid spurious correlations typ-ically found in small datasets, the number of data

Figure 3: Visualization of PWCCA. The patient repre-sentations are taken from the models before the clas-sifier. First, a) a latent space is learned with SVCCA;then, b) The original representation is projected ontothe learned latent space, and the PWCCA is computed.

points (nsample) should be at least five times3 thefeature dimension (dfeature). Therefore, we in-clude all shared patients between structured andunstructured datasets, and over-sample the datafor the sepsis task. We set up random baselinesfor each test where we randomly generate nsample

dfeature-dimensional vectors using the same sam-pling strategy as the real features. To ensure thatour features are meaningful, we only analyze fea-tures extracted by models that reach an AUROC ofat least 0.75. It is important to note that we con-structed the structured dataset to obtain the patientrepresentation, not to compare model performance.The structured inputs contain measurements afterthe onset of patient outcomes, so the metrics shouldnot be compared to those of the language model.Additionally, the structured data models fail tolearn to predict sepsis from SIRS cohort, so weinclude negative samples without SIRS whose dataare extracted from random time frames. Modelperformance and PWCCA similarity (described byMorcos et al. (2018)) are listed in Table 3.

Feature correlation: The similarity scores aresubject to confounding factors such as noise andsample size. Due to limited data availability, wecan only comment on the general patterns. Thestructured data model and language model con-verge to correlated solutions, compared to randombaselines. We do not observe any clear relationshipbetween structured model performance and simi-larity. The features learned from all lab measure-ments, which supposedly encode a more compre-hensive patient representation than any subgroupalone, are close to the features learned from medi-cal notes, especially in the mortality task. For thesepsis task, the test groups that are highly related

3Experiments demonstrating the choice of sample sizes inCCA can be found at https://github.com/google/svcca

1484

Model Sepsis MortalityAUROC F1 Precision Recall AUROC F1 Precision Recall

BERT 0.72 69.3 64.3 75.0 0.75 74.2 77.7 70.9BioBERT 0.72 71.2 59.8 88.1 0.76 76.8 72.6 81.6ClinicalBERT 0.75 73.0 64.4 84.3 0.78 78.9 78.2 79.7

Table 2: Test performance scores using different BERT models.

Features Sepsis MortalityAUROC Similarity AUROC Similarity

All 0.75 0.68 0.92 0.762CBC 0.77 0.80 0.5 -PT 0.76 0.60 0.5 -UCE 0.68 - 0.57 -ABG 0.77 0.60 0.62 -BP 0.76 0.65 0.5 -IND 0.77 0.93 0.88 0.686PF 0.78 0.61 0.62 -PV 0.5 - 0.5 -Random - 0.45 - 0.361

Table 3: Structured model test performance andPWCCA similarity to text features. The All categoryencompasses all test groups and their features. Table 1shows the full list of features and their correspondingtest categories.

to systematic inflammation or organ dysfunction(CBC, BP, IND) show especially strong correlationwith the textual features. The results suggest thatour language models learn to encode the most rel-evant patient conditions for each outcome. Futurework includes further examining representation cor-relations, and other multi-view models combiningstructured and unstructured data as inputs.

5.3 Evaluating Explanations

Evaluating model explainability remains a broadarea of research. Our primary objective is a usablemodel that can be deployed as a real-life decisionsupport tool. Therefore, we focus on human eval-uation as our assessment of rationale quality. Weoutline a novel evaluation protocol that measuresthe quality of the extracted rationales by leveragingclinical domain expertise. To avoid arbitrary judge-ments, we work with the physician to tailor thedefinition of utility for each task; this is expandedupon in the Appendix along with a stand-alonequantitative evaluation on non-clinical data of la-tent attention as an explanation mechanism.

To obtain succinct meaningful explanations, wecalculate an attention threshold score

athreshold = max

(1

ns, asentencei

),

Figure 4: Example attention distribution over sentencesin one patient document.

where a denotes attention scores, ns is the numberof sentences, and i = min(20, dns

10 e). This en-sures that selected sentences have higher attentionscores than uniform attention and at most 10% ofthe original texts are included. To avoid burdeningthe evaluator, at most 20 sentences are selected fordocuments with more than 200 sentences. Figure 4shows an example distribution of attention scoresand demonstrates our explanation generation crite-ria. To prevent overly complicated results, we onlyevaluate the correctly predicted cases.

All independent evaluation uses a command-lineuser interface.

5.3.1 Labeling taskLabeling is designed to evaluate the informative-ness of our generated explanations. Sentences arepresented sequentially to an expert physician whochooses at each step to either predict patient out-come or check the next sentence. Sepsis has de-fined diagnosis criteria that must be followed inclinical practice, and information about such crite-ria are not necessarily available even in completedocuments. However, mortality risk assessment,despite its difficulty, is common in critical care.Therefore, we only conduct the labeling task on themortality dataset. We compare human predictionsto those of our model and note the number of se-lected sentences necessary for each prediction. Atest case fails if the evaluator does not make a de-cision after reviewing all selected sentences. This

1485

Pos Neg TotalNcases 119 136 255Conclusion 98.4% 98.5% 98.5%Correctness 69.2% 96.3% 82.7%Sentences Read (c) 4.0 3.5 -Sentences Read (i) 4.2 8.2 -

Table 4: Labeling task results. We list the number ofcases, percentage of concluded cases out of all cases,percentage of correct cases out of total concluded cases,and the average number of sentences read for both cor-rect (c) and incorrect (i) cases.

method evaluates whether the attended sentencesare sufficient to provide enough information fora clinical decision, and empirically evaluates thenumber of sentences needed for rationales.

The results are presented in Table 4. On aver-age, the evaluator reaches a correct conclusion inmortality prediction 82.7% of the time by read-ing approximately 4 sentences per case (or a se-lected 0.5% of the note, on average). Such evi-dence strongly suggests that our model is capableof extracting the most relevant information fromlong documents. We also observe a general patternthat fewer sentences are needed for a correctly pre-dicted case, which indicates that the ordering ofsentences based on attention is generally reliable.

Interestingly, the evaluator almost correctly pre-dicts all negative cases but not positive cases in themortality task. Multiple reasons may account forthe high false negative rate. First, mortality predic-tion is an intrinsically challenging task for humans.A bias towards survival may naturally occur whena sentence can be interpreted differently based onvarious contexts. Second, explanations for nega-tive cases are more likely to be independent fromthe contextual information that are not includedin the rationales. Our evaluator comments that aseemingly poor patient condition may translate tocompletely opposite outcomes depending on thecoexistence of other conditions. In real-life appli-cations, providing full documents with highlightedexplanations may be an easy solution that helps todirect users’ attention to the most important partswithout losing reference to additional contexts.

5.3.2 Rating task

In a second evaluation, we sample cases not used inthe labeling task. We present model predictions andthe entirety of the rationales sentence-by-sentenceto an expert physician. The physician is instructed

Sepsis MortalityPos Neg Total Pos Neg Total

Nsentences 1016 464 1480 958 486 1444Ncases 64 54 118 76 52 128%helpful, All 41.8 95.0 - 61.7 82.7 72.2%helpful, Top 4 - - - 75.9 86.4 80.0%helpful, Cases 96.4 74.1 86.0 - - -

Table 5: Rating task results.

to decide whether each sentence in the rationalecontains information that helps explain the modeldecision. To avoid arbitrary judgements, we workwith the physician to develop clear definitions ofexplanation utility, as shown in the appendix. Thismethod assesses the average informativeness ofselected sentences as well as the usability of ourmodel for the purpose of clinical decision support.

Given the characteristics of mortality and sepsis(see the appendix for a detailed discussion), theevaluation is meaningful at the sentence- and case-levels for the two tasks. Table 5 summarizes theresults. Between the positive and negatives cases,an average of 72.2% of sentences in the mortalitytask and 86% of cases in the sepsis task are ratedas helpful for understanding model decisions. Acloser look at the results shows that 80% of the firstfour sentences are rated as helpful, which indicatesthat the specific algorithm that generates rationalesshould be refined in future work to further excludesentences with lower attention scores (see Figure4). Nonetheless, the application of our model as anexplainable decision support tool is very promising.

6 Conclusion

Language can provide valuable support to improveclinical decision-making. We conduct a diverse setof experiments to explore several aspects of theapplicability of deep NLP in the clinical domain.We also address challenges in extracting medicaldocuments that are representative of a predictivetask.

We augment the power of domain-specific BERTand build a hierarchical CNN-Transformer that canpotentially be applied to any long-document pro-cessing task. The model achieves AUROC scoresof 0.75 and 0.78 on sepsis and mortality tasks, re-spectively. We also address model explainability byexperimenting with a simple (yet effective) linearattention mechanism, and emphasize the interac-tion between models and users in the design of anovel protocol to evaluate explanations. Not only

1486

are we able to sufficiently predict cases with per-formance comparable to models that use structuredEMR data, but we are also able to provide useful ra-tionales to support the predictions, as validated bymedical domain expertise. This has important im-plications for real-world application of explainableclinical decision support from text.

Acknowledgements

Rudzicz is supported by a CIFAR Chair in artificialintelligence.

ReferencesMuhammad Aurangzeb Ahmad, Carly Eckert, and

Ankur Teredesai. 2018. Interpretable machine learn-ing in healthcare. In Proceedings of the 2018 ACMInternational Conference on Bioinformatics, Com-putational Biology, and Health Informatics, pages559–560. ACM.

Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and MatthewB. A. McDermott. 2019. Publicly available clinicalBERT embeddings. CoRR, abs/1904.03323.

Derek C Angus and Randy S Wax. 2001. Epidemiol-ogy of sepsis: An update. Critical care medicine,29(7):S109–S116.

Michela Assale, Linda Greta Dui, Andrea Cina, AndreaSeveso, and Federico Cabitza. 2019. The revival ofthe notes field: Leveraging the unstructured contentin electronic health records. Frontiers in Medicine,6.

Robert Challen, Joshua Denny, Martin Pitt, LukeGompels, Tom Edwards, and Krasimira Tsaneva-Atanasova. 2019. Artificial intelligence, bias andclinical safety. BMJ Qual Saf, 28(3):231–237.

Matthew M Churpek, Ashley Snyder, Xuan Han, SarahSokol, Natasha Pettit, Michael D Howell, andDana P Edelson. 2017. Quick sepsis-related organfailure assessment, systemic inflammatory responsesyndrome, and early warning scores for detectingclinical deterioration in infected patients outside theintensive care unit. American Journal of Respira-tory and Critical Care Medicine, 195(7):906–911.

Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D. Manning. 2019. What does BERTlook at? an analysis of BERT’s attention. In Pro-ceedings of the 2019 ACL Workshop BlackboxNLP:Analyzing and Interpreting Neural Networks forNLP, pages 276–286, Florence, Italy. Associationfor Computational Linguistics.

Phil Culliton, Michael Levinson, Alice Ehresman,Joshua Wherry, Jay S Steingrub, and Stephen I Gal-lant. 2017. Predicting severe sepsis using text fromthe electronic health record.

Thomas Desautels, Jacob Calvert, Jana Hoffman,Melissa Jay, Yaniv Kerem, Lisa Shieh, DavidShimabukuro, Uli Chettipally, Mitchell D Feldman,Chris Barton, et al. 2016. Prediction of sepsis in theintensive care unit with minimal electronic healthrecord data: A machine learning approach. JMIRMedical Informatics, 4(3):e28.

J. Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. 2019. Bert: Pre-training of deep bidirec-tional transformers for language understanding. InNAACL-HLT.

Marzyeh Ghassemi, Tristan Naumann, FinaleDoshi-Velez, Nicole Brimmer, Rohit Joshi, AnnaRumshisky, and Peter Szolovits. 2014. Unfoldingphysiological state: Mortality modelling in inten-sive care units. In Proceedings of the 20th ACMSIGKDD international conference on Knowledgediscovery and data mining, pages 75–84.

Lauren Gordon, Teodor Grantcharov, and Frank Rudz-icz. 2019. Explainable Artificial Intelligencefor Safe Intraoperative Decision Support. JAMASurgery, pages 10–11.

Paulina Grnarova, Florian Schmidt, Stephanie L Hy-land, and Carsten Eickhoff. 2016. Neural documentembeddings for intensive care patient mortality pre-diction. NIPS 2016 Workshop on Machine Learningfor Health.

Eren Gultepe, Jeffrey P Green, Hien Nguyen, JasonAdams, Timothy Albertson, and Ilias Tagkopou-los. 2013. From vital signs to clinical outcomesfor patients with sepsis: A machine learning ba-sis for a clinical decision support system. Journalof the American Medical Informatics Association,21(2):315–325.

Steven Horng, David A Sontag, Yoni Halpern, YacineJernite, Nathan I Shapiro, and Larry A Nathanson.Creating an automated trigger for sepsis clinical de-cision support at emergency department triage usingmachine learning. PloS ONE.

Harold Hotelling. 1992. Relations between two sets ofvariates. In Breakthroughs in statistics, pages 162–190. Springer.

Sarthak Jain and Byron C. Wallace. 2019. Attention isnot explanation. In NAACL-HLT.

Alistair EW Johnson, Tom J Pollard, Lu Shen,H Lehman Li-wei, Mengling Feng, Moham-mad Ghassemi, Benjamin Moody, Peter Szolovits,Leo Anthony Celi, and Roger G Mark. 2016.MIMIC-III, a freely accessible critical care database.Scientific data, 3:160035.

Faiza Khan Khattak, Serena Jeblee, Noah Crampton,Muhammad Mamdani, and Frank Rudzicz. 2019.Allocation of physician time in ambulatory practice:A time and motion study in 4 specialties. In MED-INFO 2019, pages 1512–1513.

https://doi.org/10.1109/ICHI.2018.00095

https://doi.org/10.1109/ICHI.2018.00095

https://doi.org/10.18653/v1/W19-1909

https://doi.org/10.18653/v1/W19-1909

https://doi.org/10.1097/00003246-200107001-00035

https://doi.org/10.1097/00003246-200107001-00035

https://doi.org/10.3389/fmed.2019.00066



https://doi.org/10.1136/bmjqs-2018-008370

https://doi.org/10.1136/bmjqs-2018-008370

https://doi.org/10.1164/rccm.201604-0854OC





https://doi.org/10.18653/v1/W19-4828

https://doi.org/10.18653/v1/W19-4828

https://doi.org/10.2196/MEDINFORM.5909



https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.18653/v1/N19-1423

https://doi.org/10.1145/2623330.2623742

https://doi.org/10.1145/2623330.2623742

https://doi.org/10.1145/2623330.2623742

https://doi.org/10.1001/jamasurg.2019.2821

https://doi.org/10.1001/jamasurg.2019.2821

https://doi.org/10.1136/amiajnl-2013-001815



https://doi.org/10.1371/journal.pone.0174708



https://doi.org/10.1007/978-1-4612-4380-9_14

https://doi.org/10.1007/978-1-4612-4380-9_14

https://doi.org/10.18653/v1/N19-1357

https://doi.org/10.18653/v1/N19-1357

https://doi.org/https://doi.org/10.1038/sdata.2016.35

https://doi.org/10.3233/SHTI190510

https://doi.org/10.3233/SHTI190510

1487

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, D. Kim,Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020.Biobert: a pre-trained biomedical language represen-tation model for biomedical text mining. Bioinfor-matics.

Tao Lei, R. Barzilay, and T. Jaakkola. 2016. Rational-izing neural predictions. In EMNLP.

Chang Liu, Xinyun Chen, Richard Shin, MingchengChen, and Dawn Song. 2016. Latent attention forif-then program synthesis. In Advances in NeuralInformation Processing Systems, pages 4574–4582.

Julian McAuley, Jure Leskovec, and Dan Jurafsky.2012. Learning attitudes and attributes from multi-aspect reviews. In 2012 IEEE 12th InternationalConference on Data Mining, pages 1020–1025.IEEE.

Ari S. Morcos, Maithra Raghu, and Samy Bengio.2018. Insights on representational similarity in neu-ral networks with canonical correlation. In NeurIPS.

James Mullenbach, Sarah Wiegreffe, Jon Duke, JimengSun, and Jacob Eisenstein. 2018. Explainable pre-diction of medical codes from clinical text. pages1101–1111.

Agnieszka Mykowiecka, Małgorzata Marciniak, andAnna Kupsc. 2009. Rule-based information extrac-tion from patients’ clinical data. Journal of Biomed-ical Informatics, 42(5):923–936.

Shamim Nemati, Andre Holder, Fereshteh Razmi,Matthew D Stanley, Gari D Clifford, and Timothy GBuchman. 2018. An interpretable machine learningmodel for accurate prediction of sepsis in the ICU.Critical care medicine, 46(4):547–553.

Kimberly J O’Malley, Karon F Cook, Matt D Price,Kimberly Raiford Wildes, John F Hurdle, andCarol M Ashton. 2005. Measuring diagnoses:ICD code accuracy. Health services research,40(5p2):1620–1639.

Maithra Raghu, Justin Gilmer, Jason Yosinski, andJascha Sohl-Dickstein. 2017. SVCCA: Singular vec-tor canonical correlation analysis for deep learningdynamics and interpretability. In Advances in Neu-ral Information Processing Systems, pages 6076–6085.

Sofia Serrano and Noah A. Smith. 2019. Is attentioninterpretable? In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics, pages 2931–2951, Florence, Italy. Associa-tion for Computational Linguistics.

Yuqi Si and Kirk Roberts. 2019. Deep patient rep-resentation of clinical notes via multi-task learningfor mortality prediction. AMIA Summits on Transla-tional Science Proceedings, 2019:779.

R Andrew Taylor, Joseph R Pare, Arjun K Venkatesh,Hani Mowafi, Edward R Melnick, William Fleis-chman, and M Kennedy Hall. 2016. Prediction ofin-hospital mortality in emergency department pa-tients with sepsis: A local big data–driven, machinelearning approach. Academic emergency medicine,23(3):269–278.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, undefine-dukasz Kaiser, and Illia Polosukhin. 2017. Attentionis all you need. In Proceedings of the 31st Interna-tional Conference on Neural Information ProcessingSystems, NIPS’17, page 6000–6010, Red Hook, NY,USA. Curran Associates Inc.

Sarah Wiegreffe and Yuval Pinter. 2019. Attention isnot not explanation. pages 11–20.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing. ArXiv, abs/1910.03771.

Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.Clinical text classification with rule-based featuresand knowledge-guided convolutional neural net-works. BMC Medical Informatics and DecisionMaking, 19(3):71.

https://doi.org/10.1093/bioinformatics/btz682

https://doi.org/10.1093/bioinformatics/btz682

https://doi.org/10.18653/v1/D16-1011

https://doi.org/10.18653/v1/D16-1011

https://doi.org/10.1109/ICDM.2012.110

https://doi.org/10.1109/ICDM.2012.110

https://doi.org/10.18653/v1/N18-1100

https://doi.org/10.18653/v1/N18-1100

https://doi.org/10.1016/j.jbi.2009.07.007

https://doi.org/10.1016/j.jbi.2009.07.007

https://doi.org/10.1097/CCM.0000000000002936

https://doi.org/10.1097/CCM.0000000000002936

https://doi.org/10.1111/J.1475-6773.2005.00444.X

https://doi.org/10.1111/J.1475-6773.2005.00444.X

https://doi.org/10.18653/v1/P19-1282

https://doi.org/10.18653/v1/P19-1282

https://doi.org/10.1111/acem.12876




https://doi.org/10.5555/3295222.3295349

https://doi.org/10.5555/3295222.3295349

https://doi.org/10.18653/v1/D19-1002

https://doi.org/10.18653/v1/D19-1002

https://doi.org/10.1186/s12911-019-0781-4

https://doi.org/10.1186/s12911-019-0781-4

https://doi.org/10.1186/s12911-019-0781-4

1488

A Appendix A. On explainabilityevaluation.

Quantitatively validating latent attention as ex-planation: As previously noted, evaluating lan-guage model explanations is not yet standardized.Despite the effort to make human evaluation fairand reliable, such qualitative measurements are stillprone to bias and subjectivity. To validate that la-tent attention can be used as an explanation, weconduct a stand-alone experiment on the BeerAd-vocate dataset used by McAuley et al. (2012) andadapted by Lei et al. (2016). This is a dataset thathas ground-truth annotations of sentences relevantto prediction results. Although the dataset is notcrafted for the purpose of rationale evaluation, weuse it as a proxy to examine the quality of our at-tention scores.

Blue background: attended tokens in annotationRed background: attended tokens not in annotationUnderscore: annotation

Figure 5: Test case example of BeerAdvocate dataset.

The full BeerAdvocate dataset contains 1.5 mil-lion beer reviews describing four aspects (i.e., ap-pearance, smell, palate, and taste), each corre-sponding to a rating on a scale of 0 to 5. Lei et al.(2016) published a subset of 90k reviews selectedto minimize correlation between appearance andother aspects. In our experiment, we use these 90kreviews for training, and 994 annotated reviewsfor testing. The training set only has rating la-bels, whereas the testing set has both rating labelsand human annotations of sentence-level relevancy.Since all aspects have the exact same setups, it suf-fices to use the appearance rating prediction as aproof-of-concept.

We build a model with only two components,described in Section 3.1, namely BERT (pretrainedbase-case model) and latent attention. We feedstatic token embeddings from BERT to a latentattention layer, which output sequence represen-tations to be used for regression through a linearlayer with a sigmoid activation. We train the modelfor 20 epochs and select the best performing onefor testing.

In contrast to our clinical model, this modelonly attends to individual tokens and only gener-ates word-level explanations. For words separated

by the WordPiece tokenizer, we merge the tokensand average the attention weights. For each sen-tence, we sort the words based on their attentionweights and take the top n words as the predictionrationale, where n equals the total length of thehuman-annotated sentences. We only use attentionmechanisms without additional constraints, suchas selection continuity, which makes the testingtask even more challenging, as the annotations areranges of words.

The model is evaluated according to meansquared error (MSE) and rationale precision

Prationale =

∑Ni=1 |Si ∪Ai|∑N

i=1 |Si|,

whereN is the number of test cases, y is the groundtruth rating of appearance, y is the predicted rating,Ai is the set of word indices in the annotated cov-ers, S is the set of word indices selected as modelexplanations, and |S| = |A|.

Our model reaches a rationale precision of76.39%, which indicates that our most attendedwords are mostly consistent with the annotations.Figure 5 shows an example of appearance test re-sults. The experiment demonstrates the usability oflatent attention as an explanation mechanism.

Definition of explanation utility in the ratingtask: For mortality, each sentence is evaluatedindividually based on how the described situationwould contribute to a patient’s survival rate. Sen-tences describing highly life-threatening compli-cations (such as multiple organ failures) supporta positive prediction, whereas sentences indicat-ing improving conditions (such as stable lab mea-surements) support a negative prediction. In bothcases, these sentences are considered helpful. Sen-tences that are irrelevant (i.e., that support neithera positive nor negative prediction) are consideredunhelpful in both populations.

Many of the conditions that present themselveswith sepsis onset (such as hypotension) can havenumerous etiologies. Diagnostic criteria specifythat bacteremia (i.e., bacteria in the bloodstream)must be present in order to predict the developmentof sepsis. Yet the administration of antibiotics isalso not considered as a direct indication of bac-teremia without other indications of potential sep-sis. Therefore, sentences describing sepsis-relatedsymptoms are not rated as helpful in understandinga positive sepsis prediction until the indication ofinfection (for example, compromised skin integrity)

1489

Figure 6: Example explanations. Highlighted sentences are rationales picked by our model. Elaboration on themeanings of sentences is written in footnotes. These examples have been edited for increased privacy.

also appears, and vice versa. For negative cases,sentences that are either irrelevant to sepsis or ex-plain other origins of sepsis-related symptoms arerated as helpful. Given this definition, the existenceof any helpful sentences means the explanation isvalid for a positive case. Similarly, the existenceof any unhelpful sentences invalidates a negativecase.

Examples of sepsis and mortality explanationsare shown in Figure 6. We truncate and edit thesetexts to avoid data disclosure.

Documents

Explainable Clinical Decision Support from Text