1
What is Interpretable? Using ML to Design Interpretable Decision-Support Systems O WEN L AHAV 1 ,N ICHOLAS M ASTRONARDE 2 , AND M IHAELA VAN DER S CHAAR 1,3,4 1 U NIVERSITY OF O XFORD , 2 U NIVERSITY AT B UFFALO , 3 U NIVERSITY OF C ALIFORNIA L OS A NGELES (UCLA), 4 A LAN T URING I NSTITUTE I NTRODUCTION The need for interpretability Machine learning models can accurately predict medical outcomes However, clinicians cannot professionally or ethically utilize black-box models with- out understanding and trusting them As a result, we need interpretability Intrepretability in clinical settings ML interpretability has focused on user comprehension - interpretability modules presented with the ML model’s outputs However, comprehensibility is insufficient Clinicians must also trust models before they can use them Comprehensibility Trustworthiness low low high high Poor Model Accepted Black-Box Interpretable Model Understandable Model Solution: Ask doctors! Use reinforcement learning to design comprehensible, trustworthy systems Present supplementary information to clinicians, and learn from their responses E XPERIMENTAL D ESIGN We designed a RL-based clinical decision-support system (DSS) around the neural network model, in the form of an online survey. Below are screenshots showing some of the model evidence presented (counter-clockwise): A patient scenario, local linear model, local decision-tree model, and a feature sensitivity sample. Patient Betty is an 86-year-old non-Caucasian female suffering from heart failure Betty has a BMI of 21.6 Betty exhibits rales and shortness of breath at rest Our model predicts the probability of Betty dying within 1 year is 83.5% Examine how the following features might impact Betty’s risk score, based on our neural network model: Predicted Risk Score: 77.2% New York Risk Association Score: IV Prescription: No prescription ACE Inhibitors x Beta Blockers Local linear approximation for 80-100% risk strata: Significant coefficients: Patient Characteristic 80-100% Risk New York Heart Association Score 0.548 ACE Inhibitors or ARB perscribed -0.452 Beta Blockers perscribed -0.263 Shortness of breath at rest 0.248 Ethnicity (Caucasian) -0.241 Rales 0.211 Diabetes 0.138 Gender 0.057 Age 0.051 Local decision-tree approximation for 80-100% risk strata: Heart Failure Duration < 246 Hemoglobin < 48 Systolic Blood Pressure < 150 T F Num. Patients = 79 Risk = 84.1% D ECISION -S UPPORT S YSTEM MAGGIC data-set 30,389 heart-failure (HF) patients 31 features: patient characteristics, symp- toms, medications, etc. Average 1-year mortality rate of 18.8% Machine Learning Model Predict 1-year mortality risk after HF Simple Deep Neural Network (DNN) with 2 layers of 100 and 20 nodes Outperforms MAGGIC Risk Score used by clinicians Model AUC-ROC AUC-PR Linear Regression Random Forest Gradient Boosting Machine XGBoost Neural Network MAGGIC Risk Score Model Evidence Collated a large set of possible evidence to present to users – Model Details: data set, training, accu- racy, DNN approximation methods – Interpretability Modules: linear ap- proximations, local decision-tree, fea- ture sensitivity Consulted medical experts to reduce evi- dence space and inform design Reinforcement Learning Model Multi-armed bandit using UCB1 algorithm Arms = evidence sequences Any RL method could be utilized for iden- tifying optimal sequence of evidence M AIN R ESULTS We surveyed 14 doctors who rated their confidence in the model based on evidence shown We also surveyed 30 ML experts who predicted the average doctor’s confidence in the model The average ratings provided by doctors and ML experts for each evidence sequence are below: 0.59 0.72 0.75 0.81 0.64 0.63 0.73 0.72 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 Patient Patient, Sensitivity Patient, Sensitivity, Outcome Patient, Sensitivitity, LinReg, Dtree, Outcome MD CS 0.50 0.75 0.75 0.63 0.63 0.58 0.58 0.58 0.67 0.38 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 Data, Accuracy Data, Accuracy, Dtree Data, Accuracy, LinReg Data, Accuracy, LinReg, Dtree Data, Method, Accuracy, LinReg, DTree MD CS (a) General Model Evidence Sequences (b) Patient Scenario Sequences F UTURE W ORK Our proposed framework utilizing reinforce- ment learning to design comprehensible, trustworthy systems based on ML models can be extended: Test different ML models, data-sets, or contexts in medicine and beyond Test the effectiveness of a wide vari- ety of interpretability modules, includ- ing LIME, DeepLIFT, associative classi- fiers, feature rankings, and more Test different RL algorithms, including contextual bandits and deep RL Next step: improved, larger scale survey Fewer arms, more doctors + ML experts Statistically significant results Contextualize clinicians by specialization, years in practice, familiarity with ML, etc. * This study was reviewed by the Ethics Committee of the Univer- sity of Oxford’s Department of Computer Science, 2018 K EY F INDINGS Machine learning experts appear unable to predict which interpretability modules will best engender doctor trust Evidence is not super-additive: more in- formation may not increase confidence, possibly due to information overload Doctors must be consulted to create ML- driven DSSs that are truly useful in health- care settings T AKE OUR S URVEY ! Contact the research team for details! Contact: [email protected]

Decision-SupportSystems WhatisInterpretable ...medianetlab.ee.ucla.edu/papers/nips_ml4h_poster_owen.pdfWhatisInterpretable? UsingMLtoDesignInterpretable Decision-SupportSystems OWEN

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Decision-SupportSystems WhatisInterpretable ...medianetlab.ee.ucla.edu/papers/nips_ml4h_poster_owen.pdfWhatisInterpretable? UsingMLtoDesignInterpretable Decision-SupportSystems OWEN

What is Interpretable? Using ML to Design InterpretableDecision-Support Systems

OWEN LAHAV1, NICHOLAS MASTRONARDE2, AND MIHAELA VAN DER SCHAAR1,3,4

1UNIVERSITY OF OXFORD, 2UNIVERSITY AT BUFFALO, 3UNIVERSITY OF CALIFORNIA LOS ANGELES (UCLA),4ALAN TURING INSTITUTE

INTRODUCTIONThe need for interpretability• Machine learning models can accurately

predict medical outcomes• However, clinicians cannot professionally

or ethically utilize black-box models with-out understanding and trusting them

• As a result, we need interpretability

Intrepretability in clinical settings• ML interpretability has focused on user

comprehension - interpretability modulespresented with the ML model’s outputs

• However, comprehensibility is insufficient• Clinicians must also trust models before

they can use them

Comprehensibility

Trustw

orthiness

lowlow

high

high

Poor Model

Accepted Black-Box

Interpretable Model

Understandable Model

Solution: Ask doctors!• Use reinforcement learning to design

comprehensible, trustworthy systems• Present supplementary information to

clinicians, and learn from their responses

EXPERIMENTAL DESIGNWe designed a RL-based clinical decision-support system (DSS) around the neural networkmodel, in the form of an online survey.Below are screenshots showing some of the model evidence presented (counter-clockwise):A patient scenario, local linear model, local decision-tree model, and a feature sensitivity sample.

• Patient Betty is an 86-year-old non-Caucasian female suffering from heart failure

• Betty has a BMI of 21.6• Betty exhibits rales and shortness of breath at rest

• Our model predicts the probability of Betty dying within 1 year is 83.5%

Examine how the following features might impact Betty’s risk score, based on our neural network model:

Predicted Risk Score:77.2%

New York Risk Association Score: IV

Prescription:No prescription

ACE Inhibitors xBeta Blockers

Local linear approximation for 80-100% risk strata:Significant coefficients:

Patient Characteristic 80-100% RiskNew York Heart Association Score 0.548ACE Inhibitors or ARB perscribed -0.452Beta Blockers perscribed -0.263Shortness of breath at rest 0.248Ethnicity (Caucasian) -0.241Rales 0.211Diabetes 0.138Gender 0.057Age 0.051

Local decision-tree approximation for 80-100% risk strata:

Heart Failure Duration < 246

Hemoglobin < 48

Systolic Blood Pressure < 150

T F

Num. Patients = 79Risk = 84.1%

……… …

DECISION-SUPPORT SYSTEMMAGGIC data-set• 30,389 heart-failure (HF) patients• 31 features: patient characteristics, symp-

toms, medications, etc.• Average 1-year mortality rate of 18.8%

Machine Learning Model• Predict 1-year mortality risk after HF• Simple Deep Neural Network (DNN)

with 2 layers of 100 and 20 nodes– Outperforms MAGGIC Risk Score used

by clinicians

Model AUC-ROC AUC-PRLinear RegressionRandom ForestGradient Boosting Machine

XGBoostNeural NetworkMAGGIC Risk Score

Model Evidence• Collated a large set of possible evidence to

present to users– Model Details: data set, training, accu-

racy, DNN approximation methods– Interpretability Modules: linear ap-

proximations, local decision-tree, fea-ture sensitivity

• Consulted medical experts to reduce evi-dence space and inform design

Reinforcement Learning Model• Multi-armed bandit using UCB1 algorithm

– Arms = evidence sequences• Any RL method could be utilized for iden-

tifying optimal sequence of evidence

MAIN RESULTS• We surveyed 14 doctors who rated their confidence in the model based on evidence shown• We also surveyed 30 ML experts who predicted the average doctor’s confidence in the model

The average ratings provided by doctors and ML experts for each evidence sequence are below:

0.590.72 0.75 0.81

0.64 0.630.73 0.72

0.000.200.400.600.801.001.201.40

Patient Patient, Sensitivity Patient, Sensitivity,Outcome

Patient, Sensitivitity,LinReg, Dtree,

Outcome

MD CS

0.50

0.75 0.750.63 0.630.58 0.58 0.58 0.67

0.38

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

Data, Accuracy Data, Accuracy,Dtree

Data, Accuracy,LinReg

Data, Accuracy,LinReg, Dtree

Data, Method,Accuracy, LinReg,

DTree

MD CS

(a) General Model Evidence Sequences (b) Patient Scenario Sequences

FUTURE WORKOur proposed framework utilizing reinforce-ment learning to design comprehensible,trustworthy systems based on ML models canbe extended:

• Test different ML models, data-sets, orcontexts in medicine and beyond

• Test the effectiveness of a wide vari-ety of interpretability modules, includ-ing LIME, DeepLIFT, associative classi-fiers, feature rankings, and more

• Test different RL algorithms, includingcontextual bandits and deep RL

Next step: improved, larger scale survey

• Fewer arms, more doctors + ML experts

– Statistically significant results

• Contextualize clinicians by specialization,years in practice, familiarity with ML, etc.

∗This study was reviewed by the Ethics Committee of the Univer-

sity of Oxford’s Department of Computer Science, 2018

KEY FINDINGS• Machine learning experts appear unable

to predict which interpretability moduleswill best engender doctor trust

• Evidence is not super-additive: more in-formation may not increase confidence,possibly due to information overload

• Doctors must be consulted to create ML-driven DSSs that are truly useful in health-care settings

TAKE OUR SURVEY!Contact the research team for details!

Contact:

[email protected]