Machine Learning Approaches for Survival Prediction of

Machine Learning Approaches for Survival Prediction ofCritically Ill Patients Under Insulin Therapy

Bernardo Marreiros Firme

Thesis to obtain the Master of Science Degree in

Mechanical Engineering

Supervisors: Prof. João Miguel Da Costa SousaEng. Aldo Robles Arévalo

Examination Committee

Chairperson: Prof. Carlos Baptista CardeiraSupervisor: Prof. João Miguel Da Costa Sousa

Member of the Committee: Prof. Jorge Dos Santos Salvador Marques

June 2019

ii

iii

iv

Acknowledgments

First of all, thank the support of my supervisors Professor Joao Sousa and Eng. Aldo Arevalo for all the

guidance and shared knowledge through the development of this work.

A special thanks to my family, for all the autonomy and support given during these years, always with the

most assertive advices to help me to follow the right path, even with some essential deviations.

To my friends, for all the ’work’ extra-school.

Last but not least, a word to Imortal BC, a club that represent a sport that helped me to grow and build

my character as a person.

v

vi

Resumo

Esta dissertacao propoe o desenvolvimento de um modelo capaz de predizer a mortalidade de pacientes sob o

efeito de insulina em UCI, utilizando a base de dados MIMIC-III. A terapia com insulina e crucial para controlar

os nıveis de acucar no sangue em pacientes em estado critico. No entanto, nao ha consenso sobre que controlo

glicemico, intensivo ou convencional, e mais benefico de modo a reduzir a mortalidade.

Gradient boosting e regressao logıstica foram as tecnicas escolhidas apos uma extensiva comparacao entre

varias tecnicas de machine learning. Data sampling foi aplicado para neutralizar o desequilıbrio presente no

conjunto de dados e tecnicas de feature selection, incluindo uma nova abordagem intitulada recursive feature

selection, foram igualmente aplicadas.

No geral, gradient boosting com um total de 187 variaveis obteve o melhor desempenho (AUC de 91, 4±

1.36) para dados coletados nas primeiras 24 horas na UCI, superando o melhor ındice de gravidade, SAPS-II

(AUC de 77, 4± 2.44).

Diferentes tempos de previsao foram testados e o mais proximo da alta medica obteve o melhor desempenho

(AUC de 94, 8± 0.92).

Apos feature selection, um modelo com apenas 7 variaveis obteve um bom desempenho (AUC de 90.2±

1.34). Este modelo foi validado usando dados da base de dados eICU-CRD, alcancando um desempenho

semelhante (AUC de 88.0).

Finalizando, os modelos foram interpretados usando valores SHAP. Assim, identificaram-se as variaveis

que globalmente e individualmente mais afetam os pacientes, dando origem a construcao de paineis clınicos

individualizados. Estes podem ser uma ferramenta importante numa perspectiva de decisoes medicas auxiliadas

por dados.

Palavras-chave: Machine Learning, Previsao de Mortalidade, Insulina, Gradient Boosting, Inter-

pretacao de Modelos

vii

viii

Abstract

This thesis proposes the development of a classification model capable of predict mortality in patients under

insulin therapy in ICU using data from MIMIC-III database. Insulin therapy is crucial to control blood sugar

levels for critical-care patients. However, there is no consensus on which is the most beneficial glucose control,

either intensive or conventional, for these patients to reduce mortality.

Gradient boosting and logistic regression were the chosen modelling techniques after an extensive compari-

son of several machine learning techniques. Data sampling was applied to counteract the imbalance present in

dataset and feature selection techniques, including a novel approach entitled recursive feature selection, were

also applied.

Overall, gradient boosting with a total of 187 features achieved the highest performance (AUC of 91.4±

1.36) for data collected in patients’ first 24 hours in the ICU and outperformed the highest performance among

severity scores, SAPS-II (AUC of 77.4 ± 2.44). Different prediction time-windows were tested and the one

nearer to ICU discharge achieved the highest performance among all tested (AUC of 94.8± 0.92).

After feature selection, a model with only 7 features achieved a good performance (AUC of 90.2± 1.34).

This model was validated using a previously unseen data from the eICU-CRD database, and a similar perfor-

mance was achieved (AUC of 88.0).

Lastly, models were interpreted using SHAP values. Thus, variables that overall and individually most

affect patients were identified, giving rise to the construction of individualized clinical dashboards. These may

be an important tool in a perspective of data-aided decisions by physicians.

Keywords: Machine Learning, Mortality Prediction, Insulin, Gradient Boosting, Model Interpretation

ix

x

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

1 Introduction 1

1.1 Applications of Data Mining in Healthcare . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Mortality in Patients under Insulin Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Insulin Therapy 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Homeostatic Regulation of Blood Glucose Levels . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Diabetes Mellitus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 Types of Insulin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Insulin Therapy in Intensive Care Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Intensive Care Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Dysglycemia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Intensive Insulin Therapy vs Conventional Insulin Therapy . . . . . . . . . . . . . . . . 10

2.3 Mortality Prediction in Intensive Care Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Previous studies aiming mortality prediction in ICU . . . . . . . . . . . . . . . . . . . 12

2.3.2 Severity Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Data pre-processing 15

3.1 MIMIC-III Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Patients and Variables Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Inclusion Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.2 Input Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

xi

3.3.1 Removal of Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.2 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Feature Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.1 Discretization of Time-series Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4.2 Glycemic Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4.3 Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Selected Time-windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.1 Description of Processed Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6 Data Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6.1 Over Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.6.2 Undersampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Modeling 29

4.1 Knowledge Discovery Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Modeling Techniques Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.2 Linear and Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.4 K-Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.5 Gaussian Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 Parallel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.2 Sequential Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Gradient Boosting Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4.1 Tree Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4.2 Best split method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4.4 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.5 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.5.1 Recursive Feature Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5.2 Sequential Forward Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.5.3 Recursive Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.6 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6.1 Features Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6.2 SHapley Additive exPlanation values . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.7 Model Performance Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.7.1 Repeated K-Fold Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.7.2 Sensitivity, Specificity and Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xii

4.7.3 Area under the receiver operating characteristic curve . . . . . . . . . . . . . . . . . . 44

4.7.4 Area under the Precision-Recall curve . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 Results 45

5.1 Descriptive Analysis of the Cohort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Selection of a Machine Learning Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Selection of a Gradient Boosting Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4 First-day Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.4.1 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.4.2 Effects of Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.4.4 Feature Selection - Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4.5 Comparing Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.5 Comparison with severity scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.6 Analysis for different time-windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.7 Comparison with similar studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.8 External Validation - eICU-CRD database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6 Model Analysis and Interpretation 63

6.1 Model Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.1.1 Glasgow Comma Scale, Age and Ventilation Time . . . . . . . . . . . . . . . . . . . . 65

6.1.2 Number of Insulin Infusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.1.3 Diabetes and Long-Term Insulin Users . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.1.4 Ethnicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.1.5 Glucose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.1.6 Respiratory Rate and Respiratory Diseases . . . . . . . . . . . . . . . . . . . . . . . . 69

6.1.7 Anion Gap and Bicarbonate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.2 Individualized Clinical Dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7 Conclusions 77

7.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.2 Comparison with Previous Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Bibliography 81

A Outlier Detection 90

B Gradient Boosting Machines 94

xiii

C eICU-CRD Collaborative Research Database 97

C.1 Inclusion Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

C.2 Input Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

C.3 Data Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

D SHAP values 100

xiv

List of Tables

2.1 Types of Insulin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Blood glucose levels Classification(mg/dL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Chronological summary of major randomized controlled trials. . . . . . . . . . . . . . . . . . 11

3.1 Demographic variables for the working cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Multi-level approach for patients diagoses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Covariates associated to patients’ diagnoses . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Laboratorial variables and normal ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Vital variables and normal ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.6 Number of patients included in each time-window. . . . . . . . . . . . . . . . . . . . . . . . 25

3.7 Description of the input variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Machine learning algorithms assessed in this thesis work. . . . . . . . . . . . . . . . . . . . . 32

4.2 Hyperparameters to tune. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1 Performance metrics for machine learning techniques . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Performance metrics for each GB algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Performance metrics for first-day dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4 Performance metrics for oversampling techniques . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5 Performance metrics for undersampling techniques . . . . . . . . . . . . . . . . . . . . . . . . 51

5.6 Input features after feature selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.7 Performance metrics after feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.8 Performance metrics to comparison between approaches for LGB . . . . . . . . . . . . . . . . 59

5.9 Performance metrics to compare with common severity scores used in clinical setting. . . . . . 59

5.10 Performance metrics for different data extraction time-windows. . . . . . . . . . . . . . . . . 60

5.11 Performance comparison with literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.12 Performance metrics for external validation with eICU database . . . . . . . . . . . . . . . . . 62

C.1 Number of patients included in each feature subset . . . . . . . . . . . . . . . . . . . . . . . 99

xv

xvi

List of Figures

1.1 Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Internet of Things - Analogy Human-Car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Data Mining in databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Homeostasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Inclusion criteria applied to extract the cohort used in this work. . . . . . . . . . . . . . . . . 16

3.2 Patients percentage per variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Outliers Detection - Glucose Feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Percentage patients loss and its relationship with missing date in some features. . . . . . . . . 23

3.5 Prediction time-windows switching for the study. . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Model construction layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Trade-off Select Machine Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Timeline of ensemble learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 General overview of the working dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Comparison of different machine learning algorithms (first-day) . . . . . . . . . . . . . . . . 47

5.3 Comparison of different machine learning algorithms (Last-day) . . . . . . . . . . . . . . . . 47

5.4 Performance analysis comparing GB algorithms (First day) . . . . . . . . . . . . . . . . . . . 48

5.5 Performance analysis comparing GB algorithms (Last day) . . . . . . . . . . . . . . . . . . . 48

5.6 Number of Estimators vs Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.7 Number of leaves vs Max depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.8 Subsampling and feature sampling for LGB. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.9 Recursive Feature Elimination - LGB with SHAP Values . . . . . . . . . . . . . . . . . . . . . 53

5.10 Sequential Forward Selection - LGB with AUC metric . . . . . . . . . . . . . . . . . . . . . . 53

5.11 Recursive Feature Selection - LGB with SHAP Values . . . . . . . . . . . . . . . . . . . . . . 54

5.12 Recursive Feature Elimination - LR with Weight Vectors . . . . . . . . . . . . . . . . . . . . 55

5.13 Sequential Forward Selection - LR with AUC metric . . . . . . . . . . . . . . . . . . . . . . . 56

5.14 Recursive Feature Selection - LR with Weigt Vectors . . . . . . . . . . . . . . . . . . . . . . 57

5.15 Analysis for different time-windows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

xvii

6.1 Feature’s importance ranked through SHAP values. . . . . . . . . . . . . . . . . . . . . . . . 64

6.2 SHAP values for the 20 most important features. . . . . . . . . . . . . . . . . . . . . . . . . 64

6.3 SHAP ranking and the relationship to different covariates. . . . . . . . . . . . . . . . . . . . 66

6.4 Number of infusions and SHAP values for LightGBM and LDA models. . . . . . . . . . . . . 67

6.5 Diabetic condition and SHAP values for LGB and LR models. . . . . . . . . . . . . . . . . . 68

6.6 Ethnicity SHAP values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.7 Glucose readings vs SHAP values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.8 Respiratory Rate SHAP values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.9 Respiratory Rate (Respiratory Diseases) SHAP values . . . . . . . . . . . . . . . . . . . . . . 71

6.10 Anion gap and SHAP values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.11 Bicarbonate Mean SHAP values for LGB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.12 Patients mortality probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.13 Clinical dashboard for a patient expected to survive . . . . . . . . . . . . . . . . . . . . . . . 74

6.14 Clinical dashboard for a patient expected to die . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.15 Patients mortality probability coloured by real outcomes . . . . . . . . . . . . . . . . . . . . . 75

A.1 Outliers Detection - Anion Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.2 Outliers Detection - Bicarbonate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.3 Outliers Detection - Chloride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

A.4 Outliers Detection - Creatinine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.5 Outliers Detection - Hemoglobin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.6 Outliers Detection - Hematocrit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.7 Outliers Detection - MCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

A.8 Outliers Detection - MCHC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.9 Outliers Detection - MCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.10 Outliers Detection - Platelet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.11 Outliers Detection - RBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.12 Outliers Detection - RDW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.13 Outliers Detection - Sodium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.14 Outliers Detection - BUN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

A.15 Outliers Detection - Glucose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

C.1 Databases analogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

C.2 Inclusion criteria applied to extract the cohort used in this work. . . . . . . . . . . . . . . . . 98

C.3 Missing values for each variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

xviii

Nomenclature

Acronyms

ADA AdaBoost

AUC Area Under ROC curve

AUPRC Area Under Precision-Recall curve

CB Categorical Boosting (CatBoost)

CCS Clinical Classifications Software

CIT Conventional Insulin Therapy

CRISP-DM Cross Industry Process for Data Mining

DIKW Data-Information-Knowledge-Wisdom

DT Decision Trees

ENN Edited Nearest Neighbours

GB Gradient Boosting

GNB Gaussian Naive Bayes

ICD-9-CM International Classification of Diseases, Ninth Revision, Clinical Modification

ICU Intensive Care Unit

IIT Intensive Insulin Therapy

IoT Internet of Things

KDD Knowledge Discovery in Databases

KNN K-Nearest Neighbour

LDA Linear Discriminant Analysis

LGB Light Gradient Boosting (LightGBM)

LOCF Last Observation Carried Forward

xix

LODS Logistic Organ Dysfunction System

MAR Missing at Random

MCAR Missing Completely at Random

MCHC Mean Corpuscular Hemoglobin Concentration

MCH Mean Corpuscular Hemoglobin

MCV Mean Corpuscular Volume

MIMIC III Medical Information Mart for Intensive Care III

MNAR Missing not as Random

NCR Neighborhood Cleaning Rule

OASIS Oxford Acute Severity of Illness Score

QDA Quadratic Discriminant Analysis

QR Quick Response

QSOFA Quick Sequential Organ Failure Assessment

RBC Red Blood Cells

RDW Red cell Distribution Width

RFID Radio-Frequency Identification

RFS Recursive Feature Selection

RF Random Forest

SAPS III Simplified Acute Physiology Score 3

SAPS Simplified Acute Physiology Score

SHAP SHapley Additive exPlanations

SOFA Sequential Organ Failure Assessment

SVM Support Vector Machines

WBC White Blood Cells

XGB Extreme Gradient Boosting (XGBoost)

xx

Chapter 1

Introduction

The world’s digitization has brought a multitude of ways to generate and collect data in the most diverse

contexts. Data is scattered everywhere. Everything is data. However, data alone is inherently powerless and

that’s where machines and current machine learning techniques make a differentiation. Ability to learn from

data changed the way machines are programmed and companies realized that investing to transform data into

information can support decision making processes.

Surely everyone has already faced with machine learning in everyday situations: weather forecasting in

the news, e-mail spam filters for unwanted information, alerts when logging in our personal accounts in a

new device or online retailers offering personalized recommendations based on previous purchases and activity.

Daily used apps are loaded with machine learning to give the consumer solid and empowered support. Apps use

image recognition to identify familiar faces from contact list, suggest movies and music tracks from likes and

dislikes, use voice recognition to imitate human interaction and analyze traffic to reduce travel time suggesting

faster routes. Even smart homes nowadays adjust indoor climate, switch and regulate lights or detect a leak

on a water pipe only by making data-driven decisions [1, 2].

The importance of data has grown so much in the past decade that nowadays is considered one of the

most important commodities and is giving rise to a new economy comparable to petroleum industry economy.

Data as a new currency is a thing and is considered the petroleum of the digital era [3].

But how and why is data changing this paradigm of society?

The technological advancements are constantly changing how society interact. Internet has been the

catalyst to people leverage connectivity to interact with each other and with multiple devices. The world is at

our fingertips and with just a click data is generated.

The way data is handled is what it makes it so valuable and machine learning makes computing processes

more efficient, cost-effective, reliable and personalized for the consumer. Despite that, machine learning only

works when data is shared which is a point that turns the data into a valuable asset in an economic perspective

of data sharing.

From computing devices to mechanical and digital machines passing through animals and people, at some

point everything will be connected and able to identify themselves in a data sharing environment, so called,

Internet of Things (IoT) (figure 1.2) tailed to transfer data from one another to optimize each one performance.

1

This assumption does not necessary means the communication method is strictly restricted to internet since

radio frequencies (RFID), sensor technologies or QR codes may be included to facilitate data flowing process

in and between ”things”. IoT is expected to produce a tremendous amount of data that can be boosted with

the help of machine learning. As more data becomes available more diverse and challenging problems can be

tackled [4–6].

Figure 1.1: Internet of Things

With the help of IoT and machine learning, it is possible to bridge the gap between medicine and mechanical

engineering to better understand the aim of this thesis. Thereby, an analogy between a car and a human is

made, keeping in mind the proper separation of ideas (figure 1.2).

Considering car’s engine the heart of a car and its fluids as blood, either a device monitoring heart signal

in a person or a sensor measuring engine performance in a car give data about the object in study from which

is possible to infer possible malfunctions.

In a thesis related practical example, a monitored control of blood glucose levels leads to a decision by

the physician in the presence of abnormal values (occasionally insulin is administered in the presence of high

values) in the same tune a sensor recording abnormal oil levels results in a action on the part of the mechanic.

The expected outcome is both levels return to their normal range of values but external constraints may change

the desired outcome like patient’s resistance to insulin or a leak in oils reservoir.

Thus, it is important to take in account more data to infer about the possible cause, i.e, collect more clinical

or mechanical information about the object in study which is possible accessing sensors and monitoring devices

embedded in the object. This sensors are connected to a gateway that analyze and sorts data transmitting

only valuable information to the platform of IoT.

IoT’s platform constantly gathers and stores data not only from the object in study but also data from

other similar objects in order to build a historical record and a database. Coupling machine learning with

the data collected in IoT’s platform leads to an identification of possible diseases/anomalies, suggesting a

treatment/repair to the physician/mechanic, supporting the decision-making process.

Estimate lifetime for each object from data is a more ambitious task which will be addressed in the case

2

(a) Analogy Human-Car (b) ’Things’ in Internet of Things

Figure 1.2: Internet of Things - Analogy Human-Car

of patients in intensive care units in the course of this master’s thesis.

After all, it all comes to handle the data no matter the topic and climb the data-information-knowledge-

wisdom (DIKW) hierarchy [7], where data leads to information, that is transformed in knowledge identifying

patterns but just with a formulation and an understanding of the principles can be turned in wisdom. In a

succinct way, the process of discovering useful information from data is called data mining.

1.1 Applications of Data Mining in Healthcare

Data mining is becoming increasingly essential in healthcare. Insurance companies attempting fraud detection,

diagnoses through the analysis of images and resources allocation control in hospitals based on predictive high

risk areas are examples of data mining application in healthcare [8, 9]. Notwithstanding, data mining is crucial

in data from intensive care units (ICUs), both by its high-dimensionality and the possibility to achieve major

outcomes in critically-ill patients [10].

Due to the critical condition of the patients, ICUs are the most data-intensity part of any hospital. Highly

sophisticated devices have been implemented to closely monitor patients’ health status and are capable to

collect electronic health records (EHR) in an extremely high frequency [11]. EHRs cover not only bedside

monitored vital signs or laboratory measures collected from devices but also patients’ demographic information,

past medical history and diagnoses notes, just to name a few, in order to provide a detailed critical record for

each patient. Having the ICU data in an electronic format led to the development of large datasets whose

latent information has the potential to improve health and disease understanding [10, 11].

However, there is no consensus framework to apply data mining in datasets. Knowledge Discovery in

Databases (KDD) process [12] and Cross Industry Process for Data Mining (CRISP-DM) [13] are widely used

frameworks in a variety of different projects and their incorporation for data mining application in healthcare

3

is viable. Despite ponctual differences, these frameworks are quite similar and can be merged into a single

framework in order to meet the demanding necessities of this thesis.

So, the problem can be branched into seven major steps that are described in figure 1.3. Predominantly

following the steps proposed in [13], an extra step for model interpretation was introduced as a result of the

requirement to understand model outputs, essential to achieve possible medical conclusions.

Project understanding, data understanding, data preparation, modelling using machine learning, evaluation,

model interpretation and deployment are the steps that will be discussed throughout the constituent chapters

of this thesis.

Figure 1.3: Data Mining in databases

1.2 Mortality in Patients under Insulin Therapy

Over the past years, glucose control among patients in ICUs has became a widely discussed and controversial

topic. Since a study conducted by van der Berghe [14] reported that critically-ill patients under intensive insulin

therapy (IIT), aiming to achieve normoglycemia, had lower mortality rates than those following the conventional

protocol that, several studies were made scheming a comparison between intensive and conventional insulin

therapy (section 2.2.3). After years as the standard care in ICU, intensive insulin therapy came to be seen as

counterproductive and conventional insulin therapy (CIT) establish as the standard practice [15].

Insulin therapy is intrinsically related to hyperglycemic events since it is the preferred regimen for effectively

healing hyperglycemia in ICU [16]. Hyperglycemia occurs in the majority of critically ill patients regardless

of previous history of diabetes and is associated with many adverse clinical outcomes [15, 17, 18], including

mortality and morbidity [19, 20]. The mortality rate for newly-hyperglycemic patients approaches one in three

4

[21] so it is consensual that hyperglycemia should be treated to improve chances of survival.

On the other hand, hypoglycemia is a limiting factor when dealing with the maintenance of blood glucose

levels [22]. Besides not directly related, hypoglycemic events can be linked with insulin infusion. Absolute

or relative insulin excess, together with inadequate nutrition and features of critical illness are the funda-

mental causes of hypoglycemia in ICUs [22, 23]. Like hyperglycemia, hypoglycemia is a concern during the

management of critically ill patients that is associated with unfavourable outcomes to the patients.

1.3 Objectives and Contributions

The aim of this thesis is to develop a model capable of predict mortality in patients under insulin therapy in

ICU using data from Medical Information Mart for Intensive Care (MIMIC) III database.

After model’s construction, it is intended to validate the same with data from eICU collaborative research

database from Philips Healthcare which, to the best of our knowledge, it will be one of the first works to

present results using the aforementioned database.

The work is mainly focused on patients’ first 24 hours in ICU since this is the time-stamp used to calculated

common severity scores used in ICUs. However miscellaneous data is taken from different time-windows during

patient’s ICU stay in a perspective of a real-time mortality prediction.

To the best of our knowledge, this is the first work to focus entirely on patients under insulin therapy.

Although there are several studies comparing the influence of different insulin therapies in ICUs (that were

identified in section 2.2.3), none of them is intended to predict mortality. On the other hand, studies aimed

to predict mortality, even not insulin-related, served for performance assessment. Compared to the highest

performance found in literature [24], a study with no restrictions in patients choice, the model has a similar

performance besides dealing with a less predictable cohort of patients.

Relatively to data preprocessing, this work is one of the only to rearrange diagnosis upon admission, more

precisely ICD9 codes, in a multi-level approach following the rules of Clinical Classifications Software (CCS)

[25].

In the modeling strand, several machine learning techniques were tested citing, k-nearest neighbours (KNN);

support vector machines (SVM); decision trees (DT); random forest (RF); logistic regression (LR); AdaBoost

(ADA); gradient boosting (GB); gaussian naive bayes (GNB); linear discriminant analysis (LDA) and quadratic

discriminant analysis (QDA).

LR achieved the second highest performance among all techniques and by norm, it is the baseline model

in health data analysis. For that reason, it was selected to keep the work.

GB achieved the highest performance among all techniques and three different gradient boosting frame-

works were, a posteriori, tested citing, XGBoost (XGB), LightGBM (LGB) and CatBoost (CB). Those three

achieved a higher performance than GB but similar performance between each other. Among those, LGB was

selected to further work for the study due to stand out in terms of computational performance.

Regarding feature selection, a novel approach was implemented entitled recursive feature selection (RFS),

using the importance of features calculated through SHapley Additive exPlanations (SHAP) values as the

ranking criterion for recursive feature elimination coupling principles from sequential selection methods. For

5

comparison, widely used sequential forward selection and recursive feature elimination were performed.

Moreover, this work underpins their real contributions not only in its predictive performance but especially

in the way the models are interpreted using SHAP values. Thus, variables that overall and individually most

affect patients are identified, giving rise to the construction of individualized clinical dashboards. These may

be an important tool in a perspective of data-aided decisions by physicians.

To highlight that, the work developed was the basis of the conference paper intitled ’Fuzzy Modeling of

Survival Associated to Insulin Therapy in the ICU’ submitted for FUZZ-IEEE 2019 - International Conference

on Fuzzy Systems.

1.4 Thesis Outline

This thesis is organized in 7 chapters. Following the present chapter, the 6 subsequent chapters are described

by topics as follows:

Chapter 2 - Insulin Therapy in Intensive Care Units

Theoretical framework of the influence of insulin on human body and its importance to control blood

glucose levels in intensive care units. Survey of studies comparing intensive and conventional insulin therapy.

Previous studies aiming mortality prediction. Severity scores in ICU.

Chapter 3 - Data Pre-Processing

MIMIC III database description. Patients and variables inclusion criteria. Variables description by type and

typical range of values. Missing data, outliers and max-min normalization. Time experiments. Data sampling.

Chapter 4 - Modelling

Problem Assessment. Machine learning techniques explanation. Model interpretation explanation. Feature

selection techniques. Performance Metrics.

Chapter 5 - Results

Overview of data for the study. Choice of machine learning techniques. Hyperparameter tuning. Results

after: data sampling and feature selection. Mortality prediction for different time-windows. External validation

using eICU-CRD database.

Chapter 6 - Model Analysis and Interpretation

Interpretation of the model. Individualized clinical dashboards.

Chapter 7 - Conclusion

Main conclusions achieved with this master’s thesis. Future work.

6

Chapter 2

Insulin Therapy

This chapter describes the role of insulin in human metabolism. Section 2.1 succinctly explains key concepts

such as homeostasis, how homeostatic regulation of blood glucose levels occur, which metabolic disorders are

associated with glucose levels, and which types of insulin are used to counteract these disorders. Section 2.2

describes insulin therapy in the ICU setting through making a brief review of the studies performed to compare

different types of insulin therapies for glucose control. Lastly, section 2.3 presents a brief state-of-the-art about

the current studies in mortality prediction and the severity scores used to evaluate patient health status in the

ICU.

2.1 Introduction

2.1.1 Homeostatic Regulation of Blood Glucose Levels

The human body is composed by trillions of cells that work together for the maintenance of the entire organism.

Cells make up tissues, tissues are grouped in organs and separated organs work together forming human body

systems with distinct functions. Maintaining an equilibrium and stability in response to environmental changes

is crucial for the welfare of an individual.

Homeostasis is the tendency of human body systems to maintain internal stability owing a coordinate

response to any stimulus that would tend to disturb normal functionality of the human body [28]. Figure 2.1

summarize human body systems and describe their role to homeostasis.

The response to a stimulus is characterized by a feedback regulation. In first instance, a stimulus is

detected by a receptor that send information to a control center which in turn send commands to an effector

to counteract stimulus. Response to stimulus may itself become a new stimulus and process is repeated until

reach a set point, resulting in homeostasis.

Regulating blood glucose concentration is part of homeostatic regulation [29]. Glucose is required for

cellular respiration and is the preferred fuel for all human body cells. An imbalance on normal blood glucose

levels work as a stimulus that is detected by pancreas. Pancreatic islets, namely the islets of Langerhans, are

islets dispersed throughout the pancreas that release regulatory hormones from different cells to counteract

stimulus. When blood glucose levels rises, insulin is released from β-cells into bloodstream to stimulate human

7

(a) Human body systems and role on homeostasis (source[26])

(b) Homeostatic regulation of blood glucose levels (source[27])

Figure 2.1: Homeostasis

body cells to take glucose up from the blood and liver to convert excess glucose into glycogen (glycogenesis).

In opposite, when blood glucose levels fall, α-cells release glucagon into bloodstream to stimulate liver to

break down stored glycogen and liberate glucose to the blood (glycogenolysis). In figure 2.1 is presented a

schematic picture of homeostatic regulation of blood glucose levels.

2.1.2 Diabetes Mellitus

Diabetes mellitus is a metabolic disorder that cause high blood glucose levels over a prolonged period due to

defects in insulin production or cells’ resistance to insulin. Individuals who carry this condition have problems

in homeostatic regulation when blood glucose levels rise and need special care or treatment with insulin. There

are two major types of diabetes mellitus, type I and type II.

In type I diabetic humans there is an absence of beta cells in pancreas killed mistakenly by immune system

so there is no insulin production. Glucose is not removed from bloodstream and an insulin-dependent treatment

is necessary to avoid death.

Type II diabetes is characterized by cells’ insensitivity or resistance to insulin. Due to the exposure to high

blood glucose levels for an extendend period and an overproduction of insulin, the human body adapts and

became ineffective at using insulin or simply unable to produce enough insulin. However, this is a non-insulin

dependent condition.

Secondary diabetes is a consequence of another medical condition and it’s taken as a response to hyper-

glycemia effects. It is important to mention that diabetes can appear in a wide range of types. For instance,

gestational diabetes, diabetes LADA, diabetes MODY, double diabetes, type III diabetes or diabetes insipidus

[30]. However, these types of diabetes are out of the main purpose of this work.

8

2.1.3 Types of Insulin

In a historical context, insulin was discovered in 1921 by Banting and Best in pancreatic extracts of dogs and,

with the help of MacLeod, it was possible to purify insulin for human needs. Banting and MacLeod were

awarded ”for the discovery of insulin” with the Nobel Prize in Physiology back in 1923 [31].

Initially extracted from animals pancreases, now human insulin is produced synthetically by growing genet-

ically engineered strains of bacteria, namely E. coli. However, insulin analogs are replacing synthetic human

insulin. Insulin analogs mimics better body’s natural pattern of insulin release and have a more predictable

duration of action [32].

Types of insulin are described in table 2.1 by their duration of action that differs in terms of onset, peak

and duration in bloodstream. Their role in blood sugar management is also explained.

Table 2.1: Types of Insulin [32].Type of Insulin(Brand Name)

Onset Peak Duration Role in Blood Sugar Management

Rapid-Acting

Lispro(Humalog)*

15-30 min 30-90 min 3-5 h Covers insulin needs for meals eatenat the same time as the injetion.

Often used with longer-acting insulinAspart

(Novolog)10-20 min 40-50 min 3-5 h

Glulisine(Apidra)

20-30 min 30-90 min 1-2.5 h

Short-Acting

Regular*(Humulin R,Novolin R)

30-60 min 2-5 h 5-8 h Covers insulin needs for mealseaten within 30-60 min

Insulin Pump(Velosulin)

30-60 min 1-2 h 2-3 h

Intermediate-Acting

NPH*(Humulin N,Novolin N)

1-2 h 4-12 h 18-24 h

Covers insulin needs forhalf the day or overnight.

Often combined with rapid orshort-acting insulin type.

Long-Acting

Glargine*(Lantus)

1-1.5 h No peak 20-24 h Covers insulin needs for one full dayOften combined with rapid or

short-acting insulin.Detemir

(Levemir)1-2 h 6-8 h Up to 24 h

Degludec(Tresiba)

30-90 min No peak 42 h

Pre-Mixed

NPH + Regular(Humulin 70/30)

30 min 2-4 h 14-24 h Taken two or three timesa day before mealtime.

Combine intermediate andshort-acting insulin.

Note: Factors in first column indicatepercentage of each type of insulin

(Intermediate/Short)

NPH + Regular(Novolin 70/30)

30 min 2-12 h Up to 24 h

Aspart Protamine + Aspart(Novolog 70/30)

10-20 min 1-4 h Up to 24 h

NPH + Regular(Humulin 50/50)

30 min 2-5 h 18-24 h

Lispro Protamine + Lispro(Humalog mix 75/25)*

15 min 30-150 min 16-20 h

*Types of insulin presents in MIMIC III database

9

2.2 Insulin Therapy in Intensive Care Units

2.2.1 Intensive Care Unit

Intensive care, also know as critical care, is a medical specialty dedicated to the comprehensive management

of patients having, or being at risk of developing, acute, life-threatening injuries and illnesses [33].

An intensive care unit (ICU) is an organized system for the provision of care to critically ill patients that

provides intensive and specialized medical and nursing care, an enhanced capacity for monitoring, and multiple

modalities of physiologic organ support to sustain life during a period of acute organ system insufficiency [33].

2.2.2 Dysglycemia

Dysglycemia is a term that refers to any disorder of blood sugar metabolism and can appear in two different

types: hyperglycemia and hypoglycemia. As introduced in section 1.2, these episodes frequently occur in

critically hospitalized patients and it is associated with adverse outcomes, including morbidity and mortality

[19, 20].

Hypoglycemia is a condition when the amount of circulating glucose in the bloodstream is lower than

normal; while higher amount than normal refers to hyperglycemia. In table 2.2 is shown how are classified

blood glucose levels to describe each dysglycemic condition.

Hyperglycemia, besides being commonly associated to patients with a medical history of diabetes, it

frequently appears as stress hyperglycemia [34, 35] which is an adaptive and appropriate response to life-

threatening condition in previously normoglycemic patients [36].

Table 2.2: Blood glucose levels classification (mg/dL) [37],[38].

HypoglycemiaLevel 3

HypoglycemiaLevel 2

HypoglycemiaLevel 1

Normal Hyperglycemia

< 50 [50− 54[ [54− 70[ [70− 180] > 180

2.2.3 Intensive Insulin Therapy vs Conventional Insulin Therapy

The concept of controlling glycemia (normoglycemia) in ICU patients through insulin therapy to affect out-

comes has become increasingly complicated to apply and achieve. The debate between intensive insulin therapy

(IIT) and conventional insulin therapy (CIT) has been object of study over the years.

The first large study was the Diabetes Mellitus Insulin Glucose Infusion in Acute Myocardial Infarction

(DIGAMI) study [39] in 1995 where 620 patients with diabetes mellitus and myocardial infarction were assigned

to receive conventional therapy or intensive therapy with a glucose-insulin infusion. In the trial, an improved

long-term prognosis was achieved in patients with intensive therapy.

In 2001, Leuven surgical trial conducted by Van der Berghe [40] resulted in a substantially reduced mortality

and morbidity in a surgical ICU with the use of IIT to maintain glucose levels.

Five years later, Van der Berghe conducted the Leuven medical trial [41] to see the role of IIT but for

patients in a medical ICU. IIT prevented morbidity but did not significantly reduce the risk of death among

10

all patients. However those who stayed in the ICU for three or more days presented reduced morbidity and

mortality.

The larger study until the date was conducted in 2009, the Normoglycemia in Intensive Care Evaluation-

Survival Using Glucose Algorithm Regulation (NICE-SUGAR) trial [42] with 6140 patients from medical and

surgical ICUs contradicted the previous studies because patients assigned to IIT had an higher mortality and

an higher incidence of hypoglycemic events comparing to conventional control.

The Glucontrol study [43] published by Preiser in 2009, randomized medical and surgical ICU patients. IIT

didn’t reduce mortality and the risk of hypoglycemia was increased. The study was prematurely interrupted

which precluded definitive conclusions to be drawn.

One of the last published studies, was [44] in 2014 with a different approach. Instead of traditional IIT,

patients underwent tight computerized glucose control. Comparing to conventional control, mortality didn’t

significantly change but was associated with more frequent severe hypoglycemic episodes.

Last published studies [42–44], demonstrate that IIT is associated to a higher frequency of hypoglycemic

events which is the major consequence for nowadays CIT be the standard practice in ICU.

A chronological summary of major randomized controlled trials is presented in table 2.3. It is presented

the number of patients associated to each trial, the type of center where the trials were conducted and if the

trial was in a single or multi centers. For last, the range of values for each type of insulin therapy in the trials

were identified.

Table 2.3: Chronological summary of major randomized controlled trials.

Year Clinical TrialNo.

PatientsCenter Type of Center IIT (mg/dL) CIT (mg/dL)

1995Malmberg -

DIGAMI [39]620 Multi Acute Myorcadial Infarction 126-180 NA

2001Berghe -

Leuven [40]1548 Single Surgical ICU 80-110 180-200

2006Berghe -

Leuven 2[41]1200 Single Medical ICU 80-110 180-200

2007 Gandhi [45] 620 Single Surgical ICU 80-100 ¡200

2008 Arabi [46] 523 Single Mixed ICU 80-110 180-200

2008 De la Rosa [47] 504 Single Mixed ICU 80-110 180-200

2008 Brunkhorst [48] 537 Multi Mixed ICU 80-110 180-200

2008 Mackenzie [49] 240 Multi Mixed ICU 72-108 198

2009NICE-SUGAR

trial [42]6104 Multi Mixed ICU 81-108 144-180

2009Preiser -

Glucontrol [43]1078 Multi Mixed ICU 79-110 140-180

2009 Yang [50] 240 Multi Neurological ICU 80-110 <200

2009 Bilotta [51] 483 Single Neurosurgical ICU 80-110 <215

2010Annanne -

COIITSS [52]509 Multi Septic Shock Patients 80-110 180-200

2012 Desai [53] 189 Single Surgical ICU 90-120 121-180

2013 Giakoumidakis [54] 212 Single Surgical ICU 120-160 161-180

2014 Macrae [55] 1369 Multi Mixed pediatric ICU 72-126 180-216

2014 Kalfon [44] 2648 Multi Mixed ICU 79-109 ¡180

11

2.3 Mortality Prediction in Intensive Care Units

2.3.1 Previous studies aiming mortality prediction in ICU

To date, no published studies were conducted restricting patients’ cohort to only patients under insulin therapy.

Since insulin is highly related with diabetic and hyper/hypoglycemic events, studies addressing these topics

were prioritized but even these topic-related studies were scarce.

Three studies predicting mortality in patients within ICU stay were analyzed for comparison purposes.

These studies have in common the use of the same database as this thesis, MIMIC III [56].

In a recent published study [57], a cohort of 4111 diabetic patients with a mortality rate of 9.3%, used a

random forest and a combined logistic regression between three severity scores (CCI, DCSI and Elixhauster)

to achieve AUC values of 78.7 and 78.5 respectively. Mean blood glucose was the variable most strongly

associated with mortality in diabetic patients. Among all variables, from just five variables (diagnoses at

admission, type of admission, patient’s glycated hemoglobin, age and mean glucose), a robust classification

can be done to predict risk of mortality.

A different approach was applied in [58], where only patients that stayed in coronary care unit (CCU) were

analyzed using only heart rate signals to predict mortality. Heart rate is a time-series feature and was described

in terms of 12 statistical and signal-based features from the first hour in ICU. The cohort was composed by

2979 patients and 8 different classifiers were employed. Random forest and decision tree classifiers had the

best results with sensitivities of 97% and 92% and precisions of 97% and 90% respectively.

In a real-time mortality prediction attempt, [59] extracted data from a random time during ICU stay. Results

of first 24 hours in ICU were also recorded to comparison with common severity scores. Best performing model

was gradient boosting from four models tested. To the real-time experiment, gradient boosting had an AUC

of 92.0 and for the first 24 hours experiment the AUC was 92.7. It should also be noted, that developed model

outperformed severity scores in terms of predict mortality. The final cohort had 50488 ICU stays from this

database.

Some nuances from these three studies described above were applied to define some of the main objectives

of this thesis:

• Interpretation of which features are associated with mortality/survival in patients under insulin therapy.

• Describe time-series features in terms of statistical features.

• Evaluation of diverse models to compare their performance.

• Extracting data from different times within ICU admission.

• Comparison of results with common severity scores.

12

2.3.2 Severity Scores

In section 2.3.1, severity scores were used by the researchers as a comparison term against the developed

models. These severity scores are used in the ICU in order to aid physicians to predict patient outcome and

assess trauma severity. There are several severity scoring systems, but five scores will be described and used

in this work.

LODS

Logistic Organ Dysfunction System (LODS) is a way to assess organ dysfunction using 12 variables to rep-

resent the function of six organ systems (neurologic, cardiovascular, renal, pulmonary, hematologic, hepatic).

Variables are scored from 0 (no dysfuntion) to 5 (maximum dysfunction) based on the worst value recorded

in the first 24 hours in ICU.

SAPS

Simplified Acute Physiology Score (SAPS) use 14 physiological variables and their degree of deviation from

normal to assign a score from the first 24 hours of ICU stay.

SAPS II

Simplified Acute Physiology Score 2 (SAPS II) is an upgrade to SAPS. Assess only 12 physiological variables

in the first 24 hours of ICU admission

SOFA

Sequential Organ Failure Assessment (SOFA) scores the worst value of each day in ICU in a range from 0

(low) to 4 (high) representing malfunction of six organ systems (respiratory, cardiovascular, renal, hepatic,

central nervous and coagulation), for a total of 10 variables. Since SOFA score changes over time, just first

day values will be used in the study for comparison with the remaining severity scores.

QSOFA

Quick SOFA identify patients with suspected infection since mortality increase among infected patients. It is

a simplified version of SOFA score that take in account just three variables: blood pressure, respiratory rate

and glasgow comma scale. The score for each variable ranges from 0 to 3 and can be easily measured by

physicians while SOFA requires more laboratory tests.

13

14

Chapter 3

Data pre-processing

This chapter explains how it was obtained the final input dataset and how was defined the cohort used in

this work. Section 3.1 describes briefly the database used for this study. Section 3.2 starts with patients

inclusion criteria followed by input variables description. Then, data preparation is conducted in sections

3.3 and 3.4 with outlier detection, handling of missing data, time-series variables discretization and variables

normalization. An explanation of selected time-windows and an analysis to datasets coming from them are

conducted in section 3.5. Lastly, data sampling tecnhiques to be applied are presented in section 3.6.

3.1 MIMIC-III Database

Medical Information Mart for Intensive Care (MIMIC) III [56] is a large clinical database containing health-

related data from patients that were admitted in critical care units of the Beth Israel Deaconess Medical Center

between 2001 and 2012.

Developed by the MIT Laboratory for Computational Physiology, this database includes information such

as demographics, vital sign measurements made at the bedside (∼ 1 data point per hour), laboratory test

results, procedures, medications, nurse and physician notes, imaging reports, and out-of-hospital mortality.

Data is de-identified to not compromise the confidentiality and safety of patients. For this purpose,

patients are identified by codes: subject id to identify a patient, hadm id refers to each hospital admission

and icustay id to admission into ICU. Dates are shifted into the future by a random offset and the age of

individuals over 89 years old were changed to values greater than 300 years old.

MIMIC III database is extracted from two different systems: CareVue and Metavision. Patients in CareVue

system are admitted between 2001 − 2008 and admissions at a later date are recorded by Metavision. Both

systems have data archived in a different format which can lead to inconsistencies. The most prominent is

related to Item ID since several concepts are identified in multiple manners along database.

15

3.2 Patients and Variables Selection

3.2.1 Inclusion Criteria

Initially, all patients were extracted to analyze the number of admissions in the hospital per patient, and for

each admission, the number of ICU stays per admission. Readmissions were discarded, i.e., from patients with

multiple admissions or with more than one ICU stay per admission, only first admission and ICU stay were

included to avoid biased assessments.

From the subset, adult (≥ 16 years old) patients that received insulin during ICU stay were selected.

Infants were discarded because they have a different metabolism and, therefore, a different glucose control

protocol in the ICU. Lastly, only patients with a length of stay equal or higher than 24 hours remained for the

study.

The number of patients extracted in each step is described in figure 3.1. Cohort prior data treatment and

modeling is composed by 12338 patients.

61501 patients’ ICU stays in database

57328 patients’ first ICUstay during admission

46428 patients’ first hospital admission

13 195 patients that received in-sulin during ICU stay and age ≥ 16

8 044 patients 5 151 patientsPatients that receivedinsulin during ICU stay

Length of stay ≥ 24 h

12 338 Patients

MetaVision Carevue

Figure 3.1: Inclusion criteria applied to extract the cohort used in this work.

16

3.2.2 Input Variables

Patients’ information were divided into four major categories: demographic, diagnoses, laboratorial and vital

variables.

Demographic variables were extracted from Admission, Icustays and Patients tables.

Patients’ diagnoses with their respective ICD9 codes were extracted from Diagnoses icd table. ICD9 codes

meanings are described in D icd diagnoses table. Ventilation related data were extracted from Cpt events

and Chartevents tables. Inputevents mv and Inputevents cv tables served as support to extract insulin

infusions in patients during ICU stay.

Clinical measurements are both present in Chartevents and Labevents tables with duplicated values in

some instances. Labevents was chosen as the ground truth (as suggested in [56]) to laboratory variables

with Chartevents providing the vital variables. Codes associated to each measurement are identified in

D lab items and D items tables for Labevents and Chartevents respectively.

Demographic variables

Demographic variables are attributes of human population that constitute the study and are presented in table

3.1 along with their type and abbreviation name.

As mention in section 3.1, the age of patients older than 89 years old was changed to values greater than

300 years old. Since the median age of those patients is 91.4, each patients’ age in that situation was rounded

to 91.

For admission type in the ICU, emergency and urgent patients were merged into a single variable being

the elective variable attributed to patients with previously planned hospital admission.

Ethnicity was branched into five categories [Asian,Black,Hispanic,White and Other] based on the most

common categories present in dataset. The last category [Other] belongs to all the patients not attributed in

the first four categories.

Patient’s weight is recorded over-time though in an inconsistent way. Thus, only the first recorded value

is taken into account. Height is also extracted following the same method and body mass index (BMI) is

calculated through the formula: weightheight2

. Gender and length of stay are the remaining demographic variables

added to the list.

Table 3.1: Demographic variables.

# Demographics Type Abbreviation

1 Age Continuous age

2 Gender Categorical/Binary gender

3 Ethnicity Categorical ethnicity

4 Admission Type Categorical/Binary admission type

5 Length of Stay Continuous los icu

6 Weight Continuous weight first

7 Height Continuous height

8 BMI Continuous bmi

17

Diagnoses variables

The International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes were

extracted for the diagnoses variables. ICD-9-CM is a standard set of alphanumeric codes used to describe

patients’ symptoms and diagnoses. There are 6985 different ICD9 codes in MIMIC III database and taking

into account all the codes would be computationally unfeasible.

In order to reduce the number of variables, codes were grouped and arranged in a multi-level approach

following the rules of Clinical Classifications Software (CCS) [25]. From level to level the groups are divided

into multiple smaller and more descriptive groups for a total of four levels. First level is constituted by 18

groups, second level by 136 groups and third level by 367 groups. Fourth level serve as a more descriptive

level for just some groups of third level and is constituted by 209 groups. Notice that, there are a total of

15072 ICD9 codes, where only 6985 are in MIMIC III.

In table 3.2 is shown an example how codes are rearranged for the specific case of patient with admission

number, hadm id= 1506814.

Table 3.2: Multi-level approach for patients diagoses (hadm id= 1506814)

hadm id Level 1 Level 2 Level 3 Level 4MIMIC IIIdescription

ICD9code

156814

Endocrine, nutritionaland metabolic disorders

Diabetes mellituswith complications

Diabetes with ophthalmicmanifestations

-

Diabetes withophtalmic manifestations,

type I [juvenile type]not stated as uncontrolled

25051

Diseases ofnervous system

and sense organsEye disorders

Retinal detachments,defects vascular occlusion

and retinopathyOther retinal disorders

Background diabeticretinopathy

36201

Diseases ofcirculatory system

Hypertension Essential hypertension -Unspecified essential

hypertension4019

Diseases ofcirculatory system

Cerebrovascular diseaseAcute cerebrovascular

diseaseIntracranial hemorrhage Intracerebral hemorrhage 431

Diseases ofgenitourinary system

Diseases ofurinary system

Urinary tract infectionsUrinary tract infections;

site not specifiedUrinary tract infections;

site not specified431

First level was used to extracted codes related with diseases in human body systems. Such as, digestive

system (1), circulatory system (2), respiratory system (3), nervous system and sense organs (4), musculoskeletal

system and connective tissue (5), genitourinary system (6) and skin and subcutaneous tissue (integumentary

system) (7). Also, diseases of the blood and blood-forming organs (8), mental illness (9) and infectious and

parasitic diseases (10) were extracted. Each diagnosis numbered [1−10] above makes a different variable that

sums the number of ICD9 codes associated to each patient for the specific diagnosis.

Codes associated to diabetes were identified from second level since in first level are assigned to endocrine,

nutritional and metabolic disorders. Codes were split into diabetes mellitus type I or type II and secondary

diabetes (see table 3.2). From identified diabetic patients, was also verified whether or not those were long-

term insulin users.

Some patients needed ventilation during their stay in the ICU. It was calculated the total duration in hours

and how many different times a patient was under external ventilation. Lastly, Glasgow comma scale and

number of insulin infusions (attending that this number can be related to changes in insulin infusion rate)

were also extracted. In table 3.3 is detailed the diagnoses variables included in this work.

18

Table 3.3: Covariates associated to patients’ diagnoses.

Diagnoses Type Abbreviation

Digestive system (1)

Continuous

digestive sys

Circulatory system (2) circulatory sys

Respiratory system (3) respiratory sys

Nervous system and sense organs (4) nervous sys

Musculoskeletal system and connective system (5) musculoskeletal sys

Genitourinary system (6) genitourinaty sys

Integumentary system (7) skin sys

Diseases of the blood and blood-forming organs (8) blodd sys

Mental illness (9) mental sys

Infestous and parasitic diseases(10) infetous sys

Glasgow comma scale Discrete gcs

Number of times ventilated Discrete ventilation num

Time under ventilation Discrete ventilation time

Diabetes Type I Binary diabetes typeI

Diabetes Type II Binary diabetes typeII

Secondary diabetes Binary diabetes sec

Long-Term Insulin User Binary user insulin

Number of insulin infusions Discrete num infusions

Regarding laboratory and vital signs variables, measurements of the same variable are collected over time.

Data is stored over several codes but same or similar description (homonyms). That means the same mea-

surement may have different syntax (e.g. systolic blood pressure has 6 different codes associated).

Laboratory variables

It was extracted the percentage of patients that during ICU stay had at least one measure in the first 24 hours

for each variable. Due to the importance of glucose for the study, was set up as baseline variable to include

or exclude variables. In figure 3.2 it is clear that glucose is the most common variable (X axis). In an initial

approach to verify a trade-off between number of patients and number of variables, this histogram was divided

in three subsets representing the percentage {70%, 80%, 90%} of the information available per each variable

(horizontal dotted lines). Then, it was counted the percentage of patients (Y axis) that could be included

in each subset. Here those subjects with at least one missing variable are removed and not included in the

subset.

It is noticeable that there is a large loss of patients between 90% and 80%. At 80% the loss represents

almost 40% of total number of patients. For this reason, laboratorial variables chosen for the study were those

in 90% subset

Considering an absence of a universal range for normal values for each variable, a standard interval was

deduced in a conservative manner through a research in specialized entities [60–62]. The criteria was to keep

the minimum values found in the literature for lower limits and maximums for upper limits. In table 3.4 is

summarized all laboratory variables and normal ranges associated to each one.

19

% NP NV

90 11876 1780 8196 2270 7983 26

Figure 3.2: At left side, an histogram with percentage of patients per laboratory variable and the subsetschosen represented. At right side, a table with the total number of patients (NP) and number of variablesassociated (NV) for each subset.

Table 3.4: Laboratorial variables and normal ranges [60–62].

# Variable (units) Range Abbreviation # Variable (units) Range Abbreviation

1Anion Gap(mEq/L)

[7− 20] aniongap 9MCV - Meancorpuscularvolume (fL)

[80− 98] mcv

2 Bicarbonate (mEq/L) [23− 28] bicarbonate 10Platelet Count(K/uL)

[150− 450] platelet

3 Chloride (mEq/L) [96− 108] chloride 11 Potassium (mEq/L) [3.5− 5.1] potassium

4 Creatinine (mg/dL) [0.4− 1.3] creatinine 12RBC - Red bloodcells (m/uL)

[4.2− 5.6] rbc

5 Hemoglobin (g/dL) [12− 18] hemoglobin 13RDW - Red celldistributionwidth (%)

[9− 14.5] rdw

6 Hematocrit (%) [37− 52] hematocrit 14 Sodium (mEq/L) [134− 145] sodium

7MCH - Meancorpuscularhemoglobin (pg)

[28− 32] mch 15Urea Nitrogen(mg/dL)

[6− 25] bun

8

MCHC - Meancorpuscularhemoglobinconcentration(%)

[33− 36] mchc 16WBC - Whiteblood cells (K/uL)

[5− 10] wbc

17 Glucose (mg/dL) [70− 110] glucose

Vital variables

Vital variables collects patients’ vital signs during their ICU stay. Heart rate, respiratory rate, both systolic and

diastolic blood pressure, mean arterial pressure, peripheral oxygen saturation, temperature and urine output

are the variables were included. In table 3.5 is summarized the vital variables and their respective normal

ranges.

20

Table 3.5: Vital variables and normal ranges [60–62]

# Variable (units) Range Abbreviation # Variable (units) Range Abbreviation

1 Heart Rate (bpm) [60− 100] heartrate 5Mean ArterialPressure (mmHg)

[70− 110] meanbp

2Respiratory Rate(breathes per min)

[12− 16] resprate 6Peripheral OxygenSaturation (%)

[95− 100] spO2

3Systolic BloodPressure (mmHg)

[90− 120] sysbp 7 Temperature (oC) [36− 37] tempc

4Diastolic BloodPressure (mmHg)

[60− 80] diasp 8 Urine output (mL/h) [30] urineoutput

3.3 Data Preparation

Real-world datasets are generally incomplete, noisy and inconsistent. Medical datasets, especialy ICU related,

represent a bigger challenge than conventional datasets because exhibit unique features coming from patients

that due to their critical condition, their respective values might be abnormal [63]. Nevertheless, measurements

outside of the normal range can come from systematic errors caused by equipment malfunction or random

errors due to human mistakes when recording the data. Distinct measurements in a medical dataset must be

distinguished between abnormal but probable values and outliers for physiologically impossible or less probable

to read that value. Therefore, it is important to filter the information to differentiate between them.

On the other hand, missing data can occur when no data is stored or recorded for a variable during an

observation and/or result recording. Handling missing data is essential for not producing biased conclusions

that may induce to invalid results.

3.3.1 Removal of Outliers

Common pratices to remove outliers/identify novelties in medical datasets are based on statistical methods

such as interquartile range method (IQR) and Tukey’s method which uses IQR approach in a conservative

way, z-score for a standard deviation approach or local outlier factor (LOF) through a local density deviation.

Despite the efficiency of these methods, clinical knowledge in conjunction with a careful inspection of each

variable is more time-consuming but in return a better outcome is expected. This approach was applied to

the selected variables.

In figure 3.3 is shown an example of detection of outliers for variable aniongap. Measurements associated

to each patient are plotted in a graph to have a perception of the underlying distribution present in the variable

and box plot is used to graphically represent groups of variable data. Values inside of normal range are delimited

by a red dashed line in order to check for patterns in data. Through a careful inspection, an inclusion boundary

is set (green dashed lines) and values outside this boundary, considered outliers, are removed.

Outliers detection for the remaining variables are presented in Appendix A.

21

Figure 3.3: Outliers detection for anion gap variable

3.3.2 Missing data

Relative few absent observations on distinct variables can shrink the dataset sample size on a large scale.

Missing data is either missing completely at random (MCAR), missing at random (MAR) or missing not as

random (MNAR) [64, 65].

If missing data is randomly distributed across all observations, is considered MCAR. Usually appears due

to equipment malfunction, lost samples in transit or technically unsatisfactory samples [64].

MAR is a more realistic assumption than MCAR because observed data is no longer a random sample and

in addition to following a pattern, missing data is correlated to a set of observed variables. MAR may appear

in laboratory variables for patients who are more severely injured due to clinical staff being busier providing

time-sensitive care, skipping results annotation [66].

MNAR are commonly complicated and causes of missingness are unlikely to conclude. Normally are

associated to study dropout and patient’s’ illness or refusal to contribute. Demographic variables missing data

can be inserted in this group.

Handling of missing data is attenuated by a well-planned and careful data extraction for the study. If

necessary, conventional methods or statistical methods can be coupled to treat missing data.

Listwise or complete case deletion is the most used and simplest method where if a specific case have

missing data for any variable or a variable have several missing cases, exclusion criteria is applied. This

method was applied in section 3.2.2 in an initial approach to select laboratory and vital variables and will be

applied in section 3.5.1 to achieve the final input dataset for the study.

Imputation methods replace missing values for a reasonable guess. Mean substitution, regression imputa-

tion or last observation carried forward (LOCF) are practical procedures but not useful for laboratory and vital

variables since a time-series discretization will be applied (section 3.4.1).

An intuitive zero-imputation is implemented in diagnoses variables’ missing data besides the lack of records

didn’t necessarily mean that a patient wasn’t under determined condition. For example, scarcity in ventilation

related data includes about 900 patients and an intentional zero-imputation implies that those patients weren’t

under any type of ventilation.

The first 24h after admission time-window will be used to obtain the final variable set. For this analysis, the

22

graph presented in figure 3.4 shows the 7 features (X axis) with the highest number of missing samples (left Y

axis) ordered from left to the right. The green line shows the amount of missing data per each variable. The

blue line represents the total percentage of remaining patients in case a variable in the X axis were removed.

If BMI, height and temperature were excluded, there is a gain in the amount of patients included. Sliding to

the right side present a less significant gain in percentage of patients. For that reason, the three variables

aforementioned were excluded from now on.

Figure 3.4: Percentage patients loss and its relationship with missing date in some features.

3.3.3 Normalization

Data from different variables come in different orders of magnitude. All input data was normalized to mitigate

the potential bias of one variable with large numeric values dominating other variables having smaller values.

Min-max normalization (equation 3.1) was applied to normalize the values (x) of each feature (i) in an interval

[0, 1]. Normalization is not necessary for categorical and binary variables.

xi =xi −min(xi)

max(xi)−min(xi)(3.1)

3.4 Feature Construction

Feature construction is the process of augmenting the space of variables by inferring or creating additional

variables [67]. Feature construction methods may be applied to improve prediction performance and allow

easy addition of domain knowledge. Thus, it is important to generate a set of variables that are generalizeable

to different classifiers. [68]

3.4.1 Discretization of Time-series Variables

Laboratory and vital variables are recorded over time in a pace of approximately one measurement per hour.

23

Quantitative features have been calculated to represent those variables so, each variable is represented

in terms of six statistical features. These are maximum, minimum, mean, median, standard deviation and

variance. Moreover, one additional feature that counts the number of abnormal laboratory measurements, i.e,

values recorded outside the normal ranges (see table 3.4), was included to each variable, except for glucose

variable where more features detailing the abnormal measures were constructed due to its relevance for the

study.

3.4.2 Glycemic Covariates

In addition to statistical features, it was worked out more covariates from glucose information. These are repre-

sented in four features: count of hypoglycemic events for each hypoglycemia level and count of hyperglycemic

events according to values presented in table 2.2.

3.4.3 Categorical Variables

Categorical variables are split into multiple variables using one hot encoding. One hot encoding is used instead

of label encoding to avoid models assuming that variables have some kind of hierarchy, when clearly don’t

have (e.g. nominal features).

According to the different n labels present in each categorical variable, n binary variables are constructed.

This is done for ethnicity (nominal variable) resulting in 5 binary variables. In case n = 2 (binary variables),

as for gender and admission type, just one of the binary variables constructed prevail avoiding redundancy or

repetition.

3.5 Selected Time-windows

Sensitivity analysis in this work were performed independently in four different time-windows: predict mortality

during the first 12 hours after admission in ICU, first 24 hours after admission in ICU , 12 hours before ICU

discharge and 24 hours before ICU discharge. The cohort size for each time-window is highly dependent from

num infusions variable. This is because only those patients already under insulin therapy during a specified

time-window will be considered. Hence, it will be neglected those who received insulin outside the specified

time-window.

The study will mainly focus on the first 24 hours after admission to compare the results with common

severity scores, which by rule are calculated in the same time-window (section 2.3.2). It was excluded the

variable length of stay in ICU (los icu) since the values will be the same for all patients for the same

time-window.

Some data associated to a specific patient have the probability to appear and remain the same in two

or more time-windows. This is due to the fact that different time-windows may overlap. For example, a

patient that stayed in the ICU only for 24 hours, the value of any variable in the first 12 hours after admission

time-window or before discharge will be exactly the same.

In figure 3.5 is shown how time-stamps can switch between each other depending on the duration of stay

after being admitted in the ICU (green solid line). Time-stamps after admission are static (blue solid lines),

24

but not true for discharge times (black lines). Time-stamps (black continuous lines) before discharge for stays

longer than 48 hours (red solid line) appear after 12 and 24 time-stamps after admission and for stays shorter

that 48h may appear before times after admission (dashed black lines).

In any case, this switching between time-stamps has little influence since each trial will be carried out

separately.

Figure 3.5: Prediction time-windows switching for the study.

3.5.1 Description of Processed Datasets

The constitution of each time-window to be analyzed is presented in table 3.6. As expected, the number of

available patients rise as the ICU discharge time-stamp gets closer. Mortality ratio also increases. Further, in

all time-windows, there is a notorious imbalance between mortality and survival.

Table 3.6: Number of patients included in each time-window.Patients

Under Insulin TherapyPatients after

missing data removalDied/Survived

(Mortality ratio)

12h after admission 7626 7100 541/6559 (0.076)

24h after admission 9643 9098 826/8272 (0.091)

24h before discharge 11932 9593 1270/8323 (0.132)

12h before discharge 11956 11435 1353/10082 (0.118)

All ICU stay 12338 11788 1377/10041 (0.114)

To conclude this section, in table 3.7 is summarized the total 188 input variables used in this thesis work.

3.6 Data Sampling

Imbalance of classes is a recurrent problem found in real-world datasets, where the instances of dataset

predominantly belong to one class. Among others, it is a problem extremely common in medical datasets,

25

especially in mortality prediction [24, 57, 58]. The final datasets detailed in section 3.5.1 has this characteristic

as well.

There are two approaches to deal with class imbalance: cost functions and sampling techniques. Sampling

approach can be divided into three categories: oversampling, undersampling and hybrid, that is a mix between

over and undersampling . For this study, the first two categories of sampling techniques will be used to adjust

the dataset class distribution.

3.6.1 Over Sampling

Oversampling focus on the minority class to overcome the imbalance problem. Some of the most used

techniques are described below.

• Random Over Sampling. Random over sampling simply pick samples at random with replacement

from the minority class.

• Synthetic Minority Over-sampling Technique (SMOTE). It was proposed in [69] where new minority

instances are synthesized. In general, SMOTE takes one real minority sample and k closest minority

class neighbours. At each iteration one of those k neighbours is chosen and a new minority sample is

synthesized between the minority sample and that neighbour. This process is repeated until a balance

between minority and majority classes is achieved.

3.6.2 Undersampling

Undersampling focuses on the majority class to overcome the imbalance problem. Some of the most used

methods are:

• Random Under Sampling. Random under sampling randomly pick and eliminate samples from majority

class adjusting dataset class distribution.

• Tomek Links. The method [70] starts by calculating the distance between samples in a dataset. If two

samples from different classes are the nearest neighbours of each other, then both are considered a pair

of tomek links. Samples belonging to tomek links are usually samples located in the boundary between

classes. From tomek links, sample from majority class is removed. This process is repeated until all the

nearest neighbours belong to the same class.

• Edited Nearest Neighbours. Edited nearest neighbours (ENN) [71] remove samples from majority class

according to k nearest neighbours. Majority class sample is removed if one of the k nearest neighbours

do not belong to the same class.

• Neighborhood Cleaning Rule. Neighborhood cleaning rule (NCR) modifies the ENN method adding

data cleaning [72], so it’s a more conservative method. In a first instance, it identifies noisy data using

ENN rule. Then, minority class is analyzed and k nearest samples are found. Neighbours that belong

to majority class are removed.

26

Table 3.7: Description of the input variables.

Feature Units or Categories TypeAge Years Continuous

Gender [Male,Female] BinaryEthnicity [Asian, Black, Hispanic/Latino, White and Other] Categorical

Admission Type [Elective, Emergency] BinaryLength of Stay Fractional days Continuous

Ventilation-Time Hours ContinuousVentilation-Number (Frequency/ICU stay) DiscreteInfusions-Number (Frequency/ICU stay) Discrete

Diabetes [Type I, Type II, Secondary] CategoricalLong-Term Insulin User [True, False] Binary

Featureparameter

Feature (Units) Type

MinimumMaximum

MeanMedian

Standard deviationVariance

Flaga

HyperFlag∗

HypoFlag{1,2,3}∗

Laboratory analysesa

Anion gap (mEq/L)Bicarbonate (mEq/L)

Chloride (mEq/L)Creatinine (mg/dL)Hemoglobin (g/dL)

Hematocrit (%)Mean corpuscular hemoglobin (pg)

Mean corpuscular hemoglobin concentration (%)Mean corpuscular volume (fL),

Platelets (K/uL)Potassium (mEq/L)

Red blood cells (m/uL)Red cell distribution width (%)

Sodium (mEq/L)Urea nitrogen (mg/dL)

White blood cells (K/uL)Glucose (mg/dL)*

Vital SignsHeart rate (bpm),

Respiratory rate (breaths per min),Systolic blood pressure (mmHg),Diastolic blood pressure (mmHg),Mean arterial pressure (mmHg),Peripheral oxygen saturation (%)

Continuous

MinimumLast

Glasgow Comma Scale Continuous

First Weight (kg) Continuous

Sum

Urine Output (mL)Diagnoses

Problems in:Digestive system (1),

Circulatory system (2),Respiratory system (3),

Nervous system and sense organs (4),Musculoskeletal system and connective system (5),

Genitourinary system (6),Integumentary system (7).

Diseases of the blood and blood-forming organs (8)Mental illness (9)

Infestous and parasitic diseases (10)

Continuous

∗ Exclusive to categorize feature Glucose.∗ HypoFlag: Hypoglycemia in any of their severity scores (Table 2.2).∗ HyperFlag: Hyperglycemia event.

27

28

Chapter 4

Modeling

This chapter starts with a schematic of modeling process, identifying all techniques used along with respective

libraries in section 4.1. Then, an assessment of the most appropriate modeling techniques is made in section

4.2. Following sections have theoretical explanation of machine learning techniques that will be applied in

the work for modeling. Subsequently, feature selection methods are presented in section 4.5. In section 4.6,

model interpretation is detailed with SHAP values. The performance metrics used in this work are explained

in section 4.7.

4.1 Knowledge Discovery Process

Model construction was developed in Python 3.6 language with the use of several libraries widely used in data

science [73–81]. Each step and respective library associated is presented in figure 4.1. The processor used to

perform all the tests was a Intel R©Core 8th generation i7-8750H Hexa-Core, 2.20 GHz with turbo until 4.10

GHz, 9 MB Cache.

The figure 4.1 is divided in two schematics: one for the construction process of the model and other for

the external validation of the same.

For the construction process, after data preparation (chapter 3) and achieving a final input dataset, a 5x10

cross validation is performed with diverse machine learning techniques. Data sampling and feature selection

are optional steps implemented to counteract the imbalance present in the dataset and to reduce the final

feature subset respectively. Models are interpreted and features rankings (weights or SHAP values depending

on the machine learning technique used) aids during feature selection process in the case of using recursive

feature elimination (RFE). From models’ output, performance metrics are calculated and serve as support

for feature selection for the case of using sequential forward selection (SFS). In the case of recursive feature

selection (RFS) both ranking and performances serve as support. Threshold-based metrics use as threshold

the one minimizing the difference between true and false positive rate, assigning values above to class 1

(dead) and values below to class 0 (alive). For last, individualized clinical dashboards are constructed using

models’ outputs and interpretation. For the external validation, a final model is constructed using all patients

extracted from MIMIC database and the resulting features from the feature selection process. The model is

29

validated using the patients extracted from eICU database. As in the construction process, model’s outputs

are interpreted and individualized clinical dashboards constructed.

Data Preparation[73],[74],[75]

INPUT DATASETMIMIC database

Repeated K-Fold Cross Validation5x10 Cross Validation [75]

Data Sampling*Oversampling or Undersampling [76]

Performance MetricsAUC, AUPRC, Sensitivity, Specificity

[75]

Feature Selection*Recursive Feature Elimination (RFE) [75]Sequential Forward Selection (SFS) [81]

Recursive Feature Selection (RFS)

ModelMachine Learning Techniques

[75],[77],[78],[79]

Model InterpretationSHAP values [82]

Weights [75]

IndividualizedClinical Dashboards [80]

FINAL MODELDATASET

MIMIC databasePerformance

Metrics

ModelInterpretation

IndividualizedClinical Dashboards

DATASETeICU database

Training 9-Fold Testing Fold

Model Output

Threshold Choice

Until Desired Feature Subset (RFE)

Until Desired Feature Subset (SFS)

Train

Test

Figure 4.1: Model construction layout. *Optional steps

30

4.2 Modeling Techniques Assessment

It is important to assess the advantages and limitations of each modeling technique that could be applied to a

problem in particular. There are a sort of machine learning techniques that can be used separately or together

to predict an outcome.

The attributes taken in account when selecting the right algorithm were predictive accuracy, computational

speed, interpretability, simplicity, robustness and scalability [83]. Each one can be classified in the scale of

[Low, Medium, High] as represented in figure 4.2.

As the study objective of this work is to predict mortality in critically ill patients under insulin therapy for

glycemia control, predictive accuracy and interpretability play an important role for this study since the main

purpose is to create a model capable of predict patient mortality and, at the same time, exhibit a complete

picture of the health status for each patient. On the other hand, scalability owes its importance to the fact

that the study is conducted in time-windows with different sizes (section 3.5) having in mind that, in a long

term perspective, more patients can be included in the study.

Figure 4.2: Importance of trade-offs to take in account when choosing a machine learning algorithm (adaptedfrom [83])

Conjointly, a set of machine learning algorithms were tested to corroborate characteristics of the problem

and select the most suitable algorithms to go on with the study.

In table 4.1 are described the machine learning algorithms assessed for this work in terms of interpretability

[84] and simplicity [85] considering a dataset with high dimensionality. Remaining assumptions (predictive

accuracy, speed, robustness and scalability) were deduced after testing the algorithms in different prediction

31

time-windows of this study (section 5.2). These modeling techniques will be described in the sections below.

Table 4.1: Machine learning algorithms assessed based on [84, 85].

Method Abbreviation Interpretability Simplicity

K-Nearest Neighbours (n=3) KNN Low High

Support Vector Machines SVM Low Low

Decision Trees DT High High

Random Forest RF Medium Medium

Logistic Regression LR High High

AdaBoost ADA Medium Medium

Gradient Boosting GB Medium Medium

Gaussian Naive Bayes GNB High High

Linear Discriminant Analysis LDA High High

Quadratic Discriminant Analysis QDA High High

4.2.1 Logistic Regression

Logistic Regression (LR) is a widely used machine learning technique for binary classification problem. Founded

on a statistical background in 1958 by Cox [86], LR owes its name to logistic function that is the technique’s

core. Logistic function is a S-shaped curve that take any real value number and map it between 0 and 1 never

reaching the limits.

LR describe the relation between an output value and input values through a linear combinations of weights

and coefficient values.

p(x) =eβ0+

N∑i=1

βixi

1 + eβ0+

N∑i=1

βixi

(4.1)

where β0 represent the bias or intercept term and βi the regression coefficient (weight) for each input

variable

4.2.2 Linear and Quadratic Discriminant Analysis

Linear Discriminant Analysis (LDA) is a simple and mathematically robust technique for classification. LDA

make the following assumptions about dataset to estimate mean and variance for each class: each variable is

modeled as a multivariate gaussian distribution with density (equation 4.2) and gaussians for each class are

assumed to share the same covariance matrix where∑k

=∑

for all k.

Quadratic Discrimant Analysis (QDA) is equivalent to LDA, with the difference to estimate covariance

matrix for each class.

32

P (X|y = k) =1

(2π)12 |∑k

| 12exp(−1

2(X − µk)t

−1∑k

(X − µk)) (4.2)

In equation 4.2, k represent each class, µk is the class means and d is the number of features. Both

methods, uses Bayes theorem to estimate the probabilities of data belonging to each class (equation 4.3)

P (y = k|X) =P (X|y = k)P (y = k)

P (X)(4.3)

4.2.3 Support Vector Machines

Support vector machines (SVM) is a classifier defined by the construction of a hyperplane or a set of hyperplanes

in a multidimensional space separating different classes. The hyperplane chosen is the one with maximum

distance between data points of both classes also called maximum margin.

Data points closer to the hyperplane are support vectors that influence position and orientation of the

hyperplane. Those maximize classifier’s margin so test points can be classified with more accuracy.

SVM handles create non-linear regions to separate class more efficiently using kernel functions. Among

kernel functions available, radial basis function (equation 4.4) is the one used in this work.

k(xi, xj) = exp(−||xi − xj ||2

2σ2) (4.4)

where xi and xj are feature vectors and ||xi−xj ||2 the euclidean distance among those. σ defines how much

influence a single training example has.

4.2.4 K-Nearest Neighbours

K-nearest neighbours (KNN) is one of the simplest classifiers that work as a majority voter of the nearest

neighbours of each data point. The principle is to find a predefined number of training samples (k) closest in

distance to the test sample and predict the label from these from majority voting.

The metric used to compute proximity among samples is Minkowski distance as presented in equation 4.5.

(

k∑i=1

(|xi − yi|)q)1/q (4.5)

Here, x represent training samples and y test samples. q can assume values 1 or 2, representing Manhattan

and Eucledian distances respectively. For the work q = 2 and k = 3.

4.2.5 Gaussian Naive Bayes

Gaussian Naive Bayes (GNB) is a probabilistic classifier based in Bayes’ theorem with the naive assumption

of conditional independence between every pair of features. The likelihood of the features is assumed to be

gaussian and probability is calculated as described in equation 4.6.

33

P (xi|y) =1√

2πσ2y

exp(− (xi − µy)2

2σ2y

) (4.6)

Where paramenters σ2y and µy are estimated using maximum likelihood.

4.3 Ensemble learning

Ensemble learning is considered the machine learning analogy for the wisdom of the crowd. Considering learning

algorithms as individuals, the collective knowledge from different and independent individuals typically exceeds

the knowledge of any single individual.

Diverse approaches to the concept were made through the time commencing in 1785 with the Condorcet

jury theorem [87] proving that a jury of partially informed voters is more probable to take the correct decision

under the plurality voting rule than any single voter alone. In early 1900s, Galton [88] made an experiment

in a fair encouraging 787 uneducated farmers to guess the weight of an ox obtaining an error less than 1%

between median of guesses and true weight.

The four major concepts required to form a wise crowd capable of improving on individual knowledge [89]

are:

• Diversity of opinion - People in crowd should have a range of experiences, education and opinions.

• Independence - Each person’s opinion is not affected or influenced by others.

• Decentralization - People have specializations and can make conclusions based on local information.

• Aggregation - Mechanism for turning all predictions into a collective decision.

Applying the concept to machine learning, ensemble learning can be described as a combination of several

base estimators (”weak learners”) in order to produce one optimal predictive model (”strong learner”).

Ensemble methods can be classified into two main groups: parallel/independent methods and sequen-

tial/dependent methods. In parallel methods, multiple estimators are built independently and predictions

combined using model averaging techniques (e.g. bagging [90], random forests [91]). In sequential methods,

estimators are built sequentially and one tries to reduce the errors of the combined estimator (e.g. boosting

[92], gradient boosting [93]).

4.3.1 Parallel Methods

Bagging

Bagging stands for bootstrap aggregating. It was proposed by Breiman [90] in 1996. It is a method for

generating multiple versions of a base estimator using bootstrap replicates of training set by sampling with

replacement. Each replicate work as a new training set containing the same number of instances as in the

original dataset to ensure sufficient amount of instances per estimator. The aggregation performs an average

of the output (for regression) or a majority voting (for classification) to get an optimized estimator. Besides

being usually used with decision trees, the method can be used with any type of estimator.

34

Random Forest

Random forests were first introduced in 1995 in [94] but came out to be efficient in [91] as an extension

over bagging. Instead of just replicate samples with replacement also just a random subsample of features

were used in the training process. This avoids excess of similarities among estimators and high correlations in

model’s predictions. For the work, a total of 100 decision trees as estimators were used.

4.3.2 Sequential Methods

The idea of boosting is whether a weak learner can be converted into a strong learner by sequentially training

weak learners, each trying to improve its predecessor predictions.

Boosting has its beginnings in the study of Valiant [95] called Probably Approximately Correct (PAC)

model that became the base for Kearns work [96]. Then, the conception that a weak learner which performs

slightly better than random guessing can be ’boosted’ into a strong learner are credited to Kearns, Valiant

and Schapire [95–97]

AdaBoost

AdaBoost, short form for adaptive boosting, is the most well-know boosting algorithm and the first that

achieved really success being the root of several notable variations [98].

The key idea behind AdaBoost is to weight the same training data, increasing misclassified instances’

weights while decreasing the weights of correctly classified instances. Initially all instances has the same

weight. Iteratively new weak learners are added focusing on less weighted instances. Each weak learner has

an individual weight according to overall predictive performance. Predictions are made by calculating the

weighted average of the weak classifiers.

Statistical boosting and Gradient Boosting Machines

Many machine learning approaches, including AdaBoost, can be considered as black boxes since they might

yield accurate predictions but the structure of relationships between input data and output is not clearly

interpretable. In opposition, statistical models aims to describe and explain relationships in a structured way

given a variables importance quantification, as well the effect of this variables in the interpretation.

AdaBoost and related algorithms were recast in a statistical framework by Breiman [99] showing that

boosting can be understood as a functional gradient descent algorithm. Afterwards, Friedman [93], based on

Breiman initial approach, elaborated a statistical boosting point of view called gradient boosting machines

(GBM).

GBM build a stage-wise additive model where each weak learner is added depending on previous weak

learners performing gradient descent in a functional space [93]. Algorithm in 1 presents an overview of the

machine learning technique. For a further detailed mathematical formulation, consult Appendix B.

GBM using decision trees as weak learners, commonly named gradient boosting decision trees (GBDT),

became widely used due to simplicity and few advantages of this learners namely, requiring less effort for data

preparation, deal with nonlinear relationships between parameters and handling outliers as well missing values.

35

Algorithm 1 Gradient boosting algorithm

Inputs:

• Input data (X, y)Ni=1

• Number of iterations M

• Loss-function choice ψ(y, F )

• Weak learner h(X, θ)

Algorithm:

1: Initialize f02: for t=1 to M do3: Compute the negative gradient gt(x)4: Fit a new weak learner h(X, θt)5: Find the best gradient descent step-size ρt:

6: ρt = arg minpN∑i=1

ψ[yi, ft−1(Xi) + ρh(Xi, θt)]

7: Update the function estimate8: ft ← ft−1 + ρth(X, θt)9: end for

In recent years, one new GBDT framework has been recognized . This framework, proposed by Chen [77],

is called XGBoost (XGB) and it has demonstrated a superior performance compared to random forests [91].

Recently, Microsoft presented a novel approach called LightGBM (LGB) [78]. This one has faster training

speed, higher efficiency and better accuracy. Then it was presented CatBoost (CB) [79] by Yandex. This

achieved significant improvements in benchmark datasets comparatively to XGB and LGB.

In figure 4.3 is represented a timeline of the ensemble learning techniques described theretofore.

Figure 4.3: Timeline of ensemble learning.

36

4.4 Gradient Boosting Frameworks

Gradients boosting models are based in decision trees models. These split a dataset into small subsets while,

in a incremental way, it develops a structured decision-making process. Making an analogy with real-world

trees, decision trees grow upside down and dataset is split into branches through decision nodes partitioning

the data into subsets in the feature with the largest information gain.

Top decision node is called root node and nodes that do not lead to a decision-making process are titled

leafs. Branches make the connection between the root and leaves with the support of internal decision nodes.

Since this thesis work will focus in these techniques, it is needed to explain with more detail how to tune

GBM models in this work (i.e. CB [79], XGB [77], and LGB [78]).

4.4.1 Tree Structure

There are two different strategies when growing the decisions trees: level-wise and leaf-wise.

In level-wise strategy each node splits the data prioritizing the nodes closer to the root maintaining a

balanced tree whereas in leaf-wise strategy tree growing is made by splitting the data in nodes with the

highest loss change being more prone to overfitting.

XGB and LGB use the leaf-wise strategy while CB use level-wise strategy with the particularity to use

oblivious trees characterized by the constraint of allowing to select only one feature in a particular level.

4.4.2 Best split method

Finding the best split for each node is a key challenge in training a GBDT. Decision tree models split each node

at the feature with the largest information gain measured by the variance after splitting. In large datasets is

computationally expensive to go through every data point in each feature to find the best split so it’s required

to use approximation methods to decrease training time.

Histogram-base method split each feature’s data points into discrete bins and use these bins to perform

the best split value of histogram.

Pre-sorted splitting method sort the data points by feature value in order to calculate gradient statistics

and propose candidate split points. Then calculate information gain for each candidate split point along each

feature to find the best split. Among all features take the best split solution for a node.

Gradient-based One-Side Sampling (GOSS) method ranks in a descending order the training data points

according to the absolute values of their gradients. Preserves the top a × 100% data points with larger

gradients and performs random sampling in (1− a)× 100% data points with smaller gradients. Sampled data

with small gradients is amplified by a constant 1−ab when calculating the information gain. So, the split point

is calculated over a smaller subset reducing the computational cost.

GOSS is exclusively used by LGB while XGB and CB use pre-sorted algorithm. Both XGB and LGB have

the option to use histogram-base method.

37

4.4.3 Loss Function

In agreement with algorithm 6, determining the loss function (also know as the objetive parameter) is required

to fit a new weak learner.

Logarithmic loss, typically referred as logloss or cross entropy loss, is a classification metric based in

probabilities where a probability to belong to a class is assigned to each sample than simply yielding the most

likely class to the sample.

Logloss per sample is the negative log-likelihood of the classifier (equation 4.7)

Llog(y, P ) = − log(y|p) = −(y log(p) + (1− y) log(1− p)) (4.7)

Logloss is used in XGB, LGB, and CB for binary classification.

4.4.4 Hyperparameter tuning

Hyperparameters are the variables which determine the structure of an algorithm and establish how this

is trained. Their importance relies on controlling the behavior and performance of the training algorithm.

Hyperparameters differ from model parameters since they can’t be directly learned during training (e.g. in

algorithm 1, a model parameter is optimized evaluating the gradient of a loss function during training).

The process of tuning hyperparameters, also called hyperparameter optimization, is based in two questions

according to [100]:

• Which of the algorithm’s hyperparameters matter most for empirical performance?

• Which values of these hyperparameters are likely to yield high performance?

Following the contributions of [101] and developers’ suggestions [77–79], hyperparameters with higher

importance and respective typical range of values used are described in table. 4.2.

Table 4.2: Hyperparameters to tune.

Function Parameter XGBoost LightGBM CatBoost Range

ControlOverfitting

Learning Rate learning rate learning rate learning rate [0.01; 0.1]

Maximum tree depth max depth max depth depth [1; 10]

Tree’s number of leafs min child weight num leaves - [1; 50]

Iterations/No. of trees n estimators n estimators iterations [50; 500]

ControlSpeed

Feature subsample colsample bytree feature fraction bootstrap type [0.1; 0.9]

Bagging subsample bagging fraction rsm [0.1; 0.9]

4.5 Feature selection

Feature selection targets to remove irrelevant, redundant or noisy features to obtain a subset of relevant

features fulfilling higher accuracy, lower computational cost and better model interpretability. Feature selection

methods can be categorized in four groups: filter, wrapper, hybrid and embedded.

38

Filter methods use statistical data analyzes such as chi-square or mutual information to assign scores

to features, removing the ones with lower score according to a specified threshold. Despite being fast and

independent of a classifier, are potentially naive in minimizing redundancy between features.

In contrast with filter methods, wrapper methods use a classifier and a performance measure to score a

subset of features. Many search procedures are possible which leads to being slower and computational costly

when dealing with high-dimensional datasets. However, usually produce better results compared with filter

methods.

Hybrid methods combine filter and wrapper methods sequentially. At first, a subset is selected through a

filter method and then a wrapper selection is performed to search for the best subset.

Embedded methods were proposed to bridge the gap between filter and wrapper methods. In contrast to

those methods, embedded methods do not separate the classifier learning from the feature selection.

4.5.1 Recursive Feature Elimination

Recursive feature elimination (RFE) was introduced in [102] as an instance of backward feature elimination

for gene selection in cancer classification. The main principle is recursively remove features based on their

importance producing smaller feature subsets until a desired number of features or in accordance with a

performance measure. RFE is described in algorithm 2.

In this work, RFE was performed removing at each step the feature with less importance, i.e, the feature

with lower absolute SHAP value or weight. Absolute value is taken into account because features can either

have a positive impact on final outcome favoring patient’s mortality or a negative impact favoring patient’s

chance of survival. This process is carried out over a 5x10-fold cross validation (see further in section 4.7) and

at the end the mean absolute impact is calculated for each feature. Then, features are ranked by importance

in an ascending order and cross validation is performed again eliminating each feature one-by-one according

to its ranking. Mean AUC and mean AUPRC are recorded when eliminating the corresponding feature.

Algorithm 2 Recursive Feature Elimination

Inputs:

• Stop criterion, M

• Ranking criterion, ω

• Classifier, h

Algorithm:

1: while M do2: Train classifier, h3: Compute ranking criterion, ω for all features4: Remove n feature with smallest ranking criterion5: end while

39

4.5.2 Sequential Forward Selection

Sequential forward selection (SFS) is a wrapper method using a greedy search approach to select the best

feature’s subset. Succinctly, each feature presented in dataset is evaluated individually and based in a specific

metric, the one that returns the best performance is selected. From then, in an interactive process each one

of the remaining features is evaluated conjointly with the first feature selected and the subset presenting the

best performance is selected. This process is repeated for all the features present in the dataset or until a

desired number of features. SFS is described in algorithm 3.

In this work, SFS was performed until a subset of 30 features since an higher of number of features is not

desired and undergoing through all feature subset would be infeasible in terms of computational cost due to

the relatively high number of features that compose the dataset. Moreover, AUC was the performance metric

used to select the feature in each iteration.

Algorithm 3 Sequential Feature Selection

Inputs:


• Performance metric, ω

• Classifier, h

Algorithm:

1: Create empty set of features Yk = ∅2: while M do3: for each feature (y) not in Yk do4: Train classifier, h for subset Yk + y5: Compute performance metric, ω6: end for7: Yk = Yk + y with maxω(Yk + y)8: end while

4.5.3 Recursive Feature Selection

In this work is proposed a novel approach entitled recursive feature selection (RFS). RFS is an hybrid approach

between RFE and sequential methods. Following the same principle of RFE to eliminate at each iteration

features with less importance, in this method the k less important features are eliminated individually at the

time and the feature subset that returns the best performance, according to a specific metric, is the subset

choosen. Interactively, this process is repeated until a desired number of features or along all feature set. RFS

is described in algorithm 4.

In this work, RFS was performed testing at each step the 5 less important features, i.e, features with lower

absolute SHAP value or weight depending on the method used. This process is carried out over a 5x10-fold

cross validation to evaluate features’ importance and also to evaluate the performance when removing each

feature selected as less important. Then, feature associated to the best performance is the feature selected to

remove. AUC is the metric selected to evaluate the performance during the feature selection process.

40

Algorithm 4 Recursive Feature Selection

Inputs:


• Ranking criterion, ω1

• Performance metric, ω2

• Classifier, h

Algorithm:

1: while M do2: Train classifier, h3: Compute ranking criterion, ω1 for all features4: for k lowest ranked features do5: Eliminate feature from subset6: Compute performance metric, ω2

7: end for8: Selected subset that achieve highest performance9: end while

4.6 Model Interpretation

As referred in section 4.3.2, black boxes algorithms can be recast in statistical frameworks to achieve model’s

interpertability. However, high-dimensionality of datasets affects model’s complexity, reducing interpretability

in order to achieve better results. This creates a trade-off between accuracy and interpretability which is

necessary to achieve.

4.6.1 Features Ranking

Knowing the importance of each feature in the final model’s prediction is a significant acknowledgment that

can be used to rank features and inform a feature selection process. Features’ importance are calculated

either in an individualized manner for a single prediction or over an entire dataset to describe model’s global

behaviour.

Linear models, as LR and LDA, in a simple approach assign weights to each feature considering its impor-

tance to the model. Other techniques used are p-values or bootstrap scores.

Tree-based ensemble methods, as GBDT, use three different methods to estimate the importance of

features in a dataset. These are: gain, split count and permutation.

The simplest approach, split, count how many times a feature is used to split a tree’s node while gain sum

the totality of information gains acquired by all splits for a given feature. Distinctively, permutation observe

the model’s error change by randomly permuting the values of a feature in a test set.

Lundberg et al. [82] proved, with exception to permutation methods, that feature importance attribution

methods are inconsistent, i.e. a given feature’s importance can decrease due to model changes when in fact

the model depends more on that feature. To overcome this assumption, a unification approach was proposed

through SHapley Additive exPlanation (SHAP) values.

41

4.6.2 SHapley Additive exPlanation values

SHAP values were founded on ideas from game theory and local explanations. Developed based on the premise

to view any model’s prediction explanation as a model itself [82], this method unified six other current methods

in a class named additive feature attribution methods. Combining game theory with the unified class, SHAP

values provide a unified measure of feature importance better aligned with human intuition.

Initially focused in linear models (as LR and LDA), kernel methods (as SVM) and deep learning algorithms,

a seventh method was added to compute SHAP values for trees and tree-based ensemble methods [103], as

GBDT.

SHAP values for each feature in a linear model are estimated by using equation 4.8 while for tree-based

models are computed by estimating E[f(x)|xS ] using equation 4.9 where fx(S) = E[f(x)|xS ]. In a high-level

description, equations 4.8 and 4.9 are the difference between model’s prediction with and without feature i

(equation 4.10).

φi = βi(xi − E[xi]) (4.8)

φi =∑

S⊆N/{i}

|S|!(M − |S|!− 1)

M ![fx(S ∪ {i} − fx(S)] (4.9)

Importance of Featurei = fx(with feature i)− fx(without feature i) (4.10)

For a further reading on a more detalied mathemaical description, see Appendix D and authors explanation

[82, 103].

4.7 Model Performance Assessment

4.7.1 Repeated K-Fold Cross Validation

In k-fold cross validation, the original dataset is partitioned in k equal sized subsamples also called k folds.

From those, k − 1 folds are used for training and the sample remaining is used for training. This process is

repeated k times so all folds serve as training and testing.

Then, k-fold cross validation is repeated n times with different randomization among samples that consti-

tute folds. Results obtained in all n x k folds are averaged.

Performance metrics aid on evaluating and compare different models. The study objective is a binary

classification problem where either a patient die (1) or survive (0) and the goal is to predict that outcome.

Classification models predict the likelihood of a patient dying as a probability of a patient belonging to a

given class (die or survive). Patient can be assigned to a class by choosing the class with highest probability

or by defining a threshold that allocates probabilities above or below the same into classes.

4.7.2 Sensitivity, Specificity and Precision

Confusion matrix is a table that describes the performance of a model summarizing prediction outcomes. In

a binary classification problem, four possible outcomes are possible:

42

Figure 4.4: Confusion Matrix (source [104])

• True positive(TP): Positive cases correctly classified

• True negative(TN): Negative cases correctly classified

• False positive(FP): Positive cases incorrectly classified

• False negative(FN):Negative cases incorrectly classified

With these four outcomes can be estimated the performance metrics used to asses the models. These

metrics are described on next subsections.

Sensitivity or Recall

Sensitivity is the fraction of positive predictions correctly classified among total number of positive cases.

Represent the number of death patients correctly classified among those who actually died.

Sensitivity =TP

TP + FN(4.11)

Specificity

Specificity is the fraction of negative predictions correctly classified among the total number of negative cases.

Represent the number of correctly classified patients that survived among those who actually survived.

Specificity =TN

TN + FP(4.12)

Precision

Precision is the fraction of positive predictions correctly classified among the total number of positive classified

cases. Represent the number of death patients correctly classified among those who actually died plus those

who were incorrectly classified as death.

Precision =TP

TP + FP(4.13)

Specificity, sensitivity and precision are threshold-based metrics where the threshold choice in favor of one

of these metrics can lead to a performance decrease in the remaining. When dealing with imbalanced datasets,

these metrics also tend to favor majority class making it impossible to obtain a coherent model classification.

43

Thus for the study, metrics threshold-free will be used to have an generalized view of model classification

performance. Area under the receiver operating characteristic curve (AUC) and Area under the Precision-Recall

curve (AUPRC) are the metrics chosen.

The threshold selected is based on AUC minimizing the distance between false positive classification rate

(1-specificity) and sensitivity.

4.7.3 Area under the receiver operating characteristic curve

Receiver Operating Characteristic (ROC) curve plots true positive rate (sensitivity) vs false positive rate (1-

specificity) at different thresholds. AUC measures the entire two-dimensional area underneath the ROC curve

and characterize how much a model is capable of distinguishing between classes. An AUC value of 0.5

correspond to a random classifier while a value of 1 correspond to a perfect classifier.

However, AUC might lead to an overoptimistic picture of the model in the presence of a class imbalance

what may incur in a false interpretation of model performance [105]

4.7.4 Area under the Precision-Recall curve

AUPRC shows the trade-off between precision and recall (sensitivity) for different thresholds. AUPRC measures

the area underneath PRC. A classifier is considered random performing depending on the ratio ( PositivesPositives+Negatives )

between positives (1) and negatives (0). So, AUPRC performance is correlated with the presence of imbal-

anced datasets, i.e, an AUPRC=0.5 for a balanced dataset is a poor performance and represent a random

classifier but for a highly imbalanced dataset is considered a good performance.

AUPRC is preferred in imbalanced dataset because it is more informative than AUC when evaluating binary

classifiers [105].

44

Chapter 5

Results

This chapter presents the main results and justify all the decisions taken to choose a final model. In section

5.1 is described briefly the cohort containing all patients under insulin therapy to have a general overview of

data distribution. Sections 5.2 and 5.3 focus on choosing the best-performer machine learning algorithm to

proceed with the study. It was made a performance analysis between the first- and last-day data. In section

5.4, is performed a detailed analysis using the first-day data in order to fine tune the hyperparameters and

to examine how sampling techniques affect models’ performance. Feature selection is also implemented to

reduce the final number of features. In section 5.5, the results obtained are compared with common severity

scores used in ICUs to predict patients mortality. Further, results for different time-windows are presented in

section 5.6. Lastly, best model results obtained are compared with the results from similar studies previously

presented in section 2.3.

5.1 Descriptive Analysis of the Cohort

In figure 5.1 is shown a dashboard with a general overview of data from all patients (12338) under insulin

therapy along all their respective ICU stay.

This dataset is composed predominantly by patients of white race and masculine gender. The patient’s

age is mostly between 60 and 80 years old and also there is a significant number of patients above 89 years

old, indicating that this is an aged dataset.

Besides the condition of being under insulin therapy, more than half of these patients (56.8%) are non-

diabetic. Among diabetic patients, the majority are diabetic type II representing 36.3% of all dataset.

As previously analyzed in section 3.5.1, the imbalance present in dataset is significant (11.4%). However,

analyzing patients’ lenght of stay by days in the ICU, it is notorious a decrease in imbalance along with length

of stay increase. This indicates an higher risk of mortality is associated to a longer stay in ICU.

The most common length of stay in ICU is between 1 and 2 days but also stays longer than 10 days

represent a significant number of patients.

45

Figure 5.1: General overview of the working dataset (All ICU stay, n = 12 338)

5.2 Selection of a Machine Learning Technique

Several machine learning algorithms were tested with a 5x10-fold cross validation imputing independently the

covariates gathered during both first- and last-day a patient is admitted to the ICU.

Analyzing figures 5.2 and 5.3 along with table 5.1, five algorithms stand out in predictive performance and

deserve a further analysis. Citing, GB, LR, LDA, ADA and SVM. From those, GB and LR were the chosen

ones to proceed with the study due to the explanations given below.

For the GB model, despite being computational costly as seen in figures 5.2c and 5.3c where GB is the

second most time consuming model (275.73s and 329.07s), it has the highest predictive performance in both

first (AUC 90.97±1.23, AUPRC 54.16±4.86) and last day (AUC 92.53±1.12, AUPRC 73.50±3.09) analysis.

LR and LDA present quite similar performances. LR gives a slightly better predictive performance (AUC

89.85±1.55 and 91.55±1.03, AUPRC 50.69±5.36 and 69.48±3.20), while LDA (6.65s and 6.86s) outperforms

LR (16.14s and 20.66) in computational cost. Since predictive performance has a more important role, LR

prevails for further study and also because it is the baseline model in health data analysis.

SVM and ADA are discarded because they achieved lower performance compared with the chosen algo-

rithms in terms of computational cost (SVM: 981.91s and 1560.42s ) and predictive performance (ADA: AUC

88.98± 1.65 and 90.94± 1.31, AUPRC 47.01± 5.52 and 68.47± 3.00 ).

46

Figure 5.2: Performance analysis where different machine learning algorithms are compared (First-day)

Figure 5.3: Performance analysis where different machine learning algorithms are compared (Last-day)

Table 5.1: Performance metrics for machine learning techniques

Algorithm AUC [%] AUPRC [%] Sensitivity [%] Specificity [%] Time [s]

First-day

KNN 71.74± 2.52 27.05± 3.58 52.47± 4.95 89.92± 0.90 115.51

SVM 88.28± 1.84 51.84± 5.09 80.80± 2.22 80.90± 2.33 981.91

DT 65.35± 2.64 19.14± 2.77 37.56± 5.28 93.14± 0.96 65.51

RF 83.53± 2.22 37.25± 4.71 75.03± 4.69 80.31± 3.46 24.44

LR 89.85± 1.55 50.69± 5.36 82.35± 1.96 82.18± 1.93 16.14

ADA 88.98± 1.65 47.01± 5.52 81.38± 2.36 81.41± 2.34 166.23

GB 90.97± 1.23 54.16± 4.86 83.19± 2.07 83.13± 2.03 275.73

GNB 81.58± 2.14 26.59± 2.93 73.47± 2.14 73.54± 2.20 2.22

LDA 88.45± 1.94 50.21± 5.09 80.99± 2.00 81.11± 1.95 6.65

QDA 78.17± 3.02 28.79± 4.26 73.58± 3.09 73.42± 3.20 4.77

Last-day

KNN 73.30± 2.09 37.08± 3.50 57.43± 4.01 86.92± 1.00 138.36

SVM 90.27± 1.40 68.12± 3.69 82.28± 1.78 82.29± 1.88 1560.42

DT 70.79± 2.16 30.89± 2.87 49.59± 4.21 91.99± 0.90 70.88

RF 86.76± 1.94 55.44± 3.81 81.76± 4.18 76.86± 4.47 27.05

LR 91.55± 1.03 69.48± 3.20 83.98± 1.52 83.88± 1.59 20.66

ADA 90.94± 1.31 68.47± 3.00 82.93± 1.55 82.92± 1.67 201.80

GB 92.53± 1.12 73.50± 3.09 84.69± 1.76 84.83± 1.73 329.07

GNB 81.47± 1.58 35.04± 2.72 74.28± 1.58 74.22± 1.53 2.36

LDA 90.83± 1.28 68.21± 3.46 83.42± 1.42 83.34± 1.51 6.86

QDA 81.51± 2.51 36.28± 3.14 75.00± 1.93 75.10± 1.87 4.81

47

5.3 Selection of a Gradient Boosting Framework

For the case of GB, three different frameworks (i.e. XGB,LGB and CB) were tested and compared with the

original algorithm (GB).

A set of default parameters were assessed in order to evaluate the performance of each GB algorithm.

This would help to select a model to be extensively evaluated for hyper-parameter tuning. For this step,

subsampling and feature sampling were discarded to avoid biased information between the algorithms with

regard to computational and overall performance. The parameters choosen were: learning rate = 0.1,

max depth = 5 and iterations = 100 along with a 5x10-fold cross validation .

Analyzing figures 5.4 and 5.5 along with table 5.2, overall performance is quite similar through all algo-

rithms. XGB emerge as the best performance algorithm in both AUC (91.36 ± 1.28 and 93.22 ± 1.06) and

AUPRC (56.57± 4.66 and 75.04± 2.94) for the first- and last-day analysis followed by LGB and CB. GB has

the lowest performance (considering all metrics) among all algorithms.

Nonetheless, LGB completely stands out in terms of computational time performance (5.30s and 11.40s),

running in less than 80% of time comparing to the second fastest algorithm (i.e. CB) and less than 95% of

time compared to XGB that achieved the highest performance, being for that reason the selected algorithm

in detriment to the original algorithm (GB).

Figure 5.4: Performance analysis comparing GB algorithms (First-day)

Figure 5.5: Performance analysis comparing GB algorithms (Last-day)

48

Table 5.2: Performance metrics for each GB algorithm

Algorithm AUC [%] AUPRC [%] Sensitivity [%] Specificity [%] Time [s]

First-day

GB 90.97± 1.23 54.16± 4.86 83.19± 2.07 83.13± 2.03 275.73

XGBoost 91.36± 1.28 56.57± 4.66 83.70± 1.76 83.65± 1.89 154.12

LightGBM 91.06± 1.32 56.05± 4.70 83.17± 1.89 83.35± 2.03 5.30

CatBoost 91.19± 1.35 55.41± 5.06 83.63± 2.03 83.78± 1.85 42.58

Last-day

GB 92.53± 1.12 73.50± 3.09 84.69± 1.76 84.83± 1.73 329.07

XGBoost 93.22± 1.06 75.04± 2.94 85.80± 1.71 85.81± 1.75 172.26

LightGBM 93.09± 1.07 74.94± 2.81 85.57± 1.59 85.51± 1.70 11.40

CatBoost 92.57± 1.14 74.28± 3.06 84.85± 1.71 84.80± 1.83 74.84

5.4 First-day Analysis

5.4.1 Hyperparameter tuning

First-day data will be the baseline dataset to tune the hyperparameters that will be used in the other time-

windows. LR has a simpler approach than LGB, so hyperparameter tuning is not required.

For LGB, hyperparameter tuning was divided in two steps in order to control model tendency to over-fitting

and check how fast the model converges.

First step was to check the trade-off between the number of estimators and learning rate. The values

chosen for learning rate were [0.01, 0.03, 0.05, 0.07, 0.1] while the number of estimators varied between

[25− 1500]. Those ranges represent a more exploratory process than those proposed in section 4.4.4.

In figure 5.6 is shown that high learning rates lead to a faster convergence of the model but it is visible,

through the decrease of performance metrics, that the model tends to over-fit as the number of estimators

increase. A lower learning rate require more estimators to converge, but increase the computational cost.

Hence, a learning rate = 0.03 and No.Estimators = 200 were chosen for modeling since these values

demonstrated that, at the same time, it avoids over-fitting and preserve a relative fast convergence.

Figure 5.6: Number of Estimators vs Learning Rate

The second step, already with a fixed learning rate and No.Estimators, was to analyze how trees depth

and number of leaves influence model’s performance. The values chosen for max depth were [3, 4, 5, 7, 9] and

values for num leaves within a range of [3− 50].

From the results shown in figure 5.7, remembering that LGB uses a leaf-wise growth (section 4.4.1), it is

49

noticeable that greater number of leaves, combined with a smaller tree depth, has no influence since a tree

is constrained by depth and vice-versa. Num leaves=10 and max depth = 5 were the chosen parameters

to keep in the study because this combination gives a better performance in both AUC (91.44 ± 1.36) and

AUPRC (56.78 ± 4.85) before the model began to vary the results in an non-linear way, which may be an

indication of model over-fitting.

Figure 5.7: Number of leaves vs Max depth.

It should be noted that in both hyperparameters choices, the values chosen from the graphs are never the

ones that achieve the maximum value registered in both AUC and/or AUPRC; and by rule, are the preceding

ones whose divert the models of over-fitting.

In table 5.3 is shown the results for the first-day analysis with LR and LGB with the selected hyperparameters

before.

Table 5.3: Performance metrics for first-day dataset.

Method AUC [%] AUPRC [%] Sensitivity [%] Specificity [%]

First-dayLGB 91.44± 1.36 56.78± 4.85 83.97± 2.03 83.68± 1.91

LR 89.85± 1.55 50.69± 5.36 82.18± 1.93 82.35± 1.96

5.4.2 Effects of Sampling Techniques

Sampling tecnhiques were applied in order to counteract the imbalance present in the dataset. Oversampling

and undersampling were applied in both algorithms; while subsampling and feature sampling were only applied

to LGB since it is an ensemble method (see section 4.4.4).

Oversampling techniques

The results for each technique are presented in table 5.4. Picking random samples from the minority class is

the preferred method comparatively to create synthetic new samples (AUC 91.12±1.28, AUPRC 54.71±5.28)

Compared to the results in table 5.3, there is a reduction in all performance measures when using oversampling

techniques.

50

Table 5.4: Performance metrics for oversampling techniques

Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%]

LGB

RandomOversampling

91.12± 1.28 54.71± 5.28 83.31± 2.13 83.39± 2.03

SMOTE 90.63± 1.38 53.18± 4.69 81.98± 2.04 82.03± 1.97

LR

RandomOversampling

89.93± 1.42 47.51± 5.17 82.11± 1.65 82.11± 1.57

SMOTE 89.60± 1.48 48.04± 5.33 82.03± 1.82 82.04± 1.81

Undersampling Techniques

In table 5.5 is shown the results for each technique. From the observation of the same, it is verified that

as more instances belonging to the majority class are removed, worse results are obtained. AUPRC is the

performance measure most negatively affected. Among all techniques, Tomek links is the one which achieves

higher performance (AUC 91.40± 1.36, AUPRC 56.53± 4.90).

Table 5.5: Performance metrics for undersampling techniques

Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%]

LGB

RandomUndersampling

90.45± 1.27 50.92± 4.74 82.50± 2.27 82.56± 2.17

TomekLinks

91.40± 1.36 56.53± 4.90 83.62± 1.97 83.75± 2.00

ENN (n=1) 91.31± 1.37 55.51± 5.39 83.55± 2.04 83.63± 1.96

ENN (n=2) 91.25± 1.34 54.45± 5.23 83.76± 2.10 83.63± 2.12

ENN (n=3) 91.25± 1.35 54.09± 5.07 83.68± 2.09 83.56± 2.12

NCR (n=1) 91.33± 1.35 55.03± 5.15 83.40± 2.17 83.31± 2.03

NCR (n=2) 91.30± 1.34 55.75± 5.00 83.45± 2.12 83.46± 1.97

NCR (n=3) 91.25± 1.31 55.09± 4.97 83.56± 1.93 83.70± 1.89

LR

RandomUndersampling

89.76± 1.38 46.77± 5.07 82.05± 1.59 81.91± 1.70

TomekLinks

89.86± 1.54 50.67± 5.37 82.19± 1.85 82.13± 1.80

ENN (n=1) 89.97± 1.51 50.23± 5.28 82.25± 1.73 82.28± 1.74

ENN (n=2) 90.03± 1.49 49.65± 5.40 82.35± 1.85 82.37± 1.81

ENN (n=3) 90.04± 1.45 49.09± 5.40 82.43± 1.74 82.33± 1.70

NCR (n=1) 89.94± 1.51 49.95± 5.37 82.20± 1.89 82.18± 1.77

NCR (n=2) 89.79± 1.55 50.20± 5.36 82.04± 1.95 82.01± 1.84

NCR (n=3) 89.72± 1.59 49.95± 5.34 81.73± 2.06 81.84± 1.95

51

Subsampling and Feature Sampling

In figure 5.8 is summarized the results for both AUC and AUPRC with LGB. When the majority of features

or samples are discarded during training, i.e. values for features fraction or bagging fraction below to 0.5, the

results are worse. This is due to the higher probability of discarding important features to the final prediction

since this is a random process; or the number of samples can be greatly reduced in case of subsampling. So

there is not much advantage to use sampling techniques in LGB besides making the process faster, which by

itself, has already been proven not to be needed.

Figure 5.8: Subsampling and feature sampling for LGB.

5.4.3 Feature Selection

Recursive Feature Elimination (RFE) with SHAP Values for LGB Modeling

The mean absolute SHAP value after removing each feature is plotted in figure 5.9a along with respective

mean values of AUC and AUPRC. Then, in figures 5.9b,c,d it can be seen an amplified vision of mean absolute

SHAP value. In the same tune, amplified visions of mean values of AUC and AUPRC can be seen in figures

5.9e,f and 5.9g,h respectively.

Analyzing figures 5.9e,g, it is possible to conclude that there is almost no variation of AUC and AUPRC

when eliminating features with null importance. Nonetheless, there is a slight variation in both performance

metrics when features with higher rank start to be eliminated. As more features with higher importance are

eliminated, a better performance is achieved until it starts to decrease.

The AUC gradually increases until the removal of 136 features while AUPRC just increases until 96

features are removed. After these points, performance slightly decrease during the remaining process of

feature elimination. From observation of figures 5.9f,h, there is a considerable drop in both performances

when removing the 20 highest ranked features in the model.

In table 5.7 is shown the results when the set of features with highest AUC and AUPRC were selected. The

performance is quite similar in both approaches, being the AUC approach better due to the reduced number

of features. Also performance for fixed features’ subsets are registered for further comparisons.

52

Figure 5.9: Recursive Feature Elimination - LGB with SHAP Values

Sequential Forward Selection (SFS) for LGB modeling

Mean values of AUC when adding features during feature selection are plotted in figure 5.10. It is observable

that there is a continuous improve in performance until reach 20 features. From there, the variation in

performance is not significant. Performance for fixed subsets are registered in table 5.7.

Figure 5.10: Sequential Forward Selection - LGB with AUC metric

53

Recursive Feature Selection (RFS) with SHAP Values for LGB Modeling

The mean absolute SHAP value for LGB after removing each feature is plotted in figure 5.11a along with

respective mean values of AUC and AUPRC. Then, in figures 5.11b,c,d it can be seen an amplified vision of

mean absolute SHAP value. In the same tune, amplified visions of mean values of AUC and AUPRC can be

seen in figures 5.11e,f and 5.11g,h respectively.

Analyzing figures 5.11e,g, it is possible to conclude that there is a gradual increase in both AUC and

AUPRC when eliminating features until 141 and 142 features eliminated for each metric, respectively.

From observation of figures 5.11f,h, there is a considerable drop in both performances when removing the

20 highest ranked features in the model.


performance is quite similar in both approaches. Also performance for fixed features’ subsets are registered

for further comparisons.

Figure 5.11: Recursive Feature Selection - LGB with SHAP Values

54

Recursive Feature Elimination (RFE) with weight vectors for LR modeling

Calculating SHAP values for LR has an expensive computational cost. In this case, instead of eliminating at

each step the feature with lower absolute SHAP value, the feature with lower absolute weight was removed.

Analogously to previous processes, the mean absolute weight value after removing each feature is plotted in

figure 5.12a along with respective mean values of AUC and AUPRC. Then, in figures 5.12b,c,d it can be seen

an amplified vision of mean absolute weight value and in figures 5.12e,f and 5.12g,h amplified visions of mean

values of AUC and AUPRC, respectively.

In figures 5.12e,g is shown the respective evolution of AUC and AUPRC when features with less impact

are eliminated. AUC peaks with a subset of 102 features, while AUPRC highest value occurs with a subset of

81 features.

In table 5.7 are also presented the performances for both subsets. Inversely to LGB, the AUC approach

results in a feature subset with higher dimensionality than AUPRC approach. Also performance for fixed

subsets are registered for further comparisons.

Figure 5.12: Recursive Feature Elimination - LR with Weight Vectors

55

Sequential Forward Selection (SFS) for LR modeling

Mean values of AUC when adding features during feature selection are plotted in figure 5.13. It is observable

that there is a continuous improve in performance until reach 20 features. From there, the variation in

performance is not significant. Performance for fixed subsets are registered in table 5.7.

Figure 5.13: Sequential Forward Selection - LR with AUC metric

Recursive Feature Selection (RFS) with Weight Vectors for LR Modeling

The mean absolute SHAP value after removing each feature is plotted in figure 5.14a along with respective

mean values of AUC and AUPRC. Then, in figures 5.14b,c,d it can be seen an amplified vision of mean absolute

SHAP value. In the same tune, amplified visions of mean values of AUC and AUPRC can be seen in figures

5.14e,f and 5.14g,h respectively.

Analyzing figures 5.14e,g, it is possible to observe that there is a linear incrase of AUC and AUPRC when

eliminating features with less weight until 70 features removed.

The AUC reaches its maximum value with a subset of 61 features and AUPRC at 75 features. After these

points, performance slightly decrease during the remaining process of feature selection. From observation of

figures 5.9f,h, there is a considerable drop in both performances when ranked features are removed.


performance is quite similar in both approaches, being the AUC approach better due to the reduced number

of features. Also performance for fixed features’ subsets are registered for further comparisons.

5.4.4 Feature Selection - Comparison

Features with higher importance after feature selection processes, i.e, features extracted in the last 20 iterations

in RFE and RFS and first 20 features added during SFS are presented in table 5.6.

From these highest ranked features, features commonly shared between the different feature selection

methods are highlighted in blue for LGB model and in green for LR model in table 5.6. In the case of LGB, 9

features are shared among feature selection processes while for LR, 6 features are shared.

However, among each model it’s noticeable that the majority of the features selected appear in more than

one feature selection process and between models (LGB and LR) there is also an high similarity between

features selected.

56

Figure 5.14: Recursive Feature Selection - LR with Weigt Vectors

Table 5.6: Input features after feature selection.

Position RFE-LGB-SHAP SFS-LGB RFS-LGB-SHAP RFE-LR-WEIGHT SFS-LR RFS-LR-WEIGHT

1st last gcs last gcs last gcs aniongap mean last gcs last gcs

2nd respiratory sys respiratory sys respiratory sys aniongap max bun max aniongap max

3rd aniongap mean aniongap med aniongap mean last gcs respiratory sys respiratory sys

4th glucose mean infectious sys platelet min resprate mean aniongap mean resprate mean

5th adm type EMERGENCY age age aniongap med resprate med age

6th ventilation time platelet min infectious sys num infusion rdw max rdw min

7th age sysbp mean sysbp med glucose mean age num infusion

8th resprate mean bun flag glucose mean bun min num infusion aniongap mean

9th urineoutput num infusion ventilation time age heartrate max wbc flag

10th num infusion wbc min num infusion weight first potassium flag glucose mean

11th rdw min spo2 mean resprate mean wbc min glucose mean ethnicity OTHER

12th bun mean glucose std urineoutput creatinine max user insulin bun min

13th infetious sys ethnicity OTHER bun mean creatinine flag ethnicity OTHER spo2 mean

14th bun min nervous sys ethnicity OTHER bicarbonate flag nervous sys creatinine max

15th rdw flag sodium flag rdw min wbc flag sysbp min creatinine flag

16th platelet min creatinine max nervous sys rdw mean creatinine max heartrate max

17th bun max urineoutput sodium min heartrate max wbc flag user insulin

18th diasp min resprate mean diabp min spO2 min ethnicity BLACK aniongap med

19th ethnicity OTHER diasbp med heartrate max rdw min adm type EMERGENCY nervous sys

20th spO2 mean user insulin mental sys creatinine med sodium var diasbp min

57

Since there isn’t a consensus about which feature set is the best to pursue with the study, it is necessary

to access the performance associated to each process of feature selection.

As previously commented, results for all feature selection processes are summarized in table 5.7. The best

performance comes from SFS in LGB model with 30 features (AUC of 92.03 and AUPRC of 57.05). However

for smaller subsets (3, 5 and 7 features), which are the desired ones in a perspective of faster diagnoses and

smaller collection of variables, RFS stands out both for LGB and LR models.

For those reasons, subsets from RFS are the selected to perform external validation in eICU-CRD database.

Table 5.7: Performance metrics after feature selection

Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%]No.

Features

RFE-LGB-SHAP

AUC 91.57± 1.31 56.86± 4.88 83.61± 2.12 83.61± 2.12 51

AUPRC 91.54± 1.36 57.18± 4.72 83.73± 2.28 83.80± 2.30 91

Fixed Subset

90.30± 1.37 52.69± 4.85 81.86± 2.23 81.71± 2.17 14

88.80± 1.67 48.98± 4.80 80.97± 2.05 81.07± 2.01 7

87.36± 1.63 46.09± 4.47 79.15± 1.91 79.16± 1.95 5

86.86± 1.66 45.74± 4.54 79.26± 1.89 79.32± 1.89 3

SFS-LGB Fixed Subset

92.03± 1.16 57.05± 5.28 84.45± 1.74 84.45± 1.72 30

91.31± 1.28 54.15± 4.49 83.78± 1.82 83.75± 1.82 14

90.06± 1.32 51.63± 4.97 82.53± 1.78 82.56± 1.85 7

88.61± 1.44 47.78± 4.42 80.50± 2.01 80.55± 1.91 5

86.86± 1.65 45.59± 4.82 79.36± 1.72 79.22± 1.71 3

RFS-LGB-SHAP

AUC 91.83± 1.27 57.97± 5.00 83.88± 1.95 83.89± 1.97 45

AUPRC 91.83± 1.28 58.00± 4.92 83.96± 2.03 83.90± 1.98 44

Fixed Subset

90.99± 1.29 54.88± 5.00 83.34± 1.73 83.17± 1.79 14

90.17± 1.34 52.41± 4.60 82.30± 1.96 82.17± 1.85 7

88.78± 1.39 48.94± 4.54 81.33± 1.68 81.28± 1.72 5

86.85± 1.66 45.73± 4.54 79.26± 1.89 79.32± 1.89 3

RFE-LR-WEIGHT

AUC 90.09± 1.56 51.71± 5.31 82.47± 1.86 82.41± 1.86 102

AUPRC 89.94± 1.59 52.05± 5.40 82.08± 1.88 82.13± 1.93 81

Fixed Subset

88.32± 1.53 48.49± 5.31 80.10± 2.20 80.20± 2.30 14

86.41± 1.66 44.93± 4.99 78.45± 2.20 78.65± 2.48 7

85.29± 1.93 42.85± 4.84 77.82± 2.43 77.75± 2.57 5

83.59± 2.30 40.54± 5.24 75.08± 2.37 75.09± 2.38 3

SFS-LR Fixed Subset

90.12± 1.30 49.56± 5.19 82.65± 1.76 82.54± 1.68 30

89.35± 1.31 47.79± 5.15 81.28± 1.86 81.38± 1.81 14

87.97± 1.47 44.18± 5.23 80.24± 1.86 80.17± 1.84 7

87.01± 1.56 41.53± 4.98 79.21± 2.08 79.05± 1.90 5

85.41± 1.66 35.97± 5.07 75.26± 1.97 76.32± 1.94 3

RFS-LR-WEIGHT

AUC 90.28± 1.53 51.45± 5.44 82.37± 1.86 82.37± 2.01 61

AUPRC 90.19± 1.53 51.78± 5.43 82.28± 1.71 82.38± 1.73 75

Fixed Subset

89.07± 1.39 47.97± 5.25 81.38± 1.99 81.39± 2.03 14

88.16± 1.39 45.89± 5.33 80.65± 2.13 80.60± 2.22 7

87.14± 1.49 43.99± 5.09 79.30± 1.96 79.24± 1.94 5

85.65± 1.76 38.83± 4.94 76.68± 2.24 76.67± 2.29 3

58

5.4.5 Comparing Approaches

In table 5.8 is shown a general overview of the best results obtained from all different approaches. LGB is the

choosen model since it outperforms LR in all performance metrics in all approaches. The original model after

hyperparameter tuning and models after feature selection stand out among those approaches.

Resulting model from feature selection using SFS in LGB is the best choice (AUC 92.03 ± 1.16, AUPRC

57.05± 5.28).

Table 5.8: Performance metrics to comparison between approaches for LGB

Approach Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%]No.

Features

Original - 91.44± 1.36 56.78± 4.85 83.97± 2.03 83.68± 1.91 All

OversamplingRandom

Oversampling91.12± 1.28 54.71± 5.28 83.31± 2.13 83.39± 2.03 All

UndersamplingTomekLinks

91.40± 1.36 56.53± 4.90 83.62± 1.97 83.75± 2.00 All

FeatureSelection

SFS-LGB 92.03± 1.16 57.05± 5.28 84.45± 1.74 84.45± 1.72 30

5.5 Comparison with severity scores

Common severity scores performance metrics were calculated using the same 5x10-fold cross validation in

the same patients as in the models constructed. For comparison purposes, results for models with the same

minimum and maximum number of features across all severity scores are presented. Features selected for each

model are derived from RFS method in LGB.

It is verified in table 5.9 that the model proposed totally overpass all the severity scores. This might be

an indication on how useful these models could be for predictive purposes.

Table 5.9: Performance metrics to compare with common severity scores used in clinical setting.

Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%] No.Features

SeverityScores

LODS 73.18± 2.87 25.41± 3.74 69.43± 3.97 66.56± 4.56 12

SAPS 73.60± 2.33 26.92± 3.59 68.35± 2.27 66.49± 3.33 14

SAPS II 77.41± 2.44 29.16± 4.07 70.40± 2.85 70.46± 2.61 12

SOFA 68.49± 3.50 24.97± 3.98 64.80± 4.37 63.76± 4.16 10

QSOFA 54.84± 2.44 10.32± 0.74 27.74± 14.90 77.91± 13.21 3

Model LGB90.99± 1.29 54.88± 5.00 83.34± 1.73 83.17± 1.79 14

86.85± 1.66 45.73± 4.54 79.26± 1.89 79.32± 1.89 3

59

5.6 Analysis for different time-windows.

In a perspective of real-time mortality prediction, data was extracted for different time-windows besides the

first 24 hours that was exhaustively analyzed in previous sections. The hyperparameters used in this section

were those previously selected in section 4.4.4 and imputing all features. To remember, for 12 and 24 hours

prior discharge length of stay in the ICU (los icu) is used as an extra feature.

In figure 5.15 and table 5.10 is shown the performance metrics evolution for different time-windows. It

would be expected that when using patients’ ICU discharge information, whether dead or alive, the predictive

performance would be higher due to fact that patients began to show signs of improvement or worsening

of their clinical status. That is verified in all performance metrics, being AUPRC the metric with the most

considerable improvement.

Figure 5.15: Analysis for different time-windows.

Table 5.10: Performance metrics for different data extraction time-windows.Time-window AUC [%] AUPRC [%] Sensivity [%] Specificity [%]

LGB

12h after admission 90.65± 1.63 48.41± 5.89 83.76± 2.33 83.81± 2.33


24h before discharge 92.80± 1.10 74.14± 2.90 85.24± 1.57 85.13± 1.55


LR





60

5.7 Comparison with similar studies

Performance metrics comparison between studies is presented in table 5.11. Not all metrics used in this work

are represented due to the lack of information in the studies.

Compared to [57], where cohort is restricted to diabetic patients, the model developed in this work achieved

better results. Using a small subset with just five variables (last gcs, respiratory sys, aniongap mean,

platelet min and age), results were better with an AUC of 88.8 achieved in this work compared to 78.7

achieved in the study. To remark that, 96.6% of the population in the mentioned study are under insulin

therapy, which is an indicative of the similarity between the study and this work cohort of patients.

Compared to the highest performance found in literature (AUC of 92.7), a study [59] with no restrictions

in patients choice, the model had a similar performance (AUC of 92.0) besides dealing with a less predictable

cohort of patients, as it is possible to prove comparing the frequently used Simplified Acute Physiology Score

II (SAPS II) associated to each study (SAPS II: AUC=77.4 in this work vs AUC=80.9 in the study). The

performance presented in this work was also achieved with a smaller feature set (30 features) compared to the

mentioned study. However, in the study, diagnoses are not taken in consideration as in the model proposed.

Table 5.11: Performance comparison with literature

Studies AUC [%] Sensitivity [%] Specificity [%]No.

PatientsNo.

FeaturesInfo

Johnson et al.,2017 [59]

92.7 - - 50488 144No patientsrestrictions

Anand et al.,2018 [57]

78.7 70 73 4111 5Diabeticpatients

ModelsProposed

92.03± 1.16 84.45± 1.74 84.45± 1.729098

30 Patients underinsulin therapy88.78± 1.39 81.33± 1.68 81.28± 1.72 5

61

5.8 External Validation - eICU-CRD database

The resulting dataset extracted from eICU-CRD database is detailed in appendix A.

Models trained with all data from MIMIC database and with the resulting subsets of features obtained in

section 5.4.3 were validated with an external dataset from the eICU-CRD database [106]. These validation

results are detailed in table 5.12.

Results are compared with the previous results from 5x10 cross-validation using only MIMIC database and

also the results from using the same 5x10 cross-validation for training but using always as testing fold the

patients from eICU-CRD database.

It is noticeable that there is a slightly decrease in performance when performing external validation, around

2% in AUC metric and 4% in AUPRC metric, when comparing results using only MIMIC database.

Focusing only on external validation and cross-validation using eICU-CRD as testing fold, it is verified a

consistent increase in all subsets tested and all metrics on external validation. This may be due to the higher

number of patients used to train the model.

This result is indicative of how important would be to have a even more representative dataset and that,

in the future, the integration of eICU-CRD dataset to train the model may result in a even more robust model

with better predictive performance.

Overall, the model with the higher number of features (7) achieved the best performance (AUC of 87.99

and AUPRC of 47.09) in external validation. However, models with less features also achieved interesting

results.

Table 5.12: Performance metrics for external validation with eICU database

Method AUC [%] AUPRC [%] Sensivity [%] Specificity [%]No.

Features

LGB

ExternalValidation

(MIMIC+eICU)

87.99 47.09 81.17 81.26 7

86.79 45.61 79.91 79.92 5

84.11 41.83 77.68 77.76 3

5x10Cross-Validation(MIMIC+eICU)

87.62± 0.28 46.08± 0.81 81.15± 0.46 81.14± 0.46 7

86.67± 0.31 45.23± 0.83 79.59± 0.47 79.59± 0.48 5

83.89± 0.56 41.42± 0.77 77.32± 1.04 77.31± 1.03 3

5x10Cross-Validation

(MIMIC)

90.17± 1.34 52.41± 4.60 82.30± 1.96 82.17± 1.85 7

88.78± 1.39 48.94± 4.54 81.33± 1.68 81.28± 1.72 5

86.85± 1.66 45.73± 4.54 79.26± 1.89 79.32± 1.89 3

62

Chapter 6

Model Analysis and Interpretation

This chapter presents a model analysis and an interpretation of the results from the model proposed. In a

first instance, an interpretation of which features are more relevant and how each feature is influencing the

outcome was carried on in section 6.1. Then, a detailed analysis of features with high importance and insulin

related features was conducted and corroborated with medical studies when possible. In other perspective,

in section 6.2 were presented individual clinical dashboards representing which features are influencing each

patient individually.

6.1 Model Interpretation

First-day data and the respective fold that during cross validation achieved the best relation between AUC and

AUPRC was selected to interpret the model. All features were used since overall performance didn’t increase

significantly enough during feature selection process (section 5.4.3). This analysis, taking in consideration all

features, does not invalid the feature selection process previously performed. It only allows to get an insight

on how the features are influencing the final outcome.

The SHAP values provide the importance of each feature. In figure 6.1 is shown the 20 most important

features for each model attending the selected fold. Comparing both models, some features are commonly

important to predict the mortality in patients during their ICU stay. Last GCS value (last gcs) is the feature

with highest importance for both models. Age of each patient (age) and how long a patient were under

mechanical ventilation (ventilation time) are also features that most influenced models’ outputs. These

features will be used for the following analyses.

For visualization purposes and for knowing whether features are affecting negatively or positively the

outcome, dots representing each patient are plotted horizontally by their SHAP value (figure 6.2) and coloured

by their nominal feature value from low (blue in LGB/green in LR) to high (red in LGB/orange in LR). Giving

an example, glucose mean values can range between 40 mg/dL (blue or green) represented as Low and 500

mg/dL (red or orange) represented as High.

Moreover, dots are stacked vertically when they run out of space creating a density effect that allows to

visualize patients concentrated in a SHAP value [82]. It is also noteworthy that negative SHAP values favor

63

(a) Feature importance for LGB model (b) Feature importance for LR model

Figure 6.1: Feature’s importance ranked through SHAP values.

(a) Shap values for LGB (b) Shap values for LR

Figure 6.2: SHAP values for the 20 most important features.

class 0 (Alive) and positive SHAP values favor class 1 (Dead).

A characteristic in some features is the presence of long tails in just one side of the graphics, especially

in LGB model due to its non-linearity. For example, the general trend of long tails reaching to the right, but

not to the left, means that extreme values of these features can significantly increase mortality, but lowers

values cannot significantly lower that risk. Long tails on the right side of the graphics indicate a higher risk

of mortality. Conversely, there is a higher probability of survival if a tail is left-sided.

Giving some examples for better interpretation, lower values of last gcs indicates that risk of mortality is

especially high but higher values have less influence in patients’ survival; ventilation time has more influence

to mortality as its value increases but lower values have almost no influence in chances of survival; diseases

in human body systems ( respiratory sys and nervous sys) also influence mortality but its absence has

poor effect ; and mean value of respiratory rate (resprate mean) influence more the final outcome as long

64

as this nominal value increases.

Further, in features like age is shown a gradual influence for nominal values. Lower values have a high

impact in survival chances but as long as the value increase impact decrease until a turning value where risk

of mortality begins to be favored.

On the following subsections, it will be described and analyzed with greater detail some of the most

important features.

6.1.1 Glasgow Comma Scale, Age and Ventilation Time

As verified above, glasgow comma scale (GCS), age, and mechanical ventilation duration are the three most

important features both for LGB and LR according to the ranking list.

In figure 6.3 is shown the patients distribution by SHAP rank and respective nominal values associated

to each of those features. It is notable the different behaviour of each algorithm, where LR has a linear

distribution comparing to a non-linear distribution in LGB.

The covariate last gcs is a discrete feature with values between 3 and 15. A gradual increase favoring

mortality (positive SHAP rank) is seen as the Glasgow Comma Scale decreases. For this model, all feature

values above 14 are found below a 0 SHAP rank. In the case of LR, it is spread along the SHAP rank scale.

In the case of Age, mortality begins to be favored around 70 years old for LGB and between [55−75] years

old for LR. This trade-off is presented in a smoothest way for LR case due to the already mentioned behaviour

of each model. It should also be noted in LGB that there is a increased survival for patients below 51 years

old and an increased risk of mortality for patients above 75 years old.

For last, ventilation time in LGB influence survival chances for small values (below 17 hours) but with

a quite small impact. As the patients stays longer under ventilation, the higher is the risk of mortality with

predominance in patients that spent almost (more than 22 hours) the first full day under ventilation. For the

LR model, values above 10 hours start to favor mortality while values below favor survival chances.

Something to point out is that ventilation time values above 24 hours represent patients that a priori

entered the ICU already under ventilation.

Actually these analysis could be done in all the features in the dataset (except for those with null impor-

tance). Nonetheless, from now on the work will focus in crucial features that potentially can be related to

patients’ condition that are under insulin therapy to control their blood glucose.

65

(a) Last GCS and SHAP values for LGB (b) Last GCS and SHAP values for LR

(c) Age and SHAP values for LGB (d) Age SHAP values for LR

(e) Ventilation time and SHAP values for LGB (f) Ventilation time and SHAP values for LR

Figure 6.3: SHAP ranking and the relationship to different covariates.

66

6.1.2 Number of Insulin Infusions

The number of insulin infusions (num infusion) is a feature with relative importance in both models and a

higher number of infusions is associated to a higher chance of survival. The number of insulin infusions can

be evaluated in figures 6.1 and 6.2.

In figure 6.4a is shown that the number of infusions starts to influence the chances of survival in a smaller

value, 8 for LGB comparatively to LR which is around 10 infusions (figure 6.4b). Indeed, in the LR model, the

samples distribution form an orange long tail in the left side (same explanation for figure 6.2), which indicates

that a higher number of infusions has much more influence in patient chance of survival. In consequence, a

lower number in patient mortality.

This conclusion is a crucial for the discussion between the CIT and the IIT regimes. A higher number

of infusions is associated to an increased chance of survival. Then, the results of this work may favour the

IIT. Anyhow, it is necessary an individualized study of each patient to support such conclusion and take in

consideration the condition of being diabetic or long-term insulin user. It is also needed the validation of

clinicians.

(a) Number of infusions and SHAP values for LightGBM model (b) Number of infusions and SHAP values for LDA model

Figure 6.4: Number of infusions and SHAP values for LightGBM and LDA models.

6.1.3 Diabetes and Long-Term Insulin Users

Regarding the influence of diabetes and long-term insulin users, figure 6.5 shows that the condition of a

patient having diabetes type I or secondary diabetes has null impact on mortality prediction in LGB. However,

long-term insulin users and patients with diabetes type II has an increased chance of survival.

In LR, each condition has a positive impact on survival chance while the impact of not having any condition

is practically null for the majority of patients. To emphasize that the condition of being a long-term insulin

user and diabetic type I are ones of the highest ranked features in LR (figure 6.1).

In an overall perspective, patients that have any type of diabetes have a augmented chance of survival.

This may be due to the fact that they need to be exposed to insulin previously to survive. Actually, the

condition of being a long-term insulin user have the highest influence in both models which serves as the basis

67

for the previous inference. However, a further refining and delimitation of these groups should be done.

(a) Diabetic condition and SHAP values for LGB model (b) Diabetic condition and SHAP values for LR model

Figure 6.5: Diabetic condition and SHAP values for LGB and LR models.

6.1.4 Ethnicity

Relatively to ethnicity, it is possible to conclude from figure 6.6 that black people has higher chance of survival

under insulin therapy for LGB model. This assumption can be formulated based on previous studies [107, 108]

where it was proven that african descendents have an exacerbated insulin response and a lower insulin sensitivity

compared to the remaining ethnicities.

The same is true for the LR model. Additionally, asian and hispanic/latino patients have some importance,

albeit lower, on survival. Nonetheless, patients with undefined ethnicity (ethnicity OTHER) tend to have a

higher risk of mortality in both models. This is one of the limitations of MIMIC-III database.

(a) Ethnicity SHAP values for LGB (b) Ethnicity SHAP values for LR

Figure 6.6: Ethnicity SHAP values

6.1.5 Glucose

Among the discretized time-series features glucose, respiratory rate, and anion gap; seem to play an important

role in the models. Glucose was already identified as a crucial feature because it is intrinsically related to many

metabolic responses and mainly to insulin administration (section 2.2.2). The built models corroborate this

assumption because it was identified that the mean and minimum glucose readings influence in the outcome.

In figures 6.7a,b is displayed that in glucose mean values there is a trade-off between patient survival and

mortality for values around 150 mg/dL in both models. Values above contribute to patients mortality. As long

as the values increase, there is a small risk escalation for LGB. In the LR model the risk is directly proportional

to glucose mean values.

Glucose min values are represented in figures 6.7c,d. In LR model, the trade-off point is positioned for

values between [80 − 100] mg/dL and risk is also directly proportional as values increase. However, in the

68

LGB model there are some significant changes (figure 6.7c). For the Glucose min values below 40 mg/dL,

representing severe hypoglycemia level 3 (table 2.2), favor patient’s mortality. Then, values between [40−100]

are favoring, in the majority, to the chance of surviving regardless the presence of few cases where mortality is

benefited although with less impact. Values above 100 mg/dL favor the risk of mortality which is accentuated

for some cases as the values rise.

(a) Glucose mean values and SHAP values for LGB model (b) Glucose mean values and SHAP values for LR model

(c) Glucose minimum values and SHAP values for LGB model (d) Glucose minimum values and SHAP values for LR model

Figure 6.7: Glucose readings vs SHAP values.

6.1.6 Respiratory Rate and Respiratory Diseases

Respiratory rate mean values (resprate mean) and median values (resprate med) are also features with

importance for the models.

Analyzing figure 6.8, there are no significant differences between mean and median values in the LR

model. The slope and the associated SHAP values are very similar for both measurements and the trade-off

mortality/survival occurs in values between [15 − 20] breaths per minute, which coincides with the typical

values in adults (table 3.5). Smaller values increase survival chances and higher values the risk of mortality.

Normally, mean and median are highly correlated measurements which may lead to the importance of only

one feature being distributed between both what seems to happen in LR model. Nevertheless, significant

69

differences in specific cases can make the difference in the model’s outcome when using both features with

less importance associated to each one, instead of just one with the totality of importance.

This conclusion can be drawn by looking at the figure 6.8 for the LGB model. In this case, both measures

values between [8 − 17] favor, but small impact, in survival chances. For resprate mean values between

[17 − 22] tend to have almost null impact. As the values increase from that range, higher is the mortality

chance, specially for values above 28. However, the same is not verified in resprate med that besides a

slight increase on mortality risk for values above 25 compared to trade-off values, mortality risk tend to remain

constant as the values increase.

(a) Respiratory Rate Mean SHAP values for LGB (b) Respiratory Rate Mean SHAP values for LR

(c) Respiratory Rate Median SHAP values for LGB (d) Respiratory Rate Median SHAP Values for LR

Figure 6.8: Respiratory Rate SHAP values

Given that diseases in respiratory system (respiratory sys) is also a feature with higher importance for

the model it is interesting to compare how respiratory rate features are related with this one. That relation

is visible in figure 6.9 for LGB model where data points are coloured by number of respiratory diseases. It is

verified that there is a propensity where as more respiratory diseases are identified in a patient, higher are the

values of resprate mean and resprate med that also represent a higher mortality risk.

The importance of these respiratory related features in the study may be, despite other complications,

due to patients resistance to insulin. Some medical studies [109–111] connect insulin resistance with lungs

70

(a) Mean respiratory rate and SHAP values for the LGB modelcoloured by number of respiratory diseases

(b) Median respiratory rate and SHAP values for the LGB modelcoloured by the number of respiratory diseases

Figure 6.9: Respiratory Rate (Respiratory Diseases) SHAP values. Respiratory rate (mean and median) andSHAP values are plotted for LGB model as in figures 6.8a,c however data points are coloured by number ofrespiratory diseases.

disorders which may be a cause for this outcome.

6.1.7 Anion Gap and Bicarbonate

Finally, mean values of anion gap (aniongap mean) follow an evolution similar to age or glucose mean.

Chances of survival decrease from the lowest value until a trade-off point (17 to LGB and between [12− 18]

to LR). Further, as the aniongap mean values increase, higher is the risk of mortality. These deductions are

possible to extract from figure 6.10.

(a) Mean anion gap and SHAP values for the LGB model (b) Mean anion gap and SHAP values for the LR model

Figure 6.10: Anion gap and SHAP values.

A study [112] concluded that higher values of anion gap and lower values of bicarbonate are associated with

insulin resistance. Although insulin resistance is not directly associated to mortality, it is worth interconnect

a possible meaningful association between them. For that, checking the bicarbonate values behaviour in the

models is important to give more value to the assumption.

Bicarbonate is not an important feature in LR model. However, in the LGB model, mean values of

71

bicarbonate (bicarbonate mean) appear as an important feature (figures 6.1 and 6.2). From figure 6.11 it is

possible to corroborate the previous assumption, check the asymptote (23 mEq/L) where chances of survival

start to increase and conclude that lower values positively affect mortality risk and higher values benefit chances

of survival.

Figure 6.11: Bicarbonate Mean SHAP values for LGB

72

6.2 Individualized Clinical Dashboards

General patterns and conclusions were taken from the analysis of all cohort but the possibility to know how

each feature is influencing each patient might potentially facilitate a personalized diagnosis.

Creating a model capable of respond to patient’s particular necessities is of high importance because each

individual case is a different case. What might be relevant for a specific patient, can have null or poor impact

in another patient. Then, an individualized medical record for each patient was created from the LGB model

by using the best performer fold from the first-day analysis.

First of all, it is necessary to analyze the model’s outputs. As mentioned above (section 4.7), a threshold-

based choice categorizes which patients are expected to die or survive from model’s outputs probabilities.

From figure 6.12, there can be found patients above a threshold (t = 0.116). These are the patients predicted

to die. On the other hand, those patients below are predicted to survive. Nonetheless, it is necessary to

understand how far are these patient’s probabilities from that threshold for not making inaccurate conclusions.

For explaining further this, two patients will be used as example. In figure 6.12 are highlighted them. One

represents a mortality case (red dot) and the other a survival prediction (blue dot). This also will serve as the

basis to explain the concept of individual clinical dashboards.

Figure 6.12: Patients mortality probability.

For this purpose, the 20 features that most affect each patient outcome were ranked and ordered in a plot

bar. For visualization simplicity, features favoring chances of survival were coloured green while those favoring

mortality risk were coloured red (see figures 6.13 and 6.14).

Along with the features, their associated value was also plotted in an adjacent graphic. The values were

represented depending on their type. For continuous features with a normal range of values, values are plotted

in accordance therewith to automatically observe if the values are below, inside or above. Binary features

appear in a dashed line where respective value associated (0 or 1) is highlighted. For the remaining features,

the values are just presented.

In figure 6.13 is presented the clinical dashboard for a patient that is expected to survive. Comparing this

figure and figure 6.1 with features ranked from patients’ cohort, it is perceptible that most of the features appear

73

in both figures however are ranked with different importance which immediately indicates an individualization

of the diagnosis.

Figure 6.13: Clinical dashboard for a patient expected to survive

In a general and quick diagnose for a patient, the number of abnormal creatinine measurements registered

(creatinine flag), mean systolic blood pressure value (sysbp mean) and minimum blood urea nitrogen

value (bun min) registered at the moment along with the patient’s age (age) are variables that most have to

concern physicians when healing the patient.

On the opposite, figure 6.14 shows a medical record for a patient expected to die. The amount of features

indicting mortality risk evidence the critical health status of the patient which should lead to redoubled care

with incidence on variables that present greater risk.

Figure 6.14: Clinical dashboard for a patient expected to die

74

Demystifying figure 6.12 with the real outcomes for the patients, coloring the dots in figure 6.15, it is

possible to conclude that above predictions were correct. It is also worth noting that the majority of patients

that actually died (red dots) were correctly classified and are above the threshold. Considering that the

testing fold is composed by 911 patients, where just 83 patients actually died and among those just 11

were misclassified being 4 quite near of the threshold, this well mirrors the predictive capacity of the model.

However, it shall not preclude that those patients weren’t under extreme poor health conditions that influenced

the outcome, but it can be strengthened by high mortality probabilities attributed to 106 patients that actually

survived.

Figure 6.15: Patients mortality probability coloured by real outcomes

By way of conclusion, these mortality probabilities coupled with individualized clinical dashboards allow

physicians to a faster diagnose and a general overview of patient’s health status. Together with clinical

knowledge, those records can support physicians’ decisions that can lead to a faster treatment which can

eventually make a difference in patients’ real outcome.

75

76

Chapter 7

Conclusions

The main purpose of this thesis was to predict mortality in patients admitted to an ICU that were under

insulin therapy. In a work mainly focused in the first 24 hours spent in the care unit, machine learning, data

sampling and feature selection techniques were tested and compared in order to achieve an optimal predictive

performance. The work extended to different time-windows within ICU in a perspective of real-time mortality

prediction. Variables that most affect patients outcome were identified and interpreted with recourse to medical

studies. From a medical decision support point of view, individualized clinical dashboards were designed from

models created.

7.1 Achievements

The work developed demonstrated that gradient boosting (LGB) models had a substantial improvement over all

the models tested presenting the following performance metrics: AUC of 91.44±1.36, AUPRC of 56.78±4.85,

sensitivity of 83.97± 2.03 and specificity of 83.68± 1.91.

Data sampling techniques used to counteract the imbalance presented in dataset have proven to be irrel-

evant as performance decreased, both for under and oversampling.

Relatively to feature selection, a resulting subset from SFS of 30 features achieved the best performance

with an AUC of 92.03 ± 1.16, an AUPRC of 57.05 ± 5.28, a sensitivity of 84.45 ± 1.74 and a specificity of

84.45 ± 1.72. Notwithstanding, relatively smaller subsets achieved quite interesting results although slightly

lower than that mentioned above.

From all tested feature selection techniques, namely RFE, SFS and RFS, the novel technique proposed in

this work (RFS) achieved the best results for smaller subsets of features (3, 5 and 7), whose were used for

external validation using data from eICU-CRD database. The best validation performance achieved an AUC

of 87.99, an AUPRC of 47.09, a sensitivity of 81.17 and a specificity of 81.26 with a subset of 7 features. This

performance with a reduced number of features and in a completely different dataset which in itself is also a

dataset made up of different sources of information (208 hospitals), present an extra point of view in favor of

the model constructed.

In a perspective of real-time prediction, when using patients’ ICU discharge information, whether dead or

77

alive, the predictive performance was higher with an AUC of 94.84 ± 0.92, an AUPRC of 78.76 ± 2.72, a

sensitivity of 87.40±1.57 and a specificity of 87.44±1.52 for 12 hours prior discharge of ICU. This is possible

due to the fact that patients began to show signs of improvement or worsening of their clinical status.

Still, the variables that most influence the models and respective mortality prediction were identified.

However, as explained and analyzed during model interpretation, each variable importance is independently

related to patients and overall importance may differ from individual importance.

Finally, this work present the construction of individualized clinical dashboards which may be an important

tool in a perspective of data-aided decisions by phisicians.

7.2 Comparison with Previous Works

Compared to [57], where cohort is restricted to diabetic patients, the model developed in this work achieved

better results. Using a small subset with just five variables (last gcs, respiratory sys, aniongap mean,

platelet min and age), results were better with an AUC of 88.8 achieved in this work compared to 78.7

achieved in the study. To remark that, 96.6% of the population in the mentioned study are under insulin

therapy, which is indicative of a good basis for comparison.

Following the approach proposed by [58] of using statistical features for time-series variables, mean stood

out among all statistical features used followed by minimum and maximum. Variance proved to be useless for

the model presenting almost null importance. Due to major differences in patients’ cohort and variables used

in the studies, performances are not comparable.

Compared to the highest performance found in literature (AUC of 92.7), a study [59] with no restrictions

in patients choice, the model have a similar performance (AUC of 92.0) besides dealing with a less predictable

cohort of patients, as it is possible to prove comparing the frequently used Simplified Acute Physiology Score

II (SAPS II) associated to each study (SAPS II: AUC=77.4 vs AUC=80.9). The performance presented in

this work was also achieved with a smaller subset (30 features) compared to the mentioned study. However,

in the study, diagnoses are not taken in consideration as in the model proposed.

7.3 Future Work

The work developed has room to evolve in a multitude of ways.

Firstly in the context of continuing the insulin-related work, it may focus on studying the influence of

differentacting types of insulin (i.e. short, intermediate and long acting) and to consider other common inputs

for this patients (e.g. dextrose boluses and glycated hemoglobin).

In a more ambitious task, predict insulin type and amount need consonant patient necessities. Predict

not only influence in glucose variability but also in the remaining variables both those identified as having a

important role in the course of this thesis, as well as those that may prove to be important.

Lastly, it is important to highlight that it is needed further validation of these models and conclusions

drawn by the part of physicians to improve and validate their interpretability.

78

Still in the medical field, the work can be extended to the entire patients’ cohort with fewer restrictions or

directed to other areas of medical interest in order to achieve different conclusions.

From another perspective, models’ construction process, from data treatment, to hyperparameter tuning,

to model interpretation, can be used in distinct studies of the most varied areas since with data associated.

79

80

Bibliography

[1] P. Domingos. The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake

Our World. Basic Books, Inc., 2018.

[2] I. A. Berg, O. E. Khorev, A. I. Matvevnina, and A. V. Prisjazhnyj. Machine learning in smart home

control systems - algorithms and new opportunities. AIP Conference Proceedings, 1906(1):070007,

2017.

[3] T. Economist. The world’s most valuable resource is no longer

oil but data, 2017. URL www.economist.com/leaders/2017/05/06/

the-worlds-most-valuable-resource-is-no-longer-oil-but-data.

[4] E. Ahmed, I. Yaqoob, I. A. T. Hashem, I. Khan, A. I. A. Ahmed, M. Imran, and A. V. Vasilakos. The

role of big data analytics in Internet of Things. Computer Networks, 129:459–471, 2017.

[5] M. S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi, and A. P. Sheth. Machine

learning for internet of things data analysis: a survey. Digital Communications and Networks, 4(3):

161–175, 2018.

[6] P. Domingos. A few useful things to know about machine learning. Communications of the ACM, 55

(10):78, 2012.

[7] J. Rowley. The wisdom hierarchy: Representations of the DIKW hierarchy. Journal of Information

Science, 33(2):163–180, 2007.

[8] N. Jothi, N. A. Rashid, and W. Husain. Data Mining in Healthcare - A Review. Procedia Computer

Science, 72:306–313, 2015.

[9] E. M. Beulah, S. N. S. Rajini, and N. Rajkumar. Application of Data mining in healthcare: A survey.

Asian Journal of Microbiology, Biotechnology and Environmental Sciences, 18(4):1001–1003, 2016.

[10] F. Guiza, J. Van Eyck, and G. Meyfroidt. Predictive data mining on monitoring data from the intensive

care unit. Journal of Clinical Monitoring and Computing, 27(4):449–453, 2013.

[11] T. J. Pollard and L. A. Celi. Enabling Machine Learning in Critical Care. ICU management & practice,

17(3):198–199, 2017.

81

www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data

www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data

[12] P. a. Fayyad, U., Piatetsky-Shapiro, G. & Smyth. Data mining and knowledge discovery in databases,

Commun. American Association for Artificial Intelligence, 17(3):82–88, 1996.

[13] R. Wirth and J. Hipp. CRISP-DM: Towards a standard process model for data mining. Proceedings

of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data

Mining, (24959):29–39, 2000.

[14] A. G. Pittas, R. D. Siegel, and J. Lau. Insulin Therapy for Critically Ill Hospitalized Patients: A Meta-

analysis of Randomized Controlled Trials. Archives of Internal Medicine, 164(18):2005–2011, 10 2004.

[15] J. Clain. Glucose control in critical care. World Journal of Diabetes, 6(9):1082, 2015.

[16] M. E. McDonnell and G. E. Umpierrez. Insulin Therapy for the Management of Hyperglycemia in

Hospitalized Patients. Endocrinol Metab Clin North Am, 41(1):175–201, 2012.

[17] J. C. Preiser, J. G. Chase, R. Hovorka, J. I. Joseph, J. S. Krinsley, C. De Block, T. Desaive, L. Foubert,

P. Kalfon, U. Pielmeier, T. Van Herpe, and J. Wernerman. Glucose Control in the ICU: A Continuing

Story. Journal of Diabetes Science and Technology, 10(6):1372–1381, 2016.

[18] M. Haluzik, M. Mraz, P. Kopecky, M. Lips, and S. Svacina. Glucose control in the ICU: Is there a time

for more ambitious targets again? Journal of Diabetes Science and Technology, 8(4):652–657, 2014.

[19] F. G. Smith, A. M. Sheehy, J. L. Vincent, and D. B. Coursin. Critical illness-induced dysglycaemia:

Diabetes and beyond. Critical Care, 14(6):4–6, 2010.

[20] J. M. Boutin and L. Gauthier. Insulin infusion therapy in critically ill patients. Canadian Journal of

Diabetes, 38(2):144–150, 2014.

[21] C. De Block, B. Manuel-Y-Keenoy, L. Van Gaal, and P. Rogiers. Intensive insulin therapy in the intensive

care unit: Assessment by continuous glucose monitoring. Diabetes Care, 29(8):1750–1756, 2006.

[22] P. E. Cryer. Hypoglycaemia: The limiting factor in the glycaemic management of Type I and Type II

diabetes. Diabetologia, 45(7):937–948, 2002.

[23] J.-C. Lacherade, S. Jacqueminet, and J.-C. Preiser. An overview of hypoglycemia in the critically ill.

Journal of Diabetes Science and Technology, 3(6):1242–1249, 2009.

[24] A. E. Johnson, D. J. Stone, L. A. Celi, and T. J. Pollard. The MIMIC Code Repository: Enabling

reproducibility in critical care research. Journal of the American Medical Informatics Association, 25(1):

32–39, 2018.

[25] Agency for Healthcare Research and Quality. HCUP CCS. Healthcare Cost and Utilization Project

(HCUP), 2017. URL www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp.

[26] UAGC. Homeostasis and immunity - overview. URL http://anatomysciences.com/wp-content/

uploads/2018/03/human-body-systems-human-body-systems-photos-anatomy-human.jpg.

[27] J. B. Reece and N. A. Campbell. Campbell Biology. 9th ed. edition, 2012.

82

www.hcup-us.ahrq.gov/toolssoftware/ccs/ccs.jsp

http://anatomysciences.com/wp-content/uploads/2018/03/human-body-systems-human-body-systems-photos-anatomy-human.jpg

http://anatomysciences.com/wp-content/uploads/2018/03/human-body-systems-human-body-systems-photos-anatomy-human.jpg

[28] J. Torday. Homeostasis as the Mechanism of Evolution. Biology, 4(3):573–590, 2015.

[29] C. Uluseker, G. Simoni, L. Marchetti, M. Dauriz, A. Matone, and C. Priami. A closed-loop multi-level

model of glucose homeostasis. PLoS ONE, 13(2):1–23, 2018.

[30] Diabetes - The global diabetes community. Diabetes types, 2017. URL https://www.diabetes.co.

uk/diabetes-types.html.

[31] The Nobel Prize. The nobel prize in physiology or medicine, 1923. URL www.nobelprize.org/prizes/

medicine/1923/summary/.

[32] Diabetes Education Online. Types of insulin, 2018. URL dtc.ucsf.edu/types-of-diabetes/

type2/treatment-of-type-2-diabetes/medications-and-therapies/type-2-insulin-rx/

types-of-insulin/.

[33] J. C. Marshall, L. Bosco, N. K. Adhikari, B. Connolly, J. V. Diaz, T. Dorman, R. A. Fowler, G. Meyfroidt,

S. Nakagawa, P. Pelosi, J. L. Vincent, K. Vollman, and J. Zimmerman. What is an intensive care unit?

A report of the task force of the World Federation of Societies of Intensive and Critical Care Medicine.

Journal of Critical Care, 37:270–276, 2017.

[34] E. S. Moghissi, M. T. Korytkowski, M. DiNardo, D. Einhorn, R. Hellman, I. B. Hirsch, S. E. Inzucchi,

F. Ismail-Beigi, M. S. Kirkman, and G. E. Umpierrez. American Association of Clinical Endocrinologists

and American Diabetes Association consensus statement on inpatient glycemic control. Diabetes Care,

32(6):1119–1131, 2009.

[35] American Diabetes Association. Standards of Medical Care in Diabetes. Diabetes Care, 41(Supplement

1):S1–S2, 2018.

[36] P. E. Marik and R. Bellomo. Stress hyperglycemia: an essential survival response! Critical Care, 17(2):

305, 2013.

[37] J. Jacobi, N. Bircher, J. Krinsley, M. Agus, S. S. Braithwaite, C. Deutschman, A. X. Freire, D. Geehan,

B. Kohl, S. A. Nasraway, M. Rigby, K. Sands, L. Schallom, B. Taylor, G. Umpierrez, J. Mazuski, and

H. Schunemann. Guidelines for the use of an insulin infusion for the management of hyperglycemia in

critically ill patients. Critical Care Medicine, 40(12):3251–3276, 2012.

[38] S. R. Heller. Glucose Concentrations of Less Than 3.0 mmol/L (54 mg/dL) Should Be Reported in

Clinical Trials: A Joint Position Statement of the American Diabetes Association and the European

Association for the Study of Diabetes. Diabetes Care, 40(1):155–157, 2017.

[39] K. Malmberg, L. Ryden, S. Efendic, J. Herlitz, P. Nicol, A. Waldenstrom, H. Wedel, and L. Welin.

Randomized trial of insulin-glucose infusion followed by subcutaneous insulin treatment in diabetic

patients with acute myocardial infarction (DIGAMI study): Effects on mortality at 1 year. Journal of

the American College of Cardiology, 26(1):57–65, 1995.

83

https://www.diabetes.co.uk/diabetes-types.html

https://www.diabetes.co.uk/diabetes-types.html

www.nobelprize.org/prizes/medicine/1923/summary/

www.nobelprize.org/prizes/medicine/1923/summary/

dtc.ucsf.edu/types-of-diabetes/type2/treatment-of-type-2-diabetes/medications-and-therapies/type-2-insulin-rx/types-of-insulin/



[40] G. Van den Berghe, P. Wouters, F. Weekers, C. Verwaest, F. Bruyninckx, M. Schetz, D. Vlasselaers,

P. Ferdinande, P. Lauwers, and R. Bouillon. Intensive Insulin Therapy in Critically Ill Patients. New

England Journal of Medicine, 345(19):1359–1367, nov 2001.

[41] G. Van den Berghe, A. Wilmer, G. Hermans, W. Meersseman, P. J. Wouters, I. Milants, E. Van

Wijngaerden, H. Bobbaers, and R. Bouillon. Intensive Insulin Therapy in the Medical ICU. New England

Journal of Medicine, 354(5):449–461, feb 2006.

[42] W. M. Clark, W. Brooks, A. Mackey, M. D. Hill, P. P. Leimgruber, A. J. Sheffet, D. Ph, V. J. Howard,

D. Ph, W. S. Moore, J. H. Voeks, D. Ph, L. N. Hopkins, D. E. Cutlip, D. J. Cohen, J. J. Popma,

R. D. Ferguson, S. N. Cohen, J. L. Blackshear, F. L. Silver, J. P. Mohr, B. K. Lal, J. F. Meschia, and

C. Investigators. Intensive versus Conventional Glucose Control in Critically Ill Patients. New Englanc

Journal of Medicine, 360(13):11–23, 2016.

[43] J. C. Preiser, P. Devos, S. Ruiz-Santana, C. Melot, D. Annane, J. Groeneveld, G. Iapichino, X. Leverve,

G. Nitenberg, P. Singer, J. Wernerman, M. Joannidis, A. Stecher, and R. Chiolero. A prospective

randomised multi-centre controlled trial on tight glucose control by intensive insulin therapy in adult

intensive care units: The Glucontrol study. Intensive Care Medicine, 35(10):1738–1748, 2009.

[44] P. Kalfon, B. Giraudeau, C. Ichai, A. Guerrini, N. Brechot, R. Cinotti, P. F. Dequin, B. Riu-Poulenc,

P. Montravers, D. Annane, H. Dupont, M. Sorine, and B. Riou. Tight computerized versus conventional

glucose control in the ICU: A randomized controlled trial. Intensive Care Medicine, 40(2):171–181, 2014.

[45] G. Y. Gandhi, G. A. Nuttall, M. D. Abel, C. J. Mullany, H. V. Schaff, P. C. O’Brien, M. G. Johnson,

A. R. Williams, S. M. Cutshall, L. M. Mundy, R. A. Rizza, and M. M. McMahon. Intensive Intraoperative

Insulin Therapy versus Conventional Glucose Management during Cardiac Surgery. Annals of Internal

Medicine Article, 146(4):233–243, 2007.

[46] Y. M. Arabi, O. C. Dabbagh, H. M. Tamim, A. A. Al-Shimemeri, Z. A. Memish, S. H. Haddad, S. J.

Syed, H. R. Giridhar, A. H. Rishu, M. O. Al-Daker, S. H. Kahoul, R. J. Britts, and M. H. Sakkijha.

Intensive versus conventional insulin therapy: A randomized controlled trial in medical and surgical

critically ill patients. Critical Care Medicine, 36(12):3190–3197, 2008.

[47] G. Del Carmen De La Rosa, J. H. Donado, A. H. Restrepo, A. M. Quintero, L. G. Gonzalez, N. E.

Saldarriaga, M. Bedoya, J. M. Toro, J. B. Velasquez, J. C. Valencia, C. M. Arango, P. H. Aleman,

E. M. Vasquez, J. C. Chavarriaga, A. Yepes, W. Pulido, and C. A. Cadavid. Strict glycaemic control

in patients hospitalised in a mixed medical and surgical intensive care unit: A randomised clinical trial.

Critical Care, 12(5):1–9, 2008.

[48] F. M. Brunkhorst, C. Engel, F. Bloos, A. Meier-Hellmann, M. Ragaller, N. Weiler, O. Moerer, M. Gru-

endling, M. Oppert, S. Grond, D. Olthoff, U. Jaschinski, S. John, R. Rossaint, T. Welte, M. Schaefer,

P. Kern, E. Kuhnt, M. Kiehntopf, C. Hartog, C. Natanson, M. Loeffler, and K. Reinhart. Intensive

Insulin Therapy and Pentastarch Resuscitation in Severe Sepsis. New England Journal of Medicine, 358

(2):125–139, 2008.

84

[49] I. M. Mackenzie and A. Ercole. Glycaemic control and outcome in general intensive care : the East

Anglian GLYCOGENIC study. British Journal of Intensive Care, (December):121–126, 2008.

[50] M. Yang, Q. Guo, X. Zhang, S. Sun, Y. Wang, L. Zhao, E. Hu, and C. Li. Intensive insulin therapy on

infection rate, days in NICU, in-hospital mortality and neurological outcome in severe traumatic brain

injury patients: A randomized controlled trial. International Journal of Nursing Studies, 46(6):753–758,

2009.

[51] F. Bilotta, R. Caramia, F. P. Paoloni, R. Delfini, and G. Rosa. Safety and efficacy of intensive insulin

therapy in critical neurosurgical patients. Anesthesiology, 110(3):611—619, 2009.

[52] T. C. S. Investigators. Corticosteroid Treatment and Intensive Insulin Therapy for Septic Shock in

Adults. JAMA: The Journal of the American Medical Association, 303(4):341–348, 2013.

[53] S. P. Desai, L. L. Henry, S. D. Holmes, S. L. Hunt, C. T. Martin, S. Hebsur, and N. Ad. Strict versus

liberal target range for perioperative glucose in patients undergoing coronary artery bypass grafting:

A prospective randomized controlled trial. Journal of Thoracic and Cardiovascular Surgery, 143(2):

318–325, 2012.

[54] K. Giakoumidakis, R. Eltheni, E. Patelarou, S. Theologou, V. Patris, N. Michopanou, T. Mikropoulos,

and H. Brokalaki. Effects of intensive glycemic control on outcomes of cardiac surgery. Heart and Lung:

Journal of Acute and Critical Care, 42(2):146–151, 2013.

[55] D. Macrae, R. Grieve, E. Allen, Z. Sadique, K. Morris, J. Pappachan, R. Parslow, R. C. Tasker, and

D. Elbourne. A Randomized Trial of Hyperglycemic Control in Pediatric Intensive Care. New England

Journal of Medicine, 370(2):107–118, 2014.

[56] A. E. Johnson, T. J. Pollard, L. Shen, L. W. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits,

L. Anthony Celi, and R. G. Mark. MIMIC-III, a freely accessible critical care database. Scientific Data,

3:1–9, 2016.

[57] R. S. Anand, P. Stey, S. Jain, D. R. Biron, H. Bhatt, K. Monteiro, E. Feller, M. L. Ranney, I. N. Sarkar,

and E. S. Chen. Predicting mortality in diabetic icu patients using machine learning and severity in-

dices. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational

Science, 2017:310—319, 2018.

[58] R. Sadeghi, T. Banerjee, and W. Romine. Early hospital mortality prediction using vital signals. CoRR,

abs/1803.06589, 2018.

[59] A. E. W. Johnson and R. G. Mark. Real-time mortality prediction in the intensive care unit. AMIA ...

Annual Symposium proceedings. AMIA Symposium, 2017:994–1003, 2017.

[60] American Board of Internal Medicine. ABIM Laboratory Reference Ranges – July 2014. 6(July):3–10,

2014.

85

[61] S. Gorman, A. Hauber, M. Kroohs, E. Moritz, and B. Sanders. Laboratory Values Interpretation Re-

source Academy of Acute Care Physical Therapy – APTA Task Force on Lab Values Evolution of the

2017 Edition of the Laboratory Values Interpretation Resource by the Academy of Acute Care Physical

Therapy. pages 1–42, 2017.

[62] P. Eaton. Clinical Biochemistry Reference Ranges. 10(2):1–20, 2014.

[63] C. H. Lee and H.-J. Yoon. Medical big data: promise and challenges. Kidney Research and Clinical

Practice, 36(1):3–11, 2017.

[64] H. Kang. The prevention and handling of the missing data. Korean Journal of Anesthesiology, 64(5):

402–406, 2013.

[65] F. Cismondi, A. S. Fialho, S. M. Vieira, S. R. Reti, J. M. Sousa, and S. N. Finkelstein. Missing data in

medical databases: Impute, delete or classify? Artificial Intelligence in Medicine, 58(1):63–72, 2013.

[66] G. M. O’Reilly, P. A. Cameron, and D. J. Jolley. Which patients have missing data? An analysis of

missingness in a trauma registry. Injury, 43(11):1917–1923, 2012.

[67] H. Motoda and H. Liu. Feature selection extraction and construction. 2002.

[68] P. Sondhi. Feature construction methods : A survey. 2009.

[69] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-

sampling technique. J. Artif. Int. Res., 16(1):321–357, June 2002.

[70] I. Tomek. Two modifications of cnn. 1976.

[71] I. Tomek. An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man,

and Cybernetics, SMC-6(6):448–452, 1976.

[72] J. Laurikkala. Improving identification of difficult small classes by balancing class distribution. In

Proceedings of the 8th Conference on AI in Medicine in Europe: Artificial Intelligence Medicine, AIME

’01, pages 63–66, 2001.

[73] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. URL

www.scipy.org.

[74] W. Mckinney. Python data analysis library, 2014–. URL pandas.pydata.org.

[75] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,

R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-

esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830,

2011.

[76] G. Lemaıtre, F. Nogueira, and C. K. Aridas. Imbalanced-learn: A python toolbox to tackle the curse of

imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017.

86

www.scipy.org

pandas.pydata.org

[77] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016.

[78] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-y. Liu. LightGBM : A Highly

Efficient Gradient Boosting Decision Tree. (Nips):1–9, 2017.

[79] A. V. Dorogush, A. Gulin, G. Gusev, N. Kazeev, L. O. Prokhorenkova, and A. Vorobev. Catboost:

unbiased boosting with categorical features. CoRR, abs/1706.09516, 2017.

[80] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science & Engineering, 9(3):

90–95, 2007.

[81] S. Raschka. Mlxtend: Providing machine learning and data science utilities and extensions to python’s

scientific computing stack. The Journal of Open Source Software, 3(24), Apr. 2018.

[82] S. M. Lundberg and S. Lee. A unified approach to interpreting model predictions. CoRR, abs/1705.07874,

2017.

[83] M. Wilcox. Occam’s razor and machine learning. URL https://www.teradata.com/Blogs/Occam\

OT1\textquoterights-razor-and-machine-learning#/.

[84] C. Molnar. Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/, 2018.

[85] S. B. Kotsiantis. Supervised Machine Learning: A Review of Classification Techniques. Informatica, 31:

249–268, 2007.

[86] D. R. Cox. The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series

B (Methodological), 20(2):215–242, 1958.

[87] N. d. Condorcet. Essai sur l’application de l’analyse a la probabilite des decisions rendues a la pluralite

des voix. Cambridge Library Collection - Mathematics. 2014.

[88] F. Galton. Vox populi. Nature, 75:450–451.

[89] J. Surowiecki. The Wisdom of Crowds. Anchor, 2005.

[90] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug. 1996.

[91] L. Breiman. Random forests. Machine Learning, 45(1):5–32, Oct. 2001.

[92] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the

Thirteenth International Conference on International Conference on Machine Learning, ICML’96, pages

148–156, 1996.

[93] J. H. Friedman. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:

1189–1232, 2000.

[94] T. K. Ho. Random decision forests. In Proceedings of the Third International Conference on Document

Analysis and Recognition (Volume 1) - Volume 1, ICDAR ’95, pages 278–, 1995.

87

https://www.teradata.com/Blogs/Occam\OT1\textquoteright s-razor-and-machine-learning#/

https://www.teradata.com/Blogs/Occam\OT1\textquoteright s-razor-and-machine-learning#/

[95] L. Valiant. Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a

Complex World. Basic Books, Inc., 2013.

[96] M. Kearns. Thoughts on hypothesis boosting. Unpublished, 1988.

[97] R. E. Schapire. A brief introduction to boosting. In Proceedings of the 16th International Joint Confer-

ence on Artificial Intelligence - Volume 2, IJCAI’99, pages 1401–1406, 1999.

[98] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application

to boosting. Journal of Computer and Systems Science, 55(1):119–139, Aug. 1997.

[99] L. Breiman. Prediction games and arcing algorithms. Neural Computation, 11(7):1493–1517, 1999.

[100] J. N. van Rijn and F. Hutter. Hyperparameter importance across datasets. In Proceedings of the 24th

ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’18, pages

2367–2376, 2018.

[101] A. Swalin. Catboost vs. light gbm vs. xgboost – towards data science, Mar 2018. URL https:

//towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db.

[102] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support

vector machines. Machine Learning, 46(1):389–422, 2002.

[103] S. M. Lundberg, G. G. Erion, and S. Lee. Consistent individualized feature attribution for tree ensembles.

CoRR, abs/1802.03888, 2018.

[104] B. Hayes. Demystifying confusion matrix. URL http://benhay.es/posts/

demystifying-confusion-matrix/.

[105] T. Saito and M. Rehmsmeier. The precision-recall plot is more informative than the roc plot when

evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10(3):1–21, 03 2015.

[106] T. J. Pollard, A. E. W. Johnson, J. D. Raffa, L. A. Celi, and R. G. Mark. The eICU Collaborative

Research Database , a freely available multi-center database for critical care research. Scientific Data,

5:1–13, 2018.

[107] T. C. Hyatt, R. P. Phadke, G. R. Hunter, N. C. Bush, A. J. Munoz, and B. A. Gower. Insulin sensitivity

in African-American and white women: association with inflammation. Obesity (Silver Spring, Md.), 17

(2):276–282, 2009.

[108] K. Kodama, D. Tojjar, S. Yamada, K. Toda, C. J. Patel, and A. J. Butte. Ethnic differences in the

relationship between insulin sensitivity and insulin response: A systematic review and meta-analysis.

Diabetes Care, 36(6):1789–1796, 2013.

[109] S. Singh, Y. S. Prakash, A. Linneberg, and A. Agrawal. Insulin and the Lung: Connecting Asthma and

Metabolic Syndrome. Journal of Allergy, 2013:1–8, 2013.

88

https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db

https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db

http://benhay.es/posts/demystifying-confusion-matrix/

http://benhay.es/posts/demystifying-confusion-matrix/

[110] G. Sagun, C. Gedik, E. Ekiz, E. Karagoz, M. Takir, and A. Oguz. The relation between insulin resistance

and lung function: A cross sectional study. BMC Pulmonary Medicine, 15(1):1–8, 2015.

[111] G. Piazzolla, A. Castrovilli, V. Liotino, M. R. Vulpi, M. Fanelli, A. Mazzocca, M. Candigliota, E. Berardi,

O. Resta, C. Sabba, and C. Tortorella. Metabolic syndrome and Chronic Obstructive Pulmonary Disease

(COPD): The interplay among smoking, insulin resistance and vitamin D. PLoS ONE, 12(10), 2017.

[112] W. R. Farwell and E. N. Taylor. Serum bicarbonate, anion gap and insulin resistance in the national

health and nutrition examination survey. Diabetic Medicine, 25(7):798–804, 2008.

[113] A. Natekin and A. Knoll. Gradient boosting machines , a tutorial. 7(December), 2013.

89

Appendix A

Outlier Detection

Figure A.1: Outliers detection for anion gap variable

Figure A.2: Outliers detection for bicarbonate variable

Figure A.3: Outliers detection for chloride variable

90

Figure A.4: Outliers detection for creatinine variable

Figure A.5: Outliers detection for hemoglobin variable

Figure A.6: Outliers detection for hematocrit variable

Figure A.7: Outliers detection for MCH variable

91

Figure A.8: Outliers detection for MCHC variable

Figure A.9: Outliers detection for MCV variable

Figure A.10: Outliers detection for Platelet variable

Figure A.11: Outliers detection for RBC variable

92

Figure A.12: Outliers detection for RDW variable

Figure A.13: Outliers detection for sodium variable

Figure A.14: Outliers detection for BUN variable

Figure A.15: Outliers detection for glucose variable

93

Appendix B

Gradient Boosting Machines

Gradient booating mathematical formulation is based in the work of [113].

Function Estimation

Given a dataset (X, y)Ni=1 , where X = (X1, ..., XN ) refers to the explanatory input variables and y to output

variables, the goal is to obtain an estimation F (x) that minimizes a specified loss function ψ(y, F ):

F (X) = y

F (X) = arg minF (X)

ψ(y, F ) (B.1)

Rewriting the estimation in terms of expectations, the equivalent formulation would be to minimize the

expected loss function over the response variable Ey(ψ[y, F (X)]), conditioned on the observed explanatory

data X:

F (X) = arg minF (X)

Ex[Ey(ψ[y, F (X)])|X] (B.2)

To make the problem of function estimation trackable, the function space can be restricted to a parametric

family of functions F (X, θ):

F (X) = F (X, θ) (B.3)

θ = arg minθEx[Ey(ψ[y, F (X, θ)])|X] (B.4)

To perform the estimation, iterative numerical procedures are considered.

Numerical optimization

Given M iteration steps, the parameter estimates can be written in the incremental form:

θ =

M∑i=1

θi (B.5)

The simplest and the most frequently used parameter estimation procedure is the steepest gradient descent.

94

Given N data points (X, y)Ni=1, the objective is decrease the empirical loss function J(θ) over observed data:

J(θ) =

M∑i=1

ψ(yi, F (Xi, θ)) (B.6)

The steepest descent optimization procedure is organized as in Algorithm 5

Algorithm 5 Stepest descent optimization

1: Initialize the parameter estimates θ02: for t=1 to M do3: Compiled parameter estimate θt:

4: θt =t−1∑i=1

θi

5: Evaluate the gradient of the loss function ∇J(θ):

6: ∇J(θ) = {∇J(θi)} = [ ∂J(θ)∂J(θi)]θ=θt

7: Calculate the new incremental parameter estimate θt8: θt ← −∇J(θ)

9: Add the new estimate θt to the ensemble10: end for

Optimization in function space

The optimization in gradient boosting is performed in the function space through a parameterization of the

function estimate F in an additive functional form,

F (x) = FM (x) =

M∑i=1

Fi(x) (B.7)

where F0 is a initial guess and {fi}Mi=1 are incremental functions called ”steps” or ”boosts”.

To attain a ”greedy stagewise” approach of function incrementing with weak learners, where previously

entered terms are not readjust when new ones are added, an optimal step-size ρ has to be selected at each

iteration thus the optimization is defined as,

Ft ← Ft−1 + ρth(X, θt) (B.8)

(ρt, θt) = arg minρ,θ

N∑i=1

ψ(yi, Ft−1) + ρh(xi, θ) (B.9)

Gradient boost algorithm

In order to perform a particular GBM, it is necessary to determine the loss function ψ(y, F ) to be optimized

and choose the type of weak learner h(X, θ) to make predictions.

New weak learners h(X, θt) are chosen according to be the most parallel to the negative gradient {gt(Xi)}Ni=1

along the observed data,

gt(x) = Ey[∂ψ(y, F (x))

∂F (x)|X]F (x)=ft−1(x) (B.10)

95

The ”boost” increment ρ in the function space for the new weak learner is the most highly correlated with

−gt(x) over the data distribution

(ρt, θt) = arg minρ,θ

N∑i=1

[−gt(Xi) + ρh(xi, θ)]2 (B.11)

Gradient boosting algorithm is summarized in Algorithm 6

Algorithm 6 Gradient boosting algorithm

Inputs:

• Input data (X, y)Ni=1

• Number of iterations M

• Loss-function choice ψ(y, F )

• Weak learner h(X, θ)

Algorithm:

1: Initialize f02: for t=1 to M do3: Compute the negative gradient gt(x)4: Fit a new weak learner h(X, θt)5: Find the best gradient descent step-size ρt:

6: ρt = arg minpN∑i=1

ψ[yi, ft−1(Xi) + ρh(Xi, θt)]

7: Update the function estimate8: ft ← ft−1 + ρth(X, θt)9: end for

96

Appendix C

eICU-CRD Collaborative Research

Database

The Philips eICU program is a critical care telehealth program that delivers need-to-know information to

caregivers, empowering them to care for the patients where data utilized by the remote care givers is archived

for research purposes.

The eICU Collaborative Research Database [106] is composed with data from patients who were admitted

between 2014 and 2015 in several United States’ critical care units, in a total of 208 hospitals.

Analogously to MIMIC III database, data is de-identified to not compromise patients’ confidentiality and

safety and those are identified by codes. Hospitalid identifies each hospital in the database, uniquepid spec-

ifies an uniquely patient, patienthealthsystemstayid refers to each hospital stay and patientunitstayid

to admission into ICU.

In figure C.1 is presented an analogy between MIMIC III database and eICU-CRD database in terms of

how patients are identified.

Databases Analogy

eICU-CRD MIMIC III

Hospital Identification

Patient Identification

Hospital Admission

ICU admission

HospitalID

UniquepID

PatientHealthSystemID

PatientUnitStayID

– (single hospital)

SubjectID

HadmID

IcustayID

Figure C.1: Databases analogy

97

C.1 Inclusion Criteria

Initially, all patients were extracted to analyze the number of admissions in the hospital per patient, and for

each admission, the number of ICU stays per admission. Readmissions were discarded, i.e., from patients with

multiple admissions or with more than one ICU stay per admission, only first admission and ICU stay were

included to avoid biased assessments.

From the subset, adult (≥ 16 years old) patients that received insulin during ICU stay were selected.

Infants were discarded because they have a different metabolism and, therefore, a different glucose control

protocol in the ICU. Lastly, only patients with a length of stay equal or higher than 24 hours remained for the

study.

The number of patients extracted in each step is described in figure C.2. Cohort prior data treatment and

modeling is composed by 8379 patients.

208 hospitals in database

139 367 patients in database

132 933 patients’ firstICU stay during admission

11 999 patients that received in-sulin during ICU stay and age ≥ 16

Length of stay ≥ 24 h

8 379 Patients

Figure C.2: Inclusion criteria applied to extract the cohort used in this work.

C.2 Input Variables

Variables extracted to test the models were those resulting from feature selection process (section 5.4.3).

Last measured value of glasgow comma scale (last gcs) were extracted from Nursecharting table with

recourse to previous work developed in [106].

Diseases in respiratory system (respiratory sys) and infectious diseases (infectious sys) were ex-

tracted form diagnoses table where are identified as pulmonary and infectious, respectively. This conclusion

is taken by observing column diagnosistring and the ICD9Codes associated.

Minimum values of platelet and mean values of anion gap were extracted from lab table taking in account

the previous boundaries defined for MIMIC III (see appendix A).

For the case of the patient’s age (age), the values were extracted from patients table.

98

C.3 Data Treatment

Data treatment is independently performed for each subset of features desired for external validation, i.e,

subsets of 3, 5 and 7 features respectively.

From the cohort of 8379 patients, missing values for each variable are plotted in figure C.3. List-wise

deletion method is used to deal with missing values. Number of patients after missing data removal is

presented in table C.1.

Figure C.3: Missing values for each variable

Therefore, some variables have values outside the range delimited in previous max-min normalization used

in the dataset from MIMIC III database. Patients with those values associated are also extracted from final

cohort and are described in table C.1.

In fact, this step would not be necessary for the case of LGB model since it is a tree-based model where

values above or below the limiter values would be branched in the same group as those. However, for LR

model it might induce biased results.

For last, the final cohort for each feature subset is also presented in table C.1 along with the mortality

ratio associated.

Table C.1: Number of patients included in each feature subset

Number of FeaturesPatients

under insulintherapy

Patients aftermissing data

removal

Patients aftervalues outside

the range removal

Died/Survived(Mortality ratio)

3

8379

5515 5032 490/4542 (0.097)

5 5493 5003 483/4520 (0.097)

7 5262 4765 459/4306 (0.096)

99

Appendix D

SHAP values

SHAP values for tree-base are computed as follows [103]. A tree is represented by a vector of six variables,

tree = {v, a, b, t, r, d} and how many subsets (and their size) pass down each branch of the tree are kept

tracked through two methods, Extend and Unwind, following a path, m.

The path, m = {d, z, o, w}, representing unique features split so far is constitute by feature index, d,

whether the feature is in the set S that flow through a branch, o or not, z and the proportion of sets of a

given cardinally that are present,w.

Extend is used as a tree is traversed keeping subsets in each node. Unwind reverses the process by undo

extensions when the same feature is split twice and undo each extension of the path inside a leaf to correctly

compute features’ weights in the path.

Algorithm 7 represent SHAP values computation for tree based models.

Algorithm 7 Tree SHAP values [103]

Inputs: Algorithm:

1: φ Array of len(x) zeros2: Recurse (j,m, pz, po, pi)3: if vj 6= internal then4: for 2 to len(x) do5: w = sum(Unwind(m, i).w6: φm = φm + w(mi.o−mi.z)vj7: end for8: else if N is odd then9: h, c = xdj ≤ tj

10: iz = io = 111: k = Findfirst(m.d, dj12: if k 6= nothing then13: iz, io = (mk.z,mk.o14: m = UNWIND(m.k)15: end if16: Recurse (h,m, izrhrj , io, dj)

17: Recurse (c,m, izrcrj , 0, dj)

18: end if

100

Documents

Machine Learning Approaches for Survival Prediction of