The Use of Activity Monitoring and Machine …...The Use of Activity Monitoring and Machine Learning for the Functional Classification of Heart Failure by Jonathan-F. Benjamin Jason

The Use of Activity Monitoring and Machine Learning for

the Functional Classification of Heart Failure

by

Jonathan-F. Benjamin Jason Jérémy Baril

A thesis submitted in conformity with the requirements

for the degree of Master of Health Science, Clinical Engineering

Institute of Biomaterials and Biomedical Engineering

University of Toronto

CC BY 4.0 by Jonathan-F. Benjamin Jason Jérémy Baril, unless otherwise prohibited

https://creativecommons.org/licenses/by/4.0/

ii

The Use of Activity Monitoring and Machine Learning for the Functional

Classification of Heart Failure

Jonathan-F. Benjamin Jason Jérémy Baril

Master of Health Science, Clinical Engineering

Institute of Biomaterials and Biomedical Engineering

University of Toronto

2018

Abstract

Background: Assessing the functional status of a heart failure patient is a highly subjective task.

Objective: This thesis aimed to find an accessible, objective means of assessing the New York Heart

Association (NYHA) functional classification (FC) of a patient by leveraging modern machine learning

techniques.

M ethods: We first identified relevant quantitative data and upgraded Medly, a remote patient

monitoring system (RPMS), to support data collection. We then proceeded to build six different machine

learning classifiers including hidden Markov model, Generalized Linear Model (GLM), random forest and

neural network based classifiers.

Results: The best overall classifier was found to be a boosted GLM, which achieved a classification

performance (Cohen’s Kappa statistic 𝜅=0.73, balanced accuracy=85%) comparable to human level

performance (𝜅=0.75).

Conclusions: Although the investigated classifiers are not ready for implementation into a real RPMS,

they show promise for making the evaluation of NYHA FC more universally consistent and reliable.

iii

dédié à Papa,

sans ton encouragement cette thèse n'aurait jamais existée

iv

Acknowledgments

Ah! The acknowledgements. As painful and lonely as it may be to compose a thesis, the

acknowledgements section is by far the easiest and most pleasant section to write. It is both heart-

warming and humbling to be reminded of how much, and how many others, have sacrificed to breathe life

into this work - truly without the help of these people this project would still be a mere figment of an idea

in someone’s mind. If you’ve contributed to this work, whether directly or indirectly know that, even if

I’ve somehow forgotten to include your name here I am eternally grateful for your help and contribution

to this work.

Firstly, I need to acknowledge our patients: it is probably only those of us who do health research who

truly understand how much these projects life and die by the pure self-less generosity of patients. Thank

you for trusting us with your health and your data. I can only hope this work will somehow contribute to

ultimately making the need for your generosity obsolete.

Second, my committee: Drs. Joe Cafazzo, Cedric Manlhiot, Heather Ross, and Babak Taati. Your

contributions to this project can only be understated – in fact my biggest regret in this project is not

having taken greater advantage of your experience and wisdom. Your guidance, correction, teaching,

encouragement and advice was invaluable to having gotten this project anywhere. Thanks also go to Dr.

Rob Nolan for taking time to serve as the external examiner for this thesis.

I am also hugely indebted to Simon Bromberg, Raghad Abdulmajeed and Dr. Yasbanoo Moayedi, not

only for your foundational work on which I was able build my work but also for leaving behind a treasure

trove of data that was indispensable for getting this project started.

Special thanks to Edgar Crowdy, Steven Fan, Bridgette Mueller, Mohammad Peikari, Emily Somerset,

and Kabir Sakhrani at the Cardiovascular Data Management Centre for your advice and tips with regards

to the analytics but also your incredible help with much of the last-minute data collection, analytics,

processing and people-power that went into the ‘research’ part of this project.

Heartfelt thanks also go to Jason Hearn, not only for contributions to this work as part of the

aforementioned group, but also your puns, listening ear and friendship journeying through the adventure

of doing an MHSc at the Centre these last 2 years. If only all graduate students were so fortunate.

Enormous thanks to Iqra Ashfaq, Alana Tibbles, Patrick Ware, Dr. Emily Seto, and Mary O’Sullivan.

Goodness knows how many times I interrupted your work for this project. Thank you so much for your

v

patience and for being so willing to share your time, your resources, and expertise around all things Medly

(as well as for rooting for me all along the way).

Additional thanks go to:

Stephanie Wilson, Diane De Sousa and especially Larissa Maderia for all the hard work you put in so we

could get Fitbit integrated into Medly.

Damon Pfaff, Owen Thijssen and Mike Lovas for your design advice and allowing me to leech off your

expertise.

James Agnew and Vlad Voloshyn for your technical help.

Melanie Yeung and Akib Uddin, not only for your operational and project management help on the Fitbit

integration (and for the internship) but also for your timely encouragement and advice for getting through

this degree.

Aarti Mathur and Alison Bison for your always joyous help with various admin and purchasing issues.

Similarly, Jess Fifield, but who also deserves additional accolades for her eternal patience in filtering my

incessant requests, and for arranging, rearranging and further rearranging Dr. Cafazzo’s calendar and

always managing to find an available slot for Jason or for myself to meet with Dr. Cafazzo when

necessary. Thanks also to Anna Yuan for managing to wrangle the schedules of 5 incredibly busy

university professors so I could defend on time.

Quynh Pham, for your mentorship and encouragement, and for your unwavering enthusiasm at the

Centre; for always always [sic] finding time to thoughtfully answer my questions, whether on REB

applications, thesis writing, EPR or the myriad other elements of the research student life.

Plinio Morita, for your help and suggestions regarding some of the analytics in this project.

Shivani Goyal, especially for your help and advice regarding my OGS/CGS-M proposal. And speaking of:

Many thanks are owned to The Ted Rogers Centre for Heart Research and Peter Munk Cardiac Centre,

(hSITE) Health Support through Information Technology Enhancements, (NSERC) the Natural Sciences

and Engineering Research Council, (CIHR) the Canadian Institutes for Health Research, the Government

of Ontario, and the University of Toronto for funding various parts of this project at various times.

vi

And of course, thank you to everyone else at Healthcare Human Factors and at eHealth Innovation who

at various times pitched in, shared their expertise, provided advice or encouraging word or even just

expressed interest in the work. Thank you also to Wayne, Chris and Anjum for extending the opportunity

to learn, work and travel with the human factors team as part of my internships.

Thanks to Rhonda Marley, our wonderful Clin. Eng. coordinator for alleviating, as you could, a lot of the

burdensome administrative workload involved in a graduate degree.

Thank you to BESA, the IBBME community and especially the Clin. Eng. students who were part of our

program. It was a true pleasure. We made it.

And lastly, on a personal note, none of this work would have been possible without friends and family

who supported and encouraged me over these last 2 years - words cannot express how grateful I am for

you. Merci Maman, Papa, Alisson, Benjamin; Ruth and Alvis (my home away from home); Kyle F,

Thomas, Esteban (when I needed a nice invigorating round of PUBG or GTA); Vanessa, Rebecca,

Theresa, Duela, Sara & Matthew, Matt & Moni, Rachel & Justin, Melanie, Kyle N, Shawn, Valerie,

Jamie, and Courtney (all of whom graciously let me go to the big TO but would probably rather I have

stayed with them in Winnipeg). Special thanks in particular though have to go to: Paul White, who had

the dubious honor of reviewing the first draft of this thesis; Cameron MacGregor, who brought this

program to my attention and joined me on the adventure; Knox Church (and my home church in

particular; Sam, Chris, Hendrick, Stephen, Andrew, Bella, Roydon, Sarah, Lori, Thomas, Emily, Deborah,

Larissa, Katie, Jackie, Danielle, and so many others), for your open arms and being my much-needed

community in this new city; to Tanisha Strachan, for keeping me sane these past few months, even

though no one warned you that dating a grad student is often too much akin to dating a hermit and of

course; and Jesus, because ultimately this was all for you.

Thank you all for your love, for your encouragement, and for your patience.

Now on to the main event…

vii

Table of Contents

Acknowledgments ......................................................................................................................................... iv

Table of Contents ........................................................................................................................................ vii

List of Tables ................................................................................................................................................ xi

List of Figures ............................................................................................................................................. xiii

List of Abbreviations .................................................................................................................................. xvi

- Introduction ................................................................................................................................ 1

Thesis Objective ................................................................................................................................ 1

Formal Thesis Statement .................................................................................................................. 2

Thesis Summary ............................................................................................................................... 2

1.3.1 Phase 1 – Replication of Previous Study ............................................................................. 2

1.3.2 Phase 2 – Activity Tracker Monitoring Implementation ..................................................... 2

1.3.3 Phase 3 – Machine Learning Implementation & Validation ................................................ 3

- Background & Literature Review ............................................................................................... 4

Congestive Heart Failure .................................................................................................................. 4

2.1.1 New York Heart Association Functional Classification ....................................................... 6

Assessing Exercise Capacity ............................................................................................................. 7

2.2.1 The Medical Interview (Standardized & Unstandardized Questioning) .............................. 8

2.2.2 Standardized In-Clinic Exercise Testing ............................................................................ 11

2.2.3 Fitness Trackers/Monitors ................................................................................................. 14

Remote Patient Monitoring ............................................................................................................ 22

2.3.1 Medly ................................................................................................................................. 24

Artificial Intelligence & Machine Learning ..................................................................................... 24

2.4.1 Machine Learning .............................................................................................................. 26

2.4.2 Supervised, Unsupervised and Reinforcement Learning .................................................... 26

2.4.3 Classification vs Prediction Problems ................................................................................ 27

viii

2.4.4 The Effect of Sample Size on Machine Learning ............................................................... 28

2.4.5 State-of-the-art .................................................................................................................. 29

Summary ......................................................................................................................................... 32

- Replication of Previous Study .................................................................................................. 35

Abstract .......................................................................................................................................... 35

Introduction .................................................................................................................................... 36

Methods .......................................................................................................................................... 37

3.3.1 Recruitment ....................................................................................................................... 37

3.3.2 Statistics ............................................................................................................................ 39

Results and Discussion .................................................................................................................... 42

3.4.1 Principal Results ................................................................................................................ 48

3.4.2 Strengths and Limitations ................................................................................................. 51

Conclusion ....................................................................................................................................... 52

3.5.1 Acknowledgements ............................................................................................................. 52

3.5.2 Ethics Approval ................................................................................................................. 52

3.5.3 Conflicts of Interest ........................................................................................................... 52

- Activity Tracker Monitoring Implementation .......................................................................... 53

Medly User Interface Overview ...................................................................................................... 53

Requirements .................................................................................................................................. 54

Design & Implementation ............................................................................................................... 57

4.3.1 Activity Tracker Selection ................................................................................................. 57

4.3.2 User Interface Design ......................................................................................................... 64

Summary ......................................................................................................................................... 82

– Assessment of NYHA Functional Classification using Hidden Markov Models ...................... 84

Hidden Markov Models ................................................................................................................... 84

5.1.1 Rationale for the use of HMMs .......................................................................................... 84

ix

Methods .......................................................................................................................................... 86

5.2.1 Training Data .................................................................................................................... 86

5.2.2 Model Design ..................................................................................................................... 89

5.2.3 Model Validation ............................................................................................................... 93

Results and Discussion .................................................................................................................... 94

5.3.1 Classification Performance ................................................................................................. 94

5.3.2 Training Challenges ........................................................................................................... 94

Summary ....................................................................................................................................... 101

- Assessment of NYHA Functional Classification Using Cross-sectional Machine Learning

Models ................................................................................................................................................... 103

Machine Learning Models ............................................................................................................. 103

6.1.1 Generalized Linear Models ............................................................................................... 103

6.1.2 Boosted Generalized Linear Models ................................................................................. 105

6.1.3 Random Forest ................................................................................................................ 105

6.1.4 Artificial Neutral Networks ............................................................................................. 107

6.1.5 Principal Component Analysis Artificial Neutral Networks ............................................ 109

Methods ........................................................................................................................................ 110

6.2.1 Training Data .................................................................................................................. 110

6.2.2 Model Design ................................................................................................................... 111

6.2.3 Model Validation ............................................................................................................. 117

Results and Discussion .................................................................................................................. 120

6.3.1 Classification Performance ............................................................................................... 120

6.3.2 Best Features ................................................................................................................... 124

6.3.3 Comparison of 10-fold and Leave-One-Out Cross-Validation .......................................... 128

Summary ....................................................................................................................................... 129

- Conclusions, Recommendations & Future Work .................................................................... 132

Conclusions ................................................................................................................................... 132

x

Recommendations ......................................................................................................................... 135

Future Work ................................................................................................................................. 136

References .................................................................................................................................................. 138

Appendix A - Research Ethics ................................................................................................................... 168

I. REB #14-7595: Validation of A Wearable Activity Tracker for the Estimation of Heart

Failure Severity ............................................................................................................................. 168

II. REB #15-9832: Feasibility Study of Wearable Heart Rate and Activity Trackers for

Monitoring Heart Failure .............................................................................................................. 169

III. REB #16-5789: Evaluation of A Mobile Phone-Based Telemonitoring Program for Heart

Failure Patients ............................................................................................................................ 170

IV. REB #18-0221: Artificial intelligence-based quality improvement initiative of a mobile phone-

based telemonitoring program for heart failure patients .............................................................. 171

Appendix B – A Primer on Hidden Markov Models ................................................................................. 172

I. Basics of Markov Models (Hidden or Otherwise) ......................................................................... 172

II. Semi-Markov Model ...................................................................................................................... 174

III. Hidden Markov & Semi-Markov Models Parameters ................................................................... 174

IV. Determining Markov Model Parameters ....................................................................................... 175

Appendix C – Software Repository ............................................................................................................ 177

Appendix D – Tabulation of All Cross-sectional Machine Learning Classifier Performance Measures ..... 178

xi

List of Tables

Table 1: Summary of Cadmus-Bertram activity tracker heart rate accuracy study [79] ............................ 19

Table 2: Summary of Abdulmajeed activity tracker heart rate accuracy study. Reproduced from [41] ..... 20

Table 3: Inclusion criteria ............................................................................................................................ 37

Table 4: Exclusion criteria ........................................................................................................................... 37

Table 5: Study dataset demographics .......................................................................................................... 38

Table 6: Study dataset demographics (overall and just NYHA II or III) .................................................... 38

Table 7: Study re-grouped dataset demographics (NYHA group II* and group III*) ................................. 39

Table 8: Significant findings for comparisons between all classes (I/II, II, II/III, III) and just between class

II vs. III. ....................................................................................................................................................... 43

Table 9: Significant findings for comparisons between group II* and group III* ........................................ 44

Table 10: Non-significant findings for comparisons between all classes (I/II, II, II/III, III) and just

between class II vs. III. ................................................................................................................................ 45

Table 11: Non-significant findings for comparisons between group II* and group III* ............................... 46

Table 12: Candidate activity trackers ......................................................................................................... 58

Table 13: Medly inclusion criteria ............................................................................................................... 78

Table 14: Medly exclusion criteria ............................................................................................................... 78

Table 15: iPhone vs. Android patients on Medly system using Fitbit a) all patients onboarded, b) only

new Medly patients onboarded during thesis ............................................................................................... 79

Table 16: Patient adherence on Fitbit ......................................................................................................... 80

Table 17: Fitbit adherence compared to adherence recorded for original Medly during RCT .................... 80

Table 18: Minute-by-minute step count features ....................................................................................... 111

xii

Table 19: Cardiopulmonary exercise testing data features ........................................................................ 113

Table 20: Patient demographic data features ............................................................................................ 114

Table 21: Header abbreviations for Table 22 ............................................................................................. 178

Table 22: Cross-sectional machine learning classifier performance metrics ............................................... 179

xiii

List of Figures

Figure 2-1. Renin-Angiotensin-Aldosterone system [286] ............................................................................... 5

Figure 2-2 Nervous system response to drop in blood pressure [287] ............................................................ 6

Figure 2-3 PPG, ECG and arterial pressure waveforms (with cardiac arrhythmia) [288]. .......................... 16

Figure 3-1. Histogram of per minute step count values for each patient, grouped by individual NYHA

class .............................................................................................................................................................. 40

Figure 3-2. Distribution of per minute step counts by NYHA class (zoomed in to step counts > 0).

Stacked internal segments indicate relative contributions by each patient. ................................................ 41

Figure 3-3. Individual frequency of per minute step counts for each patient (zoomed in to step counts >

0), grouped by NYHA class ......................................................................................................................... 42

Figure 3-4. Boxplots (min, mean-1SEM, mean, mean+1SEM, max) of mean daily total steps for individual

each NYHA class ......................................................................................................................................... 48

Figure 3-5. Boxplots (min, mean-1SEM, mean, mean+1SEM, max) of mean daily per minute step count

maximums for each individual NYHA class ................................................................................................ 49

Figure 3-6. Boxplots (min, mean-1SEM, mean, mean+1SEM, max) of max daily per minute step count

maximums for each individual NYHA class ................................................................................................ 50

Figure 3-7. Number of zero step count minutes as a percentage of individual patient two-week data stream

..................................................................................................................................................................... 51

Figure 4-1. Medly system patient smartphone user interface a) home screen b) trends screen [289] ....... 53

Figure 4-2. Medly system clinical user web interface ................................................................................... 55

Figure 4-3. Fitbit data flow diagram ........................................................................................................... 60

Figure 4-4. Fitbit authentication process with a client app ......................................................................... 61

Figure 4-5. Medly Fitbit patient access sequence ........................................................................................ 62

Figure 4-6. Medly Fitbit clinician access sequence ...................................................................................... 63

xiv

Figure 4-7. Proposed designs for patient user interface (home screen) a) combined heart rate and steps

data on one card, b) combined heart rate and with pictoral representations, c) seperated heart rate and

step data, d) only pictoral representation with mini graph ......................................................................... 65

Figure 4-8. Proposed designs for patient user interface (trends) a) simple sparklines, b) data with bands to

indicate min (resting), mean and max values for each time period, c) whisker plot to indicate daily range,

b) heart rate (maximum and resting) and average step count values broken out for each time period, and

e) Tufte style medical data visualization as per f) which is reproduced from [201] .................................... 66

Figure 4-9. Proposed design for authorization of new Fitbit by patient via Medly smartphone application.

..................................................................................................................................................................... 67

Figure 4-10. Proposed designs for clinical user interface (activity and heart rate graphs) a) simple graph

design with indicator lines for alert levels and mean, b) design inspired by the Sick Kids T3 (tracking,

trajectory and trigger) tool [206–208], c) mix of T3 tool with Medly range bands, b) whisker plots style

and e) simple graph with range bands and NYHA class prediction display (bottom of the more info page

for step count graph) ................................................................................................................................... 71

Figure 4-11. Final web interface Fitbit authorization flow .......................................................................... 72

Figure 4-12. Final web interface activity tracker profile & deauthorization flow ........................................ 73

Figure 4-13. Final web interface activity tracker data display .................................................................... 73

Figure 4-14. Distribution of patient Fitbit adherence (as percent of days using the system) ..................... 79

Figure 5-1: A method of inputting sequential (time series) data into a cross-sectional model .................... 85

Figure 5-2: Architecture for hidden Markov model based classifier ............................................................. 90

Figure 5-3: Distribution of per-minute step count for patients with NYHA class II and NYHA III (*

grouped) ....................................................................................................................................................... 93

Figure 5-4: Overview of HMM based classifier performance ........................................................................ 94

Figure 5-5: Example patient step count data (per 6 hour resolution) ......................................................... 95

Figure 5-6: Example patient step count data (per minute resolution) ........................................................ 96

xv

Figure 5-7: Dithering as applied to a cat photo. Reproduced from Wikipedia [236]. ................................ 100

Figure 6-1: Examples of distributions in the family of exponential distributions (* indicates the

distribution belongs in the family only when certain parameters are fixed). Adapted from [290]. ............ 104

Figure 6-2: Example of a decision tree (above) with corresponding feature space (below). ...................... 106

Figure 6-3: A perceptron ............................................................................................................................ 108

Figure 6-4: A neural network ..................................................................................................................... 108

Figure 6-5: 𝒌-fold cross-validation ............................................................................................................. 117

Figure 6-7: Performance of the best CPET only classifier ......................................................................... 121

Figure 6-7: Performance of the best step data only classifier .................................................................... 121

Figure 6-9: Performance of the best CPET + step data classifier ............................................................. 121

Figure 6-9: Performance of the second best CPET + step data classifier ................................................. 121

Figure 6-10: Receiver Operating Characteristic (ROC) curve for machine learning classifiers trained with

CPET & step data (with no data imputation) .......................................................................................... 122

Figure 6-11: Feature importance scores for GLM classifier using only step count data ............................ 125

Figure 6-12: Feature importance scores for random forest classifier using CPET + step count data ....... 126

Figure 6-13: Performance of the best model with cross-validation performance difference ....................... 128

Figure B-1: Markov model ......................................................................................................................... 173

xvi

List of Abbreviations

6MWT 6 minute walk test

Acc accuracy

API application programming interface

AI artificial intelligence

AT anaerobic threshold

BNP brain natriuretic peptide

BP blood pressure.

bpm beats per minute

CART classification and regression tree

CC correlation coefficient

CI confidence interval

CV cross validation

CHF congestive heart failure.

CO2 carbon dioxide

CPET cardiopulmonary exercise test.

DPMSC daily per minute step count

ECG electrocardiography. Alternatively: electrocardiogram, or electrocardiograph

GLM generalized linear model

HF heart failure.

xvii

HFrEF heart failure with reduced ejection fraction

HMM hidden Markov model

HMMBC hidden Markov model based classifier

HT home telemonitoring

HR heart rate

HRV heart rate variability

ICC intraclass correlation coefficients

IMU inertial measurement unit

LED light-emitting diode

LVEF left ventricular ejection fraction

LOOCV leave-one-out cross validation

ML machine learning

MVP minimum viable product

NIR no information rate

NNet neural net

NYHA New York Heart Association.

O2 oxygen

PCA principal components analysis

PPG photoplethysmography

QI quality improvement

RCT randomized control trial

xviii

REB research ethics board.

RER respiratory exchange ratio

RF random forest

ROC receiver operating characteristic

RPM remote patient monitoring

SC step count

SEM standard error of the mean

TGH Toronto General Hospital.

UHN University Health Network.

UI user interface

1

- Introduction

Heart failure (HF), a complex chronic terminal phase of many cardiovascular diseases, is slowly becoming

a worldwide silent pandemic [1]. The symptoms of heart failure are complex and difficult to manage for

both patients and their physicians [2–4]. Care is made even more difficult because there is no reliable

objective method for assessing the symptomatic (functional) status of a given HF patient, or by extension,

if their symptoms have recently measurably deteriorated [5–7].

The current clinical gold standard for assessing a patient's symptom state is the New York Heart

Association (NYHA) functional classification [8,9]. This system grades a patient's degree of heart failure

based on a physician’s interpretation of the patient reported symptoms (mainly with respect to their

degree of intolerance to exercise/physical activity) and is by its nature highly subjective. Despite these

limitations, years of medical research and clinical observations have established many important

relationships between a patient's symptom status and their prognostic outcomes [7,10] which makes it

undesirable to simply replace or modifying the existing NYHA functional classification scheme. However,

finding an objective means of determining a patient's NYHA class would be of great benefit to both HF

care and research as it would allow intra- and inter-physician and patient assessments of HF functional

status to be more consistent [7,11,12]. At the very least, consistency would make communication of

patient heart failure functional status in research, clinic notes, or other medical documentation more

transparent and reliable.

Thesis Objective

The objective of this thesis is to design and develop a means of making the evaluation of NYHA

functional class more consistent and reliable for the medical research and clinical community. The larger

goal of this research work can be subdivided into 3 major sub-objectives:

1. To identify available relevant, objective data which may be useful for providing insights into

patients underlying NYHA functional class and where required, to start the collection of this

data.

2. To establish a basic foundational procedure for use by future researchers, data scientists and

engineers to develop and assess machine learning based methods of evaluating NYHA functional

class (trained to replicate classification by experienced physicians).

3. To perform a pilot analytics experiment, using data collected during an initial brief data

collection period, to explore the viability of a few machine learning algorithms which could form

2

the core of an objective and consistent system for evaluating NYHA functional class (and mirrors

classification by experienced physicians).

4. To provide a reflection on ‘lessons learned’, potential pitfalls and hazards to be mitigated in a

real-life implementation of a machine learning based NYHA functional classification system.

Formal Thesis Statement

We hypothesize that it is possible to assess NYHA functional class with an expected level of

performance at least equal to that of skilled humans, namely trained cardiologists, using objective data

readily available or recordable as part of routine care.

Thesis Summary

The four phases of this thesis are summarized in the following sections 1.3.1 to 1.3.3. We first

replicated a previous scientific study as part of initial investigations into relevant data. A basic physical

activity data collection system was then implemented as part of an established remote patient monitoring

system at the TGH HF clinic. Once sufficient data was been gathered by this system, we sought to train

and validate several machine learning models and assess their potential usefulness for the task of

classifying patients into their appropriate NYHA functional class. All research performed as part of this

thesis was reviewed and received the required approvals by the UHN Research Ethics Board (REB). The

approval letters are included as part of Appendix A.

1.3.1 Phase 1 – Replication of Previous Study

A previously published pilot study [13] showed a statistically significant association between NYHA

functional class and total daily step count activity measured by wrist-worn activity monitors in patients

with heart failure. However, the study’s small sample had the unfortunate side-effect of limiting scientific

confidence in the generalizability of these findings. Since step count activity is expected be a highly

relevant, useful, and massively feature rich dataset, we replicated the study on a separate otherwise

limited dataset collected during another previous study to increase our confidence in the relevance and

usefulness of step data for this particular thesis. This phase of the thesis was approved and covered under

REB #15-9832.

1.3.2 Phase 2 – Activity Tracker Monitoring Implementation

Having validated the relevance of step data for this particular application, we upgraded Medly, the

remote patient monitoring system already in use at the TGH HF clinic, so it could support the collection

3

and display of continuous free living activity data from a commercially available fitness tracker (a Fitbit),

including minute by minute step count and heart rate data which would form an important cornerstone in

the rest of our analysis. This phase of the thesis, upon review by the UHN REB, was accorded a waiver of

requirement for REB approval under REB #18-0221. The analysis of patient compliance was approved

and covered under REB #16-5789.

1.3.3 Phase 3 – Machine Learning Implementation & Validation

In the final phase of this research thesis, we identified potential candidate machine learning algorithms

and implemented 6 of them to attempt to create a classifier that could take the collected clinical data and

use it to attempt to objectively assess patient NYHA class. We also evaluated the performance of these

systems compared to expected ability of experienced physicians to perform the same task. This phase of

research, upon review by the UHN REB, was accorded a waiver of requirement for REB approval under

REB #18-0221.

The following chapters provide first, the necessary background to understand the rest of the research

discussed in this thesis, followed by a detailed description of the methods employed in each phase of the

research and the results of the findings of that corresponding phase.

4

- Background & Literature Review

Congestive Heart Failure

Congestive Heart Failure (CHF), or Heart Failure (HF), as previously stated, is a complex chronic

terminal phase of many cardiovascular diseases, and is slowly becoming a worldwide silent pandemic

[1,14]. Aside from being complex, it is also an incurable, constantly exacerbating condition, that looms

threateningly even over a myriad of more relatively ‘benign’ heart problems. In the words of Dr. Paul

Fedak, it is the “end result of all cardiac disease. You get heart failure from everything that goes wrong

with your heart – all roads lead to heart failure” [2]. Recent estimates would suggest that in 2016 at least

50,000 new Canadians will have officially joined an existing cohort of more than 600,0000 Canadians, and

26 million persons globally, living with heart failure [2,14]. Of course, these numbers are only expected to

grow as the population of persons at high risk of developing cardiac disease and, almost inevitably, the

prevalence of cardiac disease in general, continues to increase. Globally, the prognosis of HF patients is

bleak [1,14]. Even in Canada, despite its relatively advanced medical system, the expected median

survival time of Canadian HF patients is still very short - 2.1 years [15].

But what is heart failure? In short, heart failure is when the heart suffers a reduced ability to pump

blood, and by extension is unable to adequately supply the body with the nutrients and oxygen it requires

[1,2,14]. This inability of the heart to pump blood is sometimes termed cardiac insufficiency. This term

helps to avoid the popular misconception that heart failure is when a person’s heart has stopped as in the

case of a heart attack [2,16]. While cardiac insufficiency has the, likely obvious, effect of reducing a

person’s ability to perform demanding physical activities at any given moment, the full effects of heart

failure are rather more insidious.

Galen is perhaps the first recorded physician to have conjectured that organs aside from the heart and

arterial-venous network might be involved in regulating circulation [17]. While he erroneously concluded

that the liver was the body's main blood producing organ (due to its high degree of vascularization, i.e. it

has lots of blood vessels), an error which remained regrettably uncorrected for 15 centuries, it turns out

that the liver, along with the lungs and adrenal glands, but most importantly the kidneys, do have major

biochemical involvement in regulating a hugely important aspect of the circulatory system: blood pressure

[17]. The natural response of these organs to an event of cardiac decompensation (i.e. cardiac

insufficiency), is to attempt to correct these drops by activating a series of body systems and reflexes to

5

increase both blood volume and blood pressure and by extension cardiac output [18,19]. This is done

primarily through the renin-angiotensin-aldosterone system (see Figure 2-1) which effects an increase in

sodium and fluid retention along with an

increase in vasoconstriction (narrowing of

blood vessels) [18,19]. The autonomic

nervous system also contributes by

increasing vasoconstriction but also by

attempting to increase heart rate and

contraction force (see Figure 2) [18,19]. In

short, the body engages an emergency

response of the bodies ‘fight-or-flight’

mechanism.

While the aforementioned response is

highly appropriate for acute events of

cardiac insufficiency such as significant

blood loss, or even to prevent fainting as

a result of standing up suddenly from a

resting position, it is the incorrect

response to chronic persistent heart

failure [18,19]. Not only does this response

not resolve the underlying cause of the

chronic heart failure such as abnormal

heart rhythms or damage to or malformation of the heart, among other root causes, but constantly

engaging the bodies ‘fight-or-flight’ mechanism has damaging side-effects [19]. Elevated blood pressure

(hypertension) is associated with increased risk for a myriad of other conditions including: pulmonary

edema (leaking of fluid into the lung), atherosclerosis (hardening of arteries as a result of plaques formed

due to damage to the vessels), and hemorrhagic stroke (rupture of a blood vessel) [18,19]. Increased

sodium and fluid retention causes not just the blood to retain more water, but the whole body; fluid often

builds up in other organs and in the arms and legs which can cause undesirable compression of internal

organs and result in damage to those organs [19]. Furthermore, the reduced blood flow combined with

inappropriate pressure increases in certain organs can cause fluid in general to backup, or become

congested in areas along the circulatory network, which is what gives congestive heart failure its name

[18,19]. In addition, the whole response system has the effect of causing what is known as ‘cardiac

Figure 2-1. Renin-Angiotensin-Aldosterone system

[286]

6

remodelling’ whereby the actual physical structure of the heart changes to adapt to its new environment

[19,20]. Many of these changes have an overall damaging effect in the long-run and the exact nature and

extent of this remodelling

depends greatly on the

type of heart failure, for

example whether it is

localized in the left or

right side of the heart (or

both), whether it has the

effect of weakening or

stiffening of the heart

muscles, or whether the

heart failure is due to

other causes such as

abnormal heart rhythms

or blockages [19,20].

Suffice it to say that the

symptoms and pathology

of heart failure are

complex.

As a result of the

complexity of heart

failure, it can be difficult

to manage for both

patients and their physicians [2–4,19]. This is especially unfortunate because heart failure is essentially

impossible to cure since the heart, unlike many other muscles, does not heal or regenerate naturally and

modern medicine has not yet found a way to cause it to do so [19,21]. Care is made even more difficult

because there is no reliable objective method for assessing the functional state of any given patient’s HF,

never mind determining if it is likely to worsen irreparably [5–7].

2.1.1 New York Heart Association Functional Classification

The current clinical gold standard for communicating the severity of symptoms experienced by a CHF

patient is the New York Heart Association (NYHA) functional classification system [8,9,22]. Under this

Figure 2-2 Nervous system response to drop in blood pressure [287]

7

system patients are classified based on the physician’s interpretation of patient reported symptoms

(mainly with respect to their degree of exercise/activity intolerance). The physician will then assign the

patient into one of the four NYHA functional classes they believe is most appropriate based on their

clinical experience, professional judgement and according to the NYHA class definitions. These definitions

are copied below for the reader's convenience [23]:

I. “Patients with cardiac disease but without resulting limitation of physical activity. Ordinary

physical activity does not cause undue fatigue, palpitation, dyspnea, or anginal pain.”

II. “... slight limitation of physical activity. They are comfortable at rest. Ordinary physical

activity results in fatigue, palpitation, dyspnea, or anginal pain.”

III. “... marked limitation of physical activity. They are comfortable at rest. Less than ordinary

activity causes fatigue, palpitation, dyspnea, or anginal pain.”

IV. “Patients with cardiac disease resulting in inability to carry on any physical activity without

discomfort. Symptoms of heart failure or the anginal syndrome may be present even at rest. If

any physical activity is undertaken, discomfort is increased.”

This classification system is highly subjective [6,7], especially for NYHA class II and III, which call for

patients experiencing “slight” versus “marked limitation of physical activity” [9]. The application of the

criteria thus varies widely based on the patient’s self-report and the individual physician’s interpretation

of the report [6,7]. Despite these limitations, clinical evidence and medical research have established many

important relationships between a patient's symptom status and their prognostic outcomes which makes

the assessment of NYHA functional class a useful part of care [7,10]. Aside from the prognosticative

utility it also provides clinicians and medical researchers a standardized way of quickly communicating

the clinical severity of a given patient’s heart failure [19,24]. As such, scientific papers dealing with CHF

often report the NYHA class of their patient population (amongst other metrics) to provide a universally

recognized, although perhaps imprecise, description of the clinical make-up of their population.

Unfortunately, approximately 99% of these papers also fail to provide details as to how the NYHA

functional classes were assessed [6].

Assessing Exercise Capacity

The core determinant of NYHA class, is the impact of a patient’s heart failure on their ability to

perform physical activity without “undue fatigue, palpitation, dyspnea, or anginal pain”. While the NYHA

8

functional classification system does not prescribe a standardized method by which to evaluate limitations

of physical activity there are certainly several methods of evaluation a patient’s exercise capacity,

whether for NYHA functional class assessment or for other purposes. These include questions posed as

part of a medical interview, cardiopulmonary exercise testing, and physical activity/fitness

trackers/monitors.

2.2.1 The Medical Interview (Standardized & Unstandardized Questioning)

The familiar medical interview, whereby a clinician carefully queries a patient to elucidate the

patient’s relevant medical history and symptoms, is a staple of medical care. It is also the classic method

of assessing NYHA functional class; adding a few pertinent questions is inexpensive, relatively quick, fits

neatly into the existing workflow of clinicians and also happens to be the established best practice. It is

however highly inconsistent with regards to NYHA class assessment both between physicians and for the

same physician across time, and is thus highly unreliable [6,11,25–27]. Carroll et al. report (bibliographic

reference numbers updated to reflect ours):

[One study] used two physicians to estimate NYHA functional class in 75 patients on

the same day without chronic heart failure, reporting an interrater reliability of 56%

(weighted kappa = 0.41)[11]. In a second study, two cardiologists assessed the same 50

chronic heart failure patients on the same day in random order, observing 54%

agreement in NYHA classes [6]. In a third study, two physicians assigned NYHA class

to 56 patients with stable angina within the same hour, resulting in the highest reported

agreement of 75% [26]. Among these studies, disagreement by more than one functional

class was low and, for the most part, was concentrated on determining the discrete

differences between Classes II and III. Taken together, the reliability of the NYHA

system is limited in the few trials that have measured it directly [25].

The results are very low: a 54 and 56% level of agreement represents only weak agreement between

physicians, and a 75% level of agreement still implies that only about 56% of the examined cases should

be considered correct [28].

It should be noted that the third study (Christensen et al.) examined only NYHA functional classes I to

III, and the first study (Goldman et al.) examined all four functional classes [11,26]. In the second study

(Raphael et al.), the researchers investigated class II and III assessments specifically [6]. Furthermore each

study had an imbalanced distribution of classes which makes reporting raw accuracy somewhat misleading

since classes I and IV end up being relatively easy to distinguish in clinical practice whereas the middle

9

classes II and III generally represent the actual classification challenge for physicians [25]. Approximately

half of patients in Goldman et al.’s study exhibited NYHA class I symptoms which may have contributed

to the slightly higher agreement found in this study compared to Raphael et al.’s study. Unfortunately

Christensen et al. neglected to provide any information on their class distribution entirely, although it

appears to be slightly unbalanced since visual examination of their figures indicates that a significant

subset (possibly ¼ to a 1/3rd) of their study population are also patients with NYHA class I. We agree

with the authors (Christensen et al.) however that the real reason why they saw higher agreement was

likely because they “they used the same two physicians through the study … who, in addition, had a small

training session prior to data collection” [26].

In normal practice clinicians usually differ in the exact criteria and questions they would use to assess the

NYHA class of their patients [6]. The most popular being self-reported walking distance (70% of the 30

cardiologists surveyed), difficulty in climbing stairs (60%), ability to walk to a recognized local landmark

(30%) and breathlessness interfering with performing daily activities or when walking around the house

(23%)[6]. 13% of cardiologists had no specific question or criteria for assessing NYHA class [6]. Even of

those who would use a common question or criteria, the application of the criteria often differed. For

example, in choosing between class II and III patients, 2/3rds of physicians would classify a patient who

couldn’t make it up a flight of stairs without stopping as class II, while 1/3rd would classify them as class

III [6].

Assessment at the Toronto General Hospital Heart Function Clinic

At the TGH HF clinic, NYHA class is typically assessed for every patient with known cardiac disease,

which is first objectively verified using some sort of medical imaging. NYHA class is then reassessed at

every clinic visit by the physician responsible for patient's care as part of the medical interview. At

minimum, the physician will pose questions to attempt to elucidate the patients' degree of exercise

intolerance, for example: "How far can you walk before becoming short of breath?", although the

established preferred criteria is "How many flights of stairs can you climb before needing to stop?" The

classes are broken down as follows:

Class I. Asymptomatic; able to perform physical activity normally.1

1 As a specialized tertiary care centre, the Heart Function Clinic rarely sees NYHA class I patients as they are often

asymptomatic with regards to their heart failure, or at least rarely require the specialized level of care offered by the

clinic.

10

Class II. Able to walk up more than 1 flights of stairs, or 100+ meters before being breathless.

Class III. Only able to walk up 1 flight of stairs before being breathless/requiring a break.

Alternatively, gets tired walking to the washroom.

Class IV. Always breathless; symptoms even at rest.

Of course, these questions are adjusted as per the clinical demands. For example, the stair question is

unsuitable for a patient who is wheel-chair bound or has significant mobility impairment, but the

principle of using internally consistent criteria remains the same.

Unsurprisingly, prior agreement on assessment criteria has been demonstrated to improve inter-physician

agreement drastically [27]. Kubo et al. for example developed a patient questionnaire with the express

intent of addressing the problem of inconsistent NYHA classification in multi-centre trials, although the

questionnaire was “not meant to replace or improve the traditional method by which clinicians assess

NYHA in everyday clinical encounters” [27]. The questionnaire is composed of 7 major questions that echo

some of the popular interview questions including questions such as: “How often do you walk up and

down stairs?” and “How often do you go for walks, either outside or inside, on level ground at a normal

pace under normal conditions?” with follow up questions including “Do you avoid stairs [/walks] because

it makes you tired or short of breath?” and “How often would you get short of breath when you walk up

or down a flight of stairs at a normal pace under normal conditions?” that are typically answered with one

of ‘Never, Rarely, Some or Frequently’ and occasionally with just a simple ‘Yes/No’ response [27]. The

questionnaire uses a separate scoring tool (not provided) that assesses the frequency of both activities and

their associated symptoms including symptoms or lack of symptoms at rest [27]. The scoring tool however,

at least in its current state, eschews the use of automated algorithm “because of the inability of simple

algorithms to reconcile inconsistent patient responses” [27]. In validating the use of this questionnaire,

Kubo et al. found about a 60% agreement comparing interdependent assessments performed at a remote

site and their core central site, a 75% agreement comparing independent assessments performed at the

same core central site and a 90% agreement on repeat assessment of a random subset of the same

questionnaires 3 months later [27]. These results are in the same range as Christensen et al.’s results,

which is possibly an indication that even informal agreement on NYHA class (in the form of a

preparatory training session) drastically improves inter-physician agreement on NYHA class. Of course,

subjectivity in the NYHA classification is not just introduced by clinicians. It is also introduced by

patients.

11

2.2.2 Standardized In-Clinic Exercise Testing

A second challenge of NYHA class assessment is that it relies heavily on patient reported symptoms

and on patient memory, which can be unreliable even in the best of circumstances [29–31]. Clinicians, who

face this challenge on a routine basis in the field, even outside the context of NYHA class assessment,

have come up with a myriad of ways to address this problem. In fact, a great deal of research tries to

identify or create tests that measure physical fitness, maximum exercise capacity, or some proxy thereof

in a standardized way [32–39]. In general, these tests measure a patient's exertion over a period of time

[32,34–36,38–40]. Exertion is usually calculated by raw distance traveled (being generally more convenient

to measure) [32,34,36,40], patient step count (which can be linked to distance if the patient's stride length

is known) [38,41–47], movement recorded by raw accelerometer data [39,48–50], activity difficulty (e.g.

surface incline, resistance band strength) [41,46] or energy consumption (e.g. Metabolic Equivalents:

METS) [8,32,37].

Timed Walking Tests

Timed walking tests are an excellent example of a basic, easy to run standardized in-clinic exercise

test. The 6 minute walk test (6MWT), one of the more recently developed time walking tests, typifies the

general approach used in this tests. For this particular test, a patient is asked to walk as far as they can

(being permitted to rest as needed) over a hard flat surface over the period of 6 minutes; the total

distance walked is then used as an indicator of the exercise capacity of the individual [40] and by

inference, their symptomatic limitations due to heart failure [7].

While timed walking tests have shown that measures of exertion over time (whether distance, step count

or otherwise) are correlated to the NYHA functional classification of patients, there often remains a

notable gap in the explanatory power of these measures. For example Demers et al. found that for the 768

patients in their multi-centre study the "baseline 6MWT distance was ... moderately inversely correlated

to the New York Heart Association functional classification (NYHA-FC) (r = -0.43, P=.001)” [51]. One

would expect that walking distance should be correlated with evaluated NYHA functional class, but

distance travelled in this case only explains approximately 18.5% of the variance in the data (r2= 0.1849).

This may be because NYHA functional class is not predominantly attempting to ascertain maximal

exercise capacity but rather the degree of abnormally symptomatic response to exercise – a much more

nuanced question. Therefore tests, measures, or metrics which can reliably mirror NYHA functional class

will likely need to measure not just exertion, but the patient’s physiological response to that exertion -

12

beyond the simply binary yes/no response to being able to continue the exertion demanded (the case for

all the previously mentioned tests).

Cardiopulmonary Exercise Test (CPET)

The cardiopulmonary exercise test (CPET), or more colloquially ‘the treadmill test’, is the gold

standard for in-clinic exercise testing [52]. It is a supervised test run by trained staff in a controlled

clinical environment. In this test, the patient walks on a treadmill or cycles on a stationary bicycle

typically until they (the patient) becomes exhausted, experiences muscle fatigue, respiratory difficulty or

some other symptom that is indicated for the termination of the test [32,53]. While the patient is

exercising, their detailed physiological response to increasing resistance on the treadmill/bike is measured

using:

• surface electrocardiography (ECG), to measure pulse and cardiac waveform (sinus rhythm);

• pulse oximetry, to measure blood oxygen saturation;

• a blood pressure (BP) cuff, to measure blood pressure;

• spirometry equipment, to measure lung capacity, volumes and flow

• pulmonary gas equipment, to measure oxygen (O2) and carbon dioxide (CO2) exchange [32,53].

Together, this data provides an informative picture from which clinicians can further derive metrics

measuring a patient’s lung and cardiac response to exercise [24,32,53,54]. Some of the more unique and

important measures derived from this test include:

• 𝑝𝑒𝑎𝑘𝑉𝑂2̇ [mL/kg/min] (relative 𝑝𝑒𝑎𝑘𝑉𝑂2

), the peak oxygen volume output, is an estimate for true

maximal aerobic capacity �̇�𝑂2𝑚𝑎𝑥 [mL/kg/min] of a patient [32]. �̇�𝑂2

𝑚𝑎𝑥, or relative 𝑉𝑂2𝑚𝑎𝑥, is

the body weight normalized version of (absolute) 𝑉𝑂2𝑚𝑎𝑥 [L/min]. Absolute 𝑉𝑂2

𝑚𝑎𝑥 is

“considered to be the metric that defines the limits of the cardiopulmonary system. It is defined

by the Fick equation as the product of cardiac output [heart rate & stroke volume] and

arteriovenous oxygen difference … at peak exercise” [32]. Reporting the relative (normalized)

version is preferred since patients with higher body weight will naturally have a higher 𝑉𝑂2𝑚𝑎𝑥

due to increased body weight but will not necessarily have fundamentally increased functional

capacity, exercise capacity or exercise tolerance [32]. It is also important to note that 𝑝𝑒𝑎𝑘𝑉𝑂2̇ is

always an estimate of true maximal aerobic capacity; its recorded value depends not only on the

13

test modality used (treadmill or bike) but is importantly predicated on the attainment of

maximal/peak exercise by the patient during the test [32].

• Ventilatory threshold (𝑉𝑇) [mL/kg/min], an estimate for, and sometimes interchangeably known

as, anaerobic threshold (𝐴𝑇), attempts to measure the exertion level at which a patient’s body

stops being able to keep up with their muscles’ oxygen demands [32]. It is an alternate index used

to infer exercise capacity but is predicated on the idea that people do not constantly perform

activities at maximal effort. AT, in a sense, is a measure of maximum continuously sustainable

exertion [32]. As AT is a submaximal index of exercise capacity it is sometimes reported as a

percentage of 𝑝𝑒𝑎𝑘�̇�𝑂2 [32].

• Respiratory exchange ratio (𝑅𝐸𝑅), the ratio between exhaled CO2 and inhaled O2 [32]. Of

particular interest is the peak RER which can be used to gauge if a subject is likely to have

achieved peak (or at the very least sufficient) exerted effort as part of the test [32]. It is known to

be more robust than heart rate response for measuring exertion, as heart rate response is often

highly variable even in healthy populations (and worse for patients with heart failure, since their

response is often modulated by medications),

• 𝑉�̇�/𝑉𝐶𝑂2̇ [breaths/L], or the relationship between minute ventilation and carbon dioxide output, is

used to estimate ventilatory efficiency: how many breaths it takes for the body to clear a given

unit of CO2 [32]. The relationship most often reported is a linear approximation of the

𝑉�̇�/𝑉𝐶𝑂2̇ slope, which is highly robust against test modality and attainment of peak exercise by the

patient [32]. It is often used to infer the possible existence of ventilation-perfusion mismatching:

where the lungs are unable to efficiency clear CO2 from the circulatory system either due to

circulatory problems causing poor blood flow or inefficient CO2 transfer due to some sort of lung

damage or disease [32].

Many of these CPET measurements have been clinically validated and recommended to help inform

important decisions regarding heart failure care. For example, 𝑝𝑒𝑎𝑘𝑉𝑂2̇ is used to risk stratify certain

classes of HF patients when considering a heart transplant [55].

Others have already attempted to discover the relationship between NYHA class and various CPET

measures [11,24,25,56]. Rostagno et al. looked at 143 HF patients with NYHA functional class ranging

from I to IV but found low agreement between both 𝑝𝑒𝑎𝑘𝑉𝑂2̇ and AT (41.7%) compared to NYHA class

(35%) [24].

14

Goldman et al. looked at the duration of treadmill tests and similarly found low agreement, with only

51% of their 150 estimates (75 patients with one estimate each by two independent physicians) agreeing

with the NYHA class assigned [11]. This is not terribly surprising but is instead consistent with what we

would expect based on Demers et al. 6MWT findings.

In a more recent analysis, authors Lim et al. performed a systematic review of 38 studies that investigated

the correlation between NYHA classification and 𝑝𝑒𝑎𝑘𝑉𝑂2̇ (other CPET metrics were not reported

consistently enough for analysis) [56]. They found a significant difference between pooled 𝑝𝑒𝑎𝑘𝑉𝑂2̇ values

for NYHA classes I. vs II. and II vs. III (P < 0.0001 in both cases) [56]. However, they did not find a

significant difference when looking at classes III vs. IV [56]. 𝑝𝑒𝑎𝑘𝑉𝑂2̇ and NYHA class I to III were

inversely correlated, although the strength of the correlation was not quantified [56].

To our knowledge no one else has published attempts to characterize the relationship between NYHA

class and other CPET measures. Despite the lack of research and evidence surrounding most of the CPET

metrics, Lim et al.’s findings regarding 𝑝𝑒𝑎𝑘𝑉𝑂2̇ and NYHA class are an encouraging waypoint in the

quest to objectively assess NYHA classification. However, CPET studies do have some important

drawbacks.

One of the biggest drawbacks of running CPET studies is that they require access to expensive

equipment, trained personnel and a lab environment in which to perform the test [32]. Due to the

financial cost and time burden alone, it is likely that relying on CPET studies to assess NYHA class will

severely limit how often NYHA class can be re-assessed, which makes it less desirable for use in creating a

quick and easy method of assessing the severity of patients’ HF symptoms [54].

2.2.3 Fitness Trackers/Monitors

Modern commercially available fitness trackers, such as those developed by Fitbit Inc.[57–59] are a

promising, albeit little used candidate for assessing patient exercise capacity that would overcome many of

the drawbacks of cardiopulmonary exercise tests.

Activity & Step Detection

Activity trackers are small, portable devices that are worn on one’s person. They may be worn on

one’s feet or shoes, clipped on the belt near one’s hip, or worn on one’s wrist like a wristwatch

[41,43,64,65,45,57–63]. The classic pedometers of yore are in fact a type of activity tracker but there are

specifically limited to only counting steps [65,66]. Most modern activity trackers are more precise and

15

often more multi-functional than the classic pedometer [57–59,64]. Even from a pure motion detection

perspective, older pedometers were often limited to single-axis accelerometers which could only detect

movement (specifically acceleration) in one axis [66].

Newer, modern activity trackers have been found to be able to fairly reliably track minute-by-minute step

count [37,41,43,45,46,65,67–70]. Straiton et al. [70] in a systematic review of 7 observations studies,

including a total of 290 elderly patients (mean age 70.2 ± 4.8 [years]), discovered a high correlation

between step counts recorded by the test devices compared to the reference devices used in the study. The

reference devices used in the individual studies varied but were typically a previously validated research-

grade activity monitor such as an ActiGraph™ [71] or BodyMedia Sensewear device (no longer available).

In their review they found that “daily step count for all consumer wearables correlated highly with

validation criterion, especially the ActiGraph device: intraclass correlation coefficients (ICC) were 0.94 for

Fitbit One, 0.94 for [Fitbit] Zip, 0.86 for [Fitbit] Charge HR and 0.96 for Misfit Shine. Slower walking

pace and impaired ambulation reduced the levels of agreement” [70]. Physical activity and energy

expenditure estimation, as supported by these devices, was also found to be accurate but generally less so

than step count measurements.

Evenson et al. (2015) [68] who cast a wider net and conducted a systematic review that included 22

observations studies on adults and youth (20:2) similarly found generally high correlations between the

step measurements of various Fitbit and Jawbone devices investigated in these studies compared to the

reference devices use. The correlation coefficients (CC) (interclass or Pearson) were found to be >= 0.8

for all the devices (Fitbit and Jawbone) investigated in all the laboratory studies reviewed. Many of the

studies however found an even higher correlation, in the > 0.9 range, and even up to 0.99 for both

Jawbone and Fitbit devices [68]. Evenson et al also found that physical activity and energy expenditure

estimation were generally found to be less high correlated than pure step-tracking.

El-Amrawy in 2015 [44] recorded 4 participants who performed 40 repeated sets of 200, 500 and 1000 step

walks and found that step count accuracy varied from an average of 99.1% for the MisFit Shine and

Apple Watch, to 79.8% for the Samsung Gear 2, as compared to the steps counted by a tally counter

equipped observer. Other popular mainstream contenders like the Fitbit Flex (80.5%), the Jawbone UP

(82.51%) and the Xiaomi Mi Band (96.6%) also scored highly.

Overall, research points to step-tracking by modern mainstream commercial activity trackers as being

highly correlated to equivalent research grade reference devices. Certain activity trackers such as the

MisFit Shine appear to be more consistently in agreement with validated reference devices, which may

16

make them optimal for studies where step count values must be as accurate as possible. However, we

maintain that all the activity trackers discussed are likely suitable for practical applications of step count

tracking. Other features that should be considered are easier access to gathered data, lower cost, improved

ease of use for the patient, or the ability to detect some other important physiological marker.

Heart Rate Detection

With respect to other physiological markers, some of the major players in the commercial activity

tracker market, namely Fitbit™ [58] and Apple™ [64], have recently pioneered the integration of heart

rate monitoring capability alongside the step counting provided by their devices. These augmented fitness

trackers, which are worn on the wrist, also monitor heart rate non-invasively by detecting the flow of

blood under the surface of the wearer’s skin [41,44,72–74]. This technique, known as

photoplethysmography (PPG), has been well validated since its discovery in the 1930s and is commonly

used in various clinical settings [75,76]. In fact, it is the core technology that underpins pulse oximetry

[75,76].

The fundamental principle that underpins PPG itself is the absorption and reflection of light by various

body tissues [75,76]. By shining carefully selected frequencies of light on the surface of the skin and

recording either, the light reflected off of, or transmitted through the skin, one can detect changes in

perfusion of the surface tissues being illuminated. An example of the resulting waveform is shown in

Figure 2-3. Although the precise physiological cause of the perfusion changes measured by the PPG

Figure 2-3 PPG, ECG and arterial pressure waveforms (with cardiac arrhythmia) [288].

17

waveform is still a matter of debate [76], it is clear that certain characteristics of the waveform are

synchronized with heartbeat, and can thus be used to track heart rate. The shape of the waveform is also

known to be correlated with arterial blood pressure, another clinically important physiological marker

[75,76].

One important parameter that can also affect the PPG waveform is the choice of light [75,76]. Light

absorption/reflection characteristics of various body tissues are highly frequency dependent [75,76]. One of

the most important applications of PPG, arterial blood oxygen measurement, depends on this fact [75,76].

Furthermore, the frequency response of oxygen saturated versus desaturated blood is known to vary at

different light frequencies. If we measure separate PPG waveforms using red and near-infrared light, we

can measure the relative difference in light absorbed at these different frequencies [75,76]. The resulting

difference can then be used to infer the degree to which the blood is saturated vs. desaturated [75,76].

While fitness trackers do not yet measure arterial blood pressure or use different types of light to measure

oxygen saturation, some newer models of fitness trackers (e.g. the Fitbit Charge HR 2 [58] and Apple

Watch [64]) take advantage of the varying light frequency response of blood by instead using green light

which has been found to be more reliable for pulse rate monitoring [77].

Research has shown that consumer heartrate trackers are fairly reliable as compared to clinical grade

devices [41,44,73,74,78,79]. However, they do provide considerably less detail than clinical grade devices.

Consumer devices generally only capture a minute-by-minute pulse rate, as opposed to the complete ECG

waveform provided by a Holter monitor or non-portable ECG setup.

In a 2016 study, Wang et al. monitored 50 heathy patients on a treadmill test and compared the heart

rate measured by various fitness trackers to the heart rate recorded by an ECG and found them all to be

highly correlated [78]. The concordance coefficients were .99 for the Polar H7 device, .91 for the Apple

Watch, .91 for the Mio Fuse, .84 for the Fitbit Charge HR and .83 for the Basis Peak.

In a previously mentioned study, El-Amrawy et al. recorded 4 participants who performed 40 repeated

sets of 200, 500 and 1000 step walks. As part of this study they also compared the heart rate of various

activity monitors to the heart rate reported by a research validated professional clinical pulse oximeter

[44]. The devices investigated, with their corresponding heart rate accuracy (as percent mean deviation

from the average recorded heart rate) and the associated standard deviation (σ) of the measurements,

ordered from most to least accurate, were the Apple Watch (99.9%, σ = 5.7%), Samsung Galaxy Note

Edge (99.6%, σ = 14.4%), Apple iPhone 6 running Cardioo App [80] (99.2%, σ = 6.3%), Samsung Galaxy

S6 Edge (98.8%, σ =11.6%), Samsung Gear 2 (97.7%, σ = 16.5%), Apple iPhone 5S running Cardioo App

18

(97.6%, σ = 12.4%), Samsung Gear Fit (97.4%, σ = 28.8%), Samsung Gear S (95.0%, σ = 20.9%), and

Motorola Moto 360 (92.8%, σ = 14.1%).

Cadmus-Bertram et al. in a 2017 study, also investigated the heart rate accuracy of several wrist-worn

activity trackers [79]. They were particularly interested in the limits of agreement of the reported

beats/minute (bpm) of each of the devices at different heart rate intensity levels. They also studied the

devices’ accuracy by measuring the mean difference between the heartrates measured by the trackers, and

a simultaneously recorded reference ECG. The limits of agreement were defined as the 95% prediction

interval for the mean difference between the tracker and ECG measurements. They also compared

measurement agreement of different devices from same model series (i.e. comparing measurements

between 2 Fitbit Surges in otherwise identical test conditions) which they termed measurement

repeatability. As for the different heart rate intensity levels, they investigated the heart rate accuracy at

rest and at 65% of the individual study participants maximum heart rates while running on a treadmill

(as determined by the maximum heart rate equation: 𝑀𝑎𝑥 𝐻𝑒𝑎𝑟𝑡 𝑅𝑎𝑡𝑒 = 220 − 𝑎𝑔𝑒). The 40 study

participants were all healthy and between 30 and 65 years old (mean ± σ of 49.3 ± 9.5 [years]), and wore

2 trackers on each wrist (randomly assigned left vs. right, and proximal vs. distal to the wrist). Cadmus-

Bertram et al.’s findings, including the mean difference, limits of agreement and measurement

repeatability results, are reproduced for easier reading in Table 1. They found that the activity trackers

had excellent accuracy with a mean difference of ≤±2.8 [bpm] between activity trackers and reference

device whether at rest or while exercising. No further quantitative comparison was made between mean

difference at rest vs exercise. For reference, a 1 [bpm] agreement error at 65% of the maximum heart rate

of a 30, 49.3 and 65-year-old (minimum, mean and maximum age of participants in this study) represents

a percent error of 0.8, 0.9 and 1.0%. At rest, or rather, at heart rates of 60 and 100 [bpm] - the lower and

upper limits of the commonly accepted resting heart rate range [81,82] - the same 1 [bpm] agreement error

represents a percent error of 1.6 and 1.0%2. The precision, as measured by the limits of agreement was

found to be less impressive. At rest, they ranged from good, -5.1 to 4.5 [bpm] (Fitbit Surge), to relatively

poor, -17.1 to 22.6 [bpm] (Basis Peak). The performance of the intermediate devices investigated (Fitbit

Charge and Mio Fuse), which had limits of agreement of ~±10 [bpm], was closer to the performance of

the Fitbit Surge than the Basis Peak. During exercise (@ 65% maximum heart rate), the precision

degraded considerably, with lower limits of agreement ranging from -41.0 [bpm] in the worst case (Fitbit

Charge) to -22.5 [bpm] (Mio Fuse) in the best case, and upper limits of agreement ranging from -39.0

2 ∴ as a rule of thumb for mental calculations: 1 [bpm] error = 1% (2% when in the 40-60 [bpm] range)

19

[bpm] (Fitbit Surge) in the worst case to 26.0 [bpm] (Mio Fuse) in the best case. With respect to

repeatability between devices, most devices were found to be around half as repeatable as the ECG

whether are rest or during exercise with only two exceptions: 1) the Fitbit Surge which was found to be

possibly slightly more repeatable than the ECG at rest (unfortunately no significance test was provided),

and 2) the Basis Peak which was found to be only a quarter as repeatable as the ECG at rest.

Table 1: Summary of Cadmus-Bertram activity tracker heart rate accuracy study [79]

@ Rest @ 65% M aximum Heart Rate

Device

M ean

Difference

[bpm]

Limits of

Agreement [bpm]

Repeat-

ability

[bpm]

M ean

Difference

[bpm]

Limits of

Agreement

[bpm]

Repeat-

ability

[bpm]

ECG reference - to - 5.3 reference - to - 9.1

Fitbit Surge 2.8 -5.1 to 4.5 4.2 1.0 -34.8 to 39.0 20.6

Mio Fuse -0.7 -7.8 to 9.9 10.9 -2.5 -22.5 to 26.0 23.7

Fitbit Charge -0.3 -10.5 to 9.2 9.3 2.1 -41.0 to 36.0 21.6

Basis Peak 1.0 -17.1 to 22.6 19.3 1.8 -27.1 to 29.2 20.2

Our lab, the Centre for Global eHealth Innovation, also recently investigated the heart rate accuracy of

two of the most popular activity trackers at the time: the Fitbit Charge HR and the Apple Watch [41]. In

this 2016 study, R. Abdulmajeed studied 8 healthy participants using a similar methodology to Cadmus-

Bertram et al. although at different exercise intensity levels, which were controlled using a variable

resistance stationary bicycle. The accuracy of the two trackers (worn simultaneously) was measured

against the ECG results of a portable Holter monitor. Abdulmajeed found a similar slightly worse percent

accuracy at rest between the Holter monitor and the trackers investigated (Fitbit Charge HR: 6.00%;

Apple Watch: 3.32%), compared to Cadmus-Bertram et al. findings. Abdulmajeed’s findings also hint at a

possibly slightly non-linear relationship between percent agreement and heart workload/heart rate as it

appeared to decrease slightly with increasing workload (Fitbit Charge HR: peak of 8.68% at 40 [watts];

Apple Watch: peak of 7.51% at 30 [watts]) before improving to near complete agreement at higher

workloads (Fitbit Charge HR: <±0.5% when ≥ 80 [watts]; Apple Watch: <±0.75% when ≥ 60 [watts],

except 90 [watts] where the agreement was -1.64%). These findings are reproduced in an easier to read

format in Error! Reference source not found. along with the heart rates corresponding to the quoted

workload intensities.

20

Table 2: Summary of Abdulmajeed activity tracker heart rate accuracy study. Reproduced

from [41]

Workload

[Watts]

Holter M onitor Heart Rate [bpm] M ean Heart Rate

Difference [%]

Pearson Correlation

Coefficient

M inimum Average [sic] M aximum Fitbit

Charge HR

Apple

Watch

Fitbit

Charge HR

Apple

Watch

0 68 85 102 6.00 3.32 0.406 0.567

10 69 86 102 6.93 4.56 0.593 0.305

20 68 89 114 5.41 6.12 0.951 0.597

30 68 93 129 8.34 7.51 0.973 0.61

40 73 96 129 8.68 5.49 0.93 0.78

50 84 102 132 8.10 2.27 0.88 0.811

60 87 109 136 3.69 -0.45 0.957 0.965

70 88 116 142 1.63 -0.75 0.98 0.994

80 95 122 150 -0.20 -0.72 0.994 0.997

90 99 129 155 -0.10 -1.64 0.986 0.993

100 105 136 161 0.46 0.37 0.992 0.994

Summarizing the findings of these 4 studies, it appears that the findings of Wang et al., Cadmus-Bertram

et al. and Abdulmajeed are in clear agreement that heart rate measurements of activity monitors

generally have high accuracy and correlation with measurements performed by clinical grade equipment.

It also appears based on El-Amrawy et al.’s findings that there is very high correlation between the

individual heart rate measurements of the many commercial trackers on the market, perhaps unsurprising,

as most of the contenders leverage the same well-validated PPG technology with some minor

modifications to make them fit the form factor of the wearable device. Where the performance of these

trackers appears to differ greatly from clinical reference devices was in the variance of repeated

measurements. Of the trackers investigated in the study, Cadmus-Bertram et al. found that the devices

were typically half as consistent as an ECG regardless of whether the measurements were done while

active or at rest.

Comparison to Cardiopulmonary Exercise Testing

Based on recent research findings it is clear that modern activity trackers have been found to be fairly

reliably at tracking both step count as well as heart rate [37,41,79,43–45,65,67–69,73]. It is also clear

however that these devices are definitely less accurate and less precise than the gold-standard CPET.

21

That being said, these devices have significantly lower upfront costs than CPET equipment and require

little to no dedicated personnel or physical space in the hospital to run tests. "Replacing" patient memory

with activity trackers could still eliminate a significant source of subjectivity and potential error while

being potentially easier and less costly to administer than a full CPET.

Of course, fitness trackers provide fewer distinct data streams than a CPET, usually limited to just steps

and possibly heart rate. While few researchers have attempted to examine the interplay between fitness

tracker and step count data streams, it is possible that, in the same way that an IMU can combine the

disparate independently error-prone sensor outputs of an accelerometer, gyroscope and magnetometer

using sensor-fusion, the same might be done with activity tracker step count and heart rate data for HF

patients and thereby reduce or remove the need for the extra data provided by a CPET. Whether these

two data streams alone are sufficient to objectively assess NYHA class or perform a useful clinical

function for HF patients though is still yet to be determined. The concept however is clearly not

unreasonable: even though hospitals have only recently begun to consider the use of fitness trackers as

part of regular care, there have been some very early successes in using single data streams from trackers

to perform useful clinical functions such as monitoring step count for post-surgical readmission prediction,

or using the heart-rate data for arrhythmia detection outside the hospital [83–88].

Fitness monitors though have another advantage over CPETs: the low cost and portable nature of fitness

trackers means that patients can even be monitored outside the hospital during free-living. Capturing

real-world free-living activity of HF patients might provide a quantitative insight into the limitations

brought about by a patients’ HF symptoms. In fact, a recent exploratory study investigated this exact

concept, sending 8 HF patients home with activity trackers for a period of two weeks. The study found a

statistically significant difference between the daily average step counts of patients in different NYHA

functional classes [13]. Unfortunately, the study’s very small sample size greatly limits scientific

confidence in the generalizability of these findings. In response, we replicated this study using a larger

sample size as the first phase of this work (detailed in Chapter 3) to independently verify these very

promising findings. It would be hugely beneficial to patient care if data streams of regular real-world free-

living activity data made it possible to more routinely reassess NYHA class and even allow for more

prompt detection of important HF status changes.

22

Remote Patient Monitoring

Regular reassessment of a patient’s status and the continued monitoring of said patient while they are

outside the hospital falls under the broader umbrella of telemedicine [89] and is formally termed Remote

Patient Monitoring (RPM).

RPM, as a specific application of telemedicine, is of particular interest for patients with chronic conditions

[90–92]. An acute exacerbation of a chronic condition can often bring patients into costly hospital

emergency rooms for post-hoc care instead of less costly pre-emptive care/management that might have

prevented the exacerbation in the first place [4,14,92,93]. This leads to both suboptimal care for the

patient as well as misallocation of resources in an already and increasingly strained health sector [4,14,93–

95].

There have been many documented attempts at creating RPM systems targeted towards HF patients.

Even though researchers have not come to a consensus about the exact effect of RPM systems on

outcomes, based on several meta-analyses of recent literature, it appears that these systems are sometimes

capable of delivering on the promise of providing better care at lower cost.

In a 2018 meta-analysis, Yun et al. [96] reviewed 37 randomized control trials (RCT) covering a total of

9582 HF patients and found that the patient groups receiving telemonitoring care had significantly lower

HF-related mortality (risk ratio: 0.68, 95% confidence interval (CI): 0.50-0.91, no P-value) as well as all-

cause mortality group (risk ratio: 0.81, 95% CI: 0.70-0.94, no P-value) compared to the standard care.

Patients were found to benefit significantly when their RPM system transmitted data at least once per

day, or when it transmitted multiple (≥3) streams of biological data (e.g. weight, blood pressure and heart

rate). Yun et al. also noted that monitoring patient symptoms, medication adherence and prescription

changes was also associated with reduced mortality risk.

Klersy et al. [97] in their 2014 meta-analysis of 21 RCTs covering a total of 5715 patients, investigated

the healthcare utilization and economic impact of RPM on HF care. They found that, compared to the

control groups, the telemonitored patient groups experienced significantly fewer HF-related

hospitalizations (incidence rate ratio: 0.77, 95% CI: 0.65-.91) as well as all-cause hospitalizations

(incidence rate ratio: 0.87, 95% CI: 0.79-0.96) resulting in a per patient quality-adjusted life years gain of

0.06 years (approximately 22 days). Furthermore, RPM was associated with a yearly patient cost savings

of €300 to €1000 (approximately $460 to $1535 CAD based on the 2014 exchange rate). The cost savings

23

were conservatively estimated solely based on the associated third-party payer hospitalization

reimbursement costs for the patients in the meta-analysis.

As mentioned though, not all evidence points towards RPM being a unilaterally positive effector of

change: of note are 3 commonly cited large high-powered RCTs that found no significant effect on

outcomes for HF patients undergoing telemonitoring [50,98,99]. While these three 3 studies are certainly

not the only studies to have found little positive change from RPM implementations, their scope makes

them hard to simply dismiss. Ware et al. [100], in a comprehensive review piece, discuss the various

reasons why it is so hard to form a definitive consensus regarding the effects of home telemonitoring

systems in healthcare. They argue that RPM implementations are often viewed as simple one-size-fits-all

interventions (perhaps like a silver bullet) but they are in fact complex socio-technologic systems that are

(or should be) adequately tailored to suit the specific context in which they are implemented - a fact that

is often overlooked when assessing them. Some of the very important factors that impact the successful

implementation of any technology often go unreported or unaddressed in studies. This includes:

appropriate characterization of the intended and actual user groups (both patient population and clinical

staff), suitability of the home telemonitoring (HT) service for the implementation context (e.g. how is the

system resourced, and what actual user needs is it attempting to address), implementation strategy used

(including training, methods of ensuring adherence to the ‘system as-intended’), suitability of evaluation

approach for capturing desired outcome (e.g. RCTs an adequate trial design for capturing outcomes in an

evolving socio-technical system?), what are the actual desired outcomes for the intervention (reduced

mortality? increased patient quality of life? purely cost reduction?) and do these outcomes match up with

stakeholder expectations. In their words:

“HT has been shown to reduce mortality and HF hospitalizations and improve clinical

outcomes in HF patients. Despite this evidence, significant heterogeneity exists in the

design of HT interventions, the implementation context, and outcomes of individual

studies, leading to ambiguity about the true effect of HT on HF outcomes. HT is not

one, but rather a collection of complex interventions for which success or failure is

linked to a range of contextual factors. These factors cannot be ignored if we are to

design studies that will offer more definitive answers about the effect of HT on HF

outcomes.” [100]

24

2.3.1 Medly

For this particular thesis we piggy-backed off of a specific RPM system: Medly, a mobile-phone based

HF patient telemonitoring system currently in place (and adapted for use at) the Ted Rogers Centre of

Excellence for Heart Function, a tertiary care clinic for HF patients located in TGH in Toronto, Canada

[101,102]. A previous iteration of Medly, and thus its core features, have previously been validated

through a 6 month RCT, which found that its targeted telemonitored patient user group, relative to base-

line, had improved self-care maintenance (Δ = +7 points, P = .05) and management (Δ = +14 points, P

= .03) as measured with the Minnesota Living with Heart Failure Questionnaire, improved levels of brain

natriuretic peptide (BNP) - a biomarker associated with HF stability (Δ = -150pg/ml, P = .03) and

improved left-ventricular-ejection-fraction (LVEF) (Δ = +7.4%, P = .005) compared to the control group

[103]. In recognition of the complex multi-faceted nature of telemonitoring interventions, we provide a

more detailed discussion of the intervention and its unique context in Chapter 4, as part of the larger

discussion of how we implemented an initial version of activity tracker monitoring as part of Medly.

One of the important core features of Medly is an innovative computer algorithm capable of generating

timely, safe, and clinically-relevant messages (instructions or alerts) to patients and clinical staff [104].

The intent of this feature is to enable Medly to provide a cost-effective and scalable way of monitoring

patients on a daily basis by limiting the impact to the workload of clinical staff while simultaneously

leveraging ‘teachable moments’ to improve patient self-care maintenance and management [3,104,105].

This is accomplished by imbuing the system with a limited ability to mimic the decision making and

actioning process of the expert clinical staff at the Heart Function clinic so that the system is able to

adequately triage, and respond to or elevate clinical concerns to staff as necessary while providing patients

with regular feedback about their own condition [104]. Of course, the concept of imbuing a machine with

decision making ability (limited or otherwise) belongs to the now resurging field of artificial intelligence.

Artificial Intelligence & Machine Learning

Artificial intelligence (AI) broadly refers to the concept of intelligence (e.g. learning, decision making,

perception and recognition, creativity and problem solving) exhibited by machines (typically computers,

but formally, any thing not imbued with natural intelligence like humans or animals) [106–109]. The field

of AI is as fascinating as it is expansive. Although the field only became a formal academic discipline unto

25

itself in 19563 [108,109] it spans and draws from the fields of mathematical, statistical and computer

sciences, delves into psychology and neurology, and is even starting to pose new and challenging

philosophical, ethical and economic questions (such as ‘what actually is intelligence? what decisions should

and shouldn’t we delegate to a computer? what will be the place of humanity if computers can beat us at

everything?’).

One of the early successful approaches to creating artificial intelligence was to train a computer program

(like Medly) to mimic the decisions of a human expert, like a cardiologist or nurse, in what formally

termed an ‘expert system’ [106,110]. Expert systems are typically created by first extracting a series of

formalized facts from the target experts and translating them, typically, into formal conditional, ‘if-then’,

logic statements. For example: if a patient is male and older than 35 and has chest pain, then suspect a

heart attack; if a heart attack is suspected, then perform an ECG. These facts form the ‘knowledge base’

of the expert system. The machine can then use this knowledge base in conjunction with an ‘inference

engine, which using some formal logic system - such as zeroth-order propositional logic (i.e. modus

pollens4, modus tollens5, etc.) - to manipulate the contents of the knowledge base and draw conclusions,

make decisions or supply recommendations (if a patient is male and older than 35 and has chest pains

then perform ECG). The machine can then also be asked to ‘show its work’ by displaying the exact step

by step deductive, inductive and/or abductive logic processes used to reach its final conclusion [110].

Expert systems have seen application in various sectors, but are especially useful where demand for

expertise is high but supply is relatively low or expensive for example in the include health care, finance,

and the legal sectors [106,110].

In the case of NYHA functional class assessment, (a function not presently performed by Medly), one

might theoretically create an expert system which could mimic expert grading by (an) experienced ‘model’

physician(s). However, in doing so one would run into one of the major issues with expert systems: the

knowledge acquisition problem. Since creating traditional expert systems relies on the premise that there

are experts available who can formalize their knowledge into statements suitable for interpretation by

3 McCarthy et al. famously “proposed a 2 month, 10 man study of artificial intelligence to be carried out during the

summer of 1956… [they thought] that a significant advance [could] be made… if a carefully selected group scientists

work on it together for a summer.” Suffice it to say, the problem of AI turned out to need more a small summer

research problem to solve.

4 affirming the antecedent: If P then Q; P; ∴ Q

5 denying the consequent: If P then Q; not Q; ∴P

26

some inference engine, the actual implementation of these expert systems becomes compromised when 1)

there are insufficient experts available, or 2) their knowledge cannot be formalized adequately (or even at

all). In the case of objective NYHA functional class assessment (an unsolved problem), the situation is

fairly simple: there are no experts available - which precludes the creation of a traditional expert system

entirely. Fortunately, the field of AI has developed beyond just expert systems.

2.4.1 Machine Learning

An alternative to having experts a-priori supply all the knowledge required for an AI to ‘think’ is to

instead make an AI that can ‘learn’ that knowledge by itself from input data or example cases. This is

sub-domain of AI called machine learning6 [106,107]. This sub-domain is also fairly large, as many

different approaches have been developed since 1956 as part of different attempts to get computers to

extract useful knowledge from data [111]. Some of these approaches are more suitable for different types

of machine learning problems; so it might be helpful to first clarify how machine learning problems are

classified, broadly, before determining which machine learning category the problem of NYHA functional

class assessment falls into.

2.4.2 Supervised, Unsupervised and Reinforcement Learning

The first important way to classify machine learning problems is by learning modality. Machine

learning problems come in 3 major types: supervised learning, unsupervised learning and reinforcement

learning problems [111–113].

1) Supervised learning problems, the most common type, are those where both the input and output

variables are provided. The computer learns a mapping function to accurately convert the inputs

to outputs, even inputs that haven’t been seen before [111,112]. In other words, for a given input

variable 𝑥 and output variable 𝑦, where 𝑦 = 𝑓(𝑥), find a suitable 𝑓 [111,112].

2) Unsupervised learning problems are those where neither the output variable (𝑥) nor the mapping

function (𝑓) are known – the objective of unsupervised learning is usually to have the machine

discover underlying patterns in the data [111,112].

6 Colloquially, the terms ‘artificial intelligence’ and ‘machine learning’ are sometimes used interchangeably (e.g.

[107]). However, machine learning technically refers to the task of getting machines to mimic the ‘learning’ process of

intelligence, whereas artificial intelligence refers to the field (inclusive of all its subdomains) as a whole. In this work

we use the technical terms exclusively.

27

3) Reinforcement learning approaches the concept of learning from an entirely different perspective

than supervised and unsupervised learning [113]. In reinforcement learning there is, in a sense,

neither a static 𝑥, 𝑦 nor 𝑓. Rather, the machine learns by trial and error from successive

interactions with an external environment what actions it should take to optimize the value of

some future reward [113]. In other words, the machine must not only consider how to interpret

the present state of its environment, but also which actions to take (and by extension which

additional input data to collect about its environment), and finally decide which actions are most

appropriate to bring it closest to its goal based on the past success or failure of previous actions

[113]. Reinforcement learning methods are thus the realm of ‘game-playing’ AIs, such AlphaGo

[114] which ‘plays’ board game Go, OpenAI Five [115,116] which competes at Dota 2 (a

multiplayer online battle arena video game), and the various AIs that compete at real time

strategy video games like Starcraft/Starcraft 2 [117].

The question of objective NYHA class assessment clearly falls under the class of supervised learning, since

we have a known output label – NYHA functional class – that we wish to determine based on some input

variables, or ‘features’, in our dataset. Our question is whether it is possible to find an adequate mapping

function given our input data.

2.4.3 Classification vs Prediction Problems

Supervised learning algorithms can be further categorized by the expected output of the algorithm:

either a categorical label or a numerical prediction. The former is termed a ‘classification’ problem, and

the latter a ‘prediction’ or ‘regression’ problem [111–113]. Note that while the term ‘prediction’ has a

temporal connotation, prediction problems need not be temporal in nature – a prediction need not

necessarily be a forecast for or of the future. Inferring a missing value in a dataset, such as a missing

grade for a student’s assignment based on their other assignments would be as equally valid a prediction

problem as forecasting the next day’s temperature based on historical temperature data. In contrast,

forecasting whether the next day will be ‘hot’ or ‘cold’ is an example of a classification problem.

Determining the probability that a patient falls within a given NYHA class would be a supervised

prediction problem. However, since we wish to assign a categorical label (i.e. a NYHA functional class) to

each patient, we are instead tackling a supervised classification learning problem.

There are various algorithms for addressing supervised classification problems. These include Generalized

Linear Models, Random Trees & Forests, Neural Networks and Support Vector Machines. The author

whole-heartedly recommends the book “Programming collective intelligence” by T. Segaran for an

28

accessible, yet thorough primer on these and other modern machine learning techniques [111]. Segaran’s

book mostly discusses machine learning algorithms that are fed with cross-sectional data (i.e. where all

the data is acquired at a particular ‘slice’ of time or where the order or sequence of the data is not

necessarily considered important). Since our application involves the use of time series data where the

order of data is important, we also specifically explored the use of hidden Markov models, which are a

type of machine learning algorithm that is considered highly suitable for learning from time series data. It

has been applied to problems as disparate as speech recognition [118], stock market pricing analysis [119],

seizure classification [120] and human physical activity recognition [62,121]. A brief into to HMMs is

provided for the readers convenience in Appendix B.

2.4.4 The Effect of Sample Size on Machine Learning

Before we address the current state of research at the intersection of machine learning and HF

assessment, we briefly comment on an important consideration of machine learning: the amount of data

required to train a machine learning algorithm. Machine learning is notorious for being particularly data

intensive [119,122,123]. This notoriety likely explains why the term Big Data is often (incorrectly) used

interchangeably with machine learning in popular parlance [124].

Machine learning practitioners generally consider data sets on the order of hundreds of samples to be

relatively small [122,123,125]. In fact, most traditional ML algorithms are hard to properly validate even

when the training dataset in question contains more than 200 events of interest per candidate ML feature

- even some of the simplest models using logistic regression require at least 20-50 per candidate feature

[126]. The exact size of a data set required to properly train a typical Hidden Markov Model (or any

machine learning algorithm in general) depends on a number of different factors including: the method of

classification, complexity of the classifier, separation between classes, variance and presence of noise in the

data. The noisier, the more complex and the greater the variance in the data, typically the larger the

dataset required to achieve good performance. There is no upper limit for how much data should be used

for training but there is a point at which increasing input data begins to yield diminishing returns in

improving predictive performance [123]. The exact relationship between training set size and predictive

performance for an algorithm and problem in question is often shown as a 'learning curve' graph (which

plots training set size versus prediction error(s)). To the best of the author's knowledge the learning curve

for this particular application (or a sufficiently analogous application) has not yet been determined.

However, given that we expect that the data collected in this study will be relatively noisy and complex

we expect that the model may lean towards requiring more data rather than less data. Since biomedical

data is typically in short supply, we will endeavour to collect as much data as possible in order to not

29

prematurely limit the power nor the generalizability of the algorithm developed.

2.4.5 State-of-the-art

Tripoliti et al. [127] published a comprehensive review in 2017 on the state-of-the-art for machine

learning applications in HF management. They found that across the 45+ unique studies reviewed,

various machine learning techniques have been applied to both: a) the prediction of adverse HF events

including destabilizations, mortality and hospitalization, as well as b) the diagnosis of HF including HF

detection, recognition of sub-types of HF, and estimation of severity (e.g. NYHA functional class). Input

data included the standard demographic data, but also variously: clinical history, laboratory and ECG

data, and various features that were extracted or computed from the input data. NYHA functional class

was often included in the studies as part of the input demographic data, but only 4 studies investigated it

specifically as a classification task.

In 2011, Pecchia et al. [128] presented a telemonitoring system that collected and used patient ECG data

for HF detection and classified patients as having either NYHA class III (labeled as ‘severe HF’), or

NYHA class I or II (labeled as ‘mild HF’). The detection and severity classification tasks are each

performed with a single decision tree, specifically one generated using the Classification And Regression

Trees (CART) algorithm. The decision trees each use different Heart Rate Variability (HRV) features

[129] extracted from the ECG waveform, HRV having already been shown to be useful for discriminating

between patients of different NYHA classes [130–134]. Pecchia et al. trained and tested their severity

classifier on Holter monitor data available from a public database: the Congestive Heart Failure RR

Interval Database [135] (i.e. not data recorded using their telemonitoring system). The dataset consisted

of 29 patients (12 mild, 17 severe), with which they were able to achieve an overall classification

accuracy7 of 79.31%, sensitivity8 of 82.35%, specificity9 of 75.00%, and precision10 of 82.35% - although

the authors failed to specify the validation technique used.

7 The proportion of patients correctly classified into their actual true class

8 a.k.a. recall, or true positive rate: The proportion of patients correctly identified as belonging to the ‘positive’ test

class (e.g. class A in A vs. B)

9 a.k.a. true negative rate: the proportion of patients correctly identified as belonging to the ‘negative’ test class (e.g.

class B in A vs. B)

10 a.k.a. positive predictive value: the proportion of patients correctly classified as belonging to the ‘positive’ test class

amongst all the patients identified by the classifier as belonging to the ‘positive’ test class.

30

In 2013, Melillo et al. [136] performed a similar study, but using a larger superset of data containing

additional patients from the publicly available BIDMC Congestive Heart Failure Database [137,138]. This

data superset also included class IV patients, which were grouped with class III patients in the ‘severe

HF’ class. In this study Melillo et al. performed some additional corrections to their decision trees to

permit them to perform feature selection in a way that accounted for the now small and rather

unbalanced dataset (12:32, mild:severe). Melillo et al. also compared the performance of their single

CART decision tree to a random forest classifier [111,139], as well as a single tree generated using the

more popular C4.5 algorithm [139]. Of the 3 classifiers they found that their revised CART performed

best with a classification accuracy of 85.40% (Δ = +6.09% compared to [128]), sensitivity of 93.30% (Δ =

+10.95%), specificity of 63.60% (Δ = -11.4%), and precision of 87.50% (Δ = +5.15%). In this paper,

Melillo et al. specified that they used 10-fold cross validation. 10-fold or 𝑘-Fold cross-validation

(generally) is a common technique for validating machine learning algorithms whereby the complete

dataset is separated into 𝑘 number of groups or ‘folds’ (in this case 10). One of the folds is held aside as

the initial test set, while the remaining folds are made to constitute the initial training set [140,141]. The

folds held aside as the test and training sets are then rotated such that each fold has been held aside once

as a test set with the non-test set folds in that round being used as the training set [140,141]. In this way

each data point in the dataset is well utilized and supplies information for both testing and training

[140,141].

In 2015, Shahbazi et al. [142] used the same dataset and labelling schema, although they dropped 5

patients based on a pre-established data-reliability measure for a final dataset of 10:29 (mild:severe). In

this study, Shahbazi et al. used a different machine learning algorithm known as k-Nearest Neighbour

[111,139]. Since the k-Nearest Neighbour algorithm does not have inherent feature selection baked in (in

contrast to decision trees), Shahbazi et al. performed feature selection using a method known as

generalized discriminant analysis [143] to select a reduced subset of the best available features to present

to their k-Nearest Neighbour algorithm. The whole feature selection-classifier chain was validated using

leave-one-out cross validation. Leave-one-out cross validation is a variant of 𝑘-Fold cross validation where

𝑘 is equal to the number of data points [140]. In other words, for a dataset of size 𝑁, leave-one-out cross

validation is 𝑘-Fold cross validation where 𝑘 = 𝑁. Leave-one-out cross validation is thus often preferred

when the dataset in question is particularly small, since only 1 data point is held out as a test set for each

round, thus maximizing the amount of data available for training. In any case, Shahbazi et al. were able

to achieve a remarkable 100% and 97.43% accuracy respectively for classifiers trained using only non-

31

linear HRV features and using both linear and non-linear HRV features11.

Lastly, in a 2010 study, Yang et al. discussed an attempt made to perform both diagnosis and severity

assessment together, using a dataset of 153 patients labelled as either ‘Healthy’, ‘HF-prone’ or ‘HF’

(65:30:58). The ‘Healthy’ group corresponded to those with no cardiac dysfunction, the ‘HF-prone’ group

corresponded to those patients with NYHA class I symptoms and the ‘HF’ group corresponding to those

with either NYHA classes II or III symptoms. Due to their relative abundance of data points, Yang et al.

opted to do a simple training/test set split, allocating 63 (24:14:25) samples for training and 90 for testing

(41:16:33). Yang et al. chose to use a support-vector-machine algorithm [111,139], which is a supervised

prediction algorithm. As such they had to convert the numeric prediction value into a final output

classification, which they performed by first mapping the SVM prediction 𝑣, to a new mapped output

value 𝑦 using the following tan-sigmoid function:

𝑦 = 4

1 + e−4v − 2 (1)

and then proceeding to determine the decision cutoff points for the groups using Youden’s index [144].

Their approach gave them an overall accuracy of 74.44% with an accuracy of 87.50% and 65.85% for the

NYHA I group and NYHA II and III group respectively (78.79% for the healthy group). As input data,

Yang et al. used parameters from blood tests (specifically sodium and BNP levels), ECGs (including HRV

features), chest radiography (i.e. LVEF and cardiac dimensions), 6MWT (distance) and a “physical test”

[145]. Other noteworthy parameters employed by the SVM models include 𝑝𝑒𝑎𝑘𝑉𝑂2̇ .

To the author’s knowledge, no other studies have used machine learning for the assessing NYHA

functional class. Certainly, no study appears to have done more than a binary (two-class) prediction of

NYHA class. Of course, this is likely a result of the difficult and time-consuming nature of acquiring a

sufficiently large dataset that includes all 4 NYHA functional classes. Fortunately, as previously

mentioned, the practical challenge of NYHA functional class assessment mostly centers around

distinguishing the middle two classes, II and III, such that studies that use a the ‘mild’/’severe’ labelling

scheme like the one used by Pecchia, Melillo, and Shahbazi studies are essentially addressing the central

NYHA functional class assessment challenge. It appears clear too from these studies that machine learning

methods are a potent tool for objectively assessing NYHA functional class: case in point, Shahbazi et al.’s

11 granted, a model with 100% accuracy smells is very possibly overfit to the dataset used.

32

k-Nearest Neighbour approach appears to have achieved incredible accuracy at separating HF patients

with class I or II vs. III or IV - albeit it on a what is still a relatively small sample of 39 patients. All of

these aforementioned studies however relied solely on data recorded in the clinic, and on HRV specifically.

While we do not doubt the utility of HRV measurements for various aspects of cardiovascular care, they

do have some important drawbacks. For example, the preferred standard recording interval for an ECG

used for HRV analysis is 24 hours although it is possible to record very long-term ECGs (i.e. for longer

than a period of 1-2 days) [129,146,147]. However, very long-term ECGs require slightly different

treatment than shorter term ECGs since the longer an ECG recording, the more unreasonable it is to

maintain the assumption that the ECG signal is stationary - an important assumption for the underlying

mathematics that underpins much of the HRV signal processing [146]. While some researchers have

developed new approaches for HRV signal analysis, these have not been validated against outcomes [146].

This is known to be an important step for HRV analysis since, it is known that the features used to short-

and long-term ECGs are not always interchangeable [129], it is only reasonable to assume that the same

would apply to very long-term ECGs. ECG HRV analysis is also not common practice in many clinics and

requires specialized knowledge, and equipment (in particular for use in in telemonitoring). As an

additional drawback, ECGs are often replete with artefacts and noise, and so sometimes require manual

cleaning before they can be used for HRV analysis [129]. Altogether, this makes HRV analysis a powerful,

but relatively inaccessible tool (at least at present) for use in performing regular assessment of NYHA

class as part of care. It would be useful to determine if it were possible to objective assess NYHA class

using more commonly accessible technology like the standard CPET, or fitness trackers which, although

not ubiquitous in the hospital, are ubiquitous in the consumer space and would be an ideal tool for remote

monitoring HF patients and regularly reassessing their NYHA class.

Summary

To summarize: heart failure, a global epidemic, is a complex chronic progressive condition associated

with significant morbidity and mortality. Patients often present with exacerbations to acute care centers,

and hospital emergency rooms at significant cost.

Exercise intolerance, one of the main manifestations of heart failure (HF), is an integral part of HF care

evaluations. The New York Heart Association (NYHA) classification is a functional assessment of exercise

capacity where a higher NYHA class is associated with increased symptoms, decreased quality of life and

poor survival. This classification system is highly subjective, especially for NYHA class II and III, which

call for patients experiencing “slight” versus “marked limitation of physical activity” [9]. The application of

the criteria thus varies widely based on the patients self-report and the individual physician’s

33

interpretation [6,7]. A quantifiable measure that removes this subjectivity to make the assignment of

NYHA class more repeatable and objective is highly desirable, especially if such a measurement could be

made on a regular basis to more closely track progression of the disease.

In common clinical practice, most assessments of exercise intolerance are performed through standardized

or non-standardized questions posed as part of the medical interview. More quantifiably,

CardioPulmonary Exercise Testing (CPET) is a validated clinical tool that is used to assess exercise

intolerance. Other researchers have identified some relationships between CPET measures, specifically

𝑝𝑒𝑎𝑘𝑉𝑂2̇ , although none have attempted to predict NYHA class from CPET measures. Performing CPET

studies also has some important drawbacks: they require access to expensive equipment in a lab

environment, and trained personnel to run the tests. Consumer targeted wearable physical activity

trackers overcome these disadvantages: they are inexpensive, simple to use, and can measure moment-to-

moment physical activity (and thus hopefully infer exercise intolerance) during free-living activities

instead of simulated activity in a lab. A previous exploratory study [13] investigated wearable activity

trackers in HF patients and found a link between patients’ daily average step counts and their

corresponding NYHA functional classes. However, the study’s small sample (n=8) limits scientific

confidence in the generalizability of this finding, so we resolve to first begin (in the next chapter) by

investigating whether these results are generalizable to a larger study sample.

Activity trackers could thus also be used to remotely monitor patients to help both patients and clinical

staff better manage their condition. Remote monitoring has been shown to improve HF patient outcomes

when properly implemented. To maximize chances of successful implementation, we proposed integrating

activity tracker monitoring as part of Medly [101,102], an existing well validated phone-based HF patient

monitoring solution already integrated and in use at our hospital.

One of the important features of Medly is that it leverages an expert system (an early type of artificial

intelligence algorithm) to triage, respond to or elevate clinical concerns to staff as necessary while

handling regular ‘run-of-the-mill’ clinical tasks without needing human intervention, thus providing a

cost-effective and scalable way of monitoring patients on a daily basis. We suggest that a similar

intelligent system could be used for NYHA class assessment. By using an artificial intelligence system that

could translate relevant data into the desired clinical outcome (NYHA classification), or a sufficiently

equivalent outcome (an 'NYH-AI' or 'NYHAI' classification if you will), we could provide a way to assess

a patient's functional classification in an objective, consistent manner while still leveraging the advantages

of the existing 'traditional' NYHA classification method. Some researchers have already investigated

intelligence classification algorithms but unfortunately these all relied on analysing heart rate variability

34

from ECGs. We suggest that it might be possible to perform the same classification using more accessible

or ubiquitous technology like a CPET or fitness tracker.

35

- Replication of Previous Study

As discussed in the Section 2.2.3.3, a previous exploratory study [13] investigated wearable activity

trackers in HF patients and demonstrated a statistically significant difference between the daily average

step counts of patients experiencing NYHA class II vs NYHA class III symptoms. However, the study’s

small sample (n=8) limits scientific confidence in the generalizability of these finding. Since step count

activity is expected be a highly relevant, useful and massively feature rich dataset, we replicated the

study on a separate otherwise limited dataset collected during another previous study, to increase our

confidence in the relevance and usefulness of step data for this particular research thesis. Our primary

objective was to validate the pilot study on a larger sample of patients with HF with reduced ejection

fraction (HFrEF). Our secondary objective in analyzing the larger dataset was to also better characterize

the distribution of step counts for patients in different NYHA classes.

The remaining part of this chapter, our replication of the pilot study, has been submitted for publication

to a peer-reviewed journal [148]. The thesis author was responsible for the direction and execution of the

research as well as the drafting of the initial paper. The other authors on the submitted paper (S.

Bromberg, M. Yasbanoo, B. Taati, H. Ross, C. Manlhiot, and J. Cafazzo) contributed feedback and edits

to subsequent drafts of the manuscript. Additionally, S. Bromberg collected the original dataset used in

the study, H. Ross & M. Yasbanoo provided clinical guidance, and J. Cafazzo and C. Manlhiot provided

general consultation.

Abstract

Background: A previously published pilot study showed a statistically significant difference between

New York Heart Association (NYHA) functional class and step count activity measured by wrist-worn

activity monitors in patients with heart failure (HF). However, the study’s small sample size severely

limits scientific confidence in the generalizability of this finding to a larger HF population.

Objective: Validate the pilot study on a larger sample of patients with HF with reduced ejection fraction

(HFrEF) and attempt to characterize the step count distribution.

M ethods: We repeated the analysis performed during the pilot study on an independently recorded

dataset consisting of a total of 50 patients with HFrEF (35 NYHA II and 15 NYHA III) patients.

Participants were monitored for step count with a Fitbit Flex for a period of two weeks in a free-living

environment.

36

Results: Patients exhibiting NYHA class III symptoms had significantly lower recorded mean of daily

total step count (4012 ± 1933 vs. 5484 ± 2640 [steps/day], P = .04), lower recorded mean of daily mean

step count (2.8 ± 1.3 vs. 3.8 ± 1.8 [steps/day], P = .04,), and lower mean and maximum of the daily per

minute step count maximums (80.5 vs. 95.6, & 112.9 vs. 125.7 [steps/minute], P = .02, & .004

respectively).

Conclusions: Patients with NYHA II and III symptoms differed significantly by various aggregate

measures of free-living step count including 1) mean daily total step count as well as, newly discovered, by

2) mean, and 3) maximum of the daily per minute step count maximums. These findings affirm that the

degree of exercise intolerance of NYHA II and III patients as a group is quantifiable in a replicable

manner. This is a novel and promising finding that is highly suggestive of possible completely objective

measure of assessing HF functional class, something which would be a great boon in the continuing quest

to improve patient outcomes for this burdensome and costly disease.

Introduction

Heart Failure (HF), a global epidemic [1,14], is a complex chronic progressive condition associated with

significant morbidity and mortality. HF is the leading cause of hospitalizations costing Canadians an

estimated 3 billion dollars annually [2]. Clinicians caring for patients with HF have a strong desire to

reduce hospitalizations from both a systems and patient-centered perspective [2,4]. To do so, it is

important for clinicians caring for these patients to understanding each patient’s physiologic parameters.

Evaluating exercise intolerance, one of the main manifestations of HF, is an integral part of HF care. The

New York Heart Association (NYHA) classification is a functional assessment of exercise capacity where a

higher NYHA class is associated with increased symptoms, decreased quality of life and poor survival

[8,10,149]. This classification system is highly subjective [6,7], especially for NYHA class II and III [9]. The

application of the criteria thus varies widely based on the patients self-report and the individual

physician’s interpretation [6,7]. A quantifiable measure that removes this subjectivity to make the

assignment of NYHA class more repeatable and objective would be beneficial.

A previous exploratory study [13], investigated wearable activity trackers in HF patients and

demonstrated a statistically significant difference between the daily average step counts, a proxy for

exercise intolerance, in patients with class II and III symptoms. However, the study’s small sample (n=8)

limits the generalizability of these findings. The aim of this study is to determine if these findings can be

replicated using a larger sample collected independently from the original pilot study data.

37

Methods

As a replication, we repeated the analysis performed during the pilot study [13], but on an

independently recorded dataset consisting of a total of 50 patients with HFrEF (9 NYHA I/II, 26 NYHA

II, 4 NYHA II/III, and 11 NYHA III) patients. Participants were monitored for step count with a Fitbit

Flex [59] for a period of two weeks in a free-living environment.

3.3.1 Recruitment

Patients in a moderately larger dataset (n=50) were originally consecutively recruited from the Heart

Function Clinic at Toronto General Hospital (TGH) in Toronto, Canada from September 2014 to June

2015. The inclusion and exclusion criteria used are outlined in Table 3 & Table 4 respectively.

Table 3: Inclusion criteria

- Adults (18+ years of age)

- Stable chronic HF

- NYHA Class II or III

- LVEF (Left Ventricular Ejection Fraction) ≤ 35%

- Able to walk without walking aids

- Capable of undergoing consent, understanding English instructions and complying with

the use of the study devices.

Table 4: Exclusion criteria

- Congenital heart disease

- Diagnosis less than 6 months prior to recruitment

- Travelling out of Canada for more than 1 week during the study period (to limit study

costs – i.e. roaming charges)

Data Collection

Patients were supplied with a Fitbit Flex [57], an Android smartphone (Moto-G), the associated

charging equipment for both devices, as well as a data plan to facilitate syncing the tracker to the Fitbit

server. Patients were instructed to wear the Fitbit daily on the same wrist, preferably their non-dominant

hand, for a period of 2 weeks, except during water activities like showering or swimming, as the Flex is

not water-proof. Patients were also instructed to charge the Fitbit at least every three days, preferably

while they slept. The Fitbit data was retrieved using an open source script published and available on

GitHub and adapted for this study [150].

38

Population

Patients in our larger dataset were labeled as either NYHA class II and III, or, when a physician was

uncertain about the classification or felt that patients exhibited symptoms from different class levels, as a

borderline/mixed class I/II or II/III. Table 5 provides demographic information for each of the patients in

the dataset according to their NYHA class, Table 6 provides the same but for all patients overall and just

for the subset of patients that were labelled NYHA class II or III. In either case, the patients are

predominantly male (86 vs. 89 [%]), aged: 54 ± 14 vs. 56 ± 14 [years old], and overweight (BMI: 28.9 ±

6.4 vs. 29.6 ± 6.3 [kg/m2]).

Table 5: Study dataset demographics

NYHA I/II NYHA II NYHA II/III NYHA III

Total Participants (n [%]) 9 (18%) 26 (52%) 4 (8%) 11 (22%)

# Male (n [%]) 6 (67%) 23 (89%) 4 (100%) 10 (91%)

Age [years] 52 ± 16 55 ± 14 52 ± 13 58 ± 13

Height [cm] 171 + 12 174 ± 8 177 ± 3 175 ± 10

Weight [kg] 79.5 ± 25.5 87.6 ± 18.6 88.4 ± 22.7 94.4 ± 17.4

BMI [kg/m2] 26.6 ± 7.1 29.0 ± 6.1 28.4 ± 7.5 30.9 ± 6.7

Table 6: Study dataset demographics (overall and just NYHA II or III)

Overall NYHA II or III*

Total Participants (n [%]) 50 37 (74% of total)

# Male (n [%]) 43 (86%) 33 (89%)

Age [years] 54 ± 14 56 ± 14

Height [cm] 174 ± 9 174 ± 9

Weight [kg] 87.7 ± 20.0 89.6 ± 18.5

BMI [kg/m2] 28.9 ± 6.4 29.6 ± 6.3

Since NYHA class I/II and II/III are not formally recognized NYHA classes, we performed our analysis

using the original class labels, as well as a second time but with the borderline/mixed classes grouped into

one of the traditional 4 class NYHA. Since NYHA class I corresponds to ‘no limitation of physical

activity’, a binary distinction, we reasoned that a patient assigned as class I/II, must have exhibited

something more than ‘no limitation of physical activity’, however slight. Since NYHA class II corresponds

to ‘a slight limitation of physical activity’ we reasoned that class I/II and class II should be grouped

together. We designate the class I/II and class II group as Group II*. We extended the same line of

reasoning for II/III patients, noting that patients assigned as class II/III must have experienced some

more marked limitation of physical activity beyond class II limitations. As such we grouped them with the

39

lower class III as a conservative approach, assuming the worst-case scenario. As such we group them with

the lower class III. We designated the class II/III and III group as Group III*. Table 7 provides

demographic information for the patients when the dataset is re-grouped according to the labeling scheme

as described above.

Table 7: Study re-grouped dataset demographics (NYHA group II* and group III*)

NYHA Group II* NYHA Group III*

Total Participants (n [%]) 35 (70%) 15 (30%)

# Male (n [%]) 29 (83%) 14 (93%)

Age [years] 54 ± 14 56 ± 13

Height [cm] 173 ± 9 176 ± 8

Weight [kg] 85.5 ± 20.6 92.8 ± 18.3

BMI [kg/m2] 28.4 ± 6.3 30.2 ± 6.7

3.3.2 Statistics

Consistent with our previous study [13], we use a Kruskal-Wallis rank test to compare the

experimental variables of interest, including the mean daily total step count. Since the data is clearly not

normally distributed, as can be seen in Figure 3-1, Figure 3-2 and Figure 3-3, we also computed various

other aggregations of the minute by minute step count data to attempt to better characterize the data.

Namely, we calculated statistical summaries (mean, standard deviation, five number summaries,

interquartile range, skewness and kurtosis) for each patient’s overall two week period and then for each

individual patient-day of step data. We then calculated the max, min, mean and standard error across

each patient’s daily summaries (producing a maximum daily mean, minimum daily mean, mean of daily

means, etc.) to assess overall variation on a daily basis. We then performed a Kruskal-Wallis rank tests

on each of the overall statistical summaries. The analysis was performed using R [151], RStudio [152] with

supporting packages [153–158].

40

Figure 3-1. H istogram of per minute step count values for each patient, grouped by individual NYHA class

41

Figure 3-2. Distribution of per minute step counts by NYHA class (zoomed in to step counts > 0). S tacked internal segments

indicate relative contributions by each patient.

42

Figure 3-3. Individual frequency of per minute step counts for each patient (zoomed in to

step counts > 0), grouped by NYHA class

Results and Discussion

Table 8 and Table 9 include results that were found to be significant at the P=.05 level in at least one

comparison. Table 10 and Table 11 contain the remaining non-significant results excluding any statistical

summary that returned a 0 value for all classes (e.g. aggregations involving daily or overall minimum, 1st,

2nd and 3rd quartile) due to the overwhelming frequency of 0 per minute step counts. Table 8 and Table

10 tabulate the results of the comparison using the original class labels, i.e. comparisons between class II

vs. III, and the comparison of all available classes, i.e. I/II vs. II vs. II/III vs. III, whereas Table 9 and

Table 11 tabulate the results of the comparison of the relabeled dataset, i.e. group II* vs. group III*. The

mean daily total steps, and the mean and max of daily per minute step count maxes (with standard error

bars) are plotted graphically in Figure 3-4, Figure 3-5, and Figure 3-6 respectively.

43

Table 8: Significant findings for comparisons between all classes (I/II, II, II/III, III) and

just between class II vs. III.

I/II II II/III III P -value

(all classes)

P -value

(II vs. III)

M aximum

Maximum 2 Week PMSCa

[steps/minute] 126.33 125.54 112.75 112.91 .04* .0104*

Maximum of Maximum

DPMSCb [steps/minute] 126.33 125.54 112.75 112.91 .04* .0104*

Mean of Maximum DPMSCb

[steps/minute] 96.94 95.10 80.26 80.65 .12 .04*

M ean

Mean 2 Week PMSCa

[steps/minute] 3.85 3.79 3.12 2.66 .22 .0499*

Maximum of Mean DPMSCb

[steps/minute] 6.33 7.53 6.11 5.02 .07 .014*

Mean of Mean DPMSCb

[steps/minute] 3.85 3.79 3.12 2.66 .22 .0499*

Standard Deviation of Mean

DPMSCb [steps/minute] 1.40 1.98 1.70 1.21 .054 .0095**

Standard Error of Mean

DPMSCb [steps/minute] 0.36 0.50 0.43 0.31 .07 .013*

Standard Deviation

Standard Deviation of 2

Week PMSCa [steps/minute] 12.90 13.09 10.51 9.99 .15 .03*

Maximum of DPMSCb

Standard Deviation

[steps/minute]

18.61 20.10 15.53 14.94 .02* .0053**

Mean of DPMSCb Standard

Deviation [steps/minute] 12.24 11.87 9.44 9.23 .17 .0499*

Standard Error

Standard Error of 2 Week

PMSCa [steps/minute] 0.088 0.087 0.071 0.067 .16 .04*

Maximum of DPMSCb

Standard Error

[steps/minute]

0.49 0.53 0.41 0.39 .02* .005**

Mean of DPMSCb Standard

Error [steps/minute] 0.32 0.31 0.25 0.24 .17 .0499*

Total

Total 2 Week SCc [kilosteps] 8.19 8.51 6.95 5.87 .16 .03*

Maximum of Total DPMSCb

[steps] 9113 10837 8803 7232 .07 .014*

Mean of Total DPMSCb

[steps] 5542 5464 4499 3835 .22 .0499*

Standard Deviation of Total

DPMSCb [steps] 2019 2856 2452 1745 .054 .0095**

44


(all classes)

P -value

(II vs. III)

Standard Error of Total

DPMSCb [steps] 523 713 624 441 .07 .013*

aPMSC: Per Minute Step Count bDPMSC: Daily Per Minute Step Count cSC: step count

Table 9: Significant findings for comparisons between group II* and group III*

Group II*

(= I/II + II)

Group III*

(= II/III + III) P -value

M aximum

Maximum 2 Week PMSCa [steps/minute] 125.74 112.87 .004**

Maximum of Maximum DPMSCb [steps/minute] 125.74 112.87 .004**

Mean of Maximum DPMSCb [steps/minute] 95.57 80.55 .02*

M ean

Mean 2 Week PMSCa [steps/minute] 3.81 2.79 .04*

Maximum of Mean DPMSCb [steps/minute] 7.22 5.31 .03*

Mean of Mean DPMSCb [steps/minute] 3.81 2.79 .04*

Standard Deviation of Mean DPMSCb [steps/minute] 1.83 1.34 .04*

Standard Error of Mean DPMSCb [steps/minute] 0.46 0.34 .045*

Standard Deviation

Standard Deviation of 2 Week PMSCa [steps/minute] 13.04 10.13 .02*

Maximum of DPMSCb Standard Deviation

[steps/minute] 19.72 15.09 .002**

Mean of DPMSCb Standard Deviation [steps/minute] 11.97 9.29 .03*

Standard Error

Standard Error of 2 Week PMSCa [steps/minute] 0.09 0.07 .02*

Maximum of DPMSCb Standard Error [steps/minute] 0.52 0.40 .002**

Mean of DPMSCb Standard Error [steps/minute] 0.32 0.24 .03*

Total

Total 2 Week SCc [steps] 84293 61612 .03*

Maximum of Total DPMSCb [steps] 10393 7651 .03*

Mean of Total DPMSCb [steps] 5484 4012 .04*

Standard Deviation of Total DPMSCb [steps] 2640 1933 .04*

Standard Error of Total DPMSCb [steps] 664 490 .045*

aPMSC: Per Minute Step Count bDPMSC: Daily Per Minute Step Count cSC: step count

45

Table 10: Non-significant findings for comparisons between all classes (I/II, II, II/III, III)

and just between class II vs. III.


(all classes)

P -value

(II vs. III)

Demographics

Sex [M=0, F=1] 0.33 0.12 0.00 0.09 .29 .83

Age [years] 51.56 54.96 51.50 57.82 .65 .55

Height [cm] 171.44 173.96 176.50 175.27 .76 .69

Weight [kg] 79.53 87.62 88.35 94.35 .53 .21

BMIa [kg/m2] 26.59 29.00 28.41 30.88 .53 .39

Righthanded?b

[No=0, Yes=1] 0.89 0.88 1.00 1.00 .61 .25

Wristband Preferencec

[Left=0, Right=1] 0.67 0.35 0.25 0.20 .18 .40

M aximum

Standard Deviation of

Maximum DPMSCd

[steps/minute]

19.91 26.21 29.13 21.45 .31 .30

Standard Error of Maximum

DPMSCd [steps/minute] 5.06 6.43 7.42 5.26 .28 .32

Minimum of Maximum


75th Percentile

Maximum of 75th Percentile

of DPMSCd [steps/minute] 0.56 3.02 4.00 1.09 .46 .36

Mean of 75th Percentile of


Standard Deviation of 75th

Percentile of DPMSCd

[steps/minute]

0.14 0.91 1.41 0.29 .43 .33

Standard Error of 75th

Percentile of DPMSCd

[steps/minute]

0.04 0.23 0.35 0.08 .43 .33

M ean

Minimum of Mean DPMSCd

[steps/minute] 1.31 0.67 0.57 0.88 .21 .36

Standard Deviation

Minimum of DPMSCd

Standard Deviation

[steps/minute]

5.42 3.01 2.07 3.67 .21 .42

Standard Error

Minimum of DPMSCd

Standard Error

[steps/minute]

0.14 0.08 0.05 0.10 .21 .42

Total

Minimum of Total DPMSCd

[steps] 1887 971 818 1270 .21 .36

46


(all classes)

P -value

(II vs. III)

IQR (Interquartile Range)

Maximum of DPMSCd IQRg

[steps/minute] 0.56 3.02 4.00 1.09 .46 .36

Mean of DPMSCd IQRg

[steps/minute] 0.04 0.50 0.72 0.08 .44 .33

Standard Deviation of

DPMSCd IQRg

[steps/minute]

0.14 0.91 1.41 0.29 .43 .33

Standard Error of DPMSCd

IQRg [steps/minute] 0.04 0.23 0.35 0.08 .43 .33

Skewness

2 Week PMSCe Skewness 5.14 5.20 5.29 6.50 .62 .27

Maximum of Daily SCf

Skewness 11.36 13.22 5.24 12.39 .56 .91

Mean of Daily SCf Skewness 5.20 5.30 4.11 5.77 .76 .65

Standard Deviation of Daily

SCf Skewness 2.00 2.54 0.58 2.18 .37 .73

Standard Error of Daily SCf

Skewness 0.51 0.65 0.16 0.55 .33 .73

Minimum of Daily SCf

Skewness 3.61 3.21 2.59 3.70 .42 .34

Kurtosis

2 Week PMSCe Kurtosis 35.32 33.44 36.72 61.42 .61 .24

Maximum of Daily SCf

Kurtosis 249.66 283.85 31.17 237.06 .58 .87

Mean of Daily SCf Kurtosis 43.12 44.82 19.12 49.53 .68 .57

Standard Deviation of Daily

SCf Kurtosis 59.92 68.46 5.53 54.44 .39 .78

Standard Error of Daily SCf

Kurtosis 15.08 17.33 1.48 13.55 .39 .87

Minimum of Daily SCf

Kurtosis 15.38 10.74 6.62 15.64 .36 .23

aBMI: Body Mass Index bRighthanded?: is patient righthanded? cWristband Preference: right or left handed preference for wristband dDPMSC: Daily Per Minute Step Count ePMSC: Per Minute Step Count fSC: step count gIQR: interquartile range

Table 11: Non-significant findings for comparisons between group II* and group III*

Group II*

(= I/II + II)

Group III*


Demographics

Sex [M=0, F=1] 0.17 0.07 .33

47

Group II*

(= I/II + II)

Group III*


Age [years] 54.09 56.13 .71

Height [cm] 173.31 175.60 .38

Weight [kg] 85.54 92.75 .17

BMIa [kg/m2] 28.38 30.22 .28

Righthanded?b [No=0, Yes=1] 0.89 1.00 .18

Wristband Preferencec [Left=0, Right=1] 0.43 0.21 .16

M aximum

Standard Deviation of Maximum DPMSCd

[steps/minute] 24.59 23.50 .76

Standard Error of Maximum DPMSCd [steps/minute] 6.08 5.84 .86

Minimum of Maximum DPMSCd [steps/minute] 42.49 34.67 .58

75th Percentile

Maximum of 75th Percentile of DPMSCd

[steps/minute] 2.39 1.87 .93

Mean of 75th Percentile of DPMSCd [steps/minute] 0.38 0.25 .89

Standard Deviation of 75th Percentile of DPMSCd

[steps/minute] 0.71 0.59 .91

Standard Error of 75th Percentile of DPMSCd

[steps/minute] 0.18 0.15 .91

M ean

Minimum of Mean DPMSCd [steps/minute] 0.84 0.80 .90

Standard Deviation

Minimum of DPMSCd Standard Deviation

[steps/minute] 3.63 3.24 .80

Standard Error

Minimum of DPMSCd Standard Error [steps/minute] 0.10 0.09 .80

Total

Minimum of Total DPMSCd [steps] 1207 1149 .90


Maximum of DPMSCd IQR [steps/minute] 2.39 1.87 .93

Mean of DPMSCd IQR [steps/minute] 0.38 0.25 .89

Standard Deviation of DPMSCd IQR [steps/minute] 0.71 0.59 .91

Standard Error of DPMSCd IQR [steps/minute] 0.18 0.15 .91

Skewness

2 Week PMSCe Skewness 5.18 6.18 .29

Maximum of Daily SCf Skewness 12.60 11.68 .97

Mean of Daily SCf Skewness 5.26 5.60 .76

Standard Deviation of Daily SCf Skewness 2.36 2.02 .76

Standard Error of Daily SCf Skewness 0.60 0.51 .79

Minimum of Daily SCf Skewness 3.34 3.59 .65

Kurtosis

2 Week PMSCe Kurtosis 33.93 54.83 .25

Maximum of Daily SCf Kurtosis 272.45 216.47 .97

Mean of Daily SCf Kurtosis 44.25 46.49 .71

Standard Deviation of Daily SCf Kurtosis 65.62 49.55 .73

Standard Error of Daily SCf Kurtosis 16.58 12.34 .79

48

Group II*

(= I/II + II)

Group III*


Minimum of Daily SCf Kurtosis 12.29 14.74 .47

aBMI: Body Mass Index bRighthanded?: is patient righthanded? cWristband Preference: right or left handed preference for wristband dDPMSC: Daily Per Minute Step Count ePMSC: Per Minute Step Count fSC: step count gIQR: interquartile range

3.4.1 Principal Results

This study, using an

independent, larger group of

participants, replicated and

validated the findings of our

previous pilot study: that the daily

free-living step counts of HF

patients exhibiting NYHA class II

vs class III symptoms are

statistically different [13].

Specifically, HF patients

categorized as NYHA II vs. III

were found to differ significantly by

mean of daily total step count

(5464 vs. 3835, P = .0499), as well

as by mean of daily mean step

count (3.8 vs. 2.7, P = .0499).

NYHA II vs III patients also

differed significantly by mean (95.1

vs. 80.7, P = .04) and maximum

(125.5 vs. 112.9, P = .0104) of the

daily per minute step count

maximums.

Figure 3-4. Boxplots (min, mean-1SEM, mean,

mean+1SEM, max) of mean daily total steps for

individual each NYHA class

49

Similarly, group II* and group III* also differed significantly by mean of daily total step counts (5484 vs.

4012, P = .04), mean of daily mean step count (3.8 vs. 2.8, P = .04) as well as by mean (95.6 vs. 80.5, P

= .02), and maximum of the daily per minute step count maximums (125.7 vs. 112.9, P = .004

respectively).

In both cases quoted above, the daily step count results mimicked the two-week overall step count

analysis.

Of the 4 metrics identified above

only the maximum daily per

minute step count maximum was

found to differ significantly

between the 4 classes I/II, II,

II/III and III (126.3 vs. 125.5 vs.

112.8 vs. 112.9, P = .04). It is

reasonable that step count

maximum, which better captures

a patient’s peak exercise during

the day, might as a result better

capture the “limitation of physical

activity” experienced by a patient

and thus differentiate more

consistently between NYHA

classes (compared to a simple

mean or sum of a patient’s

activity over said day). Visual

inspection of the overall step

count density (see Figure 3-2)

corroborates this suspicion.

We however suggest another

alternative. As can clearly be seen

in Figure 3-1 (which shows a

histogram of the step count data

Figure 3-5. Boxplots (min, mean-1SEM, mean,

mean+1SEM, max) of mean daily per minute step count

maximums for each individual NYHA class

50

for each NYHA class), zero per

minute step counts made up an

overwhelming portion of the

data. Specifically, they accounted

for a mean 87.3% (standard

deviation 4.9%) of the two week

data stream for each patient,

accounting for as much as 97.6%

of the two week data stream for

one patient - the full breakdown

can be seen in Figure 3-7.

Unfortunately, the meaning of

these 0 per minute step count

values is ambiguous since the

trackers used in this study record

a 0 value not only during patient

inactivity but also when the

patient was simply not wearing

the device. As a result, it is

challenging to accurately

determine if a given series of

zeroes indicates a pattern of low

physical activity - presumably

explanatory of NYHA class - or

simply a pattern of non-device

use - essentially introducing noise

into the physical activity signal.

A visual inspection of Figure 3-2 and Figure 3-3, both of which show different perspectives of the non-zero

per minute step count data distribution, seems to strongly suggest that there is a difference in the activity

patterns of patients, for example, a longer, fatter tail for class I/II and II patients. Quantitatively however

we failed to the extract many insights into the shape of the activity distribution. Notably the 1st, 2nd, and

3rd quartile (and thus interquartile range) were all found to be fairly consistently 0 for all patients. In

Figure 3-6. Boxplots (min, mean-1SEM, mean, mean+1SEM,

max) of max daily per minute step count maximums for each

individual NYHA class

51

other words, 0’s accounted typically accounted for more than 75% of data points for any given patient

day. In fact, when looking at the two week period as a whole they accounted for at least 76.7% of all the

data points for any given patient (the complete

breakdown is shown in Figure 3-7).

The maximum daily per minute step counts on the other

hand are naturally least susceptible to the ambiguous 0

per minute step count values. We suggest that this may

have contributed to their being most consistent at

differentiating between patients in different NYHA

classes. Ultimately though, we believe that the

disambiguation of inactive vs disengaged time in

pedometer-like trackers and the subsequent effect on the

aforementioned-step data distribution are worth

investigating further to better understand the true nature

of the relationship between free-living step count and NYHA functional classification.

3.4.2 Strengths and Limitations

A strength of this replication study is that it uses a separate dataset collected by an different

researcher (S.B.) independently of (and prior to the analysis performed in) the referenced pilot study [13].

Except for one patient who participated in both studies, the dataset is also comprised of completely

different patients. On the other hand, the data being sourced as a convenience sample at the same single

site as the pilot study, i.e. consecutively recruited from the TGH Heart Function, represents a limitation

of this study with regards to generalizing our findings. Our analysis was also limited as it did not include

any patients with NYHA class I or IV patients. While these are not typically as difficult to classify as

NYHA class II or III patients, analysis of all 4 NYHA classes would have potentially provided additional

useful insight into the true underlying relationship between step count and NYHA class. Knowing this

relationship might be of tremendous value if it could allow us to invert the question posed in this study:

to instead see if step count could be used to assess NYHA class or gradation changes in NYHA class for a

patient. We suggest that this might be the subject of an important future study. The most significant

limitation of our study though was the step tracker utilized, since it introduced significant ambiguity into

the 0 per minute step count values which comprised most of each patient’s step data stream. This limits

our ability to precisely quantify the distribution of the activity/inactivity of patients especially since it is

The decimal point is at the | i.e. 76 | 7 represents 76.7% 76 | 78 78 | 9 80 | 2728 82 | 13678 84 | 022605688 86 | 03902226 88 | 024846 90 | 164668 92 | 14056 94 | 9 96 | 027

Figure 3-7. Number of zero step count

minutes as a percentage of individual

patient two-week data stream

52

as of yet unclear how much significance patient inactivity should be accorded when it comes to capturing

‘physical activity limitation’ and by extension NYHA functional class.

Conclusion

NYHA II and NYHA III patients differ significantly by various aggregate measures of step count

including 1) mean daily total step count but also importantly by 2) mean, and 3) maximum of the daily

per minute step count maximums. These findings validate our previous pilot study. However, the

discovery of additional significant aggregate measures raises several questions, amongst them: what is the

exact underlying relationship between NYHA class and step count? What features of the step count

waveform are most associated or correlated with NYHA class? These questions will no doubt feature as

the subjects of future studies, but the findings of this study are an important milestone on the road to an

objective means of assessing HF functional classification on our continuing quest to improve outcomes of

patient with the burdensome and costly disease that is congestive heart failure.

3.5.1 Acknowledgements

This project was supported by funds from: the Ted Rogers Centre for Heart Research and Peter Munk

Cardiac Centre, (hSITE) Healthcare Support through Information Technology Enhancements and

(NSERC) the Natural Sciences and Engineering Research Council, (CIHR) the Canadian Institutes for

Health Research, the Government of Ontario, and the University of Toronto.

3.5.2 Ethics Approval

This study is covered by institutional and research ethics approval (REB #14-7595) received from the

University Health Network REB.

3.5.3 Conflicts of Interest

None declared.

53

- Activity Tracker Monitoring Implementation

Having confirmed the potential utility of remote monitoring the physical activity of heart failure

patients we moved to update Medly, the remote patient monitoring system in use at the TGH HF clinic,

as part of a Quality Improvement (QI) initiative so it could support the collection and display of the

aforementioned data.

In this chapter we provide a brief overview of the Medly user interface, before discussing the activity

tracker monitoring implementation requirements. We then discuss the proposed designs, what was

actually finally implemented, as well as the success of the implementation in terms of the patients

onboarded and their adherence to the system.

Medly User Interface Overview

The concept behind the Medly remote monitoring system is relatively simple: patients download the

Medly app on their smartphone (provided by the clinic if required), and use the app every morning to

input their weight, blood pressure and pulse – either manually or using a ‘smart’ weight scale and blood

pressure cuff which can wireless transmit the corresponding datum to the smartphone app. Additionally,

patients answer a series of

questions about the symptoms

they experienced the day

before. Medly’s innovative

computer algorithm then

assesses the patients’ state and

alerts them about further

actions they may need to take

such as: taking an additional

dose of medication, calling

their physician, or even going

to the nearest emergency room

(if the patient is assessed as

being in a high-risk state). By

shortening the cause-effect

feedback cycle and leveraging

‘teachable moments’ the

Figure 4-1. Medly system patient smartphone user interface

a) home screen b) trends screen [289]

a) b)

54

system helps improve patient self-care maintenance and management. Patients can also review past

readings and observe their overall trends on a separate screen. Examples of two of the primary screens of

the patient user interface, the home and trends screen, are shown in Figure 4-1. In the example home

screen, a patient has been alerted to ‘contact the heart function clinic or [their] family doctor’ due to their

elevated heart rate (156 bpm) and reported symptoms (tired, short of breath and lightheaded) which are

highlighted in orange. A patient can also take additional readings by pressing the green ‘+’ circle near the

bottom right corner of the screen, although new readings will not remove previous alerts. In the example

trends screen the patient appears to be maintaining a constant weight higher than the light blue target

weight band (~160 lbs), with two unrecorded days (Nov 2nd and 3rd). Their blood pressure (BP) in

contrast appears to be fluctuating: initially trending downwards with the diastolic BP stabilizing but the

systolic BP recently trending upwards to exceed the gray target BP band.

All of the patients’ readings are sent back to servers at the hospital (UHN) and are displayed on a web

interface which is accessible by clinical staff, where they can review alerts and the patient trend data. An

example of the main screen of the clinical web interface, showing the weight data for a Mr./Mrs. Demo

Patient is shown in Figure 4-2. In this example the patient had 1 of 3 readings, during the period of July

12th to July 19th 2018, be outside of their target normal weight range (this time indicated by a gray

coloured band on the graph). The user could also scroll down to see the patient’s BP and pulse readings

as well as a chart of their answers to the symptoms questions.

Requirements

In keeping with engineering best practice, we performed some basic requirements gathering before

proceeding to implement changes to the Medly system. Initial requirements gathering was performed by

discussing the proposed system update with the developers, designers, researchers, project managers and

telehealth personnel at the Centre for Global eHealth Innovation, who already had significant expertise in

designing, developing, implementing and working with Medly. Their suggestions were supplemented with

findings from previously published studies discussing insights on the design and implementation of

previous versions of Medly [95,103–105,159].

55

The following requirements were identified with regards to fitness tracker selection:

1. The selected activity tracker must be readily available for purchase by patients (as established by

the ‘Best Buy Test’: is the fitness tracker available at a local big box electronic store such as Best

Buy?)

2. The fitness tracker must be compatible with Apple iOS v9.3.5 and above.

3. The fitness tracker must be compatible with the 2014 Samsung Galaxy Grand Prime (Android 5.1

Lollipop) and above.

Figure 4-2. Medly system clinical user web interface

56

4. The fitness tracker must be able to record minute by minute step data.

5. The fitness tracker must be able to record minute by minute heart rate data.

6. The data recorded from the fitness tracker must be able to be retrieved for storage and archival

at UHN.

7. The fitness tracker must be able to operate continuously for a minimum of 2 days without

requiring syncing or charging to ensure recording continuity in the event that a patient forgets or

is unable to sync or charge the device overnight).

The following additional, user experience, requirements were identified:

1. The system must provide a method to de-authenticate a fitness tracker or authenticate new

fitness tracker.

2. The system must allow for connection and authentication of fitness tracker.

3. The system must provide a means by which activity tracker functionality can be enabled/disabled

for a patient.

4. The system must provide feedback to clinicians that the fitness tracker is working.

5. The system must provide a means by which clinicians can view patient heart rate data.

6. The system must provide a means by which clinicians can view patient activity data.

7. The system must provide a means by which fitness tracker data can be access and downloaded

including:

a. anonymized bulk data.

b. analytics data (e.g. usage, interaction patterns)

8. Clinical access must continue to be secured against access by non-authorized (non-Clinical) staff.

9. Research data access must be secured against access by non-authorized (non-QI/research) staff.

The following were also identified as being important for providing an optimal user experience:

1. The system should provide feedback to clinicians that the fitness tracker is being worn by the

patient.12

2. Data visualization should be done in such a manner that clinical staff are able to easily &

simultaneously relate heart rate and contextual ‘explainers’ of heart rate (e.g. activity data,

medications, etc.)

12 where technically feasible

57

3. The system should provide feedback to the patient that fitness tracker is connected.

4. The system should provide feedback to the patient that the fitness tracker is working and

collecting data.

Design & Implementation

After having completed the initial requirements gathering we moved to the design and implementation

phase.

4.3.1 Activity Tracker Selection

To select an appropriate activity tracker, an initial search of modern consumer activity trackers was

performed, revealing 33 potential candidates. These are briefly detailed in Table 12. Most of these activity

trackers did not support continuous heart rate monitoring, had battery lives that did not meet the

continuity requirement outlined in fitness tracker requirement 7 of Section 4.2, or were simply no longer

available on the market (e.g. the Basis Peak which was recalled by Intel Corporation for safety reasons

[160], as well as the Jawbone devices since Jawbone (the company) filed for bankruptcy in July of 2017

[161]). The short list of activity trackers remaining included the Fitbit Charge 2, Ionic and Versa; the

Garmin Vivosmart 3, the Nokia/Withings Steel HR, the Wavelet Health Biostrap, and theXiaomi Band 2

(all highlighted in Table 12). We quickly eliminated a) the Nokia/Withings Steel HR since it was not yet

released in the Canadian market at the time of the study, b) the Garmin devices in general since access to

the device data through their application programming interface (API) required a steep access fee of

$5000, and c) the Xiaomi Band 2 since it did not appear to have a reliable manufacturer support method

of access device data. Although the Xiaomi Band 2 was advertised as supporting data download using

Google Fit anecdotal evidence from user forums appeared to suggest that this approach was unreliable –

notwithstanding this possible unreliability there was no way to access the data using iOS (fitness tracker

requirement 2 of Section 4.2). This left us with the Fitbit devices and the Wavelet Health Biostrap. We

eliminated the Wavelet Health device after encountering unresolvable issues while attempting to connect a

trial device to our Android devices, although the device worked fine on iOS. Furthermore, in choosing

between Fitbit devices and a relatively new and unproven contender on the relatively volatile activity

tracker market (Wavelet Health), we determined that it was a more prudent choice to opt for the market

leader, Fitbit. Additionally, due to the popularity of Fitbit devices, investigating the accuracy and

reliability of these devices is a more active area of research [41,46,48,65,67,68,84,162]. We opted to use the

Fitbit Charge 2, the successor to the Fitbit Charge HR, since it was the lowest cost option of the three

short-listed Fitbit devices.

58

Table 12: Candidate activity trackers

Company Product Step

Count

Heart

Rate

Battery

Life13 Data Access Price Link

Apple Watch Yes Yes 1 day HealthKit [163] 360-590

CAD [64]

Empatica E4 Wristband Yes Yes 1 day Unclear 1700 USD [164]

Fitbit Alta HR Yes Yes 5 days Fitbit API [165] 200 CAD [166]

Fitbit Alta Yes No 5 days Fitbit API [165] 170 CAD [167]

Fitbit Charge 2 Yes Yes 5 days Fitbit API [165] 200 CAD [58]

Fitbit Flex 2 Yes No 5 days Fitbit API [165] 80 CAD [168]

Fitbit Ionic Yes Yes 5 days Fitbit API [165] 400 CAD [169]

Fitbit Versa Yes Yes 4 days Fitbit API [165] 250 CAD [170]

Garmin Fenix Yes Yes 1 day Garmin API [171] 600 USD [172]

Garmin Vivosmart 3 Yes Yes < 5 days Garmin API [171] 150 USD [173]

Huawei Watch 2 Yes Yes 1 day Google Fit [174] 350 USD [175]

Intel Basis Peak recalled August 1, 2016 [160]

Jawbone Various company undergoing liquidation [161]

LG Watch Sport Yes Yes 1 day Google Fit [174] 350 US [176]

mc10 BioStampRC Yes Yes 1.5 days Unclear 500 US [177]

Misfit Flare Yes No 4 months Misfit API [178]

or Google Fit [174] 70 CAD [179]

Misfit Phase Yes No 6 months Misfit API [178]


Misfit Ray Yes No 4 months Misfit API [178]


Misfit Shine Yes No 6 months Misfit API [178]


Misfit Shine 2 Yes No 6 months Misfit API [178]


13 Listed battery life is always approximate.

59

Company Product Step

Count

Heart

Rate

Battery

Life13 Data Access Price Link

Misfit Vapor Yes Yes 1 day Misfit API [178]


Moov HR Yes Yes < 1 day None 60-100

CAD [185]

Moov Now Yes No 6 months None 60 CAD [186]

Nokia/

Withings Go Yes No > 8 months

Nokia Health API

[187] 50 USD [188]

Nokia/

Withings Steel Yes No > 8 months

Nokia Health API

[187] 130 USD [189]

Nokia/

Withings Steel HR Yes Yes 25 days

Nokia Health API

[187]a 180 USD [190]

TomTom Spark 3 Yes NCb < 1 day to 3

weeks No new users [191] 290 CAD [192]

TomTom Touch Yes NCb 5 days No new users [191] 130 CAD [193]

Under

Armour UA Band Yes NCb 2.5 days Unclear

170-230

CAD [194]

Wavelet

Health Biostrap Yes Yes 5 days Wavelet API [195] 250 USD [195]

Xiaomi Band Yes No 30 days

Google Fit [174],

via unofficial API

[161], or via BLEc

15 USD [196]

Xiaomi Band 2 Yes Yes 20 days Google Fit [174],

or via BLEc 30 USD [197]

aheart rate data access unclear

bNC: non-continuous

cBLE: bluetooth low energy (N.B. device commands are obfuscated by manufacturer)

60

Proposed Data Access Design

Third party access to Fitbit data is mediated exclusively through the Fitbit web API [165]. It is

possible to both write and read data through the API, but impossible to access data directly from the

device, as illustrated in Figure 4-3. Access to intraday time series data (i.e. step count and heart rate data

at a resolution of less than 1 day, e.g. at the minute level) is also restricted to either ‘personal’

applications, or authorized entities. Authorization to access this data is granted on a case-by-case basis by

Fitbit. After submitting an initial request on June 22nd, 2017 we received approval to access intraday data

2.5 months later, on September 5th 2017. Access to the individual patient data is mediated through the

OAuth 2.0 authentication framework which specifies a secure communications protocol by which Fitbit

and third party servers can confidentially exchange security access tokens to maintain secured and

encrypted transmission of data between the Fitbit servers and the client – in this case UHN - servers. The

complete process for authentication (including initial authentication and maintenance of expired security

tokens), and data retrieval is mapped out in a sequence diagram in Figure 4-4. Since the individual

patient access tokens, which must be refreshed after each use, must be shared between several users (the

patient, clinical staff and research admin/QI personnel) the system was designed such that the central

Medly server would mediate requests for data, supplying the requested data from its internal database

negating the need to re-request data from the Fitbit servers for each user request. The Medly server then

periodically updates this internal database with new data, archiving it according to hospital policy and

local, provincial and federal requirements. Figure 4-5 illustrates this proposed design for patient users and

Figure 4-3. Fitbit data flow diagram

61

Figure 4-4. Fitbit authentication process with a client app

62

Figure 4-6 illustrates the

proposed design but for

clinical users. The sequence

for research admin/QI

personnel is essentially

identical to that of clinical

users.

Final Data

Access Implementation

The final implementation

for data access was managed

by the development team at

the Centre for Global

eHealth Innovation (a

partner of UHN). As a result, the final implementation differed slightly from the proposed implementation

due to time constraints and lack of programming resources as a result of concurrent updates, bug fixes

and general QI updates to Medly that were deemed to be a higher priority. The final implementation

therefore did not include an update to the client side patient smartphone application. The proposed

design was reduced down to a pared down Minimum Viable Product14(MVP). In this pared down version,

clinical admin staff (such as the onboarding coordinator) authenticated Fitbits on behalf of patients on

the clinical client application. No functionality was provided for patients to authenticate Fitbits with

Medly or to access data through the Medly application. Furthermore the ability to authenticate new

devices and access patient data was only available for patients using Medly on an Apple iPhone15.

Clinicians wishing to access data for patients using the standard Android device usually provided as part

of the Medly patient kit, were only able to access said patient data through the official Fitbit website.

Patients, whether Android or iPhone patients were able to access their data either through the Fitbit

website or through the Fitbit app that had to be installed on their smartphone. No provisions were made

14 a featured sparse software platform that includes only the bare minimum functionality required to operate.

15 as of the time of publication Medly now supports Fitbit authentication and data access for patients using either

Apple iPhone or Android devices.

Figure 4-5. Medly Fitbit patient access sequence

63

for data access by research/QI personnel - in fact the Medly server was implemented to only receive daily

step data summaries and not intraday data. The server also did not retrieve heart rate data.

To access intraday heart rate and step data the author created an open source script using the R

programming language [151] (available with the rest of the software artifacts generated from this thesis as

per Appendix C or directly from [198]). This script connects to the Fitbit API, manages the security

access tokens for the patients in the study (both Android and iPhone patients) and is able to download

both the minute-by-minute step count and heart rate data for analysis. It is also registered as a separate

third-party application with Fitbit to permit separate administration from the clinical system and to

avoid technical issues with the script affecting the clinical system. This script was based on previous work

by S. Bromberg [46,150], whose originally script is available on GitHub [150].

Figure 4-6. Medly Fitbit clinician access sequence

64

4.3.2 User Interface Design

The Medly user interface (UI) also required updates to support the addition of fitness tracker

functionality.

Proposed User Interface Designs

Several designs were proposed, which were based on best practices from the fields of data visualization

[199–201], human factors & user experience design [202–205], as well as insights from consultations with

the Medly design team at Healthcare Human Factors (a partner of UHN) and the development team at

the Centre for Global eHealth Innovation.

In order to provide a more optimal user experience for patients, these should receive feedback that their

device is operating as expected. In the case of the Fitbit activity tracker this means not only that the

device is charged and collecting data, but also that the device is syncing data to the patient’s smartphone,

and ultimately to UHN. Displaying the patient’s Fitbit data on the Medly app on the patient’s

smartphone would provide this feedback since it requires an unbroken chain of communication between

the Fitbit, Fitbit App, Fitbit Servers, UHN Servers and the Medly app as shown in Figure 4-3. We

proposed 4 design each for both the home and trends screen that were consistent with the UI design

language already established by Medly. The 4 proposed home screen designs are illustrated in Figure 4-7,

the designs for displaying trends data are shown in Figure 4-8. Since the fitness tracker step count and

heart rate data is generated at every moment, instead of being collected usually only once a day in the

morning, the proposed designs, although adhering loosely to the established design language of Medly

intentionally treat fitness tracker data in a visually distinct manner so as to help users identify the less

static nature of the fitness tracker data (compare Figure 4-1a and Figure 4-7). Similarly, the proposed

trends screens are slightly modified to better adapt to nature of the fitness tracker data. For example,

daily or weekly heart rate summaries not only report mean heart rate, but also the lower and upper range

of heart rate during those periods.

Along with the aforementioned changes to the trends and home screen, we designed a UI flow for changes

to the Medly smartphone app to allow patients to link a Fitbit account to their Medly account, this UI

flow is illustrated in Figure 4-9. However, as mentioned in Section 4.3.1.2, this flow was ultimately not

implemented. Instead Fitbit account linking was redesigned to be done through the clinician web

interface. The final authentication flow is discussed in Section 4.3.2.2.

65

Figure 4-7. Proposed designs for patient user interface (home screen)

a) combined heart rate and steps data on one card, b) combined heart rate

and with pictoral representations, c) seperated heart rate and step data, d)

only pictoral representation with mini graph

a)

c)

b)

d)

66

f)

Figure 4-8. Proposed designs for patient user interface (trends)

a) simple sparklines, b) data with bands to indicate min (resting), mean and max values for

each time period, c) whisker plot to indicate daily range, b) heart rate (maximum and

resting) and average step count values broken out for each time period, and e) Tufte style

medical data visualization as per f) which is reproduced from [201]

a) b) c)

d) e)

67

Figure 4-9. Proposed design for authorization of new Fitbit by patient via M edly smartphone application.

68

With respect to the clinician web interface, changes were much more limited and most centered around

adding new graph components to display the new fitness tracker data which differs from the rest of the

data collected by Medly since it is available at available at up to minute-level resolution. The proposed

web interface graph designs are shown in Figure 4-10 (which can be contrasted to the existing graph

design in Figure 4-2).)

The design of the clinical user interface was approached in a similar fashion to the patient smartphone

trends screen. Although the web interface has more available screen real estate than the smartphone

screen, the performance of the web interface was known to drop drastically when made to process several

data points for display on graphs. As such, the design of the clinical user interface represented a similar

challenge to the smartphone trends screen: the need to collapse voluminous high resolution minute-by-

minute data into more concise daily or weekly summaries; this explains the successive data simplification

that occurs while transitioning from Figure 4-10b to Figure 4-10d. The design shown in Figure 4-10b for

example is inspired from the UI of an intensive care monitoring system designed for use in the data rich

environment of the pediatric critical care units at SickKids: The Hospital for Sick Children in Toronto

and Boston Children’s Hospital in Boston [206–208]. Consequently, it is the most ideal of the proposed

designs from a data fidelity point of view since it cuts out minimal data and allows a user to more easily

visualize concurrent trends in multiple data streams. However due to the technical limitations of the

Medly web interface, it is also the least feasible to implement. Figure 4-10c and Figure 4-10d were later

design iterations attempting to reduce the number of visual elements that the interface would need to

process and draw while still maintaining as much information content as possible. Figure 4-10e returns to

the same simple graph style of Figure 4-10a and Figure 4-2 but with range bands and a UI element for

displaying something useful derived from the step count data such as the predicted NYHA class

(compared to the last assessed NYHA class). This UI element also provides the option for the clinical staff

to provide feedback as to whether they agree with the prediction, or not, by pressing on the ‘x’ or check

mark and correcting the prediction (this later pop-up is not shown). This functionality would be useful for

collecting feedback (and training examples) from the user to assess the accuracy (and dynamically teach)

an NYHA functional classification suggestion algorithm once it gets implemented into Medly. Lastly, we

proposed simple alerts for both step count and heart rate consistent with those implemented for weight,

blood pressure and pulse: namely a lower limit for step count and upper and lower limit alerts for heart

rate. We also proposed adding adherence phone call functionality for the fitness tracker similar to the

already implemented system that triggers an automated reminder phone call when a patient does not

submit their daily readings.

69

a)

70

a)

b)

d)

71

e)

Figure 4-10. Proposed designs for clinical user interface (activity and heart rate graphs)

a) simple graph design with indicator lines for alert levels and mean, b) design inspired by

the Sick Kids T3 (tracking, trajectory and trigger) tool [206–208], c) mix of T3 tool with

Medly range bands, b) whisker plots style and e) simple graph with range bands and

NYHA class prediction display (bottom of the more info page for step count graph)

72

Figure 4-11. Final web interface Fitbit authorization flow

73

Figure 4-13. Final web interface activity tracker data display

Figure 4-12. Final web interface activity tracker profile & deauthorization flow

74

Final User Interface Design

As with the back-end components required to download and access the fitness tracker data, the actual

programming of the UI components required for the activity tracker update to Medly was managed by the

development team at the Centre for Global eHealth Innovation (a partner of UHN). Again, due to time

and resource constraints caused by higher priority fixes and updates, the final UI implementation was

reduced down to a proof-of-concept. Due to a lack of available iOS and Android programmers. no updates

were possible to the patient smartphone UI, so patients were instead instructed to use the Fitbit app on

their phone to confirm that data was being collected and synced to the Fitbit servers. The task of

confirming that the Fitbit data was being properly synced to the Medly servers was instead left to the

author as part of the research work documented in this thesis. Afterwards, this task is anticipated to be

delegated to the clinical admin staff to be performed on a manual basis using elements that were added to

the clinician web interface. The inability to update the smartphone UI also necessitated the creation of a

new UI design for the task of linking patient Fitbits (whether provided by the clinic, or patient’s personal

Fitbits) to Medly servers through the clinician web interface. The final version of this UI flow is shown in

Figure 4-11.

As required by the Fitbit applications programming interface (API) for web applications, as part of the

authorization process, the user is redirected directly to the official Fitbit website (Figure 4-11 step 3) so

they can be confirm that they are connecting to the genuine Fitbit.com site [209]. Once logged into the

Fitbit website they user can then select what data to share (Figure 4-11 step 4).

When linking activity trackers we instructed users to select ‘Allow All’ to allow all data to be shared

(refer to Figure 4-11 step 4). Normally this violates an old principle of computer security: the principle of

least priviledge (or least authority), which dictates that user access rights be a) limited to the bare

minimum required to perform the desired task and b) provided only for the duration required for said

task. Recognizing howevver that it was likely that Medly would receive updates in the near future to

enable more complete use of Fitbit functionality and that if these future updates used data outside

outside of the already required ‘heart rate’, and ‘activity and exercise’ data it would necessitate manual

unliking and then relinking of all the Fitbit accounts to select additional permissions, likely at significant

time cost. Furthermore, clicking the single ‘Allow All’ button was a simpler task for users to perform

compared to having users select the sperate individual ‘heart rate’, ‘activity and exercise’ and ‘Fitbit

devices and settings’ radio buttons. A less complicated task is predicted to reduce the likelyhood of error

when linking a Fitbit account. Lastly, even in the case of a real security concern such as a data breach,

75

the tokens exchanged through the authorization process, which provide the Fitbit data access rights in the

first place, can be remotely revoked through the Fitbit website both on an individual basis and on mass.

This reduced the actual security risk to what we deemed to be an acceptable level.

We were actually able to confirm this loss of data access to linked Fitbit accounts as a result of a

simulated security breach inadvertendely caused during data collection. The incident occurred on May

31st, 2017 while authenticating patients using the custom R script written to download the minute-by-

minute heart rate and step count data and manage the associated access tokens.

The script accepts a list of user accounts and loops through a pared down version of the authentication

flow shown in Figure 4-11 (i.e. just steps 3 and 4) for each account one immediately after another. This

makes it possible to quickly add and retrieve access tokens for multiple patients in bulk, reducing

workload for research/QI work. Fitbit’s automated security system interpreted the rapid automated

linking of multiple Fitbit accounts as suspicious and potentially indicative of malicious activity. As a

result, Fitbit’s security system subsequently banned the internet address of machine running the script

and flagged the 34 recently linked accounts as potentially compromised, forcing password resets and

invalidating the access tokens for each of these accounts (both for the script and the clinical system).

It took approximately 3 weeks to: 1) confirm with Fitbit that we were the actual cause of the suspected

‘data breach’ (as opposed to an actual malicious third party), 2) reset patient passwords 3) relink

accounts on the clinical system, 4) contact patients to ensure that they had successfully logged back into

the Fitbit app on their phone, and 5) slowly relink accounts to the research system (which we did at a

rate no higher than 1 per 30 seconds and in batches no longer than 25 with a pause of at least 45 minutes

between batches). As we experienced delays in reaching patients to inform them that they needed to log

back into their Fitbit account, at least half were initially unreachable on the first day and had to be left a

voicemail message or equivalent, some of the patients may have suffered about 1-2 weeks of data loss.

The potential data loss would have been caused by the limited internal memory of the Fitbit; since the

Fitbit only has sufficient internal memory to record 1 full week of minute-by-minute data it must be

synced at least once a week to the Fitbit servers, usually via the Fitbit app, to make more room for new

data. Due to accounts being flagged as compromised, patients needed to log back into their account using

their new password to reenable syncing between their Fitbit and Fitbit servers. Since Fitbit only provides

the last device sync date, which was not actively monitored during this period (as opposed to a complete

sync history) we were unable to confirm the actual extent of data loss for patients. We were also unable

to ascertain the extend of data loss simply by examining the data since it is difficult to determine if

76

potential lack of data during this period was due to the incident or simply due to patient disengagement,

in particular since those patients most likely to have not noticed that they had been logged out of the

Fitbit app are almost by definition those least engaged with the system.

Aside from the potential loss of data, the incident had no other reported impact on the system. The loss

of data also had minimal impact on the QI/research objectives of this study since most patients impacted

by the incident had already been using the monitoring system for several weeks (and even months), and

data collection for all patients would still continue for several weeks post incident (to attain a minimum 3

week recording period for each patient).

Returning to the UI: once users proceed through the authentication flow in Figure 4-11 - thus enabling

syncing of their Fitbit account to Medly - they are returned to the patient profile page which now displays

status information about the connected Fitbit account and the option to unlink the account if desired (see

Figure 4-12). This profile page displays informationa about the last time the Medly server was synced

with the Fitbit server – ‘Last Server Sync’ – as well as the last time a Fitbit device was synced to the

Fitbit account16 – ‘Last Device Sync’ – the later of which never be more recent than the ‘Last Server

Sync’. These two values were added to help users determine if a lack of displayed step count data is

caused by: a communication problem between the Fitbit server and Medly server (the ‘Last Server Sync’

value is not up to date and does not update even when the user presses the ‘Force Sync’ button); the

Fitbit device has not yet been synced (‘the Last Device Sync’ value is not up to date although the ‘Last

Server Sync’ value is up to date); or the patient has simply not used the Fitbit or performed any physical

activity (both the ‘Last Device Sync’ and ‘Last Server Sync’ values are up to date but no step data shows

up on the web interface).

As for displaying the Fitbit data: heart rate data was deemed to be non-essential for inclusion as part of

the activity tracker MVP in particular since it would further cause confusion with the existing displayed

daily recorded pulse data (recorded using a blood pressure cuff). As a result no graphical display was

implemented to display the Fitbit acquired heart rate data. The step data graph on the other hand was

redesigned after the existing graph design (Figure 4-2) showing total daily steps for each day in the view

windows (see Figure 4-13). In the ‘More Info’ page to the immediate left of the graph, the whole time

period being viewed was summarized by providing the lowest, average, and highest daily step count and

16 This process occurs automatically every time the user opens the Fitbit app on their smartphone.

77

total readings during period in question. It is worth noting that this final step data graph design also only

represents a minimum technically viable product as it does not fully honor the best practices and

principles outlined in the Fitbit API terms of service, the most relevant being the following:

“Offer Users a clear path back to their Fitbit Account.

• Always provide clear documentation and links for Users to access their Fitbit

Account from your Application.

• Paths to Fitbit User accounts should be available wherever User Data is

displayed.

• Paths to Users’ Fitbit accounts should be available in "Setting," "Account," or

a similar location from within your Application.

• When displaying Fitbit Data in your Application, Fitbit must be noted as the

source of Fitbit Data using the text link and/or logo icon made available to you

through the Fitbit Developer Portal.” [210]

As is, the step data graph adheres to none of these provisions.

Despite all of the aforementioned limitations we were able to onboard 46 patients onto the upgraded

system over a 5 month period (from January 9th to June 13th). These patients were subject to the same

inclusion and ‘exclusion’ criteria used for the general Medly system. The inclusion criteria are detailed in

Table 13. While there are no explicit exclusion criteria for Medly, we note that since the system (and by

extension this updated) is used as part of the prevailing standard of care at the Heart Function clinic, the

decision to prescribe or exclude a patient from the Medly program is ultimately up to the professional

judgement of the attending cardiologist. As of the time of writing a total of 7 attending cardiologists use

Medly as part of patient care, although one of the cardiologists (the medical director of the clinic) is

disproportionately responsible for a majority of the patients monitored. During this period 2 (4%) of the

46 patients later changed their mind about being monitored via Fitbit and subsequently chose to return

their devices and be removed from QI initiatives related to Fitbit monitoring. On the other end of the

spectrum, 3 (7%) of the 44 patients who remained in the study chose to supply and use their own Fitbit

device and Fitbit account instead of being provided one by the clinic (these patients were unsurprisingly

all very adherent with their Fitbits).

78

Table 13: M edly inclusion criteria

- a consenting adult (18+ years of age),

- diagnosed with heart failure,

- followed by a licensed cardiologist at the UHN Heart Function Clinic (who in turn bears

the primary responsibility for the management and care of that patients heart failure

diagnosis)

- sufficiently capable of speaking and reading English, or having an informal caregiver

(spouse, parent, etc.) capable of the same so as to both:

o undergo the process of and provision of informed consent for participation in the

Medly program

o understand and follow the text prompts provided by the Medly patient-side

application

- capable of complying with the use of Medly (e.g. capable of truthfully answering

symptom questions, capable of safely and correctly using the peripherals such as the

weight scale, activity tracker and blood pressure cuff)

Table 14: M edly exclusion criteria

- Congenital heart disease

- Diagnosis less than 6 months prior to recruitment

- Travelling out of Canada for more than 1 week during the study period (to limit study

costs – i.e. roaming charges)

Of the 44 patients who remained on the monitoring system, 12 (27.3%) used and provided their own

Apple iPhone devices, and 32 (72.7%) used Android devices provided by the clinic. Based on the number

of mobile wireless subscribers in Ontario (88.1% in 2015 [211]), the iPhone market share in Canada

(51.37% in October 2017 [212]), and proportion of devices using an iOS version supported by Medly

(version 9.4 or above; 96.75% in October 2017 [213]) the expected proportion of iPhone to Android was

closer to 43.8% (19:25). These expected values and actual proportions of onboarded patients by device is

tabulated for easier reading in Table 15d. By proportion, the number of iPhone users onboarded was

slightly less than expected. We anticipated that the relative proportion of Android users was higher since

we recruited Android users not just from the pool of new patients onboarded onto Medly during the 5

month period but also from patients who had already been onboarded onto Medly and happened to be

returning to the clinic for follow-up during this period. No iPhone users had previously been onboarded

onto Medly therefore all of the 7 returning patients (16%) upgraded with Fitbits were Android users.

Removing these patients, 32.4% of new patients used iPhones and 67.6% used an Android device, this is

closer to the distribution expected based on market share calculations. In either case the relative

proportion of iPhone to Android users was not found to be statistically different to the expected

79

proportion at the 5% level of significance and given the sample size (P=0.18, and P=0.47 respectively for

the cases discussed above; assessed using a chi-squared test with R [151]).

Table 15: iPhone vs. Android patients on Medly system using Fitbit

a) all patients onboarded, b) only new Medly patients onboarded during thesis

a) All Onboarded Expected (by M arket Share) P -value

iPhone Users 12 (27.3%) 19 (43.8%) .18

Android Users 32 (72.7%) 25 (56.2%)

b) New Patients Only Expected (by M arket Share) P -value

iPhone Users 12 (32.4%) 16 (43.8%) .47

Android Users 25 (67.6%) 21 (56.2%)

Patient adherence was also recorded at two points

during the study, at 3 months into the study (April

9th, 2018) and at the end of the data recording period

(August 1st, 2018; 7 months). At both of these

junctures, patients were found to be overall

moderately adherence with using the Fitbit – e.g. at

the 3 and 7 month timepoints 50% of patients had

used the Fitbit (recorded steps or heart rate) on at

least half of the days they were on the system. Only

around 1 3⁄ to 1 4⁄ of patients (at 3 and 7 months

respectively) had excellent levels of adherence

(average at least 9 of 10 days using the system). A

more complete breakdown of adherence is available in

Table 16, with the stem and leaf plots in Figure 4-14

illustrating the comparative distribution of the

percentage of days patients had used the system (relative to the total number of days they used the

upgraded system) this time at the 3 and 7 months. A paired Wilcoxon signed rank test (since the data is

non-normal, as can clearly be discerned from Figure 4-14) revealed that there was no statistically

significant difference between the adherence at 3 and 7 months (P = 0.625).

Compared to the adherence levels recorded during the original Medly RCT, where “about 42, 33, and 16

out of the 50 telemonitoring group patients (84%, 66%, and 32%) completed at least 91 (50%), 146 (80%),

and 173 (95%) of possible daily readings over the six months respectively (prior to the adherence phone

call deadline at 10am)” [103], patients using activity trackers in this study were found to be significantly

The decimal point is 1 digit to the right of the |

9 | 1 represents 91. % 3 Months 7 Months 980| 0 |0001235589 6431| 1 |15 842| 2 |45899 1| 3 |012 64| 4 |19 | 5 |237 8| 6 |2 21| 7 |4 5| 8 |0357 710| 9 |0111137888 000000| 10 |000

Figure 4-14. Distribution of patient

Fitbit adherence (as percent of days

using the system)

80

less adherent (at the 5% level of significance) at both the 50% and 80% adherence thresholds (but not the

95% threshold); detailed results are tabulated in Table 17.

Table 16: Patient adherence on Fitbit

Adherence Definition

# of Patients

3 M onths 7 M onths

sum a deltab sum a deltab

Near Perfect > 95% of days used 7 (26.9%) - - 7 (15.9%) - -

Excellent > 90% of days used 9 (34.6%) 2 (7.7%) 12 (27.3%) 5 (11.4%)

Consistent > 68% of days used17

13 (50.0%) 4 (15.4%) 18 (40.1%) 6 (14.6%)

50-50 > 1/2 of days used 13 (50.0%) 0 (0%) 22 (50%) 4 (9.1%)

Sporadic > 1/7 of days used 21 (80.8%) 7 (30.8%) 33 (75%) 11 (25%)

Onboarded all patients 26 (100%) 5 (19.2%) 44 (100%) 11 (25%)

a i.e. # (%) of patients meeting or exceeding specified level of adherence b i.e. difference between # (%) of patients at specified level of adherence and the next highest

adherence level

Table 17: Fitbit adherence compared to adherence recorded for original M edly during RCT

Adherence Level M edly RCT [103] Fitbit @ 3 M onths Fitbit @ 7 M onths

# of patients # of patients P -value # of patients P -value

> 95% of days used 16 (32%) 7 (26.9%) .85 7 (15.9%) .12

> 80% of days used 33 (66%) 10 (38.5%) .04* 17 (38.6%) .014*

> 50% of days used 42 (84%) 13 (50.0%) .004** 22 (50.0%) <.001

Total 50 26 - 44 -

A recent study by Hermsen et al. [214], who examined sustained use of a provided Fitbit activity tracker

in 711 patients, found that 232 days into their study, of those who were non-adherent at that stage (187

patients), 56.7% stopped adhering due to technical problems or difficulties18, 12.8% lost the device, 12.8%

forgot to wear the device, 9.7% felt they had no use or motivation to use the particular device given to

them (including because they used a different device), 3.7% stopped due to health issues and 5.4% didn’t

want to use the device for various other reasons (excluding health issues).

From this study we can infer that people, broadly-speaking, are non-adherent to technology for one of

three reasons:

17 68% of days equates to roughly 20-21 days out of the month (i.e. every weekday)

18 in our study we had 2 devices (both replaced) reported as non-functional (one that over-reported steps and one

that simply didn’t work).

81

1) they are (humanly) unable to use the technology. Namely because the technology is non-

functional, whether due to technical or human factors problems;

2) they want to use the technology but forget to do so; or

3) they don’t want to use the technology, for example because they have concerns about

detrimental effects of the technology on their wellbeing, or generally don’t recognize any

benefits to using the technology.

For patients who are unable to use the technology, in particular due to human factors problems, the

pared down UI designs ultimately implemented do little to make the Fitbit more usable from a patient

perspective. However, it also does little to make things worse. Since no UI updates were made to the

Medly patient app to help support the fitness tracker, a patient’s interactions with the Fitbit are limited

to interactions with the device itself and the proprietary Fitbit app (and optionally the Fitbit website). As

a result, difficulties interacting with the technology are in a way more representative of Fitbit as a

technology compared to our RPM system. Our findings therefore actually form a baseline for patient

adherence on a Fitbit RPM system since the components implemented into our system represent the bare

minimum required to actually make a Fitbit enabled RPM system function. Furthermore that the Fitbit

user experience design is also largely outside of the control of third-party researchers and programmers

simply makes it harder to make real improvements to this part of the user experience perhaps aside from

providing better user education (generally considered by human factors experts as the least effective

means of effecting meaningful change [215,216]).

In the other case of patients who simply forget to wear the tracker, a solution already exists: adherence

phone calls. These were coincidentally used with great effectiveness during the Medly RCT although they

were not added as part of the Medly Fitbit MVP.

As for patients who did not want to use our technology: we suspect that these were a less likely

contributor to non-adherence in our particular study since the patients onboarded onto this system all

willingly consented to participate. That being said, we fully expect this willingness to decrease as time

goes on. In the same Hermsen et al. study (that examined the sustained use of a provided Fitbit activity),

the authors found a “slow exponential decay in Fitbit use, with 73.9% (526/711) of participants still

tracking after 100 days and 16.0% (114/711) … after 320 days.” [214]. Although, as previously mentioned,

we found no significant difference between adherence at 3 and 7 months our study was not powered ahead

of time to address this question.

82

We suspect that the easiest and most cost-effective solution to most if not all of the aforementioned

problems is adding the fitness tracker to the adherence phone call system already implemented as part of

Medly. Adherence phone calls would not only help to address the problem of patients simply forgetting to

wear the activity tracker (which might otherwise necessitate an update to the Medly UI), but they would

also provide increased opportunity to address technical or usability issues experienced by patients by

providing patients with an additional compelling reason to get these issues addressed by contacting Medly

support staff (i.e. avoiding nuisance phone calls). If the Medly UI were to be updated, adding some sort of

alert or reminder when a patient was taking their morning systems would be even better since it would

prevent more unintentional data loss. An ideal system would also notify this same Medly support staff of

patients who are consistently experiencing difficulties with the activity tracker, to properly close the

feedback loop between patients and the clinic and ensure that patient difficulties are being properly

addressed. While adherence phone calls would help catch technical or usability issues earlier, it might also

help patients see the benefit of this system in that they would be held accountable to this element of their

self-care and management. From a research perspective, having already established the baseline adherence

of the Fitbit system, we could even quantify actual impact of adherence phone calls by re-running this

analysis after this feature implemented.

As for the usage of the updated system by clinical staff: we unfortunately have no quantitative data to

perform an analysis similar to the one done for patient users, as the upgraded iteration of the Medly

system did not record data that would permit the assessment of clinician usage of the newly available

Fitbit data.

The analysis in this chapter was performed using R [151] and supporting packages [217–219].

Summary

In summary, we updated Medly, the remote patient monitoring system in use at the TGH HF clinic, to

support the collection and partial display of Fitbit activity tracker data. Although the system supports all

Fitbits, we specifically selected to provide patients at the clinic with the Fitbit Charge 2 which was the

most inexpensive tracker that met our requirements: namely that it was readily available for purchase,

supported the hardware (smartphones) being used as part of the Medly program, could last at least a few

(2) days without syncing or charging (to help avoid data loss and provided a means for downloading and

accessing continuous minute-by-minute step count and heart rate data from the device - even if

indirectly). Data access was performed through the Fitbit API with a separate connection for the clinical

system (which allowed clinicians to monitor patient activity through Medly’s custom web interface) and

83

for the research system (a custom R script which allows research/QI staff to manage access tokens, and

download patient activity data in bulk for offline analysis – see Appendix C or [198])

Updating Medly to support Fitbit activity tracker data also required an update to the UI of the system to

allow users to 1) link a Fitbit account to the corresponding Medly patient account and 2) monitor patient

activity through the Medly system. In view of this, several UI designs were proposed to the professional

development team whose task it was to program the final design into the existing Medly system. However,

due to time and resource constraints caused by other concurrent higher priority updates and bug fixes to

Medly, all of the initially proposed designs were eschewed in favor of producing a pared down minimum

viable product which demonstrated the technical viability of the solution. As a result, no changes were

made to support the Fitbit activity tracker on the patient smartphone applications. Patients were instead

instructed to use the Fitbit app alone to access their Fitbit data. As for linking patient’s Fitbit accounts

to their Medly account, the authentication flow was adapted so it could be performed by clinical staff

through their clinical web interface. The display of Fitbit activity tracker data on said web interface was

limited to daily step data only since heart rate data was deemed as non-essential. The updated system

also only supported patients using Apple iPhones - clinicians wanting to monitor patients who were using

the standard Android phones provided as part of the Medly system instead had to go through the Fitbit

website directly (although as of the time of publication the Medly system now fully supports patients

using both iPhone and Android).

Despite these limitations, we were able to monitor 44 patients over a 5 month period (from January 9th to

June 13th) with an additional 2 patients who were additionally onboarded but later changed their mind. 3

of the 44 patients actually brought and used their own Fitbit. 12 (27.3%) of the patients used iPhones

(and could be monitored using the updated Medly web interface), whereas 32 (72.7%) of the patients used

Android (which was not supported by the updated Medly web interface). Overall, patients were found to

be only moderately adherent with using the Fitbit. At the 3 and 7 month time points, 50% of patients

had used the Fitbit (recorded steps or heart rate) on at least half of the days they were on the system.

Only around 1 3⁄ to 1 4⁄ of patients respectively at the 3 months and 7 months timepoints had excellent

levels of adherence (average at least 9 of 10 days using the system). We proposed that adding adherence

phone calls or reminder notifications would help improve patient adherence to the system, or at least help

staff catch and address patient issues in a timely manner.

84

– Assessment of NYHA Functional

Classification using Hidden Markov Models

Having completed the essential groundwork of building a system to collect relevant input data, we set

out to assess the NYHA functional classification of patients in an example dataset using 6 different

machine learning (ML) algorithms, specifically: Hidden Markov Models (HMM); Generalized Linear

Models (GLM); a variant thereof: boosted GLMs; Random Forests (RF); Artificial Neural Networks

(NNet); and a variant thereof: Principal Component Analysis Neural Networks (PCA NNet). Since the

approach used to create the HMM based classifier (HMMBC) differed slightly from the rest of the

candidate models, we discuss the HMMBC separately as part of this chapter, while the remaining ML

models are treated in Chapter 6.

First, we provide a brief refresher on HMMs - a more detailed introduction is provided in Appendix B –

followed by our rationale for using HMMs in the first place. We then proceed to explain our methodology

for training and testing a HMMBC. Finally, we discuss the results of our investigation and, since our

HMMBC approach was ultimately unsuccessful, we touch on the problems encountered and provide

recommendations for future attempts.

Hidden Markov Models

Any introduction to hidden Markov Models must start with Markov models. Markov models are

probabilistic state machines where the transitions between states occur randomly according to some pre-

determined and pre-specified transition probabilities between each of the states [118,220–223]. Hidden

Markov Models (HMM) are simply Markov Models where the underlying states cannot directly be

observed [118,220,222,224,225]. Instead, the underlying states of the HMM are inferred from an associated

set of possible observations that are linked to each state. In other words, from the possible outputs that

can be produced when the system is in a particular state. These observed outputs could be speech

phonemes, written characters of the alphabet, or genome sequences [118,226], or in our case step count or

heart rate readings, amongst others.

5.1.1 Rationale for the use of HMMs

The rationale for using hidden Markov Models is that they can embrace the complexity and nuance of

the entire time series data streams (and sequential data in general). In contrast, the remaining ML models

85

investigated in this thesis (in their standard form) must be provided with input predictors formulated as

cross-sectional data (i.e. with the observations coming from a single point in time).

Of course, it is possible to format, or distill time, series data into cross-sectional data. For example, one

could use the values at discrete time points in a time series as separate independent input features for a

ML model. This is illustrated in Figure 5-1, where the value at time 𝑡𝑛 and the 𝑚 values preceding it:

𝑡𝑛−1, 𝑡𝑛−2, 𝑡𝑛−3, …. to 𝑡𝑛−𝑚 are provided as separate inputs to the ML Model. But, by decoupling the

individual time points one loses an, if not the, essential characteristic of time series data (and sequential

data generally): the interrelationship between individual data points in the series. An ML model trained

in this manner will therefore be robbed of very important information about the time series in question.

To avoid completely throwing away this interrelationship information, one could instead compute various

metrics or characteristics to describe the entire time series such as: the mean and variance of the signal,

the total number or location of peaks, the signal auto-correlation, cross-correlation, frequency distribution,

and so on, using these as input features. Ultimately though, any computation which takes an entire time

series signal and boils it down to a single parameter before providing it to the ML model must be pre-

maturely throwing away possibly relevant information. This is not to say that feature extraction is

something to be avoided - in fact, it forms a core part of most machine learning pipelines and is also

something we performed as part of training the cross-sectional models detailed in Chapter 6. That being

Figure 5-1: A method of inputting sequential (time series) data into a cross-sectional model

86

said, we reasoned a HMM, which has access to the full time series waveform, with all its complexities,

nuances and interrelationships, would be a better initial candidate for attempting replication of the

complex task that is assessing NYHA functional class.

Methods

In the following section we briefly detail our methodology used for a) implementing and b)

subsequently assessing the performance of our HMMBC.

The work done in this chapter was performed using the R programming language [151] in conjunction

with RStudio [152], an integrated development environment for R, along with various other supporting R

packages [153–158,217]. The R package depmixS4 was used specifically for the training of the HMM

models [227,228].

5.2.1 Training Data

Dataset

Although we originally intended to use the new data collected from the upgraded Medly system (with

the additional activity tracker functionality), we opted to instead use data that was collected during a

previous study (the same data used in Chapter 3). Analysis of the data collected, and continuing to be

collected, from the upgraded Medly system is instead left to future work. The reasoning for this choice

was three-fold.

First, the previous (Chapter 3) study data had a marginally larger sample size of 50 patients, vs. a

nominal 44 patients from the new Medly data. Furthermore, since 5 of the 44 had almost no recorded

activity, and an additional 6 had less than 1 week of recorded activity, the practical size of the Medly

dataset is really closer to 33 patients. While neither of these datasets is large even when considered from a

classical statistics perspective, machine learning is notorious for being particularly data intensive, and

typically the noisier, the more complex and the greater the variance in the data, the larger the dataset

required to achieve good classification performance. Given that we expect that continuous daily step data

is simultaneously noisy, complex, and highly variant we expect that the model may lean towards requiring

more data rather than less data. Aside from considering the complexity and nature of the machine

learning algorithms we are investigating, the use of the somewhat larger 50 patient dataset is further

justified since some fraction of the 50 samples will also need to be set aside and reserved for testing and

validation of the models.

87

The second reason we chose to use the previous study data was that we had insufficient time to download

the last bits of activity data, collect the additional non-activity portions of the data set (e.g.

demographics, NYHA class and CPET data), and subsequently properly clean, and then re-run the

analysis that follows on the new Medly Fitbit data set. The lack of time was mostly a result of pushing

back the final deadline for the inclusion of new onboarded patients into the study dataset, in order to

scrape together as much data as possible for ML in the face of the relatively low onboarding rate (~1.5

patients/week including both new patients and upgraded returning patients) and the delays in

implementing the required data collection infrastructure (as discussed in Chapter 4).

The third reason we opted to use the previous study data is that it included summary cardiopulmonary

exercise testing data for all the patients in the dataset (a by-product of the inclusion criteria) whereas

approximately half of the patients on the upgraded Medly system had not had a CPET performed and

therefore had no such data available at the time of publication. Using the previous study data therefore

had the benefit of allowing us to create models and performing some initial comparisons of classification

performance of models trained using only CPET data (recall, the gold standard test for assessing exercise

capacity) as compared to models which use activity tracker data.

Our choice of dataset however did come with a significant drawback. As already mentioned, the previous

study data used an activity tracker that did not collect heart rate data. As a result, the dataset only

consisted of the following data:

1. Minute-by-minute step count data – recorded using a commercially available activity-tracker, a

Fitbit Flex [59], continuously throughout the day.

2. Cardiopulmonary exercise testing data – administered by trained clinical staff as part of routine

care at the TGH Heart Function Clinic on the same day as recruitment (except for 4 patients

who received it prior to recruitment19).

3. Patient demographic/meta data – recorded as part of onboarding, and specifically including:

a. Sex [Male or Female],

b. Age [years],

c. Height [cm],

d. Weight [kg],

19 Specifically, 1, 15, 20 and 22 days prior to recruitment.

88

e. Handedness [left or right], and

f. Wristband Preference [left or right].

Population

In short, the data ultimately used in the development and validation of all the ML classifiers discussed

in this work is the same data used for to perform the replication study in Chapter 3. Recall that the data

was originally sourced between September 2014 and June 2015 from a closed (prospective) cohort of adult

outpatients at the Heart Function Clinic (a tertiary care clinic specializing in the management of heart

failure) at Toronto General Hospital, a part of the University Health Network (UHN) in Toronto,

Canada). The inclusion and exclusion criteria are respectively detailed in Table 3 (page 37) and Table 4

(page 37). The dataset includes 50 patients whose demographics are fully detailed in Table 5 (page 38),

Table 6 (page 38) and Table 7 (page 39), but in short, to reiterate, the patients are predominantly male

(86 vs. 89 [%]), aged: 54 ± 14 vs. 56 ± 14 [years old], and overweight (BMI: 28.9 ± 6.4 vs. 29.6 ± 6.3

[kg/m2]) with no significant difference in handedness or wristband preference (see Table 11).

Patients in the dataset were recorded for 2 weeks during which time their HF, and by extension their

NYHA class, is assumed to be stable (stability of HF being one of the criteria for inclusion into the study

which originally generated this dataset; see Table 3).

Label Assignment

The “true” underlying NYHA class of a patient was assessed at onboarding by their physician as either

NYHA functional class II (n=26) or III (n=11), according to the criteria outlined in Section 2.2.1.1, or as

some intermediate/mixed class I/II (n=9) or II/III (n=4). Patients were assessed as an

intermediate/mixed class when a physician was uncertain about the classification or felt that patients

exhibited symptoms from different class levels. However, since class I/II and II/II are not formally

recognized NYHA classes (nor are the sample sizes for the classes in question large enough for any sort of

machine learning), it was necessary to group these intermediate/mixed classes together with the existing

traditional NYHA classes for the purpose of developing our ML classifiers. We grouped the

intermediate/mixed classes according to the most ‘severe’ NYHA class in the set20, i.e. I/II with NYHA

class II, and II/III with NYHA class III.

20 recall our extended reasoning on page 39 for grouping according to the more severe class in the mix.

89

5.2.2 Model Design

Predictor(s)

In order to predict the class labels, the HMMBC was supplied with only one predictor: the step count

data, since this was the only available time series data. Adding in either the demographic or available

cardiopulmonary testing data would have required stratifying our patients into groups and training

separate sub-classifiers for each group. Since our dataset was so small and relatively homogenous we

reasoned that stratification was not likely to significantly improve performance but would definitely have

at least some detrimental impact on performance by reducing the already meager number of examples

available to train a given classifier in the first place (due to the stratification process).

We did however use multiple variations of the step count data after encountering difficulties getting our

classifier to converge to a valid model using the high-resolution minute-per-minute data. We re-attempted

training our classifier using data at progressively lower temporal resolutions, from 2 to 6 hours. The

algorithm was finally able to converge when we used a resolution of 6-hours21. The result is that we

investigated five separate variant classifiers as part of this work, with each variant supplied with step

count data at a different time resolution, specifically at either:

a) a per minute level resolution [steps/minute], or;

b) a per 2-hour level resolution [steps/2 hours], or;

c) a per 3-hour level resolution [steps/3 hours], or;

d) a per 4-hour level resolution [steps/4 hours], or;

e) a per 6-hour level resolution [steps/6 hours]

Normalization

Additionally, before using the step count data for training, we also normalized the per minute values

to between 0 and 1 via linear scaling, from a minimum of 0 and using a maximum of 300 [steps/minute].

Normalizing predictors typically has beneficial effects on training speed but is usually most important for

ensuring each predictor is considered equally by the learning algorithm (as a result of being similarly

21 i.e. the per minute data summed into non-overlapping 6 hour intervals

90

weighted). In our case, since our HMMBC does not use multiple predictor inputs at the same time we

normalized the data for its secondary effect on learning speed and efficiency.

Architecture

In order to actually construct a classifier using the aforementioned predictors, we used one HMM per

classification label - 2 total: 1 each for NYHA functional class II and III22 - combined as per Figure 5-2.

Each HMM is trained with data from the subset of patients corresponding to the target NYHA class

label, i.e. one HMM is trained using the 35 patients with NYHA class II and the second with data from

the 15 patients with NYHA class III. Classification of new patients can then be performed by evaluating

the likelihood that the given patient's predictor sequence (i.e. step count data stream) was generated from

each of the corresponding HMMs in a set. Evaluating this likelihood, or similarity score, is done using an

‘inference’ algorithms, typically the ‘forward’ or ‘backward’ algorithm, whose functionality is included in

most HMM programming libraries. The interested reader can read up on the finer details of these

inference algorithms in any of these referenced works [118,220,222–224]. Regardless of the algorithm used,

the NYHA class of the patient in question is deemed to correspond to the class of the HMM with the

22 by extension, a 3 or 4 class multi-class classifier would additionally contain an additional HMM trained using

NYHA class I or IV patients as required.

Figure 5-2: Architecture for hidden Markov model based classifier

91

highest similarity score returned by the inference algorithm. In other words, the class of the model with

the highest likelihood of having generated a sequence similar to the input predictor corresponds to the

predicted class of the input patient data stream.

Model Generation and Selection

As to how we generate the individual HMM models, the process can be divided, at least logically, into

two separate parts. The first is that of generating a model for each of the classes. The second involves

generating different variant models within each class group using different initial HMM parameters with

the goal of trying to find the parametrization that creates the single best model that most accurately

represents the class group in question. In other words, to find as close to the global optimal set of

parameters (as opposed to simply a local optimum).

The first part, model generation for each class, as already touched on, is accomplished by simply selecting

all the patients that belong to given class (NYHA class II or NYHA class III) and using these as the

training data for the model training function of our HMM library for R: depmixS4 [227,228]. The

depmixS4 training function outputs a potential model which we can add to a list of potential models for

that class. This list of models will later be passed onto the optimal model set selection process.

The second part, generating different parametrizations, simply involves repeating the first part of the

process, but updating the initial parameters that form the second part of the required input for the

depmixS4 model training function, until we have swept through all the desired parameter variations. Each

of these models is in turn added to either the appropriate list of potential class II or class II models.

As for selecting the final model pair, this can be accomplished by simply taking every paired combination

of class II and class III models in the potential model lists, assessing the performance of each of these

combinations against an example test set of data, and selecting the model set with the best overall

performance. Unfortunately, we did not actually investigate this last part of the model generation process

as a result of the critical problems encountered in the first part of the model generation process: namely

that were unable to the training algorithms to converge, or actually train a HMM model using the step

count data (whether with the depmixS4 library or others [223,225]). Although we were able to discover a

way to overcome these training difficulties - using lower resolution step count data (the per-minute step

count data averaged over 6-hour periods) - this solution fundamentally violated the whole rationale for

using a HMM model based approach in the first place (being able to use the complete per-minute time

series waveform without having to dilute it down). This prompted us to instead pursue and focus on the

other more classic cross-sectional ML methods discussed in Chapter 6. As a result, although we managed

92

to train a single set of HMMs, which we used to build an initial HMMBC, the performance of the

classifier was so obviously very poor (as discussed in Section 5.3) that we eschewed spending significant

time optimizing the algorithm performance when the cross-sectional ML methods proved more effective.

Initial Parameterization

The initial parameterization for the successful trained classifier, with some rationale for the selection,

is provided below. We emphasize however that little weight should be given to these parameters since

they are hand-picked, essentially arbitrary and not-verified against other parameters. Although we

attempted several different variations on model parameterizations as part of the debugging process none

of these were thoroughly documented.

1. States: 3

Although we only tested an HMMBC built with 3 underlying states (per HMM), our original

intent was to sweep the state parameter from 3 to 6-8 states depending on available

computational power. We started with the lowest number in that range - 3 states - to help with

debugging our training problems. Since we never performed the optimal parameterization search

our final successful trained classifier therefore only had 3 states23.

2. Starting State Probabilities: [0.95 0.00 0.05]

Based on our initial exploration of the data (Chapter 3), patients spent most of their time in a

non-active state. In other words, at any given moment, if we were to look at the step count time

stream, it is most likely that a patient would be in a non-active state as opposed to any other

state. We assumed the HMM would likely detect as a strong pattern and model this non-active

state as one of the 3 state so we set our starting state probabilities to suggest this in advance.

3. Transition Probabilities: [0.90 0.3 0.330.05 0.5 0.330.05 0.2 0.33

]

23 The computational power limit is important since it increases to the square of the number of states (since each

state is interconnected). That is, with 3 states there are 9 possible transitions between states which must be solved.

Doubling the number of states to 6 causes a quadrupling of the number of possible transitions to 36 and at 8 states

there are 64 possible transitions, almost double that of the 6 state case.

93

The selection of initial transition probabilities was done almost completely arbitrarily due to a

lack of relevant precedent information. However, to remain consistent with the assumption made

for the starting state probabilities - that a patient was likely to remain in the non-active state the

majority of the time - we did tweak the initial transition probabilities for the corresponding state

(dictated by the starting state probability matrix) to heavily favor remaining in that state. The

remainder of the transition probabilities were selected completely arbitrarily with the only

restrictions being that the sum of each state transition probabilities should of course be equal to 1

and that no transition probability should be 0.

4. Emission Probabilities: normally distributed with means ± variances (in steps/minute) of

[1 40 100] ± [10 80 1000]

The emission probabilities

were based on the range of

values graphically observed

from the per minute step-

count distribution (shown in

Figure 5-3). The specific

choices for mean and

variance were arbitrarily

selected, although in such a

way that they very loosely

separated the distributions

into three equidistant parts.

5.2.3 Model Validation

Since the classifier did not perform well even when tested with the training data, which should provide

overly optimistic performance estimates, we did feel not feel it necessary to perform additional internal or

external validation of the HMMBC discussed in this chapter. The performance reported in the Results

and Discussion section that follows is therefore based on using identical training and testing sets (all n=50

patients) and should therefore be considered to be overly optimistic about the real-life performance of the

HMMBC on actual new data.

Figure 5-3: Distribution of per-minute step count for

patients with NYHA class II and NYHA III (* grouped)

94


As previously mentioned (in Section 5.2.2.1), we encountered significant difficulties during the HMM

training process. Specifically, the HMM training algorithm was unable to converge to a valid model when

supplied with the per-minute step count data. The resolution to this problem was ultimately to supply the

HMM training algorithm with progressively lower and lower resolution data. The algorithm was finally

able to converge when the data supplied had a temporal resolution of 6 hours.

5.3.1 Classification Performance

The performance of the HMM based classifier produced using the per 6-hour step count data is

presented in Figure 5-4. As can be seen from the confusion matrix,

only 19 of the total 35 NYHA class II patients and 10 of the total

15 NYHA class III patients were correctly classified by the

HMMBC yielding an overall raw (unbalanced) accuracy of 58%.

The balanced accuracy (not shown in Figure 5-4) - which corrects

for the unequal distribution of class II and class II patients - can

be calculated to be 60%. Unfortunately, the HMMBC accuracy is

lower than the no information rate (70%). This indicates that,

given the class distribution in the dataset - 70% of patients with

NYHA class II - the classifier actually performs no better than if

we had simply randomly assigned NYHA classes to patients. The

poor agreement between the physician assigned NYHA class and

classifier assigned NYHA class is also reflected in the low value of

the Cohen’s Kappa coefficient24 (𝜅=0.18).

5.3.2 Training Challenges

That the HMMBC performance is sub-par does not necessarily come as a surprise. The amount of

training data, for one, is possibly simply insufficient to adequately train the HMMBC: 15 examples of

NYHA class II patients and 35 patients of NYHA class III is not a lot of training examples. This potential

24 The Cohen’s Kappa coefficient quantifies agreement between independent raters, correcting for the degree of

agreement that would be expected if the raters were simply guessing by chance [28]. Since Cohen’s Kappa is a

standardized statistic it is particularly useful for comparing performance between algorithms (and studies) [28].

Physician

II III

AI

II 19 5

III 16 10

No Information Rate (NIR): 0.70

Unbalanced Accuracy (Acc): 0.58

Cohen’s Kappa: 0.18

Sensitivity: 0.5429

Specificity: 0.6667

Positive Predictive Value: 0.7917

Negative Predictive Value: 0.3846

Figure 5-4: Overview of HMM

based classifier performance

95

problem is easily resolved by simply collecting more data – something which is currently still in progress

as a result of the activity tracker update made to Medly as part of this research.

Another likely explanation for the low performance is that the 6-hour resolution step data is significantly

less nuanced than per-minute resolution data. Measured by number of data points alone, the 6-hour

resolution step data contains 360 times (or 2 orders of magnitude) less information than the per-minute

resolution data. It is likely that this lower resolution data yielded coarser and less nuanced models (due to

the reduced data stream size) that did not necessarily take full advantage of the modelling capabilities of

HMMs. These coarse models may not have been sufficiently differentiated to really allow for accurate

discrimination between different NYHA classes. In a similar vein, it is possible that binning the per-

minute data over 6 hours resulted in the washing out of many of the important nuances in the data that

might in fact be the key to discriminating between patients in the different NYHA classes.

Compare for example Figure 5-5 and

Figure 5-6, respectively the per-6 hour

and per-minute step count data for the

same patient. Observe, in Figure 5-5 at

the 6-hour resolution, that for days 12

and 13 the step count pattern is visually

similar with only a small variation in the

overall step count. One might be led to

conclude from these similarities that a

patient perhaps had a slightly more

intense workout session or perhaps a

little longer walk near the middle of the

day 12 compared to day 13 but that the

underlying activity pattern remained

essentially the same. Visualization of the underlying data in Figure 5-6 quickly dispels this notion. The

activity near the middle of the day on day 12 is best characterized as isolated but extended high-intensity

physical activity, in contrast to day 13 where the activity it is better characterized as punctuated,

frequent, low-duration, low-intensity activity. The former might be proposed to be characteristic NYHA

class II activity, with the latter being more characteristic of a patient experiencing NYHA class III

symptoms, but where one might be able to assess this difference based on the per-minute data it is clearly

harder to gauge between these two activities on the basis of the 6-hour aggregate data alone.

Figure 5-5: Example patient step count data (per 6

hour resolution)

96

Figure 5-6: Example patient step count data (per minute resolution)

97

In any case, it is clear that unlocking the potential in the per-minute resolution data is highly preferable

to being stuck with using low resolution data.

Analysis of Potential Root Cause

This brings us back to the question of why we were unable to get the HMM algorithm training

algorithm to work using with the per-minute resolution data in the first place. As mentioned, although we

tried various initialization parameters, ultimately the resolution was to aggregate the data. We

hypothesize that the root cause may simply be due to the fact that most of the per-minute step count

values in any given day are simply 025, and furthermore, that these 0 values, although sometimes briefly

interspersed between long periods of activity, more often exist as long unit uninterrupted sequences. These

sequences occur not only in the mornings & evenings – such as when a person is sleeping - but also at

random intervals during the middle of the day – for example when a person might simply be inactive -

see, for example, days 3, 5, 8, 11, 12, and 13 in Figure 5-6.

Recall that HMMs are stochastic models, in other words, the underlying models they use to represent a

process are constrained by the rules of probability. There is therefore, some expectation of inherent

variance in the training data which the training algorithm must use capitalizes to start formulating a

model of the underlying process. The presence of low (or no) variance sequences may therefore present a

real problem to training.

For example, take a very long uninterrupted sequence of identical values, like a string of 0’s. Depending

on the length of the sequence and expected nature of the distribution, it may in fact be considered

statistically impossible. The probability of given sequence being produced by some Markov model can be

calculated using the forward algorithm, which relies on the chain rule: namely the probability of a chain of

events 𝐸𝑛 to 𝐸1can be calculated as the probability of event 𝐸𝑛 occurring, given that the sequence

𝐸𝑛−1 to 𝐸1 has occurred, multiplied by the probability of sequence 𝐸𝑛−1 to 𝐸1 having occurred:

𝑃(𝐸𝑛, … , 𝐸1) = P(𝐸𝑛|𝐸𝑛−1, … , 𝐸1) ∙ P(𝐸𝑛−1, … , 𝐸1) (2)

The probability of sequence 𝐸𝑛−1 to 𝐸1 having occurred can be recursively calculated using the same

formula, continuously chaining (thus lending the rule its name) the conditional probabilities of the new

25 recall that for our dataset, more than 75% of the per-minute step count values for any given patient are 0 (as

measured over their whole two-week monitoring period).

98

event in question, 𝐸𝑛−1, on all the prior events in the sequence. In the case of a produced sequence 𝑆𝑟𝑒𝑝𝑒𝑎𝑡

of length 𝑛, composed of the same repeated event, which are known to occur with some probability 𝑝,

Equation 2 simplifies to the following:

𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡) = p𝑛 (3)

An oft quoted value for the threshold of statistical impossibility is 10−50 but he exact cut-off is rather

arbitrary [229]. Since our objective is not to provide a rigorous proof of our hypothesis but rather to

suggest a theory to future researchers interested in tackling this problem the choice of 10−50 is a

reasonable choice of threshold. The choice of probability 𝑝 by extension is also somewhat arbitrary.

Suppose, for simplicity sake though that, since step count ranges from approximately 0 to approximately

125 in our patients we supposed the probability of a 0 step count value lies around 1

125≈

1

100= 10−2. Since

a more conservative reader might prefer we use the actual probability of 0 step counts in our sample –

approximately 75% of the dataset and say that 𝑝 should be closer to 0.75 = 3

4= 10

−log (3

4⁄ )

log (10) ≈ 10−0.12 - we

also perform the calculation with this values for comparison. The overall conclusion remains the same.

Assume a rest period of approximately 8 hours (which occurs fairly consistently, once a day), the sequence

length, n=480 minutes, has an associated probability of:

𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,8 ℎ𝑜𝑢𝑟𝑠) = 10−2𝑛 = 10−2∗480 = 10−960

or conservatively:

𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,8 ℎ𝑜𝑢𝑟𝑠)| 𝑐𝑜𝑛𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑣𝑒

= 10−0.12𝑛 = 10−57.6

Whether by the more conservative estimate or not, these probabilities are well in excess of the statistical

impossibility threshold of 10−50.

All this is not to say that these sequences are impossible - they are quite clearly not – however, from the

perspective of the Markov model, and from the hidden Markov model attempting to guess at the

underlying hidden model, such sequences are considered very unlikely26, and therefore not likely to be

26For a 6 hour sequence, 𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,6 ℎ𝑜𝑢𝑟𝑠) = 10−720 & 𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,8 ℎ𝑜𝑢𝑟𝑠)

| 𝑐𝑜𝑛𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑣𝑒= 10−43.2. For a 4 hour sequence,


= 10−28.8. Although less than the threshold of 10−50, the sequence probability is still very

tremendously small, and the sequence thus less, but still very unlikely.

99

interpreted as regular parts of the sequence although they actually are. Even a one hour period is be

found to be highly, although relatively less, unlikely: 𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,1 ℎ𝑜𝑢𝑟𝑠) = 10−120,


= 10−7.2.

Of course, any given long predetermined sequence of variables being produced by a Markov model will

have a low associated probability. So why do we feel that we can make this special claim about a string of

0’s, or a string of identical values generally? Because of the key fact that the values in the series are

identical.

Take an arbitrary sequence of two or more alternating values of the same length 𝑛 as the sequence 𝑆𝑟𝑒𝑝𝑒𝑎𝑡

of above. These would have same probabilities calculated above, however this same sequence is unlikely to

represent a problem to an HMM. Why? Because the different values are easily associated with different

underlying states. With a single unchanging value however, it becomes impossible to determine which

value belongs a particular state: is a single state producing the sequence and we have yet to transition to

another state (what our probability calculations above actually represent) or are all states producing this

same value – in which case what makes them different states, except perhaps their transition

probabilities? But how does one determine the transition probabilities of underlying states if the emitted

symbols observed from the states are identical. We believe that ultimately, the intractability of these

questions may explain why the HMM training algorithm has difficulty converging to a value, and why

decreasing the resolution, which reduces the length of identical value sequences, but also generally

increases the variance in sequences to make possible states more differentiable resolves the training

problem.

Proposed Solution 1: Dithering

It would actually be very easy to test this hypothesis by using a signal processing technique known as

dithering. Dithering, is the act of introducing dither, that is, very low amplitude random noise

intentionally introduced into a system to improve its performance [230]. It was famously found to have

the curious effect of improving navigation and ordnance trajectory calculations performed on aircraft

based mechanical computers during the second World War as a result of the aircraft induced vibrations,

which smoothed out the operation of the moving mechanical parts [231]. Since then, it has been

successfully used to improve performance in various diverse applications as analog-to-digital conversion in

microelectronics [232], and trading on stock exchanges (where it is used to reduce high frequency trading -

an oft maligned trading practice) [233]. More commonly though, it is used to increase the visual quality of

low resolution images [234,235] – an excellent example of which has been reproduced from Wikipedia [236]

100

in Figure 5-7. Compare in particular sub-figures: 1, the raw image; 2, a lower resolution version of the

same image; and 3, the low resolution image dithered using a classic image dithering algorithm [234]. Note

in particular that the image in image 3, despite having the same resolution as image 2, approaches the

visual fidelity of image 1. We propose that, in an analogous way, carefully application of dithering to the

step count signal might counterintuitively improve our ability to train an HMMBC with high resolution

data. A small amount of noise would at least eliminate the impossibly long uniform sequences in the data,

and provide the necessary variance required for the HMM training algorithm to perform as intended,

while simultaneously not meaningfully degrading the overall quality of the step count data stream.

Proposed Solution 2: Activity Segmentation

An alternative to dithering, is to do away with the inactive sequences altogether, ignoring all the long

periods of 0 per-minute step counts, and instead training a HMMBC to use activity segments as opposed

to the complete raw daily signal. Unfortunately, this alternative, although conceptually simpler, is likely

harder to put into practice and test than dithering. Dithering can be fairly easily tested by adding various

different types and magnitudes of random noise to the high-resolution test signal and seeing if the HMM

training algorithm can successfully converge. Training on activities however first requires determining

what should constitute an activity segment, i.e. where it should begin, but also where it ends, including,

how many (if any) inactive minute should be allowed in the middle of the activity (in case of missed

readings, brief pauses, etc.). Additionally, it likely requires the development of some sort of automated or

quasi-automated data segmentation algorithm, not only for the case where the HMMBC might be

implemented in practice as part say a remote patient monitoring system, but also to help consistently and

accurately segment the relatively large volume of data that would be required to train and improve such

Figure 5-7: Dithering as applied to a cat photo. Reproduced from Wikipedia [236].

101

the classifier. Activity segmentation therefore likely involves first further investigating in more detail the

finer characteristics of the per-minute step count data stream generally. Although the task of activity

classification, at least for healthy patients, is already a very active area of research, the data used is

typically raw accelerometry data as opposed to per-minute step count data.

Although more challenging, training on separate activity segments might provide significant additional

secondary benefits not attainable though simple dithering. For example, assessing patients using smaller

periods of activity as opposed to an entire day or weeks worth of data might reduce assessment latency

thereby improving response time for any application that depends on assessments provided through an

activity tracker. Alternatively, it might provide additional insight into the specific physical exercise

routines of patients which might enable the provision of timely and relevant feedback to patients

regarding this aspect of their HF self-management.

Both dithering and activity segmentation have their relative advantages and disadvantages as possible

solutions to resolving the training challenges encountered with the HMMBC when compared with simply

reducing the temporal resolution of the input data. Ultimately though, since both dithering and activity

segmentation each represent very different but complimentary approaches to the training challenge they

are likely both worth investigating in their own right.

Summary

To summarize, in this chapter we discussed a proposed method for building a hidden Markov model

based machine learning classifier and the results of implementing and testing said classifier. We chose to

use hidden Markov models, which are a tool for modeling a system as a stochastic process, because we

hypothesized that these might be able to fully embrace the complexity and nuance of the entire time

series data streams produced by the activity trackers worn by patients in free-living conditions. We

detailed the architecture of the model, which used two hidden Markov models, one each to model the

activity patterns of patients with NYHA class II and III symptoms. Instead of using the new 44 person

dataset collected from the activity tracker monitoring system detailed in Chapter 4, we opted used the

same 50 person dataset investigated in Chapter 3, primarily because there was more data available for us

to use to train machine learning classifiers. Since the 50 person dataset does not also have heart rate data,

the only time series input provided to the hidden Markov model was patient step count data.

Unfortunately, we encountered difficulties in getting the hidden Markov models training algorithms to

converge using the per-minute step count data, which we were ultimately able to resolve by converting

the data to a coarser 6 hour temporal resolution level. Regrettably, using lower resolution data

102

contradicted our entire rationale for using hidden Markov models in the first place: attempting to use the

entire unadulterated time series data stream. Furthermore, although the hidden Markov model based

classifier we did train using the per-6 hour step count data was able to classify patients, the classifier did

not perform any better than one that simply assigns patient classes by chance (58% unbalanced accuracy

for the HMMBC vs 70% accuracy for the random classifier). The Cohen’s Kappa statistic (0.18)

confirmed the poor agreement between the physician assigned NYHA class and that assigned by the

hidden Markov model based classifier. Of note, since performance of our classifier was evaluated on the

exact same data used to train said classifier, the performance reported above should be also interpreted as

being highly optimistic compared to the real expected performance of the classifier on new data it hasn’t

seen before.

Although our initial attempts to use a hidden Markov model based classifier were met with some

significant setbacks, we don’t believe that this means that the approach does not have value, but rather,

that it might require more dedicated attention to get such an approach to work. We posited a possible

theory for why the training algorithm has difficulty creating hidden Markov models of the step count

data, namely that the presence of long low variance sequences of identical step count values makes it

impossible for the training algorithm to determine the transitions between states. In response we proposed

two possible approaches which might be investigated as part of future work: 1) dithering, that is,

intentional applying low-amplitude random noise to the time series step count data, thereby artificially

introducing variance into the low variance sequences (which might allow the hidden Markov model

training algorithm to function properly while not meaningfully degrading the overall performance of the

system), and 2) doing away with the inactive sequences altogether and approaching the task of NYHA

class assessment from the perspective of individual periods of activity as opposed to attempting to classify

the whole free-living time series data in one fell swoop.

Ultimately, we opted to take a third approach for the purpose of this thesis and put the hidden Markov

model based classifier to the side and instead investigate the effectiveness of some other more classic

approaches to supervised classification, which we discuss in the next chapter.

103

- Assessment of NYHA Functional Classification

Using Cross-sectional Machine Learning Models

As mentioned in the introduction of the previous chapter, we set out to attempt to objectively assess

the NYHA functional classification of some example patients using modern machine learning (ML)

algorithms. Having discussed our unsuccessful attempt to build a useful hidden Markov model based

classifier, we decided to investigate some cross-sectional machine learning algorithms that are popular

starting points for supervised classification problems: Generalized Linear Models (GLM; a variant thereof:

boosted GLMs; Random Forests (RF); Artificial Neural Networks (NNet); and a variant thereof: Principal

Component Analysis Neural Networks (PCA NNet).

In this chapter we first provide a brief refresher on the above ML techniques. The curious reader is

invited to consult T. Segaran’s book, “Programming Collective Intelligence: Building Smart Web 2.0

Applications” [111], for a more thorough introduction to these and other popular ML algorithms. We then

proceed to explain our methodology for training and testing the ML models investigated and finally, we

discuss the results of our investigation and detail some possible future directions to take this research.

Machine Learning Models

What follows is a very brief introduction to the cross-sectional machine learning models investigated in

this chapter, in order of relative algorithm complexity.

6.1.1 Generalized Linear Models

The generalized linear model, or GLM, is unsuprisingly, a generalized version of classic linear

regression [237,238].

Recall that that the idea behind ordinary linear regression is that we can model some randomly

distributed response variable 𝑦, as a linear combination of predictors 𝑋 = {𝑥1, 𝑥2, … , 𝑥𝑛}, subject to some

noise/error represented as the error term ε. If we define 𝐵 = {𝛽0, 𝛽1, 𝛽2, … , 𝛽𝑛} as the regression

parameters, with 𝛽0 being the intercept term, we can express the relationship formally as:

𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑛𝑥𝑛 + 𝜀 (4)

This equation, which defines linear regression, can be decomposed into two parts: 1) a linear part and 2)

the random error part. The linear part, 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑛𝑥𝑛, which tells us that there is some

expected value for 𝑦 conditional on the value of 𝑥: 𝐸(𝑦|𝑥). The error term then tells us that there is some

104

random error or variance about this expected value; in classic linear regression this error is specifically

assumed to be normally distributed with some constant variance, 𝜎2. If we call the expectation value

𝐸(𝑦|𝑥), the mean as a function of 𝑥, 𝜇(𝑥), of the normal distribution for 𝑦, we could alternatively

represent Equation 4 as:

𝑦 ~ 𝑁(𝜇(𝑥), 𝜎2) (5)

The generalized27 linear model asks us: what if the relationship between 𝑦 and 𝑥 were not normally

distributed but were instead modelled by some other distribution? Specifically, what if it we could use any

distribution within the wider family of exponential distributions, of which the normal distribution is just

27 N.B. not ‘general’ linear model, which is just a special case of the GLM, namely the one expressed in eq. 4

Figure 6-1: Examples of distributions in the family of exponential

distributions (* indicates the distribution belongs in the family only

when certain parameters are fixed). Adapted from [290].

105

one example (see Figure 6-1 for more examples). To effect this change, thus generalizing the linear model,

we need to modify the way we link together the expectation value, 𝐸(𝑦|𝑥), produced by the linear

predictors, and the mean value, 𝜇(𝑥), of our error distribution. That is, instead of defining the link

between 𝐸(𝑦|𝑥) and 𝜇(𝑥) as:

𝐸(𝑦|𝑥) = 𝜇(𝑥) (6)

we would first generalize the relationship with 𝐸(𝑦|𝑥), as link function 𝑔 of 𝜇(𝑥):

𝐸(𝑦|𝑥) = g(𝜇(𝑥)) (7)

The link function for the normal distribution is then simply 𝑔(𝑎) = 𝑏. The link function must always be

smooth, invertible, and linearizing, and is changed according to the desired noise distribution. A list of

common link functions, and their inverses, can be found in most basic texts on GLMs, for example [237].

A model can then be fit using the maximum likelihood estimation [237,238]. The end result, of this entire

process is that we gain a fairly simple yet powerful and versatile method of modelling a wide variety of

processes. As a result, although oft forgotten, GLMs usually make a great first choice to use before

moving on to more sophisticated ML models.

6.1.2 Boosted Generalized Linear Models

Boosting, or rather gradient boosting, is an ensemble learning technique [239,240]. Instead of using one

single strong predictive model, the idea behind gradient boosting is to use an ensemble of weakly

performant models that build on each other, learning from the mistakes of previous models, to create a

final model that is more accurate than any single (strong or weak) constituent model. Although boosting

can make overall more performant classifiers, it must be carefully managed to prevent overfitting the

model, that is, training the model to be too good at predicting test data at the expense of making the

model generalizable to data it has never seen before. The algorithm used to do gradient boosting is fairly

complex and well out of the scope of this thesis. The algorithm however, supports a range of possible

underlying range of ML models [240] and a boosted GLM is one that specifically uses generalized linear

models as the underlying ML model.

6.1.3 Random Forest

The second type of ensemble learning technique is known as bagging. Bagging forms a core part of

Random Forests (RF). The best place however to start discussing random forests is with decision trees.

106

A decision tree is simply a branching set of

rules, or boundary cut-points, that separate a

feature space into various partitions, each of

which are associated with some sort of

classification or decision outcome [111]. A

very simple example is shown in Figure 6-2.

In this example, the decision tree is used to

classify the three different colors of data

points (green, orange, and purple) according

to two arbitrary features, A & B, associated

with the data points. Note that due to the

placement of the boundaries, some of the

dots are misclassified.

One of the simple approaches to training a

decision tree is to start from the top of the

tree (the root) and go down, selecting several

candidate boundary cut-point that divide the

dataset, and then computing how well the

data is split by each boundary [111]. For

example, one could use the Gini impurity (a measure of diversity in the dataset), or the measure of

information gain (reduction in entropy) that results from the split. One then selects the best candidate

boundary and repeats this process down each new branch. As with all ML algorithms, one must be wary

of over-fitting the learner. In the case of decision trees this is especially true as the complexity or even

just the number of the boundaries used increases. Even with just the use of linear boundaries a decision

tree can get very precise as the tree gets deeper and larger, with more branches and leaves to cut the

feature space into smaller and smaller more ultra-specific partitions. As a result, many decision tree

creation algorithms feature a way to stop growing the tree – usually by setting a hard limit on the depth -

or to prune the tree after growth - to remove unnecessary, unhelpful or weak branches – all to help avoid

overfitting.

Decision trees are hugely useful since they are interpretable, in other words, a human can look at a

decision tree and understand the decisions being made. This is why decision trees, albeit expert trained

Figure 6-2: Example of a decision tree (above)

with corresponding feature space (below).

107

ones, are often popular for use in expert systems where the decision process might want to be inspected –

the Medly algorithm in fact uses an expert trained decision tree for triaging patients [104].

Despite all this, ML decision trees are still often highly sensitive to the input training data and have a

tendency to over-fit and not generalize well to new data. One solution to this problem is the use of the

ensemble learning technique of bagging (the counter-point to boosting). Bagging, in a similar fashion to

boosting, uses an ensemble of learners to improve learner performance, but whereas in boosting the

learners build on each other sequentially, in bagging one trains several independent learners – in this case,

multiple independently trained decision trees – and aggregates their responses. Each tree (learner) in the

forest (ensemble) produces its own separate prediction using the input predictor data and the resulting

ensemble of independent predictions are combined, for example using a majority voting scheme, to

produce the overall final prediction. The aptly named random forest is a variation on tree bagging

whereby a random subset of features is used to train individual trees in the forest, as opposed to the

entire feature space being provided to each tree. This reduces the likelihood of having highly correlated

trees, while retaining the random forest’s beneficial properties, such as it’s ability to naturally perform

feature selection. Specifically, that the most predictive features will tend to feature more prominently as

part of the random forest, whereas less important features will tend to be more sparsely distributed and

therefore be less heavily weighted as part of the forest.

All in all, the effect of bagging together decision trees into a random forest creates a ML model that has

additional useful emergent properties (e.g. natural features selection), can better generalize to new data,

and yet maintain many of the inherent advantages of the underlying decision trees. Because of its

simplicity and ease of use, RFs are therefore often (along with GLMs) a good early candidate for ML

tasks [111].

6.1.4 Artificial Neutral Networks

In contrast, Neural Networks (NNet), or as they are more formally termed – artificial neural networks

– are far on the other end of the complexity spectrum – they are the bazooka to the RF and GLMs pea-

shooters. The use of NNets for the sole purpose of assessing NYHA class is therefore likely overkill since,

as previously discussed, less complex models are likely to actually perform better due to their simplicity in

the face of limited data. However, in the context of assessing NYHA class as part of a remote monitoring

system, NNets have an interesting property that makes them a particularly worth investigating. NNets

support what is known as online learning, which means that the trained model can be progressively and

continuously updated and improved as more and more data becomes available, without needing to retrain

108

the entire model from scratch. This is a particularly useful property within the context of a remote

patient monitoring system, where new data becomes available each and every day. While the specific

NNet investigated as part of this work may not necessarily be immediately transferable to the task of

daily assessment of NYHA class, an initial foray into training a NNets with activity monitoring data is

likely to provide useful insights for future work.

The fundamental building block of the NNet is the perceptron. It is

a digital neuron and operates in an analogous fashion: summing its

input signals which it then converts to an output signal using some

predefined thresholding function. An example is shown in Figure 6-3.

A NNet is built by creating a weighted directed network of

perceptrons, as shown in Figure 6-4 (for clarity, the inter-perceptron

weights are not shown). The network is arranged in a

layered fashion and these layers are logically divided into

three types depending on their function. At the front of

the NNet is the input layer which connects each input

features to a perceptron. The input layer of the NNet

shown in Figure 6-4 as an example, would be suitable for

use with 4 input predictors or features. The input layer

acts as the interface, connecting the input features to the

first layer of perceptrons in the next set of layers in the

network: the hidden layers.

The hidden layers of the NNet are the innermost layers

and form the bulk of the network. They are where the NNet learns the various complex and complicated

relationships and patterns in the data. Unfortunately, the nature of this method of learning is that NNets

typically remain black-boxes and it is never quite clear how or what relationships the NNet has learned

from the data. The number of hidden layers and the number of nodes in each layer can be altered to

make a deeper and wider NNet capable of learning more complicated relationships. While NNets can

theoretically be made arbitrarily large, training large NNets is computationally expensive and therefore

limited by the computational power available. Although NNets have existed since the 1950s, it is only due

to modern advances in computing that training large multi-layered NNets, known as deep neutral

networks [241], has recently become feasible [111,124,242,243]. The success of deep neural nets at tackling

complex problems is generally credited as a cause for the recent popular resurgence in AI research [242].

Figure 6-3: A perceptron

Figure 6-4: A neural network

109

Once the hidden layers, regardless of depth, have processed the input data, the data is picked up by the

output layer.

The purpose of the output layer is simply to extract data from the hidden network and convert it to a

final output prediction. In output layer of the NNet in Figure 6-4 for example produces 3 output

predictions.

Training the hidden network is commonly performed using a what is known as the backpropagation

algorithm and it is essential for enabling online learning for the NNet. Essentially, new data is provided,

one at a time, to the input layer of the NNet. Examining the output values produced at the output layer

of the network, one can then determine how far off the current NNet prediction is and then work

backwards through the network, making minor adjustments to the weights of the links to slowly push the

output in the correct direction. The degree of tweaking, the learning rate, is carefully controlled to make

sure that that the NNet is neither underfit (insufficiently trained) nor overfit (that the NNet overshoots

or overgeneralizes from individual data points). The finer details of the backpropagation algorithm are

outside the scope of this thesis, but the interested reader is invited to refer to either [111] or [241] for

further reading.

Overall, NNet are a complex but very powerful ML algorithm that have been successfully used to learn

the relationships present in challenging highly non-linear data and systems. They also support continuous

incremental learning which may be particularly useful in the context of remote patient monitoring.

6.1.5 Principal Component Analysis Artificial Neutral Networks

Aside from computational cost, NNets also have the drawback of typically requiring a lot of data to

train well. The later point is particularly challenging given our small dataset. One of the ways to make

more effective use of a data limited but feature rich dataset is to perform dimensionality reduction on the

feature set prior to presenting it to a ML algorithm [244–247]. Dimensionality reduction is related to the

concept of feature selection and extraction. In both cases, a large set of features is reduced to a smaller

more concise set of principle feature that encodes, as much and as accurately as possible, all the

information originally contained in the large feature set [247].

Principal Component Analysis (PCA) is a standard and hugely popular technique for performing

dimensionality reduction [248]. In PCA, the larger m-dimensional feature set is projected onto the best 𝑛-

dimensional orthogonal subspace of 𝑚 (where 𝑛 < 𝑚) in such a way that the greatest variance in the

projected data comes to lie on the lowest (first) order coordinate axis (of the n-dimensional subspace),

110

with successively lower variance data being reserved for successively higher order coordinate axes28

[244,248]. In this way PCA trims out the features (dimensions) that provide the least new information,

either because the information is already accounted for as part of another correlated feature, or because

the feature has low variance and therefore provides little additional information to consider. The

interested reader can find a more complete mathematical treatment of the algorithm in [248].

In theory, by apply PCA to our set of features before passing it to a NNet, the resulting PCA NNet

should perform better, since the algorithm should be able to focus on learning the high information

patterns common to the limited dataset, while being less distracted by low value features. Furthermore,

the PCA NNet should be able to be trained at a reduced overall computational training cost since the

reduced number of features will likely require a lower complexity NNet to model.

Methods

We chose to use the R programming language [151] in combination with RStudio [152], the open-source

integrated development environment for R, and various supporting R packages for the research work

documented in this chapter [153–158,217,249,250]. For the specific tasks of building, training, and

validating the ML models we used the caret (Classification And Regression Training) package for R

[251,252]. We also used the caret package for data pre-processing including normalization and imputation,

although we used the leaps package [253] for feature selection.

To simplify comparison between the sometimes-disparate models discussed this chapter (as well as the

hidden Markov Model based classifier discussed in the previous chapter), we keep the methodology as

consistent as possible between the different machine learning approaches. We also aligned our

methodology as much as possible with current best practice for the creation and validation of supervised

classification ML models.

6.2.1 Training Data

Dataset

We used the same data to develop and validate the cross-sectional algorithms that we used for the

hidden Markov model based classifier investigated in Chapter 5. This data, again, is the same data used

28 i.e. the second greatest variance lies the second order axis, the third greatest variance on the third order axis, up

to the least variant data which resides on the final 𝑛-th order coordinate axis.

111

for the replication study discussed in Chapter 3. Recall that the dataset was selected primarily since it

had the largest sample size of the available datasets, but also because it contained cardiopulmonary

exercise testing data to permit us to establish a helpful baseline performance (based on the gold-standard

CPET) evaluate the impact of step count data on our algorithm performance.

Population

Recall that the Chapter 3/Chapter 5 dataset included 50 patients, predominantly male (86 vs. 89 [%]),

aged: 54 ± 14 vs. 56 ± 14 [years old], and overweight (BMI: 28.9 ± 6.4 vs. 29.6 ± 6.3 [kg/m2]). whose

demographics are fully detailed in Table 5 (page 38), Table 6 (page 38) and Table 7 (page 39). These

patients come from a closed (prospective) cohort of adult outpatients at a tertiary care clinic specializing

in the management of heart failure at a major hospital in Toronto, Canada. The exact inclusion and

exclusion criteria are detailed in Table 3 (page 37) and Table 4 (page 37) respectively.

Label Assignment

Again, recall that the patients in the dataset were originally classified at onboarding by their physician

as either NYHA functional class II (n=26) or III (n=11) - according to the criteria outlined in Section

2.2.1.1 - or as some intermediate/mixed class I/II (n=9) or II/III (n=4), as outlined in Section 5.2.1.3.

However, for the purposes of the ML classification task being investigated, patients assigned the

intermediate/mixed classes I/II were relabelled as NYHA class II patients, and patients assigned as class

II/III were relabeled as NYHA class III. This final dataset was therefore composed of only patients

labelled as NYHA class II (n=35=26+9) and NYHA class III (n=15=11+4).

6.2.2 Model Design

Predictors

In order to predict the outcome label, each of the machine learning models was fed with a series of

predictors (or features) built from available data in the dataset. Recall that the dataset consisted of the

following data:

1. Minute-by-minute step count data – recorded using a commercially available activity-tracker

(Fitbit Flex) continuously throughout the day. From which we extracted the same metrics

calculated and explored in Chapter 3, as listed in Table 18 below:

Table 18: M inute-by-minute step count features

M aximum

112

1 Maximum 2 Week PMSCa [steps/minute]

2 Maximum of Maximum DPMSCb [steps/minute]

3 Mean of Maximum DPMSCb [steps/minute]

4 Standard Deviation of Maximum DPMSCb [steps/minute]

5 Standard Error of Maximum DPMSCb [steps/minute]

6 Minimum of Maximum DPMSCb [steps/minute]

75th Percentile

7 Maximum of 75th Percentile of DPMSCb [steps/minute]

8 Mean of 75th Percentile of DPMSCb [steps/minute]

9 Standard Deviation of 75th Percentile of DPMSCb [steps/minute]

10 Standard Error of 75th Percentile of DPMSCb [steps/minute]

M ean

11 Mean 2 Week PMSCa [steps/minute]

12 Maximum of Mean DPMSCb [steps/minute]

13 Mean of Mean DPMSCb [steps/minute]

14 Standard Deviation of Mean DPMSCb [steps/minute]

15 Standard Error of Mean DPMSCb [steps/minute]

16 Minimum of Mean DPMSCb [steps/minute]

Standard Deviation

17 Standard Deviation of 2 Week PMSCa [steps/minute]

18 Maximum of DPMSCb Standard Deviation [steps/minute]

19 Mean of DPMSCb Standard Deviation [steps/minute]

20 Minimum of DPMSCb Standard Deviation [steps/minute]

Standard Error

21 Standard Error of 2 Week PMSCa [steps/minute]

22 Maximum of DPMSCb Standard Error [steps/minute]

23 Mean of DPMSCb Standard Error [steps/minute]

24 Minimum of DPMSCb Standard Error [steps/minute]

Total

25 Total 2 Week SCc [steps]

26 Maximum of Total DPMSCb [steps]

27 Mean of Total DPMSCb [steps]

28 Standard Deviation of Total DPMSCb [steps]

29 Standard Error of Total DPMSCb [steps]

30 Minimum of Total DPMSCb [steps]


31 Maximum of DPMSCb IQRd [steps/minute]

32 Mean of DPMSCb IQRd [steps/minute]

33 Standard Deviation of DPMSCb IQRd [steps/minute]

34 Standard Error of DPMSCb IQRd [steps/minute]

Skewness

113

35 2 Week PMSCa Skewness

36 Maximum of Daily SCc Skewness

37 Mean of Daily SCc Skewness

38 Standard Deviation of Daily SCc Skewness

39 Standard Error of Daily SCc Skewness

40 Minimum of Daily SCc Skewness

Kurtosis

41 2 Week PMSCa Kurtosis

42 Maximum of Daily SCc Kurtosis

43 Mean of Daily SCc Kurtosis

44 Standard Deviation of Daily SCc Kurtosis

45 Standard Error of Daily SCc Kurtosis

46 Minimum of Daily SCc Kurtosis aDPMSC: Daily Per Minute Step Count bPMSC: Per Minute Step Count cSC: step count dIQR: interquartile range

2. Cardiopulmonary exercise testing data – administered by trained clinical staff as part of routine

care at the TGH Heart Function Clinic on the same day as recruitment (except for 4 patients

who received it prior to recruitment29). From this data we extracted the following features:

Table 19: Cardiopulmonary exercise testing data features

CPET Feature Brief Description of Feature

1 CPET Duration [frac. min.] duration of CPET in fractional minutes

2 CPET Max Watts [W] max resistance achieved at end of CPET

3 % Predicted CPET Watts [%]

percentage of expected CPET Max Watts for

patient

4 SBP, Resting [mmHG] resting Systolic Blood Pressure before CPET

5 DBP, Resting [mmHG] resting Diastolic Blood Pressure before CPET

6 HR, Resting [bpm] resting Heart Rate before CPET

7 O2 Sat., Resting [%] resting oxygen saturation before CPET

8 FEV, Resting [L] resting Forced Expiratory Volume before CPET

9 % Predicted Resting FEV [%]

percentage of expected Forced Expiratory Volume

achieved by patient during CPET

10 FVC, Resting resting Forced Vital Capacity before CPET

11 % Predicted Resting FVC [%]

percentage of expected Forced Vital Capacity

achieved by patient during CPET

12 SBP [mmHG] Systolic Blood Pressure at end of CPET

13 DBP [mmHG] Diastolic Blood Pressure at end of CPET

14 HR [bpm] maximum Heart Rate at end of CPET

15 HR 1 min. Post Test [bpm] Heart Rate 1 minute after end of CPET

29 Specifically, 1, 15, 20 and 22 days prior to recruitment.

114

16 HR Drop in 1 min. [bpm]

Heart Rate drop (recovery) 1 minute after end of

CPET

17 O2 Saturation [%] oxgyen saturation at end of CPET

18 VO2 Peak (rel.) [ml/kg/min]

peak oxygen consumption during CPET relative

to patient body weight

19 Predicted VO2 Peak (rel.)

[ml/kg.min]

expected peak oxygen consumption for patient

(relative to body weight) during CPET

20

% Predicted VO2 Peak (rel.) [%]

percentage of predicted peak oxygen consumption

for patient (relative to body weight) achieved

during CPET

21 VO2 Peak [L/min]

peak oxygen consumption during CPET (not

corrected for patient body weight)

22 Predicted VO2 Peak [L/min]

expected peak oxygen consumption for patient

during CPET

23 % Predicted VO2 Peak [%]

percentage of predicted peak oxygen consumption

for patient achieved during CPET

24 Anaerobic Threshold [ml/kg/min] patient’s anaerobic threshold

25

AT as % Measured VO2 Peak [%]

Anaerobic Threshold as a percentage of the

measured peak oxygen consumption of the patient

(relative to their body weight)

26 AT as % Predicted VO2 Peak [%]

Anaerobic Threshold as a percentage of the

predicted peak oxygen consumption of the patient

27 VE Peak [L] peak minute VEntilation during CPET

28 VCO2 Peak [L] peak CO2 expiration during CPET

29 VE/VCO2 Slope @ AT

slope of minute VEntilation to CO2 output at

Anaerobic Threshold during CPET

30 VE/VCO2 Slope @ Peak

slope of minute VEntilation to CO2 output at

CPET peak

31 RER Peak peak Respiratory Exchange Ratio during CPET

3. Patient demographic/meta data – recorded as part of onboarding, specifically:

Table 20: Patient demographic data features

Feature

1 Sex [Male or Female]

2 Age [years]

3 Height [cm]

4 Weight [kg]

5 BMI (Body Mass Index) [kg/m2]

6 Handedness [left or right]

7 Wristband preference [left or right]

We tested three different variants of models using three different combinations of the above features:

a) The ‘CPET feature group’, to establish a baseline performance using only data available from

CPET tests. This feature set consisted of all the CPET features and the patient demographic

features, for a total of 38 features.

115

b) The ‘CPET + Step Data Metrics feature group’, to establish the additional benefit derived from

adding the basic step data features. This feature set consisted of all the CPET features, all the

step data features and the patient demographic features, for a total of 84 features.

c) The ‘Step Data Metrics only feature group’, to investigate the effectiveness of using only data

derived from an activity tracker. This feature set consisted of all the step data features and the

patient demographic features, for a total of 53 features.

Normalization

We normalized the input predictors as the first step in the training process for our cross-sectional ML

classifiers 1) to improve training speed but 2), also to ensure that each of the predictors was similarly

weighted for consideration by the learning algorithm. Specifically, we shifted each predictor to be centered

about its mean value and scaled the predictors by their corresponding standard deviations using the

preProcess function in caret R package.

Treatment of Missing Data

Some of the CPET data was missing in the records of some patients. Since the algorithms used do not

handle missing data by themselves, we removed patients with missing data from the training data

supplied to patients, only including the complete cases (without missing data). However, because the

aforementioned caret package’s preProcess function also had the ability to perform data imputation, we

also trained a variant of each of model where the missing training data was imputed, to salvage as many

of the otherwise incomplete cases in the dataset as possible. The preProcess function used a k-Nearest

Neighbour algorithm (k was set to 5) which chooses an imputation value based on the k nearest

neighbouring non-missing data points, as measured by their Euclidian (straight-line) distance from the

missing data point [254].

Feature Selection

Since we had such a large list of input predictors for each model (up to 84) we compared the impact of

performing feature selection on the input list of predictors that were being provided to the model training

function. The purpose of automated feature selection is to try to prevent the model from overfitting to the

data, thereby improving the ability of the classifier to generalize to new data. Traditional machine

learning heuristics dictate that, given our sample size of 50, the number of features used to train out

116

algorithms should be somewhere around 5-10 but possibly up to 49 features to prevent overfitting30. In

view of this, we used an R package called leaps [253], which uses linear regression, to identify and separate

out the single best combination of up to 10 features. We evaluated the best feature combination using the

Bayes information criterion, usually abbreviated BIC [255], which is very similar to the more commonly

used Akaike information criterion, usually abbreviated AIC. In both cases, models with lower values are

preferred, however the Bayes information criterion penalizes complex, feature rich models more heavily

and should therefore favor models that use less features. Based on the previously mentioned heuristics,

lower featured models are likely to be more appropriate given the limited size of our dataset.

Feature selection was done as a last step before generating the ML classifier models. Note also that the

feature selection was performed using only the data being made available for training the model and did

not include any of the validation data which would skew our estimation of the overall final classifier

performance.

All this said, in a similar fashion to the normalization and missing data treatment process, we also created

variant models where the pre-processing step was not applied, i.e. feature selection was not performed and

instead the whole unaltered list of input predictors was provided to the model for training.

Model Generation

To actually generate and train the ML classifiers, we provided the appropriate set of preprocessed

features to the model training function of the R caret package. Instead of setting fixed hyper-parameters

for the models - e.g. maximum decision tree depth of 5 in the RFs, 4 hidden layers for the NNets, etc. -

we had the model training function perform a grid search of the model hyperparameters to identify the

optimal hyper-parameterization for each model, assessing the performance of each model using k-fold

cross-validation (CV).

30 Pre-hoc determination of the optimal number of features for a given data set size is unfortunately still very much a

matter of debate in the field. As a result, various researchers have developed and published various heuristics for the

task, which can sometimes greatly vary in their recommendations. Some of these heuristics include: having 10 data

points per model parameter/feature [283], having “3-5 independent cases per class and feature” [284] for training

stable albeit not necessarily ‘good’ models [125], or for a dataset of size 𝑛 about √𝑛 highly correlated features to

about 𝑛 − 1 features when said features are completely uncorrelated [285]. For our dataset this puts us at 5, 3-5, 7

(highly correlated) to 49 (uncorrelated) features.

117

k-fold CV is a technique used for performing training and testing/validation where it is undesirable for an

already small dataset to be further divided into proportionately smaller separate training, testing and

validation datasets, but where it is

still necessary to assess how well a

classifier is expected to perform on

data it has never seen before [256].

In k-fold CV, the original dataset

is instead first segmented into 𝑘,

typically approximately equally

sized, partitions termed folds.

Testing and training of a given

model is then performed 𝑘 times

such that each fold is used once as

part of a test set, with the

remaining 𝑘 − 1 folds in each

round are used to train a model for

evaluation on the test fold. The overall performance is then reported as the mean of the performance of

the models across the rounds. The process is shown visually in Figure 6-5.

In each case, we set number of folds for the testing CV procedure to be the same as the number used for

the overall model CV procedure detailed in the next section.

6.2.3 Model Validation

Since a suitable external validation dataset was not available, we again performed CV using the

Chapter 3/Chapter 5 dataset to perform an internal validation of our ML classifiers and estimate the real-

world performance of our classifier against new, unseen data. Specifically, we validated the model using

both nested 10-fold CV and nested leave-one-out cross validation (LOOCV). In other words, we cross-

validated the overall pre-processing, features selection and models, but nested within the evaluation of

each model we used a further round of cross-validation (splitting out new further training and test folds)

to select the optimally hyper-parameterized model. LOOCV is a special case of k-fold CV where the

number of folds, 𝑘, is set to the be equal to the number of observations in the dataset. In other words,

every training/test set split repeatedly leaves out one new data point for testing or validation and uses

the rest for training. Before proceeding to discussing the rationale for using both 10-fold and leave-one-out

cross validation we first define some important terms for assessing ML model performance

Figure 6-5: 𝒌-fold cross-validation

118

On Bias and Variance

The bias of a machine learner is simply its error rate: i.e. how much or how little the algorithm errs in

performing whatever task it is attempting to accomplish. They are the “erroneous assumptions in the

model” [257]. Notably though, the bias is separate from the unavoidable or irreducible error of the

problem and only measures how distant the learner is from the ‘optimal’ overall error rate. For example,

if a system was trying to recognize speech from very noisy low quality audio streams where even humans

failed at the task 10% of the time, and a machine learning algorithm was able to recognize the speech

with an error rate of 15%, the bias of the algorithm would only be 5% since the gold-standard classifier for

this problem, the human ear, still erred 10% of the time due to the inherent nature of the problem [258].

In contrast the variance is how well, or rather how badly, the ML classifier generalizes to never before

seen data – i.e. how much the classifier errs due to ‘sensitivity to small fluctuations in the training set’

[257]. For example, if the same speech recognition classifier were provided with new test data (separate

from the data used to train it) and found to have a new error rate of 27%, the bias of the classifier would

still be 5% but the variance would be estimated at 12%, since the algorithm suffered an additional 12%

loss in performance in the face of the new test data. Knowing a classifiers bias and variance allow us to

estimate how under-, over- or both under- & over-fit a given classifier may be; high bias being indicative

of an under-fit classifier, high variance indicative of an over-fit classifier, and high bias & variance

indicative of an under- and over-fit classifier [259,260]. By extension, most change made to a ML classifier

have an associated bias and variance trade-off where an amelioration in one results in a deterioration of

the other – e.g. decreasing bias, and reducing over-fitting results in an increased variance, or increased

under-fitting – somewhere in the middle lies the optimal fit point where bias and variance are both

minimized.

Rationale for multiple cross-validation

Returning to 10-fold and leave-one-out cross-validation: LOOCV is known to be the least

pessimistically biased estimator of model performance [256,261–265]. However it has been accused of

having “high [estimator] variance, leading to unreliable estimates (Efron 1983)” [263]. This accusation is

typically attributed to the cited paper by R. Kohavi, presumably citing alleged findings by B. Efron [266].

Efron however only elaborates on CV generally and does not appear to investigate or make any claims

about higher k values on the variance of the estimate provided by the CV process, Kohavi’s research

findings in fact also repudiate his claim to higher variance, as do the findings and simulations of a myriad

of other investigators who in fact suggest quite the opposite [261,264,265,267]. Only in special highly

specific cases do simulations suggest that higher variance performance estimates result from LOOCV

119

[267]. The conclusion then that LOOCV results in higher variance estimates therefore appears to likely

simply be an erroneous intuitive over-generalization (dare we say overfitting) of the bias-variance trade-off

so ever present in ML performance assessment, to the actual performance estimators themselves.

Our rationale for also performing 10-fold cross validation therefore is not to improve our estimate of

model performance - although in the event that both the 10-fold cross-validation and leave-one-out cross-

validation estimates are similar, we would have additional confirmation that the performance estimates

are in fact accurate. Rather our objective is in fact to measure the difference in the estimate of model

performance using different sized training datasets to roughly determine our location on the learning curve

of these algorithms and ascertain if collecting more training data is likely to provide improved model

performance. It may seem strange to do this using 10-fold cross validation since we have previously

mentioned that LOOCV is known to be a less biased estimator of model performance than lower k-fold

CV and we could simply perform LOOCV on an artificially reduced dataset. However, to do so we would

have to artificially reduce the dataset and arbitrarily throw away data we could otherwise use for some

useful purpose, namely testing, which is why we opt to use 10-fold CV vs. LOOCV. Furthermore, previous

simulations and experiments have demonstrated that in most datasets, even as small as 40 datapoints, 10-

fold cross-validation provides an estimate that is nearly as unbiased as LOOCV or at least within 7-9

percentage points of the LOOCV value [261,263,267].

Since performing nested 10-fold cross-validation on our dataset represents a large, nearly 15%, reduction

in available training data31, most of the performance delta above 7-9% points is reasonable attributable

to the reduced training data in our already small dataset and can therefore be used to make a rough

approximation of our location on the learning curve (i.e. determine if we are still in the location of high

increase in performance for small increase in dataset size). Of course, if the performance delta is within 7-

9% points we unfortunately will not be able to approximate our location on the learning curve since we

will be unable to differentiate the bias delta due to using 10-fold CV vs LOOCV and the improvement

resulting from an increase in training data. However, in the unlikely event that the performance delta is

very low, i.e. both 10-fold and LOOCV converge to the same estimate, we can conclude that either

method is suitable for cross-validation of our algorithm given our sample size, and recommend that future

31 From 50 patients, nested leave-one-out results in 2 hold-outs for a total training set size of 48 patients. 10-fold

cross-validation results in a hold-out of 5 data points for validation, and a further 4.5 (on average) for the second

hold-out for model optimization leaving a total of 41.5 patients for training. (48 - 41.5) / 48 = 15%

120

work utilize 10-fold CV and take advantage of the associated decreased computational cost and simply use

the datapoints generated by this work to start plotting the learning curve.


Using the methodology detailed in the previous section we were able to successfully train GLMs,

boosted GLMs, RF, NNets and PCA NNets for each of the outlined feature groups: the CPET feature

group, the CPET + Step Data Metrics feature group, and the Step Data Metrics only feature group.

6.3.1 Classification Performance

The final overall validation performance of each of variant classifiers is tabulated in Table 22, located

in Appendix D, for completeness. For brevity’s sake however, we summarize only the top performing

classifiers for each feature group in this chapter. In general, we found that pre-selecting features did not

change the classification performance of the models, and although imputing missing data did have an

effect on classifier performance, 3 of the 4 best performing models were built by simply excluding

incomplete cases as opposed to performing imputation.

The best CPET only classifiers (and the third best classifier variant overall), summarized in Figure 6-7,

was found to be a simple boosted GLM with no imputed data and either with or without feature pre-

selection. The classifier achieved an unbalanced accuracy of 79%, better than the no-information rate of

70% which translates to a balanced accuracy of the model 72%. The level of agreement as measured by

Cohen’s Kappa was moderate (𝜅=0.47). This classifier is a huge improvement over the hidden Markov

model based classifier trained in Chapter 5. That being said, the 47% agreement between the GLM and

the physician assigned label is still lower than the lower end of comparable human-level performance;

recall that the interrater agreement between physicians was found to be between 54-75%32 [6,26]. Solely

32 The study by Goldman et al. [11] which found a 41% agreement is excluded as their result is not directly

comparable since they used a weighted kappa to account for disagreements by more than 1 NYHA class. The other

cited studies did not encounter this problem.

121

based on the performance of this classifier, human performance remains the gold-standard baseline against

which to compare the agreement in assessed NYHA functional class.

Unfortunately, the ML classifiers provided with just the step data did not fare as well as the CPET based

classifiers. The best of these step data only classifiers - tied between a regular GLM, a boosted GLM and

a NNet - all using imputed data and either with or without feature selection, only achieved an unbalanced

accuracy of 72% (63% balanced) – only marginally higher than the no-information rate of 70%. The low

agreement between the classifier and physician assigned label was also affirmed by the low kappa

coefficient (𝜅=0.28). That being said, the step data GLM/NNet/boosted GLM still performed better than

the hidden Markov model based classifier.

The best performing classifier overall, another boosted GLM which used only complete cases (i.e. no

imputed data) and either with or without feature selection, used the combination of CPET and step count

data to achieve a solid 89% unbalanced accuracy (85% balanced) which was significantly larger than the

no-information rate of the dataset (at the 5% level of significance, since P=.02). There was substantial

agreement between the machine and physician assigned labels (𝜅=0.73) approaching that of the best

reported human analogues (𝜅=0.75 [26]).

Physician

II III

AI

II 6 2

III 1 19




P-value [Acc > NIR]: 0.02

Model Type: Boosted GLM

Imputed Data: No

Pre-selected Feature: Yes or No

Figure 6-9: Performance of the

best CPET + step data

classifier

Physician

II III

AI

II 5 3

III 0 20





Model Type: Random Forest

Imputed Data: No



second best CPET + step data

classifier

Physician

II III

AI

II 7 6

III 3 27






Imputed Data: No



best CPET only classifier

Physician

II III

AI

II 6 9

III 5 30





Model Type: (boosted) GLM/NNet

Imputed Data: Yes



best step data only classifier

122

The second best performing classifier overall, was a RF in the same variant class as the best overall GLM

(no imputed data, and with or without feature preselection and using CPET and step count data). It

achieved an equivalent unbalanced accuracy (89%) with a corresponding significance level (compared to

the no-information rate) but it had a marginally lower agreement coefficient (𝜅=0.70) and balanced

accuracy (81%).

The receiver operating characteristic (ROC) curve, which graphically represents the sensitivity (true

positive rate) and specificity (the mathematical complement of the false positive rate33) trade-off of a

classifier, is shown in Figure 6-10 for the best RF and boosted GLM built using CPET and step data.

Although, it also includes the NNet, PCA NNet and glm in the same variant class: no imputed data, with

or without feature selection. We can see from this curve that the diagnostic error rate for the boosted

33 i.e. 1 – the false positive rate

Figure 6-10: Receiver Operating Characteristic (ROC) curve for machine learning classifiers

trained with CPET & step data (with no data imputation)

123

GLM is always expected to be more, or at least as, favorable as that of the RF based classifier, regardless

of the discrimination threshold chosen.

As an aside, we can also see from this graph that our choice to use PCA for feature selection before

providing our features to the NNet was well justified, since the PCA NNet shows greatly improved

discriminatory ability compared to the pure NNet. This suggests that a NNet might still have use for

assessing NYHA functional class, but may require more careful selection of input features or at least more

data to properly take advantage of its powerful modelling capabilities.

Regardless, both of our boosted GLM and RF based CPET + step data classifiers showed improved

performance over the ones using heart rate variability (HRV) data created by 1) Pecchia et al. [128] - a

cross-validated classification and regression tree that had moderate agreement (𝜅=0.57) and good

discrimination accuracy (79.3%, unbalanced) on a slightly unbalanced dataset (12:17, 59% severe) - and 2)

the one created by Melillo et al. [136] - another classification and regression tree, 10-fold cross-validated,

which achieved a marginally better level of agreement (𝜅=0.60), and discrimination accuracy (85.4%,

unbalanced) than Pecchia et al.’s tree, but on a different more unbalanced dataset (12:32, 73%). Our

classifier however does not approach the performance of Shahbazi et al.’s [142] leave-one-out cross-

validated HRV based k-Nearest Neighbour classifier (with generalized discriminant analysis feature

selection), which achieved perfect agreement (𝜅=1.0) and accuracy (100%) at the classification task (I or

II vs. III or IV) on their unbalanced dataset (10:29, 74% severe). We suspect that Shabazi’s classifier may

possibly be overfit to their data.

Unfortunately, the practical applications of our classifier are not clear cut. Our early investigation of the

combination of data from the relatively more established CPET, and the simpler to administer activity

tracker monitoring, does demonstrate that it is possible to create classifier that performs comparable to

those that use relatively esoteric HRV data. Administering a CPET augmented with two-weeks of activity

tracker data might therefore prove a useful alternative for clinicians or researchers wishing to objectively

assess NYHA functional classification without requiring access to the specialized software and know-how

required to perform an HRV analysis. Unfortunately, this alternative still requires the administration of a

CPET which remains an expensive, cumbersome, and labor-intensive ordeal. Furthermore, to achieve

near-human levels of classification performance, it presently appears necessary to augment CPET data

with use activity tracker step data since neither CPET nor step data alone suffice to achieve reasonable

levels of classification agreement. While activity tracker data is less expensive and labor-intensive to

collect than CPET data, in its currently investigated form it is associated with at least a two-week delay.

Although two-weeks is not necessarily longer than the time required to get certain blood or pathology

124

tests - which can sometimes also take several weeks [268–270] – this time delay certainly limit the

practical applications of our classifier.

While an obvious next step is to investigate a smaller monitoring periods, we suggest that an equally

profitable step may be to identify better features in the step count data and ideally alternate data sources

to reduce the dependence on CPET data outright.

6.3.2 Best Features

As it stands, the top 5 features for the best step count data classifier (GLM) were, in order of most

importance:

1) the total 2 week step count,

2) the mean 2 week per minute step count (PMSC),

3) patient weight,

4) the standard error of the 2 week per minute step count (PMSC), and

5) the standard error of the total daily per minute step count.

The features were assessed by summing their weighted importance scores across folds. The raw

importance score was computed using the default variable importance scores for the specific model in

question, using the varImp function in the caret package [271]. Each of these scores was then scaled to be

between 0 and 1 (from least to most important). Therefore, the highest possible importance score is 50,

which is possible if a variable scores as most important for all 50 leave-one-out cross-validated folds.

The full ordered list of top features for the step count data only GLM is shown in Figure 6-11. We can see

from the graph that very few of the features clearly stood out as being relatively more important, in fact

only the total 2 week step count and the mean 2 week per minute step count scored higher than 25

importance points (of 50). The third scoring feature, weight, is not even a step count metric, and is

already known to be not significantly different between classes (P=.21) at the 5% level of significant in

this dataset (see Table 10). Given the ML classifier used in this case, a GLM (which is linear regression

based), - it is not unreasonable to conclude that features at and below this level likely provided

increasingly little discriminatory value, which goes a long way towards explaining the relatively low

performance of this classifier.

125

Unfortunately, at the time of writing, the caret package’s varImp function did not adequately support

variable importance analysis for boosted GLMs, the model type of our best performing model and the

CPET only model. We instead provide as contrast the top 10 features identified by our second best

performing classifier, the CPET + step count data RF classifier. The top 10 features for the RF classifier

are shown in Figure 6-12.

Only two of the top 10 features used by the RF classifier are step count derived metrics:

1) mean of maximum daily per minute step count

2) standard deviation of total daily per minute step count,

The remaining 8 features are all CPET features, of which the respiratory exchange ratio peak (RER

Peak) is particularly noteworthy, having scored the highest possible importance score of 50 points,

indicating that it was voted the single most important feature by every single leave-one-out cross-

validated fold. The next single most important overall feature (also from the CPET data) is the slope of

Figure 6-11: Feature importance scores for GLM classifier using only step count data

126

minute ventilation (VE) to CO2 output (VCO2) at anaerobic threshold (AT) during CPET (VE/VC O2

Slope @ AT), which scored less than 20 importance points, indicating relatively low importance across

folds. The third most important feature, the duration of CPET in fractional minutes (CPET Duration),

scored less than 10 importance points.

For reference, weight - the 3rd best feature for the step data only GLM - was found to only be the 31st

most important feature for the RF with a score of 0.878, which would indicate that weight actually has

relatively low overall predictive helpfulness. Interestingly, leanness in HF patients has been found to be

associated with worse prognostic outcomes – in what is known as the ‘obesity paradox’ [272–275].

However, more recent findings from a large 300 thousand patient study suggest that this association is

likely the result of other unaccounted for confounding factors [276]. This might explain the low ranking of

weight (correlated with BMI) in the face of other explanatory variables. The mean 2 week per minute step

count and the total 2 week step count, the top two highest scoring features for the GLM trained using

only step count, also scored as being low importance for the RF classifier: 0.967 and 0.945 respectively.

Figure 6-12: Feature importance scores for random forest classifier using CPET + step

count data

127

The RF classifier in fact scored 14 other step count derived features as being more important than these

(although none of these 14 others scored any higher than 2.6 points).

It is curious that the step count metrics as a whole, appear to be considered by the classifiers to be

relatively unimportant in contributing to the successful assessment of patient NYHA class, yet that our

analysis of the models from a holistic perspective appears to indicate that the interaction of the step data

metrics with the CPET data appears to notably enhance the overall performance of classifier.

We suspect that one possible cause of this paradox is that the step data metrics, which were originally

selected due to their ability to characterize the step count distributions and not their predictive capacity,

are in fact only weakly correlated, noisy, and uncontextualized, and in general only weakly explanatory of

NYHA functional class alone. Furthermore, these metrics are likely also highly intercorrelated. This makes

it difficult for a ML algorithm to identify which single metric is most helpful. This is evidenced by the

pattern visible in Figure 6-11 where most of the metrics are considered only mildly important with none

standing out as specifically important. This pattern, although not shown in Figure 6-12, is also reflected

in the RF classifier scoring with similar metrics closely neighbouring each other.

When framed around CPET data - which helps contextualize and account for some noise in the step

count data - some of the step count metrics begin to stand out as being more explanatory (they are in the

top 10 features). These features therefore appear to possibly be explaining otherwise unexplained variance

in the CPET data. However, feature importance is rated inconsistently between models. Although this is

not necessarily unexpected, it may indicate that although the RF classifier assesses these as important,

they in fact only interpreted as important as a result of the chance subset of training data within the

folds. This leads to us to an alternative explanation: that the classifier is simply overfit. This is a less

compelling explanation than the step data being simply weakly explanatory since the RF classifier clearly

assesses the step count data as being still relatively unimportant. That being said, the possibility of

overfitting may definitely still exist, but it could be easily verified by computing the variance of the

importance scores across the random folds (high overall variance being an indicator of potential

overfitting to individual training folds).

The overall conclusion of our feature analysis however is that the step count metrics provided to the ML

classifiers for training are generally inadequate and that most of the predictive power resides in the CPET

features. In light of the desire to not be dependent on CPET for assessment of NYHA class, especially

within the context of remote patient monitoring for Medly, any continuation of this work should therefore

seriously consider investing time in identifying and engineering more relevant step count features as well

128

as adding other data sources like heart rate, which would be complimentary to step count, and help

contextualize the step data hopefully reducing the dependence on cumbersome CPET data. However, we

also note the lack of impact feature pre-selection had on the performance of our variant models and

suggest that increasing the amount of training data available may be a better approach than pre-

trimming the available features. That being said, other researchers have had significant success performing

clever feature selection to improve their algorithm performance [142].

6.3.3 Comparison of 10-fold and Leave-One-Out Cross-Validation

Recall that we cross-validated our

classifiers not only with leave-one-out

cross-validation, but we also performed

10-fold CV to try to approximate our

location on the classifier learning curve.

Excluding models whose unbalanced

accuracy was less than the no-

information rate, the smallest

difference in performance estimation

for 10-fold CV vs LOOCV of the same

classifier was 19% (κ|𝐿𝑂𝑂𝐶𝑉 = 0.47,

κ|10−𝑓𝑜𝑙𝑑𝐶𝑉 = 0.28). The classifier in

question, with the smallest estimator

difference, was in fact the CPET only

classifier discussed in Section 6.3.1. A

summary of the performance

estimations between the 10-fold and LOOCV of this classifier (the CPET Only GLM) is shown in Figure

6-13.

The largest and second largest performance differences were associated with the best performing classifier

(CPET + Step Data GLM, κ|𝐿𝑂𝑂𝐶𝑉 = 0.73, κ|10−𝑓𝑜𝑙𝑑𝐶𝑉 = 0.10) and second best performing classifier

(CPET + Step Data RF, κ|𝐿𝑂𝑂𝐶𝑉 = 0.70, κ|10−𝑓𝑜𝑙𝑑𝐶𝑉 = 0.10). It is worth noting that the 10-fold CV

version of these classifiers also in fact had unbalanced accuracies (68%) that were marginally less than the

associated no-information rate (70%) for the classifiers.

Figure 6-13: Performance of the best model with

cross-validation performance difference

LOOCV Physician

II II

AI

II 7 6

III 3 27

No Information Rate (NIR): 0.70 |

Unbalanced Accuracy (Acc): 0.79 | 0.72

Cohen’s Kappa: 0.47 | 0.28

P-value [Acc > NIR]: 0.12 | 0.45


Imputed Data: No


Data Source: CPET Only

10-fold

CV

Physician

II II

AI

II 6 9

III 5 30

129

Since, as previously mentioned in Section 6.2.3.2, we expect at most about 7-9% difference in performance

estimation due to the bias of 10-fold CV vs LOOCV, these large differences in performance estimation

using 10-fold CV and LOOCV are clearly indications that our model is still highly sensitive to the amount

of input data used to train the model and of may be possibly overfit to the training data. From a learning

curve perspective, these values indicate that we are still at the point in the curve where we are likely to

derive significant benefit from adding more training data. Since adding more training data is often an

adequate solution to overfitting, an adequate solution in either case is to collect more data. Certainly, we

appear to have been justified in using this larger 50 patient dataset for our experiments as opposed to the

44 patient dataset despite the associated loss of activity monitoring heart rate data.

Fortunately, as a result of the activity tracker monitoring upgrade made to Medly, as detailed in Chapter

4, more data (containing both heart rate and step count) is still actively being collected and should soon

result in a larger (n > 50) activity monitoring dataset than the one used for the classification experiments

in this thesis.

As the dataset increases in size, we suggest that future work performed with the dataset continue to be

assessed using both 10-fold and LOOCV until the estimates from these approaches are found to converge.

This will not only increase confidence in the performance estimates of the classifiers, but also help

determine when it is appropriate to switch over to the less computational expensive 10-fold CV.

Furthermore, recording the performance across otherwise identical ML models as the amount of data

available continues to increase, would permit more accurate mapping of the learning curve than our initial

single datapoint [258]. Knowing the actual learning curve associated with this problem would be helpful

for diagnosing the source of classifier errors and ascertaining possible future steps to improve algorithm

performance, and it would also be helpful for determining the incremental cost/benefit of continuing to

collect increasingly more data [258].

Summary

To summarize, in this chapter we discussed a method for building cross-sectional machine learning

classifiers to assess NYHA functional class using CPET and activity monitoring step data. We chose to

investigate some popular starting points for supervised classification problems: Generalized Linear Models

(GLM); a variant thereof: boosted GLMs; Random Forests (RF); Artificial Neural Networks (NN); and a

variant thereof: Principal Component Analysis Neural Networks (PCA NN). We trained multiple variants

of each model to investigate the effect of a) performing separate feature selection ahead of model training,

b) imputing missing data instead of just dropping incomplete cases, and c) supplying different groups of

130

input predictors to our models for training. Specifically, we investigated the performance of the classifiers

when supplied with demographic data and a) just CPET data, b) just the step data metrics investigated

in Chapter 3, and c) the combination of both the CPET data and step data metrics.

To properly determine the expected performance of the classifiers in the face of new data we also cross-

validated all the models using 10-fold cross-validation and leave-one-out cross-validation. Since we also

optimized the model hyper-parameters and cross-validated these selections, we ended up performing

nested 10-fold and nested leave-one-out cross-validation of each of the models.

In general, we found that pre-selecting features did not change the classification performance of the

models, and although imputing missing data sometimes had an effect on classifier performance, 3 of the 4

best performing models (all except the step data only classifier) discussed in this chapter were built by

simply excluding incomplete cases as opposed to performing 5-Nearest Neighbour imputation.

The best overall classifier was found to be a boosted GLM, trained using only complete cases of both

CPET and step data, which achieved an unbalanced accuracy of 89% (85% balanced) versus a no-

information rate of 70%. As a result, this classifier had a substantial level of agreement with the physician

assigned NYHA class (𝜅=0.73). The performance of the classifier was therefore comparable to human level

performance (𝜅=0.75 [26])

The CPET + step data classifier exceeded the baseline level of performance established by the best CPET

data only classifier. The best classifier trained with only CPET data (another boosted GLM) achieved an

unbalanced accuracy of 79% (72%, balanced) which was also better than the no-information rate of 70%.

The CPET only classifier therefore showed a moderate level of agreement with the physician assigned

label (𝜅=0.47) which was lower than the lower end of comparable human-level performance (𝜅=0.54 [6]).

The step data only classifiers (tied between a regular GLM, boosted GLM and NNet) fared much worse,

achieving an unbalanced accuracy of 72% (63% balanced) – only marginally higher than the no-

information rate of 70%, and with a low level of agreement between the classifier and physician assigned

label (𝜅=0.28).

When comparing which features were considered most important by the classifiers we found that the step

data metrics as a whole were found to be less important than the CPET metrics. We theorized that this

is because the step data metrics, which were originally selected due to their ability to characterize the

step count distributions and not their predictive capacity, are in fact only weakly correlated, noisy and

uncontextualized, and in general only weakly explanatory of NYHA functional class alone. This makes it

131

difficult for a ML algorithm to use the features effectively for classification. In light of the desire to also

not remain dependent on CPET data for assessment of NYHA class, especially within the context of

remote patient monitoring for Medly, we suggested that a next reasonable step would be to invest in

engineering more relevant step count features. We also recommend adding other data sources like heart

rate, which are presumed to be complimentary to step count and would help contextualize the step data –

hopefully replacing the currently required CPET data.

In comparing the performance estimations from the 10-fold and leave-one-out cross-validation we found

that there was a notable difference between the measurements of agreement (𝜅), varying from 19-63% for

the well performing algorithms but always in favor of the leave-one-out cross-validation. We proposed

that this might be evidence of overfitting of the classifiers, but is likely also largely attributable to the

15% reduction in already limited data available for training the classifier resulting from the nesting of the

10-fold cross-validation process (compared to nesting leave-one-out cross-validation) and thus more

indicative of our location on the learning curve. Regardless these numbers indicated that there is likely

considerable benefit to collecting more training data. We suggested that future work performed with

larger datasets should continue to assess performance using 10-fold and LOOCV until the estimates from

these approaches are found to converge. This would increase confidence in the performance estimates of

the classifiers, as well as help determine when it is appropriate to switch over to the less computational

expensive 10-fold CV. We also suggested that at minimum, keeping the number of folds consistent for

cross-validation would be helpful for better mapping out the learning curve for this problem – which

would be a helpful tool for diagnosing classifier error, and assessing the cost/benefit of continuing to

collect more and more data.

132

- Conclusions, Recommendations & Future

Work

In this chapter we reflect on this work as a whole, briefly reiterating the major conclusions and

findings of this work and providing some recommendations and suggested directions for future work.

Conclusions

The objective of this thesis was to design and develop a means of making New York Heart Association

(NYHA) classification more consistent and reliable for the medical research and clinical community. We

proposed that a good way to accomplish this objective was to find a means of objectively assessing NYHA

functional class. In light of this, we performed a thorough review of the current state-of-the art for

assessing NYHA functional class, including the state-of-the-art in applying artificial intelligence machine

learning algorithms to the task of assessing or classifying patients into their NYHA functional class.

We found that other researchers have already attempted to use machine learning for NYHA functional

classification. These however used heart rate variability data which is not necessarily readily accessible or

usable by all heart function clinic, nor, at least at present, highly suitable for long-term remote patient

monitoring. Remote patient monitoring being a growing trend in the pursuit of more cost-efficient care for

chronic conditions and specifically the quest to improv patient- and physician-management of the heart

failure condition. We proposed that a useful but more accessible data source that would synergize well

with remote patient monitoring would be activity tracker data.

We proposed updating an existing remote patient monitoring system with the ability to collect and

display activity tracker data, which could provide data for use by a machine learning algorithm to

perform automated assessment of NYHA functional class. For this task we selected Medly, the remote

patient monitoring system presently in use at the Toronto General Hospital Heart Function Clinic as a

suitable candidate system. However, since activity tracker data has not seen wide use in actual clinic

settings - in fact we only found one small pilot study that investigated the relationship between NYHA

class and activity tracker step count - we first replicated the pilot study on a larger dataset that we had

available from a previous study performed at our lab, verifying the findings of the pilot study: that NYHA

II and NYHA III patients differ significantly by mean daily total step count. Additionally, we discovered

that these patients actually differed by various aggregate measures of step count also including mean and

maximum of the daily per minute step count maximums. Overall, our findings reaffirmed the findings of

the previous pilot study, giving us some additional reassurance that remote monitored step count might

133

be beneficial for objectively assessing NYHA class. We noted however that the recorded step count data

was often ambiguous, since the data recorded by the fitness trackers used in this study, which only

recorded step count, did not allow us to differentiate between when the wearer was inactive versus the

tracker simply not being worn. This significantly limited our ability to draw precise practical conclusions

from the dataset.

We then proceeded to engineer an upgrade to the Medly remote patient monitoring system to allow it to

support activity tracker monitoring data from Fitbit devices specifically the Fitbit Charge HR 2 which

supported collection of both step count and heart rate data (to avoid the ambiguity problems which were

identified in the replication study). Despite delays in the actual implementation of the activity tracker

upgrade we were successfully able to onboard 44 patients over a 5 month period with some (3) of the

patients even providing their own Fitbit for use with the system. Unfortunately, the patients were found

to be only moderately adherent with using the Fitbit with only around 1 3⁄ to 1 4⁄ of patients (at 3

months and 7 months respectively) having excellent levels of adherence (average at least 9 of 10 days

using the system). We theorized that the many compromises made to the user experience throughout the

implementation process may have detrimentally impacted patient adherence.

Since the effective size of the Medly Fitbit dataset was drastically reduced to 33 patients after removing

those patients with less than 1 week of recorded activity, we opted to instead use the dataset investigated

as part of the replication study to explore if it would be possible to assess NYHA class using free-living

fitness tracker data. The marginally larger replication data set we opted to use consisted of 50 patients

(35 NYHA class II; 15 NYHA class III), and although it lacked activity monitor heart rate data to

complement the step count data, all of the patients in the dataset had recorded cardiopulmonary exercise

test data which we proposed to use to establish a baseline performance level against which to evaluate our

classifiers

We investigated 6 different types of supervised machine learning classifiers to assess NYHA functional

classification: a hidden Markov model based classifier, several Generalized Linear Models, boosted

Generalized Linear Models, Random Forests, Artificial Neural Networks and Principal Component

Analysis Neural Networks.

We found that the hidden Markov model based classifier performed worst overall and in fact in many

cases refused to train properly. The hidden Markov model based classifier we did manage to train had

poor agreement (Cohen’s Kappa statistic, 𝜅=0.18) between the physician assigned NYHA class and that

134

assigned by the classifier, with a resulting low (unbalanced) accuracy of 58% (assessed on the same data

used to train the classifier) which was actually worse than the no-information rate of the dataset (70%).

In contrast, the best overall classifier was found to be a boosted GLM (leave-one-out cross-validated),

trained using only complete cases of both CPET and step data, which demonstrated substantial

agreement with the physician assigned NYHA class (𝜅=0.73) comparable to human level performance

(𝜅=0.75 [26]) and better than 2 of the 3 heart rate variability based machine learning classifiers. The level

of agreement of our classifier corresponded to an unbalanced accuracy of 89% (85% balanced) against a

no-information rate of 70%.

The best classifier trained with only CPET data – our proposed performance baseline - (another boosted

GLM) showed a moderate level of agreement with the physician assigned label (𝜅=0.47) with

corresponding unbalanced accuracy of 79% (72%, balanced), again better than 70%. The performance of

this classifier however was lower than the reported lower range of human-level performance (𝜅=0.54 [6])

and as a result surprisingly did not dislodge physicians as the gold-standard against which to assess

NYHA functional class agreement despite the notoriously high degree of subjectivity in their assessments.

The step data only classifier (tied between a regular GLM, boosted GLM and NNet) fared even worse

than the classifier trained with only CPET data, although still better than the hidden Markov model

based classifier, achieving an unbalanced accuracy of 72% (63% balanced) – only marginally higher than

the no-information rate of 70%, and with a low level of agreement between the classifier and physician

assigned label (𝜅=0.28).

An analysis of the important input features revealed notably that, of the CPET + step data features

investigated, the respiratory exchange ratio was found to be rated most consistently important. The step

data metrics, as a whole, were found to be less important generally than the CPET metrics and were also

found to be inconsistent in their ratings of relative importance amongst themselves.

We also found a notable difference between the estimates of the measurements of agreement (|∆𝜅| =

[0.19, 0.63]) generated using 10-fold versus leave-one-out cross-validation for the well performing classifiers

when comparing (always in favor of leave-one-out cross-validation). We proposed that this might be

evidence of overfitting of the classifiers, but more likely an indication that 10-fold cross-validation caused

a severe reduction in the already limited amount of data available for classifier training.

In summary, we found that it is possible to objectively assess NYHA functional classification with a level

of performance comparable to the human physicians by using a combination of CPET and step count

135

data. Although CPET data and step count data were found to be generally inadequate for performing

objective NYHA functional classifications by themselves, this may have been due to the lack of data and

the lack of useful and relevant features. In particular, for the step count data metrics, which were

originally selected due to their ability to characterize the step count distributions and not their predictive

capacity, more intentional feature engineering of relevant step count metrics might further improve

performance using this data. As well, adding other data sources, for example heart rate data, which is

presumed complimentary to step count and might help re-contextualize and clean up ambiguity in the

data, might further improve classifier performance.

In general, although the machine learning classifiers developed in this work are not yet ready for

implementation into a real-life remote patient monitoring system, the classifiers investigated in this thesis

certainly show promise for making the assessment of NYHA functional class more objective and by

extension more universally consistent and reliable.

Recommendations

In this section we propose several recommendations and ‘lessons learned’ in light of our findings:

1. Avoid activity trackers that label disengagement with the monitoring solution and patient

inactivity identically. These contribute significant ambiguity to later data analysis that is often

difficult or impossible to reconcile.

2. For data collected remotely from patients, provide a means of helping staff catch and address

patient issues in a timely manner, thereby improving the overall quality of the data. For example,

adding automated adherence phone calls or reminder notifications (for a smartphone-based

application) may improve adherence at little cost.

3. When adding new sources of data to an existing system, either a) begin data collection as soon as

possible, improving as required, and collecting lots of lower quality data which can be cleaned and

noise-corrected post-hoc, or b) fully commit to designing a user experience that will result in high

adherence – collecting a smaller amount of high-quality data. Delaying data collection to design

an incomplete user experience will likely only result in collecting an insufficient amount of

moderate quality data that will be more challenging to analyze.

4. Notwithstanding the above, prefer collecting more data (especially for machine learning

applications). While it is possible to build a machine learning classifier with little data it becomes

significantly more difficult to properly assess if the classifier is of good quality.

136

5. The corollary to 3 and 4 is to invest in data collection infrastructure. Collecting a suitably large

dataset can take a long time and should be started well in advance of a proposed research project.

6. Invest time in visualizing and understanding the data being collected. In this case of this thesis,

we discovered several limitations in our data, for example the prevalence of 0 step count values,

that had drastic implications on model design and development. This could have been addressed

in a more timely fashion with foresight derived from a more thorough earlier investigation of the

source data.

7. Prefer simpler machine learning classifiers over more complex ones especially in the face of smaller

datasets. Almost all of the best performing classifiers investigated in this thesis were simple

generalized linear models or variants thereof.

8. Prefer the use of the 𝑅 programming language (along with the tidyverse package by H. Wickham

[217]) for analysis and visualization of data, but use Python along with the well established scikit-

learn library to accelerate creation of the machine learning pipeline required to build and

adequately assess a series of machine learning classifiers. Aside from cleaning data, building the

machine learning pipeline is one of the most time-consuming parts of a machine learning project.

Future Work

Having outlined some general recommendations and lessons that should be taken from this work

provide some suggested future directions for this work:

1. A more thorough study of the characterization of the minute-by-minute step count waveform for

both health persons and patients with congestive heart failure should be undertaken. This would

provide very valuable insights for projects investigating the use of fitness trackers for monitoring

tasks.

2. Revisit the user interfaces and user experience design of the fitness tracker upgrade applied to

Medly. Aside from the fact that the system as is does not fully honor the best practices and

principles outlined in the Fitbit API terms of service, patients using the system are only

moderately adherent which reduces the amount and quality of data being collected for use by

patients, by clinicians, and as part of any future quality improvement or research projects.

Adding adherence phone calls or reminder notifications would likely provide significant benefit at

little cost.

3. Investigate the effects of applying dithering to the training of the HMMBC.

137

4. Repeat the work performed in this thesis but using the combination of activity tracker step count

and heart rate data. The data being collected from Medly patients would be suitable for this

purpose once a sufficient number of patients are onboarded onto the upgraded system.

5. Furthermore, investigate the effect of including other data available from the Medly system such

as daily symptoms data, these potentially helping further contextualize patient step count data.

6. Investigate the effect of reducing the analysis window duration for the step count data from 2-

weeks to some shorter time period.

7. In a similar vein, investigate activity segmentation with an eye towards using it in combination

with a HMMBC (or more standard cross-sectional ML model).

8. Perform careful manual feature engineering or automated feature extraction to identify more

relevant features from available time series data streams (including step count).

9. And finally, regardless of other work performed, continue to assess the cross-validated

performance of otherwise identical models as dataset size increases, to better map the learning

curve associated with the NYHA functional class supervised classification problem.

138

References

1. Mehra MR, Butler J. Heart Failure: A Global Pandemic and Not Just a Disease of the West.

Heart Fail Clin [Internet] 2015 Oct [cited 2017 Oct 13];11(4):xiii–xiv. PMID:26462110

2. Heart and Stroke Foundation. 2016 Report on the Health of Canadians: The Burden of Heart

Failure. 2016 [cited 2016 Oct 29]; Available from: https://www.heartandstroke.ca/-/media/pdf-

files/canada/2017-heart-month/heartandstroke-reportonhealth-

2016.ashx?la=en&hash=0478377DB7CF08A281E0D94B22BED6CD093C76DB (Archived by

WebCite® at http://www.webcitation.org/706UliccA)

3. Seto E, Leonard KJ, Cafazzo J a, Masino C, Barnsley J, Ross HJ. Self-care and quality of life of

heart failure patients at a multidisciplinary heart function clinic. J Cardiovasc Nurs [Internet]

2011;26(5):377–85. PMID:21263339

4. Lawrence S. Canada is failing our heart failure patients - Heart and Stroke Foundation of Canada

[Internet]. Marketwired. 2016 [cited 2016 Oct 7]. Available from:

http://www.marketwired.com/press-release/canada-is-failing-our-heart-failure-patients-

2093022.htm (Archived by WebCite® at http://www.webcitation.org/706U7G8oI)

5. Cox J, Naylor CD. The Canadian Cardiovascular Society Grading Scale for Angina Pectoris: Is It

Time for Refinements? Ann Intern Med [Internet] American College of Physicians; 1992 Oct 15

[cited 2016 Oct 30];117(8):677. [doi: 10.7326/0003-4819-117-8-677]

6. Raphael C, Briscoe C, Davies J, Ian Whinnett Z, Manisty C, Sutton R, Mayet J, Francis DP,

Raphael C. Limitations of the New York Heart Association functional classification system and

self-reported walking distances in chronic heart failure. Heart [Internet] 2007 Apr 1 [cited 2016 Oct

30];93(4):476–482. [doi: 10.1136/hrt.2006.089656]

7. Bennett JA, Riegel B, Bittner V, Nichols J. Validity and reliability of the NYHA classes for

measuring research outcomes in patients with cardiac disease. Hear Lung J Acute Crit Care

2002;31(4):262–270. PMID:12122390

8. Heart Foundation. New York Heart Association (NYHA) Classification [Internet]. Heart

Foundation; 2014 [cited 2017 Jun 30]. p. 1. Available from:

http://www.heartonline.org.au/media/DRL/New_York_Heart_Association_(NYHA)_classificati

139

on.pdf

9. American Heart Association. Classes of Heart Failure [Internet]. 2015 [cited 2016 Oct 30].

Available from:

http://www.heart.org/HEARTORG/Conditions/HeartFailure/AboutHeartFailure/Classes-of-

Heart-Failure_UCM_306328_Article.jsp#.WvyuQYgvyiN (Archived by WebCite® at

http://www.webcitation.org/6zT3C5Rpx)

10. Ahmed A, Aronow WS, Fleg JL. Higher New York Heart Association classes and increased

mortality and hospitalization in patients with heart failure and preserved left ventricular function.

Am Heart J [Internet] NIH Public Access; 2006 Feb [cited 2017 Oct 30];151(2):444–50.

PMID:16442912

11. Goldman L, Hashimoto B, Cook EF, Loscalzo A. Comparative reproducibility and validity of

systems for assessing cardiovascular functional class: advantages of a new specific activity scale.

Circulation [Internet] 1981;64(6):1227–1234. PMID:7296795

12. Williams BA, Doddamani S, Troup MA, Mowery AL, Kline CM, Gerringer JA, Faillace RT.

Agreement between heart failure patients and providers in assessing New York Heart Association

functional class. Hear Lung J Acute Crit Care [Internet] Elsevier Inc; 2017 Jul 1 [cited 2017 Oct

30];46(4):293–299. PMID:28558929

13. Moayedi Y, Abdulmajeed R, Posada JD, Foroutan F, Alba AC, Cafazzo J, Ross HJ, Duero Posada

J, Foroutan F, Alba AC, Cafazzo J, Ross HJ. Assessing the Use of Wrist-Worn Devices in Patients

With Heart Failure: Feasibility Study. JMIR Cardio [Internet] JMIR Cardio; 2017 Dec 19 [cited

2018 Jan 25];1(2):8. [doi: 10.2196/cardio.8301]

14. Savarese G, Lund LH. Global Public Health Burden of Heart Failure. Card Fail Rev [Internet]

Radcliffe Cardiology; 2017 Apr [cited 2018 Jun 4];3(1):7–11. PMID:28785469

15. University of Toronto Faculty of Medicine. The State of the Heart in Canada [Internet]. 2014.

Available from:

http://medicine.utoronto.ca/sites/default/files/TRCHR_StateOfHeart_Infographsm.png

16. cardiac insufficiency. McGraw-Hill Concise Dict Mod Med [Internet] The McGraw-Hill Companies,

Inc.; 2018 [cited 2018 Jul 21]. Available from: https://medical-

dictionary.thefreedictionary.com/cardiac+insufficiency

140

17. Aird WC. Discovery of the cardiovascular system: From Galen to William Harvey. J Thromb

Haemost [Internet] 2011;9(1 S):118–129. PMID:21781247

18. Silverthorn DU, Johnson BR, Ober WC, Garrison CW, Silverthorn AC. Blood Flow and the

Control of Blood Pressure. Hum Physiol An Integr Approach 5th ed Pearson Benjamin Cummings;

2009. p. 512–545.

19. Shah SJ. Heart Failure (HF) [Internet]. Merck Manuals Prof Ed. 2017 [cited 2018 Jul 21].

Available from: https://www.merckmanuals.com/en-ca/professional/cardiovascular-

disorders/heart-failure/heart-failure-hf

20. Azevedo PS, Polegato BF, Minicucci MF, Paiva SAR, Zornoff LAM. Cardiac Remodeling:

Concepts, Clinical Impact, Pathophysiological Mechanisms and Pharmacologic Treatment. Arq

Bras Cardiol [Internet] Arquivos Brasileiros de Cardiologia; 2016 Jan [cited 2018 Jul 21];106(1):62–

9. PMID:26647721

21. Laflamme MA, Murry CE. Heart regeneration. Nature [Internet] NIH Public Access; 2011 May 19

[cited 2018 Jul 21];473(7347):326–35. PMID:21593865

22. National Heart Foundation of Australia and the Cardiac Society of Australia and New Zealand

(Chronic Heart Failure Guidelines Expert Writing Panel). Guidelines for the prevention, detection

and management of chronic heart failure in Australia. 2011 [cited 2018 May 10];84. Available from:

https://www.heartfoundation.org.au/images/uploads/publications/Chronic_Heart_Failure_Guide

lines_2011.pdf

23. The Criteria Committee of the New York Heart Association. Classification of Functional Capacity

and Objective Assessment [Internet]. 9th ed. Nomencl Criteria Diagnosis Dis Hear Gt Vessel.

Boston, Mass.: Little, Brown and Co.; 1994 [cited 2017 Oct 13]. Available from:

http://professional.heart.org/professional/General/UCM_423811_Classification-of-Functional-

Capacity-and-Objective-Assessment.jsp

24. Rostagno C, Galanti G, Comeglio M, Boddi V, Olivo G, Gastone G, Serneri N. Comparison of

different methods of functional evaluation in patients with chronic heart failure. Eur J Heart Fail

[Internet] 2000 [cited 2018 Jun 4];2:273–280. Available from:

https://onlinelibrary.wiley.com/doi/pdf/10.1016/S1388-9842(00)00091-X

25. Carroll SL, Harkness K, Mcgillion MH. A Comparison of the NYHA Classification and the Duke

141

Treadmill Score in Patients with Cardiovascular Disease. Open J Nurs [Internet] 2014 [cited 2017

Nov 3];4:774–783. [doi: 10.4236/ojn.2014.411083]

26. Christensen HW, Haghfelt T, Vach W, Johansen A, Hoilund-Carlsen PF. Observer reproducibility

and validity of systems for clinical classification of angina pectoris: comparison with radionuclide

imaging and coronary angiography. Clin Physiol Funct Imaging [Internet] Blackwell Science Ltd;

2006 Jan [cited 2017 Nov 6];26(1):26–31. [doi: 10.1111/j.1475-097X.2005.00643.x]

27. Kubo SH, Schulman S, Starling RC, Jessup M, Wentworth D, Burkhoff D. Development and

validation of a patient questionnaire to determine New York heart association classification. J

Card Fail [Internet] Churchill Livingstone; 2004 [cited 2017 Nov 3];10(3):228–235. [doi:

10.1016/J.CARDFAIL.2003.10.005]

28. McHugh ML. Interrater reliability: the kappa statistic. Biochem medica [Internet] Croatian Society

for Medical Biochemistry and Laboratory Medicine; 2012 [cited 2018 Aug 25];22(3):276–82.

PMID:23092060

29. Sallis JF, Saelens BE. Research Quarterly for Exercise and Sport Assessment of Physical Activity

by Self-Report: Status, Limitations, and Future Directions. 2015 [cited 2018 Jul 24]; [doi:

10.1080/02701367.2000.11082780org/10.1080/02701367.2000.11082780]

30. Okura Y, Urban LH, Mahoney DW, Jacobsen SJ, Rodeheffer RJ. Agreement between self-report

questionnaires and medical record data was substantial for diabetes, hypertension, myocardial

infarction and stroke but not for heart failure. J Clin Epidemiol [Internet] Pergamon; 2004 Oct 1

[cited 2018 Jul 24];57(10):1096–1103. [doi: 10.1016/J.JCLINEPI.2004.04.005]

31. Baranowski T. Validity and Reliability of Self Report Measures of Physical Activity: An

Information-Processing Perspective. Res Q Exerc Sport [Internet] 1988 [cited 2018 Jul

24];59(4):314–327. [doi: 10.1080/02701367.1988.10609379org/10.1080/02701367.1988.10609379]

32. Balady GJ, Arena R, Sietsema K, Myers J, Coke L, Fletcher GF, Forman D, Franklin B, Guazzi

M, Gulati M, Keteyian SJ, Lavie CJ, Macko R, Mancini D, Milani R V. AHA Scientific Statement

Clinician’s Guide to Cardiopulmonary Exercise Testing in Adults A Scientific Statement From the

American Heart Association. Am Hear Assoc Exerc Clin Cardiol Counc Epidemiol Prev [Internet]

[cited 2017 May 2]; [doi: 10.1161/CIR.0b013e3181e52e69]

33. Uth N, Sørensen H, Overgaard K, Pedersen PK. Estimation of VO2max from the Ratio between

142

HRmax and HRrest - the Heart Rate Ratio Method. Eur J Appl Physiol [Internet] 2004 [cited

2017 May 2];91(1):111–115. [doi: 10.1007/s00421-003-0988-y]

34. Kline GM, Porcari JP, Hintermeister R, Freedson PS, Ward A, McCarron RF, Ross J, Rippe JM.

Estimation of VO2max from a one-mile track walk, gender, age, and body weight. Med Sci Sports

Exerc [Internet] 1987 Jun [cited 2017 May 2];19(3):253–9. PMID:3600239

35. Cooper KH. Aerobics. Bantam Books; 1969. ISBN:9780553144901

36. Saalasti S, Pulkkinen A. Method and system for determining the fitness index of a person

[Internet]. United States Patent Office; 2012 [cited 2017 May 2]. Available from:

https://www.google.com/patents/US20140088444

37. Butte NF, Ekelund U, Westerterp KR. Assessing Physical Activity Using Wearable Monitors:

Measures of Physical Activity. Med Sci Sport Exerc [Internet] 2012 [cited 2017 Jun 15];44(1S):5–

12. [doi: 10.1249/MSS.0b013e3182399c0e]

38. ap507. Study shows slow walking pace is good predictor of heart-related deaths — University of

Leicester [Internet]. Univ Leicester News. 2017 [cited 2017 Aug 30]. Available from:

https://www2.le.ac.uk/news/blog/2017-archive/august/study-shows-slow-walking-pace-good-

predictor-of-heart-related-deaths

39. Zhao S, Chen K, Su Y, Hua W, Chen S, Liang Z, Xu W, Dai Y, Liu Z, Fan X, Hou C, Zhang S.

Association between patient activity and long-term cardiac death in patients with implantable

cardioverter-defibrillators and cardiac resynchronization therapy defibrillators. Eur J Prev Cardiol

[Internet] 2017;24(7):760–767. [doi: 10.1177/2047487316688982]

40. Roul G, Germain P, Bareiss P. Does the 6-minute walk test predict the prognosis in patients with

NYHA class II or III chronic heart failure? Am Heart J [Internet] 1998 Sep [cited 2017 Jun

30];136(3):449–457. [doi: 10.1016/S0002-8703(98)70219-4]

41. Abdulmajeed R. The Use of Continuous Monitoring of Heart Rate as a Prognosticator of

Readmission in Heart Failure Patients. University of Toronto; 2016.

42. Eapen ZJ, Turakhia MP, McConnell M V., Graham G, Dunn P, Tiner C, Rich C, Harrington RA,

Peterson ED, Wayte P. Defining a Mobile Health Roadmap for Cardiovascular Health and

Disease. J Am Heart Assoc [Internet] 2016 Jul 12 [cited 2016 Oct 30];5(7):e003119. [doi:

143

10.1161/JAHA.115.003119]

43. Wen D, Zhang X, Liu X, Lei J. Evaluating the Consistency of Current Mainstream Wearable

Devices in Health Monitoring: A Comparison Under Free-Living Conditions. J Med Internet Res

[Internet] Journal of Medical Internet Research; 2017 Mar 7 [cited 2017 Mar 9];19(3):e68.

PMID:28270382

44. El-Amrawy F, Nounou MI, Volpp K, Patel M, Lin N, Lewis R. Are Currently Available Wearable

Devices for Activity Tracking and Heart Rate Monitoring Accurate, Precise, and Medically

Beneficial? Healthc Inform Res [Internet] Apress Media; 2015 [cited 2017 Jul 7];21(4):315. [doi:

10.4258/hir.2015.21.4.315]

45. An H-S, Jones GC, Kang S-K, Welk GJ, Lee J-M. How valid are wearable physical activity

trackers for measuring steps? Eur J Sport Sci [Internet] Routledge; 2017 Mar 16 [cited 2017 Jul

12];17(3):360–368. [doi: 10.1080/17461391.2016.1255261]

46. Bromberg SE. Consumer Wristband Activity Monitors as a Simple and Inexpensive Tool for

Remote Heart Failure Monitoring. 2015.

47. Abeles A, Kwasnicki RM, Pettengell C, Murphy J, Darzi A. The relationship between physical

activity and post-operative length of hospital stay: A systematic review. Int J Surg [Internet] 2017

Jul [cited 2017 Jul 12]; [doi: 10.1016/j.ijsu.2017.06.085]

48. Bornstein DB, Beets MW, Byun W, Welk G, Bottai M, Dowda M, Pate R. Equating

accelerometer estimates of moderate-to-vigorous physical activity: In search of the Rosetta Stone. J

Sci Med Sport [Internet] BioMed Central; 2011 Sep [cited 2017 Jul 12];14(5):404–410. [doi:

10.1016/j.jsams.2011.03.013]

49. Awais M, Mellone S, Chiari L. Physical activity classification meets daily life: Review on existing

methodologies and open challenges. Proc Annu Int Conf IEEE Eng Med Biol Soc EMBS

2015;2015–Novem:5050–5053. PMID:26737426

50. Jehn M, Prescher S, Koehler K, Von Haehling S, Winkler S, Deckwart O, Honold M, Sechtem U,

Baumann G, Halle M, Anker SD, Koehler F. Tele-accelerometry as a novel technique for assessing

functional status in patients with heart failure: Feasibility, reliability and patient safety. Int J

Cardiol [Internet] 2013 [cited 2017 Sep 5];168:4723–4728. [doi: 10.1016/j.ijcard.2013.07.171]

144

51. Demers C, McKelvie RS, Negassa A, Yusuf S. Reliability, validity, and responsiveness of the six-

minute walk test in patients with heart failure. Am Heart J 2001;142(4):698–703. PMID:11579362

52. Guazzi M, Myers J, Arena R. Cardiopulmonary Exercise Testing in the Clinical and Prognostic

Assessment of Diastolic Heart Failure. J Am Coll Cardiol [Internet] Elsevier; 2005 Nov 15 [cited

2018 Jul 25];46(10):1883–1890. [doi: 10.1016/J.JACC.2005.07.051]

53. Albouaini K, Egred M, Alahmar A, Wright DJ. Cardiopulmonary exercise testing and its

application. Postgrad Med J [Internet] BMJ Group; 2007 Nov [cited 2016 Sep 20];83(985):675–82.

PMID:17989266

54. Chatterjee S, Sengupta S, Nag M, Kumar P, Goswami S, Rudra A. Cardiopulmonary Exercise

Testing: A Review of Techniques and Applications. 2013 [cited 2018 Jul 25]; [doi: 10.4172/2155-

6148.1000340]

55. Mehra MR, Canter CE, Hannan MM, Semigran MJ, Uber PA, Baran DA, Danziger-Isakov L,

Kirklin JK, Kirk R, Kushwaha SS, Lund LH, Potena L, Ross HJ, Taylor DO, Verschuuren EAM,

Zuckermann A. The 2016 International Society for Heart Lung Transplantation listing criteria for

heart transplantation: A 10-year update. [cited 2018 Jun 2]; [doi: 10.1016/j.healun.2015.10.023]

56. Lim FY, Yap J, Gao F, Teo LL, Lam CSP, Yeo KK. Correlation of the New York Heart

Association classification and the cardiopulmonary exercise test: A systematic review. Int J Cardiol

[Internet] Elsevier; 2018 Jul 15 [cited 2018 Jun 4];263:88–93. [doi: 10.1016/J.IJCARD.2018.04.021]

57. Fitbit Inc. Fitbit Official Site for Activity Trackers & More [Internet]. 2016. Available from:

https://www.fitbit.com/en-ca/home (Archived by WebCite® at

http://www.webcitation.org/6zTITrK95)

58. Fitbit Inc. Fitbit Charge 2TM Heart Rate + Fitness Wristband [Internet]. 2018 [cited 2018 Apr 17].

Available from: https://client.fitbit.com/en-ca/charge2 (Archived by WebCite® at

http://www.webcitation.org/6zTIzBoj5)

59. Fitbit Inc. Fitbit Flex [Internet]. [cited 2018 Apr 17]. Available from: https://client.fitbit.com/en-

ca/shop/flex (Archived by WebCite® at http://www.webcitation.org/6zTIrGkAE)

60. Bromberg SE. Consumer wristband activity monitors as a simple and inexpensive tool for remote

heart failure monitoring [Internet]. [Toronto]: University of Toronto; 2015. Available from:

145

http://hdl.handle.net/1807/70232

61. Piwek L, Ellis DA, Andrews S, Joinson A. The Rise of Consumer Health Wearables: Promises and

Barriers. PLoS Med [Internet] Public Library of Science; 2016 Feb [cited 2016 Sep

20];13(2):e1001953. PMID:26836780

62. Attal F, Mohammed S, Dedabrishvili M, Chamroukhi F, Oukhellou L, Amirat Y. Physical Human

Activity Recognition Using Wearable Sensors. Sensors (Basel) [Internet] 2015;15(12):31314–38.

PMID:26690450

63. James CJ. Editorial: “Longer term monitoring through wearables brings with it the promise of

predicting the onset of disease - moving from managing illness to maintaining wellness.”. Healthc

Technol Lett [Internet] IET: Institution of Engineering and Technology; 2015 Feb [cited 2016 Sep

20];2(1):1. PMID:26609395

64. Apple Inc. Watch - Apple (CA) [Internet]. 2016. Available from:

https://www.apple.com/ca/watch/

65. Storm FA, Heller BW, Mazzà C. Step detection and activity recognition accuracy of seven

physical activity monitors. PLoS One [Internet] Public Library of Science; 2015 [cited 2018 May

7];10(3):e0118723. PMID:25789630

66. Fitbit Inc. Help article: How does my Fitbit device count steps? [Internet]. Fitbit Help. 2017 [cited

2017 Nov 7]. Available from: https://help.fitbit.com/articles/en_US/Help_article/1143

67. Diaz KM, Krupka DJ, Chang MJ, Peacock J, Ma Y, Goldsmith J, Schwartz JE, Davidson KW.

Fitbit?: An accurate and reliable device for wireless physical activity tracking. Int J Cardiol. 2015.

PMID:25795203

68. Evenson KR, Goto MM, Furberg RD. Systematic review of the validity and reliability of

consumer-wearable activity trackers. Int J Behav Nutr Phys Act [Internet] 2015 Dec 18 [cited 2017

May 18];12(1):159. PMID:26684758

69. Al M. Personalization of energy expenditure and cardiorespiratory fitness estimation using

wearable sensors in supervised and ... Personalization of energy expenditure and. Eindhoven

University of Technology; 2015.

70. Straiton N, Alharbi M, Bauman A, Neubeck L, Gullick J, Bhindi R, Gallagher R. The validity and

146

reliability of consumer-grade activity trackers in older, community-dwelling adults: A systematic

review. Maturitas [Internet] Elsevier; 2018 Jun 1 [cited 2018 Jul 30];112:85–93. [doi:

10.1016/J.MATURITAS.2018.03.016]

71. ActiGraph Corporation. ActiGraph [Internet]. [cited 2018 Jul 30]. Available from:

https://www.actigraphcorp.com/

72. Fitbit Inc. Help article: What should I know about my heart rate data? [Internet]. Fitbit Help.

2017 [cited 2017 Nov 7]. Available from: https://help.fitbit.com/articles/en_US/Help_article/1565

73. Kroll RR, Boyd JG, Maslove DM. Accuracy of a Wrist-Worn Wearable Device for Monitoring

Heart Rates in Hospital Inpatients: A Prospective Observational Study. J Med Internet Res

[Internet] 2016 [cited 2016 Sep 22];18(9):e253. PMID:27651304

74. Ra H-K, Ahn J, Jung Yoon H, Yoon D, Hyuk Son DGIST S, Ko J. I am a “Smart” watch, Smart

Enough to Know the Accuracy of My Own Heart Rate Sensor. [cited 2017 May 15]; [doi:

10.1145/3032970.3032977]

75. Allen J. Photoplethysmography and its application in clinical physiological measurement. Physiol

Meas [Internet] 2007 [cited 2017 Nov 7];28:1–39. [doi: 10.1088/0967-3334/28/3/R01]

76. Alian AA, Shelley KH. Photoplethysmography. Best Pract Res Clin Anaesthesiol [Internet]

Baillière Tindall; 2014 Dec 1 [cited 2018 Jul 30];28(4):395–406. [doi: 10.1016/J.BPA.2014.08.006]

77. Maeda Y, Sekine M, Tamura T. The Advantages of Wearable Green Reflected

Photoplethysmography. J Med Syst [Internet] 2011 Oct 18 [cited 2018 Jul 30];35(5):829–834.

PMID:20703690

78. Wang R, Blackburn G, Desai M, Phelan D, Gillinov L, Houghtaling P, Gillinov M, MA C, H M,

RMT L, DJ T, F E-A, MS P. Accuracy of Wrist-Worn Heart Rate Monitors. JAMA Cardiol

[Internet] 2016 Oct 12 [cited 2016 Nov 10];313(6):625–626. [doi: 10.1001/jamacardio.2016.3340]

79. Cadmus-Bertram L, Gangnon R, Wirkus EJ, Thraen-Borowski KM, Gorzelitz-Liebhauser J. The

Accuracy of Heart Rate Monitoring by Some Wrist-Worn Activity Trackers. Ann Intern Med

[Internet] 2017;10–13. PMID:28395305

80. Cardioo Inc. Cardiio: Heart Rate Monitor (iOS App) [Internet]. Apple Inc; 2012. Available from:

https://itunes.apple.com/ca/app/cardiio-heart-rate-monitor/id542891434?mt=8

147

81. Laskowski ER. Heart rate: What’s normal? [Internet]. Mayo Clin. 2015 [cited 2018 Jul 31].

Available from: https://www.mayoclinic.org/healthy-lifestyle/fitness/expert-answers/heart-

rate/faq-20057979

82. American Heart Association. All About Heart Rate (Pulse) [Internet]. Am Hear Assoc Website.

2015 [cited 2018 Jul 31]. Available from: https://www.heart.org/en/health-topics/high-blood-

pressure/the-facts-about-high-blood-pressure/all-about-heart-rate-pulse#.Wg1mcBO0OCU

83. Low CA, Bovbjerg DH, Ahrendt S, Choudry MH, Holtzman M, Jones HL, Pingpank JF,

Ramalingam L, Zeh HJ, Zureikat AH, Bartlett DL. Fitbit step counts during inpatient recovery

from cancer surgery as a predictor of readmission. Ann Behav Med [Internet] Oxford University

Press; 2018 Jan 5 [cited 2018 Jul 26];52(1):88–92. [doi: 10.1093/abm/kax022]

84. Hartman SJ, Nelson SH, Weiner LS. Patterns of Fitbit Use and Activity Levels Throughout a

Physical Activity Intervention: Exploratory Analysis from a Randomized Controlled Trial. JMIR

mHealth uHealth [Internet] JMIR mHealth and uHealth; 2018 Feb 5 [cited 2018 Mar 8];6(2):e29.

PMID:29402761

85. Wicklund E. Hospital’s mHealth Project Finds Value in Fitbit Data [Internet].

mHealthIntelligence. 2016 [cited 2018 Jul 26]. Available from:

https://mhealthintelligence.com/news/hospitals-diabetes-mhealth-project-finds-value-in-fitbit-data

86. Apple Inc. Apple Heart Study launches to identify irregular heart rhythms [Internet]. Apple

Newsroom. 2017 [cited 2018 Jul 31]. Available from:

https://www.apple.com/newsroom/2017/11/apple-heart-study-launches-to-identify-irregular-heart-

rhythms/

87. Eadicicco L. EXCLUSIVE: Fitbit Working On Atrial Fibrillation Detection [Internet]. Time. 2017

[cited 2018 Jul 31]. Available from: http://time.com/4907284/fitbit-detect-atrial-fibrillation/

88. Griffith E. When Your Fitbit Goes From Activity Tracker to Personal Medical Device [Internet].

Wired. 2018 [cited 2018 Jul 26]. Available from: https://www.wired.com/story/when-your-activity-

tracker-becomes-a-personal-medical-device/

89. Field MJ, Grigsby J. Telemedicine and Remote Patient Monitoring. JAMA [Internet] American

Medical Association; 2002 Jul 24 [cited 2018 Aug 1];288(4):423. [doi: 10.1001/jama.288.4.423]

148

90. Hargreaves S, Hawley MS, Haywood A, Enderby PM. Informing the Design of “Lifestyle

Monitoring” Technology for the Detection of Health Deterioration in Long-Term Conditions: A

Qualitative Study of People Living With Heart Failure. J Med Internet Res [Internet] Journal of

Medical Internet Research; 2017 Jun 28 [cited 2017 Jun 30];19(6):e231. PMID:28659253

91. Noah B, Keller MS, Mosadeghi S, Stein L, Johl S, Delshad S, Tashjian VC, Lew D, Kwan JT,

Jusufagic A, Spiegel BMR. Impact of remote patient monitoring on clinical outcomes: an updated

meta-analysis of randomized controlled trials. npj Digit Med [Internet] Nature Publishing Group;

2018 Dec 15 [cited 2018 Aug 1];1(1):20172. [doi: 10.1038/s41746-017-0002-4]

92. Hanlon P, Daines L, Campbell C, McKinstry B, Weller D, Pinnock H. Telehealth Interventions to

Support Self-Management of Long-Term Conditions: A Systematic Metareview of Diabetes, Heart

Failure, Asthma, Chronic Obstructive Pulmonary Disease, and Cancer. J Med Internet Res

[Internet] Journal of Medical Internet Research; 2017 May 17 [cited 2017 May 18];19(5):e172. [doi:

10.2196/jmir.6688]

93. Hargreaves S, Hawley MS, Haywood A, Enderby PM. Informing the Design of “Lifestyle

Monitoring” Technology for the Detection of Health Deterioration in Long-Term Conditions: A

Qualitative Study of People Living With Heart Failure. J Med Internet Res [Internet] Journal of

Medical Internet Research; 2017 Jun 28 [cited 2017 Jun 30];19(6):e231. PMID:28659253

94. Clark RA, Inglis SC, McAlister FA, Cleland JGF, Stewart S. Telemonitoring or structured

telephone support programmes for patients with chronic heart failure: systematic review and meta-

analysis. BMJ [Internet] 2007 May 5 [cited 2018 Apr 4];334(7600):942. PMID:17426062

95. Ware P, Ross HJ, Cafazzo JA, Laporte A, Gordon K, Seto E. Evaluating the Implementation of a

Mobile Phone–Based Telemonitoring Program: Longitudinal Study Guided by the Consolidated

Framework for Implementation Research. JMIR mHealth uHealth [Internet] JMIR mHealth and

uHealth; 2018 Jul 31 [cited 2018 Aug 1];6(7):e10768. [doi: 10.2196/10768]

96. Yun JE, Park J-E, Park H-Y, Lee H-Y, Park D-A. Comparative Effectiveness of Telemonitoring

Versus Usual Care for Heart Failure: A Systematic Review and Meta-analysis. J Card Fail

[Internet] 2018 Jan [cited 2018 Aug 1];24(1):19–28. [doi: 10.1016/j.cardfail.2017.09.006]

97. Klersy C, De Silvestri A, Gabutti G, Raisaro A, Curti M, Regoli F, Auricchio A. Economic impact

of remote patient monitoring: an integrated economic model derived from a meta-analysis of

149

randomized controlled trials in heart failure. Eur J Heart Fail [Internet] Wiley-Blackwell; 2011 Apr

1 [cited 2018 Aug 1];13(4):450–459. [doi: 10.1093/eurjhf/hfq232]

98. Ong MK, Romano PS, Edgington S, Aronow HU, Auerbach AD, Black JT, De Marco T, Escarce

JJ, Evangelista LS, Hanna B, Ganiats TG, Greenberg BH, Greenfield S, Kaplan SH, Kimchi A,

Liu H, Lombardo D, Mangione CM, Sadeghi B, Sadeghi B, Sarrafzadeh M, Tong K, Fonarow GC.

Effectiveness of Remote Patient Monitoring After Discharge of Hospitalized Patients With Heart

Failure. JAMA Intern Med [Internet] American Medical Association; 2016 Mar 1 [cited 2018 Aug

1];176(3):310. [doi: 10.1001/jamainternmed.2015.7712]

99. Chaudhry SI, Mattera JA, Curtis JP, Spertus JA, Herrin J, Lin Z, Phillips CO, Hodshon B V.,

Cooper LS, Krumholz HM. Telemonitoring in Patients with Heart Failure. N Engl J Med

[Internet] Massachusetts Medical Society ; 2010 Dec 9 [cited 2018 Aug 1];363(24):2301–2309. [doi:

10.1056/NEJMoa1010029]

100. Ware P, Seto E, Ross HJ. Accounting for Complexity in Home Telemonitoring: A Need for

Context-Centred Evidence. Can J Cardiol [Internet] Elsevier; 2018 Jul 1 [cited 2018 Aug

1];34(7):897–904. [doi: 10.1016/J.CJCA.2018.01.022]

101. Centre for Global eHealth Innovation. Medly - Chronic Complex Diseases Self-care Management

[Internet]. 2016 [cited 2016 Oct 30]. Available from: http://ehealthinnovation.org/what-we-

do/projects/medly/

102. Healthcare Human Factors. Medly: Managing Chronic Conditions [Internet]. 2016 [cited 2016 Oct

30]. Available from: http://humanfactors.ca/projects/medly/

103. Seto E, Leonard KJ, Cafazzo JA, Barnsley J, Masino C, Ross HJ. Mobile phone-based

telemonitoring for heart failure management: a randomized controlled trial. J Med Internet Res

2012;14(1):1–14. PMID:22356799

104. Seto E, Leonard KJ, Cafazzo JA, Barnsley J, Masino C, Ross HJ. Developing healthcare rule-

based expert systems: Case study of a heart failure telemonitoring system. Int J Med Inform

[Internet] Elsevier Ireland Ltd; 2012;81(8):556–565. PMID:22465288

105. Seto E, Leonard KJ, Masino C, Cafazzo JA, Barnsley J, Ross HJ. Attitudes of heart failure

patients and health care providers towards mobile phone-based remote monitoring. J Med Internet

Res 2010;12(4):3–12. PMID:21115435

150

106. Smith C, McGuire B, Huang T, Yang G. The History of Artificial Intelligence [Internet]. Seattle:

University of Washington; 2006 [cited 2018 Apr 4]. p. 27. Available from:

https://courses.cs.washington.edu/courses/csep590/06au/projects/history-ai.pdf

107. Anyoha R. The History of Artificial Intelligence [Internet]. Sci News. 2017 [cited 2018 Aug 4].

Available from: http://sitn.hms.harvard.edu/flash/2017/history-artificial-intelligence/

108. McCarthy J, Minsky ML, Rochester N, Shannon CE. A Proposal for the Dartmouth Summer

Research Project on Artificial Intelligence [Internet]. Dartmouth; 1955 [cited 2018 Aug 4].

Available from: http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html

109. Coward C. AI and the Ghost in the Machine [Internet]. hackaday. 2017 [cited 2018 Aug 4].

Available from: https://hackaday.com/2017/02/06/ai-and-the-ghost-in-the-machine/

110. Shu-Hsien Liao. Expert system methodologies and applications—a decade review from 1995 to

2004. Expert Syst Appl [Internet] Pergamon; 2005 Jan 1 [cited 2018 Aug 4];28(1):93–103. [doi:

10.1016/J.ESWA.2004.08.003]

111. Segaran T. Programming collective intelligence : building smart web 2.0 applications. O’Reilly;

2007. ISBN:9780596529321

112. Brownlee J. Supervised and Unsupervised Machine Learning Algorithms [Internet]. Mach Learn

Mastery. 2016 [cited 2018 Aug 6]. Available from:

https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

113. Alpaydin E. Introduction to Machine Learning (Adaptive Computation and Machine Learning)

[Internet]. MIT Press; 2004 [cited 2018 Aug 6]. Available from:

https://dl.acm.org/citation.cfm?id=1036287ISBN:0262012111

114. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai

M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, Van Den Driessche G, Graepel T, Hassabis D.

Mastering the game of Go without human knowledge. Nature [Internet] Nature Publishing Group;

2017;550(7676):354–359. PMID:29052630

115. OpenAI Five [Internet]. OpenAI. 2018 [cited 2018 Aug 6]. Available from:

https://blog.openai.com/openai-five/

116. Savov V. The OpenAI Dota 2 bots just defeated a team of former pros [Internet]. The Verge. 2018

151

[cited 2018 Aug 6]. Available from: https://www.theverge.com/2018/8/6/17655086/dota2-openai-

bots-professional-gaming-ai

117. Thompson T. Zerg Rush: A History of StarCraft AI Research [Internet]. Medium. 2018 [cited 2018

Aug 6]. Available from: https://medium.com/@t2thompson/zerg-rush-a-history-of-starcraft-ai-

research-4478759a3c53

118. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition.

Proc IEEE [Internet] 1989 [cited 2017 Aug 28];77(2):257–286. [doi: 10.1109/5.18626]

119. Visser I, Raijmakers MEJ, van der Maas HLJ. Hidden Markov Models for Individual Time Series.

In: Valsiner J, Molenaar PCM, Lyra MCDP, Chaudhary N, editors. Dyn Process Methodol Soc

Dev Sci 2009. p. 269–289. PMID:25246403

120. Iskandar J. RPubs - Classifying Seizure State (using R package depmixS4) [Internet]. RPubs; 2014

[cited 2017 Aug 30]. p. 6. Available from: https://rpubs.com/jimmyiskandar/30484

121. Mannini A, Sabatini AM. Machine Learning Methods for Classifying Human Physical Activity

from On-Body Accelerometers. Sensors [Internet] Molecular Diversity Preservation International;

2010 Feb 1 [cited 2017 Aug 22];10(2):1154–1175. [doi: 10.3390/s100201154]

122. Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for

classification performance. BMC Med Inform Decis Mak [Internet] 2012 Dec 15 [cited 2017 Oct

7];12(1):8. [doi: 10.1186/1472-6947-12-8]

123. Brownlee J. How Much Training Data is Required for Machine Learning? [Internet]. Mach Learn

Mastery. 2017 [cited 2017 Oct 7]. Available from: https://machinelearningmastery.com/much-

training-data-required-machine-learning/

124. Denham L. Aren’t The IoT, Big Data And Machine Learning The Same? [Internet]. Innov Enterp.

2017 [cited 2018 Aug 20]. Available from:

https://channels.theinnovationenterprise.com/articles/aren-t-the-iot-big-data-and-machine-

learning-the-same

125. Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample size planning for classification

models. Anal Chim Acta [Internet] Elsevier; 2013 Jan 14 [cited 2017 Oct 7];760:25–33. [doi:

10.1016/j.aca.2012.11.007]

152

126. van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a

simulation study for predicting dichotomous endpoints. BMC Med Res Methodol [Internet] 2014

Dec 22 [cited 2018 Aug 20];14(1):137. [doi: 10.1186/1471-2288-14-137]

127. Tripoliti EE, Papadopoulos TG, Karanasiou GS, Naka KK, Fotiadis DI. Heart Failure: Diagnosis,

Severity Estimation and Prediction of Adverse Events Through Machine Learning Techniques.

Comput Struct Biotechnol J [Internet] 2017 [cited 2017 Oct 7];15:26–47. [doi:

10.1016/j.csbj.2016.11.001]

128. Pecchia L, Melillo P, Bracale M. Remote Health Monitoring of Heart Failure With Data Mining

via CART Method on HRV Features. IEEE Trans Biomed Eng [Internet] 2011 Mar [cited 2018

Aug 6];58(3):800–804. [doi: 10.1109/TBME.2010.2092776]

129. Shaffer F, Ginsberg JP. An Overview of Heart Rate Variability Metrics and Norms. Front public

Heal [Internet] Frontiers Media SA; 2017 [cited 2018 Aug 7];5:258. PMID:29034226

130. Melillo P, Fusco R, Sansone M, Bracale M, Pecchia L. Discrimination power of long-term heart

rate variability measures for chronic heart failure detection. Med Biol Eng Comput [Internet]

Springer-Verlag; 2011 Jan 4 [cited 2018 Aug 6];49(1):67–74. [doi: 10.1007/s11517-010-0728-5]

131. Pecchia L, Melillo P, Sansone M, Bracale M. Discrimination Power of Short-Term Heart Rate

Variability Measures for CHF Assessment. IEEE Trans Inf Technol Biomed [Internet] 2011 Jan

[cited 2018 Aug 6];15(1):40–46. [doi: 10.1109/TITB.2010.2091647]

132. Panina G, Khot UN, Nunziata E, Cody RJ, Binkley PF. Role of spectral measures of heart rate

variability as markers of disease progression in patients with chronic congestive heart failure not

treated with angiotensin-converting enzyme inhibitors. Am Heart J [Internet] Mosby; 1996 Jan 1

[cited 2018 Aug 6];131(1):153–157. [doi: 10.1016/S0002-8703(96)90064-2]

133. Mietus JE, Peng C-K, Henry I, Goldsmith RL, Goldberger AL. The pNNx files: re-examining a

widely used heart rate variability measure. Heart [Internet] BMJ Publishing Group Ltd; 2002 Oct

1 [cited 2018 Aug 6];88(4):378–80. PMID:12231596

134. Casolo GC, Stroder P, Sulla A, Chelucci A, Freni A, Zerauschek M. Heart rate variability and

functional severity of congestive heart failure secondary to coronary artery disease. Eur Heart J

[Internet] Oxford University Press; 1995 Mar 1 [cited 2018 Aug 6];16(3):360–367. [doi:

10.1093/oxfordjournals.eurheartj.a060919]

153

135. Goldsmith R. Congestive Heart Failure RR Interval Database [Internet]. [cited 2018 Aug 6]. [doi:

10.13026/C2F598]

136. Melillo P, De Luca N, Bracale M, Pecchia L. Classification Tree for Risk Assessment in Patients

Suffering From Congestive Heart Failure via Long-Term Heart Rate Variability. IEEE J Biomed

Heal Informatics [Internet] 2013 May [cited 2018 Aug 6];17(3):727–733. [doi:

10.1109/JBHI.2013.2244902]

137. Beth Israel Deaconess Medical Center. The BIDMC Congestive Heart Failure Database [Internet].

PhysioNet. 1986 [cited 2018 Aug 6]. [doi: 10.13026/C29G60]

138. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Plamen CI, Mark RG, Mietus JE, Moody

GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet Components of a New

Research Resource for Complex Physiologic Signals. Circulation [Internet] 2000 [cited 2018 Aug

6];(101):215–220. [doi: 10.1161/circ.101.23.e215]

139. Witten IH (Ian H., Frank E, Hall MA (Mark A, Pal CJ. Data mining : practical machine learning

tools and techniques. ISBN:9780128042915

140. Vanwinckelen G, Blockeel H. On Estimating Model Accuracy with Repeated Cross-Validation.

[cited 2018 Apr 25]; Available from:

https://lirias.kuleuven.be/bitstream/123456789/346385/3/OnEstimatingModelAccuracy.pdf

141. Forman G, Scholz M. Apples-to-Apples in Cross-Validation Studies: Pitfalls in Classifier

Performance Measurement. SIGKDD Explor [Internet] 2010 [cited 2017 Nov 3];12(1):49–57.

Available from: http://www.kdd.org/exploration_files/v12-1-p49-forman-sigkdd.pdf

142. Shahbazi F, Asl BM. Generalized discriminant analysis for congestive heart failure risk assessment

based on long-term heart rate variability. Comput Methods Programs Biomed [Internet] Elsevier;

2015 Nov 1 [cited 2018 Aug 6];122(2):191–198. [doi: 10.1016/J.CMPB.2015.08.007]

143. Baudat G, Anouar F. Generalized Discriminant Analysis Using a Kernel Approach. Neural

Comput [Internet] MIT Press 238 Main St., Suite 500, Cambridge, MA 02142-1046 USA journals-

[email protected] ; 2000 Oct 13 [cited 2018 Aug 6];12(10):2385–2404. [doi:

10.1162/089976600300014980]

144. Fluss R, Faraggi D, Reiser B. Estimation of the Youden Index and its associated cutoff point.

154

Biom J [Internet] 2005 Aug [cited 2018 Aug 7];47(4):458–72. PMID:16161804

145. Guiqiu Yang, Yinzi Ren, Qing Pan, Gangmin Ning, Shijin Gong, Guolong Cai, Zhaocai Zhang, Li

Li, Jing Yan. A heart failure diagnosis model based on support vector machine. 2010 3rd Int Conf

Biomed Eng Informatics [Internet] IEEE; 2010 [cited 2018 Aug 6]. p. 1105–1108. [doi:

10.1109/BMEI.2010.5639619]

146. Wu H-T, Soliman EZ. A new approach for analysis of heart rate variability and QT variability in

long-term ECG recording. Biomed Eng Online [Internet] BioMed Central; 2018 Dec 3 [cited 2018

Aug 7];17(1):54. [doi: 10.1186/s12938-018-0490-8]

147. Pang D, Igasaki T, Maehara J. Long-term monitoring of heart rate variability toward practical use

in intensive/high care unit. 2016 9th Biomed Eng Int Conf [Internet] IEEE; 2016 [cited 2018 Aug

7]. p. 1–6. [doi: 10.1109/BMEiCON.2016.7859631]

148. Baril J-F, Bromberg S, Yasbanoo M, Taati B, Manlhiot C, Ross HJ, Cafazzo J. Use of free-living

step count monitoring for heart failure functional classification: a validation study. Toronto: JMIR

Cardio; 2018. [doi: 10.2196/preprints.12122]

149. Stein KM, Mittal S, Merkel S, Meye TE. Baseline Physical Activity and NYHA Classification

Affects Future Ventricular Event Rates in a General ICD Population. J Card Fail [Internet]

Churchill Livingstone; 2006 Aug 1 [cited 2017 Oct 13];12(6):S58. [doi:

10.1016/J.CARDFAIL.2006.06.203]

150. Bromberg SE. googlefitbit [Internet]. Toronto; 2015. Available from:

https://github.com/simonbromberg/googlefitbit

151. R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna,

Austria; 2017. Available from: https://www.r-project.org

152. RStudio Team. RStudio: Integrated Development Environment for R [Internet]. Boston, MA;

2015. Available from: http://www.rstudio.com/

153. Wickham H. A Layered Grammar of Graphics. 2010 [cited 2017 May 31]; [doi:

10.1198/jcgs.2009.07098]

154. Arnold JB. ggthemes: Extra Themes, Scales and Geoms for “ggplot2” [Internet]. 2017. Available

from: https://cran.r-project.org/package=ggthemes

155

155. Wickham H. The Split-Apply-Combine Strategy for Data Analysis. J Stat Softw [Internet]

2011;40(1):1–29. Available from: http://www.jstatsoft.org/v40/i01/

156. Wickham H, Francois R, Henry L, Müller K. dplyr: A Grammar of Data Manipulation [Internet].

2017. Available from: https://cran.r-project.org/package=dplyr

157. Wickham H. Reshaping Data with the {reshape} Package. J Stat Softw [Internet] 2007;21(12):1–

20. Available from: http://www.jstatsoft.org/v21/i12/

158. Hester J. glue: Interpreted String Literals [Internet]. 2017. Available from: https://cran.r-

project.org/package=glue

159. Seto E, Leonard KJ, Cafazzo JA, Barnsley J, Masino C, Ross HJ. Perceptions and experiences of

heart failure patients and clinicians on the use of mobile phone-based telemonitoring. J Med

Internet Res 2012;14(1):1–15. PMID:22328237

160. Intel Corporation. Safety Recall Notice for all Basis PeakTM Watches [Internet]. 2018 [cited 2018

Aug 13]. Available from:

https://www.intel.ca/content/www/ca/en/support/articles/000025310/emerging-

technologies/wearable-devices.html

161. Somerville H. Jawbone’s demise a case of “death by overfunding” in Silicon Valley | Reuters

[Internet]. Thomson Reuters. 2018 [cited 2018 Aug 14]. Available from:

https://www.reuters.com/article/us-jawbone-failure/jawbones-demise-a-case-of-death-by-

overfunding-in-silicon-valley-idUSKBN19V0BS

162. Alharbi M, Straiton N, Gallagher R. Harnessing the Potential of Wearable Activity Trackers for

Heart Failure Self-Care. [cited 2017 May 15]; [doi: 10.1007/s11897-017-0318-z]

163. Apple Inc. HealthKit - Apple Developer [Internet]. 2018 [cited 2018 Aug 14]. Available from:

https://developer.apple.com/healthkit/

164. empatica. E4 wristband [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://www.empatica.com/research/e4/

165. Fitbit Inc. Fitbit SDK [Internet]. 2018. Available from: https://dev.fitbit.com/

166. Fitbit Inc. AltaHR [Internet]. 2018 [cited 2018 Aug 13]. Available from:

156

https://www.fitbit.com/en-ca/altahr

167. Fitbit AltaTM Fitness Wristband [Internet]. [cited 2018 Aug 13]. Available from:

https://www.fitbit.com/en-ca/alta

168. Fitbit Inc. Fitbit Flex 2TM Fitness Wristband [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://www.fitbit.com/en-ca/flex2

169. Fitbit Inc. Fitbit IonicTM Watch [Internet]. 2018. [cited 2018 Aug 13]. Available from:

https://www.fitbit.com/en-ca/ionic

170. Fitbit Inc. Fitbit Versa [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://www.fitbit.com/en-ca/versa

171. Garmin. Home | Garmin Developers [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://developer.garmin.com/

172. Garmin. fenix 5 [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://buy.garmin.com/en-

CA/CA/p/552982

173. Garmin. vivosmart [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://buy.garmin.com/en-US/US/p/154886

174. Google Developers. Google Fit [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://developers.google.com/fit/

175. Huawei Technology Co. Ltd. Huawei Watch 2 [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://consumer.huawei.com/ca/wearables/watch2/

176. LG Electronics. LG Smart Watch Sport for AT&T With Android Wear 2.0 | LG USA [Internet].

2018 [cited 2018 Aug 13]. Available from: https://www.lg.com/us/smart-watches/lg-W280A-sport

177. mc10. BiostampRC System [Internet]. Available from: https://www.mc10inc.com/our-

products/biostamprc

178. Misfit. Build @ Misfit [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://build.misfit.com/

179. Misfit. Misfit Flare [Internet]. 2018. Available from: https://misfit.com/misfit-flare

157

180. Misfit. Misfit Phase [Internet]. 2018. Available from: https://misfit.com/misfit-phase

181. Misfit. Misfit Ray [Internet]. 2018. Available from: https://misfit.com/misfit-ray

182. Misfit. Misfit Shine. 2018.

183. Misfit. Misfit Shine 2 [Internet]. 2018. Available from: https://misfit.com/misfit-shine-2

184. Misfit. Misfit Vapor [Internet]. 2018. Available from: https://misfit.com/misfit-vapor

185. Moov Inc. Moov HR [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://welcome.moov.cc/moovhr/

186. Moov Inc. Moov Now [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://welcome.moov.cc/moovnow/

187. Nokia. Nokia Health API [Internet]. 2018 [cited 2018 Aug 13]. Available from:

http://developer.health.nokia.com/oauth2/

188. Nokia | Withings. Nokia Go [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://health.nokia.com/ca/en/go

189. Nokia | Withings. Nokia Steel [Internet]. [cited 2018 Aug 13]. Available from:

https://health.nokia.com/ca/en/steel

190. Nokia | Withings. Nokia Steel HR [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://health.nokia.com/ca/en/steel-hr

191. TomTom Sports Team. TomTom Sports Cloud [Internet]. 2018. Available from:

https://developer.tomtom.com/tomtom-sports-cloud

192. TomTom. TomTom Spark 3 Cardio + Music GPS Fitness Watch [Internet]. 2018 [cited 2018 Aug

13]. Available from: https://www.tomtom.com/en_ca/sports/fitness-trackers/gps-fitness-watch-

cardio-music-spark3/black-large/

193. TomTom. TomTom Touch Fitness Tracker [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://www.tomtom.com/en_ca/sports/fitness-trackers/fitness-tracker-touch/black-large/

194. Under Armour I. Under Armour UA Band [Internet]. 2018 [cited 2018 Aug 13]. Available from:

158

https://www.underarmour.com/en-ca/ua-band

195. Wavelet Health. Products [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://wavelethealth.com/products/

196. MI. Mi Band [Internet]. 2018. [cited 2018 Aug 13]. Available from:

https://www.mi.com/en/miband/

197. MI. Mi Band 2 [Internet]. 2018 [cited 2018 Aug 13]. Available from:

https://www.mi.com/en/miband2/

198. Baril J-F. fitbit4research [Internet]. Toronto; 2018 [cited 2018 Aug 16]. Available from:

https://github.com/cosmomeese/fitbit4research

199. Tufte ER. The visual display of quantitative information. Graphics Press; 2001. ISBN:1930824130

200. Wong DM. The Wall Street journal guide to information graphics : the dos and don’ts of

presenting data, facts, and figures. ISBN:0393347281

201. Tufte ER, McKay SR, Christian W, Matey JR. Visual Explanations: Images and Quantities,

Evidence and Narrative. Comput Phys 1998; PMID:1659109

202. Zhang J, Johnson TR, Patel VL, Paige DL, Kubose T. Using usability heuristics to evaluate

patient safety of medical devices. 2003;36:23–30. [doi: 10.1016/S1532-0464(03)00060-1]

203. Tognazzini B. First Principles of Interaction Design (Revised & Expanded) | askTog [Internet].

askTog.com. [cited 2017 Jan 13]. Available from: http://asktog.com/atc/principles-of-interaction-

design/

204. Nielsen J. 10 Heuristics for User Interface Design [Internet]. Nielsen Norman Gr. 1995 [cited 2017

Jan 13]. Available from: https://www.nngroup.com/articles/ten-usability-heuristics/

205. Norman DA. The Design of Everyday Things [Internet]. Hum Factors Ergon Manuf. 2013.

PMID:13182255ISBN:0465067107

206. Laussen PC, Almodovar M, Goodwin A, Sick Kids: The Hospital for Sick Children. T3 - Tracking,

trajectory and trigger tool [Internet]. Crit Care Med Programs Serv. 2018. Available from:

http://www.sickkids.ca/Critical-Care/programs-and-services/T3/index.html

159

207. Laussen PC. Precision monitoring. Crit Care Canada Forum [Internet] Toronto; 2015 [cited 2018

Aug 15]. Available from:

https://criticalcarecanada.com/presentations/2015/precision_monitoring.pdf

208. Guerguerian A-M. BME1439 Critical Care Instrumentation Lecture. Toronto; 2016.

209. Fitbit Inc. Accessing the Fitbit API [Internet]. Fitbit Dev Website. 2018. Available from:

https://dev.fitbit.com/build/reference/web-api/oauth2/

210. Fitbit Inc. Fitbit Platform Terms of Service (Revised August 1st, 2018) [Internet]. Fitbit Dev

Website. 2018. Available from: https://dev.fitbit.com/legal/platform-terms-of-service/

211. Canadian Radio-television and Telecommunications Commission. Communications Monitoring

Report 2017: Canada’s Communication System: An Overview for Canadians (Table 2.0.6)

[Internet]. Ottawa; 2017. Available from:

https://crtc.gc.ca/eng/publications/reports/policymonitoring/2017/cmr2.htm#s20i

212. Mobile Operating System Market Share Canada [Internet]. StatCounter. 2017 [cited 2017 Nov 29].

Available from: http://gs.statcounter.com/os-market-share/mobile/canada/#monthly-201706-

201711

213. Mobile iOS Version Market Share Canada [Internet]. StatCounter. 2017 [cited 2017 Nov 29].

Available from: http://gs.statcounter.com/ios-version-market-share/mobile/canada/#monthly-

201611-201711

214. Hermsen S, Moons J, Kerkhof P, Wiekens C, De Groot M. Determinants for Sustained Use of an

Activity Tracker: Observational Study. JMIR mHealth uHealth [Internet] JMIR Publications Inc.;

2017 Oct 30 [cited 2018 Aug 18];5(10):e164. PMID:29084709

215. Cafazzo J, St-Cyr O. From Discovery to Design: The Evolution of Human Factors in Healthcare.

Healthc Q [Internet] 2012 Apr 11 [cited 2018 Aug 18];15(sp):24–29. [doi: 10.12927/hcq.2012.22845]

216. Canadian Patient Safety Institute, Institute for Safe Medication Practices Canada, Saskatchewan

Health, Patients for Patient Safety Canada, Beard P, Hoffman CE, Ste-Marie M. Canadian

Incident Analysis Framework [Internet]. Edmonton, AB; 2012. Available from:

http://www.patientsafetyinstitute.ca/en/toolsResources/PatientSafetyIncidentManagementToolkit

/Documents/CIAF Key Features - Analysis Process.pdf

160

217. Wickham H. tidyverse: Easily Install and Load the “Tidyverse” [Internet]. 2017. Available from:

https://cran.r-project.org/package=tidyverse

218. Wolf HP. aplpack: Another Plot Package: “Bagplots”, “Iconplots”, “Summaryplots”, Slider

Functions and Others [Internet]. 2018 [cited 2018 Aug 17]. Available from: https://cran.r-

project.org/web/packages/aplpack/index.html

219. Champely S. PairedData: Paired Data Analysis [Internet]. 2018 [cited 2018 Aug 17]. Available

from: https://cran.r-project.org/web/packages/PairedData/index.html

220. Jurafsky D, Martin J. Hidden Markov Models. Speech Lang Process [Internet] 3rd ed Pearson;

2017 [cited 2017 Nov 11]. p. 21. Available from: https://web.stanford.edu/~jurafsky/slp3/9.pdf

221. Bobick A, Essa I, Chakraborty A, Udacity. Markov Models [Internet]. Udacity Introd to Comput

Vis. YouTube; 2015 [cited 2017 Nov 11]. Available from:

https://www.youtube.com/watch?v=4XqWadvEj2k

222. Gagniuc PA. Markov chains: from theory to implementation and experimentation. 1st ed. John

Wiley and Sons, Inc; 2017. [doi: 10.1002/9781119387596]ISBN:9781119387558

223. O’Connell J, Højsgaard S. Hidden Semi Markov Models for Multiple Observation Sequences: The

mhsmm Package for R. J Stat Softw [Internet] 2011 [cited 2017 Nov 1];39(4):1–22. [doi:

10.18637/jss.v039.i04]

224. Bobick A, Essa I, Chakraborty A, Udacity. Hidden Markov Models [Internet]. Udacity Introd to

Comput Vis. YouTube; 2015 [cited 2017 Nov 11]. Available from:

https://www.youtube.com/watch?v=5araDjcBHMQ

225. O’Connell J, Højsgaard S. Package “mhsmm.” CRAN 2017;(0.4.16).

226. Altman RM, Mackay Altman R. Mixed Hidden Markov Models Mixed Hidden Markov Models: An

Extension of the Hidden Markov Model to the Longitudinal Data Setting. J Am Stat Assoc

[Internet] 2007 [cited 2017 Aug 28];102477:201–210. [doi: 10.1198/016214506000001086]

227. Visser I, Speekenbrink M. depmixS4: An R Package for Hidden Markov Models [Internet].

Available from: http://cran.r-project.org/package=depmixS4.

228. Visser I, Speekenbrink M. depmixS4: Dependent Mixture Models - Hidden Markov Models of

161

GLMs and Other Distributions in S4 [Internet]. 2016 [cited 2018 Aug 23]. Available from:

https://cran.r-project.org/web/packages/depmixS4/index.html

229. Rohan. Can something be statistically impossible? [Internet]. Math Stack Exch. 2016 [cited 2018

Aug 24]. Available from: https://math.stackexchange.com/q/2049722

230. Pohlmann KC. Principles of digital audio. McGraw-Hill; 2011. ISBN:9780071663465

231. Farmer WC, editor. Ordnance Field Guide: Restricted, Volume 2 [Internet]. Military service

publishing company; 1944 [cited 2018 Aug 24]. Available from:

https://books.google.ca/books?id=15ffO4UVw8QC&q=dither&redir_esc=y

232. Analog Devices. A Technical Tutorial on Digital Signal Synthesis [Internet]. 1999. Available from:

http://www.analog.com/media/cn/training-seminars/tutorials/450968421DDS_Tutorial_rev12-2-

99.pdf

233. Mannix BF. Races, Rushes, and Runs: Taming the Turbulence in Financial Trading [Internet].

Washington; 2013. Available from: www.regulatorystudies.gwu.edu

234. Floyd RW, Steinberg L. An Adaptive Algorithm for Spatial Greyscale. Proc Soc Inf Disp

1976;17(2):75–77.

235. Roberts LG. Picture Coding Using Pseudo-Random Noise. IRE Trans Inf Theory 1962;8(2):145–

154. [doi: 10.1109/TIT.1962.1057702]

236. Wikipedia Contributors. Dither [Internet]. Wikipedia, Free Encycl. 2018 [cited 2018 Aug 24].

Available from: https://en.wikipedia.org/wiki/Dither

237. Fox J. Generalized Linear Models. Appl Regres Gen Linear Model [Internet] SAGE Publications;

2015 [cited 2018 Aug 27]. p. 379–424. Available from: http://kilpatrick.eeb.ucsc.edu/wp-

content/uploads/2015/04/GLMs-Chapter_15.pdf

238. Rigollet P. Lecture 21. Generalized Linear Models from MIT 18.650: Statistics for Applications

[Internet]. YouTube; 2016 [cited 2018 Aug 27]. Available from:

https://www.youtube.com/watch?v=X-ix97pw0xY

239. Gao J, Fan W, Han J. On the Power of Ensemble: Supervised and Unsupervised Methods

Reconciled. Tutor SIAM Data Min Conf [Internet] Columbus, OH; 2010 [cited 2018 Aug 27].

162

Available from: https://cse.buffalo.edu/~jing/sdm10ensemble.htm

240. Grover P. Gradient Boosting from scratch [Internet]. ML Rev. 2017 [cited 2018 Aug 27]. Available

from: https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d

241. LeCun Y, Bengio Y, Hinton G, Y. L, Y. B, G. H. Deep learning. Nature 2015;521(7553):436–444.

PMID:26017442

242. Parloff R. The AI Revolution: Why Deep Learning Is Suddenly Changing Your Life [Internet].

Fortune. 2016 [cited 2018 Aug 29]. Available from: http://fortune.com/ai-artificial-intelligence-

deep-machine-learning/

243. Goodfellow I, Bengio Y, Courville A. Deep Learning [Internet]. 2016. Available from:

http://www.deeplearningbook.org

244. Zekić-Sušac M, Šarlija N, Pfeifer S. Combining PCA Analysis And Artificial Neural Networks In

Modelling Entrepreneurial Intentions Of Students. Croat Oper Res Rev [Internet] 2013 Feb 1

[cited 2018 Aug 29];4(1):306–317. Available from:

https://hrcak.srce.hr/index.php?id_clanak_jezik=143365&show=clanak

245. Seuret M, Alberti M, Ingold R, Liwicki M. PCA-Initialized Deep Neural Networks Applied To

Document Image Analysis [Internet]. Available from: https://arxiv.org/pdf/1702.00177.pdf

246. Marsupial D. Does Neural Networks based classification need a dimension reduction [Internet].

Cross Validated. 2013 [cited 2018 Aug 29]. Available from:

https://stats.stackexchange.com/q/67988

247. Hartmann WM. Dimension Reduction vs. Variable Selection. Springer, Berlin, Heidelberg; 2006

[cited 2018 Aug 29]. p. 931–938. [doi: 10.1007/11558958_113]

248. Sorzano COS, Vargas J, Pascual-Montano A. A survey of dimensionality reduction techniques

[Internet]. [doi: arXiv:1403.2877]

249. Turck N, Vutskits L, Sanchez-Pena P, Robin X, Hainard A, Gex-Fabry M, Fouda C, Bassem H,

Mueller M, Lisacek F, Puybasset L, Sanchez J-C. pROC: an open-source package for R and S+ to

analyze and compare ROC curves. BMC Bioinformatics [Internet] BioMed Central; 2011 Mar 17

[cited 2017 Nov 1];12(77). [doi: 10.1007/s00134-009-1641-y]

163

250. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M, Siegert S. Package

“pROC.” CRAN [Internet] 2017 [cited 2017 Nov 1];(1.10). Available from: https://cran.r-

project.org/web/packages/pROC/pROC.pdf

251. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B,

R Core Team, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T. caret:

Classification and Regression Training [Internet]. 2017. Available from: https://cran.r-

project.org/package=caret

252. Kuhn M. Predictive Modeling with R and the caret Package. useR! R User Conf [Internet]

Albacete, Spain; 2013 [cited 2018 Aug 21]. Available from: http://www.edii.uclm.es/~useR-

2013/Tutorials/kuhn/user_caret_2up.pdf

253. Lumley T, Miller A. leaps: Regression Subset Selection [Internet]. 2017. Available from:

https://cran.r-project.org/package=leaps


R Core Team, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T.

preProcess function [Internet]. R Doc. 2017 [cited 2018 Aug 30]. Available from:

https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/preProcess

255. Schwarz G. Estimating the Dimension of a Model. Ann Stat [Internet] Institute of Mathematical

Statistics; 1978 Mar [cited 2018 Aug 30];6(2):461–464. [doi: 10.1214/aos/1176344136]

256. Refaeilzadeh P, Tang L, Liu H. Cross-Validation. In: Liu L, Özsu MT, editors. Encycl Database

Syst [Internet] Boston, MA: Springer US; 2009 [cited 2018 Aug 25]. p. 532–538. [doi: 10.1007/978-

0-387-39940-9_565]

257. Zemel R. Ensemble Methods from University of Toronto CSC411 Machine Learning & Data

Mining [Internet]. Toronto; 2014. Available from:

http://www.cs.toronto.edu/~rsalakhu/CSC411/notes/lecture_ensemble1.pdf

258. Ng A. Machine Learning Yearning: Technical Strategy for AI Engineers in the Era of Deep

Learning [draft] [Internet]. draft. deeplearning.ai. 2018. Available from:

https://gallery.mailchimp.com/dc3a7ef4d750c0abfc19202a3/files/704291d2-365e-45bf-a9f5-

719959dfe415/Ng_MLY01.pdf

164

259. Brownlee J. Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning [Internet].

Mach Learn Mastery. 2016 [cited 2018 Aug 25]. Available from:

https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-

learning/

260. Geng D, Shih S. Machine Learning Crash Course: Part 4 - The Bias-Variance Dilemma [Internet].

Mach Learn @ Berkeley. 2017 [cited 2018 Aug 25]. Available from:

https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/

261. Sicotte XB. Bias and variance in leave-one-out vs K-fold cross validation [Internet]. Cross

Validated. 2018 [cited 2018 Aug 25]. Available from: https://stats.stackexchange.com/q/357749

262. Little MA, Varoquaux G, Saeb S, Lonini L, Jayaraman A, Mohr DC, Kording KP. Using and

understanding cross-validation strategies. Perspectives on Saeb et al. Gigascience [Internet] Oxford

University Press; 2017 May 1 [cited 2018 Aug 25];6(5):1–6. PMID:28327989

263. Kohavi R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model

Selection. Proc 14th Int Jt Conf Artif Intell - Vol 2 [Internet] Montreal: Morgan Kaufmann

Publishers Inc.; 1995 [cited 2018 Aug 30]. p. 1137–1143. Available from:

http://web.cs.iastate.edu/~jtian/cs573/Papers/Kohavi-IJCAI-95.pdf

264. Bengio Y, Grandvalet Y. No Unbiased Estimator of the Variance of K-Fold Cross-Validation

Yoshua Bengio Yves Grandvalet. J Mach Learn Res [Internet] 2004 [cited 2018 Aug 31];5:1089–

1105. Available from: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf

265. Zhang Y, Yang Y. Cross-validation for selecting a model selection procedure. J Econom [Internet]

2015 Jul [cited 2018 Aug 31];187(1):95–112. [doi: 10.1016/j.jeconom.2015.02.006]

266. Efron B. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J Am

Stat Assoc [Internet] 1983 Jun [cited 2018 Aug 31];78(382):316–331. [doi:

10.1080/01621459.1983.10477973]

267. Sicotte XB. Variance of K-fold cross-validation estimates as f(K): what is the role of “stability”?

[Internet]. Cross Validated2. 18AD. Available from: https://stats.stackexchange.com/q/358278

268. National Health Service. Blood tests - Overview [Internet]. Natl Heal Serv. 2016 [cited 2018 Aug

31]. Available from: https://www.nhs.uk/conditions/blood-tests/

165

269. The Royal College of Pathologists of Australasia. Pathology: The Facts [Internet]. 2013. Available

from:

http://www.health.gov.au/internet/publications/publishing.nsf/Content/CA2578620005D57ACA2

57B6A000862D3/$File/What I Should Know Pathology-FS.pdf

270. Dynacare. After My Test [Internet]. [cited 2018 Aug 31]. Available from:

https://www.dynacare.ca/patients-and-individuals/preparation-and-tips/after-my-test.aspx


R Core Team, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T. varImp

function [Internet]. R Doc. 2017 [cited 2018 Aug 31]. Available from:

https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/varImp

272. Habbu A, Lakkis NM, Dokainish H. The Obesity Paradox: Fact or Fiction? Am J Cardiol

[Internet] Excerpta Medica; 2006 Oct 1 [cited 2018 Sep 24];98(7):944–948. [doi:

10.1016/J.AMJCARD.2006.04.039]

273. Curtis JP, Selter JG, Wang Y, Rathore SS, Jovin IS, Jadbabaie F, Kosiborod M, Portnay EL,

Sokol SI, Bader F, Krumholz HM. The Obesity Paradox. Arch Intern Med [Internet] 2005 Jan 10

[cited 2018 Sep 24];165(1):55. [doi: 10.1001/archinte.165.1.55]

274. Kenchaiah S, Evans JC, Levy D, Wilson PWF, Benjamin EJ, Larson MG, Kannel WB, Vasan RS.

Obesity and the Risk of Heart Failure. N Engl J Med [Internet] 2002 Aug [cited 2018 Sep

24];347(5):305–313. [doi: 10.1056/NEJMoa020245]

275. Mosterd A. The prognosis of heart failure in the general population. The Rotterdam Study. Eur

Heart J [Internet] 2001 Aug 1 [cited 2018 Sep 24];22(15):1318–1327. [doi: 10.1053/euhj.2000.2533]

276. Iliodromiti S, Celis-Morales CA, Lyall DM, Anderson J, Gray SR, Mackay DF, Nelson SM, Welsh

P, Pell JP, Gill JMR, Sattar N. The impact of confounding on the associations of different

adiposity measures with the incidence of cardiovascular disease: a cohort study of 296 535 adults of

white European descent. Eur Heart J [Internet] Oxford University Press; 2018 May 1 [cited 2018

Sep 24];39(17):1514–1520. [doi: 10.1093/eurheartj/ehy057]

277. Mailund T, Storm Pedersen CN. Machine Learning in Bioinformatics Lecture Week 5 - Hidden

Markov Models Selecting model parameters or “training” Hidden Markov Models [Internet].

Aarhus, Denmark; 2014 [cited 2017 Aug 28]. p. 56. Available from: http://users-

166

birc.au.dk/cstorm/courses/MLiB_f14/slides/hidden-markov-models-4.pdf

278. Jelinek B. Review on Training Hidden Markov Models with Multiple Observations. [cited 2017

Aug 28]; Available from:

https://www.isip.piconepress.com/courses/msstate/ece_8443/papers/2001_spring/multi_obs/p00

_paper_v0.pdf

279. user34790, de Azevdeo R, Morat, hxd1011, Bulatov Y, Masterfool, Dernoncourt F. What is the

difference between the forward-backward and Viterbi algorithms? - Cross Validated [Internet].

Cross Validated. 2016 [cited 2017 Nov 11]. Available from:

https://stats.stackexchange.com/questions/31746/what-is-the-difference-between-the-forward-

backward-and-viterbi-algorithms

280. Rodríguez LJ, Torres I. Comparative Study of the Baum-Welch and Viterbi Training Algorithms

Applied to Read and Spontaneous Speech Recognition. Pattern Recognit Image Anal [Internet]

Springer, Berlin, Heidelberg; 2003 [cited 2017 Nov 11]. p. 847–857. [doi: 10.1007/978-3-540-44871-

6_98]

281. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer

P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M,

Duchesnay É. Scikit-learn: Machine Learning in Python. J Mach Learn Res [Internet] 2011 [cited

2018 Aug 22];12:2825–2830. Available from: http://scikit-learn.org/stable/about.html#citing-

scikit-learn

282. Baril J-F. mhsc-thesis [Internet]. Toronto; 2018. Available from:

https://github.com/cosmomeese/mhsc-thesis

283. Abu-Mostafa Y. Lecture 07 - The VC Dimension from Caltech CS 156: Learning Systems

[Internet]. YouTube; 2012 [cited 2018 Aug 30]. Available from:

https://www.youtube.com/watch?v=Dc0sr0kdBVI&hd=1#t=57m20s

284. Beleites C, Klein A. Any “rules of thumb” on number of features versus number of instances?

(small data sets) [Internet]. Data Sci (Stack Exch. 2018. Available from:

https://datascience.stackexchange.com/a/29478

285. Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of

sample size for various classification rules. Bioinformatics [Internet] Oxford University Press; 2005

167

Apr 15 [cited 2018 Aug 30];21(8):1509–1515. [doi: 10.1093/bioinformatics/bti171]

286. Häggström M. Renin-angiotensin_system_in_man_shadow. Wikimedia Commons; 2009.

287. Ober WC, Garrison CW, Silverthorn DU. Adapted from Figure 15-24 The baroreceptor reflex: the

repsonse to orthostatic hypotension. Hum Physiol An Integr Approach Pearson Benjamin

Cummings; 2009. p. 991.

288. Alien AA, Shelley HK. Fig. 3. The effect of cardiac arrhythmia (PVCs) on the PPG. Best Pract

Res Clin Anaesthesiol [Internet] 2014 [cited 2018 Jul 30];28(4). [doi: 10.1016/j.bpa.2014.08.006]

289. University Health Network (UHN). Medly for Heart Failure [Internet]. iTunes; 2018. Available

from: https://itunes.apple.com/ca/app/medly-for-chronic-conditions/id1310832707?mt=8

290. Owen S. Common Probability Distributions: The Data Scientist’s Crib Sheet - Cloudera

Engineering Blog [Internet]. Cloudera Eng Blog. 2015 [cited 2018 Aug 27]. Available from:

https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-

sheet/

168

Appendix A - Research Ethics

I. REB #14-7595: Validation of A Wearable Activity Tracker for the Estimation of

Heart Failure Severity

169

II. REB #15-9832: Feasibility Study of Wearable Heart Rate and Activity Trackers

for Monitoring Heart Failure

170

III. REB #16-5789: Evaluation of A Mobile Phone-Based Telemonitoring Program

for Heart Failure Patients

171

IV. REB #18-0221: Artificial intelligence-based quality improvement initiative of a

mobile phone-based telemonitoring program for heart failure patients

172

Appendix B – A Primer on Hidden Markov Models

I. Basics of Markov Models (Hidden or Otherwise)

Markov Models (hidden or otherwise) are probabilistic state machines where the transitions between

states are executed randomly according to pre-specified transition probabilities between states [118,220–

223]. Markov Models are used to model Markov chains/processes which are stochastic (i.e. random)

processes that satisfy the Markovian property. That is, the transitions from a given state in the chain to

the next immediate state (and by extension all future states) must be dependant solely on the current

state of the model [118,220–224]. They must not depend on the path taken to arrive to that state, i.e. on

any previous states in which the system has existed. The Markovian property is alternatively known as

the 'memoryless' property: essentially that the Markov process or markov chain has no memory of the

past [118,220–224]. The transition probabilities along with the number of states form the fundamental

model parameters which uniquely describe the Markov Model. Where relevant, a Markov Model may also

have initial starting parameters which dictate the likelihood associated with the Markov Model starting in

each possible state (e.g. 10% chance to start in State S1, 20% chance to start in State S2 and so on)

[118,220–224].

In many Markov Models (and in every Hidden Markov Model) there is also an associated set of possible

observations that are linked to each state, i.e. that can possibly be output when the system is in a given

state. For example, as shown in Figure B-1, a Markov Model is shown that models weather outside an

office with possible states S1 = Sunny, S2 = Cloudy and S3 = Rainy with associated transition

probabilities between each state [221]. The observations associated with each state might be the clothing

that a given person in a stream of passers-by are wearing, say a shirt, a sweater or a rainjacket [221]. It is

possible that a person might be wearing any of these types of clothing in any given type of weather but it

is likely that the likelihood of observing each clothing type will differ based on the underlying weather

state; for example rainjackets are probably more likely to be observed in rainy weather than in sunny

weather [221]. These probabilities are termed observation probabilities and link the states in the Markov

Model to the observations that are measured as outputs of the Markov Model. These observations could

be speech phonemes, written characters of the alphabet, or genome sequences [118,226]. Observe that in

Figure B-1, our hypothetical example Markov Model of the weather includes the starting, transition and

observation probabilities. The starting probabilities are indicated by very light lines between the

rectangular ‘start’ & the state circles, and are almost uniformly distributed with a slight bias towards it

being state S1: Sunny (perhaps unjustified optimism). The transition probabilities, indicated by lines

between the three state circles, favor the state remaining the same, with low probability of the state

173

jumping directly between the S1: Sunny and S3: Rainy states. The observation probabilities model our

hypothesis that shirts are most likely to be associated with sunny weather, and rainjackets with rainy

weather. In cloud weather, people are almost equally likely to wear shirts, sweaters or rainjackets, with a

minor preference towards sweaters.

Figure B-1: M arkov model

The appropriately named Hidden Markov Models (HMM) are simply Markov Models where the

underlying states are hidden - i.e. cannot directly be observed [118,220,222,224,225]. Specifically, we don’t

know the number of states the system has, nor the transition probabilities between states, the sequence of

states it has been through, or even the present state of the system [118,220,222,224,225]. However, if we

assume the system has a certain number of states (e.g. 3) for which we have some given observation

probabilities, it is actually possible for us work backwards and try to infer the current state of the hidden

underlying Markov Model, including the sequence of states that the particular model went through and

generally to create a model of the underlying process [118,220,222–224,277]. We can then use the model to

replicate the modelled process. A relatable example is for text prediction, where an HMM might be

trained using text a user inputs into their smartphone and then used to dynamically suggest the next

word as a user types in new text. Alternatively, one could use a model to quantify how similar a new

process is to an existing modeled process: for example one could model the stock market using the trade

volume and price of a major index during a known bullish (rising) period, and then provide this bull

174

market trained HMM a recent sample of the index trade volume and pricing information to quantify how

similar the current market is to the known bull market period.

Of course, the process of modelling an underlying process using an HMM relies on many assumptions,

both about the input data and properties of the underlying process. As previously mentioned, one of the

major assumptions (the Markovian assumption) that comes with hidden Markov Models, as with Markov

Models in general, is that they assume that the underlying process they model adheres to the Markovian

property: that the future state of the model does not depend on the past states or sequence of states, only

the present state [118,220,221,224]. That being said, it has been found that Hidden Markov Model are

able in certain cases to fairly successfully model processes that violate this Markovian assumption. For

example in the classic cases of speech recognition and gesture recognition [118,226,278]. Of course, both

patient activity and heart rate data likely violate the Markovian assumption 'demanded' of hidden

Markov Models, and although HMMs have been used successfully in some applications of physical activity

recognition using accelerometer data [62] the jury is still out when it comes to modeling with heart rate

data or even with minute-by-minute step count data.

II. Semi-Markov Model

The violation of the pure Markovian assumption leads us to a variation on Hidden Markov Models:

Hidden Semi-Markov Models (HSMM) [223]. HSMMs are HMMs that formally relax the 'Markovian'

assumption of the model by permitting the model to specifically retain the memory of how long it has

been in a certain state (sometimes to force the model to not exist in a state for more than a desired time)

[223]. As such, HSMMs require that an additional set of parameters be defined: the sojourn distribution of

each state [223]. That is, the distribution of expected mean waiting times in each given state. These

waiting times can follow any distribution desired - normal, geometric, gamma, etc. - or appropriate for the

problem at hand [223]. For example, in the case of patient activity and heart rate, where it might be

unreasonable to assume that there no some time-dependence in state changes due to the dynamic nature

of human exercise and activity (e.g. people who are performing high-intensity activity are less likely to

continue as time goes by since they get tired) one might train equivalent multivariate hidden semi-

Markov models to explore and measure the effect of formally relaxing the Markovian assumption (or time-

independence) of a pure Markov models. Although HSMMs are likely highly relevant to the problem of

assessing NYHA class they were not investigated as part of the research documented in this thesis.

III. Hidden Markov & Semi-Markov Models Parameters

To summarize, the complete set of parameters determined a Hidden Markov Model are as follows:

175

1. the number of states in the model

2. the starting probabilities (for each state)

3. the transition probabilities (between each state)

4. the (observation) emission probabilities (of the observable by-products of each state; e.g.

shirt/sweater/rainjacket)

For Hidden Semi-Markov Models, the individual state sojourn distributions must also be specified.

IV. Determining Markov Model Parameters

Determining the single best or most optimal Hidden Markov Model parametrization for given data

stream is unfortunately, an intractable problem [118,220,222]. That being said, there is a known algorithm

for efficiently computing the most likely locally optimal parametrization, the ‘maximum likelihood

estimation’, for a stream. Generally speaking the specific sub-class of algorithms used to solve this

problem in the Markov model space are known as expectation-maximization (EM) algorithms

[118,220,222]. One of the most common EM algorithm implementations used for Hidden Markov Model

training is the Baum-Welch algorithm [118,220,222,279]. Another common algorithm used to approximate

EM is the Viterbi training algorithm (N.B. not the Viterbi algorithm) which can yield less accurate

models than the Baum-Welch algorithm but is usually much less computationally intensive [279,280]. We

eschew further discussion of the implementation details of either of these algorithms since the availability

of pre-programmed libraries implementing these algorithms makes it unnecessary for new student of

HMMs to have the in-depth knowledge required to implement the algorithms and because there are many

excellent sources available that explore the finer details of algorithm much more completely than can be

done as part of a quick primer [118,220,222,280]. In any case none of these algorithms is able to determine

all of the parameters by itself. Some of the parameters must be provided as 'initial conditions' for the

algorithm to execute. Typically these are the emission probabilities, the starting probabilities, the sojourn

distributions (and sometimes even initial transition probabilities). Depending on the library used it may

try to make an educated guess for starting points or leave the 'initial conditions' to be specified solely by

the author. It is possible (and encouraged) to try various combinations of parameters to determine the

most effective set - in fact more fully featured software libraries will also sometimes offer to do this

automatically, although it is ultimately up to the researcher to determine appropriate 'initial conditions.'

In the case of this work, where we used the R package depmixS4 [227,228] the user must provide the

number of desired states, the emission probabilities (which are assumed to remain fixed) as well as an

176

initial starting point for the state probabilities and transition probabilities, which the algorithm then

adjusts as it searches for a local optimum. Other hidden Markov model packages exist for R as well as for

other programming languages, including Python (as part of the package scikit-learn [281]) which is

particularly popular for machine learning.

177

Appendix C – Software Repository

All of the software written by the author and used for, or as part of this project, can be accessed at [282]:


The Fitbit data management and access script can also be found at [198]:





178

Appendix D – Tabulation of All Cross-sectional Machine Learning Classifier

Performance Measures

An exhaustive list of all the performance measures recorded for the final cross-sectional machine learning classifiers evaluated in Chapter 6 are

tabulated in Table 22. To maximize the legibility of the rest of the tables the headers were abbreviated. Table 21 provides the key to these

abbreviations, along with any relevant abbreviated codes used in Table 22. For ease of navigation, the similar model variants are grouped together in

Table 22 in roughly descending order of performance (due to the model grouping). Furthermore, the column with the performance metric used for

model comparison in this thesis - Cohen’s Kappa (indicated by the 𝜅 symbol) - is highlighted purple. Models whose unbalanced accuracy does not

improve over their no-information are highlighted red, and the best performing models are highlighted in green. The models with the lowest |Δ𝜅| (of

the models that improve over their default no-information rate) are highlighted in yellow.

Table 21: Header abbreviations for Table 22

Header Abbreviation Expanded Header Coding

Type Machine learning model type

Feats

Features used C=CPET Only,

S=Step Data Only,

C+S=CPET and Step Data

Imp Imputed missing data?

F Sel Feature selection performed?

K k-fold cross-validation method used: -1=leave-one-out cross-validation,

10=10-fold cross-validation

𝜅 Cohen’s Kappa

|Δ𝜅| Absolute value of the difference between leave-one-out cross-validation kappa and

10-fold cross-validation kappa for the particular model configuration

Bal Acc Balanced Accuracy

Raw Acc Unbalanced Accuracy

Acc UB Unbalanced Accuracy Upper Bound

Acc LB Unbalanced Accuracy Lower Bound

179

Header Abbreviation Expanded Header Coding

NIR No Information Rate

P P-Value (Unbalanced Accuracy)

McN P McNemar P-Value

Sens Sensitivity

Spec Specificity

+ve PV Positive (NYHA Class II) Predictive Value

-ve PV Positive (NYHA Class III) Predictive Value

Pre Precision

Rec Recall

F1 F1 Score

Prev Prevalence

DR Detection Rate

DP Detection Prevalence

AUC Area Under ROC Curve

TP True Positive (Correct NYHA II Classification) Count

FN False Negative (Incorrect NYHA III Classification) Count

FP False Positive (Incorrect NYHA II Classification) Count

TN True Negative (Correct NYHA III Classification) Count

Table 22: Cross-sectional machine learning classifier performance metrics

Type

Feat

s Imp

F

Sel k 𝜿 |Δ 𝜿|

Bal

Ac

c

Ra

w

Acc

Ac

c

UB

Ac

c

LB

N I

R P

M c

N P

Sen

s

Spe

c

+ve

PV

-ve

PV Pre

Re

c F1

Pre

v DR DP

AU

C

T

P

F

N

F

P

T

N

Boosted

GLM C+S No No -1 0.73 0.63 0.85 0.89 0.98 0.72 0.71 .02 1.00 0.75 0.95 0.86 0.90 0.86 0.75 0.80 0.29 0.21 0.25 0.94 6 2 1 19

Boosted

GLM C+S No Yes -1 0.73 0.63 0.85 0.89 0.98 0.72 0.71 .02 1.00 0.75 0.95 0.86 0.90 0.86 0.75 0.80 0.29 0.21 0.25 0.94 6 2 1 19

Boosted

GLM C+S No No 10 0.10 0.63 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.54 3 12 4 31

Boosted

GLM C+S No Yes 10 0.10 0.63 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.54 3 12 4 31

Rando

m

Forest

C+S No No -1 0.70 0.60 0.81 0.89 0.98 0.72 0.71 .02 .25 0.63 1.00 1.00 0.87 1.00 0.63 0.77 0.29 0.18 0.18 0.80 5 3 0 20

Rando

m

Forest

C+S No Yes -1 0.70 0.60 0.81 0.89 0.98 0.72 0.71 .02 .25 0.63 1.00 1.00 0.87 1.00 0.63 0.77 0.29 0.18 0.18 0.80 5 3 0 20

Rando

m

Forest

C+S No No 10 0.10 0.60 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.46 3 12 4 31

180

Type

Feat

s Imp

F


Bal

Ac

c

Ra

w

Acc

Ac

c

UB

Ac

c

LB

N I

R P

M c

N P

Sen

s

Spe

c

+ve

PV

-ve

PV Pre

Re

c F1

Pre

v DR DP

AU

C

T

P

F

N

F

P

T

N

Rando

m

Forest

C+S No Yes 10 0.10 0.60 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.46 3 12 4 31

Boosted

GLM C No No -1 0.47 0.19 0.72 0.79 0.90 0.64 0.70 .12 .50 0.54 0.90 0.70 0.82 0.70 0.54 0.61 0.30 0.16 0.23 0.80 7 6 3 27

Boosted

GLM C No Yes -1 0.47 0.19 0.72 0.79 0.90 0.64 0.70 .12 .50 0.54 0.90 0.70 0.82 0.70 0.54 0.61 0.30 0.16 0.23 0.80 7 6 3 27

Boosted

GLM C No No 10 0.28 0.19 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.55 6 9 5 30

Boosted

GLM C No Yes 10 0.28 0.19 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.55 6 9 5 30

PCA

NNet C Yes No -1 0.45 0.31 0.73 0.76 0.87 0.62 0.70 .22 .77 0.67 0.80 0.59 0.85 0.59 0.67 0.63 0.30 0.20 0.34 0.68 10 5 7 28

PCA

NNet C Yes Yes -1 0.45 0.31 0.73 0.76 0.87 0.62 0.70 .22 .77 0.67 0.80 0.59 0.85 0.59 0.67 0.63 0.30 0.20 0.34 0.68 10 5 7 28

PCA

NNet C Yes No 10 0.14 0.31 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.54 4 11 5 30

PCA

NNet C Yes Yes 10 0.14 0.31 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.54 4 11 5 30

Boosted

GLM C Yes No -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.76 9 6 6 29

Boosted

GLM C Yes Yes -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.76 9 6 6 29

NNet C Yes No -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.73 9 6 6 29

NNet C Yes Yes -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.73 9 6 6 29

Boosted

GLM C Yes No 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.53 4 11 5 30

Boosted

GLM C Yes Yes 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.53 4 11 5 30

NNet C Yes No 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.56 4 11 5 30

NNet C Yes Yes 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.56 4 11 5 30

NNet C No No -1 0.41 0.45 0.71 0.74 0.86 0.59 0.70 .32 1.00 0.62 0.80 0.57 0.83 0.57 0.62 0.59 0.30 0.19 0.33 0.73 8 5 6 24

NNet C No Yes -1 0.41 0.45 0.71 0.74 0.86 0.59 0.70 .32 1.00 0.62 0.80 0.57 0.83 0.57 0.62 0.59 0.30 0.19 0.33 0.73 8 5 6 24

NNet C No No 10 -0.05 0.45 0.48 0.56 0.70 0.41 0.70 .99 1.00 0.27 0.69 0.27 0.69 0.27 0.27 0.27 0.30 0.08 0.30 0.55 4 11 11 24

NNet C No Yes 10 -0.05 0.45 0.48 0.56 0.70 0.41 0.70 .99 1.00 0.27 0.69 0.27 0.69 0.27 0.27 0.27 0.30 0.08 0.30 0.55 4 11 11 24

GLM C Yes No -1 0.37 0.23 0.68 0.74 0.85 0.60 0.70 .33 1.00 0.53 0.83 0.57 0.81 0.57 0.53 0.55 0.30 0.16 0.28 0.70 8 7 6 29

GLM C Yes Yes -1 0.37 0.23 0.68 0.74 0.85 0.60 0.70 .33 1.00 0.53 0.83 0.57 0.81 0.57 0.53 0.55 0.30 0.16 0.28 0.70 8 7 6 29

GLM C Yes No 10 0.14 0.23 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.49 4 11 5 30

GLM C Yes Yes 10 0.14 0.23 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.49 4 11 5 30

PCA

NNet C+S No No -1 0.36 0.43 0.68 0.75 0.89 0.55 0.71 .43 1.00 0.50 0.85 0.57 0.81 0.57 0.50 0.53 0.29 0.14 0.25 0.63 4 4 3 17

PCA

NNet C+S No Yes -1 0.36 0.43 0.68 0.75 0.89 0.55 0.71 .43 1.00 0.50 0.85 0.57 0.81 0.57 0.50 0.53 0.29 0.14 0.25 0.63 4 4 3 17

181

Type

Feat

s Imp

F


Bal

Ac

c

Ra

w

Acc

Ac

c

UB

Ac

c

LB

N I

R P

M c

N P

Sen

s

Spe

c

+ve

PV

-ve

PV Pre

Re

c F1

Pre

v DR DP

AU

C

T

P

F

N

F

P

T

N

PCA

NNet C+S No No 10 -0.06 0.43 0.47 0.52 0.66 0.37 0.70 1.00 .54 0.33 0.60 0.26 0.68 0.26 0.33 0.29 0.30 0.10 0.38 0.56 5 10 14 21

PCA

NNet C+S No Yes 10 -0.06 0.43 0.47 0.52 0.66 0.37 0.70 1.00 .54 0.33 0.60 0.26 0.68 0.26 0.33 0.29 0.30 0.10 0.38 0.56 5 10 14 21

PCA

NNet C No No -1 0.34 0.24 0.67 0.72 0.85 0.56 0.70 .44 1.00 0.54 0.80 0.54 0.80 0.54 0.54 0.54 0.30 0.16 0.30 0.74 7 6 6 24

PCA

NNet C No Yes -1 0.34 0.24 0.67 0.72 0.85 0.56 0.70 .44 1.00 0.54 0.80 0.54 0.80 0.54 0.54 0.54 0.30 0.16 0.30 0.74 7 6 6 24

PCA

NNet C No No 10 0.10 0.24 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.53 3 12 4 31

PCA

NNet C No Yes 10 0.10 0.24 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.53 3 12 4 31

GLM S Yes No -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30

GLM S Yes Yes -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30

Boosted

GLM S Yes No -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30

Boosted

GLM S Yes Yes -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30

NNet S Yes No -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.69 6 9 5 30

NNet S Yes Yes -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.69 6 9 5 30

GLM S Yes No 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8

GLM S Yes Yes 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8

Boosted

GLM S Yes No 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8

Boosted

GLM S Yes Yes 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8

NNet S Yes No 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.33 0 4 0 8

NNet S Yes Yes 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.33 0 4 0 8

Rando

m

Forest

C Yes No -1 0.21 0.13 0.60 0.68 0.80 0.53 0.70 .68 .80 0.40 0.80 0.46 0.76 0.46 0.40 0.43 0.30 0.12 0.26 0.68 6 9 7 28

Rando

m

Forest

C Yes Yes -1 0.21 0.13 0.60 0.68 0.80 0.53 0.70 .68 .80 0.40 0.80 0.46 0.76 0.46 0.40 0.43 0.30 0.12 0.26 0.68 6 9 7 28

Rando

m

Forest

C Yes No 10 0.08 0.13 0.54 0.62 0.75 0.47 0.70 .92 1.00 0.33 0.74 0.36 0.72 0.36 0.33 0.34 0.30 0.10 0.28 0.52 5 10 9 26

Rando

m

Forest

C Yes Yes 10 0.08 0.13 0.54 0.62 0.75 0.47 0.70 .92 1.00 0.33 0.74 0.36 0.72 0.36 0.33 0.34 0.30 0.10 0.28 0.52 5 10 9 26

Boosted

GLM C+S Yes No -1 0.17 0.23 0.59 0.66 0.79 0.51 0.70 .78 1.00 0.40 0.77 0.43 0.75 0.43 0.40 0.41 0.30 0.12 0.28 0.65 6 9 8 27

Boosted

GLM C+S Yes Yes -1 0.17 0.23 0.59 0.66 0.79 0.51 0.70 .78 1.00 0.40 0.77 0.43 0.75 0.43 0.40 0.41 0.30 0.12 0.28 0.65 6 9 8 27

Boosted

GLM C+S Yes No 10 -0.06 0.23 0.48 0.64 0.77 0.49 0.70 .86 .03 0.07 0.89 0.20 0.69 0.20 0.07 0.10 0.30 0.02 0.10 0.45 1 14 4 31

182

Type

Feat

s Imp

F


Bal

Ac

c

Ra

w

Acc

Ac

c

UB

Ac

c

LB

N I

R P

M c

N P

Sen

s

Spe

c

+ve

PV

-ve

PV Pre

Re

c F1

Pre

v DR DP

AU

C

T

P

F

N

F

P

T

N

Boosted

GLM C+S Yes Yes 10 -0.06 0.23 0.48 0.64 0.77 0.49 0.70 .86 .03 0.07 0.89 0.20 0.69 0.20 0.07 0.10 0.30 0.02 0.10 0.45 1 14 4 31

Rando

m

Forest

S Yes No -1 0.14 0.38 0.58 0.62 0.75 0.47 0.70 .92 .65 0.47 0.69 0.39 0.75 0.39 0.47 0.42 0.30 0.14 0.36 0.62 7 8 11 24

Rando

m

Forest

S Yes Yes -1 0.14 0.38 0.58 0.62 0.75 0.47 0.70 .92 .65 0.47 0.69 0.39 0.75 0.39 0.47 0.42 0.30 0.14 0.36 0.62 7 8 11 24

Rando

m

Forest

S Yes No 10 -0.24 0.38 0.38 0.42 0.72 0.15 0.67 .98 1.00 0.25 0.50 0.20 0.57 0.20 0.25 0.22 0.33 0.08 0.42 0.41 1 3 4 4

Rando

m

Forest

S Yes Yes 10 -0.24 0.38 0.38 0.42 0.72 0.15 0.67 .98 1.00 0.25 0.50 0.20 0.57 0.20 0.25 0.22 0.33 0.08 0.42 0.41 1 3 4 4

Rando

m

Forest

C No No -1 0.11 0.23 0.55 0.67 0.81 0.51 0.70 .70 .18 0.23 0.87 0.43 0.72 0.43 0.23 0.30 0.30 0.07 0.16 0.65 3 10 4 26

Rando

m

Forest

C No Yes -1 0.11 0.23 0.55 0.67 0.81 0.51 0.70 .70 .18 0.23 0.87 0.43 0.72 0.43 0.23 0.30 0.30 0.07 0.16 0.65 3 10 4 26

Rando

m

Forest

C No No 10 -0.12 0.23 0.44 0.54 0.68 0.39 0.70 .99 1.00 0.20 0.69 0.21 0.67 0.21 0.20 0.21 0.30 0.06 0.28 0.47 3 12 11 24

Rando

m

Forest

C No Yes 10 -0.12 0.23 0.44 0.54 0.68 0.39 0.70 .99 1.00 0.20 0.69 0.21 0.67 0.21 0.20 0.21 0.30 0.06 0.28 0.47 3 12 11 24

GLM S No No -1 0.10 0.09 0.55 0.65 0.80 0.46 0.71 .83 .77 0.30 0.79 0.38 0.73 0.38 0.30 0.33 0.29 0.09 0.24 0.65 3 7 5 19

GLM S No Yes -1 0.10 0.09 0.55 0.65 0.80 0.46 0.71 .83 .77 0.30 0.79 0.38 0.73 0.38 0.30 0.33 0.29 0.09 0.24 0.65 3 7 5 19

GLM S No No 10 0.01 0.09 0.50 0.52 0.66 0.37 0.70 1.00 .15 0.47 0.54 0.30 0.70 0.30 0.47 0.37 0.30 0.14 0.46 0.49 7 8 16 19

GLM S No Yes 10 0.01 0.09 0.50 0.52 0.66 0.37 0.70 1.00 .15 0.47 0.54 0.30 0.70 0.30 0.47 0.37 0.30 0.14 0.46 0.49 7 8 16 19

NNet C+S Yes No -1 0.08 0.24 0.54 0.60 0.74 0.45 0.70 .95 .82 0.40 0.69 0.35 0.73 0.35 0.40 0.38 0.30 0.12 0.34 0.49 6 9 11 24

NNet C+S Yes Yes -1 0.08 0.24 0.54 0.60 0.74 0.45 0.70 .95 .82 0.40 0.69 0.35 0.73 0.35 0.40 0.38 0.30 0.12 0.34 0.49 6 9 11 24

NNet C+S Yes No 10 -0.15 0.24 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.51 1 14 7 28

NNet C+S Yes Yes 10 -0.15 0.24 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.51 1 14 7 28

Rando

m

Forest

C+S Yes No -1 0.07 0.10 0.53 0.64 0.77 0.49 0.70 .86 .48 0.27 0.80 0.36 0.72 0.36 0.27 0.31 0.30 0.08 0.22 0.61 4 11 7 28

Rando

m

Forest

C+S Yes Yes -1 0.07 0.10 0.53 0.64 0.77 0.49 0.70 .86 .48 0.27 0.80 0.36 0.72 0.36 0.27 0.31 0.30 0.08 0.22 0.61 4 11 7 28

Rando

m

Forest

C+S Yes No 10 -0.03 0.10 0.49 0.60 0.74 0.45 0.70 .95 .50 0.20 0.77 0.27 0.69 0.27 0.20 0.23 0.30 0.06 0.22 0.62 3 12 8 27

Rando

m

Forest

C+S Yes Yes 10 -0.03 0.10 0.49 0.60 0.74 0.45 0.70 .95 .50 0.20 0.77 0.27 0.69 0.27 0.20 0.23 0.30 0.06 0.22 0.62 3 12 8 27

183

Type

Feat

s Imp

F


Bal

Ac

c

Ra

w

Acc

Ac

c

UB

Ac

c

LB

N I

R P

M c

N P

Sen

s

Spe

c

+ve

PV

-ve

PV Pre

Re

c F1

Pre

v DR DP

AU

C

T

P

F

N

F

P

T

N

NNet C+S No No -1 0.05 0.04 0.53 0.64 0.81 0.44 0.71 .85 .75 0.25 0.80 0.33 0.73 0.33 0.25 0.29 0.29 0.07 0.21 0.46 2 6 4 16

NNet C+S No Yes -1 0.05 0.04 0.53 0.64 0.81 0.44 0.71 .85 .75 0.25 0.80 0.33 0.73 0.33 0.25 0.29 0.29 0.07 0.21 0.46 2 6 4 16

NNet C+S No No 10 0.02 0.04 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.53 5 10 11 24

NNet C+S No Yes 10 0.02 0.04 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.53 5 10 11 24

GLM C+S Yes No -1 0.05 0.24 0.52 0.60 0.74 0.45 0.70 .95 1.00 0.33 0.71 0.33 0.71 0.33 0.33 0.33 0.30 0.10 0.30 0.50 5 10 10 25

GLM C+S Yes Yes -1 0.05 0.24 0.52 0.60 0.74 0.45 0.70 .95 1.00 0.33 0.71 0.33 0.71 0.33 0.33 0.33 0.30 0.10 0.30 0.50 5 10 10 25

GLM C+S Yes No 10 -0.19 0.24 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.37 2 13 11 24

GLM C+S Yes Yes 10 -0.19 0.24 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.37 2 13 11 24

PCA

NNet C+S Yes No -1 0.02 0.17 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.49 5 10 11 24

PCA

NNet C+S Yes Yes -1 0.02 0.17 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.49 5 10 11 24

PCA

NNet C+S Yes No 10 -0.15 0.17 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.42 1 14 7 28

PCA

NNet C+S Yes Yes 10 -0.15 0.17 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.42 1 14 7 28

PCA

NNet S No No -1 0.01 0.16 0.50 0.59 0.75 0.41 0.71 .95 1.00 0.30 0.71 0.30 0.71 0.30 0.30 0.30 0.29 0.09 0.29 0.45 3 7 7 17

PCA

NNet S No Yes -1 0.01 0.16 0.50 0.59 0.75 0.41 0.71 .95 1.00 0.30 0.71 0.30 0.71 0.30 0.30 0.30 0.29 0.09 0.29 0.45 3 7 7 17

PCA

NNet S No No 10 -0.15 0.16 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.40 1 14 7 28

PCA

NNet S No Yes 10 -0.15 0.16 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.40 1 14 7 28

GLM C No No 10 0.07 -0.08 0.55 0.50 0.64 0.36 0.70 1.00 .01 0.67 0.43 0.33 0.75 0.33 0.67 0.44 0.30 0.20 0.60 0.58 10 5 20 15

GLM C No Yes 10 0.07 -0.08 0.55 0.50 0.64 0.36 0.70 1.00 .01 0.67 0.43 0.33 0.75 0.33 0.67 0.44 0.30 0.20 0.60 0.58 10 5 20 15

GLM C No No -1 0.00 -0.08 0.50 0.51 0.67 0.35 0.70 1.00 .19 0.46 0.53 0.30 0.70 0.30 0.46 0.36 0.30 0.14 0.47 0.53 6 7 14 16

GLM C No Yes -1 0.00 -0.08 0.50 0.51 0.67 0.35 0.70 1.00 .19 0.46 0.53 0.30 0.70 0.30 0.46 0.36 0.30 0.14 0.47 0.53 6 7 14 16

NNet S No No -1 -0.03 0.16 0.48 0.56 0.73 0.38 0.71 .98 1.00 0.30 0.67 0.27 0.70 0.27 0.30 0.29 0.29 0.09 0.32 0.44 3 7 8 16

NNet S No Yes -1 -0.03 0.16 0.48 0.56 0.73 0.38 0.71 .98 1.00 0.30 0.67 0.27 0.70 0.27 0.30 0.29 0.29 0.09 0.32 0.44 3 7 8 16

NNet S No No 10 -0.19 0.16 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.42 2 13 11 24

NNet S No Yes 10 -0.19 0.16 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.42 2 13 11 24

GLM C+S No No 10 -0.07 -0.03 0.46 0.54 0.68 0.39 0.70 .99 1.00 0.27 0.66 0.25 0.68 0.25 0.27 0.26 0.30 0.08 0.32 0.55 4 11 12 23

GLM C+S No Yes 10 -0.07 -0.03 0.46 0.54 0.68 0.39 0.70 .99 1.00 0.27 0.66 0.25 0.68 0.25 0.27 0.26 0.30 0.08 0.32 0.55 4 11 12 23

GLM C+S No No -1 -0.11 -0.03 0.44 0.46 0.66 0.28 0.71 1.00 .30 0.38 0.50 0.23 0.67 0.23 0.38 0.29 0.29 0.11 0.46 0.47 3 5 10 10

GLM C+S No Yes -1 -0.11 -0.03 0.44 0.46 0.66 0.28 0.71 1.00 .30 0.38 0.50 0.23 0.67 0.23 0.38 0.29 0.29 0.11 0.46 0.47 3 5 10 10

184

Type

Feat

s Imp

F


Bal

Ac

c

Ra

w

Acc

Ac

c

UB

Ac

c

LB

N I

R P

M c

N P

Sen

s

Spe

c

+ve

PV

-ve

PV Pre

Re

c F1

Pre

v DR DP

AU

C

T

P

F

N

F

P

T

N

Boosted

GLM S No No 10 -0.15 -0.01 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.41 1 14 7 28

Boosted

GLM S No Yes 10 -0.15 -0.01 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.41 1 14 7 28

Boosted

GLM S No No -1 -0.16 -0.01 0.43 0.56 0.73 0.38 0.71 .98 .61 0.10 0.75 0.14 0.67 0.14 0.10 0.12 0.29 0.03 0.21 0.46 1 9 6 18

Boosted

GLM S No Yes -1 -0.16 -0.01 0.43 0.56 0.73 0.38 0.71 .98 .61 0.10 0.75 0.14 0.67 0.14 0.10 0.12 0.29 0.03 0.21 0.46 1 9 6 18

Rando

m

Forest

S No No 10 -0.11 -0.09 0.46 0.64 0.77 0.49 0.70 .86 .01 0.00 0.91 0.00 0.68 0.00 0.00 NaN 0.30 0.00 0.06 0.39 0 15 3 32

Rando

m

Forest

S No Yes 10 -0.11 -0.09 0.46 0.64 0.77 0.49 0.70 .86 .01 0.00 0.91 0.00 0.68 0.00 0.00 NaN 0.30 0.00 0.06 0.39 0 15 3 32

Rando

m

Forest

S No No -1 -0.20 -0.09 0.42 0.59 0.75 0.41 0.71 .95 .18 0.00 0.83 0.00 0.67 0.00 0.00 NaN 0.29 0.00 0.12 0.43 0 10 4 20

Rando

m

Forest

S No Yes -1 -0.20 -0.09 0.42 0.59 0.75 0.41 0.71 .95 .18 0.00 0.83 0.00 0.67 0.00 0.00 NaN 0.29 0.00 0.12 0.43 0 10 4 20

PCA

NNet S Yes No 10 0.00 NA 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.38 0 4 0 8

PCA

NNet S Yes Yes 10 0.00 NA 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.38 0 4 0 8