Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
The Use of Activity Monitoring and Machine Learning for
the Functional Classification of Heart Failure
by
Jonathan-F. Benjamin Jason Jérémy Baril
A thesis submitted in conformity with the requirements
for the degree of Master of Health Science, Clinical Engineering
Institute of Biomaterials and Biomedical Engineering
University of Toronto
CC BY 4.0 by Jonathan-F. Benjamin Jason Jérémy Baril, unless otherwise prohibited
ii
The Use of Activity Monitoring and Machine Learning for the Functional
Classification of Heart Failure
Jonathan-F. Benjamin Jason Jérémy Baril
Master of Health Science, Clinical Engineering
Institute of Biomaterials and Biomedical Engineering
University of Toronto
2018
Abstract
Background: Assessing the functional status of a heart failure patient is a highly subjective task.
Objective: This thesis aimed to find an accessible, objective means of assessing the New York Heart
Association (NYHA) functional classification (FC) of a patient by leveraging modern machine learning
techniques.
M ethods: We first identified relevant quantitative data and upgraded Medly, a remote patient
monitoring system (RPMS), to support data collection. We then proceeded to build six different machine
learning classifiers including hidden Markov model, Generalized Linear Model (GLM), random forest and
neural network based classifiers.
Results: The best overall classifier was found to be a boosted GLM, which achieved a classification
performance (Cohen’s Kappa statistic 𝜅=0.73, balanced accuracy=85%) comparable to human level
performance (𝜅=0.75).
Conclusions: Although the investigated classifiers are not ready for implementation into a real RPMS,
they show promise for making the evaluation of NYHA FC more universally consistent and reliable.
iii
dédié à Papa,
sans ton encouragement cette thèse n'aurait jamais existée
iv
Acknowledgments
Ah! The acknowledgements. As painful and lonely as it may be to compose a thesis, the
acknowledgements section is by far the easiest and most pleasant section to write. It is both heart-
warming and humbling to be reminded of how much, and how many others, have sacrificed to breathe life
into this work - truly without the help of these people this project would still be a mere figment of an idea
in someone’s mind. If you’ve contributed to this work, whether directly or indirectly know that, even if
I’ve somehow forgotten to include your name here I am eternally grateful for your help and contribution
to this work.
Firstly, I need to acknowledge our patients: it is probably only those of us who do health research who
truly understand how much these projects life and die by the pure self-less generosity of patients. Thank
you for trusting us with your health and your data. I can only hope this work will somehow contribute to
ultimately making the need for your generosity obsolete.
Second, my committee: Drs. Joe Cafazzo, Cedric Manlhiot, Heather Ross, and Babak Taati. Your
contributions to this project can only be understated – in fact my biggest regret in this project is not
having taken greater advantage of your experience and wisdom. Your guidance, correction, teaching,
encouragement and advice was invaluable to having gotten this project anywhere. Thanks also go to Dr.
Rob Nolan for taking time to serve as the external examiner for this thesis.
I am also hugely indebted to Simon Bromberg, Raghad Abdulmajeed and Dr. Yasbanoo Moayedi, not
only for your foundational work on which I was able build my work but also for leaving behind a treasure
trove of data that was indispensable for getting this project started.
Special thanks to Edgar Crowdy, Steven Fan, Bridgette Mueller, Mohammad Peikari, Emily Somerset,
and Kabir Sakhrani at the Cardiovascular Data Management Centre for your advice and tips with regards
to the analytics but also your incredible help with much of the last-minute data collection, analytics,
processing and people-power that went into the ‘research’ part of this project.
Heartfelt thanks also go to Jason Hearn, not only for contributions to this work as part of the
aforementioned group, but also your puns, listening ear and friendship journeying through the adventure
of doing an MHSc at the Centre these last 2 years. If only all graduate students were so fortunate.
Enormous thanks to Iqra Ashfaq, Alana Tibbles, Patrick Ware, Dr. Emily Seto, and Mary O’Sullivan.
Goodness knows how many times I interrupted your work for this project. Thank you so much for your
v
patience and for being so willing to share your time, your resources, and expertise around all things Medly
(as well as for rooting for me all along the way).
Additional thanks go to:
Stephanie Wilson, Diane De Sousa and especially Larissa Maderia for all the hard work you put in so we
could get Fitbit integrated into Medly.
Damon Pfaff, Owen Thijssen and Mike Lovas for your design advice and allowing me to leech off your
expertise.
James Agnew and Vlad Voloshyn for your technical help.
Melanie Yeung and Akib Uddin, not only for your operational and project management help on the Fitbit
integration (and for the internship) but also for your timely encouragement and advice for getting through
this degree.
Aarti Mathur and Alison Bison for your always joyous help with various admin and purchasing issues.
Similarly, Jess Fifield, but who also deserves additional accolades for her eternal patience in filtering my
incessant requests, and for arranging, rearranging and further rearranging Dr. Cafazzo’s calendar and
always managing to find an available slot for Jason or for myself to meet with Dr. Cafazzo when
necessary. Thanks also to Anna Yuan for managing to wrangle the schedules of 5 incredibly busy
university professors so I could defend on time.
Quynh Pham, for your mentorship and encouragement, and for your unwavering enthusiasm at the
Centre; for always always [sic] finding time to thoughtfully answer my questions, whether on REB
applications, thesis writing, EPR or the myriad other elements of the research student life.
Plinio Morita, for your help and suggestions regarding some of the analytics in this project.
Shivani Goyal, especially for your help and advice regarding my OGS/CGS-M proposal. And speaking of:
Many thanks are owned to The Ted Rogers Centre for Heart Research and Peter Munk Cardiac Centre,
(hSITE) Health Support through Information Technology Enhancements, (NSERC) the Natural Sciences
and Engineering Research Council, (CIHR) the Canadian Institutes for Health Research, the Government
of Ontario, and the University of Toronto for funding various parts of this project at various times.
vi
And of course, thank you to everyone else at Healthcare Human Factors and at eHealth Innovation who
at various times pitched in, shared their expertise, provided advice or encouraging word or even just
expressed interest in the work. Thank you also to Wayne, Chris and Anjum for extending the opportunity
to learn, work and travel with the human factors team as part of my internships.
Thanks to Rhonda Marley, our wonderful Clin. Eng. coordinator for alleviating, as you could, a lot of the
burdensome administrative workload involved in a graduate degree.
Thank you to BESA, the IBBME community and especially the Clin. Eng. students who were part of our
program. It was a true pleasure. We made it.
And lastly, on a personal note, none of this work would have been possible without friends and family
who supported and encouraged me over these last 2 years - words cannot express how grateful I am for
you. Merci Maman, Papa, Alisson, Benjamin; Ruth and Alvis (my home away from home); Kyle F,
Thomas, Esteban (when I needed a nice invigorating round of PUBG or GTA); Vanessa, Rebecca,
Theresa, Duela, Sara & Matthew, Matt & Moni, Rachel & Justin, Melanie, Kyle N, Shawn, Valerie,
Jamie, and Courtney (all of whom graciously let me go to the big TO but would probably rather I have
stayed with them in Winnipeg). Special thanks in particular though have to go to: Paul White, who had
the dubious honor of reviewing the first draft of this thesis; Cameron MacGregor, who brought this
program to my attention and joined me on the adventure; Knox Church (and my home church in
particular; Sam, Chris, Hendrick, Stephen, Andrew, Bella, Roydon, Sarah, Lori, Thomas, Emily, Deborah,
Larissa, Katie, Jackie, Danielle, and so many others), for your open arms and being my much-needed
community in this new city; to Tanisha Strachan, for keeping me sane these past few months, even
though no one warned you that dating a grad student is often too much akin to dating a hermit and of
course; and Jesus, because ultimately this was all for you.
Thank you all for your love, for your encouragement, and for your patience.
Now on to the main event…
vii
Table of Contents
Acknowledgments ......................................................................................................................................... iv
Table of Contents ........................................................................................................................................ vii
List of Tables ................................................................................................................................................ xi
List of Figures ............................................................................................................................................. xiii
List of Abbreviations .................................................................................................................................. xvi
- Introduction ................................................................................................................................ 1
Thesis Objective ................................................................................................................................ 1
Formal Thesis Statement .................................................................................................................. 2
Thesis Summary ............................................................................................................................... 2
1.3.1 Phase 1 – Replication of Previous Study ............................................................................. 2
1.3.2 Phase 2 – Activity Tracker Monitoring Implementation ..................................................... 2
1.3.3 Phase 3 – Machine Learning Implementation & Validation ................................................ 3
- Background & Literature Review ............................................................................................... 4
Congestive Heart Failure .................................................................................................................. 4
2.1.1 New York Heart Association Functional Classification ....................................................... 6
Assessing Exercise Capacity ............................................................................................................. 7
2.2.1 The Medical Interview (Standardized & Unstandardized Questioning) .............................. 8
2.2.2 Standardized In-Clinic Exercise Testing ............................................................................ 11
2.2.3 Fitness Trackers/Monitors ................................................................................................. 14
Remote Patient Monitoring ............................................................................................................ 22
2.3.1 Medly ................................................................................................................................. 24
Artificial Intelligence & Machine Learning ..................................................................................... 24
2.4.1 Machine Learning .............................................................................................................. 26
2.4.2 Supervised, Unsupervised and Reinforcement Learning .................................................... 26
2.4.3 Classification vs Prediction Problems ................................................................................ 27
viii
2.4.4 The Effect of Sample Size on Machine Learning ............................................................... 28
2.4.5 State-of-the-art .................................................................................................................. 29
Summary ......................................................................................................................................... 32
- Replication of Previous Study .................................................................................................. 35
Abstract .......................................................................................................................................... 35
Introduction .................................................................................................................................... 36
Methods .......................................................................................................................................... 37
3.3.1 Recruitment ....................................................................................................................... 37
3.3.2 Statistics ............................................................................................................................ 39
Results and Discussion .................................................................................................................... 42
3.4.1 Principal Results ................................................................................................................ 48
3.4.2 Strengths and Limitations ................................................................................................. 51
Conclusion ....................................................................................................................................... 52
3.5.1 Acknowledgements ............................................................................................................. 52
3.5.2 Ethics Approval ................................................................................................................. 52
3.5.3 Conflicts of Interest ........................................................................................................... 52
- Activity Tracker Monitoring Implementation .......................................................................... 53
Medly User Interface Overview ...................................................................................................... 53
Requirements .................................................................................................................................. 54
Design & Implementation ............................................................................................................... 57
4.3.1 Activity Tracker Selection ................................................................................................. 57
4.3.2 User Interface Design ......................................................................................................... 64
Summary ......................................................................................................................................... 82
– Assessment of NYHA Functional Classification using Hidden Markov Models ...................... 84
Hidden Markov Models ................................................................................................................... 84
5.1.1 Rationale for the use of HMMs .......................................................................................... 84
ix
Methods .......................................................................................................................................... 86
5.2.1 Training Data .................................................................................................................... 86
5.2.2 Model Design ..................................................................................................................... 89
5.2.3 Model Validation ............................................................................................................... 93
Results and Discussion .................................................................................................................... 94
5.3.1 Classification Performance ................................................................................................. 94
5.3.2 Training Challenges ........................................................................................................... 94
Summary ....................................................................................................................................... 101
- Assessment of NYHA Functional Classification Using Cross-sectional Machine Learning
Models ................................................................................................................................................... 103
Machine Learning Models ............................................................................................................. 103
6.1.1 Generalized Linear Models ............................................................................................... 103
6.1.2 Boosted Generalized Linear Models ................................................................................. 105
6.1.3 Random Forest ................................................................................................................ 105
6.1.4 Artificial Neutral Networks ............................................................................................. 107
6.1.5 Principal Component Analysis Artificial Neutral Networks ............................................ 109
Methods ........................................................................................................................................ 110
6.2.1 Training Data .................................................................................................................. 110
6.2.2 Model Design ................................................................................................................... 111
6.2.3 Model Validation ............................................................................................................. 117
Results and Discussion .................................................................................................................. 120
6.3.1 Classification Performance ............................................................................................... 120
6.3.2 Best Features ................................................................................................................... 124
6.3.3 Comparison of 10-fold and Leave-One-Out Cross-Validation .......................................... 128
Summary ....................................................................................................................................... 129
- Conclusions, Recommendations & Future Work .................................................................... 132
Conclusions ................................................................................................................................... 132
x
Recommendations ......................................................................................................................... 135
Future Work ................................................................................................................................. 136
References .................................................................................................................................................. 138
Appendix A - Research Ethics ................................................................................................................... 168
I. REB #14-7595: Validation of A Wearable Activity Tracker for the Estimation of Heart
Failure Severity ............................................................................................................................. 168
II. REB #15-9832: Feasibility Study of Wearable Heart Rate and Activity Trackers for
Monitoring Heart Failure .............................................................................................................. 169
III. REB #16-5789: Evaluation of A Mobile Phone-Based Telemonitoring Program for Heart
Failure Patients ............................................................................................................................ 170
IV. REB #18-0221: Artificial intelligence-based quality improvement initiative of a mobile phone-
based telemonitoring program for heart failure patients .............................................................. 171
Appendix B – A Primer on Hidden Markov Models ................................................................................. 172
I. Basics of Markov Models (Hidden or Otherwise) ......................................................................... 172
II. Semi-Markov Model ...................................................................................................................... 174
III. Hidden Markov & Semi-Markov Models Parameters ................................................................... 174
IV. Determining Markov Model Parameters ....................................................................................... 175
Appendix C – Software Repository ............................................................................................................ 177
Appendix D – Tabulation of All Cross-sectional Machine Learning Classifier Performance Measures ..... 178
xi
List of Tables
Table 1: Summary of Cadmus-Bertram activity tracker heart rate accuracy study [79] ............................ 19
Table 2: Summary of Abdulmajeed activity tracker heart rate accuracy study. Reproduced from [41] ..... 20
Table 3: Inclusion criteria ............................................................................................................................ 37
Table 4: Exclusion criteria ........................................................................................................................... 37
Table 5: Study dataset demographics .......................................................................................................... 38
Table 6: Study dataset demographics (overall and just NYHA II or III) .................................................... 38
Table 7: Study re-grouped dataset demographics (NYHA group II* and group III*) ................................. 39
Table 8: Significant findings for comparisons between all classes (I/II, II, II/III, III) and just between class
II vs. III. ....................................................................................................................................................... 43
Table 9: Significant findings for comparisons between group II* and group III* ........................................ 44
Table 10: Non-significant findings for comparisons between all classes (I/II, II, II/III, III) and just
between class II vs. III. ................................................................................................................................ 45
Table 11: Non-significant findings for comparisons between group II* and group III* ............................... 46
Table 12: Candidate activity trackers ......................................................................................................... 58
Table 13: Medly inclusion criteria ............................................................................................................... 78
Table 14: Medly exclusion criteria ............................................................................................................... 78
Table 15: iPhone vs. Android patients on Medly system using Fitbit a) all patients onboarded, b) only
new Medly patients onboarded during thesis ............................................................................................... 79
Table 16: Patient adherence on Fitbit ......................................................................................................... 80
Table 17: Fitbit adherence compared to adherence recorded for original Medly during RCT .................... 80
Table 18: Minute-by-minute step count features ....................................................................................... 111
xii
Table 19: Cardiopulmonary exercise testing data features ........................................................................ 113
Table 20: Patient demographic data features ............................................................................................ 114
Table 21: Header abbreviations for Table 22 ............................................................................................. 178
Table 22: Cross-sectional machine learning classifier performance metrics ............................................... 179
xiii
List of Figures
Figure 2-1. Renin-Angiotensin-Aldosterone system [286] ............................................................................... 5
Figure 2-2 Nervous system response to drop in blood pressure [287] ............................................................ 6
Figure 2-3 PPG, ECG and arterial pressure waveforms (with cardiac arrhythmia) [288]. .......................... 16
Figure 3-1. Histogram of per minute step count values for each patient, grouped by individual NYHA
class .............................................................................................................................................................. 40
Figure 3-2. Distribution of per minute step counts by NYHA class (zoomed in to step counts > 0).
Stacked internal segments indicate relative contributions by each patient. ................................................ 41
Figure 3-3. Individual frequency of per minute step counts for each patient (zoomed in to step counts >
0), grouped by NYHA class ......................................................................................................................... 42
Figure 3-4. Boxplots (min, mean-1SEM, mean, mean+1SEM, max) of mean daily total steps for individual
each NYHA class ......................................................................................................................................... 48
Figure 3-5. Boxplots (min, mean-1SEM, mean, mean+1SEM, max) of mean daily per minute step count
maximums for each individual NYHA class ................................................................................................ 49
Figure 3-6. Boxplots (min, mean-1SEM, mean, mean+1SEM, max) of max daily per minute step count
maximums for each individual NYHA class ................................................................................................ 50
Figure 3-7. Number of zero step count minutes as a percentage of individual patient two-week data stream
..................................................................................................................................................................... 51
Figure 4-1. Medly system patient smartphone user interface a) home screen b) trends screen [289] ....... 53
Figure 4-2. Medly system clinical user web interface ................................................................................... 55
Figure 4-3. Fitbit data flow diagram ........................................................................................................... 60
Figure 4-4. Fitbit authentication process with a client app ......................................................................... 61
Figure 4-5. Medly Fitbit patient access sequence ........................................................................................ 62
Figure 4-6. Medly Fitbit clinician access sequence ...................................................................................... 63
xiv
Figure 4-7. Proposed designs for patient user interface (home screen) a) combined heart rate and steps
data on one card, b) combined heart rate and with pictoral representations, c) seperated heart rate and
step data, d) only pictoral representation with mini graph ......................................................................... 65
Figure 4-8. Proposed designs for patient user interface (trends) a) simple sparklines, b) data with bands to
indicate min (resting), mean and max values for each time period, c) whisker plot to indicate daily range,
b) heart rate (maximum and resting) and average step count values broken out for each time period, and
e) Tufte style medical data visualization as per f) which is reproduced from [201] .................................... 66
Figure 4-9. Proposed design for authorization of new Fitbit by patient via Medly smartphone application.
..................................................................................................................................................................... 67
Figure 4-10. Proposed designs for clinical user interface (activity and heart rate graphs) a) simple graph
design with indicator lines for alert levels and mean, b) design inspired by the Sick Kids T3 (tracking,
trajectory and trigger) tool [206–208], c) mix of T3 tool with Medly range bands, b) whisker plots style
and e) simple graph with range bands and NYHA class prediction display (bottom of the more info page
for step count graph) ................................................................................................................................... 71
Figure 4-11. Final web interface Fitbit authorization flow .......................................................................... 72
Figure 4-12. Final web interface activity tracker profile & deauthorization flow ........................................ 73
Figure 4-13. Final web interface activity tracker data display .................................................................... 73
Figure 4-14. Distribution of patient Fitbit adherence (as percent of days using the system) ..................... 79
Figure 5-1: A method of inputting sequential (time series) data into a cross-sectional model .................... 85
Figure 5-2: Architecture for hidden Markov model based classifier ............................................................. 90
Figure 5-3: Distribution of per-minute step count for patients with NYHA class II and NYHA III (*
grouped) ....................................................................................................................................................... 93
Figure 5-4: Overview of HMM based classifier performance ........................................................................ 94
Figure 5-5: Example patient step count data (per 6 hour resolution) ......................................................... 95
Figure 5-6: Example patient step count data (per minute resolution) ........................................................ 96
xv
Figure 5-7: Dithering as applied to a cat photo. Reproduced from Wikipedia [236]. ................................ 100
Figure 6-1: Examples of distributions in the family of exponential distributions (* indicates the
distribution belongs in the family only when certain parameters are fixed). Adapted from [290]. ............ 104
Figure 6-2: Example of a decision tree (above) with corresponding feature space (below). ...................... 106
Figure 6-3: A perceptron ............................................................................................................................ 108
Figure 6-4: A neural network ..................................................................................................................... 108
Figure 6-5: 𝒌-fold cross-validation ............................................................................................................. 117
Figure 6-7: Performance of the best CPET only classifier ......................................................................... 121
Figure 6-7: Performance of the best step data only classifier .................................................................... 121
Figure 6-9: Performance of the best CPET + step data classifier ............................................................. 121
Figure 6-9: Performance of the second best CPET + step data classifier ................................................. 121
Figure 6-10: Receiver Operating Characteristic (ROC) curve for machine learning classifiers trained with
CPET & step data (with no data imputation) .......................................................................................... 122
Figure 6-11: Feature importance scores for GLM classifier using only step count data ............................ 125
Figure 6-12: Feature importance scores for random forest classifier using CPET + step count data ....... 126
Figure 6-13: Performance of the best model with cross-validation performance difference ....................... 128
Figure B-1: Markov model ......................................................................................................................... 173
xvi
List of Abbreviations
6MWT 6 minute walk test
Acc accuracy
API application programming interface
AI artificial intelligence
AT anaerobic threshold
BNP brain natriuretic peptide
BP blood pressure.
bpm beats per minute
CART classification and regression tree
CC correlation coefficient
CI confidence interval
CV cross validation
CHF congestive heart failure.
CO2 carbon dioxide
CPET cardiopulmonary exercise test.
DPMSC daily per minute step count
ECG electrocardiography. Alternatively: electrocardiogram, or electrocardiograph
GLM generalized linear model
HF heart failure.
xvii
HFrEF heart failure with reduced ejection fraction
HMM hidden Markov model
HMMBC hidden Markov model based classifier
HT home telemonitoring
HR heart rate
HRV heart rate variability
ICC intraclass correlation coefficients
IMU inertial measurement unit
LED light-emitting diode
LVEF left ventricular ejection fraction
LOOCV leave-one-out cross validation
ML machine learning
MVP minimum viable product
NIR no information rate
NNet neural net
NYHA New York Heart Association.
O2 oxygen
PCA principal components analysis
PPG photoplethysmography
QI quality improvement
RCT randomized control trial
xviii
REB research ethics board.
RER respiratory exchange ratio
RF random forest
ROC receiver operating characteristic
RPM remote patient monitoring
SC step count
SEM standard error of the mean
TGH Toronto General Hospital.
UHN University Health Network.
UI user interface
1
- Introduction
Heart failure (HF), a complex chronic terminal phase of many cardiovascular diseases, is slowly becoming
a worldwide silent pandemic [1]. The symptoms of heart failure are complex and difficult to manage for
both patients and their physicians [2–4]. Care is made even more difficult because there is no reliable
objective method for assessing the symptomatic (functional) status of a given HF patient, or by extension,
if their symptoms have recently measurably deteriorated [5–7].
The current clinical gold standard for assessing a patient's symptom state is the New York Heart
Association (NYHA) functional classification [8,9]. This system grades a patient's degree of heart failure
based on a physician’s interpretation of the patient reported symptoms (mainly with respect to their
degree of intolerance to exercise/physical activity) and is by its nature highly subjective. Despite these
limitations, years of medical research and clinical observations have established many important
relationships between a patient's symptom status and their prognostic outcomes [7,10] which makes it
undesirable to simply replace or modifying the existing NYHA functional classification scheme. However,
finding an objective means of determining a patient's NYHA class would be of great benefit to both HF
care and research as it would allow intra- and inter-physician and patient assessments of HF functional
status to be more consistent [7,11,12]. At the very least, consistency would make communication of
patient heart failure functional status in research, clinic notes, or other medical documentation more
transparent and reliable.
Thesis Objective
The objective of this thesis is to design and develop a means of making the evaluation of NYHA
functional class more consistent and reliable for the medical research and clinical community. The larger
goal of this research work can be subdivided into 3 major sub-objectives:
1. To identify available relevant, objective data which may be useful for providing insights into
patients underlying NYHA functional class and where required, to start the collection of this
data.
2. To establish a basic foundational procedure for use by future researchers, data scientists and
engineers to develop and assess machine learning based methods of evaluating NYHA functional
class (trained to replicate classification by experienced physicians).
3. To perform a pilot analytics experiment, using data collected during an initial brief data
collection period, to explore the viability of a few machine learning algorithms which could form
2
the core of an objective and consistent system for evaluating NYHA functional class (and mirrors
classification by experienced physicians).
4. To provide a reflection on ‘lessons learned’, potential pitfalls and hazards to be mitigated in a
real-life implementation of a machine learning based NYHA functional classification system.
Formal Thesis Statement
We hypothesize that it is possible to assess NYHA functional class with an expected level of
performance at least equal to that of skilled humans, namely trained cardiologists, using objective data
readily available or recordable as part of routine care.
Thesis Summary
The four phases of this thesis are summarized in the following sections 1.3.1 to 1.3.3. We first
replicated a previous scientific study as part of initial investigations into relevant data. A basic physical
activity data collection system was then implemented as part of an established remote patient monitoring
system at the TGH HF clinic. Once sufficient data was been gathered by this system, we sought to train
and validate several machine learning models and assess their potential usefulness for the task of
classifying patients into their appropriate NYHA functional class. All research performed as part of this
thesis was reviewed and received the required approvals by the UHN Research Ethics Board (REB). The
approval letters are included as part of Appendix A.
1.3.1 Phase 1 – Replication of Previous Study
A previously published pilot study [13] showed a statistically significant association between NYHA
functional class and total daily step count activity measured by wrist-worn activity monitors in patients
with heart failure. However, the study’s small sample had the unfortunate side-effect of limiting scientific
confidence in the generalizability of these findings. Since step count activity is expected be a highly
relevant, useful, and massively feature rich dataset, we replicated the study on a separate otherwise
limited dataset collected during another previous study to increase our confidence in the relevance and
usefulness of step data for this particular thesis. This phase of the thesis was approved and covered under
REB #15-9832.
1.3.2 Phase 2 – Activity Tracker Monitoring Implementation
Having validated the relevance of step data for this particular application, we upgraded Medly, the
remote patient monitoring system already in use at the TGH HF clinic, so it could support the collection
3
and display of continuous free living activity data from a commercially available fitness tracker (a Fitbit),
including minute by minute step count and heart rate data which would form an important cornerstone in
the rest of our analysis. This phase of the thesis, upon review by the UHN REB, was accorded a waiver of
requirement for REB approval under REB #18-0221. The analysis of patient compliance was approved
and covered under REB #16-5789.
1.3.3 Phase 3 – Machine Learning Implementation & Validation
In the final phase of this research thesis, we identified potential candidate machine learning algorithms
and implemented 6 of them to attempt to create a classifier that could take the collected clinical data and
use it to attempt to objectively assess patient NYHA class. We also evaluated the performance of these
systems compared to expected ability of experienced physicians to perform the same task. This phase of
research, upon review by the UHN REB, was accorded a waiver of requirement for REB approval under
REB #18-0221.
The following chapters provide first, the necessary background to understand the rest of the research
discussed in this thesis, followed by a detailed description of the methods employed in each phase of the
research and the results of the findings of that corresponding phase.
4
- Background & Literature Review
Congestive Heart Failure
Congestive Heart Failure (CHF), or Heart Failure (HF), as previously stated, is a complex chronic
terminal phase of many cardiovascular diseases, and is slowly becoming a worldwide silent pandemic
[1,14]. Aside from being complex, it is also an incurable, constantly exacerbating condition, that looms
threateningly even over a myriad of more relatively ‘benign’ heart problems. In the words of Dr. Paul
Fedak, it is the “end result of all cardiac disease. You get heart failure from everything that goes wrong
with your heart – all roads lead to heart failure” [2]. Recent estimates would suggest that in 2016 at least
50,000 new Canadians will have officially joined an existing cohort of more than 600,0000 Canadians, and
26 million persons globally, living with heart failure [2,14]. Of course, these numbers are only expected to
grow as the population of persons at high risk of developing cardiac disease and, almost inevitably, the
prevalence of cardiac disease in general, continues to increase. Globally, the prognosis of HF patients is
bleak [1,14]. Even in Canada, despite its relatively advanced medical system, the expected median
survival time of Canadian HF patients is still very short - 2.1 years [15].
But what is heart failure? In short, heart failure is when the heart suffers a reduced ability to pump
blood, and by extension is unable to adequately supply the body with the nutrients and oxygen it requires
[1,2,14]. This inability of the heart to pump blood is sometimes termed cardiac insufficiency. This term
helps to avoid the popular misconception that heart failure is when a person’s heart has stopped as in the
case of a heart attack [2,16]. While cardiac insufficiency has the, likely obvious, effect of reducing a
person’s ability to perform demanding physical activities at any given moment, the full effects of heart
failure are rather more insidious.
Galen is perhaps the first recorded physician to have conjectured that organs aside from the heart and
arterial-venous network might be involved in regulating circulation [17]. While he erroneously concluded
that the liver was the body's main blood producing organ (due to its high degree of vascularization, i.e. it
has lots of blood vessels), an error which remained regrettably uncorrected for 15 centuries, it turns out
that the liver, along with the lungs and adrenal glands, but most importantly the kidneys, do have major
biochemical involvement in regulating a hugely important aspect of the circulatory system: blood pressure
[17]. The natural response of these organs to an event of cardiac decompensation (i.e. cardiac
insufficiency), is to attempt to correct these drops by activating a series of body systems and reflexes to
5
increase both blood volume and blood pressure and by extension cardiac output [18,19]. This is done
primarily through the renin-angiotensin-aldosterone system (see Figure 2-1) which effects an increase in
sodium and fluid retention along with an
increase in vasoconstriction (narrowing of
blood vessels) [18,19]. The autonomic
nervous system also contributes by
increasing vasoconstriction but also by
attempting to increase heart rate and
contraction force (see Figure 2) [18,19]. In
short, the body engages an emergency
response of the bodies ‘fight-or-flight’
mechanism.
While the aforementioned response is
highly appropriate for acute events of
cardiac insufficiency such as significant
blood loss, or even to prevent fainting as
a result of standing up suddenly from a
resting position, it is the incorrect
response to chronic persistent heart
failure [18,19]. Not only does this response
not resolve the underlying cause of the
chronic heart failure such as abnormal
heart rhythms or damage to or malformation of the heart, among other root causes, but constantly
engaging the bodies ‘fight-or-flight’ mechanism has damaging side-effects [19]. Elevated blood pressure
(hypertension) is associated with increased risk for a myriad of other conditions including: pulmonary
edema (leaking of fluid into the lung), atherosclerosis (hardening of arteries as a result of plaques formed
due to damage to the vessels), and hemorrhagic stroke (rupture of a blood vessel) [18,19]. Increased
sodium and fluid retention causes not just the blood to retain more water, but the whole body; fluid often
builds up in other organs and in the arms and legs which can cause undesirable compression of internal
organs and result in damage to those organs [19]. Furthermore, the reduced blood flow combined with
inappropriate pressure increases in certain organs can cause fluid in general to backup, or become
congested in areas along the circulatory network, which is what gives congestive heart failure its name
[18,19]. In addition, the whole response system has the effect of causing what is known as ‘cardiac
Figure 2-1. Renin-Angiotensin-Aldosterone system
[286]
6
remodelling’ whereby the actual physical structure of the heart changes to adapt to its new environment
[19,20]. Many of these changes have an overall damaging effect in the long-run and the exact nature and
extent of this remodelling
depends greatly on the
type of heart failure, for
example whether it is
localized in the left or
right side of the heart (or
both), whether it has the
effect of weakening or
stiffening of the heart
muscles, or whether the
heart failure is due to
other causes such as
abnormal heart rhythms
or blockages [19,20].
Suffice it to say that the
symptoms and pathology
of heart failure are
complex.
As a result of the
complexity of heart
failure, it can be difficult
to manage for both
patients and their physicians [2–4,19]. This is especially unfortunate because heart failure is essentially
impossible to cure since the heart, unlike many other muscles, does not heal or regenerate naturally and
modern medicine has not yet found a way to cause it to do so [19,21]. Care is made even more difficult
because there is no reliable objective method for assessing the functional state of any given patient’s HF,
never mind determining if it is likely to worsen irreparably [5–7].
2.1.1 New York Heart Association Functional Classification
The current clinical gold standard for communicating the severity of symptoms experienced by a CHF
patient is the New York Heart Association (NYHA) functional classification system [8,9,22]. Under this
Figure 2-2 Nervous system response to drop in blood pressure [287]
7
system patients are classified based on the physician’s interpretation of patient reported symptoms
(mainly with respect to their degree of exercise/activity intolerance). The physician will then assign the
patient into one of the four NYHA functional classes they believe is most appropriate based on their
clinical experience, professional judgement and according to the NYHA class definitions. These definitions
are copied below for the reader's convenience [23]:
I. “Patients with cardiac disease but without resulting limitation of physical activity. Ordinary
physical activity does not cause undue fatigue, palpitation, dyspnea, or anginal pain.”
II. “... slight limitation of physical activity. They are comfortable at rest. Ordinary physical
activity results in fatigue, palpitation, dyspnea, or anginal pain.”
III. “... marked limitation of physical activity. They are comfortable at rest. Less than ordinary
activity causes fatigue, palpitation, dyspnea, or anginal pain.”
IV. “Patients with cardiac disease resulting in inability to carry on any physical activity without
discomfort. Symptoms of heart failure or the anginal syndrome may be present even at rest. If
any physical activity is undertaken, discomfort is increased.”
This classification system is highly subjective [6,7], especially for NYHA class II and III, which call for
patients experiencing “slight” versus “marked limitation of physical activity” [9]. The application of the
criteria thus varies widely based on the patient’s self-report and the individual physician’s interpretation
of the report [6,7]. Despite these limitations, clinical evidence and medical research have established many
important relationships between a patient's symptom status and their prognostic outcomes which makes
the assessment of NYHA functional class a useful part of care [7,10]. Aside from the prognosticative
utility it also provides clinicians and medical researchers a standardized way of quickly communicating
the clinical severity of a given patient’s heart failure [19,24]. As such, scientific papers dealing with CHF
often report the NYHA class of their patient population (amongst other metrics) to provide a universally
recognized, although perhaps imprecise, description of the clinical make-up of their population.
Unfortunately, approximately 99% of these papers also fail to provide details as to how the NYHA
functional classes were assessed [6].
Assessing Exercise Capacity
The core determinant of NYHA class, is the impact of a patient’s heart failure on their ability to
perform physical activity without “undue fatigue, palpitation, dyspnea, or anginal pain”. While the NYHA
8
functional classification system does not prescribe a standardized method by which to evaluate limitations
of physical activity there are certainly several methods of evaluation a patient’s exercise capacity,
whether for NYHA functional class assessment or for other purposes. These include questions posed as
part of a medical interview, cardiopulmonary exercise testing, and physical activity/fitness
trackers/monitors.
2.2.1 The Medical Interview (Standardized & Unstandardized Questioning)
The familiar medical interview, whereby a clinician carefully queries a patient to elucidate the
patient’s relevant medical history and symptoms, is a staple of medical care. It is also the classic method
of assessing NYHA functional class; adding a few pertinent questions is inexpensive, relatively quick, fits
neatly into the existing workflow of clinicians and also happens to be the established best practice. It is
however highly inconsistent with regards to NYHA class assessment both between physicians and for the
same physician across time, and is thus highly unreliable [6,11,25–27]. Carroll et al. report (bibliographic
reference numbers updated to reflect ours):
[One study] used two physicians to estimate NYHA functional class in 75 patients on
the same day without chronic heart failure, reporting an interrater reliability of 56%
(weighted kappa = 0.41)[11]. In a second study, two cardiologists assessed the same 50
chronic heart failure patients on the same day in random order, observing 54%
agreement in NYHA classes [6]. In a third study, two physicians assigned NYHA class
to 56 patients with stable angina within the same hour, resulting in the highest reported
agreement of 75% [26]. Among these studies, disagreement by more than one functional
class was low and, for the most part, was concentrated on determining the discrete
differences between Classes II and III. Taken together, the reliability of the NYHA
system is limited in the few trials that have measured it directly [25].
The results are very low: a 54 and 56% level of agreement represents only weak agreement between
physicians, and a 75% level of agreement still implies that only about 56% of the examined cases should
be considered correct [28].
It should be noted that the third study (Christensen et al.) examined only NYHA functional classes I to
III, and the first study (Goldman et al.) examined all four functional classes [11,26]. In the second study
(Raphael et al.), the researchers investigated class II and III assessments specifically [6]. Furthermore each
study had an imbalanced distribution of classes which makes reporting raw accuracy somewhat misleading
since classes I and IV end up being relatively easy to distinguish in clinical practice whereas the middle
9
classes II and III generally represent the actual classification challenge for physicians [25]. Approximately
half of patients in Goldman et al.’s study exhibited NYHA class I symptoms which may have contributed
to the slightly higher agreement found in this study compared to Raphael et al.’s study. Unfortunately
Christensen et al. neglected to provide any information on their class distribution entirely, although it
appears to be slightly unbalanced since visual examination of their figures indicates that a significant
subset (possibly ¼ to a 1/3rd) of their study population are also patients with NYHA class I. We agree
with the authors (Christensen et al.) however that the real reason why they saw higher agreement was
likely because they “they used the same two physicians through the study … who, in addition, had a small
training session prior to data collection” [26].
In normal practice clinicians usually differ in the exact criteria and questions they would use to assess the
NYHA class of their patients [6]. The most popular being self-reported walking distance (70% of the 30
cardiologists surveyed), difficulty in climbing stairs (60%), ability to walk to a recognized local landmark
(30%) and breathlessness interfering with performing daily activities or when walking around the house
(23%)[6]. 13% of cardiologists had no specific question or criteria for assessing NYHA class [6]. Even of
those who would use a common question or criteria, the application of the criteria often differed. For
example, in choosing between class II and III patients, 2/3rds of physicians would classify a patient who
couldn’t make it up a flight of stairs without stopping as class II, while 1/3rd would classify them as class
III [6].
Assessment at the Toronto General Hospital Heart Function Clinic
At the TGH HF clinic, NYHA class is typically assessed for every patient with known cardiac disease,
which is first objectively verified using some sort of medical imaging. NYHA class is then reassessed at
every clinic visit by the physician responsible for patient's care as part of the medical interview. At
minimum, the physician will pose questions to attempt to elucidate the patients' degree of exercise
intolerance, for example: "How far can you walk before becoming short of breath?", although the
established preferred criteria is "How many flights of stairs can you climb before needing to stop?" The
classes are broken down as follows:
Class I. Asymptomatic; able to perform physical activity normally.1
1 As a specialized tertiary care centre, the Heart Function Clinic rarely sees NYHA class I patients as they are often
asymptomatic with regards to their heart failure, or at least rarely require the specialized level of care offered by the
clinic.
10
Class II. Able to walk up more than 1 flights of stairs, or 100+ meters before being breathless.
Class III. Only able to walk up 1 flight of stairs before being breathless/requiring a break.
Alternatively, gets tired walking to the washroom.
Class IV. Always breathless; symptoms even at rest.
Of course, these questions are adjusted as per the clinical demands. For example, the stair question is
unsuitable for a patient who is wheel-chair bound or has significant mobility impairment, but the
principle of using internally consistent criteria remains the same.
Unsurprisingly, prior agreement on assessment criteria has been demonstrated to improve inter-physician
agreement drastically [27]. Kubo et al. for example developed a patient questionnaire with the express
intent of addressing the problem of inconsistent NYHA classification in multi-centre trials, although the
questionnaire was “not meant to replace or improve the traditional method by which clinicians assess
NYHA in everyday clinical encounters” [27]. The questionnaire is composed of 7 major questions that echo
some of the popular interview questions including questions such as: “How often do you walk up and
down stairs?” and “How often do you go for walks, either outside or inside, on level ground at a normal
pace under normal conditions?” with follow up questions including “Do you avoid stairs [/walks] because
it makes you tired or short of breath?” and “How often would you get short of breath when you walk up
or down a flight of stairs at a normal pace under normal conditions?” that are typically answered with one
of ‘Never, Rarely, Some or Frequently’ and occasionally with just a simple ‘Yes/No’ response [27]. The
questionnaire uses a separate scoring tool (not provided) that assesses the frequency of both activities and
their associated symptoms including symptoms or lack of symptoms at rest [27]. The scoring tool however,
at least in its current state, eschews the use of automated algorithm “because of the inability of simple
algorithms to reconcile inconsistent patient responses” [27]. In validating the use of this questionnaire,
Kubo et al. found about a 60% agreement comparing interdependent assessments performed at a remote
site and their core central site, a 75% agreement comparing independent assessments performed at the
same core central site and a 90% agreement on repeat assessment of a random subset of the same
questionnaires 3 months later [27]. These results are in the same range as Christensen et al.’s results,
which is possibly an indication that even informal agreement on NYHA class (in the form of a
preparatory training session) drastically improves inter-physician agreement on NYHA class. Of course,
subjectivity in the NYHA classification is not just introduced by clinicians. It is also introduced by
patients.
11
2.2.2 Standardized In-Clinic Exercise Testing
A second challenge of NYHA class assessment is that it relies heavily on patient reported symptoms
and on patient memory, which can be unreliable even in the best of circumstances [29–31]. Clinicians, who
face this challenge on a routine basis in the field, even outside the context of NYHA class assessment,
have come up with a myriad of ways to address this problem. In fact, a great deal of research tries to
identify or create tests that measure physical fitness, maximum exercise capacity, or some proxy thereof
in a standardized way [32–39]. In general, these tests measure a patient's exertion over a period of time
[32,34–36,38–40]. Exertion is usually calculated by raw distance traveled (being generally more convenient
to measure) [32,34,36,40], patient step count (which can be linked to distance if the patient's stride length
is known) [38,41–47], movement recorded by raw accelerometer data [39,48–50], activity difficulty (e.g.
surface incline, resistance band strength) [41,46] or energy consumption (e.g. Metabolic Equivalents:
METS) [8,32,37].
Timed Walking Tests
Timed walking tests are an excellent example of a basic, easy to run standardized in-clinic exercise
test. The 6 minute walk test (6MWT), one of the more recently developed time walking tests, typifies the
general approach used in this tests. For this particular test, a patient is asked to walk as far as they can
(being permitted to rest as needed) over a hard flat surface over the period of 6 minutes; the total
distance walked is then used as an indicator of the exercise capacity of the individual [40] and by
inference, their symptomatic limitations due to heart failure [7].
While timed walking tests have shown that measures of exertion over time (whether distance, step count
or otherwise) are correlated to the NYHA functional classification of patients, there often remains a
notable gap in the explanatory power of these measures. For example Demers et al. found that for the 768
patients in their multi-centre study the "baseline 6MWT distance was ... moderately inversely correlated
to the New York Heart Association functional classification (NYHA-FC) (r = -0.43, P=.001)” [51]. One
would expect that walking distance should be correlated with evaluated NYHA functional class, but
distance travelled in this case only explains approximately 18.5% of the variance in the data (r2= 0.1849).
This may be because NYHA functional class is not predominantly attempting to ascertain maximal
exercise capacity but rather the degree of abnormally symptomatic response to exercise – a much more
nuanced question. Therefore tests, measures, or metrics which can reliably mirror NYHA functional class
will likely need to measure not just exertion, but the patient’s physiological response to that exertion -
12
beyond the simply binary yes/no response to being able to continue the exertion demanded (the case for
all the previously mentioned tests).
Cardiopulmonary Exercise Test (CPET)
The cardiopulmonary exercise test (CPET), or more colloquially ‘the treadmill test’, is the gold
standard for in-clinic exercise testing [52]. It is a supervised test run by trained staff in a controlled
clinical environment. In this test, the patient walks on a treadmill or cycles on a stationary bicycle
typically until they (the patient) becomes exhausted, experiences muscle fatigue, respiratory difficulty or
some other symptom that is indicated for the termination of the test [32,53]. While the patient is
exercising, their detailed physiological response to increasing resistance on the treadmill/bike is measured
using:
• surface electrocardiography (ECG), to measure pulse and cardiac waveform (sinus rhythm);
• pulse oximetry, to measure blood oxygen saturation;
• a blood pressure (BP) cuff, to measure blood pressure;
• spirometry equipment, to measure lung capacity, volumes and flow
• pulmonary gas equipment, to measure oxygen (O2) and carbon dioxide (CO2) exchange [32,53].
Together, this data provides an informative picture from which clinicians can further derive metrics
measuring a patient’s lung and cardiac response to exercise [24,32,53,54]. Some of the more unique and
important measures derived from this test include:
• 𝑝𝑒𝑎𝑘𝑉𝑂2̇ [mL/kg/min] (relative 𝑝𝑒𝑎𝑘𝑉𝑂2
), the peak oxygen volume output, is an estimate for true
maximal aerobic capacity �̇�𝑂2𝑚𝑎𝑥 [mL/kg/min] of a patient [32]. �̇�𝑂2
𝑚𝑎𝑥, or relative 𝑉𝑂2𝑚𝑎𝑥, is
the body weight normalized version of (absolute) 𝑉𝑂2𝑚𝑎𝑥 [L/min]. Absolute 𝑉𝑂2
𝑚𝑎𝑥 is
“considered to be the metric that defines the limits of the cardiopulmonary system. It is defined
by the Fick equation as the product of cardiac output [heart rate & stroke volume] and
arteriovenous oxygen difference … at peak exercise” [32]. Reporting the relative (normalized)
version is preferred since patients with higher body weight will naturally have a higher 𝑉𝑂2𝑚𝑎𝑥
due to increased body weight but will not necessarily have fundamentally increased functional
capacity, exercise capacity or exercise tolerance [32]. It is also important to note that 𝑝𝑒𝑎𝑘𝑉𝑂2̇ is
always an estimate of true maximal aerobic capacity; its recorded value depends not only on the
13
test modality used (treadmill or bike) but is importantly predicated on the attainment of
maximal/peak exercise by the patient during the test [32].
• Ventilatory threshold (𝑉𝑇) [mL/kg/min], an estimate for, and sometimes interchangeably known
as, anaerobic threshold (𝐴𝑇), attempts to measure the exertion level at which a patient’s body
stops being able to keep up with their muscles’ oxygen demands [32]. It is an alternate index used
to infer exercise capacity but is predicated on the idea that people do not constantly perform
activities at maximal effort. AT, in a sense, is a measure of maximum continuously sustainable
exertion [32]. As AT is a submaximal index of exercise capacity it is sometimes reported as a
percentage of 𝑝𝑒𝑎𝑘�̇�𝑂2 [32].
• Respiratory exchange ratio (𝑅𝐸𝑅), the ratio between exhaled CO2 and inhaled O2 [32]. Of
particular interest is the peak RER which can be used to gauge if a subject is likely to have
achieved peak (or at the very least sufficient) exerted effort as part of the test [32]. It is known to
be more robust than heart rate response for measuring exertion, as heart rate response is often
highly variable even in healthy populations (and worse for patients with heart failure, since their
response is often modulated by medications),
• 𝑉�̇�/𝑉𝐶𝑂2̇ [breaths/L], or the relationship between minute ventilation and carbon dioxide output, is
used to estimate ventilatory efficiency: how many breaths it takes for the body to clear a given
unit of CO2 [32]. The relationship most often reported is a linear approximation of the
𝑉�̇�/𝑉𝐶𝑂2̇ slope, which is highly robust against test modality and attainment of peak exercise by the
patient [32]. It is often used to infer the possible existence of ventilation-perfusion mismatching:
where the lungs are unable to efficiency clear CO2 from the circulatory system either due to
circulatory problems causing poor blood flow or inefficient CO2 transfer due to some sort of lung
damage or disease [32].
Many of these CPET measurements have been clinically validated and recommended to help inform
important decisions regarding heart failure care. For example, 𝑝𝑒𝑎𝑘𝑉𝑂2̇ is used to risk stratify certain
classes of HF patients when considering a heart transplant [55].
Others have already attempted to discover the relationship between NYHA class and various CPET
measures [11,24,25,56]. Rostagno et al. looked at 143 HF patients with NYHA functional class ranging
from I to IV but found low agreement between both 𝑝𝑒𝑎𝑘𝑉𝑂2̇ and AT (41.7%) compared to NYHA class
(35%) [24].
14
Goldman et al. looked at the duration of treadmill tests and similarly found low agreement, with only
51% of their 150 estimates (75 patients with one estimate each by two independent physicians) agreeing
with the NYHA class assigned [11]. This is not terribly surprising but is instead consistent with what we
would expect based on Demers et al. 6MWT findings.
In a more recent analysis, authors Lim et al. performed a systematic review of 38 studies that investigated
the correlation between NYHA classification and 𝑝𝑒𝑎𝑘𝑉𝑂2̇ (other CPET metrics were not reported
consistently enough for analysis) [56]. They found a significant difference between pooled 𝑝𝑒𝑎𝑘𝑉𝑂2̇ values
for NYHA classes I. vs II. and II vs. III (P < 0.0001 in both cases) [56]. However, they did not find a
significant difference when looking at classes III vs. IV [56]. 𝑝𝑒𝑎𝑘𝑉𝑂2̇ and NYHA class I to III were
inversely correlated, although the strength of the correlation was not quantified [56].
To our knowledge no one else has published attempts to characterize the relationship between NYHA
class and other CPET measures. Despite the lack of research and evidence surrounding most of the CPET
metrics, Lim et al.’s findings regarding 𝑝𝑒𝑎𝑘𝑉𝑂2̇ and NYHA class are an encouraging waypoint in the
quest to objectively assess NYHA classification. However, CPET studies do have some important
drawbacks.
One of the biggest drawbacks of running CPET studies is that they require access to expensive
equipment, trained personnel and a lab environment in which to perform the test [32]. Due to the
financial cost and time burden alone, it is likely that relying on CPET studies to assess NYHA class will
severely limit how often NYHA class can be re-assessed, which makes it less desirable for use in creating a
quick and easy method of assessing the severity of patients’ HF symptoms [54].
2.2.3 Fitness Trackers/Monitors
Modern commercially available fitness trackers, such as those developed by Fitbit Inc.[57–59] are a
promising, albeit little used candidate for assessing patient exercise capacity that would overcome many of
the drawbacks of cardiopulmonary exercise tests.
Activity & Step Detection
Activity trackers are small, portable devices that are worn on one’s person. They may be worn on
one’s feet or shoes, clipped on the belt near one’s hip, or worn on one’s wrist like a wristwatch
[41,43,64,65,45,57–63]. The classic pedometers of yore are in fact a type of activity tracker but there are
specifically limited to only counting steps [65,66]. Most modern activity trackers are more precise and
15
often more multi-functional than the classic pedometer [57–59,64]. Even from a pure motion detection
perspective, older pedometers were often limited to single-axis accelerometers which could only detect
movement (specifically acceleration) in one axis [66].
Newer, modern activity trackers have been found to be able to fairly reliably track minute-by-minute step
count [37,41,43,45,46,65,67–70]. Straiton et al. [70] in a systematic review of 7 observations studies,
including a total of 290 elderly patients (mean age 70.2 ± 4.8 [years]), discovered a high correlation
between step counts recorded by the test devices compared to the reference devices used in the study. The
reference devices used in the individual studies varied but were typically a previously validated research-
grade activity monitor such as an ActiGraph™ [71] or BodyMedia Sensewear device (no longer available).
In their review they found that “daily step count for all consumer wearables correlated highly with
validation criterion, especially the ActiGraph device: intraclass correlation coefficients (ICC) were 0.94 for
Fitbit One, 0.94 for [Fitbit] Zip, 0.86 for [Fitbit] Charge HR and 0.96 for Misfit Shine. Slower walking
pace and impaired ambulation reduced the levels of agreement” [70]. Physical activity and energy
expenditure estimation, as supported by these devices, was also found to be accurate but generally less so
than step count measurements.
Evenson et al. (2015) [68] who cast a wider net and conducted a systematic review that included 22
observations studies on adults and youth (20:2) similarly found generally high correlations between the
step measurements of various Fitbit and Jawbone devices investigated in these studies compared to the
reference devices use. The correlation coefficients (CC) (interclass or Pearson) were found to be >= 0.8
for all the devices (Fitbit and Jawbone) investigated in all the laboratory studies reviewed. Many of the
studies however found an even higher correlation, in the > 0.9 range, and even up to 0.99 for both
Jawbone and Fitbit devices [68]. Evenson et al also found that physical activity and energy expenditure
estimation were generally found to be less high correlated than pure step-tracking.
El-Amrawy in 2015 [44] recorded 4 participants who performed 40 repeated sets of 200, 500 and 1000 step
walks and found that step count accuracy varied from an average of 99.1% for the MisFit Shine and
Apple Watch, to 79.8% for the Samsung Gear 2, as compared to the steps counted by a tally counter
equipped observer. Other popular mainstream contenders like the Fitbit Flex (80.5%), the Jawbone UP
(82.51%) and the Xiaomi Mi Band (96.6%) also scored highly.
Overall, research points to step-tracking by modern mainstream commercial activity trackers as being
highly correlated to equivalent research grade reference devices. Certain activity trackers such as the
MisFit Shine appear to be more consistently in agreement with validated reference devices, which may
16
make them optimal for studies where step count values must be as accurate as possible. However, we
maintain that all the activity trackers discussed are likely suitable for practical applications of step count
tracking. Other features that should be considered are easier access to gathered data, lower cost, improved
ease of use for the patient, or the ability to detect some other important physiological marker.
Heart Rate Detection
With respect to other physiological markers, some of the major players in the commercial activity
tracker market, namely Fitbit™ [58] and Apple™ [64], have recently pioneered the integration of heart
rate monitoring capability alongside the step counting provided by their devices. These augmented fitness
trackers, which are worn on the wrist, also monitor heart rate non-invasively by detecting the flow of
blood under the surface of the wearer’s skin [41,44,72–74]. This technique, known as
photoplethysmography (PPG), has been well validated since its discovery in the 1930s and is commonly
used in various clinical settings [75,76]. In fact, it is the core technology that underpins pulse oximetry
[75,76].
The fundamental principle that underpins PPG itself is the absorption and reflection of light by various
body tissues [75,76]. By shining carefully selected frequencies of light on the surface of the skin and
recording either, the light reflected off of, or transmitted through the skin, one can detect changes in
perfusion of the surface tissues being illuminated. An example of the resulting waveform is shown in
Figure 2-3. Although the precise physiological cause of the perfusion changes measured by the PPG
Figure 2-3 PPG, ECG and arterial pressure waveforms (with cardiac arrhythmia) [288].
17
waveform is still a matter of debate [76], it is clear that certain characteristics of the waveform are
synchronized with heartbeat, and can thus be used to track heart rate. The shape of the waveform is also
known to be correlated with arterial blood pressure, another clinically important physiological marker
[75,76].
One important parameter that can also affect the PPG waveform is the choice of light [75,76]. Light
absorption/reflection characteristics of various body tissues are highly frequency dependent [75,76]. One of
the most important applications of PPG, arterial blood oxygen measurement, depends on this fact [75,76].
Furthermore, the frequency response of oxygen saturated versus desaturated blood is known to vary at
different light frequencies. If we measure separate PPG waveforms using red and near-infrared light, we
can measure the relative difference in light absorbed at these different frequencies [75,76]. The resulting
difference can then be used to infer the degree to which the blood is saturated vs. desaturated [75,76].
While fitness trackers do not yet measure arterial blood pressure or use different types of light to measure
oxygen saturation, some newer models of fitness trackers (e.g. the Fitbit Charge HR 2 [58] and Apple
Watch [64]) take advantage of the varying light frequency response of blood by instead using green light
which has been found to be more reliable for pulse rate monitoring [77].
Research has shown that consumer heartrate trackers are fairly reliable as compared to clinical grade
devices [41,44,73,74,78,79]. However, they do provide considerably less detail than clinical grade devices.
Consumer devices generally only capture a minute-by-minute pulse rate, as opposed to the complete ECG
waveform provided by a Holter monitor or non-portable ECG setup.
In a 2016 study, Wang et al. monitored 50 heathy patients on a treadmill test and compared the heart
rate measured by various fitness trackers to the heart rate recorded by an ECG and found them all to be
highly correlated [78]. The concordance coefficients were .99 for the Polar H7 device, .91 for the Apple
Watch, .91 for the Mio Fuse, .84 for the Fitbit Charge HR and .83 for the Basis Peak.
In a previously mentioned study, El-Amrawy et al. recorded 4 participants who performed 40 repeated
sets of 200, 500 and 1000 step walks. As part of this study they also compared the heart rate of various
activity monitors to the heart rate reported by a research validated professional clinical pulse oximeter
[44]. The devices investigated, with their corresponding heart rate accuracy (as percent mean deviation
from the average recorded heart rate) and the associated standard deviation (σ) of the measurements,
ordered from most to least accurate, were the Apple Watch (99.9%, σ = 5.7%), Samsung Galaxy Note
Edge (99.6%, σ = 14.4%), Apple iPhone 6 running Cardioo App [80] (99.2%, σ = 6.3%), Samsung Galaxy
S6 Edge (98.8%, σ =11.6%), Samsung Gear 2 (97.7%, σ = 16.5%), Apple iPhone 5S running Cardioo App
18
(97.6%, σ = 12.4%), Samsung Gear Fit (97.4%, σ = 28.8%), Samsung Gear S (95.0%, σ = 20.9%), and
Motorola Moto 360 (92.8%, σ = 14.1%).
Cadmus-Bertram et al. in a 2017 study, also investigated the heart rate accuracy of several wrist-worn
activity trackers [79]. They were particularly interested in the limits of agreement of the reported
beats/minute (bpm) of each of the devices at different heart rate intensity levels. They also studied the
devices’ accuracy by measuring the mean difference between the heartrates measured by the trackers, and
a simultaneously recorded reference ECG. The limits of agreement were defined as the 95% prediction
interval for the mean difference between the tracker and ECG measurements. They also compared
measurement agreement of different devices from same model series (i.e. comparing measurements
between 2 Fitbit Surges in otherwise identical test conditions) which they termed measurement
repeatability. As for the different heart rate intensity levels, they investigated the heart rate accuracy at
rest and at 65% of the individual study participants maximum heart rates while running on a treadmill
(as determined by the maximum heart rate equation: 𝑀𝑎𝑥 𝐻𝑒𝑎𝑟𝑡 𝑅𝑎𝑡𝑒 = 220 − 𝑎𝑔𝑒). The 40 study
participants were all healthy and between 30 and 65 years old (mean ± σ of 49.3 ± 9.5 [years]), and wore
2 trackers on each wrist (randomly assigned left vs. right, and proximal vs. distal to the wrist). Cadmus-
Bertram et al.’s findings, including the mean difference, limits of agreement and measurement
repeatability results, are reproduced for easier reading in Table 1. They found that the activity trackers
had excellent accuracy with a mean difference of ≤±2.8 [bpm] between activity trackers and reference
device whether at rest or while exercising. No further quantitative comparison was made between mean
difference at rest vs exercise. For reference, a 1 [bpm] agreement error at 65% of the maximum heart rate
of a 30, 49.3 and 65-year-old (minimum, mean and maximum age of participants in this study) represents
a percent error of 0.8, 0.9 and 1.0%. At rest, or rather, at heart rates of 60 and 100 [bpm] - the lower and
upper limits of the commonly accepted resting heart rate range [81,82] - the same 1 [bpm] agreement error
represents a percent error of 1.6 and 1.0%2. The precision, as measured by the limits of agreement was
found to be less impressive. At rest, they ranged from good, -5.1 to 4.5 [bpm] (Fitbit Surge), to relatively
poor, -17.1 to 22.6 [bpm] (Basis Peak). The performance of the intermediate devices investigated (Fitbit
Charge and Mio Fuse), which had limits of agreement of ~±10 [bpm], was closer to the performance of
the Fitbit Surge than the Basis Peak. During exercise (@ 65% maximum heart rate), the precision
degraded considerably, with lower limits of agreement ranging from -41.0 [bpm] in the worst case (Fitbit
Charge) to -22.5 [bpm] (Mio Fuse) in the best case, and upper limits of agreement ranging from -39.0
2 ∴ as a rule of thumb for mental calculations: 1 [bpm] error = 1% (2% when in the 40-60 [bpm] range)
19
[bpm] (Fitbit Surge) in the worst case to 26.0 [bpm] (Mio Fuse) in the best case. With respect to
repeatability between devices, most devices were found to be around half as repeatable as the ECG
whether are rest or during exercise with only two exceptions: 1) the Fitbit Surge which was found to be
possibly slightly more repeatable than the ECG at rest (unfortunately no significance test was provided),
and 2) the Basis Peak which was found to be only a quarter as repeatable as the ECG at rest.
Table 1: Summary of Cadmus-Bertram activity tracker heart rate accuracy study [79]
@ Rest @ 65% M aximum Heart Rate
Device
M ean
Difference
[bpm]
Limits of
Agreement [bpm]
Repeat-
ability
[bpm]
M ean
Difference
[bpm]
Limits of
Agreement
[bpm]
Repeat-
ability
[bpm]
ECG reference - to - 5.3 reference - to - 9.1
Fitbit Surge 2.8 -5.1 to 4.5 4.2 1.0 -34.8 to 39.0 20.6
Mio Fuse -0.7 -7.8 to 9.9 10.9 -2.5 -22.5 to 26.0 23.7
Fitbit Charge -0.3 -10.5 to 9.2 9.3 2.1 -41.0 to 36.0 21.6
Basis Peak 1.0 -17.1 to 22.6 19.3 1.8 -27.1 to 29.2 20.2
Our lab, the Centre for Global eHealth Innovation, also recently investigated the heart rate accuracy of
two of the most popular activity trackers at the time: the Fitbit Charge HR and the Apple Watch [41]. In
this 2016 study, R. Abdulmajeed studied 8 healthy participants using a similar methodology to Cadmus-
Bertram et al. although at different exercise intensity levels, which were controlled using a variable
resistance stationary bicycle. The accuracy of the two trackers (worn simultaneously) was measured
against the ECG results of a portable Holter monitor. Abdulmajeed found a similar slightly worse percent
accuracy at rest between the Holter monitor and the trackers investigated (Fitbit Charge HR: 6.00%;
Apple Watch: 3.32%), compared to Cadmus-Bertram et al. findings. Abdulmajeed’s findings also hint at a
possibly slightly non-linear relationship between percent agreement and heart workload/heart rate as it
appeared to decrease slightly with increasing workload (Fitbit Charge HR: peak of 8.68% at 40 [watts];
Apple Watch: peak of 7.51% at 30 [watts]) before improving to near complete agreement at higher
workloads (Fitbit Charge HR: <±0.5% when ≥ 80 [watts]; Apple Watch: <±0.75% when ≥ 60 [watts],
except 90 [watts] where the agreement was -1.64%). These findings are reproduced in an easier to read
format in Error! Reference source not found. along with the heart rates corresponding to the quoted
workload intensities.
20
Table 2: Summary of Abdulmajeed activity tracker heart rate accuracy study. Reproduced
from [41]
Workload
[Watts]
Holter M onitor Heart Rate [bpm] M ean Heart Rate
Difference [%]
Pearson Correlation
Coefficient
M inimum Average [sic] M aximum Fitbit
Charge HR
Apple
Watch
Fitbit
Charge HR
Apple
Watch
0 68 85 102 6.00 3.32 0.406 0.567
10 69 86 102 6.93 4.56 0.593 0.305
20 68 89 114 5.41 6.12 0.951 0.597
30 68 93 129 8.34 7.51 0.973 0.61
40 73 96 129 8.68 5.49 0.93 0.78
50 84 102 132 8.10 2.27 0.88 0.811
60 87 109 136 3.69 -0.45 0.957 0.965
70 88 116 142 1.63 -0.75 0.98 0.994
80 95 122 150 -0.20 -0.72 0.994 0.997
90 99 129 155 -0.10 -1.64 0.986 0.993
100 105 136 161 0.46 0.37 0.992 0.994
Summarizing the findings of these 4 studies, it appears that the findings of Wang et al., Cadmus-Bertram
et al. and Abdulmajeed are in clear agreement that heart rate measurements of activity monitors
generally have high accuracy and correlation with measurements performed by clinical grade equipment.
It also appears based on El-Amrawy et al.’s findings that there is very high correlation between the
individual heart rate measurements of the many commercial trackers on the market, perhaps unsurprising,
as most of the contenders leverage the same well-validated PPG technology with some minor
modifications to make them fit the form factor of the wearable device. Where the performance of these
trackers appears to differ greatly from clinical reference devices was in the variance of repeated
measurements. Of the trackers investigated in the study, Cadmus-Bertram et al. found that the devices
were typically half as consistent as an ECG regardless of whether the measurements were done while
active or at rest.
Comparison to Cardiopulmonary Exercise Testing
Based on recent research findings it is clear that modern activity trackers have been found to be fairly
reliably at tracking both step count as well as heart rate [37,41,79,43–45,65,67–69,73]. It is also clear
however that these devices are definitely less accurate and less precise than the gold-standard CPET.
21
That being said, these devices have significantly lower upfront costs than CPET equipment and require
little to no dedicated personnel or physical space in the hospital to run tests. "Replacing" patient memory
with activity trackers could still eliminate a significant source of subjectivity and potential error while
being potentially easier and less costly to administer than a full CPET.
Of course, fitness trackers provide fewer distinct data streams than a CPET, usually limited to just steps
and possibly heart rate. While few researchers have attempted to examine the interplay between fitness
tracker and step count data streams, it is possible that, in the same way that an IMU can combine the
disparate independently error-prone sensor outputs of an accelerometer, gyroscope and magnetometer
using sensor-fusion, the same might be done with activity tracker step count and heart rate data for HF
patients and thereby reduce or remove the need for the extra data provided by a CPET. Whether these
two data streams alone are sufficient to objectively assess NYHA class or perform a useful clinical
function for HF patients though is still yet to be determined. The concept however is clearly not
unreasonable: even though hospitals have only recently begun to consider the use of fitness trackers as
part of regular care, there have been some very early successes in using single data streams from trackers
to perform useful clinical functions such as monitoring step count for post-surgical readmission prediction,
or using the heart-rate data for arrhythmia detection outside the hospital [83–88].
Fitness monitors though have another advantage over CPETs: the low cost and portable nature of fitness
trackers means that patients can even be monitored outside the hospital during free-living. Capturing
real-world free-living activity of HF patients might provide a quantitative insight into the limitations
brought about by a patients’ HF symptoms. In fact, a recent exploratory study investigated this exact
concept, sending 8 HF patients home with activity trackers for a period of two weeks. The study found a
statistically significant difference between the daily average step counts of patients in different NYHA
functional classes [13]. Unfortunately, the study’s very small sample size greatly limits scientific
confidence in the generalizability of these findings. In response, we replicated this study using a larger
sample size as the first phase of this work (detailed in Chapter 3) to independently verify these very
promising findings. It would be hugely beneficial to patient care if data streams of regular real-world free-
living activity data made it possible to more routinely reassess NYHA class and even allow for more
prompt detection of important HF status changes.
22
Remote Patient Monitoring
Regular reassessment of a patient’s status and the continued monitoring of said patient while they are
outside the hospital falls under the broader umbrella of telemedicine [89] and is formally termed Remote
Patient Monitoring (RPM).
RPM, as a specific application of telemedicine, is of particular interest for patients with chronic conditions
[90–92]. An acute exacerbation of a chronic condition can often bring patients into costly hospital
emergency rooms for post-hoc care instead of less costly pre-emptive care/management that might have
prevented the exacerbation in the first place [4,14,92,93]. This leads to both suboptimal care for the
patient as well as misallocation of resources in an already and increasingly strained health sector [4,14,93–
95].
There have been many documented attempts at creating RPM systems targeted towards HF patients.
Even though researchers have not come to a consensus about the exact effect of RPM systems on
outcomes, based on several meta-analyses of recent literature, it appears that these systems are sometimes
capable of delivering on the promise of providing better care at lower cost.
In a 2018 meta-analysis, Yun et al. [96] reviewed 37 randomized control trials (RCT) covering a total of
9582 HF patients and found that the patient groups receiving telemonitoring care had significantly lower
HF-related mortality (risk ratio: 0.68, 95% confidence interval (CI): 0.50-0.91, no P-value) as well as all-
cause mortality group (risk ratio: 0.81, 95% CI: 0.70-0.94, no P-value) compared to the standard care.
Patients were found to benefit significantly when their RPM system transmitted data at least once per
day, or when it transmitted multiple (≥3) streams of biological data (e.g. weight, blood pressure and heart
rate). Yun et al. also noted that monitoring patient symptoms, medication adherence and prescription
changes was also associated with reduced mortality risk.
Klersy et al. [97] in their 2014 meta-analysis of 21 RCTs covering a total of 5715 patients, investigated
the healthcare utilization and economic impact of RPM on HF care. They found that, compared to the
control groups, the telemonitored patient groups experienced significantly fewer HF-related
hospitalizations (incidence rate ratio: 0.77, 95% CI: 0.65-.91) as well as all-cause hospitalizations
(incidence rate ratio: 0.87, 95% CI: 0.79-0.96) resulting in a per patient quality-adjusted life years gain of
0.06 years (approximately 22 days). Furthermore, RPM was associated with a yearly patient cost savings
of €300 to €1000 (approximately $460 to $1535 CAD based on the 2014 exchange rate). The cost savings
23
were conservatively estimated solely based on the associated third-party payer hospitalization
reimbursement costs for the patients in the meta-analysis.
As mentioned though, not all evidence points towards RPM being a unilaterally positive effector of
change: of note are 3 commonly cited large high-powered RCTs that found no significant effect on
outcomes for HF patients undergoing telemonitoring [50,98,99]. While these three 3 studies are certainly
not the only studies to have found little positive change from RPM implementations, their scope makes
them hard to simply dismiss. Ware et al. [100], in a comprehensive review piece, discuss the various
reasons why it is so hard to form a definitive consensus regarding the effects of home telemonitoring
systems in healthcare. They argue that RPM implementations are often viewed as simple one-size-fits-all
interventions (perhaps like a silver bullet) but they are in fact complex socio-technologic systems that are
(or should be) adequately tailored to suit the specific context in which they are implemented - a fact that
is often overlooked when assessing them. Some of the very important factors that impact the successful
implementation of any technology often go unreported or unaddressed in studies. This includes:
appropriate characterization of the intended and actual user groups (both patient population and clinical
staff), suitability of the home telemonitoring (HT) service for the implementation context (e.g. how is the
system resourced, and what actual user needs is it attempting to address), implementation strategy used
(including training, methods of ensuring adherence to the ‘system as-intended’), suitability of evaluation
approach for capturing desired outcome (e.g. RCTs an adequate trial design for capturing outcomes in an
evolving socio-technical system?), what are the actual desired outcomes for the intervention (reduced
mortality? increased patient quality of life? purely cost reduction?) and do these outcomes match up with
stakeholder expectations. In their words:
“HT has been shown to reduce mortality and HF hospitalizations and improve clinical
outcomes in HF patients. Despite this evidence, significant heterogeneity exists in the
design of HT interventions, the implementation context, and outcomes of individual
studies, leading to ambiguity about the true effect of HT on HF outcomes. HT is not
one, but rather a collection of complex interventions for which success or failure is
linked to a range of contextual factors. These factors cannot be ignored if we are to
design studies that will offer more definitive answers about the effect of HT on HF
outcomes.” [100]
24
2.3.1 Medly
For this particular thesis we piggy-backed off of a specific RPM system: Medly, a mobile-phone based
HF patient telemonitoring system currently in place (and adapted for use at) the Ted Rogers Centre of
Excellence for Heart Function, a tertiary care clinic for HF patients located in TGH in Toronto, Canada
[101,102]. A previous iteration of Medly, and thus its core features, have previously been validated
through a 6 month RCT, which found that its targeted telemonitored patient user group, relative to base-
line, had improved self-care maintenance (Δ = +7 points, P = .05) and management (Δ = +14 points, P
= .03) as measured with the Minnesota Living with Heart Failure Questionnaire, improved levels of brain
natriuretic peptide (BNP) - a biomarker associated with HF stability (Δ = -150pg/ml, P = .03) and
improved left-ventricular-ejection-fraction (LVEF) (Δ = +7.4%, P = .005) compared to the control group
[103]. In recognition of the complex multi-faceted nature of telemonitoring interventions, we provide a
more detailed discussion of the intervention and its unique context in Chapter 4, as part of the larger
discussion of how we implemented an initial version of activity tracker monitoring as part of Medly.
One of the important core features of Medly is an innovative computer algorithm capable of generating
timely, safe, and clinically-relevant messages (instructions or alerts) to patients and clinical staff [104].
The intent of this feature is to enable Medly to provide a cost-effective and scalable way of monitoring
patients on a daily basis by limiting the impact to the workload of clinical staff while simultaneously
leveraging ‘teachable moments’ to improve patient self-care maintenance and management [3,104,105].
This is accomplished by imbuing the system with a limited ability to mimic the decision making and
actioning process of the expert clinical staff at the Heart Function clinic so that the system is able to
adequately triage, and respond to or elevate clinical concerns to staff as necessary while providing patients
with regular feedback about their own condition [104]. Of course, the concept of imbuing a machine with
decision making ability (limited or otherwise) belongs to the now resurging field of artificial intelligence.
Artificial Intelligence & Machine Learning
Artificial intelligence (AI) broadly refers to the concept of intelligence (e.g. learning, decision making,
perception and recognition, creativity and problem solving) exhibited by machines (typically computers,
but formally, any thing not imbued with natural intelligence like humans or animals) [106–109]. The field
of AI is as fascinating as it is expansive. Although the field only became a formal academic discipline unto
25
itself in 19563 [108,109] it spans and draws from the fields of mathematical, statistical and computer
sciences, delves into psychology and neurology, and is even starting to pose new and challenging
philosophical, ethical and economic questions (such as ‘what actually is intelligence? what decisions should
and shouldn’t we delegate to a computer? what will be the place of humanity if computers can beat us at
everything?’).
One of the early successful approaches to creating artificial intelligence was to train a computer program
(like Medly) to mimic the decisions of a human expert, like a cardiologist or nurse, in what formally
termed an ‘expert system’ [106,110]. Expert systems are typically created by first extracting a series of
formalized facts from the target experts and translating them, typically, into formal conditional, ‘if-then’,
logic statements. For example: if a patient is male and older than 35 and has chest pain, then suspect a
heart attack; if a heart attack is suspected, then perform an ECG. These facts form the ‘knowledge base’
of the expert system. The machine can then use this knowledge base in conjunction with an ‘inference
engine, which using some formal logic system - such as zeroth-order propositional logic (i.e. modus
pollens4, modus tollens5, etc.) - to manipulate the contents of the knowledge base and draw conclusions,
make decisions or supply recommendations (if a patient is male and older than 35 and has chest pains
then perform ECG). The machine can then also be asked to ‘show its work’ by displaying the exact step
by step deductive, inductive and/or abductive logic processes used to reach its final conclusion [110].
Expert systems have seen application in various sectors, but are especially useful where demand for
expertise is high but supply is relatively low or expensive for example in the include health care, finance,
and the legal sectors [106,110].
In the case of NYHA functional class assessment, (a function not presently performed by Medly), one
might theoretically create an expert system which could mimic expert grading by (an) experienced ‘model’
physician(s). However, in doing so one would run into one of the major issues with expert systems: the
knowledge acquisition problem. Since creating traditional expert systems relies on the premise that there
are experts available who can formalize their knowledge into statements suitable for interpretation by
3 McCarthy et al. famously “proposed a 2 month, 10 man study of artificial intelligence to be carried out during the
summer of 1956… [they thought] that a significant advance [could] be made… if a carefully selected group scientists
work on it together for a summer.” Suffice it to say, the problem of AI turned out to need more a small summer
research problem to solve.
4 affirming the antecedent: If P then Q; P; ∴ Q
5 denying the consequent: If P then Q; not Q; ∴P
26
some inference engine, the actual implementation of these expert systems becomes compromised when 1)
there are insufficient experts available, or 2) their knowledge cannot be formalized adequately (or even at
all). In the case of objective NYHA functional class assessment (an unsolved problem), the situation is
fairly simple: there are no experts available - which precludes the creation of a traditional expert system
entirely. Fortunately, the field of AI has developed beyond just expert systems.
2.4.1 Machine Learning
An alternative to having experts a-priori supply all the knowledge required for an AI to ‘think’ is to
instead make an AI that can ‘learn’ that knowledge by itself from input data or example cases. This is
sub-domain of AI called machine learning6 [106,107]. This sub-domain is also fairly large, as many
different approaches have been developed since 1956 as part of different attempts to get computers to
extract useful knowledge from data [111]. Some of these approaches are more suitable for different types
of machine learning problems; so it might be helpful to first clarify how machine learning problems are
classified, broadly, before determining which machine learning category the problem of NYHA functional
class assessment falls into.
2.4.2 Supervised, Unsupervised and Reinforcement Learning
The first important way to classify machine learning problems is by learning modality. Machine
learning problems come in 3 major types: supervised learning, unsupervised learning and reinforcement
learning problems [111–113].
1) Supervised learning problems, the most common type, are those where both the input and output
variables are provided. The computer learns a mapping function to accurately convert the inputs
to outputs, even inputs that haven’t been seen before [111,112]. In other words, for a given input
variable 𝑥 and output variable 𝑦, where 𝑦 = 𝑓(𝑥), find a suitable 𝑓 [111,112].
2) Unsupervised learning problems are those where neither the output variable (𝑥) nor the mapping
function (𝑓) are known – the objective of unsupervised learning is usually to have the machine
discover underlying patterns in the data [111,112].
6 Colloquially, the terms ‘artificial intelligence’ and ‘machine learning’ are sometimes used interchangeably (e.g.
[107]). However, machine learning technically refers to the task of getting machines to mimic the ‘learning’ process of
intelligence, whereas artificial intelligence refers to the field (inclusive of all its subdomains) as a whole. In this work
we use the technical terms exclusively.
27
3) Reinforcement learning approaches the concept of learning from an entirely different perspective
than supervised and unsupervised learning [113]. In reinforcement learning there is, in a sense,
neither a static 𝑥, 𝑦 nor 𝑓. Rather, the machine learns by trial and error from successive
interactions with an external environment what actions it should take to optimize the value of
some future reward [113]. In other words, the machine must not only consider how to interpret
the present state of its environment, but also which actions to take (and by extension which
additional input data to collect about its environment), and finally decide which actions are most
appropriate to bring it closest to its goal based on the past success or failure of previous actions
[113]. Reinforcement learning methods are thus the realm of ‘game-playing’ AIs, such AlphaGo
[114] which ‘plays’ board game Go, OpenAI Five [115,116] which competes at Dota 2 (a
multiplayer online battle arena video game), and the various AIs that compete at real time
strategy video games like Starcraft/Starcraft 2 [117].
The question of objective NYHA class assessment clearly falls under the class of supervised learning, since
we have a known output label – NYHA functional class – that we wish to determine based on some input
variables, or ‘features’, in our dataset. Our question is whether it is possible to find an adequate mapping
function given our input data.
2.4.3 Classification vs Prediction Problems
Supervised learning algorithms can be further categorized by the expected output of the algorithm:
either a categorical label or a numerical prediction. The former is termed a ‘classification’ problem, and
the latter a ‘prediction’ or ‘regression’ problem [111–113]. Note that while the term ‘prediction’ has a
temporal connotation, prediction problems need not be temporal in nature – a prediction need not
necessarily be a forecast for or of the future. Inferring a missing value in a dataset, such as a missing
grade for a student’s assignment based on their other assignments would be as equally valid a prediction
problem as forecasting the next day’s temperature based on historical temperature data. In contrast,
forecasting whether the next day will be ‘hot’ or ‘cold’ is an example of a classification problem.
Determining the probability that a patient falls within a given NYHA class would be a supervised
prediction problem. However, since we wish to assign a categorical label (i.e. a NYHA functional class) to
each patient, we are instead tackling a supervised classification learning problem.
There are various algorithms for addressing supervised classification problems. These include Generalized
Linear Models, Random Trees & Forests, Neural Networks and Support Vector Machines. The author
whole-heartedly recommends the book “Programming collective intelligence” by T. Segaran for an
28
accessible, yet thorough primer on these and other modern machine learning techniques [111]. Segaran’s
book mostly discusses machine learning algorithms that are fed with cross-sectional data (i.e. where all
the data is acquired at a particular ‘slice’ of time or where the order or sequence of the data is not
necessarily considered important). Since our application involves the use of time series data where the
order of data is important, we also specifically explored the use of hidden Markov models, which are a
type of machine learning algorithm that is considered highly suitable for learning from time series data. It
has been applied to problems as disparate as speech recognition [118], stock market pricing analysis [119],
seizure classification [120] and human physical activity recognition [62,121]. A brief into to HMMs is
provided for the readers convenience in Appendix B.
2.4.4 The Effect of Sample Size on Machine Learning
Before we address the current state of research at the intersection of machine learning and HF
assessment, we briefly comment on an important consideration of machine learning: the amount of data
required to train a machine learning algorithm. Machine learning is notorious for being particularly data
intensive [119,122,123]. This notoriety likely explains why the term Big Data is often (incorrectly) used
interchangeably with machine learning in popular parlance [124].
Machine learning practitioners generally consider data sets on the order of hundreds of samples to be
relatively small [122,123,125]. In fact, most traditional ML algorithms are hard to properly validate even
when the training dataset in question contains more than 200 events of interest per candidate ML feature
- even some of the simplest models using logistic regression require at least 20-50 per candidate feature
[126]. The exact size of a data set required to properly train a typical Hidden Markov Model (or any
machine learning algorithm in general) depends on a number of different factors including: the method of
classification, complexity of the classifier, separation between classes, variance and presence of noise in the
data. The noisier, the more complex and the greater the variance in the data, typically the larger the
dataset required to achieve good performance. There is no upper limit for how much data should be used
for training but there is a point at which increasing input data begins to yield diminishing returns in
improving predictive performance [123]. The exact relationship between training set size and predictive
performance for an algorithm and problem in question is often shown as a 'learning curve' graph (which
plots training set size versus prediction error(s)). To the best of the author's knowledge the learning curve
for this particular application (or a sufficiently analogous application) has not yet been determined.
However, given that we expect that the data collected in this study will be relatively noisy and complex
we expect that the model may lean towards requiring more data rather than less data. Since biomedical
data is typically in short supply, we will endeavour to collect as much data as possible in order to not
29
prematurely limit the power nor the generalizability of the algorithm developed.
2.4.5 State-of-the-art
Tripoliti et al. [127] published a comprehensive review in 2017 on the state-of-the-art for machine
learning applications in HF management. They found that across the 45+ unique studies reviewed,
various machine learning techniques have been applied to both: a) the prediction of adverse HF events
including destabilizations, mortality and hospitalization, as well as b) the diagnosis of HF including HF
detection, recognition of sub-types of HF, and estimation of severity (e.g. NYHA functional class). Input
data included the standard demographic data, but also variously: clinical history, laboratory and ECG
data, and various features that were extracted or computed from the input data. NYHA functional class
was often included in the studies as part of the input demographic data, but only 4 studies investigated it
specifically as a classification task.
In 2011, Pecchia et al. [128] presented a telemonitoring system that collected and used patient ECG data
for HF detection and classified patients as having either NYHA class III (labeled as ‘severe HF’), or
NYHA class I or II (labeled as ‘mild HF’). The detection and severity classification tasks are each
performed with a single decision tree, specifically one generated using the Classification And Regression
Trees (CART) algorithm. The decision trees each use different Heart Rate Variability (HRV) features
[129] extracted from the ECG waveform, HRV having already been shown to be useful for discriminating
between patients of different NYHA classes [130–134]. Pecchia et al. trained and tested their severity
classifier on Holter monitor data available from a public database: the Congestive Heart Failure RR
Interval Database [135] (i.e. not data recorded using their telemonitoring system). The dataset consisted
of 29 patients (12 mild, 17 severe), with which they were able to achieve an overall classification
accuracy7 of 79.31%, sensitivity8 of 82.35%, specificity9 of 75.00%, and precision10 of 82.35% - although
the authors failed to specify the validation technique used.
7 The proportion of patients correctly classified into their actual true class
8 a.k.a. recall, or true positive rate: The proportion of patients correctly identified as belonging to the ‘positive’ test
class (e.g. class A in A vs. B)
9 a.k.a. true negative rate: the proportion of patients correctly identified as belonging to the ‘negative’ test class (e.g.
class B in A vs. B)
10 a.k.a. positive predictive value: the proportion of patients correctly classified as belonging to the ‘positive’ test class
amongst all the patients identified by the classifier as belonging to the ‘positive’ test class.
30
In 2013, Melillo et al. [136] performed a similar study, but using a larger superset of data containing
additional patients from the publicly available BIDMC Congestive Heart Failure Database [137,138]. This
data superset also included class IV patients, which were grouped with class III patients in the ‘severe
HF’ class. In this study Melillo et al. performed some additional corrections to their decision trees to
permit them to perform feature selection in a way that accounted for the now small and rather
unbalanced dataset (12:32, mild:severe). Melillo et al. also compared the performance of their single
CART decision tree to a random forest classifier [111,139], as well as a single tree generated using the
more popular C4.5 algorithm [139]. Of the 3 classifiers they found that their revised CART performed
best with a classification accuracy of 85.40% (Δ = +6.09% compared to [128]), sensitivity of 93.30% (Δ =
+10.95%), specificity of 63.60% (Δ = -11.4%), and precision of 87.50% (Δ = +5.15%). In this paper,
Melillo et al. specified that they used 10-fold cross validation. 10-fold or 𝑘-Fold cross-validation
(generally) is a common technique for validating machine learning algorithms whereby the complete
dataset is separated into 𝑘 number of groups or ‘folds’ (in this case 10). One of the folds is held aside as
the initial test set, while the remaining folds are made to constitute the initial training set [140,141]. The
folds held aside as the test and training sets are then rotated such that each fold has been held aside once
as a test set with the non-test set folds in that round being used as the training set [140,141]. In this way
each data point in the dataset is well utilized and supplies information for both testing and training
[140,141].
In 2015, Shahbazi et al. [142] used the same dataset and labelling schema, although they dropped 5
patients based on a pre-established data-reliability measure for a final dataset of 10:29 (mild:severe). In
this study, Shahbazi et al. used a different machine learning algorithm known as k-Nearest Neighbour
[111,139]. Since the k-Nearest Neighbour algorithm does not have inherent feature selection baked in (in
contrast to decision trees), Shahbazi et al. performed feature selection using a method known as
generalized discriminant analysis [143] to select a reduced subset of the best available features to present
to their k-Nearest Neighbour algorithm. The whole feature selection-classifier chain was validated using
leave-one-out cross validation. Leave-one-out cross validation is a variant of 𝑘-Fold cross validation where
𝑘 is equal to the number of data points [140]. In other words, for a dataset of size 𝑁, leave-one-out cross
validation is 𝑘-Fold cross validation where 𝑘 = 𝑁. Leave-one-out cross validation is thus often preferred
when the dataset in question is particularly small, since only 1 data point is held out as a test set for each
round, thus maximizing the amount of data available for training. In any case, Shahbazi et al. were able
to achieve a remarkable 100% and 97.43% accuracy respectively for classifiers trained using only non-
31
linear HRV features and using both linear and non-linear HRV features11.
Lastly, in a 2010 study, Yang et al. discussed an attempt made to perform both diagnosis and severity
assessment together, using a dataset of 153 patients labelled as either ‘Healthy’, ‘HF-prone’ or ‘HF’
(65:30:58). The ‘Healthy’ group corresponded to those with no cardiac dysfunction, the ‘HF-prone’ group
corresponded to those patients with NYHA class I symptoms and the ‘HF’ group corresponding to those
with either NYHA classes II or III symptoms. Due to their relative abundance of data points, Yang et al.
opted to do a simple training/test set split, allocating 63 (24:14:25) samples for training and 90 for testing
(41:16:33). Yang et al. chose to use a support-vector-machine algorithm [111,139], which is a supervised
prediction algorithm. As such they had to convert the numeric prediction value into a final output
classification, which they performed by first mapping the SVM prediction 𝑣, to a new mapped output
value 𝑦 using the following tan-sigmoid function:
𝑦 = 4
1 + e−4v − 2 (1)
and then proceeding to determine the decision cutoff points for the groups using Youden’s index [144].
Their approach gave them an overall accuracy of 74.44% with an accuracy of 87.50% and 65.85% for the
NYHA I group and NYHA II and III group respectively (78.79% for the healthy group). As input data,
Yang et al. used parameters from blood tests (specifically sodium and BNP levels), ECGs (including HRV
features), chest radiography (i.e. LVEF and cardiac dimensions), 6MWT (distance) and a “physical test”
[145]. Other noteworthy parameters employed by the SVM models include 𝑝𝑒𝑎𝑘𝑉𝑂2̇ .
To the author’s knowledge, no other studies have used machine learning for the assessing NYHA
functional class. Certainly, no study appears to have done more than a binary (two-class) prediction of
NYHA class. Of course, this is likely a result of the difficult and time-consuming nature of acquiring a
sufficiently large dataset that includes all 4 NYHA functional classes. Fortunately, as previously
mentioned, the practical challenge of NYHA functional class assessment mostly centers around
distinguishing the middle two classes, II and III, such that studies that use a the ‘mild’/’severe’ labelling
scheme like the one used by Pecchia, Melillo, and Shahbazi studies are essentially addressing the central
NYHA functional class assessment challenge. It appears clear too from these studies that machine learning
methods are a potent tool for objectively assessing NYHA functional class: case in point, Shahbazi et al.’s
11 granted, a model with 100% accuracy smells is very possibly overfit to the dataset used.
32
k-Nearest Neighbour approach appears to have achieved incredible accuracy at separating HF patients
with class I or II vs. III or IV - albeit it on a what is still a relatively small sample of 39 patients. All of
these aforementioned studies however relied solely on data recorded in the clinic, and on HRV specifically.
While we do not doubt the utility of HRV measurements for various aspects of cardiovascular care, they
do have some important drawbacks. For example, the preferred standard recording interval for an ECG
used for HRV analysis is 24 hours although it is possible to record very long-term ECGs (i.e. for longer
than a period of 1-2 days) [129,146,147]. However, very long-term ECGs require slightly different
treatment than shorter term ECGs since the longer an ECG recording, the more unreasonable it is to
maintain the assumption that the ECG signal is stationary - an important assumption for the underlying
mathematics that underpins much of the HRV signal processing [146]. While some researchers have
developed new approaches for HRV signal analysis, these have not been validated against outcomes [146].
This is known to be an important step for HRV analysis since, it is known that the features used to short-
and long-term ECGs are not always interchangeable [129], it is only reasonable to assume that the same
would apply to very long-term ECGs. ECG HRV analysis is also not common practice in many clinics and
requires specialized knowledge, and equipment (in particular for use in in telemonitoring). As an
additional drawback, ECGs are often replete with artefacts and noise, and so sometimes require manual
cleaning before they can be used for HRV analysis [129]. Altogether, this makes HRV analysis a powerful,
but relatively inaccessible tool (at least at present) for use in performing regular assessment of NYHA
class as part of care. It would be useful to determine if it were possible to objective assess NYHA class
using more commonly accessible technology like the standard CPET, or fitness trackers which, although
not ubiquitous in the hospital, are ubiquitous in the consumer space and would be an ideal tool for remote
monitoring HF patients and regularly reassessing their NYHA class.
Summary
To summarize: heart failure, a global epidemic, is a complex chronic progressive condition associated
with significant morbidity and mortality. Patients often present with exacerbations to acute care centers,
and hospital emergency rooms at significant cost.
Exercise intolerance, one of the main manifestations of heart failure (HF), is an integral part of HF care
evaluations. The New York Heart Association (NYHA) classification is a functional assessment of exercise
capacity where a higher NYHA class is associated with increased symptoms, decreased quality of life and
poor survival. This classification system is highly subjective, especially for NYHA class II and III, which
call for patients experiencing “slight” versus “marked limitation of physical activity” [9]. The application of
the criteria thus varies widely based on the patients self-report and the individual physician’s
33
interpretation [6,7]. A quantifiable measure that removes this subjectivity to make the assignment of
NYHA class more repeatable and objective is highly desirable, especially if such a measurement could be
made on a regular basis to more closely track progression of the disease.
In common clinical practice, most assessments of exercise intolerance are performed through standardized
or non-standardized questions posed as part of the medical interview. More quantifiably,
CardioPulmonary Exercise Testing (CPET) is a validated clinical tool that is used to assess exercise
intolerance. Other researchers have identified some relationships between CPET measures, specifically
𝑝𝑒𝑎𝑘𝑉𝑂2̇ , although none have attempted to predict NYHA class from CPET measures. Performing CPET
studies also has some important drawbacks: they require access to expensive equipment in a lab
environment, and trained personnel to run the tests. Consumer targeted wearable physical activity
trackers overcome these disadvantages: they are inexpensive, simple to use, and can measure moment-to-
moment physical activity (and thus hopefully infer exercise intolerance) during free-living activities
instead of simulated activity in a lab. A previous exploratory study [13] investigated wearable activity
trackers in HF patients and found a link between patients’ daily average step counts and their
corresponding NYHA functional classes. However, the study’s small sample (n=8) limits scientific
confidence in the generalizability of this finding, so we resolve to first begin (in the next chapter) by
investigating whether these results are generalizable to a larger study sample.
Activity trackers could thus also be used to remotely monitor patients to help both patients and clinical
staff better manage their condition. Remote monitoring has been shown to improve HF patient outcomes
when properly implemented. To maximize chances of successful implementation, we proposed integrating
activity tracker monitoring as part of Medly [101,102], an existing well validated phone-based HF patient
monitoring solution already integrated and in use at our hospital.
One of the important features of Medly is that it leverages an expert system (an early type of artificial
intelligence algorithm) to triage, respond to or elevate clinical concerns to staff as necessary while
handling regular ‘run-of-the-mill’ clinical tasks without needing human intervention, thus providing a
cost-effective and scalable way of monitoring patients on a daily basis. We suggest that a similar
intelligent system could be used for NYHA class assessment. By using an artificial intelligence system that
could translate relevant data into the desired clinical outcome (NYHA classification), or a sufficiently
equivalent outcome (an 'NYH-AI' or 'NYHAI' classification if you will), we could provide a way to assess
a patient's functional classification in an objective, consistent manner while still leveraging the advantages
of the existing 'traditional' NYHA classification method. Some researchers have already investigated
intelligence classification algorithms but unfortunately these all relied on analysing heart rate variability
34
from ECGs. We suggest that it might be possible to perform the same classification using more accessible
or ubiquitous technology like a CPET or fitness tracker.
35
- Replication of Previous Study
As discussed in the Section 2.2.3.3, a previous exploratory study [13] investigated wearable activity
trackers in HF patients and demonstrated a statistically significant difference between the daily average
step counts of patients experiencing NYHA class II vs NYHA class III symptoms. However, the study’s
small sample (n=8) limits scientific confidence in the generalizability of these finding. Since step count
activity is expected be a highly relevant, useful and massively feature rich dataset, we replicated the
study on a separate otherwise limited dataset collected during another previous study, to increase our
confidence in the relevance and usefulness of step data for this particular research thesis. Our primary
objective was to validate the pilot study on a larger sample of patients with HF with reduced ejection
fraction (HFrEF). Our secondary objective in analyzing the larger dataset was to also better characterize
the distribution of step counts for patients in different NYHA classes.
The remaining part of this chapter, our replication of the pilot study, has been submitted for publication
to a peer-reviewed journal [148]. The thesis author was responsible for the direction and execution of the
research as well as the drafting of the initial paper. The other authors on the submitted paper (S.
Bromberg, M. Yasbanoo, B. Taati, H. Ross, C. Manlhiot, and J. Cafazzo) contributed feedback and edits
to subsequent drafts of the manuscript. Additionally, S. Bromberg collected the original dataset used in
the study, H. Ross & M. Yasbanoo provided clinical guidance, and J. Cafazzo and C. Manlhiot provided
general consultation.
Abstract
Background: A previously published pilot study showed a statistically significant difference between
New York Heart Association (NYHA) functional class and step count activity measured by wrist-worn
activity monitors in patients with heart failure (HF). However, the study’s small sample size severely
limits scientific confidence in the generalizability of this finding to a larger HF population.
Objective: Validate the pilot study on a larger sample of patients with HF with reduced ejection fraction
(HFrEF) and attempt to characterize the step count distribution.
M ethods: We repeated the analysis performed during the pilot study on an independently recorded
dataset consisting of a total of 50 patients with HFrEF (35 NYHA II and 15 NYHA III) patients.
Participants were monitored for step count with a Fitbit Flex for a period of two weeks in a free-living
environment.
36
Results: Patients exhibiting NYHA class III symptoms had significantly lower recorded mean of daily
total step count (4012 ± 1933 vs. 5484 ± 2640 [steps/day], P = .04), lower recorded mean of daily mean
step count (2.8 ± 1.3 vs. 3.8 ± 1.8 [steps/day], P = .04,), and lower mean and maximum of the daily per
minute step count maximums (80.5 vs. 95.6, & 112.9 vs. 125.7 [steps/minute], P = .02, & .004
respectively).
Conclusions: Patients with NYHA II and III symptoms differed significantly by various aggregate
measures of free-living step count including 1) mean daily total step count as well as, newly discovered, by
2) mean, and 3) maximum of the daily per minute step count maximums. These findings affirm that the
degree of exercise intolerance of NYHA II and III patients as a group is quantifiable in a replicable
manner. This is a novel and promising finding that is highly suggestive of possible completely objective
measure of assessing HF functional class, something which would be a great boon in the continuing quest
to improve patient outcomes for this burdensome and costly disease.
Introduction
Heart Failure (HF), a global epidemic [1,14], is a complex chronic progressive condition associated with
significant morbidity and mortality. HF is the leading cause of hospitalizations costing Canadians an
estimated 3 billion dollars annually [2]. Clinicians caring for patients with HF have a strong desire to
reduce hospitalizations from both a systems and patient-centered perspective [2,4]. To do so, it is
important for clinicians caring for these patients to understanding each patient’s physiologic parameters.
Evaluating exercise intolerance, one of the main manifestations of HF, is an integral part of HF care. The
New York Heart Association (NYHA) classification is a functional assessment of exercise capacity where a
higher NYHA class is associated with increased symptoms, decreased quality of life and poor survival
[8,10,149]. This classification system is highly subjective [6,7], especially for NYHA class II and III [9]. The
application of the criteria thus varies widely based on the patients self-report and the individual
physician’s interpretation [6,7]. A quantifiable measure that removes this subjectivity to make the
assignment of NYHA class more repeatable and objective would be beneficial.
A previous exploratory study [13], investigated wearable activity trackers in HF patients and
demonstrated a statistically significant difference between the daily average step counts, a proxy for
exercise intolerance, in patients with class II and III symptoms. However, the study’s small sample (n=8)
limits the generalizability of these findings. The aim of this study is to determine if these findings can be
replicated using a larger sample collected independently from the original pilot study data.
37
Methods
As a replication, we repeated the analysis performed during the pilot study [13], but on an
independently recorded dataset consisting of a total of 50 patients with HFrEF (9 NYHA I/II, 26 NYHA
II, 4 NYHA II/III, and 11 NYHA III) patients. Participants were monitored for step count with a Fitbit
Flex [59] for a period of two weeks in a free-living environment.
3.3.1 Recruitment
Patients in a moderately larger dataset (n=50) were originally consecutively recruited from the Heart
Function Clinic at Toronto General Hospital (TGH) in Toronto, Canada from September 2014 to June
2015. The inclusion and exclusion criteria used are outlined in Table 3 & Table 4 respectively.
Table 3: Inclusion criteria
- Adults (18+ years of age)
- Stable chronic HF
- NYHA Class II or III
- LVEF (Left Ventricular Ejection Fraction) ≤ 35%
- Able to walk without walking aids
- Capable of undergoing consent, understanding English instructions and complying with
the use of the study devices.
Table 4: Exclusion criteria
- Congenital heart disease
- Diagnosis less than 6 months prior to recruitment
- Travelling out of Canada for more than 1 week during the study period (to limit study
costs – i.e. roaming charges)
Data Collection
Patients were supplied with a Fitbit Flex [57], an Android smartphone (Moto-G), the associated
charging equipment for both devices, as well as a data plan to facilitate syncing the tracker to the Fitbit
server. Patients were instructed to wear the Fitbit daily on the same wrist, preferably their non-dominant
hand, for a period of 2 weeks, except during water activities like showering or swimming, as the Flex is
not water-proof. Patients were also instructed to charge the Fitbit at least every three days, preferably
while they slept. The Fitbit data was retrieved using an open source script published and available on
GitHub and adapted for this study [150].
38
Population
Patients in our larger dataset were labeled as either NYHA class II and III, or, when a physician was
uncertain about the classification or felt that patients exhibited symptoms from different class levels, as a
borderline/mixed class I/II or II/III. Table 5 provides demographic information for each of the patients in
the dataset according to their NYHA class, Table 6 provides the same but for all patients overall and just
for the subset of patients that were labelled NYHA class II or III. In either case, the patients are
predominantly male (86 vs. 89 [%]), aged: 54 ± 14 vs. 56 ± 14 [years old], and overweight (BMI: 28.9 ±
6.4 vs. 29.6 ± 6.3 [kg/m2]).
Table 5: Study dataset demographics
NYHA I/II NYHA II NYHA II/III NYHA III
Total Participants (n [%]) 9 (18%) 26 (52%) 4 (8%) 11 (22%)
# Male (n [%]) 6 (67%) 23 (89%) 4 (100%) 10 (91%)
Age [years] 52 ± 16 55 ± 14 52 ± 13 58 ± 13
Height [cm] 171 + 12 174 ± 8 177 ± 3 175 ± 10
Weight [kg] 79.5 ± 25.5 87.6 ± 18.6 88.4 ± 22.7 94.4 ± 17.4
BMI [kg/m2] 26.6 ± 7.1 29.0 ± 6.1 28.4 ± 7.5 30.9 ± 6.7
Table 6: Study dataset demographics (overall and just NYHA II or III)
Overall NYHA II or III*
Total Participants (n [%]) 50 37 (74% of total)
# Male (n [%]) 43 (86%) 33 (89%)
Age [years] 54 ± 14 56 ± 14
Height [cm] 174 ± 9 174 ± 9
Weight [kg] 87.7 ± 20.0 89.6 ± 18.5
BMI [kg/m2] 28.9 ± 6.4 29.6 ± 6.3
Since NYHA class I/II and II/III are not formally recognized NYHA classes, we performed our analysis
using the original class labels, as well as a second time but with the borderline/mixed classes grouped into
one of the traditional 4 class NYHA. Since NYHA class I corresponds to ‘no limitation of physical
activity’, a binary distinction, we reasoned that a patient assigned as class I/II, must have exhibited
something more than ‘no limitation of physical activity’, however slight. Since NYHA class II corresponds
to ‘a slight limitation of physical activity’ we reasoned that class I/II and class II should be grouped
together. We designate the class I/II and class II group as Group II*. We extended the same line of
reasoning for II/III patients, noting that patients assigned as class II/III must have experienced some
more marked limitation of physical activity beyond class II limitations. As such we grouped them with the
39
lower class III as a conservative approach, assuming the worst-case scenario. As such we group them with
the lower class III. We designated the class II/III and III group as Group III*. Table 7 provides
demographic information for the patients when the dataset is re-grouped according to the labeling scheme
as described above.
Table 7: Study re-grouped dataset demographics (NYHA group II* and group III*)
NYHA Group II* NYHA Group III*
Total Participants (n [%]) 35 (70%) 15 (30%)
# Male (n [%]) 29 (83%) 14 (93%)
Age [years] 54 ± 14 56 ± 13
Height [cm] 173 ± 9 176 ± 8
Weight [kg] 85.5 ± 20.6 92.8 ± 18.3
BMI [kg/m2] 28.4 ± 6.3 30.2 ± 6.7
3.3.2 Statistics
Consistent with our previous study [13], we use a Kruskal-Wallis rank test to compare the
experimental variables of interest, including the mean daily total step count. Since the data is clearly not
normally distributed, as can be seen in Figure 3-1, Figure 3-2 and Figure 3-3, we also computed various
other aggregations of the minute by minute step count data to attempt to better characterize the data.
Namely, we calculated statistical summaries (mean, standard deviation, five number summaries,
interquartile range, skewness and kurtosis) for each patient’s overall two week period and then for each
individual patient-day of step data. We then calculated the max, min, mean and standard error across
each patient’s daily summaries (producing a maximum daily mean, minimum daily mean, mean of daily
means, etc.) to assess overall variation on a daily basis. We then performed a Kruskal-Wallis rank tests
on each of the overall statistical summaries. The analysis was performed using R [151], RStudio [152] with
supporting packages [153–158].
40
Figure 3-1. H istogram of per minute step count values for each patient, grouped by individual NYHA class
41
Figure 3-2. Distribution of per minute step counts by NYHA class (zoomed in to step counts > 0). S tacked internal segments
indicate relative contributions by each patient.
42
Figure 3-3. Individual frequency of per minute step counts for each patient (zoomed in to
step counts > 0), grouped by NYHA class
Results and Discussion
Table 8 and Table 9 include results that were found to be significant at the P=.05 level in at least one
comparison. Table 10 and Table 11 contain the remaining non-significant results excluding any statistical
summary that returned a 0 value for all classes (e.g. aggregations involving daily or overall minimum, 1st,
2nd and 3rd quartile) due to the overwhelming frequency of 0 per minute step counts. Table 8 and Table
10 tabulate the results of the comparison using the original class labels, i.e. comparisons between class II
vs. III, and the comparison of all available classes, i.e. I/II vs. II vs. II/III vs. III, whereas Table 9 and
Table 11 tabulate the results of the comparison of the relabeled dataset, i.e. group II* vs. group III*. The
mean daily total steps, and the mean and max of daily per minute step count maxes (with standard error
bars) are plotted graphically in Figure 3-4, Figure 3-5, and Figure 3-6 respectively.
43
Table 8: Significant findings for comparisons between all classes (I/II, II, II/III, III) and
just between class II vs. III.
I/II II II/III III P -value
(all classes)
P -value
(II vs. III)
M aximum
Maximum 2 Week PMSCa
[steps/minute] 126.33 125.54 112.75 112.91 .04* .0104*
Maximum of Maximum
DPMSCb [steps/minute] 126.33 125.54 112.75 112.91 .04* .0104*
Mean of Maximum DPMSCb
[steps/minute] 96.94 95.10 80.26 80.65 .12 .04*
M ean
Mean 2 Week PMSCa
[steps/minute] 3.85 3.79 3.12 2.66 .22 .0499*
Maximum of Mean DPMSCb
[steps/minute] 6.33 7.53 6.11 5.02 .07 .014*
Mean of Mean DPMSCb
[steps/minute] 3.85 3.79 3.12 2.66 .22 .0499*
Standard Deviation of Mean
DPMSCb [steps/minute] 1.40 1.98 1.70 1.21 .054 .0095**
Standard Error of Mean
DPMSCb [steps/minute] 0.36 0.50 0.43 0.31 .07 .013*
Standard Deviation
Standard Deviation of 2
Week PMSCa [steps/minute] 12.90 13.09 10.51 9.99 .15 .03*
Maximum of DPMSCb
Standard Deviation
[steps/minute]
18.61 20.10 15.53 14.94 .02* .0053**
Mean of DPMSCb Standard
Deviation [steps/minute] 12.24 11.87 9.44 9.23 .17 .0499*
Standard Error
Standard Error of 2 Week
PMSCa [steps/minute] 0.088 0.087 0.071 0.067 .16 .04*
Maximum of DPMSCb
Standard Error
[steps/minute]
0.49 0.53 0.41 0.39 .02* .005**
Mean of DPMSCb Standard
Error [steps/minute] 0.32 0.31 0.25 0.24 .17 .0499*
Total
Total 2 Week SCc [kilosteps] 8.19 8.51 6.95 5.87 .16 .03*
Maximum of Total DPMSCb
[steps] 9113 10837 8803 7232 .07 .014*
Mean of Total DPMSCb
[steps] 5542 5464 4499 3835 .22 .0499*
Standard Deviation of Total
DPMSCb [steps] 2019 2856 2452 1745 .054 .0095**
44
I/II II II/III III P -value
(all classes)
P -value
(II vs. III)
Standard Error of Total
DPMSCb [steps] 523 713 624 441 .07 .013*
aPMSC: Per Minute Step Count bDPMSC: Daily Per Minute Step Count cSC: step count
Table 9: Significant findings for comparisons between group II* and group III*
Group II*
(= I/II + II)
Group III*
(= II/III + III) P -value
M aximum
Maximum 2 Week PMSCa [steps/minute] 125.74 112.87 .004**
Maximum of Maximum DPMSCb [steps/minute] 125.74 112.87 .004**
Mean of Maximum DPMSCb [steps/minute] 95.57 80.55 .02*
M ean
Mean 2 Week PMSCa [steps/minute] 3.81 2.79 .04*
Maximum of Mean DPMSCb [steps/minute] 7.22 5.31 .03*
Mean of Mean DPMSCb [steps/minute] 3.81 2.79 .04*
Standard Deviation of Mean DPMSCb [steps/minute] 1.83 1.34 .04*
Standard Error of Mean DPMSCb [steps/minute] 0.46 0.34 .045*
Standard Deviation
Standard Deviation of 2 Week PMSCa [steps/minute] 13.04 10.13 .02*
Maximum of DPMSCb Standard Deviation
[steps/minute] 19.72 15.09 .002**
Mean of DPMSCb Standard Deviation [steps/minute] 11.97 9.29 .03*
Standard Error
Standard Error of 2 Week PMSCa [steps/minute] 0.09 0.07 .02*
Maximum of DPMSCb Standard Error [steps/minute] 0.52 0.40 .002**
Mean of DPMSCb Standard Error [steps/minute] 0.32 0.24 .03*
Total
Total 2 Week SCc [steps] 84293 61612 .03*
Maximum of Total DPMSCb [steps] 10393 7651 .03*
Mean of Total DPMSCb [steps] 5484 4012 .04*
Standard Deviation of Total DPMSCb [steps] 2640 1933 .04*
Standard Error of Total DPMSCb [steps] 664 490 .045*
aPMSC: Per Minute Step Count bDPMSC: Daily Per Minute Step Count cSC: step count
45
Table 10: Non-significant findings for comparisons between all classes (I/II, II, II/III, III)
and just between class II vs. III.
I/II II II/III III P -value
(all classes)
P -value
(II vs. III)
Demographics
Sex [M=0, F=1] 0.33 0.12 0.00 0.09 .29 .83
Age [years] 51.56 54.96 51.50 57.82 .65 .55
Height [cm] 171.44 173.96 176.50 175.27 .76 .69
Weight [kg] 79.53 87.62 88.35 94.35 .53 .21
BMIa [kg/m2] 26.59 29.00 28.41 30.88 .53 .39
Righthanded?b
[No=0, Yes=1] 0.89 0.88 1.00 1.00 .61 .25
Wristband Preferencec
[Left=0, Right=1] 0.67 0.35 0.25 0.20 .18 .40
M aximum
Standard Deviation of
Maximum DPMSCd
[steps/minute]
19.91 26.21 29.13 21.45 .31 .30
Standard Error of Maximum
DPMSCd [steps/minute] 5.06 6.43 7.42 5.26 .28 .32
Minimum of Maximum
DPMSCd [steps/minute] 58.89 36.81 17.75 40.82 .22 .62
75th Percentile
Maximum of 75th Percentile
of DPMSCd [steps/minute] 0.56 3.02 4.00 1.09 .46 .36
Mean of 75th Percentile of
DPMSCd [steps/minute] 0.04 0.50 0.72 0.08 .44 .33
Standard Deviation of 75th
Percentile of DPMSCd
[steps/minute]
0.14 0.91 1.41 0.29 .43 .33
Standard Error of 75th
Percentile of DPMSCd
[steps/minute]
0.04 0.23 0.35 0.08 .43 .33
M ean
Minimum of Mean DPMSCd
[steps/minute] 1.31 0.67 0.57 0.88 .21 .36
Standard Deviation
Minimum of DPMSCd
Standard Deviation
[steps/minute]
5.42 3.01 2.07 3.67 .21 .42
Standard Error
Minimum of DPMSCd
Standard Error
[steps/minute]
0.14 0.08 0.05 0.10 .21 .42
Total
Minimum of Total DPMSCd
[steps] 1887 971 818 1270 .21 .36
46
I/II II II/III III P -value
(all classes)
P -value
(II vs. III)
IQR (Interquartile Range)
Maximum of DPMSCd IQRg
[steps/minute] 0.56 3.02 4.00 1.09 .46 .36
Mean of DPMSCd IQRg
[steps/minute] 0.04 0.50 0.72 0.08 .44 .33
Standard Deviation of
DPMSCd IQRg
[steps/minute]
0.14 0.91 1.41 0.29 .43 .33
Standard Error of DPMSCd
IQRg [steps/minute] 0.04 0.23 0.35 0.08 .43 .33
Skewness
2 Week PMSCe Skewness 5.14 5.20 5.29 6.50 .62 .27
Maximum of Daily SCf
Skewness 11.36 13.22 5.24 12.39 .56 .91
Mean of Daily SCf Skewness 5.20 5.30 4.11 5.77 .76 .65
Standard Deviation of Daily
SCf Skewness 2.00 2.54 0.58 2.18 .37 .73
Standard Error of Daily SCf
Skewness 0.51 0.65 0.16 0.55 .33 .73
Minimum of Daily SCf
Skewness 3.61 3.21 2.59 3.70 .42 .34
Kurtosis
2 Week PMSCe Kurtosis 35.32 33.44 36.72 61.42 .61 .24
Maximum of Daily SCf
Kurtosis 249.66 283.85 31.17 237.06 .58 .87
Mean of Daily SCf Kurtosis 43.12 44.82 19.12 49.53 .68 .57
Standard Deviation of Daily
SCf Kurtosis 59.92 68.46 5.53 54.44 .39 .78
Standard Error of Daily SCf
Kurtosis 15.08 17.33 1.48 13.55 .39 .87
Minimum of Daily SCf
Kurtosis 15.38 10.74 6.62 15.64 .36 .23
aBMI: Body Mass Index bRighthanded?: is patient righthanded? cWristband Preference: right or left handed preference for wristband dDPMSC: Daily Per Minute Step Count ePMSC: Per Minute Step Count fSC: step count gIQR: interquartile range
Table 11: Non-significant findings for comparisons between group II* and group III*
Group II*
(= I/II + II)
Group III*
(= II/III + III) P -value
Demographics
Sex [M=0, F=1] 0.17 0.07 .33
47
Group II*
(= I/II + II)
Group III*
(= II/III + III) P -value
Age [years] 54.09 56.13 .71
Height [cm] 173.31 175.60 .38
Weight [kg] 85.54 92.75 .17
BMIa [kg/m2] 28.38 30.22 .28
Righthanded?b [No=0, Yes=1] 0.89 1.00 .18
Wristband Preferencec [Left=0, Right=1] 0.43 0.21 .16
M aximum
Standard Deviation of Maximum DPMSCd
[steps/minute] 24.59 23.50 .76
Standard Error of Maximum DPMSCd [steps/minute] 6.08 5.84 .86
Minimum of Maximum DPMSCd [steps/minute] 42.49 34.67 .58
75th Percentile
Maximum of 75th Percentile of DPMSCd
[steps/minute] 2.39 1.87 .93
Mean of 75th Percentile of DPMSCd [steps/minute] 0.38 0.25 .89
Standard Deviation of 75th Percentile of DPMSCd
[steps/minute] 0.71 0.59 .91
Standard Error of 75th Percentile of DPMSCd
[steps/minute] 0.18 0.15 .91
M ean
Minimum of Mean DPMSCd [steps/minute] 0.84 0.80 .90
Standard Deviation
Minimum of DPMSCd Standard Deviation
[steps/minute] 3.63 3.24 .80
Standard Error
Minimum of DPMSCd Standard Error [steps/minute] 0.10 0.09 .80
Total
Minimum of Total DPMSCd [steps] 1207 1149 .90
IQR (Interquartile Range)
Maximum of DPMSCd IQR [steps/minute] 2.39 1.87 .93
Mean of DPMSCd IQR [steps/minute] 0.38 0.25 .89
Standard Deviation of DPMSCd IQR [steps/minute] 0.71 0.59 .91
Standard Error of DPMSCd IQR [steps/minute] 0.18 0.15 .91
Skewness
2 Week PMSCe Skewness 5.18 6.18 .29
Maximum of Daily SCf Skewness 12.60 11.68 .97
Mean of Daily SCf Skewness 5.26 5.60 .76
Standard Deviation of Daily SCf Skewness 2.36 2.02 .76
Standard Error of Daily SCf Skewness 0.60 0.51 .79
Minimum of Daily SCf Skewness 3.34 3.59 .65
Kurtosis
2 Week PMSCe Kurtosis 33.93 54.83 .25
Maximum of Daily SCf Kurtosis 272.45 216.47 .97
Mean of Daily SCf Kurtosis 44.25 46.49 .71
Standard Deviation of Daily SCf Kurtosis 65.62 49.55 .73
Standard Error of Daily SCf Kurtosis 16.58 12.34 .79
48
Group II*
(= I/II + II)
Group III*
(= II/III + III) P -value
Minimum of Daily SCf Kurtosis 12.29 14.74 .47
aBMI: Body Mass Index bRighthanded?: is patient righthanded? cWristband Preference: right or left handed preference for wristband dDPMSC: Daily Per Minute Step Count ePMSC: Per Minute Step Count fSC: step count gIQR: interquartile range
3.4.1 Principal Results
This study, using an
independent, larger group of
participants, replicated and
validated the findings of our
previous pilot study: that the daily
free-living step counts of HF
patients exhibiting NYHA class II
vs class III symptoms are
statistically different [13].
Specifically, HF patients
categorized as NYHA II vs. III
were found to differ significantly by
mean of daily total step count
(5464 vs. 3835, P = .0499), as well
as by mean of daily mean step
count (3.8 vs. 2.7, P = .0499).
NYHA II vs III patients also
differed significantly by mean (95.1
vs. 80.7, P = .04) and maximum
(125.5 vs. 112.9, P = .0104) of the
daily per minute step count
maximums.
Figure 3-4. Boxplots (min, mean-1SEM, mean,
mean+1SEM, max) of mean daily total steps for
individual each NYHA class
49
Similarly, group II* and group III* also differed significantly by mean of daily total step counts (5484 vs.
4012, P = .04), mean of daily mean step count (3.8 vs. 2.8, P = .04) as well as by mean (95.6 vs. 80.5, P
= .02), and maximum of the daily per minute step count maximums (125.7 vs. 112.9, P = .004
respectively).
In both cases quoted above, the daily step count results mimicked the two-week overall step count
analysis.
Of the 4 metrics identified above
only the maximum daily per
minute step count maximum was
found to differ significantly
between the 4 classes I/II, II,
II/III and III (126.3 vs. 125.5 vs.
112.8 vs. 112.9, P = .04). It is
reasonable that step count
maximum, which better captures
a patient’s peak exercise during
the day, might as a result better
capture the “limitation of physical
activity” experienced by a patient
and thus differentiate more
consistently between NYHA
classes (compared to a simple
mean or sum of a patient’s
activity over said day). Visual
inspection of the overall step
count density (see Figure 3-2)
corroborates this suspicion.
We however suggest another
alternative. As can clearly be seen
in Figure 3-1 (which shows a
histogram of the step count data
Figure 3-5. Boxplots (min, mean-1SEM, mean,
mean+1SEM, max) of mean daily per minute step count
maximums for each individual NYHA class
50
for each NYHA class), zero per
minute step counts made up an
overwhelming portion of the
data. Specifically, they accounted
for a mean 87.3% (standard
deviation 4.9%) of the two week
data stream for each patient,
accounting for as much as 97.6%
of the two week data stream for
one patient - the full breakdown
can be seen in Figure 3-7.
Unfortunately, the meaning of
these 0 per minute step count
values is ambiguous since the
trackers used in this study record
a 0 value not only during patient
inactivity but also when the
patient was simply not wearing
the device. As a result, it is
challenging to accurately
determine if a given series of
zeroes indicates a pattern of low
physical activity - presumably
explanatory of NYHA class - or
simply a pattern of non-device
use - essentially introducing noise
into the physical activity signal.
A visual inspection of Figure 3-2 and Figure 3-3, both of which show different perspectives of the non-zero
per minute step count data distribution, seems to strongly suggest that there is a difference in the activity
patterns of patients, for example, a longer, fatter tail for class I/II and II patients. Quantitatively however
we failed to the extract many insights into the shape of the activity distribution. Notably the 1st, 2nd, and
3rd quartile (and thus interquartile range) were all found to be fairly consistently 0 for all patients. In
Figure 3-6. Boxplots (min, mean-1SEM, mean, mean+1SEM,
max) of max daily per minute step count maximums for each
individual NYHA class
51
other words, 0’s accounted typically accounted for more than 75% of data points for any given patient
day. In fact, when looking at the two week period as a whole they accounted for at least 76.7% of all the
data points for any given patient (the complete
breakdown is shown in Figure 3-7).
The maximum daily per minute step counts on the other
hand are naturally least susceptible to the ambiguous 0
per minute step count values. We suggest that this may
have contributed to their being most consistent at
differentiating between patients in different NYHA
classes. Ultimately though, we believe that the
disambiguation of inactive vs disengaged time in
pedometer-like trackers and the subsequent effect on the
aforementioned-step data distribution are worth
investigating further to better understand the true nature
of the relationship between free-living step count and NYHA functional classification.
3.4.2 Strengths and Limitations
A strength of this replication study is that it uses a separate dataset collected by an different
researcher (S.B.) independently of (and prior to the analysis performed in) the referenced pilot study [13].
Except for one patient who participated in both studies, the dataset is also comprised of completely
different patients. On the other hand, the data being sourced as a convenience sample at the same single
site as the pilot study, i.e. consecutively recruited from the TGH Heart Function, represents a limitation
of this study with regards to generalizing our findings. Our analysis was also limited as it did not include
any patients with NYHA class I or IV patients. While these are not typically as difficult to classify as
NYHA class II or III patients, analysis of all 4 NYHA classes would have potentially provided additional
useful insight into the true underlying relationship between step count and NYHA class. Knowing this
relationship might be of tremendous value if it could allow us to invert the question posed in this study:
to instead see if step count could be used to assess NYHA class or gradation changes in NYHA class for a
patient. We suggest that this might be the subject of an important future study. The most significant
limitation of our study though was the step tracker utilized, since it introduced significant ambiguity into
the 0 per minute step count values which comprised most of each patient’s step data stream. This limits
our ability to precisely quantify the distribution of the activity/inactivity of patients especially since it is
The decimal point is at the | i.e. 76 | 7 represents 76.7% 76 | 78 78 | 9 80 | 2728 82 | 13678 84 | 022605688 86 | 03902226 88 | 024846 90 | 164668 92 | 14056 94 | 9 96 | 027
Figure 3-7. Number of zero step count
minutes as a percentage of individual
patient two-week data stream
52
as of yet unclear how much significance patient inactivity should be accorded when it comes to capturing
‘physical activity limitation’ and by extension NYHA functional class.
Conclusion
NYHA II and NYHA III patients differ significantly by various aggregate measures of step count
including 1) mean daily total step count but also importantly by 2) mean, and 3) maximum of the daily
per minute step count maximums. These findings validate our previous pilot study. However, the
discovery of additional significant aggregate measures raises several questions, amongst them: what is the
exact underlying relationship between NYHA class and step count? What features of the step count
waveform are most associated or correlated with NYHA class? These questions will no doubt feature as
the subjects of future studies, but the findings of this study are an important milestone on the road to an
objective means of assessing HF functional classification on our continuing quest to improve outcomes of
patient with the burdensome and costly disease that is congestive heart failure.
3.5.1 Acknowledgements
This project was supported by funds from: the Ted Rogers Centre for Heart Research and Peter Munk
Cardiac Centre, (hSITE) Healthcare Support through Information Technology Enhancements and
(NSERC) the Natural Sciences and Engineering Research Council, (CIHR) the Canadian Institutes for
Health Research, the Government of Ontario, and the University of Toronto.
3.5.2 Ethics Approval
This study is covered by institutional and research ethics approval (REB #14-7595) received from the
University Health Network REB.
3.5.3 Conflicts of Interest
None declared.
53
- Activity Tracker Monitoring Implementation
Having confirmed the potential utility of remote monitoring the physical activity of heart failure
patients we moved to update Medly, the remote patient monitoring system in use at the TGH HF clinic,
as part of a Quality Improvement (QI) initiative so it could support the collection and display of the
aforementioned data.
In this chapter we provide a brief overview of the Medly user interface, before discussing the activity
tracker monitoring implementation requirements. We then discuss the proposed designs, what was
actually finally implemented, as well as the success of the implementation in terms of the patients
onboarded and their adherence to the system.
Medly User Interface Overview
The concept behind the Medly remote monitoring system is relatively simple: patients download the
Medly app on their smartphone (provided by the clinic if required), and use the app every morning to
input their weight, blood pressure and pulse – either manually or using a ‘smart’ weight scale and blood
pressure cuff which can wireless transmit the corresponding datum to the smartphone app. Additionally,
patients answer a series of
questions about the symptoms
they experienced the day
before. Medly’s innovative
computer algorithm then
assesses the patients’ state and
alerts them about further
actions they may need to take
such as: taking an additional
dose of medication, calling
their physician, or even going
to the nearest emergency room
(if the patient is assessed as
being in a high-risk state). By
shortening the cause-effect
feedback cycle and leveraging
‘teachable moments’ the
Figure 4-1. Medly system patient smartphone user interface
a) home screen b) trends screen [289]
a) b)
54
system helps improve patient self-care maintenance and management. Patients can also review past
readings and observe their overall trends on a separate screen. Examples of two of the primary screens of
the patient user interface, the home and trends screen, are shown in Figure 4-1. In the example home
screen, a patient has been alerted to ‘contact the heart function clinic or [their] family doctor’ due to their
elevated heart rate (156 bpm) and reported symptoms (tired, short of breath and lightheaded) which are
highlighted in orange. A patient can also take additional readings by pressing the green ‘+’ circle near the
bottom right corner of the screen, although new readings will not remove previous alerts. In the example
trends screen the patient appears to be maintaining a constant weight higher than the light blue target
weight band (~160 lbs), with two unrecorded days (Nov 2nd and 3rd). Their blood pressure (BP) in
contrast appears to be fluctuating: initially trending downwards with the diastolic BP stabilizing but the
systolic BP recently trending upwards to exceed the gray target BP band.
All of the patients’ readings are sent back to servers at the hospital (UHN) and are displayed on a web
interface which is accessible by clinical staff, where they can review alerts and the patient trend data. An
example of the main screen of the clinical web interface, showing the weight data for a Mr./Mrs. Demo
Patient is shown in Figure 4-2. In this example the patient had 1 of 3 readings, during the period of July
12th to July 19th 2018, be outside of their target normal weight range (this time indicated by a gray
coloured band on the graph). The user could also scroll down to see the patient’s BP and pulse readings
as well as a chart of their answers to the symptoms questions.
Requirements
In keeping with engineering best practice, we performed some basic requirements gathering before
proceeding to implement changes to the Medly system. Initial requirements gathering was performed by
discussing the proposed system update with the developers, designers, researchers, project managers and
telehealth personnel at the Centre for Global eHealth Innovation, who already had significant expertise in
designing, developing, implementing and working with Medly. Their suggestions were supplemented with
findings from previously published studies discussing insights on the design and implementation of
previous versions of Medly [95,103–105,159].
55
The following requirements were identified with regards to fitness tracker selection:
1. The selected activity tracker must be readily available for purchase by patients (as established by
the ‘Best Buy Test’: is the fitness tracker available at a local big box electronic store such as Best
Buy?)
2. The fitness tracker must be compatible with Apple iOS v9.3.5 and above.
3. The fitness tracker must be compatible with the 2014 Samsung Galaxy Grand Prime (Android 5.1
Lollipop) and above.
Figure 4-2. Medly system clinical user web interface
56
4. The fitness tracker must be able to record minute by minute step data.
5. The fitness tracker must be able to record minute by minute heart rate data.
6. The data recorded from the fitness tracker must be able to be retrieved for storage and archival
at UHN.
7. The fitness tracker must be able to operate continuously for a minimum of 2 days without
requiring syncing or charging to ensure recording continuity in the event that a patient forgets or
is unable to sync or charge the device overnight).
The following additional, user experience, requirements were identified:
1. The system must provide a method to de-authenticate a fitness tracker or authenticate new
fitness tracker.
2. The system must allow for connection and authentication of fitness tracker.
3. The system must provide a means by which activity tracker functionality can be enabled/disabled
for a patient.
4. The system must provide feedback to clinicians that the fitness tracker is working.
5. The system must provide a means by which clinicians can view patient heart rate data.
6. The system must provide a means by which clinicians can view patient activity data.
7. The system must provide a means by which fitness tracker data can be access and downloaded
including:
a. anonymized bulk data.
b. analytics data (e.g. usage, interaction patterns)
8. Clinical access must continue to be secured against access by non-authorized (non-Clinical) staff.
9. Research data access must be secured against access by non-authorized (non-QI/research) staff.
The following were also identified as being important for providing an optimal user experience:
1. The system should provide feedback to clinicians that the fitness tracker is being worn by the
patient.12
2. Data visualization should be done in such a manner that clinical staff are able to easily &
simultaneously relate heart rate and contextual ‘explainers’ of heart rate (e.g. activity data,
medications, etc.)
12 where technically feasible
57
3. The system should provide feedback to the patient that fitness tracker is connected.
4. The system should provide feedback to the patient that the fitness tracker is working and
collecting data.
Design & Implementation
After having completed the initial requirements gathering we moved to the design and implementation
phase.
4.3.1 Activity Tracker Selection
To select an appropriate activity tracker, an initial search of modern consumer activity trackers was
performed, revealing 33 potential candidates. These are briefly detailed in Table 12. Most of these activity
trackers did not support continuous heart rate monitoring, had battery lives that did not meet the
continuity requirement outlined in fitness tracker requirement 7 of Section 4.2, or were simply no longer
available on the market (e.g. the Basis Peak which was recalled by Intel Corporation for safety reasons
[160], as well as the Jawbone devices since Jawbone (the company) filed for bankruptcy in July of 2017
[161]). The short list of activity trackers remaining included the Fitbit Charge 2, Ionic and Versa; the
Garmin Vivosmart 3, the Nokia/Withings Steel HR, the Wavelet Health Biostrap, and theXiaomi Band 2
(all highlighted in Table 12). We quickly eliminated a) the Nokia/Withings Steel HR since it was not yet
released in the Canadian market at the time of the study, b) the Garmin devices in general since access to
the device data through their application programming interface (API) required a steep access fee of
$5000, and c) the Xiaomi Band 2 since it did not appear to have a reliable manufacturer support method
of access device data. Although the Xiaomi Band 2 was advertised as supporting data download using
Google Fit anecdotal evidence from user forums appeared to suggest that this approach was unreliable –
notwithstanding this possible unreliability there was no way to access the data using iOS (fitness tracker
requirement 2 of Section 4.2). This left us with the Fitbit devices and the Wavelet Health Biostrap. We
eliminated the Wavelet Health device after encountering unresolvable issues while attempting to connect a
trial device to our Android devices, although the device worked fine on iOS. Furthermore, in choosing
between Fitbit devices and a relatively new and unproven contender on the relatively volatile activity
tracker market (Wavelet Health), we determined that it was a more prudent choice to opt for the market
leader, Fitbit. Additionally, due to the popularity of Fitbit devices, investigating the accuracy and
reliability of these devices is a more active area of research [41,46,48,65,67,68,84,162]. We opted to use the
Fitbit Charge 2, the successor to the Fitbit Charge HR, since it was the lowest cost option of the three
short-listed Fitbit devices.
58
Table 12: Candidate activity trackers
Company Product Step
Count
Heart
Rate
Battery
Life13 Data Access Price Link
Apple Watch Yes Yes 1 day HealthKit [163] 360-590
CAD [64]
Empatica E4 Wristband Yes Yes 1 day Unclear 1700 USD [164]
Fitbit Alta HR Yes Yes 5 days Fitbit API [165] 200 CAD [166]
Fitbit Alta Yes No 5 days Fitbit API [165] 170 CAD [167]
Fitbit Charge 2 Yes Yes 5 days Fitbit API [165] 200 CAD [58]
Fitbit Flex 2 Yes No 5 days Fitbit API [165] 80 CAD [168]
Fitbit Ionic Yes Yes 5 days Fitbit API [165] 400 CAD [169]
Fitbit Versa Yes Yes 4 days Fitbit API [165] 250 CAD [170]
Garmin Fenix Yes Yes 1 day Garmin API [171] 600 USD [172]
Garmin Vivosmart 3 Yes Yes < 5 days Garmin API [171] 150 USD [173]
Huawei Watch 2 Yes Yes 1 day Google Fit [174] 350 USD [175]
Intel Basis Peak recalled August 1, 2016 [160]
Jawbone Various company undergoing liquidation [161]
LG Watch Sport Yes Yes 1 day Google Fit [174] 350 US [176]
mc10 BioStampRC Yes Yes 1.5 days Unclear 500 US [177]
Misfit Flare Yes No 4 months Misfit API [178]
or Google Fit [174] 70 CAD [179]
Misfit Phase Yes No 6 months Misfit API [178]
or Google Fit [174] 150 CAD [180]
Misfit Ray Yes No 4 months Misfit API [178]
or Google Fit [174] 80 CAD [181]
Misfit Shine Yes No 6 months Misfit API [178]
or Google Fit [174] 80 CAD [182]
Misfit Shine 2 Yes No 6 months Misfit API [178]
or Google Fit [174] 80 CAD [183]
13 Listed battery life is always approximate.
59
Company Product Step
Count
Heart
Rate
Battery
Life13 Data Access Price Link
Misfit Vapor Yes Yes 1 day Misfit API [178]
or Google Fit [174] 200 CAD [184]
Moov HR Yes Yes < 1 day None 60-100
CAD [185]
Moov Now Yes No 6 months None 60 CAD [186]
Nokia/
Withings Go Yes No > 8 months
Nokia Health API
[187] 50 USD [188]
Nokia/
Withings Steel Yes No > 8 months
Nokia Health API
[187] 130 USD [189]
Nokia/
Withings Steel HR Yes Yes 25 days
Nokia Health API
[187]a 180 USD [190]
TomTom Spark 3 Yes NCb < 1 day to 3
weeks No new users [191] 290 CAD [192]
TomTom Touch Yes NCb 5 days No new users [191] 130 CAD [193]
Under
Armour UA Band Yes NCb 2.5 days Unclear
170-230
CAD [194]
Wavelet
Health Biostrap Yes Yes 5 days Wavelet API [195] 250 USD [195]
Xiaomi Band Yes No 30 days
Google Fit [174],
via unofficial API
[161], or via BLEc
15 USD [196]
Xiaomi Band 2 Yes Yes 20 days Google Fit [174],
or via BLEc 30 USD [197]
aheart rate data access unclear
bNC: non-continuous
cBLE: bluetooth low energy (N.B. device commands are obfuscated by manufacturer)
60
Proposed Data Access Design
Third party access to Fitbit data is mediated exclusively through the Fitbit web API [165]. It is
possible to both write and read data through the API, but impossible to access data directly from the
device, as illustrated in Figure 4-3. Access to intraday time series data (i.e. step count and heart rate data
at a resolution of less than 1 day, e.g. at the minute level) is also restricted to either ‘personal’
applications, or authorized entities. Authorization to access this data is granted on a case-by-case basis by
Fitbit. After submitting an initial request on June 22nd, 2017 we received approval to access intraday data
2.5 months later, on September 5th 2017. Access to the individual patient data is mediated through the
OAuth 2.0 authentication framework which specifies a secure communications protocol by which Fitbit
and third party servers can confidentially exchange security access tokens to maintain secured and
encrypted transmission of data between the Fitbit servers and the client – in this case UHN - servers. The
complete process for authentication (including initial authentication and maintenance of expired security
tokens), and data retrieval is mapped out in a sequence diagram in Figure 4-4. Since the individual
patient access tokens, which must be refreshed after each use, must be shared between several users (the
patient, clinical staff and research admin/QI personnel) the system was designed such that the central
Medly server would mediate requests for data, supplying the requested data from its internal database
negating the need to re-request data from the Fitbit servers for each user request. The Medly server then
periodically updates this internal database with new data, archiving it according to hospital policy and
local, provincial and federal requirements. Figure 4-5 illustrates this proposed design for patient users and
Figure 4-3. Fitbit data flow diagram
61
Figure 4-4. Fitbit authentication process with a client app
62
Figure 4-6 illustrates the
proposed design but for
clinical users. The sequence
for research admin/QI
personnel is essentially
identical to that of clinical
users.
Final Data
Access Implementation
The final implementation
for data access was managed
by the development team at
the Centre for Global
eHealth Innovation (a
partner of UHN). As a result, the final implementation differed slightly from the proposed implementation
due to time constraints and lack of programming resources as a result of concurrent updates, bug fixes
and general QI updates to Medly that were deemed to be a higher priority. The final implementation
therefore did not include an update to the client side patient smartphone application. The proposed
design was reduced down to a pared down Minimum Viable Product14(MVP). In this pared down version,
clinical admin staff (such as the onboarding coordinator) authenticated Fitbits on behalf of patients on
the clinical client application. No functionality was provided for patients to authenticate Fitbits with
Medly or to access data through the Medly application. Furthermore the ability to authenticate new
devices and access patient data was only available for patients using Medly on an Apple iPhone15.
Clinicians wishing to access data for patients using the standard Android device usually provided as part
of the Medly patient kit, were only able to access said patient data through the official Fitbit website.
Patients, whether Android or iPhone patients were able to access their data either through the Fitbit
website or through the Fitbit app that had to be installed on their smartphone. No provisions were made
14 a featured sparse software platform that includes only the bare minimum functionality required to operate.
15 as of the time of publication Medly now supports Fitbit authentication and data access for patients using either
Apple iPhone or Android devices.
Figure 4-5. Medly Fitbit patient access sequence
63
for data access by research/QI personnel - in fact the Medly server was implemented to only receive daily
step data summaries and not intraday data. The server also did not retrieve heart rate data.
To access intraday heart rate and step data the author created an open source script using the R
programming language [151] (available with the rest of the software artifacts generated from this thesis as
per Appendix C or directly from [198]). This script connects to the Fitbit API, manages the security
access tokens for the patients in the study (both Android and iPhone patients) and is able to download
both the minute-by-minute step count and heart rate data for analysis. It is also registered as a separate
third-party application with Fitbit to permit separate administration from the clinical system and to
avoid technical issues with the script affecting the clinical system. This script was based on previous work
by S. Bromberg [46,150], whose originally script is available on GitHub [150].
Figure 4-6. Medly Fitbit clinician access sequence
64
4.3.2 User Interface Design
The Medly user interface (UI) also required updates to support the addition of fitness tracker
functionality.
Proposed User Interface Designs
Several designs were proposed, which were based on best practices from the fields of data visualization
[199–201], human factors & user experience design [202–205], as well as insights from consultations with
the Medly design team at Healthcare Human Factors (a partner of UHN) and the development team at
the Centre for Global eHealth Innovation.
In order to provide a more optimal user experience for patients, these should receive feedback that their
device is operating as expected. In the case of the Fitbit activity tracker this means not only that the
device is charged and collecting data, but also that the device is syncing data to the patient’s smartphone,
and ultimately to UHN. Displaying the patient’s Fitbit data on the Medly app on the patient’s
smartphone would provide this feedback since it requires an unbroken chain of communication between
the Fitbit, Fitbit App, Fitbit Servers, UHN Servers and the Medly app as shown in Figure 4-3. We
proposed 4 design each for both the home and trends screen that were consistent with the UI design
language already established by Medly. The 4 proposed home screen designs are illustrated in Figure 4-7,
the designs for displaying trends data are shown in Figure 4-8. Since the fitness tracker step count and
heart rate data is generated at every moment, instead of being collected usually only once a day in the
morning, the proposed designs, although adhering loosely to the established design language of Medly
intentionally treat fitness tracker data in a visually distinct manner so as to help users identify the less
static nature of the fitness tracker data (compare Figure 4-1a and Figure 4-7). Similarly, the proposed
trends screens are slightly modified to better adapt to nature of the fitness tracker data. For example,
daily or weekly heart rate summaries not only report mean heart rate, but also the lower and upper range
of heart rate during those periods.
Along with the aforementioned changes to the trends and home screen, we designed a UI flow for changes
to the Medly smartphone app to allow patients to link a Fitbit account to their Medly account, this UI
flow is illustrated in Figure 4-9. However, as mentioned in Section 4.3.1.2, this flow was ultimately not
implemented. Instead Fitbit account linking was redesigned to be done through the clinician web
interface. The final authentication flow is discussed in Section 4.3.2.2.
65
Figure 4-7. Proposed designs for patient user interface (home screen)
a) combined heart rate and steps data on one card, b) combined heart rate
and with pictoral representations, c) seperated heart rate and step data, d)
only pictoral representation with mini graph
a)
c)
b)
d)
66
f)
Figure 4-8. Proposed designs for patient user interface (trends)
a) simple sparklines, b) data with bands to indicate min (resting), mean and max values for
each time period, c) whisker plot to indicate daily range, b) heart rate (maximum and
resting) and average step count values broken out for each time period, and e) Tufte style
medical data visualization as per f) which is reproduced from [201]
a) b) c)
d) e)
67
Figure 4-9. Proposed design for authorization of new Fitbit by patient via M edly smartphone application.
68
With respect to the clinician web interface, changes were much more limited and most centered around
adding new graph components to display the new fitness tracker data which differs from the rest of the
data collected by Medly since it is available at available at up to minute-level resolution. The proposed
web interface graph designs are shown in Figure 4-10 (which can be contrasted to the existing graph
design in Figure 4-2).)
The design of the clinical user interface was approached in a similar fashion to the patient smartphone
trends screen. Although the web interface has more available screen real estate than the smartphone
screen, the performance of the web interface was known to drop drastically when made to process several
data points for display on graphs. As such, the design of the clinical user interface represented a similar
challenge to the smartphone trends screen: the need to collapse voluminous high resolution minute-by-
minute data into more concise daily or weekly summaries; this explains the successive data simplification
that occurs while transitioning from Figure 4-10b to Figure 4-10d. The design shown in Figure 4-10b for
example is inspired from the UI of an intensive care monitoring system designed for use in the data rich
environment of the pediatric critical care units at SickKids: The Hospital for Sick Children in Toronto
and Boston Children’s Hospital in Boston [206–208]. Consequently, it is the most ideal of the proposed
designs from a data fidelity point of view since it cuts out minimal data and allows a user to more easily
visualize concurrent trends in multiple data streams. However due to the technical limitations of the
Medly web interface, it is also the least feasible to implement. Figure 4-10c and Figure 4-10d were later
design iterations attempting to reduce the number of visual elements that the interface would need to
process and draw while still maintaining as much information content as possible. Figure 4-10e returns to
the same simple graph style of Figure 4-10a and Figure 4-2 but with range bands and a UI element for
displaying something useful derived from the step count data such as the predicted NYHA class
(compared to the last assessed NYHA class). This UI element also provides the option for the clinical staff
to provide feedback as to whether they agree with the prediction, or not, by pressing on the ‘x’ or check
mark and correcting the prediction (this later pop-up is not shown). This functionality would be useful for
collecting feedback (and training examples) from the user to assess the accuracy (and dynamically teach)
an NYHA functional classification suggestion algorithm once it gets implemented into Medly. Lastly, we
proposed simple alerts for both step count and heart rate consistent with those implemented for weight,
blood pressure and pulse: namely a lower limit for step count and upper and lower limit alerts for heart
rate. We also proposed adding adherence phone call functionality for the fitness tracker similar to the
already implemented system that triggers an automated reminder phone call when a patient does not
submit their daily readings.
69
a)
70
a)
b)
d)
71
e)
Figure 4-10. Proposed designs for clinical user interface (activity and heart rate graphs)
a) simple graph design with indicator lines for alert levels and mean, b) design inspired by
the Sick Kids T3 (tracking, trajectory and trigger) tool [206–208], c) mix of T3 tool with
Medly range bands, b) whisker plots style and e) simple graph with range bands and
NYHA class prediction display (bottom of the more info page for step count graph)
72
Figure 4-11. Final web interface Fitbit authorization flow
73
Figure 4-13. Final web interface activity tracker data display
Figure 4-12. Final web interface activity tracker profile & deauthorization flow
74
Final User Interface Design
As with the back-end components required to download and access the fitness tracker data, the actual
programming of the UI components required for the activity tracker update to Medly was managed by the
development team at the Centre for Global eHealth Innovation (a partner of UHN). Again, due to time
and resource constraints caused by higher priority fixes and updates, the final UI implementation was
reduced down to a proof-of-concept. Due to a lack of available iOS and Android programmers. no updates
were possible to the patient smartphone UI, so patients were instead instructed to use the Fitbit app on
their phone to confirm that data was being collected and synced to the Fitbit servers. The task of
confirming that the Fitbit data was being properly synced to the Medly servers was instead left to the
author as part of the research work documented in this thesis. Afterwards, this task is anticipated to be
delegated to the clinical admin staff to be performed on a manual basis using elements that were added to
the clinician web interface. The inability to update the smartphone UI also necessitated the creation of a
new UI design for the task of linking patient Fitbits (whether provided by the clinic, or patient’s personal
Fitbits) to Medly servers through the clinician web interface. The final version of this UI flow is shown in
Figure 4-11.
As required by the Fitbit applications programming interface (API) for web applications, as part of the
authorization process, the user is redirected directly to the official Fitbit website (Figure 4-11 step 3) so
they can be confirm that they are connecting to the genuine Fitbit.com site [209]. Once logged into the
Fitbit website they user can then select what data to share (Figure 4-11 step 4).
When linking activity trackers we instructed users to select ‘Allow All’ to allow all data to be shared
(refer to Figure 4-11 step 4). Normally this violates an old principle of computer security: the principle of
least priviledge (or least authority), which dictates that user access rights be a) limited to the bare
minimum required to perform the desired task and b) provided only for the duration required for said
task. Recognizing howevver that it was likely that Medly would receive updates in the near future to
enable more complete use of Fitbit functionality and that if these future updates used data outside
outside of the already required ‘heart rate’, and ‘activity and exercise’ data it would necessitate manual
unliking and then relinking of all the Fitbit accounts to select additional permissions, likely at significant
time cost. Furthermore, clicking the single ‘Allow All’ button was a simpler task for users to perform
compared to having users select the sperate individual ‘heart rate’, ‘activity and exercise’ and ‘Fitbit
devices and settings’ radio buttons. A less complicated task is predicted to reduce the likelyhood of error
when linking a Fitbit account. Lastly, even in the case of a real security concern such as a data breach,
75
the tokens exchanged through the authorization process, which provide the Fitbit data access rights in the
first place, can be remotely revoked through the Fitbit website both on an individual basis and on mass.
This reduced the actual security risk to what we deemed to be an acceptable level.
We were actually able to confirm this loss of data access to linked Fitbit accounts as a result of a
simulated security breach inadvertendely caused during data collection. The incident occurred on May
31st, 2017 while authenticating patients using the custom R script written to download the minute-by-
minute heart rate and step count data and manage the associated access tokens.
The script accepts a list of user accounts and loops through a pared down version of the authentication
flow shown in Figure 4-11 (i.e. just steps 3 and 4) for each account one immediately after another. This
makes it possible to quickly add and retrieve access tokens for multiple patients in bulk, reducing
workload for research/QI work. Fitbit’s automated security system interpreted the rapid automated
linking of multiple Fitbit accounts as suspicious and potentially indicative of malicious activity. As a
result, Fitbit’s security system subsequently banned the internet address of machine running the script
and flagged the 34 recently linked accounts as potentially compromised, forcing password resets and
invalidating the access tokens for each of these accounts (both for the script and the clinical system).
It took approximately 3 weeks to: 1) confirm with Fitbit that we were the actual cause of the suspected
‘data breach’ (as opposed to an actual malicious third party), 2) reset patient passwords 3) relink
accounts on the clinical system, 4) contact patients to ensure that they had successfully logged back into
the Fitbit app on their phone, and 5) slowly relink accounts to the research system (which we did at a
rate no higher than 1 per 30 seconds and in batches no longer than 25 with a pause of at least 45 minutes
between batches). As we experienced delays in reaching patients to inform them that they needed to log
back into their Fitbit account, at least half were initially unreachable on the first day and had to be left a
voicemail message or equivalent, some of the patients may have suffered about 1-2 weeks of data loss.
The potential data loss would have been caused by the limited internal memory of the Fitbit; since the
Fitbit only has sufficient internal memory to record 1 full week of minute-by-minute data it must be
synced at least once a week to the Fitbit servers, usually via the Fitbit app, to make more room for new
data. Due to accounts being flagged as compromised, patients needed to log back into their account using
their new password to reenable syncing between their Fitbit and Fitbit servers. Since Fitbit only provides
the last device sync date, which was not actively monitored during this period (as opposed to a complete
sync history) we were unable to confirm the actual extent of data loss for patients. We were also unable
to ascertain the extend of data loss simply by examining the data since it is difficult to determine if
76
potential lack of data during this period was due to the incident or simply due to patient disengagement,
in particular since those patients most likely to have not noticed that they had been logged out of the
Fitbit app are almost by definition those least engaged with the system.
Aside from the potential loss of data, the incident had no other reported impact on the system. The loss
of data also had minimal impact on the QI/research objectives of this study since most patients impacted
by the incident had already been using the monitoring system for several weeks (and even months), and
data collection for all patients would still continue for several weeks post incident (to attain a minimum 3
week recording period for each patient).
Returning to the UI: once users proceed through the authentication flow in Figure 4-11 - thus enabling
syncing of their Fitbit account to Medly - they are returned to the patient profile page which now displays
status information about the connected Fitbit account and the option to unlink the account if desired (see
Figure 4-12). This profile page displays informationa about the last time the Medly server was synced
with the Fitbit server – ‘Last Server Sync’ – as well as the last time a Fitbit device was synced to the
Fitbit account16 – ‘Last Device Sync’ – the later of which never be more recent than the ‘Last Server
Sync’. These two values were added to help users determine if a lack of displayed step count data is
caused by: a communication problem between the Fitbit server and Medly server (the ‘Last Server Sync’
value is not up to date and does not update even when the user presses the ‘Force Sync’ button); the
Fitbit device has not yet been synced (‘the Last Device Sync’ value is not up to date although the ‘Last
Server Sync’ value is up to date); or the patient has simply not used the Fitbit or performed any physical
activity (both the ‘Last Device Sync’ and ‘Last Server Sync’ values are up to date but no step data shows
up on the web interface).
As for displaying the Fitbit data: heart rate data was deemed to be non-essential for inclusion as part of
the activity tracker MVP in particular since it would further cause confusion with the existing displayed
daily recorded pulse data (recorded using a blood pressure cuff). As a result no graphical display was
implemented to display the Fitbit acquired heart rate data. The step data graph on the other hand was
redesigned after the existing graph design (Figure 4-2) showing total daily steps for each day in the view
windows (see Figure 4-13). In the ‘More Info’ page to the immediate left of the graph, the whole time
period being viewed was summarized by providing the lowest, average, and highest daily step count and
16 This process occurs automatically every time the user opens the Fitbit app on their smartphone.
77
total readings during period in question. It is worth noting that this final step data graph design also only
represents a minimum technically viable product as it does not fully honor the best practices and
principles outlined in the Fitbit API terms of service, the most relevant being the following:
“Offer Users a clear path back to their Fitbit Account.
• Always provide clear documentation and links for Users to access their Fitbit
Account from your Application.
• Paths to Fitbit User accounts should be available wherever User Data is
displayed.
• Paths to Users’ Fitbit accounts should be available in "Setting," "Account," or
a similar location from within your Application.
• When displaying Fitbit Data in your Application, Fitbit must be noted as the
source of Fitbit Data using the text link and/or logo icon made available to you
through the Fitbit Developer Portal.” [210]
As is, the step data graph adheres to none of these provisions.
Despite all of the aforementioned limitations we were able to onboard 46 patients onto the upgraded
system over a 5 month period (from January 9th to June 13th). These patients were subject to the same
inclusion and ‘exclusion’ criteria used for the general Medly system. The inclusion criteria are detailed in
Table 13. While there are no explicit exclusion criteria for Medly, we note that since the system (and by
extension this updated) is used as part of the prevailing standard of care at the Heart Function clinic, the
decision to prescribe or exclude a patient from the Medly program is ultimately up to the professional
judgement of the attending cardiologist. As of the time of writing a total of 7 attending cardiologists use
Medly as part of patient care, although one of the cardiologists (the medical director of the clinic) is
disproportionately responsible for a majority of the patients monitored. During this period 2 (4%) of the
46 patients later changed their mind about being monitored via Fitbit and subsequently chose to return
their devices and be removed from QI initiatives related to Fitbit monitoring. On the other end of the
spectrum, 3 (7%) of the 44 patients who remained in the study chose to supply and use their own Fitbit
device and Fitbit account instead of being provided one by the clinic (these patients were unsurprisingly
all very adherent with their Fitbits).
78
Table 13: M edly inclusion criteria
- a consenting adult (18+ years of age),
- diagnosed with heart failure,
- followed by a licensed cardiologist at the UHN Heart Function Clinic (who in turn bears
the primary responsibility for the management and care of that patients heart failure
diagnosis)
- sufficiently capable of speaking and reading English, or having an informal caregiver
(spouse, parent, etc.) capable of the same so as to both:
o undergo the process of and provision of informed consent for participation in the
Medly program
o understand and follow the text prompts provided by the Medly patient-side
application
- capable of complying with the use of Medly (e.g. capable of truthfully answering
symptom questions, capable of safely and correctly using the peripherals such as the
weight scale, activity tracker and blood pressure cuff)
Table 14: M edly exclusion criteria
- Congenital heart disease
- Diagnosis less than 6 months prior to recruitment
- Travelling out of Canada for more than 1 week during the study period (to limit study
costs – i.e. roaming charges)
Of the 44 patients who remained on the monitoring system, 12 (27.3%) used and provided their own
Apple iPhone devices, and 32 (72.7%) used Android devices provided by the clinic. Based on the number
of mobile wireless subscribers in Ontario (88.1% in 2015 [211]), the iPhone market share in Canada
(51.37% in October 2017 [212]), and proportion of devices using an iOS version supported by Medly
(version 9.4 or above; 96.75% in October 2017 [213]) the expected proportion of iPhone to Android was
closer to 43.8% (19:25). These expected values and actual proportions of onboarded patients by device is
tabulated for easier reading in Table 15d. By proportion, the number of iPhone users onboarded was
slightly less than expected. We anticipated that the relative proportion of Android users was higher since
we recruited Android users not just from the pool of new patients onboarded onto Medly during the 5
month period but also from patients who had already been onboarded onto Medly and happened to be
returning to the clinic for follow-up during this period. No iPhone users had previously been onboarded
onto Medly therefore all of the 7 returning patients (16%) upgraded with Fitbits were Android users.
Removing these patients, 32.4% of new patients used iPhones and 67.6% used an Android device, this is
closer to the distribution expected based on market share calculations. In either case the relative
proportion of iPhone to Android users was not found to be statistically different to the expected
79
proportion at the 5% level of significance and given the sample size (P=0.18, and P=0.47 respectively for
the cases discussed above; assessed using a chi-squared test with R [151]).
Table 15: iPhone vs. Android patients on Medly system using Fitbit
a) all patients onboarded, b) only new Medly patients onboarded during thesis
a) All Onboarded Expected (by M arket Share) P -value
iPhone Users 12 (27.3%) 19 (43.8%) .18
Android Users 32 (72.7%) 25 (56.2%)
b) New Patients Only Expected (by M arket Share) P -value
iPhone Users 12 (32.4%) 16 (43.8%) .47
Android Users 25 (67.6%) 21 (56.2%)
Patient adherence was also recorded at two points
during the study, at 3 months into the study (April
9th, 2018) and at the end of the data recording period
(August 1st, 2018; 7 months). At both of these
junctures, patients were found to be overall
moderately adherence with using the Fitbit – e.g. at
the 3 and 7 month timepoints 50% of patients had
used the Fitbit (recorded steps or heart rate) on at
least half of the days they were on the system. Only
around 1 3⁄ to 1 4⁄ of patients (at 3 and 7 months
respectively) had excellent levels of adherence
(average at least 9 of 10 days using the system). A
more complete breakdown of adherence is available in
Table 16, with the stem and leaf plots in Figure 4-14
illustrating the comparative distribution of the
percentage of days patients had used the system (relative to the total number of days they used the
upgraded system) this time at the 3 and 7 months. A paired Wilcoxon signed rank test (since the data is
non-normal, as can clearly be discerned from Figure 4-14) revealed that there was no statistically
significant difference between the adherence at 3 and 7 months (P = 0.625).
Compared to the adherence levels recorded during the original Medly RCT, where “about 42, 33, and 16
out of the 50 telemonitoring group patients (84%, 66%, and 32%) completed at least 91 (50%), 146 (80%),
and 173 (95%) of possible daily readings over the six months respectively (prior to the adherence phone
call deadline at 10am)” [103], patients using activity trackers in this study were found to be significantly
The decimal point is 1 digit to the right of the |
9 | 1 represents 91. % 3 Months 7 Months 980| 0 |0001235589 6431| 1 |15 842| 2 |45899 1| 3 |012 64| 4 |19 | 5 |237 8| 6 |2 21| 7 |4 5| 8 |0357 710| 9 |0111137888 000000| 10 |000
Figure 4-14. Distribution of patient
Fitbit adherence (as percent of days
using the system)
80
less adherent (at the 5% level of significance) at both the 50% and 80% adherence thresholds (but not the
95% threshold); detailed results are tabulated in Table 17.
Table 16: Patient adherence on Fitbit
Adherence Definition
# of Patients
3 M onths 7 M onths
sum a deltab sum a deltab
Near Perfect > 95% of days used 7 (26.9%) - - 7 (15.9%) - -
Excellent > 90% of days used 9 (34.6%) 2 (7.7%) 12 (27.3%) 5 (11.4%)
Consistent > 68% of days used17
13 (50.0%) 4 (15.4%) 18 (40.1%) 6 (14.6%)
50-50 > 1/2 of days used 13 (50.0%) 0 (0%) 22 (50%) 4 (9.1%)
Sporadic > 1/7 of days used 21 (80.8%) 7 (30.8%) 33 (75%) 11 (25%)
Onboarded all patients 26 (100%) 5 (19.2%) 44 (100%) 11 (25%)
a i.e. # (%) of patients meeting or exceeding specified level of adherence b i.e. difference between # (%) of patients at specified level of adherence and the next highest
adherence level
Table 17: Fitbit adherence compared to adherence recorded for original M edly during RCT
Adherence Level M edly RCT [103] Fitbit @ 3 M onths Fitbit @ 7 M onths
# of patients # of patients P -value # of patients P -value
> 95% of days used 16 (32%) 7 (26.9%) .85 7 (15.9%) .12
> 80% of days used 33 (66%) 10 (38.5%) .04* 17 (38.6%) .014*
> 50% of days used 42 (84%) 13 (50.0%) .004** 22 (50.0%) <.001
Total 50 26 - 44 -
A recent study by Hermsen et al. [214], who examined sustained use of a provided Fitbit activity tracker
in 711 patients, found that 232 days into their study, of those who were non-adherent at that stage (187
patients), 56.7% stopped adhering due to technical problems or difficulties18, 12.8% lost the device, 12.8%
forgot to wear the device, 9.7% felt they had no use or motivation to use the particular device given to
them (including because they used a different device), 3.7% stopped due to health issues and 5.4% didn’t
want to use the device for various other reasons (excluding health issues).
From this study we can infer that people, broadly-speaking, are non-adherent to technology for one of
three reasons:
17 68% of days equates to roughly 20-21 days out of the month (i.e. every weekday)
18 in our study we had 2 devices (both replaced) reported as non-functional (one that over-reported steps and one
that simply didn’t work).
81
1) they are (humanly) unable to use the technology. Namely because the technology is non-
functional, whether due to technical or human factors problems;
2) they want to use the technology but forget to do so; or
3) they don’t want to use the technology, for example because they have concerns about
detrimental effects of the technology on their wellbeing, or generally don’t recognize any
benefits to using the technology.
For patients who are unable to use the technology, in particular due to human factors problems, the
pared down UI designs ultimately implemented do little to make the Fitbit more usable from a patient
perspective. However, it also does little to make things worse. Since no UI updates were made to the
Medly patient app to help support the fitness tracker, a patient’s interactions with the Fitbit are limited
to interactions with the device itself and the proprietary Fitbit app (and optionally the Fitbit website). As
a result, difficulties interacting with the technology are in a way more representative of Fitbit as a
technology compared to our RPM system. Our findings therefore actually form a baseline for patient
adherence on a Fitbit RPM system since the components implemented into our system represent the bare
minimum required to actually make a Fitbit enabled RPM system function. Furthermore that the Fitbit
user experience design is also largely outside of the control of third-party researchers and programmers
simply makes it harder to make real improvements to this part of the user experience perhaps aside from
providing better user education (generally considered by human factors experts as the least effective
means of effecting meaningful change [215,216]).
In the other case of patients who simply forget to wear the tracker, a solution already exists: adherence
phone calls. These were coincidentally used with great effectiveness during the Medly RCT although they
were not added as part of the Medly Fitbit MVP.
As for patients who did not want to use our technology: we suspect that these were a less likely
contributor to non-adherence in our particular study since the patients onboarded onto this system all
willingly consented to participate. That being said, we fully expect this willingness to decrease as time
goes on. In the same Hermsen et al. study (that examined the sustained use of a provided Fitbit activity),
the authors found a “slow exponential decay in Fitbit use, with 73.9% (526/711) of participants still
tracking after 100 days and 16.0% (114/711) … after 320 days.” [214]. Although, as previously mentioned,
we found no significant difference between adherence at 3 and 7 months our study was not powered ahead
of time to address this question.
82
We suspect that the easiest and most cost-effective solution to most if not all of the aforementioned
problems is adding the fitness tracker to the adherence phone call system already implemented as part of
Medly. Adherence phone calls would not only help to address the problem of patients simply forgetting to
wear the activity tracker (which might otherwise necessitate an update to the Medly UI), but they would
also provide increased opportunity to address technical or usability issues experienced by patients by
providing patients with an additional compelling reason to get these issues addressed by contacting Medly
support staff (i.e. avoiding nuisance phone calls). If the Medly UI were to be updated, adding some sort of
alert or reminder when a patient was taking their morning systems would be even better since it would
prevent more unintentional data loss. An ideal system would also notify this same Medly support staff of
patients who are consistently experiencing difficulties with the activity tracker, to properly close the
feedback loop between patients and the clinic and ensure that patient difficulties are being properly
addressed. While adherence phone calls would help catch technical or usability issues earlier, it might also
help patients see the benefit of this system in that they would be held accountable to this element of their
self-care and management. From a research perspective, having already established the baseline adherence
of the Fitbit system, we could even quantify actual impact of adherence phone calls by re-running this
analysis after this feature implemented.
As for the usage of the updated system by clinical staff: we unfortunately have no quantitative data to
perform an analysis similar to the one done for patient users, as the upgraded iteration of the Medly
system did not record data that would permit the assessment of clinician usage of the newly available
Fitbit data.
The analysis in this chapter was performed using R [151] and supporting packages [217–219].
Summary
In summary, we updated Medly, the remote patient monitoring system in use at the TGH HF clinic, to
support the collection and partial display of Fitbit activity tracker data. Although the system supports all
Fitbits, we specifically selected to provide patients at the clinic with the Fitbit Charge 2 which was the
most inexpensive tracker that met our requirements: namely that it was readily available for purchase,
supported the hardware (smartphones) being used as part of the Medly program, could last at least a few
(2) days without syncing or charging (to help avoid data loss and provided a means for downloading and
accessing continuous minute-by-minute step count and heart rate data from the device - even if
indirectly). Data access was performed through the Fitbit API with a separate connection for the clinical
system (which allowed clinicians to monitor patient activity through Medly’s custom web interface) and
83
for the research system (a custom R script which allows research/QI staff to manage access tokens, and
download patient activity data in bulk for offline analysis – see Appendix C or [198])
Updating Medly to support Fitbit activity tracker data also required an update to the UI of the system to
allow users to 1) link a Fitbit account to the corresponding Medly patient account and 2) monitor patient
activity through the Medly system. In view of this, several UI designs were proposed to the professional
development team whose task it was to program the final design into the existing Medly system. However,
due to time and resource constraints caused by other concurrent higher priority updates and bug fixes to
Medly, all of the initially proposed designs were eschewed in favor of producing a pared down minimum
viable product which demonstrated the technical viability of the solution. As a result, no changes were
made to support the Fitbit activity tracker on the patient smartphone applications. Patients were instead
instructed to use the Fitbit app alone to access their Fitbit data. As for linking patient’s Fitbit accounts
to their Medly account, the authentication flow was adapted so it could be performed by clinical staff
through their clinical web interface. The display of Fitbit activity tracker data on said web interface was
limited to daily step data only since heart rate data was deemed as non-essential. The updated system
also only supported patients using Apple iPhones - clinicians wanting to monitor patients who were using
the standard Android phones provided as part of the Medly system instead had to go through the Fitbit
website directly (although as of the time of publication the Medly system now fully supports patients
using both iPhone and Android).
Despite these limitations, we were able to monitor 44 patients over a 5 month period (from January 9th to
June 13th) with an additional 2 patients who were additionally onboarded but later changed their mind. 3
of the 44 patients actually brought and used their own Fitbit. 12 (27.3%) of the patients used iPhones
(and could be monitored using the updated Medly web interface), whereas 32 (72.7%) of the patients used
Android (which was not supported by the updated Medly web interface). Overall, patients were found to
be only moderately adherent with using the Fitbit. At the 3 and 7 month time points, 50% of patients
had used the Fitbit (recorded steps or heart rate) on at least half of the days they were on the system.
Only around 1 3⁄ to 1 4⁄ of patients respectively at the 3 months and 7 months timepoints had excellent
levels of adherence (average at least 9 of 10 days using the system). We proposed that adding adherence
phone calls or reminder notifications would help improve patient adherence to the system, or at least help
staff catch and address patient issues in a timely manner.
84
– Assessment of NYHA Functional
Classification using Hidden Markov Models
Having completed the essential groundwork of building a system to collect relevant input data, we set
out to assess the NYHA functional classification of patients in an example dataset using 6 different
machine learning (ML) algorithms, specifically: Hidden Markov Models (HMM); Generalized Linear
Models (GLM); a variant thereof: boosted GLMs; Random Forests (RF); Artificial Neural Networks
(NNet); and a variant thereof: Principal Component Analysis Neural Networks (PCA NNet). Since the
approach used to create the HMM based classifier (HMMBC) differed slightly from the rest of the
candidate models, we discuss the HMMBC separately as part of this chapter, while the remaining ML
models are treated in Chapter 6.
First, we provide a brief refresher on HMMs - a more detailed introduction is provided in Appendix B –
followed by our rationale for using HMMs in the first place. We then proceed to explain our methodology
for training and testing a HMMBC. Finally, we discuss the results of our investigation and, since our
HMMBC approach was ultimately unsuccessful, we touch on the problems encountered and provide
recommendations for future attempts.
Hidden Markov Models
Any introduction to hidden Markov Models must start with Markov models. Markov models are
probabilistic state machines where the transitions between states occur randomly according to some pre-
determined and pre-specified transition probabilities between each of the states [118,220–223]. Hidden
Markov Models (HMM) are simply Markov Models where the underlying states cannot directly be
observed [118,220,222,224,225]. Instead, the underlying states of the HMM are inferred from an associated
set of possible observations that are linked to each state. In other words, from the possible outputs that
can be produced when the system is in a particular state. These observed outputs could be speech
phonemes, written characters of the alphabet, or genome sequences [118,226], or in our case step count or
heart rate readings, amongst others.
5.1.1 Rationale for the use of HMMs
The rationale for using hidden Markov Models is that they can embrace the complexity and nuance of
the entire time series data streams (and sequential data in general). In contrast, the remaining ML models
85
investigated in this thesis (in their standard form) must be provided with input predictors formulated as
cross-sectional data (i.e. with the observations coming from a single point in time).
Of course, it is possible to format, or distill time, series data into cross-sectional data. For example, one
could use the values at discrete time points in a time series as separate independent input features for a
ML model. This is illustrated in Figure 5-1, where the value at time 𝑡𝑛 and the 𝑚 values preceding it:
𝑡𝑛−1, 𝑡𝑛−2, 𝑡𝑛−3, …. to 𝑡𝑛−𝑚 are provided as separate inputs to the ML Model. But, by decoupling the
individual time points one loses an, if not the, essential characteristic of time series data (and sequential
data generally): the interrelationship between individual data points in the series. An ML model trained
in this manner will therefore be robbed of very important information about the time series in question.
To avoid completely throwing away this interrelationship information, one could instead compute various
metrics or characteristics to describe the entire time series such as: the mean and variance of the signal,
the total number or location of peaks, the signal auto-correlation, cross-correlation, frequency distribution,
and so on, using these as input features. Ultimately though, any computation which takes an entire time
series signal and boils it down to a single parameter before providing it to the ML model must be pre-
maturely throwing away possibly relevant information. This is not to say that feature extraction is
something to be avoided - in fact, it forms a core part of most machine learning pipelines and is also
something we performed as part of training the cross-sectional models detailed in Chapter 6. That being
Figure 5-1: A method of inputting sequential (time series) data into a cross-sectional model
86
said, we reasoned a HMM, which has access to the full time series waveform, with all its complexities,
nuances and interrelationships, would be a better initial candidate for attempting replication of the
complex task that is assessing NYHA functional class.
Methods
In the following section we briefly detail our methodology used for a) implementing and b)
subsequently assessing the performance of our HMMBC.
The work done in this chapter was performed using the R programming language [151] in conjunction
with RStudio [152], an integrated development environment for R, along with various other supporting R
packages [153–158,217]. The R package depmixS4 was used specifically for the training of the HMM
models [227,228].
5.2.1 Training Data
Dataset
Although we originally intended to use the new data collected from the upgraded Medly system (with
the additional activity tracker functionality), we opted to instead use data that was collected during a
previous study (the same data used in Chapter 3). Analysis of the data collected, and continuing to be
collected, from the upgraded Medly system is instead left to future work. The reasoning for this choice
was three-fold.
First, the previous (Chapter 3) study data had a marginally larger sample size of 50 patients, vs. a
nominal 44 patients from the new Medly data. Furthermore, since 5 of the 44 had almost no recorded
activity, and an additional 6 had less than 1 week of recorded activity, the practical size of the Medly
dataset is really closer to 33 patients. While neither of these datasets is large even when considered from a
classical statistics perspective, machine learning is notorious for being particularly data intensive, and
typically the noisier, the more complex and the greater the variance in the data, the larger the dataset
required to achieve good classification performance. Given that we expect that continuous daily step data
is simultaneously noisy, complex, and highly variant we expect that the model may lean towards requiring
more data rather than less data. Aside from considering the complexity and nature of the machine
learning algorithms we are investigating, the use of the somewhat larger 50 patient dataset is further
justified since some fraction of the 50 samples will also need to be set aside and reserved for testing and
validation of the models.
87
The second reason we chose to use the previous study data was that we had insufficient time to download
the last bits of activity data, collect the additional non-activity portions of the data set (e.g.
demographics, NYHA class and CPET data), and subsequently properly clean, and then re-run the
analysis that follows on the new Medly Fitbit data set. The lack of time was mostly a result of pushing
back the final deadline for the inclusion of new onboarded patients into the study dataset, in order to
scrape together as much data as possible for ML in the face of the relatively low onboarding rate (~1.5
patients/week including both new patients and upgraded returning patients) and the delays in
implementing the required data collection infrastructure (as discussed in Chapter 4).
The third reason we opted to use the previous study data is that it included summary cardiopulmonary
exercise testing data for all the patients in the dataset (a by-product of the inclusion criteria) whereas
approximately half of the patients on the upgraded Medly system had not had a CPET performed and
therefore had no such data available at the time of publication. Using the previous study data therefore
had the benefit of allowing us to create models and performing some initial comparisons of classification
performance of models trained using only CPET data (recall, the gold standard test for assessing exercise
capacity) as compared to models which use activity tracker data.
Our choice of dataset however did come with a significant drawback. As already mentioned, the previous
study data used an activity tracker that did not collect heart rate data. As a result, the dataset only
consisted of the following data:
1. Minute-by-minute step count data – recorded using a commercially available activity-tracker, a
Fitbit Flex [59], continuously throughout the day.
2. Cardiopulmonary exercise testing data – administered by trained clinical staff as part of routine
care at the TGH Heart Function Clinic on the same day as recruitment (except for 4 patients
who received it prior to recruitment19).
3. Patient demographic/meta data – recorded as part of onboarding, and specifically including:
a. Sex [Male or Female],
b. Age [years],
c. Height [cm],
d. Weight [kg],
19 Specifically, 1, 15, 20 and 22 days prior to recruitment.
88
e. Handedness [left or right], and
f. Wristband Preference [left or right].
Population
In short, the data ultimately used in the development and validation of all the ML classifiers discussed
in this work is the same data used for to perform the replication study in Chapter 3. Recall that the data
was originally sourced between September 2014 and June 2015 from a closed (prospective) cohort of adult
outpatients at the Heart Function Clinic (a tertiary care clinic specializing in the management of heart
failure) at Toronto General Hospital, a part of the University Health Network (UHN) in Toronto,
Canada). The inclusion and exclusion criteria are respectively detailed in Table 3 (page 37) and Table 4
(page 37). The dataset includes 50 patients whose demographics are fully detailed in Table 5 (page 38),
Table 6 (page 38) and Table 7 (page 39), but in short, to reiterate, the patients are predominantly male
(86 vs. 89 [%]), aged: 54 ± 14 vs. 56 ± 14 [years old], and overweight (BMI: 28.9 ± 6.4 vs. 29.6 ± 6.3
[kg/m2]) with no significant difference in handedness or wristband preference (see Table 11).
Patients in the dataset were recorded for 2 weeks during which time their HF, and by extension their
NYHA class, is assumed to be stable (stability of HF being one of the criteria for inclusion into the study
which originally generated this dataset; see Table 3).
Label Assignment
The “true” underlying NYHA class of a patient was assessed at onboarding by their physician as either
NYHA functional class II (n=26) or III (n=11), according to the criteria outlined in Section 2.2.1.1, or as
some intermediate/mixed class I/II (n=9) or II/III (n=4). Patients were assessed as an
intermediate/mixed class when a physician was uncertain about the classification or felt that patients
exhibited symptoms from different class levels. However, since class I/II and II/II are not formally
recognized NYHA classes (nor are the sample sizes for the classes in question large enough for any sort of
machine learning), it was necessary to group these intermediate/mixed classes together with the existing
traditional NYHA classes for the purpose of developing our ML classifiers. We grouped the
intermediate/mixed classes according to the most ‘severe’ NYHA class in the set20, i.e. I/II with NYHA
class II, and II/III with NYHA class III.
20 recall our extended reasoning on page 39 for grouping according to the more severe class in the mix.
89
5.2.2 Model Design
Predictor(s)
In order to predict the class labels, the HMMBC was supplied with only one predictor: the step count
data, since this was the only available time series data. Adding in either the demographic or available
cardiopulmonary testing data would have required stratifying our patients into groups and training
separate sub-classifiers for each group. Since our dataset was so small and relatively homogenous we
reasoned that stratification was not likely to significantly improve performance but would definitely have
at least some detrimental impact on performance by reducing the already meager number of examples
available to train a given classifier in the first place (due to the stratification process).
We did however use multiple variations of the step count data after encountering difficulties getting our
classifier to converge to a valid model using the high-resolution minute-per-minute data. We re-attempted
training our classifier using data at progressively lower temporal resolutions, from 2 to 6 hours. The
algorithm was finally able to converge when we used a resolution of 6-hours21. The result is that we
investigated five separate variant classifiers as part of this work, with each variant supplied with step
count data at a different time resolution, specifically at either:
a) a per minute level resolution [steps/minute], or;
b) a per 2-hour level resolution [steps/2 hours], or;
c) a per 3-hour level resolution [steps/3 hours], or;
d) a per 4-hour level resolution [steps/4 hours], or;
e) a per 6-hour level resolution [steps/6 hours]
Normalization
Additionally, before using the step count data for training, we also normalized the per minute values
to between 0 and 1 via linear scaling, from a minimum of 0 and using a maximum of 300 [steps/minute].
Normalizing predictors typically has beneficial effects on training speed but is usually most important for
ensuring each predictor is considered equally by the learning algorithm (as a result of being similarly
21 i.e. the per minute data summed into non-overlapping 6 hour intervals
90
weighted). In our case, since our HMMBC does not use multiple predictor inputs at the same time we
normalized the data for its secondary effect on learning speed and efficiency.
Architecture
In order to actually construct a classifier using the aforementioned predictors, we used one HMM per
classification label - 2 total: 1 each for NYHA functional class II and III22 - combined as per Figure 5-2.
Each HMM is trained with data from the subset of patients corresponding to the target NYHA class
label, i.e. one HMM is trained using the 35 patients with NYHA class II and the second with data from
the 15 patients with NYHA class III. Classification of new patients can then be performed by evaluating
the likelihood that the given patient's predictor sequence (i.e. step count data stream) was generated from
each of the corresponding HMMs in a set. Evaluating this likelihood, or similarity score, is done using an
‘inference’ algorithms, typically the ‘forward’ or ‘backward’ algorithm, whose functionality is included in
most HMM programming libraries. The interested reader can read up on the finer details of these
inference algorithms in any of these referenced works [118,220,222–224]. Regardless of the algorithm used,
the NYHA class of the patient in question is deemed to correspond to the class of the HMM with the
22 by extension, a 3 or 4 class multi-class classifier would additionally contain an additional HMM trained using
NYHA class I or IV patients as required.
Figure 5-2: Architecture for hidden Markov model based classifier
91
highest similarity score returned by the inference algorithm. In other words, the class of the model with
the highest likelihood of having generated a sequence similar to the input predictor corresponds to the
predicted class of the input patient data stream.
Model Generation and Selection
As to how we generate the individual HMM models, the process can be divided, at least logically, into
two separate parts. The first is that of generating a model for each of the classes. The second involves
generating different variant models within each class group using different initial HMM parameters with
the goal of trying to find the parametrization that creates the single best model that most accurately
represents the class group in question. In other words, to find as close to the global optimal set of
parameters (as opposed to simply a local optimum).
The first part, model generation for each class, as already touched on, is accomplished by simply selecting
all the patients that belong to given class (NYHA class II or NYHA class III) and using these as the
training data for the model training function of our HMM library for R: depmixS4 [227,228]. The
depmixS4 training function outputs a potential model which we can add to a list of potential models for
that class. This list of models will later be passed onto the optimal model set selection process.
The second part, generating different parametrizations, simply involves repeating the first part of the
process, but updating the initial parameters that form the second part of the required input for the
depmixS4 model training function, until we have swept through all the desired parameter variations. Each
of these models is in turn added to either the appropriate list of potential class II or class II models.
As for selecting the final model pair, this can be accomplished by simply taking every paired combination
of class II and class III models in the potential model lists, assessing the performance of each of these
combinations against an example test set of data, and selecting the model set with the best overall
performance. Unfortunately, we did not actually investigate this last part of the model generation process
as a result of the critical problems encountered in the first part of the model generation process: namely
that were unable to the training algorithms to converge, or actually train a HMM model using the step
count data (whether with the depmixS4 library or others [223,225]). Although we were able to discover a
way to overcome these training difficulties - using lower resolution step count data (the per-minute step
count data averaged over 6-hour periods) - this solution fundamentally violated the whole rationale for
using a HMM model based approach in the first place (being able to use the complete per-minute time
series waveform without having to dilute it down). This prompted us to instead pursue and focus on the
other more classic cross-sectional ML methods discussed in Chapter 6. As a result, although we managed
92
to train a single set of HMMs, which we used to build an initial HMMBC, the performance of the
classifier was so obviously very poor (as discussed in Section 5.3) that we eschewed spending significant
time optimizing the algorithm performance when the cross-sectional ML methods proved more effective.
Initial Parameterization
The initial parameterization for the successful trained classifier, with some rationale for the selection,
is provided below. We emphasize however that little weight should be given to these parameters since
they are hand-picked, essentially arbitrary and not-verified against other parameters. Although we
attempted several different variations on model parameterizations as part of the debugging process none
of these were thoroughly documented.
1. States: 3
Although we only tested an HMMBC built with 3 underlying states (per HMM), our original
intent was to sweep the state parameter from 3 to 6-8 states depending on available
computational power. We started with the lowest number in that range - 3 states - to help with
debugging our training problems. Since we never performed the optimal parameterization search
our final successful trained classifier therefore only had 3 states23.
2. Starting State Probabilities: [0.95 0.00 0.05]
Based on our initial exploration of the data (Chapter 3), patients spent most of their time in a
non-active state. In other words, at any given moment, if we were to look at the step count time
stream, it is most likely that a patient would be in a non-active state as opposed to any other
state. We assumed the HMM would likely detect as a strong pattern and model this non-active
state as one of the 3 state so we set our starting state probabilities to suggest this in advance.
3. Transition Probabilities: [0.90 0.3 0.330.05 0.5 0.330.05 0.2 0.33
]
23 The computational power limit is important since it increases to the square of the number of states (since each
state is interconnected). That is, with 3 states there are 9 possible transitions between states which must be solved.
Doubling the number of states to 6 causes a quadrupling of the number of possible transitions to 36 and at 8 states
there are 64 possible transitions, almost double that of the 6 state case.
93
The selection of initial transition probabilities was done almost completely arbitrarily due to a
lack of relevant precedent information. However, to remain consistent with the assumption made
for the starting state probabilities - that a patient was likely to remain in the non-active state the
majority of the time - we did tweak the initial transition probabilities for the corresponding state
(dictated by the starting state probability matrix) to heavily favor remaining in that state. The
remainder of the transition probabilities were selected completely arbitrarily with the only
restrictions being that the sum of each state transition probabilities should of course be equal to 1
and that no transition probability should be 0.
4. Emission Probabilities: normally distributed with means ± variances (in steps/minute) of
[1 40 100] ± [10 80 1000]
The emission probabilities
were based on the range of
values graphically observed
from the per minute step-
count distribution (shown in
Figure 5-3). The specific
choices for mean and
variance were arbitrarily
selected, although in such a
way that they very loosely
separated the distributions
into three equidistant parts.
5.2.3 Model Validation
Since the classifier did not perform well even when tested with the training data, which should provide
overly optimistic performance estimates, we did feel not feel it necessary to perform additional internal or
external validation of the HMMBC discussed in this chapter. The performance reported in the Results
and Discussion section that follows is therefore based on using identical training and testing sets (all n=50
patients) and should therefore be considered to be overly optimistic about the real-life performance of the
HMMBC on actual new data.
Figure 5-3: Distribution of per-minute step count for
patients with NYHA class II and NYHA III (* grouped)
94
Results and Discussion
As previously mentioned (in Section 5.2.2.1), we encountered significant difficulties during the HMM
training process. Specifically, the HMM training algorithm was unable to converge to a valid model when
supplied with the per-minute step count data. The resolution to this problem was ultimately to supply the
HMM training algorithm with progressively lower and lower resolution data. The algorithm was finally
able to converge when the data supplied had a temporal resolution of 6 hours.
5.3.1 Classification Performance
The performance of the HMM based classifier produced using the per 6-hour step count data is
presented in Figure 5-4. As can be seen from the confusion matrix,
only 19 of the total 35 NYHA class II patients and 10 of the total
15 NYHA class III patients were correctly classified by the
HMMBC yielding an overall raw (unbalanced) accuracy of 58%.
The balanced accuracy (not shown in Figure 5-4) - which corrects
for the unequal distribution of class II and class II patients - can
be calculated to be 60%. Unfortunately, the HMMBC accuracy is
lower than the no information rate (70%). This indicates that,
given the class distribution in the dataset - 70% of patients with
NYHA class II - the classifier actually performs no better than if
we had simply randomly assigned NYHA classes to patients. The
poor agreement between the physician assigned NYHA class and
classifier assigned NYHA class is also reflected in the low value of
the Cohen’s Kappa coefficient24 (𝜅=0.18).
5.3.2 Training Challenges
That the HMMBC performance is sub-par does not necessarily come as a surprise. The amount of
training data, for one, is possibly simply insufficient to adequately train the HMMBC: 15 examples of
NYHA class II patients and 35 patients of NYHA class III is not a lot of training examples. This potential
24 The Cohen’s Kappa coefficient quantifies agreement between independent raters, correcting for the degree of
agreement that would be expected if the raters were simply guessing by chance [28]. Since Cohen’s Kappa is a
standardized statistic it is particularly useful for comparing performance between algorithms (and studies) [28].
Physician
II III
AI
II 19 5
III 16 10
No Information Rate (NIR): 0.70
Unbalanced Accuracy (Acc): 0.58
Cohen’s Kappa: 0.18
Sensitivity: 0.5429
Specificity: 0.6667
Positive Predictive Value: 0.7917
Negative Predictive Value: 0.3846
Figure 5-4: Overview of HMM
based classifier performance
95
problem is easily resolved by simply collecting more data – something which is currently still in progress
as a result of the activity tracker update made to Medly as part of this research.
Another likely explanation for the low performance is that the 6-hour resolution step data is significantly
less nuanced than per-minute resolution data. Measured by number of data points alone, the 6-hour
resolution step data contains 360 times (or 2 orders of magnitude) less information than the per-minute
resolution data. It is likely that this lower resolution data yielded coarser and less nuanced models (due to
the reduced data stream size) that did not necessarily take full advantage of the modelling capabilities of
HMMs. These coarse models may not have been sufficiently differentiated to really allow for accurate
discrimination between different NYHA classes. In a similar vein, it is possible that binning the per-
minute data over 6 hours resulted in the washing out of many of the important nuances in the data that
might in fact be the key to discriminating between patients in the different NYHA classes.
Compare for example Figure 5-5 and
Figure 5-6, respectively the per-6 hour
and per-minute step count data for the
same patient. Observe, in Figure 5-5 at
the 6-hour resolution, that for days 12
and 13 the step count pattern is visually
similar with only a small variation in the
overall step count. One might be led to
conclude from these similarities that a
patient perhaps had a slightly more
intense workout session or perhaps a
little longer walk near the middle of the
day 12 compared to day 13 but that the
underlying activity pattern remained
essentially the same. Visualization of the underlying data in Figure 5-6 quickly dispels this notion. The
activity near the middle of the day on day 12 is best characterized as isolated but extended high-intensity
physical activity, in contrast to day 13 where the activity it is better characterized as punctuated,
frequent, low-duration, low-intensity activity. The former might be proposed to be characteristic NYHA
class II activity, with the latter being more characteristic of a patient experiencing NYHA class III
symptoms, but where one might be able to assess this difference based on the per-minute data it is clearly
harder to gauge between these two activities on the basis of the 6-hour aggregate data alone.
Figure 5-5: Example patient step count data (per 6
hour resolution)
96
Figure 5-6: Example patient step count data (per minute resolution)
97
In any case, it is clear that unlocking the potential in the per-minute resolution data is highly preferable
to being stuck with using low resolution data.
Analysis of Potential Root Cause
This brings us back to the question of why we were unable to get the HMM algorithm training
algorithm to work using with the per-minute resolution data in the first place. As mentioned, although we
tried various initialization parameters, ultimately the resolution was to aggregate the data. We
hypothesize that the root cause may simply be due to the fact that most of the per-minute step count
values in any given day are simply 025, and furthermore, that these 0 values, although sometimes briefly
interspersed between long periods of activity, more often exist as long unit uninterrupted sequences. These
sequences occur not only in the mornings & evenings – such as when a person is sleeping - but also at
random intervals during the middle of the day – for example when a person might simply be inactive -
see, for example, days 3, 5, 8, 11, 12, and 13 in Figure 5-6.
Recall that HMMs are stochastic models, in other words, the underlying models they use to represent a
process are constrained by the rules of probability. There is therefore, some expectation of inherent
variance in the training data which the training algorithm must use capitalizes to start formulating a
model of the underlying process. The presence of low (or no) variance sequences may therefore present a
real problem to training.
For example, take a very long uninterrupted sequence of identical values, like a string of 0’s. Depending
on the length of the sequence and expected nature of the distribution, it may in fact be considered
statistically impossible. The probability of given sequence being produced by some Markov model can be
calculated using the forward algorithm, which relies on the chain rule: namely the probability of a chain of
events 𝐸𝑛 to 𝐸1can be calculated as the probability of event 𝐸𝑛 occurring, given that the sequence
𝐸𝑛−1 to 𝐸1 has occurred, multiplied by the probability of sequence 𝐸𝑛−1 to 𝐸1 having occurred:
𝑃(𝐸𝑛, … , 𝐸1) = P(𝐸𝑛|𝐸𝑛−1, … , 𝐸1) ∙ P(𝐸𝑛−1, … , 𝐸1) (2)
The probability of sequence 𝐸𝑛−1 to 𝐸1 having occurred can be recursively calculated using the same
formula, continuously chaining (thus lending the rule its name) the conditional probabilities of the new
25 recall that for our dataset, more than 75% of the per-minute step count values for any given patient are 0 (as
measured over their whole two-week monitoring period).
98
event in question, 𝐸𝑛−1, on all the prior events in the sequence. In the case of a produced sequence 𝑆𝑟𝑒𝑝𝑒𝑎𝑡
of length 𝑛, composed of the same repeated event, which are known to occur with some probability 𝑝,
Equation 2 simplifies to the following:
𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡) = p𝑛 (3)
An oft quoted value for the threshold of statistical impossibility is 10−50 but he exact cut-off is rather
arbitrary [229]. Since our objective is not to provide a rigorous proof of our hypothesis but rather to
suggest a theory to future researchers interested in tackling this problem the choice of 10−50 is a
reasonable choice of threshold. The choice of probability 𝑝 by extension is also somewhat arbitrary.
Suppose, for simplicity sake though that, since step count ranges from approximately 0 to approximately
125 in our patients we supposed the probability of a 0 step count value lies around 1
125≈
1
100= 10−2. Since
a more conservative reader might prefer we use the actual probability of 0 step counts in our sample –
approximately 75% of the dataset and say that 𝑝 should be closer to 0.75 = 3
4= 10
−log (3
4⁄ )
log (10) ≈ 10−0.12 - we
also perform the calculation with this values for comparison. The overall conclusion remains the same.
Assume a rest period of approximately 8 hours (which occurs fairly consistently, once a day), the sequence
length, n=480 minutes, has an associated probability of:
𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,8 ℎ𝑜𝑢𝑟𝑠) = 10−2𝑛 = 10−2∗480 = 10−960
or conservatively:
𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,8 ℎ𝑜𝑢𝑟𝑠)| 𝑐𝑜𝑛𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑣𝑒
= 10−0.12𝑛 = 10−57.6
Whether by the more conservative estimate or not, these probabilities are well in excess of the statistical
impossibility threshold of 10−50.
All this is not to say that these sequences are impossible - they are quite clearly not – however, from the
perspective of the Markov model, and from the hidden Markov model attempting to guess at the
underlying hidden model, such sequences are considered very unlikely26, and therefore not likely to be
26For a 6 hour sequence, 𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,6 ℎ𝑜𝑢𝑟𝑠) = 10−720 & 𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,8 ℎ𝑜𝑢𝑟𝑠)
| 𝑐𝑜𝑛𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑣𝑒= 10−43.2. For a 4 hour sequence,
𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,8 ℎ𝑜𝑢𝑟𝑠)| 𝑐𝑜𝑛𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑣𝑒
= 10−28.8. Although less than the threshold of 10−50, the sequence probability is still very
tremendously small, and the sequence thus less, but still very unlikely.
99
interpreted as regular parts of the sequence although they actually are. Even a one hour period is be
found to be highly, although relatively less, unlikely: 𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,1 ℎ𝑜𝑢𝑟𝑠) = 10−120,
𝑃(𝑆𝑟𝑒𝑝𝑒𝑎𝑡,8 ℎ𝑜𝑢𝑟𝑠)| 𝑐𝑜𝑛𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑣𝑒
= 10−7.2.
Of course, any given long predetermined sequence of variables being produced by a Markov model will
have a low associated probability. So why do we feel that we can make this special claim about a string of
0’s, or a string of identical values generally? Because of the key fact that the values in the series are
identical.
Take an arbitrary sequence of two or more alternating values of the same length 𝑛 as the sequence 𝑆𝑟𝑒𝑝𝑒𝑎𝑡
of above. These would have same probabilities calculated above, however this same sequence is unlikely to
represent a problem to an HMM. Why? Because the different values are easily associated with different
underlying states. With a single unchanging value however, it becomes impossible to determine which
value belongs a particular state: is a single state producing the sequence and we have yet to transition to
another state (what our probability calculations above actually represent) or are all states producing this
same value – in which case what makes them different states, except perhaps their transition
probabilities? But how does one determine the transition probabilities of underlying states if the emitted
symbols observed from the states are identical. We believe that ultimately, the intractability of these
questions may explain why the HMM training algorithm has difficulty converging to a value, and why
decreasing the resolution, which reduces the length of identical value sequences, but also generally
increases the variance in sequences to make possible states more differentiable resolves the training
problem.
Proposed Solution 1: Dithering
It would actually be very easy to test this hypothesis by using a signal processing technique known as
dithering. Dithering, is the act of introducing dither, that is, very low amplitude random noise
intentionally introduced into a system to improve its performance [230]. It was famously found to have
the curious effect of improving navigation and ordnance trajectory calculations performed on aircraft
based mechanical computers during the second World War as a result of the aircraft induced vibrations,
which smoothed out the operation of the moving mechanical parts [231]. Since then, it has been
successfully used to improve performance in various diverse applications as analog-to-digital conversion in
microelectronics [232], and trading on stock exchanges (where it is used to reduce high frequency trading -
an oft maligned trading practice) [233]. More commonly though, it is used to increase the visual quality of
low resolution images [234,235] – an excellent example of which has been reproduced from Wikipedia [236]
100
in Figure 5-7. Compare in particular sub-figures: 1, the raw image; 2, a lower resolution version of the
same image; and 3, the low resolution image dithered using a classic image dithering algorithm [234]. Note
in particular that the image in image 3, despite having the same resolution as image 2, approaches the
visual fidelity of image 1. We propose that, in an analogous way, carefully application of dithering to the
step count signal might counterintuitively improve our ability to train an HMMBC with high resolution
data. A small amount of noise would at least eliminate the impossibly long uniform sequences in the data,
and provide the necessary variance required for the HMM training algorithm to perform as intended,
while simultaneously not meaningfully degrading the overall quality of the step count data stream.
Proposed Solution 2: Activity Segmentation
An alternative to dithering, is to do away with the inactive sequences altogether, ignoring all the long
periods of 0 per-minute step counts, and instead training a HMMBC to use activity segments as opposed
to the complete raw daily signal. Unfortunately, this alternative, although conceptually simpler, is likely
harder to put into practice and test than dithering. Dithering can be fairly easily tested by adding various
different types and magnitudes of random noise to the high-resolution test signal and seeing if the HMM
training algorithm can successfully converge. Training on activities however first requires determining
what should constitute an activity segment, i.e. where it should begin, but also where it ends, including,
how many (if any) inactive minute should be allowed in the middle of the activity (in case of missed
readings, brief pauses, etc.). Additionally, it likely requires the development of some sort of automated or
quasi-automated data segmentation algorithm, not only for the case where the HMMBC might be
implemented in practice as part say a remote patient monitoring system, but also to help consistently and
accurately segment the relatively large volume of data that would be required to train and improve such
Figure 5-7: Dithering as applied to a cat photo. Reproduced from Wikipedia [236].
101
the classifier. Activity segmentation therefore likely involves first further investigating in more detail the
finer characteristics of the per-minute step count data stream generally. Although the task of activity
classification, at least for healthy patients, is already a very active area of research, the data used is
typically raw accelerometry data as opposed to per-minute step count data.
Although more challenging, training on separate activity segments might provide significant additional
secondary benefits not attainable though simple dithering. For example, assessing patients using smaller
periods of activity as opposed to an entire day or weeks worth of data might reduce assessment latency
thereby improving response time for any application that depends on assessments provided through an
activity tracker. Alternatively, it might provide additional insight into the specific physical exercise
routines of patients which might enable the provision of timely and relevant feedback to patients
regarding this aspect of their HF self-management.
Both dithering and activity segmentation have their relative advantages and disadvantages as possible
solutions to resolving the training challenges encountered with the HMMBC when compared with simply
reducing the temporal resolution of the input data. Ultimately though, since both dithering and activity
segmentation each represent very different but complimentary approaches to the training challenge they
are likely both worth investigating in their own right.
Summary
To summarize, in this chapter we discussed a proposed method for building a hidden Markov model
based machine learning classifier and the results of implementing and testing said classifier. We chose to
use hidden Markov models, which are a tool for modeling a system as a stochastic process, because we
hypothesized that these might be able to fully embrace the complexity and nuance of the entire time
series data streams produced by the activity trackers worn by patients in free-living conditions. We
detailed the architecture of the model, which used two hidden Markov models, one each to model the
activity patterns of patients with NYHA class II and III symptoms. Instead of using the new 44 person
dataset collected from the activity tracker monitoring system detailed in Chapter 4, we opted used the
same 50 person dataset investigated in Chapter 3, primarily because there was more data available for us
to use to train machine learning classifiers. Since the 50 person dataset does not also have heart rate data,
the only time series input provided to the hidden Markov model was patient step count data.
Unfortunately, we encountered difficulties in getting the hidden Markov models training algorithms to
converge using the per-minute step count data, which we were ultimately able to resolve by converting
the data to a coarser 6 hour temporal resolution level. Regrettably, using lower resolution data
102
contradicted our entire rationale for using hidden Markov models in the first place: attempting to use the
entire unadulterated time series data stream. Furthermore, although the hidden Markov model based
classifier we did train using the per-6 hour step count data was able to classify patients, the classifier did
not perform any better than one that simply assigns patient classes by chance (58% unbalanced accuracy
for the HMMBC vs 70% accuracy for the random classifier). The Cohen’s Kappa statistic (0.18)
confirmed the poor agreement between the physician assigned NYHA class and that assigned by the
hidden Markov model based classifier. Of note, since performance of our classifier was evaluated on the
exact same data used to train said classifier, the performance reported above should be also interpreted as
being highly optimistic compared to the real expected performance of the classifier on new data it hasn’t
seen before.
Although our initial attempts to use a hidden Markov model based classifier were met with some
significant setbacks, we don’t believe that this means that the approach does not have value, but rather,
that it might require more dedicated attention to get such an approach to work. We posited a possible
theory for why the training algorithm has difficulty creating hidden Markov models of the step count
data, namely that the presence of long low variance sequences of identical step count values makes it
impossible for the training algorithm to determine the transitions between states. In response we proposed
two possible approaches which might be investigated as part of future work: 1) dithering, that is,
intentional applying low-amplitude random noise to the time series step count data, thereby artificially
introducing variance into the low variance sequences (which might allow the hidden Markov model
training algorithm to function properly while not meaningfully degrading the overall performance of the
system), and 2) doing away with the inactive sequences altogether and approaching the task of NYHA
class assessment from the perspective of individual periods of activity as opposed to attempting to classify
the whole free-living time series data in one fell swoop.
Ultimately, we opted to take a third approach for the purpose of this thesis and put the hidden Markov
model based classifier to the side and instead investigate the effectiveness of some other more classic
approaches to supervised classification, which we discuss in the next chapter.
103
- Assessment of NYHA Functional Classification
Using Cross-sectional Machine Learning Models
As mentioned in the introduction of the previous chapter, we set out to attempt to objectively assess
the NYHA functional classification of some example patients using modern machine learning (ML)
algorithms. Having discussed our unsuccessful attempt to build a useful hidden Markov model based
classifier, we decided to investigate some cross-sectional machine learning algorithms that are popular
starting points for supervised classification problems: Generalized Linear Models (GLM; a variant thereof:
boosted GLMs; Random Forests (RF); Artificial Neural Networks (NNet); and a variant thereof: Principal
Component Analysis Neural Networks (PCA NNet).
In this chapter we first provide a brief refresher on the above ML techniques. The curious reader is
invited to consult T. Segaran’s book, “Programming Collective Intelligence: Building Smart Web 2.0
Applications” [111], for a more thorough introduction to these and other popular ML algorithms. We then
proceed to explain our methodology for training and testing the ML models investigated and finally, we
discuss the results of our investigation and detail some possible future directions to take this research.
Machine Learning Models
What follows is a very brief introduction to the cross-sectional machine learning models investigated in
this chapter, in order of relative algorithm complexity.
6.1.1 Generalized Linear Models
The generalized linear model, or GLM, is unsuprisingly, a generalized version of classic linear
regression [237,238].
Recall that that the idea behind ordinary linear regression is that we can model some randomly
distributed response variable 𝑦, as a linear combination of predictors 𝑋 = {𝑥1, 𝑥2, … , 𝑥𝑛}, subject to some
noise/error represented as the error term ε. If we define 𝐵 = {𝛽0, 𝛽1, 𝛽2, … , 𝛽𝑛} as the regression
parameters, with 𝛽0 being the intercept term, we can express the relationship formally as:
𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑛𝑥𝑛 + 𝜀 (4)
This equation, which defines linear regression, can be decomposed into two parts: 1) a linear part and 2)
the random error part. The linear part, 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + ⋯ + 𝛽𝑛𝑥𝑛, which tells us that there is some
expected value for 𝑦 conditional on the value of 𝑥: 𝐸(𝑦|𝑥). The error term then tells us that there is some
104
random error or variance about this expected value; in classic linear regression this error is specifically
assumed to be normally distributed with some constant variance, 𝜎2. If we call the expectation value
𝐸(𝑦|𝑥), the mean as a function of 𝑥, 𝜇(𝑥), of the normal distribution for 𝑦, we could alternatively
represent Equation 4 as:
𝑦 ~ 𝑁(𝜇(𝑥), 𝜎2) (5)
The generalized27 linear model asks us: what if the relationship between 𝑦 and 𝑥 were not normally
distributed but were instead modelled by some other distribution? Specifically, what if it we could use any
distribution within the wider family of exponential distributions, of which the normal distribution is just
27 N.B. not ‘general’ linear model, which is just a special case of the GLM, namely the one expressed in eq. 4
Figure 6-1: Examples of distributions in the family of exponential
distributions (* indicates the distribution belongs in the family only
when certain parameters are fixed). Adapted from [290].
105
one example (see Figure 6-1 for more examples). To effect this change, thus generalizing the linear model,
we need to modify the way we link together the expectation value, 𝐸(𝑦|𝑥), produced by the linear
predictors, and the mean value, 𝜇(𝑥), of our error distribution. That is, instead of defining the link
between 𝐸(𝑦|𝑥) and 𝜇(𝑥) as:
𝐸(𝑦|𝑥) = 𝜇(𝑥) (6)
we would first generalize the relationship with 𝐸(𝑦|𝑥), as link function 𝑔 of 𝜇(𝑥):
𝐸(𝑦|𝑥) = g(𝜇(𝑥)) (7)
The link function for the normal distribution is then simply 𝑔(𝑎) = 𝑏. The link function must always be
smooth, invertible, and linearizing, and is changed according to the desired noise distribution. A list of
common link functions, and their inverses, can be found in most basic texts on GLMs, for example [237].
A model can then be fit using the maximum likelihood estimation [237,238]. The end result, of this entire
process is that we gain a fairly simple yet powerful and versatile method of modelling a wide variety of
processes. As a result, although oft forgotten, GLMs usually make a great first choice to use before
moving on to more sophisticated ML models.
6.1.2 Boosted Generalized Linear Models
Boosting, or rather gradient boosting, is an ensemble learning technique [239,240]. Instead of using one
single strong predictive model, the idea behind gradient boosting is to use an ensemble of weakly
performant models that build on each other, learning from the mistakes of previous models, to create a
final model that is more accurate than any single (strong or weak) constituent model. Although boosting
can make overall more performant classifiers, it must be carefully managed to prevent overfitting the
model, that is, training the model to be too good at predicting test data at the expense of making the
model generalizable to data it has never seen before. The algorithm used to do gradient boosting is fairly
complex and well out of the scope of this thesis. The algorithm however, supports a range of possible
underlying range of ML models [240] and a boosted GLM is one that specifically uses generalized linear
models as the underlying ML model.
6.1.3 Random Forest
The second type of ensemble learning technique is known as bagging. Bagging forms a core part of
Random Forests (RF). The best place however to start discussing random forests is with decision trees.
106
A decision tree is simply a branching set of
rules, or boundary cut-points, that separate a
feature space into various partitions, each of
which are associated with some sort of
classification or decision outcome [111]. A
very simple example is shown in Figure 6-2.
In this example, the decision tree is used to
classify the three different colors of data
points (green, orange, and purple) according
to two arbitrary features, A & B, associated
with the data points. Note that due to the
placement of the boundaries, some of the
dots are misclassified.
One of the simple approaches to training a
decision tree is to start from the top of the
tree (the root) and go down, selecting several
candidate boundary cut-point that divide the
dataset, and then computing how well the
data is split by each boundary [111]. For
example, one could use the Gini impurity (a measure of diversity in the dataset), or the measure of
information gain (reduction in entropy) that results from the split. One then selects the best candidate
boundary and repeats this process down each new branch. As with all ML algorithms, one must be wary
of over-fitting the learner. In the case of decision trees this is especially true as the complexity or even
just the number of the boundaries used increases. Even with just the use of linear boundaries a decision
tree can get very precise as the tree gets deeper and larger, with more branches and leaves to cut the
feature space into smaller and smaller more ultra-specific partitions. As a result, many decision tree
creation algorithms feature a way to stop growing the tree – usually by setting a hard limit on the depth -
or to prune the tree after growth - to remove unnecessary, unhelpful or weak branches – all to help avoid
overfitting.
Decision trees are hugely useful since they are interpretable, in other words, a human can look at a
decision tree and understand the decisions being made. This is why decision trees, albeit expert trained
Figure 6-2: Example of a decision tree (above)
with corresponding feature space (below).
107
ones, are often popular for use in expert systems where the decision process might want to be inspected –
the Medly algorithm in fact uses an expert trained decision tree for triaging patients [104].
Despite all this, ML decision trees are still often highly sensitive to the input training data and have a
tendency to over-fit and not generalize well to new data. One solution to this problem is the use of the
ensemble learning technique of bagging (the counter-point to boosting). Bagging, in a similar fashion to
boosting, uses an ensemble of learners to improve learner performance, but whereas in boosting the
learners build on each other sequentially, in bagging one trains several independent learners – in this case,
multiple independently trained decision trees – and aggregates their responses. Each tree (learner) in the
forest (ensemble) produces its own separate prediction using the input predictor data and the resulting
ensemble of independent predictions are combined, for example using a majority voting scheme, to
produce the overall final prediction. The aptly named random forest is a variation on tree bagging
whereby a random subset of features is used to train individual trees in the forest, as opposed to the
entire feature space being provided to each tree. This reduces the likelihood of having highly correlated
trees, while retaining the random forest’s beneficial properties, such as it’s ability to naturally perform
feature selection. Specifically, that the most predictive features will tend to feature more prominently as
part of the random forest, whereas less important features will tend to be more sparsely distributed and
therefore be less heavily weighted as part of the forest.
All in all, the effect of bagging together decision trees into a random forest creates a ML model that has
additional useful emergent properties (e.g. natural features selection), can better generalize to new data,
and yet maintain many of the inherent advantages of the underlying decision trees. Because of its
simplicity and ease of use, RFs are therefore often (along with GLMs) a good early candidate for ML
tasks [111].
6.1.4 Artificial Neutral Networks
In contrast, Neural Networks (NNet), or as they are more formally termed – artificial neural networks
– are far on the other end of the complexity spectrum – they are the bazooka to the RF and GLMs pea-
shooters. The use of NNets for the sole purpose of assessing NYHA class is therefore likely overkill since,
as previously discussed, less complex models are likely to actually perform better due to their simplicity in
the face of limited data. However, in the context of assessing NYHA class as part of a remote monitoring
system, NNets have an interesting property that makes them a particularly worth investigating. NNets
support what is known as online learning, which means that the trained model can be progressively and
continuously updated and improved as more and more data becomes available, without needing to retrain
108
the entire model from scratch. This is a particularly useful property within the context of a remote
patient monitoring system, where new data becomes available each and every day. While the specific
NNet investigated as part of this work may not necessarily be immediately transferable to the task of
daily assessment of NYHA class, an initial foray into training a NNets with activity monitoring data is
likely to provide useful insights for future work.
The fundamental building block of the NNet is the perceptron. It is
a digital neuron and operates in an analogous fashion: summing its
input signals which it then converts to an output signal using some
predefined thresholding function. An example is shown in Figure 6-3.
A NNet is built by creating a weighted directed network of
perceptrons, as shown in Figure 6-4 (for clarity, the inter-perceptron
weights are not shown). The network is arranged in a
layered fashion and these layers are logically divided into
three types depending on their function. At the front of
the NNet is the input layer which connects each input
features to a perceptron. The input layer of the NNet
shown in Figure 6-4 as an example, would be suitable for
use with 4 input predictors or features. The input layer
acts as the interface, connecting the input features to the
first layer of perceptrons in the next set of layers in the
network: the hidden layers.
The hidden layers of the NNet are the innermost layers
and form the bulk of the network. They are where the NNet learns the various complex and complicated
relationships and patterns in the data. Unfortunately, the nature of this method of learning is that NNets
typically remain black-boxes and it is never quite clear how or what relationships the NNet has learned
from the data. The number of hidden layers and the number of nodes in each layer can be altered to
make a deeper and wider NNet capable of learning more complicated relationships. While NNets can
theoretically be made arbitrarily large, training large NNets is computationally expensive and therefore
limited by the computational power available. Although NNets have existed since the 1950s, it is only due
to modern advances in computing that training large multi-layered NNets, known as deep neutral
networks [241], has recently become feasible [111,124,242,243]. The success of deep neural nets at tackling
complex problems is generally credited as a cause for the recent popular resurgence in AI research [242].
Figure 6-3: A perceptron
Figure 6-4: A neural network
109
Once the hidden layers, regardless of depth, have processed the input data, the data is picked up by the
output layer.
The purpose of the output layer is simply to extract data from the hidden network and convert it to a
final output prediction. In output layer of the NNet in Figure 6-4 for example produces 3 output
predictions.
Training the hidden network is commonly performed using a what is known as the backpropagation
algorithm and it is essential for enabling online learning for the NNet. Essentially, new data is provided,
one at a time, to the input layer of the NNet. Examining the output values produced at the output layer
of the network, one can then determine how far off the current NNet prediction is and then work
backwards through the network, making minor adjustments to the weights of the links to slowly push the
output in the correct direction. The degree of tweaking, the learning rate, is carefully controlled to make
sure that that the NNet is neither underfit (insufficiently trained) nor overfit (that the NNet overshoots
or overgeneralizes from individual data points). The finer details of the backpropagation algorithm are
outside the scope of this thesis, but the interested reader is invited to refer to either [111] or [241] for
further reading.
Overall, NNet are a complex but very powerful ML algorithm that have been successfully used to learn
the relationships present in challenging highly non-linear data and systems. They also support continuous
incremental learning which may be particularly useful in the context of remote patient monitoring.
6.1.5 Principal Component Analysis Artificial Neutral Networks
Aside from computational cost, NNets also have the drawback of typically requiring a lot of data to
train well. The later point is particularly challenging given our small dataset. One of the ways to make
more effective use of a data limited but feature rich dataset is to perform dimensionality reduction on the
feature set prior to presenting it to a ML algorithm [244–247]. Dimensionality reduction is related to the
concept of feature selection and extraction. In both cases, a large set of features is reduced to a smaller
more concise set of principle feature that encodes, as much and as accurately as possible, all the
information originally contained in the large feature set [247].
Principal Component Analysis (PCA) is a standard and hugely popular technique for performing
dimensionality reduction [248]. In PCA, the larger m-dimensional feature set is projected onto the best 𝑛-
dimensional orthogonal subspace of 𝑚 (where 𝑛 < 𝑚) in such a way that the greatest variance in the
projected data comes to lie on the lowest (first) order coordinate axis (of the n-dimensional subspace),
110
with successively lower variance data being reserved for successively higher order coordinate axes28
[244,248]. In this way PCA trims out the features (dimensions) that provide the least new information,
either because the information is already accounted for as part of another correlated feature, or because
the feature has low variance and therefore provides little additional information to consider. The
interested reader can find a more complete mathematical treatment of the algorithm in [248].
In theory, by apply PCA to our set of features before passing it to a NNet, the resulting PCA NNet
should perform better, since the algorithm should be able to focus on learning the high information
patterns common to the limited dataset, while being less distracted by low value features. Furthermore,
the PCA NNet should be able to be trained at a reduced overall computational training cost since the
reduced number of features will likely require a lower complexity NNet to model.
Methods
We chose to use the R programming language [151] in combination with RStudio [152], the open-source
integrated development environment for R, and various supporting R packages for the research work
documented in this chapter [153–158,217,249,250]. For the specific tasks of building, training, and
validating the ML models we used the caret (Classification And Regression Training) package for R
[251,252]. We also used the caret package for data pre-processing including normalization and imputation,
although we used the leaps package [253] for feature selection.
To simplify comparison between the sometimes-disparate models discussed this chapter (as well as the
hidden Markov Model based classifier discussed in the previous chapter), we keep the methodology as
consistent as possible between the different machine learning approaches. We also aligned our
methodology as much as possible with current best practice for the creation and validation of supervised
classification ML models.
6.2.1 Training Data
Dataset
We used the same data to develop and validate the cross-sectional algorithms that we used for the
hidden Markov model based classifier investigated in Chapter 5. This data, again, is the same data used
28 i.e. the second greatest variance lies the second order axis, the third greatest variance on the third order axis, up
to the least variant data which resides on the final 𝑛-th order coordinate axis.
111
for the replication study discussed in Chapter 3. Recall that the dataset was selected primarily since it
had the largest sample size of the available datasets, but also because it contained cardiopulmonary
exercise testing data to permit us to establish a helpful baseline performance (based on the gold-standard
CPET) evaluate the impact of step count data on our algorithm performance.
Population
Recall that the Chapter 3/Chapter 5 dataset included 50 patients, predominantly male (86 vs. 89 [%]),
aged: 54 ± 14 vs. 56 ± 14 [years old], and overweight (BMI: 28.9 ± 6.4 vs. 29.6 ± 6.3 [kg/m2]). whose
demographics are fully detailed in Table 5 (page 38), Table 6 (page 38) and Table 7 (page 39). These
patients come from a closed (prospective) cohort of adult outpatients at a tertiary care clinic specializing
in the management of heart failure at a major hospital in Toronto, Canada. The exact inclusion and
exclusion criteria are detailed in Table 3 (page 37) and Table 4 (page 37) respectively.
Label Assignment
Again, recall that the patients in the dataset were originally classified at onboarding by their physician
as either NYHA functional class II (n=26) or III (n=11) - according to the criteria outlined in Section
2.2.1.1 - or as some intermediate/mixed class I/II (n=9) or II/III (n=4), as outlined in Section 5.2.1.3.
However, for the purposes of the ML classification task being investigated, patients assigned the
intermediate/mixed classes I/II were relabelled as NYHA class II patients, and patients assigned as class
II/III were relabeled as NYHA class III. This final dataset was therefore composed of only patients
labelled as NYHA class II (n=35=26+9) and NYHA class III (n=15=11+4).
6.2.2 Model Design
Predictors
In order to predict the outcome label, each of the machine learning models was fed with a series of
predictors (or features) built from available data in the dataset. Recall that the dataset consisted of the
following data:
1. Minute-by-minute step count data – recorded using a commercially available activity-tracker
(Fitbit Flex) continuously throughout the day. From which we extracted the same metrics
calculated and explored in Chapter 3, as listed in Table 18 below:
Table 18: M inute-by-minute step count features
M aximum
112
1 Maximum 2 Week PMSCa [steps/minute]
2 Maximum of Maximum DPMSCb [steps/minute]
3 Mean of Maximum DPMSCb [steps/minute]
4 Standard Deviation of Maximum DPMSCb [steps/minute]
5 Standard Error of Maximum DPMSCb [steps/minute]
6 Minimum of Maximum DPMSCb [steps/minute]
75th Percentile
7 Maximum of 75th Percentile of DPMSCb [steps/minute]
8 Mean of 75th Percentile of DPMSCb [steps/minute]
9 Standard Deviation of 75th Percentile of DPMSCb [steps/minute]
10 Standard Error of 75th Percentile of DPMSCb [steps/minute]
M ean
11 Mean 2 Week PMSCa [steps/minute]
12 Maximum of Mean DPMSCb [steps/minute]
13 Mean of Mean DPMSCb [steps/minute]
14 Standard Deviation of Mean DPMSCb [steps/minute]
15 Standard Error of Mean DPMSCb [steps/minute]
16 Minimum of Mean DPMSCb [steps/minute]
Standard Deviation
17 Standard Deviation of 2 Week PMSCa [steps/minute]
18 Maximum of DPMSCb Standard Deviation [steps/minute]
19 Mean of DPMSCb Standard Deviation [steps/minute]
20 Minimum of DPMSCb Standard Deviation [steps/minute]
Standard Error
21 Standard Error of 2 Week PMSCa [steps/minute]
22 Maximum of DPMSCb Standard Error [steps/minute]
23 Mean of DPMSCb Standard Error [steps/minute]
24 Minimum of DPMSCb Standard Error [steps/minute]
Total
25 Total 2 Week SCc [steps]
26 Maximum of Total DPMSCb [steps]
27 Mean of Total DPMSCb [steps]
28 Standard Deviation of Total DPMSCb [steps]
29 Standard Error of Total DPMSCb [steps]
30 Minimum of Total DPMSCb [steps]
IQR (Interquartile Range)
31 Maximum of DPMSCb IQRd [steps/minute]
32 Mean of DPMSCb IQRd [steps/minute]
33 Standard Deviation of DPMSCb IQRd [steps/minute]
34 Standard Error of DPMSCb IQRd [steps/minute]
Skewness
113
35 2 Week PMSCa Skewness
36 Maximum of Daily SCc Skewness
37 Mean of Daily SCc Skewness
38 Standard Deviation of Daily SCc Skewness
39 Standard Error of Daily SCc Skewness
40 Minimum of Daily SCc Skewness
Kurtosis
41 2 Week PMSCa Kurtosis
42 Maximum of Daily SCc Kurtosis
43 Mean of Daily SCc Kurtosis
44 Standard Deviation of Daily SCc Kurtosis
45 Standard Error of Daily SCc Kurtosis
46 Minimum of Daily SCc Kurtosis aDPMSC: Daily Per Minute Step Count bPMSC: Per Minute Step Count cSC: step count dIQR: interquartile range
2. Cardiopulmonary exercise testing data – administered by trained clinical staff as part of routine
care at the TGH Heart Function Clinic on the same day as recruitment (except for 4 patients
who received it prior to recruitment29). From this data we extracted the following features:
Table 19: Cardiopulmonary exercise testing data features
CPET Feature Brief Description of Feature
1 CPET Duration [frac. min.] duration of CPET in fractional minutes
2 CPET Max Watts [W] max resistance achieved at end of CPET
3 % Predicted CPET Watts [%]
percentage of expected CPET Max Watts for
patient
4 SBP, Resting [mmHG] resting Systolic Blood Pressure before CPET
5 DBP, Resting [mmHG] resting Diastolic Blood Pressure before CPET
6 HR, Resting [bpm] resting Heart Rate before CPET
7 O2 Sat., Resting [%] resting oxygen saturation before CPET
8 FEV, Resting [L] resting Forced Expiratory Volume before CPET
9 % Predicted Resting FEV [%]
percentage of expected Forced Expiratory Volume
achieved by patient during CPET
10 FVC, Resting resting Forced Vital Capacity before CPET
11 % Predicted Resting FVC [%]
percentage of expected Forced Vital Capacity
achieved by patient during CPET
12 SBP [mmHG] Systolic Blood Pressure at end of CPET
13 DBP [mmHG] Diastolic Blood Pressure at end of CPET
14 HR [bpm] maximum Heart Rate at end of CPET
15 HR 1 min. Post Test [bpm] Heart Rate 1 minute after end of CPET
29 Specifically, 1, 15, 20 and 22 days prior to recruitment.
114
16 HR Drop in 1 min. [bpm]
Heart Rate drop (recovery) 1 minute after end of
CPET
17 O2 Saturation [%] oxgyen saturation at end of CPET
18 VO2 Peak (rel.) [ml/kg/min]
peak oxygen consumption during CPET relative
to patient body weight
19 Predicted VO2 Peak (rel.)
[ml/kg.min]
expected peak oxygen consumption for patient
(relative to body weight) during CPET
20
% Predicted VO2 Peak (rel.) [%]
percentage of predicted peak oxygen consumption
for patient (relative to body weight) achieved
during CPET
21 VO2 Peak [L/min]
peak oxygen consumption during CPET (not
corrected for patient body weight)
22 Predicted VO2 Peak [L/min]
expected peak oxygen consumption for patient
during CPET
23 % Predicted VO2 Peak [%]
percentage of predicted peak oxygen consumption
for patient achieved during CPET
24 Anaerobic Threshold [ml/kg/min] patient’s anaerobic threshold
25
AT as % Measured VO2 Peak [%]
Anaerobic Threshold as a percentage of the
measured peak oxygen consumption of the patient
(relative to their body weight)
26 AT as % Predicted VO2 Peak [%]
Anaerobic Threshold as a percentage of the
predicted peak oxygen consumption of the patient
27 VE Peak [L] peak minute VEntilation during CPET
28 VCO2 Peak [L] peak CO2 expiration during CPET
29 VE/VCO2 Slope @ AT
slope of minute VEntilation to CO2 output at
Anaerobic Threshold during CPET
30 VE/VCO2 Slope @ Peak
slope of minute VEntilation to CO2 output at
CPET peak
31 RER Peak peak Respiratory Exchange Ratio during CPET
3. Patient demographic/meta data – recorded as part of onboarding, specifically:
Table 20: Patient demographic data features
Feature
1 Sex [Male or Female]
2 Age [years]
3 Height [cm]
4 Weight [kg]
5 BMI (Body Mass Index) [kg/m2]
6 Handedness [left or right]
7 Wristband preference [left or right]
We tested three different variants of models using three different combinations of the above features:
a) The ‘CPET feature group’, to establish a baseline performance using only data available from
CPET tests. This feature set consisted of all the CPET features and the patient demographic
features, for a total of 38 features.
115
b) The ‘CPET + Step Data Metrics feature group’, to establish the additional benefit derived from
adding the basic step data features. This feature set consisted of all the CPET features, all the
step data features and the patient demographic features, for a total of 84 features.
c) The ‘Step Data Metrics only feature group’, to investigate the effectiveness of using only data
derived from an activity tracker. This feature set consisted of all the step data features and the
patient demographic features, for a total of 53 features.
Normalization
We normalized the input predictors as the first step in the training process for our cross-sectional ML
classifiers 1) to improve training speed but 2), also to ensure that each of the predictors was similarly
weighted for consideration by the learning algorithm. Specifically, we shifted each predictor to be centered
about its mean value and scaled the predictors by their corresponding standard deviations using the
preProcess function in caret R package.
Treatment of Missing Data
Some of the CPET data was missing in the records of some patients. Since the algorithms used do not
handle missing data by themselves, we removed patients with missing data from the training data
supplied to patients, only including the complete cases (without missing data). However, because the
aforementioned caret package’s preProcess function also had the ability to perform data imputation, we
also trained a variant of each of model where the missing training data was imputed, to salvage as many
of the otherwise incomplete cases in the dataset as possible. The preProcess function used a k-Nearest
Neighbour algorithm (k was set to 5) which chooses an imputation value based on the k nearest
neighbouring non-missing data points, as measured by their Euclidian (straight-line) distance from the
missing data point [254].
Feature Selection
Since we had such a large list of input predictors for each model (up to 84) we compared the impact of
performing feature selection on the input list of predictors that were being provided to the model training
function. The purpose of automated feature selection is to try to prevent the model from overfitting to the
data, thereby improving the ability of the classifier to generalize to new data. Traditional machine
learning heuristics dictate that, given our sample size of 50, the number of features used to train out
116
algorithms should be somewhere around 5-10 but possibly up to 49 features to prevent overfitting30. In
view of this, we used an R package called leaps [253], which uses linear regression, to identify and separate
out the single best combination of up to 10 features. We evaluated the best feature combination using the
Bayes information criterion, usually abbreviated BIC [255], which is very similar to the more commonly
used Akaike information criterion, usually abbreviated AIC. In both cases, models with lower values are
preferred, however the Bayes information criterion penalizes complex, feature rich models more heavily
and should therefore favor models that use less features. Based on the previously mentioned heuristics,
lower featured models are likely to be more appropriate given the limited size of our dataset.
Feature selection was done as a last step before generating the ML classifier models. Note also that the
feature selection was performed using only the data being made available for training the model and did
not include any of the validation data which would skew our estimation of the overall final classifier
performance.
All this said, in a similar fashion to the normalization and missing data treatment process, we also created
variant models where the pre-processing step was not applied, i.e. feature selection was not performed and
instead the whole unaltered list of input predictors was provided to the model for training.
Model Generation
To actually generate and train the ML classifiers, we provided the appropriate set of preprocessed
features to the model training function of the R caret package. Instead of setting fixed hyper-parameters
for the models - e.g. maximum decision tree depth of 5 in the RFs, 4 hidden layers for the NNets, etc. -
we had the model training function perform a grid search of the model hyperparameters to identify the
optimal hyper-parameterization for each model, assessing the performance of each model using k-fold
cross-validation (CV).
30 Pre-hoc determination of the optimal number of features for a given data set size is unfortunately still very much a
matter of debate in the field. As a result, various researchers have developed and published various heuristics for the
task, which can sometimes greatly vary in their recommendations. Some of these heuristics include: having 10 data
points per model parameter/feature [283], having “3-5 independent cases per class and feature” [284] for training
stable albeit not necessarily ‘good’ models [125], or for a dataset of size 𝑛 about √𝑛 highly correlated features to
about 𝑛 − 1 features when said features are completely uncorrelated [285]. For our dataset this puts us at 5, 3-5, 7
(highly correlated) to 49 (uncorrelated) features.
117
k-fold CV is a technique used for performing training and testing/validation where it is undesirable for an
already small dataset to be further divided into proportionately smaller separate training, testing and
validation datasets, but where it is
still necessary to assess how well a
classifier is expected to perform on
data it has never seen before [256].
In k-fold CV, the original dataset
is instead first segmented into 𝑘,
typically approximately equally
sized, partitions termed folds.
Testing and training of a given
model is then performed 𝑘 times
such that each fold is used once as
part of a test set, with the
remaining 𝑘 − 1 folds in each
round are used to train a model for
evaluation on the test fold. The overall performance is then reported as the mean of the performance of
the models across the rounds. The process is shown visually in Figure 6-5.
In each case, we set number of folds for the testing CV procedure to be the same as the number used for
the overall model CV procedure detailed in the next section.
6.2.3 Model Validation
Since a suitable external validation dataset was not available, we again performed CV using the
Chapter 3/Chapter 5 dataset to perform an internal validation of our ML classifiers and estimate the real-
world performance of our classifier against new, unseen data. Specifically, we validated the model using
both nested 10-fold CV and nested leave-one-out cross validation (LOOCV). In other words, we cross-
validated the overall pre-processing, features selection and models, but nested within the evaluation of
each model we used a further round of cross-validation (splitting out new further training and test folds)
to select the optimally hyper-parameterized model. LOOCV is a special case of k-fold CV where the
number of folds, 𝑘, is set to the be equal to the number of observations in the dataset. In other words,
every training/test set split repeatedly leaves out one new data point for testing or validation and uses
the rest for training. Before proceeding to discussing the rationale for using both 10-fold and leave-one-out
cross validation we first define some important terms for assessing ML model performance
Figure 6-5: 𝒌-fold cross-validation
118
On Bias and Variance
The bias of a machine learner is simply its error rate: i.e. how much or how little the algorithm errs in
performing whatever task it is attempting to accomplish. They are the “erroneous assumptions in the
model” [257]. Notably though, the bias is separate from the unavoidable or irreducible error of the
problem and only measures how distant the learner is from the ‘optimal’ overall error rate. For example,
if a system was trying to recognize speech from very noisy low quality audio streams where even humans
failed at the task 10% of the time, and a machine learning algorithm was able to recognize the speech
with an error rate of 15%, the bias of the algorithm would only be 5% since the gold-standard classifier for
this problem, the human ear, still erred 10% of the time due to the inherent nature of the problem [258].
In contrast the variance is how well, or rather how badly, the ML classifier generalizes to never before
seen data – i.e. how much the classifier errs due to ‘sensitivity to small fluctuations in the training set’
[257]. For example, if the same speech recognition classifier were provided with new test data (separate
from the data used to train it) and found to have a new error rate of 27%, the bias of the classifier would
still be 5% but the variance would be estimated at 12%, since the algorithm suffered an additional 12%
loss in performance in the face of the new test data. Knowing a classifiers bias and variance allow us to
estimate how under-, over- or both under- & over-fit a given classifier may be; high bias being indicative
of an under-fit classifier, high variance indicative of an over-fit classifier, and high bias & variance
indicative of an under- and over-fit classifier [259,260]. By extension, most change made to a ML classifier
have an associated bias and variance trade-off where an amelioration in one results in a deterioration of
the other – e.g. decreasing bias, and reducing over-fitting results in an increased variance, or increased
under-fitting – somewhere in the middle lies the optimal fit point where bias and variance are both
minimized.
Rationale for multiple cross-validation
Returning to 10-fold and leave-one-out cross-validation: LOOCV is known to be the least
pessimistically biased estimator of model performance [256,261–265]. However it has been accused of
having “high [estimator] variance, leading to unreliable estimates (Efron 1983)” [263]. This accusation is
typically attributed to the cited paper by R. Kohavi, presumably citing alleged findings by B. Efron [266].
Efron however only elaborates on CV generally and does not appear to investigate or make any claims
about higher k values on the variance of the estimate provided by the CV process, Kohavi’s research
findings in fact also repudiate his claim to higher variance, as do the findings and simulations of a myriad
of other investigators who in fact suggest quite the opposite [261,264,265,267]. Only in special highly
specific cases do simulations suggest that higher variance performance estimates result from LOOCV
119
[267]. The conclusion then that LOOCV results in higher variance estimates therefore appears to likely
simply be an erroneous intuitive over-generalization (dare we say overfitting) of the bias-variance trade-off
so ever present in ML performance assessment, to the actual performance estimators themselves.
Our rationale for also performing 10-fold cross validation therefore is not to improve our estimate of
model performance - although in the event that both the 10-fold cross-validation and leave-one-out cross-
validation estimates are similar, we would have additional confirmation that the performance estimates
are in fact accurate. Rather our objective is in fact to measure the difference in the estimate of model
performance using different sized training datasets to roughly determine our location on the learning curve
of these algorithms and ascertain if collecting more training data is likely to provide improved model
performance. It may seem strange to do this using 10-fold cross validation since we have previously
mentioned that LOOCV is known to be a less biased estimator of model performance than lower k-fold
CV and we could simply perform LOOCV on an artificially reduced dataset. However, to do so we would
have to artificially reduce the dataset and arbitrarily throw away data we could otherwise use for some
useful purpose, namely testing, which is why we opt to use 10-fold CV vs. LOOCV. Furthermore, previous
simulations and experiments have demonstrated that in most datasets, even as small as 40 datapoints, 10-
fold cross-validation provides an estimate that is nearly as unbiased as LOOCV or at least within 7-9
percentage points of the LOOCV value [261,263,267].
Since performing nested 10-fold cross-validation on our dataset represents a large, nearly 15%, reduction
in available training data31, most of the performance delta above 7-9% points is reasonable attributable
to the reduced training data in our already small dataset and can therefore be used to make a rough
approximation of our location on the learning curve (i.e. determine if we are still in the location of high
increase in performance for small increase in dataset size). Of course, if the performance delta is within 7-
9% points we unfortunately will not be able to approximate our location on the learning curve since we
will be unable to differentiate the bias delta due to using 10-fold CV vs LOOCV and the improvement
resulting from an increase in training data. However, in the unlikely event that the performance delta is
very low, i.e. both 10-fold and LOOCV converge to the same estimate, we can conclude that either
method is suitable for cross-validation of our algorithm given our sample size, and recommend that future
31 From 50 patients, nested leave-one-out results in 2 hold-outs for a total training set size of 48 patients. 10-fold
cross-validation results in a hold-out of 5 data points for validation, and a further 4.5 (on average) for the second
hold-out for model optimization leaving a total of 41.5 patients for training. (48 - 41.5) / 48 = 15%
120
work utilize 10-fold CV and take advantage of the associated decreased computational cost and simply use
the datapoints generated by this work to start plotting the learning curve.
Results and Discussion
Using the methodology detailed in the previous section we were able to successfully train GLMs,
boosted GLMs, RF, NNets and PCA NNets for each of the outlined feature groups: the CPET feature
group, the CPET + Step Data Metrics feature group, and the Step Data Metrics only feature group.
6.3.1 Classification Performance
The final overall validation performance of each of variant classifiers is tabulated in Table 22, located
in Appendix D, for completeness. For brevity’s sake however, we summarize only the top performing
classifiers for each feature group in this chapter. In general, we found that pre-selecting features did not
change the classification performance of the models, and although imputing missing data did have an
effect on classifier performance, 3 of the 4 best performing models were built by simply excluding
incomplete cases as opposed to performing imputation.
The best CPET only classifiers (and the third best classifier variant overall), summarized in Figure 6-7,
was found to be a simple boosted GLM with no imputed data and either with or without feature pre-
selection. The classifier achieved an unbalanced accuracy of 79%, better than the no-information rate of
70% which translates to a balanced accuracy of the model 72%. The level of agreement as measured by
Cohen’s Kappa was moderate (𝜅=0.47). This classifier is a huge improvement over the hidden Markov
model based classifier trained in Chapter 5. That being said, the 47% agreement between the GLM and
the physician assigned label is still lower than the lower end of comparable human-level performance;
recall that the interrater agreement between physicians was found to be between 54-75%32 [6,26]. Solely
32 The study by Goldman et al. [11] which found a 41% agreement is excluded as their result is not directly
comparable since they used a weighted kappa to account for disagreements by more than 1 NYHA class. The other
cited studies did not encounter this problem.
121
based on the performance of this classifier, human performance remains the gold-standard baseline against
which to compare the agreement in assessed NYHA functional class.
Unfortunately, the ML classifiers provided with just the step data did not fare as well as the CPET based
classifiers. The best of these step data only classifiers - tied between a regular GLM, a boosted GLM and
a NNet - all using imputed data and either with or without feature selection, only achieved an unbalanced
accuracy of 72% (63% balanced) – only marginally higher than the no-information rate of 70%. The low
agreement between the classifier and physician assigned label was also affirmed by the low kappa
coefficient (𝜅=0.28). That being said, the step data GLM/NNet/boosted GLM still performed better than
the hidden Markov model based classifier.
The best performing classifier overall, another boosted GLM which used only complete cases (i.e. no
imputed data) and either with or without feature selection, used the combination of CPET and step count
data to achieve a solid 89% unbalanced accuracy (85% balanced) which was significantly larger than the
no-information rate of the dataset (at the 5% level of significance, since P=.02). There was substantial
agreement between the machine and physician assigned labels (𝜅=0.73) approaching that of the best
reported human analogues (𝜅=0.75 [26]).
Physician
II III
AI
II 6 2
III 1 19
No Information Rate (NIR): 0.71
Unbalanced Accuracy (Acc): 0.89
Cohen’s Kappa: 0.73
P-value [Acc > NIR]: 0.02
Model Type: Boosted GLM
Imputed Data: No
Pre-selected Feature: Yes or No
Figure 6-9: Performance of the
best CPET + step data
classifier
Physician
II III
AI
II 5 3
III 0 20
No Information Rate (NIR): 0.71
Unbalanced Accuracy (Acc): 0.89
Cohen’s Kappa: 0.70
P-value [Acc > NIR]: 0.02
Model Type: Random Forest
Imputed Data: No
Pre-selected Feature: Yes or No
Figure 6-9: Performance of the
second best CPET + step data
classifier
Physician
II III
AI
II 7 6
III 3 27
No Information Rate (NIR): 0.70
Unbalanced Accuracy (Acc): 0.79
Cohen’s Kappa: 0.47
P-value [Acc > NIR]: 0.12
Model Type: Boosted GLM
Imputed Data: No
Pre-selected Feature: Yes or No
Figure 6-7: Performance of the
best CPET only classifier
Physician
II III
AI
II 6 9
III 5 30
No Information Rate (NIR): 0.70
Unbalanced Accuracy (Acc): 0.72
Cohen’s Kappa: 0.28
P-value [Acc > NIR]: 0.45
Model Type: (boosted) GLM/NNet
Imputed Data: Yes
Pre-selected Feature: Yes or No
Figure 6-7: Performance of the
best step data only classifier
122
The second best performing classifier overall, was a RF in the same variant class as the best overall GLM
(no imputed data, and with or without feature preselection and using CPET and step count data). It
achieved an equivalent unbalanced accuracy (89%) with a corresponding significance level (compared to
the no-information rate) but it had a marginally lower agreement coefficient (𝜅=0.70) and balanced
accuracy (81%).
The receiver operating characteristic (ROC) curve, which graphically represents the sensitivity (true
positive rate) and specificity (the mathematical complement of the false positive rate33) trade-off of a
classifier, is shown in Figure 6-10 for the best RF and boosted GLM built using CPET and step data.
Although, it also includes the NNet, PCA NNet and glm in the same variant class: no imputed data, with
or without feature selection. We can see from this curve that the diagnostic error rate for the boosted
33 i.e. 1 – the false positive rate
Figure 6-10: Receiver Operating Characteristic (ROC) curve for machine learning classifiers
trained with CPET & step data (with no data imputation)
123
GLM is always expected to be more, or at least as, favorable as that of the RF based classifier, regardless
of the discrimination threshold chosen.
As an aside, we can also see from this graph that our choice to use PCA for feature selection before
providing our features to the NNet was well justified, since the PCA NNet shows greatly improved
discriminatory ability compared to the pure NNet. This suggests that a NNet might still have use for
assessing NYHA functional class, but may require more careful selection of input features or at least more
data to properly take advantage of its powerful modelling capabilities.
Regardless, both of our boosted GLM and RF based CPET + step data classifiers showed improved
performance over the ones using heart rate variability (HRV) data created by 1) Pecchia et al. [128] - a
cross-validated classification and regression tree that had moderate agreement (𝜅=0.57) and good
discrimination accuracy (79.3%, unbalanced) on a slightly unbalanced dataset (12:17, 59% severe) - and 2)
the one created by Melillo et al. [136] - another classification and regression tree, 10-fold cross-validated,
which achieved a marginally better level of agreement (𝜅=0.60), and discrimination accuracy (85.4%,
unbalanced) than Pecchia et al.’s tree, but on a different more unbalanced dataset (12:32, 73%). Our
classifier however does not approach the performance of Shahbazi et al.’s [142] leave-one-out cross-
validated HRV based k-Nearest Neighbour classifier (with generalized discriminant analysis feature
selection), which achieved perfect agreement (𝜅=1.0) and accuracy (100%) at the classification task (I or
II vs. III or IV) on their unbalanced dataset (10:29, 74% severe). We suspect that Shabazi’s classifier may
possibly be overfit to their data.
Unfortunately, the practical applications of our classifier are not clear cut. Our early investigation of the
combination of data from the relatively more established CPET, and the simpler to administer activity
tracker monitoring, does demonstrate that it is possible to create classifier that performs comparable to
those that use relatively esoteric HRV data. Administering a CPET augmented with two-weeks of activity
tracker data might therefore prove a useful alternative for clinicians or researchers wishing to objectively
assess NYHA functional classification without requiring access to the specialized software and know-how
required to perform an HRV analysis. Unfortunately, this alternative still requires the administration of a
CPET which remains an expensive, cumbersome, and labor-intensive ordeal. Furthermore, to achieve
near-human levels of classification performance, it presently appears necessary to augment CPET data
with use activity tracker step data since neither CPET nor step data alone suffice to achieve reasonable
levels of classification agreement. While activity tracker data is less expensive and labor-intensive to
collect than CPET data, in its currently investigated form it is associated with at least a two-week delay.
Although two-weeks is not necessarily longer than the time required to get certain blood or pathology
124
tests - which can sometimes also take several weeks [268–270] – this time delay certainly limit the
practical applications of our classifier.
While an obvious next step is to investigate a smaller monitoring periods, we suggest that an equally
profitable step may be to identify better features in the step count data and ideally alternate data sources
to reduce the dependence on CPET data outright.
6.3.2 Best Features
As it stands, the top 5 features for the best step count data classifier (GLM) were, in order of most
importance:
1) the total 2 week step count,
2) the mean 2 week per minute step count (PMSC),
3) patient weight,
4) the standard error of the 2 week per minute step count (PMSC), and
5) the standard error of the total daily per minute step count.
The features were assessed by summing their weighted importance scores across folds. The raw
importance score was computed using the default variable importance scores for the specific model in
question, using the varImp function in the caret package [271]. Each of these scores was then scaled to be
between 0 and 1 (from least to most important). Therefore, the highest possible importance score is 50,
which is possible if a variable scores as most important for all 50 leave-one-out cross-validated folds.
The full ordered list of top features for the step count data only GLM is shown in Figure 6-11. We can see
from the graph that very few of the features clearly stood out as being relatively more important, in fact
only the total 2 week step count and the mean 2 week per minute step count scored higher than 25
importance points (of 50). The third scoring feature, weight, is not even a step count metric, and is
already known to be not significantly different between classes (P=.21) at the 5% level of significant in
this dataset (see Table 10). Given the ML classifier used in this case, a GLM (which is linear regression
based), - it is not unreasonable to conclude that features at and below this level likely provided
increasingly little discriminatory value, which goes a long way towards explaining the relatively low
performance of this classifier.
125
Unfortunately, at the time of writing, the caret package’s varImp function did not adequately support
variable importance analysis for boosted GLMs, the model type of our best performing model and the
CPET only model. We instead provide as contrast the top 10 features identified by our second best
performing classifier, the CPET + step count data RF classifier. The top 10 features for the RF classifier
are shown in Figure 6-12.
Only two of the top 10 features used by the RF classifier are step count derived metrics:
1) mean of maximum daily per minute step count
2) standard deviation of total daily per minute step count,
The remaining 8 features are all CPET features, of which the respiratory exchange ratio peak (RER
Peak) is particularly noteworthy, having scored the highest possible importance score of 50 points,
indicating that it was voted the single most important feature by every single leave-one-out cross-
validated fold. The next single most important overall feature (also from the CPET data) is the slope of
Figure 6-11: Feature importance scores for GLM classifier using only step count data
126
minute ventilation (VE) to CO2 output (VCO2) at anaerobic threshold (AT) during CPET (VE/VC O2
Slope @ AT), which scored less than 20 importance points, indicating relatively low importance across
folds. The third most important feature, the duration of CPET in fractional minutes (CPET Duration),
scored less than 10 importance points.
For reference, weight - the 3rd best feature for the step data only GLM - was found to only be the 31st
most important feature for the RF with a score of 0.878, which would indicate that weight actually has
relatively low overall predictive helpfulness. Interestingly, leanness in HF patients has been found to be
associated with worse prognostic outcomes – in what is known as the ‘obesity paradox’ [272–275].
However, more recent findings from a large 300 thousand patient study suggest that this association is
likely the result of other unaccounted for confounding factors [276]. This might explain the low ranking of
weight (correlated with BMI) in the face of other explanatory variables. The mean 2 week per minute step
count and the total 2 week step count, the top two highest scoring features for the GLM trained using
only step count, also scored as being low importance for the RF classifier: 0.967 and 0.945 respectively.
Figure 6-12: Feature importance scores for random forest classifier using CPET + step
count data
127
The RF classifier in fact scored 14 other step count derived features as being more important than these
(although none of these 14 others scored any higher than 2.6 points).
It is curious that the step count metrics as a whole, appear to be considered by the classifiers to be
relatively unimportant in contributing to the successful assessment of patient NYHA class, yet that our
analysis of the models from a holistic perspective appears to indicate that the interaction of the step data
metrics with the CPET data appears to notably enhance the overall performance of classifier.
We suspect that one possible cause of this paradox is that the step data metrics, which were originally
selected due to their ability to characterize the step count distributions and not their predictive capacity,
are in fact only weakly correlated, noisy, and uncontextualized, and in general only weakly explanatory of
NYHA functional class alone. Furthermore, these metrics are likely also highly intercorrelated. This makes
it difficult for a ML algorithm to identify which single metric is most helpful. This is evidenced by the
pattern visible in Figure 6-11 where most of the metrics are considered only mildly important with none
standing out as specifically important. This pattern, although not shown in Figure 6-12, is also reflected
in the RF classifier scoring with similar metrics closely neighbouring each other.
When framed around CPET data - which helps contextualize and account for some noise in the step
count data - some of the step count metrics begin to stand out as being more explanatory (they are in the
top 10 features). These features therefore appear to possibly be explaining otherwise unexplained variance
in the CPET data. However, feature importance is rated inconsistently between models. Although this is
not necessarily unexpected, it may indicate that although the RF classifier assesses these as important,
they in fact only interpreted as important as a result of the chance subset of training data within the
folds. This leads to us to an alternative explanation: that the classifier is simply overfit. This is a less
compelling explanation than the step data being simply weakly explanatory since the RF classifier clearly
assesses the step count data as being still relatively unimportant. That being said, the possibility of
overfitting may definitely still exist, but it could be easily verified by computing the variance of the
importance scores across the random folds (high overall variance being an indicator of potential
overfitting to individual training folds).
The overall conclusion of our feature analysis however is that the step count metrics provided to the ML
classifiers for training are generally inadequate and that most of the predictive power resides in the CPET
features. In light of the desire to not be dependent on CPET for assessment of NYHA class, especially
within the context of remote patient monitoring for Medly, any continuation of this work should therefore
seriously consider investing time in identifying and engineering more relevant step count features as well
128
as adding other data sources like heart rate, which would be complimentary to step count, and help
contextualize the step data hopefully reducing the dependence on cumbersome CPET data. However, we
also note the lack of impact feature pre-selection had on the performance of our variant models and
suggest that increasing the amount of training data available may be a better approach than pre-
trimming the available features. That being said, other researchers have had significant success performing
clever feature selection to improve their algorithm performance [142].
6.3.3 Comparison of 10-fold and Leave-One-Out Cross-Validation
Recall that we cross-validated our
classifiers not only with leave-one-out
cross-validation, but we also performed
10-fold CV to try to approximate our
location on the classifier learning curve.
Excluding models whose unbalanced
accuracy was less than the no-
information rate, the smallest
difference in performance estimation
for 10-fold CV vs LOOCV of the same
classifier was 19% (κ|𝐿𝑂𝑂𝐶𝑉 = 0.47,
κ|10−𝑓𝑜𝑙𝑑𝐶𝑉 = 0.28). The classifier in
question, with the smallest estimator
difference, was in fact the CPET only
classifier discussed in Section 6.3.1. A
summary of the performance
estimations between the 10-fold and LOOCV of this classifier (the CPET Only GLM) is shown in Figure
6-13.
The largest and second largest performance differences were associated with the best performing classifier
(CPET + Step Data GLM, κ|𝐿𝑂𝑂𝐶𝑉 = 0.73, κ|10−𝑓𝑜𝑙𝑑𝐶𝑉 = 0.10) and second best performing classifier
(CPET + Step Data RF, κ|𝐿𝑂𝑂𝐶𝑉 = 0.70, κ|10−𝑓𝑜𝑙𝑑𝐶𝑉 = 0.10). It is worth noting that the 10-fold CV
version of these classifiers also in fact had unbalanced accuracies (68%) that were marginally less than the
associated no-information rate (70%) for the classifiers.
Figure 6-13: Performance of the best model with
cross-validation performance difference
LOOCV Physician
II II
AI
II 7 6
III 3 27
No Information Rate (NIR): 0.70 |
Unbalanced Accuracy (Acc): 0.79 | 0.72
Cohen’s Kappa: 0.47 | 0.28
P-value [Acc > NIR]: 0.12 | 0.45
Model Type: Boosted GLM
Imputed Data: No
Pre-selected Feature: Yes or No
Data Source: CPET Only
10-fold
CV
Physician
II II
AI
II 6 9
III 5 30
129
Since, as previously mentioned in Section 6.2.3.2, we expect at most about 7-9% difference in performance
estimation due to the bias of 10-fold CV vs LOOCV, these large differences in performance estimation
using 10-fold CV and LOOCV are clearly indications that our model is still highly sensitive to the amount
of input data used to train the model and of may be possibly overfit to the training data. From a learning
curve perspective, these values indicate that we are still at the point in the curve where we are likely to
derive significant benefit from adding more training data. Since adding more training data is often an
adequate solution to overfitting, an adequate solution in either case is to collect more data. Certainly, we
appear to have been justified in using this larger 50 patient dataset for our experiments as opposed to the
44 patient dataset despite the associated loss of activity monitoring heart rate data.
Fortunately, as a result of the activity tracker monitoring upgrade made to Medly, as detailed in Chapter
4, more data (containing both heart rate and step count) is still actively being collected and should soon
result in a larger (n > 50) activity monitoring dataset than the one used for the classification experiments
in this thesis.
As the dataset increases in size, we suggest that future work performed with the dataset continue to be
assessed using both 10-fold and LOOCV until the estimates from these approaches are found to converge.
This will not only increase confidence in the performance estimates of the classifiers, but also help
determine when it is appropriate to switch over to the less computational expensive 10-fold CV.
Furthermore, recording the performance across otherwise identical ML models as the amount of data
available continues to increase, would permit more accurate mapping of the learning curve than our initial
single datapoint [258]. Knowing the actual learning curve associated with this problem would be helpful
for diagnosing the source of classifier errors and ascertaining possible future steps to improve algorithm
performance, and it would also be helpful for determining the incremental cost/benefit of continuing to
collect increasingly more data [258].
Summary
To summarize, in this chapter we discussed a method for building cross-sectional machine learning
classifiers to assess NYHA functional class using CPET and activity monitoring step data. We chose to
investigate some popular starting points for supervised classification problems: Generalized Linear Models
(GLM); a variant thereof: boosted GLMs; Random Forests (RF); Artificial Neural Networks (NN); and a
variant thereof: Principal Component Analysis Neural Networks (PCA NN). We trained multiple variants
of each model to investigate the effect of a) performing separate feature selection ahead of model training,
b) imputing missing data instead of just dropping incomplete cases, and c) supplying different groups of
130
input predictors to our models for training. Specifically, we investigated the performance of the classifiers
when supplied with demographic data and a) just CPET data, b) just the step data metrics investigated
in Chapter 3, and c) the combination of both the CPET data and step data metrics.
To properly determine the expected performance of the classifiers in the face of new data we also cross-
validated all the models using 10-fold cross-validation and leave-one-out cross-validation. Since we also
optimized the model hyper-parameters and cross-validated these selections, we ended up performing
nested 10-fold and nested leave-one-out cross-validation of each of the models.
In general, we found that pre-selecting features did not change the classification performance of the
models, and although imputing missing data sometimes had an effect on classifier performance, 3 of the 4
best performing models (all except the step data only classifier) discussed in this chapter were built by
simply excluding incomplete cases as opposed to performing 5-Nearest Neighbour imputation.
The best overall classifier was found to be a boosted GLM, trained using only complete cases of both
CPET and step data, which achieved an unbalanced accuracy of 89% (85% balanced) versus a no-
information rate of 70%. As a result, this classifier had a substantial level of agreement with the physician
assigned NYHA class (𝜅=0.73). The performance of the classifier was therefore comparable to human level
performance (𝜅=0.75 [26])
The CPET + step data classifier exceeded the baseline level of performance established by the best CPET
data only classifier. The best classifier trained with only CPET data (another boosted GLM) achieved an
unbalanced accuracy of 79% (72%, balanced) which was also better than the no-information rate of 70%.
The CPET only classifier therefore showed a moderate level of agreement with the physician assigned
label (𝜅=0.47) which was lower than the lower end of comparable human-level performance (𝜅=0.54 [6]).
The step data only classifiers (tied between a regular GLM, boosted GLM and NNet) fared much worse,
achieving an unbalanced accuracy of 72% (63% balanced) – only marginally higher than the no-
information rate of 70%, and with a low level of agreement between the classifier and physician assigned
label (𝜅=0.28).
When comparing which features were considered most important by the classifiers we found that the step
data metrics as a whole were found to be less important than the CPET metrics. We theorized that this
is because the step data metrics, which were originally selected due to their ability to characterize the
step count distributions and not their predictive capacity, are in fact only weakly correlated, noisy and
uncontextualized, and in general only weakly explanatory of NYHA functional class alone. This makes it
131
difficult for a ML algorithm to use the features effectively for classification. In light of the desire to also
not remain dependent on CPET data for assessment of NYHA class, especially within the context of
remote patient monitoring for Medly, we suggested that a next reasonable step would be to invest in
engineering more relevant step count features. We also recommend adding other data sources like heart
rate, which are presumed to be complimentary to step count and would help contextualize the step data –
hopefully replacing the currently required CPET data.
In comparing the performance estimations from the 10-fold and leave-one-out cross-validation we found
that there was a notable difference between the measurements of agreement (𝜅), varying from 19-63% for
the well performing algorithms but always in favor of the leave-one-out cross-validation. We proposed
that this might be evidence of overfitting of the classifiers, but is likely also largely attributable to the
15% reduction in already limited data available for training the classifier resulting from the nesting of the
10-fold cross-validation process (compared to nesting leave-one-out cross-validation) and thus more
indicative of our location on the learning curve. Regardless these numbers indicated that there is likely
considerable benefit to collecting more training data. We suggested that future work performed with
larger datasets should continue to assess performance using 10-fold and LOOCV until the estimates from
these approaches are found to converge. This would increase confidence in the performance estimates of
the classifiers, as well as help determine when it is appropriate to switch over to the less computational
expensive 10-fold CV. We also suggested that at minimum, keeping the number of folds consistent for
cross-validation would be helpful for better mapping out the learning curve for this problem – which
would be a helpful tool for diagnosing classifier error, and assessing the cost/benefit of continuing to
collect more and more data.
132
- Conclusions, Recommendations & Future
Work
In this chapter we reflect on this work as a whole, briefly reiterating the major conclusions and
findings of this work and providing some recommendations and suggested directions for future work.
Conclusions
The objective of this thesis was to design and develop a means of making New York Heart Association
(NYHA) classification more consistent and reliable for the medical research and clinical community. We
proposed that a good way to accomplish this objective was to find a means of objectively assessing NYHA
functional class. In light of this, we performed a thorough review of the current state-of-the art for
assessing NYHA functional class, including the state-of-the-art in applying artificial intelligence machine
learning algorithms to the task of assessing or classifying patients into their NYHA functional class.
We found that other researchers have already attempted to use machine learning for NYHA functional
classification. These however used heart rate variability data which is not necessarily readily accessible or
usable by all heart function clinic, nor, at least at present, highly suitable for long-term remote patient
monitoring. Remote patient monitoring being a growing trend in the pursuit of more cost-efficient care for
chronic conditions and specifically the quest to improv patient- and physician-management of the heart
failure condition. We proposed that a useful but more accessible data source that would synergize well
with remote patient monitoring would be activity tracker data.
We proposed updating an existing remote patient monitoring system with the ability to collect and
display activity tracker data, which could provide data for use by a machine learning algorithm to
perform automated assessment of NYHA functional class. For this task we selected Medly, the remote
patient monitoring system presently in use at the Toronto General Hospital Heart Function Clinic as a
suitable candidate system. However, since activity tracker data has not seen wide use in actual clinic
settings - in fact we only found one small pilot study that investigated the relationship between NYHA
class and activity tracker step count - we first replicated the pilot study on a larger dataset that we had
available from a previous study performed at our lab, verifying the findings of the pilot study: that NYHA
II and NYHA III patients differ significantly by mean daily total step count. Additionally, we discovered
that these patients actually differed by various aggregate measures of step count also including mean and
maximum of the daily per minute step count maximums. Overall, our findings reaffirmed the findings of
the previous pilot study, giving us some additional reassurance that remote monitored step count might
133
be beneficial for objectively assessing NYHA class. We noted however that the recorded step count data
was often ambiguous, since the data recorded by the fitness trackers used in this study, which only
recorded step count, did not allow us to differentiate between when the wearer was inactive versus the
tracker simply not being worn. This significantly limited our ability to draw precise practical conclusions
from the dataset.
We then proceeded to engineer an upgrade to the Medly remote patient monitoring system to allow it to
support activity tracker monitoring data from Fitbit devices specifically the Fitbit Charge HR 2 which
supported collection of both step count and heart rate data (to avoid the ambiguity problems which were
identified in the replication study). Despite delays in the actual implementation of the activity tracker
upgrade we were successfully able to onboard 44 patients over a 5 month period with some (3) of the
patients even providing their own Fitbit for use with the system. Unfortunately, the patients were found
to be only moderately adherent with using the Fitbit with only around 1 3⁄ to 1 4⁄ of patients (at 3
months and 7 months respectively) having excellent levels of adherence (average at least 9 of 10 days
using the system). We theorized that the many compromises made to the user experience throughout the
implementation process may have detrimentally impacted patient adherence.
Since the effective size of the Medly Fitbit dataset was drastically reduced to 33 patients after removing
those patients with less than 1 week of recorded activity, we opted to instead use the dataset investigated
as part of the replication study to explore if it would be possible to assess NYHA class using free-living
fitness tracker data. The marginally larger replication data set we opted to use consisted of 50 patients
(35 NYHA class II; 15 NYHA class III), and although it lacked activity monitor heart rate data to
complement the step count data, all of the patients in the dataset had recorded cardiopulmonary exercise
test data which we proposed to use to establish a baseline performance level against which to evaluate our
classifiers
We investigated 6 different types of supervised machine learning classifiers to assess NYHA functional
classification: a hidden Markov model based classifier, several Generalized Linear Models, boosted
Generalized Linear Models, Random Forests, Artificial Neural Networks and Principal Component
Analysis Neural Networks.
We found that the hidden Markov model based classifier performed worst overall and in fact in many
cases refused to train properly. The hidden Markov model based classifier we did manage to train had
poor agreement (Cohen’s Kappa statistic, 𝜅=0.18) between the physician assigned NYHA class and that
134
assigned by the classifier, with a resulting low (unbalanced) accuracy of 58% (assessed on the same data
used to train the classifier) which was actually worse than the no-information rate of the dataset (70%).
In contrast, the best overall classifier was found to be a boosted GLM (leave-one-out cross-validated),
trained using only complete cases of both CPET and step data, which demonstrated substantial
agreement with the physician assigned NYHA class (𝜅=0.73) comparable to human level performance
(𝜅=0.75 [26]) and better than 2 of the 3 heart rate variability based machine learning classifiers. The level
of agreement of our classifier corresponded to an unbalanced accuracy of 89% (85% balanced) against a
no-information rate of 70%.
The best classifier trained with only CPET data – our proposed performance baseline - (another boosted
GLM) showed a moderate level of agreement with the physician assigned label (𝜅=0.47) with
corresponding unbalanced accuracy of 79% (72%, balanced), again better than 70%. The performance of
this classifier however was lower than the reported lower range of human-level performance (𝜅=0.54 [6])
and as a result surprisingly did not dislodge physicians as the gold-standard against which to assess
NYHA functional class agreement despite the notoriously high degree of subjectivity in their assessments.
The step data only classifier (tied between a regular GLM, boosted GLM and NNet) fared even worse
than the classifier trained with only CPET data, although still better than the hidden Markov model
based classifier, achieving an unbalanced accuracy of 72% (63% balanced) – only marginally higher than
the no-information rate of 70%, and with a low level of agreement between the classifier and physician
assigned label (𝜅=0.28).
An analysis of the important input features revealed notably that, of the CPET + step data features
investigated, the respiratory exchange ratio was found to be rated most consistently important. The step
data metrics, as a whole, were found to be less important generally than the CPET metrics and were also
found to be inconsistent in their ratings of relative importance amongst themselves.
We also found a notable difference between the estimates of the measurements of agreement (|∆𝜅| =
[0.19, 0.63]) generated using 10-fold versus leave-one-out cross-validation for the well performing classifiers
when comparing (always in favor of leave-one-out cross-validation). We proposed that this might be
evidence of overfitting of the classifiers, but more likely an indication that 10-fold cross-validation caused
a severe reduction in the already limited amount of data available for classifier training.
In summary, we found that it is possible to objectively assess NYHA functional classification with a level
of performance comparable to the human physicians by using a combination of CPET and step count
135
data. Although CPET data and step count data were found to be generally inadequate for performing
objective NYHA functional classifications by themselves, this may have been due to the lack of data and
the lack of useful and relevant features. In particular, for the step count data metrics, which were
originally selected due to their ability to characterize the step count distributions and not their predictive
capacity, more intentional feature engineering of relevant step count metrics might further improve
performance using this data. As well, adding other data sources, for example heart rate data, which is
presumed complimentary to step count and might help re-contextualize and clean up ambiguity in the
data, might further improve classifier performance.
In general, although the machine learning classifiers developed in this work are not yet ready for
implementation into a real-life remote patient monitoring system, the classifiers investigated in this thesis
certainly show promise for making the assessment of NYHA functional class more objective and by
extension more universally consistent and reliable.
Recommendations
In this section we propose several recommendations and ‘lessons learned’ in light of our findings:
1. Avoid activity trackers that label disengagement with the monitoring solution and patient
inactivity identically. These contribute significant ambiguity to later data analysis that is often
difficult or impossible to reconcile.
2. For data collected remotely from patients, provide a means of helping staff catch and address
patient issues in a timely manner, thereby improving the overall quality of the data. For example,
adding automated adherence phone calls or reminder notifications (for a smartphone-based
application) may improve adherence at little cost.
3. When adding new sources of data to an existing system, either a) begin data collection as soon as
possible, improving as required, and collecting lots of lower quality data which can be cleaned and
noise-corrected post-hoc, or b) fully commit to designing a user experience that will result in high
adherence – collecting a smaller amount of high-quality data. Delaying data collection to design
an incomplete user experience will likely only result in collecting an insufficient amount of
moderate quality data that will be more challenging to analyze.
4. Notwithstanding the above, prefer collecting more data (especially for machine learning
applications). While it is possible to build a machine learning classifier with little data it becomes
significantly more difficult to properly assess if the classifier is of good quality.
136
5. The corollary to 3 and 4 is to invest in data collection infrastructure. Collecting a suitably large
dataset can take a long time and should be started well in advance of a proposed research project.
6. Invest time in visualizing and understanding the data being collected. In this case of this thesis,
we discovered several limitations in our data, for example the prevalence of 0 step count values,
that had drastic implications on model design and development. This could have been addressed
in a more timely fashion with foresight derived from a more thorough earlier investigation of the
source data.
7. Prefer simpler machine learning classifiers over more complex ones especially in the face of smaller
datasets. Almost all of the best performing classifiers investigated in this thesis were simple
generalized linear models or variants thereof.
8. Prefer the use of the 𝑅 programming language (along with the tidyverse package by H. Wickham
[217]) for analysis and visualization of data, but use Python along with the well established scikit-
learn library to accelerate creation of the machine learning pipeline required to build and
adequately assess a series of machine learning classifiers. Aside from cleaning data, building the
machine learning pipeline is one of the most time-consuming parts of a machine learning project.
Future Work
Having outlined some general recommendations and lessons that should be taken from this work
provide some suggested future directions for this work:
1. A more thorough study of the characterization of the minute-by-minute step count waveform for
both health persons and patients with congestive heart failure should be undertaken. This would
provide very valuable insights for projects investigating the use of fitness trackers for monitoring
tasks.
2. Revisit the user interfaces and user experience design of the fitness tracker upgrade applied to
Medly. Aside from the fact that the system as is does not fully honor the best practices and
principles outlined in the Fitbit API terms of service, patients using the system are only
moderately adherent which reduces the amount and quality of data being collected for use by
patients, by clinicians, and as part of any future quality improvement or research projects.
Adding adherence phone calls or reminder notifications would likely provide significant benefit at
little cost.
3. Investigate the effects of applying dithering to the training of the HMMBC.
137
4. Repeat the work performed in this thesis but using the combination of activity tracker step count
and heart rate data. The data being collected from Medly patients would be suitable for this
purpose once a sufficient number of patients are onboarded onto the upgraded system.
5. Furthermore, investigate the effect of including other data available from the Medly system such
as daily symptoms data, these potentially helping further contextualize patient step count data.
6. Investigate the effect of reducing the analysis window duration for the step count data from 2-
weeks to some shorter time period.
7. In a similar vein, investigate activity segmentation with an eye towards using it in combination
with a HMMBC (or more standard cross-sectional ML model).
8. Perform careful manual feature engineering or automated feature extraction to identify more
relevant features from available time series data streams (including step count).
9. And finally, regardless of other work performed, continue to assess the cross-validated
performance of otherwise identical models as dataset size increases, to better map the learning
curve associated with the NYHA functional class supervised classification problem.
138
References
1. Mehra MR, Butler J. Heart Failure: A Global Pandemic and Not Just a Disease of the West.
Heart Fail Clin [Internet] 2015 Oct [cited 2017 Oct 13];11(4):xiii–xiv. PMID:26462110
2. Heart and Stroke Foundation. 2016 Report on the Health of Canadians: The Burden of Heart
Failure. 2016 [cited 2016 Oct 29]; Available from: https://www.heartandstroke.ca/-/media/pdf-
files/canada/2017-heart-month/heartandstroke-reportonhealth-
2016.ashx?la=en&hash=0478377DB7CF08A281E0D94B22BED6CD093C76DB (Archived by
WebCite® at http://www.webcitation.org/706UliccA)
3. Seto E, Leonard KJ, Cafazzo J a, Masino C, Barnsley J, Ross HJ. Self-care and quality of life of
heart failure patients at a multidisciplinary heart function clinic. J Cardiovasc Nurs [Internet]
2011;26(5):377–85. PMID:21263339
4. Lawrence S. Canada is failing our heart failure patients - Heart and Stroke Foundation of Canada
[Internet]. Marketwired. 2016 [cited 2016 Oct 7]. Available from:
http://www.marketwired.com/press-release/canada-is-failing-our-heart-failure-patients-
2093022.htm (Archived by WebCite® at http://www.webcitation.org/706U7G8oI)
5. Cox J, Naylor CD. The Canadian Cardiovascular Society Grading Scale for Angina Pectoris: Is It
Time for Refinements? Ann Intern Med [Internet] American College of Physicians; 1992 Oct 15
[cited 2016 Oct 30];117(8):677. [doi: 10.7326/0003-4819-117-8-677]
6. Raphael C, Briscoe C, Davies J, Ian Whinnett Z, Manisty C, Sutton R, Mayet J, Francis DP,
Raphael C. Limitations of the New York Heart Association functional classification system and
self-reported walking distances in chronic heart failure. Heart [Internet] 2007 Apr 1 [cited 2016 Oct
30];93(4):476–482. [doi: 10.1136/hrt.2006.089656]
7. Bennett JA, Riegel B, Bittner V, Nichols J. Validity and reliability of the NYHA classes for
measuring research outcomes in patients with cardiac disease. Hear Lung J Acute Crit Care
2002;31(4):262–270. PMID:12122390
8. Heart Foundation. New York Heart Association (NYHA) Classification [Internet]. Heart
Foundation; 2014 [cited 2017 Jun 30]. p. 1. Available from:
http://www.heartonline.org.au/media/DRL/New_York_Heart_Association_(NYHA)_classificati
139
on.pdf
9. American Heart Association. Classes of Heart Failure [Internet]. 2015 [cited 2016 Oct 30].
Available from:
http://www.heart.org/HEARTORG/Conditions/HeartFailure/AboutHeartFailure/Classes-of-
Heart-Failure_UCM_306328_Article.jsp#.WvyuQYgvyiN (Archived by WebCite® at
http://www.webcitation.org/6zT3C5Rpx)
10. Ahmed A, Aronow WS, Fleg JL. Higher New York Heart Association classes and increased
mortality and hospitalization in patients with heart failure and preserved left ventricular function.
Am Heart J [Internet] NIH Public Access; 2006 Feb [cited 2017 Oct 30];151(2):444–50.
PMID:16442912
11. Goldman L, Hashimoto B, Cook EF, Loscalzo A. Comparative reproducibility and validity of
systems for assessing cardiovascular functional class: advantages of a new specific activity scale.
Circulation [Internet] 1981;64(6):1227–1234. PMID:7296795
12. Williams BA, Doddamani S, Troup MA, Mowery AL, Kline CM, Gerringer JA, Faillace RT.
Agreement between heart failure patients and providers in assessing New York Heart Association
functional class. Hear Lung J Acute Crit Care [Internet] Elsevier Inc; 2017 Jul 1 [cited 2017 Oct
30];46(4):293–299. PMID:28558929
13. Moayedi Y, Abdulmajeed R, Posada JD, Foroutan F, Alba AC, Cafazzo J, Ross HJ, Duero Posada
J, Foroutan F, Alba AC, Cafazzo J, Ross HJ. Assessing the Use of Wrist-Worn Devices in Patients
With Heart Failure: Feasibility Study. JMIR Cardio [Internet] JMIR Cardio; 2017 Dec 19 [cited
2018 Jan 25];1(2):8. [doi: 10.2196/cardio.8301]
14. Savarese G, Lund LH. Global Public Health Burden of Heart Failure. Card Fail Rev [Internet]
Radcliffe Cardiology; 2017 Apr [cited 2018 Jun 4];3(1):7–11. PMID:28785469
15. University of Toronto Faculty of Medicine. The State of the Heart in Canada [Internet]. 2014.
Available from:
http://medicine.utoronto.ca/sites/default/files/TRCHR_StateOfHeart_Infographsm.png
16. cardiac insufficiency. McGraw-Hill Concise Dict Mod Med [Internet] The McGraw-Hill Companies,
Inc.; 2018 [cited 2018 Jul 21]. Available from: https://medical-
dictionary.thefreedictionary.com/cardiac+insufficiency
140
17. Aird WC. Discovery of the cardiovascular system: From Galen to William Harvey. J Thromb
Haemost [Internet] 2011;9(1 S):118–129. PMID:21781247
18. Silverthorn DU, Johnson BR, Ober WC, Garrison CW, Silverthorn AC. Blood Flow and the
Control of Blood Pressure. Hum Physiol An Integr Approach 5th ed Pearson Benjamin Cummings;
2009. p. 512–545.
19. Shah SJ. Heart Failure (HF) [Internet]. Merck Manuals Prof Ed. 2017 [cited 2018 Jul 21].
Available from: https://www.merckmanuals.com/en-ca/professional/cardiovascular-
disorders/heart-failure/heart-failure-hf
20. Azevedo PS, Polegato BF, Minicucci MF, Paiva SAR, Zornoff LAM. Cardiac Remodeling:
Concepts, Clinical Impact, Pathophysiological Mechanisms and Pharmacologic Treatment. Arq
Bras Cardiol [Internet] Arquivos Brasileiros de Cardiologia; 2016 Jan [cited 2018 Jul 21];106(1):62–
9. PMID:26647721
21. Laflamme MA, Murry CE. Heart regeneration. Nature [Internet] NIH Public Access; 2011 May 19
[cited 2018 Jul 21];473(7347):326–35. PMID:21593865
22. National Heart Foundation of Australia and the Cardiac Society of Australia and New Zealand
(Chronic Heart Failure Guidelines Expert Writing Panel). Guidelines for the prevention, detection
and management of chronic heart failure in Australia. 2011 [cited 2018 May 10];84. Available from:
https://www.heartfoundation.org.au/images/uploads/publications/Chronic_Heart_Failure_Guide
lines_2011.pdf
23. The Criteria Committee of the New York Heart Association. Classification of Functional Capacity
and Objective Assessment [Internet]. 9th ed. Nomencl Criteria Diagnosis Dis Hear Gt Vessel.
Boston, Mass.: Little, Brown and Co.; 1994 [cited 2017 Oct 13]. Available from:
http://professional.heart.org/professional/General/UCM_423811_Classification-of-Functional-
Capacity-and-Objective-Assessment.jsp
24. Rostagno C, Galanti G, Comeglio M, Boddi V, Olivo G, Gastone G, Serneri N. Comparison of
different methods of functional evaluation in patients with chronic heart failure. Eur J Heart Fail
[Internet] 2000 [cited 2018 Jun 4];2:273–280. Available from:
https://onlinelibrary.wiley.com/doi/pdf/10.1016/S1388-9842(00)00091-X
25. Carroll SL, Harkness K, Mcgillion MH. A Comparison of the NYHA Classification and the Duke
141
Treadmill Score in Patients with Cardiovascular Disease. Open J Nurs [Internet] 2014 [cited 2017
Nov 3];4:774–783. [doi: 10.4236/ojn.2014.411083]
26. Christensen HW, Haghfelt T, Vach W, Johansen A, Hoilund-Carlsen PF. Observer reproducibility
and validity of systems for clinical classification of angina pectoris: comparison with radionuclide
imaging and coronary angiography. Clin Physiol Funct Imaging [Internet] Blackwell Science Ltd;
2006 Jan [cited 2017 Nov 6];26(1):26–31. [doi: 10.1111/j.1475-097X.2005.00643.x]
27. Kubo SH, Schulman S, Starling RC, Jessup M, Wentworth D, Burkhoff D. Development and
validation of a patient questionnaire to determine New York heart association classification. J
Card Fail [Internet] Churchill Livingstone; 2004 [cited 2017 Nov 3];10(3):228–235. [doi:
10.1016/J.CARDFAIL.2003.10.005]
28. McHugh ML. Interrater reliability: the kappa statistic. Biochem medica [Internet] Croatian Society
for Medical Biochemistry and Laboratory Medicine; 2012 [cited 2018 Aug 25];22(3):276–82.
PMID:23092060
29. Sallis JF, Saelens BE. Research Quarterly for Exercise and Sport Assessment of Physical Activity
by Self-Report: Status, Limitations, and Future Directions. 2015 [cited 2018 Jul 24]; [doi:
10.1080/02701367.2000.11082780org/10.1080/02701367.2000.11082780]
30. Okura Y, Urban LH, Mahoney DW, Jacobsen SJ, Rodeheffer RJ. Agreement between self-report
questionnaires and medical record data was substantial for diabetes, hypertension, myocardial
infarction and stroke but not for heart failure. J Clin Epidemiol [Internet] Pergamon; 2004 Oct 1
[cited 2018 Jul 24];57(10):1096–1103. [doi: 10.1016/J.JCLINEPI.2004.04.005]
31. Baranowski T. Validity and Reliability of Self Report Measures of Physical Activity: An
Information-Processing Perspective. Res Q Exerc Sport [Internet] 1988 [cited 2018 Jul
24];59(4):314–327. [doi: 10.1080/02701367.1988.10609379org/10.1080/02701367.1988.10609379]
32. Balady GJ, Arena R, Sietsema K, Myers J, Coke L, Fletcher GF, Forman D, Franklin B, Guazzi
M, Gulati M, Keteyian SJ, Lavie CJ, Macko R, Mancini D, Milani R V. AHA Scientific Statement
Clinician’s Guide to Cardiopulmonary Exercise Testing in Adults A Scientific Statement From the
American Heart Association. Am Hear Assoc Exerc Clin Cardiol Counc Epidemiol Prev [Internet]
[cited 2017 May 2]; [doi: 10.1161/CIR.0b013e3181e52e69]
33. Uth N, Sørensen H, Overgaard K, Pedersen PK. Estimation of VO2max from the Ratio between
142
HRmax and HRrest - the Heart Rate Ratio Method. Eur J Appl Physiol [Internet] 2004 [cited
2017 May 2];91(1):111–115. [doi: 10.1007/s00421-003-0988-y]
34. Kline GM, Porcari JP, Hintermeister R, Freedson PS, Ward A, McCarron RF, Ross J, Rippe JM.
Estimation of VO2max from a one-mile track walk, gender, age, and body weight. Med Sci Sports
Exerc [Internet] 1987 Jun [cited 2017 May 2];19(3):253–9. PMID:3600239
35. Cooper KH. Aerobics. Bantam Books; 1969. ISBN:9780553144901
36. Saalasti S, Pulkkinen A. Method and system for determining the fitness index of a person
[Internet]. United States Patent Office; 2012 [cited 2017 May 2]. Available from:
https://www.google.com/patents/US20140088444
37. Butte NF, Ekelund U, Westerterp KR. Assessing Physical Activity Using Wearable Monitors:
Measures of Physical Activity. Med Sci Sport Exerc [Internet] 2012 [cited 2017 Jun 15];44(1S):5–
12. [doi: 10.1249/MSS.0b013e3182399c0e]
38. ap507. Study shows slow walking pace is good predictor of heart-related deaths — University of
Leicester [Internet]. Univ Leicester News. 2017 [cited 2017 Aug 30]. Available from:
https://www2.le.ac.uk/news/blog/2017-archive/august/study-shows-slow-walking-pace-good-
predictor-of-heart-related-deaths
39. Zhao S, Chen K, Su Y, Hua W, Chen S, Liang Z, Xu W, Dai Y, Liu Z, Fan X, Hou C, Zhang S.
Association between patient activity and long-term cardiac death in patients with implantable
cardioverter-defibrillators and cardiac resynchronization therapy defibrillators. Eur J Prev Cardiol
[Internet] 2017;24(7):760–767. [doi: 10.1177/2047487316688982]
40. Roul G, Germain P, Bareiss P. Does the 6-minute walk test predict the prognosis in patients with
NYHA class II or III chronic heart failure? Am Heart J [Internet] 1998 Sep [cited 2017 Jun
30];136(3):449–457. [doi: 10.1016/S0002-8703(98)70219-4]
41. Abdulmajeed R. The Use of Continuous Monitoring of Heart Rate as a Prognosticator of
Readmission in Heart Failure Patients. University of Toronto; 2016.
42. Eapen ZJ, Turakhia MP, McConnell M V., Graham G, Dunn P, Tiner C, Rich C, Harrington RA,
Peterson ED, Wayte P. Defining a Mobile Health Roadmap for Cardiovascular Health and
Disease. J Am Heart Assoc [Internet] 2016 Jul 12 [cited 2016 Oct 30];5(7):e003119. [doi:
143
10.1161/JAHA.115.003119]
43. Wen D, Zhang X, Liu X, Lei J. Evaluating the Consistency of Current Mainstream Wearable
Devices in Health Monitoring: A Comparison Under Free-Living Conditions. J Med Internet Res
[Internet] Journal of Medical Internet Research; 2017 Mar 7 [cited 2017 Mar 9];19(3):e68.
PMID:28270382
44. El-Amrawy F, Nounou MI, Volpp K, Patel M, Lin N, Lewis R. Are Currently Available Wearable
Devices for Activity Tracking and Heart Rate Monitoring Accurate, Precise, and Medically
Beneficial? Healthc Inform Res [Internet] Apress Media; 2015 [cited 2017 Jul 7];21(4):315. [doi:
10.4258/hir.2015.21.4.315]
45. An H-S, Jones GC, Kang S-K, Welk GJ, Lee J-M. How valid are wearable physical activity
trackers for measuring steps? Eur J Sport Sci [Internet] Routledge; 2017 Mar 16 [cited 2017 Jul
12];17(3):360–368. [doi: 10.1080/17461391.2016.1255261]
46. Bromberg SE. Consumer Wristband Activity Monitors as a Simple and Inexpensive Tool for
Remote Heart Failure Monitoring. 2015.
47. Abeles A, Kwasnicki RM, Pettengell C, Murphy J, Darzi A. The relationship between physical
activity and post-operative length of hospital stay: A systematic review. Int J Surg [Internet] 2017
Jul [cited 2017 Jul 12]; [doi: 10.1016/j.ijsu.2017.06.085]
48. Bornstein DB, Beets MW, Byun W, Welk G, Bottai M, Dowda M, Pate R. Equating
accelerometer estimates of moderate-to-vigorous physical activity: In search of the Rosetta Stone. J
Sci Med Sport [Internet] BioMed Central; 2011 Sep [cited 2017 Jul 12];14(5):404–410. [doi:
10.1016/j.jsams.2011.03.013]
49. Awais M, Mellone S, Chiari L. Physical activity classification meets daily life: Review on existing
methodologies and open challenges. Proc Annu Int Conf IEEE Eng Med Biol Soc EMBS
2015;2015–Novem:5050–5053. PMID:26737426
50. Jehn M, Prescher S, Koehler K, Von Haehling S, Winkler S, Deckwart O, Honold M, Sechtem U,
Baumann G, Halle M, Anker SD, Koehler F. Tele-accelerometry as a novel technique for assessing
functional status in patients with heart failure: Feasibility, reliability and patient safety. Int J
Cardiol [Internet] 2013 [cited 2017 Sep 5];168:4723–4728. [doi: 10.1016/j.ijcard.2013.07.171]
144
51. Demers C, McKelvie RS, Negassa A, Yusuf S. Reliability, validity, and responsiveness of the six-
minute walk test in patients with heart failure. Am Heart J 2001;142(4):698–703. PMID:11579362
52. Guazzi M, Myers J, Arena R. Cardiopulmonary Exercise Testing in the Clinical and Prognostic
Assessment of Diastolic Heart Failure. J Am Coll Cardiol [Internet] Elsevier; 2005 Nov 15 [cited
2018 Jul 25];46(10):1883–1890. [doi: 10.1016/J.JACC.2005.07.051]
53. Albouaini K, Egred M, Alahmar A, Wright DJ. Cardiopulmonary exercise testing and its
application. Postgrad Med J [Internet] BMJ Group; 2007 Nov [cited 2016 Sep 20];83(985):675–82.
PMID:17989266
54. Chatterjee S, Sengupta S, Nag M, Kumar P, Goswami S, Rudra A. Cardiopulmonary Exercise
Testing: A Review of Techniques and Applications. 2013 [cited 2018 Jul 25]; [doi: 10.4172/2155-
6148.1000340]
55. Mehra MR, Canter CE, Hannan MM, Semigran MJ, Uber PA, Baran DA, Danziger-Isakov L,
Kirklin JK, Kirk R, Kushwaha SS, Lund LH, Potena L, Ross HJ, Taylor DO, Verschuuren EAM,
Zuckermann A. The 2016 International Society for Heart Lung Transplantation listing criteria for
heart transplantation: A 10-year update. [cited 2018 Jun 2]; [doi: 10.1016/j.healun.2015.10.023]
56. Lim FY, Yap J, Gao F, Teo LL, Lam CSP, Yeo KK. Correlation of the New York Heart
Association classification and the cardiopulmonary exercise test: A systematic review. Int J Cardiol
[Internet] Elsevier; 2018 Jul 15 [cited 2018 Jun 4];263:88–93. [doi: 10.1016/J.IJCARD.2018.04.021]
57. Fitbit Inc. Fitbit Official Site for Activity Trackers & More [Internet]. 2016. Available from:
https://www.fitbit.com/en-ca/home (Archived by WebCite® at
http://www.webcitation.org/6zTITrK95)
58. Fitbit Inc. Fitbit Charge 2TM Heart Rate + Fitness Wristband [Internet]. 2018 [cited 2018 Apr 17].
Available from: https://client.fitbit.com/en-ca/charge2 (Archived by WebCite® at
http://www.webcitation.org/6zTIzBoj5)
59. Fitbit Inc. Fitbit Flex [Internet]. [cited 2018 Apr 17]. Available from: https://client.fitbit.com/en-
ca/shop/flex (Archived by WebCite® at http://www.webcitation.org/6zTIrGkAE)
60. Bromberg SE. Consumer wristband activity monitors as a simple and inexpensive tool for remote
heart failure monitoring [Internet]. [Toronto]: University of Toronto; 2015. Available from:
145
http://hdl.handle.net/1807/70232
61. Piwek L, Ellis DA, Andrews S, Joinson A. The Rise of Consumer Health Wearables: Promises and
Barriers. PLoS Med [Internet] Public Library of Science; 2016 Feb [cited 2016 Sep
20];13(2):e1001953. PMID:26836780
62. Attal F, Mohammed S, Dedabrishvili M, Chamroukhi F, Oukhellou L, Amirat Y. Physical Human
Activity Recognition Using Wearable Sensors. Sensors (Basel) [Internet] 2015;15(12):31314–38.
PMID:26690450
63. James CJ. Editorial: “Longer term monitoring through wearables brings with it the promise of
predicting the onset of disease - moving from managing illness to maintaining wellness.”. Healthc
Technol Lett [Internet] IET: Institution of Engineering and Technology; 2015 Feb [cited 2016 Sep
20];2(1):1. PMID:26609395
64. Apple Inc. Watch - Apple (CA) [Internet]. 2016. Available from:
https://www.apple.com/ca/watch/
65. Storm FA, Heller BW, Mazzà C. Step detection and activity recognition accuracy of seven
physical activity monitors. PLoS One [Internet] Public Library of Science; 2015 [cited 2018 May
7];10(3):e0118723. PMID:25789630
66. Fitbit Inc. Help article: How does my Fitbit device count steps? [Internet]. Fitbit Help. 2017 [cited
2017 Nov 7]. Available from: https://help.fitbit.com/articles/en_US/Help_article/1143
67. Diaz KM, Krupka DJ, Chang MJ, Peacock J, Ma Y, Goldsmith J, Schwartz JE, Davidson KW.
Fitbit?: An accurate and reliable device for wireless physical activity tracking. Int J Cardiol. 2015.
PMID:25795203
68. Evenson KR, Goto MM, Furberg RD. Systematic review of the validity and reliability of
consumer-wearable activity trackers. Int J Behav Nutr Phys Act [Internet] 2015 Dec 18 [cited 2017
May 18];12(1):159. PMID:26684758
69. Al M. Personalization of energy expenditure and cardiorespiratory fitness estimation using
wearable sensors in supervised and ... Personalization of energy expenditure and. Eindhoven
University of Technology; 2015.
70. Straiton N, Alharbi M, Bauman A, Neubeck L, Gullick J, Bhindi R, Gallagher R. The validity and
146
reliability of consumer-grade activity trackers in older, community-dwelling adults: A systematic
review. Maturitas [Internet] Elsevier; 2018 Jun 1 [cited 2018 Jul 30];112:85–93. [doi:
10.1016/J.MATURITAS.2018.03.016]
71. ActiGraph Corporation. ActiGraph [Internet]. [cited 2018 Jul 30]. Available from:
https://www.actigraphcorp.com/
72. Fitbit Inc. Help article: What should I know about my heart rate data? [Internet]. Fitbit Help.
2017 [cited 2017 Nov 7]. Available from: https://help.fitbit.com/articles/en_US/Help_article/1565
73. Kroll RR, Boyd JG, Maslove DM. Accuracy of a Wrist-Worn Wearable Device for Monitoring
Heart Rates in Hospital Inpatients: A Prospective Observational Study. J Med Internet Res
[Internet] 2016 [cited 2016 Sep 22];18(9):e253. PMID:27651304
74. Ra H-K, Ahn J, Jung Yoon H, Yoon D, Hyuk Son DGIST S, Ko J. I am a “Smart” watch, Smart
Enough to Know the Accuracy of My Own Heart Rate Sensor. [cited 2017 May 15]; [doi:
10.1145/3032970.3032977]
75. Allen J. Photoplethysmography and its application in clinical physiological measurement. Physiol
Meas [Internet] 2007 [cited 2017 Nov 7];28:1–39. [doi: 10.1088/0967-3334/28/3/R01]
76. Alian AA, Shelley KH. Photoplethysmography. Best Pract Res Clin Anaesthesiol [Internet]
Baillière Tindall; 2014 Dec 1 [cited 2018 Jul 30];28(4):395–406. [doi: 10.1016/J.BPA.2014.08.006]
77. Maeda Y, Sekine M, Tamura T. The Advantages of Wearable Green Reflected
Photoplethysmography. J Med Syst [Internet] 2011 Oct 18 [cited 2018 Jul 30];35(5):829–834.
PMID:20703690
78. Wang R, Blackburn G, Desai M, Phelan D, Gillinov L, Houghtaling P, Gillinov M, MA C, H M,
RMT L, DJ T, F E-A, MS P. Accuracy of Wrist-Worn Heart Rate Monitors. JAMA Cardiol
[Internet] 2016 Oct 12 [cited 2016 Nov 10];313(6):625–626. [doi: 10.1001/jamacardio.2016.3340]
79. Cadmus-Bertram L, Gangnon R, Wirkus EJ, Thraen-Borowski KM, Gorzelitz-Liebhauser J. The
Accuracy of Heart Rate Monitoring by Some Wrist-Worn Activity Trackers. Ann Intern Med
[Internet] 2017;10–13. PMID:28395305
80. Cardioo Inc. Cardiio: Heart Rate Monitor (iOS App) [Internet]. Apple Inc; 2012. Available from:
https://itunes.apple.com/ca/app/cardiio-heart-rate-monitor/id542891434?mt=8
147
81. Laskowski ER. Heart rate: What’s normal? [Internet]. Mayo Clin. 2015 [cited 2018 Jul 31].
Available from: https://www.mayoclinic.org/healthy-lifestyle/fitness/expert-answers/heart-
rate/faq-20057979
82. American Heart Association. All About Heart Rate (Pulse) [Internet]. Am Hear Assoc Website.
2015 [cited 2018 Jul 31]. Available from: https://www.heart.org/en/health-topics/high-blood-
pressure/the-facts-about-high-blood-pressure/all-about-heart-rate-pulse#.Wg1mcBO0OCU
83. Low CA, Bovbjerg DH, Ahrendt S, Choudry MH, Holtzman M, Jones HL, Pingpank JF,
Ramalingam L, Zeh HJ, Zureikat AH, Bartlett DL. Fitbit step counts during inpatient recovery
from cancer surgery as a predictor of readmission. Ann Behav Med [Internet] Oxford University
Press; 2018 Jan 5 [cited 2018 Jul 26];52(1):88–92. [doi: 10.1093/abm/kax022]
84. Hartman SJ, Nelson SH, Weiner LS. Patterns of Fitbit Use and Activity Levels Throughout a
Physical Activity Intervention: Exploratory Analysis from a Randomized Controlled Trial. JMIR
mHealth uHealth [Internet] JMIR mHealth and uHealth; 2018 Feb 5 [cited 2018 Mar 8];6(2):e29.
PMID:29402761
85. Wicklund E. Hospital’s mHealth Project Finds Value in Fitbit Data [Internet].
mHealthIntelligence. 2016 [cited 2018 Jul 26]. Available from:
https://mhealthintelligence.com/news/hospitals-diabetes-mhealth-project-finds-value-in-fitbit-data
86. Apple Inc. Apple Heart Study launches to identify irregular heart rhythms [Internet]. Apple
Newsroom. 2017 [cited 2018 Jul 31]. Available from:
https://www.apple.com/newsroom/2017/11/apple-heart-study-launches-to-identify-irregular-heart-
rhythms/
87. Eadicicco L. EXCLUSIVE: Fitbit Working On Atrial Fibrillation Detection [Internet]. Time. 2017
[cited 2018 Jul 31]. Available from: http://time.com/4907284/fitbit-detect-atrial-fibrillation/
88. Griffith E. When Your Fitbit Goes From Activity Tracker to Personal Medical Device [Internet].
Wired. 2018 [cited 2018 Jul 26]. Available from: https://www.wired.com/story/when-your-activity-
tracker-becomes-a-personal-medical-device/
89. Field MJ, Grigsby J. Telemedicine and Remote Patient Monitoring. JAMA [Internet] American
Medical Association; 2002 Jul 24 [cited 2018 Aug 1];288(4):423. [doi: 10.1001/jama.288.4.423]
148
90. Hargreaves S, Hawley MS, Haywood A, Enderby PM. Informing the Design of “Lifestyle
Monitoring” Technology for the Detection of Health Deterioration in Long-Term Conditions: A
Qualitative Study of People Living With Heart Failure. J Med Internet Res [Internet] Journal of
Medical Internet Research; 2017 Jun 28 [cited 2017 Jun 30];19(6):e231. PMID:28659253
91. Noah B, Keller MS, Mosadeghi S, Stein L, Johl S, Delshad S, Tashjian VC, Lew D, Kwan JT,
Jusufagic A, Spiegel BMR. Impact of remote patient monitoring on clinical outcomes: an updated
meta-analysis of randomized controlled trials. npj Digit Med [Internet] Nature Publishing Group;
2018 Dec 15 [cited 2018 Aug 1];1(1):20172. [doi: 10.1038/s41746-017-0002-4]
92. Hanlon P, Daines L, Campbell C, McKinstry B, Weller D, Pinnock H. Telehealth Interventions to
Support Self-Management of Long-Term Conditions: A Systematic Metareview of Diabetes, Heart
Failure, Asthma, Chronic Obstructive Pulmonary Disease, and Cancer. J Med Internet Res
[Internet] Journal of Medical Internet Research; 2017 May 17 [cited 2017 May 18];19(5):e172. [doi:
10.2196/jmir.6688]
93. Hargreaves S, Hawley MS, Haywood A, Enderby PM. Informing the Design of “Lifestyle
Monitoring” Technology for the Detection of Health Deterioration in Long-Term Conditions: A
Qualitative Study of People Living With Heart Failure. J Med Internet Res [Internet] Journal of
Medical Internet Research; 2017 Jun 28 [cited 2017 Jun 30];19(6):e231. PMID:28659253
94. Clark RA, Inglis SC, McAlister FA, Cleland JGF, Stewart S. Telemonitoring or structured
telephone support programmes for patients with chronic heart failure: systematic review and meta-
analysis. BMJ [Internet] 2007 May 5 [cited 2018 Apr 4];334(7600):942. PMID:17426062
95. Ware P, Ross HJ, Cafazzo JA, Laporte A, Gordon K, Seto E. Evaluating the Implementation of a
Mobile Phone–Based Telemonitoring Program: Longitudinal Study Guided by the Consolidated
Framework for Implementation Research. JMIR mHealth uHealth [Internet] JMIR mHealth and
uHealth; 2018 Jul 31 [cited 2018 Aug 1];6(7):e10768. [doi: 10.2196/10768]
96. Yun JE, Park J-E, Park H-Y, Lee H-Y, Park D-A. Comparative Effectiveness of Telemonitoring
Versus Usual Care for Heart Failure: A Systematic Review and Meta-analysis. J Card Fail
[Internet] 2018 Jan [cited 2018 Aug 1];24(1):19–28. [doi: 10.1016/j.cardfail.2017.09.006]
97. Klersy C, De Silvestri A, Gabutti G, Raisaro A, Curti M, Regoli F, Auricchio A. Economic impact
of remote patient monitoring: an integrated economic model derived from a meta-analysis of
149
randomized controlled trials in heart failure. Eur J Heart Fail [Internet] Wiley-Blackwell; 2011 Apr
1 [cited 2018 Aug 1];13(4):450–459. [doi: 10.1093/eurjhf/hfq232]
98. Ong MK, Romano PS, Edgington S, Aronow HU, Auerbach AD, Black JT, De Marco T, Escarce
JJ, Evangelista LS, Hanna B, Ganiats TG, Greenberg BH, Greenfield S, Kaplan SH, Kimchi A,
Liu H, Lombardo D, Mangione CM, Sadeghi B, Sadeghi B, Sarrafzadeh M, Tong K, Fonarow GC.
Effectiveness of Remote Patient Monitoring After Discharge of Hospitalized Patients With Heart
Failure. JAMA Intern Med [Internet] American Medical Association; 2016 Mar 1 [cited 2018 Aug
1];176(3):310. [doi: 10.1001/jamainternmed.2015.7712]
99. Chaudhry SI, Mattera JA, Curtis JP, Spertus JA, Herrin J, Lin Z, Phillips CO, Hodshon B V.,
Cooper LS, Krumholz HM. Telemonitoring in Patients with Heart Failure. N Engl J Med
[Internet] Massachusetts Medical Society ; 2010 Dec 9 [cited 2018 Aug 1];363(24):2301–2309. [doi:
10.1056/NEJMoa1010029]
100. Ware P, Seto E, Ross HJ. Accounting for Complexity in Home Telemonitoring: A Need for
Context-Centred Evidence. Can J Cardiol [Internet] Elsevier; 2018 Jul 1 [cited 2018 Aug
1];34(7):897–904. [doi: 10.1016/J.CJCA.2018.01.022]
101. Centre for Global eHealth Innovation. Medly - Chronic Complex Diseases Self-care Management
[Internet]. 2016 [cited 2016 Oct 30]. Available from: http://ehealthinnovation.org/what-we-
do/projects/medly/
102. Healthcare Human Factors. Medly: Managing Chronic Conditions [Internet]. 2016 [cited 2016 Oct
30]. Available from: http://humanfactors.ca/projects/medly/
103. Seto E, Leonard KJ, Cafazzo JA, Barnsley J, Masino C, Ross HJ. Mobile phone-based
telemonitoring for heart failure management: a randomized controlled trial. J Med Internet Res
2012;14(1):1–14. PMID:22356799
104. Seto E, Leonard KJ, Cafazzo JA, Barnsley J, Masino C, Ross HJ. Developing healthcare rule-
based expert systems: Case study of a heart failure telemonitoring system. Int J Med Inform
[Internet] Elsevier Ireland Ltd; 2012;81(8):556–565. PMID:22465288
105. Seto E, Leonard KJ, Masino C, Cafazzo JA, Barnsley J, Ross HJ. Attitudes of heart failure
patients and health care providers towards mobile phone-based remote monitoring. J Med Internet
Res 2010;12(4):3–12. PMID:21115435
150
106. Smith C, McGuire B, Huang T, Yang G. The History of Artificial Intelligence [Internet]. Seattle:
University of Washington; 2006 [cited 2018 Apr 4]. p. 27. Available from:
https://courses.cs.washington.edu/courses/csep590/06au/projects/history-ai.pdf
107. Anyoha R. The History of Artificial Intelligence [Internet]. Sci News. 2017 [cited 2018 Aug 4].
Available from: http://sitn.hms.harvard.edu/flash/2017/history-artificial-intelligence/
108. McCarthy J, Minsky ML, Rochester N, Shannon CE. A Proposal for the Dartmouth Summer
Research Project on Artificial Intelligence [Internet]. Dartmouth; 1955 [cited 2018 Aug 4].
Available from: http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html
109. Coward C. AI and the Ghost in the Machine [Internet]. hackaday. 2017 [cited 2018 Aug 4].
Available from: https://hackaday.com/2017/02/06/ai-and-the-ghost-in-the-machine/
110. Shu-Hsien Liao. Expert system methodologies and applications—a decade review from 1995 to
2004. Expert Syst Appl [Internet] Pergamon; 2005 Jan 1 [cited 2018 Aug 4];28(1):93–103. [doi:
10.1016/J.ESWA.2004.08.003]
111. Segaran T. Programming collective intelligence : building smart web 2.0 applications. O’Reilly;
2007. ISBN:9780596529321
112. Brownlee J. Supervised and Unsupervised Machine Learning Algorithms [Internet]. Mach Learn
Mastery. 2016 [cited 2018 Aug 6]. Available from:
https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/
113. Alpaydin E. Introduction to Machine Learning (Adaptive Computation and Machine Learning)
[Internet]. MIT Press; 2004 [cited 2018 Aug 6]. Available from:
https://dl.acm.org/citation.cfm?id=1036287ISBN:0262012111
114. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai
M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, Van Den Driessche G, Graepel T, Hassabis D.
Mastering the game of Go without human knowledge. Nature [Internet] Nature Publishing Group;
2017;550(7676):354–359. PMID:29052630
115. OpenAI Five [Internet]. OpenAI. 2018 [cited 2018 Aug 6]. Available from:
https://blog.openai.com/openai-five/
116. Savov V. The OpenAI Dota 2 bots just defeated a team of former pros [Internet]. The Verge. 2018
151
[cited 2018 Aug 6]. Available from: https://www.theverge.com/2018/8/6/17655086/dota2-openai-
bots-professional-gaming-ai
117. Thompson T. Zerg Rush: A History of StarCraft AI Research [Internet]. Medium. 2018 [cited 2018
Aug 6]. Available from: https://medium.com/@t2thompson/zerg-rush-a-history-of-starcraft-ai-
research-4478759a3c53
118. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition.
Proc IEEE [Internet] 1989 [cited 2017 Aug 28];77(2):257–286. [doi: 10.1109/5.18626]
119. Visser I, Raijmakers MEJ, van der Maas HLJ. Hidden Markov Models for Individual Time Series.
In: Valsiner J, Molenaar PCM, Lyra MCDP, Chaudhary N, editors. Dyn Process Methodol Soc
Dev Sci 2009. p. 269–289. PMID:25246403
120. Iskandar J. RPubs - Classifying Seizure State (using R package depmixS4) [Internet]. RPubs; 2014
[cited 2017 Aug 30]. p. 6. Available from: https://rpubs.com/jimmyiskandar/30484
121. Mannini A, Sabatini AM. Machine Learning Methods for Classifying Human Physical Activity
from On-Body Accelerometers. Sensors [Internet] Molecular Diversity Preservation International;
2010 Feb 1 [cited 2017 Aug 22];10(2):1154–1175. [doi: 10.3390/s100201154]
122. Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for
classification performance. BMC Med Inform Decis Mak [Internet] 2012 Dec 15 [cited 2017 Oct
7];12(1):8. [doi: 10.1186/1472-6947-12-8]
123. Brownlee J. How Much Training Data is Required for Machine Learning? [Internet]. Mach Learn
Mastery. 2017 [cited 2017 Oct 7]. Available from: https://machinelearningmastery.com/much-
training-data-required-machine-learning/
124. Denham L. Aren’t The IoT, Big Data And Machine Learning The Same? [Internet]. Innov Enterp.
2017 [cited 2018 Aug 20]. Available from:
https://channels.theinnovationenterprise.com/articles/aren-t-the-iot-big-data-and-machine-
learning-the-same
125. Beleites C, Neugebauer U, Bocklitz T, Krafft C, Popp J. Sample size planning for classification
models. Anal Chim Acta [Internet] Elsevier; 2013 Jan 14 [cited 2017 Oct 7];760:25–33. [doi:
10.1016/j.aca.2012.11.007]
152
126. van der Ploeg T, Austin PC, Steyerberg EW. Modern modelling techniques are data hungry: a
simulation study for predicting dichotomous endpoints. BMC Med Res Methodol [Internet] 2014
Dec 22 [cited 2018 Aug 20];14(1):137. [doi: 10.1186/1471-2288-14-137]
127. Tripoliti EE, Papadopoulos TG, Karanasiou GS, Naka KK, Fotiadis DI. Heart Failure: Diagnosis,
Severity Estimation and Prediction of Adverse Events Through Machine Learning Techniques.
Comput Struct Biotechnol J [Internet] 2017 [cited 2017 Oct 7];15:26–47. [doi:
10.1016/j.csbj.2016.11.001]
128. Pecchia L, Melillo P, Bracale M. Remote Health Monitoring of Heart Failure With Data Mining
via CART Method on HRV Features. IEEE Trans Biomed Eng [Internet] 2011 Mar [cited 2018
Aug 6];58(3):800–804. [doi: 10.1109/TBME.2010.2092776]
129. Shaffer F, Ginsberg JP. An Overview of Heart Rate Variability Metrics and Norms. Front public
Heal [Internet] Frontiers Media SA; 2017 [cited 2018 Aug 7];5:258. PMID:29034226
130. Melillo P, Fusco R, Sansone M, Bracale M, Pecchia L. Discrimination power of long-term heart
rate variability measures for chronic heart failure detection. Med Biol Eng Comput [Internet]
Springer-Verlag; 2011 Jan 4 [cited 2018 Aug 6];49(1):67–74. [doi: 10.1007/s11517-010-0728-5]
131. Pecchia L, Melillo P, Sansone M, Bracale M. Discrimination Power of Short-Term Heart Rate
Variability Measures for CHF Assessment. IEEE Trans Inf Technol Biomed [Internet] 2011 Jan
[cited 2018 Aug 6];15(1):40–46. [doi: 10.1109/TITB.2010.2091647]
132. Panina G, Khot UN, Nunziata E, Cody RJ, Binkley PF. Role of spectral measures of heart rate
variability as markers of disease progression in patients with chronic congestive heart failure not
treated with angiotensin-converting enzyme inhibitors. Am Heart J [Internet] Mosby; 1996 Jan 1
[cited 2018 Aug 6];131(1):153–157. [doi: 10.1016/S0002-8703(96)90064-2]
133. Mietus JE, Peng C-K, Henry I, Goldsmith RL, Goldberger AL. The pNNx files: re-examining a
widely used heart rate variability measure. Heart [Internet] BMJ Publishing Group Ltd; 2002 Oct
1 [cited 2018 Aug 6];88(4):378–80. PMID:12231596
134. Casolo GC, Stroder P, Sulla A, Chelucci A, Freni A, Zerauschek M. Heart rate variability and
functional severity of congestive heart failure secondary to coronary artery disease. Eur Heart J
[Internet] Oxford University Press; 1995 Mar 1 [cited 2018 Aug 6];16(3):360–367. [doi:
10.1093/oxfordjournals.eurheartj.a060919]
153
135. Goldsmith R. Congestive Heart Failure RR Interval Database [Internet]. [cited 2018 Aug 6]. [doi:
10.13026/C2F598]
136. Melillo P, De Luca N, Bracale M, Pecchia L. Classification Tree for Risk Assessment in Patients
Suffering From Congestive Heart Failure via Long-Term Heart Rate Variability. IEEE J Biomed
Heal Informatics [Internet] 2013 May [cited 2018 Aug 6];17(3):727–733. [doi:
10.1109/JBHI.2013.2244902]
137. Beth Israel Deaconess Medical Center. The BIDMC Congestive Heart Failure Database [Internet].
PhysioNet. 1986 [cited 2018 Aug 6]. [doi: 10.13026/C29G60]
138. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Plamen CI, Mark RG, Mietus JE, Moody
GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet Components of a New
Research Resource for Complex Physiologic Signals. Circulation [Internet] 2000 [cited 2018 Aug
6];(101):215–220. [doi: 10.1161/circ.101.23.e215]
139. Witten IH (Ian H., Frank E, Hall MA (Mark A, Pal CJ. Data mining : practical machine learning
tools and techniques. ISBN:9780128042915
140. Vanwinckelen G, Blockeel H. On Estimating Model Accuracy with Repeated Cross-Validation.
[cited 2018 Apr 25]; Available from:
https://lirias.kuleuven.be/bitstream/123456789/346385/3/OnEstimatingModelAccuracy.pdf
141. Forman G, Scholz M. Apples-to-Apples in Cross-Validation Studies: Pitfalls in Classifier
Performance Measurement. SIGKDD Explor [Internet] 2010 [cited 2017 Nov 3];12(1):49–57.
Available from: http://www.kdd.org/exploration_files/v12-1-p49-forman-sigkdd.pdf
142. Shahbazi F, Asl BM. Generalized discriminant analysis for congestive heart failure risk assessment
based on long-term heart rate variability. Comput Methods Programs Biomed [Internet] Elsevier;
2015 Nov 1 [cited 2018 Aug 6];122(2):191–198. [doi: 10.1016/J.CMPB.2015.08.007]
143. Baudat G, Anouar F. Generalized Discriminant Analysis Using a Kernel Approach. Neural
Comput [Internet] MIT Press 238 Main St., Suite 500, Cambridge, MA 02142-1046 USA journals-
[email protected] ; 2000 Oct 13 [cited 2018 Aug 6];12(10):2385–2404. [doi:
10.1162/089976600300014980]
144. Fluss R, Faraggi D, Reiser B. Estimation of the Youden Index and its associated cutoff point.
154
Biom J [Internet] 2005 Aug [cited 2018 Aug 7];47(4):458–72. PMID:16161804
145. Guiqiu Yang, Yinzi Ren, Qing Pan, Gangmin Ning, Shijin Gong, Guolong Cai, Zhaocai Zhang, Li
Li, Jing Yan. A heart failure diagnosis model based on support vector machine. 2010 3rd Int Conf
Biomed Eng Informatics [Internet] IEEE; 2010 [cited 2018 Aug 6]. p. 1105–1108. [doi:
10.1109/BMEI.2010.5639619]
146. Wu H-T, Soliman EZ. A new approach for analysis of heart rate variability and QT variability in
long-term ECG recording. Biomed Eng Online [Internet] BioMed Central; 2018 Dec 3 [cited 2018
Aug 7];17(1):54. [doi: 10.1186/s12938-018-0490-8]
147. Pang D, Igasaki T, Maehara J. Long-term monitoring of heart rate variability toward practical use
in intensive/high care unit. 2016 9th Biomed Eng Int Conf [Internet] IEEE; 2016 [cited 2018 Aug
7]. p. 1–6. [doi: 10.1109/BMEiCON.2016.7859631]
148. Baril J-F, Bromberg S, Yasbanoo M, Taati B, Manlhiot C, Ross HJ, Cafazzo J. Use of free-living
step count monitoring for heart failure functional classification: a validation study. Toronto: JMIR
Cardio; 2018. [doi: 10.2196/preprints.12122]
149. Stein KM, Mittal S, Merkel S, Meye TE. Baseline Physical Activity and NYHA Classification
Affects Future Ventricular Event Rates in a General ICD Population. J Card Fail [Internet]
Churchill Livingstone; 2006 Aug 1 [cited 2017 Oct 13];12(6):S58. [doi:
10.1016/J.CARDFAIL.2006.06.203]
150. Bromberg SE. googlefitbit [Internet]. Toronto; 2015. Available from:
https://github.com/simonbromberg/googlefitbit
151. R Core Team. R: A Language and Environment for Statistical Computing [Internet]. Vienna,
Austria; 2017. Available from: https://www.r-project.org
152. RStudio Team. RStudio: Integrated Development Environment for R [Internet]. Boston, MA;
2015. Available from: http://www.rstudio.com/
153. Wickham H. A Layered Grammar of Graphics. 2010 [cited 2017 May 31]; [doi:
10.1198/jcgs.2009.07098]
154. Arnold JB. ggthemes: Extra Themes, Scales and Geoms for “ggplot2” [Internet]. 2017. Available
from: https://cran.r-project.org/package=ggthemes
155
155. Wickham H. The Split-Apply-Combine Strategy for Data Analysis. J Stat Softw [Internet]
2011;40(1):1–29. Available from: http://www.jstatsoft.org/v40/i01/
156. Wickham H, Francois R, Henry L, Müller K. dplyr: A Grammar of Data Manipulation [Internet].
2017. Available from: https://cran.r-project.org/package=dplyr
157. Wickham H. Reshaping Data with the {reshape} Package. J Stat Softw [Internet] 2007;21(12):1–
20. Available from: http://www.jstatsoft.org/v21/i12/
158. Hester J. glue: Interpreted String Literals [Internet]. 2017. Available from: https://cran.r-
project.org/package=glue
159. Seto E, Leonard KJ, Cafazzo JA, Barnsley J, Masino C, Ross HJ. Perceptions and experiences of
heart failure patients and clinicians on the use of mobile phone-based telemonitoring. J Med
Internet Res 2012;14(1):1–15. PMID:22328237
160. Intel Corporation. Safety Recall Notice for all Basis PeakTM Watches [Internet]. 2018 [cited 2018
Aug 13]. Available from:
https://www.intel.ca/content/www/ca/en/support/articles/000025310/emerging-
technologies/wearable-devices.html
161. Somerville H. Jawbone’s demise a case of “death by overfunding” in Silicon Valley | Reuters
[Internet]. Thomson Reuters. 2018 [cited 2018 Aug 14]. Available from:
https://www.reuters.com/article/us-jawbone-failure/jawbones-demise-a-case-of-death-by-
overfunding-in-silicon-valley-idUSKBN19V0BS
162. Alharbi M, Straiton N, Gallagher R. Harnessing the Potential of Wearable Activity Trackers for
Heart Failure Self-Care. [cited 2017 May 15]; [doi: 10.1007/s11897-017-0318-z]
163. Apple Inc. HealthKit - Apple Developer [Internet]. 2018 [cited 2018 Aug 14]. Available from:
https://developer.apple.com/healthkit/
164. empatica. E4 wristband [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://www.empatica.com/research/e4/
165. Fitbit Inc. Fitbit SDK [Internet]. 2018. Available from: https://dev.fitbit.com/
166. Fitbit Inc. AltaHR [Internet]. 2018 [cited 2018 Aug 13]. Available from:
156
https://www.fitbit.com/en-ca/altahr
167. Fitbit AltaTM Fitness Wristband [Internet]. [cited 2018 Aug 13]. Available from:
https://www.fitbit.com/en-ca/alta
168. Fitbit Inc. Fitbit Flex 2TM Fitness Wristband [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://www.fitbit.com/en-ca/flex2
169. Fitbit Inc. Fitbit IonicTM Watch [Internet]. 2018. [cited 2018 Aug 13]. Available from:
https://www.fitbit.com/en-ca/ionic
170. Fitbit Inc. Fitbit Versa [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://www.fitbit.com/en-ca/versa
171. Garmin. Home | Garmin Developers [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://developer.garmin.com/
172. Garmin. fenix 5 [Internet]. 2018 [cited 2018 Aug 13]. Available from: https://buy.garmin.com/en-
CA/CA/p/552982
173. Garmin. vivosmart [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://buy.garmin.com/en-US/US/p/154886
174. Google Developers. Google Fit [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://developers.google.com/fit/
175. Huawei Technology Co. Ltd. Huawei Watch 2 [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://consumer.huawei.com/ca/wearables/watch2/
176. LG Electronics. LG Smart Watch Sport for AT&T With Android Wear 2.0 | LG USA [Internet].
2018 [cited 2018 Aug 13]. Available from: https://www.lg.com/us/smart-watches/lg-W280A-sport
177. mc10. BiostampRC System [Internet]. Available from: https://www.mc10inc.com/our-
products/biostamprc
178. Misfit. Build @ Misfit [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://build.misfit.com/
179. Misfit. Misfit Flare [Internet]. 2018. Available from: https://misfit.com/misfit-flare
157
180. Misfit. Misfit Phase [Internet]. 2018. Available from: https://misfit.com/misfit-phase
181. Misfit. Misfit Ray [Internet]. 2018. Available from: https://misfit.com/misfit-ray
182. Misfit. Misfit Shine. 2018.
183. Misfit. Misfit Shine 2 [Internet]. 2018. Available from: https://misfit.com/misfit-shine-2
184. Misfit. Misfit Vapor [Internet]. 2018. Available from: https://misfit.com/misfit-vapor
185. Moov Inc. Moov HR [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://welcome.moov.cc/moovhr/
186. Moov Inc. Moov Now [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://welcome.moov.cc/moovnow/
187. Nokia. Nokia Health API [Internet]. 2018 [cited 2018 Aug 13]. Available from:
http://developer.health.nokia.com/oauth2/
188. Nokia | Withings. Nokia Go [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://health.nokia.com/ca/en/go
189. Nokia | Withings. Nokia Steel [Internet]. [cited 2018 Aug 13]. Available from:
https://health.nokia.com/ca/en/steel
190. Nokia | Withings. Nokia Steel HR [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://health.nokia.com/ca/en/steel-hr
191. TomTom Sports Team. TomTom Sports Cloud [Internet]. 2018. Available from:
https://developer.tomtom.com/tomtom-sports-cloud
192. TomTom. TomTom Spark 3 Cardio + Music GPS Fitness Watch [Internet]. 2018 [cited 2018 Aug
13]. Available from: https://www.tomtom.com/en_ca/sports/fitness-trackers/gps-fitness-watch-
cardio-music-spark3/black-large/
193. TomTom. TomTom Touch Fitness Tracker [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://www.tomtom.com/en_ca/sports/fitness-trackers/fitness-tracker-touch/black-large/
194. Under Armour I. Under Armour UA Band [Internet]. 2018 [cited 2018 Aug 13]. Available from:
158
https://www.underarmour.com/en-ca/ua-band
195. Wavelet Health. Products [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://wavelethealth.com/products/
196. MI. Mi Band [Internet]. 2018. [cited 2018 Aug 13]. Available from:
https://www.mi.com/en/miband/
197. MI. Mi Band 2 [Internet]. 2018 [cited 2018 Aug 13]. Available from:
https://www.mi.com/en/miband2/
198. Baril J-F. fitbit4research [Internet]. Toronto; 2018 [cited 2018 Aug 16]. Available from:
https://github.com/cosmomeese/fitbit4research
199. Tufte ER. The visual display of quantitative information. Graphics Press; 2001. ISBN:1930824130
200. Wong DM. The Wall Street journal guide to information graphics : the dos and don’ts of
presenting data, facts, and figures. ISBN:0393347281
201. Tufte ER, McKay SR, Christian W, Matey JR. Visual Explanations: Images and Quantities,
Evidence and Narrative. Comput Phys 1998; PMID:1659109
202. Zhang J, Johnson TR, Patel VL, Paige DL, Kubose T. Using usability heuristics to evaluate
patient safety of medical devices. 2003;36:23–30. [doi: 10.1016/S1532-0464(03)00060-1]
203. Tognazzini B. First Principles of Interaction Design (Revised & Expanded) | askTog [Internet].
askTog.com. [cited 2017 Jan 13]. Available from: http://asktog.com/atc/principles-of-interaction-
design/
204. Nielsen J. 10 Heuristics for User Interface Design [Internet]. Nielsen Norman Gr. 1995 [cited 2017
Jan 13]. Available from: https://www.nngroup.com/articles/ten-usability-heuristics/
205. Norman DA. The Design of Everyday Things [Internet]. Hum Factors Ergon Manuf. 2013.
PMID:13182255ISBN:0465067107
206. Laussen PC, Almodovar M, Goodwin A, Sick Kids: The Hospital for Sick Children. T3 - Tracking,
trajectory and trigger tool [Internet]. Crit Care Med Programs Serv. 2018. Available from:
http://www.sickkids.ca/Critical-Care/programs-and-services/T3/index.html
159
207. Laussen PC. Precision monitoring. Crit Care Canada Forum [Internet] Toronto; 2015 [cited 2018
Aug 15]. Available from:
https://criticalcarecanada.com/presentations/2015/precision_monitoring.pdf
208. Guerguerian A-M. BME1439 Critical Care Instrumentation Lecture. Toronto; 2016.
209. Fitbit Inc. Accessing the Fitbit API [Internet]. Fitbit Dev Website. 2018. Available from:
https://dev.fitbit.com/build/reference/web-api/oauth2/
210. Fitbit Inc. Fitbit Platform Terms of Service (Revised August 1st, 2018) [Internet]. Fitbit Dev
Website. 2018. Available from: https://dev.fitbit.com/legal/platform-terms-of-service/
211. Canadian Radio-television and Telecommunications Commission. Communications Monitoring
Report 2017: Canada’s Communication System: An Overview for Canadians (Table 2.0.6)
[Internet]. Ottawa; 2017. Available from:
https://crtc.gc.ca/eng/publications/reports/policymonitoring/2017/cmr2.htm#s20i
212. Mobile Operating System Market Share Canada [Internet]. StatCounter. 2017 [cited 2017 Nov 29].
Available from: http://gs.statcounter.com/os-market-share/mobile/canada/#monthly-201706-
201711
213. Mobile iOS Version Market Share Canada [Internet]. StatCounter. 2017 [cited 2017 Nov 29].
Available from: http://gs.statcounter.com/ios-version-market-share/mobile/canada/#monthly-
201611-201711
214. Hermsen S, Moons J, Kerkhof P, Wiekens C, De Groot M. Determinants for Sustained Use of an
Activity Tracker: Observational Study. JMIR mHealth uHealth [Internet] JMIR Publications Inc.;
2017 Oct 30 [cited 2018 Aug 18];5(10):e164. PMID:29084709
215. Cafazzo J, St-Cyr O. From Discovery to Design: The Evolution of Human Factors in Healthcare.
Healthc Q [Internet] 2012 Apr 11 [cited 2018 Aug 18];15(sp):24–29. [doi: 10.12927/hcq.2012.22845]
216. Canadian Patient Safety Institute, Institute for Safe Medication Practices Canada, Saskatchewan
Health, Patients for Patient Safety Canada, Beard P, Hoffman CE, Ste-Marie M. Canadian
Incident Analysis Framework [Internet]. Edmonton, AB; 2012. Available from:
http://www.patientsafetyinstitute.ca/en/toolsResources/PatientSafetyIncidentManagementToolkit
/Documents/CIAF Key Features - Analysis Process.pdf
160
217. Wickham H. tidyverse: Easily Install and Load the “Tidyverse” [Internet]. 2017. Available from:
https://cran.r-project.org/package=tidyverse
218. Wolf HP. aplpack: Another Plot Package: “Bagplots”, “Iconplots”, “Summaryplots”, Slider
Functions and Others [Internet]. 2018 [cited 2018 Aug 17]. Available from: https://cran.r-
project.org/web/packages/aplpack/index.html
219. Champely S. PairedData: Paired Data Analysis [Internet]. 2018 [cited 2018 Aug 17]. Available
from: https://cran.r-project.org/web/packages/PairedData/index.html
220. Jurafsky D, Martin J. Hidden Markov Models. Speech Lang Process [Internet] 3rd ed Pearson;
2017 [cited 2017 Nov 11]. p. 21. Available from: https://web.stanford.edu/~jurafsky/slp3/9.pdf
221. Bobick A, Essa I, Chakraborty A, Udacity. Markov Models [Internet]. Udacity Introd to Comput
Vis. YouTube; 2015 [cited 2017 Nov 11]. Available from:
https://www.youtube.com/watch?v=4XqWadvEj2k
222. Gagniuc PA. Markov chains: from theory to implementation and experimentation. 1st ed. John
Wiley and Sons, Inc; 2017. [doi: 10.1002/9781119387596]ISBN:9781119387558
223. O’Connell J, Højsgaard S. Hidden Semi Markov Models for Multiple Observation Sequences: The
mhsmm Package for R. J Stat Softw [Internet] 2011 [cited 2017 Nov 1];39(4):1–22. [doi:
10.18637/jss.v039.i04]
224. Bobick A, Essa I, Chakraborty A, Udacity. Hidden Markov Models [Internet]. Udacity Introd to
Comput Vis. YouTube; 2015 [cited 2017 Nov 11]. Available from:
https://www.youtube.com/watch?v=5araDjcBHMQ
225. O’Connell J, Højsgaard S. Package “mhsmm.” CRAN 2017;(0.4.16).
226. Altman RM, Mackay Altman R. Mixed Hidden Markov Models Mixed Hidden Markov Models: An
Extension of the Hidden Markov Model to the Longitudinal Data Setting. J Am Stat Assoc
[Internet] 2007 [cited 2017 Aug 28];102477:201–210. [doi: 10.1198/016214506000001086]
227. Visser I, Speekenbrink M. depmixS4: An R Package for Hidden Markov Models [Internet].
Available from: http://cran.r-project.org/package=depmixS4.
228. Visser I, Speekenbrink M. depmixS4: Dependent Mixture Models - Hidden Markov Models of
161
GLMs and Other Distributions in S4 [Internet]. 2016 [cited 2018 Aug 23]. Available from:
https://cran.r-project.org/web/packages/depmixS4/index.html
229. Rohan. Can something be statistically impossible? [Internet]. Math Stack Exch. 2016 [cited 2018
Aug 24]. Available from: https://math.stackexchange.com/q/2049722
230. Pohlmann KC. Principles of digital audio. McGraw-Hill; 2011. ISBN:9780071663465
231. Farmer WC, editor. Ordnance Field Guide: Restricted, Volume 2 [Internet]. Military service
publishing company; 1944 [cited 2018 Aug 24]. Available from:
https://books.google.ca/books?id=15ffO4UVw8QC&q=dither&redir_esc=y
232. Analog Devices. A Technical Tutorial on Digital Signal Synthesis [Internet]. 1999. Available from:
http://www.analog.com/media/cn/training-seminars/tutorials/450968421DDS_Tutorial_rev12-2-
99.pdf
233. Mannix BF. Races, Rushes, and Runs: Taming the Turbulence in Financial Trading [Internet].
Washington; 2013. Available from: www.regulatorystudies.gwu.edu
234. Floyd RW, Steinberg L. An Adaptive Algorithm for Spatial Greyscale. Proc Soc Inf Disp
1976;17(2):75–77.
235. Roberts LG. Picture Coding Using Pseudo-Random Noise. IRE Trans Inf Theory 1962;8(2):145–
154. [doi: 10.1109/TIT.1962.1057702]
236. Wikipedia Contributors. Dither [Internet]. Wikipedia, Free Encycl. 2018 [cited 2018 Aug 24].
Available from: https://en.wikipedia.org/wiki/Dither
237. Fox J. Generalized Linear Models. Appl Regres Gen Linear Model [Internet] SAGE Publications;
2015 [cited 2018 Aug 27]. p. 379–424. Available from: http://kilpatrick.eeb.ucsc.edu/wp-
content/uploads/2015/04/GLMs-Chapter_15.pdf
238. Rigollet P. Lecture 21. Generalized Linear Models from MIT 18.650: Statistics for Applications
[Internet]. YouTube; 2016 [cited 2018 Aug 27]. Available from:
https://www.youtube.com/watch?v=X-ix97pw0xY
239. Gao J, Fan W, Han J. On the Power of Ensemble: Supervised and Unsupervised Methods
Reconciled. Tutor SIAM Data Min Conf [Internet] Columbus, OH; 2010 [cited 2018 Aug 27].
162
Available from: https://cse.buffalo.edu/~jing/sdm10ensemble.htm
240. Grover P. Gradient Boosting from scratch [Internet]. ML Rev. 2017 [cited 2018 Aug 27]. Available
from: https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
241. LeCun Y, Bengio Y, Hinton G, Y. L, Y. B, G. H. Deep learning. Nature 2015;521(7553):436–444.
PMID:26017442
242. Parloff R. The AI Revolution: Why Deep Learning Is Suddenly Changing Your Life [Internet].
Fortune. 2016 [cited 2018 Aug 29]. Available from: http://fortune.com/ai-artificial-intelligence-
deep-machine-learning/
243. Goodfellow I, Bengio Y, Courville A. Deep Learning [Internet]. 2016. Available from:
http://www.deeplearningbook.org
244. Zekić-Sušac M, Šarlija N, Pfeifer S. Combining PCA Analysis And Artificial Neural Networks In
Modelling Entrepreneurial Intentions Of Students. Croat Oper Res Rev [Internet] 2013 Feb 1
[cited 2018 Aug 29];4(1):306–317. Available from:
https://hrcak.srce.hr/index.php?id_clanak_jezik=143365&show=clanak
245. Seuret M, Alberti M, Ingold R, Liwicki M. PCA-Initialized Deep Neural Networks Applied To
Document Image Analysis [Internet]. Available from: https://arxiv.org/pdf/1702.00177.pdf
246. Marsupial D. Does Neural Networks based classification need a dimension reduction [Internet].
Cross Validated. 2013 [cited 2018 Aug 29]. Available from:
https://stats.stackexchange.com/q/67988
247. Hartmann WM. Dimension Reduction vs. Variable Selection. Springer, Berlin, Heidelberg; 2006
[cited 2018 Aug 29]. p. 931–938. [doi: 10.1007/11558958_113]
248. Sorzano COS, Vargas J, Pascual-Montano A. A survey of dimensionality reduction techniques
[Internet]. [doi: arXiv:1403.2877]
249. Turck N, Vutskits L, Sanchez-Pena P, Robin X, Hainard A, Gex-Fabry M, Fouda C, Bassem H,
Mueller M, Lisacek F, Puybasset L, Sanchez J-C. pROC: an open-source package for R and S+ to
analyze and compare ROC curves. BMC Bioinformatics [Internet] BioMed Central; 2011 Mar 17
[cited 2017 Nov 1];12(77). [doi: 10.1007/s00134-009-1641-y]
163
250. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M, Siegert S. Package
“pROC.” CRAN [Internet] 2017 [cited 2017 Nov 1];(1.10). Available from: https://cran.r-
project.org/web/packages/pROC/pROC.pdf
251. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B,
R Core Team, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T. caret:
Classification and Regression Training [Internet]. 2017. Available from: https://cran.r-
project.org/package=caret
252. Kuhn M. Predictive Modeling with R and the caret Package. useR! R User Conf [Internet]
Albacete, Spain; 2013 [cited 2018 Aug 21]. Available from: http://www.edii.uclm.es/~useR-
2013/Tutorials/kuhn/user_caret_2up.pdf
253. Lumley T, Miller A. leaps: Regression Subset Selection [Internet]. 2017. Available from:
https://cran.r-project.org/package=leaps
254. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B,
R Core Team, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T.
preProcess function [Internet]. R Doc. 2017 [cited 2018 Aug 30]. Available from:
https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/preProcess
255. Schwarz G. Estimating the Dimension of a Model. Ann Stat [Internet] Institute of Mathematical
Statistics; 1978 Mar [cited 2018 Aug 30];6(2):461–464. [doi: 10.1214/aos/1176344136]
256. Refaeilzadeh P, Tang L, Liu H. Cross-Validation. In: Liu L, Özsu MT, editors. Encycl Database
Syst [Internet] Boston, MA: Springer US; 2009 [cited 2018 Aug 25]. p. 532–538. [doi: 10.1007/978-
0-387-39940-9_565]
257. Zemel R. Ensemble Methods from University of Toronto CSC411 Machine Learning & Data
Mining [Internet]. Toronto; 2014. Available from:
http://www.cs.toronto.edu/~rsalakhu/CSC411/notes/lecture_ensemble1.pdf
258. Ng A. Machine Learning Yearning: Technical Strategy for AI Engineers in the Era of Deep
Learning [draft] [Internet]. draft. deeplearning.ai. 2018. Available from:
https://gallery.mailchimp.com/dc3a7ef4d750c0abfc19202a3/files/704291d2-365e-45bf-a9f5-
719959dfe415/Ng_MLY01.pdf
164
259. Brownlee J. Gentle Introduction to the Bias-Variance Trade-Off in Machine Learning [Internet].
Mach Learn Mastery. 2016 [cited 2018 Aug 25]. Available from:
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-
learning/
260. Geng D, Shih S. Machine Learning Crash Course: Part 4 - The Bias-Variance Dilemma [Internet].
Mach Learn @ Berkeley. 2017 [cited 2018 Aug 25]. Available from:
https://ml.berkeley.edu/blog/2017/07/13/tutorial-4/
261. Sicotte XB. Bias and variance in leave-one-out vs K-fold cross validation [Internet]. Cross
Validated. 2018 [cited 2018 Aug 25]. Available from: https://stats.stackexchange.com/q/357749
262. Little MA, Varoquaux G, Saeb S, Lonini L, Jayaraman A, Mohr DC, Kording KP. Using and
understanding cross-validation strategies. Perspectives on Saeb et al. Gigascience [Internet] Oxford
University Press; 2017 May 1 [cited 2018 Aug 25];6(5):1–6. PMID:28327989
263. Kohavi R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model
Selection. Proc 14th Int Jt Conf Artif Intell - Vol 2 [Internet] Montreal: Morgan Kaufmann
Publishers Inc.; 1995 [cited 2018 Aug 30]. p. 1137–1143. Available from:
http://web.cs.iastate.edu/~jtian/cs573/Papers/Kohavi-IJCAI-95.pdf
264. Bengio Y, Grandvalet Y. No Unbiased Estimator of the Variance of K-Fold Cross-Validation
Yoshua Bengio Yves Grandvalet. J Mach Learn Res [Internet] 2004 [cited 2018 Aug 31];5:1089–
1105. Available from: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf
265. Zhang Y, Yang Y. Cross-validation for selecting a model selection procedure. J Econom [Internet]
2015 Jul [cited 2018 Aug 31];187(1):95–112. [doi: 10.1016/j.jeconom.2015.02.006]
266. Efron B. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J Am
Stat Assoc [Internet] 1983 Jun [cited 2018 Aug 31];78(382):316–331. [doi:
10.1080/01621459.1983.10477973]
267. Sicotte XB. Variance of K-fold cross-validation estimates as f(K): what is the role of “stability”?
[Internet]. Cross Validated2. 18AD. Available from: https://stats.stackexchange.com/q/358278
268. National Health Service. Blood tests - Overview [Internet]. Natl Heal Serv. 2016 [cited 2018 Aug
31]. Available from: https://www.nhs.uk/conditions/blood-tests/
165
269. The Royal College of Pathologists of Australasia. Pathology: The Facts [Internet]. 2013. Available
from:
http://www.health.gov.au/internet/publications/publishing.nsf/Content/CA2578620005D57ACA2
57B6A000862D3/$File/What I Should Know Pathology-FS.pdf
270. Dynacare. After My Test [Internet]. [cited 2018 Aug 31]. Available from:
https://www.dynacare.ca/patients-and-individuals/preparation-and-tips/after-my-test.aspx
271. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, Mayer Z, Kenkel B,
R Core Team, Benesty M, Lescarbeau R, Ziem A, Scrucca L, Tang Y, Candan C, Hunt T. varImp
function [Internet]. R Doc. 2017 [cited 2018 Aug 31]. Available from:
https://www.rdocumentation.org/packages/caret/versions/6.0-80/topics/varImp
272. Habbu A, Lakkis NM, Dokainish H. The Obesity Paradox: Fact or Fiction? Am J Cardiol
[Internet] Excerpta Medica; 2006 Oct 1 [cited 2018 Sep 24];98(7):944–948. [doi:
10.1016/J.AMJCARD.2006.04.039]
273. Curtis JP, Selter JG, Wang Y, Rathore SS, Jovin IS, Jadbabaie F, Kosiborod M, Portnay EL,
Sokol SI, Bader F, Krumholz HM. The Obesity Paradox. Arch Intern Med [Internet] 2005 Jan 10
[cited 2018 Sep 24];165(1):55. [doi: 10.1001/archinte.165.1.55]
274. Kenchaiah S, Evans JC, Levy D, Wilson PWF, Benjamin EJ, Larson MG, Kannel WB, Vasan RS.
Obesity and the Risk of Heart Failure. N Engl J Med [Internet] 2002 Aug [cited 2018 Sep
24];347(5):305–313. [doi: 10.1056/NEJMoa020245]
275. Mosterd A. The prognosis of heart failure in the general population. The Rotterdam Study. Eur
Heart J [Internet] 2001 Aug 1 [cited 2018 Sep 24];22(15):1318–1327. [doi: 10.1053/euhj.2000.2533]
276. Iliodromiti S, Celis-Morales CA, Lyall DM, Anderson J, Gray SR, Mackay DF, Nelson SM, Welsh
P, Pell JP, Gill JMR, Sattar N. The impact of confounding on the associations of different
adiposity measures with the incidence of cardiovascular disease: a cohort study of 296 535 adults of
white European descent. Eur Heart J [Internet] Oxford University Press; 2018 May 1 [cited 2018
Sep 24];39(17):1514–1520. [doi: 10.1093/eurheartj/ehy057]
277. Mailund T, Storm Pedersen CN. Machine Learning in Bioinformatics Lecture Week 5 - Hidden
Markov Models Selecting model parameters or “training” Hidden Markov Models [Internet].
Aarhus, Denmark; 2014 [cited 2017 Aug 28]. p. 56. Available from: http://users-
166
birc.au.dk/cstorm/courses/MLiB_f14/slides/hidden-markov-models-4.pdf
278. Jelinek B. Review on Training Hidden Markov Models with Multiple Observations. [cited 2017
Aug 28]; Available from:
https://www.isip.piconepress.com/courses/msstate/ece_8443/papers/2001_spring/multi_obs/p00
_paper_v0.pdf
279. user34790, de Azevdeo R, Morat, hxd1011, Bulatov Y, Masterfool, Dernoncourt F. What is the
difference between the forward-backward and Viterbi algorithms? - Cross Validated [Internet].
Cross Validated. 2016 [cited 2017 Nov 11]. Available from:
https://stats.stackexchange.com/questions/31746/what-is-the-difference-between-the-forward-
backward-and-viterbi-algorithms
280. Rodríguez LJ, Torres I. Comparative Study of the Baum-Welch and Viterbi Training Algorithms
Applied to Read and Spontaneous Speech Recognition. Pattern Recognit Image Anal [Internet]
Springer, Berlin, Heidelberg; 2003 [cited 2017 Nov 11]. p. 847–857. [doi: 10.1007/978-3-540-44871-
6_98]
281. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer
P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M,
Duchesnay É. Scikit-learn: Machine Learning in Python. J Mach Learn Res [Internet] 2011 [cited
2018 Aug 22];12:2825–2830. Available from: http://scikit-learn.org/stable/about.html#citing-
scikit-learn
282. Baril J-F. mhsc-thesis [Internet]. Toronto; 2018. Available from:
https://github.com/cosmomeese/mhsc-thesis
283. Abu-Mostafa Y. Lecture 07 - The VC Dimension from Caltech CS 156: Learning Systems
[Internet]. YouTube; 2012 [cited 2018 Aug 30]. Available from:
https://www.youtube.com/watch?v=Dc0sr0kdBVI&hd=1#t=57m20s
284. Beleites C, Klein A. Any “rules of thumb” on number of features versus number of instances?
(small data sets) [Internet]. Data Sci (Stack Exch. 2018. Available from:
https://datascience.stackexchange.com/a/29478
285. Hua J, Xiong Z, Lowey J, Suh E, Dougherty ER. Optimal number of features as a function of
sample size for various classification rules. Bioinformatics [Internet] Oxford University Press; 2005
167
Apr 15 [cited 2018 Aug 30];21(8):1509–1515. [doi: 10.1093/bioinformatics/bti171]
286. Häggström M. Renin-angiotensin_system_in_man_shadow. Wikimedia Commons; 2009.
287. Ober WC, Garrison CW, Silverthorn DU. Adapted from Figure 15-24 The baroreceptor reflex: the
repsonse to orthostatic hypotension. Hum Physiol An Integr Approach Pearson Benjamin
Cummings; 2009. p. 991.
288. Alien AA, Shelley HK. Fig. 3. The effect of cardiac arrhythmia (PVCs) on the PPG. Best Pract
Res Clin Anaesthesiol [Internet] 2014 [cited 2018 Jul 30];28(4). [doi: 10.1016/j.bpa.2014.08.006]
289. University Health Network (UHN). Medly for Heart Failure [Internet]. iTunes; 2018. Available
from: https://itunes.apple.com/ca/app/medly-for-chronic-conditions/id1310832707?mt=8
290. Owen S. Common Probability Distributions: The Data Scientist’s Crib Sheet - Cloudera
Engineering Blog [Internet]. Cloudera Eng Blog. 2015 [cited 2018 Aug 27]. Available from:
https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-
sheet/
168
Appendix A - Research Ethics
I. REB #14-7595: Validation of A Wearable Activity Tracker for the Estimation of
Heart Failure Severity
169
II. REB #15-9832: Feasibility Study of Wearable Heart Rate and Activity Trackers
for Monitoring Heart Failure
170
III. REB #16-5789: Evaluation of A Mobile Phone-Based Telemonitoring Program
for Heart Failure Patients
171
IV. REB #18-0221: Artificial intelligence-based quality improvement initiative of a
mobile phone-based telemonitoring program for heart failure patients
172
Appendix B – A Primer on Hidden Markov Models
I. Basics of Markov Models (Hidden or Otherwise)
Markov Models (hidden or otherwise) are probabilistic state machines where the transitions between
states are executed randomly according to pre-specified transition probabilities between states [118,220–
223]. Markov Models are used to model Markov chains/processes which are stochastic (i.e. random)
processes that satisfy the Markovian property. That is, the transitions from a given state in the chain to
the next immediate state (and by extension all future states) must be dependant solely on the current
state of the model [118,220–224]. They must not depend on the path taken to arrive to that state, i.e. on
any previous states in which the system has existed. The Markovian property is alternatively known as
the 'memoryless' property: essentially that the Markov process or markov chain has no memory of the
past [118,220–224]. The transition probabilities along with the number of states form the fundamental
model parameters which uniquely describe the Markov Model. Where relevant, a Markov Model may also
have initial starting parameters which dictate the likelihood associated with the Markov Model starting in
each possible state (e.g. 10% chance to start in State S1, 20% chance to start in State S2 and so on)
[118,220–224].
In many Markov Models (and in every Hidden Markov Model) there is also an associated set of possible
observations that are linked to each state, i.e. that can possibly be output when the system is in a given
state. For example, as shown in Figure B-1, a Markov Model is shown that models weather outside an
office with possible states S1 = Sunny, S2 = Cloudy and S3 = Rainy with associated transition
probabilities between each state [221]. The observations associated with each state might be the clothing
that a given person in a stream of passers-by are wearing, say a shirt, a sweater or a rainjacket [221]. It is
possible that a person might be wearing any of these types of clothing in any given type of weather but it
is likely that the likelihood of observing each clothing type will differ based on the underlying weather
state; for example rainjackets are probably more likely to be observed in rainy weather than in sunny
weather [221]. These probabilities are termed observation probabilities and link the states in the Markov
Model to the observations that are measured as outputs of the Markov Model. These observations could
be speech phonemes, written characters of the alphabet, or genome sequences [118,226]. Observe that in
Figure B-1, our hypothetical example Markov Model of the weather includes the starting, transition and
observation probabilities. The starting probabilities are indicated by very light lines between the
rectangular ‘start’ & the state circles, and are almost uniformly distributed with a slight bias towards it
being state S1: Sunny (perhaps unjustified optimism). The transition probabilities, indicated by lines
between the three state circles, favor the state remaining the same, with low probability of the state
173
jumping directly between the S1: Sunny and S3: Rainy states. The observation probabilities model our
hypothesis that shirts are most likely to be associated with sunny weather, and rainjackets with rainy
weather. In cloud weather, people are almost equally likely to wear shirts, sweaters or rainjackets, with a
minor preference towards sweaters.
Figure B-1: M arkov model
The appropriately named Hidden Markov Models (HMM) are simply Markov Models where the
underlying states are hidden - i.e. cannot directly be observed [118,220,222,224,225]. Specifically, we don’t
know the number of states the system has, nor the transition probabilities between states, the sequence of
states it has been through, or even the present state of the system [118,220,222,224,225]. However, if we
assume the system has a certain number of states (e.g. 3) for which we have some given observation
probabilities, it is actually possible for us work backwards and try to infer the current state of the hidden
underlying Markov Model, including the sequence of states that the particular model went through and
generally to create a model of the underlying process [118,220,222–224,277]. We can then use the model to
replicate the modelled process. A relatable example is for text prediction, where an HMM might be
trained using text a user inputs into their smartphone and then used to dynamically suggest the next
word as a user types in new text. Alternatively, one could use a model to quantify how similar a new
process is to an existing modeled process: for example one could model the stock market using the trade
volume and price of a major index during a known bullish (rising) period, and then provide this bull
174
market trained HMM a recent sample of the index trade volume and pricing information to quantify how
similar the current market is to the known bull market period.
Of course, the process of modelling an underlying process using an HMM relies on many assumptions,
both about the input data and properties of the underlying process. As previously mentioned, one of the
major assumptions (the Markovian assumption) that comes with hidden Markov Models, as with Markov
Models in general, is that they assume that the underlying process they model adheres to the Markovian
property: that the future state of the model does not depend on the past states or sequence of states, only
the present state [118,220,221,224]. That being said, it has been found that Hidden Markov Model are
able in certain cases to fairly successfully model processes that violate this Markovian assumption. For
example in the classic cases of speech recognition and gesture recognition [118,226,278]. Of course, both
patient activity and heart rate data likely violate the Markovian assumption 'demanded' of hidden
Markov Models, and although HMMs have been used successfully in some applications of physical activity
recognition using accelerometer data [62] the jury is still out when it comes to modeling with heart rate
data or even with minute-by-minute step count data.
II. Semi-Markov Model
The violation of the pure Markovian assumption leads us to a variation on Hidden Markov Models:
Hidden Semi-Markov Models (HSMM) [223]. HSMMs are HMMs that formally relax the 'Markovian'
assumption of the model by permitting the model to specifically retain the memory of how long it has
been in a certain state (sometimes to force the model to not exist in a state for more than a desired time)
[223]. As such, HSMMs require that an additional set of parameters be defined: the sojourn distribution of
each state [223]. That is, the distribution of expected mean waiting times in each given state. These
waiting times can follow any distribution desired - normal, geometric, gamma, etc. - or appropriate for the
problem at hand [223]. For example, in the case of patient activity and heart rate, where it might be
unreasonable to assume that there no some time-dependence in state changes due to the dynamic nature
of human exercise and activity (e.g. people who are performing high-intensity activity are less likely to
continue as time goes by since they get tired) one might train equivalent multivariate hidden semi-
Markov models to explore and measure the effect of formally relaxing the Markovian assumption (or time-
independence) of a pure Markov models. Although HSMMs are likely highly relevant to the problem of
assessing NYHA class they were not investigated as part of the research documented in this thesis.
III. Hidden Markov & Semi-Markov Models Parameters
To summarize, the complete set of parameters determined a Hidden Markov Model are as follows:
175
1. the number of states in the model
2. the starting probabilities (for each state)
3. the transition probabilities (between each state)
4. the (observation) emission probabilities (of the observable by-products of each state; e.g.
shirt/sweater/rainjacket)
For Hidden Semi-Markov Models, the individual state sojourn distributions must also be specified.
IV. Determining Markov Model Parameters
Determining the single best or most optimal Hidden Markov Model parametrization for given data
stream is unfortunately, an intractable problem [118,220,222]. That being said, there is a known algorithm
for efficiently computing the most likely locally optimal parametrization, the ‘maximum likelihood
estimation’, for a stream. Generally speaking the specific sub-class of algorithms used to solve this
problem in the Markov model space are known as expectation-maximization (EM) algorithms
[118,220,222]. One of the most common EM algorithm implementations used for Hidden Markov Model
training is the Baum-Welch algorithm [118,220,222,279]. Another common algorithm used to approximate
EM is the Viterbi training algorithm (N.B. not the Viterbi algorithm) which can yield less accurate
models than the Baum-Welch algorithm but is usually much less computationally intensive [279,280]. We
eschew further discussion of the implementation details of either of these algorithms since the availability
of pre-programmed libraries implementing these algorithms makes it unnecessary for new student of
HMMs to have the in-depth knowledge required to implement the algorithms and because there are many
excellent sources available that explore the finer details of algorithm much more completely than can be
done as part of a quick primer [118,220,222,280]. In any case none of these algorithms is able to determine
all of the parameters by itself. Some of the parameters must be provided as 'initial conditions' for the
algorithm to execute. Typically these are the emission probabilities, the starting probabilities, the sojourn
distributions (and sometimes even initial transition probabilities). Depending on the library used it may
try to make an educated guess for starting points or leave the 'initial conditions' to be specified solely by
the author. It is possible (and encouraged) to try various combinations of parameters to determine the
most effective set - in fact more fully featured software libraries will also sometimes offer to do this
automatically, although it is ultimately up to the researcher to determine appropriate 'initial conditions.'
In the case of this work, where we used the R package depmixS4 [227,228] the user must provide the
number of desired states, the emission probabilities (which are assumed to remain fixed) as well as an
176
initial starting point for the state probabilities and transition probabilities, which the algorithm then
adjusts as it searches for a local optimum. Other hidden Markov model packages exist for R as well as for
other programming languages, including Python (as part of the package scikit-learn [281]) which is
particularly popular for machine learning.
177
Appendix C – Software Repository
All of the software written by the author and used for, or as part of this project, can be accessed at [282]:
https://github.com/cosmomeese/mhsc-thesis
The Fitbit data management and access script can also be found at [198]:
https://github.com/cosmomeese/fitbit4research
178
Appendix D – Tabulation of All Cross-sectional Machine Learning Classifier
Performance Measures
An exhaustive list of all the performance measures recorded for the final cross-sectional machine learning classifiers evaluated in Chapter 6 are
tabulated in Table 22. To maximize the legibility of the rest of the tables the headers were abbreviated. Table 21 provides the key to these
abbreviations, along with any relevant abbreviated codes used in Table 22. For ease of navigation, the similar model variants are grouped together in
Table 22 in roughly descending order of performance (due to the model grouping). Furthermore, the column with the performance metric used for
model comparison in this thesis - Cohen’s Kappa (indicated by the 𝜅 symbol) - is highlighted purple. Models whose unbalanced accuracy does not
improve over their no-information are highlighted red, and the best performing models are highlighted in green. The models with the lowest |Δ𝜅| (of
the models that improve over their default no-information rate) are highlighted in yellow.
Table 21: Header abbreviations for Table 22
Header Abbreviation Expanded Header Coding
Type Machine learning model type
Feats
Features used C=CPET Only,
S=Step Data Only,
C+S=CPET and Step Data
Imp Imputed missing data?
F Sel Feature selection performed?
K k-fold cross-validation method used: -1=leave-one-out cross-validation,
10=10-fold cross-validation
𝜅 Cohen’s Kappa
|Δ𝜅| Absolute value of the difference between leave-one-out cross-validation kappa and
10-fold cross-validation kappa for the particular model configuration
Bal Acc Balanced Accuracy
Raw Acc Unbalanced Accuracy
Acc UB Unbalanced Accuracy Upper Bound
Acc LB Unbalanced Accuracy Lower Bound
179
Header Abbreviation Expanded Header Coding
NIR No Information Rate
P P-Value (Unbalanced Accuracy)
McN P McNemar P-Value
Sens Sensitivity
Spec Specificity
+ve PV Positive (NYHA Class II) Predictive Value
-ve PV Positive (NYHA Class III) Predictive Value
Pre Precision
Rec Recall
F1 F1 Score
Prev Prevalence
DR Detection Rate
DP Detection Prevalence
AUC Area Under ROC Curve
TP True Positive (Correct NYHA II Classification) Count
FN False Negative (Incorrect NYHA III Classification) Count
FP False Positive (Incorrect NYHA II Classification) Count
TN True Negative (Correct NYHA III Classification) Count
Table 22: Cross-sectional machine learning classifier performance metrics
Type
Feat
s Imp
F
Sel k 𝜿 |Δ 𝜿|
Bal
Ac
c
Ra
w
Acc
Ac
c
UB
Ac
c
LB
N I
R P
M c
N P
Sen
s
Spe
c
+ve
PV
-ve
PV Pre
Re
c F1
Pre
v DR DP
AU
C
T
P
F
N
F
P
T
N
Boosted
GLM C+S No No -1 0.73 0.63 0.85 0.89 0.98 0.72 0.71 .02 1.00 0.75 0.95 0.86 0.90 0.86 0.75 0.80 0.29 0.21 0.25 0.94 6 2 1 19
Boosted
GLM C+S No Yes -1 0.73 0.63 0.85 0.89 0.98 0.72 0.71 .02 1.00 0.75 0.95 0.86 0.90 0.86 0.75 0.80 0.29 0.21 0.25 0.94 6 2 1 19
Boosted
GLM C+S No No 10 0.10 0.63 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.54 3 12 4 31
Boosted
GLM C+S No Yes 10 0.10 0.63 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.54 3 12 4 31
Rando
m
Forest
C+S No No -1 0.70 0.60 0.81 0.89 0.98 0.72 0.71 .02 .25 0.63 1.00 1.00 0.87 1.00 0.63 0.77 0.29 0.18 0.18 0.80 5 3 0 20
Rando
m
Forest
C+S No Yes -1 0.70 0.60 0.81 0.89 0.98 0.72 0.71 .02 .25 0.63 1.00 1.00 0.87 1.00 0.63 0.77 0.29 0.18 0.18 0.80 5 3 0 20
Rando
m
Forest
C+S No No 10 0.10 0.60 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.46 3 12 4 31
180
Type
Feat
s Imp
F
Sel k 𝜿 |Δ 𝜿|
Bal
Ac
c
Ra
w
Acc
Ac
c
UB
Ac
c
LB
N I
R P
M c
N P
Sen
s
Spe
c
+ve
PV
-ve
PV Pre
Re
c F1
Pre
v DR DP
AU
C
T
P
F
N
F
P
T
N
Rando
m
Forest
C+S No Yes 10 0.10 0.60 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.46 3 12 4 31
Boosted
GLM C No No -1 0.47 0.19 0.72 0.79 0.90 0.64 0.70 .12 .50 0.54 0.90 0.70 0.82 0.70 0.54 0.61 0.30 0.16 0.23 0.80 7 6 3 27
Boosted
GLM C No Yes -1 0.47 0.19 0.72 0.79 0.90 0.64 0.70 .12 .50 0.54 0.90 0.70 0.82 0.70 0.54 0.61 0.30 0.16 0.23 0.80 7 6 3 27
Boosted
GLM C No No 10 0.28 0.19 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.55 6 9 5 30
Boosted
GLM C No Yes 10 0.28 0.19 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.55 6 9 5 30
PCA
NNet C Yes No -1 0.45 0.31 0.73 0.76 0.87 0.62 0.70 .22 .77 0.67 0.80 0.59 0.85 0.59 0.67 0.63 0.30 0.20 0.34 0.68 10 5 7 28
PCA
NNet C Yes Yes -1 0.45 0.31 0.73 0.76 0.87 0.62 0.70 .22 .77 0.67 0.80 0.59 0.85 0.59 0.67 0.63 0.30 0.20 0.34 0.68 10 5 7 28
PCA
NNet C Yes No 10 0.14 0.31 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.54 4 11 5 30
PCA
NNet C Yes Yes 10 0.14 0.31 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.54 4 11 5 30
Boosted
GLM C Yes No -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.76 9 6 6 29
Boosted
GLM C Yes Yes -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.76 9 6 6 29
NNet C Yes No -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.73 9 6 6 29
NNet C Yes Yes -1 0.43 0.29 0.71 0.76 0.87 0.62 0.70 .22 1.00 0.60 0.83 0.60 0.83 0.60 0.60 0.60 0.30 0.18 0.30 0.73 9 6 6 29
Boosted
GLM C Yes No 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.53 4 11 5 30
Boosted
GLM C Yes Yes 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.53 4 11 5 30
NNet C Yes No 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.56 4 11 5 30
NNet C Yes Yes 10 0.14 0.29 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.56 4 11 5 30
NNet C No No -1 0.41 0.45 0.71 0.74 0.86 0.59 0.70 .32 1.00 0.62 0.80 0.57 0.83 0.57 0.62 0.59 0.30 0.19 0.33 0.73 8 5 6 24
NNet C No Yes -1 0.41 0.45 0.71 0.74 0.86 0.59 0.70 .32 1.00 0.62 0.80 0.57 0.83 0.57 0.62 0.59 0.30 0.19 0.33 0.73 8 5 6 24
NNet C No No 10 -0.05 0.45 0.48 0.56 0.70 0.41 0.70 .99 1.00 0.27 0.69 0.27 0.69 0.27 0.27 0.27 0.30 0.08 0.30 0.55 4 11 11 24
NNet C No Yes 10 -0.05 0.45 0.48 0.56 0.70 0.41 0.70 .99 1.00 0.27 0.69 0.27 0.69 0.27 0.27 0.27 0.30 0.08 0.30 0.55 4 11 11 24
GLM C Yes No -1 0.37 0.23 0.68 0.74 0.85 0.60 0.70 .33 1.00 0.53 0.83 0.57 0.81 0.57 0.53 0.55 0.30 0.16 0.28 0.70 8 7 6 29
GLM C Yes Yes -1 0.37 0.23 0.68 0.74 0.85 0.60 0.70 .33 1.00 0.53 0.83 0.57 0.81 0.57 0.53 0.55 0.30 0.16 0.28 0.70 8 7 6 29
GLM C Yes No 10 0.14 0.23 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.49 4 11 5 30
GLM C Yes Yes 10 0.14 0.23 0.56 0.68 0.80 0.53 0.70 .68 .21 0.27 0.86 0.44 0.73 0.44 0.27 0.33 0.30 0.08 0.18 0.49 4 11 5 30
PCA
NNet C+S No No -1 0.36 0.43 0.68 0.75 0.89 0.55 0.71 .43 1.00 0.50 0.85 0.57 0.81 0.57 0.50 0.53 0.29 0.14 0.25 0.63 4 4 3 17
PCA
NNet C+S No Yes -1 0.36 0.43 0.68 0.75 0.89 0.55 0.71 .43 1.00 0.50 0.85 0.57 0.81 0.57 0.50 0.53 0.29 0.14 0.25 0.63 4 4 3 17
181
Type
Feat
s Imp
F
Sel k 𝜿 |Δ 𝜿|
Bal
Ac
c
Ra
w
Acc
Ac
c
UB
Ac
c
LB
N I
R P
M c
N P
Sen
s
Spe
c
+ve
PV
-ve
PV Pre
Re
c F1
Pre
v DR DP
AU
C
T
P
F
N
F
P
T
N
PCA
NNet C+S No No 10 -0.06 0.43 0.47 0.52 0.66 0.37 0.70 1.00 .54 0.33 0.60 0.26 0.68 0.26 0.33 0.29 0.30 0.10 0.38 0.56 5 10 14 21
PCA
NNet C+S No Yes 10 -0.06 0.43 0.47 0.52 0.66 0.37 0.70 1.00 .54 0.33 0.60 0.26 0.68 0.26 0.33 0.29 0.30 0.10 0.38 0.56 5 10 14 21
PCA
NNet C No No -1 0.34 0.24 0.67 0.72 0.85 0.56 0.70 .44 1.00 0.54 0.80 0.54 0.80 0.54 0.54 0.54 0.30 0.16 0.30 0.74 7 6 6 24
PCA
NNet C No Yes -1 0.34 0.24 0.67 0.72 0.85 0.56 0.70 .44 1.00 0.54 0.80 0.54 0.80 0.54 0.54 0.54 0.30 0.16 0.30 0.74 7 6 6 24
PCA
NNet C No No 10 0.10 0.24 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.53 3 12 4 31
PCA
NNet C No Yes 10 0.10 0.24 0.54 0.68 0.80 0.53 0.70 .68 .08 0.20 0.89 0.43 0.72 0.43 0.20 0.27 0.30 0.06 0.14 0.53 3 12 4 31
GLM S Yes No -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30
GLM S Yes Yes -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30
Boosted
GLM S Yes No -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30
Boosted
GLM S Yes Yes -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.72 6 9 5 30
NNet S Yes No -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.69 6 9 5 30
NNet S Yes Yes -1 0.28 0.28 0.63 0.72 0.84 0.58 0.70 .45 .42 0.40 0.86 0.55 0.77 0.55 0.40 0.46 0.30 0.12 0.22 0.69 6 9 5 30
GLM S Yes No 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8
GLM S Yes Yes 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8
Boosted
GLM S Yes No 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8
Boosted
GLM S Yes Yes 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.47 0 4 0 8
NNet S Yes No 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.33 0 4 0 8
NNet S Yes Yes 10 0.00 0.28 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.33 0 4 0 8
Rando
m
Forest
C Yes No -1 0.21 0.13 0.60 0.68 0.80 0.53 0.70 .68 .80 0.40 0.80 0.46 0.76 0.46 0.40 0.43 0.30 0.12 0.26 0.68 6 9 7 28
Rando
m
Forest
C Yes Yes -1 0.21 0.13 0.60 0.68 0.80 0.53 0.70 .68 .80 0.40 0.80 0.46 0.76 0.46 0.40 0.43 0.30 0.12 0.26 0.68 6 9 7 28
Rando
m
Forest
C Yes No 10 0.08 0.13 0.54 0.62 0.75 0.47 0.70 .92 1.00 0.33 0.74 0.36 0.72 0.36 0.33 0.34 0.30 0.10 0.28 0.52 5 10 9 26
Rando
m
Forest
C Yes Yes 10 0.08 0.13 0.54 0.62 0.75 0.47 0.70 .92 1.00 0.33 0.74 0.36 0.72 0.36 0.33 0.34 0.30 0.10 0.28 0.52 5 10 9 26
Boosted
GLM C+S Yes No -1 0.17 0.23 0.59 0.66 0.79 0.51 0.70 .78 1.00 0.40 0.77 0.43 0.75 0.43 0.40 0.41 0.30 0.12 0.28 0.65 6 9 8 27
Boosted
GLM C+S Yes Yes -1 0.17 0.23 0.59 0.66 0.79 0.51 0.70 .78 1.00 0.40 0.77 0.43 0.75 0.43 0.40 0.41 0.30 0.12 0.28 0.65 6 9 8 27
Boosted
GLM C+S Yes No 10 -0.06 0.23 0.48 0.64 0.77 0.49 0.70 .86 .03 0.07 0.89 0.20 0.69 0.20 0.07 0.10 0.30 0.02 0.10 0.45 1 14 4 31
182
Type
Feat
s Imp
F
Sel k 𝜿 |Δ 𝜿|
Bal
Ac
c
Ra
w
Acc
Ac
c
UB
Ac
c
LB
N I
R P
M c
N P
Sen
s
Spe
c
+ve
PV
-ve
PV Pre
Re
c F1
Pre
v DR DP
AU
C
T
P
F
N
F
P
T
N
Boosted
GLM C+S Yes Yes 10 -0.06 0.23 0.48 0.64 0.77 0.49 0.70 .86 .03 0.07 0.89 0.20 0.69 0.20 0.07 0.10 0.30 0.02 0.10 0.45 1 14 4 31
Rando
m
Forest
S Yes No -1 0.14 0.38 0.58 0.62 0.75 0.47 0.70 .92 .65 0.47 0.69 0.39 0.75 0.39 0.47 0.42 0.30 0.14 0.36 0.62 7 8 11 24
Rando
m
Forest
S Yes Yes -1 0.14 0.38 0.58 0.62 0.75 0.47 0.70 .92 .65 0.47 0.69 0.39 0.75 0.39 0.47 0.42 0.30 0.14 0.36 0.62 7 8 11 24
Rando
m
Forest
S Yes No 10 -0.24 0.38 0.38 0.42 0.72 0.15 0.67 .98 1.00 0.25 0.50 0.20 0.57 0.20 0.25 0.22 0.33 0.08 0.42 0.41 1 3 4 4
Rando
m
Forest
S Yes Yes 10 -0.24 0.38 0.38 0.42 0.72 0.15 0.67 .98 1.00 0.25 0.50 0.20 0.57 0.20 0.25 0.22 0.33 0.08 0.42 0.41 1 3 4 4
Rando
m
Forest
C No No -1 0.11 0.23 0.55 0.67 0.81 0.51 0.70 .70 .18 0.23 0.87 0.43 0.72 0.43 0.23 0.30 0.30 0.07 0.16 0.65 3 10 4 26
Rando
m
Forest
C No Yes -1 0.11 0.23 0.55 0.67 0.81 0.51 0.70 .70 .18 0.23 0.87 0.43 0.72 0.43 0.23 0.30 0.30 0.07 0.16 0.65 3 10 4 26
Rando
m
Forest
C No No 10 -0.12 0.23 0.44 0.54 0.68 0.39 0.70 .99 1.00 0.20 0.69 0.21 0.67 0.21 0.20 0.21 0.30 0.06 0.28 0.47 3 12 11 24
Rando
m
Forest
C No Yes 10 -0.12 0.23 0.44 0.54 0.68 0.39 0.70 .99 1.00 0.20 0.69 0.21 0.67 0.21 0.20 0.21 0.30 0.06 0.28 0.47 3 12 11 24
GLM S No No -1 0.10 0.09 0.55 0.65 0.80 0.46 0.71 .83 .77 0.30 0.79 0.38 0.73 0.38 0.30 0.33 0.29 0.09 0.24 0.65 3 7 5 19
GLM S No Yes -1 0.10 0.09 0.55 0.65 0.80 0.46 0.71 .83 .77 0.30 0.79 0.38 0.73 0.38 0.30 0.33 0.29 0.09 0.24 0.65 3 7 5 19
GLM S No No 10 0.01 0.09 0.50 0.52 0.66 0.37 0.70 1.00 .15 0.47 0.54 0.30 0.70 0.30 0.47 0.37 0.30 0.14 0.46 0.49 7 8 16 19
GLM S No Yes 10 0.01 0.09 0.50 0.52 0.66 0.37 0.70 1.00 .15 0.47 0.54 0.30 0.70 0.30 0.47 0.37 0.30 0.14 0.46 0.49 7 8 16 19
NNet C+S Yes No -1 0.08 0.24 0.54 0.60 0.74 0.45 0.70 .95 .82 0.40 0.69 0.35 0.73 0.35 0.40 0.38 0.30 0.12 0.34 0.49 6 9 11 24
NNet C+S Yes Yes -1 0.08 0.24 0.54 0.60 0.74 0.45 0.70 .95 .82 0.40 0.69 0.35 0.73 0.35 0.40 0.38 0.30 0.12 0.34 0.49 6 9 11 24
NNet C+S Yes No 10 -0.15 0.24 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.51 1 14 7 28
NNet C+S Yes Yes 10 -0.15 0.24 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.51 1 14 7 28
Rando
m
Forest
C+S Yes No -1 0.07 0.10 0.53 0.64 0.77 0.49 0.70 .86 .48 0.27 0.80 0.36 0.72 0.36 0.27 0.31 0.30 0.08 0.22 0.61 4 11 7 28
Rando
m
Forest
C+S Yes Yes -1 0.07 0.10 0.53 0.64 0.77 0.49 0.70 .86 .48 0.27 0.80 0.36 0.72 0.36 0.27 0.31 0.30 0.08 0.22 0.61 4 11 7 28
Rando
m
Forest
C+S Yes No 10 -0.03 0.10 0.49 0.60 0.74 0.45 0.70 .95 .50 0.20 0.77 0.27 0.69 0.27 0.20 0.23 0.30 0.06 0.22 0.62 3 12 8 27
Rando
m
Forest
C+S Yes Yes 10 -0.03 0.10 0.49 0.60 0.74 0.45 0.70 .95 .50 0.20 0.77 0.27 0.69 0.27 0.20 0.23 0.30 0.06 0.22 0.62 3 12 8 27
183
Type
Feat
s Imp
F
Sel k 𝜿 |Δ 𝜿|
Bal
Ac
c
Ra
w
Acc
Ac
c
UB
Ac
c
LB
N I
R P
M c
N P
Sen
s
Spe
c
+ve
PV
-ve
PV Pre
Re
c F1
Pre
v DR DP
AU
C
T
P
F
N
F
P
T
N
NNet C+S No No -1 0.05 0.04 0.53 0.64 0.81 0.44 0.71 .85 .75 0.25 0.80 0.33 0.73 0.33 0.25 0.29 0.29 0.07 0.21 0.46 2 6 4 16
NNet C+S No Yes -1 0.05 0.04 0.53 0.64 0.81 0.44 0.71 .85 .75 0.25 0.80 0.33 0.73 0.33 0.25 0.29 0.29 0.07 0.21 0.46 2 6 4 16
NNet C+S No No 10 0.02 0.04 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.53 5 10 11 24
NNet C+S No Yes 10 0.02 0.04 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.53 5 10 11 24
GLM C+S Yes No -1 0.05 0.24 0.52 0.60 0.74 0.45 0.70 .95 1.00 0.33 0.71 0.33 0.71 0.33 0.33 0.33 0.30 0.10 0.30 0.50 5 10 10 25
GLM C+S Yes Yes -1 0.05 0.24 0.52 0.60 0.74 0.45 0.70 .95 1.00 0.33 0.71 0.33 0.71 0.33 0.33 0.33 0.30 0.10 0.30 0.50 5 10 10 25
GLM C+S Yes No 10 -0.19 0.24 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.37 2 13 11 24
GLM C+S Yes Yes 10 -0.19 0.24 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.37 2 13 11 24
PCA
NNet C+S Yes No -1 0.02 0.17 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.49 5 10 11 24
PCA
NNet C+S Yes Yes -1 0.02 0.17 0.51 0.58 0.72 0.43 0.70 .97 1.00 0.33 0.69 0.31 0.71 0.31 0.33 0.32 0.30 0.10 0.32 0.49 5 10 11 24
PCA
NNet C+S Yes No 10 -0.15 0.17 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.42 1 14 7 28
PCA
NNet C+S Yes Yes 10 -0.15 0.17 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.42 1 14 7 28
PCA
NNet S No No -1 0.01 0.16 0.50 0.59 0.75 0.41 0.71 .95 1.00 0.30 0.71 0.30 0.71 0.30 0.30 0.30 0.29 0.09 0.29 0.45 3 7 7 17
PCA
NNet S No Yes -1 0.01 0.16 0.50 0.59 0.75 0.41 0.71 .95 1.00 0.30 0.71 0.30 0.71 0.30 0.30 0.30 0.29 0.09 0.29 0.45 3 7 7 17
PCA
NNet S No No 10 -0.15 0.16 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.40 1 14 7 28
PCA
NNet S No Yes 10 -0.15 0.16 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.40 1 14 7 28
GLM C No No 10 0.07 -0.08 0.55 0.50 0.64 0.36 0.70 1.00 .01 0.67 0.43 0.33 0.75 0.33 0.67 0.44 0.30 0.20 0.60 0.58 10 5 20 15
GLM C No Yes 10 0.07 -0.08 0.55 0.50 0.64 0.36 0.70 1.00 .01 0.67 0.43 0.33 0.75 0.33 0.67 0.44 0.30 0.20 0.60 0.58 10 5 20 15
GLM C No No -1 0.00 -0.08 0.50 0.51 0.67 0.35 0.70 1.00 .19 0.46 0.53 0.30 0.70 0.30 0.46 0.36 0.30 0.14 0.47 0.53 6 7 14 16
GLM C No Yes -1 0.00 -0.08 0.50 0.51 0.67 0.35 0.70 1.00 .19 0.46 0.53 0.30 0.70 0.30 0.46 0.36 0.30 0.14 0.47 0.53 6 7 14 16
NNet S No No -1 -0.03 0.16 0.48 0.56 0.73 0.38 0.71 .98 1.00 0.30 0.67 0.27 0.70 0.27 0.30 0.29 0.29 0.09 0.32 0.44 3 7 8 16
NNet S No Yes -1 -0.03 0.16 0.48 0.56 0.73 0.38 0.71 .98 1.00 0.30 0.67 0.27 0.70 0.27 0.30 0.29 0.29 0.09 0.32 0.44 3 7 8 16
NNet S No No 10 -0.19 0.16 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.42 2 13 11 24
NNet S No Yes 10 -0.19 0.16 0.41 0.52 0.66 0.37 0.70 1.00 .84 0.13 0.69 0.15 0.65 0.15 0.13 0.14 0.30 0.04 0.26 0.42 2 13 11 24
GLM C+S No No 10 -0.07 -0.03 0.46 0.54 0.68 0.39 0.70 .99 1.00 0.27 0.66 0.25 0.68 0.25 0.27 0.26 0.30 0.08 0.32 0.55 4 11 12 23
GLM C+S No Yes 10 -0.07 -0.03 0.46 0.54 0.68 0.39 0.70 .99 1.00 0.27 0.66 0.25 0.68 0.25 0.27 0.26 0.30 0.08 0.32 0.55 4 11 12 23
GLM C+S No No -1 -0.11 -0.03 0.44 0.46 0.66 0.28 0.71 1.00 .30 0.38 0.50 0.23 0.67 0.23 0.38 0.29 0.29 0.11 0.46 0.47 3 5 10 10
GLM C+S No Yes -1 -0.11 -0.03 0.44 0.46 0.66 0.28 0.71 1.00 .30 0.38 0.50 0.23 0.67 0.23 0.38 0.29 0.29 0.11 0.46 0.47 3 5 10 10
184
Type
Feat
s Imp
F
Sel k 𝜿 |Δ 𝜿|
Bal
Ac
c
Ra
w
Acc
Ac
c
UB
Ac
c
LB
N I
R P
M c
N P
Sen
s
Spe
c
+ve
PV
-ve
PV Pre
Re
c F1
Pre
v DR DP
AU
C
T
P
F
N
F
P
T
N
Boosted
GLM S No No 10 -0.15 -0.01 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.41 1 14 7 28
Boosted
GLM S No Yes 10 -0.15 -0.01 0.43 0.58 0.72 0.43 0.70 .97 .19 0.07 0.80 0.13 0.67 0.13 0.07 0.09 0.30 0.02 0.16 0.41 1 14 7 28
Boosted
GLM S No No -1 -0.16 -0.01 0.43 0.56 0.73 0.38 0.71 .98 .61 0.10 0.75 0.14 0.67 0.14 0.10 0.12 0.29 0.03 0.21 0.46 1 9 6 18
Boosted
GLM S No Yes -1 -0.16 -0.01 0.43 0.56 0.73 0.38 0.71 .98 .61 0.10 0.75 0.14 0.67 0.14 0.10 0.12 0.29 0.03 0.21 0.46 1 9 6 18
Rando
m
Forest
S No No 10 -0.11 -0.09 0.46 0.64 0.77 0.49 0.70 .86 .01 0.00 0.91 0.00 0.68 0.00 0.00 NaN 0.30 0.00 0.06 0.39 0 15 3 32
Rando
m
Forest
S No Yes 10 -0.11 -0.09 0.46 0.64 0.77 0.49 0.70 .86 .01 0.00 0.91 0.00 0.68 0.00 0.00 NaN 0.30 0.00 0.06 0.39 0 15 3 32
Rando
m
Forest
S No No -1 -0.20 -0.09 0.42 0.59 0.75 0.41 0.71 .95 .18 0.00 0.83 0.00 0.67 0.00 0.00 NaN 0.29 0.00 0.12 0.43 0 10 4 20
Rando
m
Forest
S No Yes -1 -0.20 -0.09 0.42 0.59 0.75 0.41 0.71 .95 .18 0.00 0.83 0.00 0.67 0.00 0.00 NaN 0.29 0.00 0.12 0.43 0 10 4 20
PCA
NNet S Yes No 10 0.00 NA 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.38 0 4 0 8
PCA
NNet S Yes Yes 10 0.00 NA 0.50 0.67 0.90 0.35 0.67 .63 .13 0.00 1.00 NaN 0.67 NA 0.00 NA 0.33 0.00 0.00 0.38 0 4 0 8