Unsupervised learning to detect patient subgroups in ...1316425/FULLTEXT01.pdf · learning to detect patient subgroups in electronic health records ELIN LÜTZ Master of Science in

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Unsupervised learning to detect patient subgroups in electronic health records

ELIN LÜTZ

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Unsupervised machinelearning to detect patientsubgroups in electronichealth records

ELIN LÜTZ

Master of Science in Computer ScienceDate: January 6, 2019Supervisor: Johan GustavssonExaminer: Viggo KannSwedish title: Identifiering av patientgrupper genom oövervakadmaskininlärning av digitala patientjournaler

iii

Abstract

The use of Electronic Health Records (EHR) for reporting patient datahas been widely adopted by healthcare providers. This data can en-compass many forms of medical information such as disease symp-toms, results from laboratory tests, ICD-10 classes and other informa-tion from patients. Structured EHR data is often high-dimensional andcontain many missing values, which impose a complication to manycomputing problems. Detecting meaningful structures in EHR datacould provide meaningful insights in diagnose detection and in devel-opment of medical decision support systems.

In this work, a subset of EHR data from patient questionnairesis explored through two well known clustering algorithms: K-Meansand Agglomerative Hierarchical. The algorithms were tested on differ-ent types of data, primarily raw data and data where missing valueshave been imputed using different imputation techniques. The pri-mary evaluation index for the clustering algorithms was the silhouettevalue using euclidean and cosine distance measures.

The result showed that natural groupings most likely exist in thedata set. Hierarchical clustering created higher quality clusters thank-means, and the cosine measure yielded a good interpretation of dis-tance. The data imputation imposed large effects to the data and like-wise to the clustering results, and other or more sophisticated tech-niques are needed for handling missing values in the data set.

Keywords: Machine learning, unsupervised learning, clustering, EHR, elec-tronic health records, ICD, diagnosis codes.

iv

Sammanfattning

Användandet av digitala journaler för att rapportera patientdata harökat i takt med digitaliseringen av vården. Dessa data kan innehål-la många typer av medicinsk information så som sjukdomssymptom,labbresultat, ICD-10 diagnoskoder och annan patientinformation. EHRdata är vanligtvis högdimensionell och innehåller saknade värden, vil-ket kan leda till beräkningssvårigheter i ett digitalt format. Att upptäc-ka grupperingar i sådana patientdata kan ge värdefulla insikter inomdiagnosprediktion och i utveckling av medicinska beslutsstöd.

I detta arbete så undersöker vi en delmängd av digital patientdatasom innehåller patientsvar på sjukdomsfrågor. Detta dataset under-söks genom att applicera två populära klustringsalgoritmer: k-meansoch agglomerativ hierarkisk klustring. Algoritmerna är ställda motvarandra och på olika typer av dataset, primärt rådata och två datasetdär saknade värden har ersatts genom imputationstekniker. Det pri-mära utvärderingsmåttet för klustringsalgoritmerna var silhuettvär-det tillsammans med beräknandet av ett euklidiskt distansmått ochett cosinusmått.

Resultatet visar att naturliga grupperingar med stor sannolikhetfinns att hitta i datasetet. Hierarkisk klustring visade på en högre klus-terkvalitet än k-means, och cosinusmåttet var att föredra för detta da-taset. Imputation av saknade data ledde till stora förändringar på data-strukturen och således på resultatet av klustringsexperimenten, vilkettyder på att andra och mer avancerade dataspecifika imputationstek-niker är att föredra.

Nyckelord: Maskininlärning, oövervakat lärande, klustring, EHR, digita-la patientjournaler, ICD, diagnoskoder.

Contents

1 Introduction 11.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 11.2 Electronic Health Records . . . . . . . . . . . . . . . . . . 21.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Public Interest . . . . . . . . . . . . . . . . . . . . . 31.3.2 Principal’s Interest . . . . . . . . . . . . . . . . . . 3

1.4 Problem Specification . . . . . . . . . . . . . . . . . . . . . 41.4.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . 41.4.2 Research Question . . . . . . . . . . . . . . . . . . 41.4.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . 41.4.4 Implications of High-dimensionality . . . . . . . . 51.4.5 Implications of Missing data . . . . . . . . . . . . 51.4.6 Implications of Accuracy . . . . . . . . . . . . . . 61.4.7 Limitations . . . . . . . . . . . . . . . . . . . . . . 6

2 Background 82.1 Related Theory . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Hierarchical Clustering Algorithms . . . . . . . . 92.1.3 Partitional Clustering Algorithms . . . . . . . . . 102.1.4 Density-Based Clustering Algorithms . . . . . . . 112.1.5 Subspace Clustering Algorithms . . . . . . . . . . 122.1.6 Similarity Measures . . . . . . . . . . . . . . . . . 132.1.7 Clustering Evaluation . . . . . . . . . . . . . . . . 142.1.8 Silhouette Index . . . . . . . . . . . . . . . . . . . . 14

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Summary of Literature Study . . . . . . . . . . . . . . . . 16

v

vi CONTENTS

3 Methods 183.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.1 Original data set . . . . . . . . . . . . . . . . . . . 183.1.2 Imputed Data sets . . . . . . . . . . . . . . . . . . 19

3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.1 Determining the Number of Clusters . . . . . . . 193.2.2 Software and Hardware . . . . . . . . . . . . . . . 203.2.3 Implementation of Algorithms . . . . . . . . . . . 20

4 Results 224.1 Determination of Number of Clusters . . . . . . . . . . . 224.2 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 k-means . . . . . . . . . . . . . . . . . . . . . . . . 224.2.2 Agglomerative Hierarchical . . . . . . . . . . . . . 244.2.3 Running Times . . . . . . . . . . . . . . . . . . . . 25

5 Discussion and Analysis 265.1 Determination of Number of Clusters . . . . . . . . . . . 275.2 Handling of Missing values . . . . . . . . . . . . . . . . . 275.3 Comparison of the Algorithms . . . . . . . . . . . . . . . 285.4 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.5 Ethical Perspective . . . . . . . . . . . . . . . . . . . . . . 305.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.7 Future Research . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Conclusions 32

Bibliography 34

Chapter 1

Introduction

Healthcare is getting more and more digitized trough advances in com-puter based medical systems. This has been accelerated by hospitals’adoption of electronic health record (EHR) systems. Medical systemsin clinical settings are today generating large amounts of patient datain the form of EHRs, and its digital format makes it easy to analyse thedata to retrieve important information [1]. Data science researchershave recognized the value and potential of this data, and have recentlyproposed effective machine learning approaches to help obtain action-able insights in EHRs for improving healthcare [2]. The findings mayhave a significant impact by supporting clinical decision making, pre-venting avoidable adverse clinical outcomes, improve triaging of pa-tients, automatically assigning diagnosis codes, detecting diseases atan early stage, and improving statistics on clinical outcomes – to namea few examples [3]. There is a need for researching machine learn-ing techniques capable of effectively and efficiently analysing avail-able amounts of EHR data, with the purpose to improve quality andproductivity in health care. This thesis focuses on investigating howmachine learning might improve the automated assignment of diag-nosis codes through clustering EHR patient data.

1.1 Machine Learning

Machine Learning (ML) is an area within computer science that stud-ies algorithms that learn from processing data in order to make deci-sions or predictions. The idea is to build computer systems that auto-matically gain knowledge with experience. ML is often divided into

1

2 CHAPTER 1. INTRODUCTION

supervised and unsupervised problems. This thesis focuses on clus-tering, which is one of the most fundamental unsupervised ML prob-lems. The main goal of clustering is to separate unlabelled data intoclusters of similar data points. Besides having its own applications, itis beneficial for multiple other tasks. For instance, it can serve for datalabelling for further supervised learning and as a pre-processing stepfor data visualization and analysis [4].

Clustering methods can be categorized into hierarchical and par-titional (centroid-based) methods [5]. Hierarchical clustering aims tobuild a hierarchy of clusters and data points, whilst partitional (centroid-based) clustering methods create cluster centers and use metric rela-tions to assign each of the data points into the cluster with the mostsimilar center [4]. Methods for clustering are described in detail insection 2.1.1.

1.2 Electronic Health Records

The increased penetration of information technology to the health caresector has led to an increased amount of clinical data in the form ofEHRs, which consist of patient data journalized by health care profes-sionals. EHRs are a promising data source for applying ML algorithmsbecause of their tremendous amount of clinical information [6]. How-ever, typical EHR data is challenging to represent and model in MLapplications due to a number of reasons. It is often high-dimensional,noisy, sparse, incomplete and contains systematic biases, amongst otherfactors [7]. The problem of unstructured EHR data has previously beenresearched through deep learning methods with promising results [1].Moreover, there is a lack of studies using large sets of structured EHRdata in ML applications since this type of data is notably demandingto collect and there is a lack of standard structures between differentcaregivers, hospitals or regions [2]. An important asset for this the-sis is therefore a structured EHR data set provided by the principal,which contains information about symptom events, diagnosis codesand laboratory test in a readily computable form.

CHAPTER 1. INTRODUCTION 3

1.3 Motivation

1.3.1 Public InterestThe disciplines of health and data science are intersecting more andmore as health systems worldwide are confronted with large amountsof data. To process this data and gain a greater understanding of pa-tient groups would be of benefit to the research community, the healthcare professionals and the public. Recent advances where machinelearning has been used in the field of health care have shown impres-sive results. In the field of image recognition and pattern classification,the use of deep learning has been successfully applied on medical data.For example, Google has developed a deep learning algorithm to iden-tify cancerous tumors from x-ray mammography images [8].

Two main approaches are emerging in the field; the automatedML approach where machines propose solutions based on history pat-terns autonomously, or the interactive approach where machines serveagents (e.g. health care professionals) that complement the automationwith manual analysis and medical expertise [9].

Diagnosis is a challenging process that as of today involves manyhuman-to-human interactions (often the procedure of a doctor’s visit).A machine would initially serve as a complement and extension ofthis medical procedure, and help physicians make considerations indiagnoses and treatments. Patient groupings could provide a higheraccuracy in diagnosis where the chance of certain diseases would beeliminated. This could increase the speed of giving a diagnosis andlead to a more rapid treatment decision. With large amounts of data toprocess, it is reasonable to believe that an ML approach would be ableto detect rare events easier than humans.

1.3.2 Principal’s InterestThe external principal of this degree project is Doctrin, a company de-veloping an online tool for digital primary care visits. Doctrin’s plat-form is gathering structured information about patients’ health historyand state, in order to support health care professionals making deci-sions about subsequent care. Doctrin is interested in diagnosis catego-rization and grouping in order to investigate the possibilities of an au-tomated decision support tool for medical doctors. Since diagnosis de-


termination today requires a lot of manual handling such as meetingsand consensus amongst medical experts, an automated decision sup-port system could be beneficial as a support for improved efficiencyand accuracy. Doctrin has contributed to the project through externalsupervision and by supplying the data needed for the experiments.

1.4 Problem Specification

1.4.1 ObjectiveThe main objective of this masters’ degree project is to examine dif-ferent clustering algorithms in order to detect groups in a real-world,high-dimensional EHR data set. The goal of the thesis is to find al-gorithms that detect high-quality clusters within a reasonable compu-tational time. In order for the chosen algorithms to scale with an in-creasing size of the patient sample size and/or features, it is desirableto find an algorithm that achieves high-quality results with minimalpreliminary processing of the data.

1.4.2 Research QuestionThe main research question for this degree project is:

Which unsupervised methods and algorithms are suited for clustering high-dimensional EHR data with many missing values?

1.4.3 ChallengesThe choice of methods for the identification of groups in the data setimposes challenges due to several aspects:

• High-dimensionality: The data set originally has about 1200 fea-tures, which constrains the choice of algorithms and evaluationmeasures.

• Missing data: The set contains missing values that are not miss-ing at random.

• Accuracy: Data are collected through health questionnaires andthe accuracy of this self-reported data is unknown.


1.4.4 Implications of High-dimensionalityHigh-dimensional data imposes major challenges for traditional clus-tering algorithms to have meaningful application. This is recognizedas effects attributed to The curse of dimensionality, a phrase originatedby Richard Bellman in 1961 [10]. The term captures the increase incomplexity of algorithmic problems when dimensionality increases,and has been studied in various research fields.

When clustering high-dimensional data, there is a chance that manyirrelevant dimensions in the data set mask out relevant clusters. Also,assessing cluster similarities through distance-based metrics is diffi-cult due to the problem of many points being equidistant from eachother when dealing with many dimensions. Furthermore, there is achance that clusters may only exist in subspaces of the high-dimensionaldata.

Note that there is no clear consensus in the literature regardingwhat ‘high-dimensionality’ really means. Data sets with as few asten features have been referred to as being high-dimensional in someworks [11].

1.4.5 Implications of Missing dataMany real world data sets have missing values – parameters that werenot observed at all or incorrectly collected and removed. In all sta-tistical tasks dealing with missing data, it is important to considerwhy data is missing and handle missing variables accordingly. Miss-ing data imposes a problem to computing problems since many algo-rithms need a complete input matrix based on observations. This hasbeen dubbed the matrix completion problem [12].

Missing data can be categorized as missing at random, missing com-pletely at random, and missing not at random [13]. As an example re-trieved from the data used in this thesis, consider a questionnaire thatpatients answer with a section on pain killers. One question could be“Do you take painkillers to reduce your pain?”. If the response is “No”,a patient is directed to skip the next items on painkillers, such as ifpainkillers are helping with the pain or what painkillers they use reg-ularly. These attributes become missing in this patient’s EHR, even ifwe do not actually know if painkillers in fact would help with the pain.This example statues a case where data is clearly missing not at randomand contain many unobserved latent values. Thus, missing values in


the data set provided are non-ignorable in the experiments and han-dling of missing data is essential for efficient clustering results.

Common strategies for handling missing values are [14] :

• Select relevant features to only use complete cases. This meansusing only patient samples which have existing values for all fea-tures.

• Ignore null inputs in the clustering.

• Apply imputation techniques to fill missing values, i.e. predictwhat a value for a certain feature would be.

When a data set has a large fraction of missing values in every sam-ple and every feature, using only complete cases is impossible since allrows and columns would be removed from the data set. Furthermore,keeping null input and ignore them in our clustering might lead to thefollowing algorithm to depend on the patient samples which have themost complete data collection [15]. Imputation techniques are an inter-esting approach if done right, and research shows that imputation hasbeen successfully applied to EHR data. Specific imputation techniqueswill further be discussed in section 3 .

1.4.6 Implications of AccuracyThe accuracy of self-reported health data has been studied with vari-ous results [16] and poses effects on the validity of our results. Patientsmight be unable to answer questions on their health because they donot know the answer, do not understand the question or they simplychoose to hide the truth [16]. In a digital format, it is also reasonable toconsider that some patients might accidentally enter incorrect valuesinto the questionnaire. Although it is important to address that inaccu-rate data points would impact the quality of a potential decision sup-port system, investigating the sampling accuracy of the self-reporteddata is out of scope for this work.

1.4.7 LimitationsThe limitations of this thesis arise mainly from the specificity of thedata set and the implications discussed above. The thesis will be of a


comparative style and does not aim to find a general best algorithmfor clustering, but instead compare relevant algorithms.

Since the data is unlabelled, we are limited to investigate unsuper-vised approaches. The main challenge when evaluating the methods isthat different clustering approaches define clusters in different ways,and no universal measure of quality exists [17]. Instead, we must makereasonable assumptions to apply on clustering parameters.

Since there are a lot of algorithms for clustering data and clusteringevaluation, all methods cannot be tested in this thesis.

Chapter 2

Background

In this chapter the literature study is presented. The literature studyconsists of two sections: related theory and related work. The firstsection contains additional theory; that is theory from academic papersand books on unsupervised ML that have impacted the methodologyof this thesis. The second section contains research papers in whichthe authors have attempted to answer similar research questions asthe one of this thesis.

2.1 Related Theory

2.1.1 ClusteringClustering refers to a set of techniques for grouping unlabelled datapoints into distinctive groups, based on the similarity of the items.This task is fundamentally unsupervised since we have data withoutpredefined labels and want to discover the structure of this data. Thegeneral goal of clustering is to obtain clusters where data points are assimilar as possible, whereas data points in different clusters are fairlydifferent from each other. The challenge in clustering and cluster anal-ysis is how we choose do define observation points being similar ordifferent. Traditional clustering algorithms can be broadly dividedinto two main groups: hierarchical and partitional algorithms. In thissection, we also include the techniques of density-based and subspaceclustering, due to its demonstrated relevancy for data in higher di-mensions.

8

CHAPTER 2. BACKGROUND 9

2.1.2 Hierarchical Clustering AlgorithmsHierarchical clustering algorithms [18] are widely used for clusteringmany types of data. The hierarchical algorithms can be divided intotwo categories; agglomerative algorithms (bottom-up) and divisive al-gorithms (top-down) [19]. In agglomerative mode, one starts witheach data point in its own cluster and merge the most similar clus-ters successively to form a hierarchy [20]. In contrast, in the divisivemode, one starts with all the data points in one cluster and recursivelydivide each cluster into smaller sub-clusters [20].

Hierarchical algorithms’ main advantage is that they produce a hi-erarchy of clusters which is straight-forward for interpretation; it of-fers a visualization of the clusters called a dendogram, seen in figure2.1. The dendogram has a tree-like structure and offers informationsuch as at where in the clustering process that clusters split and sub-clusters are created. Because of this, we do not have to know in ad-vance how many clusters k we want, and are allowed to view clustersof each possible numbers of clusters from 1 to n in the dendogram andcut the tree at some point (visualization of this in figure 2.1) [19]. Hi-erarchical algorithms are often described as a good quality clusteringapproach, as they are generally robust to noise and able to create clus-ters of arbitrary shape. Although, a drawback of these methods aretheir quadratic time complexity of O(n2) [21].

Figure 2.1: An example plot of a dendogram. The center and rightdendograms are cut at different heights to produce two and three dis-tinct clusters respectively (shown in different colors). Figure adaptedfrom [19].

10 CHAPTER 2. BACKGROUND

One main parameter that changes the result of the clustering is aspreviously mentioned at which point you choose to cut the tree. An-other essential parameter is to define the dissimilarity between twogroups of observations in hierarchical clustering. This is called link-age, with the most common types being complete, average, centroidand single linkage [19].

Algorithm 1 Agglomerative Hierarchicalfor all data points p do

construct cluster containing pend forwhile number of clusters > 1 do

compute distancesmerge the two closest clusters

end while

2.1.3 Partitional Clustering AlgorithmsPartitonal clustering algorithms do not impose a hierarchical structure.Instead, these methods find clusters simultaneously as a partition ofthe data based on a centroid (center point) in a cluster [20]. The idea isto define clusters which intra-cluster variation is minimized. The mostpopular and simplest partitional clustering algorithm is k-means [22][20] explained in this section.

Briefly explained, the k-means algorithm first assigns all observa-tions to a cluster randomly. Then, a centroid is calculated for eachcluster, using the mean value of the variables in a data set. Each datapoint is then reassigned to a cluster which centroid is most similar.This process is repeated until some criteria of convergence is satisfied,which is calculated by a similarity measure.

In k-means, the number of clusters k must be pre-defined. This isa major drawback of the algorithm, as we sometimes in unsupervisedproblem do not know how many clusters there are. Another drawbackis the initial random assignment of cluster centers, which can lead tonon-ideal solutions [23]. Techniques for choosing the number of clus-ters in k-means should be carefully evaluated based on what data onewants to explore. A popular technique for choosing k is the elbowmethod. To find k, the method runs a k-means multiple times with


Algorithm 2 K-Meansfor all data points p do

randomly assign a number from 1 to k to each pend forwhile convergence criteria is not met do

compute the cluster centroidfor all p in the data set do

assign each p to the cluster whose centroid is closestend for

end while

different k. Then, choose a k where the value of intra cluster varia-tion is beginning to produce a flat curve, at the “elbow” of the curve.Some data sets do not reproduce an obvious elbow, making it hard toidentify a k with this method. Kodinariya and Makwana (2013) [24]propose another method as a rule of thumb. They propose to set knaively to k =

pn where n = the number of data points.

2.1.4 Density-Based Clustering AlgorithmsIn density-based clustering, clusters are defined by areas of higherdensity data points than the rest of the data set [25]. Objects in thesparse remaining area are usually considered to be noise or borderpoints. A well-known density-based cluster algorithm is DBSCAN[26]. This algorithm computes clusters iteratively; it initialize a newcluster C with a not yet assigned core point p by assigning all pointsto C that are connected to p [25].

Algorithm 3 DBSCANfor all data points p do

if p is not yet classified and p is a core point thencollect all points which is density-reachablefrom p and assign them to cluster C

elsemark p as an outlier

end ifend for


Advantages of density-based algorithms are that they are robust tonoise and outliers, and clusters of arbitrary shape can be detected. Al-though, these algorithms are developed to discover clusters in full di-mensional space and are not effective in finding clusters in subspacesof the original data set [27]. Also, parameters have to be set by the im-plementer manually; " which is the maximum distance between twopoints and the parameter minPts (minimum points in a cluster). DB-SCAN has an average runtime complexity of O(n log n).

2.1.5 Subspace Clustering AlgorithmsSubspace clustering is an extension of the traditional clustering algo-rithms. The idea is to find clusters in different subspaces within a dataset, through a top-down or a bottom-up approach. As mentioned ear-lier in section 1.4.4, it is common for data points in high dimensions tobe nearly equidistant from each other. This leads to masking of clus-ters when evaluated with some distance measures in full-space clus-tering. Common approaches to tackle this is by using feature selec-tion, where unnecessary features are removed and dimensionality isreduced. Subspace clustering methods extend feature selection sincethey attempt to localize the search and uncover clusters that exist inpossible overlapping subspaces [17].

To illustrate the need for subspace clustering, consider figure 2.2. Inthe x and y axis respectively, data on the first dimension is relativelytightly packed and it is easy to interpret groups. However, when thesefeatures are combined, data points become further apart making thedata sparser.

The CLIQUE algorithm [27] is a well-known subspace clusteringalgorithm, and the algorithm that originally introduced the conceptof subspace clustering. It uses a grid based approach combined withdensity-based clustering (described in 2.1.4), to discretize a featurespace into regular intervals in each dimension [27]. When dense sub-spaces of the data are found, they are first sorted by coverage. Cov-erage is defined as a fraction of a dense data set, which is covered bythe high-density units in a subspace[17]. The subspace with the great-est coverage is then kept and the rest are removed. Using a depthfirst search, CLIQUE then finds the adjacent grid units in each of thesubspaces by a depth-first search. Then, clusters are formed throughcombining these units. Just like a bottom-up approach, the algorithm


Figure 2.2: In the one dimension (along each axis), data points areseemingly grouped. When projected onto a two-dimensional space,observations become sparse.

starts with a unit that grows a maximal region in each dimension, untilthe combination of all these units covers the entire cluster [17].

2.1.6 Similarity MeasuresA crucial step in cluster analysis is to measure similarities betweenobjects. There are a number of methods helpful in doing this. Themost common distance measure is the Euclidean distance measure,which constitutes the square length of the vector between two datapoints. More formally, the euclidean distance can be defined as:

kp� qk =p

(p� q) · (p� q) (2.1)

As mentioned in 1.4.4, in high-dimensional spaces, concepts suchas distance and proximity become less meaningful due to a loss of con-trast in distance measures [28]. This fundamental problem makes theuse of the Euclidean distance challenging due to the setback of manydata points being at an equal squared distance from each other. There-fore, we will also consider alternative similarity indices in the experi-ments such as the Manhattan distance as well as the cosine similarity.

The Manhattan distance, also called the "city block-distance" is thesum of the absolute differences of the data points cartesian coordi-nates:


(p,q) = kp� qk1 =nX

i=1

|pi � qi| (2.2)

Another common measure when it comes to clustering high-dimensionalspaces is the cosine similarity. Given two non zero vectors p and q, itgives the cosine of the angle between them using a dot product:

cos(✓) =p · q

kpkkqk(2.3)

2.1.7 Clustering EvaluationSince clustering is an unsupervised method, there are no defined la-bels to indicate what a correct clustering result is. Also, there exist nosingle preferred criterion of what good division of the data set means,making the task of validating clusters highly non-trivial [29]. Evalu-ation techniques should therefore be carefully considered. Evaluationindices are often separated into internal, external and relative criteria[29]. Internal clustering indices measure the compactness and separa-tion between the clusters, given the predefined structures which wereproduced by the clustering algorithm. Contrarily, external criteria arebased on some external data that contain information about the opti-mal groupings of the data [29]. For example, a classification manuallycrafted by a domain expert of the data points could be used as an exter-nal criterion. Finally, relative quality indices are often used for choos-ing a stable output through the use of repeated clustering of the samedata set, varying one or more parameters. A common example is tovary the number of clusters in the repeated experiments, and comparethe results of some validity index for every run.

To evaluate the results of our clustering approaches, validity in-dices are needed. In this work, we will make use of the internal evalu-ation Silhouette Index due to its compatibility with k-means.

2.1.8 Silhouette IndexA well-known internal criterium is the silhouette index [30]. It cal-culates the intra-cluster distance and contrasts this with the averagedistance elements in other clusters. A high silhouette value indicatesa high quality of the clusters, while a low value might indicate a low


cluster quality and occurrence of outliers [30]. The silhouette index isdefined as:

s(p) =b(p)� a(p)

max{a(p), b(p)}(2.4)

Here, a(p) is the mean distance between point p and all other pointswithin the same cluster (intra-cluster distance), whereas b(p) is thesmallest mean distance of p to all points in any other cluster, wherep is not a member (nearest-cluster distance).

2.2 Related Work

Several unsupervised approaches attempting to find meaningful clus-ters in general EHR data have been proposed in the past years. Co-hen et al. [31] made use of hierarchical clustering to visualize the pa-tient states in an intensive care unit, clustering 45 data points collectedevery minute. They used agglomerative hierarchical clustering withthe complete linkage method, then measured similarities between datapoints with a standard Euclidean distance. Their results found mean-ingful clusters among patients which were defined as distinct patientstates.

Another traditional clustering approach was used by Panahiazar etal. [32], who wanted to asses patient similarities from EHR data in or-der to predict an appropriate treatment plan in heart failure patients.They first used traditional clustering algorithms such as k-means andhierarchical clustering. Then, Pahanizar et al. compared the results us-ing a supervised technique, using class labels from patient’s medica-tion plan. They used a generalized Mahalanobis distance to measureinter- and intra similarities between clusters. In their unsupervisedexperiments, hierarchical clustering performed slightly better than k-means.

An interesting attempt to group patients by their EHR textual datawas made by Roque et al. [33]. They extracted relevant informationfrom patients non-structured free-text EHRs by text-mining, calculat-ing patient’s similarity to ICD-10 diagnosis vectors extracted from acorpus. ICD-10 is the international standard classification system ofdiseases structured after the category of the diagnosis and the loca-tion on the body. They grouped the patients using a cosine similaritybetween all pairs of ICD-10-vectors, which they used as a basis for a


hierarchical clustering approach. This method led them to later in-vestigate disease comorbidities and protein/gene existence in patientsthat had multiple diagnosis codes similar to their EHRs.

Promising work have emerged using autoencoders for medical data.An autoencoder is a type of neural network that focus on learningdata representations in an unsupervised manner. An early exampleincludes the work of Lasko et al. [34], who applied an unsupervisedfeature learner using autoencoders to discover phenotypes in patientsdiagnosed with gout or leukaemia. To do this, Lasko et. al useda stochastic tool (Gaussian process regression) to transform the rawmedical data collected, into a continuous probability density, and thenused autoencoders to infer meaningful features. Their results showedthat unsupervised features were as accurate as the features proposedby a domain expert in detecting subgroups [15]. Lasko et al. aimed atdetecting meaningful features in the data rather than detecting patientgroups.

Additional noteworthy work in the field of unsupervised learningusing neural networks on EHRs was made by Miotto et al., [1]. Here,they wanted to create patient representations using a stack of autoen-coders. They compared the use of the autoencoder-approach with tra-ditional unsupervised algorithms such as PCA and k-means, and mea-sured the accuracy of the representations through supervised classifi-cation. Their results show that using a stack of denoising autoencodersoutperformed the use of traditional unsupervised approaches.

2.3 Summary of Literature Study

Traditional clustering algorithms can be divided into four main groups;partitional, hierarchical, density-based and subspace clustering. Outof these, partitional algorithms and hierarchical clustering are the mostwidely used methods for cluster analysis. The review of related worksuggests that two traditional methods in particular; k-means and hi-erarchical clustering, are often used when clustering EHR data. Theyalso serve as good benchmark clustering algorithms when developingand evaluating data-specific or novel clustering techniques. No litera-ture was found on density-based clustering or subspace clustering inclustering EHR data, even though both of these algorithms are statedto be suitable for high-dimensional data.


The literature review reveals that the use of autoencoders has beenapplied successfully on EHR data. New research by Lasko et al. [34]and Miotto et al. [1], suggests using autoencoders to detect structuresand groupings in data sets. The advantages of neural networks mainlylie in their ability to outperform many traditional ML algorithms, al-though these models are generally more computationally expensiveand require much more data than traditional algorithms. Furthermore,neural network-based methods are not as transparent as traditionalclustering algorithms and results can be difficult to interpret.

Partitional algorithms are computationally effective with a timecomplexity O(n). Also, these methods are transparent and results areinterpretable. A drawback of these algorithms and especially the k-means algorithm, is that it require prior knowledge about the numberof clusters k. Although, choosing a proper k can be studied throughvarious techniques such as the elbow method. The partitional algo-rithms assume that the clusters are of spherical shape. K-means mini-mizes the distances between all points p in the clusters and make theassumption that clusters are of similar size. In the context of the medi-cal data used for this study, this might be another disadvantage of thismethod, as it cannot be determined in prior if clusters should appearof similar size.

Indeed, hierarchical clustering algorithms generally seems to per-form better than partitional methods in creating high-quality clusters.As opposed to partitional methods, they are also able to discover non-spherical clusters. Hierarchical clustering algorithms also offer a goodvisualization in the form of a dendogram, which is suitable for inter-pretation. The hierarchical methods has the main drawback that thetime complexity is quadratic O(n2).

Because of the popularity in literature and widespread use of thetwo traditional methods, hierarchical clustering and k-means cluster-ing will be of interest in this study. Although our high-dimensionaldata set might implicate lengthy running times for the hierarchicalmethod, leading to lengthy experiments, it compensates this drawbackthrough producing high-quality clusters, which is of main interest forthis work.

Chapter 3

Methods

3.1 Data sets

3.1.1 Original data setWe are given the structured data set X = {x1, x2, . . . , xm} which con-tains m = 5315 cases of anonymized patient data. This data primarilyconsists of discrete elements made up of predefined categories, such asmultiple-choice questions and diagnoses codes based on ICD-10 cate-gories. Some attributes also consist of quantifiable numeric values (e.g.body temperature and age).

Features Q1 Q2 Q3 . . . Qn

Patient 1 - 0 1 . . . -Patient 2 1 0 - . . . 0Patient 3 0 2 0 . . . 1. . . . . . . . . . . . . . . . . .Patient m - 1 - . . . 0

Table 3.1: An example of the provided data set

Every attribute stands for a question Q = Q1, Q2, . . . , Qn that a pa-tient could answer. Each patient case has n = 1211 possible questions(features) that could contain a possible answer (a value). Patients areasked questions dynamically based on their previous answers. Theseanswers are coded 0 = No, 1 = I do not know and 2 = Yes.

In the given data, patients have answered 32 questions on average.Since the number of questions is very large whereas the frequency of

18

CHAPTER 3. METHODS 19

each answer’s appearance in each EHR is small, features in the datasetare highly sparse. Over 97 % of the entries of the data matrix are notanswered, which, as previously mentioned, challenges the choice ofclustering algorithms.

3.1.2 Imputed Data setsIn addition to our raw data set, three additional data sets were fabri-cated. Two of these tackled the problem of missing values through us-ing sophisticated imputation techniques that have previously showngood results with multiple imputation in medical data [35]. The datasets were fabricated using k-nearest neighbours (KNN) Imputationand Soft Imputation. These imputation algorithms were implementedusing the fancyimpute [36] library for Python.

In the KNN imputation, missing values were replaced by a valueapproximated by the value of its neighbours through inferring an in-direct partition along every row and column. The number of K = 10neighbours where set due to resulting in a reasonable computing time.If fewer than the chosen ten neighbours where available, the missingvalue was replaced with zero.

The Soft Imputation method was implemented to reproduce a sec-ond dataset. A soft-thresholded singular value decomposition (SVD)was derived from the original data, creating a matrix with missing val-ues and iteratively replacing these values with values from the SVD.Maximum of SVD iterations was set to n = 100 iterations.

Finally, a third data set with the same values as the original dataset was randomly scrambled, creating a randomized data set with thesame dimensions and sparsity as the original data set. This was usedas a baseline data set for comparison.

3.2 Clustering

3.2.1 Determining the Number of ClustersIn order to perform k-means clustering, one first has to decide thenumber of clusters k. Also, one has to make a decision on where tocut the dendogram in our hierarchical clustering approach when eval-uating cluster results. In order to determine k, the elbow method ex-plained in section 2.1.3 was used. We had some initial suggestions

20 CHAPTER 3. METHODS

to consider before testing which appropriate k we need. Firstly, it isknown that the ICD-10 standard classes consist of 22 main categories.However, this structure is created by domain experts to sort diseasesafter a convenient taxonomy rather than taking into account for theirreal-world occurrence. Furthermore, 178 of the cases had been la-belled by a domain expert, resulting in 29 distinct ICD-10 diagnosiscodes. Even though we could access these 3.3 % of labelled data, wecannot assume that this subset necessarily represents the structuresof the whole data set. Moreover, some of these can have similar fea-tures, leading to two diagnoses or more being clustered together. Still,this knowledge gives us valuable insight to consider before estimatingnumbers of clusters. Lastly, we calculated the rule of thumb suggestedby [24] (see section 2.1.3), resulting in approximately 50 distinct clus-ters. The Elbow method was therefore decided to be examined withvalues of k ranging from 1 to 50.

3.2.2 Software and HardwareAll code was written in Python (version 3.7.0) using the Scikit-learn[37], NumPy and SciPy libraries. Python was chosen primarily due toits wide use in computing research and its variety of machine learn-ing packages. The imputation methods were implemented using thefancyimpute library for Numpy matrices. The matplotlib [38] pack-age was used for plotting. Experiments were conducted on an 2,2 Ghz4-core Intel Core i7 processor.

3.2.3 Implementation of AlgorithmsThe k-means algorithm described in 2.1.3 was the first method to beimplemented. Implementation of the algorithm was done using theclass KMeans that is offered by the Scikit-learn library. Clusters wereinitialized randomly, meaning that the algorithm chooses cluster cen-troids from the samples randomly. Because of this randomized ini-tialization, each experiment was run 10 times and the average of theevaluation measures was calculated. After the clustering, results weretransformed to the cluster representation, using an internal represen-tation such as the silhouette value described in 2.1.7. Using this rep-resentation, clustering results can be compared more easily betweenalgorithms.

CHAPTER 3. METHODS 21

Hierarchical clustering was later performed. The algorithm wasimplemented using the ‘Agglomerative Clustering’ library also offeredthrough Scikit-learn. First, the ward, average and complete linkage func-tions were tested on the raw data set and also the different similaritymeasures ‘euclidean’ and ‘cosine’. Comparing the six possible combi-nations of linkage functions and similarity measures, the average link-age function with the cosine distance measure yielded the best silhou-ette value. This combination was therefore selected for further experi-ments. Thereafter, hierarchical clustering was run on all four data sets.Similarly to our k-means experiment, the results transformed into in-ternal representation for clusters and later the silhouette scores for thecluster representations were calculated.

Chapter 4

Results

4.1 Determination of Number of Clusters

The elbow method described in section 2.1.3 was used to determinethe number of clusters. The k-means algorithm was run for values of kranging from 2 to 50 clusters and the sum of intra squared errors (SSE)was calculated. A line chart was plotted for the SSE as a function of kin order to detect a possible elbow. As seen in figure 4.1a and 4.1b, theSSE are rapidly decreasing down to k = 6 for both the original datasetand the soft imputed dataset. Then, the SSE decrease slower out upto k = 50, indicating the presence of clusters at around k = 6. In figure4.1c, a clear elbow was not seen. Figure 4.1d reveals that the methodrun on the randomized data set reveals no distinct elbow.

4.2 Evaluation Measures

4.2.1 k-meansThe k-means experiments were initialized and the results are reportedin table 4.1. The first experiment was to calculate the silhouette in-dex for k-means with the euclidean and cosine distances on the orig-inal data set. Here, the use of the cosine measure generally producedhigher silhouette values than the euclidean distance. The silhouettevalue is steadily decreasing as the amount of clusters increase, whichis a pattern seen in a majority of the experiments. This indicates a lossof cluster quality when the number of clusters increase. On the original

22

CHAPTER 4. RESULTS 23

(a) Original data set (b) Soft Imputed data set

(c) KNN Imputed data set (d) Randomized data

Figure 4.1: Plot of the sum of squared errors as a function of k. Anelbow is seemingly visible around the two first plots, around k = 6.

data set, the silhouette value with a cosine similarity index yielded thehighest value when run with two clusters, and indicate that the pointsare well matched to its cluster. Later, the very same experiments wererun on the SoftImpute data set and Filled KNN data set. Here, theclustering procedure performed on the SoftImpute data set with twoclusters and cosine similarity led to a high silhouette value of 0.843which indicate the best cluster quality out of all experiments. Withthe SoftImpute data set, the silhouette value drastically decreased to alow value for three clusters and later a negative value when the num-ber of clusters increased to four clusters. Negative silhouette valuesrepresent a poor cluster quality. Lastly, k-means clustering was runon the Filled KNN matrix with a varying number of clusters. The sil-houette values were here significantly lower than the values on theoriginal and the SoftImpute data set for k. The cosine measure yielded

24 CHAPTER 4. RESULTS

a slightly better silhouette value for the Filled KNN data set comparedto using euclidean distances.

Metric 10n 2 Clusters 3 Clusters 4 Clusters 5 Clusters 6 ClustersOriginal Data

SSE 103 1569 1187 1020 867 785Euclidean 10�3 559 306 300 273 233

Cosine 10�3 741 390 374 336 296SoftImputeEuclidean 10�3 601 374 374 343 257

Cosine 10�3 843 82 -316 -94 -394FilledKNNEuclidean 10�3 277 257 260 247 208

Cosine 10�3 400 413 395 390 285

Table 4.1: K-means results

4.2.2 Agglomerative HierarchicalResults from the agglomerative hierarchical clustering experiments aredisplayed in table 4.2. For the original data set, two clusters yieldeda silhouette value of 0.691 with euclidean distances. When calculatingthe cluster results with four clusters, the silhouette index was higher,which indicates that four distinct clusters is of interest for the hierar-chical clusterer. Using the SoftImpute data set, the index was slightlylower than overall, and then decreased significantly with five and sixclusters respectively. Of all experiments, the implementation with acosine distance measure on the SoftImpute data set produced overallvery high silhouette values, and resulted in a value of 0.824 for twoclusters. It is also noticeable that the silhouette values with cosine dis-tances on the SoftImpute data set remain high for all numbers of k,ranging between 0.824 and 0.729, compared to the k-means results in4.1 on the very same data set where silhouette index dropped drasti-cally for k > 3. Thereafter, the silhouette index on the Filled KNN wascalculated, producing low indices in general. The cosine distance mea-sure resulted in a slightly higher value during these experiments, butis interestingly slightly higher when having 6 clusters than 5 clusters.

CHAPTER 4. RESULTS 25

Metric 10n 2 Clusters 3 Clusters 4 Clusters 5 Clusters 6 ClustersOriginal Data

Euclidean 10�3 691 599 720 570 561Cosine 10�3 798 806 811 762 713

SoftImputeEuclidean 10�3 678 608 558 332 330

Cosine 10�3 824 785 777 752 729Filled KNNEuclidean 10�3 397 303 280 232 260

Cosine 10�3 449 309 332 309 401Randomized Data

Euclidean 10�3 548 536 500 492 492Cosine 10�3 402 400 384 378 374

Table 4.2: Agglomerative hierarchical clustering using average linkage

Clustering the randomized data set produced even higher silhou-ette values than the clustering the Filled KNN data, which suggeststhat this type of imputation is not preserving the presence of clusterthat we can presume exists in the original data set and the SoftImputedata set.

4.2.3 Running TimesRunning times are presented in table 4.3. Clearly, hierarchical cluster-ing had longer running times than k-means. This aligns with the com-putational complexity of the algorithms, described in sections 2.1.2and 2.1.3. Hierarchical clustering with average linkage resulted in alonger computing time than with the complete and ward linkage. Cal-culating running times also revealed that k-means generated longerrunning times with the original data set than with other data sets.

K-means on different data sets Average running timeOriginal 3.12 sSoftImpute 2.89 sFilled KNN 2.60 sRandomized Data 2.08 sHierarchical Clustering with different linkages Average running timeComplete linkage 16.78 sAverage linkage 18.31 sWard linkage 16.98 s

Table 4.3: Running times of relevant experiments

Chapter 5

Discussion and Analysis

Unsupervised machine learning is the task of learning patterns in datawhere no labels are present. Compared to a lot of optimization prob-lems, an optimal clustering result does not exist. Different algorithmswith different parameters produce different clusters, and none can beproved to be the best solution since several good partitions of the datamight be found. Discussing the result of unsupervised methods andits implications is therefore highly non-trivial, since no benchmark re-sults or ground truth are available for comparison as with supervisedtechniques.

In our work, a data set with a high proportion of missing data wasprovided. This implication challenged the choice of algorithms as wellas evaluation indices. To tackle the problem of missing values, two im-putation techniques were used in order to produce comparable datasets while preserving the structures of the original data set. Gener-ally, such methods of multiple imputation of data should be carefullyconsidered. In the context of this work, imputing missing values bya SVD or KNN-approach might induce bias or lead to a loss of infor-mation. This since the missing data in the set is clearly missing notat random. Most likely, the KNN-imputed data set in fact resulted ina loss of information, since silhouette values for this data set was onaverage lower in all experiments. Natural groupings that presumablyexists in the original data set was mixed out with their neighboringdata points. In a medical context, this implies that the KNN imputedset had patients whose symptoms were similar.

The quality of the clusters was measured through the silhouette in-dex in our experiments, which indicate how similar a data point is to

26

CHAPTER 5. DISCUSSION AND ANALYSIS 27

other point in the same cluster. Although this index is limited to be-ing used to measure the quality of the clustering structure without anyexternal information, it yielded information about the cluster similar-ities that indicated the quality of the different partitions made by theclustering algorithms.

5.1 Determination of Number of Clusters

The clear elbow showing in the plots 4.1a and 4.1b indicates that thereexist a number of clusters in the data, which fits the data better thanother numbers. Interpreting the plot, an elbow is visible at around k= 6 and k = 7. This was a useful insight for further experiments sinceit can be challenging to interpret plots from the Elbow method. Also,using the elbow method on a randomized data set with the same di-mensions and sparsity yielded a completely different result with noindication of an elbow. Furthermore, our following intra clusteringmeasures and intra clustering distance imply that natural clusters ex-ist in the data for this number of clusters. Not finding an appropriate kwould possibly lead to uncertainty of having clusters in the data set atall, and setting arbitrary k might unintentionally induce bias to the fol-low clustering experiments. In a medical context, a presence of clustersat around k = 6 infer meaningful insights. Indeed, our hypothesis wasthat the appropriate k would be found at around 20-30 samples. Thiswas the hypothesis due to the sample size, the rule of thumb methodproposed as well as the notion of having a large feature space. In theICD-10 context, this insight might infer that clustering algorithms donot infer the international standard ICD-10 taxonomy.

5.2 Handling of Missing values

Indeed, handling of missing values is a difficult problem and tech-niques to tackle the problem of incomplete matrices should be care-fully considered. In our context, imputing missing values to patientrecords would mean to impute values that is certainly not missing atrandom. Proposed approaches to deal with missing values is to com-pletely delete rows or columns that contain missing values, which wasimpossible in our case because of the sparsity in the raw data matrix.Other approaches suggest replacing the value with a mean value for

28 CHAPTER 5. DISCUSSION AND ANALYSIS

the features. On this data, such techniques would result in ruiningstructures of the data by diminishing the difference between filled val-ues and missing values, which would result in an information lossin the following clustering stage. The imputation methods suggestedwere implemented after reading the work of [35]. The Soft Imputemethod seems to be the better choice in our data set. In especially thek-means experiments, the Soft Impute data set yielded a higher sil-houette value than the raw data set with clusters, indicating that thisimputation method is applicable to our data set. In contrast, the exper-iments on the KNN data set yielded less good silhouette values. Mostlikely, the KNN-imputed data set in fact lead to a loss of information,since silhouette values for this data set was on average lower in allexperiments. Natural groupings that presumably exist in the originaldata set and on the SoftImpute data set was out with their neighboringdata points. In a medical context, this implies that the KNN imputedset had patients that were too like each other.

5.3 Comparison of the Algorithms

Both clustering algorithms performed better on the raw data set thanon the imputed data sets. This indicates that the imputation methodswere not appropriate to use for our data set. In k-means, using the dataset with Soft Imputed values resulted in almost the same silhouettevalue as the raw data set. Also, the soft imputed data set showed aclear elbow in the plot when looking for an optimal k, indicating thatthis data set kept the information structure of the data well.

Agglomerative hierarchical clustering performed slightly better com-pared to k-means on the same data. This aligns with previous researchwhich generally portrays hierarchical algorithms as a better qualityclustering approach. Hierarchical clustering algorithms are generallyrobust to noise and create clusters of arbitrary shape, while the k-means algorithm generates spherical clusters. Also, hierarchical algo-rithms are shown to work well on high-dimensional data. Our initialresults show that hierarchical algorithm with the average linkage hada clear advantage when running algorithms with other linkages.

The cosine measurement was applicable in our experiments sincea larger size of a sample vector do not necessarily mean that the al-gorithm should put extra weight on this sample. Cosine similarity is


generally used as a metric for measuring distance when the magni-tude of the vectors does not matter, and is often used in text miningapplications. According to previous theory, cosine had been proven tobe a good similarity index for high-dimensional data sets. This wasalso showed in our results, where usage of the cosine metric yieldeda higher silhouette value in almost all runs. The largest difference be-tween the Euclidean distance measure and the cosine measure wereobserved for the hierarchical clustering algorithm on the Soft Imputeddata set.

From our results, we can also interpret that both algorithms showedmore structures of the groupings from our raw data and imputed datathan on the randomized data. From the hierarchical clustering, we ob-served a silhouette value of the KNN imputed data, which was lowerthan for the randomized data. This might indicate that the imputationin fact deteriorates the natural cluster structures available in the rawdata set.

5.4 Context

In a medical context, it is interesting to find presence of patient sub-groups from EHR data using unsupervised machine learning. How-ever, an unsupervised approach is hard to evaluate in a medical con-text because of the difficult interpretation and evaluation. In our work,the initial hypothesis for finding the optimal number of clusters wassignificantly greater than the result. It is unlikely that our data set pro-vided a balanced representation of all diseases mentioned in the ICD-10 classification taxonomy. However, knowing that there are notablyfewer natural clusters than there are diagnosis types or the number ofreasons for encounter that a patient can submit, might indicate that thedisease symptoms of certain ICD-10 codes are very similar in their dis-ease phenotype. Although, this finding could also have been inducedfrom the data set itself, since the questionnaire is filled in dynamicallyand previous answers determine the next questions a patient shouldbe given.

30 CHAPTER 5. DISCUSSION AND ANALYSIS

5.5 Ethical Perspective

Medical data contains highly personal and perhaps sensitive informa-tion. With an increasing collection of digital patient data such as EHRs,security is a concern for users of such services [39]. The data set usedin this thesis had been anonymized by the principal. Additionally, at-tribute tags that may contain sensitive information had been removedand replaced by an unrecognizable label. Free-text responses from pa-tients in the EHR had been left out.

Despite the growth of using the web and apps for diagnosis andtriage advice, the clinical performance of such platforms has not beenthoroughly assessed [40]. As the work in this thesis may be useful tointegrate in a computerized decision support system, it is importantto discuss the consequences of such systems. Doctors report that aconcern with patients reporting their symptoms digitally is that a se-riously ill patient would enter their information and get a reassuranceback that they are okay [41]. If such false-negative situations occurbased on an algorithm’s suggestion, the quality of these systems couldhave a large impact and potentially be life threatening for the patient.

5.6 Limitations

The work offers several limitations. Unsupervised learning is diffi-cult to evaluate and to discuss the implications of clusters is gener-ally hard. Firstly, we used a structured EHR data extracted from aspecific patient questionnaire. No benchmarks for this type of sparseand high-dimensional categorical data was found, therefore the resultswere hard to interpret and discuss. Also, additional effort could havebeen added when experimenting and evaluating the imputation tech-niques, since these techniques have a high impact on data sets wherethe ratio of missing values is high. Only two algorithms were evalu-ated in this work, but an enormous amount of clustering algorithmsexist in literature. The chosen algorithms do not propose a generalbest solution for this type of data in any sense, just as with any un-supervised technique, but rather act as a benchmark for conductingexperiments on our specific data.


5.7 Future Research

Many interesting applications can be explored in future work. Furtherresearch in unsupervised techniques for clustering patient data couldtry more sophisticated methods for clustering and perhaps combinedifferent methods using more complete, real-world EHR data. In suchdata, free-text data from doctors or patients and laboratory tests couldbe of interest to include in the clustering analysis. Furthermore, the re-view of relevant work on clustering medical data reveal promising re-sults when using unsupervised autoencoders to detect representationsin high-dimensional EHR data, which could play an important role infuture work. In this thesis, it could be of relevance to compare the tra-ditional unsupervised methods to neural network-based autoencodersand contrast the results of the different approaches. Another possibleimprovement to the cluster evaluation could be to complement the sil-houette index with some external evaluation index. For example, itis interesting to simulate the hierarchy of ICD-10 categories and com-pare these groupings to clustering partitions. Lastly, to overcome theproblem of interpreting the cluster quality objectively, it is relevant toextend the validation of clusters and compare unsupervised methodswith results on labeled data, perhaps from a data set that is completelylabeled with patients’ diagnoses.

Chapter 6

Conclusions

In this study, we examined the field of unsupervised learning. Wecompared different unsupervised machine learning approaches to re-search what clustering algorithms are suitable on grouping high-dimensional,sparse and real-world EHR data. Clustering high-dimensional andsparse data is a difficult task, partly due to the curse of dimensionalityand its effect on the internal evaluation, such as distance metrics andthe silhouette value. It is difficult to easily interpret the structure ofsuch high-dimensional data, and to visualize it. Furthermore, a chal-lenge of this thesis was to conduct cluster analysis on a data set witha high partition of missing values. Imputing missing data has beendone with promising results on medical data, although it is not clearlyexamined how well this data reflects the actual problem behind clus-tering EHR data. In our work, we conducted clustering experimentsusing two well-known traditional algorithms; k-means and hierarchi-cal clustering, on four comparable data sets. In addition to the originaldata set, two data sets with imputed values and one randomized datasets were used. Our results using the elbow method with k-meanssuggests that the data set do contain cluster structures. The particu-lar clustering algorithms in the thesis showed that natural groupingsexisted around k = 6, which indicate that 6 patient subgroups werefound in the data set. We could also see that using a SVD techniquefor multiple imputing missing values was preserving the cluster struc-tures. Since the concept of distances is difficult to evaluate when hav-ing many features in a data set, a cosine similarity yielded a higherinternal evaluation index. Each algorithm had its advantages and dis-

32

CHAPTER 6. CONCLUSIONS 33

advantages, although our results showed that hierarchical clusteringclearly performed better on our data set.

Bibliography

[1] Riccardo Miotto et al. “Deep Patient: An Unsupervised Repre-sentation to Predict the Future of Patients from the ElectronicHealth Records”. In: Scientific Reports 6 (2016), p. 26094.

[2] Djordje Gligorijevic et al. “Large-Scale Discovery of Disease-Diseaseand Disease-Gene Associations”. In: Scientific Reports 6 (2016),p. 32404.

[3] Trang Pham et al. “DeepCare: A Deep Dynamic Memory Modelfor Predictive Medicine”. In: arXiv preprint (2017). arXiv: 1602.00357.

[4] Elie Aljalbout et al. “Clustering with Deep Learning: Taxonomyand New Methods”. In: arXiv preprint (2018). arXiv: 1801.07648.

[5] Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. “Dataclustering: a review”. In: ACM Computing Surveys 31.3 (1999),pp. 264–323.

[6] Jing Zhao. “Learning Predictive Models from Electronic HealthRecords”. PhD thesis. 2017.

[7] George Hripcsak and David J. Albers. “Next-generation pheno-typing of electronic health records”. In: Journal of the AmericanMedical Informatics Association 20.1 (2013), pp. 117–121.

[8] Martin Stumpe and Lily Peng. Assisting Pathologists in DetectingCancer with Deep Learning. 2017. URL: https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html (visited on 03/16/2018).

[9] Andreas Holzinger. Machine Learning for Health Informatics: State-of-the-Art and Future Challenges. Graz: Springer Nature, 2016, p. 211.

[10] Richard Bellman. Adaptive Control Processes: A Guided Tour. 1961,p. 255.

34

BIBLIOGRAPHY 35

[11] Ira Assent. “Clustering high dimensional data”. In: Wiley Inter-disciplinary Reviews: Data Mining and Knowledge Discovery 2.4 (2012),pp. 340–350.

[12] Emmanuel J. Candès and Benjamin Recht. “Exact matrix com-pletion via convex optimization”. In: Foundations of ComputationalMathematics 9.6 (2009), pp. 717–772.

[13] Joseph L. Schafer and John W. Graham. “Missing Data: Our Viewof the State of the Art”. In: Psychological Methods 7.2 (2002), pp. 147–177.

[14] Brett K. Beaulieu-Jones. “Machine Learning for Structured Clini-cal Data”. In: arXiv preprint (2010), pp. 1–18. arXiv: 1707.06997.

[15] Brett K. Beaulieu-Jones and Casey S. Greene. “Semi-supervisedlearning of the electronic health record for phenotype stratifica-tion”. In: Journal of Biomedical Informatics 64 (2016), pp. 168–178.

[16] Sallie Anne Newell et al. “The accuracy of self-reported healthbehaviors and risk factors relating to cancer and cardiovasculardisease in the general population”. In: American Journal of Pre-ventive Medicine 17.3 (), pp. 211–229.

[17] Lance Parsons, Ehtesham Haque, and Huan Liu. “Subspace Clus-tering for High Dimensional Data: A Review”. In: SIGKDD Ex-plor. 6.1 (2004), pp. 90–105.

[18] Stephen C. Johnson. “Hierarchical clustering schemes”. In: Psy-chometrika 32.3 (1967), pp. 241–254.

[19] Gareth James et al. An Introduction to Statistical Learning. 2013.

[20] Anil K. Jain. “50 years beyond K-means”. In: Pattern RecognitionLetters 31.8 (2010), pp. 651–666.

[21] Michael Steinbach, George Karypis, and Vipin Kumar. A Com-parison of Document Clustering Techniques. Tech. rep. X. 2000, pp. 1–2.

[22] James B. MacQueen. “Some methods for classification and anal-ysis of multivariate observations”. In: Proceedings of the Fifth Berke-ley Symposium on Mathematical Statistics and Probability. Vol. 1.233. 1967, pp. 281–297.

[23] Sarka Brodinova et al. “Robust and sparse k-means clusteringfor high-dimensional data”. In: ArXiv e-prints (2017). arXiv: 1709.10012.

36 BIBLIOGRAPHY

[24] Trupti M. Kodinariya and Prashant R. Makwana. “Review ondetermining number of Cluster in K-Means Clustering”. In: In-ternational Journal of Advance Research in Computer Science and Man-agement Studies 1.6 (2013), pp. 2321–7782.

[25] Hans Peter Kriegel et al. “Density-based clustering”. In: WileyInterdisciplinary Reviews: Data Mining and Knowledge Discovery 1.3(2011), pp. 231–240.

[26] Martin Ester et al. “Density-based algorithm for discovering clus-ters in large spatial databases with noise”. In: Proceedings Of TheAcm Sigkdd International Conference On Knowledge Discovery AndData Mining. Vol. pages. 1996, pp. 226–231.

[27] Rakesh Agrawal et al. “Automatic subspace clustering of highdimensional data for data mining applications”. In: Proceedingsof 1998 ACM SIGMOD international conference on Management ofdata. 1998, pp. 94–105.

[28] Michael E. Houle, Hans Peter Kriegel, and Peer Kröger. “CanShared-Neighbor Distances Defeat the Curse of Dimensional-ity?” In: Proceedings of SSDBM. 2010, pp. 482–500.

[29] Nenad Tomašev and Miloš Radovanovic. “Clustering Evalua-tion in High-Dimensional Data”. In: Unsupervised Learning Al-gorithms. Springer, Cham, 2016, pp. 71–107.

[30] Peter J. Rousseeuw. “Silhouettes: A graphical aid to the interpre-tation and validation of cluster analysis”. In: Journal of Computa-tional and Applied Mathematics 20.C (1987), pp. 53–65.

[31] Mitchell J. Cohen et al. “Identification of complex metabolic statesin critically injured patients using bioinformatic cluster analy-sis”. In: Critical Care 14.1 (2010), pp. 1–11.

[32] Maryam Panahiazar et al. “Using EHRs for Heart Failure Ther-apy Recommendation Using Multidimensional Patient Similar-ity Analytics”. In: Studies in Health Technology and Informatics.Vol. 210. 2015, pp. 369–373.

[33] Francisco S. Roque et al. “Using electronic patient records to dis-cover disease correlations and stratify patient cohorts”. In: PLoSComputational Biology 7.8 (2011), pp. 1–10.

BIBLIOGRAPHY 37

[34] Thomas A. Lasko, Joshua C. Denny, and Mia A. Levy. “Compu-tational Phenotype Discovery Using Unsupervised Feature Learn-ing over Noisy, Sparse, and Irregular Clinical Data”. In: PLoSONE 8.6 (2013), pp. 1–13.

[35] Brett K. Beaulieu-Jones and Jason H. Moore. “Missing Data Im-putation in the Electronic Health Record using Deeply LearnedAutoencoders”. In: Pacific Symposium on Biocomputing. 2017, pp. 207–218.

[36] Alex Rubinsteyn and Sergey Feldman. Fancy Impute: A variety ofmatrix completion and imputation algorithms implemented in Python.Version 0.0.16. 2016. URL: https://github.com/iskandr/fancyimpute (visited on 07/15/2018).

[37] Fabian Pedregosa et al. “Scikit-learn: Machine Learning in Python”.In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

[38] John D. Hunter. “Matplotlib: A 2D graphics environment”. In:Computing In Science & Engineering 9.3 (2007), pp. 90–95.

[39] Deborah Lupton and Annemarie Jutel. “’It’s like having a physi-cian in your pocket!’ A critical analysis of self-diagnosis smart-phone apps”. In: Social Science and Medicine 133 (2015), pp. 128–135.

[40] Hannah L. Semigran et al. “Evaluation of symptom checkers forself diagnosis and triage: Audit study”. In: BMJ 351 (2015), pp. 1–8.

[41] Matt Burgess. "Can you really trust the medical apps on your phone?"2017. URL: http://www.wired.co.uk/article/health-apps-test-ada-yourmd-babylon-accuracy (visited on07/09/2018).

TRITA -EECS-EX-2019:95

www.kth.se

Documents

Unsupervised learning to detect patient subgroups in ...1316425/FULLTEXT01.pdf · learning to detect patient subgroups in electronic health records ELIN LÜTZ Master of Science in