10
Syndromic Surveillance using Generic Medical Entities on Twitter Pin Huang , Andrew MacKinlay ♦♠ and Antonio Jimeno Yepes IBM Research – Australia, Melbourne, VIC, Australia Dept of Computing and Information Systems, University of Melbourne, Australia [email protected], {admackin,antonio.jimeno}@au1.ibm.com Abstract Public health surveillance is challenging due to difficulties accessing medical data in real-time. We present a novel, ef- fective and computationally inexpensive method for syndromic surveillance using Twitter data. The proposed method uses a regression model on a database previ- ously built using named entity recognition to identify mentions of symptoms, disor- ders and pharmacological substances over GNIP Decahose Twitter data. The result of our method is compared to the reported weekly flu and Lyme disease rates from the US Center of Disease Control and Pre- vention (CDC) website. Our method pre- dicts the 2014 CDC reported flu preva- lence with 94.9% Spearman correlation using 2012 and 2013 CDC flu statistics as training data, and the CDC Lyme dis- ease rate for July to December 2014 with 89.6% Spearman correlation. It also pre- dicts the prevalences for the same diseases and time periods using the Twitter data from the previous week with 93.31% and 86.9% Spearman correlations respectively. 1 Introduction Real-time public health surveillance for tasks such as syndromic surveillance is challenging due to difficulties accessing medical data. Twitter is a social media platform in which people share their views, opinions and their lives. Data from Twit- ter is accessible in real-time and it could poten- tially be used for syndromic surveillance. Even if only a small portion of the tweets contains po- tentially information about the health of Twitter users (Jimeno-Yepes et al., 2015a), there is still a large volume of data that could be useful for public health surveillance. Several approaches to predict flu prevalences from Twitter data already exist. These approaches either rely on topic modelling (e.g. Latent Dirich- let Allocation (LDA) (Blei et al., 2003)) (Paul and Dredze, 2012; Paul and Dredze, 2011) or rely on regression models on keyword frequency (Culotta, 2010a; Culotta, 2010b). The topic modelling approach for flu preva- lence prediction requires manually labelling a large number of tweets (e.g. 5,128 tweets) that are used to train a Support Vector Machine (SVM) (Joachims, 1999) classifier applied on 11.7 mil- lion messages. The predictions on the tweets are applied on a LDA based topic model to over millions of tweets (Paul and Dredze, 2012; Paul and Dredze, 2011). Regression approaches (Cu- lotta, 2010a; Culotta, 2010b) require prior knowl- edge to develop a keyword list {flu,cough,sore throat,headache} that could identify tweets rele- vant to flu. In this paper, we propose an effective and computationally efficient alternative for disease prevalence prediction based on an already exist- ing database developed by (Jimeno-Yepes et al., 2015b). Our approach to predict disease preva- lence does not require manual labelling of Twit- ter posts to determine whether the posts are re- lated to a particular disease or not. Our train- ing dataset uses aggregated weekly term frequen- cies, so it is less computationally expensive to train compared to other approaches trained on millions of tweets. In addition, compared with regression approaches (Culotta, 2010a; Culotta, 2010b), no prior knowledge was used to manually develop a list of keywords indicative of a disease. Over- all, we used our method to effectively predict the prevalence of flu and Lyme disease one week ahead of reported CDC data. Pin Huang, Andrew MacKinlay and Antonio Jimeno Yepes. 2016. Syndromic Surveillance using Generic Medical Entities on Twitter. In Proceedings of Australasian Language Technology Association Workshop, pages 35-44.

Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

Syndromic Surveillance using Generic Medical Entities on Twitter

Pin Huang♦, Andrew MacKinlay♦♠ and Antonio Jimeno Yepes♦♦ IBM Research – Australia, Melbourne, VIC, Australia

♠ Dept of Computing and Information Systems, University of Melbourne, [email protected], {admackin,antonio.jimeno}@au1.ibm.com

Abstract

Public health surveillance is challengingdue to difficulties accessing medical datain real-time. We present a novel, ef-fective and computationally inexpensivemethod for syndromic surveillance usingTwitter data. The proposed method usesa regression model on a database previ-ously built using named entity recognitionto identify mentions of symptoms, disor-ders and pharmacological substances overGNIP Decahose Twitter data. The resultof our method is compared to the reportedweekly flu and Lyme disease rates fromthe US Center of Disease Control and Pre-vention (CDC) website. Our method pre-dicts the 2014 CDC reported flu preva-lence with 94.9% Spearman correlationusing 2012 and 2013 CDC flu statisticsas training data, and the CDC Lyme dis-ease rate for July to December 2014 with89.6% Spearman correlation. It also pre-dicts the prevalences for the same diseasesand time periods using the Twitter datafrom the previous week with 93.31% and86.9% Spearman correlations respectively.

1 Introduction

Real-time public health surveillance for tasks suchas syndromic surveillance is challenging due todifficulties accessing medical data. Twitter is asocial media platform in which people share theirviews, opinions and their lives. Data from Twit-ter is accessible in real-time and it could poten-tially be used for syndromic surveillance. Evenif only a small portion of the tweets contains po-tentially information about the health of Twitterusers (Jimeno-Yepes et al., 2015a), there is still alarge volume of data that could be useful for publichealth surveillance.

Several approaches to predict flu prevalencesfrom Twitter data already exist. These approacheseither rely on topic modelling (e.g. Latent Dirich-let Allocation (LDA) (Blei et al., 2003)) (Paul andDredze, 2012; Paul and Dredze, 2011) or rely onregression models on keyword frequency (Culotta,2010a; Culotta, 2010b).

The topic modelling approach for flu preva-lence prediction requires manually labelling alarge number of tweets (e.g. 5,128 tweets) thatare used to train a Support Vector Machine (SVM)(Joachims, 1999) classifier applied on 11.7 mil-lion messages. The predictions on the tweetsare applied on a LDA based topic model to overmillions of tweets (Paul and Dredze, 2012; Pauland Dredze, 2011). Regression approaches (Cu-lotta, 2010a; Culotta, 2010b) require prior knowl-edge to develop a keyword list {flu,cough,sorethroat,headache} that could identify tweets rele-vant to flu.

In this paper, we propose an effective andcomputationally efficient alternative for diseaseprevalence prediction based on an already exist-ing database developed by (Jimeno-Yepes et al.,2015b). Our approach to predict disease preva-lence does not require manual labelling of Twit-ter posts to determine whether the posts are re-lated to a particular disease or not. Our train-ing dataset uses aggregated weekly term frequen-cies, so it is less computationally expensive to traincompared to other approaches trained on millionsof tweets. In addition, compared with regressionapproaches (Culotta, 2010a; Culotta, 2010b), noprior knowledge was used to manually develop alist of keywords indicative of a disease. Over-all, we used our method to effectively predictthe prevalence of flu and Lyme disease one weekahead of reported CDC data.

Pin Huang, Andrew MacKinlay and Antonio Jimeno Yepes. 2016. Syndromic Surveillance using Generic Medical Entitieson Twitter. In Proceedings of Australasian Language Technology Association Workshop, pages 35−44.

Page 2: Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

2 Modelling weekly syndromic rate

2.1 Dataset IntroductionThe Twitter data for years 2012, 2013 and 2014was obtained from the GNIP Decahose,1 whichprovides a random 10% selection of availabletweets. From here, only English tweets were con-sidered and retweets were removed.

Each tweet was annotated with three types ofmedical named entities: disorders, symptoms andpharmacological substances (PharmSub) (Jimeno-Yepes et al., 2015b). These entity types are definedusing the UMLS (Unified Medical Language Sys-tem) semantic types (Bodenreider, 2004). Recog-nition of entities was performed using a trainedconditional random field annotator. Statistics onthe annotated entities by this classifier for the firsthalf year of 2014 is available from (Jimeno-Yepeset al., 2015a). Annotation of pharmacological sub-stances is complemented by using a dictionarybased annotator using terms from the UMLS.

Since just a small portion of tweets contain de-clared location information, posts containing med-ical entities were automatically geolocated usingthe method presented in (Han et al., 2012). Thisgeolocation has been used to select tweets fromthe USA, since our reference is US CDC.

Based on the annotated tweets in USA, Twit-ter terms’ counts are aggregated into a weekly ba-sis; then the terms’ counts are normalized by theweekly total number of the tweets. For three yearsdata, the sample size of the dataset used for theprevalence prediction is 156, because only about52 weeks per year.

The weekly terms’ frequencies data set is thenmapped to the weekly CDC’s data. Three years’data are available for the flu prevalence prediction.While year 2013 and 2014’s CDC data is availablefor Lyme Disease prevalence prediction, therefore,the dataset for Lyme disease is with 104 samplesize.

2.2 Overall ArchitectureThe proposed methodology is a predictive modelwhich aims to achieve the following goals:

• Predict reported CDC flu and Lyme diseasetrend using weekly term frequencies to pre-dict syndromic weekly rates.

• Predict reported CDC flu and Lyme diseasetrend one week in advance using weekly term

1http://support.gnip.com/apis/firehose/overview.html

frequencies to predict the following weeksyndromic rates.

The overall architecture of the proposedmethodology is shown in Figure 1. The first step isdata preprocessing, followed by feature engineer-ing and support vector machines (SVM) (Gunnand others, 1998) regression modelling.This re-gression model is trained to combine the engi-neered features from our Twitter database to per-form syndromic prediction.

A major challenge of the first step is map-ping Twitter terms with similar meanings from ourdatabase to a unique term. A mapping algorithmis proposed to map synonyms into a unique term.

After the synonyms mapping, a series of featureengineering methods are applied to engineer a fi-nal set of the most important features. Finally, pre-diction is made by using a trained SVM regressionmodel on the final set of features.

Twitter Term Concept Entity Typeadrenal disease adrenal disease Diseaseadrenal disorder adrenal disease Disease

adrenal gland disease adrenal disease Diseaseadrenal gland disorder adrenal disease Disease

acne treatment acne treatment PharmSubtreatment acne acne treatment PharmSubabdomen pain abdominal pain Symptom

abdominal pain abdominal pain Symptomabdominal pains abdominal pain Symptom

gut pain abdominal pain Symptom

Table 1: A sample of Concept Mapping

EntityType Found in UMLS Not Found in UMLSDisease 9162 19454

PharmSub 15891 23556Symptom 2604 53142

Table 2: Unique Twitter entities found in UMLS

2.3 Twitter entity synonyms mapping

Terms in medical entities from our Twitter datasetmay have the same meaning but different surfaceform, e.g. vomit and throw up. Treating these syn-onyms as different input features to a regressionmodel may result in a performance bias. Aggre-gating weekly term counts for synonyms maxi-mize the probability that each input feature is nothighly correlated to each other.

Therefore, we propose a synonym mapping al-gorithm that uses the UMLS to map Twitter med-ical entity synonyms to a unique term. Table 2

36

Page 3: Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

Figure 1: Overall Architecture

shows the statistics of how many Twitter entitiescould be found or not in UMLS. The unique termis considered to be a concept term for synonyms.In UMLS, medical terms with the same meaningsare associated with one concept ID. Twitter termsare mapped to the UMLS medical terms in orderto find concept IDs for the Twitter synonyms. A Csharp program is developed to automate this task.Details of the algorithm is explained as below.

As already mentioned, Twitter terms are anno-tated with three types of medical entities: symp-toms, disorders and pharmacological substances.So based on the medical entity associated witheach Twitter medical term, these terms are seg-mented into three groups: Symptom Terms, Disor-der Terms and Pharmacological Substance Terms.In UMLS, each concept ID is associated to a TUI(Type Unique Identifier), indicating the semantictype of the concept ID. Three types of TUI areused for the synonyms mapping: symptom, disor-der and pharmacological substances. Each groupof the Twitter medical terms mentioned above aremapped to three types of concept IDs in UMLSrespectively.

If a Twitter medical term can be found in theUMLS dataset and it is mapped to only one UMLSconcept ID, the concept ID will be used as refer-ence for the term. If the Twitter term cannot befound in UMLS, the term will be the referenceconcept for itself.

An advantage of mapping three types of Twitterterms separately is that when a term is associatedwith more than one UMLS concept ID, the med-ical entity type associated to the term may helpto determine the most suitable UMLS concept IDthat is from the same type. For example, a con-cept ID in UMLS is associated with two semantictypes: symptom and disorder; a Twitter term an-notated with the disorder entity is mapped to thisconcept ID. The medical entity type of the Twit-ter term helps the algorithm to determine the mostsuitable concept ID for the term is the UMLS con-cept ID associated with the disorder semantic type.But it could still be possible that a term is mapped

to more than one UMLS concept IDs. In this case,each UMLS concept ID is related one or more thanone Twitter terms. The most appropriate conceptID for the Twitter term is the UMLS concept IDassociated with a largest number of Twitter terms.

After determining the most suitable concept IDfor each term, the algorithm continues to identifythe best concept label for each concept ID. Usingconcept label instead of the ID helps us have a bet-ter understanding of the model outcomes.

A UMLS concept ID may be associated to morethan one Twitter medical terms. The best labelfor a concept ID is a Twitter term that appears themost in the Twitter database. Table 1 shows a sam-ple of the concept mapping.

2.4 Feature Engineering

After mapping synonyms to unique terms, thedataset contained 112,690 unique terms. A seriesof feature engineering steps are conducted to im-prove the computational efficiency and the predic-tive performance of our methodology. With thefeature engineering, a set of the most importantfeatures is selected and mathematically reducedusing Partial Least Square and Recursive FeatureElimination with SVM. These engineered featuresare the input features for the final regression modelthat is trained to predict weekly disease rates.

The architecture of the feature engineering isshown in Figure 2. We use a nested cross-validation strategy. The outer division is two-foldor three-fold, into a training dataset and a separatevalidation set. We then apply 10-fold CV to thetraining set of each outer fold.

Non frequent and irrelevant features are firstremoved. Partial Least Square (PLS) regres-sion (Abdi, 2003) is then applied to reduce thenumber of dimensions. Different number of PLScomponents are computed from PLS. A dimensionreduction technique of selecting the optimal num-ber of PLS components is proposed in later sub-section. With the optimal number of componentsselected from PLS, each feature’s ‘Variable Impor-tance of Project (VIP)’ (Wold and others, 1995) is

37

Page 4: Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

Figure 2: Feature Engineering Workflow.

Algorithm 1 Pseudocode of Selecting The Optimal Number of PLS Components1: N = Maximum Number of Components resulted from PLS2: MC = Maximum Validation Correlation3: BN = Selected Optimal Number of Components4: for n = 1, n = n+1, n <= N do5: Let Validation Correlation be the correlation for the outervalidation dataset

6: Validation Correlation = Cor(Predicted Result,CDC Rates ofvalidation set)

7: if MC < Validation Correlation then8: MC = Validation Correlation9: BN = n

10: end if11: end for12: Return MC, BN

calculated. Wold and others (1995) suggest thatfeatures with very low VIP are unimportant andcan be removed. A PLS VIP based feature re-moval technique is proposed to further remove nonimportant features. Recursive Feature Eliminationusing Support Vector Machines (SVM) (Guyon etal., 2002; Gunn and others, 1998) is then used toretrieve the final set of the most important features.

2.4.1 Non Frequent and Irrelevant FeaturesRemoval

Non frequent unique terms tend to have zero vari-ance in the dataset, which do not significantly im-pact the prediction outcome. Therefore, uniqueterms were removed if they appeared in less than30 tweets in our dataset. This threshold was se-lected based on the examination of a histogramto determine the cutoff point to exclude the “longtail” of terms while still retaining important termslikely to be useful for our modelling process. Ap-plying this cutoff the number of unique conceptterms is reduced from 112,690 to 8,525. How-ever, the number of the features is still far morethan the number of samples in the reference CDCdataset (8,525 features vs 52 weeks per year eachyear in our study). Recent studies have shown that

PLS is able to deal with datasets with more fea-tures than the sample size (Li and Zeng, 2009),therefore PLS is our first preferred algorithm totrain the dataset. It has been shown that PLSs pre-dictive performance will be improved if the irrel-evant features are removed beforehand (Li et al.,2007). Our approach to determine irrelevant fea-tures is different to Li et al. (2007). In this pa-per, the PLS’s predictive performance is consid-ered to the correlation between the predictive CDCweekly rates and the actual CDC weekly rates. Weapply Pearson correlation to determine irrelevantfeatures. Pearson correlation measures linear rela-tionship between two sets of variables. For eachinput feature, the correlation between the feature’sweekly frequency and the CDC’s weekly rates iscalculated in the training set. If the correlation isless than 0.1, it is assumed that the linear relation-ship between the feature and the CDC rates is veryweak, so the feature is regarded as irrelevant andremoved. The remaining set of relevant featuresare used as input features for the next step.

2.4.2 PLS Components Selection

With a set of relevant features obtained from previ-ous step, the PLS algorithm is applied to the train-

38

Page 5: Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

Algorithm 2 Pseudocode of VIP Based Feature Removal1: T = {T1,..Ti,...Tn} as the collection of VIP Threshold2: T1 =0.02, Tn=13: MaxValCor = Maximum Validation Correlation among each Ti in T4: BestVIPThreshold = VIP threshold associated with MaxValCor5: for Ti = 0.02, Ti = Ti+ 0.02, Ti <=1 do6: Remove features with VIP < Ti

7: Run PLS on the dataset, and let MC be the ’Maximum ValidationCorrelation among PLS Components’

8: MC = result of running Pseudocode of Selecting The Number ofPLS Components

9: if MaxValCor < MC then10: MaxValCor = MC11: BestVIPThreshold = Ti

12: end if13: end for

Figure 3: Flu Trend Prediction Experiment Work Flow

ing dataset with a 10-fold cross validation (Resultsof this are not shown in the paper). Different num-ber of components are created by applying PLS.In order to select the optimal number of compo-nents from PLS, the “outer” validation set (fromthe outer CV and separate to training) is used tovalidate the predictive performance of applyingdifferent number of PLS components. The term’outer validation set’ is used in the later sectionsto refer to the validation set that is separate to thetraining set. The optimal number of PLS compo-nents is selected based on the maximum correla-tion among all components on the validation set.

Algorithm 1 shows pseudocode of a selectionprocess to determine the optimal number of PLScomponents with the best predictive performanceon the validation set. A loop of calculating the cor-relation for the validation set by using all numberof PLS components is used in Algorithm 1.

2.4.3 PLS VIP based Feature Removal

With the selected number of PLS components,each input feature’s VIP is calculated. Features’VIP values are only valid for the selected setof PLS components; they would be different ifa different set of PLS components was selected.Each feature’s VIP value is related to the feature’sweights for each latent component and the vari-ance explained by each latent component. For-mula for the jth feature’s VIP calculation is shownbelow (Wold and others, 1995; Mehmood et al.,2011), where N is the number of features, m is thenumber of PLS latent components, wmj is the PLSweight of the jth feature for the mth latent compo-nent, Pm is the percentage of the response factor(in our experiment, it is CDC weekly disease rate)explained by the mth latent component:

V IPj =

√√√√ N∑M

m=1.Pm

M∑m=1

w2mj .Pm

39

Page 6: Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

Features’ VIP values are used to determinewhether the feature should be removed or not. Ifa feature’s VIP is less than a particular threshold,this feature is removed before applying PLS againto train the dataset.

We set the VIP threshold using the followingmethodology: Values from 0.02 to 1 are consid-ered. We use 1 as the maximum possible, asheuristically anything greater indicates that thefeature is important (Cassotti and Grisoni, 2012).The optimal VIP threshold is determined by run-ning a loop, in which different VIP threshold val-ues ranging from 0.02 to 1 are all used to removefeatures. Let T = T1,..Ti,...Tn be a collection ofVIP thresholds, n is the number of thresholds, Ti

is the ith threshold in T . For each Ti (1 ≤ i ≤ n)in T , features with VIP less than Ti are removed,then PLS is used to train on the rest of features,which results in different number of PLS compo-nents. The optimal number of PLS componentsis selected if it has the maximum value of corre-lation for the outer validation dataset. This outervalidation set is the same dataset used in previousstep. These components are the representation forthe best result produced by removing features withVIP lower than Ti. The optimal VIP threshold in Tis the one that yields the maximum correlation forthe outer validation dataset. Pseudocode for VIPBased Feature Removal is shown in Algorithm 2.Any features with VIP less than the selected op-timal VIP threshold are not included for the nextstep.

2.4.4 Recursive Feature Elimination (RFE)In terms of the computational cost, if hundreds offeatures resulted from the previous step are inputfeatures for RFE, it might take too much time forRFE to present results. Therefore, if the number offeatures is greater than 200, features with VIP lessthan 0.2 are removed before applying RFE. Thereduced number of features are then used as inputfeatures for linear SVM based RFE (Guyon et al.,2002; Gunn and others, 1998). Five times ten foldcross validation is used for RFE with SVM. Fea-tures selected from RFE with SVM is the final setof features, which are the input feature for the nextstep.

2.5 Linear SVM Regression and Prediction

After feature engineering, an SVM regressionmodel with a linear kernel function (Gunn and oth-ers, 1998) is trained on the most important features

selected from previous RFE. The final predictionis made using this SVM regression model.

Figure 5: Second Half Year of 2014 predictedweekly Lyme Disease rates versus US CDC

Figure 6: Weekly frequency of spider web versusUS CDC weekly flu rate

3 Experimental Results

In this section, we show results for flu prevalenceprediction for the year 2014, as well as Lyme dis-ease prevalence prediction for the second half yearof 2014, based both on the current week’s dataas well as using posts from a week in advanceto evaluate the possibility of getting a signal ear-lier. Data from 2012 and 2013 is used as the train-ing set, year 2014’s weekly data is used to predictthe weekly flu rates of the year 2014. For Lymedisease prediction, only 18 months of data (from2013 to the first half of 2014) is available as thetraining set for Lyme disease prevalence predic-tion.

Two fold cross validation is used for feature tun-ing for flu prevalence. Figure 3 shows details of

40

Page 7: Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

Stage Training Validation / Features VIP Feature VIP RFESet Test Set For PLS Removal > 0.2 SVM

Flu, current1st 2012 2013 5353 412 103 612nd 2013 2012 5353 992 154 74

Final 2012, 2013 2014 – – – 22

Flu, one weekin advance

1st 2012 2013 5060 141 – 1412nd 2013 2012 5060 78 – 77

Final 2012, 2013 2014 – – – 21

Lyme Disease,current

1st 2013 H1, 2013 H2 2014 H1 6139 693 103 222nd 2013 H1, 2014 H1 2013 H2 6139 65 – 353rd 2013 H2, 2014 H1 2013 H1 6139 627 105 24

Final 2013, 2014 H1 2014 H2 – – – 41

Lyme Disease,one week inadvance

1st 2013 H1, 2013 H2 2014 H1 6076 63 – 212nd 2013 H1, 2014 H1 2013 H2 6076 61 – 363rd 2013 H2, 2014 H1 2013 H1 6076 167 – 67

Final 2013, 2014 H1 2014 H2 6076 – – 54

Table 3: Number of Input Features After Each Dimension Reduction

Testing Pearson SpearmanPeriod Correlation Correlation R2 RMSE

Flu, current2014 92.4% 94.9% 85.3% 1.51E-05

2014 H1 96.3% 96.6% 92.7% 1.06E-052014 H2 94.8% 92.3% 89.8% 1.90E-05

Flu, one week inadvance

2014 91.3% 93.3% 83.3% 1.55E-052014 H1 91.6% 94.6% 84.0% 1.41E-052014 H2 96.0% 92.3% 92.1% 1.69E-05

Lyme Disease, current 2014 H2 86.6% 89.6% 75% 3.41E-06Lyme Disease, oneweek in advance

2014 H2 90.32% 86.9% 81.6% 3.02E-06

Table 4: Flu and Lyme Disease trend prediction results

the flu trend prediction experiment. When datafrom 2012 is used for training, data from 2013is used as an outer validation set, and vice versa.After PLS based dimension reduction, RFE withSVM is applied to obtain the most important fea-tures from each fold. Another round of RFE withSVM is applied to train on the year 2012 and2013’s data with all unique input features selectedfrom the previous step. This results in a final in-put feature set, and then a regression based SVMwith linear kernel function is trained using 2012and 2013 data. Finally, prediction of weekly flurates of the year 2014 is made from the trainedSVM.

For Lyme disease, we have only two years ofCDC data (2013 and 2014) which overlap withour dataset of NER-tagged tweets. We set aside2013 and the first 6 months of 2014 as for trainingand feature tuning, keeping the final six monthsfor testing. We use three-fold cross-validation forfeature tuning. With each fold, six months of datais used as an outer validation set, with the remain-der used as a training set. Similar to the flu trendexperiment procedure shown in Figure 3, impor-tant features selected from each cross-validation

round are all included for another round of featurereduction by using RFE with SVM. With the fi-nal feature set determined by RFE with SVM, anSVM with a linear kernel function is then trainedon the training set to make the final prediction forthe Lyme disease trend for the second half of 2014.

3.1 Results of Flu and Lyme Disease TrendPrediction and Detection

Table 3 shows number of features being reducedafter each step of dimension reduction. After VIPbased feature removal, if the number of featuresexceeds 200, only features with VIP greater than0.2 are selected. Otherwise, these features are theinput features for the next step.

Experimental results for flu and Lyme diseasetrend prediction are presented in Table 4. BothPearson and Spearman correlations are included.Pearson correlation measures the linear relation-ship between two sets of variables, while Spear-man correlation measures correlation between twoset of ranked variables, which is used to checkwhether one variable increases, the other increasesor not. Therefore, Spearman correlation is usedin this paper as an alternative measurement to ex-

41

Page 8: Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

Figure 4: Year 2014 predicted weekly flu rates versus the US CDC weekly flu rates.

Flu, Current ‘stomach flu’ ‘pneumonia’ ‘bronchitis’ ‘coughing’ ‘sick’ ‘cough medicine’ ‘coldsore’ ‘cough syrup’ ‘sickness’ ‘cold’ ‘red nose’ ‘fever’ ‘sinus infection’ ‘ear infec-tion’ ‘body ache’ ‘blush’ ‘spider web’ ‘throat hurt’ ‘aching’ ‘strep throat’ ‘alcide’‘gelato’

Flu, In Advance ‘stomach flu’ ‘pneumonia’ ‘bronchitis’ ‘coughing’ ‘sick’ ‘cough medicine’ ‘rednose’ ‘cough syrup’ ‘cold sore’ ‘cold’ ‘sickness’ ‘sinus infection’ ‘aloe’ ‘ear infec-tion’ ‘fever’ ‘spider web’ ‘sleepy’ ‘aching’ ‘body ache’ ‘sore’ ‘seeing double’

Lyme Disease, Current ‘coughing’ ‘bronchitis’ ‘pneumonia’ ‘runny nose’ ‘stuffy nose’ ‘cold’ ‘stomachflu’ ‘sick’ ‘throat hurt’ ‘sinus infection’

Lyme Disease, In Advance ‘cold’ ‘coughing’ ‘pneumonia’ ‘bronchitis’ ‘runny nose’ ‘stuffy nose’ ‘sick’‘stress’ ‘aloe’ ‘stomach flu’ ‘sinus infection’ ‘caffeine’ ‘shaking’ ‘snoring’ ‘fart’‘concussion’ ‘throw up’ ‘migraine’ ‘dizzy’ ‘sore throat’

Table 5: Important Features for Flu and Lyme Disease Trend Prediction

amine similarities among downward or upwardmovements of the predicted trend and the CDCtrend.

When making flu prevalence predictions forthe first half, second half and the whole year of2014, Spearman correlations are 96.6%, 92.3%and 94.9% respectively. The first half year’sSpearman correlation is higher than the secondhalf year. When the proposed methodology isused to predict flu trend one week earlier, the firsthalf year’s Spearman correlation (with 94.6%) ishigher than the second half (with 92.3%). Thismeans the final set of the most important featuresselected tends to represent more for the first halfyear’s flu prevalence than the second half of year2014. The Spearman correlation for predicting flutrend one week before CDC for the year 2014 is93.5%, which indicates that the proposed method-ology has some advance predictive power ahead ofthe CDC data, which is inherently less timely dueto delays in collection. Figure 4 illustrates pre-dicted flu prevalence against current CDC data aswell as one week before.

For Lyme disease, as shown in Table 4, the Pear-son correlation between the predicted prevalence

and CDC weekly rates is 86.6%, while the Spear-man correlation is higher,as 89.6%. A few weeksat the end of the year are predicted with nega-tive rates, contributing to a relatively low Pear-son correlation; without considering the last fiveweeks, the Pearson correlation increases to 93.3%.The relatively high Spearman correlation for Lymedisease has indicated that the upward or down-ward trends are well predicted. The Spearmancorrelation of detecting Lyme disease trend oneweek before CDC is 86.9%, which is lower thanfor the current week but still shows that a use-ful signal is being predicted. Predicted Lyme dis-ease prevalence and CDC-reported Lyme diseaseweekly rates are shown in Figure 5.

The most important features selected by the pro-posed methodology for flu and Lyme disease trendpredictions are presented in Table 5. Most ofthe features for flu prevalence prediction are rea-sonable, such as coughing, cold and fever, whichare flu symptoms. However, spider web has beenranked as one of the features for flu predictionwhich appears in our database because spider webappears as a pharmacological substance in theUMLS. The weekly term frequency for spider web

42

Page 9: Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

is highly negatively correlated to CDC weeklyflu rates as shown in Figure 6, due to many spi-der webs being observed in the Northern hemi-sphere in September, close to the low point ofthe flu season. Gelato is also detected as rele-vant for a similar reason, due to an coincidental(negative) correlation with the flu season. Gelatohas been wrongly annotated by our system as apharmacological substance since in the UMLS itrefers gelato sodium fluoride instead of ice cream.

Table 5 shows the most important features forLyme disease prevalence prediction. Many fea-tures selected are very similar to flu symptoms, inline with many symptoms of Lyme disease match-ing those of flu;2 in addition, dizzy matches aLyme disease symptom. However, overall the termlist for Lyme disease is less convincing than forflu, with more symptoms of Lyme disease missedand more terms included with no immediately ob-vious relationship to the disease. It seems that therelative rarity of Lyme disease is leading to noisiersignal in tweets about its symptoms.

4 Discussion

The proposed methodology is an effective ap-proach to predict prevalences for influenza andLyme disease based on social media posts. It pre-dicts flu prevalences for 2014 with Pearson corre-lations range from 92.4% to 96.3%. Similar re-sults have been reported with other existing ap-proaches for flu prevalence prediction: Paul andDredze (2012) and Paul and Dredze (2011) pre-dicted flu rate from August 2009 to May 2010with Pearson correlations of 95.8% and 93.4% re-spectively; Culotta (2010a) made predictions forflu rate from September 2009 to May 2010 with95% Pearson correlation. However, our methodhas some advantages over these, as they requirelabour-intensive manual labelling of tweets andsignificant computational resources to train theirsystem using millions of data samples, in con-trast with the method proposed here, where theonly computationally-intensive step is a one-offstep (reusable for other diseases and other kindsof analytics) of applying an NER tagger to a largeTwitter corpus. In addition, Culotta (2010a) pre-sented a method that requires prior knowledge tomanually identify flu-related key words. Here, amanually pre-built keyword list is not required asthe most important features related to flu are au-

2http://www.cdc.gov/lyme/signs symptoms/

tomatically selected based on the data. We alsoshow that our method can predict disease preva-lence with some reliability in a small time windowahead of the reported CDC figures, which has po-tential utility for real-time disease monitoring andalerts.

Our method is somewhat generalisable, withroughly the same approach achieving good corre-lations against CDC data for Lyme disease. Anexisting approach to track Lyme disease (Seifter etal., 2010) requires knowledge to select key wordsfrom Google trends, but there is no evaluation pro-vided. To our knowledge there is relatively littleother work on Lyme disease surveillance so thisapplication is somewhat novel. However, accu-racy for Lyme disease was weaker than for flu, interms of raw correlations as well as basic plausibil-ity checks on the most important indicative terms –canonical indicators such as the erythema migransrash did not make the list. An important factor isprobably the lower overall prevalence of the dis-ease (an average of 1500 reported to the CDC perweek in our test set versus 15,000 for flu), thereare fewer instances of Twitter users experiencingthe disease and the relevant symptoms, which theycould then tweet about.

This hints a limitation of the method we havedeveloped. Terms showing a natural seasonal fluc-tuation but are not indicative of disease (such asgelato) happen to coincide with a disease may ac-cidentally come out as important terms in the anal-ysis. One way to mitigate this would be to im-prove the accuracy of the named entity tagging inthe source data.

Future extensions may also be generalisableto other regions with a high number of English-language tweets.

5 Conclusion

We have presented an effective methodologywhich produce predictions for flu and Lyme dis-ease prevalences with strong or moderate correla-tions with current CDC figures for the whole ofthe US, and those of a week later and we expectthe approach to be somewhat generalisable acrossdiseases and regions.

ReferencesHerve Abdi. 2003. Partial least square regression (pls

regression). Encyclopedia for research methods forthe social sciences, pages 792–795.

43

Page 10: Syndromic Surveillance using Generic Medical Entities on ... · Several approaches to predict u prevalences from Twitter data already exist. These approaches ... and Dredze, 2011)

David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent dirichlet allocation. Journal of ma-chine Learning research, 3(Jan):993–1022.

Olivier Bodenreider. 2004. The unified medical lan-guage system (umls): integrating biomedical termi-nology. Nucleic acids research, 32(suppl 1):D267–D270.

Matteo Cassotti and Francesca Grisoni. 2012. Variableselection methods: an introduction.

Aron Culotta. 2010a. Detecting influenza outbreaksby analyzing twitter messages. arXiv preprintarXiv:1007.4748.

Aron Culotta. 2010b. Towards detecting influenza epi-demics by analyzing twitter messages. In Proceed-ings of the first workshop on social media analytics,pages 115–122. ACM.

Steve R Gunn et al. 1998. Support vector machines forclassification and regression. ISIS technical report,14.

Isabelle Guyon, Jason Weston, Stephen Barnhill, andVladimir Vapnik. 2002. Gene selection for cancerclassification using support vector machines. Ma-chine learning, 46(1-3):389–422.

Bo Han, Timothy Baldwin, and Paul Cook. 2012. Ge-olocation prediction in social media data by findinglocation indicative words. Proceedings of COLING2012: Technical Papers, pages 1045–1062.

Antonio Jimeno-Yepes, Andrew MacKinlay, andBo Han. 2015a. Investigating public health surveil-lance using twitter. ACL-IJCNLP 2015, page 164.

Antonio Jimeno-Yepes, Andrew MacKinlay, Bo Han,and Qiang Chen. 2015b. Identifying diseases,drugs, and symptoms in twitter. Studies in healthtechnology and informatics, 216:643–647.

Thorsten Joachims. 1999. Svmlight: Support vec-tor machine. SVM-Light Support Vector Machinehttp://svmlight. joachims. org/, University of Dort-mund, 19(4).

Guo-Zheng Li and Xue-Qiang Zeng. 2009. Featureselection for partial least square based dimension re-duction. In Foundations of Computational Intelli-gence Volume 5, pages 3–37. Springer.

Guo-Zheng Li, Xue-Qiang Zeng, Jack Y Yang, andMary Qu Yang. 2007. Partial least squares baseddimension reduction with gene selection for tu-mor classification. In 2007 IEEE 7th InternationalSymposium on BioInformatics and BioEngineering,pages 1439–1444. IEEE.

Tahir Mehmood, Harald Martens, Solve Sæbø, JonasWarringer, and Lars Snipen. 2011. A partial leastsquares based algorithm for parsimonious variableselection. Algorithms for Molecular Biology, 6(1):1.

Michael J Paul and Mark Dredze. 2011. You arewhat you tweet: Analyzing twitter for public health.ICWSM, 20:265–272.

Michael J Paul and Mark Dredze. 2012. A model formining public health topics from twitter. Health,11:16–6.

Ari Seifter, Alison Schwarzwalder, Kate Geis, and JohnAucott. 2010. The utility of google trends for epi-demiological research: Lyme disease as an example.Geospatial health, 4(2):135–137.

S Wold et al. 1995. Pls for multivariate linear mod-eling. Chemometric methods in molecular design,2:195.

44