Deep Multi-Output Forecasting

Deep Multi-Output ForecastingLearning to Accurately Predict Blood Glucose Trajectories

Ian Fox1, Lynn Ang2, Mamta Jaiswal2, Rodica Pop-Busui2, Jenna Wiens11CSE, University of Michigan, 2Internal Medicine, University of Michigan

ABSTRACTIn many forecasting applications, it is valuable to predict not onlythe value of a signal at a certain time point in the future, but alsothe values leading up to that point. This is especially true in clini-cal applications, where the future state of the patient can be lessimportant than the patient’s overall trajectory. This requires multi-step forecasting, a forecasting variant where one aims to predictmultiple values in the future simultaneously. Standard methods toaccomplish this can propagate error from prediction to prediction,reducing quality over the long term. In light of these challenges, wepropose multi-output deep architectures for multi-step forecastingin which we explicitly model the distribution of future values of thesignal over a prediction horizon. We apply these techniques to thechallenging and clinically relevant task of blood glucose forecasting.Through a series of experiments on a real-world dataset consistingof 550K blood glucose measurements, we demonstrate the effec-tiveness of our proposed approaches in capturing the underlyingsignal dynamics. Compared to existing shallow and deep methods,we find that our proposed approaches improve performance in-dividually and capture complementary information, leading to alarge improvement over the baseline when combined (4.87 vs. 5.31absolute percentage error (APE)). Overall, the results suggest theefficacy of our proposed approach in predicting blood glucose leveland multi-step forecasting more generally.ACM Reference Format:Ian Fox, Lynn Ang, Mamta Jaiswal, Rodica Pop-Busui, Jenna Wiens. 2018.Deep Multi-Output Forecasting: Learning to Accurately Predict Blood Glu-cose Trajectories. In KDD ’18: The 24th ACM SIGKDD International Confer-ence on Knowledge Discovery & Data Mining, August 19–23, 2018, London,UK. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3219819.3220102

1 INTRODUCTIONIn a typical signal forecasting problem, one aims to estimate thefuture value of the signal using past values. For example, one mayaim to predict a blood glucose measurement occurring 30 minutesin the future, given past blood glucose measurements. This single-step setting generalizes to the multi-step setting, in which one aimsto predict multiple values within a time horizon. This multi-step

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’18, August 19–23, 2018, London, United Kingdom© 2018 Copyright held by the owner/author(s). Publication rights licensed to theAssociation for Computing Machinery.ACM ISBN 978-1-4503-5552-0/18/08. . . $15.00https://doi.org/10.1145/3219819.3220102

setting is inherently more difficult, since it requires modeling thejoint probability of future measurements. While more challenging,if successful this joint modeling of observations within a sequencecan improve overall performance. For example, while the word ‘the’occurs often in English, the phrase ‘the the the’ does not.

Recursive approaches, in which a single-step forecaster predictsseveral values by using the current prediction to make the next pre-diction, are commonly used in multi-step forecasting [22]. However,such approaches often suffer from poor long term performance,since any error introduced will enter a positive feedback loop. Al-ternatively, multi-output forecasting aims to estimate multiple val-ues at once. While no longer susceptible to the feedback issue,multi-output forecasting may not adequately capture dependenciesamong predictions.

We propose two complementary solutions to these issues. Thefirst is a multi-output recurrent neural network where explicittemporal dependencies between outputs capture the relationshipbetween the predictions. The second is a novel architecture thatdirectly models the underlying generating function of the signal bylearning a polynomial approximation for the outputs. The problemof error accumulation during sequence prediction has been previ-ously studied in NLP [1, 13]. We distinguish ourselves from thispast work by focusing on new models that alleviate this problem,as opposed to new training schemes.

We apply the proposed approaches to a challenging real-worldforecasting problem (described below). Our main contributions canbe summarized as followed:

• Wepropose two novel and complementary deepmulti-outputforecasting architectures: an autoregressive multi-outputforecaster and a polynomial “function forecasting” system.

• We improve over existing approaches by leveraging the pro-posed forecasting architectures on a large-scale real-worldmulti-step forecasting problem.

In additional analyses, we demonstrate that predicting multi-ple values can provide extra supervision, improving single outputforecasting performance.

This work focuses on forecasting blood glucose values. Forecast-ing blood glucose is relevant to individuals with type 1 diabetes. Inthe United States alone, there are over 1 million type 1 diabetics [23].Tight glucose control, which can reduce the risk of complicationsin diabetes [8–10, 16], can be challenging to maintain. There is atremendous decision burden placed on diabetic patients who areconstantly faced with decisions pertaining to food intake, activities,and insulin administration. Thus there is a continuous interest inthe field to develop sensitive technologies and algorithms to closethe loop in insulin delivery and continouos glucose monitoring[4]. Better predictive algorithms are critical in the development ofsuch technologies [14]. This work could also be useful for type 2diabetics with poor blood glucose control.

https://doi.org/10.1145/3219819.3220102

https://doi.org/10.1145/3219819.3220102

https://doi.org/10.1145/3219819.3220102

Learning the dynamics of the glucoregulatory system is diffi-cult because the long term system dynamics are highly nonlinear[11]. We tackle this challenge using data from over 550K blood glu-cose measurements. Our proposed approaches, when used together,achieve better results in terms of Absolute Percent Error (APE) thanexisting shallow or deep forecasting approaches both on average(4.87 vs. 5.31) and particularly in periods of extreme fluctuation(12.05 vs. 13.34).

The remainder of the paper is organized as follows. In the nextsection, we introduce notation and formally define our problem. Wethen present our proposed forecasting architectures and methods.After, we present a series of experiments on the real-world dataset,and discuss the results.

2 PROBLEM SETUP AND BACKGROUNDIn signal forecasting, one aims to estimate the next value in a sig-nal xt+1 given past values x0:t , x represents the signal of interest,and t the current time step. Here, we focus on the univariate set-ting, i.e., x0, . . . ,xt ∈ R, though our approaches generalize to themultidimensional setting. The most common approaches to sig-nal forecasting focus on learning a model for p(xt+1 |x0:t ) [17, 22].Sometimes a prediction offset d is added to learn the model forp(xt+d |x0:t ). Given the recent successes of deep architectures forthis problem in general [6, 17, 25], and specifically in the domain ofglucose forecasting [15], we focus on building upon deep learningmethods for signal forecasting.

A model that accurately predicts xt+1 can be used for eithersingle- or multi-step forecasting. Applied recursively, single-stepmodels enable multi-step forecasting, i.e., predicting multiple val-ues over a time horizon of length h (see Figure 1). Of particularnote are deep conditional generative models, which model jointdistributions by sequentially estimating terms in the conditional fac-torization of the distributionwith deep neural nets [17]. This style offorecasting, wherep(xt+1:t+h |x0:t ) is estimated by the factorization:p(xt+1 |x0:t )p(xt+2 |x0:t , x̂t+1) . . .p(xt+h |x0:t , x̂t+1:t+h−1) is called re-cursive forecasting, and is the primary form of multi-step fore-casting [22]. Recursive models have the advantage of modeling thejoint probability of the signal within the prediction window. How-ever, the re-use of predictions creates a feedback loop, amplifyingpotential errors and leading to lower quality predictions as the timehorizon increases.

In contrast, multi-output forecasting aims to estimatep(xt+1:t+h |x0:t ) in one step [22]. Multi-output approaches sidestepthe issue of error feedback by jointly estimating over the predictionwindow, and will be the main focus of this paper.

3 METHODSOur main methodological contribution is the development of adeep multi-output forecasting framework, that we extend in twodirections: 1) we propose a method to propagate information acrossthe prediction window, and 2) we propose a method to directlypredict the underlying generative function of the signal. We inves-tigate both approaches, as they represent different, complementarymethods to enhance multi-output forecasting. In this section, wefirst describe the multi-output deep learning framework shown inFigure 2, then explain both of our extensions, shown in Figure

Figure 1: An example of multi-step recursive forecasting.Predictions at one time step are fed back into the network asinput. This allows for single-stepmethods to produce multi-step forecasts.

Figure 2: Forecasting with DeepMO involves transformingthe input to a shared representation and then learning sep-arate output networks for each time point in the predictionhorizon h.

3. We finish by providing additional details on how to train themodels.

3.1 Deep Multi-Output Forecasting (DeepMO)A neural network can function as a multi-output forecaster byusing multiple output channels to infer multiple time points into thefuture from a shared hidden state. At time t , a standard multi-outputneural network derives a hidden state vector zt from input x0:tthrough a series of hidden layers composed of linear combinationsand nonlinear activations, all parameterized by θz . This hiddenstate is then translated to predictions via the network’s outputchannels o1:h as follows:

x̂t+i = oi (zt ;θ i ) for i ∈ [1 : h] (1)where oi is defined in terms of linear combinations and nonlin-

ear activations, parameterized by θ i . This approach is illustratedin Figure 2. Note that the value of the predicted output at timestep t + 1 is not explicitly propagated through the remainder ofthe prediction window (as it would be in a recursive setting). Themapping defined by oi has no direct impact on oj at inference timefor j , i . However, x̂i is not independent of x̂ j , since they bothdepend on the shared representation zt . Additionally, temporal de-pendencies among the output are implicitly propagated by the joint

optimization over θ = [θz ,θo1 , . . .θoh ] during training. However,since this approach does not explicitly encode dependencies amongthe outputs, learning such relationships may be more difficult.

Standard neural networks require a fixed sized input. To elim-inate this limitation, we use recurrent neural networks (RNNs),which allow for variable-sized input. This allows the network tolearn the amount of history that is useful for prediction and makepredictions at any point in the signal. As such, this and all sub-sequent architectures use recurrent cells to generate zt . We useGRU cells [2], however, other recurrent cells could be used as well.The recurrent cells are depicted by the orange cell in Figure 2.We refer to the architecture described above as “DeepMO” (DeepMulti-Output Forecaster).

3.2 Sequential Multi-Output Forecasting(SeqMO)

We extend the approach described above by combining 1) the abilityof a recursive forecaster to explicitly model temporal dependencieswithin a sequence with 2) the ability of a multi-output system tomodel multiple time steps at once. To combine the advantages ofthese two forecasting systems, we use the DeepMO architecture,described above, but introduce temporal dependencies betweensequential outputs.

Our sequential multi-output approach modifies DeepMO by re-placing the multiple output channels oi for i ∈ [1 : h] with arecurrent decoder network, parameterized by θdec . The decodernetwork unrolls the hidden state zt into [z′1, z

′2, . . . , z

′h ]. Each z′i is

then independently passed through the same shared output channel.Specifically, we replace (1) with:

z′1, z′2, . . . , z

′h = Dec(zt ;θdec ) (2)

x̂t+i = o(z′i ;θo ) (3)With this setup, the model can learn to trade off between re-

cursively propagating error and capturing temporal dependencies.We refer to this approach as a sequential multi-output forecaster(SeqMO) (Figure 3a). We hypothesize that by explicitly encod-ing a temporal relationship among predictions, we will learn amore accurate forecasting strategy. This approach uses a recur-rent encoding-decoding framework [3] for time-series forecasting.Note that this involves a many-to-many mapping, since we makemultiple sequential predictions at each time step.

3.3 Polynomial Function Forecasting (PolyMO)Our second proposed extension reframes the forecasting task. In-stead of learning the distribution of future signal values conditionedon the past, we learn to predict an underlying representation ofthe data. We call this function forecasting. In particular, we as-sume the prediction window xt+1:t+h ∼ f (0 : h − 1;w), where wparameterizes the function class f . Instead of directly modelingp(xt+1:t+h |x0:t ), we estimate the parameters to the underlying gen-erative function p(w|x0:t ) (see Figure 3b). Function forecasting isanalogous to SeqMO, where input data are encoded into a hiddenstate best parameterizing a decoder network. The key difference isthat here the decoding step is restricted to the function f .

We restrict our generating function class f to polynomials ofdegree n. Such functions are parameterized by n + 1 real numbersf (t ;w) = w0 +w1t + · · · +wnt

n . We modify equation (1) so

wj = oj (zt ;θoj ) for j ∈ [0 : n] (4)At each time step, t we predict the set of coefficients w param-

eterizing the best approximation of future values xt+1:t+h . Fortraining, we determine the actual value of the parameter by takingthe best-fit polynomial of degree n over xt+1:t+h . Since we wantthe generating function f to actually model the underlying signal,and not just the observations, we limit the polynomial’s capacityby setting n << h.

We refer to this approach of polynomial function forecasting as“PolyMO” (Figure 3b). We hypothesize that focusing on estimatingthe underlying generative function versus the values themselveswill result in improved forecasting performance. By compactlyrepresenting future data, the output complexity of the network canbe lowered, reducing noise. In addition, by predicting a generatingfunction, the networkmust reason about the joint distribution of thevalues (since each parameterwi affects the entire output window).This helps address the error accumulation inherent to recursiveforecasting.

3.4 Sequential Polynomial FunctionForecasting (PolySeqMO)

The two extensions to DeepMO shown in Figure 3 both seek to im-prove our estimation of p(xt+1:t+h |x0:t ). However, these proposedtechniques represent somewhat orthogonal improvements. SeqMOprovides a way to learn the trade-off between relying on interme-diate value estimates and avoiding recursive error accumulation.PolyMO, meanwhile, facilitates prediction by constraining the in-termediate representation, predicting values parameterizing thefunction approximation. The prediction of these parameters is itselfa multi-output prediction. While the standard PolyMO forecasteruses the DeepMO framework to generate w0, . . . ,wn , there’s noreason that it couldn’t use or wouldn’t benefit from the SeqMOframework instead. Thus, we also examine a PolyMO forecasterwith recurrent parameter decoding, denoted “PolySeqMO” (Figure4).

3.5 Training and Inference DetailsIn the above methods, the parameters θ can be learned using sto-chastic gradient descent. The standard deep forecasting formulationdefines training loss based on actual values in the signal. However,previous work has found that it can be beneficial to transform theproblem into a multi-class classification task [17]. Thus, we replacethe task of directly predicting signal values x̂t+i with the task ofpredicting the probability mass function over possible discretizedvalues of the signal: p̂(xt+i ), using a cross-entropy loss against theone-hot distribution for the actual value. Each output channel oiencodes not a single number, but distribution over possible val-ues. Similarly, we predict distributions over parameter values win PolyMO forecasting. While the multi-class formulation allowsus to use the cross-entropy loss during training time, ultimately,we are interested in evaluating the quality of real valued forecasts.

(a)(b)

Figure 3: Two extensions to theDeepMO forecasting framework (a) SeqMOuses a decoder network to generate a representationfor each time point in the prediction horizon that feeds into a shared output network to produce h predictions (b) PolyMOlearns n+1 separate output networks based on a shared representation zt , to infer the parameters of an nth degree polynomialwhich is then used to generate the predicted output.

Figure 4: A combination of SeqMO and PolyMO. The func-tion forecasting framework from PolyMO is used to predictparameters w for a function approximating output values.These parameters are predicted using the recurrent decod-ing network from SeqMO.

Thus, we translate these distributions to predictions by taking thevalue represented by the class with maximum probability in p̂. Thisapproach has been found to work well in the field of speech gener-ation [17], but has not, to our knowledge, been investigated in thecontext of physiological signal forecasting.

Finally, at inference time, we smooth predictions by replacing thepredicted values x̂t+1:t+h with the values occurring at that time in abest-fit polynomial, with the polynomial degree set using validationdata. That is, we find the polynomial f (·;w) that best approximatesx̂t+1:t+h , and return as output the vector [f (0), . . . f (h − 1)]. Thispolynomial smoothing allows for amore direct comparison betweenmodels that predict glucose values and the PolyMO approach.

In the sections that follow, we test our hypotheses and evaluateour proposed forecasting systems on a real dataset. We begin bydescribing the forecasting task, and then explain the experimentalsetup and provide the detailed implementation of the methods weevaluate.

4 DATASET & FORECASTING TASKWe consider the task for predicting future blood glucose values inpatients with type 1 diabetes. These data present a challenging andclinically meaningful forecasting task.

4.1 The DataThe data consist of a large number of continuous glucose readingsfrom 40 patients with type 1 diabetes, collected over the courseof three years. At three-month intervals, individuals included inthe study were given a continuous glucose monitor (CGM) thatrecorded their blood glucose at regular five-minute intervals overthe course of several consecutive days. All subjects were blindedto the output of the CGM, so as not to affect the managementof their disease. Subjects continued regular activities and insulinadministration, either with injections or an insulin pump. In total,the dataset consists of 1.9k days of blood glucose measurements,totaling nearly 550k distinct glucose measurements. Blood glucosemeasurements were of integer resolution in the range of 40-400

mg/dL. An example of a few days worth of measurements fromfour different patients is found in Figure 5.

Figure 5: Four examples of continuous blood glucose valuescollected over the course of 24 hours from four different pa-tients.

4.2 The TaskThere has been extensive work on using CGM data to predict short-term outcomes (e.g., predicting hypo and hyperglycemic events,[18, 21]). In contrast, here, we focus on the more general task ofglucose forecasting. More specifically, we consider the challengingtask of predicting future blood glucose levels using only data aboutpast blood glucose measurements. In this context, previous workhas focused on using ARIMA [5] and machine learning algorithmssuch as Random Forests [21]. Relatedly, others have proposed ma-chine learning techniques for leveraging data pertaining to externalfactors [20, 24, 26]. While blood glucose is affected by external fac-tors, e.g., carbohydrate intake and insulin, such data are not alwaysreadily available [27] and come at the cost of increasing patientburden in the form of data collection. Through the dataset we con-sider does not contain these additional data, it is important to notethat the proposed methods generalize to a multivariate setting.

With enough advanced warning via a forecast, one can correctblood glucose levels through the administration of either insulinor glucose. How far in advance is far enough? It is important tonote that there is 1) often a delay before insulin or glucose beginsto act on the glucoregulatory system and 2) a lag between changesin blood glucose levels and CGM measurements. Thus, to have thegreatest impact (e.g., help patients avoid hypo- and hyperglycemicevents) we must be able to predict several measurements in thefuture. Previous work in blood glucose forecasting has settled on a30-minute prediction window as adequate for this task [20, 24, 26].Thus, to test the efficacy of our forecasting systems, we evaluate amulti-step forecast with a 30-minute (h = 6) prediction window.

We evaluate performance for any given prediction by calculat-ing the mean absolute percentage error (APE) over the predictionwindow. This evaluation metric varies from previous work, whichreports performance on only the last sample in the prediction win-dow. We report over the entire prediction window for two reasons.

Figure 6: The splitting procedure used to train and test ourmodels. The complete dataset is separated into three disjointsubsets: train, validation, and test. The test set is then fur-ther split into 4 non-disjoint sets: the full test set, test setexamples that denote new hypo/hyperglycemic events, andsubsets for each event type separately. This results in widelyvarying subset sizes given in the image.

First, from a clinical perspective, we are interested in the trend thevalues suggest. An ultimate decrease in blood glucose can havedifferent interpretations if the rate of decline is accelerating vs.decelerating. Second, from a technical perspective, we are inter-ested in evaluating our systems as multi-step forecasters, naturallysuggesting evaluation over multiple steps.

5 EXPERIMENTAL SETUP & BASELINESThrough a series of experiments on the data and task describedabove, we measure the ability of our proposed methods to forecastblood glucose values. We compare against several baselines thathave been used for forecasting in previous work [21]. We alsoinvestigate the advantage the extra supervision inherent in multi-output forecasting offers over single-output forecasting on a single-output task.

5.1 Train, Test, and ValidationIn all of our experiments, we split the data into training, validation,and a held-out test dataset. This procedure is shown in Figure 6.These splits were determined using the CGM recording sessionsacross patients. For each subject, the entirety of the final record-ing session is added to the test set, the second to last session isadded to the validation set, and the remaining data are added tothe training set. Recording sessions vary in length, but this splitresults in approximately 85% of the data being used for training,7.5% for validation, and 7.5% for testing. Compared to a randomsplit, a temporal split more closely mimics howwe expect the modelto perform in practice. Note that several months elapse betweenrecording sessions, so data in the training set have no immediateconnection to the testing data.

We evaluate the models at any point in time in which we haveat least ten samples (i.e., 50 minutes) of prior data. We select thisminimum to ensure that there is sufficient information to make areasonable prediction. As CGMs are used continuously in the realworld, this does not restrict applicability. We remove measurementsthat represent physiologically unrealistic glucose fluctuations (over40 mg/dL in under 5 minutes) to remove noisy CGM measurements.Using a sliding window sampling method with a stride of 1 resultsin over 39k distinct test samples.

For evaluation purposes, we divide our test set into four over-lapping groups: 1) the full test set, 2) windows in the test set thatcontain either hypoglycemic onsets or hyperglycemic onsets, and

two sets containing 3) only hypo and 4) only hyperglycemic onsets.Specifically, we examine performance on a second test set of sam-ples filtered such that a hypo or hyperglycemic event begins in the30-minute prediction window, and the last training sample was ata normal blood glucose level (between 70-180 mg/dl). Focusing ononly hypo and hyperglycemic events reduces our test set size to3,068 samples: 1,156 hypoglycemic events and 1,912 hyperglycemicevents. We look at each of these subgroups to better understandforecaster performance across a range of relevant situations. Thedynamics of the glucoregulatory system are highly nonlinear, andthe dynamics can vary dramatically depending on the state of theglucoregulatory system and environmental contexts [14].

The complete test set is most representative of general model per-formance. However, the event test set is indicative of performanceat points critical for maintaining healthy glucose levels. This eventtest set is further broken into a hypoglycemic event set and hy-perglycemic event set. The prevention of hypo and hyperglycemicevents are important for different reasons, and depending on theoutcome of interest and the patient’s personal history, it may bemore important to effectively predict one versus the other. Thus,performance across all the test sets can be relevant.

5.2 Baseline Forecasting MethodsTo compare the performance of our proposed approaches to existingmethods, we consider the following shallow and deep architectures.

• Baseline: Linear Extrapolation. This baseline simply usesthe most recent 30 minutes of data to extrapolate 30 minutesinto the future. We chose to use the most recent 30 minutes(as opposed to a longer history) based on performance onthe validation set. We include this naive baseline to givethe reader a sense for how challenging the task is. It alsoprovides an interesting comparison to the performance ofthe 1st degree PolyMO and PolySeqMOmodels, as they haveidentical output capacity.

• Baseline: Random Forest (RF). Simple but effective, theRF algorithm has been successfully used in glucose forecast-ing [21]. A robust ensemble method, it can be parallelized forrapid training and prediction. Applied to the task of predict-ing hypoglycemic events, Sudharsan et al. achieved resultscompetitive with state-of-the-art. We experimented with tworandom forest baselines: i) a random forest trained to predictone time step into the future, used to recursively generatedthe multi-output prediction, and ii) a true multi-output ran-dom forest.

• Recursive RNN. As our next baseline, we consider a re-current neural network (RNN) which makes multi-step pre-dictions using the recursive approach outlined in Figure 1.RNNs have recently been shown to achieve state-of-the-artresults in glucose forecasting [15]. Our RNN uses two layersof GRU cells regularized via early stopping on a validationset and weight decay. While we are interested in minimizingthe APE of our forecasts, we do not use APE as our lossfunction. Instead, as discussed in Section 3.5, our networkoutputs a probability mass function over a discretized set ofglucose values, thus we use a cross-entropy loss.

5.3 Implementation DetailsWe implemented all deep learningmodels using PyTorch.We learnedthe model parameters using stochastic gradient descent with anADAM optimizer [12]. We implemented the RF baseline using ScikitLearn [19].

All of our models have a number of hyperparameters. To en-sure fair comparison between methods, we set hyperparameters byoptimizing performance on the training and validation data. Ourhyperparameter search space for the deep architectures includedmodel depth, recurrent layer size, initial learning rate, and inputnormalization. For the RF, we tuned the number of trees and sizeof input.

Values reported for the RF forecaster were obtained using 100estimators with a 10-sample input size. The remaining hyperparam-eters used the default Scikit Learn values. The deep architectureswere found to have performance robust to hyperparameter selec-tion. All results reported were obtained using two recurrent layersof 512 GRU hidden units. Output channels were implemented asfully connected layers with softmax activations. Training was rununtil performance on a separate validation set failed to increase for50 epochs. A weight decay value of 10−5 was used for all models.All remaining model details, such as the initialization procedureand the initial learning rate for ADAM, used the PyTorch defaultvalues.

To train our PolyMO model, we tested a variety of differentpolynomial degrees. On the training data, we observed that bloodglucose values over a 30-minute window can be well approximatedwith something as simple as a 1st degree polynomial (n = 1). Onaverage, a linear approximation of the output window incurred areasonably small reconstruction loss. The first degree polynomialstruck a good balance between performance and capacity. On thevalidation data, we investigated the performance of the PolyMOmodel using different degrees, and found the 1st degree performedbest.

We found the best fit 1st degree polynomials over all length sixprediction windows in the training set and used the maximum andminimum values for each coefficient as the range for our categor-ical output prediction, except for the bias term which we limitedbetween 40-400 to mirror the glucose monitor output limitations.Outputs were quantized into 361 equal bins both when predict-ing glucose and for each polynomial coefficient. This number waschosen to give the real-value network the capacity to predict anyrecorded value of blood glucose, as most continuous glucose mon-itors have integer resolution. All source code for this project isavailable online 1.

6 RESULTS & DISCUSSIONIn total, we tested eight distinct forecasting systems: 1) LinearExtrapolation, 2) a recursive RF (RF: Rec), 3) a multi-output RF(RF: MO), 4) a recursive RNN (Recursive), 5) a multi-output RNN(DeepMO), 6) a sequential multi-output RNN (SeqMO), 7) a polyno-mial multi-output RNN (PolyMO), and 8) a sequential polynomialmulti-output RNN (PolySeqMO).

Table 1 presents the forecasting model performance, in terms ofAPE over the prediction window in the held-out test data. We noted1https://github.com/igfox/multi-output-glucose-forecasting

the error distribution was non-normal, so we report the medianAPE and the 2.5th − 97.5th percentile errors. That said, all observedtrends hold when instead considering the mean APE, with the ex-ception that SeqMO outperforms PolySeqMO on the Event andHypo subtasks (12.63 vs. 12.79 and 16.60 vs. 17.04 respectively).These results illustrate the strengths (and weaknesses) of the pro-posed forecasting systems applied to the task of predicting bloodglucose. We discuss the implications of these results in the sectionsthat follow.

6.1 Deep vs. Shallow.While RF: MO achieves good performance on the full test set, itdoes worse than our three improved deep multi-output methods.Moreover, it underperforms the deep approaches on the event sub-set. This is due mainly to very poor performance on hypoglycemicpredictions. These results suggest that, compared to shallowmodels,deep models can more accurately learn the underlying dynamicsof the glucoregulatory system from raw data. Still, we note thatRF is a competitive forecaster, in line with previous work [21]. Inparticular, it achieves lower APE on the hyperglycemic test set thanall models except the PolySeqMO Ensemble.

6.2 Multi-Output vs. Recursive.We observe that among both deep (DeepMO vs. Recursive: 5.01vs. 5.31) and shallow approaches (RF: MO vs. Rec: 5.18 vs. 8.00),multi-output forecasting offers significant advantages over recur-sive forecasting. We also observe that all deep multi-output modelsimprove on the Recursive model in the hypoglycemic task (12.05-12.91 vs. 13.34).

We highlight these differences further in Figure 7. We plot theaverage performance at each of the six time points within the 30-minute prediction window. The difference in the approaches is am-plified as we predict further out. At the first place in the predictionwindow, corresponding to predicting xt+1, the recursive approachoutperforms most other approaches. As the target moves further inthe future, we observe two trends. First, the problem becomes moredifficult for all approaches (i.e., MO Error increases). This makessense, as it is inherently more difficult to predict events furtherin the future. Second, the relative performance of the Recursiveforecaster degrades with respect to the multi-output approaches.By the final step, the recursive model is far and away the worstpredictor (8.38 vs. 7.51-7.74).

6.3 Adding Sequential DependenciesExamining the difference in performance between the DeepMO andSeqMO models, we note the autoregressive connections improveperformance across every subset of the data (4.91 vs. 5.01 on thefull test set). While the multi-output approach under-performs thedeep recursive forecaster on the hyperglycemic task, once we addthe sequential decoding, the resulting model beats the recursiveforecaster on every task. This indicates that, while multi-outputforecasting represents a step in the right direction, it is importantto consider sequential dependencies between outputs.

Figure 7: A comparison of per-step error between the vari-ous forecasters.While themulti-outputmodels initially per-form worse, they do not accumulate error as rapidly as therecursive approach, achieving lower error at later predictionsteps.

17 21 30 51 100% Weight on Final Output

7.58

7.60

7.62

7.64

7.66

7.68

APE

at

30-M

inute

s (S

ingle

Outp

ut

Err

or)

Single OutputMulti-Output

Figure 8: We examine the single output error across a rangeof different model types, determined using an exponentialloss weighting.We derive the proportion of weight allocatedto the final output (which represents the evaluation target).Surprisingly, we observe that a multi-output loss improvessingle-output performance, suggesting that it is helpful tomodel forecast trajectories even when you only care aboutthe final value.

6.4 Predicting Underlying Function vs. Values.In our investigation of PolyMO, we began by looking at the per-formance attained using a range of polynomial function classes.We looked at four different degree settings for the best-fit polyno-mial we predict (0-3 degree polynomials). We found the degree 1model achieved the best performance on the validation set (4.87 vs.next best 5.14). While higher degree polynomials allow for strictlybetter approximations of prediction windows, they also allow formore variation in output. Moreover, minor errors in high-degreecoefficients rapidly compound to large errors.

Focusing on the 1st degree PolyMO, we see it is advantageousto rephrase the value forecasting problem as a function forecastingone. In particular, we find that PolyMO beats DeepMO on everytask.

Table 1: Results. We examine the performance of our eight forecasting approaches across different subsets of the CGM testdata. Results are reported as 50th percentile APE over the predictionwindow, values in parentheses are 2.5th−97.5th percentiles.Underlined results indicate the best single-model performance. Bold results demonstrate the best overall (single or ensembled)performance.

ArchitectureFull39k

Event3.1k

Hypo1.2k

Hyper1.9k

Shallow

Baselin

e

Extrapolation 6.48 (0.21-42.12) 10.76 (1.42-63.98) 14.85 (1.89-86.81) 8.73 (1.30-36.87)RF: Rec 8.00 (0.62-40.83) 10.45 (1.99-65.21) 14.31 (2.73-91.12) 8.82 (1.85-30.42)RF: MO 5.18 (0.71-30.16) 10.64 (1.41-55.28) 17.88 (2.70-75.46) 8.20 (1.14-28.00)

Deep

Baselin

e

Recursive 5.31 (0.00-29.32) 10.00 (1.45-46.22) 13.34 (2.17-62.86) 8.43 (1.24-30.49)DeepMO 5.01 (0.00-28.74) 9.93 (1.62-41.67) 12.91 (2.26-56.04) 8.52 (1.43-30.02)

Prop

osed SeqMO 4.91 (0.00-28.95) 9.69 (1.51-41.54) 12.48 (2.28-54.02) 8.37 (1.29-29.46)

PolyMO 4.95 (0.51-28.30) 9.79 (1.48-43.67) 12.49 (1.93-60.78) 8.46 (1.31-30.75)PolySeqMO 4.87 (0.48-27.80) 9.57 (1.43-43.59) 12.05 (2.03-60.90) 8.31 (1.24-29.76)

PolySeqMO Ensemble 4.59 (0.41-21.12) 9.38 (1.35-42.34) 11.61 (1.99-59.89) 8.13 (1.18-29.49)

Moreover, we find that the improvements from function fore-casting are complimentary to those achieved by accounting forsequential output dependencies. By combining the SeqMO outputdecoder with the PolyMO function forecasting, resulting in PolySe-qMO, we achieve better performance than all other non-ensemblemodels in every task under consideration. The fact that PolySeqMOdoes well across all subsets of the data suggests that it encodes amore accurate and complete view of the underlying dynamics ofthe glucoregulatory system.

Both PolyMO and PolySeqMO focus entirely on modeling valuetrajectories as opposed to the values themselves. Given that we areevaluating using a multi-output metric, this built-in assumptionmay appear to drive the boost in performance. However, upon in-specting the performance of the Linear Extrapolation baseline, weconclude that this is not the case. Degree 1 PolyMO and PolySe-qMO are equivalent in output capacity to the Linear Extrapolationbaseline, and both inherently emphasize trajectories. However, theLinear Extrapolation does far worse. PolySeqMO significantly re-duces the error of the Linear Extrapolation approach on the fulldataset (4.87 vs. 6.48). This demonstrates that the value of PolySe-qMO is its ability to predict the future, not its assumption of lineartrajectories. However, the fact that all models improve performanceunder polynomial smoothing suggests there is some value in thetrajectory assumption.

6.5 EnsemblingWhile PolySeqMO is the best individual model in every task, itunder-performs RF: MO on hyperglycemic prediction (8.31 vs. 8.20).While this could be due to the fact that RF: MO is simply bettersuited to that task, it could also be a result of the general improve-ments observed when ensembling different model performances.To test this effect, we trained 10 PolySeqMO models on the sametraining set, varying only the random seed for initialization and

training batch ordering. We then averaged the results of each mod-els prediction on the test set by taking the mean.

We found that even this simple ensembling scheme with fewmodels (relative to the 100 model ensemble used in RF: MO) leadsto a sizeable increase in performance across all tasks. In particu-lar, we find the PolySeqMO ensemble outperforms RF: MO on thehyperglycemic prediction task(8.13 vs. 8.20).

6.6 Multi-output vs. Direct ForecastingThere are many forecasting problems in which an accurate single-output forecast may suffice. In such cases, it is common to use adirect forecaster [22], where one directly estimates p(xt+h |x0:t ).In a follow-up analysis, we demonstrate that even in cases whereonly a single output is desired, it can be beneficial to consider amulti-output forecasting framework.

To demonstrate this, we begin by introducing a method to tran-sition from multi-output to direct forecasting, focusing on SeqMO.This model operates by predicting multiple values at each time step.The training loss is the average across each of the six time steps inthe prediction window. A direct forecaster, aiming to make a singleprediction at the final time step, can be approximated by zeroingout all losses except those incurred at the final step. This focusesthe full capacity of the network on predicting the final value. Wecan flexibly transition between direct and multi-output forecastingby manipulating the per-step loss weighting, transitioning from aone-hot vector on the final output (direct) to a uniform allocationof weight across the window (multi-output). We encapsulate thistransition in a single hyperparameter, 0 ≤ bw ≤ 1, or the base forthe per-step loss weight. For each step i in a prediction window oflength h, we set the loss weightwl,i =

bh−iw∑hi=0 b

h−iw

.In Figure 8 we show the single output performance (predicting

30 minutes into the future) of SeqMO with different settings ofbw . We observe that modeling the full trajectory does not worsenperformance, and in fact appears to slightly improve it (MO 7.67

vs. Direct 7.61). Interestingly, we found that best single-outputperformance was achieved using an intermediate value for bw(7.58 with bw = 0.5). Mixing multi-output and direct forecastingstrategies could be a promising direction for improving single-output forecasting performance.

7 CONCLUSIONSIn this work, we investigated methods for deep multi-output bloodglucose forecasting. We demonstrated the importance of balancingautoregressive behavior and sequential error accumulation, andprovided a forecasting model, SeqMO, that accomplishes this. Wedeveloped the idea of function forecasting, and introduced novelforecasting methods PolyMO and PolySeqMO. We compared ourproposed approaches to both shallow and deep baselines. Applied tothe challenging task of predicting blood glucose, we demonstratedthat: 1) multi-output methods outperform recursive alternatives, 2)modeling underlying dependencies among outputs using explicitconnections and function forecasting leads to better performance,and 3) the proposed approaches are complementary, and combiningthem significantly improves performance. Additionally, we demon-strated that multi-output forecasting improves performance, evenon a single-output task.

These experimental results suggest a multi-output approach caneffectively capture the underlying dynamics of the glucoregulatorysystem. While we focus on blood glucose, forecasting real valuedsignals is a problem with applications across a number of differ-ent domains including speech processing, weather prediction, andmedicine [6, 7, 17]. Our proposed methods are generally applicableto any forecasting problem requiring multi-step predictions.

8 ACKNOWLEDGEMENTSThis work was supported by the National Science Foundation (NSFaward no. IIS-1553146), the Michigan Institute for Data Science(MIDAS), and the National Institutes of Health (NIH research grantRO1 NIH/NHLBI1R01HL102334-01). The views and conclusions inthis document are those of the authors and should not be interpretedas necessarily representing the official policies, either expressed orimplied, of the NSF or the NIH

REFERENCES[1] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Sched-

uled Sampling for Sequence Prediction with Recurrent Neural Networks.arXiv:1506.03099 [cs] (June 2015). http://arxiv.org/abs/1506.03099 arXiv:1506.03099.

[2] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Ben-gio. 2014. On the Properties of Neural Machine Translation: Encoder-DecoderApproaches. arXiv:1409.1259 [cs, stat] (Sept. 2014). arXiv: 1409.1259.

[3] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning PhraseRepresentations using RNN Encoder-Decoder for Statistical Machine Translation.(June 2014). arXiv: 1406.1078.

[4] Claudio Cobelli, Eric Renard, and Boris Kovatchev. 2011. Artificial pancreas: past,present, future. Diabetes 60, 11 (2011), 2672–2682.

[5] Meriyan Eren-Oruklu, Ali Cinar, and Lauretta Quinn. 2010. Hypoglycemia pre-diction with subject-specific recursive time-series models. SAGE Publications.

[6] Andre Gensler, Janosch Henze, Bernhard Sick, and Nils Raabe. 2016. DeepLearning for solar power forecasting #x2014; An approach using AutoEncoderand LSTM Neural Networks. In 2016 IEEE International Conference on Systems,Man, and Cybernetics (SMC). 002858–002865.

[7] Marzyeh Ghassemi, Marco AF Pimentel, Tristan Naumann, Thomas Brennan,David A. Clifton, Peter Szolovits, and Mengling Feng. 2015. A multivariatetimeseries modeling approach to severity of illness assessment and forecasting

in icu with sparse, heterogeneous clinical data. In Proceedings of the... AAAIConference on Artificial Intelligence. AAAI Conference on Artificial Intelligence,Vol. 2015. NIH Public Access, 446.

[8] DCCT/EDIC Research Group. [n. d.]. The Effect of Intensive Treatment ofDiabetes on the Development and Progression of Long-Term Complications inInsulin-Dependent Diabetes Mellitus. ([n. d.]).

[9] DCCT/EDIC Research Group. [n. d.]. Intensive Diabetes Therapy and GlomerularFiltration Rate in Type 1 Diabetes. ([n. d.]).

[10] DCCT/EDIC Research Group. [n. d.]. Sustained effect of intensive treatment oftype 1 diabetes mellitus on development and progression of diabetic nephropathy:the Epidemiology of Diabetes Interventions and Complications (EDIC) study. ([n.d.]).

[11] Roman Hovorka, Valentina Canonico, Ludovic J. Chassin, Ulrich Haueter, Mas-simo Massi-Benedetti, Marco Orsini Federici, Thomas R. Pieber, Helga C. Schaller,Lukas Schaupp, Thomas Vering, and others. 2004. Nonlinear model predictivecontrol of glucose concentration in subjects with type 1 diabetes. Physiologicalmeasurement 25, 4 (2004), 905.

[12] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980 (2014).

[13] Alex Lamb, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, andYoshua Bengio. 2016. Professor Forcing: A New Algorithm for Training RecurrentNetworks. arXiv:1610.09038 [cs, stat] (Oct. 2016). http://arxiv.org/abs/1610.09038arXiv: 1610.09038.

[14] Chiara Dalla Man, Francesco Micheletto, Dayu Lv, Marc Breton, Boris Kovatchev,and Claudio Cobelli. 2014. The UVA/PADOVA type 1 diabetes simulator: newfeatures. Journal of diabetes science and technology 8, 1 (2014), 26–34.

[15] Sadegh Mirshekarian, Razvan Bunescu, Cindy Marling, and Frank Schwartz. 2017.Using LSTMs to Learn Physiological Models of Blood Glucose Behavior. EMBC(2017).

[16] David M. Nathan, Margaret Bayless, Patricia Cleary, Saul Genuth, Rose Gubitosi-Klug, JohnM. Lachin, Gayle Lorenzi, Bernard Zinman, and for the DCCT/EDIC Re-search Group. 2013. Diabetes Control and Complications Trial/Epidemiologyof Diabetes Interventions and Complications Study at 30 Years: Advances andContributions. Diabetes 62, 12 (Dec. 2013), 3976–3986. https://doi.org/10.2337/db13-1093

[17] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, OriolVinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499(2016).

[18] Silvia Oviedo, Josep Vehí, Remei Calm, and Joaquim Armengol. 2017. A review ofpersonalized blood glucose prediction strategies for T1DM patients. Internationaljournal for numerical methods in biomedical engineering 33, 6 (2017).

[19] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss,Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research 12, Oct (2011), 2825–2830.

[20] Kevin Plis, Razvan Bunescu, Cindy Marling, Jay Shubrook, and Frank Schwartz.2014. A machine learning approach to predicting blood glucose levels for diabetesmanagement. Modern Artificial Intelligence for Health Analytics. Papers from theAAAI-14 (2014).

[21] Bharath Sudharsan, Malinda Peeples, and Mansur Shomali. 2014. Hypoglycemiaprediction using machine learning models for patients with type 2 diabetes.Journal of diabetes science and technology 9, 1 (2014), 86–90.

[22] Souhaib Ben Taieb, Gianluca Bontempi, Amir Atiya, and Antti Sorjamaa. 2012. Areview and comparison of strategies for multi-step ahead time series forecastingbased on the NN5 forecasting competition. Expert systems with applications 39, 8(2012), 7067–7083.

[23] Betty Tao, Massimo Pietropaolo, Mark Atkinson, Desmond Schatz, and DavidTaylor. 2010. Estimating the Cost of Type 1 Diabetes in the U.S.: A PropensityScore Matching Method. PLOS ONE 5, 7 (July 2010), e11501.

[24] Kamuran Turksoy, Elif S. Bayrak, Lauretta Quinn, Elizabeth Littlejohn, Der-rick Rollins, and Ali Cinar. 2013. Hypoglycemia early alarm systems based onmultivariable models. Industrial & engineering chemistry research 52, 35 (2013),12329–12336.

[25] Wenzu Wu, Kunjin Chen, Ying Qiao, and Zongxiang Lu. 2016. Probabilistic short-termwind power forecasting based on deep neural networks. In 2016 InternationalConference on Probabilistic Methods Applied to Power Systems (PMAPS). 1–8.

[26] Chiara Zecchin, Andrea Facchinetti, Giovanni Sparacino, and Claudio Cobelli.2015. Jump Neural Network for Real-Time Prediction of Glucose Concentration.In Artificial Neural Networks. Springer New York, 245–259. DOI: 10.1007/978-1-4939-2239-0_15.

[27] Chiara Zecchin, Andrea Facchinetti, Giovanni Sparacino, and Claudio Cobelli.2016. How Much Is Short-Term Glucose Prediction in Type 1 Diabetes Improvedby Adding Insulin Delivery and Meal Content Information to CGM Data? AProof-of-Concept Study. Journal of Diabetes Science and Technology 10, 5 (Sept.2016), 1149–1160.

http://arxiv.org/abs/1506.03099

http://arxiv.org/abs/1610.09038

https://doi.org/10.2337/db13-1093

https://doi.org/10.2337/db13-1093

Documents

Deep Multi-Output Forecasting