Collaborative Learning between Cloud and End Devices: An ...ads click, location change, app launch). While the learn-then-deploy approaches achieve promising results in many scenarios,

Collaborative Learning between Cloud and End Devices: AnEmpirical Study on Location Prediction

Yan Lu†‡, Yuanchao Shu‡, Xu Tan‡, Yunxin Liu‡, Mengyu Zhou‡, Qi Chen‡, Dan Pei∗Microsoft Research‡, New York University†, Tsinghua University∗

ABSTRACT

Over the years, numerous learning methods have been put forwardto model and predict different user behaviors on end devices (e.g.,ads click, location change, app launch). While the learn-then-deployapproaches achieve promising results in many scenarios, dataheterogeneity and variability throw impediment in the way ofdeploying pre-learned models to a large cluster of end devices.On the other hand, learning on devices like smartphones suffersfrom limited data, computing power and energy budget. This paperproposes Colla, a collaborative learning approach for behaviorprediction that allows cloud and devices to learn collectivelyand continuously. Colla finds a middle ground to build tailoredmodel for each device, leveraging local data and computationresources to update the model, while at the same time exploitscloud to aggregate and transfer device-learned knowledge acrossthe network to solve the cold-start problem and prevent over-fitting. We fully implemented Colla with a multi-feature RNNmodel on both smartphones and in cloud, and applied it to predictuser locations. Evaluation results based on large-scale real datashow that compared with training using centralized data, Collaimproves prediction accuracy by 21%. Our experiments also validatethe efficiency of Colla, showing that one overnight trainingon a commodity smartphone can process one-year data from atypical smartphone, at the cost of 2000mWh and few hundreds KBcommunication overhead.

CCS CONCEPTS

• Computer systems organization → Distributed architec-

tures; Cloud computing; • Information systems → Data analyt-ics; • Networks→ Network algorithms.

KEYWORDS

edge computing; collaborative learning; smartphone; DNN; locationprediction; knowledge distillation

ACM Reference Format:

Yan Lu, Yuanchao Shu, Xu Tan, Yunxin Liu, Mengyu Zhou, Qi Chen, DanPei. 2019. Collaborative Learning between Cloud and Edge Devices: AnEmpirical Study on Location Prediction. In SEC ’19: ACM/IEEE Symposium

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’19, November 7–9, 2019, Arlington, VA, USA

© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6733-2/19/11. . . $15.00https://doi.org/10.1145/3318216.3363304

on Edge Computing, November 7–9, 2019, Arlington, VA, USA. ACM, NewYork, NY, USA, 13 pages. https://doi.org/10.1145/3318216.3363304

1 INTRODUCTION

User behavior prediction on end devices (e.g., PCs, laptops,smartphones) has long been a topic of interest in both systemand machine learning community. Researches have been carriedout to predict, for example, online user behaviors from weblogs [1],word input [2] and app usages [3] from OS system logs, andphysical activities such as locations which user will visit [4] basedon data from portable sensing devices. These prediction resultsbenefit numerous third-party applications including personal digitalassistant, recommendation systems, advertising etc.

Despite a broad range of applications, the vast majority ofapproaches used for user behavior prediction are data-driven - usedata mining and machine learning techniques to learn behaviormodels from historical data [5–11]. More recently, Deep NeuralNetworks (DNNs) have also been widely applied in behaviormodeling due to its great success in sequence prediction [4, 12, 13].Albeit implementation differences, these approaches learn in acentralized way (e.g., in the cloud) from labeled activities, and makepredictions on each end device (e.g., desktops and smartphones).

This classical learning paradigm has several downsides. First,most designs focus on building one single model that achievesthe best performance on a given "mixed-user" dataset. However,when deployed to end devices, the global model may performbadly for some users due to the skewed data distributions. Forinstance, mobility patterns of kids, college students and companyemployees are distintive from each other. A natural solution is totrain different models for different groups of users. However, itis non-trivial to determine the number of groups as well as theconsequent methodology of training and deployment. Second, toachieve good performance on the "mixed-user" dataset, large modelcapacity is needed, eventually increasing the resource usage onend devices. Third, the learn-then-deploy paradigm does not takeruntime model execution results as well as newly-generated dataon end devices into account, failing to adapt to data drift over time.

This paper is driven by a simple question: can each end device(e.g., smartphones) learn its own prediction model and improveit over time? We believe the answer is yes. Our key insight isthat despite limited amount of data and skewed data distributionon each device, in user behavior prediction, there exist commonpatterns [14] cross users/devices due to the intrinsic correlations inthe physical space, hence allowing cloud and (multiple) end devicesto complement each other and learn collectively. For example, usersmay install very different apps on their smartphones. But all theseapps come from the same app store, and Yelp, for instance, is morelikely to be launched around the same time (e.g., lunch time) on alldevices.

https://doi.org/10.1145/3318216.3363304

https://doi.org/10.1145/3318216.3363304

Design of such a system, however, has to address severalchallenges. The key is to enable effective and efficient local training,allowing knowledge transfer from both cloud to device and deviceto device, while at the same time, accommodating the huge gapbetween the cloud and end devices in terms of compute power,data variation, and energy budget. To realize the benefits of cloud-device collaboration, we let the cloud side do the heavy lifting atbeginning to train an initial cloud model. Devices then take overto perform incremental learning tasks using their local data, andbuild their own models (a.k.a., client model) in a distributed wayfor local inference. In the meantime, cloud serves as a sink nodethat enables knowledge share across the network from time to time,expediting learning progress on each device and alleviating theimpact of insufficient data and over-fitting.

Under this framework, we choose recurrent neural networks(RNN) as a template predictive model, and further propose diversemodel architectures for cloud and end devices. Specifically, cloudhosts a heavy model with more layers and a larger hidden layersize, resulting in a larger capacity, while each device only maintainsa lightweight model tailored to local data and computing power. Toenable knowledge sharing between models, an update mechanismusing a novel dual knowledge distillation approach is devised. Atthe client side, akin to classical knowledge distillation [15, 16],device fine-tunes its model using both local ground truth (i.e., hardlabels) as well as soft labels created by the cloud model, whereason the cloud side, the cloud model is updated by distilling theknowledge from lightweight device models1. To deal with datavariation, a model grouping mechanism is incorporated in thecloud. It dynamically classifies device models, and updates the cloudby generating multiple cloud models. This way, each client onlypulls one cloud model and hence client model benefits from peerdevices’ training process on correlated data. The cloud model isalso enhanced over time, providing a good baseline for new comingdevices thus solving the cold-start problem. In what follows, wefirst talk about problem formulation and model design in §2, andthen elaborate knowledge transfer and model update in §3.

To explore the feasibility of applying collaborative learning touser behavior prediction and quantify the benefits along multipledimensions, we take smartphone as an example and conduct a casestudy on location prediction. Our empirically experiments over alarge-scale dataset collected from real users reveal the following keyobservations. First, we found that, despite limited amount of dataand computation resources, end devices can still develop knowledgeof their mobility patterns promptly and efficiently by traininga neural network model. The key is to customize device modelarchitecture and leverage cloud to bootstrap the learning process.Second, by exploiting knowledge distilled from the crowd, thecollaboratively-trained model achieves a higher prediction accuracythan both centralized-trained model based on the aggregated dataand the client-trained models by individual devices, as well as state-of-the-art baseline prediction methods. Third, prediction accuracyincreases with successive model updates with the help of modelgrouping. Nevertheless, device variations are observed and the gainfrom each update diminishes over time. Our main contributions inthe paper are as follows.

1Note that raw data uploading is not required.

• We revisit the problem of user behavior prediction, andpropose Colla, a collaborative learning framework thatallows devices and the cloud to learn collectively. In contrastto the traditional learn-then-deploy paradigm, Colla allowslocal devices to play an active role in learning their own data,and wisely leverages the cloud and other devices for bothdata and computational resources.

• We study the feasibility of Colla by applying a multi-feature RNN network to the problems of smartphone-basedlocation prediction. A novel dual distillationmechanismwithmodel customization and grouping mechanisms is proposed,demonstrating superior prediction accuracy over the state-of-the-art methods.

• We fully implement Colla on Android and Microsoft Azure,and evaluate its performance on a large-scale real dataset.Key observations and insights are reported, shedding lighton how collaborative learning would work for behaviorprediction in practice, and how a better learning systemcan be designed with a swarm of networked devices.

2 USER BEHAVIOR PREDICTION: A

NON-COLLABORATIVE PERSPECTIVE

We set the context by formulating behavior prediction problemsfrom a non-collaborative perspective (§2.1). This represents aclass of classical prediction algorithms that run on centralizeddata. Amidst these methods, RNN has demonstrated superiorperformance recently due to its ability on handling sequencedependence. Hence, in § 2.2, we first introduce an RNN withLong Short-Term Memory (LSTM) unit, and use it as our templatemodel design. Note that although the design of RNN is not a keycontribution of this paper, it lays a foundation on the design ofColla (§3), making it a generally applicable method for differentbehavior predictions.

2.1 Problem Formulation

Sequence data is one of the most widely collected data on enddevices. The particular focus of this paper on sequence predictionis to decide with given time series patterns – (randomly sampled)data observations from the past – can we learn to make reasonableprediction on future data, such as the next location. Formally, wedefine the problem as follows.

Definition 2.1 (Sequence.). Sd is a sequence of samples qd1qd2 ...q

dn

on device d , where qdi = (ti ,vi ), ti and vi ∈ V are the timestampand value of sample qdi . The time gap between ti and ti+1 may varybetween samples, and we mainly focus on the discrete sequencewhere vi is a discrete value, such as location landmarks andapplication ID.

Definition 2.2 (Sequence Prediction Problem.). Given sequencedata Sd , where d is from a set of end devices D = {d0, ...dk },sequence prediction problem seeks to predict the next data pointqdn+1 for each device d ∈ D.

Note thatqdn+1 is a tuple of timestamp tn+1 and valuevn+1, whichmakes prediction a complicated multivariate prediction problem.In this study, the timestamp information is taken as the input ofRNN and the value vn+1 is the prediction value. Therefore, we can

vary input tn+1 to predict the next value vn+1 at any given time.As we assume discrete values of a sequence, the prediction can beregarded as a series of multiclass classification problem, with eachclass represents the possible value of vn+1.

In § 2.2, we take location prediction as an example to describeour model design. In this situation, the sequence data consists oftrajectory sequence which records the time and location ID of adevice. Note that in many applications like keyboard, location andapp launch prediction, ground truth values vn+1 at tn+1 can benaturally obtained so there is no need for local data labeling.

2.2 Prediction Methods

RNNs has demonstrated outstanding performance recently due toits advantages in sequence dependency modeling, generalizationability, high flexibility and scalability [4, 12, 13]. Consideringcontextual data such as time also provides valuable information foruser-behavior prediction, we propose a Multiple Feature RNN (M-RNN) model. Figure 1 shows the structure of M-RNN. In mobilityprediction, the M-RNN model takes multiple sequence data streamsas input, and outputs a vector of multi-class probabilities with eachentry representing the probability of each possible location.

Location sequence

Concatenation Layer

LSTM LSTM LSTM LSTM

Linear

0.01 0.11 0.03 0.01 0.02 0.08…

…

Class Probability

Embedding

Hour sequence Minute sequence

Weekday sequence Duration sequence Gap sequence

Embedding Embedding

Embedding Embedding Embedding

…

Figure 1: M-RNN architecture.

To leverage RNN, we first convert location ID and relevant timeinformation in trajectory sequence into a vector that is friendly toneural networks. We borrow the idea of word embedding [17, 18],and convert each location ID into an embeddingwith the embeddinglookup table. Five time-related features are extracted and convertedto discrete values here. In specific, Hour , Minute andWeekdayfeatures represent the likelihood of seeing similar mobility patternsin the temporal domain. Using timestamps of consecutive datapoints that contain the same location ID, we extract Duration,which represents how long a user stays at one location. We alsoextract Gap, the time gap between two adjacent but differentlocations, to characterize e.g., commute time. During inference,

we take Gap as an input to predict future locations at any giventime. Second, we transform the extracted information above intodiscrete id features. The numbers of discrete ID for Hour ,MinuteandWeekday are 24, 60 and 7, respectively. Duration and Gap arediscretized into 144 values, with each one covering a time span of10 minutes, and 144 in total covering 1440 minutes (i.e., one day). Itis the upper bound of the time span of a sequence we consider inlocation prediction. Finally, we convert the trajectory sequence into6 discrete ID sequences, each representing the Location sequence ,Hour sequence , Minute sequence , Weekday sequence , Durationsequence and Gap sequence .

After extraction, we get six sequences of features in total(location sequence plus five time-related sequences). We uselocation ID sequence in a 10-minutes time window (like a sentencein natural language processing) to pre-train the embedding modelto convert the location ID into a dense vector. Similarly, at eachtimestamp, we train other embeddings end-to-end, and concatenatethe embeddings of the six different ids as the input for the M-RNNmodel. We use Xn to represent the embedding concatenation at thenth step.

The M-RNN model contains an LSTM layer. We feed the outputhidden of the last step in the sequence to a fully connected layer fand then map feature vector into a |V |-dimensional vector, where|V | is the number of location IDs. In LSTM, h0 and c0 are the initialhidden state and cell state, and

hn , cn = LSTM(Xn , (hn−1, cn−1)), (1)

which denotes the hidden state and cell state at nth step. In the fullyconnected layer f , we use the locationn+1 = f (hn ) to representthe future location at the n + 1 step. Since devices have differentvisit locations, in Colla, each device customizes its local model bysetting an adaptive value ofV (§ 3.3). For example, a device sets thesize of its fully-connected layer to 25 when it has visited 25 uniquelocations. If it collects 5 new location IDs, it will expand the size to30.

In training, we split a sequence which has n elements into n −

1 training sequences. For instance, [id1, id2, id3] would be splitinto [id1]->[id2] and [id1, id2]->[id3]. Finally, softmax operation isperformed to get output probability.

3 COLLA LEARNING FRAMEWORK

In this section, we describe the collaborative learning framework. Itconsists of cloud and a set of end devices such as smartphones. Weillustrate the learning process using a star topology as an examplewhere each device connects directly to the cloud. In Colla, cloudtrains and maintains a large base model M whereas each deviceholds a small client modelm customized for itself. We call the basemodel on the cloud and the client model on a device cloud model

and client model, respectively.

3.1 Learning Flow

As shown in Figure 2, Colla learning process consists of thefollowing four stages.

Stage 1: Bootstrapping. The cloud trains the very first cloud modelfrom an initial dataset. This is done only once when the system isdeployed, e.g., when a keyboard application releases a new version

!

inference

clientdistillation

! 1

T0

T1

T2

initial data

cloud model

client model

client model after distillation

! " #$

clouddistillation

M0

M1

M1

mi

0 d

i

mi

1

cloud model after distillation

$

M2

M2

" 2

mi

2

inference

M0

cloud compression

modelparameters

Figure 2: Colla learning flow. Number in dashed red box represents learning stage. Model grouping not included.

that can learn a model to predict next word given the prefix orprevious word [19]. At this stage, it is reasonable to expect that thecloud owns some data (e.g., anonymous user data or existing largepublic dataset like word corpus) to initiate the learning process.

Stage 2: First pulling.When a new device di joins the system attimeT0, it asks the cloud for the latest model. The cloud compresses(see § 3.2 for more details) its latest cloud model M0 into a smallone and sends it to the device. di uses the compressed small modelas its first client modelmi

0 to perform inference in the coming timeperiod T = T1 −T0.

Stage 3: Client model update.After collecting a reasonable amountof data or simply after a fixed time period T , device di pulls thelatest cloud modelM1 and merge it with the current client modelmi

0 through knowledge distillation, resulting in a new client modelmi

1. In the simplest form of distillation, the heavy cloud model isused as the teacher model to fine-tune the client model (i.e., thestudent model) over the local dataset (a.k.a., the transfer set). Thus,client model is able to learn from both its local data and the cloudmodel. § 3.3 will describe our knowledge distillation process indetail. In § 5.4 and § 7, we also evaluate the cost of model training oncommodity smartphones, and list works that can further expediteDNN execution on resource-constrained devices. After client modelmi

1 being generated, its parameters are pushed to the cloudwhile thedevice uses the lightweight client modelmi

1 to perform inferencetill time T2.

Stage 4: Cloud model update. Once receiving model parametersfrom N of devices, cloud also updates its model. This is done againby knowledge distillation, but with multiple teacher models (i.e., Npushed models from end device di , i ∈ [1,N ]). Here the transferset includes all data available in the cloud. This stage results in anew cloud model. Besides updating the base cloud model, Collaalso performs model grouping, dividing N device models into Kgroups. Each model group is used to teach the base cloud modelinto a new classified cloud model (not shown in Figure 2). Detailsare described in § 3.2.

Client model update (Stage 3) and cloud model update (Stage 4)happen repeatedly. For example, at timeT2, di pulls the latest cloud

modelM2 (orMk2 if di was classified into group k) and distills the

knowledge into client modelmi1 using the data collected during time

period T2 −T1, resulting a new client modelmi2, and then pushes it

to the cloud. Similarly, cloud updates its model from time to timeusing received models from devices. Since we perform distillationto transfer knowledge from multiple devices to cloud as well asfrom cloud to each device, we call the mechanism dual knowledge

distillation. Note that model updates in cloud and device may workin an asynchronous way so as to adapt to various data collectionrates and different device constraints (e.g., energy).

3.2 Model Compression and Grouping

To reduce the overhead of running model inference on resource-limited end devices and consider the behavior diversity of differentgroups of users, Colla uses model compression and modelgrouping.

Model compression. As end devices usually have limited memoryand computation resources, it is critical not to run a large modelon end devices. However, the cloud needs to use a large modelwith enough capacity to combine the knowledge learned from alldevices. We propose to use model compression to balance the needsof both the cloud and devices. To this end, the cloud compresses itslarge base model into a small one through knowledge distillationover the cloud dataset. As shown in Stage 2 in Figure 2, when anew device comes, the cloud sends the compressed small model tothe device. The large model and the small model pair have differentsizes of LSTM units and different embedding sizes. Specifically,for the large model, we set the size of LSTM unit to 128, and theembedding size for all six different ID spaces to 32. In total, thelarge M-RNN model contains 128, 796 parameters with a size of503KB (128, 796 ∗ 4 Bytes). For the small model, we set the hiddensize of LSTM unit to 16 and the embedding size to 4, resulting in amodel with 11, 868 parameters and a size of 88KB (11, 868∗ 4 Bytes).In § 5.4, we show the impacts of different sizes of embedding andLSTM unit on prediction accuracy as well as execution cost onmobile devices.

Model grouping. With growing number of users in Colla,maintaining a single model in cloud may not achieve a satisfiedprediction accuracy for all users as different groups of users mayhave very different behavior patterns. To solve this problem, insteadof using a one-fits-all cloud model, Colla splits up the cloud modelover time to serve different groups of users. Doing so requiresdividing devices into different groups where users in the samegroup share similar behavior pattern. To this end, Colla conductsmodel grouping in Stage 4 in Figure 2. After receiving N smallclient models, cloud performs model inference using each of the Nmodels over the cloud dataset. Based on N output feature vectors,we use the cross entropy between two feature vectors as thedistance of two models, and cluster models into K groups using theAffinity Propagation [20]. In our current implementation, we takea brute force approach and choose K with the highest SilhouetteCoefficient [21] from a set of {2, 3, 5, 10}.

Afterwards, cloud uses client models in each group to generatecloud model Mk , k ∈ [1,K] per group, through cloud distillation(see § 3.3) using the cloud dataset. This happens at every cloudmodel update stage (i.e., Stage 4 in Figure 2) and generates a newset of groups each time. The cloud records which client model(and thus the corresponding device) belongs to which group. As aresult, when a device asks for the latest cloud model again (Stage3 in Figure 2), the cloud will send back the classified cloud modelof the device rather than the base cloud model. Note that the basecloud model is also updated using all the received client models andthus new devices may always get the latest cloud model to startwith, before it goes to a group.

3.3 Model Update through Dual Distillation

Client and cloud model update are both critical in Colla whichtransfer knowledge between cloud and different client models.Knowledge distillation [15, 16] is widely used in machine learningto transfer the knowledge from a heavy model (a.k.a., teachermodel) to a cheap model (a.k.a., student model) that is more suitablefor deployment on edge devices [16, 22–26]. Taking classificationproblem as an example, for the training process without knowledgedistillation, the model generates class probabilities using softmaxfunction, and then matches them to one-hot ground truth labelsto calculate loss for back propagation. The major difference afteradding knowledge distillation is that, the student model not onlymatches the output probabilities p of the input x to the true labely, but also matches p to the class probabilities q predicted by theteacher model on the same input x . Mathematically, distillation lossis calculated as:

L = λL(y,p) + (1 − λ)L(q,p), (2)where L(∗, ∗) is the loss function (i.e., Cross-Entropy or KullbackLeibler (KL) Divergence), L(y,p) is the original loss term and L(q,p)is the distillation loss term, λ is a hyper-parameter to trade offthe contribution of the two loss terms. It can be determined byhyper-parameter search during training.

With the popularity of distillation, much of the work [27, 28] hasbeen proposed to allow two models teach each other. For instance,Deep Mutual Learning [27] trains two models on the same datasetsimultaneously and makes them match the probability estimates ofeach other. BAN [28], on the other hand, enables one network to

teach itself and generate new models via consecutive distillation. Italso adopts ensemble learning to aggregates predictions of thesemodels to generate a more reliable teacher model. Inspired by theseworks, we extend basic knowledge distillation to a dual distillationparadigm both in client and cloud. When cloud model teaches theclient model, devices can utilize the general knowledge from cloudmodel to avoid overfitting despite the variations of local data. Onthe contrary, cloud model benefits from the ensemble learning byaggregating outputs from multiple client model.

Client Distillation. On the device side, knowledge is transferredfrom cloud model to client model. Therefore, we use cloud modelas teacher model and conduct knowledge distillation on the dataavailable on each device. Follow Equation 2, the loss function ofclient model i can be calculated as

Ldi = λdL(ydi ,p

di ) + (1 − λd )L(q

c ,pdi ), (3)

where superscript c and d denote the cloud side and device side.ydi represents the label of the data on device i , pdi and qc representthe output probabilities of the client model (student) and cloudmodel (teacher) on device i , respectively. We use early stopping inclient distillation to avoid over-fitting on limited local data. Here weassume local data is annotated (i.e., with known labels). This mightbe impractical for tasks like object segmentation, but is trivial inmany applications like sequence prediction, where word inputs,device locations, app launches are natural labels.

Cloud Distillation. Cloud distillation uses multiple client modelsto teach the cloud model by fine-tuning it on the cloud dataset. Itaims to match the output probabilities of the cloud model to theaverage of the softmax output of each client model. Similarly, wehave

Lc = λcL(yc ,pc ) + (1 − λc )L(

1N

∑i ∈N

qdi ,pc ), (4)

where the superscript c and d denote the cloud side and device siderespectively. yc represents hard labels of the cloud data, pc and qdirepresent the class probabilities of the cloud data by the cloudmodeland client model i , respectively. This cloud distillation procedureis the same for both the base cloud model and the classified groupmodels. The only difference is that the number of client modelsused in the distillation is different.

Note that in Colla, devices have different values of V (i.e.,customized FC layers) to cover diverse data classes on the clientside. Usually, client models have a smaller label size |V | than thatof the cloud model. In order to facilitate knowledge distillation thatrequires the same label set between teacher and student model,we modify the output probabilities of teacher model. Specifically,in client distillation, only the output probabilities that matches toexisting labels of the client model are picked and re-normalizedas soft labels. In cloud distillation, the output probabilities of thedevice model are padded with zeros to match the feature vectorsize of the cloud model.

4 DATASET AND IMPLEMENTATION

To empirically evaluate Colla, we fully implement them onMicrosoft Azure cloud and Android smartphones, and conducttrace-driven emulations using large-scale real data.

4.1 Dataset Description

Location prediction is conducted using WiFi AP scanning resultscollected on a large university campus for four months (fromMarchto June in 2016). In total, there are 120, 624, 600 WiFi scanningrecords between 201, 041 devices and 2, 890 Cisco enterprise APsdeployed in 116 buildings on the campus. At peak time, there are∼20, 000 devices concurrently connected to the campusWLAN. Thetotal number of unique devices is more than 60, 000 each day. Onaverage, 4.89 APs are scanned by each device per day.

Based on the hypothesis that APs close to each other are morelikely to be observed by the same mobile device, we group all 2, 890APs into a smaller number of clusters (|V | = 368). Colla aims tofigure out which AP cluster each device is more likely to visit givena future time t . To extract devices’ trajectory sequences, we slicescanning records using a 10-minute time window. Device locationis set to the cluster ID that contains the most visited AP in eachtime window. In data preprocessing, we filter out inactive devicesthat have less than 10 active days per month – a day is active onlyif it contains more than 5 unique location IDs. As a result, thereare 12540 devices left. The distributions of the number of activedays per month and the length of location sequences (i.e., numberof different locations) across all devices are shown in Figure 3.

0 5 10 15 20 25 30Active days

0.00.20.40.60.81.0

Prob

abilit

y

Before preprocessingAfter preprocessing

(a) Distributions of the number of activedays per month

0 20 40 60 80 100120The length of sequence

0.00.20.40.60.81.0

Prob

abilit

y

Before preprocessingAfter preprocessing

(b) Distributions of the length of locationsequences

Figure 3: Location prediction data distribution.

4.2 Implementation

We implement the cloud part of Colla on Microsoft Azurecloud. In specific, we use an Azure virtual machine with IntelXeon [email protected], 128GB memory and 3TB storage, 8 NVIDIATesla P100 GPU cards (CUDA 9.0.0) with 16GB GPU memory forexperiments. The cloud part has three main components: datastorage, model training engine and communication interface. Modelstructure is stored in a .json file and the weights are stored in a.npy file (numpy array). Initial training data is stored in HDFSformat using Azure HDInsight. Model training is conducted usingKeras (v1.0.0) and Theano (v1.0.0). RESTful API is provided for thecommunications between cloud and devices.

On the client side, we use Android smartphones and PyDroid32, an IDE for Android featuring offline Python (v3.6) interpreterand built-in C and C++ compiler. We install Keras, numpy, scipyand Theano on Pydroid 3 using pip to enable model training withdata stored in SD card. Due to the lack of programming model for

2https://play.google.com/store/apps/details?id=ru.iiec.pydroid3

mobile GPUs3, we train the M-RNN model on smartphones usingCPU. Execution cost is presented in § 5.4.

In terms of training configurations, we use cross entropy as lossfunction both for the true loss and distillation loss. Both λc and λdin Equation 4 and Equation 3 are set to 0.5 after a hyperparametersearch. Batch size is set to 32 and we use Adam optimizer witha learning rate of 0.1. We adopt cross validation with the ratioof training/validation/test set setting to 16 : 4 : 5. Early stopmechanism is adopted which terminates training if loss on thevalidation set doesn’t increased in consecutive twenty epochs.

5 EVALUATION

We present evaluation results of Colla in this section, using thedataset described in Section 4.1.

5.1 Experimental Settings

Colla: We use synchronized model update by default. Modelupdate cycle is set toT = 20 days. It means all devices pull the cloudmodel at the end of each cycle, and upload (changed) parametersright after their local models have been updated using local datafrom the past period. Cloud periodically updates its model basedon the model parameters from all clients.

Model: As described in Section 3, Colla adopts different M-RNNarchitectures in the cloud and on devices. The embedding layer ofthe heavy M-RNNs are sized at 32 with hidden layer size of 128,whereas the cheap model is sized at 4 with hidden layer size of 16.

Data: We select 10, 000 devices that contain at least 10 active daysper month, and evaluate Colla across the entire four months (i.e.,March 2016 to June 2016). Particularly, we take a closer look atcollaboration performance between 100 most active devices (a.k.a.,top-100). Data from the first month is used as initial training set forthe cloud. Because we set T = 20 days and extract the first monthas initial data, we split the data from mobility prediction (April 2016to June 2016) into four parts. In evaluation, we use the current partas training set and the next part as evaluation set. Thus, there arethree cycles in mobility prediction and two cycles in app launchprediction.

Metrics: Trained models are evaluated on the data from thesubsequent cycle. We use following metrics to evaluate theperformance of Colla:

prediction accuracy: the portion of correct prediction to the totalpredictions (i.e., top-1 accuracy)

weighted-precision and weighted-recall: Pr = wi ∗

∑|V |i=1 T Pi∑|V |

i=1(T Pi+FPi )

and Re = wi ∗

∑|V |i=1 T Pi∑|V |

i=1(T Pi+FNi ), which are widely used in multi-class

classification [29]. wi =∑(i )∑|V |

i=1(i )is the percentage of i in all data.

execution cost: local model inference time, training time andcommunication cost of smartphones.

Since there is no well-developed Reception Operating Charac-teristic (ROC) analysis for multi-class classification or prediction,3Most smartphones feature GPUs from Qualcomm and Intel, which only focus onspeeding up inference/prediction but not training. We expect computing platformslike CUDA will be released for smartphone GPU-based training in the near future.

0.0 0.2 0.4 0.6 0.8 1.0Accuracy

0.00.20.40.60.81.0

Prob

abilit

y

DTCTDT-wsCOLLA

0.0 0.2 0.4 0.6 0.8 1.0Precision

0.00.20.40.60.81.0

Prob

abilit

y

DTCTDT-wsCOLLA

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.00.20.40.60.81.0

Prob

abilit

y

DTCTDT-wsCOLLA

Figure 4: CDF of overall mobility prediction accuracy, precision and recall in three updates.

0.0 0.2 0.4 0.6 0.8 1.0Accuracy

0.00.20.40.60.81.0

Prob

abilit

y

Decision TreeMarkov 2Markov 4RNNM-RNN

0.0 0.2 0.4 0.6 0.8 1.0Precision

0.00.20.40.60.81.0

Prob

abilit

y


0.0 0.2 0.4 0.6 0.8 1.0Recall

0.00.20.40.60.81.0

Prob

abilit

y


Figure 5: CDF of mobility prediction accuracy, precision and recall with baseline methods.

we do not include AUC (Area under the ROC Curve) in the list ofmeasures.

5.2 End-to-end Prediction Performance

We first evaluate end-to-end performance between four differentmethods at the end of the evaluation period. These four methods arei) device-training (DT), where each device trains its own model fromscratch using its own data and never exchanges any informationwith cloud nor other devices. ii) cloud-training (CT), where cloudcollects data from all devices periodically and trains a global model(using heavy M-RNN) for prediction. i) and ii) are two extremes thatare commonly used in legacy prediction approaches where clientsare either treated independently or uniformly. iii) device-trainingwith warm-start (DT-ws), where each device pulls the very firsttrained model from cloud, and utilizes local data to fine-tune itat the beginning of each cycle. However, devices never exchangeparameters with cloud nor other devices after the first pulling. Thisis the default approach to deploy models in transfer learning. iv)Colla, where collaborative learning is adopted.

5.2.1 Top-100 devices with fixed size of initial dataset. Figure 4shows the overall performances in three updates for mobilityprediction. We see that CT performs slightly better than bothDT and DT-ws, due to the extra data from other devices. In allthree figures, Colla performs the best among these methods,demonstrating the effectiveness of collaborative learning. Somedetailed accuracy numbers are listed in Table 1. It can be seenthat the median accuracy of Colla is 0.51, 21.42% higher thanthe second best approach (0.42 of DT-ws), and Colla yields the

best accuracy among all four methods on 68 devices (out of 100).Another interesting finding is that the performance of DT is nearlyequal to DT-ws in Figure 4. Given DT-ws outperforms DT by 10%in terms of median accuracy after the first update, this comparisonreveals that the help from warm start diminishes with successiveclient model updates.

Table 1: Mobility prediction performance. Best percentage

means the percentage of the devices where the correspond-

ing method performs the best.

Method Best pct. Avg. acc. Med. acc.DT 0.0% 0.40 0.40CT 28.00% 0.40 0.39

DT-ws 4.00% 0.41 0.42Colla 68.00% 0.47 0.51

Figure 5 also compares Colla with multiple state-of-the-artmethods including decision tree [6], Markov Chain model [7] andRNN [4] after the first update. As can be seen, Colla with M-RNN achieves a superior performance in all three metrics. Forinstance, M-RNN exceeds RNN 116% (0.39 vs. 0.18) in terms ofmedian accuracy.

5.2.2 Random sampled devices with varying sizes of initial dataset.In a more realistic setting, we sample 100 devices randomly fromall 100,00 devices, and extend the cloud initial dataset to top-104devices.

0.0 0.2 0.4 0.6 0.8 1.0Accuracy

0.00.20.40.60.81.0

Prob

abilit

y

CTDTDT-wsCOLLA

(a) Initial dataset with 103 devices

0.0 0.2 0.4 0.6 0.8 1.0Accuracy

0.00.20.40.60.81.0

Prob

abilit

y

CTDTDT-wsCOLLA

(b) Initial dataset with 104 devices

Figure 6: CDF of prediction accuracy on the random-100devices in mobility prediction.

In Figure 6, we can see that model fine-tuning-based methods(i.e., DT, DT-ws and Colla) obtain better performance when theinitial dataset is small. However, with a large initial dataset in thecloud, CT catches up and outperforms DT and DT-ws. In the bestcase, it achieves a median accuracy of 0.33 when the initial trainingset contains 104 devices. Note that the gain of CT comes at thecost of continuous local data uploading and heavy model exchange.Execution cost of local inference using a heavy model could also beprohibitively expensive (more experimental results in Section 5.4).Compared with CT and DT-ws, Colla obtains the best predictionaccuracy on different initial datasets. In addition, between thesetwo figures, we can see it also benefits from the improvement ofthe pre-trained cloud model, bringing median accuracy from 0.32to 0.39.

0.0 0.2 0.4 0.6 0.8 1.0Accuracy

0.00.20.40.60.81.0

Prob

abilit

y

Top-100Top-1000Top-10000

(a) CT

0.0 0.2 0.4 0.6 0.8 1.0Accuracy

0.00.20.40.60.81.0

Prob

abilit

y

Top-100Top-1000Top-10000

(b) COLLA

Figure 7: CDF of prediction accuracy using different meth-

ods in mobility prediction.

Figure 7 shows another perspective of Figure 6 by puttingtogether models trained in the same way but with different initialdata sizes. It is clearer that the larger data size we used for CT, thebetter generalization capability the model can achieve, hence thehigher prediction accuracy.

5.3 Model Update Performance

In this section, we zoom in to examine gains from each modelupdate.

Figure 8 shows mobility prediction performance from top-100 devices over four months (three updates). Compared withsteady improvements of Colla, DT-ws can hardly boost devices’prediction capabilities over time. Since DT-ws purely relies on eachdevice’s own data, over-fittings are more likely to happen. However,in Colla, devices take advantages of the cloud model during

subsequent local training, which can be seen as regularization toprevent over-fitting from happening.

We found the continuous improvement from each model updateis largely brought by model grouping. From Figure 9(a), it canbe seen that without grouping, although prediction results getbetter over time, the gain becomes marginal after three months.We also investigate the performance of the Affinity Propagationwith Silhouette Coefficient to determine the number of groups K(algorithm in Section 3.2). To this end, during cloud model update,we cluster models into different numbers of groups K , and examinetheir end-to-end prediction performance. For the top-100 devices,we find clustering into two groups achieves the best overall accuracy(Figure 9(b)) after the second update, matching with the rank ofSilhouette Coefficient (inside legend box in Figure 9(b)). It verifiesthe effectiveness of grouping using model inference results.

Another component that has impacts on local model update isthe customization of FC layer. Here we compare Customized FC,where each device dynamically expands the size of FC layer, againstUnified FC, where each device owns identical model architecture(cheap model with a same size of the FC layer), on top-100 devicesin the first update. In Figure 10(a), we can see that CustomizedFC strictly outperforms Unified FC in terms of accuracy. This canbe explained by the location class distribution of each user acrosscycles - new locations are more likely to be seen in the first fewcycles (Figure 10(b)). Therefore, the gain from having a smaller FClayer, thus a higher prediction confidence, outweighs predictionerrors from missing location classes in next cycle.

5.4 Execution Cost

Colla obtains aforementioned accuracy improvements at the costof periodical on-device model training and cloud model update.Due to orders of magnitude less computing power and energybudget, we focus on execution cost on smartphones. To obtain acomprehensive understanding of runtime cost, we compare threemodel update approaches. They are i) Same-Whole, where devicesshare the same architectures (i.e., cheap M-RNN) and fine-tune thewhole network during update; ii) Same-FC, where devices share thesame architectures but only fine-tune the last fully connected layer;iii) Customization, where devices gradually expand (and fine-tune)the last FC layer with growing local observations. All other layersare frozen during training. We conducted experiments on multiplesmartphones at different levels. Results from four M-RNNs withdifferent sizes (i.e., 4-16, 8-32, 16-64, 32-128) are presented, where,4-16, for instance, means the embedding size is 4 and the hiddensize is 16. Sizes of these models are shown in Figure 11(a), whereCustomization(n) denotes the n − th local update in customizationupdate approach.

We firstly examine the number of parameters to evaluate thecommunication overhead. For down-link, all three model updateapproaches send the same cloud model to the client side. However,for the client-to-cloud communication, Same-FC and Customizationonly need to upload the parameters of the fully connected layerwhile Same-Whole needs to upload all parameters, since onlymodified model parameters are required for cloud model update.Figure 11(b) shows the communication overhead from top-100devices. In our current implementation with client model sized

0.0 0.2 0.4 0.6 0.8 1.0Accuracy

0.00.20.40.60.81.0

Prob

abilit

y

DT-ws (1)DT-ws (2)DT-ws (3)COLLA (1)COLLA (2)COLLA (3)

0.0 0.2 0.4 0.6 0.8 1.0Precision

0.00.20.40.60.81.0

Prob

abilit

y


0.0 0.2 0.4 0.6 0.8 1.0Recall

0.00.20.40.60.81.0

Prob

abilit

y


Figure 8: CDF of top-100 devices’ performances during three updates in mobility prediction.

0.0 0.2 0.4 0.6 0.8 1.0Accuracy

0.00.20.40.60.81.0

Prob

abilit

y

COLLA-1 (1)COLLA-1 (2)COLLA-1 (3)COLLA-2 (1)COLLA-2 (2)COLLA-2 (3)

(a)

0.0 0.2 0.4 0.6 0.8 1.0Accuracy

0.00.20.40.60.81.0

Probability

Ten Groups (0.06)Five Groups (0.05)Three Groups (0.08)Two Groups (0.10)One Group

(b)

Figure 9: The accuracy without grouping (COLLA-1) and

with grouping (COLLA-2) during three updates (Figure

a). The accuracy and corresponding Silhouette Coefficient

score (inside legend box) with different size of groups after

the second update (Figure b).

0.0 0.2 0.4 0.6 0.8 1.0Accuracy

0.00.20.40.60.81.0

Prob

abilit

y

Unified FCCustomized FC

(a) Prediction accuracy CDF

0 20 40 60 80 100Device-id

0.750.800.850.900.951.00

Percen

tage

First cycleSecond cycleThird cycle

(b) Class distribution across cycles

Figure 10: Performance of local model customization.

to 4-16, communication cost of one model update is as low as520KB(130000 ∗ 4byte).

Another concern of running Colla on smartphones is the costof local inference and client model training. As inference timecan be affected by many factors (i.e., different optimizations onmatrix multiplication), we use FLOPS to measure the inferencecost. In Figure 11(c), compared with Same-FC and Same-Whole(they have the same FLOPS due to the same model architecture),Customization nearly halved inference computation. Due to thesmallest numbers of uploaded parameters, inference’s FLOPS andmodel size, FC customization is proved to be the most efficient localmodel update approach in iterative training.

Next, we dig deep to measure model training/inference time aswell as energy consumption. We randomly sampled a user from the

top-100 list, and use the total 293 data samples from the first cycle(i.e., 20 days) to fine-tune the local model, whereas 224 samplesfrom the subsequent cycle are used for testing. Note that this useris at 23% percentile of all users in terms of the size of local samples.Figure 12(a) shows local model update time of four M-RNNs on fivesmartphones. We find that customization has the least training timeamong three updating approaches in all cases - it only takes 2.42minutes on average to fine-tune the local model on a commoditysmartphone, which is order of magnitude smaller than fine-tuningthe whole model. Similar results are observed in Figure 12(b), whereone prediction execution only needs 133.5 milliseconds. This makesit a feasible solution to turn location prediction as a component tothird party applications on end devices. In terms of different modelsizes, interestingly, we find model 8-32 needs the maximum timeto fine-tune on all five phones. This is because although 8-32 ranksthird in size, it demands much more training epochs to converge(as shown in Figure 13). We also used a Monsoon Power Monitoras a power supply for the smartphone, and tracks both runtimecurrent and voltage to calculate energy consumption (Table 2).In summary, although training a large CNN model (e.g., ResNet-152) is still prohibitively expensive for smartphones, we find thecost of training a lightweight but effective RNN model is veryviable in terms of both time, energy and network consumption.For applications like mobility prediction, one-hour training on acommodity smartphone can handle data collected over one year,at the cost of 2000 mWh energy consumption and few hundredsKB communication overhead. Note that training can be furtheraccelerated using GPU, FPGA, and it can also be scheduled to runat night when charger plugged in.

Table 2: Time (s) and energy consumption (mWh).

Model arch.Huawei Mate 9 Pro Huawei Mate 10

Time Energy Time Energy

4 - 16 492.6 276.2 216.6 127.3

6 DISCUSSION

Despite promising results yielded by collaborative learning onmobility prediction, there are several practical issues and limitationsof this study that warrant further investigation.

0 20 40 60 80 100Device-id

70K80K90K

100K110K

Mod

el-s

ize (B

yte) Customization(1)

Customization(2)Customization(3)Same-FC and Same-Whole

(a) Model size

0 20 40 60 80 100Device-id

1.21.31.41.51.61.71.8

#Par

amet

ers Customization(1)

Customization(2)Customization(3)Same-FCSame-Whole

(b) Number of parameters (*1e5)

0 20 40 60 80 100Device-id

5K10K15K20K25K30K35K40K

Flop

s

Customization(1)Customization(2)Customization(3)Same-FC and Same-Whole

(c) Inference FLOPS

Figure 11: Runtime cost of Colla.

Huawei Mate 9 Pro

Huawei Mate 10

Xiaomi Mi Note 3

Xiaomi 5X

Pi el 3XL

2−121232527

Minute

Same-Whole Same-FC Customization

(a) Fine-tuning time

Huawei Mate 9 Pro

Huawei Mate 10

Xiaomi Mi Note 3

Xiaomi 5X

Pixel 3XL

2526272829210211

Milli

seco

nd

Customization Same-FC and Same-Whole

(b) Inference time. Note that Same-FC and Same-Whole are strictly higher thanCustomization but their values are very close in many cases.

Figure 12: Fine-tuning and inference time. Each bar repre-

sents one M-RNN architecture.

Same-Whole Same-FC Customization0100200300400500

Epoch

4-16 8-32 16-64 32-128

Figure 13: Fine-tuning epochs.

Beyond location prediction. Colla can naturally be adapted touser behavior predictions other than location changes. Taking applaunch prediction as another example, we can replace trajectorysequence used in M-RNN (§ 2.2) with app launch sequence and

predict the next app to be launched using features like time,previous app launch instances, battery level, CPU usage that mighthave strong correlations with app launch behaviors.

On an app launch prediction dataset collected from 27 volunteersusing SherLock smartphone agent between January and March in2016 [30], Colla achieved comparable results with Table 1, wherea 31.8% improvement of median prediction accuracy is achieved byColla over CT. The evaluation used same settings in Section 5.2 andwe constructed app launch sequence from the resource utilizationtraces sampled every five seconds.

Table 3: App launch prediction performance.

Method Best pct. Avg. acc. Med. acc.DT 22.2% 0.52 0.48CT 22.2% 0.40 0.44

DT-ws 0.0% 0.47 0.46Colla 55.6% 0.55 0.58

Learning process. Open questions on learning process also remain.First, in our design, learning starts from a fixed amount of initialdata. There are still insights to be gained on the quality (e.g.,diversity) of initial data that would be needed for a confidentgeneral model. It is also interesting to explore the gains from extradata uploading after bootstrapping. For instance, each new-comingdevice could push certain amount of data for one time, or someedge devices may be willing to upload data constantly but at a verylow frequency. Besides, the fixed architectures of client model maybe not suitable for incremental settings. When the capacity of clientmodel cannot handle the change of all new data, Colla needs totune the architecture of client model.

Selectively using device models for cloud model update couldnot only brings cloud model a better generalization ability, but alsoprevent cloud model from being corrupted by adversarial input. Asshown in Figure 14, our preliminary results from filtering out 30%device-uploaded models with poor inference accuracy on the clouddataset shows an end-to-end median accuracy increase of 3.46%.Extending Colla to CNNs is also an interesting subject to pursue.

With growing concern about data privacy, it is also worthinvestigating how Colla can be designed in a privacy-preserving

0 20 40 60 80 100Device ID

-5%-2.5%

0%2.5%5%

7.5%10%

12.5%

Accu

racy

Top70-Top100

Figure 14: Location prediction accuracy gap between using

Top-70 and Top-100.

way. This is challenging given that prior works [31] have shownthat even the most innocuous aggregate, including the parametersof ML models, can reveal information about the training sets.

7 RELATEDWORK

Distributed machine learning has been a hot topic in the machinelearning community due to the emerging big data and big models.Model average [32–35] is a simple yet efficient technique whichiteratively averages the locally trained models and performsparameter synchronization. Allreduce-based communication [36]is also used for the gradient communication in the data parallelismsetting both for the single-machinemulti-GPUs andmulti-machinesmulti-GPUs. A more flexible architecture for distributed machinelearning is parameter server [37–39], which uses dedicatedservers to synchronize model parameters and implement othercomputation logic related to the optimization algorithms. However,most previous work on distributed machine learning treat eachworking node as a computing machine, receiving i.i.d. data andmodel parameters, generating gradients or updated parametersfor synchronization. Endpoints in this work is not simply acomputing device, but also consuming trained model by itself withlocal generated non i.i.d. training data. More recently, federatelearning [19, 40, 41] attempts to leverages local collected data totrain a global model. However, it fails to take into device diversityinto consideration and no model customization is allowed.

Knowledge distillation was first proposed to transfer theknowledge from a cumbersome model (or ensemble of models)to a single small model more suitable for deployment [15]. It hasbeen used since then in a wide range of tasks such as imageclassification [16], neural machine translation [22, 23] and speechrecognition [16, 24]. The typical setting of knowledge distillationtransfers knowledge from a teacher model to a student model, whilethere have been studies transferring knowledge cross all the modelsin a collaborative way, in order to boost task execution performanceof all participants [25, 26]. Unlike CoDistillation [26] that uses thesame dataset to train all the models, or Mutual Learning [25] thatadvocates an ensemble of students to learn collaboratively andteach each other, dual distillation in Colla features a completelydifferent network where N different client models distill knowledgefrom their own data and share it between each other with the helpof the cloud.

Researchers have been exploring various approaches to enabledeep learning on mobile and edge devices that have limited com-puting power and energy budget. Those efforts including buildingsmaller models without sacrificing too much accuracy [42–44],leveraging or building customized hardware for fast learning [45–48], model compression to reduce resource consumption [49–51], or system optimization to achieve a better resource-accuracytradeoff [52–54]. For example, DeepEar [42] proposes a specialmodel for audio sensing on smartphones in unconstrained acousticenvironments. DianNao [45] designs a dedicated ASIC to accelerateubiquitous machine learning. NestDNN [55] designs a dynamicframework to choose the most suitable model when the resourceof application is changed. DeepX [49] leverages Runtime LayerCompression (RLC) and Deep Architecture Decomposition (DAD)to reduce resource usage. These learning practices on mobile edgesall focus on individual devices and thus are complementary to ourcollaborative learning.

User behavior prediction has been studied for decades. It wasindicated that the potential average predictability in, for example,human mobility can be as high as 93% [9, 56, 57]. Various methodshave been proposed to profile user behaviors and make predictionson whereabouts, including Markov models [7, 8, 58–60], neuralnetworks [12, 61], Bayesian networks [62], random forest [63],eigendecomposition [64] etc. Markovmodel, as well as its variations,model the probability of future movements by building a transitionmatrix between several locations based on past trajectories.Given its success in speech and NLP, RNN is also proposed formobility prediction. For instance, Spatial Temporal RecurrentNeural Networks (ST-RNN) is designed to model temporal andspatial contexts [4]. Nevertheless, it only applies to continuousspatial prediction, and assumes distances between location pointsare known. More recently, DeepSense, a unified deep learningframework for mobile sensing data is proposed by integratingconvolutional and recurrent neural network [65]. However, itfocuses on accommodating diverse sensor noise patterns with amodel trained remotely from uniform sampling data. Different fromexisting work, we demonstrate the advantages of applying a moregeneral learning framework that combines intelligence from boththe cloud and edges on location prediction problem.

8 CONCLUSION

We propose Colla, a learning framework designed for userbehavior prediction by enabling end devices and a cloud tolearn from sequence data in a collaborative way. A cloud-clientcollaboration mechanism is carefully designed to make the learningapproach flexible and scalable. In particular, we propose a novel dualdistillation method with model compression and model groupingto empower the cloud to aggregate the knowledge learned fromend devices into a global cloud model and enable each device todistill knowledge from the cloud model and build a customizedclient model. We demonstrate the feasibility of the framework onmobility prediction using amulti-feature RNN. Experimental resultson Azure and commodity smartphones with large-scale real datashow that Colla establishes effective device models in terms ofboth prediction accuracy and execution cost.

REFERENCES

[1] P. G. Om Prakash and Dr. A. Jaya. Analyzing and Predicting UserBehavior Pattern from Weblogs. International Journal of Applied

Engineering Research, 11(0973-4562):62786283, 2016.[2] Zhe Zeng and Matthias Roetting. A text entry interface using smooth

pursuit movements and language model. In ACM Symposium on Eye

Tracking Research & Applications, 2018.[3] Chang Tan, Qi Liu, Enhong Chen, and Hui Xiong. Prediction for

Mobile Application Usage Patterns. In Nokia MDC Workshop, 2012.[4] Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. Predicting the Next

Location : A Recurrent Model with Spatial and Temporal Contexts. InAAAI, 2016.

[5] Joao Bartolo Gomes, Clifton Phua, and Shonali Krishnaswamy. WhereWill You Go? Mobile Data Mining for Next Place Prediction. LectureNotes, 8057 LNCS:146–158, 2013.

[6] Ivana Semanjski and Sidharta Gautama. Smart City MobilityApplication Gradient Boosting Trees for Mobility Prediction andAnalysis Based on Crowdsourced Data. Sensors, abs/15974.15987, 2015.

[7] Akinori Asahara, Kishiko Maruyama, Akiko Sato, and Kouichi Seto.Pedestrian-movement Prediction Based onMixedMarkov-chainModel.In ACM SIGSPATIAL, 2011.

[8] Sébastien Gambs, Marc-Olivier Killijian, and Miguel Núñez delPrado Cortez. Next Place Prediction Using Mobility Markov Chains. InProceedings of the First Workshop onMeasurement, Privacy, andMobility,2012.

[9] C. Song, Z. Qu, N. Blumm, and A.-L. Barabasi. Limits of Predictabilityin Human Mobility. Science, 327(5968):1018–1021, 2010.

[10] Fatma Somaa, Cédric Adjih, Inès El Korbi, and Leila Azouz Saidane. ABayesian model for mobility prediction in wireless sensor networks.In International Conference on Performance Evaluation and Modeling in

Wired and Wireless Networks, 2016.[11] Nam T. Nguyen Binh T. Nguyen, Nhan V. Nguyen and My Huynh T.

Tran. A Potential Approach for Mobility Prediction using GPS Data.In ICIST, 2017.

[12] Z Lin, M Yin, S Feygin, M Sheehan, and JF Paiement. Deep GenerativeModels of Urban Mobility. In ACM KDD, 2017.

[13] Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. Predicting the NextLocation: A Recurrent Model with Spatial and Temporal Contexts. InAAAI, 2016.

[14] Yoshua Bengio Jason Yosinski, Jeff Clune and Hod Lipson. Howtransferable are features in deep neural networks? In NIPS, 2014.

[15] Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Modelcompression. In ACM KDD, 2006.

[16] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling theKnowledge in a Neural Network. In NIPS Deep Learning and

Representation Learning Workshop, 2015.[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff

Dean. Distributed representations of words and phrases and theircompositionality. In Advances in neural information processing systems,pages 3111–3119, 2013.

[18] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficientestimation of word representations in vector space. arXiv, 2013.

[19] Jakub Konecný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik,Ananda Theertha Suresh, and Dave Bacon. Federated Learning: Strate-gies for Improving Communication Efficiency. CoRR, abs/1610.05492,2016.

[20] Renchu Guan, Xiaohu Shi, Maurizio Marchese, Chen Yang, andYanchun Liang. Text Clustering with Seeds Affinity Propagation. IEEETransactions on Knowledge and Data Engineering, 23(4):627–637, 2011.

[21] Peter J. ROUSSEEUW. Silhouettes: a graphical aid to the interpretationand validation of cluster analysis. Journal of Computational and Applied

Mathematics, 87(21):0377–0427, 1986.

[22] Yoon Kim and Alexander M. Rush. Sequence-Level KnowledgeDistillation. In EMNLP, 2016.

[23] Markus Freitag, Yaser Al-Onaizan, and Baskaran Sankaran. Ensembledistillation for neural machine translation. CoRR, abs/1702.01802, 2017.

[24] Liang Lu, Michelle Guo, and Steve Renals. Knowledge distillation forsmall-footprint highway networks. In ICASSP, pages 4820–4824. IEEE,2017.

[25] Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu.Deep Mutual Learning. In IEEE CVPR, 2017.

[26] Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormándi,George E. Dahl, and Geoffrey E. Hinton. Large scale distributed neuralnetwork training through online distillation. CoRR, abs/1804.03235,2018.

[27] Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu.Deep Mutual Learning. In Conference on Computer Vision and Pattern

Recognition. CVPR-2018, 2018.[28] Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent

Itti, and Anima Anandkumar. Born-Again Neural Networks. In Thirty-

fifth International Conference on Machine Learning. ICML-2018, 2018.[29] Marina Sokolova and Guy Lapalme. A Systematic Analysis of

Performance Measures for Classification Tasks. Inf. Process. Manage.,45(4):427–437, July 2009.

[30] Yisroel Mirsky, Asaf Shabtia, Lior Rokach, Bracha Shapira, andYuval Elovici. SherLock vs Moriarty: A Smartphone Dataset forCybersecurity Research. In AISec with CCS, 2016.

[31] Reza Shokri, Marco Stronati, and Vitaly Shmatikov. MembershipInference Attacks against Machine Learning Models. In IEEE

Symposium on Security and Privacy, 2017.[32] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola.

Parallelized Stochastic Gradient Descent. In NIPS, 2010.[33] Ryan McDonald, Keith Hall, and Gideon Mann. Distributed Training

Strategies for the Structured Perceptron. In ACL, 2010.[34] Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel

Training of Deep Neural Networks with Natural Gradient andParameter Averaging. arXiv, 2014.

[35] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep Learningwith Elastic Averaging SGD. In NIPS, 2015.

[36] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, LukaszWesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and KaimingHe. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.arXiv, 2017.

[37] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin,Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al.Large Scale Distributed Deep Networks. In NIPS, 2012.

[38] Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, AmrAhmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-YiingSu. Scaling Distributed Machine Learning with the Parameter Server.In USENIX OSDI, 2014.

[39] Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, SeunghakLee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu.Petuum: A New Platform for Distributed Machine Learning on BigData. IEEE Transactions on Big Data, 1(2):49–67, 2015.

[40] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson,and Blaise Agüera y Arcas. Communication-Efficient Learning ofDeep Networks from Decentralized Data. In Proceedings of the 20th

International Conference on Artificial Intelligence and Statistics, 2017.[41] Jakub Konecný, Brendan McMahan, and Daniel Ramage. Federated

Optimization: Distributed Optimization Beyond the Datacenter. CoRR,abs/1511.03575, 2015.

[42] Nicholas D. Lane, Petko Georgiev, and Lorena Qendro. DeepEar:Robust Smartphone Audio Sensing in Unconstrained AcousticEnvironments Using Deep Learning. In ACM UbiComp, 2015.

[43] Guoguo Chen, Carolina Parada, and Georg Heigold. Small-footprintKeyword Spotting Using Deep Neural Networks. In IEEE ICASSP, 2014.

[44] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko,Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam.MobileNets: Efficient Convolutional Neural Networks for MobileVision Applications. arXiv, 2017.

[45] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu,Yunji Chen, and Olivier Temam. DianNao: a Small-footprint High-throughput Accelerator for Ubiquitous Machine-Learning. In ACM

ASPLOS, 2014.[46] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and

Jason Cong. Optimizing FPGA-based Accelerator Design for DeepConvolutional Neural Networks. In ACM/SIGDA FPGA, 2015.

[47] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. Eyeriss: A SpatialArchitecture for Energy-Efficient Dataflow for Convolutional NeuralNetworks. In ACM/IEEE ISCA, 2016.

[48] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A.Horowitz, and William J. Dally. EIE: Efficient Inference Engine onCompressed Deep Neural Network. In ACM/IEEE ISCA, 2016.

[49] Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, ClaudioForlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. DeepX: ASoftware Accelerator for Low-power Deep Learning Inference onMobile Devices. In ACM/IEEE IPSN, 2016.

[50] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng.Quantized Convolutional Neural Networks for Mobile Devices. InIEEE CVPR, 2016.

[51] Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and RobFergus. Exploiting Linear Structure Within Convolutional Networksfor Efficient Evaluation. In NIPS, 2014.

[52] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, SiddharthaSen, and Ion Stoica. Chameleon: Video analytics at scale via adaptiveconfigurations and cross-camera correlations. In ACM SIGCOMM,2018.

[53] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agar-wal, Alec Wolman, and Arvind Krishnamurthy. MCDNN: AnApproximation-Based Execution Framework for Deep Stream Pro-cessing Under Resource Constraints. In ACM MobiSys, 2016.

[54] Ganesh Ananthanarayanan, Victor Bahl, Landon Cox, Alex Crown,Shadi Nogbahi, and Yuanchao Shu. Demo: Video Analytics - KillerApp for Edge Computing. In ACM MobiSys, 2019.

[55] Biyi Fang, Xiao Zeng, and Mi Zhang. NestDNN: Resource-AwareMulti-Tenant On-Device Deep Learning for Continuous Mobile Vision.In Proceedings of the 24th Annual International Conference on Mobile

Computing and Networking. ACM, 2018, 2018.[56] Xin Lu, ErikWetter, Nita Bharti, Andrew J. Tatem, and Linus Bengtsson.

Approaching the Limit of Predictability in Human Mobility. ScientificReports, 3:1–9, 2013.

[57] Xin Lu, Linus Bengtsson, and Petter Holme. Predictability ofPopulation Displacement after the 2010 Haiti Earthquake. Proceedingsof the National Academy of Sciences, 109(29):11576–11581, 2012.

[58] George Liu and Gerald Maguire. A Class of Mobile Motion PredictionAlgorithms for Wireless Mobile Computing and Communications.Mobile Networks and Applications, 1(2):113–121, 1996.

[59] Yu Zheng, Quannan Li, Yukun Chen, Xing Xie, and Wei-Ying Ma.Understanding Mobility Based on GPS Data. In ACM UbiComp, 2008.

[60] L. Song, D. Kotz, R. Jain, and Xiaoning He. Evaluating LocationPredictors with Extensive Wi-Fi Mobility Data. In IEEE INFOCOM,2004.

[61] Shiang-Chun Liou and Hsuan-Chia Lu. Applied Neural Network forLocation Prediction and Resources Reservation Scheme in WirelessNetworks. In International Conference on Communication Technology

Proceedings, 2003.

[62] Sherif Akoush and Ahmed Sameh. Mobile User Movement PredictionUsing Bayesian Learning for Neural Networks. In International

Conference on Wireless Communications and Mobile Computing, 2007.[63] Zidong Yang, Ji Hu, Yuanchao Shu, Peng Cheng, Jiming Chen, and

Thomas Moscibroda. Mobility Modeling and Prediction in Bike-Sharing Systems. In ACM International Conference on Mobile Systems,

Applications, and Services (MobiSys), 2016.[64] Nathan Eagle and Alex Sandy Pentland. Eigenbehaviors: Identifying

Structure in Routine. Behavioral Ecology and Sociobiology, 63(7):1057–1066, 2009.

[65] Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and TarekAbdelzaher. DeepSense: A Unified Deep Learning Framework forTime-Series Mobile Sensing Data Processing. In WWW, 2017.

Documents

Collaborative Learning between Cloud and End Devices: An ...ads click, location change, app launch). While the learn-then-deploy approaches achieve promising results in many scenarios,