6
Pushing Analytics to the Edge Goutham Kamath Pavan Agnihotri Maria Valero * Krishanu Sarker Wen-Zhan Song * Department of Computer Science, Georgia State University * College of Engineering, University of Georgia {gkamath1,pagnihotri2,ksarker1}@student.gsu.edu, {maria.valero,wsong}@uga.edu Abstract—Edge or Fog computing is emerging as a new computing paradigm where the data processing, networking, storage and analytics are performed closer to the devices (IoT) and applications. The edge of a network plays an important role in the IoT system. It is an optimal site for off-loading bandwidth hungry IoT data. In order to generate business value out of the large volume of data on the edge, we need decentralized machine learning (ML) algorithms. These algorithms must enable edge devices to communicate autonomously and deliver information seamlessly to the decision makers. In this paper, we present EdgeSGD a decentralized stochastic gradient descent method to solve a large linear regression problem on the edge network. The solution is applied to the seismic imaging use-case and evaluated using an edge computing testbed. The proposed algorithm avoids sending raw data to the cloud, and offers faster and balanced computation. We compare the algorithm with the existing meth- ods such as MapReduce, DGD and EXTRA. The results show that the EdgeSGD converges faster to the optimal value and is robust towards node/link failure. Finally, we test the algorithm using real-world seismic data trace from Parkfield, CA and show that our proposed algorithm can illuminate the underlying fault region of San-Andreas. Index Terms—Edge Computing, Decentralized Machine Learn- ing, Linear Regression, Online Learning. Internet of Things I. I NTRODUCTION In the era of Internet of Things (IoT), high data rate sensors such as video camera, cell phones are becoming ubiquitous. Today, most high volume data obtained from these sensors are stored close to the point of capture, and only few are transferred to the cloud. In future, sending all the data from billions of IoT devices to the cloud can easily overwhelm the existing infrastructure. To overcome these issues, a new paradigm called Edge or Fog computing is emerging [14]. This paradigm brings the data processing, networking, storage and analytics closer to the devices (IoT) and applications. Edge computing aims at placing small data center’s called cloudlets at the edge of the internet, in close proximity to mobile devices, sensors and end users. The new paradigm counters the theme of consolidation and massive data centers and opens up the door for further research in decentralized computing and analytics. The edge of a network plays an important role in the IoT system. It is an optimal site for off-loading bandwidth hungry sensor data. To generate business value out of the large volume of data on the edge, we need decentralized machine learning (ML) algorithms. These algorithms must enable edge devices to communicate autonomously and deliver information seam- lessly to the decision makers. Unlike the cloud infrastructure, edge network exhibits following properties. i) heterogeneous hardware ii) unreliable low bandwidth communication network iii) limited on-board memory and processing power. Moreover, the edge analytics algorithm should not rely on any central coordinator and must be fault-tolerant as node/link failures are a common occurrence. Existing decentralized machine learning platform such as MapReduce [18] is not fault-tolerant and requires a coordinator node to perform reduce operation. In this paper, we present EdgeSGD, a decentralized stochastic gradient descent algorithm suitable for machine learning and analytics on edge network. This method can be applied to wide range of problems arising in decentralized machine learning [1]. As a case study, we solve a fundamental machine learning problem of linear regression, where the objective is to estimate the feature vector θ R n using m training example {(x 1 ,y 1 ), (x 2 ,y 2 ), ··· , (x m ,y m )} given by, min θR n 1 2m m X i=1 ky i - x | i θk 2 2 + λkθk 2 2 , (1) where, λ is the regularization parameter. As a real-world use- case, we apply EdgeSGD to predict subsurface seismic anomaly via real-time imaging [6]. As far as we know, we are the first to explore the potentials of such algorithms on edge computing framework. The key contributions in the paper are: We present a novel decentralized stochastic gradient descent algorithm for solving linear regression problem on the edge node. The proposed algorithm is applied to learning/predicting seismic anomalies via real-time imaging. We evaluate the algorithm on edge computing testbed, using both synthetic and real seismic dataset. We compare the proposed solution with other existing decentralized methods such as MapReduce, DGD and EXTRA, and examine in particular the effects node/link failure and communication cost. The rest of the paper is structured as follows: In Section II, we provide seismic imaging background and the problem formulation. In Section III, we describe previous works related to decentralized machine learning. We provide a detailed design of EdgeSGD in Section IV. In Section V, we evaluate the performance of our algorithm and provide experimental results. We conclude the paper in Section VI.

Pushing Analytics to the Edge

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Pushing Analytics to the EdgeGoutham Kamath Pavan Agnihotri Maria Valero∗ Krishanu Sarker Wen-Zhan Song∗

Department of Computer Science, Georgia State University∗College of Engineering, University of Georgia

{gkamath1,pagnihotri2,ksarker1}@student.gsu.edu, {maria.valero,wsong}@uga.edu

Abstract—Edge or Fog computing is emerging as a newcomputing paradigm where the data processing, networking,storage and analytics are performed closer to the devices (IoT)and applications. The edge of a network plays an important rolein the IoT system. It is an optimal site for off-loading bandwidthhungry IoT data. In order to generate business value out of thelarge volume of data on the edge, we need decentralized machinelearning (ML) algorithms. These algorithms must enable edgedevices to communicate autonomously and deliver informationseamlessly to the decision makers. In this paper, we presentEdgeSGD a decentralized stochastic gradient descent method tosolve a large linear regression problem on the edge network. Thesolution is applied to the seismic imaging use-case and evaluatedusing an edge computing testbed. The proposed algorithm avoidssending raw data to the cloud, and offers faster and balancedcomputation. We compare the algorithm with the existing meth-ods such as MapReduce, DGD and EXTRA. The results showthat the EdgeSGD converges faster to the optimal value and isrobust towards node/link failure. Finally, we test the algorithmusing real-world seismic data trace from Parkfield, CA and showthat our proposed algorithm can illuminate the underlying faultregion of San-Andreas.

Index Terms—Edge Computing, Decentralized Machine Learn-ing, Linear Regression, Online Learning. Internet of Things

I. INTRODUCTION

In the era of Internet of Things (IoT), high data rate sensorssuch as video camera, cell phones are becoming ubiquitous.Today, most high volume data obtained from these sensors arestored close to the point of capture, and only few are transferredto the cloud. In future, sending all the data from billions ofIoT devices to the cloud can easily overwhelm the existinginfrastructure. To overcome these issues, a new paradigm calledEdge or Fog computing is emerging [14]. This paradigm bringsthe data processing, networking, storage and analytics closerto the devices (IoT) and applications. Edge computing aims atplacing small data center’s called cloudlets at the edge of theinternet, in close proximity to mobile devices, sensors and endusers. The new paradigm counters the theme of consolidationand massive data centers and opens up the door for furtherresearch in decentralized computing and analytics.

The edge of a network plays an important role in the IoTsystem. It is an optimal site for off-loading bandwidth hungrysensor data. To generate business value out of the large volumeof data on the edge, we need decentralized machine learning(ML) algorithms. These algorithms must enable edge devicesto communicate autonomously and deliver information seam-lessly to the decision makers. Unlike the cloud infrastructure,edge network exhibits following properties. i) heterogeneous

hardware ii) unreliable low bandwidth communication networkiii) limited on-board memory and processing power. Moreover,the edge analytics algorithm should not rely on any centralcoordinator and must be fault-tolerant as node/link failures area common occurrence. Existing decentralized machine learningplatform such as MapReduce [18] is not fault-tolerant andrequires a coordinator node to perform reduce operation.

In this paper, we present EdgeSGD, a decentralized stochasticgradient descent algorithm suitable for machine learning andanalytics on edge network. This method can be applied towide range of problems arising in decentralized machinelearning [1]. As a case study, we solve a fundamental machinelearning problem of linear regression, where the objective is toestimate the feature vector θ ∈ Rn using m training example{(x1, y1), (x2, y2), · · · , (xm, ym)} given by,

minθ∈Rn

1

2m

m∑i=1

‖yi − xᵀi θ‖22 + λ‖θ‖22, (1)

where, λ is the regularization parameter. As a real-world use-case, we apply EdgeSGD to predict subsurface seismic anomalyvia real-time imaging [6]. As far as we know, we are the firstto explore the potentials of such algorithms on edge computingframework. The key contributions in the paper are:

• We present a novel decentralized stochastic gradientdescent algorithm for solving linear regression problemon the edge node.

• The proposed algorithm is applied to learning/predictingseismic anomalies via real-time imaging.

• We evaluate the algorithm on edge computing testbed,using both synthetic and real seismic dataset.

• We compare the proposed solution with other existingdecentralized methods such as MapReduce, DGD andEXTRA, and examine in particular the effects node/linkfailure and communication cost.

The rest of the paper is structured as follows: In Section II,we provide seismic imaging background and the problemformulation. In Section III, we describe previous works relatedto decentralized machine learning. We provide a detailed designof EdgeSGD in Section IV. In Section V, we evaluate theperformance of our algorithm and provide experimental results.We conclude the paper in Section VI.

II. PRELIMINARIES

A. Seismic Image BackgroundThe current subsurface imaging technology to visualize

oil/gas reservoir or fault/magma movement lacks the capa-bilities of obtaining real-time information. The principles ofseismic imaging consists of three main steps as illustrated inthe Fig. 1(a).i) Seismic sensors (green triangles on the surface)measure the vibration after the occurrence of an earthquake,calculate the arrival time of the p-wave [8] and obtain themeasurement y. ii) Next, the earthquake location and time isestimated using origin and time using prior estimates. iii) Lastly,the rays are traced from earthquake to node i (blue rays) toform data set xi. The feature vector θ represents the image,where the components {θ1, · · · , θn} represent the value of npixels.

Today, thousands of dense arrays of seismic sensors (geo-phones) are deployed to monitor subsurface anomalies suchas earthquake etc. These geophones passively listen to theground vibration and record them to a local storage [5]. Theseacoustic signals are sampled at 16-24 bit 100-500 Hz whichaccumulates to few GB/day per node. Transferring large volumeof data to a central repository (cloud) using low-power radiois often infeasible due to energy and bandwidth limitation [14].For this reason, we propose an edge computing framework toperform real-time seismic imaging. Here the edge nodes forma layer between geophones and the cloud (Fig. 1(b)). Severalgeophones are connected to each edge node, and since theyare located close to the sensors, seismic signals are transferredto them in real-time. The edge nodes form a mesh network,and our goal is to design a decentralized algorithm to obtain θ(image) by collaboratively optimizing the objective functionEq. (1) over the edge network.

km

0 2 4 6 8 10

−4

−2

02

δt=0

δt=0δt=0

δt=0δt>0

δt>0

Surface Seismic Station

δt=obs − pred time

Low Vp Anomally

Seismic Waves

Earthquake

Depth

, km

(a)

Edge Node

Sensor

Cloud

(b)

Fig. 1. Illustrating the process of real-time seismic imaging on a edge network.(a) Principle of travel-time seismic imaging. (b) Edge computing architecturefor seismic imaging.

B. Problem FormulationIn this paper, we consider N edge nodes connected to form

an arbitrary topology given by an undirected graph G(V, E),with node set V = {1, · · · , N} and edge set E that containsset of links in the network. We have {i, j} ∈ E if node iand node j can communicate with each other. Ni denotes theneighbor set of node i. Each edge node i is connected to oneor more seismic sensors and have mi training samples givenby, {(x1, y1), (x2, y2), · · · , (xmi , ymi)}, where

∑Ni=1mi = m.

Now using Eq. (1), we can form a local loss function at nodei given by fi(x) = 1

2mi

∑mi

i=1 ‖yi − xᵀi θ‖22 + λ‖θ‖22.

Now combining all to a single equation we get the followingoptimization problem,

minθ∈Rn

1

2N

N∑i=1

‖Yi −Xᵀi θ‖

22 + λ‖θ‖22, (2)

where, Xi ∈ Rmi×n;Yi ∈ Rmi denote the mi rows trainingset at node i. Here, the regularization parameter λ is apositive number that controls the weight between ‖Yi−Xᵀ

i θ‖22(goodness fit measure) and ‖‖22 (regularity measure). Werepresent θk feature estimated at the iteration k. The goalnow is to develop an algorithm to solve regularized regressionproblem Eq. (2) in a decentralized and asynchronous way onthe edge network.

III. RELATED WORK

Distributed machine learning algorithms can be classifiedbased on 1) Shared Memory and 2) Distributed Memoryarchitectures. In a shared memory architecture, the data residesin a location accessible to all the compute nodes e.g. GPUcomputing. Some of the recent ML algorithms for sucharchitectures are [9], [11]. These algorithms perform distributedstochastic gradient descent using several processors. Since,compute nodes have access to shared memory, ML algorithmsdesigned for such architectures are not suitable for edgeanalytics.

In a distributed memory architecture, each compute nodeassumes data to be available locally. We can further classifythis into Client-Server and Peer-to-peer P2P architecture.MapReduce is one of the most popular machine learningalgorithm for client-server architecture. Authors in [18] describea distributed stochastic gradient descent method suitable forlarge multi-core platforms. This method incorporates the onlinelearning principle of [10]. These methods scale linearly withthe data size and log-linearly with number of compute nodes. InMapReduce, although the data and computation is distributed,the reduce operation takes place at a central coordinator and weshow from our experiment that these schemes are not suitablefor edge network.

In recent years, several P2P based optimization methods havebeen developed for solving decentralized machine learning [4],[12], [16]. In these methods, computation nodes do notrequire any central coordinator and perform optimization bycommunicating only with their immediate neighbors [2], [3],[7]. Some of the recent advancement include decentralizedgradient descent (DGD) [16] and EXTRA [12]. Methods thathave both local computation and communication are suitedfor edge computing platform. In this paper, we attempt tocombine the benefits of the online learning model [18] withthe asynchronous average strategy as proposed by [13] anddesign a robust edge analytics algorithm.

IV. ALGORITHM DESIGN

To design machine learning algorithm for a specific appli-cation, we assume that the feature vectors are already known,and the incoming data is normalized. However, the other twoimportant characteristics of big data i.e Volume and Velocityare still an issue while dealing with a specific application.For instance in seismic imaging, the row size mi of thesamples {xi, yi} at node i increases with the occurrence ofvibration. Therefore, the computation algorithm should beiterative in nature, which processes single row or multiplerows at a time. Since the computation is carried out on theedge, the algorithm should also be lightweight and robust. Theoptimization algorithm suitable for these scenarios are based onthe method of online learning [1] and the Stochastic GradientDescent (SGD) is the most prominent among them.

A. Stochastic Gradient DescentUnlike, Conjugate Gradient and batch gradient descent, SGD

does not require knowledge of a complete training set. Thismethod performs a random projection onto a set of availabletraining set (hyperplanes) until convergence. The method startswith an arbitrary initial vector θ0 and at every iteration k, itrandomly selects a row i(k) ∈ {1, · · · .m} of training set. Itthen uses the sampled training set to calculate the gradientbased on the local loss function i.e ∂fi(xi, yi). The parameterθk is updated by moving a small step size α along the negativegradient as shown in Algorithm 1.

Algorithm 1 Stochastic Gradient Descent (SGD)

1: Initialize: α, iterations T, θ0 ← 02: for k ← 0 until convergence or maximum iteration T do3: draw i ∈ {1, · · · .m}4: update θk+1 = θk − α∂fi(xi, yi)5: end6: return θ

B. MapReduce SGDDue to inherent sequential nature, SGD algorithm does not

scale very well and is hard to parallelize [11]. With the influxof bigdata, there has been renewed interest in large scalemachine learning methods [18] especially for MapReduce andmore recently Spark architectures. The main idea behind theseframeworks are to partition data and distribute them onto thecompute nodes (MAP) followed by joining or aggregation ofthe partial results from each node (REDUCE). This processis repeated until convergence. Algorithm 2 gives a briefdescription the MapReduce scheme and the application ofthis on seismic imaging problem can be seen here [6]

Line 5 of the Algorithm 2 essentially performs multiplerounds of SGD in parallel on node i ∈ {1, · · · , N} usingthe initial parameter θ(k) obtain from the k-th iteration.Line 7 denotes the REDUCE operation that computes θ(k+1)

by averaging the partial solutions from all the nodes. Thereduce operation requires synchronization and aggregation.The communication cost of this method is of the order 2kNnwhere k,N, n represents iteration, the number of nodes andsize of θ respectively. These methods are suitable for GPU,

Algorithm 2 MapReduce SGD1: Initialize: α, iterations T, Node N, θ0 ← 02: Map data each nodes3: while not converged do4: for all i ∈ {1, · · · , N} parallel do5: θki =SGD(xi, yi,T,α, θk)6: end7: θ(k+1) = 1

N

∑Ni=1 θ

ki

8: end

Multi-core or cloud-based architecture which can facilitate ascatter-gather operation via a coordinator node. Data collectionand dissemination on a large scale mesh network can becomechallenging and can limit the systems scalability. Our previousresearch [6] shows that applying MapReduce scheme on anedge network can create bottleneck near the coordinator nodeand also decreases the performance with increase in the nodesize. The multihop communication increases the communicationoverhead, packet loss, and slows the convergence. Thesepractical limitations motivated us to develop another classof algorithm to solve Eq. (2) that limits the communicationonly with the neighbors.

C. Edge Stochastic Gradient DescentTo overcome the drawbacks of MapReduce, we introduce a

novel algorithm called EdgeSGD, which is completely decen-tralized (no coordinator) and does not require synchronization.Unlike a single coordinator based reduce strategy [18] usedin MapReduce, we propose a decentralized reduce operationbased on Gossip Averaging [2]. Gossip methods are emergingas a new communication paradigm for large-scale distributedsystems [7]. Some of the features that makes gossip methodsattractive are: 1) absence of central entity or coordinator node2) high fault tolerance and robustness 3) self healing or errorrecovery mechanism [15] 4) efficient message exchange due toonly neighbor communication 5) provision for asynchronouscommunication. These interesting characteristics make themsuitable for edge nodes to carry out decentralized computation.

In the gossip reduce, when a node i activates at k-th iteration,the following set of events occur: i) Node i sends its currentparameter estimation θki to its neighboring node j. ii) Nodej receives θki and updates it in the following way: θk+1

j =

βθki +(1−β)θkj where, β ∈ {0, 1}. iii) Node j sends θk+1j to i,

where it updates θk+1i = θt+1

j . In order to avoid race condition,we must ensure that no two nodes update its neighbors valueat the same time. Notice that gossip reduce operation performsaverage only with its neighbor. Next, we will show how thiscan be used to design EdgeSGD algorithm.

We assume that node i contains mi samples of training set{(x1, y1), (x2, y2), · · · , (xmi , ymi)}. The row size increaseswith the occurrence of events (seismic vibration) and getsappended to the existing set without discarding the previousdata. Since the algorithm is decentralized and asynchronouswe allow nodes to wake up at anytime. In practice, we cancontrol the wake up cycle based on the events or the energyavailable. Once a node i wakes up, it performs Gossip Reduce

with one of its neighbors j given by, θkij = βθki + (1− β)θkj .Next, if ‖θki − θkij‖ > ε, ε is the update threshold, then weinvoke SGD locally at node i. Similar step happens at node jas well. Also, parallel updates can happen between any otherpairs in the network independently. In case other nodes try tocommunicate with a pair say i and j while they are in betweentheir update phase, then the received packet is either bufferedor dropped.

Algorithm 3 Edge Stochastic Gradient DescentInitialize

1: Node ID id, ε, α, β, iteration T ,Total Nodes N .2: Node i ∈ V has mi samples3: θ` ← previously recorded estimates.1: Upon the detection of an event2: Node receive samples x, y.3: k ← 0, θk ← θ`

4: while not converged do5: Node i ∈ V wakes up uniformly.6: Node i now selects node j ∈ Ni

7: At Node i:8: θkij ← (βθki + (1− β)θkj )9: if ‖θki − θkij‖ > ε then

10: θk+1i = SGD(xi, yi,T,α, θkij)

11: end if12: At Node j:13: θkij ← (βθki + (1− β)θkj )14: if ‖θkj − θkij‖ > ε then15: θk+1

j = SGD(xj , yj ,T,α, θkij)16: end if17: end while18: TERMINATE

Algorithm 3 outlines the EdgeSGD algorithm. In Line 10and 15, the SGD randomly picks few samples from the localtraining set to obtain the local partial estimates. These partialupdates are merged using gossip reduce given in Line 8 and 13.This process is repeated until convergence. This algorithm isadaptable to new arrival of the updates and the random samplesize can always be adjusted based on the internal memory ofthe edge device. In algorithm 3 we show the update processonly in terms of node i and j, and similar updates happen inparallel with other nodes in the network. The proposed EdgeSGD is adaptable, robust and fault tolerant. We will evaluatethese characteristic in the next section.

V. EVALUATION

In this section, we evaluate the EdgeSGD algorithm andprovide the empirical results using both synthetic and realseismic data set. We carry our evaluation on a edge computingtestbed consisting of 16 BeagleBone nodes Fig. 2(a). Wecompare our results with MapReduce algorithm [18], DGD [16]and EXTRA [12]. Our results indicate that the proposed methodis faster in terms of execution time, robustness and has balancedcommunication cost. We also observe that MapReduce takesfewer iteration than EdgeSGD, however, takes longer time periteration due to synchronization overhead.

(a)

Nodes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Thro

ughput

(Mbit

s/s)

0

5

10

15

20

25

30

(b)

Fig. 2. (a) Testbed consisting of 16 Beaglebone black (inset) connected toeach other forming a cluster. (b) Network benchmarking of the cluster usingiPerf. Plot shows a significant difference in network throughput across thelnodes even with same hardware.

A. Edge Analytics TestbedIn order to evaluate the edge analytic algorithms we

developed an experimental testbed consisting of 16 BeagleBoneBlack (BBB) as shown in Fig. 2(a). These computing unitsare interconnected via a switch forming a cluster. Each BBBoperates a linux based OS and has 512MB DDR3 RAM, 16GBflash storage, 1 GHz ARM Cortex A8 processor. They are creditcard sized, low-power computing units that are under $50. Webenchmark the cluster throughput (Mbits/sec) using iPerf tool.Fig. 2(b) shows that there is a significant difference in thethroughput across different nodes. Variations in the networkperformance exhibits the heterogeneity and unreliability of theIoT and edge nodes. These characteristics make BBB clustera suitable platform to evaluate edge analytics algorithms.

B. Synthetic DatasetTypically, a synthetic model is used to test seismic imaging

algorithms mainly because: 1) the optimal solution is usuallyunknown for real data, 2) only a few very large datasets arepublicly available, and, 3) we even need a collection of datasetswith varying parameters such as dimensionality n, size m andnumber of clusters k in order to evaluate the scalability. Toevaluate our algorithm, we use synthetic magma (Fig. 2(c))covering an area of 16×16×16 Km. The velocities of seismicwaves inside and outside the magma area are V and 0.9V whereV is 4.5km/s which is a typical P-wave velocity. We simulatemi = 5000 earthquake (training set) at random location withinthe observed area. Based on these settings, the dimension ofthe data matrix at node i is given by {xi ∈ R5000×4096 withthe measurements yi ∈ R5000. We perform T=10 local SGDiteration to solve the fi(x) = 1

2mi

∑mi

i=1 ‖yi− xᵀi θ‖22 + λ‖θ‖22.

We choose following parameters throughout the experiment,β = 0.5, α = 0.8 and ε = 10−6. For asynchronous algorithmsepochs are used instead of iterations, and one epoch inEdgeSGD corresponds to nodes waking up atleast once. We usethe relative update (φ = ‖θ(k+1) − θ(k)‖/‖θ(k)‖ < ε) betweenthe two epochs to terminate each node.

C. Performance EvaluationIn this section, we first evaluate the performance of EdgeSGD

empirically and compare it to other standard analytic techniques.Fig. 4(a) shows the convergence behavior of different algo-rithms. We see that both MapReduce and EdgeSGD convergeto the true solution after 75 and 150 epochs respectively. Dueto faster mixing of intermediate results in the coordinator node,MapReduce takes lesser epochs to converge to a true solution.

X

ZLayer 7 of 16 layers along Y

Resolution: 16x16

0 1 2 3 4 5 6 7 8 9 10

10

98

76

54

32

10

(a)

X

Z

Layer 7 of 16 layers along YResolution: 16x16

0 1 2 3 4 5 6 7 8 9 10

10

98

76

54

32

10

(b)

X

Z

Layer 7 of 16 layers along YResolution: 16x16

0 1 2 3 4 5 6 7 8 9 10

10

98

76

54

32

10

(c)

Fig. 3. Visual comparison of the result obtain from a) Layer 7 of Groundtruth or the optimal solution b) MapReduce algorithm with distance from theoptimal φ = 0.0019b) Edge SGD algorithm with distance from the optimalφ = 0.0054. These results are obtained after 30 epochs and from this wecan see that the proposed Edge algorithm’s result is closer to the optimal andMapReduce method.

Epochs

0 75 150 200

Rel

ativ

e E

rror

10-16

10-10

10-5

100

DGD

EXTRA

MapReduce

EdgeSGD

(a)

Time (sec)

0 75 150 200

Rel

ativ

e E

rror

10-6

10-5

10-4

10-3

10-2

10-1

100

EdgeSGD

MapReduce

(b)

Fig. 4. (a) Convergence characteristic of different algorithms against number ofEpochs. We see that EdgeSGD converges to the same solution as MapReduce.(b) Time (sec) taken by EdgeSGD and MapReduce to decrease its relativeerror. We see that around 85th sec MapReduce stalls due to a delay in thenode 10. The speed of MapReduce is governed by the slowest node in thenetwork and is not suitable for Edge analytics.

However, from Fig. 4(b) we see that MapReduce takes longertime per epoch due to synchronization overhead. The speed ofthe MapReduce is governed by the slowest node in the network.In Fig. 4(b) we see this effect, where, node 10 temporarilystalls at 85th sec for about a minute preventing other nodes tocontinue. In EdgeSGD, due to the decentralized nature, failureof one or group of nodes does not have significant effect onthe overall convergence rate and is shown in Fig. 5(c).

A detailed comparison of different algorithms are given inTable I. Except SGD, all the other algorithms are run on 16nodes. Each node has 5K training set with 4096 features. SinceSGD is centralized, it takes very few epochs to reach truesolution, on the other hand, each epoch requires frequent diskaccess due to large training set. This increases the executiontime. DGD and EXTRA both have their computation andcommunication distributed, however these methods are slowerthan EdgeSGD which also has similar property. Propertiessuch as robustness, asynchrony, distributed computation andcommunication makes EdgeSGD algorithm a suitable choicefor Edge analytics.

Next, we demonstrate the correctness of our algorithmthrough visualization. Fig. 3 shows the layer 7 of 16 alongX and Y direction. Since EdgeSGD outperforms DGD andEXTRA, we now only compare it with MapReduce. Fromthis experiment, we can see that EdgeSGD is able to generatesharper image similar to MapReduce. This evaluation suggeststhat EdgeSGD which has other advantages over MapReducecan be a good candidate for decentralized seismic imaging.

TABLE ICOMPARISON OF DIFFERENT ALGORITHMS

DistributedComputation

DecentralizedCommunication Epochs Time (sec)

SGD No No 22 205MapReduce Yes No 35 175EdgeSGD Yes Yes 60 110EXTRA Yes Yes 175 232DGD Yes Yes 200 251

D. RobustnessOne of the important characteristic of EdgeSGD is its fault

tolerance and here we validate it by simulating node and linkfailure. We run each experiments for 200 sec and with threedifferent cases. Case 1) No Failure Case 2) 25% of the nodefail for 10% of the time Case 3) 50% of the nodes fail for 10%of the time. From Fig. 5(d) we see that there is no significanteffect on the convergence due to node failure. However, ifmore than half of the nodes fail then the algorithm slows as onan average there would be no neighbors to communicate with.However, MapReduce requires all nodes to be running and itsspeed depends on the slowest node. This property is extremelyimportant when choosing algorithms for edge computing.

E. Communication CostWe now compare the communication pattern of MapReduce

and EdgeSGD. We plot the number of messages exchangedby each node on a grid to finish 200 epochs. The x and yaxis in Fig 5 shows the 4 × 4 grid layout and the z axisdenotes the number of messages exchanged. We can seethat the communication is balanced in EdgeSGD and eachnode exchanges roughly around 250 messages. Whereas inMapReduce since we use center node as a coordinator, weobserve an imbalance communication pattern. There is anincreased message overhead near the center node roughlyaccumulating to 300 messages. Although, messages exchangedby other nodes are lower than in EdgeSGD, reliance oncentral coordinator will limit the systems scalability andthe performance. Building the infrastructure with specialcoordinator, routing protocols and synchronization adds extraoverhead and is not suitable for decentralized systems such asedge network.

F. Effect of TopologyTopology plays a key role in decentralized algorithms

like EdgeSGD. Since the method communicates only withits immediate neighbors, the topology decides how fast theinformation diffuses in the network. Therefore, EdgeSGD ona strongly connected topology such as complete graph (Kn)converges faster than Grid topology as shown in Fig. 5(d).

G. Imaging Parkfield, CAIn this section, we will apply our algorithm to a real-

data set and compare the results with the existing centralizedmethod [17]. The dataset we use here is obtained from Parkfield,CA which is located over the San-Andreas fault region. Thegoal is to obtain the P-wave velocity map of the region orin other words image of the fault zone. We preprocess theraw seismic signal using the algorithms given in [6], [8].

43

Nodes

214

3

Nodes

2

100

0

200

300

1

Nu

mb

er o

f M

essa

ge

(a)

43

Nodes

214

3

Nodes

2

100

200

300

01

Nu

mb

er o

f M

essa

ge

(b)

Time (sec)

0 50 100 150

Rela

tive E

rror

10-5

10-4

10-3

10-2

10-1

100

Case-1

Case-2

Case-3

(c)

Epochs

0 75 150 200

Rel

ativ

e E

rror

10-16

10-10

10-5

100

Grid

RGG

Kn

(d)

Fig. 5. (a) - (b) Number of messages transmitted denoted along z by eachnode on a 4 × 4 grid denoted along x and y direction. (a) Communicationpattern during EdgeSGD (b) Communication pattern during MapReduce. (c)Performance of EdgeSGD under different cases of node/link failure. Case-1)No failure Case-2) 25% nodes fail for 10% time Case-3) 50% nodes fail for10% time. (d) Evaluating EdgeSGD for different edge topology.

Fig 6 shows the horizontal slices of the P-wave velocity at 1Km depths. Image in Fig 6(a) is obtained using centralizedalgorithm as presented in [17]. This is currently the widelyaccepted model of the San-Andreas fault region. The result fromour proposed algorithm is shown in Fig 6(b). From the visualcomparison our decentralized algorithm is able to obtain themain features of the fault similar to the centralized algorithm.These results are very promising and we hope to extend it toother seismic monitoring sites as well.

8 10 12 14 16 18 20

2022

2426

2830

xo

yo

2

3

4

5

6

7

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

8 9 10 11 12 13 14 15 16 17 18 19 20

2021

2223

2425

2627

2829

30

Model: vp1Layer 8 of 72 layers along Z

Resolution: 120x160

(a)

8 10 12 14 16 18 20

2022

2426

2830

xo

yo

2

3

4

5

6

7

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

8 9 10 11 12 13 14 15 16 17 18 19 20

2021

2223

2425

2627

2829

30

Model: vp2Layer 8 of 72 layers along Z

Resolution: 120x160

(b)

Fig. 6. Horizontal slices of the P-wave velocity at depths of 1km. The verticalpartition at the center shows the san-andreas fault of california and the blackdots depicts the earthquake location. a) The image obtained by centralizedcomputing using method from [17] b) Image obtained using decentralizedEdgeSGD algorithm. We see that the image obtained by decentralizedcomputation matches closely to existing state of the art centralized scheme.

VI. CONCLUSION

In this paper we presented a novel decentralized stochasticgradient descent algorithm, EdgeSGD for solving machinelearning problem such as linear regression on a edge network.We demonstrated the applicability of such method using a real-world problem of learning/predicting seismic anomaly via real-time imaging. EdgeSGD outperformed existing MapReducealgorithm and also decentralized P2P based machine learning

methods such as DGD and EXTRA on an edge computingtestbed.

ACKNOWLEDGMENT

Our research is partially supported by NSF-CNS-1066391, NSF-CNS-0914371, NSF-CPS-1135814 and NSF-CDI-1125165.

REFERENCES

[1] L. Bottou. Large-Scale Machine Learning with Stochastic GradientDescent. In Y. Lechevallier and G. Saporta, editors, Proceedings ofCOMPSTAT’2010, pages 177–186. Physica-Verlag HD, 2010.

[2] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossipalgorithms. Information Theory, IEEE Transactions on, 52(6):2508–2530,June 2006.

[3] A. G. Dimakis, S. Kar, J. M. F. Moura, M. G. Rabbat, and A. Scaglione.Gossip Algorithms for Distributed Signal Processing. Proceedings of theIEEE, 98(11):1847–1864, Nov. 2010.

[4] G. Kamath, P. Ramanan, and W.-Z. Song. Distributed RandomizedKaczmarz and Applications to Seismic Imaging in Sensor Network. InThe 11th IEEE International Conference on Distributed Computing inSensor Systems (IEEE DCOSS), Fortaleza, Brazil, 2015.

[5] G. Kamath, L. Shi, and W.-Z. Song. Component-Average BasedDistributed Seismic Tomography in Sensor Networks. In The 9th IEEEInternational Conference on Distributed Computing in Sensor Systems(DCOSS), pages 88–95, May 2013.

[6] G. Kamath, L. Shi, W.-Z. Song, and J. Lees. Distributed travel-timeseismic tomography in large-scale sensor networks. Journal of Paralleland Distributed Computing, 89:50–64, Mar. 2016.

[7] D. Kempe, A. Dobra, and J. Gehrke. Gossip-Based Computationof Aggregate Information. In Proceedings of the 44th Annual IEEESymposium on Foundations of Computer Science, FOCS ’03, pages482–491, 2003.

[8] G. Liu, R. Tan, R. Zhou, G. Xing, W. Song, and J. Lees. VolcanicEarthquake Timing using Wireless Sensor Networks. In The 12thACM/IEEE Conference on Information Processing in Sensor Networks(IPSN), pages 91–102, 2013.

[9] J. Liu, S. J. Wright, C. Re, V. Bittorf, and S. Sridhar. An AsynchronousParallel Stochastic Coordinate Descent Algorithm, Nov. 2014.

[10] R. Mcdonald, M. Mohri, N. Silberman, D. Walker, and G. S. Mann.Efficient Large-Scale Distributed Training of Conditional MaximumEntropy Models. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I.Williams, and A. Culotta, editors, Advances in Neural InformationProcessing Systems 22, pages 1231–1239. Curran Associates, Inc., 2009.

[11] F. Niu, B. Recht, C. Ra c©, and S. J. Wright. Hogwild: A Lock-FreeApproach to Parallelizing Stochastic Gradient Descent. In In NIPS, 2011.

[12] W. Shi, Q. Ling, G. Wu, and W. Yin. EXTRA: An Exact First-OrderAlgorithm for Decentralized Consensus Optimization, Nov. 2014.

[13] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the LinearConvergence of the ADMM in Decentralized Consensus Optimization.Signal Processing, IEEE Transactions on, 62(7):1750–1761, Apr. 2014.

[14] I. Stojmenovic and S. Wen. The Fog computing paradigm: Scenarios andsecurity issues. In Computer Science and Information Systems (FedCSIS),2014 Federated Conference on, pages 1–8. IEEE, 2014.

[15] R. Van Renesse, K. P. Birman, and W. Vogels. Astrolabe: A Robust andScalable Technology For Distributed . . . In ACM TRANSACTIONS ONCOMPUTER SYSTEMS, 2003.

[16] K. Yuan, Q. Ling, and W. Yin. On the Convergence of DecentralizedGradient Descent, Feb. 2014.

[17] H. Zhang, C. Thurber, and P. Bedrosian. Joint inversion for Vp, Vs, andVp/Vs at SAFOD, Parkfield, California. Geochem. Geophys. Geosyst.,10(11):Q11002+, Nov. 2009.

[18] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized StochasticGradient Descent. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor,R. S. Zemel, and A. Culotta, editors, Advances in Neural InformationProcessing Systems 23, pages 2595–2603. Curran Associates, Inc., 2010.